In market research, a “population” is the total universe of people who meet the criteria outlined for a study. Since we often can’t reach out to the whole population, when conducting surveys, we ask questions to a “sample” of that population. Many people imagine that each sample is a perfect representation of the population and the sample’s results reflect the status of the whole population. However, in market research this kind of situation hardly ever happens.
Is the data collected within the sample wrong?
No, it is not! Gathering information through surveys can be very useful. The data collected in a survey provide us with rich information. However, it is difficult to interpret on a one-to-one basis as an image of the whole population.
Here comes data weighting
Data weighting allows us to make the results of the sample reflect the overall population. However, to do so we need to know some information about the population. How does that work?
Let’s imagine our population includes 10 people – 5 women and 5 men (50/50 split)
Now, let’s randomly choose 4 of them.
In this sample, there are three women and only one man (75/25 split). We cannot say that based on this sample the split between women and men in population is 75 to 25. This would not be true. Additionally, the survey results for this sample (e.g., usage of certain products, satisfaction ratings, etc.) may be affected by the difference between these two distributions.
To make the results from sample more accurately reflect the population at large, we would need to create weights that will influence numbers for a certain group (in this case, Male vs. Female).
For each group, we divide the group’s fraction in the population by fraction in the sample. For males who were outnumbered in the sample the weight will reinforce their results by multiplying them by 2. As women were overrepresented in the sample the weight will diminish their results by 33% (1-0.67).
What are the challenges to data weighting?
There are different ways to obtain weights depending on available data. When we do have the necessary information about a population, statistical tools can easily apply the weight to the results in dataset. When all data regarding the population comes from a single source, we can easily cross variables, (e.g. industry with company size).
Data weighting becomes more difficult when variables (e.g., industry distribution and company size distribution) are obtained from different sources as this makes crossing variables more challenging.
Data weighting by combining multiple sources entails a multi-step process:
1. First create weight based on one variable
2. Then, based on the weighted data create another weight for the second variable
3. When we have both of these, the next step is to multiply both to calculate the final
We can use the same pattern using more variables. Each next weight would be an outcome of the new variable and previously created weight. This approach has a serious drawback – after applying every next weight the distribution of previous ones become skewed. To correct that we need to repeat the process of weighting again with the variables which distribution is skewed and multiplying it and the last weight. This can be repeated until we come up with weight that in fair way expresses all the distributions.
Most statistical tools have an option to select a variable that would be treated as a weight so each case is tagged with a certain number (coming from weight variable) that will be used when exporting any results.
When not to weight data
It seems that this solution is almost perfect, but it is not in common use. Why is that? First of all, it is not often that the information regarding whole population is available or known to the researchers (i.e resellers of certain product, IT decision-makers in companies from a certain subsector). We often have limited knowledge about specific, niche audiences, as there are no sources gathering this kind of data. This is a crucial part when creating any weight; without it we cannot move forward. This is not the only obstacle that stops us from introducing weights to our projects. If the sample size is below 50n the weighted data may be effected by the unique characteristics of single a respondent.
When conducting surveys, we often face collecting outliers within the sample. Outliers are the cases that are extremely different from all other companies/respondents. Such cases skew the data, which may lead to misinterpretation of the results. Let’s imagine we are conducting research in companies about usage of paper (e.g., for printing). If one of the organizations included in the sample is USPS (United States Postal Service) the numbers for paper usage will be higher compared to other companies. It is a great risk to keep that company in low-sized sample, even without weighting. In this case the companies with the same weighting category as USPS will be strongly affected by an unusual usage of paper in USPS.
In summary, data weighting is a useful tool in extrapolating information about a population from a given sample. However, it can be a difficult process and can’t be done effectively if not enough information about the overall population is known.