Interpreting a Summary Report
Categorical and numeric variables
Generally, summary reports will show tables of percentages for categorical variables, such as age and gender, and tables showing averages for numeric variables. For example, in the summary report from MarketSight below we can see that the first table shows an average of a numeric variable and the second shows percentages and counts from a categorical variable.
Switching between categorical and numeric variables
In most programs it is necessary to change the metadata to switch between the average and percentages. Exceptions to this are:
- In SPSS the user specifies whether to run a mean or frequency manually for each table.
- In Q and Data Cracker you can either change the metadata, or, if it is showing a percentage you can use Statistics - Below or Statistics - Right to add averages to the table of percentages.
Multiple response questions
With multiple response questions there are a couple of different ways of computing percentages:
- Percentage of respondents. The % column in the table below (which was computed using Q) shows the proportion of respondents to have selected a response (e.g., the 24% for AAPT is computed by dividing the 122 people to have selected this option by the 498 people that were shown the alternative (which, in this case, was the entire sample). Generally, it is this percentage that is used when reporting data from multiple response questions).
- Percentage of responses. The % Responses value of 6% is computed by dividing 122 by the total of all of the counts (i.e.,122/(122 + 46 + ... + 401)). This percentage is rarely used and is perhaps never actually useful,[note 1] except as an input to data cleaning.
Multiple response summary tables with messy data
When the data from a survey is 'neat', all the main data analysis programs used for analyzing surveys produce basically the same results. However, there are a couple of situations where programs can produce wildly different results.
When multiple response data contains missing values the programs produce completely different results. Please see Counted Values and Missing Values in Multiple Response Questions for a discussion of how the programs differ and for instructions for making the programs produce more sensible results.
When not everybody is in a category
In a 'tidy' survey everybody is forced to have an answer in a multiple response question (i.e., people are not permitted to go onto the next question without selecting at least one of the alternatives). However, there are a few scenarios when not everybody will have a response:
- When there are data integrity problems.
- When people were not compelled to select an option.
- When the multiple response question has been created by the user (e.g., if creating top 2 box scores).
In each of these situations different programs give different results. The two tables below are computed using data where everybody in the data has selected at least one category. And, as will occur with all of the standard programs, the results are the same. That is, the percentages on the table on the left, which has been computed using Q, are the same, bar rounding, with those on the right side of the second table, which was computed using SPSS. The only substantive difference between these tables relates to the bottom row, where Q shows a NET, which is the proportion of people to have selected one or more of the options, whereas SPSS shows the total.
The two tables below are also computed using Q and SPSS. Further, they use the same data as used in the tables above, except that only the first four categories have been included in the analysis. Note that the Q analysis is almost the same. The percentages for each brand remain the same. The only difference relates to the NET, which is 100% for the table above, but 93% for the table below, which is because only 93% of the sample have selected one of the four brands shown. By contrast, the results for SPSS are all different. In fact, they are all about 8% higher on the table below compared to the table above. The reason for this is the way that it uses a somewhat strange formula. The SPSS percentages have been computed by dividing the number of people to have selected any option by the number of people to have selected one or more options. Looking at the AAPT data, in the table above SPSS shows 8.8% which is computed as 44 / 498, where 498 is the proportion of people to have selected one or more option (i.e., the total sample). In the table below, however, 9.5% is shown which is 44 / 462, where 462 is the number of people to have selected one or more of the four brands used to construct the table.
It is important to appreciate that the discrepancy between the results is caused by having data where some people have not selected any of the categories and where the data does not suffer from this problem the different programs will give the same results. Additionally, the difference is one of those rare instances where one of the programs is producing numbers that are, in most situations, unhelpful (i.e., the results produced by the SPSS calculation in the second table are misleading, because most people would assume that they relate to the proportion of respondents to select the option and such an interpretation is incorrect. Unfortunately, the SPSS calculation is the 'standard' one and is used by most data analysis programs (which have generally been written under the assumption that people are compelled to choose at least one option).
The reason that the programs do it differently
As mentioned, in situations where the NET is 100% the two methods will get the same answer. The table-based method is the traditional approach. In a traditional survey the NET will always be 100%, because in a traditional survey run by a professional researcher there would always be a 'None of these' option and thus both methods get the same results. Thus, the traditional programs use the table-based method because it is faster to compute when there is no missing data. However, in situations where there is a chance that the data will be messy in some way the respondent-based method is preferable as it has the advantages that:
- The possibility of a problem is flagged by the NET not being 100%.
- The values that are estimated are sensible (i.e., it is much easier to explain that the percentage represents the proportion of respondents than it is to describe the percentage as representing the proportion amongst respondents that have selected at least one option).
Thus, as many of the traditional programs are developed under the assumption that the data is relatively clean they employ a method that is best in those situations, where as the more modern programs use the alternative method as it is safer in the modern world where the data is often messy.
How to switch between the different types of multiple response computations
In most programs it is possible to get the program to change the way that it computes the percentages on multiple response questions. In programs that use the respondent-based method the trick is to filter the table so that it only contains respondents that selected one or more options. In programs that use the table-based method the trick is to not tell the program that it is a multiple response question.
Using the summary report to guide data cleaning
At a minimum, the summary report should be reviewed to check that the results make sense, which essentially involves comparing results with things that are already known about the population being studied (this is discussed in detail in Checking Representativeness), and that they are plausible (e.g., if a respondent claims to have 99 mobile phones then this suggests that there is a problem.
More thorough data cleaning involves checking that the two-way relationships between variables are sensible. For example, if checking data on firm profitability, it is useful to eview the profitability per number of employees, as it may make sense for a firm to contribute 10 million dollars of profit to an industry, but it is less likely if the firm has 1 employee. This is done by creating Crosstabs. Typically, this is done as a part of the main data analysis rather than as a separate stage of data preparation.
The hard part of data cleaning is deciding what to do with “dirty” data. Consider as an example a data file that indicates that the person goes to the beach 99 times a month in summer. The options are to:
- Determine that the problem is that the metadata is incorrect. For example, it may be that a value of 99 does not represent the number of trips to the beach instead indicates that they person the person did said "don't know". See Correcting Metadata.
- Delete the incorrect value, replacing -99 with a special code indicating the data is invalid. This results in missing values and then there is often a need to use special analysis tools that can address the missing data. See Missing Values.
- Change the value (e.g., replacing 99 with 9). See Recoding Variables.
- Change the value to multiple values and assign probabilities to the different values. Although this can be the most appropriate thing to do, it is extraordinarily unusual for something like this to occur in a real-world commercial study and as such this approach, which is known as multiple imputation, is discussed no further.
- Delete the entire record of data that is dirty, which involves making the assumption that this one error indicates all their data is wrong. See Deleting Respondents.
In order to work out which of these is appropriate we need to understand the cause of the poor data (e.g., key punching errors, corrupted data, respondent error), as if we clean the data without understanding the cause of the problems, we run the high risk that other data that we have not spotted as being dirty is inaccurate and that the “clean” data does not accurately represent the market.
- ↑ There are a couple of problems with the use of the percentage of responses computation:
- The population that the estimates relate to is the population of responses. This is an artificial population created by the survey and does not directly relate to any population of interest to users of the research (e.g., it does not reflect populations of people or sales).
- Even if there is a believe that the population of responses is relevant, the estimator is biased.