Getting a Data File

From Market Research
Jump to: navigation, search

A data file contains the individual responses to a survey in a format that permits them to be analyzed by a program specifically designed for the analysis of survey data (e.g., SPSS, Q, DataCracker, Stata). Almost all programs that are used to conduct surveys are able to export data files.

In much the same way that some meals are great while others are inedible, some data files are great and some are unusable. The quality of a data file is perhaps the biggest cause of problems experienced by people when learning how to analyze surveys, as a common mistake is to put insufficient effort into obtaining 'good' data which results in the analysis being much harder than it needs to be.

Contents

What survey data looks like

Raw data

When we conduct market research we usually collect data about individual people, households or businesses.[note 1] The data provided by respondents is called the raw data. The table below shows raw data for ten households (from a larger data file). Each row in a table of raw data represents the data from an individual respondent, and there are no blank rows. In this case, each respondent was a household. Each column is referred to as a variable. Each variable is a measure of some characteristic of the respondents.

ID CARRIER INCOME MOVES AGE EDUCATION EMPLOYMENT NON-USAGE Q1a Q1b Q1c
1 2 1 1 3 2 1 9 1 1 1
2 1 4 3 2 2 1 2 1 1 1
3 2 NA 1 NA NA NA 6 1 1 1
4 2 NA 3 6 1 5 7 1 1 1
5 2 NA 1 6 2 4 0 1 1 1
6 2 NA 1 6 NA NA 0 0 1 0
7 2 3 1 4 1 1 3 1 1 1
8 2 2 1 5 2 5 1 1 1 1
9 2 NA 1 5 1 1 0 1 1 1
10 2 3 1 4 1 4 2 1 1 0

Metadata

Data such as that above is not, on its own, readily interpretable. To interpret such data it is necessary to also have metadata, explaining what it all means. The metadata, which is sometimes referred to as a data dictionary, is shown below. So, returning to the table above, the database indicates that the fourth household was not a customer of AT&T, there is no data indicating the household’s income, the household moved twice in the last 10 years, the respondent who completed the survey was 65 or older, and so on.

Where a variable is categorical this means that the values stored in the raw data can only be interpreted by looking at the metadata. In particular, with the MOVES variable, a 1 indicates that a household has not moved, a 2 indicates it has moved once, etc. By contrast, with the USAGE variable, which is numeric, a 1 indicates it was used once, a 2 indicates it was used twice, etc.

Variable Variable Label Value Labels Variable Type
ID A unique identification number assigned to each respondent 1 = first respondent, 2 = second respondent,... Categorical
CARRIER Phone carrier of household 1 = AT&T, 2 = Other Categorical
INCOME Household income bracket (in thousands) 1 = <7.5, 2 = 7.5-15, 3 = 15-25, 4 = 25-35, 5 = 35-45, 6 = 45-75, 7 = >75 Categorical
MOVES Number of times the household has moved in the preceding 10 years 1 = 0, 2 = 1,3 = 2, ..., 8 = 7,11 = >10 Categorical
AGE Age of the respondent 1 = 18-24,2 = 25-34, 3 = 35-44, 4 = 45-54, 5 = 55-64, 6 = 65+ Categorical
EDUCATION The highest level of education achieved 1 = Did not finish school, 2 = High School; 3 = College, 4 = Postgraduate Categorical
EMPLOYMENT Employment status of the respondent 1 = Full-time, 2 = Part-time; 3 = Student; 4 = At home; 5 = Retired; 6 = Unemployed Categorical
USAGE The typical monthly number of longdistance telephone calls by the household Numeric
Q1a Aware: AT&T 0 = No, 1 = Yes Categorical
Q1b Aware: Verizon 0 = No, 1 = Yes Categorical
Q1c Aware: CenturyLink 0 = No, 1 = Yes Categorical

This table shows the minimal metadata necessary to analyze a survey. However, better data files will contain more information. In particular:

  • Question Type. For example, note that the last three variables, Q1a, Q1b and Q1c are related and form a part of a single question (which asked people which of the companies they had heard of); a good data file will contain metadata showing that these are linked together.
  • Versioning. For example, changes to question wording that occurred during the data collection process and different translations.

A good data file will contain both the raw data and the metadata together in a single file. If you have two files, one which contains the raw data and another which contains the metadata, then you do not actually have a 'data file', you instead have the material you need to create a data file, but still have to create it. Many data analysis programs will provide tools that allow you to import the raw data and then enter the metadata but it will generally need to be done manually (i.e., by retyping it or cutting and pasting each field of information); this is a time consuming and error-prone process which should be avoided where possible.

Data files formats

Data collection programs export data files in a specific format. Most programs provide multiple formats for exporting, but these formats can differ markedly in terms of their usefulness.

Text files

The simplest data files are called 'text data files'. It is generally a very bad idea to obtain the data from a survey as a text file. This is because when data is obtained as a text file there will be one of two problems:

  1. It will either contain no metadata, which makes it at best difficult to analyze and at worst impossible (e.g., if you do not know that a value of 2 represents an age of 25-34, then there is no way to interpret the data).
  2. It will contain text instead of numbers for all the data. Initially this may appear to be useful, but in practice is a massive problem, as:
    • Most programs for the analysis of survey data do not permit you to do analysis with data in this format, and so you will read the data into the program and then discover that you either cannot do even the most basic analysis, or, need to spend a lot of time re-formatting the data to make it useful.
    • Many of the important features of the survey will not be evident in the data file. For example, if you have asked a question getting people to give ratings from 0 to 10, when you create a table in a text file they will be ordered as: 0, 1, 10, 2, 3, .... Similarly, Grid and Multiple Response questions will generally need to be treated as if they were multiple Single Response questions.
ID CARRIER INCOME MOVES AGE EDUCATION EMPLOYMENT USAGE Q1a Q1b Q1c
1 Other <7.5 0 35-44 High School Full-time 9 Yes Yes Yes
2 AT&T 45-75 2 25-34 High School Full-time 2 Yes Yes Yes
3 Other NA 0 NA NA NA 6 Yes Yes Yes
4 Other NA 2 65+ Did not finish school Retired 7 Yes Yes Yes
5 Other NA 0 65+ High School At Home 0 Yes Yes Yes
6 Other NA 0 65+ NA NA 0 No Yes No
7 Other 25-35 0 45-54 Did not finish school Full-time 3 Yes Yes Yes
8 Other 15-25 0 55-64 High School Retired 1 Yes Yes Yes
9 Other NA 0 55-64 Did not finish school Full-time 0 Yes Yes Yes
10 Other 25-35 0 45-54 Did not finish school At Home 2 Yes Yes No

CSV Files and Excel files

This is generally the best of the text file formats (although this is very much a case of being the tallest dwarf). It uses a comma to separate each variable.

Tab delimited files

This is similar to a CSV file, except that a tab character is used instead of a comma. Generally, if data is in this format it is appropriate to open it in Excel and then save it as a CSV file.

Fixed width files (ASCII) files

A fixed width file is one where each column of numbers has a specific meaning. For example, in the data below the first column may represent the first variable, the second and third variable together may represent the second variable, and so on. This format was invented because it took up little hard-disk space, which was an important consideration in the 1960s and 1970s. It is rarely used today and is the worst of all of the file formats as it cannot readily be used with Open-Ended questions and most modern programs will not read this file format. Generally, if data is in this format it is appropriate to open it in Excel and then save it as a CSV file.

00001
01200
01203

Good data files

The good formats

The gold standard data file is an IBM SPSS Data Collection Model data file (also known as a Dimensions, MDT or MDD data file). This file format contains all the different types of metadata. This data file is only created by the top-of-the-range IBM data collection programs and can only be read by IBM data products and a small number of other products (Q and DataCracker).

The next-best format is the Triple S format. It is a little more widely used than the IBM SPSS Data Collection Model format, but it is generally only available in the more expensive data collection programs.

The industry standard 'good' file format for data is an SPSS .SAV data file (usually called a 'dot sav' file). This is not quite as good as the other two formats, as it does not contain the versioning information and it only contains very limited Question Type information (it does not support the various Grid type of questions). However, all good data collection programs can export in this format. Refer to SPSS Data File Specifications for details on how these files are best set up.

Occasionally data collection programs will export both a text file and an SPSS .sps file (also known as a syntax file). The syntax file is actually a program which contains instructions for turning the text file into an SPSS .SAV file. SPSS is the the only program that can always read these files, but Q can read these in some circumstances.

Appropriate set up in the good formats

Obtaining a good data file is not just a case of specifying the desired format. In particular, in the case of the SPSS .SAV data files, it is quite common to have them created with either incorrect values and incorrect metadata. The most common problems are:

  • Incorrect values for options not selected in multiple response questions. That is, the files use the same value (commonly a 0 or a special indicates missing value category usually called SYSMS, NA, or NaN) to indicate that somebody was not asked a question as they use to indicate that somebody did not select an option in a Multiple Response question.
  • Labels that have been truncated (e.g., saying Please rate your satisfaction with the following ba), making it impossible to determine what the data means (except by reviewing the questionnaire).

Most survey analysis programs will have some facilities in them to clean such poor data, but it is generally advisable to try and instead obtain a data file that does not contain such problems.

Previous page

Choosing Survey Analysis Software

Next page

Creating a Summary Report

Notes

  1. There are many other units of analysis that are, from time-to-time, required, such as the: occasions, products and transactions.
Personal tools
Namespaces

Variants
Actions
Navigation
Categories
Toolbox