2.2.06: Transform Data

Evaluation Implementation  2.06 Transform Data

Broadly speaking, transforming the data means doing the same thing to each individual data point, and then using the transformed version in subsequent analysis steps. You may need to do transformations if quantitative data are flawed or don’t show a standard distribution of values. You may choose to convert quantitative data to a qualitative form or assign quantitative values to qualitative data in order to allow for a richer analysis. A full treatment of these issues is beyond the scope of this Guide. Our purpose here is to highlight examples of situations that may arise, to alert you to the possible need for data transformation.

With quantitative data: Statistical tests – including familiar ones like t-tests for significance, or ANOVA for analyses of variance – have preconditions for the data that must be met in order for the test to be useful and valid. Some tests, for example, only work if the “noise” in the data (roughly, the errors in the data that arise inevitably from unobservable factors and other sources of random variation) is “normally distributed” (more on this below, and in section 2.3.01). The good news is that even if you have data that are not normally distributed there are alternatives that still allow you to conduct careful statistical analyses, but it is essential to determine which situation you are facing. If you will be conducting statistical inference tests as part of your evaluation, it would be appropriate to consult a statistician and/or data analysis resources. Some references are included in the Data Analysis section of this Guide.

Some data transformations are dictated by statistical considerations. Others are called for simply on practical grounds, even for less elaborate analyses. Consider the following examples:

  • Missing values: Check the software program you are using to determine how to handle blanks or non-responses. Some programs recognize a blank as a missing value, others would require that the blank be replaced with something like “999” or a special character like “*”. The transformation then involves reviewing the data for each variable and replacing any blanks with the appropriate entry. 
  • Uneven data distribution: Sometimes you will find that there are very few responses in a particular category or range of values. In these cases, consider the option of transforming a continuous variable into a categorical one. For example, if age is one of your variables and you plotted a histogram or bar graph of the actual data and found that it bounces up and down a lot, you might bundle up responses into categories that provided a smoother bar chart. You might end up with a new categorical variable called “age range”, with possible values of below 18; 18 to 23; 24 to 29; 30 to 36, and so on. 
  • Skewed data: A variable that is “normally distributed” will look bell-shaped in a simple distribution plot – that is, it will have many responses sitting in the middle, and fewer below and a few above that middle. But if this is not the case and you find that the variable’s distribution is skewed to the left or right, you can transform the data by using a statistical transformation (e.g., the log of the values) rather than the values themselves. Again, for this kind of transformation you will need to consult a statistician or the relevant literature. 
  • Item reversals: When writing a set of survey questions with scaled responses it is often appropriate to phrase some so that they are worded oppositely to the other questions. In a measure assessing interest in science, for example, where the response scale ranges from 1 = “strongly disagree” to 5 = “strongly agree”, some items may be phrased so that strong interest in science comes out at 5 while others may be phrased negatively (“Science classes are really boring.”) so that strong interest comes out as a 1. (This can make the data less vulnerable to those who simply want to run down the list checking all the 1’s or all the 5’s). In this case, in order to summarize the data meaningfully you need to create a series in which higher scores all mean the same thing. The transformation involves converting the scores on the negatively phrased items to that a 1 gets converted to a 5, and so on. (The usual trick is to replace the actual score with a New Value = (High Value +1) – Original Value, so that 1’s become 5’s, 2’s become 4’s and so on.) 

Another type of transformation that may be relevant in certain cases is that the data may need to be “scored”. The simplest way to think about scoring is to imagine a test given in school to assess the content knowledge of a group of students. There may be a number of types of questions on the test, including multiple choice and short answer. When the teacher receives the completed test, she or he scores each item to eventually give the student a summary score, or grade, on the test. This might involve weighting different items differently (e.g., multiple choice questions are all worth two points while short answer questions are worth five points.) The teacher’s decision about whether an open-ended short answer response is correct or not is more akin to coding, yet once the code (right/wrong/partial credit) is given, the teacher will score the item and the whole exam based on how items are weighted and how items are grouped. The results is a single “score” or set of scores that combine the information provided in various item responses. In the analysis of a survey, the same principles apply.

With qualitative data: Many of the same concerns about flaws in the data apply to qualitative data. If responses are missing or a particular intended category of respondents was not reached, more data collection may be needed. Concerns about having data with enough variation also apply to qualitative data, especially where you are trying to develop a causal explanation. In this situation, if the qualitative data doesn’t show enough variation across the categories you selected, or enough instances in each category, it may be necessary to combine smaller categories into larger ones in order to make reasonable statements from the data. It may also be useful to convert data from one form to another, either from quantitative to qualitative, or from qualitative to quantitative. For example you might take quantitative data and transform it into a set of narrative summaries. Or you might take narrative data and transform it into quantitative data by assigning each chunk of narrative a numeric rating on frequency of mention or emotional intensity of comments. Once you have applied transformations and/or conversions systematically to the data, re-examine the new version using the process you adopted originally to make sure that the data will now work well for your purposes.

Scroll to Top