II.2.03: Review and Clean Data – Netway Help System

Evaluation Implementation – 2.03 Review and Clean Data

Once data collection is complete, there are still some important steps which must be addressed before the actual analysis can begin. Despite careful planning to ensure high quality data, the contingencies of data collection and the variation introduced by the ways in which participants actually respond to different measures will almost inevitably produce some data which either cannot be used, or must be “cleaned” in order to be useful.

First, it is important to review the data for quality. Like data entry, this should be done early and often (if feasible.) This way, if the review reveals especially poor quality data, there still may be options to correct the data collection, entry, and management plans so as to improve data quality – of course, it is essential to document any changes to the way data is handled. The review of data is best performed by someone who is close to the program or evaluation (so she or he understands the context and the questions or prompts well enough to notice when the data seems to be of questionable quality or utility) yet not someone who has been very actively entering data (so she or he brings “fresh eyes” or a new perspective to the data.) Some potential considerations in data reviewing include:

Are the responses legible?
Are all important questions answered?
Are the responses complete?
Is all relevant contextual information included (e.g., date, time, place)?

The process of cleaning data differs depending on what types of data are involved, but generally this step involves making slight changes to the way the data are represented in order to make them usable for analysis. It is essential to always keep raw data in an unedited format. Any data cleaning should be done in way that clearly differentiates cleaned data from raw data (e.g., have a separate spreadsheet or worksheet for cleaned data). Like so many other aspects of evaluation planning and implementation, there are no absolute right or wrong answers to the question of how to clean data—it takes a mixture of carefully-reasoned professional judgment and rigorous tracking and documentation, so that decisions can be described and justified at later stages of the analysis and reporting process. Four possible data problems and related approaches to data cleaning are presented below:

An online survey has an item asking for a numerical value, such as the number of participants involved in a program. In the spreadsheet, most responses look sensible, but there are a handful of responses that are listed as dates, such as 8/1/2011. At first glance, it might seem as though the respondent seriously misunderstood the question, or the data had been entered incorrectly. Yet it is also possible that the formatting of the column in the spreadsheet had a default setting to convert certain types of entries to dates. Thus, if the data were entered as a range (8 – 11 participants), the spreadsheet may have converted the entry to 8/1/2011. If you are certain that the column was formatted in such a way, it would make sense to clean that data point back to 8 – 11, or if you if you are not certain, you may simply have to exclude these responses.
A survey has a series of items which all have a five-point response scale on a range of “Strongly agree” to “Strongly disagree.” To ensure that the respondent was carefully reading each item, you have worded some of the items so as to reverse the scale (e.g., one item is “This program was really fun” and another item is “I was very bored during this program.”) If a respondent filled in the survey by selecting only “Strongly agree” for every item, the contradiction in responses suggests that she or he didn’t actually read the survey items. In this case it would make sense to clean the data set by not using that individual’s responses.
A phone interview asks a set of semi-structured questions that take most participants 45 minutes to an hour to answer. In the middle of the recording of an interview, a respondent launches into a non-relevant and highly personally revealing and sensitive account of a real-life event. In this case, it would make sense to remove this section of sensitive and personally identifying information from the data before coding.
A focus group protocol is administered with focus groups in 10 locations. One of the locations is right next to an airport. The audio transcription of the proceedings includes several moments where the noise of jets taking off obscures the voices of all but the loudest participants. In this case, it would make sense to remove the data from consideration in later analysis.

Qualitative data should also be “cleaned”. Examine the set of narrative, audio, video or photographic information to make sure it is in a form that can be analyzed. Is the number of unanswered questions too great, requiring some interviews be discarded? If participants were asked to follow a set of instructions, did they all follow them? Is there enough variation across themes and/or dimensions to establish meaningful explanations? If not, consider gathering additional data. Qualitative data may also require an extra step of assigning broader categories to finer categories, or “super-coding”, before the data is ready for further analysis.