Evaluation Planning – 3.04 Measurement and Measures
Measurement is a very complex topic. Measurement approaches need to be tailored to the specific program and lifecycle phase of program development. This makes it very difficult to specify simple rules for how to develop a measurement plan for any given study. There is a huge literature on how to apply the great variety of measurement approaches that one might use in an evaluation. It is important that the Evaluation Champion be familiar with the scope of strategies and methodologies available. There are widely available resources that would be useful when thinking about measurement planning (e.g.: http://www.socialresearchmethods.net/kb/measure.php)
This Protocol step is broken into two parts: “Defining Variables and Measurement Strategies” and “Identifying Measures.” Evaluation Champions may find that it makes the most sense to think about the first part, which builds from work on evaluation questions, directly after developing the questions. However, it may be advisable to look ahead to sampling and analysis before returning to work on identifying measures, which can be an intensive research and development process. Though the entire Protocol is designed to be iterative, you may find that the steps in this stage are particularly difficult to think about separately. This is because decisions about measures and measurement, sampling, design and analysis all depend on one another, and therefore must be considered simultaneously.
Defining Variables and Measurement Strategies
Before you can really begin to think about measurement, you first need to clearly define what it is that you are trying to measure. Begin by reviewing the evaluation questions. The variables are the things that you think influence or are influenced by something else. For example, if the evaluation question is whether participation in the program is related to an increase in science knowledge, then the variables are participation in the program and science knowledge. When a variable is an abstract idea that is derived from empirical evidence but may not be directly observable (such as behaviors, attitudes, or knowledge) it is called a construct. It’s important to clarify the variables and construct(s) so you know what it is that you are trying to measure.
After the variables have been identified, they must be defined. Often, what “science knowledge” means to one stakeholder may be different from what it means to another. This is another step in the Protocol when it is critical to bring in multiple stakeholder perspectives. Ideally, the working group will invite stakeholders to participate in a discussion that will establish internal working definitions for the variables identified in each of the evaluation questions.
In addition to developing definitions, the working group will need to brainstorm indicators of the presence of the variable they define. This can be particularly tricky when trying to think about indicators of constructs. By their very nature, constructs are not directly observable and so the working group needs to think about how they would “know” the construct exists. This may be best described as brainstorming what a construct “looks like.” For example, an indicator of “science knowledge” could be the ability to explain a science concept to a peer, or adequate performance on a science test, while an indicator of “interest in science” could be signing up for additional science courses, or engaging in science-based hobbies.
Once the variables are clear, you can begin thinking about measurement. There are many different strategies that you can use to measure variables. Some strategies will be more appropriate than others depending on the variable you are trying to measure and the context. Examples of common measurement strategies are surveys, observations, interviews, and focus groups. Using a table like the one shown below can help working group members think through these questions systematically.
| Evaluation Question | Constructs/Variables | How is it defined? | What does it look like? | How might it be measured? |
| Is program participation related to an increase in science knowledge? | Science knowledge | Understanding of science concepts | Ability to explain science concepts to a peer | Track and record peer-to-peer explanations using video and an observational checklist |
The process of identifying variables, defining them and brainstorming measurement strategies should be repeated for each question. When this process is complete, the working group should be able to easily identify the number and types of measures that they will need. Often this process will lead to the realization that one measure may be able to address (or collect data that will address) more than one evaluation question. Once this list of desired measures has been created, the working group should reconsider the following about their proposed strategies before moving onto identifying measures:
- Stakeholder interests (credibility of the strategy)
- The program’s lifecycle stage
- Implied sampling, design, and analysis strategies (feasibility)
- Accuracy
- Usefulness
- Context
Identifying Measures
Choosing appropriate measures for each evaluation question will likely be one of the central challenges in writing and executing the evaluation plan. Measures will generally fall into three categories:
- demographic or descriptive measures – measures that track (simply count) events and/or participants and, if relevant, their characteristics
- process measures – measures that capture the type or quality of the program event or interaction
- outcome measures – measures that capture effects of the program including associated change for a group, significant change for an individual, causal relationships between activities and outcomes.
Many programs will use more than one of these types to address their evaluation questions, and each of these types could be formatted in a variety of different ways.
There are three main strategies that can be used when attempting to identify a measure that is appropriate for a given program and question. These include:
- Using existing measures (already in use within the organization)
- Locating measures in the literature
- Developing a measure (from “scratch” or by modifying one of the above)
In order to maximize accuracy, as well as credibility to many stakeholders, it is often desirable to use an established and validated measure, but at times modifying or creating an existing measure is necessary. The following questions/steps can help identify measures, and are listed in sequence of how we approach identifying measures.
Existing measures (already in the office)
Begin by looking at the measures currently being used by the program. Perhaps program practitioners have developed and used their own measure for quite some time, or maybe they already have a measure that fits their needs.
- What measures are currently used in the program?
- Are there specific measures that have been mandated by a funding agency?
- Are the current measures something that have been tried and true, and have literature to support them? (This would be ideal, and if the measures don’t fit this category there may be a tradeoff between resources available to locate an appropriate established measure and using the existing one.)
- Do these measures match the evaluation questions and are they appropriate given the evaluation lifecycle? (This is a critical issue – existing internal measures may have been developed for a different evaluation purpose and for different evaluation questions. Be sure they really will serve your current purpose and questions.)
Locating Measures in the Literature
A second option is to locate measures that are already developed and are supported in the literature. It may make sense to try and locate measures in the literature if you have evaluation questions that do not have measures already in place, or for evaluation questions that are using measures that have not been validated. Measures obtained from other sources will need to be cited appropriately by the program, and some may even have fees associated with them. Networking with colleagues can also help to broaden your measures library. Do you have colleagues in another department, office, or geographic location who may be measuring the same activity or outcome of interest? Have they found a measure that is working well? Sometimes you may find a measure in the literature that has several subscales (portions of the measure that each addresses a different construct.) You may not be interested in all of the subscales included in the measure, but you may find that one or more of the subscales addresses your outcome of interest. Sometimes these relevant subscales can be found in measures that, on the surface or taken as a whole, seem unrelated to your target outcome. The following are some questions to consider when thinking about looking for measures in the literature:
- Which evaluation questions do not yet have an identified measure?
- Which evaluation questions may already have an existing measure, but the working group would prefer to have a measure with stronger support in the literature?
- What measures are colleagues using to measure similar outcomes, and are they available for this program?
- What does a literature search for related programs, activities or outcomes reveal? A validated measure may be identified either in a research article or through a related article that is cited in the bibliography.
- Is there a larger measure that has a sub-scale that could be used to measure the outcome of interest?
- Is there a measure that can address more than one evaluation question?
- Is there an existing measure that can be modified to fit the program’s needs by changing a word or two?
Developing Measures
Developing a measure is likely to be appropriate particularly for newer programs where there is no history of prior measurement. Any time a new measure is created, people will question how good the measure actually is at measuring what it is supposed to measure (validity) and whether it does so consistently and dependably (reliability). Eventually, new measures will be expected to undergo testing to demonstrate their reliability and validity (measures that you find in the literature have typically been tested for reliability and validity and a good measure will report just how reliable and valid it is. This is an advantage of using a measure that you find in the literature).
Creating new measures is disadvantageous in that it will not allow you to cite evidence of reliability and validity for the current evaluation, nor compare results to those obtained by others. However, creating a new measure may be the only option available for many programs, and creating a new measure, pilot testing it, and assessing and refining it can, over time, be the foundation for a good new measure. (See other resources, such as http://socialresearchmethods.net, for information on validity and reliability.)
Key Questions Regarding Measures:
When making decisions about measures, be sure to use your evaluation questions as a guide. When thinking about which measures to use, review the following questions:
- Identify outcomes to be measured. What exactly are you trying to measure? Does the measure you have actually measure the outcome of interest? For example, if the outcome of interest is self-esteem, make sure you have a measure of self-esteem and not self-concept or some other similar construct. These are different things. If the evaluation tool is a broad measure, does it address all of the outcomes you want to measure? Does it cover “too much”? That is, does it collect data you do not need and won’t use? If so, try to pare it down in order to not waste your or your participants’ valuable time and attention.
- Determine the measurement strategy. Which measurement strategy is most appropriate given the outcome that you are trying to measure and the context in which the program is taking place? Surveys are a common measurement strategy; however, they are not the only possible strategy. Consider whether using interviews, observations, content analysis, etc. might be appropriate.
- Match measure to sample. The sampling plan may target adults or youth, each having its own literacy level, etc. Is this measure appropriate for the sample? Consider lifecycle. Does the measure fit the stage of the program and evaluation lifecycles?
- Quality of the measure. This is the “bird in the hand vs. two in the bush” decision. A program may have a choice to make between using what is on hand already (which may be ready to go, and may even have data from past years giving evaluators the opportunity to compare results), or trying to find a “better” existing measure. A “better” measure in this case might mean one that has been tested in careful studies for validity and reliability, has the credibility of having been used in additional research papers, and for which large-scale study results are available to which results can be compared.
- Feasibility. There’s no point in listing a measure in an evaluation plan if it is simply not realistic that program staff will be able to find it, afford it, modify it appropriately, test it, use it, analyze it, and/or report on it. Will staff have time available to use this measure?
- Strategic Value. If time and resources are limited then efforts should be focused on the opportunities that have the highest “payoff”. Consulting with stakeholders or advisory groups is recommended in order to be sure that the choice is made well.
- References. If using established or named measures, are they properly referenced? Measures developed and field tested by others should be cited in the evaluation plan and in any other writing where the measure is mentioned.
How does measurement change over the life-course of a program? Newer programs are probably looking for rapid feedback from participants about their reactions to the program which might be met with simple satisfaction surveys, whereas more mature programs will be looking to show cause and effect relationships and might use more established and tested measures. Below is a tentative chart of the intentions of evaluation at each stage of development.
| New Programs |
Mature Programs |
|||
|
Informed |
Reliability Consistency Validity |
Standard protocol of measurement instruments |
Formal |
At the risk of repeating ourselves, remember that decisions about measures don’t occur in a vacuum – they are related to lifecycle, sampling, design and analysis issues, that is, they will both affect and be affected by these other topics. For help with this decision making process, see Appendices XXIV and XXV.
When writing the measurement plan, consider each evaluation question, identify the focal construct, and describe in detail the measures that will be used. Be sure to include a description of the measure type (e.g., survey, observation, interview, etc.), identify the origin of the measure (cite the source if the measure was found in the literature or describe the development of the measure if you are creating a new measure), and discuss the reliability and validity of the measure (if the reliability and/or validity has not been tested, be sure to state this as well).
Unobtrusive and Nonreactive Measures
One of the biggest practical challenges in evaluation is motivating program participants to engage in providing data. When analyzed from a systems perspective, it’s typically the case that both the participants and the evaluators want something out of the program. Participants usually want the program itself, or what it might do for them potentially. Evaluators want data or information about how the program affects the participants. And, each has negative motivators. Participants usually don’t like the burdens that formal measures, especially tests, impose on them. And evaluators don’t like being in the position of having to impose on the participants (and often the program staff). From a systems point of view, the ideal potential solution often amounts to looking for a symbiosis between the interests of the evaluator and the program participant.
Now consider how this might inform how we approach measurement. Let’s say that in an informal science education program designed to teach children about how to use a microscope you would like to assess their knowledge of the material conveyed in the program. You could construct a paper-and-pencil test that the children would complete at the end of the program. But that would be obtrusive (not to say a drag!) and it would be better if you could assess knowledge more symbiotically and less obtrusively. How might this be done? One approach would be to collect data about knowledge of microscope use in the natural course of their doing the program. You might do this by observing how they try to use the microscope initially and how they perform with it in the last task of the program (a type of before-after assessment). This could be done by observing them directly or by rating or scoring the results of what they record in the natural course of using the microscope. It may even be that the “program” doesn’t involve a real microscope but is one simulated through a computer program. In this case, measures of performance could be unobtrusively built into the software itself. In this example of measurement symbiosis, both parties get what they want. The children who participate get to take part in an engaging (we hope) program without doing any burdensome tests and the evaluator gets data on knowledge as reflected in their performance without having to cajole them to take a test (or impose that requirement on the program staff).
These kinds of measurement approaches are known as unobtrusive or nonreactive measures. We have several favorites that we like to cite. For instance, when an evaluator wanted to assess peoples’ radio station preferences, instead of doing a survey research study they came up with the clever solution of having auto mechanics in the area note what radio stations were on in the cars that were brought in. Or, when museum evaluators wanted to assess which exhibit paintings people were most interested in they took the creative approach of replacing selected floor tiles in front of each painting and then making careful measures of wear-and-tear at the end of the exhibit. There are lots of potential ways to conduct measurement unobtrusively: direct observation and coding, photography, video, use of archival research, and so on. In fact, one of the most fertile sources of short-term outcome measures is likely to be program outputs – products that are naturally generated in the course of participating in the program. When we take these outputs and code, rate or score them we are in effect turning them into measures that might reflect things like performance, knowledge or even satisfaction or interest. Although unobtrusive measures may require considerable forethought and preparation, they can also be fun to create and integrate into a program. This helps everyone get what they want and encourages a greater symbiosis between the program and its evaluation.
Precisely because these types of measures can be implemented without the participants’ knowledge, it is especially important with these strategies to be careful with issues of privacy and protection of human subjects more broadly. If in doubt about the ethics of a particular strategy, be sure to consult with an expert in this area. Universities have Institutional Review Boards or other entities charged with ensuring protection of human subjects in research and evaluation. Outside universities, there are experts and consultants who specialize in this.
See Also: