TSO evaluators and researchers must carefully consider organization- and context-specific selection dynamics when constructing pools of comparison units. Further WSC research testing the performance of QEDs using different sets of comparison units would help inform the design of more credible quasi-experimental TSO evaluations. Another major finding of our WSC is that OLS performs better when pre-program measures of the outcome—in this case baseline test scores—are controlled for.

Similarly, kernel matching performs better when exact matching on both grade and special education, rather than just grade. This finding reinforces prior literature Betts et al. Accordingly, evaluations of TSOs without natural baseline measures should use caution when interpreting quasi-experimental results as causal.

Finally, a third takeaway is that simple OLS regression with control variables sometimes performs better than matching. The variables we use matter more than how we use them. This finding reinforces some prior studies Betts et al. Such inconsistencies indicate a need for continued research. Selection bias is central to these findings. In Anderson and Wolf , when QE approaches are used to estimate program effects among only eligible program applicants, the results often suggest significant positive effects in math, while the experimental estimates suggest null effects.

When those same methods are used within the broader sample, including program nonapplicants, the results tend to indicate significant negative effects in math. This suggests that students are negatively selective at the time of application but positively selective at the time of voucher take-up. Positive selection at take-up is consistent with prior research on school choice interventions Campbell et al.

A similar pattern might be expected in a variety of TSOs, particularly those in which eligible applicants may have unmet and difficult-to-measure needs, but more advantaged people in those circumstances might be more likely to take the final steps necessary to experience the intervention.

For example, many TSOs have eligibility requirements based on low-income or need, thus intentionally generating negative selection at the application stage. However, individuals may face barriers when it comes to actually following through with a program or intervention Kahn et al. Or, if the processes used to select and follow-up with applications for treatment allow for cream-skimming or cherry-picking of those most likely to benefit, it would create positive selection bias at this stage.

Similar WSCs can further explore these dynamics. Doing so permits evaluators to use rigorous experimental methods to identify the true difference that TSOs make in the lives of their clients. Gather lots of information about program applicants, preferably before they are served by your TSO, to aid evaluators in establishing reliable comparison groups for any analysis of outcomes. That advice often will be difficult to follow.

However confident the leaders of TSOs are that their organization deliver added value to its clients, putting that expectation to the test takes great courage. Leaders at times may be disappointed in the results. Moreover, most TSOs seek to serve every client who is eligible for their services and motivated to receive it, usually on a first-come-first-served basis. There are some TSO settings, such as the healthcare field, where lottery-based admissions to clinics or services would be unethical in many cases.

For that reason alone, QE designs likely will continue to be the dominant methodology for evaluating the performance of TSOs. Our advice to TSO researchers is more complicated. First, conduct experimental evaluations whenever possible. Experiments yield causal results and lay the foundation for WSCs.

Second, where experimental evaluations of TSOs exist or are in the works, researchers should plan to conduct WSCs as follow-ups to shed light on how nonexperimental research in the field can be improved. While ethical concerns may prevent randomization in some cases, conducting a WSC once it has been determined that randomization is feasible and ethical likely adds very little, if any, ethical concern. Indeed, as Cook et al. Just as failing to conduct randomized experiments could be considered unethical in this regard, conducting a WSC, when it is feasible, strikes us as the ethical thing to do.

Third, TSO researchers planning or conducting QE evaluations should draw upon the lessons of WSCs to make their evaluations as plausibly causal as possible. Researchers attempting to evaluate the causal effects of TSOs need to consider the contextual relevance of covariate choice, model choice, sampling frame, and the potential type and degree of selection bias at various stages in the process.

Most notably, constructing a comparison group that is similar in important ways, such as interest in and eligibility for a program, is important for reducing selection bias. In some cases, evaluators may consider using multiple comparison groups. However, requiring geographic similarity may actually introduce bias if selection bias is due more to individual-level factors than geographic-level factors or if the pool of potential matches is small Unlu et al.

Given the complexity and diversity of institutions within the third sector, researchers must closely evaluate the selection mechanisms that are likely at play in their particular context. For example, selection may operate differently in first-come-first-served TSOs and others that select participants based on need or specific eligibility criteria. The factors that generate selective attrition from programs also should be considered. Evaluators of TSOs are advised to collect baseline data on as many relevant client characteristics as possible, especially baseline outcome measures, when available.

Some health, educational, or other life outcomes relevant to TSO evaluation have multiple measures over time, yet some, such as death or high school graduation, are singular events. In cases of the latter, evaluators should have less confidence in the performance of quasi-experimental approaches. More WSCs should explore the performance of alternative approaches when closely related baseline measures may not be available.

Similarly, while some TSOs have simple interventions and clearly defined and validly measurable outcomes and goals, some may have more complex or subjectively defined goals, such as economic or community development, in which case WSCs may still be fairly limited in their ability to describe the circumstances under which nonexperimental approaches might approximate experimental ones.

Researchers attempting to conduct WSCs within the third sector may face some unique challenges. One challenge might be obtaining data from another comparison sector, whether governmental or private. Doing so may be politically difficult, since first and second sector organizations tend to view TSOs as rivals.

It also might be legally problematic, as robust privacy protections exist regarding access to personal information in such fields as health care and education. Even when data are available from rival organizations, key variables might be measured differently than they are in the TSO. In our example, we were limited in the number of years of DCPS data for which the same testing outcome was available.

Further, even if a robust set of measures are collected within an experimental arm, it may not be feasible to obtain similar variables from comparison units outside the scope of the original evaluation e. Conducting a WSC with a mix of administrative data and researcher-collected data may require some data cleaning and cross-walking to have consistent—albeit limited—measures across different arms of the WSC. The evidence from WSCs can also help inform researchers, policymakers, and other consumers of research about when it may or may not be reasonable to generalize from unique, single-case experimental findings across a broader set of contexts.

This body of work is still growing, and it has some important limitations. However, as interest in WSCs continues to grow, and as the methodological literature on how to conduct well-designed WSCs expands Cook et al. Those developments will be especially instructive for rigorous evaluation in the voluntary sector, where individuals are able to self-select into and out of available and accessible programs and providers for which they are eligible.

As third-sector organizations continue to serve clients who voluntarily seek them out, causal evaluations informed by the lessons of within-study comparisons can better determine if those organizations are serving them well.

Comparison of a randomized and two quasi-experimental designs in a single outcome evaluation: Efficacy of a university-level remedial writing program. Evaluation Review. Does method matter? Assessing the correspondence between experimental and nonexperimental results from a school voucher program evaluation.

Vouchers for private schooling in Colombia: Evidence from a randomized natural experiment. The American Economic Review. Testing the validity of the single interrupted time series design No. National Bureau of Economic Research. Bettinger E, Slonim R. Using experimental economics to measure the effects of a natural educational experiment on altruism.

Journal of Public Economics. Madness in the method? A critical analysis of popular methods of estimating the effect of charter schools on student achievement. Taking measure of charter schools: Better assessments, better policymaking, better schools. Rowman and Littlefield; Can nonexperimental estimates replicate estimates based on random assignment in evaluations of school choice? A within-study comparison.

Journal of Policy Analysis and Management. Using experiments to assess nonexperimental comparison-group methods for measuring program effect. In: Bloom HS, editor. Learning more from social experiments. Russell Sage Foundation; Administrative exclusion: Organizations and the hidden costs of welfare claiming.

