What metric should we use to measure program success?

Let’s focus on how large program effects are, not how probable they are. By Michael Karcher

Editors Note: In this post, Professor Michael Karcher shares with us his considerable expertise in statistics and program evaluation. In doing so, he makes a strong and compelling case for using “effect sizes” as opposed to “statistical significance” as the benchmark for success in program evaluation. Even if you’re not familiar with the concepts, I urge you to read this. Michael’s accessible approach and compelling arguments might just bring researchers and practitioners to a shared conversation around what really works in mentoring.

by Michael Karcher

We share the same goal. We just need to share the same language.  And that language needs to be a logical one that reflects the reality of what we know about mentoring, evaluation, statistics, and program development.

I assume most readers of the Chronicle find their days filled with the duties of developing, operating and sustaining mentoring programs and matches. Most of my days (and those of my academically inclined colleagues) are spent thinking about the evaluation of mentoring programs and relationships.  Our goals are the same, however: to learn how to identify what works in mentoring and improve the relationships we create between mentors and mentees.

I’d like to start a dialogue among us about evaluation. I’d like to pose a simple question about measuring the effectiveness of programs, and I encourage my colleagues—both those programmatically and those academically inclined—to educate me on why our work can’t be made more simply through the systematic use of (and a redirected emphasis on) measuring the effects of specific mentoring programs in terms of their “effect sizes” rather than statistical significance.

In this essay, I suggest we should rely less on the tests of statistical significance that social science would have us use.  Perhaps those in the field supporting matches defer this question to those of us conducting the research. But I find there is little communication among us about the benefits of focusing on effect sizes nor on the limitations of reliance on tests of statistical significance. I believe addressing this question could not only create a common dialogue, but also could make the practice of evaluation so much more realistic for most programs.

What is an effect size?

When I first started writing this article, I tried to explain that I want to keep the conversation I hope this essay initiates a simple one. I then went off on a two-page rant about what simple means. So I deleted that ironical diatribe. Let me just say that a discussion of statistics could quickly become overly complex, just as could a discussion about the difficulties on the ground of collecting data, creating comparison groups, tracking data, et cetera by program staff. So as readers comment on this piece—which I really hope people will do—let’s all try to write at a level that can be appreciated by practitioners and researchers alike.

Effect sizes can take several forms—reflecting both group differences as well as the strength of relationships between phenomena. In program evaluation, however, the typical effect being measured is the difference on an important outcome (grades, attendance rates, social skills or happiness) between kids who did and did not get a mentor through a specific program.   That “difference” becomes an “effect size” when it is standardized in a way that allows a given scale of measurement to be applied across all outcomes (many of which will differ in their scale of measurement).  For example, if you want to know if the difference observed (between unmentored kids and mentored kids after some period of program participation) on attendance is similar to the difference between groups on a self-report measure of happiness that ranges from unhappy (1) to very happy (5), we need to standardize these scores. One way to do this is to use a measure of how much the scores on each outcome differ among the kids in general.  The standard deviation tells us how much all of the scores on an outcome, like attendance or happiness, vary around the group mean. Typically 99% of the scores fall within three standard deviations on either side of the mean. When we take the difference between two groups on an outcome and divide it by the standard deviation of scores for one or both groups, we get the effect size named “d”.

In the social sciences, we can take a given score on “d” and tell if it is a small, medium, large or very large difference, regardless of the original metric of measurement.   Regardless of the outcome, once standardized all scores of .2, .5. and .8 can be interpreted similarly as small, medium, and large differences, respectively.

Mentoring typically has a “small” effect. We know this from multiple meta-analysis, specifically those reported by DuBois and colleagues (2002; 2011).  I’ve heard DuBois say that many program staff are offended or “put off” by calling the effects of their programs’ small—but unfortunately that’s just the standard interpretation used across the social sciences (no offense intended, I assure you). So David has been known to sometimes use the word “demure” instead to assuage his listeners.  Another response that I often give to program staff, who might be disappointed by the word “small,” is to note that a similarly demure impact is typically reported for other interventions like tutoring and after school programs (see Ritter et al. 2009; Durlak and Weissberg, 2007).  The effect size we generally achieve may be called small, by this social science convention, but is falls in the range of many other programs (see DuBois and colleagues’ 2011 meta-analysis for examples).

Don’t be fooled into relying on tests of statistical significance in your program evaluations

So from here I want to make two points and then conclude and open the dialogue to all interested.  The two points are related, in that both deal with the problem of relying on tests of statistic significance as the sole or primary gauge of whether an impact is real, important, or “significant.”

Here is a definition of what “statistical significance” means.  When we say that a difference is statistically significant at the “p less than .05 level,” we mean “the likelihood of a result [program outcome] even more extreme than that observed across all possible random samples assuming that the null hypothesis [i.e., that there is no program impact] is true and all assumptions of that test statistic (e.g., independence, normality, and homogeneity of variance) are satisfied.

“Some correct interpretations for the specific case α = .05 and p < .05 are… 1. Assuming that [null hypothesis of no effect or] H0 is true and the study is repeated many times by drawing random samples from the same population(s), less than 5% of these results will be even more inconsistent with H0 than the actual result. 2. Less than 5% of test statistics from random samples are further away from the mean of the sampling distribution under H0 than the one for the observed result. 3. The odds are less than 1 to 19 of getting a result from a random sample even more extreme than the observed one when H0 is true. “ (Kline, 2008, location 2185-2192)

I should confess that I am a fan of Rex Kline (whom I quote above), because he is a crystal clear writer on complex topics. So, if you find the text above confusing, it is not because of Rex’s writing skills. It’s because, in my opinion, the concept is convoluted. I believe that p-values reflect a weird approach to achieving scientific rigor when used in program evaluation (for reasons I explain below). I prefer to rely on other scientific convention, such as that of replication and consistency of findings across programs, places, people and outcomes.  It seems odd to use p-values to say, in effect, “Our program had a meaningful (“statistically significant”) effect because the difference we observe between mentees and non-mentees is so big that we would only rarely (1 in 20 times we did an evaluation) find such a difference in a world wherein no such difference really exists.” It just does not make sense to use as the starting place, “mentoring has no effect,” and try to disprove it using probability, when we have strong foundation of research suggesting it does (at least under a set of known conditions, namely those listed in MENTOR’s Elements of Effective Practice).

Another problem is that statistical significance is the product of four ingredients. One of which is rarely present in small-scale program evaluations that include fewer than several hundred youth.  Statistical significance depends on how big the difference is, of course; on the level of significance one chooses; as well as on the size of the sample of youth in the evaluation.  It also depends on a thing called power, which is the likelihood of failing to find or claim an effect when one really does exist. (Typically the field of social science has chosen a power level of .8, which means that we’d be okay not finding and effect that really did exist every two out of 10 times we ran the study.) When conducting a study—both when planning the study (or evaluation) as well as after data is collected and before conducting statistical tests of significance—researchers and evaluators must determine whether the conditions present allow one to reasonably expect that they could detect the expectable difference (recall, in mentoring, it is a “d” effect size of .20). Cutting to the chase—to detect a small effect in a simple two-group comparison (mentees and non-mentees) at the significance level of .05 (and power level of .8) requires a sample size of 788 (e-mail me and I can send you the calculation details).

So, it is generally not appropriate to apply the conventions of statistical significance testing in most mentoring program evaluations. Yet funders usually require it. Many journals require it for published research (but, program evaluation is local and does not seek to generalize to other settings, which differentiates it from research). In fact, some of the “what works” lists of effective programs rely almost exclusively on statistical significance tests and virtually ignore effect sizes. But that is for published research, which some may argue is a different matter altogether. But, personally and professionally, for most mentoring program evaluations, I think it is wrong, unethical, stupid, self-sabotaging, clueless, wasteful, and unproductive to use p-values as benchmarks of meaningful program impacts.

Given the requirements of test for statistical significance, and specifically the common constraint that most program evaluations have small sample sizes that preclude the responsible use of significance tests, we need another way to think about evidence of impact. By extension, it seems fair point out, most program evaluations that are conducted by local evaluators studying specific programs using insufficiently small samples (under 800) will not be able to appropriately use standard test statistics (e.g., the t-test). This is because the number of kids they can include in their evaluations are usually not large enough to reliably test the effect size we know we can expect (based on multiple meta-analyses, such as those conducted by David DuBois and colleagues). Therefore, most of these reports are of little scientific merit, and thus useless if not misleading.

Bringing the cumulative effect of program practices into view

My question, then, is how do we deal with the fact that programs need statistical evidence of impact, yet most programs would be unwise to use tests of statistical significance as the main approach in their quantitative evaluations?  Most funders want quantitative evidence of program effectiveness (even though qualitative studies based on interviews, observations, or case studies, such as those written by Renee Spencer, can be very often so much more interesting and informative regarding program practices and the nature of mentoring relationships in a specific context or program). So programs must evaluate using numbers of some kind. But what should programs do to evaluate their programs using numbers?

Here is my second and final point. Programs should turn their attention away from significance tests and toward the goal of increasing program impacts (effect sizes) on outcomes through the systematic inclusion of more best practices.  DuBois and colleagues (2002) showed that programs which included more than a half dozen best practices have double the impact of programs with far fewer best practices. This cumulative effect of adding more evidence-based practices is where we should be putting our focus, our energy, and our funding.

Other programs also find that when they focus on the inclusion of best practices they see program impacts rise. In the Ritter et al (2009) meta-analysis of tutoring programs, they report volunteer tutoring program impacts on reading skills differed substantially for programs that were unstructured (d = .14) vs. those that were structured (d = .59). That’s the difference between a very small effect and a larger-than-medium sized effect.  Durlak and Weissberg (2007) also found that after-school programs that used evidence based training approaches more than doubled their program effect sizes across a host of outcomes.

What are some of the best practices that we should focus on including? Based on DuBois and colleague’s (2002) meta-analysis, important practices include (1) procedures for systematic monitoring of program implementation; (2) mentoring in community settings,  (3) recruiting mentors with backgrounds in helping roles or jobs,  (4) clearly conveying expectations for the frequency of match contact, (5) providing ongoing (post-match) training for mentors, (6) having structured activities for mentors and youth, and (7) supporting parent involvement.  Finding ways to incorporate these practices is what we should be focusing on.

Programs should seek funding to support the inclusion of these best practices, rather than seek funding to determine “whether mentoring works” in their setting. We have pretty solid evidence that professionally operated mentoring programs work, and we are especially confident about those programs that include several of the aforementioned best practices in addition to the most basic practices (e.g., background checks, pre-match training, etc.).

So, I say, Don’t be seduced into evaluating the “impact” of your program using significance tests to understand differences between your mentees and a comparison group on outcomes. If you must make such comparisons, restrict them to interpreting the size of the effects (.2= small, .5=medium, .8=large) the size of the difference between groups.  Alternatively, consider placing your emphasis on consistency of effects across outcomes and the size of these effects, rather than testing the probability of finding a given impact in a hypothetical world in which “no impact” exists. Use the DuBois and colleagues (2002; 2011) meta-analyses, not your program evaluation, to show funders that mentoring works. Tell funders you want to assess the increase in effectiveness that results from the inclusion of best practices that their resources are used to support.

I may be wrong, and I hope someone will show me where my thinking is off, but it makes no sense to me to estimate the likelihood that your program impacts occurred in a world in which they don’t exist (the situation of that “null hypothesis” the significance tests are used to reject).  This can lead to crazy conclusions. Consider Ritter and colleagues’ (2009) meta-analysis of volunteer tutoring programs.  The outcome of tutoring on global reading skills was d = .26 and for global math skills was d = .27.  Yet they emphasized that the test statistic for the math improvements was not statistically significant (mainly, it seems because the number of reading studies was twice the number of math tutoring studies, but also because the effects on math varied more widely across those five studies). Their conclusion: “participation in a volunteer tutoring program results in improved overall reading measures of approximately one third of a standard deviation” but “very little is known about the effectiveness of volunteer tutoring interventions at improving math outcomes” (p. 19-20).  That’s right, tutoring in reading works, but tutoring in math…not so much. Sounds crazy?  Of course this finding does not mean we should stop tutoring in math. In fact, the average effect size of tutoring in math is comparable to the effect size of tutoring in reading. The significance test was the deciding factor in their deciding the merit of each intervention.  If you don’t think that the misapplication, misuse, or misunderstanding of statistical significance tests could happen in the mentoring field or have any serious adverse consequences, may I suggest you read the evaluation of the Student Mentoring Program funded by the U. S. Department of Education (Bernstein et al., 2009) and Google its consequences.

In conclusion, may I suggest that we think about which scientific standards will be most useful to employ in the local evaluation of mentoring programs? Based on the state of the literature (namely the evidence of effectiveness that has accumulated), I suggest we focus on program improvement rather than impact. Focus on consistency of positive program effects and size of the effects of programs across outcomes. Then, if you must, compare your program’s average effect size to the average effect size in the meta-analyses referenced below. That probably gives you as good a “comparison group” as you can find.  And all that can happen in a world in which you can reasonably expect that mentoring does  have an effect.

Thoughts, anyone?


Bernstein, L., Rappaport, C. D., Olsho, L., Hunt, D., & Levin, M. (2009). Impact evaluation of the U.S. Department of Education’s Student Mentoring Program: Final report. Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education, U.S. Department of Education.

DuBois, D. L., Holloway, B. E., Valentine, J. C., & Cooper, H. (2002). Effectiveness of mentoring programs for youth: A meta-analytic review. American Journal of Community Psychology, 30(2), 157-197.

DuBois, D. L., Portillo, N., Rhodes, J. E., Silverthorn, N., & Valentine, J. C. (2011). How effective are mentoring programs for youth?  A systematic assessment of the evidence. Psychological Science in the Public Interest 12, 57-91.

Durlak, J. A., & Weissberg, R. P. (2007). The impact of after-school programs that promote personal and social skills. 

Chicago, IL: Collaborative for Academic, Social, and Emotional Learning (CASEL).

Klin, R. B. (2008-08-21). Becoming a Behavioral Science Researcher: A Guide to Producing Research That Matters (Kindle Locations 2185-2192). Guilford Press – A. Kindle Edition.)

Ritter, G. W., et al. (2009). The effectiveness of volunteer tutoring programs for elementary and middle school students: A meta-analysis. Review of Educational Research, 79(1), 3-38.