The Campbell Collaboration www.campbellcollaboration.org
Applied topics: Interpreting the Practical Significance of Meta-Analysis Findings
Mark Lipsey Co-Chair, The Campbell Collaboration Co-Editor-in-Chief, Campbell Systematic Reviews Director, Peabody Research Institute, Vanderbilt University, USA The Campbell Collaboration www.campbellcollaboration.org The problem The effect size statistics that constitute the direct findings of a meta- analysis often provide little insight into the nature, magnitude, or practical significance of the effects they represent. Practitioners, policymakers, and even researchers have difficulty knowing whether the effects are meaningful in an applied context. Example: The mean standardized mean difference effect size (Cohens d or Hedges g) for the effects of educational interventions with middle school students on standardized reading tests is about . 15 and statistically significant. Seems small: Is .15 large enough to have practical significance for improving the reading skills of middle school students? Most important to recognize: There is no necessary relationship between the numerical magnitude of an effect size and the practical significance of the effect it represents! 2 The Campbell Collaboration www.campbellcollaboration.org A widely used but inappropriate and misleading characterization of effect sizes
Statistical effect sizes assessed by Cohens small (.20), medium (.50) and large (.80) categories Impressionistic norms across a wide range of outcomes in social and behavioral research Almost never are these the appropriate norms for the particular outcomes of a particular intervention Comparing an obtained mean effect size with norms can be informative, but those norms must be appropriate to the context, intervention, nature of the outcomes, etc. [more on this later] The Campbell Collaboration www.campbellcollaboration.org Two approaches to review here 1. Descriptive representations of intervention effect sizes: Translations of effect sizes into forms that are more readily interpreted. Supports better intuitions about the practical significance of the effect size.
2. Direct assessment of practical significance: Assessing statistical effect sizes in relationship to criteria that have recognized practical value in the context of application. Requires that appropriate criteria be used; different criteria may yield different conclusions. 3 The Campbell Collaboration www.campbellcollaboration.org Useful Descriptive Representations of Intervention Effect Sizes The Campbell Collaboration www.campbellcollaboration.org Back translation to an original metric Useful when the original metric is readily interpretable; not so useful when it is in arbitrary units. Example: Mean Phi coefficient for effects of intervention on the reoffense rates of juvenile offenders < .20 allegedly trivial. Computation of Phi Coefficient as an effect size:
Reoffend (failure) Dont Reoffend (success) Tx a = p b = 1-p a+b=1 Ct c = q d = 1-q c+d=1 a+c= p+q b+d= (1-p)+ (1-q) Phi = (ad-bc)/SQRT((a+b)(c+d)(a+c)(b+d)) 4 The Campbell Collaboration www.campbellcollaboration.org Back translation to original metric: Phi coefficient example Mean reoffense rate for the control groups in the studies was .50. Some algebra (or trial & error in a spreadsheet) yields the reoffense rate of the average treatment group required to produce Phi = .20 [Note: Similar procedure would work for odds ratio ES as well]
Reoffend (failure) Dont Reoffend (success) Tx .30 .70 1.00 Ct .50 .50 1.00 .80 1.20 Phi = .20 Phi = .20 thus means an average .20 reduction in the reoffense rate from a .50 average baseline value; That is, a 40% decrease in the reoffense rate. Hardly trivial! The Campbell Collaboration www.campbellcollaboration.org Back translation to original metric: Standardized test example Suppose the mean standardized mean difference effect size for intervention effects on vocabulary tests is .30 The most frequently used measure of vocabulary in the contributing studies was the Peabody Picture Vocabulary Test (PPVT) The PPVT has a normed standard score of 100 with a standard deviation of 15. Differences in standard scores are readily understood by researchers and practitioners familiar with standardized tests The control groups in the studies using the PPVT had a mean standard score of 87. How much improvement in the PPVT standard score is represented by an effect size of .30? 5 The Campbell Collaboration www.campbellcollaboration.org Back translation to original metric: PPVT The Campbell Collaboration www.campbellcollaboration.org Intervention effect sizes represented as percentiles on the normal distribution Percentile values on the control distribution of the intervention effect in standard deviation units 6 The Campbell Collaboration www.campbellcollaboration.org Translating effect sizes into percentiles from a table of areas under the normal curve The Campbell Collaboration www.campbellcollaboration.org The percentage of the treatment group that is above the control group mean is Cohens U3 index
Effect Size
Proportion above the Control Mean Additional proportion above original mean .10 .54 .04 .20 .58 .08 .30 .62 .12 .40 .66 .16 .50 .69 .19 .60 .73 .23 .70 .76 .26 .80 .79 .29 .90 .82 .32 1.00 .84 .34 1.10 .86 .36 1.20 .88 .38 7 The Campbell Collaboration www.campbellcollaboration.org Rosenthal and Rubin Binomial Effect Size Display (BESD) d = .80 The Campbell Collaboration www.campbellcollaboration.org BESD representations of SMD and correlation ESs
Effect Size
r Proportion of control/ intervention cases above the grand median
BESD (difference between the proportions) .10 .05 .47 / .52 .05 .20 .10 .45 / .55 .10 .30 .15 .42 / .57 .15 .40 .20 .40 / .60 .20 .50 .24 .38 / .62 .24 .60 .29 .35 / .64 .29 .70 .33 .33 / .66 .33 .80 .37 .31 / .68 .37 .90 .41 .29 / .70 .41 1.00 .45 .27 / .72 .45 1.10 .48 .26 / .74 .48 1.20 .51 .24 / .75 .51 8 The Campbell Collaboration www.campbellcollaboration.org Even better, use an inherently meaningful threshold Suppose we have a mean standardize mean difference effect size of . 23 for the effects of treatment for depression on outcome measures of depression. For many measures of depression, a threshold score has been determined for the range that constitutes clinical levels of depression. Suppose, then, that we can determine from at least a subset of representative studies that the average proportion of the control groups whose scores are in the clinical range is 64%. Assuming that depression scores are normally distributed, we can then use this proportion and the effect size to determine the average proportion in the clinical range for the treatment groups. From that we find the proportion of clinically depressed patients moved out of the clinical range by the treatment. The Campbell Collaboration www.campbellcollaboration.org Proportions of T and C samples above and below a meaningful reference value Success threshold Proportion above Proportion below 9 The Campbell Collaboration www.campbellcollaboration.org Using a table of areas under the normal curve Z Cum p Tail p Z Cum p Tail p 64% of the area of the normal curve is below Z=.36 Subtracting ES=.23 SD from Z=.36 gives Z=.13 with 55% of the area of the normal curve below The Campbell Collaboration www.campbellcollaboration.org The mean effect size of .23 indicates that, on average, the intervention reduced the number of clinically depressed patients from 64% to 55%, a 9% differential Clinical threshold Proportion above=36% Proportion below=64% Proportion below=55% Proportion above=45% 10 The Campbell Collaboration www.campbellcollaboration.org The more general point With some understanding of the nature of the effect size index you are working with and some understanding of the context of the intervention and what might be an interpretable representation of the magnitude of the intervention effect on the outcomes of interest, it will almost always be possible to translate any effect size or mean effect size into a form that facilitates interpretation of its practical significance. The Campbell Collaboration www.campbellcollaboration.org Direct Assessments of Practical Significance
11 The Campbell Collaboration www.campbellcollaboration.org Assessing the practical significance of effect sizes requires a criterion from the context of application Neither the numerical value of an effect size nor its statistical significance is a valid indicator of the practical significance of the effect. Translating the numerical value into terms easier to understand facilitates an intuitive assessment of practical significance, but is inherently subjective. A more direct assessment of practical significance can often be made by comparing the effect size with an appropriate criterion drawn from the context of application and, therefore, meaningful in that context. The clinical and normative thresholds used as examples in the previous section are a step in that direction, but more can be learned from a more fully-developed criterion framework The Campbell Collaboration www.campbellcollaboration.org Examples of some criterion frameworks that can be used to assess the practical significance of intervention effect sizes E.g., compare the mean effect size found with: Established normative expectations for change Effects others have found on similar measures with similar interventions Policy-relevant performance gaps Intervention costs (not discussed here) Some examples from education follow (happens to be where we have done a lot of work recently) 12 The Campbell Collaboration www.campbellcollaboration.org Benchmarking against normative expectations for change from test norming samples Data compiled from national norms for standardized achievement tests: Up to seven tests were used for reading, math, science, and social science The mean and standard deviation of the scores for each grade were obtained from the test manuals The standardized mean difference effect size across succeeding grades was computed The Campbell Collaboration www.campbellcollaboration.org Annual achievement gain: Mean effect sizes across 7 nationally-normed tests Grade Transition
Reading
Math
Science
Social Studies K 1 1.52 1.14 -- -- 1 - 2 .97 1.03 .58 .63 2 - 3 .60 .89 .48 .51 3 - 4 .36 .52 .37 .33 4 - 5 .40 .56 .40 .35 5 - 6 .32 .41 .27 .32 6 - 7 .23 .30 .28 .27 7 - 8 .26 .32 .26 .25 8 - 9 .24 .22 .22 .18 9 - 10 .19 .25 .19 .19 10 - 11 .19 .14 .15 .15 11- 12 .06 .01 .04 .04 Adapted from Bloom, Hill, Black, and Lipsey (2008). Spring-to-spring differences. The means shown are the simple (unweighted) means of the effect sizes from all or a subset of seven tests: CAT5, SAT9, Terra Nova-CTBS, Gates-MacGinitie, MAT8, Terra Nova-CAT, and SAT10. 13 The Campbell Collaboration www.campbellcollaboration.org Mean effect size relative to the effect size for achievement gain from pretest baseline .31 SD (38% increase) .82 SD Gain from the Beginning to End of Pre-K on a Summary Achievement Measure for Children Who Participated in Pre-K Compared to Children Who Did Not Participate Pre-K Participants Nonparticipants ES for mean control group pre-post gain Mean intervention ES = .31 The Campbell Collaboration www.campbellcollaboration.org Benchmarking against effect sizes for achievement from random assignment studies of education interventions Data in our current compilation: 124 random assignment studies 181 independent subject samples 829 effect size estimates 14 The Campbell Collaboration www.campbellcollaboration.org !"#$%&%'%() %+%") ,$-%, ./ 0123% 4%&%4 2(3 )/5% 67 2"#$%&%'%() )%,) Crade Level & AchlevemenL Measure n of LS LsumaLes Mean Su "#$%$&'()* +,-..# 693 .28 .46 SLandardlzed LesL (broad) 89 .08 .27 SLandardlzed LesL (narrow) 374 .23 .42 Speclallzed Loplc/LesL 230 .40 .33 /011#$ +,-..# 70 .33 .38 SLandardlzed LesL (broad) 13 .13 .33 SLandardlzed LesL (narrow) 30 .32 .26 Speclallzed Loplc/LesL 27 .43 .48 203- 4,-..# 66 .23 .34 SLandardlzed LesL (broad) -- -- -- SLandardlzed LesL (narrow) 22 .03 .07 Speclallzed Loplc/LesL 43 .34 .38 The Campbell Collaboration www.campbellcollaboration.org !"#$%&%'%() %+%") ,$-%, ./ 0123% 4%&%4 2(3 )/5% 67 2"#$%&%'%() )%,) Crade Level & AchlevemenL Measure n of LS LsumaLes Mean Su "#$%$&'()* +,-..# 693 .28 .46 SLandardlzed LesL (broad) 89 .08 .27 SLandardlzed LesL (narrow) 374 .23 .42 Speclallzed Loplc/LesL 230 .40 .33 /011#$ +,-..# 70 .33 .38 SLandardlzed LesL (broad) 13 .13 .33 SLandardlzed LesL (narrow) 30 .32 .26 Speclallzed Loplc/LesL 27 .43 .48 203- 4,-..# 66 .23 .34 SLandardlzed LesL (broad) -- -- -- SLandardlzed LesL (narrow) 22 .03 .07 Speclallzed Loplc/LesL 43 .34 .38 15 The Campbell Collaboration www.campbellcollaboration.org !"#$%&%'%() %+%") ,$-%, ./ )210%) 1%"$5$%(), 1argeL 8eclplenLs number of LS LsumaLes Mean LS Su lndlvldual SLudenLs (one-on-one) 232 .40 .33 Small groups (noL classrooms) 322 .26 .40 Classroom of sLudenLs 176 .18 .41 Whole school 33 .10 .30 Mlxed 44 .30 .33 The Campbell Collaboration www.campbellcollaboration.org Benchmarking against policy-relevant demographic performance gaps Effectiveness of interventions can be judged relative to the sizes of existing gaps across demographic groups
Effect size gaps for groups may vary across grades, years, tests, and districts
16 The Campbell Collaboration www.campbellcollaboration.org Demographic performance gaps on SAT 9 scores in a large urban school district as effect sizes
Subject & Grade
Black-White
Hispanic-White Eligible- Ineligible for FRPL Reading Grade 4 1.09 1.03 .86 Grade 8 1.02 1.14 .68 Grade 12 1.11 1.16 .58 Math Grade 4 .95 .71 .68 Grade 8 1.11 1.07 .58 Grade 12 1.20 1.12 .51 Adapted from Bloom, Hill, Black, and Lipsey (2008). District local outcomes are based on SAT-9 scaled scores for tests administered in spring 2000, 2001, and 2002. SAT 9: Stanford Achievement Tests, 9th Edition (Harcourt Educational Measurement, 1996). The Campbell Collaboration www.campbellcollaboration.org Benchmarking against performance gaps between average and weak schools Main idea: What is the performance gap (in effect size) for the same types of students in different schools? Approach: Estimate a regression model that controls for student characteristics: race/ethnicity, prior achievement, gender, overage for grade, and free lunch status. Infer performance gap (in effect size) between schools at different percentiles of the performance distribution 17 The Campbell Collaboration www.campbellcollaboration.org Performance gaps between average (50 percentile) and weak (10 percentile) schools in 4 districts as effect sizes School District Subject & Grade A B C D Reading Grade 3 .31 .18 .16 .43 Grade 5 .41 .18 .35 .31 Grade 7 .25 .11 .30 NA Grade 10 .07 .11 NA NA Math Grade 3 .29 .25 .19 .41 Grade 5 .27 .23 .36 .26 Grade 7 .20 .15 .23 NA Grade 10 .14 .17 NA NA Adapted from Bloom, Hill, Black, and Lipsey (2008). NA indicates that a value is not available due to missing test score data. Means are regression-adjusted for test scores in prior grade and students demographic characteristics. The tests are the ITBS for District A, SAT9 for District B, MAT for District C, and SAT8 for District D. The Campbell Collaboration www.campbellcollaboration.org Cost effectiveness as a framework for practical significance: Example for juvenile offender programs Excerpted from Aos, Phipps, Barnoski, & Lieb, 2001 18 The Campbell Collaboration www.campbellcollaboration.org In conclusion The numerical values of statistical effect size indices for intervention effects provide little understanding of the practical magnitude of those effects.
Translating effect sizes into a more descriptive and intuitive form makes them easier to understand and assess for practitioners, policymakers, and researchers.
There are a number of easily applied translations that could be routinely used in reporting intervention effect sizes.
Directly assessing the practical significance of those effects, however, requires that they be benchmarked against some criterion that is meaningful in the intervention context.
Assessing practical significance directly is more difficult, but there are approaches that may be appropriate depending on the intervention and outcome construct. The Campbell Collaboration www.campbellcollaboration.org References Aos, S., Phipps, P., Barnoski, R., & Lieb, R. (2001). The comparative costs and benefits of programs to reduce crime (Version 4.0). Washington State Institute for Public Policy. Bloom, H. S., Hill, C. J., Black, A. B., & Lipsey, M. W. (2008). Performance trajectories and performance gaps as achievement effect-size benchmarks for educational interventions. Journal of Research on Educational Effectiveness, 1(4), 289-328. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd edition). Hillsdale, NJ: Erlbaum. Redfield, D. L., & Rousseau, E. W. (1981). A meta-analysis of experimental research on teacher questioning behavior. Review of Educational Research, 51, 237-245. Rosenthal, R., & Rubin, D. B. (1982). A simple, general purpose display of magnitude of experimental effect. Journal of Educational Psychology, 74(2), 166-169.
19 The Campbell Collaboration www.campbellcollaboration.org Campbell Collaboration P.O. Box 7004 St. Olavs plass 0130 Oslo, Norway E-mail: info@c2admin.org http://www.campbellcollaboration.org Contact Information mark.lipsey@vanderbilt.edu