Practical Significance of Meta-Analysis Findings Lipsey

1
The Campbell Collaboration www.campbellcollaboration.org

Applied topics:
Interpreting the Practical Significance
of Meta-Analysis Findings

Mark Lipsey
Co-Chair, The Campbell Collaboration
Co-Editor-in-Chief, Campbell Systematic Reviews
Director, Peabody Research Institute, Vanderbilt University, USA
The problem
The effect size statistics that constitute the direct findings of a meta-
analysis often provide little insight into the nature, magnitude, or
practical significance of the effects they represent.
Practitioners, policymakers, and even researchers have difficulty
knowing whether the effects are meaningful in an applied context.
Example: The mean standardized mean difference effect size
(Cohens d or Hedges g) for the effects of educational interventions
with middle school students on standardized reading tests is about .
15 and statistically significant.
Seems small: Is .15 large enough to have practical significance for
improving the reading skills of middle school students?
Most important to recognize: There is no necessary relationship
between the numerical magnitude of an effect size and the
practical significance of the effect it represents!
2
A widely used but inappropriate and misleading
characterization of effect sizes

Statistical effect sizes assessed by Cohens small (.20), medium (.50)
and large (.80) categories
Impressionistic norms across a wide range of outcomes in social
and behavioral research
Almost never are these the appropriate norms for the particular
outcomes of a particular intervention
Comparing an obtained mean effect size with norms can be informative,
but those norms must be appropriate to the context, intervention, nature
of the outcomes, etc. [more on this later]
Two approaches to review here
1. Descriptive representations of intervention effect sizes:
Translations of effect sizes into forms that are more readily
interpreted.
Supports better intuitions about the practical significance of
the effect size.

2. Direct assessment of practical significance:
Assessing statistical effect sizes in relationship to criteria that
have recognized practical value in the context of application.
Requires that appropriate criteria be used; different criteria may
yield different conclusions.
3
Useful Descriptive Representations of
Intervention Effect Sizes
Back translation to an original metric
Useful when the original metric is readily interpretable; not so useful
when it is in arbitrary units.
Example: Mean Phi coefficient for effects of intervention on the
reoffense rates of juvenile offenders < .20 allegedly trivial.
Computation of Phi Coefficient as an effect size:

Reoffend
(failure)
Dont
Reoffend
(success)
Tx a = p b = 1-p a+b=1
Ct c = q d = 1-q c+d=1
a+c=
p+q
b+d=
(1-p)+
(1-q)
Phi = (ad-bc)/SQRT((a+b)(c+d)(a+c)(b+d))
4
Back translation to original metric: Phi coefficient
example
Mean reoffense rate for the control groups in the studies was .50.
Some algebra (or trial & error in a spreadsheet) yields the reoffense
rate of the average treatment group required to produce Phi = .20
[Note: Similar procedure would work for odds ratio ES as well]

Reoffend
(failure)
Dont
Reoffend
(success)
Tx .30 .70 1.00
Ct .50 .50 1.00
.80 1.20
Phi = .20
Phi = .20 thus means an average .20 reduction in the
reoffense rate from a .50 average baseline value;
That is, a 40% decrease in the reoffense rate.
Hardly trivial!
Back translation to original metric: Standardized
test example
Suppose the mean standardized mean difference effect size for
intervention effects on vocabulary tests is .30
The most frequently used measure of vocabulary in the contributing
studies was the Peabody Picture Vocabulary Test (PPVT)
The PPVT has a normed standard score of 100 with a standard
deviation of 15. Differences in standard scores are readily understood
by researchers and practitioners familiar with standardized tests
The control groups in the studies using the PPVT had a mean
standard score of 87.
How much improvement in the PPVT standard score is represented
by an effect size of .30?
5
Back translation to original metric: PPVT
Intervention effect sizes represented as
percentiles on the normal distribution
Percentile values on the control distribution of the
intervention effect in standard deviation units
6
Translating effect sizes into percentiles from a table
of areas under the normal curve
The percentage of the treatment group that is above
the control group mean is Cohens U3 index

Effect
Size

Proportion above
the Control Mean
Additional
proportion above
original mean
.10 .54 .04
.20 .58 .08
.30 .62 .12
.40 .66 .16
.50 .69 .19
.60 .73 .23
.70 .76 .26
.80 .79 .29
.90 .82 .32
1.00 .84 .34
1.10 .86 .36
1.20 .88 .38
7
Rosenthal and Rubin Binomial Effect Size Display
(BESD)
d = .80
BESD representations of SMD and correlation ESs

Effect
Size

r
Proportion of
control/
intervention
cases above the
grand median

BESD
(difference
between the
proportions)
.10 .05 .47 / .52 .05
.20 .10 .45 / .55 .10
.30 .15 .42 / .57 .15
.40 .20 .40 / .60 .20
.50 .24 .38 / .62 .24
.60 .29 .35 / .64 .29
.70 .33 .33 / .66 .33
.80 .37 .31 / .68 .37
.90 .41 .29 / .70 .41
1.00 .45 .27 / .72 .45
1.10 .48 .26 / .74 .48
1.20 .51 .24 / .75 .51
8
Even better, use an inherently meaningful threshold
Suppose we have a mean standardize mean difference effect size of .
23 for the effects of treatment for depression on outcome measures
of depression.
For many measures of depression, a threshold score has been
determined for the range that constitutes clinical levels of depression.
Suppose, then, that we can determine from at least a subset of
representative studies that the average proportion of the control
groups whose scores are in the clinical range is 64%.
Assuming that depression scores are normally distributed, we can
then use this proportion and the effect size to determine the average
proportion in the clinical range for the treatment groups.
From that we find the proportion of clinically depressed patients
moved out of the clinical range by the treatment.
Proportions of T and C samples above and below
a meaningful reference value
Success
threshold
Proportion
above
Proportion
below
9
Using a table of areas under the normal curve
Z Cum p Tail p
Z Cum p Tail p
64% of the area of the normal
curve is below Z=.36
Subtracting ES=.23 SD from Z=.36
gives Z=.13 with 55% of the area
of the normal curve below
The mean effect size of .23 indicates that, on average,
the intervention reduced the number of clinically depressed
patients from 64% to 55%, a 9% differential
Clinical
threshold
Proportion
above=36%
Proportion
below=64%
Proportion
below=55%
Proportion
above=45%
10
The more general point
With some understanding of the nature of the effect size index you
are working with
and some understanding of the context of the intervention and what
might be an interpretable representation of the magnitude of the
intervention effect on the outcomes of interest,
it will almost always be possible to translate any effect size or mean
effect size into a form that facilitates interpretation of its practical
significance.
Direct Assessments of Practical Significance

11
Assessing the practical significance of effect sizes
requires a criterion from the context of application
Neither the numerical value of an effect size nor its statistical
significance is a valid indicator of the practical significance of the
effect.
Translating the numerical value into terms easier to understand
facilitates an intuitive assessment of practical significance, but is
inherently subjective.
A more direct assessment of practical significance can often be made
by comparing the effect size with an appropriate criterion drawn from
the context of application and, therefore, meaningful in that context.
The clinical and normative thresholds used as examples in the
previous section are a step in that direction, but more can be learned
from a more fully-developed criterion framework
Examples of some criterion frameworks that
can be used to assess the practical
significance of intervention effect sizes
E.g., compare the mean effect size found with:
Established normative expectations for change
Effects others have found on similar measures with similar
interventions
Policy-relevant performance gaps
Intervention costs (not discussed here)
Some examples from education follow (happens to be where we
have done a lot of work recently)
12
Benchmarking against normative expectations for
change from test norming samples
Data compiled from national norms for standardized achievement tests:
Up to seven tests were used for reading, math, science, and social
science
The mean and standard deviation of the scores for each grade were
obtained from the test manuals
The standardized mean difference effect size across succeeding
grades was computed
Annual achievement gain: Mean effect sizes
across 7 nationally-normed tests
Grade
Transition

Reading

Math

Science

Social Studies
K 1 1.52 1.14 -- --
1 - 2 .97 1.03 .58 .63
2 - 3 .60 .89 .48 .51
3 - 4 .36 .52 .37 .33
4 - 5 .40 .56 .40 .35
5 - 6 .32 .41 .27 .32
6 - 7 .23 .30 .28 .27
7 - 8 .26 .32 .26 .25
8 - 9 .24 .22 .22 .18
9 - 10 .19 .25 .19 .19
10 - 11 .19 .14 .15 .15
11- 12 .06 .01 .04 .04
Adapted from Bloom, Hill, Black, and Lipsey (2008). Spring-to-spring differences. The means shown are the simple (unweighted) means of the
effect sizes from all or a subset of seven tests: CAT5, SAT9, Terra Nova-CTBS, Gates-MacGinitie, MAT8, Terra Nova-CAT, and SAT10.
13
Mean effect size relative to the effect size for
achievement gain from pretest baseline
.31 SD
(38% increase)
.82 SD
Gain from the Beginning to End of Pre-K on a Summary Achievement Measure for
Children Who Participated in Pre-K Compared to Children Who Did Not Participate
Pre-K Participants
Nonparticipants
ES for mean
control group
pre-post gain
Mean
intervention
ES = .31
Benchmarking against effect sizes for achievement
from random assignment studies of education
interventions
Data in our current compilation:
124 random assignment studies
181 independent subject samples
829 effect size estimates
14
!"#$%&%'%() %+%") ,$-%, ./ 0123% 4%&%4 2(3
)/5% 67 2"#$%&%'%() )%,)
Crade Level
& AchlevemenL Measure
n of LS
LsumaLes Mean Su
"#$%$&'()* +,-..# 693 .28 .46
SLandardlzed LesL (broad) 89 .08 .27
SLandardlzed LesL (narrow) 374 .23 .42
Speclallzed Loplc/LesL 230 .40 .33
/011#$ +,-..# 70 .33 .38
203- 4,-..# 66 .23 .34
SLandardlzed LesL (broad) -- -- --
!"#$%&%'%() %+%") ,$-%, ./ 0123% 4%&%4 2(3
)/5% 67 2"#$%&%'%() )%,)
Crade Level
& AchlevemenL Measure
n of LS
LsumaLes Mean Su
"#$%$&'()* +,-..# 693 .28 .46
/011#$ +,-..# 70 .33 .38
203- 4,-..# 66 .23 .34
SLandardlzed LesL (broad) -- -- --
15
!"#$%&%'%() %+%") ,$-%, ./ )210%) 1%"$5$%(),
1argeL 8eclplenLs
number
of LS
LsumaLes
Mean
LS
Su
lndlvldual SLudenLs (one-on-one) 232 .40 .33
Small groups (noL classrooms) 322 .26 .40
Classroom of sLudenLs 176 .18 .41
Whole school 33 .10 .30
Mlxed 44 .30 .33
Benchmarking against policy-relevant
demographic performance gaps
Effectiveness of interventions can be judged relative to the sizes
of existing gaps across demographic groups

Effect size gaps for groups may vary across grades, years, tests,
and districts

16
Demographic performance gaps on SAT 9 scores
in a large urban school district as effect sizes

Subject & Grade

Black-White

Hispanic-White
Eligible-
Ineligible for
FRPL
Reading
Grade 4 1.09 1.03 .86
Grade 8 1.02 1.14 .68
Grade 12 1.11 1.16 .58
Math
Grade 4 .95 .71 .68
Grade 8 1.11 1.07 .58
Grade 12 1.20 1.12 .51
Adapted from Bloom, Hill, Black, and Lipsey (2008). District local outcomes are based
on SAT-9 scaled scores for tests administered in spring 2000, 2001, and 2002. SAT 9:
Stanford Achievement Tests, 9th Edition (Harcourt Educational Measurement, 1996).
Benchmarking against performance gaps between
average and weak schools
Main idea:
What is the performance gap (in effect size) for the same types
of students in different schools?
Approach:
Estimate a regression model that controls for student
characteristics: race/ethnicity, prior achievement, gender,
overage for grade, and free lunch status.
Infer performance gap (in effect size) between schools at
different percentiles of the performance distribution
17
Performance gaps between average (50 percentile) and weak (10
percentile) schools in 4 districts as effect sizes
School District
Subject & Grade A B C D
Reading
Grade 3 .31 .18 .16 .43
Grade 5 .41 .18 .35 .31
Grade 7 .25 .11 .30 NA
Grade 10 .07 .11 NA NA
Math
Grade 3 .29 .25 .19 .41
Grade 5 .27 .23 .36 .26
Grade 7 .20 .15 .23 NA
Grade 10 .14 .17 NA NA
Adapted from Bloom, Hill, Black, and Lipsey (2008). NA indicates that a value is not available due to missing test
score data. Means are regression-adjusted for test scores in prior grade and students demographic characteristics.
The tests are the ITBS for District A, SAT9 for District B, MAT for District C, and SAT8 for District D.
Cost effectiveness as a framework for practical
significance: Example for juvenile offender programs
Excerpted from Aos, Phipps, Barnoski, & Lieb, 2001
18
In conclusion
The numerical values of statistical effect size indices for intervention effects
provide little understanding of the practical magnitude of those effects.

Translating effect sizes into a more descriptive and intuitive form makes
them easier to understand and assess for practitioners, policymakers, and
researchers.

There are a number of easily applied translations that could be routinely
used in reporting intervention effect sizes.

Directly assessing the practical significance of those effects, however,
requires that they be benchmarked against some criterion that is
meaningful in the intervention context.

Assessing practical significance directly is more difficult, but there are
approaches that may be appropriate depending on the intervention and
outcome construct.
References
Aos, S., Phipps, P., Barnoski, R., & Lieb, R. (2001). The comparative costs and
benefits of programs to reduce crime (Version 4.0). Washington State Institute for
Public Policy.
Bloom, H. S., Hill, C. J., Black, A. B., & Lipsey, M. W. (2008). Performance
trajectories and performance gaps as achievement effect-size benchmarks for
educational interventions. Journal of Research on Educational Effectiveness, 1(4),
289-328.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd edition).
Hillsdale, NJ: Erlbaum.
Redfield, D. L., & Rousseau, E. W. (1981). A meta-analysis of experimental research
on teacher questioning behavior. Review of Educational Research, 51, 237-245.
Rosenthal, R., & Rubin, D. B. (1982). A simple, general purpose display of
magnitude of experimental effect. Journal of Educational Psychology, 74(2),
166-169.

19
Campbell Collaboration
P.O. Box 7004 St. Olavs plass
0130 Oslo, Norway
E-mail: info@c2admin.org
http://www.campbellcollaboration.org
Contact Information
mark.lipsey@vanderbilt.edu

Practical Significance of Meta-Analysis Findings Lipsey

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Practical Significance of Meta-Analysis Findings Lipsey

Hochgeladen von

Copyright:

Verfügbare Formate

1

The Campbell Collaboration www.campbellcollaboration.org

Das könnte Ihnen auch gefallen