Sie sind auf Seite 1von 10

Methods in Ecology and Evolution 2011, 2, 110 doi: 10.1111/j.2041-210X.2010.00056.

Getting started with meta-analysis


Freya Harrison*
Department of Zoology, University of Oxford, Oxford, UK

Summary
1. Meta-analysis is a powerful and informative tool for basic and applied research. It provides a
statistical framework for synthesizing and comparing the results of studies which have all tested a
particular hypothesis. Meta-analysis has the potential to be particularly useful for ecologists and
evolutionary biologists, as individual experiments often rely on small sample sizes due to the
constraints of time and manpower, and therefore have low statistical power.
2. The rewards of conducting a meta-analysis can be signicant. It can be the basis of a systematic
review of a topic that provides a powerful exploration of key hypotheses or theoretical assumptions,
thereby inuencing the future development of a eld of research. Alternatively, for the applied sci-
entist, it can provide robust answers to questions of ecological, medical or economic signicance.
However, planning and conducting a meta-analysis can be a daunting prospect and the analysis
itself is invariably demanding and labour intensive. Errors or omissions made at the planning stage
can create weeks of extra work.
3. While a range of useful resources is available to help the budding meta-analyst on his or her way,
much of the key information and explanation is spread across dierent articles and textbooks. In
order to help the reader use the available information as eciently as possible (and so avoid making
time-consuming errors) this article aims to provide a road map to the existing literature. It provides
a brief guide to planning, organizing and implementing a meta-analysis which focuses more on logic
and implementation than on maths; it is intended to be a rst port of call for those interested in the
topic and should be used in conjunction with the more detailed books and articles referenced. In the
main, references are cited and discussed with an emphasis on useful reading order rather than a
chronological history of meta-analysis and its uses.
4. No prior knowledge of meta-analysis is assumed in the current article, though it is assumed that
the reader is familiar with anova and regression-type statistical models.
Key-words: effect size, meta-analysis, null hypothesis signicance testing, power, P-values,
sample size, statistics, systematic review

Meta-analysis makes me very happy Jacob Cohen, psy- by their mate with an increase of smaller magnitude in their
chologist and statistician own care eort. Over the last 25 years, many behavioural ecol-
ogists have performed experiments to test whether partial com-
pensation is indeed observed if one member of a breeding pair
Introduction: the foundations of meta-analysis is removed or handicapped to reduce its care input. These stud-
ies have been carried out on birds, rodents and insects and have
A literature review for any given topic is likely to turn up a long
reported every possible response to experimental manipula-
list of studies, with varying degrees of consistency in experi-
tion, from desertion to over-compensation for the lost care
mental methodology, study species and analytical approach.
eort (reviewed in Harrison et al. 2009). Given the variability
Often these studies have led to very dierent conclusions. For
in how these studies were conducted and the often small indi-
example, theoreticians working on the evolution of biparental
vidual sample sizes, it is almost impossible to decide if the liter-
care have predicted that it is only an evolutionarily stable strat-
ature as a whole supports the partial compensation hypothesis
egy if individuals respond to a decrease in parental care eort
simply by reading and contrasting studies. However, meta-
analysis provides a formal statistical framework with which we
*Correspondence author. E-mail: freya.andersdottir@gmail.com can rigorously combine and compare the results of these exper-
Correspondence site: http://www.respond2articles.com/MEE iments. In this article, I will outline the logic of meta-analysis

2010 The Author. Methods in Ecology and Evolution 2010 British Ecological Society
2 F. Harrison

and provide a brief guide to planning, organizing and imple- weak eects of some variable, based on the size of the calcu-
menting a meta-analysis. The article is intended to serve as a lated P-value? It is a common fallacy to assume that the smal-
road map to the numerous detailed resources which are avail- ler the P-value, the stronger the observed relationship must be.
able, providing an introduction which focuses more on logic However, the magnitude of an eect and its statistical signi-
and implementation than on mathematics. A glossary of key cance are not intrinsically correlated: a small P-value does not
terms (marked in bold in the main text) is provided in Box 1 necessarily mean that the eect of experimental treatment is
and key references are listed at the end of each section. large, or that the slope of a variable of interest on some covari-
Meta-analysis gives us quantitative tools to do two things. ate is steep. This is due in large part to the dependence of P on
First, if a number of attempts have been made to measure the sample size: given a large enough sample size, the null hypothe-
eect of one variable on another, then meta-analysis provides sis will almost always be rejected. P-values reect a dichoto-
a method to calculate the mean eect of the independent vari- mous question (is the observed pattern of data likely to be due
able, across all attempts. Usually, the independent variable to chance, or not?) not an open-ended one (how strong is the
represents some form of experimental manipulation (treated pattern in the data?). Cohen (1990) uses a rather wonderful
vs. control groups, or a continuous variable representing treat- example to demonstrate this point: he cites a study of 14 000
ment level). To illustrate this, Fernandez-Duque & Valeggia children that reported a signicant link between height (mea-
(1994) combined the results of ve studies of the eect of selec- sured in feet and adjusted for age and sex) and IQ. He then
tive logging on bird populations. This revealed a detrimental points out that if we take signicant to mean a P-value of
eect of selective logging on population density that was not 0001, then a correlation coecient of at least 00278 a very
immediately apparent from simply looking at the results of the shallow slope indeed would be found to be signicant in a
individual studies. Secondly, meta-analysis allows us to mea- sample this large. The authors actually reported a rather larger
sure the amount of experimentally-induced change in the correlation coecient of 011, but the eect of height on IQ is
dependent variable across studies and to attempt to explain still small: converting height to a metric measure, this means
this variability using dened moderator variables. Such vari- that a 30-point increase in IQ would be associated with an
ables could reect phylogenetic, ecological or methodological increase in height of over a metre.
dierences between study groups. For example, in a classic The pros and cons of NHST and its alternatives have been
meta-analysis of 20 studies, Cote & Sutherland (1997) calcu- discussed by other authors (Nakagawa & Cuthill 2007; Ste-
lated that, on average, predator removal resulted in an increase phens, Buskirk & del Rio 2007) and are beyond the scope of
in post-breeding bird populations but not in breeding popula- this article: suce it to say that P-values from NHST do not
tions. measure the magnitude of the eect of independent variables
Meta-analysis achieves these goals by using eect sizes: these on dependent variables, are heavily inuenced by sample size
are statistics that provide a standardized, directional measure and are not generally comparable across studies. In other
of the mean change in the dependent variable in each study. words, P-values are not eect sizes: two studies can have the
Eect sizes can incorporate considerations of sample size. Fur- same eect size but dierent P-values, or vice versa.
thermore, when being combined in a meta-analysis, eect sizes This means that post hoc analyses that rely on vote count-
can be weighted by the variance of the estimate, such that stud- ing of studies with signicant and non-signicant results are
ies with lower variance (i.e. tighter estimated eect size) are not very reliable. Vote counting has been a common method of
given more weight in the data set. Because variance decreases determining support for a hypothesis, is often used in the intro-
as sample size increases, this generally means that eect sizes duction or discussion sections of empirical papers to provide
based on larger study populations are given greater weight. an overview of the current state of a eld or to justify new
Tests which are analogous to analysis of variance (anova) and work, and is sometimes published under the erroneous title of
weighted regression can then be applied to the population of meta-analysis. No quantitative estimate of the eect of interest
eect sizes to identify dependent variables that explain a signi- is provided by vote counting. Furthermore, vote counting
cant amount of variation between studies. For instance, in our lacks statistical power for two reasons. First, the eect of sam-
meta-analysis, we found that the mean response to partner ple size on P-value means that real but small eects may have
removal or handicapping was indeed partial compensation been obscured by small sample size in the original studies. Sec-
and that the sex of the responding partner and aspects of the ondly, simply counting votes with no attention to eect magni-
experimental methodology explained some of the variation tude or sample size does nothing to rectify this lack of power.
between individual studies (Harrison et al. 2009). A formal meta-analysis ameliorates this problem. Not only are
eect sizes more informative, they also represent continuous
Key references. Stewart (2010) and Hillebrand (2008) provide variables that can be combined and compared. A more subtle
neat introductions to the logic and power of meta-analysis point is that NHST focuses on reducing the probability of type
from an ecological perspective. I errors (rejecting the null hypothesis when it is in fact true).
Type II errors (failing to reject the null hypothesis when it is
false) are not so tightly controlled for and this type of error can
Meta-analysis vs. vote counting
be of particular concern in elds such as conservation or medi-
How many of us have heard the results of null hypothesis signif- cine, where failing to detect an eect of, say, pesticide use on
icance tests (NHST) being referred to as showing strong or farm bird populations, could be more harmful than a type I
2010 The Author. Methods in Ecology and Evolution 2010 British Ecological Society, Methods in Ecology and Evolution, 2, 110
Getting started with meta-analysis 3

error. By denition, any method that increases the power of a 3 Choose an appropriate measure of eect size and calculate
test reduces the likelihood of making a type II error. an eect size for each study that you wish to retain.
There are three commonly-used types of statistic that give 4 Enter these studies into a master data base which includes
reliable and comparable eect sizes for use in meta-analysis. study identity, eect size(s), sample size(s) and information
All can be corrected for sample size and weighted by within- which codes each study for variables which you have reason
study variance. For studies that involve comparing a continu- to believe may aect the outcome of each study, or whose
ous response variable between control and experimental possible inuence on eect size you wish to investigate (exper-
groups, the mean dierence between the groups can be calcu- imental design, taxonomic information on the study species,
lated. For studies that test for an eect of a continuous or ordi- geographic location of study population, life-history vari-
nal categorical variable on a continuous response variable, the ables of the species used etc). You should also record how
correlation coecient can be used. Finally, for dichotomous you calculated the eect size(s) for each study (see below).
response variables the risk ratio or odds ratio provides a mea- 5 Use meta-analytic methods to summarize the cross-study
sure of eect size. Once a population of eect sizes has been support for the hypothesis of interest and to try to explain
collected, it is possible to calculate the mean eect size and also any variation in conclusions drawn by individual studies.
a measure of the amount of between-study variation (heteroge- 6 Assess the robustness and power of your analysis (likeli-
neity, Q) in eect size. hood of type I and type II errors).
We might therefore say that meta-analysis is more clearly Steps 1 and 2 reect the fact that meta-analysis sits within
needs driven and evidence based than simple vote counting. the general methodological framework of the systematic
Box 2 provides a simple demonstration of how meta-analysis review. Cooper, Hedges & Valentine (2009) argue that research
works. It should be noted that, like any statistical method, synthesis based on systematic reviews can be viewed as a scien-
meta-analyses are only as good as the data used and can still tic discipline in its own right. As they rightly stress, a good sys-
suer from both type I and type II errors: this is dealt with in tematic review follows exactly the same steps as an experiment:
more detail in the discussion of meta-analytic methods below. a problem is identied, a hypothesis or hypotheses formulated,
a method for testing the hypothesis designed and, once applied,
Key references. Useful textbooks on meta-analysis include the results of this method are quantitatively analysed. The
those by Borenstein et al. (2009), Cooper, Hedges & Valentine method itself can then be criticized. These steps allow the goals
(2009) and Lipsey & Wilson (2001) and a forthcoming volume of systematic review in general, or meta-analysis in particular,
edited by Koricheva, Gurevitch & Mengerson (in press). The to be met. It is dicult to argue that a review has usefully con-
introductory chapters of these books all expand on the infor- tributed to a eld whether it be by providing critical analysis
mation discussed by Stewart (2010) and Hillebrand (2008). of empirical results, highlighting key issues or addressing a con-
Koricheva et al.s book is explicitly written for researchers in ict if the review itself does not have a rm basis in a dened
ecology and evolution; of the other three, Borenstein et al.s methodology for identifying, including and extracting informa-
book is probably the most accessible and Lipsey & Wilsons tion from the sources reviewed. A notable proponent of the sys-
the most concise. Cohen (1990) discusses some of the issues dis- tematic review approach in ecology, Stewart (e.g. Stewart,
cussed above in a short and illuminating article on using and Coles & Pullin 2005; Pullin & Stewart 2006; Roberts, Stewart
interpreting statistical tests, while Nakagawa & Cuthill (2007) & Pullin 2006) has provided guidelines relevant to this eld.
provide a detailed discussion of signicance vs. eect size. For The keys to making meta-analysis as stress-free as possible
readers interested in early arguments against vote counting are organization and planning. In particular, your list of poten-
and the rst applications of meta-analysis in evolution and tial moderator variables (step 4) should be clearly dened
ecology, see the seminal work of Gurevitch and Hedges (e.g. before you begin: it is far preferable to produce a data base
Hedges & Olkin 1980; Gurevitch et al. 1992; Gurevitch & which includes information that you later decide not to use,
Hedges 1993). than to produce a data base that excludes a variable you later
decide to explore, as the latter may require a second (or third,
or fourth) trawl through your collection of studies to extract
A to do list for meta-analysis
the necessary information. In the present article, I will now
At this point, it would be useful to outline the steps required to concentrate on the mechanics of carrying out a meta-analysis
begin and carry out a meta-analysis. (steps 3, 5 and 6).
1 Perform a thorough literature search for studies that
address the hypothesis of interest, using dened keywords Key references. DeCoster (2004) and Lipsey & Wilson (2001,
and search methodology. This includes searching for unpub- Chapters 2, 4 & 5) provide an excellent and comprehensive
lished studies, for example by posting requests to professional guide to literature searching and study coding. It may be help-
newsletters or mailing lists. ful at this stage to read a model meta-analysis: good examples
2 Critically appraise the resulting studies and assess whether for ecologists include Fernandez-Duque & Valeggia (1994),
they should be included in the review. (Are they applicable? Is Cote & Sutherland (1997) or Cassey et al. (2005); many further
the study methodology valid? Do you have enough informa- examples can be found at the website of the Meta-analysis in
tion to calculate an eect size?) Record the reasons for drop- Ecology and Evolution working group (http://www.nceas.ucsb.
ping any studies from your data set. edu/meta/)

2010 The Author. Methods in Ecology and Evolution 2010 British Ecological Society, Methods in Ecology and Evolution, 2, 110
4 F. Harrison

measure of eect size is given by calculating the risk ratio or


Choosing an appropriate effect size statistic
odds ratio. These types of eect size have been very rarely
A meaningful measure of eect size will depend on the nature used in ecology and evolution, though they are common in
of the data being considered. Experimental and observational medical research.
studies in ecology and evolution generally generate data that In some cases, more than one type of eect size can be
falls into one of three categories, and this determines which meaningful. For instance, if an experiment involves applying
indices of eect size are appropriate. All of the indices of eect some quantitatively-measurable level of treatment to the
size outlined below have known sampling distributions (gener- experimental group, then the experimental and control
ally they are normalized) and this allows us to calculate their groups could meaningfully be compared using either stan-
standard errors and construct condence intervals. dardized mean dierence or a correlation coecient. If dier-
1 Continuous or ordinal data from two or more groups. Data ent studies have applied dierent levels of the treatment to
in this form are exemplied by treatment vs. control group their experimental groups, the latter may be preferable. The
comparisons and are generally presented and analysed using best measure of eect size must be judged based on its com-
averages and measures of variance (mean and standard devia- patibility with the available raw data and its ease of interpre-
tion, median and interquartile range, etc). In such cases, a tation.
measure of the dierence between the group means is an Much of the labour in conducting a meta-analysis lies in cal-
appropriate eect size. The raw dierence in means can be culating individual eect sizes for studies. All of the eect sizes
standardized by the pooled standard deviation; two com- mentioned above can be calculated from reported means, vari-
monly-used measures of standardized mean dierence are ances, SEs, correlation coecients and frequencies. If these are
Cohens d and Hedges g: these dier in the method used for not available then eect sizes can be calculated from reported
calculating the pooled standard deviation but it should be t, F or Chi-squared statistics or from P-values. The exact for-
noted that the d and g notation has been used interchangeably mulae for calculating eect sizes from these data dier depend-
by some authors. Alternatively, when the data measure rates ing on the nature of the statistical tests and experimental
of change in independent groups (e.g. plant growth response designs from which they were taken (e.g. paired vs. unpaired
in normal or elevated CO2, body mass gain after supplemen- t-test). This is explained rather thoroughly by DeCoster (2004)
tary feeding), the response ratio can be used. This measures and Nakagawa & Cuthill (2007). In general, the more directly
the ratio of the mean change in one group to the mean change you can calculate an eect size the less you have to infer by
in the other. Like the standardized mean dierence, it takes using test statistics and reconstructed statistical tables the less
the standard deviations in the two groups into account. The error will be incorporated into your estimate of the eect size.
response ratio is generally log-transformed prior to meta- It is also possible to convert between dierent measures of
analysis in order to linearize and normalize the raw ratios. eect size.
2 Continuous or ordinal data which are a response to a contin- While the actual mathematics of converting reported data
uous or ordinal independent variable. Any data which are anal- into eect sizes is rendered fairly straightforward thanks to
ysed using correlation or regression fall into this category. In freely-available Microsoft Excel les and meta-analytic soft-
this case, the correlation coecient itself can be used as a ware packages, actually harvesting the necessary data from a
measure of eect size. Generally, we are interested in a simple library of published studies can be painstaking work. The
bivariate relationship (say, the eect of average daily rainfall number of studies that have to be discarded due to an inability
on the laying date of great tits), and it may be that the studies to calculate a meaningful eect size based on the information
in our data set also explore such a relationship. If a study available can be surprisingly high. Studies that do not give vari-
reports the results of statistical tests which include other vari- ance statistics, do not clearly state which statistical tests were
ables (such as average daily temperature during the breeding used or even do not make sample size explicit all create head-
season), then we might use the partial correlation coecient: aches for the would-be meta-analyst.
the eect of rainfall on lay date if temperature is held xed. (It
may be that this is the only eect size we can calculate from Key references. The textbooks referenced above all outline
the data available; if the published data allow us to calculate various eect size calculations, as do Hillebrand (2008),
the simple bivariate correlation between rainfall and laying DeCoster (2004) and Nakagawa & Cuthill (2007). Hedges,
date, ignoring temperature, then we could argue that it would Gurevitch & Curtis (1999) provide an introduction to the use
be better to use this as our eect size as it would be more of response ratios for ecological data and Schielzeth (2010)
directly comparable with the bivariate correlation coecients presents a thoughtful and interesting perspective on the
retrieved from the other studies). Whichever type of correla- calculation and presentation of correlation coecients. Lipsey
tion coecient we use, Fishers z transformation is generally & Wilson (2002) helpfully provide an Excel spreadsheet for
applied in order to stabilize the variance among coecients calculating eect size, which complements the information
prior to meta-analysis. provided in the Appendices of their textbook; a similar
3 Binary response data. Data that take the form of binary spreadsheet is provided by Thalheimer & Cook (2002). The
yes/no outcomes, such as nest success or survival to the next software packages outlined in Partitioning and expaining
breeding season, are generally analysed using logistic heterogeneity can also calculate eect sizes from summarized
regression or a chi-squared test. In this case, an appropriate data.

2010 The Author. Methods in Ecology and Evolution 2010 British Ecological Society, Methods in Ecology and Evolution, 2, 110
Getting started with meta-analysis 5

cance of Q is tested using either a chi-squared distribution or


Analysing your data
randomization tests and the calculation of Q is covered in the
There are three main steps to a meta-analysis: calculating the textbooks cited in Meta-analysis vs. vote counting.
mean eect size, calculating a measure of between-study heter- Variance in eect size between studies and a signicant
ogeneity in eect size and partitioning this heterogeneity Q-value may stem from one of two broad sources. On the one
between moderator variables and error. The three papers sug- hand, systematic, identiable sources of variation such as spe-
gested in A to do list for meta-analysis as model ecological cies used, sex of individuals etc. may cause heterogeneity in
meta-analyses all provide clear and helpful guides for carrying eect size. Such sources of variation can be identied and their
out your own meta-analysis. Cote & Sutherland (1997) and impact on eect size can be quantied using statistical tests
Fernandez-Duque & Valeggia (1994) use Cohens d as their analogous to anova or weighted regression this is exactly how
eect size while Cassey et al. (2005) use transformed correla- Cote & Sutherland (1997) showed that predator removal dif-
tion coecients. Cote & Sutherland and Cassey et al give ferentially aects breeding and non-breeding bird populations,
details of between-group heterogeneity calculations. In our as mentioned in Introduction: The foundations of meta-analy-
meta-analysis of parental care (Harrison et al. 2009), my col- sis. An overview of how heterogeneity is quantied and parti-
leagues and I show and discuss results derived from Lipsey & tioned between moderator variables is given in the next section
Wilsons (2002) spss macros. of this article, but rst it is necessary to discuss the second
source of variation: essentially random or non-identiable dif-
ferences between studies.
THE MEAN EFFECT SIZE
Most experimental biologists will be familiar with tting
If we have eect sizes from N studies and we denote the eect random factors in anova-type analyses to produce a mixed
size for the ith study ESi and its variance s2ES , we can then cal- model. For instance, in behavioural experiments with repeated
i
culate the mean eect size across studies. In order to give stud- measures on the same individuals, individual identity should
ies with lower variance more weight in the calculation of the be coded as a random variable if we wish to generalize our
mean, we multiply each individual eect size by the inverse of results from these specic individuals to the population as a
its variance 1/s2(ESi), henceforth denoted wi for brevity. The whole: the true eect of the experiment is not assumed to be
weighted mean eect size ES and its standard error SEES are the same for all individuals and we are interested in the mean
thus calculated as follows. eect across individuals, rather than nding a single true
P eect that holds for every member of the population. By anal-
ESi wi ogy, individual studies in a meta-analysis could each have some
ES P eqn 1
wi idiosyncrasies that we cannot either reveal or include in our
s model (age or sex ratio of the population used, season during
1 which observations were taken etc.) So just as we measure
SE ES P eqn 2
wi experimental eects on a sample of individual animals from a
population, in meta-analysis we have measured a random sam-
Note that this is a xed-eect calculation of ES as it does not ple of eect sizes from a population of true eect sizes. If this is
take into account random sources of variance (see below for a the case, then there must be some component of the total vari-
discussion of xed and random eects). Condence intervals ance in the data set that represents these random eects and we
for the mean eect size can then be calculated as for the indi- must incorporate this into our analysis if we wish to generalize
vidual eect sizes, using critical values from a standard normal our results beyond this specic set of studies.
or t distribution. If the condence interval does not include Therefore, if we obtain a signicant Q-value from the xed-
zero, then we conclude that on average, experimental manipu- eects calculation of ES, we must proceed in one of three ways.
lation has a signicant eect on our response variable at the First, we could continue to assume a xed eects model but
specied signicance level. Once the mean eect size has been add the assumption that between-study variation is the result
calculated, the next step is to determine whether the various of systematic, identiable moderator variables and then
individual eect sizes in our sample of studies are likely to esti- attempt to identify these. Secondly, we could assume that the
mate the same population mean eect size. variation between studies is random. In this case, we can use
mathematical methods for estimating the random eects vari-
ance component, add this component to the variance statistic
HETEROGENEITY ACROSS STUDIES: FIXED AND
for each individual eect size and re-calculate the inverse vari-
RANDOM EFFECTS
ance weights in order to calculate a random-eects version of
If our sample of eect sizes reects a single population mean ES. Thirdly, we could assume that heterogeneity stems from
eect size, then a studys eect size will dier from the true pop- both xed and random sources. In this case, we can run
ulation mean purely as a result of which individuals were stud- meta-analytic models to test for the eects of xed moderator
ied and the distribution of eect sizes will be homogeneous. It variables, using inverse variance weights that have been
is possible to test whether our set of eect sizes shows more het- adjusted for the estimated random eects component. This
erogeneity than would be expected due to sampling error alone produces a mixed-eects model and a more robust test for the
by calculating the Q (often called QTotal) statistic. The signi- signicance of moderator variables, because it does not exclude
2010 The Author. Methods in Ecology and Evolution 2010 British Ecological Society, Methods in Ecology and Evolution, 2, 110
6 F. Harrison

the possibility that some of the heterogeneity remaining after a signicant proportion of the total heterogeneity in eect sizes.
model has been tted is due to systemic but unidentied The mean eect sizes within each sex can then be calculated. In
sources of variation. The type I error rate is thus reduced, the case of continuous or ordinal categorical moderator vari-
though this does come at the cost of a loss of some power in ables, an analogue to a weighted regression model can be used
testing for the eects of moderator variables (Lipsey & Wilson, to determine whether tting one or more of these explanatory
2001). The software packages detailed in Section 5.4 allow the variables explains a signicant amount of heterogeneity
user to specify xed- or random-eects models and/or calcu- (QModel vs. QResidual). In this case, for each variable treated as a
late the random eects variance component. covariate, an estimate of the slope of its eect and a corre-
It is worth noting that sometimes moderator variables can sponding P-value can also be calculated. Implementing these
explain some of the variance in the data set even when there is models is rendered fairly straightforward by the availability of
no evidence for signicant overall heterogeneity in a xed- specialist software such as MetaWin (Rosenberg, Adams &
eects estimate of ES. Therefore, if there are good a priori rea- Gurevitch 2000) or Meta-analyst (Wallace et al. 2009) and by
sons for supposing an eect of a moderator, it is arguably macros or extensions for common software packages (e.g.
worth testing for this even when Q is not signicant. A related macros for spss: Lipsey & Wilson 2001, 2002; MIX for Micro-
issue is that of non-independence of eect sizes across studies; soft Excel: Bax et al. 2008).
it may be advisable to control for studies being conducted by
the same research group, for instance, by including this as a
Criticizing meta-analysis
moderator variable. In a population of studies carried out on
dierent species, non-independence arises from phylogenetic The robustness and utility of meta-analysis and the reliability
relatedness. The development of phylogenetic meta-analysis is of any inferences drawn from it are determined ultimately by
gaining momentum but is beyond the scope of this article; the population of individual studies used. First, issues sur-
Adams (2008) and Lajeunesse (2009) discuss methodologies rounding which studies can and should be included in a meta-
for conducting such analyses. analysis should be mentioned. Secondly, it would be useful to
have some way of determining the likelihood of a signicantly
Key references. Part 3 of Borenstein et al. (2009) and Chap- non-zero mean eect size being the result of a type I error and,
ters 68 of Lipsey & Wilson (2001) provide clear explanations conversely, the likelihood of a zero mean eect size being the
of xed vs. random eects models and provide macros (2002) result of a lack of statistical power rather than a reliable reec-
to calculate the random eects variance component. Gurevitch tion of the true population mean eect size. The number and
& Hedges (1993) provide one of the rst discussions of mixed- identity of studies used, as well as their individual sample sizes,
eects models in ecological meta-analysis. will aect type I and II error rate in meta-analysis.

PARTITIONING AND EXPLAINING HETEROGENEITY WHICH STUDIES CAN BE COMBINED IN A META-


ANALYSIS?
If moderator variables cause eect sizes to dier on a study-
by-study basis, the distribution of eect sizes will be heteroge- Step 2 in the to do list reects the fact that including methodo-
neous. For instance, if our studies measure the eect of logically poor studies in the data set may add more noise than
environmental disturbance on antagonistic interactions signal, clouding our ability to calculate a robust mean eect
between individuals of a species, and if this eect is moderated size or to identify important moderator variables. Dening
by the population density, then a data set which includes stud- and reporting the criteria by which studies were assessed for
ies on high- and low-density populations may show a bimodal inclusion is therefore an essential part of the meta-analytic
distribution of eect sizes. In this case, the null hypothesis of method. Furthermore, thought must be given as to whether
homogeneity is rejected and a single mean is not the best way the studies under consideration may sensibly be combined in a
to describe the the population of true eect sizes. Sources of meta-analysis do the eect sizes calculated from the popula-
systematic variation between studies the potential moderator tion of studies all reect the same thing? For instance, both
variables identied when planning and coding the meta-analy- feeding ospring and providing thermoregulation by means of
sis must be investigated to see if they do indeed aect the brooding or incubating are types of parental care, but in our
response to experimental manipulation. meta-analysis (Harrison et al. 2009) we considered these two
Variables that explain signicant proportions of heterogene- types of care separately. The eect sizes for the two types of
ity can be identied using statistical tests that are analogous to care were signicantly dierent as dened by a Q test, but more
anova or weighted regression, but which do not rely on the fundamentally there is no reason to assume that these behav-
assumption of homogeneity of variance. If we have a categori- iours have the same cost : benet ratios for parents: therefore,
cal moderator variable, for example sex, then just as an anova we felt that combining their eect sizes would be an example of
would partition total variance in a data set into variance due to comparing apples and oranges a criticism that has often
the explanatory variables (between-sex variance) and variance been levelled at meta-analysis. This consideration is probably
due to error (within-sex variance), so heterogeneity can be split more pertinent to ecologists than to, say, medical researchers,
into between- and within-group components. If QBetween is sig- as response variables and study designs vary more widely in
nicantly greater than QWithin, this indicates that sex explains a our eld.
2010 The Author. Methods in Ecology and Evolution 2010 British Ecological Society, Methods in Ecology and Evolution, 2, 110
Getting started with meta-analysis 7

It is also worth noting that individual studies may act as out- error. This is termed the failsafe sample size and various simple,
liers in a meta-analytic data set, having a very large inuence back-of-an-envelope methods have been suggested for calcu-
on the mean eect size. It is possible to identify such studies by lating it, based on the number of studies included, their eect
means of a leave-out analysis: each of our N studies is dropped sizes and some benchmark minimal meaningful eect size. The
from the data set in turn and a set of estimates of the mean larger the failsafe sample size, the more condent we can be
eect sizes from the N-1 remaining studies is calculated. Soft- about the representativeness of our data set and the robustness
ware such as the aforementioned Meta-Analyst can perform of any signicant ndings. However, Rosenberg (2005) makes
an automated leave-out analysis and so ag highly inuential the important point that suggested methods for calculating the
studies. How to deal with such a study is then a matter for per- failsafe sample size are overly simple and likely to be mislead-
sonal consideration; depending on the nature of the study ing, in the main because they do not take into account the
(sample size, experimental protocol, apparent methodological weighting of individual studies in the meta-analytic data set a
quality), the meta-analyst must decide whether it is justiable curious omission, given that weighting is one of the key
to leave it in the data set, or better to remove it. If it is retained, strengths of meta-analysis. He outlines a method for calculat-
then it would be advisable to report the eect of dropping this ing the failsafe sample size which is arguably more explicitly
study on the conclusions. meta-analytic in its calculation.
We must also consider potential sources of non-indepen- The reader should therefore be aware that the utility of fail-
dence in the data set. Non-independence has already been safe sample size calculations is still debated. Jennions, Mller
mentioned in the context of moderator variables such as & Hunt (2004) and Mller & Jennions (2001) provide an inter-
research group, and in the context of phylogenetic meta-analy- esting discussion of publication bias and type I errors in meta-
sis. However, non-independence can also result from more analysis. These authors stress the point that meta-analysis
than one eect being measured on each individual or replicate involves (or should involve) explicit consideration of publica-
in a study. For example, if we have data on reproductive suc- tion bias and attempts to minimize its inuence, and that this
cess and survival in control and experimental groups, then should primarily consist of seeking unpublished studies (as
including the whole population of eect sizes in a single analy- opposed to post hoc calculations). If I may venture a tentative
sis not only raises the issue of a potential apples and oranges opinion, I would suggest that a report of failsafe sample size is
comparison, but also creates non-independence as a result of worth including in published meta-analyses, but it is no substi-
measures from the same individuals being correlated. In this tute for a thorough search for unpublished data and should be
scenario, arguably the best strategy is to conduct separate interpreted as only a rough reection of the likely impact of
meta-analyses of eects on reproduction and survival. Non- any publication bias.
independence also rears its head in another form if we test the
same set of studies over and over for the eects of dierent
POWER
moderator variables. This will compromise the reliability of
our signicance tests and increase the type I error rate. As discussed above, type II errors often concern us more than
type I errors. If our mean eect size is not signicantly dierent
from zero, if no signicant heterogeneity is found among stud-
PUBLICATION BIAS
ies, or if a moderator variable is concluded to have no eect on
The biggest potential source of type I error in meta-analysis is eect size, how can we start to decide if this is simply due to a
probably publication bias. A funnel plot of eect size vs. study lack of statistical power? Evaluating the power of meta-ana-
size is one method of identifying publication bias in our set of lytic calculations is rather more complex as it depends on both
studies: all thing being equal, we would expect that the eect the number of studies used and their individual sample sizes,
sizes reported in a number of studies should be symmetrically which are related to the within-study component of variance in
distributed around the underlying true eect size, with more eect size. Hedges & Pigott (2001, 2004) provide detailed
variation from this value in smaller studies than in larger ones. guides to power calculations for meta-analysis. In the present
Asymmetry or gaps in the plot are suggestive of bias, most article, I will limit the discussion of power to the observation
often due to studies which are smaller, non-signicant or have that small studies which in themselves have low statistical
an eect in the opposite direction from that expected having a power might add more noise than signal to a meta-analytic
lower chance of being published. A more thorough discussion data set and thus reduce its power: the benets of excluding
of publication bias is provided by Sutton (2009). For the pur- studies with very small sample size should be seriously consid-
poses of this article, suce it to say that time spent uncovering ered, and can be quantied by calculating the power of a meta-
unpublished data relevant to the hypothesis in question, as sug- analytic data set that either includes or excludes such studies.
gested in the to do list above, is highly recommended.
Even if we discover and include some unpublished studies Key references. Most of the general references given in earlier
and produce a funnel plot with no glaring gaps, it would still sections also discuss criticisms and limitations of meta-analy-
be informative if we could work out the number of non-signi- sis. A free program for carrying out the calculations described
cant, unpublished studies that would have to exist, lying buried in Rosenberg (2005) is available from http://www.rosenberg
in le drawers and eld notebooks, in order to make us suspect lab.net/software.php. Power calculations are discussed in
that our calculated mean eect size is the result of a type I Chapter 29 of Borenstein et al. (2009), in Lajeunesses chapter
2010 The Author. Methods in Ecology and Evolution 2010 British Ecological Society, Methods in Ecology and Evolution, 2, 110
8 F. Harrison

in the forthcoming book by Koricheva et al. and in more detail signicant result likely to be due to sampling error rather than
by Hedges & Pigott (2001, 2004) and Cafri & Kromrey (2008) any real eect of the experimental treatment. i.e. the bigger this
have developed an sas macro to calculate power using the value, the smaller the probability of a type I error. The utility
methods described by Hedges & Pigott. of failsafe sample sizes is debated.
Heterogeneity: A measure of the among-study variance in
eect size, denoted Q. Just as anova-type statistical analyses
Closing remarks
partition variance between dened independent variables and
Meta-analysis is a great tool for extracting as much informa- error to perform signicance tests, meta-analysis can partition
tion as possible from a set of empirical studies. The potential heterogeneity between independent variables of interest and
advantages of sharing and combining data in this way are, I error.
hope, evident from the discussion in this article. Organizing Meta-analysis: A formal statistical framework for compar-
and carrying out a meta-analysis is hard work, but the fruits of ing the results of a number of empirical studies that have
the meta-analysts labour can be signicant. In the best case tested, or can be used to test, the same hypothesis. Meta-analy-
scenario, meta-analysis allows us to perform a relatively pow- sis allows us to calculate the mean response to experimental
erful test of a specic hypothesis and to draw quantitative con- treatment across studies and to discover key variables that
clusions. A low-powered analysis based on a small number of may explain any inconsistencies in the results of dierent stud-
studies can still provide useful insights (e.g. by revealing publi- ies.
cation bias through a funnel plot). Finally, by revealing the Null hypothesis signicance testing: Traditional statistical
magnitude of eect sizes associated with prior research, meta- tests are tools for deciding whether an observed relationship
analysis can suggest how future studies might best be designed between two or more variables is likely to be caused simply by
to maximize their individual power. sampling error. A test statistic is calculated based on the vari-
Most journals now include in their instructions to authors a ance components of the data set and compared with a known
sentence to the eect that eect sizes should be given where frequency distribution to determine how often the observed
appropriate, or that at least the necessary information required patterns in the data set would arise by chance, given random
for rapidly calculating an eect size should be provided. The sampling from a homogeneous population.
lack of this information is common, but will not necessarily be Power: The ability of a given test using a given data set to
noticed by the authors, interested readers or peer reviewers. reject the null hypothesis (at a specied signicance level) if it is
For example, when conducting our meta-analysis on parental false. i.e. as power increases, the probability of making a type
care (Harrison et al. 2009), it was only on specically attempt- II error decreases.
ing to calculate eect sizes that we noticed a small number of Type I error: Rejecting the null hypothesis when it is true
published articles where the sample sizes used were not clear. (see Fail-safe sample size).
Double-checking that sample sizes are stated explicitly and Type II error: Failing to reject the null hypothesis when it is
that exact test statistics and P-values are stated should not add false (see Power).
signicantly to the burden of writing up a research article and
will add value to the work by allowing its ready incorporation
Box 2: The power of meta-analysis
to a meta-analysis if required. On a more positive note, we
received many rapid and positive responses from colleagues Imagine that a novel genetic polymorphism has been discov-
whom we contacted to ask for clarication, extra data or ered in a species of mammal. It has been hypothesized that the
unpublished data. There is clearly a spirit of cooperation in mutant genotype may aect female lifetime reproductive suc-
ecology and evolution which can lead to the production of use- cess (LRS) relative to the wild type. Twelve groups of research-
ful and interesting syntheses of key issues in the eld. ers genotype a number of females and record their LRS. Each
group studies equal numbers of wild-type and mutant females,
with total sample sizes ranging from 18 to 32 animals. Six of
Box 1: Glossary
the studies were carried out on one long-term study population
Eect size: A standardized measure of the response of a depen- in habitat A and six on a second in habitat B.
dent variable to change in an independent variable; often but Unknown to the researchers, there is a habitat-dependent
not always a response to experimental manipulation. Eect eect of genotype on female LRS. Across the whole species
sizes could be thought of as P-values that have been corrected wild-type females produce on average 50 20 ospring that
for sample size and are the cornerstone of meta-analysis: they survive to reproductive age. In habitat A, mutant females also
make statistical comparison of the results of dierent studies produce 50 20 ospring, but in habitat B mutant LRS is
valid. Commonly-used eect size measurements are the stan- increased to 58 20 ospring. The standardized mean dif-
dardized mean dierence between control and experimental ference in female LRS is, therefore, zero in habitat A and 04 in
groups, correlation coecients and response ratios. habitat B.
Fail-safe sample size: If we calculate a mean eect size across The results of the imaginary studies are given in the table
studies and it is signicantly dierent from zero, the failsafe below and are based on random sampling from normal distri-
sample size is the number of unpublished studies with an eect butions with the specied means and standard deviations
size of zero that would have to exist in order to make our (Table 1). For each study, LRS (mean and SD) is given for
2010 The Author. Methods in Ecology and Evolution 2010 British Ecological Society, Methods in Ecology and Evolution, 2, 110
Getting started with meta-analysis 9

Table 1. Results of 12 studies investigating eect of genotype on increasing sample size to provide a more robust test of a
LRS hypothesis. However, like all statistical methods, the results of
meta-analysis should be interpreted in the light of various
Wild-type LRS Mutant LRS
P (two-
checks and balances which can inform us as to the likely reli-
Study Habitat Mean SD N Mean SD N tailed) ability of our conclusions: this is discussed in the main text.

1 A 471 155 10 528 274 10 0575


2 A 45 151 12 5 223 12 0470 Acknowledgements
3 A 512 194 14 479 21 14 0671
I would like to thank my co-authors for my own rst foray into meta-analysis,
4 A 492 201 11 468 135 11 0745
Zoltan Barta, Innes Cuthill and Tamas Szekely, for their support. I would also
5 A 493 165 15 419 224 15 0312 like to thank Rob Freckleton, Michael Jennions and one anonymous reviewer
6 A 508 236 16 511 158 16 2122 for their helpful criticisms and suggestions, and nally Andy Morgan for proof-
7 B 471 115 12 594 166 12 0047 reading the manuscript.
8 B 49 17 10 601 122 10 0110
9 B 51 205 9 583 181 9 0257
10 B 477 178 16 699 17 16 0001 References
11 B 392 255 14 584 194 14 0034
Adams, D.C. (2008) Phylogenetic meta-analysis. Evolution, 62, 567572.
12 B 499 16 15 572 166 15 0229
Bax, L., Yu, L.M., Ikeda, N., Tsuruta, H. & Moons, K.G.M. (2008) MIX:
comprehensive free software for meta-analysis of causal research data.
LRS, lifetime reproductive success. Version 1.7. http://mix-for-meta-analysis.info.
Borenstein, M., Hedges, L.V., Higgins, J.P.T. & Rothstein, H.R. (2009)
each genotype, along with the sample sizes and the P-value Introduction to Meta-Analysis (Statistics in Practice). John Wiley & Sons,
Chichester.
resulting from a t-test. Based on t-tests, three studies reported a Cafri, G. & Kromrey, J.D. (2008) ST-159: A SAS macro for statistical power
signicant eect of the mutant allele on LRS. calculations in meta-analysis. SESUG 2008: The Proceedings of the South
Can we use meta-analytic techniques to combine these data East SAS Users Group, St Pete Beach, FL, 2008.
Cassey, P., Blackburn, T.M., Duncan, R.P. & Lockwood, J.L. (2005) Lessons
and gain quantitative estimates for the size of the eect of from the establishment of exotic species: a meta-analytical case study using
genotype on LRS? Fig. 1 shows the calculated mean eect size birds. Journal of Animal Ecology, 74, 250258.
(Cohens d) for each study, with their 95% condence inter- Cohen, J. (1990) Things I have learned (so far). American Psychologist, 45,
13041312.
vals. The 95% condence interval for the weighted mean eect Cooper, H.M., Hedges, L.V. & Valentine, J.C. (eds) (2009) The Handbook of
size across all twelve studies is (006, 064), suggesting that the Research Synthesis and Meta-Analysis, 2nd edn. Russell Sage Foundation,
mutation does indeed increase LRS. Furthermore, if we treat New York, NY.
Cote, I.M. & Sutherland, W.J. (1997) The eectiveness of removing predators
habitat as a moderator variable, the genotype by environment to protect bird populations. Conservation Biology, 11, 395405.
interaction is revealed: the 95% condence interval for the DeCoster, J. (2004) Meta-analysis notes. Retrieved from http://www.stat-help.
mean eect is ()035, 029) in habitat A and (040, 11) in habi- com/notes.html.
Fernandez-Duque, E. & Valeggia, C. (1994) Meta-analysis a valuable tool in
tat B. Thus the mean eect size is not signicantly dierent conservation research. Conservation Biology, 8, 555561.
from zero in habitat A, but positive in habitat B. Also, the con- Gurevitch, J. & Hedges, L.V. (1993) Meta-analysis: combining the results of
dence interval for habitat B (just) captures the true eect size independent experiments. The Design and Analysis of Ecological Experiments
(eds S.M. Scheiner & J. Gurevitch), pp. 378398. Chapman & Hall, London.
of 04. Gurevitch, J., Morrow, L.L., Wallace, A. & Walsh, J.S. (1992) A meta-analysis
This example should serve to demonstrate that meta-analy- of eld experiments on competition. American Naturalist, 140, 539572.
sis is a powerful way of synthesizing data and eectively Harrison, F., Barta, Z., Cuthill, I. & Szekely, T. (2009) How is sexual conict
over parental care resolved? A meta-analysis. Journal of Evolutionary Biol-
ogy, 22, 18001812.
Hedges, L.V., & Olkin, I. (1980) Vote counting methods in research synthesis.
Psychological Bulletin, 88, 359369.
Hedges, L.V. & Pigott, T.D. (2001) The power of statistical tests in meta-analy-
(a) sis. Psychological Methods, 6, 203217.
Hedges, L.V. & Pigott, T.D. (2004) The power of statistical tests for moderators
in meta-analysis. Psychological Methods, 9, 426445.
Hedges, L.V., Gurevitch, J. & Curtis, P.S. (1999) The meta-analysis of response
ratios in experimental ecology. Ecology, 80, 11501156.
Hillebrand, H. (2008) Meta-analysis in ecology. Encyclopedia of Life Sciences
(b) (ELS). John Wiley & Sons, Ltd, Chichester.
Jennions, M.D., Mller, A.P. and Hunt, J. (2004) Meta-analysis can fail:
reply to Kotiaho & Tomkins. Oikos, 104, 191193.
Koricheva, J. Gurevitch, J. & Mengerson, K. (eds) (in press) Handbook of
15 10 05 00 05 10 15 20 25 Meta-Analysis in Ecology and Evolution. Princeton University Press, Prince-
Effect size (Cohens d 95% confidence interval) ton, NJ.
Lajeunesse, M.J. (2009) Meta-analysis and the comparative phylogenetic
Fig. 1. Results of 12 studies investigating eect of genotype on life- method. American Naturalist, 174, 369381.
Lipsey, M.W. & Wilson, D.B. (2001) Practical Meta-analysis (Applied Social
time reproductive success (LRS). Small grey circles show eect sizes
Research Methods Series Volume 49). SAGE Publications, Thousand Oaks,
from individual studies, large black circles mean eect sizes for the CA.
two habitats (A and B). The black square shows the overall mean Lipsey, M.W. & Wilson, D.B. (2002) Eect size calculator and SPSS macros
eect size. All eect sizes are Cohens d and 95% condence interval. available from http://mason.gmu.edu/~dwilsonb/ma.html.
The ctitious data set was analysed using Meta-Analyst (Wallace Mller, A.P. & Jennions, M.D. (2001) Testing and adjusting for publication
et al. 2009). bias. Trends in Ecology and Evolution, 16, 580586.

2010 The Author. Methods in Ecology and Evolution 2010 British Ecological Society, Methods in Ecology and Evolution, 2, 110
10 F. Harrison

Nakagawa, S. & Cuthill, I.C. (2007) Eect size, condence interval and statisti- Stephens, P.A., Buskirk, S.W. & del Rio, C.M. (2007) Inference in ecology and
cal signicance: a practical guide for biologists. Biological Reviews, 82, evolution. Trends in Ecology & Evolution, 22, 192197.
591605. Stewart, G. (2010) Meta-analysis in applied ecology. Biology Letters, 6, 7881.
Pullin, A.S., & Stewart, G.B. (2006) Guidelines for systematic review in conser- Stewart, G.B., Coles, C.F., & Pullin, A.S. (2005) Applying evidence-based prac-
vation and environmental management. Conservation Biology, 20, tice in conservation management: lessons from the rst systematic review
16471656. and dissemination projects. Biological Conservation, 126, 270278.
Roberts, P.D., Stewart, G.B., & Pullin, A.S. (2006) Are review articles a Sutton, A.J. (2009) Publication bias. The handbook of research synthesis and
reliable source of evidence to support conservation and environmental meta-analysis, 2nd edn (eds H.M. Cooper, L.V. Hedges & J.C. Valentine),
management? A comparison with medicine. Biological Conservation, 132, pp. 435452. Russell Sage Foundation, New York, NY.
409423. Thalheimer, W. & Cook, S. (2002) How to calculate eect sizes from published
Rosenberg, M.S. (2005) The le-drawer problem revisited: a general weighted research articles: a simplied methodology. Retrieved from http://
method for calculating fail-safe numbers in meta-analysis. Evolution, 59, work-learning.com/eect_sizes.htm.
464468. Wallace, B.C., Schmid, C.H., Lau, J. & Trikalinos, T.A. (2009) Meta-Analyst:
Rosenberg, M.S., Adams, D.C., & Gurevitch, J. (2000) MetaWin: statistical software for meta-analysis of binary, continuous and diagnostic data. BMC
software for meta-analysis. Version 2.0. Software and manual available from Medical Research Methodology, 9, 80.
http://www.metawinsoft.com.
Schielzeth, H. (2010) Simple means to improve the interpretability of regression Received 10 March 2010; accepted 28 June 2010
coecients. Methods in Ecology and Evolution, 1, 103113. Handling Editor: Robert P. Freckleton

2010 The Author. Methods in Ecology and Evolution 2010 British Ecological Society, Methods in Ecology and Evolution, 2, 110