Sie sind auf Seite 1von 4

FOCUS

S C I E N T I F I C C O M M U N I C AT I O N

Criteria for biological reproducibility:


What does n mean?
Kristen Naegle,1* Nancy R. Gough,2 Michael B. Yaffe3,4*
This Focus tackles the issue of technical versus biological replicates, what constitutes appropriate biological replicates, and appropriate statistical analysis for
data with small sample sizes.

Technical versus biological replicates


Reproducibility in biological science is critical to scientic progress. Here, we discuss
the issue of reproducibility at the level of
individual experiments. First, it is important to distinguish between technical and
biological replicates. Technical replicates
tell something about the reproducibility of
an assay, not the reproducibility of the phenomenon under study. Done in duplicate or
triplicate, these technical replicates provide
a glimpse of whether the technique used to
measure something is robust or noisy, and,
if it is noisy, whether the extent of that noise
negates the ability to distinguish the effect
from the control. Technical replication data,
however, need to be interpreted in the context of biological replicates. Poor technical
replication may still enable discovery if the
biological replicates show that the phenomenon is strong and easily distinguishable
from the controls. Conversely, tight technical replication is necessary if the goal is to
1

Biomedical Engineering and the Center for


Biological Systems Engineering, Washington
University in St. Louis, St. Louis, MO, 63130,
USA. 2Editor, Science Signaling, American Association for the Advancement of Science, 1200
New York Avenue, N.W., Washington, DC 20005,
USA. 3Chief Scientic Editor of Science Signaling, American Association for the Advancement of Science, 1200 New York Avenue, N.W.,
Washington,DC 20005, USA. 4David H. Koch
Institute for Integrative Cancer Research, The
Broad Institute, and the Departments of Biology
and Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
*Corresponding author. E-mail: myaffe@mit.edu
(M.Y.); knaegle@wustl.edu (K.N.)

discover an effect when the phenomenon


itself is weak and highly variable between
individual experiments.
The concept of biological replicates is
even trickier. What exactly is a biological
replicate, and why are these important? The
denition and the importance of biological
replicates depend on exactly what conclusions are being drawn. One may argue that
biological replicates should be designed to
be as similar as possible to each other
that is, parallel plates of cells in the same
incubator, harvested on the same day at the
same time. In fact, we would argue that by
those criteria, the replicates would most often be technical and not biological. But it
depends on what kind of truth you are seeking. If the intent of a biological replicate is
to prove that the phenomenon you observed
was realin those cells, in that incubator, on that daythen maybe those can be
considered biological replicates inasmuch
as they are biological samples evaluated
under identical conditions. But, in general,
most biologists are interested in discovering biological processes that constitute a
fundamental truth about nature, not just
something that might be true only when the
humidity in the tissue culture hood is 63%
and the pollen count in the lab is high. We
want to learn something general about life
itselfday in and day outand that means,
at the least, that we can observe it again and
again, at different times, on different days,
and maybe even in different thawed aliquots
or passages of the same cell line.
The biological n for cells in culture
Just as important as understanding the difference between technical and biological
replicates is understanding what constitutes appropriate biological replicates. Is it
enough to study a phenomenon in a single
cell line? The more general a phenomenon
is, the more universal the biological truth
that is being unveiled, and thus the more important the discovery is likely to be. Show-

www.SCIENCESIGNALING.org

7 April 2015 Vol 8 Issue 371 fs7

Downloaded from http://stke.sciencemag.org/ on April 10, 2015

A mathematician, a physicist, and a statistician went hunting for deer. When they
chanced upon one buck lounging about, the
mathematician red rst, missing the bucks
nose by a few inches. The physicist then
tried his hand and missed the tail by a wee
bit. The statistician started jumping up and
down saying, We got him! We got him!
(from the website CrossValidated)

ing that the p38 mitogen-activated protein


kinase (MAPK) pathway cross-talks with
the nuclear factor B (NF-B) pathway
in one cell linesay, U2OS cellsis interesting, but this result may have more
effect and more important implications if
it also occurs in other cultured cell lines,
such as 293 cells and HeLa cells, and in
primary cells, such as human endothelial
cells, macrophages, and breast epithelial
cells. However, it can be equally important
to know the specicity of signaling events
to understand in what contexts a particular
regulatory process is relevant: Is the crosstalk between the p38 MAPK and NF-B
pathways limited to mesenchymal cells and
tissues? This may be critical information
not only for the basic biological researcher
but also for the medical research looking to
translate the results into clinical practice.
Thus, some types of experiments require
testing in multiple cell types and in primary
cells, if possible. However, it may require
a change in the scientic value placed on
negative results to provide outlets for publication of these kinds of data. Publishing the
lack of an effect in some cell types becomes
just as important as publishing the positive
data in other cell types.
Some scientists prefer to do an n = 1 experiment in four different cell lines rather
than an n = 4 experiment in a single cell
line, arguing that if they see the same phenomenon once in four cell lines, then it must
be genuine. However, that approach fails to
provide the type of quantitative details of
the mean magnitude of the effect and the
variation. Thus, to determine generalizability, it is important to perform the same
experiment multiple times (biological replicates performed with technical replicates)
in multiple types of cells. Another concern
is that scientists will use different cell lines
in an opportunistic manner. In published
studies, the authors may show a blot of
protein X shifting its migration pattern in
one cell line, the coimmunoprecipitation of
protein X with protein Y in a different cell
line, and then an effect of knocking down
protein X on some phenotype in a third cell
line. Sometimes using different cellular systems is necessary for technical reasons, but
we suspect that often it is because each of
the effects was best detected in one specic
cell line. Inevitably, it is then hard to know
if the resulting model that emerges denes
a coherent phenomenon within individual
cells or is some agglomerate of effects, the
entire sum of which never actually occur

FOCUS
within any single cell. We argue that reliability and generalizability are higher if the
experiments are all done within one cell line
where technically feasible. Ideally, the key
points can be demonstrated in other cell
lines, in relevant primary cells, or in vivo.

three replicates from day 1, three from day 2,


and three from day 3, then you could calculate the sample mean and standard deviation
for n = 3 samples, giving you three estimates
of the mean. Calculating the standard deviation of the three mean estimates produces
the standard error of the mean. These values
could then be analyzed with the t test.

CREDIT: H. MCDONALD/SCIENCE SIGNALING

Probability

One-sided versus two-sided t tests


Ultimately, researchers must turn to statistical tests to understand whether the meaThe magic of 3
surements indicate that there are signicant
How many biological replicates is enough?
differences between the conditions. The
Most statisticians will tell you that n = 30 is
t test is frequently used to assess signia good number from which to get a feel
cance. As with any calculation of a P value,
for the mean and its distribution. However,
we are asking, What is the likelihood that
this is generally unrealistic in biological exthe observed differences in some biological
periments, both for practical and nancial
response that we measured is real, or did it
reasons. So why, then, have most researchers
happen by chance? If less than 5% of the
settled on three biological replicates? Where
time (P value 0.05) the differences could
does this magic n = 3 number come from?
have happened by chance, then we can conThere have to be at least two samples; othsider the differences real and place an astererwise, you cannot calculate a standard deisk (*) on the graph. As an example, imagine
viation and therefore cannot use a t test (nor
that we want to understand whether kinase
should you want to, because with only one
activity is signicantly increased in one cell
sample, you have no idea as to how good
line versus another. In this case, the hypothyour estimate of the mean is). Furthermore,
esis is that the kinase activity in cell line A
only in the case where n = 3 does the value
m m2
is increased compared with that in cell line
(1) of the standard deviation actually begin to
t= 1
s12 s22
B, which means that the null hypothesis is
tell you anything other than the arithmetic

that the kinase activity is the same. Because


difference between the measurements. Is
n1 n2
the hypothesis states that the change is unin = 3 that much better than n = 2? To see
directional, cell line A has increased kinase
How good is our estimate of the mean? this, imagine taking kinase activity measureactivity compared with cell line B, we can That depends on how you designed the ex- ments in two cell lines; the rst time, we
use a one-sided t test. But, if we had a dif- periment. In the kinase activity example, if have two samples for each cell line (n = 2),
ferent hypothesis involving several trans- you have nine samples and used all of the and the second time, three samples for each
fected cell lines and were comparing each samples to calculate the mean, then you cell line (n = 3). To simplify the situation, asof them to a control and we hypothesized would calculate the standard deviation of the sume that the means and standard deviations
that the transfected cell lines would have sample mean. However, if instead you had are the same between experiments with n =
different (increased or decreased)
2 and n = 3 samples (for example,
kinase activity compared with the
the mean in cell line one is 10
Students t probability density function
control, then we would use a twowith a standard deviation of 0.5,
0.40
sided t test, because the differand the mean in cell line two is 5
0.35
ence could be signicant in either
with a standard deviation of 0.2).
P value for t = 3
direction. Like the rst experiDespite both experiments (n = 2
0.30
n = 2, P value = 0.102
ment, the null hypothesis is that
versus n = 3) having no changes in
n = 3, P value = 0.048
0.25
the kinase activities in all of the
the differences of the two means
n = 4, P value = 0.029
cell lines are the same as in the
or the standard deviations, the P
n
=
10,
P
value
=
0.007
0.20
control: There is no difference. If
value in the case of n = 3 will be
n = 30, P value = 0.003
0.15
the hypothesis in the second exsmaller than with n = 2 for two
periment was that all of the transreasons. First, the t statistic in0.10
fected cells would have increased
creases, because the denominator
n
kinase activity compared with the
decreases as the number of sam0.05
control, then we would use a oneples increases. Second, the de0
sided t test. This is important to
grees of freedom increases for the
-10
-5
0
5
10
consider carefully, because using
t distribution and, therefore, the
t
a two-sided t test when a one- Fig. 1. The effect of n on the P value. The t distribution: For tails decrease and the central peak
sided test is more appropriate larger sample sizes, the tails become smaller, and therefore the rises, meaning that for the same t
amounts to doubling the P value P value for the same t statistic is smaller. This trend ends when statistic on this n = 3 curve, versus
because t distributions are sym- the t distribution converges to a standard normal distribution for the n = 2 curve, there is less area
large enough n.
metric (Fig. 1).
under the tails (Fig. 1).
www.SCIENCESIGNALING.org

7 April 2015 Vol 8 Issue 371 fs7

Downloaded from http://stke.sciencemag.org/ on April 10, 2015

How does a t test work, and what is a t


distribution? The t statistic and its distribution of values are useful for comparing the
mean values from a normally distributed
set of numbers (perhaps measurements or
average values from replicate sets of measurements) when the total number of observations is small. The t distribution shows
the probability distribution for a t statistic
(see Eq. 1), and both the t distribution and
the t statistic are related to the number of
samples measured (Fig. 1). As the t statistic increases, the P value decreases (Fig. 1).
Using the equation for Welchs t test (Eq. 1),
which can be used regardless of whether a
different number of samples was collected
for each condition, it is clear that the statistic, t, would get larger, and therefore the P
value would decrease (becoming more signicant) if any of the following happened:
(i) the differences between the means (m1
and m2) became larger, (ii) the standard deviations (s1 and s2) became smaller, or (iii)
more samples were used to calculate the
sample means and standard deviations.

FOCUS
Is n = 4 better than n = 3? Absolutely!
You get more statistical power (Fig. 1).
However, is it worth the expense and time of
more samples? That depends on many factors, particularly on the standard deviation
among the samples, which depends on the
effect size, the noise of the underlying biology, and the specic assay being used. Very
large differences will be signicant even
with low sample numbers. Statistical signicance can emerge between different sample
populations when many samples are examined (for example, in the case of automatic
microscopy measurements), even when the
difference between the means is exceedingly small and potentially not biologically
relevant. Evaluating statistical signicance

is clearly important, but ultimately we care


more about the signicance of the nding as
it relates to biology.
A future of reliable and reproducible
science
We suspect that confusion about standard
deviation, standard error of the mean, number of samples, and one- versus two-sided
t tests, which we have only lightly touched
upon here, is common. Many cell biologists
learned statistics by necessity when dealing
with our own primary data and may have a
somewhat limited knowledge base. Others
formally learned the basics of statistics in
a mathematics course using articial examples like riverboat gamblers with loaded

dice rather than using realistic data from


real wet-lab biological experiments. Only
by rst raising the awareness of these issues
among ourselves, and then providing better and more practical training in statistical
methods to the next generation of biologists
can we hope both to enhance scientic progress and to retain the trust of the lay community that supports our endeavors.

10.1126/scisignal.aab1125

Citation: K. Naegle, N. R. Gough, M. B. Yaffe, Criteria for biological reproducibility: What does n
mean? Sci. Signal. 8, fs7 (2015).

Downloaded from http://stke.sciencemag.org/ on April 10, 2015

www.SCIENCESIGNALING.org

7 April 2015 Vol 8 Issue 371 fs7

Criteria for biological reproducibility: What does ''n'' mean?


Kristen Naegle, Nancy R. Gough and Michael B. Yaffe (April 7,
2015)
Science Signaling 8 (371), fs7. [doi: 10.1126/scisignal.aab1125]

The following resources related to this article are available online at http://stke.sciencemag.org.
This information is current as of April 10, 2015.

Article Tools

Glossary
Permissions

The editors suggest related resources on Science's sites:


http://stke.sciencemag.org/content/sigtrans/8/371/eg4.full.html
http://stke.sciencemag.org/content/sigtrans/8/371/eg5.full.html
Look up definitions for abbreviations and terms found in this article:
http://stke.sciencemag.org/cgi/glossarylookup
Obtain information about reproducing this article:
http://www.sciencemag.org/about/permissions.dtl

Science Signaling (ISSN 1937-9145) is published weekly, except the last December, by the
American Association for the Advancement of Science, 1200 New York Avenue, NW, Washington,
DC 20005. Copyright 2015 by the American Association for the Advancement of Science; all rights
reserved.

Downloaded from http://stke.sciencemag.org/ on April 10, 2015

Related Content

Visit the online version of this article to access the personalization and
article tools:
http://stke.sciencemag.org/content/8/371/fs7

Das könnte Ihnen auch gefallen