Sie sind auf Seite 1von 82

Statistical Thinking and

Smart Experimental
Design
Course Notes
V LAAMS I NSTITUUT
VOOR B IOTECHNOLOGIE

Luc Wouters
April 2016

Contents
1

Introduction

Smart Research Design by Statistical


Thinking
2.1

4.8.2

The architecture of experimental re-

20

4.8.1.6

Random sampling .

22

4.8.1.7

Standardization . .

22

Strategies

for

controlling

variability - good experi5

mental design . . . . . . . . .

22

2.1.1

The controlled experiment . .

4.8.2.1

Replication . . . . .

22

2.1.2

Scientific

4.8.2.2

Subsampling . . . .

23

4.8.2.3

Blocking . . . . . . .

24

4.8.2.4

Covariates . . . . .

24

2.1.3

research

as

a
5

Scientific research as an iterative, dynamic process . . . .

4.9

Simplicity of design . . . . . . . . . .

25

4.10 The calculation of uncertainty . . . .

25

2.2

Research styles - The smart researcher

2.3

Principles of statistical thinking . . .

7
5

Randomization . . .

search . . . . . . . . . . . . . . . . . .

phased process . . . . . . . .

4.8.1.5

Common Designs in Biological Experi-

Planning the Experiment

mentation

27

3.1

The planning process . . . . . . . . .

5.1

28

3.2

Types of experiments . . . . . . . . .

10

3.3

The pilot study . . . . . . . . . . . .

11

Principles of Statistical Design

13

4.1

Some terminology . . . . . . . . . . .

13

4.2

The structure of the response variable 13

4.3

Defining the experimental unit . . .

13

4.4

Variation is omnipresent . . . . . . .

16

4.5

Balancing internal and external validity . . . . . . . . . . . . . . . . . .

16

4.6

Bias and variability . . . . . . . . . .

17

4.7

Requirements for a good experiment

18

4.8

Strategies for minimizing bias and


maximizing signal-to-noise ratio . .
4.8.1

5.1.1

The completely randomized


design . . . . . . . . . . . . .

5.1.2

5.2

18

5.3

bias - good experimental


practice . . . . . . . . . . . . .

18

4.8.1.1

The use of controls .

18

4.8.1.2

Blinding . . . . . . .

19 6

4.8.1.3

The presence of a

block design . . . . . . . . . .

29

5.1.2.1

The paired design .

30

5.1.2.2

Efficiency

of

the
com-

plete block design .

31

5.1.3

Incomplete block designs . .

32

5.1.4

The Latin square design . . .

33

Treatment designs . . . . . . . . . . .

34

5.2.1

One-way layout . . . . . . . .

34

5.2.2

Factorial designs . . . . . . .

34

More complex designs . . . . . . . .

37

5.3.1

The split-plot and strip-plot

5.3.2

The repeated measures design 38

designs . . . . . . . . . . . . .

20

Calibration . . . . .

20

6.1

41

Determining sample size is a risk cost assessment . . . . . . . . . . . .

37

The Required Number of Replicates Sample Size

technical protocol .

28

The randomized complete

randomized

Strategies for minimizing

4.8.1.4

Error-control designs . . . . . . . . .

41

ii

CONTENTS

6.2

The context of biomedical experiments 41

9.1.1

Experimental design . . . . .

59

6.3

The hypothesis testing context - the

9.1.2

Statistical methods . . . . . .

59

The Results section . . . . . . . . . .

60

6.4

Sample size calculations . . . . . . .

42

9.2.1

Summarizing the data . . . .

60

42

9.2.2

Graphical displays . . . . . .

60

9.2.3

Interpreting and reporting

6.4.1

Power analysis computations

6.4.2

Meads

resource

require43

6.5

How many subsamples . . . . . . . .

44

6.6

Multiplicity and sample size . . . . .

46

6.7

The problem with underpowered


studies . . . . . . . . . . . . . . . . .

47

Sequential plans . . . . . . . . . . . .

48

significance tests . . . . . . .
10 Concluding Remarks and Summary
10.2 Recommended reading . . . . . . . .

63

10.3 Summary . . . . . . . . . . . . . . . .

64

References

65

7.1

The statistical triangle . . . . . . . .

7.2

The statistical model revisited . . . .

7.3

Significance tests . . . . . . . . . . .

51 Appendices
51
52 Appendix A Glossary of Statistical Terms

7.4

Verifying the statistical assumptions

53

7.5

The meaning of the p-value and sta53

Multiplicity . . . . . . . . . . . . . .

55

The Study Protocol

57

Interpretation and Reporting

59

9.1

59

The Methods section . . . . . . . . .

63
63

51

tistical significance . . . . . . . . . .

61

10.1 Role of the statistician . . . . . . . .

The Statistical Analysis

7.6
8

41

ment equation . . . . . . . . .

6.8

9.2

population model . . . . . . . . . . .

73

Appendix B Tools for randomization in MS


Excel and R

77

B.1 Completely randomized design . . .

77

B.1.1

MS Excel . . . . . . . . . . . .

77

B.1.2

R-Language . . . . . . . . . .

77

B.2 Randomized complete block design


B.2.1 MS Excel . . . . . . . . . . . .

78
78

B.2.2

R-Language . . . . . . . . . .

78

1. Introduction
More often than not, we are unable to reproduce findings published by researchers in journals.
Glenn Begley, Vice President Research Amgen (2015)
The way we do our research [with our animals] is stone-age.
Ulrich Dirnagl, Charit University Medicine Berlin (2013)
Over the past decade, the biosciences have been

sensitive and resistant, etc. Baggerly and Coombes

plagued by problems with the replicability and re-

concluded that they were unable to reproduce the

producibility of research findings. This lack of re- analysis carried out by Potti et al. (2006), but the
liability can be attributed in large part to statistical
fallacies, misconceptions, and other methodolog-

damage was done. Several clinical trials had started

ical issues (Begley and Ioannidis, 2015; Loscalzo,

based on the erroneous results. In 2011, after several corrections, the original study by Potti et al.

2012; Peng, 2015; Prinz et al., 2011; Reinhart, 2015;

was retracted from Nature Medicine because we

van der Worp et al., 2010). The following exam-

have been unable to reproduce certain crucial exper-

ples illustrate some of these problems and show

iments (Potti et al., 2011).

that there is a definite need to transform and imExample 1.2. In 2009, a group of researchers from

prove the research process.

the Harvard Medical School published a study


Example 1.1. In 2006, a group of researchers from showing that cancer tumors could be destroyed by
Duke University led by Anil Potti published a pa- targeting the STK33 protein (Scholl et al., 2009).
per claiming that they had built an algorithm us- Scientists at Amgen Inc. pounced on the idea and
ing genomic microarray data that allowed to predict assigned a group of 24 researchers to try to repeat
which cancer patients would respond to chemother- the experiment with the objective of developing a
apy (Potti et al., 2006). This would spare patients new medicine. After six months intensive lab work,
the side effects of ineffective treatments. Of course it turned out that the project was a waste of time
this paper drew a lot of attention and many inde- and money, since it was impossible for the Amgen
pendent investigators tried to reproduce the results. scientists to replicate the results (Babij et al., 2011;
Keith Baggerly and Kevin Coombes, two statisti-

Naik, 2011). Unfortunately, this was not the only

cians at MD Anderson Cancer Center, were also

problem of replicability the Amgen researchers en-

asked to have a look at the the data. What they countered. During a decade Begley and Ellis (2012)
found was a mess of poorly conducted data analysis identified a set of 53 landmark publications in pre(Baggerly and Coombes, 2009). Some of the data

clinical cancer research, i.e. papers in top journals

was mislabeled, some samples were duplicated in from reputable labs. A team of 100 scientists tried
the data, sometimes samples were marked as both to replicate the results. To their surprise, in 47 of
2 Formally, we consider replicability as the replication of scientific findings using independent investigators, methods,
data, equipment, and protocols. Replicability has long been, and will continue to be, the standard by which scientific claims
are evaluated. On the other hand, reproducibility means that starting from the data gathered by the scientist, we can
reproduce the same results, p-values, confidence intervals, tables and figures as those reported by the scientist (Peng, 2009).

the 53 studies (i.e. 89%) the findings could not be

CHAPTER 1. INTRODUCTION

Seralinis paper, claiming that it did not reach the

replicated. This outcome was particularly disturb- journals threshold of publication (Hayes, 2014) 1 .
ing since Begley and Ellis did every effort to work in
close collaboration with the authors of the original
papers and even tried to replicate the experiments
in the laboratory of the original investigator. In
some cases, 50 attempts were made to reproduce
the original data, without obtaining the claimed result (Begley, 2012). What is even more troubling is
that Amgens findings were consistent with those of
others. In a similar setting, Bayer researchers found
that only 25% of the original findings in target discovery could be validated (Prinz et al., 2011).

Example 1.4. Selwyn (1996) describes a study


where an investigator examined the effect of a test
compound on hepatocyte diameters. The experimenter decided to study eight rats per treatment
group, three different lobes of each rats liver, five
fields per lobe, and approximately 1,000 to 2,000
cells per field. At that time, most of the work,
i.e. measuring the cell diameters, was done manually, making the total amount of work, i.e. 15,000
- 30,000 measurements per rat, substantial. The
experimenter complained about the overwhelming

Example 1.3. Seralini et al. (2012) published a 2- amount of work in this study and the tight deadyear feeding study in rats investigating the health lines that were set up. A sample size evaluation
effects of genetically modified (GM) maize NK603 conducted after the study was completed indicated
with and without glyphosate-containing herbicides. that sampling as few as 100 cells per lobe would
The authors of the study concluded that GM maize have been without appreciable loss of information.
NK603 and low levels of glyphosate herbicide formulations, at concentrations well below officiallyset safe limits, induce severe adverse health effects,

Unreliable biological
reagents and
reference materials 36.1%

such as tumors, in rats. Apart from the publication, Seralini also presented his findings in a press
conference, which was widely covered in the media
showing shocking photos of rats with enormous tumors. Consequently, this study had severe impact
on the general public and also on the interest of

Improper study
design 27.6%
Laboratory protocol
errors 10.8%

Inadequate
data analysis
and reporting 25.5%

industry. The paper was used in the debate over


a referendum over labeling of GM food in California, and it led to bans on importation of certain

Figure 1.1. Categories of errors that contribute to the problem


of replicability in life science research (source Freedman et al.
2015)

Doing good science and producing highGMOs in Russia and Kenya. However, short after
its publication many scientists, among them also quality data should be the concern of every serious
researchers from the VIB (Vlaams Instituut voor research scientist. Unfortunately, as shown by the
Biotechnologie, 2012), heavily criticized the study first three examples, this is not always the case. As
and expressed their concerns about the validity of mentioned above, there is a genuine concern about
the findings. A polemic debate started with oppo- the reproducibility of research findings and it has
nents of GMOs and also within the scientific com- been argued that most research findings are false
munity, which inspired media to refer to the contro- (Ioannidis, 2005). In a recent paper Begley and
versy as The Seralini affair or Seralini tumor-gate. Ioannidis (2015) estimated that 85% of biomedical
Subsequently, the European Food Safety Authority research is wasted at large. Freedman et al. (2015)
(2012) thoroughly scrutinized the study and found tried to identify the root causes of the replicability
that it was of inadequate design, analysis and re- problem and to estimate its economic impact. They
porting. Specifically, the number of animals was estimated that in the United States alone approxiconsidered too small and not sufficient for reaching

mately US$28B/year is spent on research that can-

a solid conclusion. Eventually, the journal retracted not be replicated. The main problems causing this
1 S
eralini managed to republish the study in Environmental Sciences Europe (S
eralini et al., 2014), a journal with a
considerably lower impact factor.

lack of replicability are summarized in Figure 1.1.

or no thought is given to methodological issues, in

Issues in study design and data analysis accounted

particular to the statistical aspects of the study de-

for more than 50% of the studies that could not be

sign, the studies are often seriously flawed and are

replicated. This was also concluded by Kilkenny

not capable of meeting their intended purpose. In

et al. (2009) who surveyed 271 papers reporting

some cases, such as the Sralini study study, the

laboratory animal experiments. They found that

experiments were designed too small to enable an

most of the studies had flaws in their design and

answer to the research question. Conversely, like

almost one third of the papers that used statistical

in Example 1.4, there are also studies that use too

methods did not describe them or did not present

much experimental material, so that valuable re-

their results adequately.

sources are wasted.

Not only scientists, but also the journals have a

To improve on these issues of credibility and effi-

great responsibility in guarding the quality of their

ciency, we need effective interventions and change

publications.

Peer reviewers and editors, who

the way scientists look at the research process

have little or no statistical training, let methodolog-

(Ioannidis, 2014; Reinhart, 2015). This can be ac-

ical errors pass undetected. Moreover, high-impact

complished by introducing statistical thinking and

journals tend to focus on statistically significant re-

statistical reasoning as powerful informed skills,

sults of unexpected findings, often without looking

based on the fundamentals of statistics, that en-

at the practical importance. Especially, in studies

hance the quality of the research data (Vanden-

with insufficient sample size, this publication bias

broeck et al., 2006). While the science of statis-

causes high numbers of irreproducible and even

tics is mostly involved with the complexities and

false results (Ioannidis, 2005; Reinhart, 2015).

techniques of statistical analysis, statistical thinking and reasoning are generalist skills that focus on

In addition to the problem of replicability of re-

the application of nontechnical concepts and prin-

search findings, there has also been a dramatic rise

ciples. There are no clear, generally accepted defi-

in the number of journal retractions over the last

nitions of statistical thinking and reasoning. In our

decades (Cokol et al., 2008). In a review of all

conceptualization we consider statistical thinking

2,047 biomedical and life-science research articles


indexed by PubMed as retracted on May 3, 2012,

as a skill that helps to better understand how sta-

Fang et al. (2012) found that 21.3% of the retractions were due to error, while 67.4% of the retrac-

to specific research problems and what the implications are in terms of data collection, experimen-

tions were attributable to misconduct, including

tal setup, data analysis, and reporting. Statistical

fraud or suspected fraud (43.4%), duplicate pub-

thinking will provide us with a generic methodol-

lication (14.2%), and plagiarism (9.8%).

ogy to design insightful experiments. On the other

tistical methods can contribute to finding answers

hand, we will consider statistical reasoning as beStudies such as those by Potti et al. (2006), Scholl

ing more involved with the presentation and inter-

et al. (2009), and Sralini et al. (2012), as well

pretation of the statistical analysis. Of course, as

as the lack of replicability in general and the in-

is apparent from the above, there is a large over-

creased number of retractions have also caught the

lap between the concepts of statistical thinking and

attention of mainstream media (Begley, 2012; Hotz,

reasoning.

2007; Lehrer, 2010; Naik, 2011; Zimmer, 2012) and


have put the integrity of science into question by
the general public.

Statistical thinking permeates the entire research


process and, when adequately implemented, can
lead to a highly successful and productive research

To summarize, a substantial part of the issues of

enterprise. This was demonstrated by the eminent

replicability can be attributed to a lack of quality in

scientist, the late Dr. Paul Janssen. As pointed out

the design and execution of the studies. When little

by Lewi (2005), the success of Dr. Paul Janssen

CHAPTER 1. INTRODUCTION

could be attibruted to a large extent on having a

was such a success that, when he retired in 1991,

set of statistical precepts being accepted by his col-

his laboratory had produced 77 original medicines

laborators. These formed the statistical founda-

over a period of less than 40 years. This still rep-

tion upon which his research was built and insured

resents a world record. In addition, at its peak, the

that research proceeded in an orderly and planned

Janssen laboratory produced more than 200 scien-

fashion, while at the same time having an open

tific publications per year (Lewi and Smith, 2007).

mind for unexpected opportunities. His approach

2. Smart Research Design by Statistical


Thinking
Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and
write !
Samuel S. Wilks, (1951).

2.1

2.1.1

The architecture of experi-

tistical thinking and reasoning can be of use to op-

mental research

timize their design and interpretation.

2.1.2

The controlled experiment

Scientific research as a phased process

There are two basic approaches to implement a scientific research project. One approach is to conduct an observational study2 in which we investigate the effect of naturally occurring variation and
the assignment of treatments is outside the control
of the investigator. Although there are often good
and valid reasons for conducting an observational
study, their main drawback is that the presence of

Phase

Deliverable

Definition

Research Proposal

Design

Protocol

Data Collection

Data set

Analysis

Conclusions

Reporting

Report

Figure 2.1. Research is a phased process with each of the phases


having a specific deliverable

concomitant confounding variables can never be

From a systems analysis point of view, the sci-

excluded, thus weakening the conclusions.

entific research process can be divided into five


distinct stages:
An alternative to an observational study is an
1. definition of the research question

experimental or manipulative study in which the


investigator manipulates the experimental system

2. design of the experiment

and measures the effect of his manipulations on


the experimental material. Since the manipulation

3. conduct of the experiment and data collec-

of the experimental system is under control of the

tion

experimenter, one also speaks of controlled experi4. data analysis

ments. A well-designed experimental study eliminates the bias caused by confounding variables.

5. reporting

The great power of a controlled experiment, provided it is well conceived, lies in the fact that it al-

Each of these phases results in a specific deliver-

lows us to demonstrate causal relationships. We

able (Figure 2.1). The definition of the research

will focus on controlled experiments and how sta-

question will usually result in a research or grant

2 also

called correlational study

CHAPTER 2. SMART RESEARCH DESIGN BY STATISTICAL THINKING

proposal, stating the hypothesis related to the research (research hypothesis) and the implications
or predictions that follow from it. The design of

2.2

Research styles - The smart


researcher

the experiment needed for testing the research hypothesis is formalized in a written protocol. After
the experiment has been carried out, the data will
be collected providing the experimental data set.
Statistical analysis of this data set will yield conclusions that answer the research question by accept-

Figure 2.3. Modulating between the concrete and abstract world

ing or rejecting the formalized hypothesis. Finally,

The five phases that make up the research pro-

a well carried out research project will result in a

cess modulate between the concrete and the ab-

report, thesis, or journal article.

stract world (Figure 2.3). Definition and reporting are conceptual and complex tasks requiring a
great deal of abstract reasoning. Conversely, experimental work and data collection are very concrete, measurable tasks handling with the practical

2.1.3

Scientific research as an iterative, details and complications of the specific research


domain.
dynamic process

Figure 2.2. Scientific research as an iterative process

Scientific research is not a simple static activity, but as depicted in Figure 2.2, an iterative and
highly dynamic process. A research project is carried out within some organizational or manage-

Figure 2.4. Archetypes of researchers based on the relative fraction of the available resources that they are willing to spend at
each phase of the research process. D(1): definition phase, D(2):
design phase, C: data collection, A: analysis, R: reporting

ment context which can be rather authoritative;

Scientists exhibit different styles in their research

this context can be academic, governmental, or cor-

depending on the relative fraction of the available

porate (business). In this context, the management

resources that they are willing to spend at each

objectives of the research project are put forward.

phase of the research process. This allows us to

The aim of our research project itself is to fill an

recognize different archetypes of researchers (Fig-

existing information gap. Therefore, the research

ure 2.4):

question is defined, the experiment is designed


and carried out and the data are analyzed. The
results of this analysis allow informed decisions
to be made and provide a way of feedback to ad-

the novelist who needs to spend a lot of time


distilling a report from an ill-conceived experiment;

just the definition of the research question. On the

the data salvager who believes that no matter

other hand, the experimental results will trigger re-

how you collect the data or set up the exper-

search management to reconsider their objectives

iment, there is always a statistical fix-up at

and eventually request for more information.

analysis time;

2.3. PRINCIPLES OF STATISTICAL THINKING

Table 2.1. Statistical thinking versus statistics

Statistics

Statistical Thinking

Specialist skill
Science
Technology
Closure, seclusion
Introvert
Discrete interventions
Builds on good thinking

Generalist skill
Informed practice
Principles, patterns
Ambiguous, dialogue
Extravert
Permeates the research process
Valued skill itself

the lab freak who strongly believes that if formed practice and focused on the applications of
enough data are collected something inter-

nontechnical concepts and principles.

esting will always emerge;


The statistical thinker attempts to understand
the smart researcher who is aware of the ar- how statistical methods can contribute to finding
chitecture of the experiment as a sequence of answers to specific research problems in terms of
steps and allocates a major part of his time

data collection, experimental setup, data analysis

budget to the first two steps: definition and

and reporting. He or she is able to postulate which

design.

statistical expertise is required to enhance the re-

The smart researcher is convinced that time spent


planning and designing an experiment at the outset will save time and money in the long run. He
opposes the lab freak by trying to reduce the number of measurements to be taken, thus effectively

search projects success. In this capacity the statistical thinker acts as a diagnoser.
In contrast to statistics, which operates in a
closed and secluded mathematical context, statis-

tical thinking is a practice that is fully integrated


reducing the time spent in the lab. In contrast to
with the researchers scientific field, not merely an
the lab freak, the smart researcher recognizes that the
autonomous science. Hence, the statistical thinker
design of the experiment will govern how the data
operates in a more ambiguous setting, where he
will be analyzed, thereby reducing time spent at
is deeply involved in applied research, with a
the data analysis stage to a minimum. By carefully
good working knowledge of the substantive scipreparing and formalizing the definition and deence. In this role, the statistical thinker acts as an
sign phase, the smart researcher can look ahead to
(intermediary) between scientists and statisticians
the reporting phase with peace of mind, which is
and goes into dialogue with them. He attempts to
in contrast to the novelist.
integrate the several potentially competing priori-

2.3

Principles of statistical thinking

The smart researcher recognizes the value of statis-

ties that make up the success of a research project:


resource economy, statistical power, and scientific
relevance, into a coherent and statistically underpinned research strategy.
While the impact of the statistician on the re-

tical thinking for his application area and he him- search process is limited to discrete interventions,
self is skilled in statistical thinking, or he collabo-

the statistical thinker truly permeates the research

rates with a professional who masters this skill. As

process. His combined skills lead to increased ef-

noted before, statistical thinking is related to, but

ficiency, which is important to increase the speed

distinct from statistical science (Table 2.1). While

with which research data, analyses, and conclu-

statistics is a specialized technical skill based on

sions become available. Moreover, these skills al-

mathematical statistics as a science on its own, sta-

low to enhance the quality and to reduce the asso-

tistical thinking is a generalist skill based on in- ciated cost. Statistical thinking then helps the sci-

CHAPTER 2. SMART RESEARCH DESIGN BY STATISTICAL THINKING

entist to build a case and negotiate it on fair and


objective grounds with those in the organization
seeking to contribute to more business-oriented
measures of performance. In that sense, the successful statistical thinker is a persuasive communicator. This comparison clearly shows that the power
of statistics in research is actually founded upon
good statistical thinking.
Smart research design is based on the seven basic principles of statistical thinking:
1. Time spent thinking on the conceptualization
and design of an experiment is time wisely
spent.
2. The design of an experiment reflects the contributions from different sources of variabil-

ity.
3. The design of an experiment balances between its internal validity (proper control of
noise) and external validity (the experiments
generalizability).
4. Good experimental practice provides the
clue to bias minimization.
5. Good experimental design is the clue to the
control of variability.
6. Experimental design integrates various disciplines.
7. A priori consideration of statistical power is
an indispensable pillar of an effective experiment.

3. Planning the Experiment


Experimental observations are only experience carefully planned in advance, and designed to form a
secure basis of new knowledge.
R. A. Fisher (1935).

3.1

The planning process

be realistic in what can be accomplished in a single


study.
Example 3.1. The study by Seralini et al. (2012) is a
typical example of a study where the research team
tried to accomplish too much objectives. In this
study 10 treatments were examined in both female
and male rats. Since the research team apparently
had a very limited amount of resources available,
the investigators used only 10 animals per treatment per sex. This was far below the 50 animals
per treatment group that are standard in long term

Figure 3.1. The planning process

carcinogenicity studies (Gart et al., 1986; Haseman,

The first step in planning an experiment (Fig-

1984).

ure 3.1) is the specification of its objectives. The


researcher should realize what the actual goal is

After having formulated the research objec-

of his experiment and how it integrates into the

tives, the scientist will then try to transfer them

whole set of related studies on the subject. How

into scientific hypotheses that might answer the

does it relate to management or other objectives?

question. Often it is impossible to study the re-

How will the results from this particular study

search objective directly, but some surrogate ex-

contribute to knowledge about the subject? Sometimes a preliminary exploratory experiment is use-

perimental model is used instead. For example

ful to generate clear questions that will be answered in the actual experiment. The study ob-

toxic in rats. The real objective was to establish


the toxicity in humans. As a surrogate for man, the

jectives should be well defined and written out as

Sprague-Dawley strain of rat was chosen as exper-

Sralini was not interested whether GMOs were

explicitly as possible. It is wise to limit the objec- imental model. By doing so, an auxiliary hypothesis
tives of a study to a maximum of, say three (Sel-

(Hempel, 1966) was put forward, namely that the

wyn, 1996). Any more than that risks designing an

experimental model was adequate to the research

overly complex experiment and could compromise

objectives. Sralinis choice of the Sprague-Dawley

the integrity of the study. Trying to accomplish

rat strain received much criticism (European Food

each of many objectives in a single study stretches

Safety Authority, 2012), since this strain is prone to

its resources too thin and as a result, often none of

the development of tumors. Auxiliary hypotheses

the study objectives is satisfied. Objectives should

also play a role when it is difficult or even impos-

also be reasonable and attainable and one should

sible to measure the variable of interest directly. In


9

10

CHAPTER 3. PLANNING THE EXPERIMENT

this case, an indirect measure as a surrogate for the

are used to explore a new research area. They pro-

target variable might be available and the investi-

vide a powerful method for discovery (Hempel,

gator relies on the premiss that the indirect mea-

1966), i.e they are performed to generate new hy-

sure is a valid surrogate for the actual target vari-

potheses that can then be formally tested in confir-

able.

matory experiments. Replication, sample size and


formal hypothesis testing are less appropriate with

Based on both the scientific and auxiliary hy-

this type of experiment. Currently, the vast major-

potheses, the researcher will then predict the test

ity of published research in the biomedical sciences

implications of what to expect if these hypotheses

originates from this sort of experiment (Kimmel-

are true. Each of these predictions should be the

man et al., 2014). The exploratory nature of these

strongest possible test of the scientific hypotheses.

studies is also reflected in the way the data are

The deduction of these test implications also in-

analysed. Exploratory data analysis as opposed to

volves additional auxiliary hypotheses. As stated

confirmatory data analysis is a flexible approach,

by Hempel (1966), reliance on auxiliary hypotheses

based mainly on graphical displays, towards for-

is the rule, rather than the exception, in testing sci-

mulating new theories (Tukey, 1980). Exploratory

entific hypotheses. Therefore, it is important that

studies aim primarily at developing these new re-

the researcher is aware of the auxiliary assump-

search hypotheses, but they do not answer unam-

tions he makes when predicting the test implica-

biguously the research question, since using the

tions. Generating sensible predictions is one of the

same data that generated the research hypothesis

key factors of good experimental design. Good

also for its confirmation, involves circular reasoning. Exploratory studies tend to consist of a pack-

predictions will follow logically from the hypotheses that we wish to test, and not from other rival
hypotheses. Good predictions will also lead to in-

age of small and flexible experiments using differ-

sightful experiments that allow the predictions to

study by Sralini et al. (2012) was in fact an ex-

be tested.

ploratory experiment and much of the controver-

ent methodologies (Kimmelman et al., 2014). The

sies around this study would not have arisen, if it


The next step in the planning process is then to
decide which data are required to confirm or refute
the predicted test implications. Throughout the sequence of question, hypothesis, prediction it is essential to assess each step critically with enough
skepticism and even ask a colleague to play the
devils advocate. During the design and planning
stage of the study one should already have the
person refereeing the manuscript in mind. It is
much better that problems are identified at this
early stage of the research process than after the
experiment started. At the end of the experiment
the scientist should be able to determine whether
the objectives have been met, i.e. whether the research questions were answered to satisfaction.

would have been presented as such.


Pilot experiments are designed to make sure the
research question is sensible, they allow to refine
the experimental procedures, to determine how
variables should be measured, whether the experimental setup is feasible, etc. Pilot experiments
are especially useful when the actual experiment
is large, time-consuming or expensive (Selwyn,
1996). Information obtained in the pilot experiment is of particular importance when writing the
technical and study protocol of such studies. Pilot
experiments are discussed in more detail in Section
3.3.
Confirmatory experiments are used to assess the
test implications of a scientific hypothesis.

3.2

Types of experiments

In

biomedical research this assessment is based on


statistical methodology. In contrast to exploratory

We first distinguish between exploratory, pilot and

studies, confirmatory experiments make use of

confirmatory experiments. Exploratory experiments

rigid pre-specified designs and a priori stated hy-

3.3. THE PILOT STUDY

11

potheses. Exploratory and confirmatory studies

combination of row and column effects (Burrows

complement one another in the sense that the for-

et al., 1984). A variation experiment can also tell

mer generates the hypotheses that can be put to

us about the importance of cage location in animal

crucial testing in the latter. Confirmatory exper-

experiments, where animals are kept in racks of 24

iments are the main topic of this tutorial.

cages. Animals in cages close to the ventilation


could respond differently from the rest (Young,

A further distinction between different types of

1989). Other examples of variation experiments

experiments is based on the type of objective of

come from the area plant research where the posi-

the study in question. A comparative experiment is

tion of plants in the greenhouse (Brien et al., 2013)

one in which two or more techniques, treatments,

or the growth chamber (Potvin et al., 1990) could

or levels of an explanatory variable are to be com-

be the subject of investigation.

pared with one another. There are many examples


of comparative experiments in biomedical areas.
For example in nutrition studies different diets can

3.3

The pilot study

be compared to one another in laboratory animals.


In clinical studies, the efficacy of an experimental

As researchers are often under considerable time

drug is assessed in a trial by comparing it to treat-

pressure, there is the temptation to start as soon

ment with placebo. We will focus primarily on de-

as possible with the actual experiment. However,

signing comparative experiments for confirmation

a critical step in a new research project, that is of-

of research hypotheses.

ten missed, is to spend a bit of time and resources


on the beginning of the study collecting some pilot

A second type of experiment is the optimiza- data. Preliminary experiments on a limited scale,
tion experiment which has the objective of finding or pilot experiments, are especially useful when
conditions that give rise to a maximum or mini-

we deal with time-consuming, important or expen-

mum response. Optimization experiments are of-

sive studies and are of great value for assessing the

ten used in product development, such as finding

feasibility of the actual experiment. During the pi-

the optimum combination of concentration, tem-

lot stage the researcher is allowed to make varia-

perature, and pressure that gives rise to the maximum yield in a chemical production plant. Dose

tions in experimental conditions such as measure-

finding trials in clinical development are another

study can be of help to make sure that a sensible research question was asked. For instance, if

example of optimization experiments.

ment method, experimental set-up, etc. The pilot

our research question was about whether there is


The third type of experiment is the prediction experiment in which the objective is to provide some
statistical/mathematical model to predict new responses. Examples are dose response experiments
in pharmacology and immuno-assay experiments.
The final experimental type is the variation experiment. This type of experiment has as objective to study the size and structure of bias and
random variation. Variation experiments are implemented as uniformity trials, i.e. studies with-

a difference in concentration of a certain protein


between diseased and non-diseased tissue, it is of
importance that this protein is present in a measurable amount. Carrying out a pilot experiment
in this case can save considerable time, resources
and eventual embarrassment. One could also wonder whether the effect of an intervention is large
enough to warrant further study. A pilot study can
then give a preliminary idea about the size of this
effect and could be of help in making such a strategic decision.

out different treatment conditions. For example,


the assessment of sources of variation in microtiter

A second crucial role for the pilot study is for

plate experiments. These sources of variation can be the researcher to practice, validate and standardplate effects, row effects, column effects, and the

ize the experimental techniques that will be used

12

CHAPTER 3. PLANNING THE EXPERIMENT

in the full study. When appropriate, trial runs of

just the required sample size of the experiment and

different types of assays allow fine-tuning them so

to set up the data analysis environment.

that they will give optimal results. Finally, the pilot


study provides basic data to debug and fine-tune

The pilot study still belongs to the exploratory

the experimental design. Provided the experimen-

phase of the research project and is not part of the

tal techniques work well, carrying out a small-scale

actual, final experiment. In order to preserve the

version of the actual experiment will yield some

quality of the data and the validity of the statistical

preliminary experimental data. These pilot data

analysis, the pilot data cannot be included in the final

can be very valuable and allow to calculate or ad-

dataset.

4. Principles of Statistical Design


It is easy to conduct an experiment in such a way that no useful inferences can be made.
William Cochran and Gertrude Cox (1957).

4.1

Some terminology

the specific treatment, the effect of the experimental design, and an error component that describes the

We refer to a factor as the condition or set of con-

deviation of this particular experimental unit from

ditions that we manipulate in the experiment, e.g.

the mean value of its treatment group. There are


the concentration of a drug. The factor level is the some strong assumptions associated with this simparticular value of a factor, e.g. 10-6 M, 10-5 M. A ple model:
treatment consists of a specific combination of fac the treatment terms add rather than, for extor levels. In single-factor studies a treatment corample multiply;

responds to a factor level. The experimental unit is


defined as the smallest physical entity to which a

treatment effects are constant;

treatment is independently applied. The character the response in one unit is unaffected by the

istic that is measured and on which the effect of the

treatment applied to the other units.

different treatments is investigated and analysed is


referred to as the response or dependent variable. The

These assumptions are particularly important in

observational unit is the unit on which the response

the statistical analysis. A statistical analysis is only

is measured or observed. Often the observational

valid when all of these assumptions are met.

unit is identical to the experimental unit, but this


is not necessarily always the case. The definition

4.3

of additional statistical terms can be found in Appendix ??.

4.2

Defining

the

experimental

unit

The structure of the response


variable

The experimental unit corresponds to the smallest


division of the experimental material to which a
treatment can (randomly) be assigned, such that
any two units can receive different treatments. It
is important that the experimental units respond
independently of one another, in the sense that a
treatment applied to one unit cannot affect the re-

Figure 4.1. The response variable as the result of an additive


model

We assume that the response obtained for a par-

sponse obtained in another unit and that the occurrence of a high or low result in one unit has no
effect on the result of another unit. Correct iden-

ticular

tification of the experimental unit is of paramount


importance for a valid design and analysis of the

experimental unit can be described by a simple

additive model (Figure 4.1) consisting of the effect of study.


13

14

CHAPTER 4. PRINCIPLES OF STATISTICAL DESIGN

In many experiments the choice of the experi-

there were only three experimental units per treat-

mental unit is obvious. However, in studies where

ment, certainly not 280 and 162 units1 . The cor-

replication is at multiple levels, or when replicates

rect method of analysis calculates for each animal

cannot be considered independent, it often hap-

the average cell diameter and takes this value as the

pens that investigators have difficulties recogniz-

response variable.

ing the proper basic unit in their experimental material. In these cases, the term pseudoreplication is
often used (Fry, 2014). Pseudo-replication can result in a false estimate of the precision of the experimental results leading to invalid conclusions
(Lazic, 2010).

Mistakes as in the above example are abundant


whenever microscopy is concerned and the individual cell is used as experimental unit. One could
wonder whether these are mistakes made out of ignorance or out of convenience. The concern is even
greater when such studies get published in peer reviewed high impact scientific journals.

The following example represents a situation


commonly encountered in biomedical research
when multiple levels are present.

Independence of units can be an issue in plant


research and in studies using laboratory animals
as is illustrated by the following example.
Example 4.2. (LeBlanc, 2004) In order to assess the
effect of fertilization on corn plant growth, two plots
are established: one fertilized and one without fertilizer (control). In each plot 100 corn plants are
grown. To determine the effect of fertilization, the
average mass of the corn plants is compared between the two plots.

Figure 4.2. Morphometric analysis of the diameter of bile canaliculi in wild-type and Cx32-deficient liver. MeansSEM from
three livers. *: P<0.005 (after Temme et al. (2001))

The choice of plants as experimental units is incorrect, because the observations of growth made
on adjacent plants within a plot are not indepen-

Example 4.1. Temme et al. (2001) compared two

dent of one another. If one plant is genetically sugenetic strains of mice, wild-type and connexin 32 perior and grows particularly large, it will tend to
(Cx32)-deficient. They measured the diameters of shade its inferior neighbors and cause them to grow
bile canaliculi in the livers of three wild-type and more slowly. An appropriate design would be to put
of three Cx32-deficient animals, making several ob- the plants into separate pots and randomly treatservations on each liver. Their results are shown in ing each pot with fertilizer or not. In this case there
Figure 4.2. It should be clear that Temme et al. is no competition of resources between plants. An(2001) mistakenly took cells, which were the ob- other alternative would make use of several fertilservational units, for experimental units and used ized and unfertilized plots of ground. The outcome
them also as units of analysis. If we consider the would then be the crop yield in each plot.
genotype as the treatment, then it is clear that not
the cell but the animal is the experimental unit.
Moreover, cells from the same animal will be more
alike than cells from different animals. This interdependency of the cells invalidates the statisti-

Laboratory animals such as mice and rats are


often housed together in cages and the treatment
can than conveniently be supplied by the tap water, such as in the following example:

cal analysis, as it was carried out by the investiga- Example 4.3. Rivenson et al. (1988) studied the toxtors. Therefore, the correct experimental unit and icity of N-nitrosamines in rats and described their
unit of analysis is the animal, not the cell. Hence, experimental set-up as:
1 If we recalculate the standard errors of the mean (SEM) using the appropriate number of experimental units, then they
are a factor 7-10 larger than the reported ones.

4.3. DEFINING THE EXPERIMENTAL UNIT

15

The rats were housed in groups of 3 in

is not just a simple multiple of the number of an-

solid-bottomed polycarbonate cages with

imals in the cage and the required number of ex-

hardwood bedding under standard condi-

perimental units (cages). An experiment requiring

tions diet and tap water with or without

10 animals per treatment group when housed in-

N-nitrosamines were given ad libitum.

dividually is almost equivalent to an experiment

with 12 animals distributed over 4 cages per treatSince the treatment was supplied in the drinking ment. For some outcomes variability is expected to
water, it is impossible to provide different treat- be reduced when animals are more content when
ments to any two individual rats.

Furthermore, they are group-housed, which would enhance the


the responses obtained within the different animals latter experiments efficiency (Fry, 2014).
within a cage can be considered to be dependent
upon one another in the sense that the occurrence
of extreme values in one unit can affect the result
of another unit. Therefore, the experimental unit
here is not the single rat, but the cage.

The choice of the experimental unit is also of particular concern in plant research when the treatment condition has been applied to whole pots,
trays or growth chambers rather than to individual

basic units is found in the study by Sralini et al.

plants. By now, it should be clear that in these cases


it is the plot, tray or growth chamber that con-

(2012). In their study, rats were housed in groups

stitutes the experimental unit (Fernandez, 2007).

of two per cage and the treatment was present in

Especially in the case of growth chambers, when

the food delivered to the cages.

the treatment condition is an environmental fac-

An identical problem with independence of the

tor, it is difficult to obtain a sufficient number of


Even when the animals are individually treated,
e.g. by injection, group-housing can cause animals in the same cage to interact which would invalidate the assumption of independence of units.
For instance, in studies with rats, a socially dominant animal may prevent others from eating at certain times. Mice housed in a group usually lie together, thereby reducing their total surface area.
A reduced heat loss per animal in the group is
the result. Due to this behavioral thermoregulation
their metabolic rate is altered (Ritskes-Hoitinga

genuine replications. Therefore, experimenters are


tempted to consider the individual plants, pots,
trays or Petri dishes as experimental units, which is
wrong. Because the treatment condition (temperature, light intensity/duration, etc.) was applied to
the entire chamber, it is indeed the growth chamber itself that constitutes the experimental unit.
For a sufficient number of replications when the
number of growth chambers is limited, the experiment should be replicated using the same growth
chamber repeatedly over time (Fernandez, 2007).

and Strubbe, 2007).


Two-generation reproductive studies which inNevertheless, single housing of gregarious animal species is considered detrimental to their welfare and regulations in Europe concerning animal
welfare insist on group housing of such species
(Council of Europe, 2006). However, when animals

volve exposure in utero are standard procedures


in teratology. Also here, the entire litter rather
than the individual pup constitutes the experimental unit (Gart et al., 1986). This also applies to other
experiments in reproductive biology.

are housed together, the cage rather than the indi-

Example 4.4. (Fry, 2014) A drug was tested for its

vidual animal should be considered as the experi-

capacity to reduce the effect of a mutation causing

mental unit (Fry, 2014; Gart et al., 1986). Statistical

a common condition. To accomplish this, homozy-

analysis should take this into account by using ap-

gous mutant female rats were randomly assigned to

propriate techniques. Fortunately, as is pointed out

drug-treated and control groups. Then they were

by (Fry, 2014), when the cage is taken as the exper-

mated with homozygous mutant males, producing

imental unit the total number of animals needed

homozygous mutant offspring. Litters were weaned

16

CHAPTER 4. PRINCIPLES OF STATISTICAL DESIGN

and pups grouped five to a cage and the effects on essary information. This is in contrast to other scithe offspring were observed. Here, although obser- entific areas such as physics, chemistry and engivations on the individual offspring were made, the neering where the studied effects are much larger
experimental units are the mutant dams that were than the natural variation.
randomly assigned to treatment. Therefore, the observations on the offspring should be averaged to
give a single figure for each dam and these data,
one for each dam, are to be used for comparing the

4.5

Balancing internal and external validity

treatments.
A single individual can also relate to several experimental units. This is illustrated by the following example.
Example 4.5. (Fry, 2014) The efficacy of two agents
at promoting regrowth of epithelium across a wound
was evaluated by making 12 small wounds in a standardized way in a grid pattern on the back of a pig.
The wounds were far enough apart for effects on
each to be independent. One of four treatments
would then be applied at random to the wound in
each square of the grid. In this case the experimental unit would be the wound and, as there are 12
wounds, for each treatment there would be three
replicates.

Figure 4.3. The basic dilemma: balancing between internal and


external validity

Internal validity refers to the fact that in a wellconceived experiment the effect of a given treatment is unequivocally attributed to that treatment.
However, the effect of the treatment is masked by
the presence of the uncontrolled variation of the
experimental material.
An experiment with a high level of internal validity should have a great chance to detect the ef-

4.4

Variation is omnipresent

fect of the treatment. If we consider the treatment


effect as a signal and the inherent variation of our

Variation is everywhere in the natural world and

experimental material as noise, then a good exper-

is often substantial in the life sciences. Despite ac-

imental design will maximize the signal-to-noise ra-

curate execution of the experiment, the measure-

tio. Increasing the signal can be accomplished by

ments obtained in identically treated objects will


yield different results. For example, cells grown in

choosing experimental material that is more sensi-

test tubes will vary in their growth rates and in an-

tive to the treatment. Identification of factors that


increase the sensitivity of the experimental ma-

imal research no two animals will behave exactly

terial could be carried out in preliminary experi-

the same. In general, the more complex the system

ments. Reducing the noise is another way to in-

that we study, the more factors will interact with

crease the signal-to-noise ratio. This can be accom-

each other and the greater will be the variation

plished by repeating the experiment in a number

between the experimental units. Experiments in

of animals, but this is not a very efficient way of

whole animals will undoubtedly show more varia-

reducing the noise. An alternative for noise reduc-

tion than in vitro studies on isolated organs. When

tion is by using very uniform experimental mate-

the variation cannot be controlled or its source can-

rial. The use of cells harvested from a single animal

not be measured, we will refer to it as noise, ran-

is an example of noise reduction by uniformity of

dom variation or error. This uncontrollable variation

experimental material.

masks the effects under investigation and is the


reason why replication of experimental units and

External validity is related to the extent that our

statistical methods are required to extract the nec-

conclusions can be generalized to the target popu-

4.6. BIAS AND VARIABILITY

17

lation. The choice of the target population, how a


sample is selected from this population and the experimental procedures used in the study are all determinants of its external validity. Clearly, the experimental material should mimic the target population as close as possible. In animal experiments
specifying species and strain of the animal, the age

Figure 4.4. Bias and variability illustrated by a marksman shot


at a bulls eye

and weight range and other characteristics determine the target population and make the study as

Example 4.7. Confounding bias can enter a study

realistic and informative as possible. External va-

through less obvious routes. For example, micorar-

lidity can be very low if we work in a highly con-

rays allow biologists to measure the expression of

trolled environment using very uniform material.

thousands of genes or proteins simultaneously. Minor differences in a number of non-biological variables, such as reagents from different lots, differ-

Thus there is a trade-off between internal and ex-

ent technicians or even changing atmospheric ozone

ternal validity (Figure 4.3), as one goes up the other

levels can impact the data. Microarrays are usually

comes down. Fortunately, as we will see, there are


statistical strategies for designing a study such that

processed in batches and in large studies different

the noise is reduced while the external validity is

Consider a naive experimental setup in which on

maintained.

Monday all control samples are measured and on

institutes can collaberate using different equipment.

Tuesday all diseased samples, thus masking the effect of disease with that of the two batches. Worse
is that these batch effects do not affect the entire

4.6

Bias and variability

Bias and variability (Figure 4.4) are two important

microarray in the same manner. Correlation patterns differ by batch, and even reverse sign across
bathes (Leek et al., 2010).

concepts when dealing with the design of exper-

Studies using laboratory animals can be subject

iments. A good experiment will minimize or, at

to bias when all animals assigned to a specific treat-

best, try to eliminate bias and will control for vari-

ment are kept in the same cage. In Section 4.3, we

ability. By bias, we mean a systematic deviation in

already mentioned the problem of independence

observed measurements from the true value. One


of the most important sources of bias in a study is

when animals were housed in groups. In addi-

the way experimental units are allocated to treat-

same treatment, the effects due to the conditions

ment groups.

in the cage are superimposed on the effects of the

tion to these, when all animals in a cage receive the

treatments. When the experiment is restricted to


Example 4.6. Consider an animal study in which a single cage per treatment, the comparisons bethe effect of an experimental treatment is investi- tween the treatments can be biased(Fry, 2014). The
gated with respect to a control treatment. Suppose same reasoning applies to the location of the cages
the experimenter allocates all males to the control in the rack (Gart et al., 1986).
treatment and all females to the experimental treatment. Further assume that at the end of the ex-

By variability, we mean a random fluctuation

periment the investigator finds a strong difference

about a central value. The terms bias and vari-

between the two treatment groups. It is clear that ability are also related to the concepts of accuracy
this treatment effect is a biased result since the dif- and precision of a measurement process. Absence
ference between the two groups is completely con- of bias means that our measurement is accurate,
founded with the difference between both sexes.

while little variability means that the measurement

18

CHAPTER 4. PRINCIPLES OF STATISTICAL DESIGN

is precise. Good experiments are as free as possi-

4.8

ble from bias and variability. Of the two, bias is

Strategies for minimizing bias


and

the most important. Failure to minimize the bias

maximizing

signal-to-

noise ratio

of an experiment leads to erroneous conclusions


and thereby jeopardizes the internal validity. Conversely, if the outcome of the experiment shows

To safeguard the internal validity of his study, the

too much variability, this can sometimes be reme-

scientists needs to optimize the signal-to-noise ra-

diated by refinement of the experimental methods,

tio (Figure 4.5). This constitutes the basic principle

increasing the sample size, or other techniques. In

of statistical design of experiments. The signal can

this case the study may still reach the correct con-

be maximized by the proper choice of the measur-

clusions.

ing device and experimental domain. The noise is


minimized by reducing bias and variability. Strate-

4.7

Requirements for a good experiment

gies for minimizing the bias are based on good experimental practice, such as: the use of controls,
blinding, the presence of a protocol, calibration,
randomization, random sampling, and standard-

Cox (1958) enunciated the following requirements

ization. Variability can be minimized by elements

for a good experiment:

of experimental design: replication, blocking, co-

1. treatment comparisons should as far as possible be free of systematic error (bias);


2. the comparisons should also be made sufficiently precise (signal-to-noise);
3. the conclusions should have a wide range of

variate measurement, and sub-sampling. In addition random sampling can be added to enhance
the external validity. We will now consider each of
these strategies in more detail.

4.8.1

Strategies for minimizing bias good experimental practice

4.8.1.1

The use of controls

validity (external validity);


4. the experimental arrangement should be as
simple as possible;
5. uncertainty in the conclusions should be assessable.

In biomedical studies, a control or reference standard is a standard treatment condition against


which all others may be compared. The control
can either be a negative control or a positive control.

These five requirements will determine the basic

The term active control is also used for the latter. In

elements of the design. We have discussed already

some studies, both negative and positive controls

the first three requirements in the preceding sections. The following section provides some basic

are present. In this case, the purpose of the positive

strategies to fulfill these requirements.

of the experiment1 .

control is mostly to provide an internal validation

When negative controls are used, subjects can


sometimes act as their own control (self-control). In
this case the subject is first evaluated under standard conditions (i.e. untreated). Subsequently,
the treatment is applied and the subject is reevaluated.
Figure 4.5. Overview of strategies for minimizing the bias and
maximizing the signal-to-noise ratio

This design, also called pre-post de-

sign, has the property that all comparisons are


made within the same subject. In general, vari-

1 Active

controls play a special role in so-called equivalence or non-inferiority studies, where the purpose is to show that
a given therapy is equivalent or non-inferior to an existing standard.

4.8. STRATEGIES FOR MINIMIZING BIAS AND MAXIMIZING SIGNAL-TO-NOISE RATIO

19

ability within a subject is smaller than between

In a recent survey of studies in evolutionary bi-

subjects. Therefore, this is a more efficient design

ology and of the life sciences at large, Holman

than comparing control and treatment in two sep-

et al. (2015) found that in unblinded studies the

arate groups. However, the use of self-control has

mean reported effect size was inflated by 27% and

the shortcoming that the effect of treatment is com- the number of statistically significant findings was
pletely confounded with the effect of time, thus intro- substantially larger as compared to blinded studducing a potential source of bias. Furthermore,

ies. The importance of blinding in combination

blinding, which is another method to minimize

with randomization in animal studies was also

bias, is impossible in this type of design.

highlighted by Hirst et al. (2014). Despite its importance, blinding of experimenters is often ne-

Another type of negative control is where one


treated group does not receive any treatment at all,
i.e. the experimental units remain untouched. Just
as in the previous case of self-control, untreated controls cannot be blinded. Furthermore, applying the
treatment (e.g. a drug) often requires extra manipulation of the subjects (e.g. injection). The effect of
the treatment is then intertwined with that of the
manipulation and consequently it is potentially biased.

glected in biomedical research. For example, in


a systematic review of studies on animals in nonclinical research, van Luijk et al. (2014) found that
only 24% reported blinded assessment of the outcome, while only 15% considered blinding of the
caretaker/investigator.
Two types of blinding must be distinguished. In
single blinding the investigators are uninformed regarding the treatment condition of the experimental subjects. Single blinding neutralizes investigator
bias. The term double blinding in laboratory exper-

Vehicle control (laboratory experiments) or placebo iments means that both the experimenter and the
control (clinical trials) are terms that refer to a con- observer are uninformed about the treatment control group that receives a matching treatment con-

dition of the experimental units. In clinical trials

dition without the active ingredient. Another term

double blinding means that both investigators and

for this type of control, in the context of experi-

subjects are unaware of the treatment condition.

mental surgery, is sham control. In the sham control


group subjects or animals undergo a faked opera-

Two methods of blinding have found their way

tive intervention that omits the step thought to be

to the laboratory: group blinding and individual

therapeutically necessary. This type of vehicle con-

blinding. Group blinding involves identical codes,

trol, placebo control or sham control is the most

say A, B, C, etc., for entire treatment groups. This

desirable and truly minimizes bias. In clinical re-

approach is less appropriate, since when results ac-

search the placebo controlled trial has become the


gold standard. However, in the same context of

cumulate and the investigator is able to break the


code, blindness is completely destroyed. A much

clinical research ethical consideration may some-

better approach is to assign a code (e.g. sequence

times preclude its application.

number) to each experimental unit individually


and to maintain a list that indicates which code

4.8.1.2

Blinding

corresponds to which particular treatment. The sequence of the treatments in the list should be ran-

Researchers expectations may influence the study

domized. In practice, this individual blinding pro-

outcome at many stages. For instance, the exper-

cedure often involves an independent person that

imental material may unintentionally be handled

maintains the list and prepares the treatment con-

differently based on the treatment group, or ob-

ditions (e.g. drugs).

servations may be biased to confirm prior beliefs.


Blinding is a very useful strategy for minimizing
this subconscious experimenter bias.

Especially when the outcome of the experiment


is subjectively evaluated, blinding must be consid-

20

CHAPTER 4. PRINCIPLES OF STATISTICAL DESIGN

ered. However, there is one situation where blind-

of treatment effects across a range of disease ar-

ing does not seem to be appropriate, namely in tox-

eas and outcome measures. Randomization has

icologic histopathology. Here, the bias that would

been successfully applied in many areas of re-

be reduced by blinding is actually a bias favor-

search, such as microarray studies (Ghlmann and

ing diagnosis of a toxicological hazard and a con-

Talloen, 2009; Yang et al., 2008) and studies using

servative safety evaluation, which is appropriate

96-well microplates (Faessel et al., 1999; Levasseur

in this context. Blinded evaluation would result

et al., 1995).

in a reduction in the sensitivity to detect anomalies. In this context, Holland and Holland (2011)

Formal randomization, in our context, is the pro-

suggested that for toxicological work both an un-

cess of allocating experimental units to treatment

blinded and blinded evaluation of histologic mate-

groups or conditions according to a well-defined

rial should be carried out.

stochastic law1 . Randomization is a critical element in proper study design. It is an objective and

4.8.1.3

The presence of a technical protocol

scientifically accepted method for the allocation of


experimental units to treatment groups. Formal

The presence of a written technical protocol de-

randomization ensures that the effect of uncon-

scribing in full detail the specific definitions of

trolled sources of variability has equal probabil-

measurement and scoring methods is imperative

ity in all treatment groups. In the long run, ran-

to minimize potential bias. The technical protocol

domization balances treatment groups on unim-

specifies practical actions and gives guidelines for

portant or unobservable variables, of which we

lab technicians on how to manipulate the experi-

are often unaware. Any differences that exist in

mental units (animals, etc.), the materials involved


in the experiment, the required logistics, etc. It

these variables after randomized treatment alloca-

also gives details on data collection and process-

In other words, randomization is an operation that

ing. Last but not least, the technical protocol lays

effectively turns lethal bias into more manageable

down the personal responsibilities of the techni-

random error (Vandenbroeck et al., 2006). The ran-

cal staff. The importance and contents of the other

dom allocation of experimental units to treatment

protocol, the study protocol, will be discussed fur-

conditions is also a necessary condition for a rigorous statistical analysis, in the sense that it provides

ther in Chapter 8.

tion are then to be attributed to the play of chance.

an unbiased estimate of the standard error of the


4.8.1.4

Calibration

treatment effects, makes experimental units independent of one another and justifies the use of sig-

Calibration is an operation that compares the out-

nificance tests (Cox, 1958; Fisher, 1935; Lehmann,

put of a measurement device to standards of

1975). In addition, randomization is also of use as

known value, leading to correction of the values

a device for blinding the experiment.

indicated by the measurement device. Calibration


neutralizes instrument bias, i.e. the bias in the investigators measurement system.

Example 4.8. In neurological research, animals are


randomly allocated to treatments. At the end of
the experimental procedures, the animals are sacrificed, slides are made from certain target areas of

4.8.1.5

Randomization

the brain and these slides are investigated micro-

Together with blinding, randomization is an im-

scopically. At each of these stages errors can arise

portant tool for bias elimination in animal stud-

leading to biased results if the original randomiza-

ies. In an overview of systematic reviews on an-

tion order is not maintained.

imal studies Hirst et al. (2014) found that failure

Errors may arise at various stages in the exper-

to randomize is likely to result in overestimation

iment. Therefore, to eliminate all possible bias, it

1 By

the term stochastic is meant that it involves some elements of chance, such as picking numbers out of a hat, or
preferably, using a computer program to assign experimental units to treatment groups.

4.8. STRATEGIES FOR MINIMIZING BIAS AND MAXIMIZING SIGNAL-TO-NOISE RATIO

21

is essential that the randomization procedure cov-

more commonly used in the laboratory, e.g. the al-

ers all important sources of variation connected

ternating sequence AB, BA, AB, BA,. . . . Here too, it

with the experimental units. In addition, as far

cannot be excluded that a certain pattern in the un-

as practical, experimental units receiving the same

controlled variability coincides with this arrange-

treatment should be dealt with separately and in-

ment. For instance, if 8 experimental units are

dependently at all stages at which important er-

tested in one day, the first unit on a given day will

rors may arise. If this is not the case, additional

always receive treatment A. Furthermore, when a

randomization procedures should be introduced

systematic arrangement has been applied, the sta-

(Cox, 1958). To summarize, randomization should

tistical analysis is based on the false assumption of

apply to each stage of the experiment (Fry, 2014):

randomness and can be totally misleading.

allocation of independent experimental units


to treatment groups
order of exposure to test alteration within an
environment
order of measurement

Researchers are sometimes tempted to improve


on the random allocation of animals by re-arranging
individuals so that the mean weights are almost
identical. However, by reducing the variability
between the treatment groups as is done in Figure 4.6, the variability within the groups is altered
and can now differ between groups. This reduces

Therefore, in animal experiments, when the cage

the precision of the experiment and invalidates the

is the experimental unit, the arrangement of cages

subsequent statistical analysis. Later, we will see

within the rack or room, the administration of sub-

that the randomized block design instead of sys-

stances, taking of samples, etc. should also be ran-

tematic arrangement is the correct way of handling

domized, even though this adds an extra burden to

these last two cases.

the laboratory staff. Of course, this can be accomplished by maintaining the original randomization
sequence throughout the experiment.
Formal randomization requires the use of a randomization device. This can be the tossing of a coin,

Figure 4.6. Trying to improve the random allocation by reducing


the intergroup variability increases the intragroup variability

use of randomization tables (Cox, 1958), or use of


computer software (Kilkenny et al., 2009). Meth-

Formal randomization must be distinguished from

ods of randomization using MS Excel and the R sys-

haphazard allocation to treatment groups (Kilkenny

tem (R Core Team, 2013) are contained in Appendix et al., 2009). For example, an investigator wishes to
compare the effect of two treatments (A, B) on the
B.
body weight of rats. All twelve animals are delivSome investigators are convinced that not ran-

ered in a single cage to the laboratory. The labo-

domization, but a systematic arrangement is the pre-

ratory technician takes six animals out of the cage

ferred way to eliminate the influence of uncon-

and assigns them to treatment A, while the remain-

trolled variables. For example when two treat-

ing animals will receive treatment B. Although,

ments A and B have to be compared, one possibil-

many scientists would consider this as a random

ity is to set up pairs of experimental units and al-

assignment, it is not. Indeed, one could imagine

ways assign treatment A to the first member of the the following scenario: heavy animals react slower
pair and B to the remaining unit. However, if there and are easier to catch than the smaller animals.
each pair consistently yields a higher or lower re-

Consequently, the first six animals will on average


weigh more than the remaining six.

sult than the second member, the estimated treat-

Example 4.9. An important issue in the design of

ment effect will be biased. Other arrangements are

an experiment is the moment of randomization. In

is a systematic effect such that the first member of

22

CHAPTER 4. PRINCIPLES OF STATISTICAL DESIGN

an experiment brain cells were taken from animals pecially the case in studies that attempt to make
and placed in Petri dishes, such that one Petri dish a broad inference towards the target population
corresponded to one particular animal. The Petri (population model), like gene expression experdishes were then randomly divided into two groups iments that try to relate a specific pathology to
and placed in an incubator. After 72 hrs incuba- the differential expression of certain genes probes
tion one group of Petri dishes was treated with the

(Nadon and Shoemaker, 2002). For such an exper-

experimental drug, while the other group received iment, the bias in the results is minimized only if it
solvent.

is based on a random sample from the target pop-

Although the investigators made a serious effort to

ulation.

introduce randomization in their experiment, they


overlooked the fact the placement of the Petri dishes 4.8.1.7
in the incubator introduced a systematic error. Instead of randomly dividing the Petri dishes in two
groups at the start of the experiment, they should
have made random treatment allocation after the
incubation period.

Standardization

Standardization of the experimental conditions is


an effective way to minimize the bias. In addition it also can also be used to reduce the intrinsic variability in the results. Examples of standardization of the experimental conditions are the use

As pointed out before, it is important that the

of genetically or phenotypically uniform animals

randomization covers all substantial sources of varia-

or plants, environmental and nutritional control,

tion connected with the experimental units. As a

acclimatization, and standardization of the mea-

rule, randomization should be performed imme-

surement system. As discussed before, too much

diately before treatment application. Furthermore,

standardization of the experimental conditions can

after the randomization process has been carried

jeopardize the external validity of the results.

out the randomized sequence of the experimental


units must be maintained, otherwise a new ran-

4.8.2

Strategies for controlling variability


- good experimental design

4.8.2.1

Replication

domization procedure is required.


4.8.1.6

Random sampling
Ronald Fisher1 noted in his pioneering book The

Using a random sample in our study increases its

Design of Experiments that replication at the level


external validity and allows us to make a broad of the experimental unit serves two purposes. The
inference, i.e. a population model of inference first is to increase the precision of estimation and
(Lehmann, 1975). However, in practice it is of-

the second is to supply an estimate of error by

ten difficult and/or impractical to conduct studies

which the significance of the comparisons is to be

with true random sampling. Clinical trials are usu-

judged.

ally conducted using eligible patients from a small


number of study sites. Animal experiments are

The precision of an experiment depends on the

based on the available animals. This certainly lim-

standard deviation2 () of the experimental mate-

its the external validity of these studies and is one

rial and inversely on the number of experimental

of the reasons that the results are not always repli-

units (n). In a comparative experiment with two

cable.

treatment groups (X1 ) and (X2 ) and equal number of experimental units per treatment group, this

In some cases, maximizing the external validity


of the study is of great importance. This is es-

precision is quantified by the standard error of the


difference between the two averages (X1 X2 ) as:

1 Sir Ronald Aylmer Fisher (Londen,1890 - Adelaide 1962) is considered a genius who almost single-handedly created the
foundations of modern statistical science and experimental design.
2 The standard deviation refers to the variation of the individual experimental units, whereas the standard error refers
to the random variation of an estimate (mostly the mean) from a whole experiment. The standard deviation is a basic
property of the underlying distribution and, unlike the standard error, is not altered by replication.

4.8. STRATEGIES FOR MINIMIZING BIAS AND MAXIMIZING SIGNAL-TO-NOISE RATIO

23

Figure 4.7. The effect of blocking illustrated by a study of the effect of diet on running speed of dogs. Not taking age of the dog
into account (left panel) masks most of the effect of the diet. In the right panel dogs are grouped (blocked) according to age and
comparisons are made within each age group. The latter design is much more efficient.

X1 X2 =

2/n, where is the common stan- one could choose units of a larger size, such that

dard deviation and n is the number of experimen-

the estimates are more precise. In still other ex-

tal units in each treatment group.

periments, there are multiple levels of sampling.


The process of taking samples below the primary

The standard deviation is composed of the in-

level of the experimental unit is known as subsam-

trinsic variability of the experimental material and

pling (Cox, 1958; Selwyn, 1996) or pseudoreplica-

the precision of the experimental work. Reduc-

tion (Fry, 2014; Lazic, 2010; LeBlanc, 2004). The ex-

tion of the standard deviation is only possible to

periment reported by Temme et al. (2001) where

a limited extend by refining experimental proce-

the diameter of many liver cells was measured in

dures. However, one can by increasing the num-

3 animals/experimental condition, is an example

ber of experimental units effectively enhance the

of subsampling with animals at the primary level

experiments precision. Unfortunately, due to the


inverse square root dependency of the standard er-

and cells at the subsample level. Multiple obser-

ror on the sample size, this is not an efficient way to

pseudoreplications or subsamples. In biological

control the precision. Indeed, the standard error is

and chemical analysis, it is common practice to du-

halved by a fourfold increase in the number of ex-

plicate or triplicate independent determinations on

perimental units, but a hundredfold increase in the

samples from the same experimental unit. In this

number of units is required to divide the standard

case samples and determinations within samples

error by ten. In other words, replication at the level of

constitute two distinct levels of subsampling.

vations or measurements made over time are also

the experimental unit is an effective but expensive strategy to control variability. As we will see later, choosing an appropriate experimental design that takes
into account the different sources of variability that
can be identified, is a more efficient way to increase
the precision.
4.8.2.2

Subsampling

As mentioned above, reduction of the standard

When subsampling is present, the standard deviation used in the comparison of the treatment means is composed of the variability between the experimental units (between-unit variability) and the variability within the experimental
units(within-unit variability). It can be shown that
in the presence of subsampling the overall standard deviation of the experiment is equal to:

deviation is only possible to a very limited extend. This can be accomplished by standardization of the experimental conditions, but also this

r
=

n2 +

2
m
m

method is limited and it jeopardizes the external

where n and m are the number of experimental

validity of the experiment. However, in some ex-

units and subsamples and n and m the between

periments it is possible to manipulate the phys-

sample and within sample standard deviation.

ical size of the experimental units. In this case

In Section 4.8.2.1, we defined the precision of

24

CHAPTER 4. PRINCIPLES OF STATISTICAL DESIGN

the comparative 2-treatments experiment.

This

r
X1 X2 =

fectively do by blocking is dividing the variation


between the individuals into variation between

now becomes:
2
2 2 m
(n +
)
n
m

blocks and variation within blocks. If the blocking


factor has an important effect on the response, then
the between-block variation is much greater than

Thus, by increasing the number of experimental

the within block variation. We will take this into


units n we reduce the total variability, while sub- account in the analysis of the data (analysis of varisample replication m only affects the within-unit ance or ANOVA with blocks as additional factor).
variability. A large number of subsamples makes Comparisons of treatments are then carried out
only sense when the variability of the measure-

within blocks, where the variation is much smaller.


ment at the sublevel m is substantial as compared Blocking is an effective and efficient way to ento the between-unit variability n . How to deter- hance the precision of the experiment. In addition,
mine the required number of subsamples will be

blocking allows to reduce the bias due to an im-

discussed in Section 6.5. As a conclusion, we can

balance in baseline characteristics that are known

say that subsample replication is not identical and not

to affect the outcome. However, blocking does not

as effective as replication on the level of the true experi-

eliminate the need for randomization. Within each

mental unit.

block treatments are randomly assigned to the experimental units, thus removing the effect of the

4.8.2.3

remaining unknown sources of bias.

Blocking

Example 4.10. Consider a (hypothetical) experiment in which the effect of two diets on running

4.8.2.4

Covariates

speed of dogs is studied. We can carry out the experiment by taking 6 dogs with varying age and
randomly allocating 3 dogs to diet A and the 3 remaining to diet B. However, as shown in the left

Figure 4.8. Additive model with a linear covariate adjustment

panel of Figure 4.7, the effect of diet will be greatly


masked by the variability between dogs. A more

Blocking on a baseline characteristic such as

intelligent way to set up the experiment is to group

body weight is one possible strategy to elimi-

the dogs by age and make all comparisons within

nate the variability induced by the heterogeneity

the same age group, thus removing the effect of dif-

in weight of the animals or patients. Instead of

ferent ages. This is illustrated in the right panel

grouping in blocks, or in addition to, one can also

of Figure 4.7. With the variability due to age re-

make use of the actual value of the measurement.

moved, the effect of the diets within the age groups

Such a concomitant measurement is referred to as

is much more clear.

a covariate. It is an uncontrollable but measurable attribute of the experimental units (or their

If one or more factors other than the treatment

environment) that is unaffected by the treatments but

conditions can be identified as potentially influenc-

may have an influence on the measured response.

ing the outcome of the experiment, it may be ap-

Examples of covariates are body weight, age, am-

propriate to group the experimental units on these

bient temperature, measurement of the response

factors. Such groupings are referred to as blocks variable before treatment, etc. The covariate filor strata. Units within a block are then randomly

ters out the effect of a particular source of variabil-

assigned to the treatments. Examples of blocking

ity. Rather than blocking it represents a quantifi-

factors are plates (in microtiter plate experiments),

able attribute of the system studied. The statisti-

greenhouses, animals, litters, date of experiment,

cal model underlying the design of an experiment

or based on categorizations of continuous base-

with covariate adjustment is conceptualized in Fig-

line characteristics such as body weight, baseline

ure 4.8. The model implies that there is a linear re-

measurement of the response, etc. What we ef-

lationship between the covariate and the response

4.9. SIMPLICITY OF DESIGN

25

and that this relationship is the same in all treat-

tistical requirement, but it is also the most impor-

ment groups. In other words, there is a series of

tant one. Unfortunately, it is also the requirement

parallel curves, one per treatment group, relating

that is most often neglected. Fisher (1935) already

the response to the covariate. This is exemplified

lamented that:

in Figure 4.9 showing the results of an experiment


with two treatment groups in which the baseline

It is possible, and indeed it is all too fre-

measurement of the response variable served as

quent, for an experiment to be so conducted

covariate. There is a linear relationship between

that no valid estimate of error is available.

the covariate and the response and this is almost


the same in both treatment groups as is shown by

Without the ability to estimate error, there is no ba-

the fact that the two lines are almost parrallel to

sis for statistical inference. Therefore, in a well-

one another.

conceived experiment, we should always be able


to calculate the uncertainty in the estimates of the
treatment comparisons. This usually means estimating the standard error of the difference between the treatment means. To make this calculation in a rigorous manner, the set of experimental units must respond independently to a specific
treatment and may only differ in a random way
from the set of experimental units assigned to the
other treatments. This requirement again stresses
the importance of independence of experimental
units and randomness in treatment allocation.

Figure 4.9. Results of an experiment with baseline as covariate. There is a linear relationship between the covariate and
the response and this relationship is the same in both treatment
groups.

4.9

Table 4.1. Multiplication factor to correct for the bias in estimates of the standard deviation based on small samples, after
Bolch (1968).

Simplicity of design

In addition to external validity, bias and precision,


Cox (1958) also stated that the design of our exper-

Factor

2
3
4
5
6

1.253
1.128
1.085
1.064
1.051

iment should be as simple as possible. When the


When the number of experimental units is

design of the experiment is too complex, it may


be difficult to ensure adherence to a complicated

small, the sample estimate of the standard devia-

schedule of alterations, especially if these are to be

tion is biased and underestimates the true stan-

carried out by relatively unskilled people. An ef-

dard deviation. A multiplication factor to correct

ficient and simple experimental design has the ad-

for this bias for normal distributions is given in Ta-

ditional advantage that the statistical analysis will

ble 4.1. For a sample size of 3, the estimate should

also be simple without making unreasonable as-

be increased with 13% to obtain the actual stan-

sumptions.

dard deviation.

4.10

The calculation of uncertainty

Alternatively, one can also make use of the results of previous experiments to guesstimate the
new experiments standard deviation. However,

This is the last of Coxs precepts for a good experi- we then make the strong assumption that random
ment (see Section 4.7, page 18). It is the only sta- variation is the same in the new experiment.

26

CHAPTER 4. PRINCIPLES OF STATISTICAL DESIGN

5. Common Designs in Biological


Experimentation
And so it was ... borne in upon me that very often, when the most elaborate statistical refinements
possible could increase the precision by only a few percent, yet a different design involving little or no
additional experimental labour might increase the precision two-fold, or five-fold or even more
R. A. Fisher (1962)

There is a multitude of designs that can be con-

ciple of experimental design is to provide a syn-

sidered when planning an experiment and some

thetic approach to minimize bias and control variabil-

have been employed more commonly than oth-

ity. Furthermore, as shown in Figure 5.1, we can

ers in the area of biological research. Unfortu-

consider each of the specialized experimental de-

nately, the literature about experimental designs

signs as integrating three different aspects of the

is littered with technical jargon, which makes its

design (Hinkelmann and Kempthorne, 2008):

understanding quite a challenge. To name a few,

the treatment design,

there are: completely randomized designs, ran-

the error-control design,

domized complete block designs, factorial designs,


split plot designs, Latin square designs, Greco-

the sampling & observation design.

Latin squares, Youden square designs, lattice designs, Placket-Burman designs, simplex designs,

The treatment design is concerned about which

Box-Behrken designs, etc.

treatments are to be included in the experiment


and is closely linked to the goals and aims of

It helps to find our way through this jungle of

study. Should a negative or positive control be

designs by keeping in mind that the major prin-

incorporated in the experiment, or should both

Figure 5.1. The three aspects of the design determine its complexity and the required resources

27

28

CHAPTER 5. COMMON DESIGNS IN BIOLOGICAL EXPERIMENTATION

be present? How many doses or concentrations

5.1

Error-control designs

should be tested and at which level? Is the interaction of two treatment factors of interest or not?
The error-control design implements the strategies
that we learned in Section 4.8.2 to filter out different sources of variability. The sampling & observation aspect of of our experiment is about how experimental units are sampled from the population,
how and how many subsamples should be drawn,
etc.

5.1.1

The completely randomized design

This is the most common and simplest possible


error-control design for comparative experiments.
Each experimental unit is randomly assigned to exactly one treatment condition. This is often the
default design used by investigators who do not
really think about design problems. The advantage of the completely randomized design is that it
is simple to implement, as experimental units are
simply randomized to the various treatments. The
obvious disadvantage is the lack of precision in the
comparisons among the treatments, which is based

The complexity and the required resources of a

on the variation between the experimental units.

study are determined by these three aspects of experimental design. The required resources, namely

In the following example of a completely ran-

the number of experimental units, of a study are

domized design the investigators used randomiza-

governed by the number of treatments, the num-

tion, blinding, and individual housing of animals

ber of blocks and the standard error. The more

to guarantee the absence of systematic error and

treatments, or the more blocks, the more experi-

independence of experimental units.

mental units are needed. The complexity of the


experiment is determined by the underlying sta-

Example 5.1. To investigate the interaction of

tistical model of Figure 4.1. In particular the error-

chronic treatment with drugs on the proliferation

control design defines the studys complexity. The

of gastric epithelial cells in rats, an experiment was

randomization process is a major part of this error-

set up in which two experimental drugs were com-

control design. A justified and rigorous estimation

pared with their respective vehicles. A total of 40

of the standard error is only possible in a randomized experiment. In addition, randomization has

rats were randomly divided into four groups of each

the advantage that it distributes the effects of un-

cedure described in Appendix B. To guarantee in-

controlled variability randomly over the treatment

dependence of the experimental units, the animals

groups. Replication of experimental units should

were kept in separate cages. Cages were distributed


over the racks according to their sequence number.

be sufficient, such that an adequate number of de-

ten animals, using the MS Excel randomization pro-

grees of freedom are available for estimating the

Blinding was accomplished by letting the sequence

experiments precision (standard error). This pa-

number of each animal correspond to a given treat-

rameter is related to the sampling & observation

ment. One laboratory worker was familiar with the

aspect of the design.

codes and prepared the daily drug solutions. Treatment codes were concealed from the rest of the laboratory staff that was responsible for the daily treatment administration and final histological evaluation.

These three aspects of experimental design pro-

Example 5.2. Another example of a completely

vide a framework for classifying and comparing

randomized design is about eliminating the bias

the different types of experimental design that are

present in experiments using 96-well mictrotiter

used in the life sciences. As we will see, each

plates. Burrows et al. (1984) already described the

of these designs has its advantages and disadvan-

presence of plate location effects in ELISA assays

tages.

carried out in 96-well microtiter plates.

Similar

5.1. ERROR-CONTROL DESIGNS

29

findings were reported by Levasseur et al. (1995) a statistical model that allows to correct for the
and Faessel et al. (1999) who described the presence row-column bias (Schlain et al., 2001; Straetemans
of parabolic patterns (Figure 5.2) in cell growth ex- et al., 2005).
periments carried out in 96-well microtiter plates.
They were not able to show conclusively the underlying causes of this systematic error, which, as
shown in Figure 5.2, could be of considerable magnitude. Therefore, they concluded that only by random allocation of the treatments to the wells these
systematic errors could be avoided.

Figure 5.3. Scheme for the addition and dilution of drugcontaining medium to tubes in a 96-tube rack, randomization
of the tubes, and then addition of drug-containing medium to
cell-containing wells of a 95-wel microtiter plate. (after Faessel
et al. (1999))

5.1.2
Figure 5.2. Presence of bias in 96-well microtiter plates (after
Levasseur et al. (1995))

The randomized complete block design

The concept of blocking as a tool to increase effi-

ciency by enhancing the signal-to-noise ratio has


To accomplish this, they developed an ingenious
already been introduced in Section 4.8.2.3 (page
method for randomizing treatments in 96-well mi24). In a randomized complete block design a sincrotiter plates. Figure 5.3 sketches this procedure.
gle isolated extraneous source of variability (block)
Drugs are serially diluted into tubes in a 96-tube
closely related to the response is eliminated from
rack. Next, a randomization map is generated usthe comparisons between treatment groups. The
ing an Excel macro. The randomization map is then
design is complete, since all treatments are applied
taped to the top of an empty tube rack. The original
within each block. Consequently, treatments can be
tubes are then transferred to the new rack by pushcompared with one another within the blocks. The
ing the numbered tubes through the correspondrandomization procedure now randomizes treating numbered rack position. Using a multipipette
ments separately within each block. The randomsystem drug-containing medium is then transferred
ized complete block design is a very useful and refrom the tubes to the wells of a 96-well microtiter
liable error-control design, since all treatment complate. At the end of the assay, the random data file
parisons can be made within a block. When a
generated by the plate reader is imported into the
study is designed such that the number of experExcel spreadsheet and automatically resorted.
imental units within each block and treatment is
The above example is also a paradigm of how randomization can be used to eliminate systematic errors by transforming them into random noise. This

equal, the design is called a balanced design. A


few examples will illustrate its use in the laboratory.

is one way of dealing with bias in microtiterplates. Example 5.3. In the completely randomized design
Alternative methods consist of choosing an appro-

of Example 5.1 (page 28), the rats were individu-

priate experimental design such as a Latin square ally housed in a rack consisting of 5 shelves of each
design (see Section 5.1.4, page 33), or to construct 8 cages. On different shelves, rats are likely to be

30

CHAPTER 5. COMMON DESIGNS IN BIOLOGICAL EXPERIMENTATION

Figure 5.4. Outline of a paired experiment on isolated cardiomyocytes. Cardiomyocytes of a single animal were isolated and seeded
in plastic Petri dishes. From the resulting five pairs of Petri dishes, one member was randomly assigned to drug treatment, while
the remaining member received the vehicle.

exposed to multiple varieties of light intensity, tem- and Kempthorne, 2008). Sometimes, this is accomperature, humidity, sounds, views, etc. The inves-

plished by combining several extraneous sources

tigators were convinced that shelf height affected of variation into one aggregate blocking factor as
the results. Therefore, they decided to switch to

shown by the following example.

a randomized complete block design in which the


Example 5.4. In addition to the shelf height, the inblocks corresponded to the 5 shelves of the rack.
vestigators also suspected that the body weight of
Within each block separately the animals were ranthe animals might affect the results. Therefore, the
domly allocated to the treatments, such that in
animals were numbered in order of increasing body
each block each treatment condition occured exweight. The first eight animals were placed on the
actly twice. This example illustrates also that, altop shelf and randomized to the four treatment conthough all treatments are present in each block in a
ditions, then the next eight animals were placed on
randomized complete block design, more than one
the second shelf and randomized, etc. The resulting
experimental unit per block can be allocated to a
design was now simultaneously controlled for shelf
treatment condition.
height and body weight. With regard to these two
There are two main reasons for choosing a randomized complete block design above a com-

characteristics, animals within a block were as much


alike as possible.

pletely randomized design. Suppose there is an


extraneous factor that is strongly related to the out-

5.1.2.1

The paired design

come of the experiment. It would be most unfortunate if our randomization procedure yielded a
design in which there was a great imbalance on

Table 5.1. Results of an experiment using a paired design for testing the effect of a drug on the number of viable cardiomyocytes
after calcium overload

this factor. If this were the case, the comparisons

Rat No.

Vehicle

Drug

Drug - Vehicle

between treatment groups would be confounded

1
2
3
4
5

44
64
60
50
76

46
75
67
64
77

2
11
7
14
1

with differences in this nuisance factor and be biased. The second main reason for a randomized
complete block design is its possibility to considerably reduce the error variation in our experiment, thereby making the comparisons more precise. The main objection to a randomized complete
block design is that it makes the strong assumption

When only two treatments are compared, the


randomized complete block design can be simplified to a paired design.

that there is no interaction between the treatment vari- Example 5.5. Isolated cardiomyocytes provide an
able and the blocking characteristics, i.e. that the effect easy tool to assess the effect of drugs on calciumof the treatments is the same among all blocks.
overload (Ver Donck et al., 1986). Figure 5.4 illustrates the experimental setting. Cardiomyocytes of
The basic idea behind blocking is to partition the

a single animal were isolated and seeded in plastic

total set of experimental units into subsets (blocks)

Petri dishes. The Petri dishes were treated with

that are as homogeneous as possible. (Hinkelmann

the experimental drug or with its solvent. After

5.1. ERROR-CONTROL DESIGNS

31

70

70

60

50

% Viable Myocytes

% Viable Myocytes

60

50

Control

Treated

Control

Treatment

Treated

Treatment

Figure 5.5. Gain in efficiency induced by blocking illustrated in a paired design. In the left panel, the myocyte experiment is
considered as a completely randomized design in which the two samples largely overlap one another. In the right panel the lines
connect the data of the same animal and show a marked effect of the treatment.

a stabilization period the cells were exposed to a that connect the data from the same animal. It is
stimulating substance (i.e. veratridine) and the per- clear that for each pair the drug-treated Petri dish
centage viable, i.e. rod-shaped, cardiomyocytes in yielded consistently a higher result than its vehicle
a dish was counted. Although comparison of the control counterpart. Since the different pairs (anitreatment with the solvent control within a single mals) are independent from one another, the mean
animal provides the best precision, it lacks external difference and its standard error can be calculated.
validity. Therefore, a paired experiment with my- The mean difference is 7.0 with a standard error of
ocytes from different animals and with animal as 2.51.
blocking factor was carried out. From each animal
two Petri dishes containing exactly 100 cardiomy- 5.1.2.2 Efficiency of the randomized complete
ocytes were prepared. From the resulting five pairs
block design
of Petri dishes, one member was randomly assigned
Example 5.6. Suppose that in Example 5.5 the exto drug treatment, while the remaining member reperimenter would not have used blocking, i.e. conceived the vehicle. After stabilization and exposure
sider it as if he had used myocytes originating from
to the stimulus, the number of viable cardiomy10 completely different animals. The 10 Petri dishes
ocytes in each Petri dish was counted. The resultwould then be randomly distributed over the two
ing data are contained in Table 5.1 and displayed
treatment groups and we would have been conin Figure 5.5.
fronted with a completely randomized design. Assume also that the results of this hypothetical exThere are 10 experimental units, since Petri periment were identical to those obtained in the acdishes can be independently assigned to vehicle or tual paired experiment. As is illustrated in the left
drug. However, the statistical analysis should take panel of Figure 5.5, the two groups largely overthe particular structure of the experiment into ac- lap one another. Since all experimental units are
count. More specifically, the pairing has imposed now independent of one another, the effect of the
restrictions on the randomization such that data

drug is evaluated by calculating the difference be-

obtained from one animal cannot be freely inter- tween the two mean values and comparing it with
changed with that from another animal. This is il- its standard error1 Obviously, the mean difference
lustrated in the right panel of Figure 5.5 by the lines is the same as in the paired experiment. However,

1
p As already mentioned in Section 4.8.2.1, page 22, the standard error on the difference between two means is equal to
2/n

32

CHAPTER 5. COMMON DESIGNS IN BIOLOGICAL EXPERIMENTATION

the standard error on the mean difference has risen Balanced incomplete block designs (BIB) exist for
considerably from a value of 2.51 to 7.83, i.e. the only certain combinations of the number of treatuse of blocking induced a substantial increase in the ments and number and size of blocks. Software
precision of the experiment1 .

to construct incomplete block designs and to test


whether a candidate design satisfies the criteria for

Examples 5.5 and 5.6 clearly demonstrate that


carrying out a paired experiment has the possibil-

being balanced is provided by the crossdes package


in R (Sailer, 2013).

ity to enhance the precision of the experiment considerably, while the conclusions have the same va-

Table 5.2. Balanced incomplete block design for Example 5.7 with
treatments A,B,C and D

lidity as in a completely randomized experiment.


However, the forming of blocks of experimental

Lamb

units is only successful when the criterion upon

First
Second

which the pairing is based, is related to the out-

Sibling lamb pair


1

A
B

A
C

A
D

B
C

B
D

C
D

come of the experiment. Using a characteristic that


does not have an important effect on the response
variables as blocking factor, is worse than useless
since the statistical analysis will lose a bit of power
by taking the blocking into account. This can be of
particular importance for small sized experiments.

Example 5.7. (Anderson and McLean, 1974; Bate


and Clark, 2014). An experiment was carried out
to assess the effect that vitamin A and a protein dietary supplement have on the weight gain of lambs
over a 2-month period. There were four treatments,
labeled A, B, C and D in the study corresponding

5.1.3

Incomplete block designs

In some circumstances, the block size is smaller


than the number of treatments and consequently
it is impossible to assign all treatments within each
of the blocks. When a certain comparison is of specific interest, such as comparison with control, it is
wise to include it in each of the blocks.
Balanced incomplete block designs allow all
pairwise comparisons of treatments with equal
precision using a block size that is less than the
number of treatments. To achieve this, the balanced incomplete block design has to satisfy the
following conditions (Bate and Clark, 2014; Cox,
1958):
each block contains the same number of
units;
each treatment occurs the same number of
times in the design;
every pair of treatments occurs together in
the same number of blocks.
1 Kutner

to low dose and high dose of vitamin A combined


with low dose and high dose of protein. Three replicates per treatment was considered as sufficient and
blocking was carried using pairs of sibling lambs, so
six pairs of siblings were used. With the number
of treatments restricted to two per block, the balanced incomplete block design shown in Table 5.2
was used.
A possible layout of the experiment is obtained
in R by:
>
>
>
>
>

require(crossdes)
# find a incomplete block design
inc.bl<-find.BIB(4,6,2)
# check if it is a valid BIB
isGYD(inc.bl)

[1] The design is a balanced incomplete block design w.r.t. rows.


>
>
>
>
>
>

# if OK make a suitable display


# replace numbers by treatment codes
aBIB<-LETTERS[inc.bl]
dim(aBIB)<-dim((inc.bl))
# transpose displays matrix as in text
t(aBIB)

[,1] [,2] [,3] [,4] [,5] [,6]


[1,] "B" "A" "A" "C" "B" "A"
[2,] "D" "C" "B" "D" "C" "D"

et al. (2004) provide a method to compare designs on the basis of their relative efficiency. For the design in
Example 5.5, the calculations show that this paired design is 7.7 times more efficient than the completely randomized design
in Example 5.6. In other words, about 8 times as many replications per treatment with a completely randomized design
are required to achieve the same results.

5.1. ERROR-CONTROL DESIGNS

33

It is clear that the BIB-design generated in the R-

In a Latin square design the k treatments are ar-

session is equivalent to the design shown in Table

ranged in a k k square such as in Table 5.3 Each

5.2.

of the four treatments A, B, C, and D occurs exactly


When looking for a suitable balanced incom-

once in each row and exactly once in each column.

plete block design, it can happen that if we add one

The Latin square design is categorized as a two-way

or more additional treatments, a more suitable de-

block error-control design.

sign is found. Alternatively, omitting one or more


treatments can yield a more efficient design (Cox,

The main advantage of the Latin square design


is that it simultaneous balances out two sources

1958).

of error. The disadvantage is the strong assumption that there are no interactions between the

5.1.4

The Latin square design

The Latin square design is an extension of the randomized complete block design, but now blocking
is done simultaneously on two characteristics that
affect the response variable.

blocking variables or between the treatment variable and blocking variables.

In addition Latin

square designs are limited by the fact that the number of treatments, number of row, and number of
columns must all be equal. Fortunately, there are
arrangements that do not have this limitation (Cox,

A total of 16 1958; Hinkelmann and Kempthorne, 2008).

Example 5.8. (Hu et al., 2015).

large (3m height,6.7m2 area) hexagonal ecosystem


chambers were arranged in a Latin square with four

In a k k Latin square, only k experimental

repetition chambers for each treatment. The four units are assigned to each treatment group. Howclimate treatments were (i) control (CO, with ambi- ever, it may happen that more experimental units
ent conditions), (ii) increased air temperature = air are required to obtain an adequate precision. The
warming (AW), (iii) reduced irrigation = drought Latin square can then be replicated and several
(D), and (iv) air warming and drought (AWD).

squares can be used to obtain the necessary sample size. In doing this, there are two possibilities

In the above example 4 4 Latin squares were to consider. Either one stacks the squares on top
used to control for the location in the ecosystem of each other and keeps them as separate indepenchambers. Another example from laboratory prac-

dent squares, or one completely randomizes the

tice concerns the simultaneous control, without

order of the rows (or columns) of the design. In

loss of precision, of the row and column effects in

general, keeping the squares separate is not a good

microtiter plates (Burrows et al., 1984). Still another example is about experiments on neuronal

idea and leads to less precise estimation and loss


of degrees of freedom, especially in the case of a

protection, where a pair of animals was tested each

2 2 Latin square. It is only when there is a rea-

day and the investigator expected a systematic dif-

son to believe that the column (or row) effects are

ference not only between the pairs but also be-

different between the squares, that keeping them

tween the animal tested in the morning and the

separate makes sense.

one tested in the afternoon.


Table 5.3. Arrangement for a 4 4 Latin square design controlling
for column and row effects. The letters A-D indicate the four
treatments

Row

1
2
3
4

In the case of 96-well microplates consisting of


8 rows and 12 columns, scientists often omit the
edge rows and columns which are known to yield

Column
1

B
C
A
D

A
B
D
C

D
A
C
B

C
D
B
A

extreme results.

The remaining 6 rows and 10

columns are then used for experimental purposes.


However, using a Latin square design that appropriately corrects for row and column effects, the
complete 96-well microplate can be used as is illustrated by the following example.

34

CHAPTER 5. COMMON DESIGNS IN BIOLOGICAL EXPERIMENTATION

Table 5.4. A balanced lattice square arrangement of 16 treatments with 5 replicate squares on a single microtiter plate with eight
lettered rows and twelve numbered columns (after (Burrows et al., 1984))

A
B
C
D
E
F
G
H

1
11
12
9
10
12
3
13
6

2
7
8
5
6
14
5
11
4

3
3
4
1
2
7
16
2
9

4
15
16
13
14
1
10
8
15

5
11
6
1
16
5
1
9
13

6
10
7
4
13
11
15
7
3

7
12
5
2
15
4
8
16
12

9
15
9
4
6

10
5
3
10
16

11
12
14
7
1

12
2
8
13
11

The principle of simultaneously controlling for

Example 5.9. Aoki et al. (2014) describe their assay


as follows:

8
9
8
3
14
14
10
2
6

Samples diluted 1:1,600 and serially- row- and column-effect by means of Latin-square

diluted standards (100 L) were placed in wells. designs are not restricted to 96-well microtiter
Samples were arranged in triads, each containing plates. but also apply to their larger variants, like
samples from a case and two-matched controls. In

the 384 and 1536 microwell plates.

order to minimize measurement error due to a spaRandom Latin squares can be produced using
the R-package magic (Hankin, 2005). A possible
croplate wells were grouped into 6 blocks of 16 (44) layout of the experiment in Example 5.8 is obtained
wells and each set of 3 samples was placed in the by:
tial gradient in binding efficiency within plate, mi-

same block using a version of Latin square design along with standards. Placement patterns were
changed across blocks such that the influence of the

>
>
>
>

require(magic)
trts<-c("CO","AW","D","AWD")
ii<-rlatin(4)
matrix(trts[ii],nrow=4,ncol=4,byrow=FALSE)

spatial gradient on the signal-standard level relationship was minimized. The authors do not explicitly state what they mean by version of Latin
square.

Presumably lattice square designs were

[1,]
[2,]
[3,]
[4,]

[,1]
"D"
"AWD"
"AW"
"CO"

[,2]
"AW"
"CO"
"D"
"AWD"

[,3]
"AWD"
"D"
"CO"
"AW"

[,4]
"CO"
"AW"
"AWD"
"D"

used. The interested reader is referred to the literature (Cox, 1958) for this more advanced topic, but

5.2

Treatment designs

the idea is clear that this systematic arrangement


allows for unbiased comparisons. By placement

5.2.1

One-way layout

patterns were changed across blocks they presumably mean randomization of the rows and columns

The examples that we discussed up to now (apart

of the individual Latin squares.

from Examples 5.7 and 5.8), all considered the

In microplate experiments, variations on Latin


square designs such as Latin rectangles (Hinkelmann and Kempthorne, 2008) and incomplete
Latin squares such as Youden squares and lattice
squares (Cox, 1958) can be of use. Burrows et al.
(1984) investigated the properties of a number of
these designs for quantitative ELISA. One of the
designs they considered (see Table 5.4) was a bal-

treatment aspect of the design as consisting of a


single factor. This factor can represent presence or
absence of a single condition or several different
related treatment conditions (e.g. Drug A, Drug B,
Drug C ). The treatment aspect of these designs is
referred to as single factor or one-way layout.

5.2.2

Factorial designs

anced lattice square design that allows to compare

In some types of experimental work, such as in Ex-

within one plate 16 treatments with one another,

ample 5.7 (page 32), it can be of interest to assess

with 5 replicates for each treatment. Other system-

the joint effect of two or more factors, e.g. high

atic arrangements specifically for estimating rela-

and low dose of Vitamin A combined with high

tive potencies in the presence of microplate loca-

and low dose of protein. This is a typical example

tion effects have also been proposed (Schlain et al.,

of a 2 2 full factorial design, the simplest and most

2001).

frequently used factorial treatment design. In this

5.2. TREATMENT DESIGNS

35

design, the factors and all combinations (hence full) No interaction When there is no interaction beof the levels of the factors are studied. The factorial

tween strains and diets, the difference between the

treatment design of this example allows estimating

two diets is the same, irrespective of the mouse

the main effects of the individual treatments, as well

strain. The lines connecting the mean values of the

as their interaction effect, i.e. the deviation from ad- two diets are parallel to one another, as is shown in
ditivity of their joint effect. We will use the 22 full Figure 5.6.
factorial design to explore the concepts of factorial
designs and of statistical interaction.

There is an overall effect of diet, animals fed


with the Western diet show larger lesions, and this

Example 5.10. (Bate and Clark, 2014) A study was effect is the same in both strains. There is also an
conducted to assess whether the serum chemokines overall effect of the strain. Lesions are larger in the
JE and KC could be used as markers of atheroscle-

C3H apoE-/- than in the C57BL apoE-/- strain and

rosis development in mice (Parkin et al., 2004). Two this difference is the same for both diets.
strains of apolipoprotein-E-deficient (apoE-/- ) mice,
C3H apoE-/- and C57BL apoE-/- were used in the
Since the differences between the diets is the

study. These mice were fed either a normal diet


or a diet containing cholesterol (the Western diet).
After 12 weeks the animals were sacrificed and their
atherosclerotic lesion area was determined.

same regardless which strain of mice they are fed


to, it is appropriate to average the results from each
diet across both strains and make a single comparison between the diets rather than making the comparison separately for each strain. In doing this,

5000

the external validity of the conclusions is broadened since they apply to both strains. In addition, the comparison between the two strains can

4000

be made, irrespective of the diet the animals are

3500

C57BL apoE-/- regardless of the diet. In both cases

3000

receiving. C3H apoE-/- have larger lesions than


the comparisons make use of all the experimental

2500

units. This makes a factorial design a highly efficient design, since all the animals are used to test

2000

Lesion area ( m2)

4500

C3H
C57BL

1500

simultaneously two hypotheses.

5000

Western
Diet

dent diet and the Western diet. In total there were


four combinations of factor levels:
1. C3H apoE-/- + normal diet
2. C3H apoE-/- + Western diet
3. C57BL apoE-/- + normal diet
4. C57BL apoE-/- + Western diet
Let us consider some possible outcomes of this
experiment.

4000
3500
3000
2000

diet factor also contained two levels, the normal ro-

1500

of two levels, C3H apoE-/- and C57BL apoE-/- . The

Lesion area ( m2)

The study design consisted of two categorical


factors, Strain and Diet. The strain factor consisted

C3H
C57BL

4500

Figure 5.6. Plot of mean lesion area for the case where there is no
interaction between strains and diets.

2500

Normal

Normal

Western
Diet

Figure 5.7. Plot of mean lesion area for the case where there is a
moderate interaction between strains and diets.

36

CHAPTER 5. COMMON DESIGNS IN BIOLOGICAL EXPERIMENTATION

Moderate interaction When there is a moderate in- 2 2 factorial combined with a balanced incomteraction, the direction of the effect of the first fac-

plete block design, whereas a 2 2 factorial design

tor is the same regardless of the level of the sec-

is used in connection with a Latin square error con-

ond factor. However, the size of the effect varies

trol design in Example 5.8 (page 33).

with the level of the second factor. This is exemplified by Figure 5.7, where the lines are not parallel, though both indicate an increase in lesion size

Factorial designs are widely used in plant research, process optimization, etc. Recent appli-

for the Western diet as compared to the normal

cations in the biomedical field include cDNA mi-

diet. This increase is more pronounced in the C3H

croarray experiments.

apoE-/- strain than in the C57BL apoE-/- animals.


Hence, the C3H apoE-/- strain is more sensitive to Example 5.11. Glonek and Solomon (2004) describe
an example of a 2 2 factorial design using cDNA
changes in diet.
microarrays and, in doing this, highlight the imporStrong interaction The effect of the first factor can

tance of the interaction effect in this context:

also be entirely dependent on the level of the sec-

A 2 2 factorial experiment was con-

ond factor. This is illustrated in Figure 5.8, where

ducted to compare the two mutants at

feeding the Western diet to the C3H apoE-/- has


little effect on lesion size, whereas the effect on
C57BL apoE-/- mice is substantial. This is an exam-

times zero hours and 24 hours; it was


anticipated that measuring changes over
time would distinguish genes involved

ple of strong interaction. The Western diet always

in promoting or blocking differentiation,

results in bigger lesions, but the effect in the C57BL

or that suppress or enhance growth, as

apoE-/- strain is much more pronounced than in


the C3H apoE-/- mice. Furthermore, when fed with
normal diet the C3H apoE-/- mice show a larger lesion area than the C57BL apoE-/- strain. However,

genes potentially involved in leukaemia.


We are interested in genes differentially
expressed between the two samples i.e.
in the sample main effect, but more par-

when the animals receive Western diet the oppo-

ticularly, in those genes which are dif-

site is true.

ferentially expressed in the two samples


at time 24 hours but not at time zero

5000

hours. This is the interaction of sample


and time.

4000

In the case of cDNA experiments, the design

3500

in other application areas. More information on

3000

of a 2 2 factorial is not as straightforward as


how to set up factorial designs involving microar-

2500

ray experiments can be found in the specialized literature (Glonek and Solomon, 2004) and (Banerjee

2000

Lesion area ( m2)

4500

C3H
C57BL

1500

and Mukerjee, 2007).

Normal

Western
Diet

Figure 5.8. Plot of mean lesion area for the case where there is a
moderate interaction between strains and diets.

It happens that researchers have actually


planned a factorial experiment, but in the design
and analysis phase failed to recognize it as such.
In that case they do not take full advantage of the
factorial structure for interpreting treatment effects

The factorial treatment design can be combined

and often analyze and interpret the experiment

with the error control designs that we encountered

with an incorrect procedure (Nieuwenhuis et al.,

in Section 5.1. Example 5.7 (page 32) illustrates a

2011). We will come back to that in Section 9.2.3.

5.3. MORE COMPLEX DESIGNS

37

When the two factors consist of concentrations

5.3.1

The split-plot and strip-plot designs

or dosages of drugs, researchers tend to confuse


the statistical concept of interaction with the pharmacological concept of synergism. However, the
requirements for two drugs to be synergistic with
each other are much more stringent than just the
superadditivity associated with statistical interaction (Greco et al., 1995; Tallarida, 2001). It is easy
to demonstrate that, due to the nonlinearity of the
log-dose response relationship, superadditive ef-

These types of design incorporate subsampling


and allow to make comparisons among different
treatments at two or more sampling levels. The
split-plot design allows assessment of the effect
of two independent factors using different experimental units as is illustrated by the following examples.

fects will always be present for the combination,


since the total drug dosage has increased, thus implying that a drug could be synergistic with itself.
In connection with the 96-well plates and the presence of the plate location effects that we saw earlier, Straetemans et al. (2005) provide a statistical
method for assessing synergism or antagonism.

Higher dimensional factorial designs

Although

our discussion here is restricted to the 2 2 factorial, more factors and more levels can be used.

Figure 5.9. Outline of the split-plot experiment of 5.12. Cages


each containing two mice each were assigned at random to a
number of dietary treatments and the color-marked mice within
the cage were randomly selected to receive one of two vitamin
treatments by injection.

However, the number of experimental units that


are involved may become rather large. For instance, an experiment with three factors at three
levels involves 3 3 3 = 27 different treatment
combinations. When there are at least two experimental units in each treatment group, then 54 units
will be required. Such an experiment may soon
be too large to be feasible. In addition, three factor and higher order interactions are difficult to interpret (Clewer and Scarisbrick, 2001). Therefore,
when a large number of factors are involved, the
presence of higher order interactions is often neglected. Such high dimensional fractional factorial

Example 5.12. An example of a split-plot design is


the following hypothetical experiment on diets and
vitamins (see Figure 5.9). Cages each containing
two mice were assigned at random to a number of
dietary treatments (i.e. cage was the experimental unit for comparing diets), and the color-marked
mice within the cage were randomly selected to receive one of two vitamin treatments by injection
(i.e. mice were the experimental units for the vitamin effect).

designs are mostly used in process optimization.


(Selwyn, 1996).

Example 5.13. Another example is about the effects


of temperature and growth medium on yeast growth
rate (Ruxton and Colegrave, 2003). In this experi-

5.3

More complex designs

ment Petri dishes are placed inside constant temperature incubators (see Figure 5.10). Within each incubator growh media are randomly assigned to the

We will now consider some sepecialized exper-

individual Petri dishes. Temperature is then consid-

imental designs consisting of a somewhat more

ered as the main-plot factor and growth medium the

complex error-control design that is intertwined

subplot factor. The experiment has to be repeated

with a factorial treatment design.

using several incubators for each temperature.

38

CHAPTER 5. COMMON DESIGNS IN BIOLOGICAL EXPERIMENTATION

originally developed for agricultural field trials,


the next example shows that these designs have
found a place in the modern laboratory.
Example 5.14. (Casella, 2008; Lansky, 2002) describe strip-plot designs that accomodate for serial
dilutions in 96-well microtiter plates. In this case,
Figure 5.10. Outline of the split-plot experiment of Example 5.13.
Six incubators were randomly assigned to 3 temperature levels
in duplicate. In each incubator 8 Petri dishes were placed. Four
growth media were randomly applied to the Petri dishes.

the 8 rows of the microplate are randomly assigned


to the samples (e.g. reference and three test samples in duplicate) and the columns are assigned to

The term split-plot originates from agricultural

the serial dilution level (dose). The dilution level is

research where fields are randomly assigned to dif-

fixed for a given column, such that all samples in

ferent levels of a primary factor and smaller ar-

a column have the same serial dilution level, which

eas within the fields are randomly assigned to one

makes it a strip-plot design.

level of another secondary factor. Actually, the


split-plot design can be considered as two random-

5.3.2

The repeated measures design

ized complete block designs supperimposed upon


one another (Hinkelmann and Kempthorne, 2008).
The split-plot design is a two-way crossed (factorial)
treatment design and a split-plot error design.
A special case of the split-plot design, is called
the strip-plot or split-block design. The strip-plot
design differs from the split-plot design by applying a special kind of randomization. In the stripplot or strip-block design, both factors are applied
to whole-plots which are placed orthogonal to
each other.

Figure 5.12. A typical repeated measures design. Animals are randomized to different treatment groups, the variable of interest
(e.g. blood pressure) is measured at the start of the experiment
and at different time points following treatment application.

The repeated measures design is a special case


of the split-plot and more specifically of the stripplot designs. In a repeated measures design, we
typically take multiple measurements on a subject
over time. If any treatment is applied to the subjects, they immediately become the whole plots
and Time is the subplot factor. A typical experimental set-up is displayed in Figure 5.12 where
two groups of animals are randomized over 2 (or
more) treatment groups and the variable of interest is measured just before and at several time
points following treatment application. The major disadvantage of repeated measures designs is

Figure 5.11. Schematic representation of a strip-plot experiment.


Factor A is applied to whole plots in the horizontal direction at
levels a1, a2, a3, a4, and a5. Factor B is also applied to whole
plots at levels b1, b2 , and b3.

the presence of carry-over effects by which the results obtained for a treatment are influenced by
the previous treatment. In addition, any confound-

Schematically, this is represented as in Fig-

ing of the treatment effect with time, as was the

ure 5.11. There are two factors A and B that are

case in the use of self-controls in Section 4.8.1.1,

applied to the two types of whole plots. As in Fig-

must be avoided. Therefore, like in the example of

ure 5.11, both plots are placed orthogonal upon

Figure 5.12, a parallel control group which protects

one another.

against time related bias must always be included.

Although strip plot designs were

5.3. MORE COMPLEX DESIGNS

39

Table 5.5. Bioassay experiment of Example 5.14. Row and column indicators refer to the conventional coding of 96-well plates. The
four samples (A, B, C, D) are in duplo applied to the rows, and the serial dilution level (dose) is applied to an entire column. A1/1
in a cell means sample A, first replicate at dilution level 1, B2/3 second replicate of sample B at dilution 3, etc.

A
B
C
D
E
F
G
H

10

11

12

B1/2
D2/2
B2/2
C1/2
A1/2
A2/2
C2/2
D1/2

B1/8
D2/8
B2/8
C1/8
A1/8
A2/8
C2/8
D1/8

B1/10
D2/10
B2/10
C1/10
A1/10
A2/10
C2/10
D1/10

B1/1
D2/1
B2/1
C1/1
A1/1
A2/1
C2/1
D1/1

B1/11
D2/11
B2/11
C1/11
A1/11
A2/11
C2/11
D1/11

B1/3
D2/3
B2/3
C1/3
A1/3
A2/3
C2/3
D1/3

B1/12
D2/12
B2/12
C1/12
A1/12
A2/12
C2/12
D1/12

B1/7
D2/7
B2/7
C1/7
A1/7
A2/7
C2/7
D1/7

B1/4
D2/4
B2/4
C1/4
A1/4
A2/4
C2/4
D1/4

B1/5
D2/5
B2/5
C1/5
A1/5
A2/5
C2/5
D1/5

B1/6
D2/6
B2/6
C1/6
A1/6
A2/6
C2/6
D1/6

B1/9
D2/9
B2/9
C1/9
A1/9
A2/9
C2/9
D1/9

40

CHAPTER 5. COMMON DESIGNS IN BIOLOGICAL EXPERIMENTATION

6. The Required Number of Replicates Sample Size


Data, data, data. I cannot make bricks without clay.
Sherlock Holmes
The Adventure of the Copper Beeches, A.C. Doyle.

6.1

Determining sample size is a

tion, interval estimation, and hypothesis testing, of

risk - cost assessment

which hypothesis testing is definitely the most important in biomedical studies.

Replication is the basis of all experimental design


and a natural question that arises in each study is

6.3

how many replicates are required. The more repli-

The hypothesis testing context - the population model

cates, the more confidence we have in our conclusions. Therefore, we would prefer to carry out our

In the hypothesis testing context2 one defines a

experiment on a sample that is as large as possi-

null hypothesis and, for the purpose of sample size

ble. However, increasing the number of replicates

estimation, an alternative hypothesis of interest.

incurs a rise in cost. Thus, the answer to how large

The null hypothesis will often be that the response

an experiment should be, is that it should be just

variable does not really depend on the treatment

big enough to give confidence that any biologically

condition. For example, one may state as a null hy-

meaningful effect that exists can be detected.

pothesis that the population means of a particular


measurement are equal under two or more differ-

6.2

The context of biomedical ex-

ent treatment conditions and that any differences


found can be attributed to chance.

periments
At the end of the study when the data are anThe estimation of the appropriate size of the exper-

alyzed (see Section 7.3), we will either accept or

iment is straightforward and depends on the statis-

reject the null hypothesis in favor of the alterna-

tical context, the assumptions made, and the study

tive hypothesis. As is indicated in Table 6.1, there

specifications. Context and specifications on their

are four possible outcomes at the end of the exper-

turn depend on the study objectives and the design

iment: When the null hypothesis is true and we


failed to reject it, we have made the correct deci-

of the experiment.

sion. This is also the case when the null hypothesis


In practice, the most frequently encountered

is false and we did reject it. However, there are two

contexts in statistical inference are point estima-

decisions that are erroneous. If the null hypothesis

2 The

hypothesis testing context is in statistics also known as the Neyman-Pearson system

41

42

CHAPTER 6. THE REQUIRED NUMBER OF REPLICATES - SAMPLE SIZE

Table 6.1. The decision process in hypothesis testing

State of Nature
Decision
made

Null hypothesis true


true

Alternative hypothesis
true

Do not reject null hypothesis

Correct decision
(1 )
False positive

False negative

Correct decision
(1 )

Reject null hypothesis

is true and we incorrectly rejected it, then we made

When the significance level decreases or the

a false positive decision. Conversely, if the alternative

power increases, the required sample size will be-

hypothesis is true (i.e. the null hypothesis is false)

come larger.

and we failed to reject the null hypothesis we have

larger or the difference to be detected smaller, the

made a false negative decision. In statistics, a false

required sample size will also become larger. Con-

positive decision is also referred to as a type I error

versely, when the difference to be detected is large

and a false negative decision as a type II error.

or variability low, the required sample size will be

Similarly, when the variability is

small. It is convenient, for quantitative data, to express the difference in means as effect size by dividThe basis of sample size calculation is formed by

ing it by the standard deviation1 . The effect size

specifying an allowable rate of false positives and

then takes both the difference and inherent vari-

an allowable rate of false negatives for a particu-

ability into account. Cohen (1988) argues that ef-

lar alternative hypothesis and then to estimate a

fect sizes of 0.2, 0.5 and 0.8 can be regarded respec-

sample size just large enough, so that these low er-

tively as small, medium and large. However, in

ror rates can be achieved. The allowable rate for

basic biomedical research the size of the effect that

false positives is called the level of significance or al- one is interested in, can be substantially larger.
pha level and is usually set at values of 0.01, 0.05, or
0.10. The false negative rate depends on the postulated alternative hypothesis and is usually described by its complement, i.e. the probability of
rejecting the null hypothesis when the alternative

6.4
6.4.1

Sample size calculations


Power analysis computations

hypothesis holds. This is called the power of the Now that we are familiar with the concepts of hystatistical hypothesis test. Power levels are usually pothesis testing and the determinants of sample
expressed as percentages and values of 80% or 90% size, we can proceed with the actual calculations.
are standard in sample size calculations.

There is a significant amount of free software available to make elementary sample size calculations.
In particular, there is the R-package pwr (Cham-

Significance level and power are already two of


the four major determinants of the sample size required for hypothesis testing. The remaining two
are the inherent variability in the study parameter of interest and the size of the difference to be
detected in the postulated alternative hypothesis.
Other key factors that determine the sample size
are the number of treatments and the number of
blocks used in the experimental design.

pely, 2009).
Example 6.1. Consider the completely randomized
experiment about cardiomyocytes discussed in Example 5.6 (page 31). The standard deviations of the
two groups are each about 12.5. A large effect of
0.8 in this case, corresponds to a difference between
both groups of 10 myocytes. Lets assume that we
wish to plan a new experiment to detect such a
difference with a power of 80% and we want to reject the null hypothesis of no difference at a level
of significance of 0.05, whatever the direction of the
difference between the two samples (i.e. a two-sided

1 When comparing mean values from two independent groups, the standard deviation for calculating the effect size, can
be from either group when
q variances of the two groups are homogeneous, or alternatively a pooled standard deviation can

be calculated as sp = (s21 + s22 )/2.


1 See Section 7.3 for a discussion of one-sided and two-sided tests.

6.4. SAMPLE SIZE CALCULATIONS

test1 ). The calculations are carried out in R in a single line of code and show that 26 experimental units
are required in each of the two treatment groups:
> require(pwr)
> pwr.t.test(d=0.8,power=0.8,sig.level=0.05,
+
type="two.sample",
+
alternative="two.sided")

43

6.4.2

Meads resource requirement equation

There are occasions when it is difficult to use a


power analysis, because there is no information on
the inherent variability (i.e. standard deviation)
and/or because the effect size of interest is dif-

Two-sample t test power calculation


n
d
sig.level
power
alternative

=
=
=
=
=

25.52457
0.8
0.05
0.8
two.sided

NOTE: n is number in *each* group

ficult to specify. An alternative, quick and dirty


method for approximate sample size determination was proposed by Mead (1988). The method
is appropriate for comparative experiments which
can be analyzed using analysis of variance (Grafen
and Hails, 2002; Kutner et al., 2004), such as:
Exploratory experiments

Alternatively we could ask what would be the power


of an experiment with say, 5 animals per treatment
group:
> pwr.t.test(d=0.8,n=5,sig.level=0.05,
+
type="two.sample",
+
alternative="two.sided")

Two-sample t test power calculation


n
d
sig.level
power
alternative

=
=
=
=
=

5
0.8
0.05
0.2007395
two.sided

NOTE: n is number in *each* group

For a completely randomized experiment with 2


groups of 5 animals, as described in Example 5.6,
the power to detect a difference of 10 myocytes between treatment groups ( = 0.8) is only 20%.

Complex biological experiments with several


factors and treatments
Any eperiment where the power analysis
method is not possible or practicable.
The method depends on the law of diminishing returns: adding one experimental unit to a small experiment gives good returns, while adding it to a
large experiment does not do so. It has been used
by statisticians for decades, but has been explicitly
justified by Mead (1988). An appropriate sample
size can be roughly determined by the number of
degrees of freedom for the error term in the analysis of variance (ANOVA) or t test given by the formula:
E =N T B
where E, N, T and B are the total, error, treatment

A quick and dirty method for sample size cal-

and block degrees of freedom (number of occurrences

culation against a two-sided alternative in a two

or levels minus 1) in the ANOVA. In order to ob-

group comparison, with a power of 0.8 and Type I

tain a good estimate of error it is necessary to have

error of 0.05 is provided by Lehrs equation (Lehr,

at least 10 degrees of freedom for E, and many

1992; Van Belle, 2008):

statisticians would take 12 or 15 degrees of free-

n 16 /2
where represents the effect size and n stands for
the required sample size in each treatment group.
For the above example, the equation results in:

dom as their preferred lower limit. On the other


hand, if E is allowed to be large, say greater than
20, then the experimenter is wasting resources. It
is recommended that, in a non-blocked design the,
E should be between ten and twenty.

16/0.64 = 25 animals per treatment group. The nu-

Example 6.2. Suppose an experiment is planned

merator of Lehrs equation depends on the desired

with four treatments, with eight animals per group

power. Alternative values for the numerator are 8

(32 rats total). In this case N=31, B=0 (no block-

and 21 for powers of 50% and 90%, respectively.

ing), T=3, hence E=28. This experiment is a bit

44

CHAPTER 6. THE REQUIRED NUMBER OF REPLICATES - SAMPLE SIZE

n=32

n=32

80

80
n=16

n=16

n=12

60

n=8

40

n=12

Power

Power

60

n=8

40

n=6

n=6

20

10

15

20

25

n=4

20

30

10

Number of Subsamples

15

20

25

n=4

30

Number of Subsamples

Figure 6.1. Power curves for a two-group comparison to detect a difference of 1, with a two-sided t-test with significance level
= 0.05 as a function of the number of subsamples m. Lines are drawn for different numbers of experimental units n in each
group. For both left and right panel the between sample standard deviation (n ) is 1, while within sample standard deviation (m )
is 1 in the left panel and 2 in the right panel. The dots connected by the dashed line indicate where the total number of subsamples
2
2
2 n m equals 192. The vertical line indicates an upper bound to the useful number of subsamples of m = 4(m
/n
).

too large, and six animals per group might be more

6.5

How many subsamples

appropriate (23 - 3 = 20).


In Section 4.8.2.2 we defined the standard error of
the experiment when subsamples are present as:
There is one problem with this simple equation. It appears as though blocking is bad, because
it reduces the error degrees of freedom. If in the
above example, the experiment would be done in

2
2 2 m
(n +
)
n
m

where n and m are the number of experimental

eight blocks, then N = 31, B = 7, T = 3 and units and subsamples and n and m the between
E = 31 7 3 = 21 instead of 28. However, sample and within sample standard deviation. Usblocking nearly always reduces the inherent vari-

ing this expression, we can establish the power for

ability, which more than compensates for the de-

different configurations of an experiment.

crease in the error degrees of freedom, unless the


experiment is very small and the blocking criterion
was not well related to the response. Therefore,
when blocking is present and the error degrees of
freedom is not less than about 6, the experiment is
probably still of an adequate size.

Figure 6.1 shows the influence of the number of


experimental units (n) and the number of subsamples (m) per experimental unit on the power of two
experiments to detect a difference between two
mean values of 1. For both experiments n = 1.
The left panel shows the case where m = 1, while

Example 6.3. If we consider again, in this context in the right panel m = 2. The dots connected by
the paired experiment of Example 5.5 (page 30), we a dashed line represent the power for experiments
have N = 9, B = 4, T = 1. Hence, E = 941 = 4. where the total number of subsamples equals 192
Obviously the sample size of 10 experimental units (2 treatment groups n p).
was too small to allow an adequate estimate of
the error. At least 2 experimental units should be
added.

As is illustrated in the left panel of Figure 6.1,


subsampling has only a limited effect on the power

6.5. HOW MANY SUBSAMPLES

45

128

128

= 0.5

= 0.5

= 0.8

= 1.0

= 1.5

= 2.0

64

32

= 0.8

= 1.0

16

= 1.5

= 2.0

Required sample size per group

Required sample size per group

64

32

16

10

Number of comparisons

10

Number of comparisons

Figure 6.2. Required sample size of a two-sided test with a significance level of 0.05 and a power of 80% (left panel) and 90%
(right panel) as a function of the number of comparisons that are carried out. Lines are drawn for different values of the effect size
(). Note that the y-axis is logarithmic.

of the experiment when the within sample vari-

In both examples the power curves have flat-

ability m is the same size (or smaller) as the be-

tened after crossing the vertical line where m =

tween sample variability n . In this case it has

2
4(m
/n2 ). This is known as Coxs rule of thumb

no sense taking more than say 4 subsamples per

(Cox, 1958) about subsamples, which states that

experimental unit, as is indicated by the vertical

for the completely randomized design there is not

line in Figure 6.1. Furthermore, the sharp decline

much increase in power when the number of sub-

of the dashed line connecting the points with the

2
samples m is greater than 4(m
/n2 ). Coxs ratio

same total number of subsamples indicates that

provides an upper limit for a useful number of sub-

subsampling in this case is rather inefficient, at

samples. However, this rule of thumb does not

least when the cost of subsamples and experimental units is not taken into consideration. An experi-

take the different costs involved with experimental

ment with 32 experimental units and 3 subsamples

units and subsamples into account. In many cases,


especially in animal research, the cost of the exper-

has a power of more than 90%, while for an exper-

imental unit is substantially larger than that of the

iment with the same total number of subsamples

subunit. Taking these differential costs into consid-

but with 4 experimental units and 24 subsamples

eration, the optimum number of subsamples can

per unit, the power is only about 20%.

be derived as (Snedecor and Cochran, 1980):


s
m=

cn
2
m
cm
n2

The right panel of Figure 6.1 shows the case


where the within sample standard deviation m

This equation shows that taking subsamples is of

is twice the standard deviation between samples

interest when the cost of experimental units cn is

n . In this example, taking more subsamples does large relative to the cost of subsamples cm , or when
make sense. The power curves keep increasing the variation among subsamples m is large relauntil the number of subsamples is about 16. The

tive to the variation among experimental units n .

loss in efficiency by taking subsamples is also more


moderate, as is indicated by the less sharp decline

Example 6.4. In a morphologic study (Verheyen

of the dotted line.

et al., 2014), the diameter of cardiomyocytes was

46

CHAPTER 6. THE REQUIRED NUMBER OF REPLICATES - SAMPLE SIZE

0.06
0.06

Fraction

Fraction

0.04
0.04

0.02
0.02

0.00

0.00
0.0

0.5

1.0

1.5

2.0

Effect size

Effect size

Figure 6.3. Running the cardiomyocyte experiment a large number of times, the measured effect sizes follow a broad distribution. In
both plots the true effect size is 0.8. The dark area represents statistically significant results (two-sided p 0.05) and the vertical
dotted line indicates the effect size which is just large enough to be statistically significant. Left panel: 26 animals are used per
treatment group which corresponds to a power of 80%. Right panel: Only 5 animals per treatment group are used which results in
an underpowered experiment

examined in 7 sheep that underwent surgery and 6

6.6

Multiplicity and sample size

sheep that were used as control. For each animal


the diameter of about 100 epicardial cells was mea-

As we shall see in Section 7.6, when more than one

sured. A sophisticated statistical technique known statistical test is carried out on the data, the overas mixed model analysis of variance allowed to es- all rate of false positive findings is higher than the
2
as 4.58 and 13.7
timate from the data n2 and m

false positive rate for each test separately. To cir-

respectively. Surprisingly variability within an ani- cumvent this inflation of the false positive error
mal was larger than between animals. If we were to rate, the critical value of each individual test is usually set at a more stringent level. The most simber of measurements to 4 13.7/4.58 12 per ani- ple adjustment, Bonferronis adjustment, consists
mal. Alternatively, we can take the differential costs of just dividing the significance level of each indiset up a new experiment, we could limit the num-

of experimental units and subsamples into account. vidual test by the number of comparisons. BonferIt makes sense to assume that the cost of 1 ani- ronis adjustment maintains the error rate of the
mal is about 100 times the cost of one diameter totality of tests that are carried out in the same conmeasurement. Making this assumption, the opti- text at its original level. But, as we already noted
mum number of subsamples per animal would be above, when the significance level is set at a lower
p
100 13.7/4.58 17. Thus the total number value, the required sample size will necessarily inof diameter measurements could be reduced from crease. Fortunately, the increase in required num1300 to 220. Even if animals would cost 1000 times ber of replicates is surprisingly small.
more than a diameter measurement, the optimum
number of subsamples per animal would be about

Figure 6.2 shows for a two-sided Student t-test


55, which is still a reduction of about 50% of the with a significance level of 0.05 and a power of
original workload. As a conclusion, this is a typi- 80% (left panel) and 90% (right panel), how the
cal example of a study in which statistical input at required sample size increases with an increasing
the onset would have improved research efficiency number of comparisons. The percent increase in
considerably.

sample size due to adding an extra comparison,


corresponds to the slope of the line segment connecting adjacent points in Figure 6.2. The evolution of the relative sample size is comparable for

6.7. THE PROBLEM WITH UNDERPOWERED STUDIES

47

all values of and power (100 (1 )). For pow- other consequence of low statistical power is that
ers of 80% and 90 %, carrying out two indepen-

effect sizes are overestimated and results become

dent statistical tests instead of one, involves a 20%

less reproducible (Button et al., 2013). This is best

larger sample size to maintain the overall error rate

illustrated by the following example.

at its level of = 0.05. Similarly, when 3 or 4 independent tests are involved, the required sample
size increases with 30% or 40% respectively. After

Example 6.5. Consider the cardiomyocyte example

4 comparisons, the effect tapers off and all curves

as discussed in Example 5.6. A sample size calcu-

approach linearity. Adding an extra comparison in

lation, treating the experiment as if it was a com-

the range of 4 - 10 comparisons, will increase the

pletely randomized design (which it was not) was

required sample size with about 2.7%, leading to a

carried out on page 42 and yielded a required sam-

total increase in sample size for 10 comparisons of

ple size of 26 animals in each group to detect a

about 70%. For a larger number of comparisons,

large treatment effect = 0.8 with a power of

Witte et al. (2000) noted that the relative sample

80%. Imagine running several copies of this experi-

size increases linear with the logarithm of the num-

ment, say 10,000. The effect sizes that are obtained

ber of comparisons.

from these experiments follow a distribution as displayed in the left-hand panel of Figure 6.3. The

Figure 6.2 illustrates also how sample size de-

dark shaded area corresponds to experiments that

pends on the effect size. Large sample sizes are

yielded a statistically significant result. This subset

indeed required for detecting moderate-small dif-

yields a slightly increased estimate of the effect size

ferences. However, for large and very large dif-

of 0.89, which corresponds to an effect size infla-

ferences, as we usually want to detect in early re-

tion of 11%. This effect size inflation is to be ex-

search, the required sample size reduces to an at-

pected when an effect has to pass a certain threshold

tainable level.

such as statistical significance. A relative inflation


of 11% as in this case is acceptable.

6.7

The problem with underpowered studies

A survey of articles that were published in 2011

The situation is completely different in the right-

(Tressoldi et al., 2013), showed that in prestigious

hand panel of Figure 6.3 where the experiments now

journals such as Science and Nature, fewer than 3%

use only 5 animals per treatment group and corre-

of the publications calculate the statistical power

sponding power has dropped to 20%. The variabil-

before starting their study. More specifically in

ity of the results is substantially larger as displayed

the field of neuroscience, published studies have


a power between 8 and 32% to detect a true effect

by the larger scale of the x-axis. While the stan-

(Button et al., 2013). Low statistical power might

periment was 0.28, this has now increased to 0.63,

lead the researcher to wrongly conclude there is


no effect from an experimental treatment when in

an increase with a factor 2.25 which corresponds


p
to 26/5. The significant experiments now consti-

fact an effect does exist. In addition, in research in-

tute a much smaller part of the distribution. The

volving animals underpowered studies raise a sig-

mean effect size in this subset has now increased

nificant ethical concern. If each individual study

to 1.57, an inflation of 96%. In addition, the max-

is underpowered, the true effect will only likely be

imum effect detected in all studies is now 3.16 as

discovered after many studies using many animals

compared to 1.95 in the larger experiment. This

have been completed and analyzed, using far more

phenomenon is known as truth inflation, type M er-

animal subjects than if the study had been done

ror (M stands for magnitude) or the winnerss curse

properly the first time (Button et al., 2013). An-

(Button et al., 2013; Reinhart, 2015).

dard deviation of the effect sizes in the larger ex-

48

CHAPTER 6. THE REQUIRED NUMBER OF REPLICATES - SAMPLE SIZE

6.8
120

Sequential plans

Sequential plans allow investigators to save on

Relative bias of research finding (%)

the experimental material, by testing at different


100

stages, as data accumulate. These procedures have


been used in clinical research and are now advo-

80

cated for use in animal experiments (Fitts, 2010,

2011).

60

40

all types of designs that we discussed before can

be implemented in a sequential manner. Sequen-

20

tial procedures are entirely based on the Neyman-

0
20

40

Sequential plans are sometimes referred

to as "sequential designs", but strictly speaking

60

80

Statistical power of study (%)

Pearson hypothesis decision making approach that


we saw in Section 6.3 and do not consider the accuracy or precision of the treatment effect estima-

Figure 6.4. The winners curse: effect size inflation as a function


tion.
of statistical power.

Therefore, in the case of early termination for

As shown in Figure 6.4, effect inflation is worst a significant result, sequential plans are prone to
for small low-powered studies which can only detect exaggerate the treatment effect. There is certainly a
treatment effects that happen to be large. There-

place for these procedures in exploratory research

fore, significant research findings of small studies such as early screening, but a fixed sample size
are biased in favour of inflated effects. This has confirmatory experiment is needed to provide an
consequences when an attempt is made to replicate
a published finding and the sample size is computed
based on the published effect. When this is an inflated estimate, sample size of the confirmatory experiment will be too low. To summarize, effect inflation due to small, underpowered experiments is
one of the major reasons for the lack of replicability
in scientific research.

unbiased and precise estimate of the effect size.


Example 6.6. In a search for compounds that offer protection against traumatic brain injury, a rat
model was used as a screening test. Preliminary
power calculations showed that at least 25 animals per treatment group were required to detect
a protective effect with a power of 80% against a
one-sided alternative with a type I error of 0.05.
Taking into consideration that a large number of
test compounds would be inactive, a fixed sample size approach was regarded as unethical and
inefficient. Therefore, a one-sided sequential procedure (Wilcoxon et al., 1963) was considered as
more appropriate. The procedure operated in different stages (Figure 6.5). At each stage a number of animals was selected, such that the group
was as homogeneous as possible. The animals were
then randomly allocated to the different treatment
groups, as three per group. At a given stage the
treatments consisted of several experimental compounds and their control vehicle. After measuring
the response, the procedure allowed the investigator
to make the decision of rejecting the drug as uninteresting, accepting the drug as active, or to continue with a new group of animals in a next stage.

Figure 6.5. Outline of a sequential experiment

After having tested about 50 treatment conditions,

6.8. SEQUENTIAL PLANS

49

a candidate compound was selected for further de- tion of false positive and false negative results was
velopment. An advantage of this screening proce- known and fixed. A disadvantage of the method was
dure was that, given the biologically relevant level that a dedicated computer program was required for
of activity that must be detected, the expected frac- the follow-up of the results.

50

CHAPTER 6. THE REQUIRED NUMBER OF REPLICATES - SAMPLE SIZE

7. The Statistical Analysis


How absurdly simple !, I cried.
Quite so !, said he, a little nettled. Every problem becomes very childish when once it is explained
to you.
Dr. Watson and Sherlock Holmes
The Adventure of the Dancing Men, A.C. Doyle.
We teach it because its what we do; we do it because its what we teach. (on the use of p < 0.05)

John Cobb (2014)

7.1

The statistical triangle

ities of the statistical analysis have now almost become trivial

There is a one-to-one correspondence between the


study objectives, the study design and the analy-

7.2

sis. The objectives of the study will indicate which


of the designs may be considered. Once a study de-

The statistical model revisited

sign is selected, it will on its turn determine which


type of analysis is appropriate. This principle, that
the statistical analysis is determined by the way
the experiment is conducted, was enunciated by
Fisher (1935):
All that we need to emphasize immediately
is that, if an experiment does allow us to
calculate a valid estimate of error, its struc-

Figure 7.1. The statistical triangle: a conceptual framework for


the statistical analysis

ture must completely determine the statistical procedure by which this estimate is to
be calculated. If this were not so, no in-

Example 7.1. We already stated that every experi-

terpretation of the data could ever be un-

mental design is underpinned by a statistical model

ambiguous; for we could never be sure that

and that the experimental results should be consid-

some other equally valid method of inter-

ered as being generated by this experimental model.

pretation would not lead to a different re-

This conceptual framework as illustrated in Figure

sult.

7.1, greatly simplifies the statistical analysis to just


fitting the statistical model to the data and compar-

In other words, choice of the statistical methods

ing the model component related to the treatment

follows directly from the objectives and design of

effect with the error component (Grafen and Hails,

the study. With this in mind, many of the complex-

2002; Kutner et al., 2004). Hence, the choice of the


51

0.3
0.2
0.1
0.0

0.0

0.1

0.2

0.3

0.4

CHAPTER 7. THE STATISTICAL ANALYSIS

0.4

52

2 2.79

4 2.792

2 2.79

Figure 7.2. Distribution of the test statistic t for the cardiomyocyte example, under the assumption that the null hypothesis of no
difference between the samples is true.

appropriate statistical analysis is straightforward. approach of hypothesis testing. For Fisher, the pHowever, some important statistical issues remain, value was an informal measure to see how surprissuch as the type data and the assumptions we make ing the data were and whether they deserved a secabout the distribution of the data.

ond look (Nuzzo, 2014; Reinhart, 2015). It is good


practice to follow Fisher and to report the actual

7.3

Significance tests

p-values rather than p 0.05 (see Section 9.2.3),


since this allows anyone to construct their own hy-

Significance testing is related to, but not exactly the pothesis tests. The cardiomyocytes experiment of
same as hypothesis testing (see Section 6.3). Signif- Example 5.5 (page 30) will help us to illustrate the
icance testing differs from the Neyman-Pearson hy-

idea of significance testing. The experiment was

pothesis testing approach in that there is no need to set up to test the null hypothesis of no difference
define an alternative hypothesis. Here, we only de- between vehicle and drug. This null hypothesis is
fine a null hypothesis and calculate the probability tested at a level of significance of 0.05, i.e. we
to obtain results as extreme or more extreme than want to limit the probability of a false positive rewhat was actually observed, assuming the null hy- sult to 0.05. The paired design of this experiment
pothesis is true. This is done by calculating, from is a special case of the randomized complete block
the experimental data, a quantity called test statis- design with only two treatments and the response is
tic. Then, based on the the statistical model, the a continuously distributed variable. In this design,
distribution of this test statistic is derived under calculations can be simplified by evaluating for each
the null hypothesis. With this null-distribution, the pair separately the treatment effect, thus removing
probability is calculated of obtaining a test statis- the block effect. This is done in Table 5.1 in the coltic that is as extreme or more extreme than the umn with the Drug - Vehicle differences. We now
one observed. This probability is referred to as p- must make some assumptions about the statistical
value. It is common practice to compare this p- model that generated the data. Specifically, we asvalue to a preset level of significance (usually sume that the differences are independent from one
0.05). When the p-value is smaller then , the null another and originate from a normal distribution.
hypothesis is rejected, otherwise the result is inconclusive. However, this conflates the two worlds of

Next, we define a relevant test statistic, in this

significance testing and the formal decision-making

case it is the mean value of the differences (7), di-

1 the

standard error of the mean


is obtained by dividing the sample standard deviation by the square root of

of a sample
the sample size, i.e. sx = SD/ n = 5.61/ 5 = 2.51

7.4. VERIFYING THE STATISTICAL ASSUMPTIONS

vided by its standard error1 . For our example, we

obtain a value of 7/ 2.51 = 2.79 for this statistic.

7.4

53

Verifying the statistical assumptions

Under the assumptions made above and provided

the null hypothesis of no difference between the two When the inferential results are sensitive to the distreatment conditions holds, the distribution of this tributional and other assumptions of the statistical
statistic is known2 and is depicted in Figure 7.2. analysis, it is essential that these assumptions are
On the left panel of Figure 7.2, the value for the also verified. The aptness of the statistical model is
test statistic 2.79 that was obtained from the ex- preferably assessed by informal methods such as
perimental data is indicated and the area under the diagnostic plotting (Grafen and Hails, 2002; Kutcurve, right to this value is shaded in grey. This ner et al., 2004). When planning the experiment,
area corresponds to the one sided p-value, i.e. the historical data, or the results of exploratory or piprobability of obtaining a greater value for the test lot experiments can already be used for a prelimstatistic than the one obtained in the experiment. inary verification of the model assumptions. AnSince by definition the total area under the curve other option is to use statistical methods that are
equals one, we can calculate the value of the shaded robust against departures from the assumptions
area. For our example, this results in a value of (Lehmann, 1975). It is also wise, before carrying
0.024, which is the probability of obtaining a value out formal tests, to make graphical displays of the
for the test statistic as extreme or more extreme data. This allows to identify outliers and gives althan the one obtained in the experiment, provided ready indications whether the statistical model is
the null hypothesis holds.
appropriate or not. Such exploratory work is also
a tool for gaining insight in the research project and
can lead to new hypotheses.

However, before the experiment was carried out,


we were also interested in finding the opposite result, i.e. we were also interested in a decrease in
viable myocytes. Therefore, when we consider more
extreme results, we should also look at values that
are less than -2.79. This is done in the right panel
of figure 7.2. The sum of the two areas is called the
two-sided p-value and corresponds to the probability of obtaining under the null hypothesis, a more
extreme result than 2.79. In our example, the obtained two-sided p-value is 0.049, which allows us

Figure 7.3. One hundred drugs are tested for activity against a
biological targe. Each drug occupies a square in the grid, the top
row are the drugs that are truly active. Statistically significant
results are obtained only for the darker-grey drugs. The black
cells are false positives (after Reinhart (2015)).

to reject the null-hypothesis at the pre-specified significance level of 0.05 using a two-sided test.

7.5

The meaning of the p-value


and statistical significance

There is one important caveat in all this: Signif-

The literature in the life sciences is literally flooded

icance tests only reflect whether the obtained result

with p-values and yet this is also the most misun-

could be attributed to chance alone, but do not tell

derstood, misinterpreted and sometimes miscalcu-

whether the difference is meaningful or not from a

lated measure (Goodman, 2008). When we ob-

scientific point of view.

tain a result that is not significant, this does not

2 Under

the null hypothesis and when the assumptions are true, the test statistic is distributed as a Student t-distribution
with n 1 degrees of freedom.

54

CHAPTER 7. THE STATISTICAL ANALYSIS

mean that there was no difference between treat-

and the prevalence or base rate as:

ment groups. Sample size of the experiment could


just be too small to establish a statistically signifi-

F DR = (1 )/[(1 ) + (1 )]

cant result (See Chapter 6). But, what does a sig-

= 1/{1 + [/(1 )][(1 )/]}

(7.1)

nificant result mean?


Example 7.2. In a laboratory 100 experimental
compounds are tested against a particular biological target. Figure 7.3 illustrates the situation. Each
square in the grid represents a tested compound. In
reality only 10 drugs are active against the target,
these are located in the top row. We call this value
of 10% the prevalence or base rate. Lets assume
that our statistical test has a power of 80%. This
means that of the 10 active drugs 8 will be correctly detected. These are shown in darker grey.
The threshold for the p-value to declare a drug statistically significant is set to 0.05, thus there is a
5% chance to declare an inactive compound incorrectly as active. There are 90 drugs that are in
reality in active, so about 5 will yield a significant
effect. These are shown on the second row in black.
Hence, in the experiment 13 drugs are declared active, of which only 8 are truly effective, i.e. the
positive predictive value is about 8/13 = 62%, or its
complement the false discovery rate (FDR) is about
38%.

For our example, the above derivation of the FDR


yields a value of 0.36 when the prevalence of active drugs is 10%, significance threshold of 0.05
and a power 1 of 80%. This rises to 0.69 when
the power is reduced to 20%. Meaning that under these conditions, 69% of the drugs (or other research questions) that were declared active, are in
fact false positives.
The FDR depends highly on the prevalence rate
as is illustrated in Figure 7.4, leading to the conclusion that when working in new areas where the a
priori probability of a finding is low, say 1/100, a
significant result does not necessarily imply a genuine activity. In fact, under these circumstances
even in a well-powered experiment (80% power)
with a significance level of 0.05, 69% of the positives findings are false. To make things worse, it
are such surprise, groundbreaking findings, often
combined with exaggerated effect sizes due to a
small sample size (Section 6.7), that are likely to be
published in prestigious journals like Nature and
Science.
Table 7.1. Minimum false discovery rate MFDR for some commonly used critical values of p

1.0

p-value

False Discovery Rate

0.8

MFDR

0.1

0.05

0.01

0.005

0.001

0.385

0.289

0.111

0.067

0.0184

0.6

What is the value of p 0.05 Consider an exper-

0.4

iment whose results yield a p-value close to 0.05,


say between 0.045 and 0.05. What does it actu-

20%
50%
80%

0.2

ally mean? In how many instances does this result reflect a true difference? We already deduced

0.0

0.0

0.2

0.4

0.6

0.8

that, when the power or the prevalence rate are

1.0

low, the FDR can easily reach 70%. But what is

Prevalence

Figure 7.4. False discovery rate as a function of the prevalence


) and the power 100 (1 ), ( = 0.05), lines are drawn for
power 100 (1 ) of 80%, 50% and 20%

the most optimistic scenario? In other words what


is the minimum value for the FDR? Irrespective
of power, sample size and prior probability, Sellke

follows

et al. (2001) derived an expression for what they

(Colquhoun, 2014; Wacholder et al., 2004) that the

From

the

above

reasoning,

it

call the conditional error probability, which is equiv-

FDR depends on the threshold , the power (1) alent to the minimum FDR (MFDR). This gives the

7.6. MULTIPLICITY

55

minimum probability that, when a test is declared

jectives not only increases the studys complexity,

significant, the null hypothesis is in fact true.

but also results in more hypotheses that are to be

Some values of the MFDR are given in Table 7.1. tested. Testing multiple related hypotheses also
For p = 0.05 the M F DR = 0.289, which means raises the type I error rate. The same problem of
that a researcher who claims a discovery when p multiplicity arises when a study includes a large
0.05 is observed will make a fool of him-/herself in

number of variables, or when measurements are

about 30% of the cases. Even for a p-value of 0.01,

made at a large number of time points. Only in

the null hypothesis can still be true in 11% of the

studies of the most exploratory nature the statisti-

cases (Colquhoun, 2014).

cal analysis of every possible variable or time point


is acceptable. In this case, the exploratory nature of

The FDR is certainly one of the key factors re- the study should be stressed and the results intersponsible for the lack of replicability in research preted with great care.
and puts the decision-theoretic approach with its
irrational dichotomization of the p-value into significant and non-significant certainly into question.
As it was noted already in the introductory
chapter, the issues of reproducibility and replicability of research founding have been concerning
the scientific, but certainly also the statistical community, deeply. This has led the board of the American Statistical Association to issue a statement on
March 6, 2016 in which the society warns against
the misuse of p-values (Wasserstein and Lazar,
2016). This is the first time in its 177 year old
history that explicit recommendations on a fun-

Example 7.3. Suppose a drug is tested at 20 different doses on a specific variable. Further, suppose
that we reject the null hypothesis of no treatment
effect for each dose separately when the probability
of falsely rejecting the null hypothesis (the significance level ) is less than or equal to 0.05. Then
the overall probability of falsely declaring the existence of a treatment effect when all underlying null
hypotheses are in fact true is 1(10.05)20 = 0.64.
This means that we are more likely to get one significant result than not. The same multiplicity problem arises when a single dose of the drug is tested
on 20 variables that are mutually independent.

damental matter in statistics were done. In sum-

The problem of multiplicity is of particular immary, the ASA advises in its statement researchers portance and magnitude in gene expression mito avoid drawing scientific conclusions or making croarray experiments (Bretz et al., 2005). For examdecisions based on p-values alone. P-values should ple a microarray experiment examines the differcertainly not be interpreted as measuring the prob- ential expression of 30,000 genes in wildtype and
ability that the studied hypothesis is true or the in a mutant. Assume that for each gene an approprobability that the data were produced by chance

priate two-sided two-sample test is performed at

alone. Researchers should describe not only the

the 5% significance level. Then we expect to obtain

data analyses that produced statistically significant


results, the society says, but all statistical tests and

roughly 1,500 false-positives. Strategies for dealing

choices made in calculations.

in microarrays are provided by Amaratunga and

with, what is often called the curse of multiplicity,


Cabrera (2004) and Bretz et al. (2005).

7.6

Multiplicity
The multiplicity problem must at least be recog-

In Section 3.1, we already pointed out that it is wise

nized at the planning stage. Ways to deal with it

to limit the number of objectives in a study. As al-

(Bretz et al., 2010; Curran-Everett, 2000) should be

ready mentioned in Section 6.6, increasing the ob-

investigated and specified in the protocol.

56

CHAPTER 7. THE STATISTICAL ANALYSIS

8. The Study Protocol


Nothing clears up a case so much as stating it to another person.
Sherlock Holmes
The Memoirs of Sherlock Holmes. Silver Blaze. A.C. Doyle.
The writing of the study protocol finalizes the

Writing down the statistical analysis plan before-

end of the research design phase. Every study

hand prevents also from trying several methods

should have a written formal protocol before it is

of analysis and report only those results that suit

started. The complete study protocol consists of a

the investigator. Such a practice is of course inap-

more conceptual research protocol and the techni-

propriate, unscientific, and unethical. In this con-

cal protocol, which we already discussed in Section

text, the study protocol is a safeguard for the re-

4.8.1.3. The research protocol states the rational for

producibility of research findings.

performing the study, the studys objectives, the


related hypotheses and working hypotheses that
are tested and their consequential predictions. It

Many investigators consider writing a detailed

should contain a section on experimental design,

protocol a waste of time. However, the smart re-

how treatments will be assigned to experimen-

searcher understands that by writing a good pro-

tal units, information and justification of planned

tocol he is actually preparing his final study re-

sample sizes, and a description of the statistical


analysis that is to be performed. Defining the sta-

port. A well-written protocol is even more essen-

tistical methods in the protocol is of importance

collaborative. Once the protocol has been formal-

since it allows preparation of the data analytic pro-

ized, it is important that it is followed as good as


possible and every deviation of it should be docu-

tial when the design is complex or the study is

cedures beforehand and ensures against the mis-

leading practice of data dredging or data snooping. mented.

57

58

CHAPTER 8. THE STUDY PROTOCOL

9. Interpretation and Reporting


No isolated experiment, however significant in itself, can suffice for the experimental demonstration
of any natural phenomenon; for the one chance in a million will undoubtedly occur, with no less
and no more than its appropriate frequency, however surprised we may be that it should occur to us.
R. A. Fisher (1935)

9.1.1

While the previous chapters focused on the


planning phase of the study with the protocol as fi-

Experimental design

Readers should be told about weaknesses and


strengths in study design, e.g. when randomiza-

nal deliverable, this section deals with some points


to consider when interpreting and reporting the

tion and/or blinding was used since it adds to the

results of the statistical analysis. Several journals

reliability of the data. A detailed description of the

(Altman et al., 1983; Bailar III and Mostelller, 1988)

randomization and blinding procedures, and how

have published guidelines for reporting statistical

and when these were applied will allow the reader

results. Although these guidances focus mostly on

to judge the quality of the study. Reasons for block-

clinical research, many of the precepts also apply

ing and the blocking factors should be given and

to other types of research.

how blocking was dealt with in the statistical analysis. When there is ambiguity about the experimental unit, the unit used in the statistical analysis
should be specified and a justification for its choice

As a general rule in writing reports containing


statistical methodology, it is recommended not to

should be provided.

use technical statistical terms such as random, normal,significant,correlation, and sample in their everyday meaning, i.e. out of the statistical context. We

9.1.2

Statistical methods

Statistical methods should be described with

will now discuss the different sections of a scien-

enough detail to enable a knowledgeable reader

tific publication in which statistical reasoning is in-

with access to the original data to verify the re-

volved.

ported results. The authors should report and justify which methods they used. A term like tests of
significance is too vague, and should be more de-

9.1

tailed.

The Methods section

The level of significance and, when applicable,


The Methods section should contain details of the

direction of statistical tests should be specified,

experimental design, such as the size and number

e.g. two-sided p-values less than or equal to 0.05

of experimental groups, how experimental units

were considered to indicate statistical significance.

were assigned to treatment groups, how experi-

Some procedures, e.g. analysis of variance, chi-

mental outcomes were assessed, and what statis-

square tests, etc. are by definition two-sided. Is-

tical and analytical methods were used.

sues about multiplicity (Section 7.6) and a justifi59

60

CHAPTER 9. INTERPRETATION AND REPORTING

cation of the strategy that deals with them should


also be addressed here.

Spurious precision detracts from a papers readability and credibility. Therefore, unnecessary precision, particularly in tables, should be avoided.

The software used in the statistical analysis and


its version should also be specified. When the Rsystem is used (R Core Team, 2013), both R and the
packages that were used should be referenced.

When presenting means and standard deviations,


it is important to bear in mind the precision of the
original data. Means should be given one additional decimal place more than the raw data. Standard deviations and standard errors usually require one more extra decimal place. Percentages

9.2

The Results section

9.2.1

Summarizing the data

should not be expressed to more than one decimal place and with samples less than 100, the use
of decimal places should be avoided. Percentages

The number of experimental units used in the analysis should always be clearly specified. Any discrepancies with the number of units actually ran-

should not be used at all for small samples. Note


that the remarks about rounding only apply to the
presentation of results, rounding should not be
done at all before or during the statistical analysis.

domized to treatment conditions should be accounted for. Whenever possible, findings should
be quantified and presented with appropriate indicators of measurement error or uncertainty. As

9.2.2

Graphical displays

measures of spread and precision, standard deviations (SD) and standard errors (SEM) should not
be confused. Standard deviations are a measure of
spread and as such a descriptive statistic , while
standard errors are a measure of precision of the
mean. Normally distributed data should preferably be summarized as mean (SD), not as mean
SD. For non-normally distributed data, medians
and inter-quartile ranges are the most appropriate summary statistics. The practice of reporting
mean SEM should preferably be replaced by the
reporting of confidence intervals, which are more
informative. Extremely small datasets should not
be summarized at all, but should preferably be reported or displayed as raw data.

Figure 9.1. Scatter diagram with indication of median values and


95% distribution-free confidence intervals

Graphical displays complement tabular presentations of descriptive statistics.

Generally,

graphs are better suited than tables for identifying


patterns in the data, whereas tables are better for
providing large amounts of data with a high de-

When reporting SD (or SEM) one should realize

gree of numerical detail. Whenever possible, one

that for positive variables (i.e. variables measured

should always attempt to graph individual data

on a ratio scale) such as concentrations, durations,

points. This is especially the case when treatment

and counts, the mean minus 2 SD (or minus groups are small. Graphs such as Figure 9.1 and

n) which indicates a lower 2.5% of Figure 9.2 are much more informative than the

2 SEM

the distribution, can lead to a ridiculous negative

usual bar and line graphs showing mean values

value. In this case, an appropriate 95% confidence

SEM (Weissgerber et al., 2015). These graphs are

interval based on the lognormal distribution, or al-

easily constructed using the R-language. Specifi-

ternatively a distribution-free interval, will avoid

cally, the R-package beeswarm (Eklund, 2010) can

such a pitfall.

be of great help.

9.2. THE RESULTS SECTION

61

sult of 0.051 would be flagged as NS. Reporting


exact p-values, would allow readers to compare the
reported p-value with their own choice of significance levels. One should also avoid reporting a pvalue as p = 0.000, since a value with zero probability of occurrence is, by definition, an impossible
value. No observed event can ever have a probability of zero. Therefore, such an extreme small
p-value must be reported as p < 0.001. In roundFigure 9.2. Graphical display of longitudinal data showing individual subject profiles

ing a p-value, it happens that a value that is technically larger than the significance level of 0.05, say
0.051, is rounded down to p = 0.05. This is inac-

9.2.3

Interpreting and reporting signifi- curate and, to avoid this error, p-values should be
reported to the third decimal. If a one-sided test is
cance tests
used and the result is in the wrong direction, then

When data are summarized in the Results section,

the report must state that p > 0.05 (Levine and

the statistical methods that were used to analyze

Atkin, 2004), or even better report the complement

them should be specified. It is to the reader of

of the p-value, i.e. 1 p.

little help to have in the Methods section a statement such as statistical methods included analysis of

There is a common misconception among scien-

variance, regression analysis, as well as tests of signifi-

tists that a nonsignificant result implies that the

cance without any reference to which specific pro- null hypothesis can be accepted. Consequently,
cedure is reported in the Results part.
they conclude that there is no effect of the treatment or that there is no difference between the
Tests of statistical significance should be two- treatment groups. However, from a philosophisided. When comparing two means or two pro- cal point of view, one can never prove the nonexportions, there is a choice between a two-sided or

istence of something.

a one-sided test (see Section 7.3. In a one-sided

pointed out:

test the investigator alternative hypothesis specifies the direction of the difference, e.g. experimental treatment greater than control. In a two-sided
test, no such direction is specified. A one-sided test

As Fisher (1935) clearly

it should be noted that the null hypothesis is


never proved or established, but is possibly
disproved, in the course of experimentation.

is rarely appropriate and when one-sided tests are

To state it otherwise: Lack of evidence is no evidence

used, their use should be justified (Bland and Alt-

for lack of effect. Conversely, an effect that is sta-

man, 1994). For all two group comparisons, the

tistically significant is not necessarily of biomedi-

report should clearly state whether one-sided or

cal importance, nor is it replicable (see Section 7.5.

two-sided p-values are reported.

Therefore, one should avoid sole reliance on statistical hypothesis testing and preferably supplement

Exact p-values, rather than statements such as ones findings with confidence intervals which are
p < 0.0500 or even worse NS (not significant), more informative. Confidence intervals on a difshould be reported where possible. The practice ference of means or proportions provide informaof dichotomizing p-values into significant and not

tion about the size of an effect and its uncertainty

significant has no rational scientific basis at all and

and are of particular value when the results of the

should be abandoned. This lack of rationality be-

test fail to reject the null hypothesis. This is illus-

comes apparent when one considers the situation

trated in Figure 9.3 showing treatment effects and

where a study yielding a p-value of 0.049 would be their 95% confidence intervals. The shaded area
flagged significant, while an almost equivalent re-

indicates the region in which results are important

62

CHAPTER 9. INTERPRETATION AND REPORTING

from a scientific (biological) point of view. Three

Y is not. A representative, though fictitious, ex-

possible outcomes for treatment effects are shown

ample of such claims is for instance:

here as mean values and 95% confidence intervals.


The region encompassed by the confidence inter-

The percentage of neurons showing cue-

val can be interpreted as the set of plausible val-

related activity increased with training in

ues of the treatment effect. The top row shows a

the mutant mice (P < 0.05), but not in the

result that is statistically significant, the 95% confi-

control mice (P > 0.05). (Nieuwenhuis

dence interval does not encompass the zero effect

et al., 2011)

line. However, effect sizes that have no biological relevance are still plausible as is shown by the
upper limit of the confidence interval. The second
row shows the result of an experiment that was not
significant at the 0.05 level. However, the confidence interval reaches well within the area of biological relevance. Therefore, notwithstanding the
nonsignificant outcome, this experiment is inconclusive. The third outcome concerns a result that
was not significant, but the 95% confidence interval does not reach beyond the boundaries of scientific relevance. The nonsignificant result here can
also be interpreted that, with 95% confidence, the
treatment effect was also irrelevant from a scientific point of view.

Such comparisons are absurd, inappropriate and


can be misleading. Indeed, the difference between
significant and not significant is not itself statistically significant (Gelman and Stern, 2006). Unfortunately, such a practice is commonplace. A recent review by Nieuwenhuis et al. (2011) showed
that in the area of cellular and molecular neuroscience the majority of authors erroneously claim
an interaction effect when they obtained a significant result in one group and a nonsignificant result in the other. In view of our discussion of pvalues and nonsignificant findings, it is needless
to say that this approach is completely wrong and
misleading. The correct approach would be to design a factorial experiment and test the interaction effect of genotype on training. In this context,
it must also be noted that carrying out a statistical test to prove equivalence of baseline measurements is also pointless. Tests of significance are not
tests of equivalence. When baseline measurements
are present, their value should be included in the

Figure 9.3. Use of confidence intervals for interpreting statistical results. Estimated treatment effects are displayed with their
95% confidence intervals. The shaded area indicates the zone of
biological relevance.

In the scientific literature, it is common prac-

statistical model.

Finally, when interpreting the results of the ex-

tice to make a sharp distinction between significant periment, the scientist should bear in mind the topand nonsignificant findings and making compar- ics covered in Section 6.7 about effect size inflation
isons of the sort X is statistically significant, while

and Section 7.5 about the pitfalls of p-values.

10. Concluding Remarks and Summary


You know my methods. Apply them.
Sherlock Holmes
The Sign of the Four A.C. Doyle.
To consult the statistician after an experiment is finished is often merely to ask him to conduct a post
mortem examination. He can perhaps say what the experiment died of.
R.A. Fisher (1938)

10.1

Role of the statistician

perhaps they can only say what the experiment


died of.

What we didnt touch yet was the role of the statistician in the research project. The statistician is

10.2

a professional particularly skilled in solving research problems. She should be considered as a

Recommended reading

Statistics Done Wrong: The Woefully Complete Guide

team member and often even as collaborator or

by Reinhart (2015) is in my opinion essential

partner in the research process in which she can

reading material for all scientists working in

have a critical role. Whenever possible, the statisti-

biomedicine and the life sciences in general. This

cian should be consulted when there is doubt with

small book (152 pages) provides a well-written

regard to design, sample size, or statistical analy-

very accessible guide to the most popular statis-

sis. A statistician working closely together with

tical errors and slip-ups committed by scientists

a scientist can greatly improve the projects like-

every day, in the lab and in peer-reviewed jour-

lihood of success. Many applied statisticians be-

nals. Scientists working with laboratory animals

come involved into the subject area and, by virtue

should certainly read the article by Fry (2014) and

of their statistical training, take on the role of statis-

the book The Design and Statistical Analysis of An-

tical thinker, thereby permeating the research pro-

imal Experiments by Bate and Clark (2014). For

cess. In a great many instances this key role of the

those interested in the history of statistics and the


life of famous statisticians, The Lady Tasting Tea

statistician is recognized and granted with a coauthorship.

by Salsburg (2001) is a lucidly written account of


the history of statistics, experimental design and

The most effective way to work with a con-

how statistical thinking revolutionized 20th Cen-

sulting statistician is to include her or him from

tury science. A clear, comprehensive and highly

the very beginning of the project, when the

recommended work on experimental design, is the

study objectives are formulated (Hinkelmann and

book by Selwyn (1996), while, on a more introduc-

Kempthorne, 2008).

What always should be

tory level, there is the book by Ruxton and Cole-

avoided, is contacting the statistical support group

grave (2003). A gentle introduction to statistics

after the experiment has reached its completion,

in general and hypothesis testing, confidence in63

64

CHAPTER 10. CONCLUDING REMARKS AND SUMMARY

tervals and analysis of variance in particular, can

tistical thinking was introduced as a non-specialist

be found in the highly recommended book of the

generalist skill that permeates the entire research

two Wonnacott brothers (Wonnacott and Wonna-

process. The seven principles of statistical think-

cott, 1990). Comprehensive works at an advanced

ing were formulated as: 1) time spent thinking on

level on statistics and experimental design are the

the conceptualization and design of an experiment

books by Kutner et al. (2004), Hinkelmann and

is time wisely spent; 2) the design of an experiment

Kempthorne (2008), Casella (2008), and Giesbrecht

reflects the contributions from different sources of

and Gumpertz (2004). The latter two also provide

variability; 3) the design of an experiment balances

designs suitable for 96-well microtiter plates. For

between its internal validity (proper control of

those who want to carry out their analyses in the

noise) and external validity (the experiments gen-

freely available R-language (R Core Team, 2013) is

eralizability); 4) good experimental practice pro-

the book by Dalgaard (2002) a good starter, while

vides the clue to bias minimization; 5) good ex-

the book by Everitt and Hothorn (2010) is at a

perimental design is the clue to the control of vari-

more advanced level. Guidances and tips for ef-

ability; 6) experimental design integrates various

ficient data visualization can be found in the work

disciplines; 7) a priori consideration of statistical

of Tufte (1983) and in the two books by William

power is an indispensable pillar of an effective ex-

Cleveland (Cleveland, 1993, 1994). Finally, there

periment.

is the freely available e-book Speaking of Graphics


(Lewi, 2006), which takes the reader on a fascinat-

We elaborated on each of these and finally dis-

ing journey through the history of statistical graph-

cussed some points to consider in the interpreta-

ics1 .

tion and reporting of scientific results. In particular, the problems with a blind trust on statistical

10.3

Summary

We have looked at the complexities of the research


process from the vantage point of a generalist. Sta-

1 http://www.datascope.be

hypothesis tests and of exaggerated effect sizes in


small significant studies were highlighted.

REFERENCES

65

References
Altman, D. G., Gore, S. M., Gardner, M. J., and Pocock, S. J.
(1983). Statistical guidelines for contributors to medical journals. BMJ 286, 14891493.
URL http://www.bmj.com/content/286/6376/1489
Amaratunga, D. and Cabrera, J. (2004). Exploration and Analysis of DNA Microarray and Protein Array Data. New York, NY: J.
Wiley.
Anderson, V. and McLean, R. (1974). Design of Experiments.
New York, NY: Marcel Dekker Inc.
Aoki, Y., Helzlsouer, K. J., and Strickland, P. T. (2014).
Arylesterase phenotype-specific positive association between
arylesterase activity and cholinesterase specific activity in human serum. Int. J. Environ. Res. Public Health 11, 14221443.
doi:doi:10.3390/ijerph110201422.
Babij, C. J., Zhang, R. J., Y. anf Kurzeja, Munzli, A., Shehabeldin, A., Fernando, M., Quon, K., Kassner, P. D., RuefliBrasse, A. A., Watson, V. J., Fajardo, F., Jackson, A., Zondlo,
J., Sun, Y., Ellison, A. R., Plewa, C. A., T., S., Robinson, J.,
McCarter, J., Judd, T., Carnahan, J., and Dussault, I. (2011).
STK33 kinase activity is nonessential in KRAS-dependent cancer cells. Cancer Research 71, 58185826. doi:10.1158/00085472.CAN-11-0778.
Baggerly, K. A. and Coombes, K. R. (2009).
Deriving
chemosensitivity from cell lines: Forensic bioinformatics and
reproducible research in high-throughput biology. Annals of
Applied Statistics 3, 13091334. doi:10.1214/09-AOAS291.
Bailar III, J. C. and Mostelller, F. (1988). Guidelines for statistical reporting in articles for medical journals. Ann. Int. Med.
108, 226273.
URL http://www.people.vcu.edu/$\sim$albest/Guidance/
guidelines_for_statistical_reporting.htm
Banerjee, T. and Mukerjee, R. (2007). Optimal factorial designs
for cDNA microarray experiments. Ann. Appl. Stat. 2, 366385.
doi:10.1214/07-AOAS144.
Bate, S. and Clark, R. (2014). The Design and Statistical Analysis
of Animal Experiments. Cambridge, UK: Cambridge University
Press.
Begley, C. G. and Ellis, L. M. (2012). Raise standards for preclinical research. Nature 483, 531533. doi:10.1038/483531a.
Begley, C. G. and Ioannidis, J. P. A. (2015). Reproducibility in science.
Circ. Res. 116, 116126.
doi:10.1161/
CIRCRESAHA114.303819.
Begley, S. (2012). In cancer science, many "discoveries" dont
hold up. Reuters March 28.
URL
http://www.reuters.com/article/2012/03/28/usscience-cancer-idUSBRE82R12P20120328
Bland, M. and Altman, D. (1994). One and two sided tests of
significance. BMJ 309, 248.

Bretz, F., Landgrebe, J., and Brunner, E. (2005). Multiplicity issues in microarray experiments. Methods Inf. Med. 44, 431437.
Brien, C. J., Berger, B., Rabie, H., and Tester, M. (2013). Accounting for variation in designing greenhouse experiments
with special reference to greenhouses containing plants on
conveyor systems. Plant Methods 9, 526. doi:10.1186/17464811-9-5.
Burrows, P. M., Scott, S. W., Barnett, O., and McLaughlin,
M. R. (1984). Use of experimental designs with quantitative
ELISA. J. Virol. Methods 8, 207216.
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A.,
Flint, J., Robinson, E. S. J., and Munafo, M. R. (2013). Power
failure: Why small sample size undermines the reliability
of neuroscience. Nature Reviews Neuroscience 14, 112. doi:
10.1038/nrn3475.
Casella, G. (2008). Statistical Design. New Tork, NY: Springer.
Champely, S. (2009). pwr: Basic functions for power analysis.
R package version 1.1.1.
URL http://CRAN.R-project.org/package=pwr
Cleveland, W. S. (1993). Visualizing Data. Summit, NJ: Hobart
Press.
Cleveland, W. S. (1994). The Elements of Graphing Data. Summit, NJ: Hobart Press.
Clewer, A. G. and Scarisbrick, D. H. (2001). Practical Statistics
and Experimental Design for Plant and Crop Science. Chichester,
UK: J. Wiley.
Cochran, W. and Cox, G. (1957). Experimental Designs. New
York, NY: John Wiley & Sons Inc., 2nd edition.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence Erlbaum Associates, 2nd edition.
Cokol, M., Ozbay, F., and Rodriguez-Esteban, R. (2008).
Retraction rates are on the rise. EMBO Rep. 9, 2. doi:
10.1038/sj.embor.7401143.
URL
http://www.ncbi.nlm.nih.gov/pmc/articles/
PMC2246630/
Colquhoun, D. (2014). An investigation of the discovery rate
and the misinterpretation of p-values. R. Soc. open sci. 1,
140216. doi:10.1098/rsos/140216.
Council of Europe (2006). Appendix a of the european convention for the protection of vertebrate animals used for experimental and other scientific purposes (ets no. 123). guidelines
for accomodation and care of animals (article 5 of the convention). approved by the multilateral consultation.
URL https://www.aaalac.org/about/AppA-ETS123.pdf
Cox, D. (1958). Planning of Experiments. New York, NY: J. Wiley.

Bolch, B. (1968). More on unbiased estimation of the standard


deviation. The American Statician 22, 27.

Curran-Everett, D. (2000). Multiple comparisons: philosophies and illustrations. Am. J. Physiol. Regulatory Integrative
Comp. Physiol. 279, R1R8.

Bretz, F., Hothorn, T., and Westfall, P. (2010). Multiple Comparisons Using R. Boca Raton, FL: CRC Press.

Dalgaard, P. (2002). Introductory Statistics with R. New York,


NY: Springer.

66

Eklund, A. (2010). beeswarm: The bee swarm plot, an alternative to stripchart. R package version 0.0.7.
URL http://CRAN.R-project.org/package=beeswarm
European Food Safety Authority (2012). Final review of the
Sralini et al. (2012) publication on a 2-year rodent feeding
study with glyphosate formulations and GM maize NK603 as
published online on 19 September 2012 in Food and Chemical
Toxicology. EFSA Journal 10, 2986. doi:10.2903/j.efsa.2012.
2986.
Everitt, B. S. and Hothorn, T. (2010). A Handbook of Statistical
Analyses using R. Boca Raton, FL: Chapman and Hall/CRC,
2nd edition.
Faessel, H., Levasseur, L., Slocum, H., and Greco, W. (1999).
Parabolic growth patterns in 96-well plate cell growth experiments. In Vitro Cell. Dev. Biol. Anim. 35, 270278.
Fang, F. C., Steen, R. C., and Casadevall, A. (2012). Misconduct accounts for the majority of retracted scientific publications. Proc. Natl. Acad. Sci. U.S.A. 109, 1702817033. doi:
10.1073/pnas.1212247109.

REFERENCES

Ghlmann, H. and Talloen, W. (2009). Gene Expression Studies


Using Affymetrix Microarrays. Boca Raton, FL: CRC Press.
Goodman, S. (2008). A dirty dozen: Twelve p-value misconceptions. Semin Hematol 45, 135140. doi:10.1053/j.
seminhematol.2008.04.003.
Grafen, A. and Hails, R. (2002). Modern Statistics for the Life
Sciences. Oxford, UK: Oxford University Press.
Greco, W. R., Bravo, G., and Parsons, J. C. (1995). The search
for synergy: a critical review from a response surface perspective. Pharmacol. Rev. 47, 331385.
Hankin, R. K. S. (2005). Recreational mathematics with r: introducing the magic package. R News 5.
Haseman, J. K. (1984). Statistical issues in the design, analysis
and interpretation of animal carcinogenicity studies. Enviromental Health Perspect. 58, 385392.
URL
http://www.ncbi.nlm.nih.gov/pmc/articles/
PMC1569418/

Fernandez, G. C. J. (2007). Design and analysis of commonly


used comparative horticultural experiments. HortScience 42,
10521069.

Hayes, W. (2014). Retraction notice to "Long term toxicity of a


Roundup herbicide and a Roundup-tolerant genetically modified maize" [Food Chem. Toxicol. 50 (2012): 4221-4231]. Food
Chem. Toxicol. 52, 244. doi:10.1016/j.fct.2013.11.047.

Fisher, R. (1962). The place of the design of experiments in the


logic of scientific inference. Colloques Int. Centre Natl. Recherche
Sci. Paris 110, 1319.

Hempel, C. G. (1966).
Philosophy of Natural Science.
Englewood-Cliffs, NJ: Prentice-Hall.

Fisher, R. A. (1935). The Design of Experiments. Edinburgh, UK:


Oliver and Boyd.

Hinkelmann, K. and Kempthorne, O. (2008). Design and Analysis of Experiments. Volume 1. Introduction to Experimental Design.
Hoboken, NJ: J. Wiley, 2nd edition.

Fisher, R. A. (1938). Presidential address: The first session of


the Indian Statistical Conference, Calcutta. Sankhya 4, 1417.
Fitts, D. A. (2010). Improved stopping rules for the design of
small-scale experiments in biomedical and biobehavioral research. Behavior Research Methods 42, 322. doi:10.3758/BRM.
42.1.3.
Fitts, D. A. (2011). Minimizing animal numbers: the variablecriteria stopping rule. Comparative Medicine 61, 206218.
Freedman, L. P., Cockburn, I. M., and Simcoe, T. S. (2015). The
economics of reproducibility in preclinical research. PLoS Biol.
13, e1002165. doi:10.1371/journal.pbio.1002165.
Fry, D. (2014). Experimental design: reduction and refinement
in studies using animals. In K. Bayne and P. Turner, editors,
Laboratory Animal Welfare, chapter 8, pages 95112. London,
UK: Academic Press.
Gart, J., Krewski, D., P.N., L., Tarone, R., and Wahrendorf,
J. (1986). The design and analysis of long-term animal experiments, volume 3 of Statistical Methods in Cancer Research. Lyon,
France: International Agency for Research on Cancer.

Hirst, J. A., Howick, J., Aronson, J. K., Roberts, N., Perera, R.,
Koshiaris, C., and Heneghan, C. (2014). The need for randomization in animal trials: An overview of systematic reviews.
PLOS ONE 9, e98856. doi:10.1371/journal.pone.0098856.
Holland, T. and Holland, C. (2011). Unbiased histological
examinations in toxicological experiments (or, the informed
leading the blinded examination). Toxicol. Pathol. 39, 711714.
doi:10.1177/0192623311406288.
Holman, L., Head, M. L., Lanfear, R., and Jennions, M. D.
(2015). Evidence of experimental bias in the life sciences: why
we need blind data recording. PLoS Biol. 13, e1002190. doi:
10.1371/journal.pbio.1002190.
Hotz, R. L. (2007). Most science studies appear to be tainted
by sloppy analysis. The Wall Street Journal September 14.
URL http://online.wsj.com/article/SB118972683557627104.
html

Gelman, A. and Stern, H. (2006). The difference between "significant" and "not significant" is not itself statistically significant. Am. Stat. 60, 328331.

Hu, B., Simon, J., Gnthardt-Goerg, M. S., Arend, M., Kuster,


T. M., and Rennenberg, H. (2015). Changes in the dynamics of
foliar n metabolites in oak saplings by drought and air warming depend on species and soil type. PLOS ONE 10, e0126701.
doi:10.1371/journal.pone.0126701.

Giesbrecht, F. G. and Gumpertz, M. L. (2004). Planning, Construction, and Statistical Analysis of Comparative Experiments.
New York, NY: J. Wiley.

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Med. 2, e124. doi:10.1371/journal.pmed.
0020124.

Glonek, G. F. V. and Solomon, P. J. (2004). Factorial and time


course designs for cDNA microarray experiments. Biostatistics
5, 89111.

Ioannidis, J. P. A. (2014). How to make more published research true. PLoS Med. 11, e1001747. doi:10.1371/journal.
pmed.1001747.

REFERENCES

Kilkenny, C., Parsons, N., Kadyszewski, E., Festing, M. F. W.,


Cuthill, I. C., Fry, D., Hutton, J., and Altman, D. G. (2009). Survey of the quality of experimental design, statistical analysis
and reporting of research using animals. PLOS ONE 4, e7824.
doi:10.1371/journal.pone.0007824.
Kimmelman, J., Mogil, J. S., and Dirnagl, U. (2014). Distinguishing between exploratory and confirmatory preclinical research will improve translation. PLoS Biol. 12, e1001863. doi:
10.1371/journal.pbio.1001863.
Kutner, M. H., Nachtsheim, C., Neter, J., and Li, W. (2004).
Applied Linear Statistical Models.
Chicago, IL: McGrawHill/Irwin, 5th edition.
Lansky, D. (2002). Strip-plot designs, mixed models, and comparisons between linear and non-linear models for microtitre
plate bioassays. In W. Brown and A. R. Mire-Sluis, editors, The
Design and Analysis of Potency Assays for Biotechnology Products.
Dev. Biol., volume 107, pages 1123. Basel: Karger.
Lazic, S. (2010). The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis. BMC Neuroscience
11. doi:10.1186/1471-2202-11-5.
LeBlanc, D. C. (2004). Statistics: Concepts and Applications for
Science. Sudbury, MA: Jones and Bartlett Publishers.
Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead,
B., Johnson, E. W., Geman, D., Baggerly, K., and Irizarry, R. A.
(2010). Tackling the widespread and critical impact of batch
effects in high-throughput data. Nature Reviews Genetics 11,
733739. doi:10.1038/nrg2825.
Lehmann, E. L. (1975). Nonparametrics: Statistical Methods
Based on Ranks. San Francisco, CA: Holden-Day.
Lehr, R. (1992). Sixteen s squared over d squared: a relation
for crude sample size estimates. Stat. Med. 11, 10991102.
Lehrer, J. (2010). The truth wears off. The New Yorker [online]
December 13.
URL http://www.newyorker.com/magazine/2010/12/13/
the-truth-wears-off
Levasseur, L., Faessel, H., Slocum, H., and Greco, W. . (1995).
Precision and pattern in 96-well plate cell growth experiments.
In Proceedings of the American Statistical Association, Biopharmaceutical Section, pages 227232. Alexandria, Virginia: American Statistical Association.
Levine, T. R. and Atkin, C. (2004). The accurate reporting of
software-generated p-values: a cautionary note. Comm. Res.
Rep. 21, 324327. doi:10.1080/08824090409359995.
Lewi, P. J. (2005). The role of statistics in the success of a pharmaceutical research laboratory: a historical case description. J
Chemometr. 19, 282287.
Lewi, P. J. (2006). Speaking of graphics.
URL http://www.datascope.be
Lewi, P. J. and Smith, A. (2007). Successful pharmaceutical
discovery: Paul Janssens concept of drug research. R&D Management 37, 355361. doi:10.1111/j.1467-9310.2007.00481.x.
Loscalzo, J. (2012).
Irreproducible experimental results:
causes, (mis)interpretations, and consequences.
Circulation 125, 12111214. doi:10.1161/CIRCULATIONAHA.112.
098244.

67

Mead, R. (1988). The design of experiments: statistical principles


for practical application. Cambridge, UK: Cambridge University
Press.
Nadon, R. and Shoemaker, J. (2002). Statistical issues with microarrays: processing and analysis. Trends in Genetics 15, 265
271.
Naik, G. (2011). Scientists elusive goal: Reproducing study
results. The Wall Street Journal December 2.
URL
http://online.wsj.com/article/
SB10001424052970203764804577059841672541590.html
Nieuwenhuis, S., Forstmann, B. U., and Wagenmakers, E.-J.
(2011). Erroneous analysis of interactions in neuroscience: a
problem of significance. Nat. Neurosci. 14, 11051107.
Nuzzo, R. (2014). Scientific method: Statistical errors. Nature
506, 150152. doi:10.1038/506150a.
Parkin, S., Pritchett, J., Grimsdich, D., Bruckdorfer, K., Sahota,
P., Lloyd, A., and Overend, P. (2004). Circulating levels of
the chemokines JE and KC in female C3H apolipoprotein-Edeficient and C57BL apolipoprotein-E-deficient mice as potential markers of atherosclerosis development. Biochemical Society Transactions 32, 128130.
Peng, R. (2009). Reproducible research and biostatistics. Biostatistics 10, 405408. doi:10.1093/biostatistics/kxp014.
Peng, R. (2015). The reproducibility crisis in science. Significance 12, 3032. doi:10.1111/j.1740-9713.2015.00827.x.
Potti, A., Dressman, H. K., Bild, A., Riedel, R., Chan, G., Sayer,
R., Cragun, J., Cottrill, H., Kelley, M. J., Petersen, R., Harpole,
D., Marks, J., Berchuck, A., Ginsburg, G. S., Febbo, P., Lancaster, J., and Nevins, J. R. (2006). Genomic signature to guide
the use of chemotherapeutics. Nature Medicine 12, 12941300.
doi:10.1038/nm1491. (Retracted).
Potti, A., Dressman, H. K., Bild, A., Riedel, R., Chan, G., Sayer,
R., Cragun, J., Cottrill, H., Kelley, M. J., Petersen, R., Harpole,
D., Marks, J., Berchuck, A., Ginsburg, G. S., Febbo, P., Lancaster, J., and Nevins, J. R. (2011). Retracted: Genomic signature to guide the use of chemotherapeutics. Nature Medicine
17, 135. doi:10.1038/nm0111-135. (Retracted).
Potvin, C., Lechowicz, M. J., Bell, G., and Schoen, D. (1990).
Spatial, temporal and species-specific patterns of heterogeneity in growth chamber experiments. Canadian Journal of Botany
68, 499504.
Prinz, F., Schlange, A., and Asadullah, K. (2011). Believe it
or not: how much can we rely on published data on potential drug targets. Nature Rev. Drug Discov. 10, 712713. doi:
10.1038/nrd3439-c1.
R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
URL http://www.R-project.org
Reinhart, A. (2015). Statistics Done Wrong: The Woefully Complete Guide. San Francisco, CA: no starch press.
Ritskes-Hoitinga, M. and Strubbe, J. (2007). Nutrition and animal welfare. In E. Kaliste, editor, The Welfare of Laboratory Animals, chapter 5, pages 95112. Dordrecht, The Netherlands:
Springer.

68

Rivenson, A., Hoffmann, D., Prokopczyk, B., Amin, S., and


Hecht, S. S. (1988). Induction of lung and exocrine pancreas
tumors in F344 rats by tobacco-specific and Areca-derived Nnitrosamines. Cancer Res. 48, 69126917.
Ruxton, G. D. and Colegrave, N. (2003). Experimental Design
for the Life Sciences. Oxford, UK: Oxford University Press.
Sailer, M. O. (2013). crossdes: Construction of Crossover Designs.
R package version 1.1-1.
URL http://CRAN.R-project.org/package=crossdes
Salsburg, D. (2001). The Lady Tasting Tea. New York, NY.: Freeman.
Schlain, B., Jethwa, H., Subramanyam, M., Moulder, K., Bhatt,
B., and Molley, M. (2001). Designs for bioassays with plate
location effects. BioPharm International 14, 4044.
Scholl, C., Frhling, S., Dunn, I., Schinzel, A. C., Barbie, D. A.,
Kim, S. Y., Silver, S. J., Tamayo, P., Wadlow, R. C., Ramaswamy,
S., Dhner, K., Bullinger, L., Sandy, P., J.S., B., Root, D. E., Jacks,
T., Hahn, W., and Gilliland, D. G. (2009). Synthetic lethal interaction between oncogenic KRAS dependency and STK33
suppression in human cancer cells. Cell 137, 821834. doi:
10.1016/j.cell.2009.03.017.
Sellke, T., Bayarri, M., and Berger, J. (2001). Calibration of p
values for testing precise null hypotheses. The American Statistician 55, 6271.
Selwyn, M. R. (1996). Principles of Experimental Design for the
Life Sciences. Boca Raton, FL: CRC Press.
Sralini, G.-E., Claire, E., Mesnage, R., Gress, S., Defarge, N.,
Malatesta, M., Hennequin, D., and Vendmois, J. (2012). Long
term toxicity of a roundup herbicide and a roundup-tolerant
genetically modified maize. Food Chem. Toxicol. 50, 42214231.
doi:10.1016/j.fct.2012.08.005.
Sralini, G.-E., Claire, E., Mesnage, R., Gress, S., Defarge, N.,
Malatesta, M., Hennequin, D., and Vendmois, J. (2014). Republished study: long term toxicity of a roundup herbicide
and a roundup-tolerant genetically modified maize. Environmental Sciences Europe 26, 14. doi:10.1186/s12302-014-0014-5.
Snedecor, G. W. and Cochran, W. G. (1980). Statistical Methods.
Ames, IA: Iowa State University Press, 7th edition.
Straetemans, R., OBrien, T., Wouters, L., Van Dun, J., Janicot,
M., Bijnens, L., Burzykowski, T., and M, A. (2005). Design and
analysis of drug combination experiments. Biometrical J 47,
299308.
Tallarida, R. J. (2001). Drug synergism: its detection and applications. J. Pharm. Exp. Ther. 298, 865872.
Temme, A., Smpel, F., Rieber, G. S. E. P., Willecke, K. J. K.,
and Ott, T. (2001). Dilated bile canaliculi and attenuated
decrease of nerve-dependent bile secretion in connexin32deficient mouse liver. Eur. J. Physiol. 442, 961966.

REFERENCES

Tukey, J. W. (1980). We need both exploratory and confirmatory. The American Statistician 34, 2325.
Van Belle, G. (2008). Statistical Rules of Thumb. Hoboken, NJ: J.
Wiley, 2nd edition.
van der Worp, B., Howells, D. W., Sena, E. S., Porritt, M.,
Rewell, S., OCollins, V., and Macleod, M. R. (2010). Can animal models of disease reliably inform human studies. PLoS
Med. 7, e1000245. doi:10.1371/journal.pmed.1000245.
van Luijk, J., Bakker, B., Rovers, M. M., Ritskes-Hoitinga, M.,
de Vries, R. B. M., and Leenaars, M. (2014). Systematic reviews of animal studies; missing link in translational research?
PLOS ONE 9, e89981. doi:10.1371/journal.pone.0089981.
Vandenbroeck, P., Wouters, L., Molenberghs, G., Van Gestel,
J., and Bijnens, L. (2006). Teaching statistical thinking to life
scientists: a case-based approach. J. Biopharm. Stat. 16, 6175.
Ver Donck, L., Pauwels, P. J., Vandeplassche, G., and Borgers,
M. (1986). Isolated rat cardiac myocytes as an experimental
model to study calcium overload: the effect of calcium-entry
blockers. Life Sci. 38, 765772.
Verheyen, F., Racz, R., Borgers, M., Driesen, R. B., Lenders,
M. H., and Flameng, W. J. (2014). Chronic hibernating myocardium in sheep can occur without degenerating events and
is reversed after revascularization. Cardiovasc Pathol. 23, 160
168. doi:10.1016/j.carpath.2014.01.003.
Vlaams Instituut voor Biotechnologie (2012). A scientific
analysis of the rat study conducted by Gilles-Eric Sralini et
al.
URL http://www.vib.be/en/news/Documents/20121008_
EN_Analyse\rattenstudieS{}ralini\et\al.pdf
Wacholder, S., Chanoch, S., Garcia-Closas, M., El ghormli, L.,
and Rothman, N. (2004). Assessing the probability that a positive report is false: an approach for molecular epidemiology
studies. J Natl Cancer Inst 96, 434442. doi:10.1093/jnci/
djh075.
Wasserstein, R. and Lazar, N. (2016). The asas statement on
p-values: context, process, and purpose. The American Statistician 70. doi:10.1080/00031305.2016.1154108. In press.
URL http://dx.doi.org/10.1080/00031305.2016.1154108
Weissgerber, T. L., Milic, N. M., Wionham, S. J., and Garovic,
V. D. (2015). Beyond bar and line graphs: time for a new
data presentation paradigm. PLoS Biol. 13, e1002128. doi:
10.1371/journal.pbio.1002128.
Wilcoxon, F., Rhodes, L. J., and Bradley, R. A. (1963). Two sequential two-sample grouped rank tests with applications to
screening experiments. Biometrics 19, 5884.
Wilks, S. S. (1951). Undergraduate statistical education. J.
Amer. Statist. Assoc. 46, 118.

Tressoldi, P. E., Giofr, D., Sella, F., and Cumming, G. (2013).


High impact = high statistical standards? not necessarily so.
Nature Reviews Neuroscience 8, e56180. doi:10.1371/journal.
pone.0056180.

Witte, J., Elston, R., and Cardon, L. (2000). On the relative


sample size required for multiple comparisons. Statist. Med.
19, 369372.

Tufte, E. R. (1983). The Visual Display of Quantitative Information. Cheshire, CT.: Graphics Press.

Wonnacott, T. H. and Wonnacott, R. J. (1990). Introductory


Statistics. New York, NY.: J. Wiley, 5th edition.

REFERENCES

Yang, H., Harrington, C. A., Vartanian, K., Coldren, C. D.,


Hall, R., and Churchill, G. A. (2008). Randomization in laboratory procedure is key to obtaining reproducible microarray results. PLOS ONE 3, e3724. doi:10.1371/journal.pone.
0003724.
Young, S. S. (1989). Are there location/cage/systematic nontreatment effects in long-term rodent studies? a question re-

69

visited. Fundam Appl Toxicol. 13, 183188.


Zimmer, C. (2012). A sharp rise in retractions prompts calls
for reform. The New York Times April 17.
URL http://www.nytimes.com/2012/04/17/science/risein-scientific-journal-retractions-prompts-calls-forreform.html

70

REFERENCES

Appendices

71

A. Glossary of Statistical Terms


ANOVA : see analysis of variance.

is used to indicate the reliability of an estimate. For a given confidence level, if several

Accuracy : the degree to which a measurement

confidence intervals are constructed based

process is free of bias.

on independent repeats of the study, then on

Additive model : a model in which the combined

the long run, the proportion of such intervals

effect of several explanatory variables or fac-

that contain the true value of the parameter

tors is equal to the sum of their separate ef-

will correspond the confidence level.

fects.
Alternative hypothesis :

Confounding : the phenomenon in which an extraa hypothesis which is

neous variable, not under control of the in-

presumed to hold if the null hypothesis does

vestigator, influences both the factors under

not; the alternative hypothesis is necessary in

study and the response variable.

deciding upon the direction of the test and in


Covariate : a concomitant measurement that is re-

estimating sample sizes.

lated to the response but is not affected by the


Analysis of variance :a statistical method of infer-

treatment.

ence for making simultaneous comparisons


between two or more means.

Critical value : the cutoff or decision value in hypothesis testing which separates the accep-

Balanced design : a term usually applied to any

tance and rejection regions of a test.

experimental design in which the same number of observations is taken for each combi-

Data set : a general term for observations and

nation of the experimental factors.

measurements collected during any type of


scientific investigation.

Bias : the long-run difference between the average


of a measurement process and its true value.

Degrees of freedom : the number of values that are

als are uninformed as to the treatment con-

free to vary in the calculation of a statistic,


e.g. for the standard deviation, the mean is

ditions of the experimental units.

already calculated and puts a restriction on

Blinding : the condition under which individu-

the number of values that can vary; therefore

Block : a set of units which are expected to re-

the degrees of freedom of the standard devi-

spond similarly as a result of treatment.

ation is the number of observations minus 1.

Completely randomized design : a design in which

Effect size : when comparing treatment differ-

each experimental unit is randomized to a

ences, the effect size is the mean difference

single treatment condition or set of treat-

divided by the standard deviation (not stan-

ments.

dard error); the standard deviation can be

Confidence interval : a random interval that de-

from either group, or a poooled standard de-

pends on the data obtained in the study and

viation can be used.


73

74

APPENDIX A. GLOSSARY OF STATISTICAL TERMS

Estimation : an inferential process that uses the

P-value : the probability of obtaining a test statis-

value of a statistic derived from a sample to

tic as extreme as or more extreme than the

estimate the value of a corresponding popu-

observed one, provided the null hypothesis

lation parameter.

is true; small p-values are unlikely when the

Experimental unit : the smalles unit to which different treatments or experimental conditions
can be applied.

null hypothesis holds.


Parameter : a population quantity of interest, examples are the population mean and standard deviation of a normal distribution.

Explanatory variable : also called predictor, a variable which is used in a relationship to explain

Pilot study : a preliminary study performed to

or to predict changes in the values of another

gain initial information to be used in plan-

variable; the latter called the dependent vari-

ning a subsequent, definitive study; pilot

able.

studies are used to refine experimental procedures and provide information on sources

External validity : extent to which the results of a

of bias and variability.

study can be generalized to other situations.


Population : the collection of all subjects or units
Factor : the condition or set of conditions that is
manipulated by the investigator.
Factor level : the particular value of a factor.

about which inference is desired.


Precision : the degree to which a measurement
process is limited in terms of its variability
about a central value.

False negative : the error of accepting the null hypothesis when it is false.
False positive : the error of rejecting the null hypothisis when it is true.
Hypothesis testing : a formal statistical procedure
where one tests a particular hypothesis on
the basis of experimental data.
Internal validity : extent to which a causal conclusion based on a study is warranted.
Level of significance : the allowable rate of false
positives, set prior to analysis of the data.

Protocol : a document describing the plan for


a study; protocols typically contain information on the rationale for performing the
study, the study objectives, experimental
procedures to be followed, sample sizes and
their justification, and the statistical analsyses to be performed; the study protocol must
be distinguished from the technical protocol
which is more about lab instructions.
Randomization :

a well-defined stochastic law

for assigning experimental units to differing


treatment conditions; randomization may
also be applied elsewhere in the experiment.

Null hypothesis : a hypothesis indicating no difference which will either be accepted or rejected as a result of a statistical test.
Observational unit : the unit on which the response is measured or observed; this is
not necessarily identical to the experimental
unit.

Statistic : a mathematical function of the observed


data.
Statistical inference : the process of drawing conclusions from data that is subject to random
variation.
Stochastic : non-determistic, chance dependent.

One-sided test : a statistical test for which the re-

Subsampling : the situation in which measure-

jection region consists of either very large or

ments are taken at several nested levels; the

very small values of the test statistic, but not

highest level is called the primary sampling

of both.

unit; the next level is called the secondary

75

sampling unit, etc.

when subsampling is

very small values of the test statistic.

present, it is of great importance to identify


the correct experimental unit.

Type II error : error made by not rejecting the


null hypothesis when the alternative hypoth-

Test statistic : a statistic used in hypothesis testing; extreme values of the test statistic are unlikely under the null hypothesis.
Treatment : a specific combination of factor levels.
Two-sided test : a statistical test for which the rejection region consists of both very large or

esis is true.
Type I error : error made by the incorrect rejection of a true null hypothesis.
Variability : the random fluctuation of a measurement process about its central value.

76

APPENDIX A. GLOSSARY OF STATISTICAL TERMS

B. Tools for randomization in MS Excel


and R
B.1

Completely randomized de-

treatment codes in column A are now in random

sign

order, i.e. the first animal will receive treatment 2,


the second treatment 3, etc.

Suppose 21 experimental units have to be randomly assigned to three treatment groups, such

B.1.2

that each treatment group contains exactly seven

In the open source statistical language R, the same


result is obtained by

animals

B.1.1

R-Language

>
>
>
>
>

MS Excel

A randomization list is easily constructed using a


spreadsheet program like Microsoft Excel. This is
illustrated in Figure B.1. We enter in the first col-

# make randomization process reproducible


set.seed(14391)
# sequence of treatment codes A,B,C repeated 7 times
x<-rep(c("A","B","C"),7)
x

[1]
[7]
[13]
[19]

umn of the spreadsheet the code for the treatment


(1, 2, 3). Using the RAND() function, we fill the

"A"
"A"
"A"
"A"

"B"
"B"
"B"
"B"

"C" "A" "B" "C"


"C" "A" "B" "C"
"C" "A" "B" "C"
"C"

second column with pseudo-random numbers between 0 and 1. Subsequently the two columns are
selected and the Sort command from the Data-menu
is executed. In the Sort-window that appears now,
we select column B as column to be sorted by. The

> # randomize the sequence in x


> rx<-sample(x)
> rx
[1] "B" "B" "B" "A" "A" "C"
[7] "C" "C" "B" "A" "C" "C"

Figure B.1. Generating a completely randomized design in MS Excel

77

78

APPENDIX B. TOOLS FOR RANDOMIZATION IN MS EXCEL AND R

Figure B.2. Generating a randomized complete block design in MS Excel

.
[13] "A" "B" "C" "B" "A" "A"
[19] "C" "A" "B"

> design<-data.frame(block=rep(1:5,rep(4,5)),treat=treat)
> head(design,10) # first 10 exp units

B.2

1
2
3
4
5
6
7
8
9
10

Randomized complete block


design

Suppose 20 experimental units, organized in 5 blocks of size 4


have to be randomly assigned to 4 treatment groups A, B, C, D,
such that each treatment occurs exactly once in each block.

B.2.1

MS Excel

To generate the design in MS Excel, follow the procedure that


is depicted in Figure B.2. We enter in the first column of the
spreadsheet the code for the treatment (A, C, B, D). The second column (Column B) is filled with an indication of the block
(1:5). Using the RAND() function, we fill the third column with
pseudo-random numbers between 0 and 1. Subsequently, the
three columns are selected and the Sort command from the
Data-menu is executed. In the Sort window that appears now,
we select Column B as first sort criterion and Column C as second sort criterion. The treatment codes in column A are now
for each block in random order, i.e. the first animal in block 1
will receive treatment A, the second treatment D, etc.

B.2.2
>
>
>
>

R-Language

set.seed(3223) # some number


# treatments repeated 5 times
treat<- rep(c("A","B","C","D"),5)
# blocks and treatments

>
>
>
>
>
>
>

block treat
1
A
1
B
1
C
1
D
2
A
2
B
2
C
2
D
3
A
3
B

# randomly distribute units


rdesign<-design[sample(dim(design)[1]),]
# order by blocks for convenience
rdesign<-rdesign[order(rdesign[,"block"]),]
# sequence of units within blocks
# is randomly assigned to treatments
head(rdesign,10)

3
4
1
2
8
6
7
5
9
11

block treat
1
C
1
D
1
A
1
B
2
D
2
B
2
C
2
A
3
A
3
C

Das könnte Ihnen auch gefallen