Sie sind auf Seite 1von 303

DRUGS Demonstrating

Using Graphics and Statistics


A. Jonathan R. Godfrey (Editor)

February 28, 2012 ISBN 978-0-473-17651-8

Please include the following details when citing this publication as a single volume. Author: Year: Title: Publisher: Location: Godfrey, A. Jonathan R. (Editor) 2010 DRUGS: Demonstrating R Using Graphics and Statistics Institute of Fundamental Sciences, Massey University Palmerston North, New Zealand ISBN 978-0-473-17651-8 Please ensure you give credit to the specic authors when citing individual chapters. An example citation would be: Ganesh, S. (2010). Cluster Analysis: Weather in New Zealand Towns in Godfrey, A.J.R. (Editor) DRUGS: : Demonstrating R Using Graphics and Statistics, Institute of Fundamental Sciences, Massey University, Palmerston North, New Zealand.

Contents
Preface 1 Some R Basics 1.1 1.2 1.3 1.4 1.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What you need . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Getting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Getting help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Other resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 1 1 1 2 3 4 5 5 5

2 One Way Analysis of Variance 2.1 2.2 2.3 2.4 2.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Some R hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 18

3 Blocking and the Analysis of Variance 3.1 3.2 3.3 3.4

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 R Analysis of the Mosquito data. . . . . . . . . . . . . . . . . . . . . . . . 21 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 28

4 Latin Square Designs and the Analysis of Variance 4.1 4.2 4.3 4.4 4.5

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Analysis of Latin square designs . . . . . . . . . . . . . . . . . . . . . . . . 29 Analysis using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Some R hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 38 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 i

5 Factorial Designs 5.1 Introduction

5.2 5.3 5.4 5.5

The model for the Factorial Design . . . . . . . . . . . . . . . . . . . . . . 39 Analysis using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Some R hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 51

6 Incomplete Block and Factorial Designs 6.1 6.2 6.3 6.4 6.5 Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Models for incomplete block and factorial designs . . . . . . . . . . . . . . 53 Analysis using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Planning an incomplete factorial experiment an example . . . . . . . . . 66 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 69

7 Case Study: An Incomplete Factorial Design 7.1 7.2 7.3 7.4 7.5 Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

The Analysis for the Planned Experiment . . . . . . . . . . . . . . . . . . 70 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 The Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 The Final Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 78

8 An Introduction to the Analysis of Covariance 8.1 8.2 8.3 8.4 Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

The analysis of covariance model . . . . . . . . . . . . . . . . . . . . . . . 78 Analysis using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 89 . . . . . . . . . . . . . . . . . . . . . 90

9 An Introduction to Split Plot Designs 9.1 9.2 9.3 9.4 Introduction The model for the Split-plot Design

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Analysis using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 105

10 Fixed and Random Eects in Experiments 10.1 Introduction 10.2 Models for Random eects

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 . . . . . . . . . . . . . . . . . . . . . . . . . . 105

10.3 Analysis using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 10.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 11 Crossover Designs and the Analysis of Variance 116

11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 11.2 Analysis of Crossover Designs . . . . . . . . . . . . . . . . . . . . . . . . . 117 ii

11.3 Analysis using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 11.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 12 Multivariate Analysis of Variance 129

12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 12.2 The Multivariate Analysis of Variance (MANOVA) model . . . . . . . . . . 131 12.3 Analysis Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 12.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 13 An Introduction to Generalized Linear Models 13.1 Introduction 141

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

13.2 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 13.3 Analysis Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 13.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 14 An Introduction to Contingency Tables 14.1 Introduction 158

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

14.2 Contingency Table Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 160 14.3 Analysis using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 14.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 15 An Introduction to Survival Analysis 15.1 Introduction 168

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

15.2 Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 15.3 Analysis Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 15.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 16 Nonlinear Regression 181

16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 16.2 Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 16.3 Analysis Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 16.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 17 An Introduction to Mixed Eects Models 17.1 Introduction 197

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

17.2 Models with mixed eects . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 17.3 Analysis using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 17.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 iii

18 An Introduction to Sample Size Determination 18.1 Introduction

213

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

18.2 Sample Size Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 18.3 Analysis using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 18.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 19 Principal Component Analysis 223

19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 19.2 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . 224 19.3 Analysis Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 19.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 20 Discriminant Analysis 245

20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 20.2 Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 20.3 Analysis Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 20.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 21 Cluster Analysis 268

21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 21.2 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 21.3 Analysis Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 21.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 References 287

iv

Preface
The chapters of this e-text were compiled by me and my colleagues for the course 161.331 Biostatistics, which we taught at Massey University for the rst time in 2009. The course requirements changed a little since that rst oering so the chapters have been added to and improved over time. New chapters were added in 2010, and some were no longer required as part of the coursework for the paper; they are included in this volume as I have a preference for adding and not deleting. Youll notice that each chapter is given a credited authorship. During 2008 we decided on a textbook for the course and upon nal checking of the pricing and availability found that we could no longer truly justify the selected book as a good value for money text. As we were already well down the path of writing supplementary chapters to complement the chosen textbook, and upon checking what we could provide to students in an alternative fashion, it was decided to generate what we could in the rst draft of this volume using our material and relevant extracts from the vignettes that come with various R packages. Weve cleaned up the text quite a lot since then, and maybe one day it could be published more formally. Our ability to use the work of others was possible because we use R as the principle software for this course, and R is an open source (collaborative) project which benets from its users sharing their work under what is known as a general public licence (GPL). You can read the full contents of the GPL at http://www.R-project.org but my interpretation of the GPL is that you as the end user can only look to me to solve any errors in the document even if Ive taken extracts from someone elses work. I take every care to make sure that the document I provide is accurate, informative, and maybe even up to date, but in the end what you then do with any information taken from the document is your call. Im starting to like the phrase, all care but no responsibility as a descriptor of the R project on the whole. Welcome aboard! In the end, I must take responsibility for the errors that exist even those of other contributors; it would of course be most helpful to me and other readers that anything v

that looks like an error is reported to me as soon as possible. Rest assured, updates of the book are fairly easy to prepare and distribute. Feel free to let me know how you nd the e-text by sending an e-mail to: a.j.godfrey@massey.ac.nz R is a collaborative project and development is ongoing. Most users do not need to update R with the release of every new version. I recommend staying with one R version for as long as you can. I only change versions to keep up with the latest version my students might be using and because some of the additional packages created by other R users are built on recent versions. As it happens, this document was compiled using version 2.14.1 which was released on 22 December, 2011. All code should work on other versions equally well. Please let me know if this is not the case. Cheers, Jonathan Godfrey February 28, 2012

vi

Chapter 1 Some R Basics


An original chapter
written by

A. Jonathan R. Godfrey1

1.1

Introduction

This chapter aims to show you the bare minimum of basic tasks that you will need to master to get you through this text. It is (intentionally) not comprehensive! Many (some might say too many) other sources of information exist for learning to use R.

1.2

What you need

You need a current installation of R. This book used version 2.14.1 released on 22 December, 2011 for this particular edition. All code should work on any newer installation, and if it appears in this text, the code worked when the book was compiled on February 28, 2012. You will also need the additional package of the same name as the book. The DRUGS package contains the data sets used in this book along with some additional functions. If youve got this text, then you should also have obtained or been given the associated package at the same time. If you were given the book and package, then the instructions to install the package should also have been given, but if you downloaded the package and book yourself then it is assumed you have the skills to install everything you need.
1

Jonathan is a lecturer in the Institute of Fundamental Sciences.

CHAPTER 1. SOME R BASICS

1.3
1.3.1

Getting data
Creating data from scratch

R has three main forms of data structure; vectors, data.frames, and matrices. This text assumes data has been created already and does not need to be typed into R. When we do import data (see below) we will usually have a data.frame which contains a set of variables (usually called vectors if they are on their own) in columns and observations in rows. The columns of a data.frame can be of various kinds: numeric (continuous data), integer-valued, logical (TRUE or FALSE), character (this is just plain text), or factor (a special data construct). If you had a set of vectors that you wanted combined into a data.frame, you can do so using the data.frame() command; see its help for more information. A matrix is like a data.frame but its contents must be of the same type usually numeric, integer-valued, or logical. Matrices are used for specic tasks of a mathematical manipulation kind rather than a statistical analysis. We will see matrices used in Chapter 19 of this text, but it is more common to work with a data.frame. The vector is a single set of values (of any type mentioned already). R does not have dierent storage mechanisms for row-vectors versus column-vectors. The vector is used as required by the context of the mathematical manipulation. In most examples in this text, the direction (row versus column) of the vector is irrelevant because the object is just used as a basic set of values for doing a particular job. When we need to enter a basic set of values, we will use the c() command. As a basic example, we will assign a set of values into a new vector object:
> x=c(1,2,"asdf")

Note that the equals sign has been used to assign the right hand side to the vector stored under the name x. We can see what is stored under the name of any object by typing the name into R
> x [1] "1" "2" "asdf"

Note that the inclusion of the character-valued item in the last position of this trivial example has an impact on the entire vector. R now assumes that all values in the vector are character-valued so quote marks appear around the values that you might have assumed were integer.

1.3.2

Getting data from internal sources

R has a range of datasets used for testing functions and for demonstrating statistical techniques as part of its help functionality. We can use the data() command to get to

1.4. GETTING HELP any of these datasets using a command like:


> data(airquality)

We use this command throughout this text as it is useful for obtaining data stored in add-on packages created by R users the authors included. This command is used in most chapters of this text.

1.3.3

Getting data from external sources

This text assumes you have data in an importable form. We use the comma separated values type of le most frequently, which is viewable in spreadsheet software such as Microsoft Excel. The csv extension is common for these les even though they are just plain text les where each column is separated from the next using a comma; each row of the worksheet is started on a new line. We illustrate how data can be imported from a csv le using the read.csv() command, and any necessary conversions that result from having data stored in a way R is not handling as you would like. If your data is not in csv format, you can either convert it to csv format using your preferred spreadsheet software, or save it as a text le of the txt kind; if you do choose the second option, then the help for the read.table() family of commands should be consulted. It is possible to import data from les created by other statistical software. This means investigating the foreign package and its help les. In the end, its probably faster to do the conversion to plain text or csv format and import using one of the commands mentioned. Files created by spreadsheet software, such as the xls or xlsx formats, are not imported into R easily. The principal reason is that these spreadsheet programs are aimed at multiple-sheet workbooks and have embedded information for those programs to ensure the arrangement of columns etc. in the program. These le types also allow various constructs that are not imported into R, the most obvious of which is a graph.

1.4

Getting help

This book contains an index which should help you get to the topic you want quickly. All R commands mentioned in the text (not the code chunks) are automatically linked to the index. This will show you a context for the command, but not give a full description of the syntax for the command. The code chunks give specic examples of syntax but to nd out more, youll need to use the help functionality built into R itself. If you want to nd out how to use a particular command you know the name of (or have guessed), use something like

4
> ?mean

CHAPTER 1. SOME R BASICS

which opens the help page for the mean() function. Rs help pages are hyper-linked so you might be able to work through the links to get what you want. This may prove a little inecient a euphemism for frustrating so you could let R do some of the searching for you. If you wanted the syntax for the command that nds the inter-quartile range, then you might try
> ??quartile

and peruse the search results.

1.5

Other resources

So many texts are now being written that incorporate use of R that you could nd any number of them useful. Most cost money though, and those that do not are often dated still useful, but dated. This text assumes you have skills surpassing those obtained in an introductory statistics course, so using a supplementary document that includes all those things you learnt in your introductory statistics programme done using R instead of any other software you may have used is about all you need. Its blatant self promotion, but you may nd the document known as Lets Use R Now (or LURN for short) a useful starting point. Its online version can be accessed via http://r-resources.massey.ac.nz

Chapter 2 One Way Analysis of Variance: Tomato tissue growth


An original chapter
written by

A. Jonathan R. Godfrey1

2.1

Introduction

This chapter uses a data set taken from Kuehl (2000) an experimental design textbook. It is a completely randomized experiment on the growth of tomato tissue grown in tissue cultures with diering amounts and types of sugar. The experiment has ve replications for each of the four treatments. The treatments are a control and 3% concentrations of each of glucose, fructose, and sucrose. The data are given in Exhibit 2.1.

2.2
2.2.1

Analysis of Variance
The completely randomized design (CRD)

In experimental design, an experimental unit is a single item that can have a treatment applied to it. Experimental units are supposed to be independent of one another, although we can correct for many known relationships among experimental units more on this in later chapters. A treatment factor is the term we use to describe the one element we change from experimental unit to experimental unit. A treatment factor has levels which might be
1

Jonathan is a lecturer in the Institute of Fundamental Sciences.

CHAPTER 2. ONE WAY ANALYSIS OF VARIANCE

Exhibit 2.1 Growth of tomato tissue in four dierent tissue cultures. Treatment Growth control 3% glucose 3% fructose 3% sucrose 45 39 40 45 42 25 28 30 29 33 28 31 24 28 27 31 37 35 33 34

dierent amounts of the factor of interest in the experiment; it could be dierent drugs given to patients (one drug per patient); or, a combination of these some patients get a zero amount or placebo of a drug while other patients get dierent strengths of the drug in question. All experimental units can therefore be assigned one category or level of the treatment factor in question. The completely randomised design, often abbreviated to CRD, assigns dierent levels of a treatment factor to experimental units in a completely random fashion. This does not mean that we randomly assign a treatment to each experimental unit in turn, but that the combination of treatment assignments is completely random. The former could mean that we would not necessarily end up having a balanced number of experimental units for each of the treatments; the latter means we can choose how many units will get each treatment, and then randomly choose which ones get which treatments. Having an equal number of experimental units for each treatment is not crucial for an experiment with only one treatment factor, but in general, balance is a desirable property of any experimental design. We will see in later chapters why this is more important for the more complicated designs we investigate, and ask you to think about the impact of balance on the CRD as an exercise to this chapter after youve had a chance to digest the theory that follows. The purpose behind the random assignment of treatments to units is that we do not want any unknown inuences to have an impact on our experiment. We deal with known (and potential) inuences in coming chapters. Of course we do not know what the unknown inuences might be when we plan our experiments, but at least we can protect ourselves from inadvertently linking any other inuence to the impact our treatment factor has on experimental results. This confusion of eects is called confounding in experimental design terminology. Another reason for random assignment of treatments to units is that we use the replicates, sets of independent units being given the same treatment, to help estimate the amount of general randomness that exists for experimental units in our experiments. From the previous paragraph, we know that this random noise should be just that random.

2.2. ANALYSIS OF VARIANCE

2.2.2

One-way analysis of variance model

The basic model for a one-way analysis of variance model suitable for analysing the completely randomised design is yij = + i +
ij ij

(2.1) is the error term for the

where is the grand mean, i is the ith treatment eect, and treatment eects is based on the following assumptions: 1. the observations are independent of each other;

j th replicate in the ith treatment group. The F -test used to judge the signicance of the

2. the observations in each group arise from a normal population; 3. the observations in each group arise from populations with equal variance. The rst of these assumptions should be considered in the planning phase of a designed experiment. There is of course a chance that unforeseen inuences may impact on the independence of the observations which should be picked up during the residual analysis of the model, but these should have arisen due to unforeseen circumstances. This can be achieved by randomly assigning the treatments to experimental units. The second and third assumptions are tested using the residual analysis, but the third can be done prior to the model being tted. This is because the residuals for each group from the model have the same variance as the observed data for each group.

2.2.3

The hypothesis test

The analysis of variance model uses the F -distribution to gauge the relative importance of the variation that arises from dierences between the a levels of the treatment factor by comparing this quantity to the amount of variation that arise from similar observations being treated with the same level of the treatmen factor (replicates). Note that we do not require the total number of experimental units (n) to be a multiple of the number of levels of the treatment factor but that this is often desirable. The analysis of variance table given in Exhibit 2.2 shows the sources of variation, their associated degrees of freedom, sums of squares and other details leading to the hypothesis test. The most crucial element for the success of the hypothesis testing is the mean-square error (MSE or MSE ) term which is the variance of the residuals for the appropriate model for our experiment. This quantity is the pooled estimate of the variations that arise within the dierent treatment groups. The easiest way to determine this value is from the corresponding SSE value which is found by subtracting the sums of squares

CHAPTER 2. ONE WAY ANALYSIS OF VARIANCE

Exhibit 2.2 The components of the Analysis of Variance (ANOVA) table for an experiment with one treatment factor. Source df SS Factor A Error Total na n1 a1 SSA SSE = SST SSA SST MS MSE = SSE /(n a) MSA = SSA /(a 1) F MSA /MSE

attributable to between-group dierences from the total sums of squares of the whole data, SSE = SST SSA where SST otal =
i=1 n

(2.2)

(y i y )2

(2.3)

Note the use of y to indicate the grand mean of the observed data here. As we should expect, the SST otal , shortened to SST or SST , is just the total sum of squared dierences between all observations and the grand mean. This denition for the total sum of squares is relevant in all linear models. If we say that there are g treatment groups, having nj replicates in group j , we can express the sums of squares attributable to dierences between the group means y j as
g

SSA =
j =1

nj (y j y )2

(2.4)

These sums of squares are often called between-group (SSA ) and within-group (SSE ) sums of squares. The notation changes across the many texts in experimental design but the meanings do not. We use the SSA and SSE notation in preference to SSB and SSW notation (as an example) so that notation is kept consistent with other chapters that follow. In all situations where we construct an ANOVA table, the mean of squares (MS) column is just the sum of squares (SS) column divided by the degrees of freedom (df) assigned to that source of variation. It is unusual to see the total sum of squares being divided by the total degrees of freedom and this row is often not printed by software (notably R). Now that we have the MSA and MSE values, we can compute the F -ratio and its associated p-value. This p-value is just the likelihood of observing an F -value of this magnitude due to random chance. If we decide that the observed F -value from our experiment is suciently large (its p-value is suciently small) we reject the belief that the treatment group means are

2.2. ANALYSIS OF VARIANCE

equal and assume that there is at least one treatment group mean that is dierent. We then determine how we will best illustrate which of the treatment group dierences are important.

2.2.4

Standard errors of dierences

No matter what experimental design has been employed, the analyst must determine what summary information is to be provided in any report. Experiments are used (in the majority of scenarios) to ascertain any dierences among the treatment means. We evaluate the dierences between pairs of means using the standard error of the dierence of two means. This quantity is directly derived from the variance of the models residuals, often denoted MSE . If treatment i has been applied to ri experimental units, then the standard error of the dierence (often labelled s.e.d.) is found using s.e.d. = which reduces to MSE 1 1 + r1 r2 (2.5)

2MSE/r when there are the same number of observations for each

treatment being compared.

2.2.5

Comparing all treatments with a control

It is reasonable to assume that the Tomato experiment was set up with the particular aim of testing three treatments against the control treatment. Performing t-tests for each treatment against the control is inappropriate as the Type I errors of each test have a compounding eect. This is because the hypothesis tests and condence intervals created for the dierence between all treatments and the control treatment are not independent. Each of the hypothesis tests performed as part of the overall test is correlated with the other tests as all tests include the control treatment in the comparison. In this instance, the tests have an easily-dened dependence structure. Dunnett (1955) oered a solution for the creation of simultaneous condence intervals for the dierence in the mean of a treatment and that of a control. His method can also be used in a hypothesis test context. Unfortunately, R has not yet implemented Dunnetts procedure or any of its recent improvements. We will see how this can be done later using R which is a step forward on most experimental design textbooks; textbooks such as Kuehl (2000) have provided tables for the Dunnett statistic when it has been used in the past. You should be comfortable with the relationship that exists between hypothesis tests and condence intervals. Recall that the critical value used in hypothesis tests is also used for developing condence intervals; we do the same when looking for simultaneous condence intervals for the dierences between all treatments and the control treatment.

10

CHAPTER 2. ONE WAY ANALYSIS OF VARIANCE

Dunnetts procedure takes the familiar t-statistic used in a simple condence interval and replaces it with the Dunnett statistic. This is best illustrated with an example. For the meantime, it is sucient to say that the Dunnett statistic is greater than the equivalent t-statistic. This caters for the greater level of conservatism needed when making a set of judgements in combination, whether they be via hypothesis tests or simultaneous condence intervals. The end result of using Dunnetts procedure is that we keep the overall risk of making a bad judgement for one of a number of claims (a Type I error) held at a predetermined level we are comfortable with.

2.2.6

Other multiple comparisons

Dunnetts method is just one of a vast number of what we call multiple comparison procedures. Most methods are not however implemented in statistical software, and many are not straight-forward to use. We will see how one popular multiple comparison procedure compares all pairs of treatment levels in a subsequent chapter as this analysis is not necessarily suitable for the Tomato data set. Whenever you decide to use a multiple comparison procedure, the decision should be made prior to the experiment being conducted it will often have an impact on the apportioning of experimental units to treatments especially when we cannot have a balanced experiment (that is, a constant number of replicates for treatment factor levels).

2.3

Analysis Using R

Before applying analysis of variance to the data in Exhibit 2.1 we should summarise the main features of the data by calculating means and standard deviations. After selecting your working directory, we can load the data into our R session using
> data(Tomato, package = "DRUGS")

See this was successful by issuing:


> ls() [1] "Tomato"

to list all objects in our current R session. We can nd out more about the data sets structure using the str() command:
> str(Tomato) 'data.frame': 20 obs. of 2 variables: $ Sugar : Factor w/ 4 levels "control","fructose",..: 1 1 1 1 1 3 3 3 3 3 ... $ Growth: int 45 39 40 45 42 25 28 30 29 33 ...

2.3. ANALYSIS USING R

11

See how R has stored this data set. The response variable is numeric as we expect, but the treatment factor has been stored in a dierent way to the original data le. The Sugar variable has been dened a factor by R and the four levels listed; then the number of the level is listed rather than the actual level itself. This is an ecient way of storing data as the amount of memory required to store text strings is greater than the requirement for single digit numbers. The following R code produces some informative summary statistics
> > > > > attach(Tomato) Tomato.mean = tapply(Growth, Sugar, mean) Tomato.sd = tapply(Growth, Sugar, sd) detach(Tomato) Tomato.mean control fructose 42.2 27.6 > Tomato.sd control fructose 2.775 2.510 glucose 2.915 sucrose 2.236 glucose 29.0 sucrose 34.0

Note that the attach() and detach() commands are used to gain or remove direct access to the variables within the chosen data.frame. This often saves typing, but watch that you dont try to attach a data.frame that is already attached; it just confuses R a little bit! There is a rule of thumb that helps us decide if there is any heterogeneity of variance in our data. If the largest within-group standard deviation is more than twice as large as the smallest then there is some evidence that there is some heterogeneity to worry about. Inspection of the standard deviations for the four treatment groups shows no heterogeneity so we can move on to produce the one-way analysis of variance using the aov() and summary() commands.
> Tomato.aov = aov(Growth ~ Sugar, data = Tomato) > summary(Tomato.aov) Df Sum Sq Mean Sq F value Pr(>F) Sugar 3 653 217.7 31.7 5.8e-07 *** Residuals 16 110 6.9 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > model.tables(Tomato.aov, type = "means", se = TRUE) Tables of means Grand mean 33.2 Sugar Sugar control fructose 42.2 27.6

glucose 29.0

sucrose 34.0

12

CHAPTER 2. ONE WAY ANALYSIS OF VARIANCE

Exhibit 2.3 Residual analysis for the one-way analysis of variance model for the Tomato data.
> par(mfrow = c(2, 2)) > plot(Tomato.aov)

Residuals vs Fitted
10

Normal QQ
Standardized residuals 2
10

Residuals

4 2

13 6

13

30

35 Fitted values

40

Theoretical Quantiles

ScaleLocation
Standardized residuals Standardized residuals 1.2
10 6 13

Constant Leverage: Residuals vs Factor Levels


2
10

0.8

0.4

0.0

13

30

35 Fitted values

40

Sugar :

fructose

sucrose

Factor Level Combinations

Standard errors for differences of means Sugar 1.658 replic. 5

The summary() command is sensitive to the type of object it is working on. In this instance, the object is the outcome of the aov() command. If the summary() command was executed on another type of object, the output would be dierent. (Youll see this as you work through the various chapters of the book.) Each time we perform a hypothesis test we make assumptions. These were given for the analysis of variance in Section 2.2 and as previously stated are tested using a residual analysis. The aov() command (like many R commands) has an associated plot() method which generates a series of residual plots. The relevant R commands and their results are shown in Exhibit 2.3. The two elements of Exhibit 2.3 that we should pay the most attention to are the Residuals vs Fitted (top left) and the Normal Q-Q (top right) plots. In a one-way analysis of variance context the Residuals vs Fitted values plot is somewhat ugly as there is only one tted value for each treatment, being the treatment mean. Given our

2.3. ANALYSIS USING R

13

total number of experimental units was only twenty we may struggle to nd a perfectly normally distributed set of residuals. In spite of these comments, it is a good habit to generate residual analyses whenever a statistical model is formed. Dramatic departures from normality or non-constant variance will be highlighted in these plots, even for the simplest of models. As mentioned above, the comparison of all treatments with the control can be done using Dunnetts procedure, but as yet it has not been implemented in the base distribution of R or any add-on packages that the author is aware of. The DunnettStat() function has been coded in the le Dunnett.r which needs to be placed in your working directory before you issue the following commands.
> source("Dunnett.r")

The source() command reads the contents of the named le and issues all commands therein. This is a useful way of keeping your work and using it again at a later date. In this instance, the commands in the le create a new function for our use thats why there was no output from the source() command. To see what was created, type:
> DunnettStat function (Alpha = 0.05, k, v) { require(mvtnorm) CorrMat <- matrix(0.5, nrow = k, ncol = k) diag(CorrMat) <- 1 Output <- qmvt(1 - 2 * Alpha, df = v, tail = "b", corr = CorrMat, abseps = 1e-05, maxpts = 1e+05) return(Output$quantile) }

You do not need to know how this function works in terms of its calculations, but its application is shown below. The function uses the mvtnorm package to nd quantiles of a multivariate t-distribution. The three arguments for the Dunnett statistic are: Alpha, the overall Type I error rate for the simultaneous hypothesis test; k, the number of treatments in the experiment to be compared to the control; and v, the degrees of freedom for the error term in the model used. In our Tomato experiment we have three treatments compared to the control and with a quick reference to our model output above we nd that k = 3 and v = 16. We must determine the level of Alpha to use in this instance being the total risk of incurring a Type I error for the three simultaneous condence intervals. For the sake of simplicity, and for no other reason, we will use = 0.05 on this occasion. Given we are using a two-sided condence interval approach, we obtain the simultaneous condence intervals using Dunnetts procedure with the following R code.

14
> > > > > > > > >

CHAPTER 2. ONE WAY ANALYSIS OF VARIANCE

Control.mean = Tomato.mean[1] Use.means = Tomato.mean[-1] SDResid = sqrt(anova(Tomato.aov)[2, 3]) DS = DunnettStat(0.025, 3, 16) Diff = Use.means - Control.mean Lower = Diff - DS * SDResid * sqrt(1/5 + 1/5) Upper = Diff + DS * SDResid * sqrt(0.4) CI = cbind(Diff, Lower, Upper) CI

Diff Lower Upper fructose -14.6 -18.9 -10.301 glucose -13.2 -17.5 -8.901 sucrose -8.2 -12.5 -3.901

Note that the calculation of the condence intervals uses a version of the standard error of the dierence between the control and the other treatment; compare this with the expression given in Equation 2.5. We now see that every one of the three treatments is dierent to the control treatment, and by using the signs of the condence intervals, and the table of means presented earlier, we know that each treatment has signicantly lower tissue growth than does the control.

2.4

Exercises

Note: The data sets in the exercises are available via the DRUGS and datasets packages. Exercise 2.1: Sokal and Rohlf (1981) used the data in the le Fruitfly3.csv to compare the fecundity of three genotypes of fruities drosophila melanogaster. One genotype was a control having not been selected for any specic genetic purpose, while the other two genotypes were selected for resistance and susceptibility to the poison DDT. Determine if there is any dierence in the fecundity of the nonselected genotype vs the selected genotypes and then between the two selected genotypes. In the DRUGS package, this data set is called FruitFly and can be obtained using
> data(FruitFly, package = "DRUGS")

As an aside, DDT was banned for use on New Zealand farms in 1970 and for all other purposes in 1989 so you might think the research has limited application now. Apparently, DDT is still used in the tropics for the control of malaria. Exercise 2.2: Consider how you would recommend allocating 26 experimental units to an experiment having one treatment factor with ve levels, one of which is a control treatment. You expect to apply Dunnetts method for comparing this control treatment to all other levels. Hint: Refer to the theory on standard errors of dierences. Exercise 2.3: The base installation of R includes the datasets package. One of the data sets therein has the counts of insects after application of one of six dierent sprays. The data rst appeared in Beall (1942). We can gain direct access to this data by typing

2.5. SOME R HINTS


> data(InsectSprays)

15

Investigate this data set, determine if there are any dierences between the treatments, and ensure that the assumptions of the one-way ANOVA are satised. (Transformation of the response may be necessary.) Exercise 2.4: Carroll et al. (1988) used this data to demonstrate a new method for calibration. The question you need to answer is why the zero level of copper used was tested four times while the other levels of copper were tested only twice. Back up any ideas you have with evidence from the data itself. It is in the DRUGS package and can be otbained using
> data(Copper, package = "DRUGS")

or in the le Copper.csv. Fit an ANOVA model to the data. Check the degrees of freedom.

2.5

Some R hints

Using data that is already prepared has been demonstrated, but creation of data for your own needs is often tedious and frustrating. To generate a data set for a single factor experiment, you could create a spreadsheet (in any suitable software) by typing out the contents required, perhaps with some strategic cut and paste mouse clicks thrown in, or you could do the following: For example, say we want to produce a data set with four replicates of ve levels of a single factor. We will use the as.factor(), rep() and levels() commands to demonstrate an ecient way of generating the factor variable.
> OurFactor = as.factor(rep(1:5, 4)) > OurFactor [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Levels: 1 2 3 4 5 > levels(OurFactor) [1] "1" "2" "3" "4" "5"

If you didnt use the as.factor() command here, the results of the levels() command would not be the same.
> levels(OurFactor) = c("L1", "L2", "L3", "L4", "L5") > str(OurFactor) Factor w/ 5 levels "L1","L2","L3",..: 1 2 3 4 5 1 2 3 4 5 ...

Use of the str() command shows us that the variable we have created does conform to the structure we would expect if we had imported the data from an external source. Note

16

CHAPTER 2. ONE WAY ANALYSIS OF VARIANCE

that R did not return the exact values of the variable here but showed us which level of the factor was used. Now we can enter the response data separately using a command like
> OurResponse = c(20.5, 22.9, ...)

and if we had the right number of values in this new variable we could use it as the response variable in the one-way ANOVA model with the treatment factor previously created. It is a good idea however to generate random data to test your ability to t the model you intend to t once data has been collected. OK, this data and its related model are simple, but things are going to get harder so its good practice! If we wanted to generate 20 random responses we could use either of
> OurResponse = rnorm(20) > OurResponse = runif(20)

which create twenty standard normal variates if we use rnorm(), or twenty random variates that are uniformly distributed between zero and one if we use the runif() command. It doesnt really matter which distribution we use here as the data is random and will be discarded once the real data is collected. If we wanted a similar structure, but did not have equal replication, we could either generate an equally replicated data set and delete appropriate entries from the created variable, or use the full functionality of the rep() command. Lets say we want an experiment that has ve levels of the single treatment factor but that we have 22 experimental units on oer. Obviously, we will be able to allocate ve replicates to two of the ve levels of the treatment factor, and leave the other three levels with only four replicates.
> OurFactor = as.factor(rep(1:5, c(4, 4, 4, 5, 5))) > levels(OurFactor) = c("L1", "L2", "L3", "L4", "L5")

You could play around with the second argument of the rep() command to see what happens. The current version arbitrarily chose the last two levels of the treatment factor to have greater replication. The last task in planning the experiment following the completely randomised design is to randomise the allocation of treatments to units. The sample() command is all that is required.
> sample(10) [1] 3 5 2 1 10 7 6 9 8 4

gives us a random ordering of the integers one to ten. (Expect yours to be dierent!) So if we want to allocate the ve treatments to the 22 experimental units, we would number or tag our units with an identier, and then use the sample() command again:

2.5. SOME R HINTS


> Treatment = factor(sample(OurFactor), levels = levels(OurFactor)) > str(Treatment) Factor w/ 5 levels "L1","L2","L3",..: 3 1 2 4 2 5 4 5 5 4 ...

17

Remember that you might need to create another random variable to plan the order the data is to be collected. Just use the sample() command again. So lets put all that together, using some commands already issued above, and the data.frame() command to bring it all together.
> > > > > IDTag = 1:22 RandomResponse = rnorm(22) Order = sample(22) OurData = data.frame(IDTag, RandomResponse, Treatment, Order) head(OurData) IDTag RandomResponse Treatment Order 1 -1.1611 L3 6 2 -0.9672 L1 16 3 -0.8375 L2 1 4 0.6091 L4 2 5 -0.8115 L2 4 6 -0.1648 L5 8

1 2 3 4 5 6

> str(OurData) 'data.frame': $ IDTag : $ RandomResponse: $ Treatment : $ Order : 22 obs. of 4 variables: int 1 2 3 4 5 6 7 8 9 10 ... num -1.161 -0.967 -0.837 0.609 -0.811 ... Factor w/ 5 levels "L1","L2","L3",..: 3 1 2 4 2 5 4 5 5 4 ... int 6 16 1 2 4 8 20 5 14 3 ...

Now we have a data.frame that is ready for use with the analysis shown in this chapter for a completely randomised design. Once data are collected, we can add the measured responses in the data.frame using a command like:
> OurData$Response = c(...)

Chapter 3 Blocking and the Analysis of Variance: Resistance of Mosquitos to Insecticides


An original chapter
written by

A. Jonathan R. Godfrey1

3.1

Introduction

This chapter uses a data set taken from Rawlins and Wan (1995) and was reported in the Journal of the American Mosquito Control Association. The article compares the eectiveness of ve insecticides against a particular species of mosquito. The response to be analysed is the ratio of the dose required to kill 50% of the mosquitos (often labelled LD50 for lethal dose for 50%) divided by the known dosage for a susceptible mosquito strain. This value is known as the resistance ratio; the higher the ratio, the greater the resistance of this mosquito species to the insecticide. Mosquito larvae were collected from seven dierent locations in the Caribbean and then split into ve batches, thus creating 35 experimental units. The larvae from each location are referred to as blocks in statistical terminology, and the block eects are assumed to be independent of the treatment eects. This assumption can be tested when
1

Jonathan is a lecturer in the Institute of Fundamental Sciences.

18

3.2. ANALYSIS OF VARIANCE

19

Exhibit 3.1 Resistance ratios for ve insecticides tested on Aedes aegypti (mosquitos) from seven Caribbean locations. Location Temephos Malathion Fenitrothion Fenthion Chlorpyrifos Anguilla Antigua Dominica Guyana Jamaica StLucia Suriname 4.6 9.2 7.8 1.7 3.4 6.7 1.4 1.2 2.9 1.4 1.9 3.7 2.7 1.9 1.5 2 2.4 2.2 2 2.7 2 1.8 7 4.2 1.5 1.5 4.8 2.1 1.5 2 4.1 1.8 7.1 8.7 1.7

there are replicate data available, which is not the case in this instance. The data appear in Exhibit 3.1, but need to be in list form to be used in the subsequent analyses.

3.2
3.2.1

Analysis of Variance
Randomized complete block experiments

Blocking is the sorting of all experimental units into groups so that the units within each block are more homogeneous than the entire set of experimental units. This places constraints on the possible allocations of treatments to experimental units as we try to balance the number of times each treatment is applied within each block. Blocking can arise because there are obvious dierences among the experimental units, or because we decide to create blocks on the basis of some already known information about the experimental units. For example, a set of patients will have a gender which we will not change before the experiment starts, but we cannot be totally sure of the weights of the patients on day 1 of our experiment; we might decide to take the lighter half of the patients and make them one block while the heavier half of the patients form the other block. When we have been able to separate the experimental units into blocks prior to the allocation of treatments to experimental units, and the number of times each level of the treatment factor appears in each block is equal, we have a randomized complete block (RCB) design. In many instances an experiment will have only one replicate per treatment and block combination, as is the case for the mosquito data used in this chapter. When there is but one replicate, the assumption of independent block and treatment eects cannot easily be validated so this design should be used advisedly.

20

CHAPTER 3. BLOCKING AND THE ANALYSIS OF VARIANCE

3.2.2

The analysis of variance model

The appropriate two-way model for the RCB designed experiment is yijk = + i + j +
ijk

(3.1)

where is the grand mean, i is the ith block eect, j is the j th treatment eect, and
ijk

is the error term for the k th replicate of the j th treatment group within the ith block. There are now three sums of squares that sum to form the total sum of squares

(Equation 2.3). The blocking factor, the treatment factor, and the error component of our model each have a sum of squares and associated degrees of freedom. The formulae for the blocking and treatment SS are made simple by the balance that exists in the RCB design. The relevant sums of squares if there are a treatments and b blocks are:
b

SSBlock = a
i=1 a

( y i y )2 ( yj y )2

(3.2) (3.3) (3.4) (3.5)

SST reatment = b
j =1

SSError = SST otal SSBlock SST reatment

Note the use of y , y i and y j indicate the grand mean, the ith block mean, and the j th has a 1, and the error has (a 1)(b 1) degrees of freedom, which sums to the necessary treatment mean respectively. The SSBlock has b 1 degrees of freedom, the SST reatment

of freedom to get the relevant mean squares used to create the F -ratios for the hypothesis tests.

n 1 required of all ANOVA tables. As before, the SS values are divided by their degrees

The assumptions for the F -test for the one-way analysis of variance given for the oneway ANOVA (page 7) also apply to the two F -tests performed in the two-way analysis of variance for the randomized complete block design. The need to present an F -test on the signicance of the blocking factor is debatable. The vast majority of statistical software packages do not identify factors in an ANOVA as blocking or treatment factors so the F -test is calculated for all factors, regardless of the need to do so. In this respect, we will see that R is no dierent. The argument for not presenting the F -test for the blocking factor is that the blocking is a structural feature of the experiment. The structure should be represented in the model even if it were not deemed statistically signicant, so leaving it out of the resulting ANOVA table is not an option. Presentation of the sum of squares attributable to blocking is essential however, as it shows the value of employing the blocking in the experimental design. The author does not wish this discussion to be seen only as justication for

3.3. R ANALYSIS OF THE MOSQUITO DATA.

21

deleting parts of the ANOVA table given by the chosen software, but does want to protect against the practice of deleting experimental design factors from the model that are not statistically signicant.

3.2.3

Standard errors of dierences

The value of including a blocking factor in the experimental design is the improvement in the accuracy and precision of the estimates of treatment means and their dierences. As in the analysis of the completely randomized design, we evaluate the dierences between pairs of means using the standard error of the dierence of two means using the mean square error (MSE) from our model. If treatment i has ri observations, then the standard error of the dierence is still found using Equation 2.5 s.e.d. = MSE 1 1 + r1 r2

We do not need to alter the number of replicates per treatment level to account for the blocking. Blocking will have led to a reduction in the MSE of the experiment in the vast majority of situations where blocking was employed.

3.2.4

Multiple comparison procedures

Having determined that there is a signicant departure from the notion that all treatment means are the same, we need to nd which pair (or pairs) of levels are dierent. There are many multiple comparison procedures to choose among. These procedures all allow a comparison of all pairs of levels of a factor whilst maintaining the overall signicance level at its selected value and producing adjusted condence intervals for mean dierences. The procedure we will investigate is called Tukey honest signicant dierences suggested by Tukey (1953). The outcome from following Tukeys procedure is that we generate a set of simultaneous condence intervals. Each condence interval in the set is considered in the same way as we would if there were only two levels of the treatment factor. A condence interval which spans zero does not show evidence of any signicant dierence.

3.3

R Analysis of the Mosquito data.

As noted in Section 3.1, the data in Table 3.1 need to be converted to list form for use with the following statistical analyses. Once you have placed a copy of the data le

22

CHAPTER 3. BLOCKING AND THE ANALYSIS OF VARIANCE

Mosquito.csv into your chosen working directory, the rearrangement can be achieved using the following R commands
> Mosquito.mat = read.csv("Mosquito.csv", row.names = 1)

> Mat2List = function(DataMat, Resp = "", RLabel = "", CLabel = "") { + DataMat = as.matrix(DataMat) + RowFact = dimnames(DataMat)[[1]][as.vector(row(DataMat))] + ColFact = dimnames(DataMat)[[2]][as.vector(col(DataMat))] + Output = data.frame(as.vector(DataMat), RowFact, ColFact) + names(Output) = c(Resp, RLabel, CLabel) + return(Output) + } > Mosquito = Mat2List(Mosquito.mat, Resp = "RR", RLabel = "Location", + CLabel = "Insecticide") > head(Mosquito) RR Location Insecticide 4.6 Anguilla Temephos 9.2 Antigua Temephos 7.8 Dominica Temephos 1.7 Guyana Temephos 3.4 Jamaica Temephos 6.7 StLucia Temephos

1 2 3 4 5 6

Notice that we created a new R function Mat2List() by combining a set of commands to do the rearranging here. The nal statement using the head() command, proves that in fact the new function had worked. The reason for making a new function rather than just applying the commands needed for the current example is that all too often we must work with data that are in the wrong form. Writing new R commands that are general in their application might seem unnecessary, but the time spent now is bound to save time in the future when the same problem arises. The above panel of R code could be replaced by the following to achieve the same ends.
> > > > Mosquito.mat = as.matrix(Mosquito.mat) Mosquito = as.data.frame.table(Mosquito.mat) names(Mosquito) = c("Location", "Insecticide", "RR") head(Mosquito)

Location Insecticide RR 1 Anguilla Temephos 4.6 2 Antigua Temephos 9.2 3 Dominica Temephos 7.8 4 Guyana Temephos 1.7 5 Jamaica Temephos 3.4 6 StLucia Temephos 6.7

All too often there are dierent ways of achieving the same outcome in R. Stylistic dierences among programmers and the (sometimes very small) dierences between the total set of uses for similar functions cause most of the dierent approaches. You should compare this behaviour with various other statistical software packages you have used.

3.3. R ANALYSIS OF THE MOSQUITO DATA.

23

The dierences among approaches does demand that you develop robust programming styles. The two outputs above dier in just one respect. The order of the columns is dierent a dierence of no real importance as long as you refer to columns of the data.frame by variable names rather than column number. In either case, you should note that the data is now in the form of a data.frame and can be attached to the working directory using the attach() command. You could check this by issuing a is.data.frame() command, and should expect the answer TRUE. After attaching the data.frame, we can access the data using the variable names themselves. Note that this is performed (temporarily) within the aov() and lm() commands when the argument data= is used. To apply analysis of variance to the data we can use the aov() function in R and then the summary() method to give us the usual analysis of variance table.
> Mosquito.aov = aov(RR ~ Location + Insecticide, data = Mosquito) > summary(Mosquito.aov) Df Sum Sq Mean Sq F value Pr(>F) Location 6 56.7 9.46 2.74 0.036 * Insecticide 4 39.3 9.82 2.85 0.046 * Residuals 24 82.7 3.44 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model formula species an additive two-way layout (that is, without any interaction terms), where the rst factor is Location, and the second factor is Insecticide. As the Location is a blocking factor, it should always be placed in the analysis of variance ahead of the treatment factor Insecticide, because it was used to place constraints on how the experimental units were formed (in this case) or had treatments allocated (in a more general sense). Although R has already identied that the blocking factor was statistically signicant, it might prove useful to observe the impact of ignoring the blocking. Note this is done only for illustration and is not suggested best practice. It does show the value of considering blocking in the planning of experiments.
> bad.Mosquito.aov = aov(RR ~ Insecticide, data = Mosquito) > summary(bad.Mosquito.aov) Df Sum Sq Mean Sq F value Pr(>F) Insecticide 4 39.3 9.82 2.11 0.1 Residuals 30 139.4 4.65

Notice that the Insecticide treatment has explained exactly the same amount of variation in the Resistance Ratio (response). This is a direct consequence of having perfect balance of all treatments in all blocks within the design. The failure to recognize the blocking factor of Location in this instance means that the residuals have greater variation in this ANOVA than those in the correct one above. The consequence of this is that the F test for Insecticide is now showing no signicant dierences among any of the treatment

24

CHAPTER 3. BLOCKING AND THE ANALYSIS OF VARIANCE

Exhibit 3.2 Residual analysis for the two-way analysis of variance model for the Mosquito data.
> par(mfrow = c(2, 2)) > plot(Mosquito.aov)

Residuals vs Fitted
Standardized residuals
33

Normal QQ
2
33 34

34

Residuals

30

1
30

0 2

Fitted values

Theoretical Quantiles

ScaleLocation
Standardized residuals Standardized residuals 2
33 30 34

Constant Leverage: Residuals vs Factor Levels


34 33

1.2

0.8

2 1

0.4

0.0

30

Location :
Guyana

Jamaica

StLucia

Fitted values

Factor Level Combinations

means. Blocking has therefore shown to have improved the ability to nd the signicant dierence among the treatment means. The plot() command can be used to obtain the residual analysis from the preferred two-way analysis of variance model suitable for the randomized complete block design, and is shown in Exhibit 3.2. The Residsual vs Fitted values plot (top left) shows that our model has some aws. Observations with low tted values have positive residuals, while observations with mid-range tted values have more negative residuals. Correction for this is left as an exercise. We can use R to nd the actual treatment means in two dierent ways. The model.tables() can provide a list of eects for all factors, or tabulate the factor means. By default, R will provide the output for the blocking eect which will seldom be necessary for reports. It is printed here for completeness. Note that the third command is the authors preferred option for this scenario as it provides the most useful summary information.
> model.tables(Mosquito.aov) Tables of effects Location Location

3.3. R ANALYSIS OF THE MOSQUITO DATA.


Anguilla -1.1686 Antigua Dominica 1.3314 0.6914 Guyana -1.4686 Jamaica 0.2514 StLucia Suriname 1.8314 -1.4686

25

Insecticide Insecticide Temephos 1.6829

Malathion Fenitrothion -1.0457 -1.1743

Fenthion Chlorpyrifos -0.0171 0.5543

> model.tables(Mosquito.aov, type = "means") Tables of means Grand mean 3.289 Location Location Anguilla Antigua Dominica 2.12 4.62 3.98 Insecticide Insecticide Temephos 4.971

Guyana 1.82

Jamaica 3.54

StLucia Suriname 5.12 1.82

Malathion Fenitrothion 2.243 2.114

Fenthion Chlorpyrifos 3.271 3.843

> model.tables(Mosquito.aov, type = "means", se = TRUE, cterms = "Insecticide") Tables of means Grand mean 3.289 Insecticide Insecticide Temephos 4.971

Malathion Fenitrothion 2.243 2.114

Fenthion Chlorpyrifos 3.271 3.843

Standard errors for differences of means Insecticide 0.9921 replic. 7

We now nd the simultaneous condence intervals for the dierences among the resistance ratios for the ve insecticides using the ANOVA model tted previously in conjunction with the TukeyHSD() command. This generates Exhibit 3.3.; and then using the plot() command generates the graphical representation of these simultaneous condence intervals shown in Exhibit 3.4. It appears that there is only scant evidence for a dierence between Temephos versus both Fenitrothion and Malathion, with the latter pair having almost identical resistance ratios.

26

CHAPTER 3. BLOCKING AND THE ANALYSIS OF VARIANCE

Exhibit 3.3 R code and resulting multiple comparison results for the Mosquito data.
> Mosquito.hsd = TukeyHSD(Mosquito.aov, "Insecticide") > Mosquito.hsd Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = RR ~ Location + Insecticide, data = Mosquito) $Insecticide Malathion-Temephos Fenitrothion-Temephos Fenthion-Temephos Chlorpyrifos-Temephos Fenitrothion-Malathion Fenthion-Malathion Chlorpyrifos-Malathion Fenthion-Fenitrothion Chlorpyrifos-Fenitrothion Chlorpyrifos-Fenthion diff -2.7286 -2.8571 -1.7000 -1.1286 -0.1286 1.0286 1.6000 1.1571 1.7286 0.5714 lwr -5.651 -5.780 -4.623 -4.051 -3.051 -1.894 -1.323 -1.766 -1.194 -2.351 upr 0.19415 0.06558 1.22272 1.79415 2.79415 3.95129 4.52272 4.07986 4.65129 3.49415 p adj 0.0754 0.0576 0.4449 0.7853 0.9999 0.8358 0.5040 0.7698 0.4285 0.9774

Exhibit 3.4 Graphical presentation of multiple comparison results for the Mosquito data.
> plot(Mosquito.hsd)

MalathionTemephos

95% familywise confidence level

ChlorpyrifosFenthion

FenthionMalathion 6

Differences in mean levels of Insecticide

3.4. EXERCISES

27

3.4

Exercises

Note: The data sets in the exercises are available via the DRUGS package. Exercise 3.1: Repeat the analysis of the Mosquito data after nding a suitable transformation of the response data to stabilise the variance of residuals from the model. What are your conclusions about the eectiveness of the ve insecticides? In the DRUGS package, this data set is called Mosquito and can be obtained using
> data(Mosquito, package = "DRUGS")

Exercise 3.2: Snee (1985) used the data in the le Chickens.csv in a paper on graphical displays for randomized block experiments with three treatments. The response data is the average weight of birds (in pounds) with each group of chickens being blocked on the basis of physical proximity in the birdhouse. We are interested in knowing if the addition of a drug to the feeding regime has value to the chickens weight, and if so, what dose of the drug is appropriate? In the DRUGS package, this data set is called Chickens and can be obtained using
> data(Chickens, package = "DRUGS")

Exercise 3.3: Legube et al. (2002) presented an article Bromate Surveys in French Drinking Waterworks presents measurements of bromine concentrations (in g/L) at several waterworks. The measurements made for 15 dierent times at each of four waterworks are presented in the le WaterQuality.csv . Of interest is whether the bromine concentrations vary among waterworks. After constructing the appropriate ANOVA table, determine which pairs of water works have dierent bromine concentrations at the = 0.05 level of signicance. Could these data have been (correctly) analyzed with a one-way ANOVA by ignoring the time factor and using 15 observations for each of the four waterworks? In the DRUGS package, this data set is called WaterQuality and can be obtained using
> data(WaterQuality, package = "DRUGS")

Chapter 4 Latin Square Designs and the Analysis of Variance: Blisters in Rabbit Skins after Inoculation
An original chapter
written by

A. Jonathan R. Godfrey1

4.1

Introduction

The Latin square design is most commonly used as a means of analyzing data where two blocking factors are to be considered with a single treatment factor. In this context, it is an extension of the randomized complete block design covered in Chapter 3. Our data set is of this ilk. The data set we look at in this chapter is one in which it could be argued that at least two of the three factors being investigated are treatment factors. Bacharach et al. (1940) reported their ndings after six rabbits had been administered a series of inoculations. Each rabbit was given the inoculations in a specic order (the treatment of interest) and each inoculation was given in a dierent position on the rabbits backs. There are six rabbits, six locations, and six levels of the treatment factor; the response variable of interest is the size of the resulting blisters (in square centimetres). The data are shown in Exhibit 4.1.
1

Jonathan is a lecturer in the Institute of Fundamental Sciences.

28

4.2. ANALYSIS OF LATIN SQUARE DESIGNS

29

Exhibit 4.1 Measurements of area of blister (cm2 ) following inoculation of diusing factor into rabbit skins Position 1 1 2 3 4 5 6 C D A F B E 2 E B C A D F Subject 3 D F E 6 C A B 4 A E F B C D 5 F C B D E A 6 B A D E F C

7.9 8.7 7.4 7.4 7.1 8.2 6.1 8.2 7.7 7.1 8.1 5.9 7.5 8.1 6.4 6.2 7.5

6.9 8.5 6.8 7.7 8.5 8.5 6.7 9.9 7.3 6.4 6.4 7.3 7.3 8.3 7.3 5.8 6.4 7.7

4.2

Analysis of Latin square designs

A Latin square design is a design where two blocking factors are in eect and the levels of a treatment factor are spread out over the various levels so that each level appears once for each level of each of the two blocking factors. This design is most obviously applied in agricultural settings as the two blocking factors are often labelled the row and column factors. A trend may exist over the rows, the columns, or both. Often when we observe the data for a Latin square designed experiment, it is arranged in the way the data was physically laid out. Obviously, this is not the way it is handled in the analysis however. The ability to apply a Latin square design by following instructions marked out on paper makes this a very practical design that should be simple to follow by people collecting data while not under the direct supervision of the experiments designer.

4.2.1

Identifying a Latin square experiment

It is not so easy to spot a Latin square design applied in more modern experiments. There is clearly a need for there to be three factors and a response variable in the data supplied for analysis, but this will be more complicated if there are missing data, or more commonly, the square aspect of the design is not completed. In a set of data arising from a complete Latin square we will have: 1. A response variable.

30 CHAPTER 4. LATIN SQUARE DESIGNS AND THE ANALYSIS OF VARIANCE 2. A treatment factor testing t treatments. 3. Two blocking factors, each having t levels. 4. A total of n = t2 observations. It is not possible for replication in a standard Latin square experiment, as any attempts to augment the design for greater numbers of observations will change the number of levels for the row or column blocking factors It is important to note that even if the experimental unit is dened by two blocking factors that are physical, that the area itself is not necessarily a square. That is, a rectangular area divided up according to a Latin square would have equally sized rectangular experimental units. If one of the blocking factors is based on time while the other is based on something physical, we may need to consider the design as a crossover experiment and refer to Chapter 11 which deals with these experiments.

4.2.2

Randomized complete Latin square experiments

We need to do more randomization in the Latin square experimental design than we have done for the designs introduced in preceding chapters. Even though we may well work with specic labels for experimental treatment levels, we will apply the levels of the treatment factor to those labels using randomization. We must however, also consider the way the levels are laid out in the Latin square. The following arrays are both Latin squares by denition. Arrangement 1 A B C D D A B C C D A B B C D A Arrangement 2 A D B C D C A B B A C D C B D A

Arrangement 1 looks very regular and patterned while Arrangement 2 looks less so. The fact is that the second array is just the rst with two alterations. The top row of the rst was moved to become the third line, then the rst two columns exchanged positions. (It can be shown that the order of operating on rows and columns is not at all material.) The issue we must address is what the two blocking factors are protecting us from. If a trend exists over either or both blocking factors we must be sure that randomly assigning levels of the treatment factor does not place us at risk of an outside inuence being confounded with the results attributed to the treatment factor. Consider what

4.2. ANALYSIS OF LATIN SQUARE DESIGNS Exhibit 4.2 The four possible 44 Latin square designs A B C D B A D C C D B A D C A B A B C D B C D A C D A B D A B C A B C D B D A C C A D B D C B A A B C D B A D C C D A B D C B A

31

would happen if the two arrangements above were to be laid out on a hillside, where the upper right hand corner is at the top of the hill and the bottom left corner is near the bottom of the hill. Under Arrangement 1, Treatment D will be at the top of the hill once, and at a middle altitude on the other three occasions it is tested; in contrast, Treatment B is at the bottom of the hill once and around the middle altitude three times. If an outside inuence depends on altitude, e.g., the amount of water in the soil, then there will be a dierence in the benets of being Treatment D or B depending on the value of the water in the soil. Under Arrangement 2, this problem still exists but is signicantly reduced in its potential impact. 1957). These four dierent arrangements cannot be created by moving rows and/or columns, but require an interchange of parts of rows and columns. The four possible constructions are shown in Exhibit 4.2 to illustrate this point. Notice the minimal change required to create the last Latin square from the third, where only the middle four cells need to be rotated to create the new design. According to Cochran and Cox (1957), we should randomly choose among the four options presented in Exhibit 4.2 , and then randomly rearrange the rows and columns, before randomly assigning the levels of the treatment factor to the letter codes in the resulting square. These squares are arranged in standard order, as the rst row and column is in order. There are 144 possible permutations of these squares so there is a total of 576 available is more dicult as the size of the square increases. There are 56 standard 55 document all possible squares up to 66, but only present a sample of larger squares. 44 Latin squares to choose among. The process of selecting a starting square of all those There are actually four dierent constructions of the 44 Latin square design (Cochran and Cox,

squares, 9408 66 squares, and nearly 17 million 77 squares. Fisher and Yates (1957)

4.2.3

The analysis of variance model for Latin square designs

The analysis of variance table for the simple Latin square design is presented in Exhibit 4.3. Note that only the main eects for the two blocking factors and the treatment factor are given. Under a Latin square design, no interaction eects are estimable as there are insucient degrees of freedom to do so. We therefore require the assumption that the two

32 CHAPTER 4. LATIN SQUARE DESIGNS AND THE ANALYSIS OF VARIANCE Exhibit 4.3 ANOVA table for Latin square design testing 4 treatment levels Source Row blocking factor Column blocking factor Treatment factor Error Total df t1 = 6 6 3 6 315 t1 =

(t 1)(t 2) = t2 1 =

t1 =

blocking factors are independent of the treatment factor in order to apply this design in practice. The model for the Latin square design is therefore yijk = + i + j + k +
ijk

(4.1)

where the i and j are the row and column blocking eects respectively, and the k is the eect for the k th level of the treatment factor. This model is a very slight extension of the model for the randomized complete block design given in Equation 3.1. The discussion of that model and the need for the assumptions of the linear model to be met, as well as the need to test all hypotheses, apply for the Latin square design with equal relevance. Note also that we do not need to be fussy about the ordering of the two blocking factors in the model as they are orthogonal to one another.

4.2.4

Estimation of means and their standard errors

As the Latin square design is completely balanced, and all factors are orthogonal to one another, we can use the simple (observed) treatment means as those to be reported. The standard errors of these means and more importantly, the standard error of the dierence between levels of the treatment factor, follow from those given in Chapter 3 where only one blocking factor was considered. The standard error of a treatment mean is easily seen to be two means is found using Equation 2.5 which reduces to design testing t levels of a treatment factor. MSE/t as each level of the treatment factor is tested t times. The standard error of the dierence between 2MSE/t for the Latin square

4.2.5

Extensions to the simple Latin square design

A major disadvantage of the Latin square design is the need for the number of rows, columns and treatment levels to be equal. If an experiment can benet from the Latin square idea but a dierent number of rows is possible, we could either use two or more

4.2. ANALYSIS OF LATIN SQUARE DESIGNS Exhibit 4.4 ANOVA table for two adjoining Latin squares testing 4 treatment levels Source Squares Rows within squares Total Row eects Columns Treatments Error Total df s(t 1) = t1 =
2

33

s1 =

1 6 7 3 3 18 31

s(t 1) = t1 =

s(t 1) =

st2 1 =

Exhibit 4.5 ANOVA table for two independent Latin squares testing 4 treatment levels Source Squares Rows within squares Columns within squares Treatments Error Total df s(t 1) = 6 t1 = 3 s1 = 1

s(t 1) = 6

s(t 1)(t 2) + ((s 1)(t 1) = 15

st2 1 = 31

Exhibit 4.6 Layout of an experiment testing 12 toxins applied to cats over a 12 day period
Time 10:30 Observer 1 1 2 3 1 2 3 I,K B,E C,F J,D A,H G,L 2 J,G L,D K,H C,F B,E I,A 3 B,J G,F A,K E,I C,L D,H 4 L,H C,G B,E K,A D,J F,I 5 H,I D,J F,G A,L E,C K,B 6 G,B J,K L,C I,E F,A H,D Day 7 F,L K,A I,D H,C G,B J,E 8 K,C E,L D,B F,G H,I A,J 9 D,E H,C G,A B,J I,K L,F 10 E,F A,I H,L G,B J,D C,K 11 A,D F,B J,I L,H K,G E,C 12 C,A I,H E,J D,K L,F B,G

2:30

squares, or alternatively, use an incomplete Latin square design. Incomplete designs are considered further in Chapter 6, but Exhibit 4.4 shows how two (s = 2) adjoining Latin squares would be analyzed. Note the fact that the two Latin squares are adjoined does matter. If the two squares are independent, then the analysis is of the form described in Exhibit 4.5. As a nal illustration of the ways in which the Latin square has been applied in nonstandard fashion, consider the layout in Exhibit 4.6 which was presented by Cochran and Cox value of the row factor. (1957). It is bassed on a 1212 Latin square, but pairs of rows have been given the same

34 CHAPTER 4. LATIN SQUARE DESIGNS AND THE ANALYSIS OF VARIANCE

4.3
4.3.1

Analysis using R
Analysis of the Asthma data

We obtain the Asthma data and conrm its structure using:


> data(Asthma, package = "DRUGS") > str(Asthma) 'data.frame': 17 obs. of 7 variables: $ Group : Factor w/ 2 levels "AB","BA": 1 1 1 1 1 1 1 1 2 2 ... $ RunIn : num 1.09 1.38 2.27 1.34 1.31 0.96 0.66 1.69 1.74 2.41 ... $ Treat1 : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 2 2 ... $ Period1: num 1.28 1.6 2.46 1.41 1.4 1.12 0.9 2.41 3.06 2.68 ... $ WashOut: num 1.24 1.9 2.19 1.47 0.85 1.12 0.78 1.9 1.54 2.13 ... $ Treat2 : Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 1 1 ... $ Period2: num 1.33 2.21 2.43 1.81 0.85 1.2 0.9 2.79 1.38 2.1 ...

4.3.2

Analysis of the Rabbit data.

We obtain the Rabbit data and conrm its structure using:


> data(Rabbit, package = "DRUGS") > str(Rabbit) 'data.frame': 36 obs. of 4 variables: $ Area : num 7.9 6.1 7.5 6.9 6.7 7.3 8.7 8.2 8.1 $ Subject : Factor w/ 6 levels "1","2","3","4",..: 1 $ Position: Factor w/ 6 levels "1","2","3","4",..: 1 $ Order : Factor w/ 6 levels "1","2","3","4",..: 3 8.5 1 1 2 3 4 1 ... 1 1 1 2 2 2 2 ... 4 5 6 1 2 3 4 ... 6 2 5 5 2 3 1 ...

To t the straight forward analysis of variance model, with all three main eects we use the aov() command and associated summary() method.
> Rabbit.aov1 = aov(Area ~ Subject + Position + Order, data = Rabbit) > summary(Rabbit.aov1) Df Sum Sq Mean Sq F value Pr(>F) Subject 5 12.83 2.567 3.91 0.012 * Position 5 3.83 0.767 1.17 0.359 Order 5 0.56 0.113 0.17 0.970 Residuals 20 13.13 0.656 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see from the ANOVA table that the only factor that has a signicant impact is the Subject. In some circumstances we might think to reduce the model so that only the signicant factors remain.
> Rabbit.aov2 = aov(Area ~ Subject, data = Rabbit) > summary(Rabbit.aov2) Df Sum Sq Mean Sq F value Pr(>F) Subject 5 12.8 2.567 4.39 0.004 ** Residuals 30 17.5 0.584 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.4. EXERCISES

35

We should ask ourselves which of these two ANOVA tables is more useful for reporting purposes. The rst table shows the lack of the signifance of two of the three factors. As the Latin square design is balanced and orthogonal, the means for the subjects are the same for both models. The only dierence we might concern ourselves with at this point is the estimation of the standard error of the dierence in the means. A comparison of the MSE values for the two models shows that the change in MSE from the full Latin square model to the model for a one-way classication is quite minimal. Unusually however, the more complicated model results in a higher MSE than the reduced model. If we were actually concerned with publishing the results for the dierences among the rabbits under study, we would probably use the MSE from the reduced model to estimate the standard error of the dierence here; this is because we can probably argue that the the rabbits existed before any changes in the position or order of the inoculations were considered.

4.4

Exercises

Note: The data sets in the exercises are available via the DRUGS package. Exercise 4.1: An experiment was conducted to assess the potency of various conhelp page for this data set states that individual cells of dry comb were lled with measured amounts of lime sulphur emulsion in sucrose solution. Seven dierent concentrations of lime sulphur ranging from a concentration of 1/100 to 1/1,562,500 in successive factors of 1/5 were used as well as a solution containing no lime sulphur. The responses for the dierent solutions were obtained by releasing 100 bees into the chamber for two hours, and then measuring the decrease in volume of the solutions in the various cells. The data can be obtained using
> data(OrchardSprays)

stituents of orchard sprays in repelling honeybees using an 88 Latin square design. The

Note that the row and column variables are position references within the chamber into which the bees were released and are therefore blocking factors of some description. Even if there is no real blocking to speak of, the layout of the treatments has been constrained so we must incorporate the row and column eects in our analysis. Investigate this data and see which levels of the treatment are in fact dierent to one another. Exercise 4.2: Prove that the third main eect is totally confounded with the interaction of the other two main eects in the Latin square design. This will then suggest the row and column variables are blocking factors, and the design is in fact a logical extension of the randomized complete block design. Exercise 4.3: Drowsiness is a side eect of antihistamines. Hedges et al. (1971) used

36 CHAPTER 4. LATIN SQUARE DESIGNS AND THE ANALYSIS OF VARIANCE a Latin square design as the basis of an experiment to compare the eects of a new drug (Meclastine), a placebo, and an established drug (Promethazine) on nine subjects. Three critical icker frequency (a consequence of the eect of the drug on the central nervous system) six hours after the drugs were administered. The data can be found in the DRUGS package and obtained using:
> data(Flicker, package = "DRUGS")

33 Latin squares were combined to form the data set which uses measurements of the

Determine if the new drug is any dierent at reducing drowsiness than the other two considered in this experiment. Use an appropriate multiple comparison procedure as part of your analysis. Exercise 4.4: Create a data set that is similar to that given in Exhibit 4.6. It need not have the exact same placement as the treatments given there but should conform to the Latin square principle that is followed. Generate a random response variable and then analyze this random variable. It is not important what factors are signicant, but are the degrees of freedom associated with your design what you expect?

4.5

Some R hints

Generating a Latin square design can be done fairly quickly, although some fancy footwork might be used along the way. We know that the rows and columns must all include the full set of treatments, but the exact ordering can be allowed to vary. So lets rst make a matrix of the right kind.
> > > > > > Start Start Start Start Start Start = = = = = matrix(3, nrow = 5, ncol = 5) Start + row(Start) + col(Start) Start%%5 Start + 1 matrix(LETTERS[Start], nrow = 5)

[1,] [2,] [3,] [4,] [5,]

[,1] "A" "B" "C" "D" "E"

[,2] "B" "C" "D" "E" "A"

[,3] "C" "D" "E" "A" "B"

[,4] "D" "E" "A" "B" "C"

[,5] "E" "A" "B" "C" "D"

We really should choose among the possible options tabulated in Fisher and Yates (1957) but this code is just a starting point after all. To randomly reorder the rows and columns:
> NewOrder = Start[sample(5), sample(5)] > NewOrder

This matrix is just one of the possible starting matrices for the 55 Latin square design.

4.5. SOME R HINTS


[,1] "C" "E" "A" "B" "D" [,2] "A" "C" "D" "E" "B" [,3] "D" "A" "B" "C" "E" [,4] "E" "B" "C" "D" "A" [,5] "B" "D" "E" "A" "C"

37

[1,] [2,] [3,] [4,] [5,]

> Template = data.frame(Treatment = as.vector(NewOrder), Row = as.factor(row(NewOrder)), + Col = as.factor(col(NewOrder))) > head(Template) Treatment Row Col C 1 1 E 2 1 A 3 1 B 4 1 D 5 1 A 1 2

1 2 3 4 5 6

You can see how the above code has been generalised to create any size Latin square by looking at the function in the DRUGS package by typing
> MakeLatinSquare

Chapter 5 Factorial Designs: Treating Hypertension


An original chapter
written by

A. Jonathan R. Godfrey1

5.1

Introduction

There are many experiments that use the principles of a factorial design as an integral part of their experimental design. One major distinction among such experiments is whether there are more than one observation per treatment combination. In situations where there is just one replicate for the set of treatment combinations we direct the interested reader to texts on designing industrial experiments, such as Box et al. (2005), or to a reference like Milliken and Johnson (1989) who show a range of analysis techniques for what they call nonreplicated experiments. The data set used in this chapter was taken from Everitt (2002) and is a study on the blood pressure of 72 patients who were given one of three drugs. The original experiment was described by Maxwell and Delaney (1990) and is a 322 factorial experiment. The analysis of this data set in this text serves two purposes; it is useful to see the

analysis of an experiment with three factors, and it is also useful to compare the analysis given here using R with that given in Everitt (2002) which presents the analysis using SAS. Note that for the purposes of brevity, we have not printed the whole data.frame in Exhibit 5.1.
1

Jonathan is a lecturer in the Institute of Fundamental Sciences.

38

5.2. THE MODEL FOR THE FACTORIAL DESIGN Exhibit 5.1 Selected observations from the Hypertension data set. Drug Biofeed 1 x 7 x 13 y 19 y 25 z 60 y 66 z 72 z present present present present present absent absent absent Diet no yes no yes no yes no yes BP 170 161 186 164 180 203 204 179

39

5.2
5.2.1

The model for the Factorial Design


A model for experiments with two factors

A factorial experiment is one which includes two or more factors in a single experiment. Factors are traditionally given letters starting at the beginning of the alphabet and we use subscripts to indicate which level of each factor was applied. In a two-factor experiment, the response y from our experiment arises from applying level i of Factor A and level j of Factor B to a randomly selected experimental unit. For a balanced (equally replicated) experiment with two factors we are able to t the model yijk = + i + j + ( )ij +
ijk

(5.1)

Note that this model includes an interaction term ( )ij , which can only be evaluated if there are two or more replicates per treatment combination. It is also important to note that replicates are meant to be observations from independent experimental units that just so happen to have been given the same treatment combination. Two measurements on a single test of a treatment combination are not replicates. This scenario is considered in Chapters 9 and 10. Also note that a second run of an experiment in order to obtain a second replicate is not pure replication. Pure replication arises when any possible rearrangement of the assignment of treatment combinations to experimental units could be chosen. A second run of an entire experiment leads to a dierent structure to that which would be obtained if the entire (duplicated) set of runs was randomly assigned to all of the experimental units. Determining the correct analysis for this scenario is left as an exercise. The hypothesis tests for each main eect are of the same form as in the one-way analysis of variance model seen in Chapter 2. The null hypothesis is that all the means for the dierent levels of each factor are equal. If all treatment combinations have tested

40

CHAPTER 5. FACTORIAL DESIGNS

an equal number of times, we say that the design is balanced. In this case, the two factors are orthogonal to one another, and our analysis does not change if we swap the order of the two factors. The hypothesis tests for the two factors and subsequent comparisons of their means are, however, dependent on the outcome of the hypothesis test for the interaction term. We can independently consider the impact of each treatment factor if and only if the interaction is deemed to be insignicant. The importance of the interaction between the two factors can only be gauged if we have replicated the treatment cominations. A signicant interaction will force us to consider the treatment factors in conjunction because an interaction says that the benet of changing the level of one treatment is dependent on the level of the other treatment. Explanation of this change in benet may be seen using an interaction plot which plots the mean response for all treatment combinations. An example is given on page 47. As with the one-way analysis of variance, our model for the two-way analysis of variance is based on assumptions. They are the same as for the one-way model though so little extra thought is required. These assumptions are: 1. the observations are independent of each other; 2. the observations in each group arise from a normal population; 3. the observations in each group arise from populations with equal variance. While the rst of these assumptions should be considered in the planning phase of a designed experiment, it is likely to prove dicult to prove the other assumptions have been met. In most factorial experiments, we will have only two or three replicates per treatment combination and this will make assessment of normality and equal variances dicult; however, the tests on our residuals will be strongly indicative so all is not lost. In this chapter, we assume that the experiment is completely balanced; that is, all treatment combinations are replicated r times. If there are a levels of Factor A and b levels of Factor B, then there are ab dierent treatment combinations randomly assigned to the abr experimental units. The analysis of variance table given in Exhibit 5.2, shows the sources of variation, their associated degrees of freedom, sums of squares and other details leading to the three hypothesis tests. The total sum of squares SST for the data from a replicated two-way factorial experiment is SST =
i=1 j =1 k =1 a b r

(yijk y )2

(5.2)

where y is the grand mean of all abr observations. It is also the sum of the sums of squares attributable to the two main eects, their interaction eect, and the random

5.2. THE MODEL FOR THE FACTORIAL DESIGN

41

Exhibit 5.2 The components of the Analysis of Variance (ANOVA) table for an experiment with two treatment factors. Source df SS Factor A Factor B Interaction Error Total error; that is, SST = (SSA + SSB + SSAB ) + SSE (5.3) a1 (a 1)(b 1) abr 1 b1 SSA SSB SSAB SSE SST MS MSB = SSB /(b 1) MSAB =
SSAB (a1)(b1)

F MSA /MSE MSB /MSE MSAB /MSE

MSA = SSA /(a 1)

ab(r 1)

MSE = SSE /ab(r 1)

Note that the bracketed terms in this expression are sometimes combined to show the overall value of the model as a whole as they are for the explained portion of the total variation in the response variable, whereas the error term is not explained and should not be explainable as it is supposed to be random. Assuming the design is balanced where r replicates have been applied to all combinations of the a levels of Factor A and b levels of Factor B, the sums of squares attributable to the two treatment factors and their interaction are
a

SSA = br
i=1 b

( yi y )2 ( yj y )2

(5.4)

SSB = ar
j =1 a b

(5.5)

SSAB = r
i=1 j =1

( yij y i y j + y )2

(5.6)

You could rearrange Equation 5.3 but it is almost as easy to calculate the sum of squares for the error term directly using
a b r

SSE =
i=1 j =1 k =1

(yijk y ij )2

(5.7)

Each of these sums of squares is divided by the relevant number of degrees of freedom for the term in the model to give the mean square for each factor. Each main eect has one less degree of freedom than the number of levels for that factor; the interaction term has degrees of freedom equal to the product of the degrees of freedom assigned to the corresponding main eects. In total, the sum of the degrees of freedom assigned to the two main eects and the interaction is equal to the total number of treatment combinations

42

CHAPTER 5. FACTORIAL DESIGNS

minus one. If all of these rules for the degrees of freedom are met then we can be sure that all of the treatment combinations were tested. If the number of degrees of freedom assigned to the error term in the model is a multiple of the total number of treatment combinations, then we know the experiment was balanced. These checks are well worth doing whenever we see an analysis of variance as they can often show up mistakes or unexpected data structures. If the full set of treatment combinations are not tested then we will need to follow the approach discussed in Chapter 6. As with the analysis of variance tables given in previous chapters, the most crucial element for the success of the hypothesis testing is the mean-square error (MSE ) term which is the variance of the residuals for the appropriate model for our experiment. This quantity should be attributable to nothing more than the random variation that arises from applying treatments to experimental units. A check on the models validity should therefore be undertaken by considering the residual analysis. The above ANOVA table is what should be reported as it will show all information about the relative importance of the two factors and their interaction. If the interaction term is deemed not to be signicant, then (and only then) it could be left out of the model, and the analysis of the two treatment factors can be done independently of one another.

5.2.2

A model for experiments with three factors

All comments given in the previous subsection for two-way factorial experiments are relevant for a three-factor experiment. Our model needs to be embellished to include the third treatment factor and its associated interactions. Assuming there are r replicates, we would then t the model yijkm = + i + j + k + ( )ij + ( )ik + ( )jk + ( )ijk +
ijkm

(5.8)

Note that we now have three two-way interaction terms and one three-way interaction term on top of our three main eects. Separating out the various eects that the three treatment factors have on our data is made dicult if the three-way interaction (in particular) or some of the two-way interactions are signicant. The main diculty we will have is to present three-dimensional results on two-dimensional paper whether it be using graphs or tables.

5.2.3

A model for experiments with more than three factors

It is possible to have experiments with more than three treatment factors but it should be obvious that as the number of levels of each factor increases and the number of treatment

5.3. ANALYSIS USING R

43

factors increase, we need a much greater number of experimental units. Experiments with a greater number of factors seldom include replicates, and in many circumstances each treatment factor is allowed to have only two or three levels. These experiments are more common in industrial process improvement programmes or in screening experiments where an experiment is set up to test which of a set of treatment factors is the most important. In particular, a common design is the 2k experiment where k treatment factors are tested on two levels per factor. These experiments are beyond the scope of this chapter.

5.3
5.3.1

Analysis using R
Treating hypertension

It is assumed that the DRUGS package is installed and that you can obtain the 72 blood pressure readings found in the le Hypertension.csv by issuing the command:
> data(Hyper, package = "DRUGS")

We can look at the rst six rows using the head() command:
> head(Hyper) 1 2 3 4 5 6 Drug x x x x x x Biofeed Diet BP present no 170 present no 175 present no 165 present no 180 present no 160 present no 158

Recall that we will be able to access specic elements of the Hyper data.frame, if we attach() the data.frame. To begin the analysis, it is helpful to look at some summary statistics for each of the cells in the design. Some statistical software is better than R for generating tables suitable for reports quickly. There are many ways in which the summary information for the mean, standard deviation, and count for the observations in each cell of the three-way factorial design could be generated; just one is presented below.
> > > > > > + > + > attach(Hyper) Hyper.mean = aggregate(BP, list(Drug, Diet, Biofeed), mean) Hyper.sd = aggregate(BP, list(Drug, Diet, Biofeed), sd) Hyper.count = aggregate(BP, list(Drug, Diet, Biofeed), length) detach(Hyper) Hyper.desc = cbind(Hyper.mean, Hyper.sd[, 4], Hyper.count[, 4]) dimnames(Hyper.desc)[[2]] = c("Drug", "Diet", "Biofeed", "Mean", "StDev", "No.") Hyper.desc

44
Drug Diet Biofeed Mean StDev No. x no absent 188 10.863 6 y no absent 200 10.080 6 z no absent 209 14.353 6 x yes absent 173 9.798 6 y yes absent 187 14.014 6 z yes absent 182 17.111 6 x no present 168 8.602 6 y no present 204 12.681 6 z no present 189 12.617 6 x yes present 169 14.819 6 y yes present 172 10.936 6 z yes present 173 11.662 6

CHAPTER 5. FACTORIAL DESIGNS

1 2 3 4 5 6 7 8 9 10 11 12

one of the assumptions for any analysis of variance is that observations in each cell come from populations with the same variance. There are various ways in which the homogeneity of variance assumption can be tested. We use Levenes test because we cannot assume that the data come from a normal population. The leveneTest() command is implemented via an additional R package.
> > > > > require(car) attach(Hyper) Hyper.levene = leveneTest(BP, interaction(Drug, Diet, Biofeed)) detach(Hyper) Hyper.levene

Levene's Test for Homogeneity of Variance (center = median) Df F value Pr(>F) group 11 0.49 0.9 60

The null hypothesis for Levenes test (that all variances are equal) cannot be rejected, so we can continue with the analysis of variance.
> Hyper.aov = aov(BP ~ Diet * Drug * Biofeed, data = Hyper) > summary(Hyper.aov) Df Sum Sq Mean Sq F value Diet 1 5202 5202 33.20 Drug 2 3675 1837 11.73 Biofeed 1 2048 2048 13.07 Diet:Drug 2 903 451 2.88 Diet:Biofeed 1 32 32 0.20 Drug:Biofeed 2 259 130 0.83 Diet:Drug:Biofeed 2 1075 537 3.43 Residuals 60 9400 157 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' Pr(>F) 3.1e-07 5.0e-05 0.00062 0.06382 0.65294 0.44246 0.03883

*** *** *** .

0.05 '.' 0.1 ' ' 1

> model.tables(Hyper.aov, type = "means", se = TRUE) Tables of means Grand mean 184.5 Diet

5.3. ANALYSIS USING R


Diet no yes 193 176 Drug Drug x y z 174.5 190.8 188.2 Biofeed Biofeed absent present 189.8 179.2 Diet:Drug Drug Diet x y z no 178.0 202.0 199.0 yes 171.0 179.5 177.5 Diet:Biofeed Biofeed Diet absent present no 199.0 187.0 yes 180.7 171.3 Drug:Biofeed Biofeed Drug absent present x 180.5 168.5 y 193.5 188.0 z 195.5 181.0 Diet:Drug:Biofeed , , Biofeed = absent Drug Diet x y z no 188 200 209 yes 173 187 182 , , Biofeed = present Drug Diet x y z no 168 204 189 yes 169 172 173

45

Standard errors for differences of means Diet Drug Biofeed Diet:Drug Diet:Biofeed Drug:Biofeed 2.950 3.613 2.950 5.110 4.172 5.110 replic. 36 24 36 12 18 12 Diet:Drug:Biofeed 7.226 replic. 6

46

CHAPTER 5. FACTORIAL DESIGNS

Observe from the ANOVA table that the Diet, Drug, and Biofeed main eects are all signicant if we use the 5% level of signicance. The signicance of the three-way interaction term in this model is a problem for our analysis. It suggests that the advantage of one drug over another is going to be inconsistent for the levels of diet and biofeedback. In complementary fashion, one could state that the benet of changing from one diet to the other is dependent on the drug being taken as well as the biofeedback regime being followed. We can investigate the structure of the three-way interaction and its impact by plotting some simple graphs and obtaining the various marginal means in tabular form. The interaction plot of diet and biofeedback has been created separately for each drug.
> attach(Hyper) > for (i in levels(Drug)) { + jpeg(filename = paste("drug", i, ".jpg", sep = "")) + interaction.plot(x.factor = Biofeed[Drug == i], trace.factor = Diet[Drug == + i], response = BP[Drug == i], ylab = "Mean", xlab = "Biofeedback", + trace.label = "Diet") + title(paste("Interaction plot for Diet and Biofeedback for Drug", + i)) + dev.off() + } > detach(Hyper)

The tapply() command has produced the correct cell means that were used in each of the interaction plots that appear in Exhibit 5.3. Note that the jpeg() command has started sending the graphical output to the le specied. This has two benets. Firstly, the les are easy to import into a report when saved separately, and second, R will only show the last graphic created. The dev.off() command closed each le. There are ways of placing all graphics created in code like this, but it is dicult to ensure the exact presentation standard you require by doing so. You will nd the graphics les used in Exhibit 5.3 have been created in your working directory. The three graphs in this exhibit illustrate the impact of the three-way interaction. We can only comment on the impact of each drug with reference to the choices of biofeedback and diet. The ndings given thus far are complicated to deal with due to the evident high order interaction. Everitt (2002) note that transformations of the data may help, and analyze the log-transformed observations as follows.
> Hyper.ln.aov = aov(log(BP) ~ Diet * Drug * Biofeed, data = Hyper) > summary(Hyper.ln.aov) Diet Drug Biofeed Diet:Drug Diet:Biofeed Df 1 2 1 2 1 Sum Sq Mean Sq F value Pr(>F) 0.1496 0.1496 32.33 4.1e-07 *** 0.1071 0.0535 11.57 5.6e-05 *** 0.0615 0.0615 13.29 0.00056 *** 0.0240 0.0120 2.60 0.08297 . 0.0007 0.0007 0.14 0.70745

5.3. ANALYSIS USING R Exhibit 5.3 Interaction plots for diet and biofeedback for each of the three drugs.
Interaction plot for Diet and Biofeedback for Drug X

47

Diet 185 yes no

Mean 170 175

180

absent Biofeedback

present

Interaction plot for Diet and Biofeedback for Drug Y


205

Diet 200 no yes

Mean

175

180

185

190

195

absent Biofeedback

present

Interaction plot for Diet and Biofeedback for Drug Z


210

Diet 205 no yes

Mean

175

180

185

190

195

200

absent Biofeedback

present

48

CHAPTER 5. FACTORIAL DESIGNS

Drug:Biofeed 2 0.0065 0.0032 0.70 0.50103 Diet:Drug:Biofeed 2 0.0303 0.0151 3.28 0.04468 * Residuals 60 0.2775 0.0046 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Although the results are similar for those for the untransformed observations, the threeway interaction is now only marginally signicant. Everitt (2002) suggest that if no substantive explanation of this interaction is forthcoming, it might be preferable to interpret the results in terms of the very signicant main eects and t a main-eects-only model to the log-transformed blood pressures. Evaluation of the benets of following this course of action are left as an exercise.

5.4

Exercises

Note: The data sets in the exercises are available via the DRUGS package. Exercise 5.1: Consider an experiment with two factors, one of which has three levels and the other four. Construct two data sets that each have 24 experimental units. Assign the levels of the two factors to experimental units so that in one data set, the 24 runs are entirely randomized so that there are two replicates per treatment combination. In the second data set, make sure that all 12 treatment combinations are nished before the second set of 12 experimental units are given a treatment combination. Analyse the two data sets, using random response data and the appropriate model. What extra element is needed to correctly model the second data set? Exercise 5.2: Investigate the suggestion of Everitt (2002) with respect to tting a main-eects only model to the log transformed blood pressures in the Hyper data set used in this chapter. What impact are the interactions having on the nal outcomes? What are the advantages and disadvantages of the simplied model? Exercise 5.3: Using the Hypertension data given in this chapter, create a lack of balance by randomly deleting observations from the dataset; delete 5, 10, and then 15 observations and re-t the model for the factorial experiment as in the given analysis. What impact do you see? Note that we will look further at this problem in Chapter 6.

5.5

Some R hints

In Section 2.5 we saw how to create a data.frame so that we could test the experimental design proposed for its suitability. This is increasingly crucial as we increase the complexity of our experimental designs. The data structures required for the factorial experiments

5.5. SOME R HINTS

49

discussed in this chapter can be created using the commands given in Section 2.5 alone, but there a few shortcuts that might prove useful. For example, lets say we want to have an experiment laid out that has three factors, two of which are at three levels and one at two levels, and further that there are two replicates for each treatment combination for a total of 36 experimental units. Try:
> Fact3 = expand.grid(A = c(60, 80, 100), B = c(100, 200, + 300), C = c("Male", "Female"), Rep = c(1, 2), KEEP.OUT.ATTRS = FALSE) > str(Fact3) 'data.frame': 36 obs. of 4 variables: $ A : num 60 80 100 60 80 100 60 80 100 60 ... $ B : num 100 100 100 200 200 200 300 300 300 100 ... $ C : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 2 ... $ Rep: num 1 1 1 1 1 1 1 1 1 1 ...

You could use the tapply() command to test the factorial structure has been handled correctly. When you create a variable to indicate the order in which data should be collected, you would create a random list of the numbers 1 to 36. The only point to recognize about this data structure is that there is a strong likelihood that the replicate numbered 2 might be collected before the replicate marked 1 which might seem strange. The benet of allowing this to occur is that the full randomisation of the experiment is therefore assured. Remember that doing the entire set of treatment combinations once and then repeating the set, perhaps in a dierent random order, still introduces a blocking factor. You may wish to use the constructed item as a template for the data collector to ll in. If so, you should investigate the write.csv() command. To save the data collector from the perils of wondering why they are collecting data for the second replicate before the rst, you might remove this column from the object to be saved in the le. Re-ordering the template to appear in run order might prove useful for some experiments. A dierent example of this follows:
> + > > > NewFrame = expand.grid(A = c(-1, 1), B = c(-1, 1), C = c(-1, 1)) NewFrame$StdOrder = 1:8 NewFrame$RunOrder = sample(8) NewFrame[NewFrame$RunOrder, ] A -1 1 1 1 1 -1 -1 -1 B 1 -1 1 -1 1 -1 1 -1 C StdOrder RunOrder -1 3 4 1 6 1 -1 4 2 -1 2 6 1 8 5 -1 1 3 1 7 7 1 5 8

3 6 4 2 8 1 7 5

50

CHAPTER 5. FACTORIAL DESIGNS

This data.frame is suitable for a factorial experiment with three factors at two levels, called a 23 factorial experiment. The variable named StdOrder is the original order of the rows created by the expand.grid() command (short for standard order). See what the printout is like without the RunOrder part of the last command. Note that the StdOrder variable is not actually necessary as (in this case) the row labels are printed as well.

Chapter 6 Incomplete Block and Factorial Designs: Improving Listening Skills of Blind Students; and, Fouling Rates in Evaporator Tubes
An original chapter
written by

A. Jonathan R. Godfrey1

6.1

Introduction

Incomplete data is commonplace. There are two main reasons why we might have to analyse a data set that is not perfectly balanced, that is, not all combinations of factors (blocking and/or treatment) are equally replicated, or perhaps even tested. First, it just might not be possible to have perfect balance because of constraints on our experimental units. This is commonplace in experiments on animals for example, where the size of the herd, litter, family grouping etc. is beyond our control. Determining how to allocate treatments in these situations is of interest in the planning phase. Second, we may lose experimental units during the data collection phase of the experiment. If there is a chance this will occur we will see that it is possible to plan the number of replicates so that at least there are response data for all treatment combinations The rst data we use in this chapter falls into the rst of these scenarios. Steel and Torrie (1981) presented the data we use which came from PhD thesis work conducted in the early
1

Jonathan is a lecturer in the Institute of Fundamental Sciences.

51

52

CHAPTER 6. INCOMPLETE BLOCK AND FACTORIAL DESIGNS

Exhibit 6.1 Allocation of the students to treatments Preferred Treatment Medium Braille Print Total 1 5 6 2 4 8 3 3 8 4 Total 2 9 14 31 45

11 12 11 11

1970s on the impact of several training regimes for the listening skills of 45 severely visually impaired students from grades 10 to 12 at Governor Morehead School in Raleigh, North Carolina. Rawls (1970) reported on the application of four treatments to the students. It seems that even though all of the students were competent braille readers, many of them preferred to use their residual vision with print rather than using braille. Given the number of treatments and the number of students, it is fairly obvious that the data is not balanced and that the analyses of Chapters 3 and 5 are not sucient. Exhibit 6.1 shows how many of the students were allocated to the following four treatments, and whether they preferred print or braille. 1. Instruction in listening techniques plus practice listening to selected readings. 2. Same as Treatment 1, but with copies of the selected readings in the students choice of braille or print to be followed as they listened. 3. Five lessons on ecient listening techniques prior to listening-practice training. 4. A control group. Even though the preference of print or braille might be considered a blocking factor, it is explicitly linked to the second treatment. An interaction between the two factors should be investigated as the research was aimed at determining the best treatment, given the background of the students. Post-test and pre-test data were collected using the Gilmore Oral Reading Test for accuracy, but we will consider only the post-test data in this chapter. The second set of data has an incomplete factorial structure; that is, not all treatment combinations were tested. The data presented in Exhibit 6.2 were obtained in a fouling rate trial where the researchers were interested in seeing if there was any signicant dierence in the fouling rate of milk in evaporator tubes due to dierent treatment of the tubes. The researchers knew that milk varies from day to day and this is indicated by the variation in Total solids. It was also suspected that there is an ageing eect, in that trials done in the afternoon using the same milk, might foul dierently to the same milk used in the morning, even when the same treatment is used in the afternoon.

6.2. MODELS FOR INCOMPLETE BLOCK AND FACTORIAL DESIGNS Exhibit 6.2 Fouling Rates of Evaporator Tubes. Day Total Solids Time 1 2 3 4 5 6 7 13.66 Morning Afternoon 13.72 Morning Afternoon 14.1 Morning Afternoon 13.03 Morning Afternoon 14.12 Morning Afternoon 13.44 Morning Afternoon 14.29 Morning Afternoon Treatment A B A A A B A A A B A A A B Fouling Rate Induction Time 239.88 182.96 234.37 287.82 221.91 201.27 161.51 168.04 270.61 186.83 217.56 248.23 250.78 232.09 0.34 1.08 1.06 0.39 0.55 0.65 0.57 0.42 0.34 0.21 0.17 0.59 0.46 0.23

53

The primary question to be answered: Is there a signicant dierence between the two treatments on the fouling rates and on the induction times? The secondary question: Was there a signicant ageing eect found? This research is being written up for presentation in journal articles, but has already contributed towards a conference presentation, see Meas et al. (2009). In this chapter, we will limit ourselves to considering the impact of the time of day (aging of the milk) and the treatments on the fouling rates in the evaporator tubes.

6.2

Models for incomplete block and factorial designs

6.2.1

Incomplete block designs

When we discussed randomized complete block designs in Chapter 3, we noted the important assumption that our blocks are independent of our treatment factor. Each block in our data set had an observation for each level of the treatment being considered. As long as we can be sure that our plans have protected us from breaching this assumption, we can use an incomplete block design when we need to. Planning an incomplete block design requires quite a lot of thought but follows several principles.

54

CHAPTER 6. INCOMPLETE BLOCK AND FACTORIAL DESIGNS First, we cannot ignore the need to randomize the allocation of treatments to units,

even if we do know there are constraints on the overall allocation.

Second, we must plan our experiment knowing if we will have an equal number of treatments in each block. If this occurs, we will have some partial confounding of treatment eects with the block eects. Say for example, we have a constant number of units in the blocks and that this number is one less than the number of treatments to be included in the experiment. To ensure the number of comparisons between treatment pairs is held constant, we would need to choose which treatment is missed out of each block. This is probably the simplest incomplete block design to deal with because we will have confounded the absence of the treatment with the corresponding block eect, and all pairs of treatments will be tested in the same number of blocks. This design is therefore called a balanced incomplete block design. See more on balanced incomplete block (BIB) designs in Subsection 6.2.2 below.

If, on the other hand, the number of units (and therefore treatments) per block will vary, we will need to concern ourselves with the number of comparisons that are directly possible for all pairs of treatments. This means we need to know how often a pair of treatments occurs in the same block. The concurrence matrix is the tool that is often used to show how often pairs of treatments are tested in the same block. There are criteria for establishing the optimality of the allocation of treatments to blocks but this is beyond the scope of this presentation. Let us now just think that we need to link the number of times a direct comparison is made with the importance of understanding that comparison. A common outcome is that all comparisons are equally important so the number of times every pair of treatments can be directly compared is kept as constant as possible.

When it is time to think about the hypothesis testing etc. in the analysis of the incomplete block design, we refer back to the discussion on the analysis and presentation of results given in Section 3.2 for the randomised complete block (RCB) design. In particular, it was argued that the factor representing blocks would be included in a model even if its hypothesis test deemed it to be insignicant because it was a constraint on the experimental design and should use up degrees of freedom as a consequence. The order of terms to be included then has block factor(s) rst, then treatment factors following. The dierence now is that we must recognize that our ndings on the treatments are dependent in some way on the blocks to which the various treatments were applied. The ability to assume that the impact block selection is having on experimental ndings is due purely to chance, is crucial as a consequence.

6.2. MODELS FOR INCOMPLETE BLOCK AND FACTORIAL DESIGNS

55

6.2.2

Balanced incomplete block designs

If we intend tting the model, rst given in Equation 3.1 for the randomized complete block design, yijk = + i + j +
ijk

to data from a balanced incomplete block design, with t treatments replicated r times across the b blocks, each having k experimental units (k < t), then the SS for blocks are:
b

SSB = k
i=1

( y i y )2

(6.1)

The SS for treatments tted after blocks have been included in the model are:
2 k (t 1) t j =1 Qj SST = rt(k 1)

(6.2)

where Qj is the adjusted treatment total, based on the blocks in which each treatment has appeared.
b

Qj =
i=1

1 nij yij Bj k

(6.3)

where Bj is the sum over only the blocks that include the j th treatment, of the block totals. Bj =
j =1 t

nij yij

(6.4)

In these last two equations, we use nij as an indicator variable to show if there is a corresponding yij observation; a zero implies the j th treatment did not occur in the ith block and a one shows it did. The easiest way to determine the Error SS for the BIB is to calculate the total sum of squares and subtract the block and treatment SS values
b t

SSE =
i=1 j =1

nij (yij y )2 SSB SST

(6.5)

These sums of squares calculations contribute to the analysis of variance table for the BIB, which, as for the RCB design seen earlier, will have the blocking factor rst, then the treatment factor, and an error term. The degrees of freedom for blocks and treatments are one less than the number of each, meaning that the error degrees of freedom will be r (t 1) b + 1. Mean squares for treatment and error can be found by dividing the SST null hypothesis test that all treatment means are equal. and SSE by the corresponding degrees of freedom and the F -test constructed to test the

56

CHAPTER 6. INCOMPLETE BLOCK AND FACTORIAL DESIGNS The adjusted treatment means y j for a BIB are found using: y j = y + k (t 1)Qj rt(k 1) (6.6)

The associated standard error of these adjusted treatment means sy j are found using: sy j = MSE rt 1+ k (t 1)2 (k 1)t (6.7)

which looks rather more complicated than we need because if we are working with the standard error of the dierence between two levels of the treatment, we will prefer to use s.e.d. = 2k (t 1)MSE rt(k 1) (6.8)

6.2.3

Other incomplete block designs

We have not provided the equations for the sums of squares etc. for other incomplete block designs here. The fact that there are a constant number of experimental units in the BIB design makes the equations for the BIB simpler than those that are needed for unequal block size scenarios. Let us just say at this point that the ordering of factors in our modelling of block and treatment factors remained the same for the RCB and BIB designs. There is little reason to think they would change for an unequal block size scenario. Therefore, the approach to investigating the signicance of the treatment factor is the same as for the BIB. Calculating the adjusted treatment means however, is not so easy. The question is though, are they important enough to warrant being presented? In an incomplete block design with unequal block sizes, we are likely to have fairly small data sets so presenting the whole data set and the associated ANOVA tables might be all that is required. No hard and fast rules can be applied here as the data scenario will drive the requirements placed on reporting the ndings. Perhaps the easiest way to nd the adjusted treatment means in these situations is to re-model the situation as a linear model with explicit terms (using indicator variables) for each level of the treatment factor so that regression-like coecients can be extracted from the model.

6.2.4

A model for experiments with two treatment factors


yijk = + i + j + ( )ij +

We can t the model,


ijk

6.2. MODELS FOR INCOMPLETE BLOCK AND FACTORIAL DESIGNS

57

rst introduced as Equation 5.1 in Chapter 5 even when the various treatment combinations are not assessed an equal number of times. The assumptions of the two-way analysis of variance model and the hypothesis tests for a two-factor experiment where data is incomplete include those described for the complete data situation in Chapter 5, although we must make some modications to our hypothesis testing approach, and think carefully about how we present our experimental ndings. Recall that if there are a levels of Factor A and b levels of Factor B, then we expect combinations are tested, there will be ab 1 degrees of freedom assigned to the main (a 1)(b 1) degrees of freedom to be assigned to the interaction. If all treatment

eects and interaction. If a combination is not tested, the main eects will keep their degrees of freedom but the interaction will lose some. It is always a good idea to check

that the degrees of freedom in a model are what you expect, but this is especially true if there are not as many observations in total as you expect. Obviously, the fact that a treatment combination might not be tested at all will impact on the ability to estimate the specic interaction eect for that combination, but this is only relevant if we discover that a signicant interaction exists for the treatment combinations for which we do have data. When a treatment combination is not tested, the degrees of freedom assigned to the interaction term in our model are reduced.

6.2.5

All combinations of two treatments are tested

If the number of replicates changes from treatment combination to treatment combination, we can still estimate a mean response y ij for every treatment combination. The contribution that each cell makes to the overall sum of squares will depend on the number of replicates but as long as the experimental units have been assigned randomly there is relatively little to be concerned about. Recall that when we had balanced data, see Chapter 5, we could re-order the factors in our analysis without aecting the judgment we made about the signicance of each treatment factor. This is not the case when data are unbalanced. When we do not have perfect balance among the treatment factors, they are no longer orthogonal. This means that the eects of the two factors are no longer independent of one another. A portion of the explained sum of squares in our model is attributable to either treatment factor. In other words, the total of the sum of squares for a pair of factors in a single model will not be the sum of the two SS values found in two separate models. To understand this shared eect, we re-order the terms in our model. The term tted rst by R explains as much of the variation among the observed data as possible (what we would nd if it was tted alone in a model) and the second term in the model just explains what it can from the sum of squares left over from the rst. This sum of squares is used to determine that each

58

CHAPTER 6. INCOMPLETE BLOCK AND FACTORIAL DESIGNS

treatment factor is signicant. This approach is rather conservative as it means we make our judgment on the smallest contribution each factor could have made in explaining the response variable. The use of what is known as Type I SS, or sequential sums of squares is order dependent. This means that the SS value for the second (unbalanced) factor in the model is (usually) less than what would appear if the order were reversed. As is the case for R, many statistical software applications use the sequential SS approach. As a consequence, we will need to re-order the factors in our models when using R to fully understand the importance of any factor. Type III SS is the alternative presented by some statistical applications. This is also known as the adjusted sums of squares approach. It presents the contribution of any factor, assuming it to have been added to the model after all other eects have been included. In fact, this means that the p-value for a main eect of a factor is dependent on the SS assigned to any interaction involving that factor. This does not make sense if we follow the notion that the presence of an interaction eect shows that a factors eect is more than just a main eect. The hierarchy of main and interaction eects must be preserved. Always t the main eect rst, then the interaction eect; if the interaction eect is to be included in the model, then so must the main eect. This is one area where the way R presents a models ANOVA table is quite smart. All interaction eects are added to the model after the main eects, regardless of the order in which we insert the factors. If we wanted to force an interaction term into the model before a main eect, we would need to create a new variable to trick R into doing what we want! Some software will present both Type I and Type III information. Whenever you encounter unbalanced data, you should make a point of discovering what output the software you use will provide. In summary, it is recommended that you t a two-way model that includes the interaction term. This means our model will have tted values that are equal to the observed (raw) cell means y ij . If the interaction term is not signicant, remove it from the model and re-order the terms for the unbalanced factors.

6.2.6

Reporting our ndings

In any balanced experiment, the SS that result will not be any dierent if the Type I or Type III approach has been used. Remember that in balanced data situations, the interaction term is orthogonal to its corresponding main eect terms. This makes reporting cell means and/or marginal means quite easy; the choice being only dependent on the signicance of the interaction eect. In unbalanced data situations, the interaction term is not orthogonal to either main eect, and is only left in the model if it is signicant. If it is thought to be signicant,

6.2. MODELS FOR INCOMPLETE BLOCK AND FACTORIAL DESIGNS

59

then the re-ordering of factors according to the Type I SS approach is only relevant for the purposes of reporting p-values; that is, we will not choose to alter the model. If the interaction term is not signicant, then we need to remove it from the model in order to test the signicance of the two main eects in turn using the Type I SS approach. The added complication for unbalanced data is that unlike the balanced data case, we cannot consider the two main eects independently even if the interaction is not signicant. We should not for example, consider the marginal means of each factor without reference to the other factors existence. In reporting the outcome from modelling any unbalanced experiment, it is probably best to report all cell means y ij , and their standard errors sy ij , so that the reader can draw their own conclusions. The calculation of marginal (row or column) means assumes that we have estimated all of the cell means with sucient (and equivalent) accuracy and precision. (This may contribute to the number of replicates allocated to the treatment combinations in the rst place.) The adjusted marginal means in an unbalanced two-way scenario would be: y i = for the levels of Factor A, and y j = 1 a
a

1 b

y ij
j =1

(6.9)

y ij
i=1

(6.10)

for the levels of Factor B. These adjusted marginal means y i and y j are not equal to the simple marginal means y i and y j found from the raw data in an unbalanced data scenario. Standard error estimators for these unbiased least square estimators of the means can be found using: sy i = and sy j = MSE a2
a

MSE b2

j =1

1 rij

(6.11)

i=1

1 rij

(6.12)

where the MSE is that obtained from the model including the interaction term, whether it is signicant or not. If the replication in our unbalanced experiment reects the nature of the population that experimental units have been drawn from, the observed marginal means from our raw data y i and y j are meaningful for reporting purposes. This is more common in surveys and observational studies however. In reporting ndings from a designed experiment, the adjusted marginal means y i and y j and their standard errors should be reported.

60

CHAPTER 6. INCOMPLETE BLOCK AND FACTORIAL DESIGNS

6.2.7

Not all combinations of the two treatments were tested

If not all combinations of the treatments were tested, we proceed with the hypothesis testing as we would if each combination had been tested. That is, if the interaction of the two factors is signicant then use this model as the basis for drawing conclusions. If the interaction is not signicant, then re-order the two factors as per the Type I analysis in a model without the interaction term to evaluate the dierences among levesl of the two factors. When presenting the experiments ndings, present all cell means and associated standard errors. Interaction plots remain feasible, although there will be points missing on the plot for the missing treatment combinations. You will then need to consider how to determine what means might be useful for each level of the two factors. Equations 6.9 and 6.10 are not appropriate as the row and column averages of the cell means are based on diering sets of the other factor. It might be appropriate to calculate an adjusted treatment mean for a reduced set of the levels of the other factor, but whatever you do when creating the adjusted row and column means, you must be able to justify the way you nd these statistics. One alternative to working with this kind of incomplete data is to impute the missing cell means. Imputation is beyond the scope of this text because the majority of research into imputation methodology is suitable for application to survey data where respondents commonly answer most but not all questions. The current authors doctoral research was in the area of imputation of missing data in two-way arrays of experimental data, see Godfrey (2004).

6.3
6.3.1

Analysis using R
Analysis of the listening data

Exhibit 6.1 showed how many students had been allocated to the four treatments and we can clearly see that there is a lack of balance, but that all treatment combinations were tested. Make the data available in the current R session using:
> data(Listening, package = "DRUGS")

We can get the means and their standard errors for the eight treatment combinations using:
> attach(Listening) > Listening.means = tapply(PostTest, list(Medium, Treatment), + mean) > Listening.sd = tapply(PostTest, list(Medium, Treatment), + sd)

6.3. ANALYSIS USING R


> + > > > Listening.count = tapply(PostTest, list(Medium, Treatment), length) Listening.se = Listening.sd/sqrt(Listening.count) detach(Listening) Listening.means

61

1 2 3 4 Braille 91.00 92 83.33 95.00 Print 83.83 87 84.00 82.33 > Listening.se 1 2 3 4 Braille 1.949 2.799 6.438 2.000 Print 5.850 2.790 3.932 4.961

There is clear evidence that the standard errors of the cell means are not equivalent, and that in particular the variation decreases with increased mean score on the Gilmore Oral Reading Test. We employ the arcsin transformation to the data, after converting the scores to proportions in an attempt to stabilise the variation.
> PPostTest = Listening$PostTest/100 > TPostTest = asin(PPostTest)

and re-evaluate the variation in the standard errors using:


> > + > + > > > attach(Listening) TPostTest.means = tapply(TPostTest, list(Medium, Treatment), mean) TPostTest.sd = tapply(TPostTest, list(Medium, Treatment), sd) TPostTest.se = TPostTest.sd/sqrt(Listening.count) detach(Listening) TPostTest.means

1 2 3 4 Braille 1.154 1.185 1.015 1.260 Print 1.037 1.078 1.033 1.005 > TPostTest.se 1 2 3 4 Braille 0.0482 0.06856 0.13706 0.06541 Print 0.1041 0.05998 0.07949 0.08050

The problem is not improved dramatically, nor has it become worse. We use use this transformed variable in spite of not being able to fully reduce the variation among the standard errors. We rst t the model that allows for both treatment factors and their interaction using:
> summary(Listening.aov1 <- aov(TPostTest ~ Medium * Treatment, + data = Listening)) Df Sum Sq Mean Sq F value Pr(>F) Medium 1 0.118 0.1184 2.79 0.10 Treatment 3 0.036 0.0121 0.28 0.84 Medium:Treatment 3 0.070 0.0234 0.55 0.65 Residuals 37 1.571 0.0425

62

CHAPTER 6. INCOMPLETE BLOCK AND FACTORIAL DESIGNS

Even though the interaction term is not signicant, we do need to keep the MSE from the model for calculating the adjusted marginal means if we nd either or both of the main eects are signicant. To do this, we will need to create two ANOVA tables with the two terms in the opposite orders:.
> summary(Listening.aov2 <- aov(TPostTest ~ Medium + Treatment, + data = Listening)) Df Sum Sq Mean Sq F value Pr(>F) Medium 1 0.118 0.1184 2.89 0.097 . Treatment 3 0.036 0.0121 0.29 0.829 Residuals 40 1.642 0.0410 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > summary(Listening.aov3 <- aov(TPostTest ~ Treatment + Medium, + data = Listening)) Df Sum Sq Mean Sq F value Pr(>F) 3 0.050 0.0167 0.41 0.75 1 0.104 0.1045 2.55 0.12 40 1.642 0.0410

Treatment Medium Residuals

Clearly, there is no strong evidence that either of the main eects are signicant. Although there are some serious drawbacks to the models validity, we will show how to complete the calculations of the adjusted marginal means anyway. We will also use this data again, after seeing how we incorporate the covariate information from the pre-test results as an exercise in Chapter 8. Recall that our adjusted marginal means will be based on the eight cell means calculated earlier. From Equations 6.9 and 6.10, we know that only the simple average of these cell means are required. They are calculated using the rowMeans() and colMeans() commands.
> rowMeans(TPostTest.means) Braille 1.153 Print 1.038

> colMeans(TPostTest.means) 1 2 3 4 1.095 1.132 1.024 1.132

Finding the standard errors of these means using Equations 6.11 and 6.12 requires use of the number of replicates per cell of the two-way table (We found that earlier.) and the MSE from the model including the interaction term for the two factors.
> TPostTest.MSE = anova(Listening.aov1)[4, 3] > TPostTest.MSE [1] 0.04247

The ncol() and nrow() commands have been used here to make the code for these next two calculations transferrable to other data sets.

6.3. ANALYSIS USING R


> sqrt((TPostTest.MSE/ncol(Listening.count)^2) * rowSums(1/Listening.count)) Braille Print 0.05836 0.03743 > sqrt((TPostTest.MSE/nrow(Listening.count)^2) * colSums(1/Listening.count)) 1 2 3 4 0.06239 0.06310 0.06976 0.08055

63

So we now see that the standard error for the students who prefer braille is greater than that for the students who prefer print, which given the numbers of students should come as no surprise. We could now take these adjusted marginal means and their standard errors and backtransform to the original scale for inclusion in the report on the experiments ndings. Adding this information to the tabulated cell means of the raw data would probably highlight to the reader that in fact the dierences between the four treatments were trivial.

6.3.2

Analysis of the fouling data

A quick scan of the data given in Exhibit 6.2 immediately shows that the amount of total solids is measured for an entire day. There will be some confounding of eects if we include both the variables for day and total solids in the modelling of this data, so we will need to leave consideration of the total solids information until we have seen how to do this properly in Chapter 10. Looking at the major features of the data then shows us that we have a possible eect for the day, the time of day, and the treatment. Notice however, that the morning runs of this experiment are all done with treatment A being applied. Voila! An incomplete factorial experiment, where at least one treatment combination is untested, requiring a careful Type I analysis. Lets get the data and see how R sees it. Note that the actual data has more variables than were presented in Exhibit 6.2.
> data(Fouling, package = "DRUGS") > str(Fouling) 'data.frame': $ Time : $ Day : $ TotalSolids : $ Treatment : $ Temp : $ TempDiff : $ FoulingRate : $ InductionTime: 14 obs. of 8 variables: Factor w/ 2 levels "Morning","Afternoon": 1 1 1 1 1 1 1 2 2 2 ... Factor w/ 7 levels "day 1","day 2",..: 1 2 3 4 5 6 7 1 2 3 ... num 13.7 13.7 14.1 13 14.1 ... Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 2 1 2 ... num 79.8 79.5 79.1 79.5 79.4 ... num 4.36 4.83 4.65 4.44 4.8 4.1 4.12 4.54 4.76 4.57 ... num 240 234 222 162 271 ... num 0.34 1.06 0.55 0.57 0.34 0.17 0.46 1.08 0.39 0.65 ...

We use the tapply() command to nd the observed cell means and their observed standard errors.

64

CHAPTER 6. INCOMPLETE BLOCK AND FACTORIAL DESIGNS

> attach(Fouling) > tapply(FoulingRate, list(Time, Treatment), mean) A B Morning 228.1 NA Afternoon 234.7 200.8 > tapply(FoulingRate, list(Time, Treatment), sd) A B Morning 34.36 NA Afternoon 61.03 22.31 > tapply(FoulingRate, list(Time, Treatment), sd)/sqrt(tapply(FoulingRate, + list(Time, Treatment), length)) A B Morning 12.99 NA Afternoon 35.23 11.15 > detach(Fouling)

The least replicated treatment combination has the greatest standard error, but this is more to do with the standard deviation within each cell than the number of replicates. If we t the models that re-order the two treatments but leave the blocking factor in the model rst we have:
> summary(Fouling.aov1 <- aov(FoulingRate ~ Day + Time + Treatment, + data = Fouling)) Df Sum Sq Mean Sq F value Pr(>F) Day 6 11129 1855 4.60 0.058 . Time 1 571 571 1.42 0.288 Treatment 1 4850 4850 12.03 0.018 * Residuals 5 2016 403 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > summary(Fouling.aov2 <- aov(FoulingRate ~ Day + Treatment + + Time, data = Fouling)) Df Sum Sq Mean Sq F value Pr(>F) Day 6 11129 1855 4.6 0.058 . Treatment 1 4051 4051 10.1 0.025 * Time 1 1370 1370 3.4 0.125 Residuals 5 2016 403 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

These models show that the benet in using Treatment B over Treatment A seen in the tabulated means above is signicant, but the aging of the milk is not. We must be very careful here. There are only seven days over which the milk can age, and the ability to truly determine if the use of Treatment B is the cause vs some random chance aecting these days vs the Treatment A days, is limited due to the low replication of the experiment. If were feeling adventurous, we can turn our analysis into an incomplete block design by creating a single variable to represent all aspects of the treatment combinations. In this instance we use the paste() command

6.3. ANALYSIS USING R Exhibit 6.3 Tukey HSD method applied to the Fouling Rates.
> Fouling.HSD <- TukeyHSD(Fouling.ibd.aov, "CombTrt") > Fouling.HSD Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = FoulingRate ~ Day + CombTrt, data = Fouling) $CombTrt diff lwr upr p adj A Morning-A Afternoon -8.724 -53.82 36.37 0.8112 B Afternoon-A Afternoon -37.612 -87.52 12.29 0.1224 B Afternoon-A Morning -28.888 -69.84 12.07 0.1469

65

> CombTrt = as.factor(paste(Fouling$Treatment, Fouling$Time)) > str(CombTrt) Factor w/ 3 levels "A Afternoon",..: 2 2 2 2 2 2 2 3 1 3 ...

We know that only two of the three treatment combinations were tested on any given day so we do have incomplete blocks. There is certainly no balance as we cannot physically test the two afternoon treatment combinations on a given day.
> summary(Fouling.ibd.aov <- aov(FoulingRate ~ Day + CombTrt, + data = Fouling)) Df Sum Sq Mean Sq F value Pr(>F) Day 6 11129 1855 4.60 0.058 . CombTrt 2 5421 2710 6.72 0.038 * Residuals 5 2016 403 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Creating this analysis allows us to use the TukeyHSD() command rst seen in Chapter 3. Remember, this is a very conservative procedure. In this instance, we see that the output given in Exhibit 6.3, shows that there is little denite ability to separate any pair of the three treatment combinations. We could also use the lm() command to force R to provide the coecients for the terms in the model.
> summary(Fouling.ibd.lm <- lm(FoulingRate ~ Day + CombTrt, + data = Fouling)) Call: lm(formula = FoulingRate ~ Day + CombTrt, data = Fouling) Residuals: 1 2 3 5.956 -11.617 -12.184 10 11 12 12.184 -11.843 -19.386

4 11.843 13 0.227

5 19.386 14 13.159

6 7 -0.227 -13.159

8 -5.956

9 11.617

66
Coefficients:

CHAPTER 6. INCOMPLETE BLOCK AND FACTORIAL DESIGNS

Estimate Std. Error t value Pr(>|t|) (Intercept) 264.14 22.82 11.57 8.4e-05 *** Dayday 2 12.06 22.82 0.53 0.620 Dayday 3 0.17 20.08 0.01 0.994 Dayday 4 -84.26 22.82 -3.69 0.014 * Dayday 5 17.30 20.08 0.86 0.428 Dayday 6 -16.14 22.82 -0.71 0.511 Dayday 7 30.02 20.08 1.49 0.195 CombTrtA Morning -30.22 16.40 -1.84 0.125 CombTrtB Afternoon -75.22 21.69 -3.47 0.018 * --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 20.1 on 5 degrees of freedom Multiple R-squared: 0.891, Adjusted R-squared: 0.718 F-statistic: 5.13 on 8 and 5 DF, p-value: 0.0441

This way of tting what is actually the same linear model, but with a dierent parameterisation, shows us the dierences between the three treatment combinations. The eects of the various days are all tted rst and then the within-day eects are tted. The base level of the three combinations has turned out to be the afternoon runs of the Treatment A. The coecient for morning is therefore of the opposite sign to what aging should be, but the dierence between Treatments A and B are shown.

6.4

Planning an incomplete factorial experiment an example

It is quite common to need to plan an experiment in the knowledge that the design will not be a full factorial experiment. The problem is working out how bad the resulting design will be! There are many ways to go about creating a structure for the planned experiment but we now illustrate a simple approach to planning a tasting experiment. Imagine you have 16 dierent recipes of biscuit that you want tested. You have four testers on yor panel, but you know that they cant be relied upon to discern among the various biscuits if they are given too many in a single tasting session. For a start, lets plan a set of three days of tasting, where each person will test four biscuits per session. square, we know that there are three factors that each have four levels, and that the example of how things can turn out occasionally! recipe codes to the cells of the table. in session 1 of tasting, the rows indicate which taster So, upon choosing the particular 44 Latin square, we can randomly assign the 16 The latin square design we saw in Chapter 4 can help us here. If we take a 44 Latin

square has 16 cells in it. OK the numbers have worked out very nicely but this is just an

6.5. EXERCISES

67

gets which recipes. In session 2, we use the columns to allocate recipes to the tasters, and on the third session, the letter codes of the Latin square get allocated to the tasters. This plan will allow a model to be tted that can incorporate eects for tasters, sessions, and recipes. Every recipe will be tasted in each session, but over the three days a taster will only taste a subset of the recipes. In all, 48 tests will be made over the three days and each taster would only ever get to try at most 12 recipes if there was no repetition of recipe. A Latin square approach will lead to some repetition of tests though, which may prove to be an advantage.

6.5

Exercises

Note: The data sets in the exercises are available via the DRUGS package. Exercise 6.1: Consider the allocation of the 45 students to treatments shown in Exhibit 6.1. If the researcher had come to you before collecting data, what allocation would you recommend? Exercise 6.2: Rayner (1969) give the results of a turnip yield experiment which was laid out in a Latin square design. This is a row and column design where each of n treatments appears once in each of the n rows and once in each of the n columns of the experimental area. Row and column variables are therefore useful blocking factors to control for any variation that may exist across the experimental area. In this experiment there are six turnip varieties but three of the 36 plots (from one corner of the experiment) were vandalized and the experimental results were therefore deemed inadmissible. It might be useful to insert specic numbers into the spaces in the le vandal.csv to see how the ANOVA for a Latin square experiment will look when it has complete data before you use the original data to see what impact the missing data has on the ability to determine which of the six varieties has the greatest yield. You can get the original data directly using
> data(Vandal, package = "DRUGS")

Exercise 6.3: Consider what recommendations you would put to the researcher if you had been asked to help design the Fouling Rate experiment. As a starting point, how would you design this experiment if there were eight days available for collecting data? Identify what eects you could estimate if just one more day had been available. Exercise 6.4: Plan the tests for the biscuit tasting experiment according to the Latin square based planning approach from Section 6.4. Using random response data, t the appropriate ANOVA model and evaluate the structure that results, including an assessment of the degrees of freedom assigned to each source of variation. Exercise 6.5: Consider how you would choose which of the 16 recipes of the biscuit

68

CHAPTER 6. INCOMPLETE BLOCK AND FACTORIAL DESIGNS

tasting experiment would be allocated to the four tasters on each of three days if you did not use the Latin square approach of Section 6.4. Using random response data, t the appropriate ANOVA model and compare the models structure with the one based on the Latin square approach.

Chapter 7 Case Study: An Incomplete Factorial Design The Impact of Dierent Diets and a Drug on Rats after a Surgical Procedure
A report on a statistical consulting project
written by

A. Jonathan R. Godfrey1 & Penelope Bilton2

7.1

Introduction

This chapter describes a case study where the authors provided advice, and ultimately the statistical analysis, to a PhD student who was concerned that her experimental work was not going to be useful for meeting her research aims. The experiment took thirty-eight weeks in total so a large amount of human eort and money had been expended during data collection. It was therefore extremely important that we work out exactly what could be retrieved after a few things went awry. The timeline of events shows a reasonably complex scenario, but we will see that it is analysed in a fairly simple fashion. Each rat was weighed at Week 4 of its life and then started a regime of one of three diets. One of the diets was a planned control diet, while the other two were based on milk
1 2

Jonathan is a lecturer in the Institute of Fundamental Sciences. Penny is a PhD candidate in the Statistics group of the Institute of Fundamental Sciences. Please

direct any queries to the other author of this chapter.

69

70

CHAPTER 7. CASE STUDY: AN INCOMPLETE FACTORIAL DESIGN

from two dierent animals. At Week 20, the weights of the rats and a number of other measurements were taken on each rat. These measurements included various variables that would ultimately measure the bone density of the rats. In Week 22, each rat was subjected to a surgical procedure that either removed its ovaries, or was meant to replicate the impact such surgery would have except that the ovaries were not removed. This was referred to as sham surgery. At Week 30, the same measurements were taken as were taken at Week 20. Following this, half of the rats that had real surgery were started on a drug regime to see if this drug would counter the damage done by removal of the ovaries. The experiment ended with the set of bone density measurements (as taken previously) being recorded at Week 38. The rats were then euthanased. Upon questioning, it was discovered that an equal number of rats (twenty) were assigned to seven of the possible factorial treatment combinations. The rats given the sham surgery were all given the control diet. This was planned, although it was admitted that if the experiment were to be planned better, that more rats would be used and the other treatment combinations would be investigated. The researcher used randomisation at every opportunity to allocate rats to the dierent treatments. Of note was that the randomisation of the drug allocation in Week 30 was made after the surgery had taken place. Likewise, the type of surgery was randomly assigned after the diet regime was underway.

7.2

The Analysis for the Planned Experiment

The important aspects of the analysis are to work out what is the response variable and what are the explanatory variables or factors in the model. The complication for this experiment is that over time the response measurements and factors in the experiment change. We decided that the response variables collected at four time points led to three phases where change could be observed. We called these Phase I, Phase II, and Phase III. One treatment intervention was added during each of the three phases. In Phase I, only the change in the body weight of the rats could be ascertained. Over this phase, only one treatment factor was applied the diet. This made for a fairly simple model, but we note that because of what was planned for later time phases, the number of rats given each diet was not constant; 40 rats were given each of the new diets while 60 rats were given the control diet. In Phase II, the surgery treatment was applied so we now had a factorial structure but there was an incomplete factorial design as the sham surgery group only came from

7.2. THE ANALYSIS FOR THE PLANNED EXPERIMENT

71

the control diet group. So we know the 60 rats given the control diet were broken into two groups; 40 given real surgery and 20 given the sham surgery. The response variables for Phase II were calculated as the change from Week20 to Week 30 of each measurement taken at both time points. Each response variable was then put into a model that allowed for a main eect for diet (2 degrees of freedom) and the type of surgery (1 degree of freedom). As there were only four treatment combinations, the interaction eect of the diet and the surgery is not estimable. In Phase III, half of the rats given the real surgery were given the drug. This meant there were now a total of seven treatment groups, but at least each group has the same number of rats now. Some people might think of the treatment structure being based on three levels of diet, two levels of surgery, and two levels of drug which is true but given the absence of a number of these twelve treatment combinations makes the threefactor incomplete factorial design more complicated than we need. If we now think of the treatments as either diet-related and treatments based on non-diet eects, we would think of an incomplete 3 3 factorial design; three diets and three non-diet treatments (sham surgery, real surgery without drugs, and real surgery with drugs). The response variables for Phase III are the changes from Week 30 to Week 38 and are in common with the response variables for Phase II. The models that have these response variables and the two treatment factors (diet and non-diet combination) have an interesting structure for the degrees of freedom. There are only seven treatment combination groups so at most six degrees of freedom will be allocated to treatment eects. The diet treatment takes two of the degrees of freedom. The non-diet eects take two more, leaving only two degrees of freedom for other treatment eects. If we had a complete factorial structure, the interaction of the diet and non-diet treatments would lead to a further four degrees of freedom being used in the model. There are only two degrees of freedom available for this interaction eect, but we know from the way the experiment was planned that the interaction between type of surgery and diet was not estimable. The two degrees of freedom and the associated explained sum of squares found when tting the model are for the interaction of the drug and the diets, but this is conditional on the fact that all of these rats were given the real surgery. It might be easier to see how this works by pretending that the sham surgery group did not exist. In that case, all rats would have been given real surgery, one of the three diets and either the drug or no drug. This is a complete 3 2 factorial experiment where all degrees of freedom are allocated as we would expect. Six treatment combinations leads to two degrees of freedom for diet, one degree of freedom for drug, and two degrees of freedom for their interaction. The single extra treatment group can only add one more piece of information which in this instance is the impact of dierence between having sham surgery, or real surgery without having the drug.

72

CHAPTER 7. CASE STUDY: AN INCOMPLETE FACTORIAL DESIGN

Exhibit 7.1 Sources of variation and associated degrees of freedom for the Phase III model Source Diet Surgery Drug Total non-diet eects DietDrug interaction, given real surgery Residual Total 1 1 2 2 133 139 df df 2

Finally, the two degrees of freedom that are for the non-diet treatments during Phase III are separable into two eects; one for the surgery and one for the application of the drug. These eects are not orthogonal however, so the order in which they are added to the model does make a dierence on the sum of squares assigned to each of the two eects. Given the fact that the treatments were given at dierent time points however, we have a natural ordering for the model; putting the surgery factor before the drug factor in our model makes sense. The sources of variation and their associated degrees of freedom for the ANOVA table for the Phase III model is given in Exhibit 7.1. We should, as a matter of good practice, rearrange the non-orthogonal eects in our model so that each main eect is added last, especially if the interaction of diet and drug is not signicant.

7.3

The Problem

The principal reason for the researcher coming for statistical advice was that it was discovered that the control diet that had been administered was not the product that was originally planned. It turned out that this diet was decient in an important vitamin, which was only discovered as a consequence of investigating the deaths of some rats that had been given this diet.

7.4

The Solution

Quite clearly, the planned control diet treatment is no longer a control treatment at all. Two paths could be taken at this point; either we discard all data from the failed control diet groups, or we re-consider how we think of the control diet groups. The latter option was taken. The control diet was re-branded as a third diet and the analysis as outlined in Section 7.2 was carried out.

7.5. THE FINAL ANALYSIS Exhibit 7.2 Body weight for rats given one of seven treatment regimes
Bodyweight changes for different treatments

73

Treatment 400 Diet2 Diet3 Diet2+Drug Diet3+Drug Diet1+Drug Diet1 Sham

mean of Weight

100 start

200

300

Wk 20

Wk 30 Time

Wk 38

Obviously, we already know that the unplanned diet group is likely to have some unusual results. We will need to watch the impact that the rats that died during the experiment have on events that occurred before their deaths. We may need to remove these rats from the entire analysis. We should also be ready to see changes in the degrees of freedom for the residual term in the various models if these rats do not get removed.

7.5

The Final Analysis

We explored the pattern of results for all response variables over time for each of the seven eventual treatment groups using an interaction plot. An example for the body weight variable is presented in Exhibit 7.2 The points marked in this plot are the means of the rats in each treatment group. We would expect the groups given the same diet to roughly coincide in those time periods when they had been given the same treatment regime up to that point. The variation indicated by the plot is therefore part of the random noise from the dierences in the rats within each group as they were yet to be allocated to the dierent surgery or drug elements of the full treatment regime. We chose to analyze the change in a response variable from one time point to the next in order to allow for an accumulation of eects over time. This should account for the dierences in the rats within a treatment group that existed before the next element

74

CHAPTER 7. CASE STUDY: AN INCOMPLETE FACTORIAL DESIGN

Exhibit 7.3 ANOVA tables for gain in body weight over each of the three time phases. Source DIET Residuals DIET SURGERY Residuals df Sum Sq Mean Sq F value Phase I 2 137 2 1 127 57302 98309 68819 10022 67068 28651.1 717.6 34409 10022 528 65.157 0.0000 *** 18.977 0.0000 *** 39.927 0.0000 *** p

Phase II

9 observations deleted due to missingness Phase III DIET SURGERY DRUG DIET:DRUG Residuals 2 1 1 2 117 4178 3415 141 617 46018 2088.9 3415.0 140.5 308.3 393.3 5.3111 0.0062 ** 8.6826 0.0039 ** 0.3573 0.5511 0.7838 0.4591

16 observations deleted due to missingness Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 of the treatment regime was applied. There are other ways to achieve the same ends, but in this instance we have a change in the treatment structure to deal with over time. If the full set of treatments had been in place from the outset, we would have repeated measurements data that would be analysed dierently, perhaps using time as another factor in the analysis of the experiment. We show the ANOVA tables for the gain in body weight over each of the three time phases in Exhibit 7.3. This illustrates the structure of the models used in each of the three time phases. Note that this is the only response variable that can be assessed over Phase I as the other variables were only measured at the beginning and end of the second and third phases. A complete listing of all p-values from the experiment is quite an illustrative way of looking at the overall eects of the various treatments on the entire set of response variables. These are shown in Exhibit 7.4. Scanning the whole set of p-values in a table like this allows us to consider the pattern of results across all response variables. We can see that the diet is aecting the outcome of almost all response variables in all three time phases. To gauge if this is an eect found in Phase I wich is carried over to the remaining time periods would require us to have a model that allows for time to be tted as an explicit term in the model, and for

7.5. THE FINAL ANALYSIS Exhibit 7.4 ANOVA summary for Alendronate study Response Diet Phase I: weeks 4-20 Body weight Phase II: weeks 20-30 Body weight Femur density Lumbar density Whole bone density Fat mass Lean mass Bone mass Body weight Femur density Lumbar density Whole bone density Fat mass Lean mass Bone mass 0.0000 *** 0.0000 *** 0.0007 *** 0.0000 *** 0.0020 ** 0.0000 *** 0.0000 *** 0.0840 . 0.0000 *** 0.5162 0.0235 * 0.0000 *** 0.0000 *** 0.0016 ** 0.0062 ** 0.1535 0.2496 0.0267 * 0.0039 ** 0.1047 0.5766 0.0077 ** 0.5511 0.0630 . 0.3941 0.5318 0.3014 0.9150 0.4591 0.1107 0.5540 0.6355 0.7164 0.9018 0.0000 *** Eects Surgery Drug Drug:Diet

75

Phase III : weeks 30-38 0.0022 ** 0.9091

0.0002 *** 0.4434 0.0000 *** 0.7113 0.0005 *** 0.0435 *

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 an interaction term to be tted. If that interaction term was tted, and then found not to be signicant, we would say the diet eects were caused in Phase I. If the interaction is signicant, we would then need to investigate the way in which the three diets are aecting the rats over the three phases using an interaction plot for example. This is able to be seen in Exhibit 7.2 for the body weight already. The problem is that the time variable is (or would be) confounded with the interaction of diet and surgery in Phase II and the interaction of diet and drug in Phase III. Given we have used response variables that are changes over each time phase, the dierences over time we might observe would reect an alteration in the rate of change for each variable. An important assumption in our analysis is therefore that we are not concerned with the rate of change increasing or decreasing. The only way to determine these eects would have been to collect data at more points in time where the only change in the treatment regime was time alone. Recall that for Phase III we must consider any interaction the two active diets may have with the drug being administered. Fortunately, the interaction eects are all non-

76

CHAPTER 7. CASE STUDY: AN INCOMPLETE FACTORIAL DESIGN

signicant so we can consider the impact of the diet and the drug independently. This now also has the eect of allowing us to conrm the assumption that the rates of change in our response variables are not increasing or decreasing fromPhase ii to Phase III. Given the diet is aecting the outcome and we need to know more about which diets are better than others, we employed the Tukey HSD multiple comparison procedure on this variable. The results for Phases I and III are given in Exhibit 7.5. The impact of the surgery is also apparent across most response variables in Phases II and III. There is no need to use a multiple comparison procedure here as there are only two possible outcomes. A simple table of means is all that is required to demonstrate the dierences. We must note that the design has limited our ability to ascertain any link between diet and surgery when we present our ndings as the sham surgery was only performed on the rats having Diet 1. Having considered the interaction of the diet and the drug already, and determined the minimal impact of the interaction, we can now turn our attention to the impact of the drug alone. On the whole, there is little impact on the response variables of administering the drug. The only response variables where this is not true are femur density and the lumbar density. Ultimately, the researcher must determine if these two response variables should be singled out from among the set of responses for special attention. If she wishes to give these response variables any further attention, the researcher could present tables of means for the seven diet, surgery and drug combinations that were tested. This will allow the imbalance of the diets and surgery combinations to be removed from the evaluation of the drug.

7.5. THE FINAL ANALYSIS

77

Exhibit 7.5 Tukey HSD analysis for weight in Phases I and III.
95% familywise confidence level
Diet2Diet1 Diet3Diet2 Diet3Diet1

10

10

20

30

40

50

Differences in mean levels of DIET

95% familywise confidence level


Diet2Diet1 Diet3Diet2 Diet3Diet1

20

10

10

Differences in mean levels of DIET

Chapter 8 An Introduction to the Analysis of Covariance: Weight Gain in Hogs


An original chapter
written by

A. Jonathan R. Godfrey1

8.1

Introduction

Steel and Torrie (1981) presented an example, originally from Wishart (1938), of the use of the analysis of covariance using a data set with the weight gain of thirty hogs. The weight gained by male and female hogs given three dierent diets (called ration) is thought to depend (in part) on the initial weight of the hogs. A further factor included in the data set is the pen number which will be used as a blocking factor. The data appear in Exhibit 8.1.

8.2

The analysis of covariance model

The analysis of covariance (ANCOVA) model is the intersection of analysis of variance and regression. There are various reasons why we need to employ an ANCOVA model but for the moment we restrict this discussion to the experimental design context. In many experimental situations we have experimental units that are not as homogeneous as we would like. Blocking might prove useful but there are still measurable dierences among experimental units that we may wish to factor into our analysis. At times we may also know that experimental units do dier but not be able to measure
1

Jonathan is a lecturer in the Institute of Fundamental Sciences.

78

8.2. THE ANALYSIS OF COVARIANCE MODEL

79

Exhibit 8.1 The Hogs data: Weight gain of thirty hogs given one of three diets. Gender, initial weight and pen number are also provided.
Female Pen 1 Ration A1 A2 A3 2 A1 A2 A3 3 A1 A2 A3 A1 A2 A3 5 A1 A2 A3 Initial 48 48 48 32 32 28 35 41 33 46 46 50 32 37 30 Gain 9.94 10.00 9.75 9.48 9.24 8.66 9.32 9.34 7.63 10.90 9.68 10.37 8.82 9.67 8.57 38 39 48 35 38 37 41 46 42 48 40 42 43 40 40 Male Initial Gain 9.52 8.51 9.11 8.21 9.95 8.50 9.32 8.43 8.90 10.56 8.86 9.51 10.42 9.20 8.76

the dierences until after the experiment has started. Such measurements might occur concurrently with the observations of the impacts treatments are having. The model in its most simple form is yij = + i + (xij x i ) +
ij ij

(8.1) is the error

where is the grand mean, i is the treatment eect for treatment i, and

for the j th replicate of the ith treatment group. The additional term that distinguishes this model from that of Eq.2.1 is the inclusion of the regression coecient which acts on the covariate x which is measured for all observations. Note that it actually acts on the dierence between the observed covariate and the group mean for the covariate. There are some additional assumptions for this model over the basic one-way ANOVA model given in Eq.2.1. These are that the covariate x has a linear relationship with the response variable y and that this relationship is the same for all treatment groups. This second assumption is testable using the interaction of x and the treatment group factor, and should be tested for as the default action because it is done so easily. The diculty that arises when considering the use of the covariate in the model is to decide whether it is a necessary element in the modelling process. If the covariate is a pseudo-substitute for a blocking factor then there is argument for its automatic inclusion in the model. There is however another school of thought that would suggest that the value of the covariate might only be assessed after the treatment factor has been included;

80

CHAPTER 8. AN INTRODUCTION TO THE ANALYSIS OF COVARIANCE

this would use the covariate to reduce the experimental error without understating the treatment factors importance. The solution is fairly easy. If we test the value of the two eects in dierent orders we will nd out how useful the terms are after the other term has been included in the model. We therefore perform our F -tests using adjusted sums of squares, which are found using the sums of squares for each term when it is entered into the model after all other terms. It is almost inevitable that the order of terms in the model will have an impact on the sums of squares attributable to the treatment factor and covariate. The reason for any dierence in the adjusted sum of squares from the sum of squares found when the term is entered into the model rst is that the treatment factor and the covariate are not orthogonal to one another. In other words they are not independent. Recall that one of our aims in using a randomized complete block design was to have independence of blocks and treatments. When we have a balanced RCB design, we have orthogonal factors being put into the model. It is almost always impossible to have balance for covariates in experimental contexts, especially when human subjects are used. The most dicult step in analysing data from an experiment with a covariate is determining the treatment means to report. Given the (possibly small) relationship between the treatment means observed and the covariate, we must nd a way of adjusting the treatment means to a common value for the covariate. The usual value to adjust treatment means to, is the grand mean of the covariate x using
a ( =y i xi x ) y i

(8.2)

These adjusted treatment means can be compared but to give any such comparison relevance we must know the value of the Standard error of the dierence. Each treatment mean will have its own standard error, found using:
a = sy i

MSE

1 ( xi x )2 + ri Exx

(8.3)

The Exx in this expression is the error sum of squares that result from tting an analysis of variance model using the covariate as the response and the treatment factor from our original model. The standard error of the dierence for our adjusted treatment means is found using:
a y a = sy i j

MSE

1 1 ( xi x j )2 + + ri rj Exx

(8.4)

We must expect to calculate the standard error of the dierence using the above equation for all possible pairs of treatment means. Note that there will be dierences among the

8.3. ANALYSIS USING R

81

standard errors for the treatment means in all circumstances, except for the case when a factor has only two levels. The calculation of the adjusted treatment means as described here is only relevant if the covariate is having the same impact on all treatment groups. If there is any interaction between the covariate and the treatment factor, the adjustment for a single value of the covariate is not an appropriate method of comparing the treatment groups. If we nd the covariate does demonstrate an interaction with the treatment factor, then we would need to estimate the expected value of the response variable, for every level of the treatment factor, at several selected values of the covariate that are relevant for the research.

8.3
8.3.1

Analysis using R
Weight gain in hogs

Once the le Hogs.csv is placed in the working directory we can issue the following commands to import and investigate the data.
> Hogs = read.csv("Hogs.csv", row.names = 1) > str(Hogs)

'data.frame': $ Pen : $ Sex : $ Ration : $ InitialWt: $ WtGain :

30 obs. of 5 variables: int 1 1 1 1 1 1 2 2 2 2 ... Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 1 2 1 ... Factor w/ 3 levels "A1","A2","A3": 1 1 2 2 3 3 1 1 2 2 ... int 38 48 39 48 48 48 35 32 38 32 ... num 9.52 9.94 8.51 10 9.11 9.75 8.21 9.48 9.95 9.24 ...

Some variables are in the wrong format for use in our analysis. Changes are made using:
> Hogs$Pen = as.factor(Hogs$Pen) > Hogs$InitialWt = as.numeric(Hogs$InitialWt) > str(Hogs) 'data.frame': $ Pen : $ Sex : $ Ration : $ InitialWt: $ WtGain : 30 obs. of 5 variables: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 2 2 2 2 ... Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 1 2 1 ... Factor w/ 3 levels "A1","A2","A3": 1 1 2 2 3 3 1 1 2 2 ... num 38 48 39 48 48 48 35 32 38 32 ... num 9.52 9.94 8.51 10 9.11 9.75 8.21 9.48 9.95 9.24 ...

Exhibit 8.2 has been constructed to see if a relationship exists between the initial weight of the hogs and their eventual weight gain. The following R commands show the simple linear regression model tted to this data
> Hogs.lm = lm(WtGain ~ InitialWt, data = Hogs) > summary(Hogs.lm)

82

CHAPTER 8. AN INTRODUCTION TO THE ANALYSIS OF COVARIANCE

Exhibit 8.2 Weight gain of thirty hogs plotted against their initial weights.
> attach(Hogs) > plot(InitialWt, WtGain) > detach(Hogs)

WtGain

7.5

8.0

8.5

9.0

9.5

10.0

10.5

11.0

30

35

40 InitialWt

45

50

Call: lm(formula = WtGain ~ InitialWt, data = Hogs) Residuals: Min 1Q Median -1.292 -0.512 0.026

3Q 0.374

Max 1.178

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.4649 0.7560 8.55 2.7e-09 *** InitialWt 0.0708 0.0186 3.80 0.00072 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.621 on 28 degrees of freedom Multiple R-squared: 0.34, Adjusted R-squared: 0.317 F-statistic: 14.4 on 1 and 28 DF, p-value: 0.000719

From the graph and the regression model, we can see that there is a clear relationship between the weight a hog gains over the experiment and its weight when the experiment started. The initial weight must therefore be included as a covariate in our model in order to reduce the error variance. It is important to note that this investigation could not show that the covariate is not to be included in our model; this conclusion can only be drawn once all blocking and treatment factors have been included. Also note that the Pen number is included in our models as a blocking factor, and that the Sex and Ration are the treatments of interest. One might think of Sex as a blocking

8.3. ANALYSIS USING R

83

factor, but if we remember that we expect blocks to be independent of treatments, and that maybe the sex and ration factors will interact, then we should choose to think of Sex as a treatment factor for the time being. The analysis of variance model based on the randomized complete block design, that is, ignoring the covariate, but including the interaction of the two treatment factors is found using the following R commands:
> Hogs.aov1 = aov(WtGain ~ Pen + Sex * Ration, data = Hogs) > summary(Hogs.aov1) Df Sum Sq Mean Sq F value Pr(>F) Pen 4 4.85 1.213 2.92 0.047 * Sex 1 0.43 0.434 1.04 0.319 Ration 2 2.27 1.134 2.73 0.090 . Sex:Ration 2 0.48 0.238 0.57 0.573 Residuals 20 8.31 0.416 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Including the covariate in the model is done using the following R commands:
> Hogs.aov2 = aov(WtGain ~ InitialWt + Pen + Sex * Ration, + data = Hogs) > summary(Hogs.aov2) Df Sum Sq Mean Sq F value Pr(>F) InitialWt 1 5.56 5.56 21.93 0.00016 *** Pen 4 2.26 0.56 2.23 0.10473 Sex 1 1.28 1.28 5.04 0.03687 * Ration 2 2.34 1.17 4.62 0.02324 * Sex:Ration 2 0.10 0.05 0.19 0.82631 Residuals 19 4.82 0.25 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R oers a method for comparing two linear models. The anova() command takes two models as arguments and compares them using:
> anova(Hogs.aov2, Hogs.aov1) Analysis of Variance Table Model 1: WtGain ~ InitialWt + Pen + Sex * Ration Model 2: WtGain ~ Pen + Sex * Ration Res.Df RSS Df Sum of Sq F Pr(>F) 1 19 4.82 2 20 8.31 -1 -3.5 13.8 0.0015 ** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that this comparison means we do not need to waste time rearranging the terms in the model to get the order of F -tests correct, which would have normally been achieved using the R commands:

84

CHAPTER 8. AN INTRODUCTION TO THE ANALYSIS OF COVARIANCE

> Hogs.aov3 = aov(WtGain ~ Pen + Sex * Ration + InitialWt, + data = Hogs) > summary(Hogs.aov3) Df Sum Sq Mean Sq F value Pr(>F) Pen 4 4.85 1.21 4.79 0.00769 ** Sex 1 0.43 0.43 1.71 0.20609 Ration 2 2.27 1.13 4.48 0.02555 * InitialWt 1 3.88 3.88 15.30 0.00094 *** Sex:Ration 2 0.10 0.05 0.19 0.82631 Residuals 19 4.82 0.25 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that R has forced the two-way interaction of the two treatments into the model last. In many situations this automatic rearrangement is an advantage. In this instance, however, it serves to prove the value of using the anova() command. Recall that the assumptions for the analysis of covariance model included a common eect of the covariate on all treatment groups. This is easily tested by tting yet another model using the following R commands.
> Hogs.aov4 = aov(WtGain ~ Pen * InitialWt + InitialWt * Sex * + Ration, data = Hogs) > summary(Hogs.aov4) Df Sum Sq Mean Sq F value Pr(>F) Pen 4 4.85 1.213 6.04 0.0097 ** InitialWt 1 2.96 2.963 14.76 0.0033 ** Sex 1 1.28 1.277 6.36 0.0303 * Ration 2 2.34 1.170 5.83 0.0210 * Pen:InitialWt 4 0.42 0.106 0.53 0.7196 InitialWt:Sex 1 0.41 0.410 2.04 0.1834 InitialWt:Ration 2 0.47 0.234 1.17 0.3505 Sex:Ration 2 0.06 0.029 0.14 0.8687 InitialWt:Sex:Ration 2 1.55 0.774 3.86 0.0574 . Residuals 10 2.01 0.201 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It is fairly obvious from the above output that the covariate has no interaction eect with the treatment factors in this experiment. Reviewing the other models presented thus far, we might decide to ignore the possibility of an interaction between the two treatment factors and t a nal model using the following R commands.
> Hogs.aov5 = aov(WtGain ~ InitialWt + Pen + Sex + Ration, + data = Hogs) > summary(Hogs.aov5) Df Sum Sq Mean Sq F value Pr(>F) InitialWt 1 5.56 5.56 23.76 8.1e-05 *** Pen 4 2.26 0.56 2.41 0.081 . Sex 1 1.28 1.28 5.46 0.029 * Ration 2 2.34 1.17 5.00 0.017 * Residuals 21 4.91 0.23 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8.3. ANALYSIS USING R Exhibit 8.3 Residual analysis for the analysis of covariance model for the Hogs data.
> par(mfrow = c(2, 2)) > plot(Hogs.aov5)

85

1.0

Residuals vs Fitted
2
9

Normal QQ
Standardized residuals
9

Residuals

0.0

1.0

15 7

15 7

8.5

9.0

9.5 10.0

Fitted values

Theoretical Quantiles

ScaleLocation
Standardized residuals Standardized residuals 1.5
7 9 15

Residuals vs Leverage
0.5

1.0

0.5

1 0

15

0.0

8.5

9.0

9.5 10.0

Cooks distance 7 0.0 0.1 0.2 Leverage 0.3

0.5

0.4

Fitted values

The residual analysis for this nal model can be found using the plot() command on the model object and appears in Exhibit 8.3. Having established the usefulness of this nal model, we will need to nd the adjusted treatment means and their standard errors for the nal report. There are several quantities required, which are found using the following R commands.
> Hogs.MSE = anova(Hogs.aov5)[5, 3] > Hogs.MSE [1] 0.234 > BetaHat = coef(Hogs.aov5)[2] > BetaHat InitialWt 0.09127

We then need to obtain additional quantities for the calculations to nd the adjusted treatment means and the standard errors of their dierences. There is quite a lot of working to do here but it does all come together in the end. This is a clear example of when saving our R commands will have advantage if we were to repeat the task of nding adjusted treatment means.

86
> > > > > + >

CHAPTER 8. AN INTRODUCTION TO THE ANALYSIS OF COVARIANCE

attach(Hogs) Hogs.Gain.means.Sex = tapply(WtGain, Sex, mean) Hogs.Initial.means.Sex = tapply(InitialWt, Sex, mean) Hogs.Initial.gmean = mean(InitialWt) Hogs.adjusted.means.Sex = Hogs.Gain.means.Sex - BetaHat * (Hogs.Initial.means.Sex - Hogs.Initial.gmean) Hogs.adjusted.means.Sex

F M 9.519 9.090 > > > + > > Hogs.Gain.means.Ration = tapply(WtGain, Ration, mean) Hogs.Initial.means.Ration = tapply(InitialWt, Ration, mean) Hogs.adjusted.means.Ration = Hogs.Gain.means.Ration - BetaHat * (Hogs.Initial.means.Ration - Hogs.Initial.gmean) detach(Hogs) Hogs.adjusted.means.Ration

A1 A2 A3 9.676 9.233 9.003

Well nding the adjusted treatment means wasnt actually that hard given we have just done it for both sex and ration treatment factors! There is more work required for the standard errors of the dierence though.
> Hogs.Initial.aov = aov(InitialWt ~ Pen + Sex + Ration, data = Hogs) > Hogs.Exx = anova(Hogs.Initial.aov)[4, 2] > Hogs.Exx [1] 465.4 > > > + > attach(Hogs) Hogs.count.Sex = tapply(WtGain, Sex, length) Hogs.se.Sex = sqrt(Hogs.MSE * (1/Hogs.count.Sex + (Hogs.Initial.means.Sex Hogs.Initial.gmean)^2/Hogs.Exx)) Hogs.se.Sex

F M 0.127 0.127 > > > + > Hogs.count.Ration = tapply(WtGain, Ration, length) detach(Hogs) Hogs.se.Ration = sqrt(Hogs.MSE * (1/Hogs.count.Ration + (Hogs.Initial.means.Ration - Hogs.Initial.gmean)^2/Hogs.Exx)) Hogs.se.Ration

A1 A2 A3 0.1531 0.1535 0.1531 > Hogs.sed.Sex = sqrt(Hogs.MSE * (1/15 + 1/15 + (Hogs.Initial.means.Sex[1] + Hogs.Initial.means.Sex[2])^2/Hogs.Exx)) > Hogs.sed.Sex F 0.1826

So there we have it. We have the standard error of the dierence between male and female hogs in this experiment, and coupled with the two adjusted means given earlier we can verify the signicance of the sex treatment factor. Calculating the standard errors of the dierences among the rations is left as an exercise.

8.4. EXERCISES

87

Exhibit 8.4 Residuals from the preferred model for weight gain in hogs plotted against the covariate used in the model.
> plot(Hogs$InitialWt, resid(Hogs.aov5), xlab = "Initial Weight", + ylab = "Residuals")

Residuals

1.0

0.5

0.0

0.5

30

35

40 Initial Weight

45

50

Calculating adjusted treatment means can be done much simpler using the predict() functions in R. This is left as an exercise, but note that this will provide the estimate of the tted value of any adjusted means and their standard errors, but not the standard error of the dierence between two adjusted treatment means.

8.4

Exercises

Note: The data sets in the exercises are available via the DRUGS package. If you havent already done so, you can access the data by issuing the library() command
> library(DRUGS)

Exercise 8.1: One of the assumptions of the analysis of covariance model as tted for the Hogs data is that the covariate is linearly related to the response variable. Exhibit 8.4 shows the residuals from our preferred model plotted against the covariate. Do you think additions to the model for interaction between covariate and treatment factors or polynomials of the covariate are warranted? Attempt any models you think warrant investigation. Exercise 8.2: Find the standard errors of the dierences among the adjusted means for the three rations used in the Hogs experiment used above. Note that you will need to

88

CHAPTER 8. AN INTRODUCTION TO THE ANALYSIS OF COVARIANCE

nd three standard errors here, and that you should use the model that you think is the best. Exercise 8.3: The data in the le Human.csv are for the body fat percentage of 14 female and 4 male subjects of varying ages. Mazess et al. (1984) used the data to evaluate a (what was then) new method for measuring body composition. Determine (on the basis of this data anyway) the eect the age of the subject has on their body fat percentage and whether their gender has any inuence on results. In the DRUGS package, this data set is called Human and can be obtained using
> data(Human, package = "DRUGS")

Exercise 8.4: The following data set was discovered while investigating the faraway package. Partridge and Farquhar (1981) reported on the following experiment in Nature magazine. 125 fruities were divided randomly into 5 groups of 25 each. The response was the longevity of the fruity in days. One group was kept solitary, while another was kept individually with a virgin female each day. Another group was given 8 virgin females per day. As an additional control the fourth and fth groups were kept with one or eight pregnant females per day. (Apparently, pregnant fruities will not mate amazing what you learn in statistics courses!) The thorax length of each male was measured as this was known to aect longevity. One observation from the many group was lost so the data in the le Fruitfly.csv has 124 rows and 3 columns. Use this data to determine if the length of the thorax does actually have an impact on the longevity of a fruity and whether the lifestyle each male fruity has been subjected to has had any impact on longevity also. Exercise 8.5: The MASS package contains a data set for 72 patients with anorexia. Note that the three treatment groups are of diering sizes. You can obtain the data directly from the MASS package, where more detail can also be found, or in the le Anorexia.csv. Use the weight of the patients before the treatments as a covariate in determining the relative benets of the treatments. Exercise 8.6: In Chapter 6, we investigated the post-test results from 45 severely visually impaired students. Reconsider the analysis given after incorporating the pre-test results as a covariate. Exercise 8.7: Milliken and Johnson (2002) demonstrate the variety of examples of an analysis of covariance model. The data from Example 3.7 shows how to deal with a situation where there is an interaction between the covariate and treatment factor. After showing that this is the case, use the predict() function in R to nd a set of adjusted treatment means that are appropriate for demonstrating the dierence among the treatment means. You may also wish to create a graph to support any tabulated results you nd. The necessary data is called Potatoes in the DRUGS package.

Chapter 9 An Introduction to Split Plot Designs: Testing Hearing Aids


An original chapter
written by

A. Jonathan R. Godfrey1

9.1

Introduction

The vast majority of presentations on the topic of split-plot experimental designs use examples from agriculture. This is somewhat unfortunate as it understates the necessity of the split-plot design as it is used in many other scenarios. A search for data for this chapter gave plenty of journal articles that purported to have used a split-plot design in their analysis but yielded no data! A data set found in Box et al. (2005) on the number of mistakes made during hearing tests by eight subjects using six dierent hearing aids will be used in this chapter. The data are presented in Exhibit 9.1. Notice that there are four pairs of subjects in this experiment, and for each subject there is only one observation per hearing aid. If we did not know the amount of hearing loss for each patient, we would be limited to analysing the data as if they had come from a randomized complete block experiment. In that case, we would not be able to determine the existence of an interaction between patients and hearing aids.
1

Jonathan is a lecturer in the Institute of Fundamental Sciences.

89

90

CHAPTER 9. AN INTRODUCTION TO SPLIT PLOT DESIGNS

Exhibit 9.1 Data from testing six hearing aids on eight patients.
Loss Severe Marked Some Slight Person 1 2 3 4 5 6 7 8 A 7 4 0 3 4 7 2 3 B 16 15 9 6 3 2 0 1 C 13 14 10 12 9 1 4 0 D 19 26 15 13 8 4 3 5 E 14 12 2 3 2 6 0 2 F 11 10 3 0 1 8 1 1

9.2
9.2.1

The model for the Split-plot Design


Some examples

Before one embarks on using the appropriate analysis of variance for a split-plot designed experiment, one must know how to identify an experiment that has been designed (on purpose or by accident) with a split-plot. First of all, the experiment must have at least two treatments; blocking factors might also be present. One of the treatment factors will be applied to whole experimental units, while on another level, all levels of the second treatment will be applied to those experimental units. We say that the whole experimental unit is then split up to apply the second treatment factor. The eect in statistical terms is that we are comparing the rst treatment factor over the range of whole experimental units, sometimes called the main plot units, but the second treatment is compared to variations within the main plot units, and therefore among the sub-plot units. You might think the use of the word plot is confusing, but if you remember that most experimental design work has its roots in agricultural experimentation, youll probably nd the language a little easier to follow. Split-plot experiments came about as it proved impossible to perform all experiments using the now (hopefully) familiar factorial experimental design. Lets look at some examples where a split-plot design exists. Medicine 1: If patients with a skin condition were given both internal medication (possibly steriods taken orally) and a topical treatment, (such as ointment), it would be possible to have individual patients take one of the internal medication treatments and let them have more than one of the possible topical treatments you could force them to use topical cream A on the left hand side of their body and the placebo cream on the right hand side of their body. In this scenario the patient is the whole or main plot unit and the eects of the oral steroids would be compared

9.2. THE MODEL FOR THE SPLIT-PLOT DESIGN

91

among patients. Each patient is then split into separate experimental units for the investigation of the eects of the ointments. Medicine 2: Our hearing aid data. While it is dicult to suggest that we have randomly assigned hearing loss to patients, the patients may have been selected randomly from a bigger population; they may also have been selected and had their hearing tested prior to the experiment so that they could be ordered. In the end, we can assess the dierences among patients for their hearing loss status as if it were a treatment. Patients are therefore the main experimental units, who then are tested using all six hearing aids. The only randomness that then links the hearing aids with patients is the order in which the testing was done. Medicine 3: If a group of patients are allocated to one of a set of treatments and repeated measurements are taken over time, the data are analysed using the split plot approach. This repeated measures analysis is possible using both the split plot methodology and multivariate analysis of variance (see Chapter 12). Note that the patient does not change treatments in this example. If they did change treatments as measurements are recorded over time, the analysis is more likely to follow something like a crossover design as will be discussed in Chapter 11. Agriculture: Many treatments in agriculture are applied to a large area of land ploughing, top-dressing etc., but some treatments can be applied to smaller areas of land human interventions such as weeding, covering, dierent irrigation amounts etc. Manufacturing: Box et al. (2005) gave the following example to justify the use of split-plot experiments in industrial settings. An experiment to improve the corrosion resistance of steel bars used a combination of applying one of four surface coatings to the bars and then baking them in a large furnace at one of three temperatures. A completely random arrangement in which the furnace temperature had to be reset repeatedly would have been impractical but a split plot experiment was easily run in the following way: a temperature condition was set and the four dierent kinds of coated bars were randomly positioned in the furnace and baked for the prescribed length of time. The furnace was then reset at a dierent temperature to bake a second set of coated bars and so on. Food Technology: Box et al. (2005) also use an example where nine dierent cake recipes were tested for their eectiveness under ve dierent cooking conditions. Each of the recipes was made up, and then each one was split into ve, so that each recipe could be tested over (hopefully) the range of cooking time and temperature

92

CHAPTER 9. AN INTRODUCTION TO SPLIT PLOT DESIGNS conditions that might be encountered by the consumer. In this instance the point of interest will be in the recipe that has the least variation among the possible cooking conditions. In each of these examples, the whole units are eectively a blocking factor when it

comes time to gauge the eectiveness of the treatment applied to sub-plots. This means that our sub-plot treatment is compared to the variation of the sub-plots which is usually smaller than the variation of the main plots.

9.2.2

The linear model

The linear model we use for split-plot experiments is not that dicult. In fact it looks like others you will have seen before. It has the grand mean , main eects for treatments i and j , and the interaction of the two treatments ( )ij . yijk = + i + dik + j + ( )ij + eijk (9.1)

In fact the only term that looks dierent is the dik term which is the error term we will assign to the main units. The last term eijk are the errors assigned to the subplot units. If we were to allow some blocking of main units in our experiment the model is enhanced through the addition of the blocking factor k to then become yijk = + i + k + dik + j + ( )ij + eijk (9.2)

In this case, there is usually one experimental unit assigned to each level of the rst treatment factor within each block. When using some software (not R), we need to t the term for the interaction of the block and rst treatment; having this term in the analysis ensures that the correct degrees of freedom are assigned to the nal error term. The sum of squares assigned to this interaction is used as those needed for the dik term above. The diculty with the split-plot model is getting the correct analysis of variance table. If we do not specify that the rst treatment eects (i ) are to be gauged against the correct error term dik ), they will be assessed against the eijk term in our model. The second treatment is to be gauged against the subplot error term eijk ). We will need to watch the way degrees of freedom are allocated. If we did include blocking in our experiment then we would have blocked our main plot units so the eect of blocking is also then done at the main plot level. Examples will help illustrate these points, especially the vast range of incorrect analyses that are possible.

9.2.3

Constructing the ANOVA table

Lets say we have a situation where a pair of treatment factors are being employed, with Factor A being tested on the main plot units, and Factor B on subplot units. This means

9.2. THE MODEL FOR THE SPLIT-PLOT DESIGN

93

every experimental unit that is given one level of Factor A will be split up into smaller parts and each subplot (part) will then receive one level of Factor B. The rst concern to think about is how many main units are there?, and then, how many observations are there in all? We often have the added complication of a blocking factor on the main plot units to worry about. In these cases, the experiment (ignoring the Factor B and subplots) is a experimental units. RCB design with A levels of Factor A, and P blocks. This would mean there were A P For the moment, and without worrying about the presence of any blocking factors, lets say there are M main experimental units. Factor A is applied to those, but they are then split apart for the application of Factor B. If there are B levels of Factor B, then we would expect (given complete balance) there to be M B subplot experimental units. So there are two parts of the full ANOVA table to construct. The main plot exper-

iment, including the Factor A eects, and the subplot experiment which looks at the Factor B eects. In the simplest form, the two tables are: Source Factor A Error 1 Main plot Total and Main plot Total Factor B Error 2 Subplot Total M 1 SSM SSB SST
SSB B 1 SSE 2 M A M SB M SE 2

df A1

SS SSA SSM

MS
SSA A 1 SSE 1 M A

F
M SA M SE 1

M A SSE 1 M 1

N M B + 1 SSE 2 N 1

B1

This model does not allow for any interaction between the two treatment factors, nor does it allow for the possibility of a blocking factor. Adding these two elements into the tables gives the full table given in Exhibit 9.2. Of course, R does not print the last lines of the two tables, nor does it print the rst line of the second table, which is the last line of the rst table anyway. The tting of the main plot unit eects (block, treatment and error) means the mean result of each main plot experimental unit is fully explained by the upper panel of the ANOVA table. Also note the hypothesis test for blocking has not been included. Further, the explicit statement of the main plot total in the second table implies the main plots are actually blocking factors for the subplots. No interaction for the blocking factor is possible with Factor A; it is theoretically possible (but not generally advisable) for Factor B.

94

CHAPTER 9. AN INTRODUCTION TO SPLIT PLOT DESIGNS

Exhibit 9.2 Full ANOVA table for a split plot experiment with one blocking factor and two treatments. (i) for main plot units: Source Block Factor A Error 1 df P 1 A1 SS SSP SSA SSM
SSA A 1 SSE 1 (P 1)(A1) M SA M SE 1

MS

Main plot Total M 1 = AP 1 (ii) for subplot units: Main plot Total Factor B Interaction AB Error 2 Subplot Total M 1

(P 1)(A 1) SSE 1

SSM SSB SSAB SSE 2 SST


SSB B 1 SSAB (A1)(B 1) SSE 2 (P 1)A(B 1) M SB M SE 2 M SAB M SE 2

(P 1)(A)(B 1) N 1 = P AB 1

(A 1)(B 1)

B1

9.2.4

Standard errors for treatment means

There are two complications for nding the standard errors for the dierences between two levels of a treatment factor in a split-plot experiment. First, there are two dierent sources of error, and second, comparisons might be dependent on the existence of an interaction between the treatment factors. We will need to introduce some additional notation here. Whether blocking is a feature of the experiment or not, Factor A will be the whole plot treatment, and Factor B will be applied to subplot units. We will say that there were r replicates of each combination of the a levels of Factor A, and b levels of Factor B thus assume that the experiment is balanced. We will not discuss unbalanced split-plot experiments here. We will use the letters u and v to indicate two dierent levels of a treatment factor and use subscripts to show which treatment factor is under consideration. This dierence has standard error When we wish to compare two levels of Factor A, we calculate the dierence y u y v . 2MSE (1) rb

(9.3)

The MSE (1) in this equation links with the dik term in Equations 9.1 and 9.2 given earlier. When we wish to compare two levels of Factor B, we calculate the dierence y u y v .

9.3. ANALYSIS USING R This dierence has standard error

95

2MSE (2) (9.4) ra The MSE (2) in this equation links with the eijk term in Equations 9.1 and 9.2 given earlier. Note that as a consequence, dierent error mean squares from the analysis of variance will be used in the expressions for the standard error of a dierence. If there is an interaction between the two treatment factors we can nd a standard error of the dierence for a given level of the other factor. For example, if we wish to consider the dierence between two levels of Factor B for the ith level of Factor A, we calculate the dierence y u y v . This dierence has standard error 2MSE (2)/r (9.5) In similar fashion, if we wish to nd the dierence of two levels of Factor A for the j th level of Factor B, we calculate y uj y vj . This dierence has standard error 2 [(b 1)MSE (2) + MSE (1)] (9.6) rb Kuehl (2000) notes that this expression is only an approximation to the quantity actually required, and more importantly that this standard error does not behave like other standard errors. Comparisons of Factor A for given levels of Factor B are therefore less advisable than comparisons of Factor B given a particular level of Factor A. This advice is theoretically accurate but there is a more pragmatic reason for avoiding this comparison; in the split-plot arrangement, there are usually few replicates of Factor A and we must think carefully before comparing the levels of Factor A at any time. This discussion exposes an interesting feature of the split-plot design. It is common for MSE (1) MSE (2) and this has an impact on the precision of the estimates for the various treatment means. Estimates of Factor As means are less precise than those for Factor B or the interaction of the two factors. The advantages of using a split-plot experiment when the alternative of a fully balanced factorial experiment are available must be considered. If the precision of Factor B (and the interaction eects) is more important than the precision of Factor A, then a split-plot design might be a better option. This is true for the cake recipe experiment described earlier for example. In most circumstances, use of the split-plot analysis (and therefore design) is not an option it is a feature of the experiments design forced upon the researcher by the constraints of their scenario.

9.3

Analysis using R

The Hearing Aids data in the le HearingAids.csv could be read into R using the read.csv() command, but it is easier to obtain it via the DRUGS package. We therefore

96

CHAPTER 9. AN INTRODUCTION TO SPLIT PLOT DESIGNS

Exhibit 9.3 Means and standard deviations for the eight individuals used in testing the hearing aids
> > > > > attach(HearingAids) HearingAids.mean = tapply(Mistakes, Person, mean) HearingAids.sd = tapply(Mistakes, Person, sd) detach(HearingAids) HearingAids.mean 3 6.500 4 6.167 5 4.500 6 4.667 7 1.667 8 2.000

1 2 13.333 13.500

> HearingAids.sd 1 2 3 4 5 6 7 8 4.131 7.259 5.753 5.269 3.271 2.805 1.633 1.789

use the data() command to get the data, and investigate its structure using the str() command.
> data(HearingAids, package = "DRUGS") > str(HearingAids) 'data.frame': $ Person : Factor $ Loss : Factor $ Unit : Factor $ Mistakes: int 7 48 obs. of w/ 8 levels w/ 4 levels w/ 6 levels 16 13 19 14 4 variables: "1","2","3","4",..: 1 1 1 1 1 1 2 2 2 2 ... "Marked","Severe",..: 2 2 2 2 2 2 2 2 2 2 ... "A","B","C","D",..: 1 2 3 4 5 6 1 2 3 4 ... 11 4 15 14 26 ...

If you imported the data from the le, R would read the variable denoting the person being tested in each row of our data.frame as integer-valued, rather than being a factor. This is rectied using:
> HearingAids$Person = as.factor(HearingAids$Person) > str(HearingAids)

Note that we have not converted the mistakes variable which is currently (correctly) deemed to take integer values by R. The mistakes are count data. We will try to model this using a linear model, but there are two problems. The rst is that a response that is a count can only take positive values, and secondly such a variable is unlikely to have a normal distribution. We will therefore need to be quite careful about checking the assumptions of the linear model, but if all else fails we will be able to model the data using a generalised linear model as discussed in Chapter 13. It is common for count data like this to show a tendency for the variance of responses to increase as the mean of the responses increases. To investigate this phenomenon we nd means and standard deviations for the eight individuals used in this experiment. The R commands, and associated output are given in Exhibit 9.3 The cor() command and a plot of the standard deviations against the means are given in Exhibit 9.4 to investigate the relationship. Use of the cor() command to extract Pearsons correlation coecient

9.3. ANALYSIS USING R

97

Exhibit 9.4 Correlation and scatter plot for the means and standard deviations of the eight individuals used in testing the hearing aids
> cor(HearingAids.mean, HearingAids.sd) [1] 0.757

> plot(HearingAids.mean, HearingAids.sd, xlab = "Means", ylab = "Standard deviations")

Standard deviations

8 Means

10

12

shows that (as expected) the mean and standard deviation of the results for individuals in the experiment are related. At this time we will use a square root transformation of the response variable in an attempt to correct this. Exhibit 9.5 shows how we use the sqrt() command to create the new variable and then investigate its potential. While we have managed to get means and standard deviations that are now uncorrelated, there does remain one problem of note. The standard deviations are not homogeneous. This might be because we have not yet identied any interaction between treatment factors. Exhibits 9.6 and 9.7 show two possible ways of investigating the interactions that exist within this data set. These graphs show that there is an interaction between the people and the types of hearing aids, which might be summarized through the interaction of the hearing loss status and the hearing aid type.

98

CHAPTER 9. AN INTRODUCTION TO SPLIT PLOT DESIGNS

Exhibit 9.5 Impact on the response variable after taking a square root transformation.
> > > > > > HearingAids$sqrtMistakes = sqrt(HearingAids$Mistakes) attach(HearingAids) sqrtHearingAids.mean = tapply(sqrtMistakes, Person, mean) sqrtHearingAids.sd = tapply(sqrtMistakes, Person, sd) detach(HearingAids) sqrtHearingAids.sd

1 2 3 4 5 6 7 8 0.5910 1.0100 1.4171 1.3346 0.7875 0.7252 0.8607 0.7638 > cor(sqrtHearingAids.mean, sqrtHearingAids.sd) [1] -0.04416

Exhibit 9.6 First interaction plot for the Hearing Aids data.
> attach(HearingAids) > interaction.plot(x.factor = Unit, trace.factor = Person, + response = sqrtMistakes, ylab = "sqrt(Mistakes)", xlab = "Hearing aid type", + trace.label = "Person") > title("Interaction plot for hearing aids and people tested")

Interaction plot for hearing aids and people tested


5

Person 1 2 6 3 5 7 8 4

sqrt(Mistakes)

D Hearing aid type

9.3. ANALYSIS USING R Exhibit 9.7 Second interaction plot for the Hearing Aids data.
> interaction.plot(x.factor = Unit, trace.factor = Loss, response = sqrtMistakes, + ylab = "sqrt(Mistakes)", xlab = "Hearing aid type", + trace.label = "Hearing loss") > title("Interaction plot for hearing aids and hearing loss") > detach(HearingAids)

99

Interaction plot for hearing aids and hearing loss

Hearing loss Severe Some Slight Marked 4 sqrt(Mistakes) 1 2 3

Hearing aid type

9.3.1

Incorrect analyses

If we are to t a model to this data that accounts for the Hearing loss status of subjects and its interaction with the type of hearing aid, we could end up with the output given in Exhibit 9.8. This model takes no account of the dierences among the individuals used in the experiment. Fitting the model presented in Exhibit 9.9 suggests a randomized complete block design where no interaction is possible. Note the degrees of freedom assigned to the blocking factor (Person) and the treatment factor (Unit). It is possible to get R to generate an ANOVA table that has the correct sums of squares and degrees of freedom, as is presented in Exhibit 9.10. The problem with this analysis is that the hypothesis tests are incorrect well one of them anyway. We need to test the signicance of the hearing loss status against the variability of the people used in the experiment, not against the residuals. It is possible to do this manually of course, but you dont have to!

100

CHAPTER 9. AN INTRODUCTION TO SPLIT PLOT DESIGNS

Exhibit 9.8 Incorrect model 1: This analysis does not allocate degrees of freedom correctly nor does it perform the correct hypothesis tests.
> HearingAids.bad.aov1 = aov(sqrtMistakes ~ Loss * Unit, data = HearingAids) > summary(HearingAids.bad.aov1) Df Sum Sq Mean Sq F value Pr(>F) Loss 3 37.1 12.37 24.10 2e-07 *** Unit 5 12.5 2.49 4.86 0.0033 ** Loss:Unit 15 13.5 0.90 1.75 0.1066 Residuals 24 12.3 0.51 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exhibit 9.9 Incorrect model 2: This model cannot allow for any interaction between individuals and the type of hearing aids.
> HearingAids.bad.aov2 = aov(sqrtMistakes ~ Person + Unit, + data = HearingAids) > summary(HearingAids.bad.aov2) Df Sum Sq Mean Sq F value Pr(>F) Person 7 37.3 5.32 7.26 2.2e-05 *** Unit 5 12.5 2.49 3.40 0.013 * Residuals 35 25.7 0.73 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exhibit 9.10 Incorrect model 3: This model has correct degrees of freedom and sums of squares but the hypothesis tests are incorrect.
> HearingAids.bad.aov3 = aov(sqrtMistakes ~ Loss + Person + + Unit + Loss:Unit, data = HearingAids) > summary(HearingAids.bad.aov3) Df Sum Sq Mean Sq F value Pr(>F) Loss 3 37.1 12.37 20.33 2.7e-06 *** Person 4 0.2 0.04 0.06 0.99 Unit 5 12.5 2.49 4.10 0.01 * Loss:Unit 15 13.5 0.90 1.48 0.20 Residuals 20 12.2 0.61 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

9.3. ANALYSIS USING R

101

Exhibit 9.11 Correct model for the Hearing Aids data. This model tests treatment factors against the appropriate error terms.
> HearingAids.aov = aov(sqrtMistakes ~ Loss * Unit + Error(Person), + data = HearingAids) > summary(HearingAids.aov) Error: Person Df Sum Sq Mean Sq F value Pr(>F) Loss 3 37.1 12.37 329 3.1e-05 *** Residuals 4 0.2 0.04 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Error: Within Df Sum Sq Mean Sq F value Pr(>F) Unit 5 12.5 2.495 4.10 0.01 * Loss:Unit 15 13.5 0.900 1.48 0.20 Residuals 20 12.2 0.609 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

9.3.2

Correct analysis

The aov() command allows the user to show what error terms should be used for terms in the model. In fact R will seem fairly sophisticated as it will assign elements of the model to be tested against the correct error terms without being explicitly told to do so. We use the Error in the formula statement of our model tting. Exhibit 9.11 shows how the commands are structured and how R displays the information. Note that the line in the ANOVA for person in Exhibit 9.10 matches the line for residual in the top part of the output in Exhibit 9.11. The lines for degrees of freedom and for sums of squares are the same for all other rows of the ANOVA table. The hypothesis tests for the type of hearing aids and its interaction with the hearing loss status are the same in both exhibits. The treatment means for this experiment that arise from the correct model are shown in Exhibit 9.12. Note that when implementing the model.tables() command with the se argument set to TRUE returns a warning. If you want the standard errors of the means you will need to do this manually; you could still use R as a calculator though! We must now check our assumptions. The split-plot model we have created using the aov() command has created a dierent type of object. You might like to check this out using the class() command for several of the model objects created in this session. An upshot is that use of the plot() command to investigate the residuals from our model fails. R does not yield a full set of residuals when asked using the resid() command. This is because R stores the model dierently and has partitioned the residuals that belong to main plots from those belonging to subplots. In total, R has only stored n 1 residuals;

102

CHAPTER 9. AN INTRODUCTION TO SPLIT PLOT DESIGNS

Exhibit 9.12 Tables of treatment means for the Hearing Aids data.
> model.tables(HearingAids.aov, type = "means", se = TRUE) Tables of means Grand mean 2.229 Loss Loss Marked Severe Slight 2.180 3.584 1.127

Some 2.026

Unit Unit A B C D E F 1.771 2.184 2.497 3.217 1.954 1.755 Loss:Unit Unit Loss A Marked 0.866 Severe 2.323 Slight 1.573 Some 2.323

B 2.725 3.936 0.500 1.573

C 3.313 3.674 1.000 2.000

D 3.739 4.729 1.984 2.414

E 1.573 3.603 0.707 1.932

F 0.866 3.239 1.000 1.914

they are linearly independent though. You should remember that the sum of the residuals is zero. This means that if you know all but one residual, you can work out the last. In situations like this we say that the last residual is linearly dependent on the others. So, how does R partition the residuals? In our Hearing Aids data there were 48 observations on six subplot treatments applied to each of eight main plots. The sum of the residuals for these eight main plot results will sum to zero; there are therefore only seven linearly independent residuals to investigate. These main plot residuals have an impact on the subplot residuals. Given the main plot residual, only ve of the six subplot residuals for each main plot are linearly independent. Multiplying ve (subplot residuals per main plot) and eight (main plots) gives forty subplot residuals in total. Forty linearly independent subplot residuals and seven linearly independent main plot residuals give 47 which is one less than our total number of observations. What can we do to investigate the residuals then? We should look at the two sets of residuals separately. They are fundamentally dierent we know that because they have a dierent variance. We wont get much in the way of meaningful plots for the main plot residuals so we should look at the subplot residuals here rst. As we are investigating a set of 40 residuals that have no order, the analysis of the residuals is best restricted to a test of their normality and for heteroscedasticity. The problem we have is that creating vectors of the residuals and tted values using the

9.4. EXERCISES resid() and fitted() commands shows a problem in relying on these commands.
> HearingAids.fitted = fitted(HearingAids.aov$Within) > HearingAids.resid = resid(HearingAids.aov$Within) > summary(HearingAids.fitted) Min. 1st Qu. -1.180 -0.267 Median 0.176 Mean 3rd Qu. 0.233 0.615 Max. 2.210

103

These tted values quite obviously have little to do with the real tted values shown in the output from the model.tables() statement given in Exhibit 9.12. It is therefore dicult to see if our square root transformation actually did work and that our model assumptions are valid. Finding the subplot residuals manually and checking some assumptions is left to an exercise below.

9.4

Exercises

Note: The data sets in the exercises are available via the DRUGS package. If you havent already done so, you can access the data by issuing the library() command
> library(DRUGS)

Exercise 9.1: Seafood storage: This exercise is taken from Kuehl (2000). The data appear in the le Seafood.csv and show the count of bacteria growing on two types of seafood (Mussels and Oysters). Three of the nine cold storage units were randomly assigned to each of three temperatures. The seafood was investigated after it had been in the units for two weeks. The counts have been log-transformed in an attempt to obviate any heteroscedasticity. (a) Is there a signicant increase in the amount of bacteria as the temperature increases? (b) Is there a dierence between the amount of bacteria growing on the mussels and oysters? Is this a main eect or an interaction with temperature? (c) Are the assumptions of the split-plot model met for this data? (d) If only one space was available in each of the nine units for the seafood, how would this experiment be planned? Why is the split-plot experiment better than your alternative design? In the DRUGS package, this data set is called Seafood and can be obtained using
> data(Seafood, package = "DRUGS")

104

CHAPTER 9. AN INTRODUCTION TO SPLIT PLOT DESIGNS

Exercise 9.2: Find the appropriate standard errors for the treatment factors in the Hearing Aids experiment. Note, you will not need to employ all of Equations 9.3 to 9.6 to do this. Exercise 9.3: Find the full set of (linearly dependent) residuals for the 48 observations of the Hearing Aids data. Check for any heteroscedasticity. If you nd no evidence of heteroscedasticity our model is valid; if there is evidence then we will need to employ a generalised linear model. Exercise 9.4: It was observed that the model.tables() command does not work when seeking the standard errors of the means. R gives you a warning in this case, but if you ask for effects not means, R gives you some estimates. This is suspicious behaviour. Are the standard errors for the dierences provided by R correct? Exercise 9.5: Hedges et al. (1971) measured the size of the wheal resulting from an injection in the forearm of six subjects who had also been given an oral treatment. Measurements of the wheals size were taken after 5, 10, and 20 minutes. The data can be found in the DRUGS package and obtained using:
> data(Wheal, package = "DRUGS")

Which treatment leads to the smallest wheals? Choose suitable graphs to show the impact of the treatments for this experiment.

Chapter 10 Fixed and Random Eects in Experiments: Particle Filtration of Respirators


An original chapter
written by

A. Jonathan R. Godfrey1

10.1

Introduction

We concern ourselves with the consideration of eects as being xed or random when analysing an experiment in this chapter. If the experiment includes covariates, or is unbalanced, then the contents of this chapter are a start, but Chapter 17 on modelling mixed eects models is required reading. Beckman and Nachtsheim (1987) used an example data set from a study on highenergy particulate cartridge lters used with commercial respirators for protection against particulate matter. The data used in this chapter come from three lters taken from each of two manufacturers, and are based on three replicate readings. The data are shown in Exhibit 10.1

10.2

Models for Random eects

In medical studies experimental units are usually cases, and as they are usually chosen from the population in a serendipitous manner they are not usually an eect of interest.
1

Jonathan is a lecturer in the Institute of Fundamental Sciences.

105

106

CHAPTER 10. FIXED AND RANDOM EFFECTS IN EXPERIMENTS

Exhibit 10.1 Penetration rates for three lters from each of two manufacturers
Manufacturer 1 Filter 1 2 3 1 2 3 Penetration Rate 1.12, 1.10, 1.12 0.16, 0.11, 0.26 0.15, 0.12, 0.12 0.91, 0.83, 0.95 0.66, 0.83, 0.61 2.17, 1.52, 1.58

In this sense they are almost always considered random eects. While the doctors might worry about how treatments are aecting individual cases, the researcher is rarely interested in those particular cases, but rather how the treatment is performing in the larger population context. In the end, the random allocation of cases to the study is done in the hope that the results obtained from a sample of individuals can be generalized to the wider population. Random eects ANOVA is often less powerful than xed eects ANOVA because dierent and frequently larger error terms are used to gauge the importance of factors. For example, if each of four treatments was to be applied to a group of ve hospitals (20 hospitals in all), and a set of fty patients were chosen from each hospital, we would have a nested set of factors. The split plot design discussed in Section 9.2 forms part of the correct analysis, but the way sampled patients represent all patients from a hospital, and for that matter the way a set of hospitals might represent all hospitals is often a factor to consider dierently. The primary question to ask of any treatment factor is does the set of levels for this factor cover all possible levels of interest, or do we want to let these levels represent some larger group of possible treatments?. Answering this question for each factor will determine the xed vs random element for the factor. If factors are xed, then the analysis will follow those given in Chapters 3, 5, and 9. Another way of looking at this is to ask, Are we investigating dierences between two or more populations, or do we have one population from which elements are drawn?. If we are investigating the dierences between two or more populations we probably have a xed eect to consider. If we have a single population and recognize that there are dierences among the units we use in our investigation, these dierences must be random and are therefore included as random eects. Whenever we determine that a factor will have a random eect, we will apportion some random variability to that factor. Identifying sources of randomness in processes can often lead us to directing eort to reducing variability of an entire process to the largest source of variation.

10.2. MODELS FOR RANDOM EFFECTS

107

Another criterion for choosing xed vs random eects is on the basis of reproducibility. An experiment is usually set up to test specic treatment combinations. If planned well, the results from an experiment should be reproducible. Random eects may often be factors that are not directly reproducible, such as dierent time slots, or observations taken on a set of days with no known structure specic weekdays do not count here. The current presentation for dealing with random eects is limited to scenarios with one-way and two-way analyses of variance. (This is supposed to be a gentle introduction after all!) If you plan to conduct any experiments with three or more factors with a mixture of xed and random eects, you will need to nd the appropriate analysis in another reference, see Kuehl (2000) for example.

10.2.1

Models for a single factor having a random eect

We introduced the model for a one factor analysis of variance in Chapter 2. This was yij = + i +
ij ij

(10.1) is the error term for the

where is the grand mean, i is the ith treatment eect, and j th replicate within the ith treatment group. These
ij s

are said to be random; that is,

coming from a normal distribution with mean 0 and variance 2 . When we assume the treatment eects i are to be considered random eects, we
2 say that they will also come from a normal distribution with mean 0 and variance . 2 Obviously, the only way for the treatments to be dierent is for to be signicantly

greater than zero. In a xed eect context we are interested in dierences among pairs of
2 treatment means, e.g. i j , but in the random eects context it is the variance that

tells us about the wider population of treatment eects.

The traditional analysis of variance model partitions the total sum of squares of a response variable y into SST reatment and SSError . The random eects model partitions the variance of y into the two variance components using
2 2 y = + 2

(10.2)

The ANOVA table for the random eects model is the same as that for the xed eects model, including the F -test for the signicance of the treatment factor. We augment the standard table by adding a new column for the expected mean squares (denoted Expected MS) to give: Source Among treatments Error Total df T-1 N-T SS SSA SSE MS MSA = SSA /dfA MSE Expected MS +
2 2 2 R

F MSA /MSE

N-1 SST otal

108

CHAPTER 10. FIXED AND RANDOM EFFECTS IN EXPERIMENTS

We estimate the variance components for the random eects model using the quantities given in the standard xed eects based ANOVA table and the equations 2 = MSE and
2 = (MSA MSE )/R

(10.3)

(10.4)

10.2.2

Models for two random eects

The standard model for a two-factor experiment is yijk = + i + j + ( )ij +


ijk

(10.5)

originally presented as Equation 5.1 in Chapter 5. The eects for the two factors are said to be xed and the error term
ijk

is always random. We often note this by describing the

distribution of the error term as being normal with mean 0 and variance 2 . If the other terms are to be considered random, they too will have a normal distribution with mean 0
2 2 2 and variances , , and . All terms are considered independent of one another.

The expression for the response y when all treatment eects ar random follows that of Eq.5.1, but the expected value will be dierent. The distribution of y will be normal with mean and variance
2 2 2 + + + 2

(10.6)

If a treatment factor is xed, it will aect both the mean and variance of y . For example, if Factor A is xed then the distribution of y will be normal with mean + i and variance of
2 2 + + 2

Note that the interaction of the two factors is random if either of the main eects is considered random. The converse is not necessarily true as an interaction can be random, even if neither of the factors have random eects. We leave discussion of the situation where one factor is xed and the other random to the next subsection starting on page 110. The tasks required for altering our analysis to cater for the random eects are based on the need to identify the expected mean squares needed to obtain the quantities in Eq.10.6. Setting up the standard ANOVA table for the two-way model is the rst step. Again, a new column is added to the ANOVA table, for the Expected MS; these will be used for generating the quantities needed in the evaluation of expected values that follow.

10.2. MODELS FOR RANDOM EFFECTS Source Factor A Factor B Interaction Error Total df A-1 B-1 AB(R-1) ABR-1 SS SSA SSB SSE SST MS MSA = SSA /dfA MSB MSAB MSE Expected MS
2 2 2 + R + RB 2 2 2 + R + RA 2 2 + R

109

(A-1)(B-1 SSAB

2 Note in particular that the MS for interaction has expected value of 2 + R . This

suggests that the observed variance due to interaction of the two factors is made up of two elements. The rst is based on errors arising from dierences among replicates, and the second is the variation that arises from the dierent levels of the two treatment factors in combination. Using the Expected MS values and the observed values for the MS column of a standard ANOVA table, we can solve for the unknown variance component values using: = MSE = (MS MSE )/R = (MS MS )/RA and nally, = (MS MS )/RB form. Source Interaction Factor A Factor B F -statistic MS /MSE MS /MS MS /MS df for test df , dfE df , df df , df (10.10) We assess the signicance of the two factors and their interaction via F -tests of the (10.7) (10.8) (10.9)

Note that the F -test for the interaction is the same as that for the two-way model with xed eects. The two F -tests for the main eects are not the same however, as they use a dierent denominator for the F -statistic. The (what turns out to be) minor dierences in the F -tests makes our analysis using R fairly simple for a scenario of this type. As further justication for use of the F -tests given above, consider what the ANOVA table would look like if we converted the two factors to a single factor analysis, giving a level of the single factor to each of the A B treatment combinations. The initial ANOVA would look like this: Source Factor Error Total df AB-1 ABR-1 SS SSF SST MS MSF = SSF /dfF MSE

AB(R-1) SSE

110

CHAPTER 10. FIXED AND RANDOM EFFECTS IN EXPERIMENTS

We could attempt to partition the sum of squares (and degrees of freedom) for the single factor using either or both of Factors A and B. If we were to partition the SSF for both factors we would use an ANOVA of the form Source Factor A Factor B Error Total df A-1 B-1 AB-1 SS SSA SSB SSF MS MSA = SSA /dfA MSB MSAB

(A-1)(B-1) SSAB

This should appear reminiscent of the split-plot analyses of Chapter 9. The link is no accident and we shall use this fact when performing the analyses in Section 10.3.

10.2.3

Models for a combination of xed and random eects

If we know that one of our factors is actually to be analysed as a random eect, while another is to be analysed as a xed eect, we have what is known as a mixed model. In mixed models we are making assertions about the dierence in means for the xed eect while making assertions about the variation of the random eect. The model tted in Eq.5.1 is used again, but we now assume that Factor A is xed and Factor B is random. In similar fashion to previous sections, we now give detail of the ANOVA table for the mixed model with Expected MS column added. Source Factor A Factor B Error Total df A-1 B-1 AB(R-1) ABR-1 SS SSA SSB SSE SST MS MSA = SSA /dfA MSB MSAB MSE Expected MS
2 2 2 + R + RB 2 2 2 + R + RA 2 2 + R

Interaction (A-1)(B-1 SSAB

The appropriate F -tests for the two factor model where Factor A has a xed eect and Factor B has a random eect are as follows: Source Factor A Factor B F -statistic MS /MS MS /MS df for test df , dfE df , df df , df

Interaction MS /MSE

While this set of F -tests is the same as given above for the two random eect scenario, it is important to remember that we are testing for the dierence in means of the levels of Factor A, while we are testing for the existence of a signicant variance component for the random eect of Factor B and the interaction eect which is also random.

10.3. ANALYSIS USING R

111

It is crucial to note the inappropriateness of testing the signicance of the xed eect if the random interaction eect is signicant. If the interaction eect is not signicant we perform the hypothesis test for Factor A. If it shows a dierence in means does exist for some pair(s) of means, we would measure the standard error of the dierence using 2MS /r (10.11)

10.2.4

Models for nested eects

In some scenarios we nd that the levels of a factor that will have a random eect are in fact meaningless over the levels of another factor. This random eect is then said to be nested within the second factor. The data set on respiratory lters is of this type as lters labelled 1 from one manufacturer have no relationship to the lters labelled 1 from the other manufacturer. Nested designs are also called hierarchical designs. The linear model for a nested design is yijkl = + i + j (i) + k(ij ) +
ijkl

(10.12)

where Factor B is nested within Factor A and Factor C is nested within Factor B. If a factor is deemed random, then all eects that are lower in the hierarchy are also random; that is, random eects can be nested within xed eects, but xed eects cannot be nested within random ones. We now present the ANOVA table for a balanced two-factor design with a random factor nested within another factor with R replicates. Source Factor A Factor B (within A) Error Total df A-1 SS SSA MS MSA = SSA /dfA MSB MSE Expected MS
2 2 2 + R () + BR 2 2 + R ()

B-1 SSB AB(R-1) SSE AB-1 SST

As done previously, but not shown here, estimates for the variance components can be calculated using the observed MS values and the Expected MS formulae given above. It should be obvious that this ANOVA table is the same as that for a split-plot analysis. The only dierence is the construction of the Expected MS column and formulae for the variance components. F -tests are therefore the same as given in Section 9.2.

10.3

Analysis using R

Once the Filters.csv data le has been placed in your working directory, it can be imported and investigated using the following R commands.

112

CHAPTER 10. FIXED AND RANDOM EFFECTS IN EXPERIMENTS

> Filters = read.csv("Filters.csv")

> str(Filters) 'data.frame': 18 obs. of 3 variables: $ Manufacturer: int 1 1 1 1 1 1 1 1 1 2 ... $ Filter : int 1 1 1 2 2 2 3 3 3 1 ... $ Rate : num 1.12 1.1 1.12 0.16 0.11 0.26 0.15 0.12 0.12 0.91 ...

The str() command shows a quick summary of the structure of our data.frame. There are several problems of note. First the Manufacturer and Filter variables are coded as integers not factors. Second, the use of the same labels for the levels of the two factors might lead to confusion. These concerns can be worked around by some re-coding as shown below. The last and possibly most important concern is that it might appear to some that in fact there are only three dierent lters under study. This is wrong as there are three lters from each of the two manufacturers, making six in all. Creation of the FilterID variable will prove useful for the modelling exercises that follow. The R commands to resolve the concerns (and check the changes) are as follows.
> > > + > Filters$Manufacturer = as.factor(LETTERS[Filters$Manufacturer]) Filters$Filter = as.factor(Filters$Filter) Filters$FilterID = as.factor(paste(Filters$Manufacturer, Filters$Filter, sep = "")) str(Filters) 18 obs. of 4 variables: Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 num 1.12 1.1 1.12 0.16 0.11 0.26 0.15 0.12 Factor w/ 6 levels "A1","A2","A3",..: 1 1 1

'data.frame': $ Manufacturer: $ Filter : $ Rate : $ FilterID :

1 2 ... 3 3 3 1 ... 0.12 0.91 ... 2 2 2 3 3 3 4 ...

Note the construction of the lter identier which creates a unique level for each dierent lter used in the experiment via the paste() command. We can now produce means and standard deviations for the data using
> > + > > > attach(Filters) Filters.mean = tapply(Rate, list(Manufacturer, Filter), mean) Filters.sd = tapply(Rate, list(Manufacturer, Filter), sd) detach(Filters) Filters.mean

1 2 3 A 1.1133 0.1767 0.130 B 0.8967 0.7000 1.757 > Filters.sd 1 2 3 A 0.01155 0.07638 0.01732 B 0.06110 0.11533 0.35921

Alternatively, we could achieve the same ends (but in a slightly dierent format) using

10.3. ANALYSIS USING R


> > > > > attach(Filters) Filters.mean = tapply(Rate, FilterID, mean) Filters.sd = tapply(Rate, FilterID, sd) detach(Filters) Filters.mean

113

A1 A2 A3 B1 B2 B3 1.1133 0.1767 0.1300 0.8967 0.7000 1.7567 > Filters.sd A1 A2 A3 B1 B2 B3 0.01155 0.07638 0.01732 0.06110 0.11533 0.35921

The temptation in this instance is to t the model using the following code
> Filters.bad.aov = aov(Rate ~ Manufacturer/Filter, data = Filters) > summary(Filters.bad.aov) Df Sum Sq Mean Sq F value Pr(>F) Manufacturer 1 1.87 1.869 73.6 1.8e-06 *** Manufacturer:Filter 4 3.74 0.935 36.8 1.2e-06 *** Residuals 12 0.30 0.025 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F -test for manufacturer is incorrect in this analysis as the wrong error term has been used. We need to specify the correct error term using
> Filters.aov1 = aov(Rate ~ Manufacturer + Error(FilterID), + data = Filters) > summary(Filters.aov1) Error: FilterID Df Sum Sq Mean Sq F value Pr(>F) Manufacturer 1 1.87 1.869 2 0.23 Residuals 4 3.74 0.935 Error: Within Df Sum Sq Mean Sq F value Pr(>F) Residuals 12 0.305 0.0254

This output does not show us how important the random eect of filter is in the context of the experiment. We determine this using
> Filters.aov2 = aov(Rate ~ FilterID, data = Filters) > summary(Filters.aov2) Df Sum Sq Mean Sq F value Pr(>F) FilterID 5 5.61 1.122 44.2 2.6e-07 *** Residuals 12 0.30 0.025 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The relevant lines from the two ANOVA tables would be brought together after both analyses have been obtained and would result in the following ANOVA table.

114

CHAPTER 10. FIXED AND RANDOM EFFECTS IN EXPERIMENTS Source Manufacturer Residuals 1 FilterID Residuals 2 Df Sum Sq Mean Sq 1 4 5 12 1.8689 3.7413 5.6102 0.3047 1.8689 0.9353 1.1220 0.0254 44.194 0.0000 F value Pr(>F) 1.9981 0.2304

As there is no dierence between manufacturers, but there is some dierence among the lters the model is simplied to the one-way random eect model given in Eq.10.1. We would probably wish to identify the sources of the total variation in lters when nalising a report using the variance components. In this instance, 2 is estimated using the MS for Residual 2, so is Reps = 0.0254 and might be labelled the within-lter variation. The variance component for among-lter variation is F ilters = (1.1220 0.0254)/3 = 0.3655

10.4

Exercises

Note: The data sets in the exercises are available via the DRUGS package. If you havent already done so, you can access the data by issuing the library() command
> library(DRUGS)

Exercise 10.1: Use the data on water quality discussed in the Exercises for Chapter 3 on page 27. The blocking factor might be thought of as a random eect given we do not know how the times were chosen. Obtain the appropriate analysis of the data and show the model for the expected value of the water quality and its variance components. In the DRUGS package, this data set is called WaterQuality and can be obtained using
> data(WaterQuality)

Exercise 10.2: Consider the data in the le Pastes.csv which come from an experiment on the strength of chemical pastes that appeared in Davies and Goldsmith (1972). Ten randomly selected batches of paste (three randomly chosen casks per batch) are thought to show that the batches of paste have dierent strengths. Each cask is analysed twice. Determine the various sources of variation and their variance components. In the DRUGS package, this data set is called Pastes and can be obtained using
> data(Pastes)

Exercise 10.3: Investigate the broccoli data given in the faraway package. A full explanation of the experiment is given in the help le for the data which can be obtained by typing:

10.4. EXERCISES
> library(faraway) > `?`(broccoli)

115

It is reasonable to assume that the grocer would be concerned if any of the broccoli clusters were underweight. We might also want to know if any of the dierent sources of broccoli are worse than others. Consider these notions and then test relevant hypotheses. What do you conclude?

Chapter 11 Crossover Designs and the Analysis of Variance: Treatment of Acute Bronchial Asthma and Systolic Blood Pressures
An original chapter
written by

A. Jonathan R. Godfrey1

11.1

Introduction

In many clinical trials the human subjects are given all of the treatments over a period of time. Each subject is given the treatments in a random order however so we are able to ascertain the impact of the treatments after application at a dierent stage of the trial. If the trial is planned well, and the ethical application of treatment allows, there is often a period of time between each treatment called a wash out period. This attempts to eliminate or at least limit the impact of any carry over eects. ments are given in two time periods to a number of subjects. The rst data set we use in The simplest crossover design is known as a 22 crossover experiment as two treat-

this chapter is of this type. Patel (1983) used the data to illustrate how to incorporate data from the wash out and run in periods to improve the analysis of the data from the active treatment periods. A sample of the data appears in Exhibit 11.1. The full data set contains the F EV1 values (forced expired volume in one second, measured in litres) for
1

Jonathan is a lecturer in the Institute of Fundamental Sciences.

116

11.2. ANALYSIS OF CROSSOVER DESIGNS Exhibit 11.1 Selected observations from the Asthma data set. Run-in Period 1 Wash-out Group Subject AB AB . . . AB BA . . . BA BA 1 2 . . . 8 1 . . . 8 F EV1 1.09 1.38 . . . 1.69 1.74 . . . 2.41 Treatment A A . . . A B . . . B F EV1 1.28 1.6 . . . 2.41 3.06 . . . 3.35 F EV1 1.24 1.9 . . . 1.9 1.54 . . . 2.83

117

Period 2 F EV1 1.33 2.21 . . . 2.79 1.38 . . . 3.23 B B . . . B A . . . A

Treatment

F EV1

9 0.96 B 1.16 1.01 A 1.25 values are the forced expired volume in one second, measured in litres.

seventeen patients with acute bronchial asthma given two active treatments. Note that there are eight subjects in group AB and nine in group BA. The second example used in this chapter comes from a slightly more interesting crossover experiment. It tests two treatments, but over three time periods. Some patients start in a treatment and switch to the other treatment for two periods, while others start in a treatment and switch twice. The original reference for this work is Ebbutt (1984), but the experiment is used in Jones and Kenward (1989) and Ratowsky et al. (1993), both of which are well regarded texts on the topic of crossover experiments. The systolic blood pressure is recorded for a large number of patients, a selection of which appear in Exhibit 11.2 where results for the three active treatment periods are given. The run-in period measurement also exists, see Jones and Kenward (1989), but is not used in this chapter.

11.2
11.2.1

Analysis of Crossover Designs


Identifying a crossover designed experiment

If we know that the experimental units are to be given a number of treatments over time and that the order in which the treatments are applied will vary, it is likely that some aspects of crossover designs need to be considered in the analysis. The interesting aspect of crossover designs is the fact that the treatments are applied to the same experimental units so each experimental unit becomes an identiable level of a blocking factor. The crossover design takes advantage of this fact as it builds on the knowledge that the repeated measurements on a single individual are likely to be less variable than the set of measurements from dierent individuals. This means the re-

118 CHAPTER 11. CROSSOVER DESIGNS AND THE ANALYSIS OF VARIANCE Exhibit 11.2 Selected observations from the Systolic Blood Pressure data set. Subject 1.22 Period 1 2 3 2.27 1 2 3 3.23 1 2 3 4.17 1 2 3 Sequence ABB ABB ABB BAA BAA BAA ABA ABA ABA BAB BAB BAB Treatment A B B B A A A B A B A B Carry-over None AB BB None BA AA None AB BA None BA AB Systolic BP 190 150 170 126 140 138 130 120 130 140 130 130

searcher does not need to establish small sets of homogeneous experimental units (blocks) so that each member can be given one of the treatments. As it happens, nding pairs of homogeneous subjects is usually so dicult that caliper matching is used instead of nding pairs of subjects that are perfectly identical. As well as the application of the treatments to the experimental units however is the importance of the order in which the treatments are applied; the number of which will be small for low numbers of treatments, but much larger for even a moderate number of treatments. For example, if there are only two treatments, say a new treatment and a pacebo, then the subjects will either get the new treatment and then the placebo, or the other way around. For each subject, we record a result for each of the two treatments, as well as which order they took the treatments. In this case there are only two possible orders for the two treatments. For three treatments, there are six possible orders: ABC,ACB, BAC, BCA, CAB, and CBA. For four treatments there are 24 possible orderings; ve treatments would mean 120 possible orderings etc. Sometimes the explicit recording of the order of treatments will not be given as a distinct variable, but will be inferred from the way other results are presented.

11.2.2

Randomized complete crossover experiments

It is important to identify which aspects of the data being collected are based on random assignment and which are not. We can randomly assign the order the treatments are assigned to any individual by creating the possible list of orders and choosing one for each subject. Placing constraints on the random assignment is dependent on the scenario, but

11.2. ANALYSIS OF CROSSOVER DESIGNS

119

we would normally expect some replication of each order, and some eort being made to cover all dierent orders. We would not for example, want to nd that the third treatment of a set of three was only ever applied in period 2 or 3 and never in the rst period. If this occurs, we have an unbalanced crossover design and our analysis will be made all that much more dicult as we attempt to uncover the real impact each treatment has. It is therefore strongly advisable that the trial is planned well and especially that there are a large number of subjects. It is common to take measurements on patients during the wash out phases of clinical trials, as well as during a run in period where no treatments have yet been applied. These measurements provide the researcher with a better understanding of the experimental units and can therefore assist with their analysis (Patel, 1983). Use of a wash out period between the active treatment application times should help minimise the impact of carry over eects. Note however, that there is always a risk that an eect from the rst time period will be carried over to the second and subsequent time periods. A rst order carry over eect arises when a treatment aects the results in the time period following its application, and in some crossover experiments we can estimate these eects. Second order and higher order carry over eects can also exist and make the analysis even more challenging. They are not covered in this chapter.

11.2.3

The analysis of variance model

Let us assume for the purposes of the discussion that follows, that there are an equal number R of subjects in each group, and to ensure balance in the experiment that the number of groups G is equal to the number of possible orderings of the A treatments; this means G = A!. Even though we indicate the B dierent time periods in the following discussion, we know that B = A. The simplest model for a crossover design experiment is yijk = + ik + j + d(i,j ) +
ijk

(11.1)

where is the grand mean; ik is the block eect of subject k from the ith group; j is the j th time period eect; d(i,j ) is the direct eect of a treatment administered in period j for group i; and
ijk

is the error term for the k th subject of the ith treatment group within the

ith time period. It is possible to add a term into the model for the group variable. This would result in the ik term for subjects being partitioned into a group and a non-group eect. If randomisation has been applied, and there were nough experimental units, then the need to test the hypothesis for dierences among groups is irrelevant. There may be scenarios where the need to partition this source of variation is warranted.

120 CHAPTER 11. CROSSOVER DESIGNS AND THE ANALYSIS OF VARIANCE Exhibit 11.3 An ANOVA table for six groups of four subjects taking three treatments over three time periods. Source of variation Groups Among Subjects within groups Total for Subjects Time periods Direct treatment eects First order carry over eects (after direct eects) Residual Total degrees of freedom (R 1)G = 3 6 = 18 B1 =31= 2 A1 =31= 2 2 42 N = RGA 1 = 4 6 3 1 = 71 G 1 = A! 1 = 5

RG 1 = 4 6 1 = 23

We can add a rst order carry over eect d(i,j 1) to our model. This is the eect d(i,0) = 0. of the carry over of the treatment administered in period j 1 for group i note that yijk = + ik + j + d(i,j ) + d(i,j 1) +
ijk

(11.2)

These carry over eects are conditional on the direct treatment eect also included in the model. This means we have a pair of eects that are not orthogonal, and when it is time to use R, we will need to reorder these eects to gauge the overall impact they are having. Lucas (1957) proved that the addition of an extra time period to a crossover design can alleviate this problem. As we will see when analyzing the data from Ebbutt (1984) this only occurs if the treatment applied in the last two periods are the same, as is the case for the rst and second treatment sequences. Under these circumstances, we will have orthogonal direct and carry over treatment eects. If the last two periods do not follow this rule, as occurs for the third and fourth sequences in the full Ebbutt (1984) data, we are left with the need to perform an analysis catering for non-orthogonal eects. To develop an example of an ANOVA table for a crossover design, let us assume we have four subjects in each group taking three treatments over three time periods. There are therefore six groups required to get all orders replicated equally. The sources of variation and their associated degrees of freedom are given in Exhibit 11.3.

11.2.4

Analysis of a 22 crossover design

Regardless of the number of subjects included in a crossover design with two treatmets tested in two time periods, we will not be able to estimate all possible eects for the treatment groups ( order of treatments) and the interaction of the time period with the treatment, and carry over eects.

11.2. ANALYSIS OF CROSSOVER DESIGNS

121

The reason is that there are four sets of information, where each set has some replication. These arise from the results of applying treatments A or B in time periods 1 or 2. After extracting any information for subjects, we are able to use three degrees of freedom to estimate the eect of time (one degree of freedom), treatment (one degree of freedom), and then one more eectas long as it has just one degree of freedom. In fact, this last degree of freedom is the total of the remaining treatment eects all merged into one confounded eect. Establishing exactly which of the causes is actually responsible is not possible. We will see this in Section 11.3.1 below when we use R for the analysis of the Asthma data. a number of eects that can be confounded with the treatment eect under consideration. dierent, but this can be managed by carefully selecting patients for the study, and their allocation to treatment groups after analysis of their pre-experiment attributes. Another concern that is less easy to measure is the eect that arises if the treatments dier in their eectiveness if there are dierences in the intensity of the condition being treated. It is possible that both treatments are eective to some extent and that the more eective treatment in the rst period is less eective in the second period because the condition being treated has changed due to having already been treated. A very real concern, that can at least be protected against, is the physical carry over eect from having the two treatment periods too close together. The incorporation of a wash-out phase is the protection that can be introduced in many scenarios. It is possible of course for the patients to feel dierent about the second treatment they are given based on the outcome of the treatment given in the rst period. This psychological, but nonetheless real, eect cannot be tested explicitly in a 22 design. For these reasons, the 22 design has been embellished in many ways where extra The main problem of using a 22 crossover design for an experiment is that there are

We must be concerned about the chance that the two treatment groups are actually

time periods are introduced, with dierent orders of the two treatments being a common enhancement. A more comprehensive text, such as Jones and Kenward (1989) or Ratowsky et al. (1993) should be consulted if the design is enhanced beyond the scope of this chapter.

11.2.5

Unbalanced crossover designs large number of experimental units

In this discussion, large simply means we have more experimental units than the possible number of orderings of the treatments over time. In these circumstances we would want to achieve the greatest balance possible for the number of subjects assigned to each treatment ordering. This will lead to the most ecient

122 CHAPTER 11. CROSSOVER DESIGNS AND THE ANALYSIS OF VARIANCE use of resources for estimating all carry-over eects. Given the likelihood of unbalance in general, the designer of any unbalanced crossover design should consider investigating how often treatment j follows in the period after treatment i, irrespective of the timing of the two treatments. There are certain Latin squares (the building block of the crossover design) that lead to a cyclic pattern of the order treatments over time.

11.2.6

Unbalanced crossover designs small number of experimental units

In this discussion, small simply means we have less experimental units than the possible number of orderings of the treatments over time. This poses much more diculty than the opposite problem. We are now unable to estimate all carry over eects. There will be a situation where treatment i is not followed in the next time period by treatment j , for some i, j . Choosing which of these combinations is to be left out is the challenge that is faced by those who are designing small crossover experiments. There are Latin squares that ensure all treatments are tested before and after every other treatment the same number of times. Use of these options as the basis of constructing the crossover experiment is preferable to a Latin square following a cyclic pattern.

11.3
11.3.1

Analysis using R
Analysis of the Asthma data.

We obtain the Asthma data and conrm its structure using:


> data(Asthma, package = "DRUGS") > str(Asthma) 'data.frame': 17 obs. of 7 variables: $ Group : Factor w/ 2 levels "AB","BA": 1 1 1 1 1 1 1 1 2 2 ... $ RunIn : num 1.09 1.38 2.27 1.34 1.31 0.96 0.66 1.69 1.74 2.41 ... $ Treat1 : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 2 2 ... $ Period1: num 1.28 1.6 2.46 1.41 1.4 1.12 0.9 2.41 3.06 2.68 ... $ WashOut: num 1.24 1.9 2.19 1.47 0.85 1.12 0.78 1.9 1.54 2.13 ... $ Treat2 : Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 1 1 ... $ Period2: num 1.33 2.21 2.43 1.81 0.85 1.2 0.9 2.79 1.38 2.1 ...

The two response variables of the original data frame need to be converted into an object with a single response variable that includes the necessary explanatory variables. The c(), rep(), and data.frame() commands can be used to rearrange the data as required.

11.3. ANALYSIS USING R


> > > > > > > > > attach(Asthma) FEV1 = c(Period1, Period2) Time = rep(1:2, each = 17) Subjects = rep(row.names(Asthma), 2) Groups = rep(Group, 2) Treatment = c(Treat1, Treat2) detach(Asthma) Asthma2 = data.frame(FEV1, Subjects, Time, Groups, Treatment) str(Asthma)

123

'data.frame': 17 obs. of 7 variables: $ Group : Factor w/ 2 levels "AB","BA": 1 1 1 1 1 1 1 1 2 2 ... $ RunIn : num 1.09 1.38 2.27 1.34 1.31 0.96 0.66 1.69 1.74 2.41 ... $ Treat1 : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 2 2 ... $ Period1: num 1.28 1.6 2.46 1.41 1.4 1.12 0.9 2.41 3.06 2.68 ... $ WashOut: num 1.24 1.9 2.19 1.47 0.85 1.12 0.78 1.9 1.54 2.13 ... $ Treat2 : Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 1 1 ... $ Period2: num 1.33 2.21 2.43 1.81 0.85 1.2 0.9 2.79 1.38 2.1 ...

We can now t a model to the data. To apply analysis of variance, we can combine the aov() function and the summary() method to give the ANOVA table.
> summary(Asthma.aov1 <- aov(FEV1 ~ Subjects + Groups + Time + + Treatment, data = Asthma2)) Df Sum Sq Mean Sq F value Pr(>F) Subjects 16 14.86 0.929 7.79 0.00013 *** Time 1 0.20 0.202 1.69 0.21277 Treatment 1 0.56 0.557 4.68 0.04716 * Residuals 15 1.79 0.119 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > summary(Asthma.bad.aov2 <- aov(FEV1 ~ Groups + Subjects + + Time + Treatment, data = Asthma2)) Df Sum Sq Mean Sq F value Pr(>F) Groups 1 2.22 2.221 18.63 0.00061 *** Subjects 15 12.64 0.843 7.07 0.00025 *** Time 1 0.20 0.202 1.69 0.21277 Treatment 1 0.56 0.557 4.68 0.04716 * Residuals 15 1.79 0.119 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that the rst row of the upper ANOVA table just given is partitioned into the rst two rows of the second ANOVA table. Specifying the Group variable after the Subjects variable in the rst model has had no eect, as R has been unable to include the Group variable after the totality of subject eects have been removed by tting the Subjects variable rst. Printing both ANOVA tables may prove useful for presenting the ndings from this experiment, but the hypothesis test for the Group variable is incorrect. Refer to the discussion of xed versus random eects in Chapter 10 for greater detail on this matter. The correct ANOVA table is obtained using:

124 CHAPTER 11. CROSSOVER DESIGNS AND THE ANALYSIS OF VARIANCE


> summary(Asthma.aov3 <- aov(FEV1 ~ Groups + Time + Treatment + + Error(Subjects), data = Asthma2)) Error: Subjects Df Sum Sq Mean Sq F value Pr(>F) Groups 1 2.22 2.221 2.64 0.13 Residuals 15 12.64 0.843 Error: Within Df Sum Sq Mean Sq F value Pr(>F) Time 1 0.202 0.202 1.69 0.213 Treatment 1 0.557 0.557 4.68 0.047 * Residuals 15 1.788 0.119 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We now t a model that is not useful for the correct analysis, but does illustrate some interaction terms using: points made in the theoretical discussion of the 22 crossover design. We attempt to t

> summary(Asthma.bad.aov4 <- aov(FEV1 ~ Groups * Time * Treatment + + Error(Subjects), data = Asthma2)) Error: Subjects Df Sum Sq Mean Sq F value Pr(>F) Groups 1 2.22 2.221 2.64 0.13 Residuals 15 12.64 0.843 Error: Within Df Sum Sq Mean Sq F value Pr(>F) Time 1 0.202 0.202 1.69 0.213 Treatment 1 0.557 0.557 4.68 0.047 * Residuals 15 1.788 0.119 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R has been unable to incorporate these terms in the model because each of the two-way interaction terms are confounded with a main eect variable. Of course, if the two-way interaction terms cannot be tted then there is no hope of tting a three-way interaction. As a nal check on the validity of the model, we can examine the models residuals. This is given in Exhibit 11.4 using the simplest model (without the group eect partitioned out of the subjects eect), and conrms the suitability of the model.

11.3.2

Analysis of the systolic blood pressure data

After obtaining the data using:


> data(SystolicBP, package = "DRUGS")

and considering its structure using:


> str(SystolicBP)

11.3. ANALYSIS USING R

125

Exhibit 11.4 Residual plots for the 22 crossover design tted to the Asthma data.
> par(mfrow = c(2, 2)) > plot(Asthma.aov1)

Residuals vs Fitted
Standardized residuals 0.5
9

Normal QQ
3
9

Residuals

0.0

0.5

22 26

22 26

1.0

1.5

2.0

2.5

3.0

3.5

Fitted values

Theoretical Quantiles

ScaleLocation
Standardized residuals Standardized residuals 1 2 3 1.5
26 9

Residuals vs Leverage
9 0.5

22

1.0

0.5

22

0.0

1.0

1.5

2.0

2.5

3.0

3.5

Cooks distance 0.0 0.1 0.2 0.3 0.4 0.5 Leverage

0.5 26

Fitted values

'data.frame': $ Subject : $ Period : $ Sequence : $ Treatment: $ Carry1 : $ SBP :

267 obs. of 6 variables: Factor w/ 82 levels "1.1","1.11","1.12",..: 1 1 1 11 11 11 14 14 14 15 ... Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ... Factor w/ 4 levels "ABA","ABB","BAA",..: 2 2 2 2 2 2 2 2 2 2 ... Factor w/ 2 levels "A","B": 1 2 2 1 2 2 1 2 2 1 ... Factor w/ 3 levels "A","B","None": 3 1 2 3 1 2 3 1 2 3 ... num 159 140 137 153 172 155 160 156 140 160 ...

we can calculate the means of the treatment responses for each group of patients in each time period.
> attach(SystolicBP) > tapply(SBP, list(Period, Sequence), mean) ABA ABB BAA BAB 1 158.9 157.1 147.1 149.3 2 146.8 151.4 150.6 156.0 3 153.4 145.9 150.4 142.5 > detach(SystolicBP)

Ratowsky et al. (1993) uses the rst two groups to illustrate how a two treatment, three period, two order (= two groups) analysis would be carried out. We replicate that analysis here as the original reference uses dierent software in its presentation. First, extracting the rst two groups is done by:

126 CHAPTER 11. CROSSOVER DESIGNS AND THE ANALYSIS OF VARIANCE


> SystolicBP2 = SystolicBP[1:147, ]

Let us rst examine the anova tables with the direct and carry over treatment eects reversed:
> summary(SBP.aov1 <- aov(SBP ~ Sequence + Period + Treatment + + Carry1 + Error(Subject), data = SystolicBP2)) Error: Subject Df Sum Sq Mean Sq F value Pr(>F) Sequence 1 158 158 0.24 0.63 Residuals 43 28139 654 Error: Within Df Sum Sq Mean Sq F value Pr(>F) Period 2 276 138 0.73 0.485 Treatment 1 1134 1134 5.99 0.016 * Carry1 1 173 173 0.91 0.341 Residuals 98 18544 189 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> summary(SBP.aov1 <- aov(SBP ~ Sequence + Period + Carry1 + + Treatment + Error(Subject), data = SystolicBP2)) Error: Subject Df Sum Sq Mean Sq F value Pr(>F) Sequence 1 158 158 0.24 0.63 Residuals 43 28139 654 Error: Within Df Sum Sq Mean Sq F value Pr(>F) Period 2 276 138 0.73 0.485 Carry1 1 173 173 0.91 0.341 Treatment 1 1134 1134 5.99 0.016 * Residuals 98 18544 189 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the sums of squares for the eects are constant, we conclude the two eects are orthogonal, and that either of the tables would be presentable, although perhaps the former ANOVA table makes a little more sense from a logical perspective. Extending the analysis to include all four groups in the data set shows us that just adding an extra time period is not a general solution to the problem of non-orthogonal eects in crossover experiments.
> summary(SBP.aov1 <- aov(SBP ~ Sequence + Period + Treatment + + Carry1 + Error(Subject), data = SystolicBP)) Error: Subject Df Sum Sq Mean Sq F value Pr(>F) Sequence 3 654 218 0.22 0.88 Residuals 78 77899 999 Error: Within

11.4. EXERCISES
Df Sum Sq Mean Sq F value Pr(>F) Period 2 890 445 2.43 0.091 . Treatment 1 3491 3491 19.05 2.1e-05 *** Carry1 1 10 10 0.06 0.812 Residuals 181 33164 183 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > summary(SBP.aov1 <- aov(SBP ~ Sequence + Period + Carry1 + + Treatment + Error(Subject), data = SystolicBP)) Error: Subject Df Sum Sq Mean Sq F value Pr(>F) Sequence 3 654 218 0.22 0.88 Residuals 78 77899 999 Error: Within Df Sum Sq Mean Sq F value Pr(>F) Period 2 890 445 2.43 0.091 . Carry1 1 401 401 2.19 0.141 Treatment 1 3100 3100 16.92 5.9e-05 *** Residuals 181 33164 183 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

127

We now see that the impact of the carry over eect is negligible, irrespective of its placement in the ANOVA table.

11.4

Exercises

Note: The data sets in the exercises are available via the DRUGS package. Exercise 11.1: In a second experiment measuring the critical icker frequency (CFF) on eight subjects, Hedges et al. (1971) used a crossover design. The data can be found in the DRUGS package and obtained using:
> data(Flicker2, package = "DRUGS")

Compare the two treatments used in this follow-up experiment. What conclusions do you make about the dierence observed between them? Exercise 11.2: Create a data set for a three period, three-treatment experiment that has 18 subjects. If it helps, the context is that a small herd of 18 cows will be tested for their stress levels after administration of two alternative drugs by injection. A third injection of a saline solution (placebo) is to be used to oer a baseline comparison. Randomly generate a response variable and see if the resulting ANOVA has the structure you expect. Exercise 11.3: In the discussion on small unbalanced crossover experiments (page 122), an assertion was made about the selection of the Latin square used to create the design. Consider the assertion and link it to a medical experiment with four treatment regimes,

128 CHAPTER 11. CROSSOVER DESIGNS AND THE ANALYSIS OF VARIANCE and for arguments sake, only 8 or 12 subjects. You should refer to Exhibit 4.2 which lists the four dierent 44 Latin squares that might be used as the starting points. Exercise 11.4: Albert et al. (1974) used a four-period crossover experiment to test

the ecacy of four dierent administrations of a drug used in the management of grand mal and psychomotor seizures. A 100mg generic DPH product in solution, B 100mg manufacturer DPH in capsule, C 100mg generic DPH product in capsule, D 300mg manufacturer DPH in capsule. The data can be obtained using
> data(GrandMal, package = "DRUGS")

Note that the le GrandMal.csv is not coded correctly. The help le for this data is a useful starting point. Use this data to perform an analysis of a four-period crossover design. You will need to construct the necessary variable for the carry-over eects and carefully check the degrees of freedom assigned to all eects. You should check the way this design has been created, and might also consider how the design could have been improved.

Chapter 12 Multivariate Analysis of Variance (MANOVA): Biomass of Marsh Grass


An original chapter
written by

Siva Ganesh1 & A. Jonathan R. Godfrey2

12.1

Introduction

A research study was carried out to identify the important soil characteristics inuencing aerial biomass production of the marsh grass (Spartina alterniora). One phase of research consisted of sampling three types of Spartina vegetation in each of three locations. Samples of the soil substrate from 10 random sites within each location-vegetation type were analysed for soil physico-chemical characteristics and above-ground biomass. The data used in this chapter consist of only the September sampling, and only on ve substrates, namely, Salinity (SAL), Acidity as measured in water (pH), Potassium (K), Sodium (Na) and Zinc (Zn), and the aerial biomass (BIO). These data are available in the le biomass.csv with columns Location, Vegetation, BIO, SAL, pH, K, Na, and Zn. The three types of vegetation are, re-vegetated dead areas (RVEG), short Spartina (SHRT) areas, and tall Spartina areas (TALL), while the three locations are, Oak Island (OI), Smith Island (SI), and Snows Marsh (MS). A sample is shown in Exhibit 12.1.
1

Ganesh is a former colleague in the Statistics group of the Institute of Fundamental Sciences who

has moved to AgResearch Ltd. Please contact the editor for any queries relating to this chapter. 2 Jonathan is a lecturer in the Institute of Fundamental Sciences.

129

130

CHAPTER 12. MULTIVARIATE ANALYSIS OF VARIANCE

Exhibit 12.1 Biomass data. Some information for marsh grass samples.
Sample 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... ... Location OI OI SI OI MS OI MS OI OI MS OI SI MS SI MS ... ... Vegetation RVEG TALL TALL SHRT RVEG TALL TALL RVEG RVEG TALL TALL RVEG RVEG RVEG SHRT ... ... Biomass 511 981 1660 645 1199 1735 1399 868 1051 1402 1280 332 2078 241 416 ... ... Salinity 36 32 28 39 28 35 28 30 31 27 32 27 24 36 28 ... ... pH 4.56 4.04 7.36 4.7 4.59 4.21 5.54 4.53 4.26 5.45 3.29 3.14 4.75 3.33 3.96 ... ... K 1294.89 554.5 458.36 1245.6 892.86 567.36 777.74 1045.25 1157.26 775 523.65 348.51 1042.77 579.82 953.26 ... ... Na 28170.5 7895.5 8717.5 27581.2 12514.5 11975 13653.6 25088.2 26459.6 13631 9834.9 8569.4 22967.3 14722.8 16484.4 ... ... Zn 13.8211 9.855 0.2648 14.1002 20.8229 10.1201 19.1695 16.414 14.7569 20.3452 13.266 28.9857 19.4579 17.8468 30.8377 ... ...

The main aim here is to explore whether there are signicant dierences among vegetation types, among locations, and among location-vegetation combinations based on the collective behaviour of the six response variables (i.e. ve substrates and the biomass). In other words, we wish to examine how vegetation types compare in terms of the six response variables collectively; (- main eect of, say, vegetation experimental factor) how locations compare in terms of the six response variables collectively; (- main eect of , say, location experimental factor) how locations inuence vegetation-type dierences (or vice versa) with respect to the six response variables collectively; (- interaction eect between vegetation and location factors). We may dene the main eect of an experimental factor as dierences between the mean responses for the dierent levels of that factor, averaging over all levels of all other factors. On the other hand, the interaction eect between two factors can be treated as, the variation of the dierences between mean responses for dierent levels of one factor over dierent levels of the other factor. It is important to realise that the interaction between two factors is symmetric. For example, we may discuss the interaction between vegetation-types and locations in terms

12.2. THE MULTIVARIATE ANALYSIS OF VARIANCE (MANOVA) MODEL

131

of the way dierences between vegetation-types according to which of the three locations is being considered. We could equally well discuss how the dierences between the three locations are compared for each of the vegetation-types.

12.2

The Multivariate Analysis of Variance (MANOVA) model

12.2.1

Introduction

Before dening precisely what we mean by multivariate analysis of variance (or MANOVA), let us rst remind ourselves what the univariate analysis of variance (usually denoted by ANOVA) does. Since this is only a reminder, the reader is expected to have some knowledge about ANOVA including familiarity with experimental designs such as, completely randomised design, randomised block design and factorial design, and their analyses of variance. The basic dierence between MANOVA and ANOVA is that MANOVA is ANOVA in which a single response variable is replaced by several response variables. Hence, MANOVA examines the dierences of the means of several response variables, simultaneously, among the distinct groups or experimental factors. The response variables may be substantially dierent from each other (for example, height, weight, IQ, test scores etc.); or they may be the same substantive item measured at a number of dierent times (for example, milk production at dierent times or days or weeks). The former set of variables leads to the case of multivariate analysis of variance, while the latter forms the essence of repeated measures problems which are usually seen as a special case of ANOVA (hence, are not discussed here). Many researchers expect MANOVA to be more complicated than ANOVA, naturally! In some ways this expectation is fully justied. So, a natural follow-up question to the question What is MANOVA? should be What is it used for? or Why should we use it? There are, roughly speaking, three answers to this question: Often a researcher may have several variables observed, each of interest in its own right, and may wish to explore each of them separately. Each may be analysed at a conventional signicance level (say, 5%) but this can mean a very high chance of concluding that some relationship exists when in fact none does. MANOVA, and the associated test procedures, permit one to control this eect. MANOVA should be used when interest lies, not in the individual response variables as measured, but in some combination of them. A very simple example of this is the

132

CHAPTER 12. MULTIVARIATE ANALYSIS OF VARIANCE change score, i.e. the dierence between two variables (for example, initial weight and nal weight or measurement at time 1 and at time 2). MANOVA should be used when interest lies in exploring between-group patterns of dierences on a set of variables in toto. This means that the individual variables are of no intrinsic interest, it is their union that matters.

12.2.2

Computations for MANOVA

Computations associated with MANOVA are essentially similar to those of ANOVA. However, the testing of hypotheses and the interpretation of results are more complicated. In a univariate ANOVA, the total sum of squared deviations about the grand mean, denoted SS (T otal), is partitioned into a sum of squares due to one or more sources of variation (e.g. SS due to treatments, denoted SS (T reatment) and a residual sum of squares (denoted SS (Error )); i.e. SS (T otal) = SS (T reatment) + SS (Error ). (12.1)

Associated with each partition are degrees of freedom (d.f. or df), representing the number of independent parameters for the source or alternatively the number of linearly independent contrasts. The partition is then represented in an ANOVA table. In a p-dimensional MANOVA, there are p SS(Total)s to be partitioned, one for each variable measured. In addition, there are measures of covariance between pairs of variables, presented as sum of products, which need attention. The MANOVA computations, therefore, are concerned with the partition of these measures of variance and covariance which are collected in a matrix of sums of squares and products, usually denoted by SSP or SSCP (sum of squares and cross-products) matrix. The SSP(Total) matrix is partitioned into between group (or treatment) and within group variation (residual/error) sources, i.e., SSP (T otal) = SSP (T reatment) + SSP (Error ). (12.2)

These matrices are all symmetric, meaning that the element on the ith row and j th column is equal to the element on the j th row and ith column. Note that the sums of products can be negative, while as you should expect, the sums of squares (found on the leading diagonal of the three SSP matrices) will always be positive. It should also degrees of freedom associated with the various sources of variation would be the same in the MANOVA as in the ANOVA. Given the sum of squares for each variable is contained in the SSP (T otal) matrix, the advantage of using MANOVA over ANOVA is that we also consider the inter-relationships among the response variables as measured by their covariances. be noted that, since the design of the experiment is the same (say, 24 factorial), the

12.2. THE MULTIVARIATE ANALYSIS OF VARIANCE (MANOVA) MODEL

133

12.2.3

Testing Hypotheses in MANOVA

Consider a simple experiment conducted according to a Completely Randomised design with 4 treatments (i.e. one experimental factor, say, A with 4 levels). Suppose the response is tri-variate, i.e. 3 dierent variables observed, on each of 10 replicated experimental units. Then we may write the general linear model, say , as, : y ik = + i + eik where i = 1, . . . , 4; k = 1, . . . , 10. Note that: y ik is the observation vector (of the 3 response variables) measured on the k th experimental unit for the ith treatment; is the vector of the overall means of the 3 response variables; i is the eect (vector) due to the ith treatment; and the eik s are vectors of random errors assumed to be distributed independently as tri-variate normal, i.e. N3 (0, ), with zero means and covariance matrix . Under the null hypothesis H0 : i = 0, i = 1, . . . , 4, (i.e. when H0 is true), the reduced model is 0 : y ik = + eik where again, i = 1, . . . , 4; k = 1, . . . , 10. Hence, the MANOVA table may be written as follows: Source Treatments Error Total d.f. SSP matrix 3 36 39 H E T (12.4) (12.3)

where H, E and T denote the 33 SSP matrices for the sources of variation from treat-

ments, error and total, respectively. The SSP matrix for the reduced model 0 , denoted

by E0 , can be written as E0 = E + H, with (36+3) df. Note that, E0 = T in the case of a CRD experiment. The multivariate test procedures are then based on the comparisons of the matrices E and E0 . Remarks: Using the standard notation used, note that in the above CRD experiment (i.e. one group structure situation), H B and E W, where B and W are the Discriminant Analysis. between-group and pooled within-group covariance matrices as dened in Canonocal

134

CHAPTER 12. MULTIVARIATE ANALYSIS OF VARIANCE Suppose that the 10 replicated experimental units in fact were 10 blocks each with 4 homogeneous plots. In other words, the experiment was carried out in a typical Randomised Block Design fashion. Then the full model is, : y ik = + i + k + eik where i = 1, . . . , 4; k = 1, . . . , 10. The corresponding MANOVA table is, Source Blocks Treatments Error Total d.f. 9 3 27 39 SSP matrix B H E T (12.5)

Test procedures: MANOVA, like many multivariate statistical techniques, turns out to be an eigen analysis problem. The only statistics invariant under choice of origin and scale of the data are the roots of the (determinant ) equation, | H E | = 0 (12.6)

The s are the eigenvalues of HE1 (or E1 H) matrix. It is not immediately obvious that the test statistics given by the roots of Equation FindRoots are intuitively sensible. However, when p = 1, the single is equal to H/E , which is proportional to the varianceratio (or the F -ratio) statistic associated with a univariate ANOVA. It should be pointed out that the roots i , i = 1, . . . , p (p = 3 in the examples above), satisfy i > 0, but if s = rank (H) < p, then (p s) of i s are zero. For this reason, it is advisable that there are more observations within each group than there are variables measured; obviously the greater the number of observations per group the better. A number of test statistics/procedures have been suggested in the literature based on functions of the i s. These include, (a) max , the largest eigenvalue of HE1 , called Roys maximum root, (b) T = trace(HE1 ) =
i

i , called Hotelling-Lawley trace,


i [i /(1

(c) V = trace[H(H + E)1 ] = (d) = |E|/|E + B| =


i [1/(1

+ i )], called Pillais trace, and

+ i )], called Wilks lambda.

12.3. ANALYSIS USING R (Here,


i

135

and

denote, respectively the sum and product over all i; and trace is

the sum of the diagonals.) Although, many comparisons of these statistics have been carried out in the literature, the results are still indecisive. However, Wilks lambda statistic has been preferred by many researchers, partly because of its ease of computation in problems of low dimensionality, but mainly for the existence of distributional approximations which enable critical values to be readily found. Hence, we shall concentrate mainly on Wilks lambda in this chapter. Note that, all four statistics above are equivalent, when p = 1. Remarks: The larger the eigenvalues of HE1 , the stronger the signicance of Wilks lambda. The number of positive eigenvalues is equal to the smallest of d.f. of corresponding experimental eect and the number of response variables analysed. Pillais trace may be regarded as a measure of response variable variation explained by the tted multivariate linear regression model, and as an analogue to the well-known coecient of determination (R2 ) in the context of univariate linear regression. Although tables of critical values for Wilks lambda and other test statistics are available, we shall explore the popular F -statistic approximation for Wilks lambda in particular before seeing how R handles the other three tests.

12.3

Analysis Using R

Consider the marsh grass example. Here, we have two experimental factors, namely, Vegetation with 3 levels and Location with 3 levels. Each vegetation-location combination is replicated 10 times. It can be assumed that the experiment carried out follows a typical Completely Randomised Factorial Design. The ANOVA model for a two-factor factorial design with a univariate response, given in Equation 5.1, can be extended to give the full model for a MANOVA taking the form: y ijk = + i + j + ( )ij + eijk (12.7)

where in our specic case, i = 1, . . . , 3; j = 1, . . . , 3; k = 1, . . . , 10. Here, y ijk is the observation vector (of the 6 response variables, i.e. soil physico-chemical characteristics and above-ground biomass) measured on the k th experimental unit for the ith vegetation and j th location; is the vector of the overall means of the response variables; i is the main eect (vector) due to the ith type of vegetation; j is the main eect (vector) due

136

CHAPTER 12. MULTIVARIATE ANALYSIS OF VARIANCE

Exhibit 12.2 MANOVA with Wilks lambda information.


> Biomass.maov <- manova(cbind(TotalBiomass, Salinity, pH, + K, Na, Zn) ~ Location * Vegetation, data = Biomass) > summary(Biomass.maov, test = "Wilks") Df Wilks approx F num Df den Df Location 2 0.0323 57.8 12 152 Vegetation 2 0.0358 54.3 12 152 Location:Vegetation 4 0.0106 29.7 24 266 Residuals 81 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' Pr(>F) <2e-16 *** <2e-16 *** <2e-16 ***

0.1 ' ' 1

to the j th location; ( )ij is the interaction eect (vector) due to the combination of the ith vegetation and the j th location; and E ijk s are vectors of random errors assumed to be distributed independently as multivariate normal, i.e. N6 (0, ), with zero means and covariance matrix . The null hypotheses to be tested are, H01 : i = 0, H02 : j = 0 and H03 : ( )ij = 0, for all i and j . The data for information on the marsh grass samples can be read, using the function read.csv(), from the le biomass.csv. Note that, the factorial model is implemented via the manova() function specifying the terms Location, Vegetation and their interaction using Location*Vegetation.
> Biomass <- read.csv("biomass.csv", header = T) > head(Biomass)

1 2 3 4 5 6

Location Vegetation TotalBiomass Salinity pH K Na MS RVEG 828 24 4.87 937.7 20439 MS RVEG 1766 24 5.25 902.0 11687 MS RVEG 1760 28 5.21 900.0 11719 MS RVEG 826 28 5.02 934.7 20428 MS RVEG 2076 27 4.57 1042.7 22982 MS RVEG 1199 27 4.70 894.6 12507

Zn 19.21 20.92 21.19 19.18 20.25 20.66

It is clear from Exhibit 12.2 that the Wilks lambda statistic for the interaction between Location and Vegetation Type (i.e. the Location:Vegetation effect) is highly signicant (p<0.0001) when all ve substrates and biomass are considered collectively in the multivariate sense. The Wilks lambda based test also shows very strong evidence for signicant (p<0.0001) overall dierences (main eects ) among the Locations and among the Vegetation types when all six response variables are considered together. Since the interaction between location and vegetation type is highly signicant, this eect should be further examined before exploring the signicant main (or overall) eects. However, depending on the complexity of this interaction eect, the overall behaviour (or main eects) of location and vegetation type may also be examined.

12.3. ANALYSIS USING R

137

Exhibit 12.3 MANOVA with information for Roys maximum root,. Hotelling-Lawley trace, and Pillais trace.
> summary(Biomass.maov, test = "Roy") Df Roy approx F num Df den Location 2 5.57 71.4 6 Vegetation 2 17.75 227.8 6 Location:Vegetation 4 22.31 293.8 6 Residuals 81 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 Df 77 77 79 Pr(>F) <2e-16 *** <2e-16 *** <2e-16 ***

'.' 0.1 ' ' 1

> summary(Biomass.maov, test = "Hotelling-Lawley") Df Hotelling-Lawley approx F num Df den Df Location 2 9.28 58.0 12 150 Vegetation 2 18.24 114.0 12 150 Location:Vegetation 4 24.73 76.8 24 298 Residuals 81 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > summary(Biomass.maov, test = "Pillai") Df Pillai approx F num Df den Df Location 2 1.64 57.6 12 154 Vegetation 2 1.28 22.6 12 154 Location:Vegetation 4 1.87 11.6 24 316 Residuals 81 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' Pr(>F) <2e-16 *** <2e-16 *** <2e-16 *** Pr(>F) <2e-16 *** <2e-16 *** <2e-16 ***

0.1 ' ' 1

These results are conrmed when the three other test methods are applied to the data, as seen in Exhibit 12.3. If the four methods did not agree with one another, we would then be concerned with the dierences among the four tests in terms of the way they deal with the data, more than the results from the data itself. In such cases, we would need to consider the assumptions of each test method, and investigate the response variables individually to gain deeper insight into the results using univariate ANOVA. We may wish to do the investigation on the individual variables in any case, but we should condition our ndings on the outcomes from the multivariate testing. There is a very quick and elegant way to generate the separate ANOVA tables for the six response variables using the summary.aov() command.
> summary.aov(Biomass.maov) Response TotalBiomass : Df Sum Sq Mean Sq F value Location 2 1623898 811949 14.1 Vegetation 2 21797141 10898571 188.6 Location:Vegetation 4 10253652 2563413 44.4 Residuals 81 4681713 57799 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05

Pr(>F) 5.8e-06 *** < 2e-16 *** < 2e-16 ***

'.' 0.1 ' ' 1

138
Response Salinity :

CHAPTER 12. MULTIVARIATE ANALYSIS OF VARIANCE

Df Sum Sq Mean Sq F value Pr(>F) Location 2 710 355 45.98 4.5e-14 *** Vegetation 2 6 3 0.42 0.66 Location:Vegetation 4 4 1 0.14 0.97 Residuals 81 626 8 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Response pH : Df Sum Sq Mean Sq F Location 2 3.7 1.83 Vegetation 2 38.0 19.02 Location:Vegetation 4 89.7 22.42 Residuals 81 6.1 0.08 --Signif. codes: 0 '***' 0.001 '**' 0.01 Response K : Df Sum Sq Mean Sq F value Pr(>F) Location 2 3131705 1565852 101.6 <2e-16 *** Vegetation 2 1796466 898233 58.3 <2e-16 *** Location:Vegetation 4 1613572 403393 26.2 6e-14 *** Residuals 81 1247946 15407 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Response Na : Df Sum Sq Mean Sq F value Location 2 1.54e+09 7.69e+08 98.7 Vegetation 2 1.18e+09 5.91e+08 75.8 Location:Vegetation 4 8.17e+08 2.04e+08 26.2 Residuals 81 6.31e+08 7.79e+06 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 Response Zn : Df Sum Sq Mean Sq F value Pr(>F) Location 2 947 474 53.6 1.5e-15 *** Vegetation 2 2097 1049 118.6 < 2e-16 *** Location:Vegetation 4 2270 568 64.2 < 2e-16 *** Residuals 81 716 9 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Pr(>F) < 2e-16 *** < 2e-16 *** 5.8e-14 *** value Pr(>F) 24.3 5.5e-09 *** 252.1 < 2e-16 *** 297.2 < 2e-16 ***

'*' 0.05 '.' 0.1 ' ' 1

'.' 0.1 ' ' 1

The summary() command, used frequently throughout this text, looks to see what kind of object we are using it on and then looks for a secondary command of the right kind. As a consequence, it is often referred to as a method. When we used the summary() method on an ANOVA object, it automatically called the summary.aov() method for us; when we use it on a MANOVA object, it uses the summary.manova() method. We could use the summary.aov() or summary.manova() commands explicitly more often but this is seldom warranted. Unfortunately, use of the summary.aov() command explicitly in this instance means

12.4. EXERCISES

139

we do not have any model objects stored in the workspace. The code used to create these model objects could be kept as simple as generating the six models in turn as we would have done if we had just one response variable, but this is a great opportunity for showing some fancy R code that could make us just that much more ecient when undertaking larger analyses. We use the aov() command and associated summary method as expected, but also use a for() loop and the assign() command. Note also the use of the paste() command being used here too.
> for (Variable in 3:8) { + fm = aov(Biomass[, Variable] ~ Location * Vegetation, + data = Biomass) + assign(paste("Biomass", names(Biomass)[Variable], "aov1", + sep = "."), fm) + }

This code uses the numbers of the columns of the data.frame we want put into an ANOVA. The assign() command ensures that the object created by the aov() command in each iteration of the loop is given a meaningful name. To conrm all these objects exist as planned, we use the ls() command:
> ls() [1] [3] [5] [7] [9] "Biomass" "Biomass.maov" "Biomass.pH.aov1" "Biomass.TotalBiomass.aov1" "fm" "Biomass.K.aov1" "Biomass.Na.aov1" "Biomass.Salinity.aov1" "Biomass.Zn.aov1" "Variable"

12.4

Exercises

Note: The data sets in the exercises are available via the DRUGS package. Exercise 12.1: Four measurements were made of male Egyptian skulls from ve dierent time periods ranging from 4000 B.C. to 150 A.D. We wish to analyze the data to determine if there are any dierences in the skull sizes between the time periods and if they show any changes with time. The researchers theorized that a change in skull size over time is evidence of the interbreeding of the Egyptians with immigrant populations over the years. Thirty skulls are measured from each time period and the four skull measurements are, Maximal Breadth of Skull (Max), Basibregmatic Height of Skull (BasiHt), Basialveolar Length of Skull (BasiLth) and Nasal Height of Skull (NasalHt). The measurements for the 150 skulls are in the le skulls.csv. The time-period grouping is denoted by Epoch and takes values 1 to 5 with labels also shown. In the DRUGS package, this data set is called Skulls; they can be obtained using

140

CHAPTER 12. MULTIVARIATE ANALYSIS OF VARIANCE

> data(Skulls, package = "DRUGS")

Use MANOVA to determine if there are dierences in Egyptian skulls over time, and if there are, determine how individual aspect(s) of the skulls are changing over time. Consider use of suitable graphs to help describe the changes. Exercise 12.2: Mungomery et al. (1974) presented a way of analysing data collected on a large number of plant varieties (called genotypes) that had been tested over a number of environments (combinations of time and location). The data appear in the le Mungomery1974.csv and are included in the DRUGS package and can therefore be accessed using
> data(Mungomery, package = "DRUGS")

Use the six response variables in a model, or series of models, that considers the impact of genotype, time and location as well as their interactions.

Chapter 13 An Introduction to Generalized Linear Models: Blood Plasma, Womens Role and Colonic Polyps
An original chapter
written by

Geo Jones1

13.1

Introduction

Collett and Jemain (1985) reported on an investigation of the relationship between the levels of certain blood proteins and the rate at which red blood cells settle out of the blood (erythrocyte sedimentation rate or ESR). The data they collected are displayed in Exhibit 13.1. These blood protein levels tend to rise in persons suering from certain diseases and infections. ESR is easy to measure, so if it is associated with blood protein levels it might provide a useful screening test for such diseases. It is thought that normal individuals would be expected to have an ESR of less than 20mm/hr, so we want to know if people with high blood protein levels are more likely to have ESR > 20. Haberman (1973) analyzed some results from a survey carried out in 1974 in which people were asked whether or not they agreed with the statement Women should take care of their homes and leave running the country up to men. The responses, grouped according to the sex and education level of the respondent, are illustrated in Exhibit 13.2.
1

Geo is an Associate Professor in the Institute of Fundamental Sciences.

141

142

CHAPTER 13. AN INTRODUCTION TO GENERALIZED LINEAR MODELS

Exhibit 13.1 plasma data. Measurements of blood protein levels and erythrocyte sedimentation rate.
Fibrinogen 2.52 2.56 2.19 2.18 . . . 2.38 3.53 2.09 3.93 Globulin 38 31 33 31 . . . 37 46 44 32 ESR ESR < 20 ESR < 20 ESR < 20 ESR < 20 . . . ESR > 20 ESR > 20 ESR > 20 ESR > 20

Exhibit 13.2 womensrole data from a survey on womens role in society.


education 0 1 2 . . . 18 19 20 sex Male Male Male . . . Male Male Male agree 4 2 4 . . . 1 2 3 disagree 2 0 0 . . . 28 13 20 education 0 1 2 . . . 18 19 20 sex Female Female Female . . . Female Female Female agree 4 1 0 . . . 0 1 2 disagree 2 0 0 . . . 21 2 4

We want to know how the opinions on a womans role in society are inuenced by the sex and education level of the respondent. Giardiello et al. (1993) described a clinical trial of a new drug for the treatment of familial andenomatous polyposis, an inherited condition which causes polyps to form in the large intestine. The data are shown in Exhibit 13.3. The main question here is whether use of the drug reduces the number of polyps, and if so by how much. The age of the patient is expected to be an important predictor of the number of polyps, and may also aect the ecacy of treatment.

13.2

Generalized Linear Models

In the examples of the previous section there is a response variabe Y that we want to model as a function of one or more possible predictor variables, say x. The usual way to do this is using the normal linear model : Y = xT + (13.1)

13.2. GENERALIZED LINEAR MODELS

143

Exhibit 13.3 polyps data giving number of colonic polyps after 12 months of treatment.
number 63 2 28 17 61 1 7 15 44 25 treat placebo drug placebo drug placebo drug placebo placebo placebo drug age 20 16 18 22 13 23 34 50 19 17 number 3 28 10 40 33 46 50 3 1 4 treat drug placebo placebo placebo drug placebo placebo drug drug drug age 23 22 30 27 23 22 34 23 22 42

where the linear predictor xT = 0 + x1 + 1 . . . + xp p describes how the expected value of the response depends on the predictors, and the error distribution determines the variation in the response. An alternative way of writing this is Y Normal(, 2 ) where = xT . Normal(0, 2 ) (13.2)

Assuming that Y has, at least approximately, a normal distribution may be suitable when the response is a measurement, but this assumption is inappropriate for other types of response variable, such as the ones in the examples above. The rst, ESR above or below 20, is an example of binary data where the responses are either Yes or No (or True/False; 1/0). The second, the numbers agreeing or disagreeing with a proposition in a group of people, is binomial data where the binary responses of individuals have been grouped together. The third, number of colonic polyps, is count data. Nelder and Wedderburn (1972) showed how certain assumptions of the normal linear model can be relaxed to give a more exible, general framework for modelling dierent kinds of responses including binary and count data. This generalization has three components: 1. The probability distribution describing the response Y . This will typically have one parameter giving the expected value and possibly other parameters. In the case of the normal linear model, the assumed normal distribution has two parameters: mean and variance 2 . [There is a technical restriction on what distributions can be used: they need to be from an exponential family, eg normal, binomial, Poisson, gamma are OK: weibull and negative binomial are not]. 2. The link function connecting the mean to the linear predictor: g () = xT (13.3)

144

CHAPTER 13. AN INTRODUCTION TO GENERALIZED LINEAR MODELS so that individual predictors have a linear eect not on the mean itself but, in general, on some suitable chosen function of the mean. In the case of the normal linear model, the link function is g () = , ie the identity function. The link function is chosen to give sensible behaviour, and this depends in part on the probability distribution used.

3. The variance function V () that relates the variance of the response to its expected value. For the normal linear model the variance is assumed not to depend on the mean, but for other probability distributions there is a connection between the two. Once these three components have been chosen, tting the model to the data (ie estimating ) is done by maximum likelihood. Nelder and Wedderburn (1972) gave a general computational method for getting the maximum likelihood estimates. Equivalent to maximising the likelihood L is minimizing the deviance which is dened as 2 log L. The deviance of a generalized linear model plays a role similar to that of the Residual

Sum of Squares in the normal linear model, in that it is a measure of the lack-of-t of the model. Nested models can be compared by testing the change in deviance, this being a form of likelihood ratio test (cf ANOVA). If the simpler model is correct, the reduction in deviance from using the more complex model will have a chi-squared distribution with degrees of freedom given by the number of extra parameters. As with the normal linear model, diagnostic plots of residuals can help in detecting lack-of-t. Unfortunately however this is not so easy for the generalized linear model: there are a number of dierent kinds of residuals, and plotting them often shows patterns that do not necessarily imply lack of t. The deviance residual is based on the contribution of each observation to the deviance:
D ri = sign(Yi i ) d2 i

(13.4)

where i is the tted value for observation i and d2 i is its contribution to the deviance. Another possibility is the Pearson residual :
P ri =

(Y i i) V ( i )

(13.5)

which standardizes the dierence between the observed and expected response. For the normal linear model these give the same thing, but for other distributions they are different and neither has entirely satisfactory properties. The deviance residuals seem to be preferred as the distribution of the Pearson residuals tends to be skewed.

13.2.1

Models for Binary and Binomial Data

If Y is a binary variable (coded 1 for success and 0 for failure) then its distribution is the Bernoulli distribution with parameter denoting the probability of success. The

13.2. GENERALIZED LINEAR MODELS Bernoulli distribtion is P (y ) = 1 if y = 0 if y = 1

145

(13.6)

with mean = and variance (1 ), so the variance function is V () = (1 ).

Because the expected response must be between 0 and 1, a link function should be

chosen so that, no matter what the parameter and covariate values, the inverse link g 1 () always satises this constraint. The most common choice is the logistic link : g ( ) = logit( ) log 1 (13.7)

so the eect of covariates on the probability of success is modelled as logit( ) = 0 + x1 + 1 . . . + xp p and the inverse link is = exp(0 + x1 1 + . . . + xp p ) . 1 + exp(0 + x1 1 + . . . + xp p ) (13.9) (13.8)

This makes the estimated eects of predictors dicult to interpret. Some would say that an increase of 1 in xk would increase the log-odds by k , but not everyone nds this meaningful. Graphical methods might be a better way of illustrating the eects of predictors on the probability of success. Binomial data consists of pairs (yi , ni ) where yi represents the number of successes achieved in ni attempts. This can be regarded as ni Bernoulli outcomes where we get however that we are assuming that the trials are independent, and that they have the same probability of success i . Modelling binary or binomial data in this way is often called logistic regression. success yi times and failure ni yi times. Thus no new theory is required for this. Note

13.2.2

Models for Count Data


y e y!

Count data are usually modelled by a Poisson distribution with probability function: P (y ) = y = 0, 1, 2, . . . (13.10)

where is the expected count (sometimes called the rate). This distribution has the property that the variance is equal to the mean, so V () = . The rate must be positive, so it is usual to use the log link g () = log with inverse link = ex . Using this link, covariate eects are quite easy to interpret: an increase of 1 in xj multiplies the rate by ej . [Numerical problems can occur with very low counts if a tted value i gets close to zero: note that log 0 = .]
T

146

CHAPTER 13. AN INTRODUCTION TO GENERALIZED LINEAR MODELS

Modelling count data in this way is ofetn called Poisson regression. One additional diculty sometimes encountered is that the counts may relate to dierent-sized intervals. For example, dierent individuals may be observed over dierent time periods. In this case we should include observation time in the analysis. If the rate is per unit time, the rate for a period of time t will be (t), and log(t) = log t + log ; thus with a log link, our linear predictor should include the term log t to adjust for the observation period. This is accomplished by what is called an oset in the model: log t = log t + 0 + x1 1 . . . + xp p (13.11)

Note that log t is not really a term in the model; its coecient in the linear predictor is xed at 1, not estimated from the data.

13.2.3

Overdispersion

The Bernoulli and Poisson distributions, unlike the normal, only have one parameter: there is no separate variance parameter. This means that the model predicts a specic variance for the responses based on the variance function: (1 ) for the Bernoulli or for the Poisson. In practice the observed variance in the data may be greater than that predicted by the above models. This overdispersion can be accommodated by introducing a scale parameter , so that the variance is (1 ) or . In a standard maximum the residual degrees of freedom, this is an indication of overdispersion. An ammended tting procedure called quasi-likelihood estimates and adjusts for it in the calculation of the standard errors. Overdispersion often occurs when there are extra sources of variation that our model is not taking into account. This may be because of missing covariates which are aecting the mean ( or ) but which we cannot observe directly. If overdispersion is not allowed for in the analysis, by changing to quasi-likelihood, the standard errors for the parameter estimates will not be valid and inference will not be reliable. likelihood analysis, is xed at one; if the resulting residual deviance is much greater than

13.3
13.3.1

Analysis Using R
Blood Plasma

The data for examining the relation ship between ESR and blood plasma levels can be read from the le plasma.csv using the read.csv() command,
> Plasma = read.csv("plasma.csv")

or, directly from the DRUGS package using

13.3. ANALYSIS USING R


> data(Plasma, package = "DRUGS") > head(Plasma) Fibrinogen Globulin ESR 2.52 38 ESR < 20 2.56 31 ESR < 20 2.19 33 ESR < 20 2.18 31 ESR < 20 3.41 37 ESR < 20 2.46 36 ESR < 20

147

1 2 3 4 5 6

The response here is binary since ESR has two levels. To model the probability of "ESR > 20" we want this level to be success. What is the default? The answer lies in the glm() help le: For binomial and quasibinomial families the response can also be specied as a factor (when the rst level denotes failure and all others success) or as a two-column matrix with the columns giving the numbers of successes and failures. If you check the levels of ESR:
> levels(Plasma$ESR) [1] "ESR < 20" "ESR > 20"

you see that the rst level is "ESR < 20" so this is failure, and success is "ESR > 20" as desired. To see whether, and how, ESR depends on brinogen level, we t a logistic regression model as shown in Exhibit 13.4. The family argument species the distribution of the response; the link function can be specied separately, but here we use the default which is the logit link (see the help le for family). The estimated coecient for fibrinogen is 1.83, so a one unit increase in brinogen level is associated with an increase of 1.83 in the log-odds of ESR exceeding 20mm/hr. We could examine the covariate eect graphically by plotting the tted values, which can be obtained by applying fitted() to the tted object. However a smoother plot can be obtained by dening a fuller set of brinogen levels and using the predict() function as in Exhibit 13.5. Note that predictions can be made on the response scale (ie as probabilities) or by default on the linear predictor scale (logit-transformed). We can get an approximate condence interval for the brinogen coecient by using the standard error given in the output. This gives a symmetrical interval of the form estimate 1.96se. An alternative using the function confint() gives a likelihood-based interval that is not symmetrical. For large datasets they tend to give similar results, but the likelihood-based method has better small-sample properties:
> summary(Plasma.glm1)$coef[2, 1] - 1.96 * summary(Plasma.glm1)$coef[2, + 2]

148

CHAPTER 13. AN INTRODUCTION TO GENERALIZED LINEAR MODELS

Exhibit 13.4 Logistic regression of ESR on Fibrinogen level.


> Plasma.glm1 <- glm(ESR ~ Fibrinogen, data = Plasma, family = binomial()) > summary(Plasma.glm1) Call: glm(formula = ESR ~ Fibrinogen, family = binomial(), data = Plasma) Deviance Residuals: Min 1Q Median -0.930 -0.540 -0.438

3Q -0.336

Max 2.479

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.845 2.770 -2.47 0.013 * Fibrinogen 1.827 0.901 2.03 0.043 * --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 30.885 Residual deviance: 24.840 AIC: 28.84 on 31 on 30 degrees of freedom degrees of freedom

Number of Fisher Scoring iterations: 5

Exhibit 13.5 Estimated relationship between ESR and brinogen (with data).
> plot(Plasma$Fibrinogen, Plasma$ESR == "ESR > 20", xlab = "Fibrinogen", + ylab = "Probability of ERS>20") > newf <- (20:50)/10 > lines(newf, predict(Plasma.glm1, newdata = list(Fibrinogen = newf), + type = "response"))

Probability of ERS>20

0.0 2.0

0.2

0.4

0.6

0.8

1.0

2.5

3.0

3.5 Fibrinogen

4.0

4.5

5.0

13.3. ANALYSIS USING R


[1] 0.0614 > summary(Plasma.glm1)$coef[2, 1] + 1.96 * summary(Plasma.glm1)$coef[2, + 2] [1] 3.593 > confint(Plasma.glm1, parm = "Fibrinogen") 2.5 % 97.5 % 0.3388 3.9985

149

Next we consider a model with both brinogen and globulin as covariates:


> summary(Plasma.glm2 <- glm(ESR ~ Fibrinogen + Globulin, + data = Plasma, family = binomial())) Call: glm(formula = ESR ~ Fibrinogen + Globulin, family = binomial(), data = Plasma) Deviance Residuals: Min 1Q Median -0.968 -0.612 -0.346

3Q -0.212

Max 2.264

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -12.792 5.796 -2.21 0.027 * Fibrinogen 1.910 0.971 1.97 0.049 * Globulin 0.156 0.120 1.30 0.193 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 30.885 Residual deviance: 22.971 AIC: 28.97 on 31 on 29 degrees of freedom degrees of freedom

Number of Fisher Scoring iterations: 5

The coecient for globulin is not signicant, with a p-value of 0.19 based on estimate/se. Again, a likelihood-based test has better small-sample properties, so it is better to judge the signicance of the globulin term using a likelihood ratio test. This can be done by hand using the residual deviance from each model output, but is conveniently implemented in the anova() function:
> anova(Plasma.glm1, Plasma.glm2, test = "Chisq") Analysis of Deviance Table Model 1: Model 2: Resid. 1 2 ESR ~ Fibrinogen ESR ~ Fibrinogen + Globulin Df Resid. Dev Df Deviance Pr(>Chi) 30 24.8 29 23.0 1 1.87 0.17

The conclusion is the same, although the p-value is a little smaller. Note that there is 1 degree of freedom in the 2 -test because there is one extra parameter in the second model.

150

CHAPTER 13. AN INTRODUCTION TO GENERALIZED LINEAR MODELS

Exhibit 13.6 Residual plot for logistic regression of ESR on brinogen.


> plot(predict(Plasma.glm1), residuals(Plasma.glm1, type = "deviance"), + xlab = "Fitted", ylab = "Deviance residuals")

Deviance residuals

1.0 3

0.0

0.5

1.0

1.5

2.0

2.5

1 Fitted

Returning to the model with just brinogen, we now check the residuals. Exhibit 13.6 shows the deviance residuals plotted against the tted values on the linear predictor scale. There are two bands of residuals, corresponding to the successes and failures. The largest residuals, in the top-left corner of the plot, are for patients who had low levels of brinogen but nevertheless ESR > 20. Further diagnostics can be viewed by using the plot() command on the tted object plasma.glm1, but they are generally not very useful with small datasets.

13.3.2

Womens Role in Society

The data for examining how opinions about a womans role in society depend on age and education level can be read from the le WomensRole.csv using the read.csv() command,
> WomensRole = read.csv("WomensRole.csv")

or, directly from the DRUGS package using


> data(WomensRole, package = "DRUGS") > head(WomensRole) Education 0 1 2 3 4 5 Sex Agree Disagree Male 4 2 Male 2 0 Male 4 0 Male 6 3 Male 5 5 Male 13 7

1 2 3 4 5 6

13.3. ANALYSIS USING R

151

Note that the version in DRUGS has the rst (baseline) level of the factor sex set to "Male". If you read the data from the csv le, then it automatically sets the levels in alphabetical order, with "Female" rst. If you want to change this to get output consistent with that shown here, use:
> WomensRole$Sex <- factor(WomensRole$Sex, levels = c("Male", + "Female"))

The data here are binomial, and give the numbers of successes (Agree) and failures (Disagree) for each education-sex combination. For such data the response specied in glm() is no longer a vector but a two-column matrix binding together the number of successes and number of failures.
> summary(WomensRole.glm1 <- glm(cbind(Agree, Disagree) ~ + Sex + Education, data = WomensRole, family = binomial())) Call: glm(formula = cbind(Agree, Disagree) ~ Sex + Education, family = binomial(), data = WomensRole) Deviance Residuals: Min 1Q Median -2.7254 -0.8630 -0.0652

3Q 0.8434

Max 3.1332

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.5094 0.1839 13.65 <2e-16 *** SexFemale -0.0114 0.0841 -0.14 0.89 Education -0.2706 0.0154 -17.56 <2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 451.722 Residual deviance: 64.007 AIC: 208.1 on 40 on 38 degrees of freedom degrees of freedom

Number of Fisher Scoring iterations: 4

This suggests that education level has a strong eect, those with higher education levels being less likely to agree with the statement that Women should take care of their homes and leave running the country up to men. However sex is not a signicant factor. To investigate further, we include in the model statement an interaction between sex and education.
> summary(WomensRole.glm2 <- glm(cbind(Agree, Disagree) ~ + Sex * Education, data = WomensRole, family = binomial())) Call: glm(formula = cbind(Agree, Disagree) ~ Sex * Education, family = binomial(), data = WomensRole)

152

CHAPTER 13. AN INTRODUCTION TO GENERALIZED LINEAR MODELS

Deviance Residuals: Min 1Q Median -2.3910 -0.8806 0.0153 Coefficients:

3Q 0.7278

Max 2.4526

Estimate Std. Error z value Pr(>|z|) (Intercept) 2.0982 0.2355 8.91 <2e-16 SexFemale 0.9047 0.3601 2.51 0.0120 Education -0.2340 0.0202 -11.59 <2e-16 SexFemale:Education -0.0814 0.0311 -2.62 0.0089 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 451.722 Residual deviance: 57.103 AIC: 203.2 on 40 on 37 degrees of freedom degrees of freedom

*** * *** ** ' ' 1

Number of Fisher Scoring iterations: 4

We now nd that sex is signicant, and that there is a signicant interaction: the eect of education on the probability of agreeing is dierent between sexes. We can illustrate this by writing the formula for the estimated linear predictor separately for each sex: For Males: log
1 1

For Females: log

= 2.098 0.234 Education = 3.003 0.315 Education

(13.12)

Thus at education level zero females are more likely to agree than males, but their tendency to disagree increases more quickly with education. It is perhaps easier to see this graphically, as in Exhibit 13.7. We can conrm the improvement in model t from the inclusion of this interaction using anova(). We could also check the residuals of this model by plotting them against the predicted values. The residual plot shows no suggestion of outliers or lack-of-t.

13.3.3

Colonic Polyps

The data for evaluating a new treatment for familial andenomatous polyposis can be read from the le polyps.csv or directly from the DRUGS package. If the former, it is better to change the Treat factor to make placebo the baseline. The response is the count of the number of polyps, so we use Poisson regression with a log link function:
> data(Polyps, package = "DRUGS") > summary(Polyps.glm1 <- glm(Number ~ Treat + Age, data = Polyps, + family = poisson())) Call: glm(formula = Number ~ Treat + Age, family = poisson(), data = Polyps) Deviance Residuals:

13.3. ANALYSIS USING R

153

Exhibit 13.7 Estimated eect of sex and education on opinions of womens role in society.
> > + > > > > > attach(WomensRole) plot(Education, Agree/(Agree + Disagree), ylab = "P(agree)", pch = ifelse(Sex == "Male", "M", "F"), cex = 0.7) fit <- fitted(WomensRole.glm2) lines(Education[Sex == "Male"], fit[Sex == "Male"], lty = 1) lines(Education[Sex != "Male"], fit[Sex != "Male"], lty = 2) legend("topright", c("Male", "Female"), lty = 1:2) detach(WomensRole)

1.0

M F

Male Female
F M F M F M M F

0.8

M F

P(agree)

0.6

F M

0.4

F M

F M F M M M F F M F M F M F M M F M F

0.2

0.0 0

10 Education

15

20

Min -4.22

1Q -3.05

Median -0.18

3Q 1.45

Max 5.83

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.52902 0.14687 30.84 <2e-16 *** Treatdrug -1.35908 0.11764 -11.55 <2e-16 *** Age -0.03883 0.00596 -6.52 7e-11 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 378.66 Residual deviance: 179.54 AIC: 273.9 on 19 on 17 degrees of freedom degrees of freedom

Number of Fisher Scoring iterations: 5

The estimated eect of treatment is to lower the incidence of polyps by a factor of e1.359 = 0.257, ie a 74% reduction in incidence. We can get a condence interval for the reduction factor using:
> exp(confint(Polyps.glm1, "Treatdrug")) 2.5 % 97.5 % 0.2028 0.3218

154

CHAPTER 13. AN INTRODUCTION TO GENERALIZED LINEAR MODELS

Note however that the residual deviance is much greater than the residual degrees of freedom. This suggests overdispersion, so inference based on the above model is unreliable. To correct for overdispersion we switch to quasi-likelihood estimation:
> summary(Polyps.glm2 <- glm(Number ~ Treat + Age, data = Polyps, + family = quasipoisson())) Call: glm(formula = Number ~ Treat + Age, family = quasipoisson(), data = Polyps) Deviance Residuals: Min 1Q Median -4.22 -3.05 -0.18

3Q 1.45

Max 5.83

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.5290 0.4811 9.41 3.7e-08 *** Treatdrug -1.3591 0.3853 -3.53 0.0026 ** Age -0.0388 0.0195 -1.99 0.0628 . --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for quasipoisson family taken to be 10.73) Null deviance: 378.66 Residual deviance: 179.54 AIC: NA on 19 on 17 degrees of freedom degrees of freedom

Number of Fisher Scoring iterations: 5

This reduces considerably the statistical signicance of the estimates, although the treatment eect is still signicant (p = 0.0026). Note that the estimates themselves stay the same, but the standard errors change, so the condence interval for the treatment eect is now much wider. Does the treatment eect depend on the age of the patient? To examine this we can t an interaction term in the model:
> summary(Polyps.glm3 <- glm(Number ~ Treat * Age, data = Polyps, + family = quasipoisson())) Call: glm(formula = Number ~ Treat * Age, family = quasipoisson(), data = Polyps) Deviance Residuals: Min 1Q Median -4.241 -3.040 -0.086 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.51912 0.51726 8.74 1.7e-07 *** Treatdrug -1.25726 1.59069 -0.79 0.441 Age -0.03840 0.02106 -1.82 0.087 . Treatdrug:Age -0.00463 0.07023 -0.07 0.948

3Q 1.439

Max 5.849

13.3. ANALYSIS USING R


--Signif. codes:

155

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 11.38) Null deviance: 378.66 Residual deviance: 179.49 AIC: NA on 19 on 16 degrees of freedom degrees of freedom

Number of Fisher Scoring iterations: 5

This produces a very confusing result: now the treatment eect is not signicant! Note however that in this model the coecient for Treatdrug represents the eect on patients with age = 0. This is not a very sensible parametrization, and if we centre the age variable rst:
> Polyps$AgeC <- Polyps$Age - mean(Polyps$Age) > summary(Polyps.glm4 <- glm(Number ~ Treat * AgeC, data = Polyps, + family = quasipoisson())) Call: glm(formula = Number ~ Treat * AgeC, family = quasipoisson(), data = Polyps) Deviance Residuals: Min 1Q Median -4.241 -3.040 -0.086 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.55905 0.17458 20.39 7.1e-13 *** Treatdrug -1.37302 0.45166 -3.04 0.0078 ** AgeC -0.03840 0.02106 -1.82 0.0870 . Treatdrug:AgeC -0.00463 0.07023 -0.07 0.9482 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for quasipoisson family taken to be 11.38) Null deviance: 378.66 Residual deviance: 179.49 AIC: NA on 19 on 16 degrees of freedom degrees of freedom

3Q 1.439

Max 5.849

Number of Fisher Scoring iterations: 5

we get something much more sensible. It is clear from this that the treatment eect can be assumed to be the same irrespective of the age of the patient. Finally in Exhibit 13.8 we plot the tted values with the data, and the deviance residuals.

156

CHAPTER 13. AN INTRODUCTION TO GENERALIZED LINEAR MODELS

Exhibit 13.8 Estimated eect of sex and education on opinions of womens role in society.
> > > + > > + > + > + > + > attach(Polyps) par(mfrow = c(1, 2)) plot(Age, Number, pch = ifelse(Treat == "drug", 16, 1), main = "Data and fitted values") Age.n <- 10:50 lines(Age.n, predict(Polyps.glm1, type = "response", newdata = data.frame(Age = Age.n, Treat = "drug"))) lines(Age.n, predict(Polyps.glm1, type = "response", newdata = data.frame(Age = Age.n, Treat = "placebo")), lty = 2) legend("topright", c("placebo", "drug"), lty = 2:1, pch = c(1, 16), cex = 0.7) plot(predict(Polyps.glm1), residuals(Polyps.glm1), main = "Residual plot", xlab = "Fitted", ylab = "Residuals") detach(Polyps)

Data and fitted values


placebo drug

Residual plot
6

50

Residuals 20 30 Age 40 50

Number

30

10

4 1.5

2.0

2.5

3.0

3.5

4.0

Fitted

13.4

Exercises

Exercise 13.1: In the plasma data, the eect of brinogen level on logit-transformed probability of ESR>20 may not be linear. Investigate whether adding a quadratic term, or using exp(brinogen), gives a better model. Exercise 13.2: Bladder cancer has the highest recurrence rate of any malignancy. The dataset Bladder in the DRUGS package gives the number of recurrences of bladder tumours after the removal of the initial tumour, the observation period in years, and information relating to the size of the primary tumour (Seber (1989)). See Exhibit 13.9. Is the size of the initial tumour a useful prognostic factor for the rate of recurrence? Illustrate graphically. Exercise 13.3: The leuk data in the MASS library gives survival times in weeks after diagnosis with leukemia, together with some covariate information recorded on each patient at the time of diagnosis: the number of white blood cell per microlitre (wbc) and the condition of the cells (ag). See Exhibit 13.10. Investigate how the probability of surviving more than 24 weeks is aected by the covariates. [Hint: you may want to

13.4. EXERCISES Exhibit 13.9 bladdercancer data giving number of tumour recurrences.
time 2 3 . . . 4 19 tumorsize <=3cm <=3cm . . . >3cm >3cm number 1 1 . . . 3 4

157

Exhibit 13.10 leuk data giving survival times after diagnosis with leukemia.
wbc 2300 750 . . . 100000 100000 ag present present . . . absent absent time 65 156 . . . 4 43

consider a log transform of wbc and an interaction term in your model].

Chapter 14 An Introduction to Contingency Tables: Spitting Chimps, Hodgkins Patients and Kidney Stones
An original chapter
written by

Geo Jones1

14.1

Introduction

An experiment to compare the problem-solving abilities of apes and human children Hanus et al. (2011) gave each subject a vertical glass tube with a peanut at the bottom, and access to a water dispenser. Subjects who used the water to raise the peanut in the tube were deemed successful. Out of 43 chimps tested, 14 were successful. The researchers found that only two out of 24 four-year-old children could solve the problem. The data can be represented as a contingency table: yes chimp 4-year-old 14 2 no 29 22

The children were given a watering can, whereas the chimps used their mouths and spat the water into the tube. (One chimp reached the peanut by urinating into the tube this has been classied here as a success).
1

Geo is an Associate Professor in the Institute of Fundamental Sciences.

158

14.1. INTRODUCTION

159

Five gorillas tested in the above experiment all failed to obtain the peanut. In a similar experiment Taylor and Gray (2009), rooks were provided with stones to drop into the water and the peanut was replaced with a maggot. Does the table: yes rook gorilla 4 0 no 0 5

suggest that rooks are more intelligent than gorillas? An investigation into the relationship between Hodgkins disease and tonsillitis studied 85 Hodgkins patients who had a sibling of the same sex who was free of the disease and whose age was within 5 years of the patients Johnson and Johnson (1972). These investigators presented the following table: Tonsillectomy yes Hodgkins control 41 33 no 44 52

An alternative representation of the same data is given below: Sibling Tonsillectomy yes Patient Tonsillectomy yes no 37 15 no 7 26

A medical study compared the success rates of two treatments for removing kidney stones (Charig et al., 1986). The table below shows the overall numbers of success and failures, where Treatment A includes all open surgical procedures and Treatment B is percutaneous nephrolithotomy. Treatment A Treatment B success failure 273 77 289 61

For a more in-depth analysis, we can separate patients with small stones from those with large stones. This gives this table of results: Small stone A success failure 81 6 B 234 36 Large stone A 192 71 B 55 25

160

CHAPTER 14. AN INTRODUCTION TO CONTINGENCY TABLES

14.2
14.2.1

Contingency Table Analysis


Chi-squared test

The standard chi-squared analysis for comparing observed counts with the corresponding expected counts from a model has a long history Pearson (1900). In an r c table the data for n subjects can be represented as B1 A1 A2 . . . Ar Total O11 O21 . . . Or 1 C1 B2 O12 O22 . . . Or 2 C2 ... ... ... .. . ... ... Bc O1c O2c . . . Orc Cc Total R1 R2 . . . Rr n

where Oij represents the number of subjects falling into category Ai for the row variable and category Bj for the column variable, Ri denotes the total for row i and Cj the total for column j . If we assume that the n subjects are independent of each other, then we can test for independence of the row and column variables by comparing the observed values with the expected values for all cells in the contingency table. We calculate the expected values as Eij = Ri Cj /n and the test statistic as X2 =
i,j

(14.1)

(Oij Eij )2 /Eij .

(14.2)

Provided the Eij s are not too small (none < 1; no more than 20% < 5) then, under the hypothesis of no relationship between the row and column variables, X 2 will follow Large values of X 2 lead to rejection of the hypothesis of no relationship between the row
2 and column variables. The p-value for the test thus represents P (2 (r 1)(c1) > X ).

a chi-squared distribution with (r 1) (c 1) degrees of freedom, denoted 2 (r 1)(c1) .

For the chimps vs four-year-olds data, the p-value is 0.02575, suggesting a relationship between the species (chimp or human child) and ability to solve the problem. By compar ing the observed and expected frequencies, or by examining the residuals (Oij Eij / E ij we can see that the chimps were better. In a 2 2 contingency table B1 A1 O11 A2 O21 B2 O12 O22

14.2. CONTINGENCY TABLE ANALYSIS

161

the chi-squared approximation can be improved by using a continuity correction that adjusts the formula for the test statistic by subtracting 0.5 from the dierence between each observed value and its expected value: X2 =
i,j

(|Oij Eij | 0.5)2 /Eij .

(14.3)

This reduces the chi-square value obtained and thus increases its p-value (now 0.05347 for the chimps). the relationship is often expressed using the odds ratio. In the above table, the odds of a subject in category A1 falling into category B1 (as opposed to B2) are estimated as O11 /O12 ; for a subject in category B2 these odds are O21 /O22 . Thus the odds ratio is OR = O11 O22 O11 /O12 = . O21 /O22 O12 O21 (14.4) If the row and column variables in a 2 2 table are found to be signicantly related,

If the odds ratio is greater than one, it means that the odds (and therefore the probability) of being A1 is greater for B1 subjects than it is for B2 subjects. For the chimp data the OR is 9.23, suggesting that chimps are much more likely than four-year-old humans to be able to solve the problem.

14.2.2

Fishers exact test

For contingency tables with small numbers of counts, the chi-squared distribution will not give an adequate approximation to the distribution of the X 2 statistic, so the p-value obtained using the standard chi-squared analysis could be misleading. In this situation an exact p-value can be calculated using the hypergeometric distribution. getting the observed table is In the 2 2 contingency table, for xed row and column totals the probability of R1 p= O11 n C1 where R1 , C1 etc. are the row and column totals. To test for association between the row and column variables, the p-value is the probability of getting the observed table or a more extreme one. So we have to decide what tables would be at least as extreme as the observed table given the null hypothesis. This can be done using the probabilities as dened in Equation 14.5 above, or using the dierence
O11 C1

R2 O21 = R1 !R2 !C1 !C2 ! O11 !O12 !O21 !O22 !n! (14.5)

we sum the probabilities for all such tables. Usually the test will be two-tailed (H0 :

O12 . C2

To get the p-value

162

CHAPTER 14. AN INTRODUCTION TO CONTINGENCY TABLES

No association vs H1 : Association) but if there is an a priori reason to suppose that the association is in a particular direction (eg OR > 1) then a one-tailed test may be appropriate. For the rooks vs gorillas data: 4 0 4 0 5 5 4 5 9 p=
4!5!4!5! 4!0!0!5!9! O12 | C2

4321 9876

1 26

11 |O C1

4 = |4 0 |=1 5

This is clearly the most extreme result for which the rooks win. Since there is no prior reason to suppose that rooks are more intelligent, we should perform a two-tailed test. The most extreme win for gorillas, keeping the row and column totals xed, would be: 0 4 4 4 1 5 4 5 9 p=
4!5!4!5! 0!4!4!1!9! O12 | C2

5432 9876

5 26

11 |O C1

0 = |4 4 | = 0.8 5

so this is not as extreme as the observed table. The two-sided p-value is therefore 1/126 = 0.0079, so there is strong evidence that rooks are better problem-solvers than gorillas.

14.2.3

McNemars test

The tonsillectomy data should not be analysed using the usual chi-squared contingency table test because the data are in pairs each Hodgkins patient is paired with a sibling. So we need to perform a matched pairs analysis. If the response were numerical, this would mean doing a paired t-test instead of the two-sample t-test for independent samples. But here the response is categorical (tonsillectomy vs no tonsillectomy). If we choose a pair at random (a Hodgkins patient with an eligible sibling) there are four possible outcomes with probabilities as labelled in the table below: Sibling Tonsillectomy yes Patient Tonsillectomy yes no Total 11 21 .1 no 12 22 .2 Total 1. 2. 1

where the marginal probabilities are 1. = 11 + 12 etc. The null hypothesis of no relationship between tonsillectomy and Hodgkins disease is equivalent to no dierence in the marginal distribution of tonsillectomy for Hodgkins and non-Hodgkins siblings, ie 1. = .1 . This in turn is equivalent to 12 = 21 , and this is what McNemars test is testing. Its test statistic is X2 = (O12 O12 )2 O12 + O12 (14.6)

14.2. CONTINGENCY TABLE ANALYSIS

163

which, under the null hypothesis, follows a 2 1 distribution. The tonsillectomy data give X2 =
(715)2 7+15

= 2.91 which, referred to a 2 1 distribution, has a p-value of 0.0881. Some

people use a continuity correction with McNemars test, reducing | O12 O12 | by 0.5 before squaring. This can have a considerable impact on the p-value.

14.2.4

Mantel-Haenszel test

A chi-squared analysis of a two-way contingency table can be misleading if there are confounding variables. For example, the odds ratio for the kidney stone data is (273 than Treatment B. However if we take the size of the stone into account and analyze each (192 25)/(71 55) = 1.229 for large stones, both suggesting higher odds of success with 61)/(77 289) = 0.748, suggesting that the odds of success are lower with Treatment A

table separately, the odds ratios are (81 36)/(6 234) = 2.0778 for small stones and

Treatment A. Thus B seems to be better overall, but A is better for small stones and for large stones! The reason for this apparent contradiction is that B was used more frequently with small stones, which are easier to treat, and A with large stones which are harder, so combining all the results favours A. Such a reversal of eects when a confounding variable (here, size of stone) is ignored is called Simpsons Paradox. The purpose of the Mantel-Haenszel test is to take such confounding variables into level of the possible confounding variable, and the results are pooled assuming a common odds ratio in each stratum. Denoting the observed frequencies in the k th table by Ok,11 etc., the test statistic can be written as
2 XM H =

account by performing a stratied analysis. A separate 2 2 table is considered at each

Ok,1.Ok,.1/nk )]2 2 3 k Ok,1. Ok,.1 Ok,2. Ok,.2 /(nk nk )


k (Ok,11

(14.7)

Note that the quantity being summed in the numerator is the dierence between the observed and expected values in the rst cell of each table. Again, Some people use a continuity correction, reducing the absolute value of the sum by 0.5 before squaring. Under the null hypothesis of no relationship between the row and column variables in
2 2 any of the strata, XM H will follow a 1 distribution; this gives a p-value for the test. An

estimate of the common odds ratio and its condence interval can also be calculated. The Mantel-Haenszel test is sometimes, and more properly, referred to as the CochranMantel-Haenszel (CMH) test. It is often employed in meta-analysis where the results of a number of clinical trials are combined, or in the analysis of multi-centre trials where a number of dierent hospitals or regions are involved. In these situations each trial, hospital or region is treated as a separate stratum.

164

CHAPTER 14. AN INTRODUCTION TO CONTINGENCY TABLES

14.3
14.3.1

Analysis using R
Chimps vs 4-year-olds

In R, chi-squared tests are applied to matrices. It is important to remember that when contingency table data are constructed using the matrix() command, the default order is to go down each column in turn.
> Peanuts <- matrix(c(14, 2, 29, 22), ncol = 2, dimnames = list(c("chimp", + "4-year-old"), c("yes", "no"))) > Peanuts yes no 14 29 2 22

chimp 4-year-old

> chisq.test(Peanuts) Pearson's Chi-squared test with Yates' continuity correction data: Peanuts X-squared = 3.729, df = 1, p-value = 0.05347

If expected values or residuals (Oij Eij )/ E ij are required:


> Peanuts.chsq <- chisq.test(Peanuts) > Peanuts.chsq$exp yes no chimp 10.269 32.73 4-year-old 5.731 18.27 > Peanuts.chsq$res yes no chimp 1.164 -0.6522 4-year-old -1.559 0.8730

The p-value gives some (but not very strong) evidence of the superiority of chimps. Note that the continuity correction for a 2 2 table has automatically been applied. To cancel this, use the option correct=F.

14.3.2
command:

Rooks vs gorillas

Because of the small counts in this table, we use Fishers exact test, using hte fisher.test()

> Stones <- matrix(c(4, 0, 0, 5), ncol = 2, dimnames = list(c("rook", + "gorilla"), c("yes", "no"))) > Stones yes no 4 0 0 5

rook gorilla

> fisher.test(Stones)

14.3. ANALYSIS USING R


Fisher's Exact Test for Count Data data: Stones p-value = 0.007937 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 1.75 Inf sample estimates: odds ratio Inf

165

giving strong evidence of the superiority of rooks. If the ordinary chi-squared test were applied, we would get a warning:
> chisq.test(Stones) Pearson's Chi-squared test with Yates' continuity correction data: Stones X-squared = 5.406, df = 1, p-value = 0.02007

14.3.3

Tonsillectomy and Hodgkins disease

First the wrong (chi-squared) analysis:


> Hodgkins1 <- matrix(c(41, 33, 44, 52), ncol = 2, dimnames = list(c("Hodgkins", + "Control"), c("Tonsillectomy", "No tonsillectomy"))) > Hodgkins1 Hodgkins Control Tonsillectomy No tonsillectomy 41 44 33 52

> chisq.test(Hodgkins1) Pearson's Chi-squared test with Yates' continuity correction data: Hodgkins1 X-squared = 1.173, df = 1, p-value = 0.2789

and now the correct (McNemars) test, here done using the mcnemar.test() command without continuity correction:
> Hodgkins2 <- matrix(c(37, 15, 7, 26), ncol = 2, dimnames = list(Patient = c("No T", + "T"), Sibling = c("No T", "T"))) > Hodgkins2 Sibling Patient No T T No T 37 7 T 15 26 > mcnemar.test(Hodgkins2, correct = F) McNemar's Chi-squared test data: Hodgkins2 McNemar's chi-squared = 2.909, df = 1, p-value = 0.08808

There is weak evidence of an association. (Tonsils were thought to be protective against the development of Hodgkins disease).

166

CHAPTER 14. AN INTRODUCTION TO CONTINGENCY TABLES

14.3.4

Treatment of kidney stones

For the collapsed table we get:


> Kidney <- matrix(c(273, 77, 289, 61), ncol = 2, dimnames = list(c("success", + "falure"), c("Treatment A", "Treatment B"))) > chisq.test(Kidney) Pearson's Chi-squared test with Yates' continuity correction data: Kidney X-squared = 2.031, df = 1, p-value = 0.1541

whereas the stratied analysis is found using the mantelhaen.test() command:


> Kidney2 <- array(c(81, 6, 234, 36, 192, 71, 55, 25), dim = c(2, + 2, 2), dimnames = list(c("success", "falure"), c("Treatment A", + "Treatment B"), Size = c("small", "large"))) > Kidney2 , , Size = small

success falure

Treatment A Treatment B 81 234 6 36

, , Size = large

success falure

Treatment A Treatment B 192 55 71 25

> mantelhaen.test(Kidney2) Mantel-Haenszel chi-squared test with continuity correction data: Kidney2 Mantel-Haenszel X-squared = 2.091, df = 1, p-value = 0.1481 alternative hypothesis: true common odds ratio is not equal to 1 95 percent confidence interval: 0.9158 2.2858 sample estimates: common odds ratio 1.447

There is no evidence here of a dierence in success rates between the treatments.

14.4

Exercises

Exercise 14.1: Gardemann et al. (1998) surveyed genotypes at an insertion/deletion polymorphism of the apolipoprotein B signal peptide in 2259 men. Of men without coronary artery disease, 268 had the ins/ins genotype, 199 had the ins/del genotype, and 42 had the del/del genotype. Of men with coronary artery disease, there were 807 ins/ins,

14.4. EXERCISES

167

759 ins/del, and 184 del/del. Test the biological hypothesis that the apolipoprotein polymorphism doesnt aect the likelihood of getting coronary artery disease. Examine and comment on the pattern of residuals. Exercise 14.2: In the rooks vs gorillas contest, how many gorillas would have had to succeed in the task for the rooks not to have been found better (at 10% signicance level)? Exercise 14.3: A study quoted by Bland (2000) questioned 1319 schoolchildren on the prevalence of symptoms of severe cold at the age of 12 and again at the age of 14 years, with the following results: Severe cold Severe cold at age 14 at age 12 Yes No Total Yes 212 256 468 No 144 707 851 Total 356 963 1319

Was there a signicant increase in the prevalence of severe colds? Exercise 14.4: The data from a multi-centre trial comparing a new drug with a placebo are in the le Dairy.csv. Does the drug appear to have a signicant eect? If so, is it benecial?

Chapter 15 An Introduction to Survival Analysis: Gliomas and Breast Cancer


An original chapter
written by

Geo Jones1

15.1

Introduction

A glioma is a type of tumour that starts in the brain or spine. A randomised clinical trial reported by Grana et al. (2002) investigated whether treating glioma patients with radioimmunotherapy in addition to the standard treatment increased their survival time, dened as the time from initial therapy to death. The data are shown in Exhibit 15.1. Sauerbrei and Royston (1999) analyze data from the German Breast Cancer Study Group, which recruited 720 patients with primary node positive breast cancer into a cohort study. Both randomized and non-randomized patients were included. The eectiveness of hormonal treatment with Tamoxifen was investigated in the study. After a median followup time of nearly 5 years, 312 patients had had at least one recurrence of the disease or died. The paper analyzes recurrence-free survival time of the 686 patients (with 299 events recurrence or death) who had complete data for the covariates age, tumour size, number of positive lymph nodes, progesterone and oestrogen receptor status, menopausal status, hormone therapy and tumour grade. Interest here is in the eect, if any, of the covariates on recurrence-free survival. The data are shown in Exhibit 15.2.
1

Geo is an Associate Professor in the Institute of Fundamental Sciences.

168

15.1. INTRODUCTION

169

Exhibit 15.1 Glioma data. Patients are given the standard treatment (Control) or radioimmunotherapy (RIT). There are two types of glioma: Grade 3 and GBM. The event recorded is death (TRUE) or survival (FALSE) at the time (in months) shown.
age 41 45 48 54 40 . . . 71 61 65 50 42 sex Female Female Male Male Female . . . Female Male Male Male Female histology Grade3 Grade3 Grade3 Grade3 Grade3 . . . GBM GBM GBM GBM GBM group RIT RIT RIT RIT RIT . . . Control Control Control Control Control event TRUE FALSE FALSE FALSE FALSE . . . TRUE TRUE TRUE TRUE TRUE time 53 28 69 58 54 . . . 8 6 14 13 25

Exhibit 15.2 GBSG2 data. Times (days), survival status and covariate information for patients in the German Breast Cancer Study Group.
horTh no yes yes yes no . . . no yes no no no age 70 56 58 59 73 . . . 49 53 51 52 55 menostat Post Post Post Post Post . . . Pre Post Pre Post Post tsize 21 12 35 17 35 . . . 30 25 25 23 23 tgrade II II II II II . . . III III III II II pnodes 3 7 9 4 1 . . . 3 17 5 3 9 progrec 48 61 52 60 26 . . . 1 0 43 15 116 estrec 66 77 271 29 65 . . . 84 0 0 34 15 time 1814 2018 712 1807 772 . . . 721 186 769 727 1701 cens 1 1 1 1 1 . . . 0 0 1 1 1

170

CHAPTER 15. AN INTRODUCTION TO SURVIVAL ANALYSIS

15.2

Survival Analysis

Survival analysis is used in situations, like those above, where the response variable is the time to the occurrence of an event of interest, sometimes called time-to-event data. In medical contexts the event is often death from a particular cause, eg death from glioma, or it may be the time to recurrence of symptoms. Sometimes there are two groups (Treatment vs. Control ) and we want to compare their survival to see which is better. Alternatively, there may be many possible predictors of survival (prognostic factors ) and we want to determine which are the important ones. The two main features of time-to-event data that distinguish it from other kinds of statistical analysis are: 1. The times are typically very right-skewed, because there will usually be some individuals who survive much longer than normal; this makes methods based on the assumption of normality inappropriate. 2. There may be some individuals who have still not experienced the event during the period of observation, eg breast cancer patients who are still alive at the end of the study, or participants in a clinical trial who stop turning up at the clinic for checkups (lost to follow-up). In such cases the actual time to the event is unobserved, and is said to be censored. A time is still recorded for such individuals, but all we know is that their actual time-to-event is greater than the recorded time at which they were censored. For both of the above reasons, means and variances do not give an adequate description of survival data. We now consider alternative statistical summaries.

15.2.1

Survival and Hazard Functions

The survival function gives a complete description of the distribution of survival times. Thinking of the survival time of an individual as a random variable, say T , the survival function at time t is dened as S (t) = P (T t) (15.1)

Because this is a probability, it always lies between 0 and 1, and its denition implies that it is a decreasing (or at least non-increasing) function of t. The typical shape of a survival curve is shown in the rst panel of Figure 15.3. We can think of the value of S (t) as representing the proportion of the population who have still not experienced the event by time t; if the event is death, this is the proportion of the population still alive at that time.

15.2. SURVIVAL ANALYSIS Exhibit 15.3 Typical survival curve, with density and hazard functions.
Survival function S(t)
1.0

171

Probability density f(t)


0.8

Hazard function h(t)

0.8

0.20

0.6

0.4

0.10

0.2

0.00

0.0

0.0

S(t)

0.2

0.4

0.6

It follows from Equation 15.1 that the survival function is related to the density function f (t) by S (t) =
t

f (u) du

(15.2)

and this is represented by the area under the density curve as shown in the second panel of Figure 15.3. Conversely, the gradient of the survival curve is f (t), so when the survival curve is steep it indicates that people are dying o quickly. Survival curves always tend to atten out towards the end, because of the skewness of the event time distribution. Does this mean you are less likely to die as you get older? To examine the risk of death, we need to consider the probability of dying only for those people still alive at time t; this is expressed in the hazard function dened by h(t) = lim P (t T < t + t | T t) f (t) = . t S (t) (15.3)

t0

Note that this is a rate, not a probability. Like the density function, it does not have to lie between zero and one. It is sometimes called the instantaneous failure rate. The third panel of Figure 15.3 shows the hazard function for this particular distribution, which increases gradually over time. It is hard to tell the shape of the hazard function by looking at the survival curve, but it is perhaps the most important characterization of a lifetime distribution. Usually it increases with age (increasing hazard) but in some situations may remain constant (eg electrical components) or even decrease when there is a high risk of early failure (eg after a heart attack). An individual still alive at time t has managed to survive a cumulative hazard of:
t

H (t) =
0

h(u) du.

(15.4)

This is a useful relationship for comparing hazards in dierent groups.

Using Equation 15.3 and a little calculus, it is possible to show that H (t) = log S (t).

172

CHAPTER 15. AN INTRODUCTION TO SURVIVAL ANALYSIS

15.2.2

Estimating and Comparing Survival Functions

Survival data, like the examples in the rst Section, comprises a time variable t for each individual, a death variable c to indicate whether the individual experienced the event at time t (1 = death or 0 = censored) and other covariate information such as age, sex, treatment group. If there are n individuals then, ignoring the covariates, the data can be represented as pairs (ti , ci ) for i = 1 . . . n. Now denote the unique death times by t(j ) for j = 1 . . . J and the number of deaths at t(j ) as dj . Then the famous Kaplan-Meier estimate of the survival function is (t) = S
j : t(j ) t

dj rj

(15.5)

where rj denotes the number of individuals still at risk at time t(j ) . Here at risk means individuals who havent yet died or been censored. Thus at time t(j ) there are rj individuals still under observation and dj of them die, so the estimated probability of surviving each of the death times up to and including t, so multiply together all of these survival probabilities. The Kaplan-Meier (KM) estimate is consequently a step function, that steps downwards every time there is a death. It is closely related to the empirical distribution function, but allows for censored observations by only including at-risk individuals in each calculation. The locations of censoring times are often shown using tick marks. Note that (t) reaches zero: if the last time is a censoring event if the last recorded time is a death, S (eg still alive at the end of the study) then it doesnt reach zero. Condence intervals can be calculated using Greenwoods formula for the variance of the Kaplan-Meier estimator. This is a little complicated and will not be shown here. If there are two groups (eg treatment vs. control) it is useful to plot both KM curves on one graph; better survival is indicated by one curve being higher than the other. (t) = log S (t) is plotted instead to see how the Sometimes the cumulative hazard H hazard in one group is related to the hazard in the other. Going one step further, we might want to see if the hazard for one group is simply a multiple of the hazard in the other (eg a 50% reduction in risk when using the treatment); this is facilitated by plotting (t). log H After visual inspection of the estimated survival curves, you may want to carry out a formal test to examine the statistical signicance of the dierence in survival between the groups. This can be done using the log-rank test. Its test statistic compares the number of deaths observed in each group with the expected number assuming no dierence in survival; this is then referred to a chi-squared distribution with 1 df. The test can be extended to k groups, in which case the reference chi-squared distribution has (k 1) time t(j ) is (1 dj /rj ). To survive up to and including time t, you need to survive

15.2. SURVIVAL ANALYSIS

173

df. If there is suspected confounding because of another categorical variable, then the comparison can be stratied with respect to this second variable; the test then compares survival within each level of the second variable, pooling the evidence to give a single test statistic. It can happen that an apparently signicant eect disappears, or is even reversed, when stratication is used (cf Simpsons Paradox).

15.2.3

Cox Proportional Hazards Model

If there are numerical covariates, or a number of dierent possible predictors of survival, the methods of the previous section are no longer useful. The most common way of modelling the eect of covariates on survival is the Cox proportional hazards model (Cox (1972)). This assumes that each covariate has a multiplicative eect on the hazard. If we start with baseline values for each variable (eg the mean for numerical covariates and the baseline level for factors) then the hazard function for an individual whose covariate values are all at the baseline is denoted h0 (t), known as the baseline hazard. Another individual whose covariate values are given by the vector x is now assumed to have a hazard function of h(t) = h0 (t) exp(xT ) (15.6) where xT = 1 x1 + . . . + p xp . This means that for a numerical covariate xj an increase of one unit, keeping all other covariate values xed, would cause the hazard function to be multiplied by exp(j ); if j < 0 this reduces the hazard by a xed proportion at all time points. The Cox proportional hazards model is referred to as a semi-parametric model because the covariate eects are modelled parametrically but the baseline survivor function is estimated non-parametrically (as with the KM estimator). The parametric part of the model is used to form the partial likelihood, which is maximized to estimate the parameter vector . This kind of modelling appeals to medical researchers and epidemiologists because it characterizes predictors of survival, or risk factors, in terms of relative risk. Suppose a treatment has a of 2 where the baseline is control (ie no treatment). Then the model says that the risk of death for any individual at any time is reduced by a factor

right the eect of a covariate on the hazard may not be simply multiplicative. The

of exp(2) = 0.135, ie an 86.5% reduction in risk. Of course the model may not be

assumption of proportional hazards (PH) should be tested. For categorical predictors this (t), ie log( log S (t)) for each group; if the PH assumption is can be done by plotting log H of checking the PH assumption for all variables in the model, by testing whether the estimates of 1 , . . . , p vary over time. Other diagnostics are available but they can be dicult to interpret there are many

valid, the plots should be parallel. Therneau and Grambsch (2000) give a general method

174

CHAPTER 15. AN INTRODUCTION TO SURVIVAL ANALYSIS

dierent kinds of residuals in survival analysis. One such is called the martingale residual. These can be used in residual plots to look for outliers or systematic lack of t.

15.3
15.3.1

Analysis Using R
Gliomas

The survival analysis routines are in the survival package, which should be loaded rst. The Glioma data can be read from the le Glioma.csv using the read.csv() command,
> library("survival")

> glioma <- read.csv("glioma.csv")

> head(glioma) age sex histology group event time 41 Female Grade3 RIT TRUE 53 45 Female Grade3 RIT FALSE 28 48 Male Grade3 RIT FALSE 69 54 Male Grade3 RIT FALSE 58 40 Female Grade3 RIT FALSE 54 31 Male Grade3 RIT TRUE 25

1 2 3 4 5 6

Individuals with event == TRUE were observed to die during the trial, those with event == FALSE are censored at the time shown. The response in survival analysis is the combination of event time and status, and these are packaged together using the R function Surv(). This is written as part of a model statement, so that the analysis can depend on covariates; if no covariates are used, specify the model as 1. In Exhibit 15.4 a single KM curve is estimated from the whole dataset. Note that the

default is to add 95% condence limits, and tick marks for censored observations. To examine the eect of treatment on survival we include group in the model. In Exhibit 15.5 we also subset the data to show the eect separately for each tumour type. This suggests better survival with treatment for both tumour types. The log-rank test for a dierence in survival curves is implemented using survdiff():

> survdiff(Surv(time, event) ~ group, data = glioma) Call: survdiff(formula = Surv(time, event) ~ group, data = glioma) N Observed Expected (O-E)^2/E (O-E)^2/V group=Control 18 16 7.13 11.03 17.7 group=RIT 19 7 15.87 4.96 17.7 Chisq= 17.7 on 1 degrees of freedom, p= 2.54e-05

15.3. ANALYSIS USING R

175

Exhibit 15.4 Marginal KM curve for Glioma data.


> plot(survfit(Surv(time, event) ~ 1, data = glioma), ylab = "Probability", + xlab = "Survival Time in Months")

Probability

0.0 0

0.2

0.4

0.6

0.8

1.0

10

20

30

40

50

60

70

Survival Time in Months

Exhibit 15.5 KM curves showing treatment eect for Glioma data.


> > + + > + > + + par(mfrow = c(1, 2)) plot(survfit(Surv(time, event) ~ group, data = glioma[glioma$histology == "Grade3", ]), lty = 1:2, ylab = "Probability", xlab = "Survival Time in Months", main = "Grade3") legend("bottomleft", legend = c("Control", "Treated"), lty = 1:2, bty = "n") plot(survfit(Surv(time, event) ~ group, data = glioma[glioma$histology == "GBM", ]), lty = 1:2, ylab = "Probability", xlab = "Survival Time in Months", main = "GBM")

Grade3
1.0 1.0

GBM

0.8

Probability

Probability Control Treated 0 10 30 50 70

0.6

0.4

0.2

0.0

0.0 0

0.2

0.4

0.6

0.8

10

20

30

40

50

60

Survival Time in Months

Survival Time in Months

176

CHAPTER 15. AN INTRODUCTION TO SURVIVAL ANALYSIS

Exhibit 15.6 KM curves showing treatment eect for GBSG data.


> GBSG2 <- read.csv("GBSG2.csv")

> > + + > + > + +

par(mfrow = c(1, 2)) plot(survfit(Surv(time, cens) ~ horTh, data = GBSG2), lty = 1:2, mark.time = FALSE, ylab = "Probability of Survival", xlab = "Survival time (Days)") legend("bottomleft", legend = c("no", "yes"), lty = c(1, 2), title = "Hormone Therapy", bty = "n", cex = 0.7) plot(survfit(Surv(time, cens) ~ horTh, data = GBSG2), lty = 1:2, mark.time = FALSE, ylab = "Log Cumulative Hazard", xlab = "Survival time (Days)", fun = "cloglog")

1.0

Log Cumulative Hazard 1500 2500

Probability of Survival

0.8

0.4

0.6

0.2

Hormone Therapy no yes

0.0

500

6 10

50

200

1000

Survival time (Days)

Survival time (Days)

which shows strong evidence of a dierence; comparing observed and expected, we can see that there are fewer deaths than expected in the treatment group (RIT). However there may be a confounding eect from tumour type, so we also try a stratied analysis:
> survdiff(Surv(time, event) ~ group + strata(histology), + data = glioma) Call: survdiff(formula = Surv(time, event) ~ group + strata(histology), data = glioma) N Observed Expected (O-E)^2/E (O-E)^2/V group=Control 18 16 7.42 9.94 18.6 group=RIT 19 7 15.58 4.73 18.6 Chisq= 18.6 on 1 degrees of freedom, p= 1.62e-05

which conrms the benecial eect of treatment for each type.

15.3.2

Breast Cancer

Exhibit 15.6 plots the KM estimates of recurrence-free survival with and without (t). Survival seems to better for the hormone therapy. Also shown is the plot of log H

15.3. ANALYSIS USING R

177

treatment group, and it seems not unreasonable to assume that treatment has a proportional eect on the hazard, since the log cumulative hazard curves are approximately parallel. However this analysis ignores all the other covariates, which if taken into eect might change our conclusions. To model the eects of all possible predictors of survival we assume proportional hazards and t the Cox model:
> GBSG2.coxph <- coxph(Surv(time, cens) ~ ., data = GBSG2) > summary(GBSG2.coxph) Call: coxph(formula = Surv(time, cens) ~ ., data = GBSG2) n= 686, number of events= 299 coef exp(coef) se(coef) z Pr(>|z|) horThyes -0.346278 0.707316 0.129075 -2.68 0.00730 age -0.009459 0.990585 0.009301 -1.02 0.30913 menostatPre -0.258445 0.772252 0.183476 -1.41 0.15895 tsize 0.007796 1.007827 0.003939 1.98 0.04779 tgradeII 0.636112 1.889121 0.249202 2.55 0.01069 tgradeIII 0.779654 2.180718 0.268480 2.90 0.00368 pnodes 0.048789 1.049998 0.007447 6.55 5.7e-11 progrec -0.002217 0.997785 0.000574 -3.87 0.00011 estrec 0.000197 1.000197 0.000450 0.44 0.66131 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 exp(coef) exp(-coef) lower .95 upper .95 0.707 1.414 0.549 0.911 0.991 1.010 0.973 1.009 0.772 1.295 0.539 1.106 1.008 0.992 1.000 1.016 1.889 0.529 1.159 3.079 2.181 0.459 1.288 3.691 1.050 0.952 1.035 1.065 0.998 1.002 0.997 0.999 1.000 1.000 0.999 1.001

**

* * ** *** ***

' ' 1

horThyes age menostatPre tsize tgradeII tgradeIII pnodes progrec estrec

Concordance= 0.692 (se = 0.018 ) Rsquare= 0.142 (max possible= 0.995 ) Likelihood ratio test= 105 on 9 df, p=0 Wald test = 115 on 9 df, p=0 Score (logrank) test = 121 on 9 df, p=0

When interpreting the output, remember that the eect the covariates on the baseline hazard is to multiply by exp(xT ); this is why the exponentiated values of the parameters are also given in the output. For hormone therapy horTH the default factor level is no therapy, so the estimated eect of the therapy (horTHyes) is to multiply the baseline hazard by 0.707 (with condence interval 0.5490.911); thus we could say that hormone therapy reduces the risk of recurrence by between 9% and 45%. The most signicant risk factor (smallest p-value) is the number of positive lymph nodes pnodes; the estimated coecient is 0.0488 which is positive, indicating an increase in the hazard, and the relative

178

CHAPTER 15. AN INTRODUCTION TO SURVIVAL ANALYSIS

Exhibit 15.7 Estimated baseline survival curve for German Breast Cancer Study.
> plot(survfit(GBSG2.coxph), xlab = "Survival time (Days)", + ylab = "Baseline Survival")

Baseline Survival

0.0 0

0.2

0.4

0.6

0.8

1.0

500

1000

1500

2000

2500

Survival time (Days)

risk of e0.0488 = 1.050 indicates that every positive lymph node increases the risk by an estimated 5% (condence interval 3.56.5%). Some of the parameters are not signicant, so we might want to drop some of the covariates and t a simpler model. The usual R model statements apply here. We could also try adding interaction terms. Nested models can be compared using the anova() command. If you wanted to stratify the analysis on a particular factor, eg menostat you would add + strata(menostat) to the model statement; this would allow separate baseline hazards for each level of the factor. The baseline survival function can be examined by applying survfit() to the tted coxph object, as in Exhibit 15.7. Note that the baseline hazard here uses the average value of all covariates, including factors. If you want the survival function for a particular set of covariate values you can use predict() on the tted coxph object. The proportional hazards assumption can be checked using cox.zph():
> GBSG2.zph <- cox.zph(GBSG2.coxph) > GBSG2.zph rho -2.54e-02 9.40e-02 1.19e-05 -2.50e-02 -7.13e-02 -1.30e-01 5.84e-02 5.65e-02 5.46e-02 NA chisq 1.96e-01 2.96e+00 3.75e-08 1.88e-01 1.49e+00 4.85e+00 5.98e-01 1.20e+00 1.03e+00 2.27e+01 p 0.65778 0.08552 0.99985 0.66436 0.22272 0.02772 0.43941 0.27351 0.30967 0.00695

horThyes age menostatPre tsize tgradeII tgradeIII pnodes progrec estrec GLOBAL

The p-values indicate a problem with the proportion hazards assumption for tumour grade, and possibly for age. The test output can be plotted over time, as in Exhibit 15.8, showing

15.4. EXERCISES Exhibit 15.8 Test of proportional hazards for GBSG data.
> par(mfrow = c(1, 2)) > plot(GBSG2.zph, var = "tgradeII") > plot(GBSG2.zph, var = "age")

179

Beta(t) for tgradeII

Beta(t) for age 270 560 1100 Time 2300

15

10

0.6 270

0.2

0.2

560

1100 Time

2300

that the failure of the PH assumption can be regarded as being a time-varying eect. The lines on the plots are obtained by smoothing a particular kind of model residual (see Therneau and Grambsch (2000)). Exhibit 15.9 shows a plot of the Martingale residuals against age. If the model is correct, the martingale residuals should have an expected value of zero, irrespective of the age of the patient. This is dicult to judge though as the residuals are not symmetrically distributed - they cannot be greater than 1. A martingale residual of 1 occurs if a patient dies when the model says they are not as risk of dying; a patient who survives longer than the model predicts has a negative residual. In the residual plot for age there are a couple of large negative values, probably for patients who have not experienced recurrence despite having covariate values that put them at high risk.

15.4

Exercises

Exercise 15.1: Simplify the Cox model for the GBSG2 data by sequentially removing non-signicant terms. Use anova() at each stage to test for removal. Try a stratied analysis on tumour grade. Exercise 15.2: The dataset mastectomy in the DRUGS package gives the survival times in months of breast cancer patients after mastectomy. When the cancers were removed they were classed as having metastized or not. Use a Kaplan-Meier plot and a log-rank test to compare the survival of the two groups. Is a Cox proportional hazards model appropriate here? Exercise 15.3: Investigate proportional hazards modelling for the glioma data.

180

CHAPTER 15. AN INTRODUCTION TO SURVIVAL ANALYSIS

Exhibit 15.9 Martingale residual plot for GBSG data.


> res <- residuals(GBSG2.coxph) > plot(res ~ age, data = GBSG2, ylab = "Martingale Residuals")

Martingale Residuals

4 20

30

40

50 age

60

70

80

Chapter 16 Nonlinear Regression: Enzyme Kinetics, Growth Curves and Bioassay


An original chapter
written by

Geo Jones1

16.1

Introduction

Enzymes catalyze reactions in physiological systems where one substance, substrate, is converted into another, product. The initial speed of the reaction (Y ) depends on the concentration of the substrate (x) according to a relationship that biochemists call Michaelis-Menten kinetics: Y = 1 x 2 + x (16.1)

An experiment by Treloar (1974) measured the initial speed of a reaction (in counts/min2 ) for a range of substrate concentrations (in parts per million) under two dierent conditions: with/without Puromycin. The data are given in Exhibit 16.1, and plotted in Exhibit 16.2. The interest here is in the eect of Puromycin on the kinetics of the reaction. The growth of an organism or system over time is usually nonlinear. Here the response Y is a measurement such as height, weight or length, and x is time. Often there is an upper asymptote representing the adult or mature value of Y , and the curve can be assumed
1

Geo is an Associate Professor in the Institute of Fundamental Sciences.

181

182

CHAPTER 16. NONLINEAR REGRESSION

Exhibit 16.1 Puromycin data. Substrate concentration and reaction velocity for treated and untreated enzyme reactions.
Substrate Concentration (ppm) 0.02 0.02 0.06 0.06 0.11 0.11 0.22 0.22 0.56 0.56 1.1 1.1 Velocity (counts/min2 ) (a) Treated (b) Untreated 76 47 97 107 123 139 159 152 191 201 207 200 67 51 84 86 98 115 131 124 144 158 160

Exhibit 16.2 Reaction velocity for treated and untreated groups.

Velocity

100

150

200

Treated Untreated 50 0.0

0.2

0.4

0.6 Concentration

0.8

1.0

16.1. INTRODUCTION Exhibit 16.3 Kiwi data. Bill length in mm and age at recapture in days.
Age (days) 1 38 200 246 283 290 306 396 428 481 516 556 614 Bill Length (mm) 42.4 52.1 70.8 80.0 83.3 83.3 83.5 92.6 93.4 97.3 97.9 100.3 101.6 Age (days) 668 809 835 892 982 1138 1289 1514 1689 1744 1906 2047 Bill Length (mm) 104.1 111.8 112.3 115.6 117.9 121.4 123.9 125.6 127.4 127.1 127.8 128.3

183

Exhibit 16.4 Growth in bill length of a kiwi.

Bill length (mm)

40 0

60

80

100

120

500

1000 Age in days

1500

2000

to be monotonic increasing (dY /dx > 0). For example, Exhibit 16.3 gives the bill length (mm) of a North Island Brown Kiwi measured each time it is recaptured (Jones et al., 2009). The data are plotted in Exhibit 16.4. There are several standard growth curve models that have been used by researchers to t such data. Some originated as the solution to a dierential equation incorporating assumptions about the rate of growth, but mostly they are used as empirical models that

184

CHAPTER 16. NONLINEAR REGRESSION

Exhibit 16.5 Bioassay data. Weight in mg of nasturtium plants grown in various concentrations of an agrochemical.
Concentration in g/ha 0.000 0.025 0.075 0.250 0.750 2.000 4.000 920 919 870 880 693 429 200 889 878 825 834 690 395 244 Weight in mg 866 882 953 795 722 435 209 930 854 834 837 738 412 225 992 851 810 834 563 273 128 1017 850 875 810 591 257 221

just happen to t the data at hand. Three of the more common models are: Gompertz Y = 1 exp(2 e3 x ) 1 Logistic Y = 1 + 2 e3 x Weibull Y = 1 (1 2 e3 x ) (16.2) (16.3) (16.4)

The term bioassay refers to the use of a biological system to determine the amount of a target chemical (called the analyte) in samples. The method requires that the biological system produces a numerical response Y that depends on the concentration C of the analyte. The relationship between Y and C is often nonlinear, and is estimated by measuring Y on a set of predetermined concentrations C these are called the calibration data. When the relationship has been estimated, it can be used to determine concentration C for new samples given their measured responses Y . An experiment described by Racine-Poon (1988) measured the weights of nasturtium plants (in g/ha) for a range of concentrations (in parts per million) of an agrochemical in soil samples. The data are given in Exhibit 16.5. Exhibit 16.6 plots the weight against the natural log of the concentrations x = log C , with the zeros represented by -5. (Why not log 0?). A model suggested for this relationship is Y = 1 1 + exp(2 + 3 x) (16.5)

The weights recorded for three soil samples from a eld with an unknown concentration C0 of the agrochemical gave weights of 309, 296, 419. We want to estimate the unknown concentration C0 .

16.2. NONLINEAR REGRESSION Exhibit 16.6 Calibration data for nasturtium bioassay.

185

Weight

200 5

400

600

800

1000

log Concentration

16.2

Nonlinear Regression

Nonlinear regression is used in situations, like those above, where a numerical response variable Y is related to a numerical predictor x by a nonlinear function: Y = f (x; ) + where represents a vector of parameters and (16.6) is a random error. For example, in might represent the error made are independent
2

Equation 16.1, has three components (1 , 2 , 3 ), and

in measuring Y (a radioactive count). We will assume that the errors

and identically distributed, with mean zero and constant variance . In practice the variability may change at dierent points on the curve, so that 2 is a function of x; we will not consider this extra complication. Sometimes the function f (x; ) represents a scientic law or principle, as in the MichaelisMenten example. In this case the parameters of the function may have a precise scientic meaning; for example, 2 in Equation 16.1 is known as the Michaelis constant and measures the anity of the enzyme for the substrate. In other situations f (x; ) may just be a convenient function which has the right shape, in which case the parameters may have no particular meaning or interest. Nevertheless, in the growth curve examples of Equations 16.2 to 16.4 we can see that 1 represents the adult bill length, and in the bioassay example 1 is the mean weight of nasturtium when grown in soil free of the agrochemical. Such considerations can be useful when choosing starting values for numerical estimation methods.

186

CHAPTER 16. NONLINEAR REGRESSION

Exhibit 16.7 Linearized plot of Puromycin data

1/Velocity

0.010

0.015

0.020

0.005

Treated Untreated

10

20

30

40

50

1/Concentration

16.2.1

Transformations, Reparametrization and Nonlinearity

Sometimes it is possible to turn a nonlinear relationship into a linear one by transforming Y and/or x. For example in the Michalis-Menton formula of Equation 16.1 we can invert both sides of the equation to get: 1 2 1 + = Y 1 1 x (16.7)

so if we make the transformations Y = 1/Y and x = 1/x, and re-parametrize to 0 = 1/1 and 1 = 2 /1 , then we get the simple linear relationship Y = 0 + 1 x . This can be seen by plotting 1/Y against 1/x for the Puromycin data, as in Exhibit 16.7. Note that although the relationship has been made linear, the variability is now very clearly non-constant. We have the choice then of either using nonlinear regression with constant variance or linear regression with an appropriate variance function to model the increasing variability. The crucial characteristic of nonlinear statistical models is that they are nonlinear in the parameters. The quadratic model Y = 0 + 1 x + 2 x2 is a nonlinear function of x but is linear in its parameters 0 , 1 , 2 and can be tted using linear regression methods with the transformed predictors x, x2 . The growth models in Equations 16.216.4 cannot be transformed or reparametrized to make them linear in the parameters (Try it!) and here nonlinear regression is necessary. There are however alternative versions for each of these curves, with dierent parametrizations. For example, some researchers would t the Gompertz model as: Y = A exp(eB(xC ) ) (16.8)

with parameters A, B, C (Work out the correspondence between these and the original

16.2. NONLINEAR REGRESSION

187

parametrization!). The tted parameter values will be dierent, but the tted curve will be the same.

16.2.2

Numerical Methods

When a simple linear regression model is tted to data, the least squares method is used; this means that the parameters in the model are chosen to minimize the sum of the squares of the errors. Implicit in this approach is the assumption that the errors are independent with the same variance; if the errors are also normally distributed then least squares is equivalent to maximum likelihood. The same approach is used to estimate the parameter vector in Equation 16.6, but now it is called nonlinear least squares. Thus given a set of data (x1 , Y1 ), (x2 , Y2 ), . . . , (xn , Yn ), we choose to minimize:
n

SSE =
i

[Yi f (x; )]2

(16.9)

can be found using calculus; this is what If f (x; ) is linear (in ) then the estimator is done in the case of simple linear regression to get the formulae for the slope and intercept. In nonlinear regression an exact analytical solution cannot in general be found, so numerical methods are required to solve the minimization of Equation 16.9. Computer programs for doing this are based on the Newton-Raphson method, and require the user to provide starting values for all the parameters to be estimated. The program will then go through several iterations, producing better and better estimates with lower and lower values of SSE , until a convergence criterion is met at which point the program stops is outputted, along with its standard errors. The convergence and the nal estimate criterion can be based either on the size of changes in SSE or on the size, or relative size, . Most programs allow you to change the convergence criterion, and the of changes in maximum number of iterations before stopping, but some care is needed when doing this. Sometimes there are problems. The program may tell you that it has failed to converge, or else it may crash altogether. This may mean that you have made a mistake in writing the function, or that your starting values are not very good. Sometimes all that is needed is starting values that are not completely ridiculous; at other times it is worthwhile calculating good ones. Some possible strategies for this are given below. attainable velocity and can be estimated from a plot of the data. Next note that when x = 2 , Y = 1 /2, so 2 , the Michaelis constant, is the concentration at which the velocity is half its maximum value. This too can be estimated from a plot of the data. Models with three parameters, such as Equations 16.216.5, are more dicult, but similar strategies can be used: consider x , consider x = 0 or other convenient Consider Equation 16.1. Note that as x , Y 1 , so 1 represents the maximum

188

CHAPTER 16. NONLINEAR REGRESSION

values, consider when Y is halfway between its maximum and minimum values, estimate these from a graph and solve the resulting equations for (1 , 2 , 3 ). Examples are given in the analyses below. Often too the program will work if one or two of the parameters have good starting values with the others as guesses. Finally, some parametrizations work better than others, i.e. two equivalent models might have quite dierent convergence properties.

16.3
16.3.1

Analysis Using R
Enzyme Kinetics

The data for the Puromycin experiment can be read from the le Puromycin.csv using the read.csv() command,
> Puro = read.csv("Puromycin.csv")

or, directly from the DRUGS package using


> data(Puro, package = "DRUGS")

> head(Puro) Conc 0.02 0.02 0.06 0.06 0.11 0.11 V_t V_u 76 67 47 51 97 84 107 86 123 98 139 115

1 2 3 4 5 6

First we t the model in Equation 16.1 to the treated group only (V_t), using the function nls(). By referring to Exhibit 16.2 we choose starting values of 1 = 200 for the maximum velocity and 2 = 0.1 for the concentration at half-maximum velocity.
> PuroT.nls = nls(V_t ~ t1 * Conc/(t2 + Conc), data = Puro, + start = list(t1 = 200, t2 = 0.1)) > summary(PuroT.nls) Formula: V_t ~ t1 * Conc/(t2 + Conc) Parameters: Estimate Std. Error t value Pr(>|t|) t1 2.13e+02 6.95e+00 30.61 3.2e-11 *** t2 6.41e-02 8.28e-03 7.74 1.6e-05 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 10.9 on 10 degrees of freedom Number of iterations to convergence: 6 Achieved convergence tolerance: 6.1e-06

16.3. ANALYSIS USING R Exhibit 16.8 Fitted Michaelis-Menten model for Puromycin-treated reaction.
> > > > attach(Puro) plot(Conc, V_t, xlab = "Concentration", ylab = "Velocity") detach(Puro) lines((0:120)/100, predict(PuroT.nls, list(Conc = (0:120)/100)))

189

Velocity 50 0.0 100

150

200

0.2

0.4

0.6 Concentration

0.8

1.0

To examine the t of the model, use the predict() function to give predicted values from the model for a range of Conc values, as shown in Exhibit 16.8 which suggests a reasonable t. We can also create the traditional plot of residuals against tted values to check this (shown in Exhibit 16.9). There is perhaps a slight suggestion of curvature in the plot, and perhaps greater variability for lower velocities, but there is not enough data to draw a rm conclusion. The corresponding t for the untreated group (V_u) is:
Parameters: Estimate Std. Error t value t1 1.603e+02 6.480e+00 24.734 t2 4.771e-02 7.782e-03 6.131 --(1 observation deleted due to Pr(>|t|) 1.38e-09 *** 0.000173 *** missingness)

It was thought that the eect of Puromycin would be to increase the maximum velocity (1 ) but not change the concentration at half-maximum velocity (2 ). To test this we rst stack the treated and untreated groups and then t a single model for both, with a change in the parameters for the treated group (Treat= 1):

190

CHAPTER 16. NONLINEAR REGRESSION

Exhibit 16.9 Residuals versus ts for Puromycin-treated reaction.


> plot(fitted(PuroT.nls), residuals(PuroT.nls), xlab = "fitted values", + ylab = "residuals")

residuals

10 50

10

20

100 fitted values

150

200

> > + > > + + >

attach(Puro) PuroTU = data.frame(Conc = c(Conc, Conc), V = c(V_t, V_u), Treat = c(rep(1, 12), rep(0, 12))) detach(Puro) PuroTU.nls = nls(V ~ (t1 + d1 * Treat) * Conc/(t2 + d2 * Treat + Conc), start = list(t1 = 150, d1 = 50, t2 = 0.05, d2 = 0.01), data = PuroTU) summary(PuroTU.nls)

Formula: V ~ (t1 + d1 * Treat) * Conc/(t2 + d2 * Treat + Conc) Parameters: Estimate Std. Error t value Pr(>|t|) t1 1.60e+02 6.90e+00 23.24 2.0e-15 d1 5.24e+01 9.55e+00 5.49 2.7e-05 t2 4.77e-02 8.28e-03 5.76 1.5e-05 d2 1.64e-02 1.14e-02 1.44 0.17 --Signif. codes: 0 '***' 0.001 '**' 0.01

*** *** ***

'*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.4 on 19 degrees of freedom Number of iterations to convergence: 5 Achieved convergence tolerance: 1.57e-06 (1 observation deleted due to missingness)

16.3. ANALYSIS USING R

191

Since d1 is signicantly non-zero, but d2 is not, this supports the scientic hypothesis concerning the eect of Puromycin.

16.3.2

Growth Curves

The kiwi bill length data in the le Kiwi.csv can be accessed using
> data(Kiwi, package = "DRUGS") > head(Kiwi) age 1 38 200 246 283 290 bill 42.36 52.08 70.75 80.00 83.30 83.30

1 2 3 4 5 6

To t the models in Equations 16.216.4 to the Kiwi data, we need to choose starting values. From Exhibit 16.4 the adult bill length is around 130 mm, so we take 1 = 130. When age= 0 the bill length is around 40 mm, so: Gompertz 2 = log(130/40) = 1.2 Logistic 2 = 130/40 1 = 2.3 Logistic 2 = 1 40/130 = 0.7

When bill is half of its maximum value, age is approximately 200, so: Gompertz 3 = log(1.2/ log 2)/200 = 0.0027 Logistic 3 = log(2.3)/200 = 0.0042 Logistic 3 = log(0.7/0.5)/200 = 0.0017 With these starting values, nls() gives:
> bill.nlsG = nls(bill ~ t1 * exp(-t2 * exp(-t3 * age)), data = Kiwi, + start = c(t1 = 130, t2 = 1.2, t3 = 0.0027)) > summary(bill.nlsG) Formula: bill ~ t1 * exp(-t2 * exp(-t3 * age)) Parameters: Estimate Std. Error t value Pr(>|t|) t1 1.28e+02 1.04e+00 122.3 <2e-16 t2 9.85e-01 2.81e-02 35.1 <2e-16 t3 2.63e-03 1.14e-04 23.2 <2e-16 --Signif. codes: 0 '***' 0.001 '**' 0.01

*** *** *** '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.25 on 22 degrees of freedom Number of iterations to convergence: 5 Achieved convergence tolerance: 1.95e-06 > bill.nlsL = nls(bill ~ t1/(1 + t2 * exp(-t3 * age)), data = Kiwi, + start = c(t1 = 130, t2 = 2.3, t3 = 0.0042)) > summary(bill.nlsL)

192
Formula: bill ~ t1/(1 + t2 * exp(-t3 * age)) Parameters: Estimate Std. Error t value Pr(>|t|) t1 1.26e+02 1.27e+00 99.3 < 2e-16 t2 1.52e+00 8.26e-02 18.3 8.1e-15 t3 3.29e-03 1.85e-04 17.8 1.5e-14 --Signif. codes: 0 '***' 0.001 '**' 0.01

CHAPTER 16. NONLINEAR REGRESSION

*** *** *** '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3 on 22 degrees of freedom Number of iterations to convergence: 7 Achieved convergence tolerance: 3.63e-06 > bill.nlsW = nls(bill ~ t1 * (1 - t2 * exp(-t3 * age)), data = Kiwi, + start = c(t1 = 130, t2 = 0.7, t3 = 0.0017)) > summary(bill.nlsW) Formula: bill ~ t1 * (1 - t2 * exp(-t3 * age)) Parameters: Estimate Std. Error t value Pr(>|t|) t1 1.30e+02 8.28e-01 156.8 <2e-16 t2 6.55e-01 7.72e-03 84.8 <2e-16 t3 1.97e-03 6.19e-05 31.8 <2e-16 --Signif. codes: 0 '***' 0.001 '**' 0.01

*** *** *** '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.52 on 22 degrees of freedom Number of iterations to convergence: 4 Achieved convergence tolerance: 3.3e-06

Based on the residual standard errors, it would appear that the Weibull model gives the best t to the data. We can make a subjective assessment by plotting all three curves with the data as in the Figure; colour is useful here, but cannot be shown in our printed material. Another useful comparison would be to plot the residuals against tted values for each model.

16.3.3

Bioassay

The nasturtium bioassay data are in the le Bioassay.csv. We need to add the logtransformed concentrations, with 5 in place of log 0:
> Bioassay = read.csv("Bioassay.csv")

> Bioassay$Conc[1:6] = exp(-5) > Bioassay$logC = log(Bioassay$Conc) > Bioassay[1:7, ] Conc 1 0.006738 2 0.006738 Wt logC 920 -5.000 889 -5.000

16.3. ANALYSIS USING R Exhibit 16.10 Comparison of tted models for Kiwi bill length.

193

Bill length (mm)

80

100

120

Gompertz Logistic Weibull

40 0

60

500

1000 Age in days

1500

2000

3 4 5 6 7

0.006738 866 -5.000 0.006738 930 -5.000 0.006738 992 -5.000 0.006738 1017 -5.000 0.025000 919 -3.689

To t the model in Equation 16.5 we examine Exhibit 16.6 to get the starting values for the three parameters. The maximum Y is about 1 = 900. When x = log C = 0, Y 3 = log(900/300 1) + 0.7 = 1.4. With these starting values, nls() converges quickly:
> Bioassay.nls = nls(Wt ~ t1/(1 + exp(t2 + t3 * logC)), data = Bioassay, + start = list(t1 = 900, t2 = -0.7, t3 = 1.4)) > summary(Bioassay.nls) Formula: Wt ~ t1/(1 + exp(t2 + t3 * logC)) Parameters: Estimate Std. Error t value Pr(>|t|) t1 897.860 13.789 65.11 < 2e-16 t2 -0.616 0.108 -5.72 1.3e-06 t3 1.353 0.110 12.34 4.8e-15 --Signif. codes: 0 '***' 0.001 '**' 0.01

is about 600 so take 2 = log(900/600 1) = 0.7. When x = 1, Y is about 300 so take

*** *** *** '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 55.6 on 39 degrees of freedom Number of iterations to convergence: 3 Achieved convergence tolerance: 5.91e-06

Having estimated the relationship between Wt and Conc using the calibration data, we can now estimate the concentration for the new sample with weights 309, 296, 419. We can get an approximate answer graphically using locator(); click at the point where the

194

CHAPTER 16. NONLINEAR REGRESSION

Exhibit 16.11 Estimating x0 graphically in the nasturtium bioassay


> plot(logC,Wt,xlab="log Concentration",ylab="Weight") > lines((-50:20)/10,predict(Bioassay.nls,list(logC=(-50:20)/10))) > abline(h=mean(309,296,419))

Weight

200 5

400

600

800

1000

log Concentration

> locator() $x [1] 0.9290798 $y [1] 309.0842

horizontal line meets the curve (see Exhibit 16.11), then right-click to nish. This gives the x and y coordinates of the point clicked on. The estimated log C0 is now exponentiated to get C0 = e0.929 = 2.53. To get a more accurate answer we invert Equation 16.5 algebraically and then substitute the mean weight for the three samples:

> > > >

t = coef(Bioassay.nls) Y0 = mean(309, 296, 419) C0 = exp((log(t[1]/Y0 - 1) - t[2])/t[3]) C0

t1 2.54

16.4. EXERCISES

195

16.4

Exercises
1 x +3 . 2 +x

Note: The data sets in the exercises are available via the DRUGS package. Exercise 16.1: One way of generalizing the Michaelis-Menten formula is Y = Does the extra parameter give a signicantly better t to the Puromycin data? of the three growth models in Equations 16.216.4. Identify which one, and t this alternative form to the Kiwi data. Which parametrization is better, and why? In the DRUGS package, this data set is called Kiwi and can be obtained using
> data(Kiwi, package = "DRUGS")

Exercise 16.2: The model Y = / x is an alternative parametrization of one

Exercise 16.3: Exhibit 16.12 from Huet et al. (1996) gives the results of a radioimmunological assay (RIA) for the hormone cortisol. For known dilutions of the hormone, a xed amount of radioactivelabelled hormone is introduced along with a xed amount of antibody. Higher concentrations of the unlabelled hormone give lower amounts of radioactivity, measured in counts per minute (cpm). The results for known concentrations are used to establish a calibration curve from which the unknown doses in further samples can be estimated. A common model for immunoassay data is the four-parameter log-logistic (4PLL) curve: Y =A+ BA 1 + (x/C )D (16.10)

Read in the data from the le Cortisol.csv, change the zero concentrations to 0.01, and plot the responses against the log-dose. Fit the 4PLL model, and compare its t with that of the model in which D = 1. In the DRUGS package, this data set is called Cortisol and can be obtained using
> data(Cortisol, package = "DRUGS")

Exercise 16.4: For tting immunoassay data, some researchers use the ve-parameter log-logistic (5PLL) curve: Y =A+ BA [1 + (x/C )D ]E (16.11)

Does the extra parameter lead to a better t for the Cortisol data? Choose the best model to estimate the unknown concentration for which the response (to a triplicate splitsample) is 1102, 1056, 994. Use a residual plot to check the adequacy of your chosen model.

196

CHAPTER 16. NONLINEAR REGRESSION

Exhibit 16.12 Cortisol data. Immunoassay response to various concentrations of the hormone.
Dose (ng/.1 ml) 0 0 0.02 0.04 0.06 0.08 0.1 0.2 0.4 0.6 0.8 1 1.5 2 4 100 2868 2779 2615 2474 2152 2114 1862 1364 910 702 586 501 392 330 250 131 Response (cpm) 2785 2588 2651 2573 2307 2052 1935 1412 919 701 596 495 358 351 261 135 2849 2701 2506 2378 2101 2016 1800 1377 855 689 561 478 399 343 244 134 2805 2752 2498 2494 2216 2030 1871 1304 875 696 562 493 394 333 242 133

Chapter 17 An Introduction to Mixed Eects Models: Rat Pups and Tern Chicks
An original chapter
written by

Geo Jones1

17.1

Introduction

An experiment analysed by Dempster et al. (1984) investigated the eect of dierent doses of an experimental compound on the reproductive performance of rats, and in particular on the weights of the individual rat pups in a litter. Thirty rat mothers were randomized into three treatment groups: control, low dose and high dose. In the high-dose group one female failed to conceive, one ate her pups and one had a still-birth, so these rats are excluded from the analysis of pup weights. The data are shown in Exhibit 17.1. Note that there are two sources of uncontrolled (random) variation: mothers and pups. Jones et al. (2005) analysed data from an observational study on the growth of tern chicks. Here we consider only growth in the length of one wing (in mm). Wing length can only be measured after the primary feathers have emerged, typically after 35 days. The weights of the chicks were also recorded but they will not be used here. Many of the chicks were observed on only a small number of occasions, so for this study we will include only the chicks that were measured at least ve times. Their wing lengths and ages (in days) are plotted in Exhibit 17.2. Each bird is identied by a unique band number which appears on each subplot in the gure. Again there are two levels of variation: dierences between birds and random errors aecting each individual wing length measurement. At
1

Geo is an Associate Professor in the Institute of Fundamental Sciences.

197

198

CHAPTER 17. AN INTRODUCTION TO MIXED EFFECTS MODELS

Exhibit 17.1 Pups data. Weights of rat pups exposed to dierent doses of a compound.
Mother 1 1 1 1 1 1 . . . 27 27 27 27 27 27 Sex (0=male) 0 0 0 0 0 0 . . . 0 0 0 1 1 1 Rat 1 2 3 4 5 6 . . . 4 5 6 1 2 3 Dose 0 0 0 0 0 0 . . . 2 2 2 2 2 2 Weight (g) 6.60 7.40 7.15 7.24 7.10 6.04 . . . 6.29 5.69 6.36 5.39 5.74 5.74

the time of data collection, some birds were identied visually as slow. We want to describe variations in wing length between birds, and investigate whether the slow birds form a dierent group with respect to wing growth.

17.2

Models with mixed eects

The standard assumption made when analysing data is that the observations (the rows in a dataset) are iid , meaning that they are independent and identically distributed. A common situation where these assumptions are not met is when the data are grouped into clusters, so that observations in the same group are more similar than observations in dierent groups. For example cows in the same herd will be more alike in some respects than cows in dierent herds, because of inuences (unmeasured covariates ) operating at the herd level. Sometimes repeated measures are made on a number of individuals, for example the milk yields of cows measured weekly for a month. Here the groups or clusters are the cows, and the observations are the milk yield measurements. Both of the above examples have two levels of variation: group level and individual level. It is possible to have more than two levels in a hierarchical model. If we repeatedly measure the milk yields of cows on dierent farms, then we have three levels of variation: farmcowmeasurement. In such situations the usual linear model Yi = xT i + i,
i

Normal(0, 2 )

(17.1)

relating the response Yi of observation i to its vector xi of covariates is not appropriate

17.2. MODELS WITH MIXED EFFECTS

199

Exhibit 17.2 Individual plots of tern chicks comparing their wing lengths and their ages.

1020 3040

49
200 150 100 50

50

41

42

43

44

45

46

47

48
200 150 100 50

33
200 150 100 50

34

35

36

37

38

39

40

25 Wing

26

27

28

29

30

31

32
200 150 100 50

17
200 150 100 50

18

19

20

21

22

23

24

10

11

12

13

14

15

16
200 150 100 50

1
200 150 100 50 1020 3040

1020 3040

10 203040

102030 40

Age

200

CHAPTER 17. AN INTRODUCTION TO MIXED EFFECTS MODELS


i

because the errors

will be correlated for observations in the same group. This means

in particular that standard errors calculated from this model, assuming iid errors, will be wrong. In order to make valid inferences about covariate eects, we need to account for the clustering or group structure in the data.

17.2.1

Random intercept model

A simple way to do this is to introduce an extra component into the model that will be common to all the observations in a particular group: Yij = xT ij + ui +
ij

(17.2)

where Yij denotes the j th observation in group i. The eect of the term ui is to give a dierent intercept (the constant term in the model) for each group. However there are typically a large number of groups, so if the ui are tted as xed parameters, the model will have a very large number of parameters, most of which (the group eects ui ) will not be of particular interest. To draw inferences about it is better to regard the ui as random eects, so that the model has both xed and random eects (hence a mixed model ). To complete the model specication we give the variance structure
2 ui Normal(0, u ), ij

Normal(0, 2 )

(17.3)

2 where u is an extra parameter in the model representing the variability at group level.

(Note that the individual ui are not regarded as parameters are do not appear in the
2 likelihood function). If u is small compared to 2 , it means that most of the variation is

at individual, not group, level: in other words, there is not much dierence between the
2 2 groups. The ratio u /(u + 2 ) is sometimes called the intraclass correlation : it represents

the proportion of the total variation that is due to dierences between the groups.
2 In addition to estimates of the parameters (, u , 2 ), the tted model can give

predictions of the random eects, say u i . There are now two dierent tted values for just using the xed eects) and the ij th observation: one at the population level ( xT
ij

i incorporating the random group eect). Similarly one at the individual level ( xT ij + u there are two kinds of residual: individual-level based on on rij = ui +
ij , ij

and population-level based

representing respectively the dierence between this observation and a

typical member of its group, and the dierence between this observation and a typical member of the population (after adjusting for any xed covariate eects). If there are three levels, eg farmcowmeasurement, then there are three components of
2 2 variance : farm , cow , 2 , and all of these must be estimated in tting the model. There

are two slightly dierent methods for estimating the variance components: maximum likelihood (ML) and restricted maximum likelihood (REML). The latter adjusts for the

17.2. MODELS WITH MIXED EFFECTS

201

degrees of freedom and gives unbiased estimates of the variance components, whereas ML should be used for likelihood ratio tests of nested xed eects.

17.2.2

Random coecient model

It is possible that not just the intercept, but also the eects of the covariates, vary significantly between groups. This is most easily explained in the case when there is a single covariate x in addition to the intercept, so that the random intercept model would be Yij = 0 + 1 xij + ui + or equivalently Yij = b0i + 1 xij +
ij , 2 b0i Normal(0 , u ), i ij , i

Normal(0, 2 ),

2 ui Normal(0, u )

(17.4)

Normal(0, 2 )

(17.5)

where b0i represents the random intercept with mean 0 . We can extend this by making the coecient of x random as well, ie allowing it to vary between groups: Yij = b0i + b1i xij +
ij ,

(17.6)

between groups in the eect of x. We should now consider the possibility that b0i and b1i are correlated, eg it might be that when a group has a higher-than-average intercept, it has a higher-than-average slope. To allow for this, assume that the bivariate vector (b0i , b1i ) is normally distributed with mean (0i , 1i ) and covariance matrix G=
2 u 0

2 2 where b1i Normal(0 , u 2 ) is the random slope. Here u2 represents the variability

u01
2 u 1

u01

(17.7)

The o-diagonal element u01 represents the covariance between b0i and b1i , so the correlation between them is u01 /u0 u1 . These components of G need to be estimated, along with the other parameters 0 , 1 and 2 , when the model is tted. The above is perhaps easiest to understand in the case of repeated measures on a number of individuals when x is time, so that b0i is the initial value for individual i and b1i its rate of growth (assumed to be linear). The random intercepts model ts dierent lines for each individual but makes them all parallel: the random coecient model allows for dierences in the slopes as well. Note that equation 17.6 could also be written, like equation 17.4, as a xed part and a random part thus: Yij = 0i + 1i xij + u0i + u1i xij + covariance matrix G.
ij ,

(17.8)

where the vector of random eects (u0i , u1i ) is normally distributed with zero mean and

202

CHAPTER 17. AN INTRODUCTION TO MIXED EFFECTS MODELS

17.3

Analysis using R

Before tting any mixed models in R, you need to load the nlme package:
> library("nlme")

and will need to do so when starting each new R session.

17.3.1

Weights of rat pups

Once the Ratpups.csv data le has been placed in your working directory, it can be imported and investigated using the following R commands.
> Pups = read.csv("Ratpups.csv")

> head(Pups) dam sex rat dose weight 1 0 1 0 6.60 1 0 2 0 7.40 1 0 3 0 7.15 1 0 4 0 7.24 1 0 5 0 7.10 1 0 6 0 6.04

1 2 3 4 5 6

The variables dam, rat and dose should be factors, not numeric variables:
> Pups$dam <- factor(Pups$dam) > Pups$rat <- factor(Pups$rat) > Pups$dose <- factor(Pups$dose)

The litter sizes range from 2 to 18:


> tabulate(Pups$dam) [1] 12 14 [24] 12 8 4 14 13 9 9 9 18 17 17 13 15 2 12 15 13 13 14 15 10 16 14 10 3

Note that the tabulate() command is a short cut for the tapply() command used in other chapters. The data have a two-level structure, with dam as a grouping variable. To t the random intercept model, with dam as a random eect, we use the lme() function:
> Model.1 <- lme(fixed = weight ~ dose + sex, random = ~1 | + dam, data = Pups) > summary(Model.1) Linear mixed-effects model fit by REML Data: Pups AIC BIC logLik 432.1 454.7 -210.1 Random effects: Formula: ~1 | dam

17.3. ANALYSIS USING R


(Intercept) Residual 0.5717 0.4055

203

StdDev:

Fixed effects: weight ~ dose + sex Value Std.Error DF t-value p-value (Intercept) 6.608 0.18589 293 35.55 0.0000 dose1 -0.374 0.26210 24 -1.43 0.1667 dose2 -0.363 0.28971 24 -1.25 0.2223 sex -0.365 0.04806 293 -7.60 0.0000 Correlation: (Intr) dose1 dose2 dose1 -0.698 dose2 -0.633 0.450 sex -0.107 -0.025 -0.017 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -7.4395 -0.4546 0.0169 0.5419 3.1272 Number of Observations: 321 Number of Groups: 27

This suggests that females are signicantly lighter than males (recall that 0=male), but the eect of the experimental compound is not signicant. Note the dierences in the standard errors, and degrees of freedom, for dose as compared to sex. The dose is applied at the group level (dams) whereas most litters have both male and female pups. In the analysis of variance approach using aov() described in an earlier chapter, we would regard this experiment as a split-plot design and add + Error(dam) to the model formula. Note that R automatically calculates the appropriate standard error for each eect adjusted for the two sources of variation. The estimated standard deviation (u ) of the between-litter component is 0.57, which is larger than the estimated residual (within-litter) component 0.41; the intraclass correlation will be greater than 0.5, indicating a large amount of variation between dams. However the above analysis fails to take account of an important factor aecting birth weight. We now add to the data.frame an extra variable for the litter size, and add this to the model:
> > > + > size <- tabulate(Pups$dam) Pups$size <- rep(size, size) Model.2 <- lme(fixed = weight ~ dose + sex + size, random = ~1 | dam, data = Pups) summary(Model.2)

Linear mixed-effects model fit by REML Data: Pups AIC BIC logLik 411.6 437.9 -198.8 Random effects: Formula: ~1 | dam (Intercept) Residual

204
StdDev:

CHAPTER 17. AN INTRODUCTION TO MIXED EFFECTS MODELS


0.314 0.4045

Fixed effects: weight ~ dose + sex + size Value Std.Error DF t-value p-value (Intercept) 8.324 0.27706 293 30.045 0.0000 dose1 -0.442 0.15136 23 -2.918 0.0077 dose2 -0.871 0.18305 23 -4.756 0.0001 sex -0.363 0.04774 293 -7.597 0.0000 size -0.130 0.01905 23 -6.823 0.0000 Correlation: (Intr) dose1 dose2 sex dose1 -0.311 dose2 -0.582 0.428 sex -0.110 -0.038 -0.009 size -0.921 0.049 0.393 0.043 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -7.55091 -0.46470 0.01511 0.57252 3.04811 Number of Observations: 321 Number of Groups: 27

Now the eect of dose is signicant. Since the reduction in weight for dose2 is very close to twice that of dose2, we might be tempted to treat dose as a numerical variable; however we are not told that dose2 was actually twice the dose1 amount of the substance, so we do not pursue this. Note that the estimate for u has been reduced to 0.31: some of the between-litter variation is explained by the litter sizes. Note too that because litter is, obviously, a litter-level variable, it reduces the degrees of freedom at this level from 24 to 23. If we want to compare the above two models using anova(), it produces a warning message (try it!). This is because the default method is REML, and this is not suitable for comparing models with dierent xed eects. Changing to maximum likelihood estimation:
> Model.1a <- lme(fixed = weight + dam, data = Pups, method > Model.2a <- lme(fixed = weight + dam, data = Pups, method > anova(Model.1a, Model.2a) ~ = ~ = dose + sex, random = ~1 | "ML") dose + sex + size, random = ~1 | "ML")

Model.1a Model.2a

Model df AIC BIC logLik Test L.Ratio p-value 1 6 423.4 446.1 -205.7 2 7 393.5 419.9 -189.7 1 vs 2 31.95 <.0001

conrms that the model with litter is preferred. Because there are two levels of residuals, there are diagnostics at two levels. We can extract the residuals, random eects and tted values as follows:
> resid <- Model.2$res > head(resid)

17.3. ANALYSIS USING R


fixed dam 1 -0.1646 -0.3414 2 0.6354 0.4586 3 0.3854 0.2086 4 0.4754 0.2986 5 0.3354 0.1586 6 -0.7246 -0.9014 > RE <- ranef(Model.2) > head(RE) (Intercept) 0.17685 -0.07037 -0.18937 -0.05360 0.34607 -0.05743

205

1 2 3 4 5 6

> fits <- Model.2$fitted > head(fits) fixed 6.765 6.765 6.765 6.765 6.765 6.765 dam 6.941 6.941 6.941 6.941 6.941 6.941

1 2 3 4 5 6

Note that the estimated random eect ( ui ) for dam 1 is 0.177, and this gives the dierence between the population-level and group-level residuals for the pups in her litter (the two columns of resid), because rij = ui + ij . The two columns of tted values dier by the T same amount, being respectively xT i . ij and xij + u Exhibit 17.3 plots the within-group residuals against the within-group tted values, and the predicted random eects for each dam. There seems to be a suggestion of increasing variance, and some large negative outliers (maybe the runts of the litters) in the rst plot. Finally, we can use our model to predict for particular covariate values, eg for a female rat in a litter of 8 from dam 15 given dose 2:
> newdat <- data.frame(dam = "15", sex = 1, dose = "2", size = 8) > predict(Model.2, newdata = newdat, level = 0)[[1]] [1] 6.051 > predict(Model.2, newdata = newdat, level = 1)[[1]] [1] 6.096 > predict(Model.2, newdata = newdat, level = 0:1) dam val 15 6.051 15 6.096

fixed dam

206

CHAPTER 17. AN INTRODUCTION TO MIXED EFFECTS MODELS

Exhibit 17.3 Diagnostic plots for Pups data.


> plot(Model.2)

> plot(ranef(Model.2))

(Intercept)
27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0.6 0.4 0.2 0.0 0.2 0.4

Standardized residuals

8 5.0 5.5 6.0 6.5 7.0 7.5

Fitted values

dam

Random effects

17.3.2

Growth of tern chicks

Once the Terns.csv data le has been placed in your working directory, it can be imported and investigated using the following R commands.
> Terns = read.csv("Terns.csv")

> head(Terns) id obs Age Wing slow 1 1 5 28 0 1 2 6 35 0 1 3 7 36 0 1 4 9 55 0 1 5 10 61 0 1 6 12 76 0

1 2 3 4 5 6

The group variable is id and this needs to be a factor:


> Terns$id = as.factor(Terns$id)

The plot in Exhibit 17.2 can be produced using the lattice package:
> library(lattice) > xyplot(Wing ~ Age | id, data = Terns)

First we t the random intercept model:


> Model.1 <- lme(fixed = Wing ~ Age, random = ~1 | id, data = Terns) > summary(Model.1)

17.3. ANALYSIS USING R


Linear mixed-effects model fit by REML Data: Terns AIC BIC logLik 2956 2972 -1474 Random effects: Formula: ~1 | id (Intercept) Residual StdDev: 19.94 7.032 Fixed effects: Wing ~ Age Value Std.Error DF t-value p-value (Intercept) -17.13 3.1097 356 -5.51 0 Age 6.21 0.0657 356 94.47 0 Correlation: (Intr) Age -0.405 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -2.99056 -0.55031 0.09013 0.59096 3.43931 Number of Observations: 407 Number of Groups: 50

207

The average intercept is -17.1, which seems strange for a wing length, but remember that the wing length cannot be measured until at least day 3 so this parameter 0 , the average length at age zero, does not have a sensible interpretation. The Age coecient 1 represents the growth rate, 6.2 mm/day, assumed to be the same for all birds. So the modelled growth curves are all parallel, but there is considerable variation between birds as the standard deviation u of the bird-specic intercepts is about 20mm. To investigate whether the rate of growth (ie the slope of the growth curve) varies signicantly between birds, we next t the random coecient model:
> Model.2 <- lme(fixed = Wing ~ Age, random = ~1 + Age | id, + data = Terns) > summary(Model.2) Linear mixed-effects model fit by REML Data: Terns AIC BIC logLik 2468 2492 -1228 Random effects: Formula: ~1 + Age | id Structure: General positive-definite, Log-Cholesky parametrization StdDev Corr (Intercept) 15.00 (Intr) Age 1.22 -0.499 Residual 2.94 Fixed effects: Wing ~ Age Value Std.Error DF t-value p-value (Intercept) -16.764 2.2394 356 -7.49 0

208

CHAPTER 17. AN INTRODUCTION TO MIXED EFFECTS MODELS


0.1764 356 35.56 0

Age 6.273 Correlation: (Intr) Age -0.525

Standardized Within-Group Residuals: Min Q1 Med Q3 Max -3.91815 -0.46664 0.09434 0.53983 3.28713 Number of Observations: 407 Number of Groups: 50

This gives a similar result for the xed eects, but also gives an estimate of the standard deviation u1 of the slope. This model implies that the average growth rate is 6.3 mm/day, but it varies between birds with a standard deviation of 1.2 mm/day. To test whether we need this more complicated model:
> anova(Model.1, Model.2) Model df AIC BIC logLik Test L.Ratio p-value 1 4 2956 2972 -1474 2 6 2468 2492 -1228 1 vs 2 492.4 <.0001

Model.1 Model.2

suggesting that the random coecient model gives a signicant improvement. Note that we have not changed from REML to ML, but R has not issued a warning message. Here we are comparing models with the same xed eects, so R is happy to use REML. However some people think this test is not entirely appropriate, and that the p-value should be divided by 2 (because were testing whether u1 = 0 and it cant be negative). In our example the p-value is so small that we dont need to worry about this controversy. Note that in making the slopes random we have added two parameters: not just the variability in the slope u1 but also the correlation between the random slope and random intercept u10 . To extract the estimated G matrix of equation 17.7:
> getVarCov(Model.2) Random effects variance covariance matrix (Intercept) Age (Intercept) 225.090 -9.125 Age -9.125 1.488 Standard Deviations: 15 1.22

This should not be confused with the covariance matrix of the estimates of the xed eects:
> vcov(Model.2) (Intercept) Age 5.0149 -0.20758 -0.2076 0.03112

(Intercept) Age

17.3. ANALYSIS USING R

209

The former is describing the population of tern chicks, specically how their growth patterns vary from one chick to another; the latter is describing the uncertainty in our estimates of the average growth curve. Now we turn to consideration of the dierences between the slow chicks and the normal chicks. Since we are now going to compare dierent xed-eect structures, we switch to ML estimation and compare a sequence of nested models:
> + > > > Model.2a <- lme(fixed = Wing ~ Age, random = ~1 + Age | id, data = Terns, method = "ML") Model.3a <- update(Model.2a, fixed = Wing ~ Age + slow) Model.4a <- update(Model.3a, fixed = Wing ~ Age * slow) anova(Model.2a, Model.3a, Model.4a) Model df AIC BIC logLik Test L.Ratio p-value 1 6 2469 2493 -1229 2 7 2467 2495 -1227 1 vs 2 4.20 0.0403 3 8 2423 2455 -1203 2 vs 3 46.28 <.0001

Model.2a Model.3a Model.4a

This suggests that we adopt the most complex model, in which the slow chicks have a dierent intercept and a dierent slope. Now re-t the model using REML:
> Model.4 <- lme(fixed = Wing ~ Age * slow, random = ~1 + + Age | id, data = Terns) > summary(Model.4) Linear mixed-effects model fit by REML Data: Terns AIC BIC logLik 2418 2450 -1201 Random effects: Formula: ~1 + Age | id Structure: General positive-definite, Log-Cholesky parametrization StdDev Corr (Intercept) 14.8223 (Intr) Age 0.7527 -0.497 Residual 2.9355 Fixed effects: Wing ~ Age * slow Value Std.Error DF t-value p-value (Intercept) -18.971 2.500 355 -7.59 0.0000 Age 6.785 0.127 355 53.37 0.0000 slow 9.830 5.370 48 1.83 0.0734 Age:slow -2.318 0.272 355 -8.52 0.0000 Correlation: (Intr) Age slow Age -0.542 slow -0.466 0.253 Age:slow 0.253 -0.467 -0.546 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -3.98574 -0.45664 0.08044 0.54799 3.28061 Number of Observations: 407 Number of Groups: 50

210

CHAPTER 17. AN INTRODUCTION TO MIXED EFFECTS MODELS

Exhibit 17.4 Diagnostic plots for Terns data.


> plot(Model.4)

> plot(ranef(Model.4))

(Intercept)
50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 40 20 0 20 1

Age

Standardized residuals

50

100

150

200

id

Fitted values

Random effects

Interestingly the slow birds have a higher intercept than normal birds, by about 10mm but remember what this parameter means! The slow birds have a lower growth rate, averaging 4.5 mm/day as compared to 6.8 mm/day for normal birds. The variability in growth rate between birds has now dropped to u1 = 0.75 why? Exhibit 17.4 plots some diagnostics for this model. Note that there are now two columns of random eects, u0 for intercepts and u0 for slopes. Exhibit 17.5 shows the tted values for a particular chick at population and individual level, together with a least-squares t just using the data for that chick. The least-squares t is almost identical to the individual-level t from the mixed model; there is some very slight shrinkage of the individual-level model towards the population average.

17.4

Exercises

Exercise 17.1: Dohoo et al. (2001) analyzed data on the reproductive performance of cattle on Reunion Isalnd. The data are in the data le Dairy.csv. The variables are:

17.4. EXERCISES Exhibit 17.5 Fitted models for one tern chick.
> > + > > > + > + > + i <- 16 plot(Terns$Age[Terns$id == i], Terns$Wing[Terns$id == i], xlab = "Age", ylab = "Wing length") fits <- predict(Model.4, level = 0:1) lines(Terns$Age[Terns$id == i], fits[Terns$id == i, 2]) lines(Terns$Age[Terns$id == i], fits[Terns$id == i, 3], lty = 2) lines(Terns$Age[Terns$id == i], lm(Wing ~ Age, data = Terns[Terns$id == i, ])$fit, lty = 3) legend("bottomright", c("Population-level", "Individual-level", "Least squares fit"), lty = 1:3)

211

Wing length

40

45

50

55

60

Populationlevel Individuallevel Least squares fit 10 12 14 Age 16 18 20

35

region herd cow obs lact cfs lncfs fscr heifer ai

geographic region herd number unique cow number unique observation number lactation number calving to rst service interval log (cfs) rst service conception (1=success) age category type of insemination at rst service

Investigate the possible relationship between lncfs (the log-transformed time between calving and the attempt to inseminate the cow again) and the variables lact and ai. Consider three possible error structures: two-level (1|cow), three-level (1|herd/cow)

212

CHAPTER 17. AN INTRODUCTION TO MIXED EFFECTS MODELS

and four-level (1|region/herd/cow).

Exercise 17.2: Repeat the analyses of Section 17.3.2 with (Wingage = Age - 3) in-

stead of Age. This makes the intercept more meaningful (Why?). What is the relationship between the initial wing length of a chick and its growth rate? Can we assume that the slow chicks start with the same initial wing length as the normal chicks, but grow more slowly?

Chapter 18 An Introduction to Sample Size Determination: Diagnostic Tests, Blood Pressure, Carcinogenicity and Tomatoes Revisited
An original chapter
written by

Geo Jones1

18.1

Introduction

The accuracy of a diagnostic test for a disease is measured by its sensitivity (the probability of a correct diagnosis on a diseased subject) and its specicity (the probability of a correct diagnosis on a subject free of the disease). These parameters are estimated by applying the test to subjects whose true disease status is known. Often there are prior expectations of test performance based on scientic considerations. Suppose that it is expected that the sensitivity of a newly developed test for Johnes disease in cattle will be around 70%, and its specicity around 90%. How many diseased and disease-free cows are required to estimate the sensitivity and specicity to within 5%, with 95% condence? Lenth (2001) gives an example of designing a clinical trial to test the ecacy of a drug for reducing blood pressure. Treatment using the new drug is to be compared with a control using a two-sample pooled t-test at 5% signicance level, where the response is systolic blood pressure (SBP), measured using a standard sphygmomanometer, after one
1

Geo is an Associate Professor in the Institute of Fundamental Sciences.

213

214

CHAPTER 18. AN INTRODUCTION TO SAMPLE SIZE DETERMINATION

week of treatment. Patients will be randomized into two groups, one receiving the drug and the other a placebo. Since the treatment is supposed to reduce blood pressure, we set up a one-sided test of H0 : T = C versus H1 : T < C , where T is the mean SBP for the treatment group and C is the mean SBP for the control group. The parameter and H1 : < 0. We want to be able to detect a reduction in SBP of the order of 15 mm Hg with a probability of 80%; i.e. we want the power to be 0.8 for an eect size of = 15. Past experience with similar experiments (with similar sphygmomanometers and similar subjects) suggests that the data will be approximately normally distributed with a standard deviation of = 20 mm Hg. How many patients should we recruit into the trial? Dell et al. (2002) point out that to obtain ethical consent for experiments on animals, sample size calculations must be presented to show that the number of animals used is justied by the aims of the experiment. They describe the setting up of an experiment to investigate whether a particular chemical increases the incidence of tumours in rats. Previous data suggests that the spontaneous incidence of tumours in old rats of a particular strain, over the planned study period, is around 20%, and the scientists want to be able to detect an increase to 50% with a probability of 0.8, testing at a signicance level of = 0.5. Here the hypotheses are H0 : T = C versus H1 : T > C , where T is the probability that a rat in the treatment group develops a tumour during the study period and C the same for a rat in the control group. How many rats should be specied in the research proposal? In Chapter 2 an experiment on the growth of tomato tissue was used to compare four treatments (three types of sugar and a control) in a completely randomized design. One way ANOVA was used to test the null hypothesis of no treatment eect. In the notation of equation 2.1, H0 : 1 = 2 = 3 = 4 = 0 versus H1 : At least one i = 0. How was the number of replicates for each treatment chosen? Suppose that from previous experience it was expected that the within-treatment (residual) variance of tissue growth would be
2 around 2 = 10. The eect size can be measured by the between-treatment variance . 2 Suppose the experimenters wanted to be able to detect an eect size of = 10 with

of interest here is = T C , the eect size of the drug; so we could write H0 : = 0

power 0.8, testing at = 0.05.

18.2

Sample Size Determination

We assume in the following that, before beginning an experiment, the would-be experimenters are able to provide some information about the likely values of the parameters involved and about the level of precision required. In an experiment to estimate a pa-

18.2. SAMPLE SIZE DETERMINATION

215

rameter such as a proportion or mean, precision can be specied as the desired width of a condence interval. For experiments designed to test a hypothesis, the experimenters need to be able to specify the signicance level for the hypothesis test to be carried out on the eventual data (or be content with a default 5% level), the eect size that they hope to be able to detect, and the power 1 for detecting it. Recall that the power is the probability that the test rejects the null hypothesis when the null hypothesis is false, and that it depends on the true eect size. (Note is the probability of a Type II Error accepting H0 when it is false). They will also be expected to have some prior expectation about the sizes of proportions, means or variances. In practice much work may need to be done to persuade the experimenters to part with this information see Lenth (2001) for details.

18.2.1

Estimating a single parameter

The half-width of a 100(1 )% condence interval for a proportion is z/2 (1 )/n (18.1) where z/2 is the (1 /2) quantile of the standard normal distribution (eg z.025 = 1.96 for 95% condence). If we require the half-width to be (or smaller), a little algebra gives n and
2 z/ 2 (1 ) 2

(18.2)

For example, to be 95% condent of being within 5% of the true value we use z/2 = 1.96 0.3/0.052 = 322.7 so 323 subjects are required. If the experimenter is unable to give a trial value for , we can adopt a worst-case scenario approach by putting = 0.5 in 18.2. Similarly, for estimating a mean the half-width is tn1,/2 / n where tn1,/2 is the this to a desired precision (1 /2) quantile of the standard t-distribution with n 1 degrees of freedom. Equating gives n
2 t2 n1,/2 2

= 0.05. If is expected to be around 70%, equation 18.2 gives n 1.092 0.7

(18.3)

This is a little awkward to solve since n appears in the right-hand-side. A simple approximate answer can be obtained by replacing tn1,/2 by z/2 or (in the case of 95% condence) by 2.

18.2.2

Two-arm trial with numerical outcome

The usual statistic for testing H0 : 1 = 2 is x 1 x 2 t= 1 1 n +n 1 2

(18.4)

216

CHAPTER 18. AN INTRODUCTION TO SAMPLE SIZE DETERMINATION

where x 1 , x 2 are the sample means for each group and is the pooled estimate of the (assumed) common standard deviation. It is easy to show that, for a total number of subjects n1 = n2 , the denominator is minimized by taking n1 = n2 , so the test is most sensitive to dierences in the means when we have the same number of subjects n in each group. We will assume this and write t= x 1 x 2
2 n

(18.5)

value(s) tc for a given . (Were also assuming the data are approximately normal or n is large). If H0 is really false, then the test statistic follows a ghastly thing called the noncentral t-distribution, with noncentrality parameter /( 2/n), which is basically the eect size divided by the standard error of the dierence in means. This non-central t-distribution will then tell us the probability of getting a result more extreme than tc the power of the test. Knowing and , we can adjust n to get the required power. Its so simple! Perhaps Exhibit 18.1 might help. If your experimenters dont want to commit themselves to values of and , all they need is to speculate on the value of / . This is a standardized eect size the dierence in means divided by the standard deviation.

a standard t-distribution with 2n 2 degrees of freedom. This is used to get the critical

Dene the eect size = 1 2 . If H0 is true then = 0 and t in Equation 18.5 follows

18.2.3

Two-arm trial with binary outcome


p1 p2 (1

The usual statistic for testing H0 : 1 = 2 is Z=


1 )( n 1

1 ) n2

(18.6)

where p1 , p2 are the sample proportions and the pooled estimate of the overall proportion assuming H0 . As before we can assume the most ecient allocation uses equal sample sizes n1 = n2 = n as this makes the denominator as small as possible for a given total n1 + n2 . a standard normal distribution (approximately, given at least ve expected successes and failures in each group). This is used to get the critical value(s) Zc for a given . If H0 is really false, and we suppose 1 , 2 are known, then Z will still be approximately normal. However the dierence p1 p2 will now have a mean of 1 2 instead of zero, approximation to 2 (1 )/n where = (1 + 2 )/2. Thus if the experimenter species and a variance of 1 (1 1 )/n + 2(1 2 )/n, compared with Equation 18.6 which uses an Dene the eect size = 1 2 . If H0 is true then = 0 and Z in Equation 18.6 follows

18.2. SAMPLE SIZE DETERMINATION

217

Exhibit 18.1 The solid density curve is the standard t-distribution, the dotted (red) curve the non-central t. The vertical line shows the critical value tc for an upper-tailed 5% level test. The area to the right under the solid curve is = 0.05; the area under the dotted (red) curve is the power. By increasing n we move this second curve to the right, increasing the power.

density

0.0

0.1

0.2

0.3

0.4

0 t

1 , 2 we can use the implied normal distribution to work out the power - the probability of getting a Z value more extreme than Zc . The mean of Z divided by its standard deviation is
1 (11 ) n

1 2 +

2 (12 ) n

n S

(18.7)

where S is the standardized eect size S = 1 (1 1 ) + 2 (1 2 ) 1 2 (18.8)

The situation is similar to that depicted in Figure18.1, except that the curves are now both normal distributions. Increasing n increases n S , shifting the H1 distribution further away and increasing the power for a given . So n is determind by working out how far the curve has to shift to get the required power. If your experimenters are reluctant to commit themselves to values for 1 and 2 , showing them the formula for S might encourage them.

18.2.4

Multiple treatments by one-way ANOVA

Assume that there are g groups and that all comparisons between groups are equally important so that we will allocate the same number of subjects n to each group. (If all we

218

CHAPTER 18. AN INTRODUCTION TO SAMPLE SIZE DETERMINATION

want to do is to compare the treatment groups with a control group, it might be better to make the control group larger, but most people ignore this and so shall we). The test statistic for one-way ANOVA is F = MSA MSE (18.9)

where MSA is the between-group mean square and MSW the within-goup mean square (see section 2.2). If the null hypothesis H0 : i = 0 for i = 1, . . . , g is true then both MSA and MSW are estimates of the variance 2 , and F in Equation 18.9 follows an F distribution with 1 and 2 degrees of freedom, where 1 = g 1 and 1 = g (n 1) are the degrees of freedom for MSA and MSE respectively. This is used to get the critical value Fc for a given . If H0 is really false, then the test statistic follows a ghastly thing called the noncentral F -distribution, with noncentrality parameter = n(g 1)
2 2

(18.10)

2 where represents the variance of the true group means.This non-central F -distribution

will then tell us the probability of getting a result more extreme than Fc the power of
2 the test. If the experiments can specify the eect size by giving the ratio / 2 , then n

can be chosen to give the required power. Again, Exhibit 18.1 shows the basic idea, but now with F -distributions instead of t-distributions.

18.2.5

Others

The above cases illustrate a general approach. We need a way of specifying the eect size when the null hypothesis is false, and we need to consider the distribution of the test statistic for a given sample size and eect size. For contingency table analysis the eect size could be the odds ratio, or log-odds ratio. For survival analysis, the hazard ratio or log-hazard ratio. For regression, the slope or R2 . The distribution of the test statistic given the specied eect size is always dicult. Specialized software is available to do the required calculations: Russell Lenth has a website with Java applets for power and sample size calculations at http://www.cs.uiowa.edu/~rlenth/Power/ . The UCLA online calculator is at

18.3. ANALYSIS USING R http://calculators.stat.ucla.edu/ .

219

G*Power is free software that gives covers an extensive range of test situations. See: http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/ . PS is another free program for performing power and sample size calculations. See: http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/PowerSampleSize

18.3
18.3.1

Analysis using R
Estimating sensitivity and specicity

There does not appear to be an R fundtion for sample size determination in estimation problems, but it is easy to write your own based on equation 18.2:
> nsize.p<-function(e,a=0.05,p=0.5) + qnorm(1-a/2)^2*p*(1-p)/e^2 + } {

Note that in this new function nsize.p() default values of = 0.05 and = 0.5 have been set. In the example regarding a test for Johnes disease we set = 0.05 and = 0.7 (0.9) for sensitivity (specicity).
> nsize.p(0.05,p=0.7) [1] 322.7 > nsize.p(0.05,p=0.9) [1] 138.3

Thus we require 323 diseased and 139 disease-free animals.

18.3.2

Blood pressure

In the blood pressure example the experimenters have specied that they want to detect a dierence in means of = 15 with power 0.8, when the standard deviation of measurements is = 20. They will use a one-tailed test with = 0.05. R gives
> power.t.test(delta=15,sd=20,power=0.8,type="two.sample",alt="one.sided")

220

CHAPTER 18. AN INTRODUCTION TO SAMPLE SIZE DETERMINATION


Two-sample t test power calculation n delta sd sig.level power alternative = = = = = = 22.69 15 20 0.05 0.8 one.sided

NOTE: n is number in *each* group

so they require a total of 46 subjects. If they think there might be a possibility that the drug actually increases blood pressure:
> power.t.test(delta=15,sd=20,power=0.8,type="two.sample",alt="two.sided") Two-sample t test power calculation n delta sd sig.level power alternative = = = = = = 28.9 15 20 0.05 0.8 two.sided

NOTE: n is number in *each* group

then 58 subjects are required.

18.3.3

Carcinogenicity trial

In the proposed carcinogenicity trial using rats, the baseline rate is expected to be 1 = 0.2, and experimenters want to detect an increase to 2 = 0.5 with power 0.8, using a one-tailed test with = 0.05:
> power.prop.test(p1=0.2,p2=0.5,power=0.8,alt="one") Two-sample comparison of proportions power calculation n p1 p2 sig.level power alternative = = = = = = 30.19 0.2 0.5 0.05 0.8 one.sided

NOTE: n is number in *each* group

This requires a total of 62 rats (although 60 might be close enough).

18.4. EXERCISES

221

18.3.4

Tomato growth

There are four treatments, and experimenters want to be able to detect a between2 treatment variance of = 10 with power 0.8, testing at = 0.05. The within-group

variance is expected to be 2 = 10
> power.anova.test(groups=4,between=10,within=10,power=0.8) Balanced one-way analysis of variance power calculation groups n between.var within.var sig.level power = = = = = = 4 4.734 10 10 0.05 0.8

NOTE: n is number in each group

This suggests using ve plants per group, or 20 in total.

18.4

Exercises

Exercise 18.1: Write your own function in R to determine the sample size for estimating a mean, given the standard deviation , based on 18.3 but using z/2 in place of tn1,/2 . Use it to determine the sample size required to be within 0.2 of the true value with 95% condence, when = 1. Investigate with the t-distribution how well this sample size approximates the required precision. Exercise 18.2: Suppose that in the blood pressure drug trial, it was decided instead to measure each patients SBP at the start of the trial and again after the two-week treatment period, and then to analyze the reduction in baseline SBP for each patient. Suppose that the experimenters expect the standard deviation of the reduction to be about 10 mm Hg. How does this change the sample size requirement? Check your answer using Russell Lenths online calculator. Exercise 18.3: In the rooks vs gorillas contest of Chapter 14, suppose we want to carry out an experiment to see whether gorillas are signicantly worse at solving the problem than rooks. We expect that the success rate of rooks is about 95% and we want to detect with 80% power if the success rate of gorillas is as low as 50%, testing at the 5% level. How many rooks and gorillas are required? How many successes and failures would you expect in each group? Check with Lenths online calculator. Also try with Fishers Exact Test in G*Power. (Note that in G*Power you can change the ratio of rooks to gorillas try experimenting with this). Exercise 18.4: In a wheat variety trial, 12 dierent varieties are to be grown on 2m by 10m plots, and the yield at harvest in tonnes per hectare will be determined by drying

222

CHAPTER 18. AN INTRODUCTION TO SAMPLE SIZE DETERMINATION

and weighing. It is expected that the yields of plots growing the same variety will vary with a standard deviation of about 150 t/ha. How many plots of each variety should be grown in order to detect a between-yields standard deviation of 100 t/ha with 80% power, testing at 5% level? Check with Lenths online calculator by choosing Balanced ANOVA and clicking on the F -test button. Exercise 18.5: Therneau and Grambsch (2000) give the following formula for the number of events required for a clinical trial comparing the survival of patients on a new treatment and a control: (z1 + zpower )2 d= p(1 p) 2 (18.11)

where p is the proportion of patients in the treatment group, z is the quantile of the standard normal distribution, and is the eect size as measured by the log-hazard ratio. Suppose ve-year survival under standard treatment is approximately 30% and we anticipate that a new treatment will increase this to about 45%. Use the proportional hazards model S1 (t) = S0 (t)e

(18.12)

to show that the anticipated eect size is = log 0.663. Assuming that we want to detect this eect size with 80% power using a two-sided 5% test with equal allocation to the two groups, how many deaths are required? Discuss briey the diculty in determining how many patients we should recruit into the study.

Chapter 19 Principal Component Analysis: Weather in New Zealand Towns


An original chapter

written by

Siva Ganesh1

19.1

Introduction

This chapter considers a data set containing some climate related information on a sample of 36 New Zealand towns. This NZclimate data (available in the le NZclimate.csv) contains the following information for each town, with a subset shown in Exhibit 19.1: Name of Town, Latitude (in o S), Longitude (in o E), Mean January temperature (in o C), Mean July temperature (in o C), Average rainfall (in mm), Average sunshine (in hours), Altitude (height above sea level in metres), Location (Coastal or Inland) and Island (North or South). We shall analyse this multi-dimensional data using principal component analysis with a view to exploring the structure of the data and see how such a dimension reduction technique help to visualise the data in lower (say, 2) dimensions.
1

Ganesh is a former colleague in the Statistics group of the Institute of Fundamental Sciences who

has moved to AgResearch Ltd. Please contact the editor for any queries relating to this chapter.

223

224

CHAPTER 19. PRINCIPAL COMPONENT ANALYSIS

Exhibit 19.1 NZclimate data. Some climate related information for 36 NZ towns.
Town Kaitaia Kerikeri Dargaville Whangarei Auckland Tauranga Hamilton Rotorua ... Queenstown Alexandra Dunedin Gore Invercargill Haast Lat 35.1 35.2 36.0 35.7 36.9 37.7 37.8 38.2 ... 45.0 45.3 45.9 46.1 46.4 43.9 Long 173.3 174.0 173.8 174.3 174.8 176.2 175.3 176.3 ... 168.7 169.4 170.5 168.9 168.3 169.0 JanTemp 19.3 18.9 18.6 19.7 19.4 18.5 17.8 17.5 ... 15.8 17.0 15.1 15.1 13.7 14.7 JulTemp 11.7 10.8 10.7 11.0 11.0 9.3 8.3 7.3 ... 3.7 2.6 6.4 4.7 5.1 7.5 Rain 1418 1682 1248 1600 1185 1349 1201 1439 ... 805 343 784 836 1037 3455 Sun 2113 2004 1956 1925 2102 2277 2006 1948 ... 1921 2064 1685 1698 1621 1853 Alt 80 73 20 29 49 4 40 307 ... 329 141 2 72 0 4 Location Coastal Coastal Coastal Coastal Coastal Coastal Inland Inland ... Inland Inland Coastal Inland Coastal Coastal Island North North North North North North North North ... South South South South South South

19.2
19.2.1

Principal Components Analysis


Dimensionality Reduction

Often important questions can be answered by an adequate description of data. A graphical display may give data the best visual impact and is usually more informative than a block of numbers. We may continue this reasoning (i.e. graphical representation of data to explore the structure), in theory, for any number of variables or dimensions, but physical limitations prevent practical realisation, and the graphical displays start to lose their advantage when too many variables are included in a single display. We therefore need a method for displaying multivariate (usually high-dimensional) data in a low-dimensional space; i.e. we need a low-dimensional approximation to high-dimensional data. The key to reduce dimensionality is to note that the data can be viewed in various ways. Suppose we have measured height and weight of a group of individuals, and the scatter plot looks as in Exhibit 19.2(a). This scatter plot shows that the variation may be best explained along two new dimensions than along the original dimensions, weight and height. These new dimensions or variables (i.e. Y1 and Y2 ) are shown in Exhibit 19.2(b). It is apparent that most of the variation in the given data is along the new Y1 direction which measures the size of individuals. The shape dimension (Y2 ) which is orthogonal (perpendicular) to Y1 , explains the variation unaccounted for by the size dimension. Note

19.2. PRINCIPAL COMPONENTS ANALYSIS Exhibit 19.2 Height vs Weight of individuals.


100 100

225

(a)

(b)

90

90

Y1 Y2

80

Weight (X2)

Weight (X2) 140 150 160 170 Height (X1) 180 190 200

70

60

50

40

40 140

50

60

70

80

150

160

170 Height (X1)

180

190

200

that, Y1 and Y2 can be written as Y1 = a1 X1 + a2 X2 and Y2 = b1 X1 + b2 X2 where a1 , a2 , b1 and b2 are constants, computed appropriately. Now suppose our sample of heights and weights looks as in Exhibit 19.3. Then almost all the variation within the sample occurs along the Y1 dimension, while the Y2 values dier very little from zero. Thus, although we think we have a 2-dimensional data (i.e. height and weight), we really only have a 1-dimensional data (i.e. size ), as practically all the information can be recovered by plotting the data on the Y1 axis alone. Consequently, if we can nd the coecients a1 and a2 such that Y1 = a1 X1 + a2 X2 gives the (size ) direction indicated in the scatter plot, we can: reduce dimensionality of the data without loosing much information; try and interpret the resulting variable (or dimension) Y1 physically, thereby establishing the major source of variation among the individuals. The notion is that, the low-dimensional representation will be adequate if it captures enough of the overall scatter of the data.

226

CHAPTER 19. PRINCIPAL COMPONENT ANALYSIS

Exhibit 19.3 Height vs Weight of individuals.

90

Y1 80 Y2

Weight (X2)

40 150

50

60

70

160

170

180

190

200

Height (X1)

The above principles can be applied to data with more than two variables, to nd a low-dimensional representation of high-dimensional data. This procedure is commonly known as principal component analysis or simply PCA. Principal component analysis aims at adapting a line or a plane to a cloud of points in a hyperspace and deals with numerical variables all playing the same role in the analysis Pearson (in 1901).

19.2.2

Some Theory

PCA is one of the simplest of all multivariate statistical techniques available for data analysis. It is amongst the oldest and most widely used technique, originally introduced by Pearson (1901) and later, independently by Hotelling (1933). The objective is to describe the variation in a set of multivariate data in terms of a set of new, uncorrelated variables, each of which is dened to be a particular linear combination of the original variables. In other words, principal components analysis is a transformation from the observed (correlated) variables X1 , X2 , . . . , Xp , to variables Y1 , Y2 , . . . , Yp , where Y1 = a11 X1 + a12 X2 + + a1p Xp Y2 = a21 X1 + a22 X2 + + a2p Xp ... ... Yp = ap1 X1 + ap2 X2 + + app Xp

19.2. PRINCIPAL COMPONENTS ANALYSIS

227

The new variables, Yi , are called principal components (abbreviated to PCs). The lack of correlation among the PCs is a useful property, because it means that the new variables are measuring dierent dimensions in the data. The Y s are ordered so that Y1 displays most of the variation in the data, Y2 displays the second largest amount of variation (i.e. most of the variation unexplained by Y1 ), and so on. In other words, if V(Yi ) denotes the variance of Yi in the data, then V (Y 1 ) . The usual objective of the analysis is to see if the rst few components display or accountable for most of the variation in the data under investigation. If this is the case, then it can be argued that the eective dimensionality of the data is less than p. It must be pointed out that a PCA does not always work in the sense that a large number of original variables are reduced to a small number of transformed variables. Note: If the original variables are uncorrelated, then PCA does absolutely nothing! The best results are obtained, i.e. the dimensionality is reduced, only when the original variables are highly (positively or negatively) correlated. However, in many situations with a large number of variables, there is a great deal of redundancy in the original variables, with most of them measuring the same thing. Hence, it is worth deciding whether to include all variables measured, and whether any of the variables need to be transformed. When only a few PCs are required to represent the dimensionality, it is hoped that these few PCs will be intuitively meaningful; will help us understand the data; and will be useful in subsequent analyses where we can work with a smaller number of variables. In practice it is not always possible to give labels to the PCs, and so the main use of the analysis lies in reducing the dimensionality of the data in order to simplify later analyses. For example, plotting the scores of the rst two components may reveal clusters or groups of points of individuals. The principal component scores (or PC scores for short) are the values of the new variables obtained for each individual or unit in the data using the linear combinations, Yi = ai1 X1 + ai2 X2 + + aip Xp , i = 1, 2, . . . , p. (19.1) It is also worth noting that PCA is a technique that does not require the user to specify an underlying statistical model to explain the error structure. In particular, no assumption is made about the distribution of the original variables, though more meaning can generally be given to the PCs if the data are normally distributed. The derivation of the PCs is essentially the estimation of the coecients ai1 , ai2 , . . . , aip subject to some conditions. Recall that, we wish to nd Y1 = a11 X1 + a12 X2 + + a1p Xp V (Y 2 ) ... V (Y p )

228

CHAPTER 19. PRINCIPAL COMPONENT ANALYSIS

such that the variance, V(Y1 ), becomes as large as possible. This is achieved under the constraint that
2 2 a2 11 + a12 + + a1p = 1.

This constraint is introduced because if this is not done the V(Y1 ) can be increased by increasing any one of the aij values.
2 2 as possible subject to the constraint that a2 21 + a22 + + a2p = 1 and also to the condition

The second PC, Y2 = a21 X1 + a22 X2 + + a2p Xp is obtained such that V(Y2 ) is as large

that Y1 and Y2 are uncorrelated, i.e. the condition a21 a11 + a22 a12 + + a2p a1p = 0 is satised.

condition that Y3 is uncorrelated with Y1 and Y2 , i.e. a31 a11 + a32 a12 + + a3p a1p = 0 and a31 a21 + a32 a22 + + a3p a2p = 0. constructed. Although computer software packages can be readily used to perform the analysis, it may be useful to understand how the coecients of the PC equations are obtained. In fact, principal component analysis mainly involves nding the eigenvalues and the corresponding eigenvectors (i.e. eigen analysis ) of the sample covariance (or correlation ) matrix. (Note that the covariance matrix is the matrix of the covariances and the variances of the original variables X1 , . . . , Xp , and is sometimes called the dispersion matrix.) An eigen analysis : are obtained by solving the equation: |S I| = 0, where |A| denotes the determinant In other words, nd (a p 1 vector) ak = (ak1 , ak2 , . . . , akp )T to maximize aT k Sak Dene the sample (p p) covariance matrix as S, then the corresponding eigenvalues The rest of the PCs are computed in a similar manner until all p components are

2 2 large as possible subject to the constraint that a2 31 + a32 + + a3p = 1 and also to the

The third PC, Y3 = a31 X1 + a32 X2 + + a3p Xp is obtained such that V(Y3 ) is as

of matrix A while I denotes the identity matrix.

T (with T denoting transpose) under the conditions aT k ak = 1 and ak aj = 0 for k = j = T 1, 2, . . . , p. The process uses a Lagrange multiplier k , i.e. maximize aT k Sak k ak ak or

aT k (S k I)ak with respect to ak and k .

Assuming that S is of full rank (or non-singular), the solution yields p positive eigenvalues, 1 , 2 , . . . , p (> 0) with the corresponding eigenvectors a1 , a2 , . . . , ap . The eigenvalues are also the variances of the principal components, and there are p eigenvalues. Some of these eigenvalues may be zero, but none of them can be negative for

19.2. PRINCIPAL COMPONENTS ANALYSIS

229

a covariance (or correlation) matrix. Let i denote the ith eigenvalue, and these values are ordered as 1 2 . . . p 0. Then i corresponds to the ith PC,

Yi = ai1 X1 + ai2 X2 + + aip Xp where the ai j are the elements of the corresponding eigenvector, noting that V (Yi ) = i. Note also that, in nding eigenvectors we are nding a rotation which aligns the axes with the direction of greatest variability in the data. An important property of the eigenvalues is that they add up to the total variation of the original variables; i.e. V (Y1 ) + V (Y2 ) + + V (Yp ) = 1 + 2 + + p = V (X1 ) + V (X2 ) + + V (Xp ) (19.2) In other words, the sum of the variances of the original variables is equal to the sum of the variances of the PCs. This means that the PCs, together, account for all the variation in the original data. It is therefore convenient to make statements such as the ith PC accounts for a proportion i /
p j =1 j

of the total variation in the data. We may also say


m j =1 j / p j =1

that, the rst m PCs account for a proportion

j of the total variation.

Another procedure known as singular value decomposition may also be used to perform principal component analysis, and the details are left as an optional exercise to explore!

19.2.3

PCA on the Standardised Data

The variables in a multivariate data set may all be of the same type. For example, they might all be dimensions of an animal, or all percentages of dierent elements in a compound, or all marks (out of 100) in examinations, or all scores on a 5-point rating scale. Such variables are directly comparable, and often have similar variances. However, a more complicated situation arises when the variables are of dierent types. Some variables may be discrete and some continuous, while some may have much larger variances than others. Here the choice of scale is crucial. One way of proceeding is to standardise the data, so that each variable has zero mean and unit variance. Thus, techniques such as PCA will be applied to correlations rather than covariances. It should be noted that many multivariate techniques are sensitive to the type of data used, whether raw or standardised, hence care should be taken. Note that, standardisation does not alter the shape of the distribution of a variable, and that the distributions of two standardised variables need not have the same shape even though their means are zero and standard deviations are ones. To illustrate the eect of standardisation on PCA, consider the scatter plots in Exhibit 19.4. The plot on the left shows the scatter of two variables X1 and X2 together with their principal component axes Y1 and Y2 . The scatter plot on the right shows the same two variables, but with one of them, X1 , multiplied by 3. (The variables have been

230

CHAPTER 19. PRINCIPAL COMPONENT ANALYSIS

Exhibit 19.4 Demo: PCA on the scaled Data.


15 15

Y2

Y2 10 X2 Y1 0 5

X2

10

Y1 5

10

15

15

10

0 X1

10

15

15 15

10

10

0 3*X1

10

15

centred, i.e. had the mean subtracted from each observation, so that the point being illustrated will be easier to see in graphs.) The PCA was carried out separately for each pair of variables. The eect of multiplying X1 can be seen in the change of direction of the principal axes; the rst principal axis Y1 moves closer to the X1 axis (and the second principal axis Y2 moves closer to X2 ). As we increase the dilation in the X1 direction, the rst principal axis tends to become parallel to the axis of variable X1 . This behaviour is important, because it means the results of PCA depends upon the measurement units (centimetres, inches, feet, metres, pounds, kilograms etc.). The aim is to identify the structure (of variation), regardless of the particular choice of measurement scale, so we work with standardised data. Recall that, standardisation refers to, creating )/SX for each variable in the data. Here, X a new variable, say Z, such that Z = (X X and SX are, respectively, the mean and standard deviation of X . Performing PCA on the standardised variables eectively means, nding the PCs from the eigen analysis of the sample correlation matrix instead of the covariance matrix. In this case, the sum of the eigenvalues (i.e. the sum of the variances of the PCs and of the original variables) is equal to p, the number of original variables. Furthermore, the proportion of variation explained by the ith PC is simply i /p, and the total variation explained by the rst m PCs is
m j =1 j /p.

It should be noted here that the eigenvalues (and, hence,

the eigenvectors) of the correlation matrix are dierent to those of the covariance matrix, hence the PCs obtained from these are also dierent. In general, we say, PCA is not invariant under linear transformations of the original variables, i.e. PCA is not scale-invariant. It should also be pointed out here that the covariance matrix of the standardised data is simply the correlation matrix of the original

19.2. PRINCIPAL COMPONENTS ANALYSIS data. The choice between raw and standardised data for PCA:

231

For many applications, it is more in keeping with the spirit and intent of this procedure to extract principal components from the covariance matrix rather than the correlation matrix, especially if they (the PCs) are destined for use as input to other analyses. However, we may wish to use the correlation matrix in cases where the measurement units are not commensurate or the variances (of variables in hand) otherwise dier widely.

19.2.4

Interpretation of Principal Components

After obtaining the PCs (i.e. eigenvalues and eigenvectors), the usual procedure is to look at the rst few PCs which, hopefully, explain a large proportion of the total variation in the data. It is then necessary to decide which eigenvalues are large and which are small. Noting that the number of large eigenvalues indicates the eective dimensionality of the data, it is essential to decide how many of the eigenvalues are large. How many principal components? Various principles have been put forward over the years in an attempt to introduce some objectivity into the process. One popular approach is to plot the eigenvalues i against i (i = 1,. . . ,p) in a so-called scree plot. The idea is to look at the pattern of the eigenvalues and see if there is a natural breakpoint. Ideally, the rst few eigenvalues would show a sharp decline, followed by a much more gradual slope; hence we may consider all eigenvalues above the elbow of the scree plot as large (an example is given later). A popular alternative requires the computation of the average (or mean) of the eigen . Here, those components with i are regarded as important, while values, say can be ignored. When using the correlation matrix for PCA (where the sum those i < = 1, hence the common belief is that we may regard the eigenvalues of eigenvalues is p), that are less than 1 to be small, and therefore unimportant. It must be stressed that the above methods represent ad hoc suggestions, and have little formal statistical justication. However, some recent studies have used statistically motivated ideas based on error of prediction using techniques such as cross-validation and bootstrapping. Ultimately, the choice of the number of large eigenvalues should be the one that gives sensible result in a particular problem, providing useful conclusions. In other words, retain just enough principal components to explain some specied large percentages of the total variation of the original variables, values between 70 and 90% are usually suggested.

232

CHAPTER 19. PRINCIPAL COMPONENT ANALYSIS

What are the principal components, i.e. could they be labeled ? Once we have identied the important PCs, we try to attach meaningful labels to these components. The usual procedure is to look at the corresponding coecients (or eigenvectors) and pick out the original variables for which the coecients are relatively large, either positive or negative. However, there is no hard and fast rule that tells us how small a coecient must be before its corresponding variable can be ignored. Such judgement is inevitably highly subjective, and furthermore is usually made in relation to all other coecients in that PC. Often, a coecient may be deemed to be suciently small in one component, whereas one of similar size in another component might not lead to discarding the corresponding variable. Note: Some researchers suggest an approach where the eigenvectors or PCcoecients are scaled (normalised ) so that the largest (absolute) element in the vector is unity - left as an exercise to explore! Having established the dominant original variables for a given PC, we must then pay attention to the sizes and signs of their coecients in attempting a physical interpretation of that component. We may also try to see what these variables have in common. It should be noted here that it might be dicult or dangerous to try to draw too much meaning into the PCs. A remark: In a linear combination, Yi = ai1 X1 + ai2 X2 + + aip Xp , the inuence of a variable Xj

depends both on the size of aj and on the variability of Xj . If Xj is a less variable quantity than Xk then aj will have to be larger than ak if it is to have the same inuence on Yi . This is why the absolute value of the coecients aij is not a guide to the importance of the variables in Yi . This of course applies to any linear function, including canonical discriminant functions considered in another chapter. Scaling (e.g. standardising ) the variables so that each Xj has unit variance has no eect on the structure of the correlation, but it does make the importance of each variable proportional to the size of the coecient which are therefore easier to interpret. The use of PC scores: Finally, if the rst few components (preferably, 2 or 3), account for most of the variation in the original data, it is often a good idea to use the corresponding principal component scores of the individuals in the sample in subsequent analyses. One such analysis would be to plot these scores on a two or three-dimensional graph. The graphs may be useful in identifying unusual points or outliers or groups or clusters of individuals. These

19.3. ANALYSIS USING R

233

groupings may then be examined using other multivariate techniques such as canonical discriminant analysis. Another prominent use of PC scores is in Regression modelling. One of the major drawbacks of Multiple regression modelling is that it suers from multicollinearity problems, i.e. the explanatory (or input) variables are highly correlated. In such circumstances, carrying out a principal component analysis on the explanatory variables followed by using the principal components as inputs in the regression modelling would alleviate the so called multicollinearity problem because the principal components are uncorrelated. This approach is known as Principal Component Regression Modelling and is left as an option for self exploration!

19.3

Analysis Using R

The data for the climate related information for some NZ towns are read (using the function read.csv()) from the le NZclimate.csv. Note that row.names=1 tells R that the rst value in each row is to be treated as a name for the row, not simply as a variable in the data set. This is necessary for subsequent functions to pick up the names of the towns.
> climate = read.csv("NZclimate.csv", row.names = 1) > head(climate)

Latitude Longitude JanTemp JlyTemp Rain Sun Altitude Location 35.1 173.3 19.3 11.7 1418 2113 80 Coastal 35.2 174.0 18.9 10.8 1682 2004 73 Coastal 36.0 173.8 18.6 10.7 1248 1956 20 Coastal 35.7 174.3 19.7 11.0 1600 1925 29 Coastal 36.9 174.8 19.4 11.0 1185 2102 49 Coastal 37.7 176.2 18.5 9.3 1349 2277 4 Coastal Island Kaitaia North Kerikeri North Dargaville North Whangarei North Auckland North Tauranga North Kaitaia Kerikeri Dargaville Whangarei Auckland Tauranga

We shall consider the seven numerical variables (i.e. all variables except for the Name of Towns as well as the corresponding Location and Island information) in the principal component analysis. PCA requires the data to be in the form of a matrix, not a data table, so it must rst be converted. To create a matrix containing the seven climate variables, the function as.matrix() is used as below. This matrix inherits the town names from the data.frame, and ignores the Location and Island variables.
> climate.m = as.matrix(climate[1:7])

234

CHAPTER 19. PRINCIPAL COMPONENT ANALYSIS

Exhibit 19.5 Scatterplot matrix of the climate variables.


> pairs(climate.m)

168

174

12

1600 2200

Latitude

174

Longitude

168

JanTemp

12

JlyTemp

1600 2200

Sun

Altitude

36

42

14

17

1000

4000

400

Exhibit 19.6 Correlation matrix of the climate variables.


> print(cor(climate.m), 2) Latitude Longitude JanTemp JlyTemp Rain Sun Altitude 1.000 -0.75 -0.83 -0.763 -0.036 -0.37 0.14 -0.754 1.00 0.71 0.591 -0.231 0.46 -0.10 -0.825 0.71 1.00 0.743 -0.304 0.56 -0.43 -0.763 0.59 0.74 1.000 0.033 0.33 -0.61 -0.036 -0.23 -0.30 0.033 1.000 -0.44 0.20 -0.367 0.46 0.56 0.334 -0.440 1.00 -0.24 0.136 -0.10 -0.43 -0.613 0.198 -0.24 1.00

Latitude Longitude JanTemp JlyTemp Rain Sun Altitude

First obtain a scatterplot matrix of the seven variables (as shown in Exhibit 19.5) using the pairs() function. The corresponding correlation coecients, shown in Exhibit 19.6, were found using the cor() function and then rounded to 2 decimal places using an additional argument to the print() command. We see in Exhibit 19.5 that some pairs of variables are positively correlated while the others are negatively correlated. January mean temperature is reasonably highly and positively correlated with July mean temperature and Longitude, but has a high negative correlation with Latitude. The numerical values of the correlations shown in Exhibit 19.6 indicate support for this observation. In fact these are the only variables that have

400

1000

Rain

4000

14

17

36

42

19.3. ANALYSIS USING R Exhibit 19.7 Variation explained by the PCs.


> summary(climate.pca) Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation 1.932 1.139 1.033 0.690 0.5035 0.3564 0.21579 Proportion of Variance 0.533 0.185 0.153 0.068 0.0362 0.0181 0.00665 Cumulative Proportion 0.533 0.718 0.871 0.939 0.9752 0.9933 1.00000

235

moderate to high correlation between them, with the exception that July temperature also has a moderately negative relationship with Altitude. All other pairwise relationships are reasonably random (i.e. not linear) with low (i.e. between -0.5 and 0.5) correlations. Since majority of the climate variables have reasonably high correlation between them, a principal component analysis is feasible to explore the given data. Noting that the variables considered have been measured in dierent units (e.g. o S, o E,
o

C, mm, hours and metres), the PCA should be carried out on the standardised variables,

i.e. the correlation matrix should be subjected to an eigen analysis. (Later, we shall briey discuss the consequences of using the raw data or eigen analysing the covariance matrix.) We could manually standardise all variables with the scale() function. There is however an option in the R function(s) for PCA that does a pre-standardisation, so the manual standardisation step can be omitted. The functions prcomp() and princomp() both perform a principal components analysis on the given numeric data matrix. We shall explore the rst function here (but try the second one in your own time):
> climate.pca = prcomp(climate.m, scale = TRUE)

The scale=TRUE option is used to perform the PCA on the standardised data (or the correlation matrix), so if omitted (i.e. scale=FALSE) the covariance matrix is used for PCA. The results are stored in climate.pca which is a list containing the variation explained by, and coecients dening, each component, as well as the principal component scores and a few other details. First, consider the variation explained by the PCs obtained using the summary() function printed (via print() function) with 4 signicant digits (and shown in Exhibit 19.7): Exhibit 19.7 shows the standard deviations associated with each PC (which are simply the squareroots of the eigenvalues of the correlation matrix) and the corresponding proportion and cumulative proportion of variation explained. It is clear that the rst principal component accounts for just over half the total variation (53.3%) among the seven climate variables of the 36 towns sampled. The 2nd and 3rd PCs account for about 18.5% and 15.3% of the total variation respectively, and together with the rst PC explain

236

CHAPTER 19. PRINCIPAL COMPONENT ANALYSIS

Exhibit 19.8 Scree plot of the variances explained by the PCs.


> screeplot(climate.pca)

climate.pca
3.5 Variances

about 87% of the total variation. The remaining variation explained are small (the corresponding eigenvalues much smaller than 1, the so-called rule of thumb) and contribute only a little towards explaining the total variation, the 4th accounting for less than 7%). A scree plot of each components variance (using the screeplot() function) is shown in Exhibit 19.8 which conrms how the rst followed by the 2nd and 3rd components dominate. It is therefore reasonable to say that the data can be summarised by fewer dimensions (i.e. 3) than seven accounting for about 87% of the total variation in the climate variables. We shall next attempt to interpret the rst three PCs only by means of examining the corresponding principal component coecients (or the eigenvectors) and the associated PC scores. The coecients are obtained, as shown in Exhibit 19.9, by just printing the climate.pca object (again, with 4 signicant digits). The PC scores for the rst three components are graphed (see Exhibits 19.10 and 19.11) using the plot() function in conjunction with the text() function to show names of towns as labels in red and blue dierentiating the coastal and inland locations and italics indicating South Island: It is clear from Exhibit 19.9 that the 1st PC, which accounts for about 53% of the variation, written as, Y1 = 0.4418Latitude 0.4212Longitude 0.4866JanT emp 0.4424JulT emp + 0.1498Rain 0.3292Sun 0.2533Altitude

0.0

0.5

1.0

1.5

2.0

2.5

3.0

19.3. ANALYSIS USING R

237

Exhibit 19.9 Principal component coecients.


> climate.pca Standard deviations: [1] 1.9317 1.1391 1.0332 0.6902 0.5035 0.3564 0.2158 Rotation: Latitude Longitude JanTemp JlyTemp Rain Sun Altitude PC1 PC2 PC3 PC4 PC5 0.4418 -0.33474 0.24741 -0.058342 0.36556 -0.4212 0.05945 -0.38133 0.262481 0.73669 -0.4866 -0.01287 -0.02141 0.097811 -0.50620 -0.4424 0.28273 0.30552 0.005599 0.10124 0.1498 0.77161 0.06419 -0.506354 0.12930 -0.3292 -0.44015 -0.15955 -0.813167 0.07581 0.2533 0.12329 -0.81854 -0.021593 -0.18631 PC6 -0.17608 -0.24913 -0.53481 0.65915 -0.31758 0.06259 0.28575 PC7 -0.680831 -0.003601 -0.459068 -0.431613 -0.063897 0.038997 -0.365813

Exhibit 19.10 Scores of the rst and second principal components.


> plot(climate.pca$x[, 2] ~ climate.pca$x[, 1], xlab = "1st PC (53.3%)", + ylab = "2nd PC (18.5%)", xlim = c(-3.5, 5), ylim = c(-2.2, + 3.2), cex = 0.2) > text(climate.pca$x[, 2] ~ climate.pca$x[, 1], labels = row.names(climate.m), + col = c("red", "blue")[as.numeric(climate$Location)], + font = c(1, 3)[as.numeric(climate$Island)])

MtCook

Haast Whangarei Kerikeri Greymouth Hokitika

2nd PC (18.5%)

1 1 0

Taumaranui Westport Kaitaia Dargaville Ohakune Rotorua NewPlymouth Auckland Hamilton PalmerstonNth Taupo Gisborne Tauranga Wellington Waipukurau Hanmer Masterton Invercargill Dunedin Kaikoura Napier Gore Wanganui Queenstown Timaru Christchurch Nelson Blenheim LakeTekapo Alexandra 2 0 1st PC (53.3%) 2 4

238

CHAPTER 19. PRINCIPAL COMPONENT ANALYSIS

Exhibit 19.11 Scores of the rst and third principal components.


> plot(climate.pca$x[, 3] ~ climate.pca$x[, 1], xlab = "1st PC (53.3%)", + ylab = "3rd PC (15.3%)", xlim = c(-3.5, 5), cex = 0.2) > text(climate.pca$x[, 3] ~ climate.pca$x[, 1], labels = row.names(climate.m), + col = c("red", "blue")[as.numeric(climate$Location)], + font = c(1, 3)[as.numeric(climate$Island)])

Invercargill Haast Dunedin Greymouth Gore Westport Hokitika Timaru Christchurch NewPlymouth Dargaville Whangarei PalmerstonNth Gisborne Nelson Kaitaia Kaikoura Alexandra Auckland Kerikeri Blenheim Wanganui Wellington Hamilton Tauranga Napier Queenstown Masterton Taumaranui Waipukurau 1 Hanmer Rotorua Taupo 2 MtCook LakeTekapo Ohakune 2 0 1st PC (53.3%) 2 4

3rd PC (15.3%)

may be regarded as a new dimension that contrasts the mean daily temperature in January and July of a town together with its Longitude and Annual Sunshine hours from its latitude and Altitude. However, one may regard the inuence of Altitude and average rainfall on this rst PC to be marginal. Simply, this principal component could be regarded as the warmth and sunshine dimension. Hence, considering the larger coecients only, we may note that a (large) negative score for the 1st PC would indicate a town situated at low latitude (and altitude) but large longitude having high/warm mean daily temperatures in January and July with reasonable annual sunshine hours. Indeed, this seems to be the case in Exhibit 19.10, which plots scores of the rst two principal components, with points labeled by the names of corresponding towns. Here, for example, the coastal (and North Island) towns such as Gisborne, Whangarei, Kaitaia, Auckland, Tauranga and Napier fall on the left-hand side of the 1st PC dimension. We should note that these towns are found at lower latitudes and altitudes and have high mean daily temperatures and good sunshine hours per year. Much colder towns (with low annual sunshine hours) such as Mount Cook, Invercargill, Lake Tekapo, Gore and Queenstown, which are South Island towns at high latitudes and altitudes, fall on the right-hand side of the rst principal component dimension. Note

19.3. ANALYSIS USING R

239

also that, the North Island town Ohakune behaves like a South Island town (i.e. falling among them) along the 1st PC while the South Island towns Blenheim and Nelson fall among the group of North Island towns. Although, the variable Rain does not play a primary role in identifying the 1st PC, it accounts for a substantial amount of the 18% or so of the variation explained by the 2nd principal component. In fact, this 2nd PC, in general, represents a contrast between Annual Rainfall and Annual Sunshine hours of the sampled towns. Hence, towns that are mainly wet and gloomy(!) (e.g. Mt. Cook and Haast) would show large positive values for this PC, while towns such as Alexandra, Lake Tekapo and Blenheim which are predominantly dry but sunny would have large negative 2nd PC scores. This behaviour of towns is very prominent in Exhibit 19.10. Finally, the altitude of the towns dominates the 3rd PC (see Exhibit 19.9), which accounts for about 15% of the total variation. Exhibit 19.11 shows this 3rd PC plotted against the 1st PC to highlight the behaviour of towns on the 3rd principal component. As expected, a gradient of inland towns such as Ohakune, Lake Tekapo, Taupo and Mt.Cook that are at higher altitude than the coastal towns such as Invergargill and Haast can be seen clearly in the graph. Note that, since the 3rd PC has a negative coecient for altitude, the high altitude towns appear with negative PC scores in the graph. (Perhaps we should try to plot the negative PC score in this case!) A 3-D graph would be ideal to indicate the relative positions of towns with respect to all three PCs accounting for about 87% of the total variation in the climate variables. This is left as an exercise to explore using suitable R functions! PCA on the raw data, i.e. eigen analysis of the covariance matrix To explain why it may not be sensible to carry out a principal component analysis on the raw (non-standardised) data, rst consider the variation in each climate variable, found by using the apply() function on the matrix in conjunction with the sd() function:
> apply(climate.m, 2, sd) Latitude Longitude 3.204 2.672 JanTemp 1.613 JlyTemp 2.779 Rain 790.445 Sun 216.228 Altitude 200.375

It is obvious that the variation among the seven variables is highly heterogeneous, ranging from very large variance, approx. 624803.3 (=790.4452) for variable Rain to very small variance of 2.6 (=1.6132) for variable JanTemp. In PCA, the new dimensions are obtained such that the rst principal component explains most of the variation, the second explains most of the remaining variation and so on. We can, therefore, expect variables such as Rain and to a lesser extent Sun and Altitude to dominate the rst two or so components. This implies that, we should not perform PCA on the raw data. Another obvious reason

240

CHAPTER 19. PRINCIPAL COMPONENT ANALYSIS

Exhibit 19.12 Principal component coecients (Raw data).


> climate.pca2 = prcomp(climate.m, scale = FALSE) > summary(climate.pca2) Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation 797.600 209.6736 177.7293 4.05983 1.37 0.94 0.515 Proportion of Variance 0.894 0.0618 0.0444 0.00002 0.00 0.00 0.000 Cumulative Proportion 0.894 0.9556 1.0000 1.00000 1.00 1.00 1.000 > climate.pca2 Standard deviations: [1] 797.5996 209.6736 177.7293 Rotation: PC1 PC2 PC3 PC4 PC5 PC6 Latitude 8.383e-05 0.005552 -0.0042716 0.688536 0.4300445 -0.3634151 Longitude 8.176e-04 -0.003509 0.0041577 -0.507298 0.8601845 0.0518271 JanTemp 6.534e-04 -0.004163 0.0008717 -0.265154 -0.1755083 0.3814077 JlyTemp -4.488e-05 -0.008981 -0.0023961 -0.445115 -0.2105527 -0.8483747 Rain -9.904e-01 -0.125275 0.0584709 0.001082 0.0007552 0.0005348 Sun 1.272e-01 -0.659896 0.7404410 0.010251 -0.0003152 -0.0005692 Altitude -5.418e-02 0.740744 0.6695418 -0.005136 -0.0028403 -0.0055904 PC7 Latitude -0.4570124 Longitude 0.0038348 JanTemp -0.8679861 JlyTemp -0.1942009 Rain -0.0003343 Sun 0.0006861 Altitude -0.0032340

4.0598

1.3671

0.9402

0.5151

for this is that the units (or scale) used for measuring the various response variables (i.e. climate of the 36 towns) are not the same, as pointed out earlier. Hence, it is very appropriate to standardise these variables before carrying out a principal component analysis. The eect of the variables, Rain, Sun and Altitude can be seen directly by examining the eigenvalues and the corresponding eigenvectors associated with the above PCA on the raw data, obtained using the function prcomp() with scale=FALSE and shown in Exhibit 19.12. The rst eigenvalue, thus the rst PC, accounts for about 89.4% of the total variation among the variables. The corresponding eigenvector indicates that the rst PC simply represents the variable Rain as the coecients associated with all other variables are very small compared to that of this variable. The remainder of the variation is almost entirely explained by the 2nd and 3rd components, which are dominated by the variables Sun and Altitude (2nd PC being a contrast while the 3rd being a weighted sum of the two variables). Thus, it is not advisable to carry out a PCA on the raw data, instead the

19.4. EXERCISES

241

analysis should be performed on the standardised data in order to give an equal footing to all climate variables.

19.4

Exercises

Note: The data sets in the exercises are available via the DRUGS package. Exercise 19.1: This exercise uses a famous study conducted by Hermon Bumpus on house sparrows a century ago. The story behind the data is that, on February 1, 1898, a very severe snowstorm hit New England and a number of English house sparrows (Passer domesticus) were brought to the Anatomical Laboratory of Brown University in Providence, Rhode Island. Out of a total of 136 birds, 72 survived and the rest died. A scientist named Hermon Bumpus measured nine physical characteristics on the birds including total length in millimeters, weight (in grams) and length of the humerus (arm bone) in inches that he thought might distinguish the survivors from the birds that perished. He controversially claimed that a dierential pattern of survival (due to a greater toll on individuals whose morphometrics deviated most from the ideal type) was evidence of natural selection. He concluded that the birds which perished, perished not through accident, but because they were physically disqualied, and that the birds which survived, survived because they possessed certain physical characters. Bumpus seemed to have concluded that individuals with measurements close to the average survived better that individuals with measurements rather dierent from the average. We should note that, the development of multivariate statistical methods had hardly begun in 1898 and it was another 50 or so years later that Hotelling described a practical method for carrying out a principal component analysis, one of the most simplest multivariate analyses that can be applied to this Sparrow data. The provocative nature of Bumpuss interpretation of the data, coupled with publication of the complete data set on which it is based, has prompted repeated analyses of his study. The Sparrow data used here (available in le sparrow.csv) contains the results for only ve of the original variables for female birds only, with a sample shown in Exhibit 19.13: Total length, Alar length, Length of beak and head, Length of humerus, Length of keel and sternum (all in mm). Of the 49 birds, the rst 21 (in the data table) survived and the remainder died. In the DRUGS package, this data set is called Sparrow and can be obtained using
> data(Sparrow, package = "DRUGS")

(a) Examine the correlations among the ve numerical measurements made on the sparrows. You may also show a scatter plot matrix of these variables.

242

CHAPTER 19. PRINCIPAL COMPONENT ANALYSIS

Exhibit 19.13 Sparrow data. Length measurements in mm and dead or alive status of 49 birds.
Total 156 154 153 153 155 ... 155 162 153 162 164 Alar 245 240 240 236 243 ... 235 247 237 245 248 Beak+Head 31.6 30.4 31.0 30.9 31.5 ... 30.7 31.9 30.6 32.5 32.3 Humerus 18.5 17.9 18.4 17.7 18.6 ... 17.7 19.1 18.6 18.5 18.8 Sternum 20.5 19.6 20.6 20.2 20.3 ... 19.6 20.4 20.4 21.1 20.9 Status Alive Alive Alive Alive Alive ... Dead Dead Dead Dead Dead

(b) Carry out a PCA on the standardized data (use all ve numerical measurements) and comment on your ndings. Your comments should include interpretation of, the proportion of total variance explained by the principal components; the most important components utilising the PC coecients (with label, if possible); and the plots of component scores (to discover patterns among sparrows, e.g. survived or not, and unusual birds, if any). (c) Explain why it may not be advisable to carry out a PCA on the given raw data. You should utilise the variation in each variable and the eigenvalues and eigenvectors associated with PCA carried out on the raw data (i.e. PCA based on the covariance matrix) to support your arguments. Exercise 19.2: Researchers suggest, at least in part, that many people assess their health using their (estimated) percentage body fat. The general indication is that 22% body fat for women and 15% for men are maxima for good health. It is, however, very dicult to measure body fat, exactly. A researcher accurately measured the percentage body fat of a sample of 60 men using an underwater weighing method. Several other measurements were also made on these men, in particular, age (yrs), weight (lb), height (inch), neck circumference (cm), chest circumference (cm), abdomen circumference (cm) at the umbilicus and level with the iliac crest, and hip circumference (cm). The data are available in the le bodyfat.csv. The main aim here is to explore the relationship between the percent body fat and the other variables shown above. In the DRUGS package, this data set is called BodyFat and can be obtained using
> data(BodyFat, package = "DRUGS")

19.4. EXERCISES

243

(a) Produce a scatter plot matrix of all 8 variables. Comment on your graph - do you see any strong trends, peculiarities etc. among all pairs of variables? Can you suggest, intuitively and looking at the plot, which of the variables have strong inuence (linear or otherwise) on the behaviour of percent body fat - explain your answer. (b) Carry out a principal component analysis on the seven (standardised) variables other than percent body fat considered in (a), and interpret your ndings. Consider the usual features such as correlations among the variables, proportion of total variance explained by the principal components, coecients associated with the principal components and interpretation of the most important components (if possible with labeling), and plot(s) of component scores for you interpretations. (c) Explore the relationship between the important principal components in (b) and the percent body fat. Again, use scatter plots for your comments. Note that, this is usually the rst step towards principal component regression (PCR) modelling. Exercise 19.3: The data le EUprotein.csv shows protein consumption, classied into nine food groups, in 25 European countries (in the late 1970s). The food groups are, red meat (rm), white meat (wm), eggs (egg), milk (mlk), sh (fsh), cereals (crl), starchy foods (sch), pulses, nuts & oils (pno) and fruits & vegetables (fv). The protein consumption is given in grams per person. Three natural groups existed among these countries when the data were collected. These were, the European Economic Community (EEC); the other western European countries (WEC); and the eastern European communist countries (CCE). The aim here is to try and summarize the data in fewer dimensions than nine by means of a principal component analysis (PCA). In the DRUGS package, this data set is called EUProtein and can be obtained using
> data(EUProtein, package = "DRUGS")

(a) Explain why it may not be advisable to carry out a PCA on the given raw data. You should utilise the variation in each variable and the eigenvalues and eigenvectors associated with PCA carried out on the raw data (i.e. PCA based on the covariance matrix) to support your arguments. (b) Carry out a PCA on the standardized data (i.e. using the correlation matrix) and comment on your ndings. Your comments should include interpretation of, the proportion of total variance explained by the principal components; the most important components utilising the coecients/eigenvectors (with label, if possible); and the plots of component scores to discover patterns among countries such as dierences among CCE, WEC and EEC countries, and unusual countries, if any (you may attempt 2-D as well as 3-D plots).

244

CHAPTER 19. PRINCIPAL COMPONENT ANALYSIS

Exercise 19.4: Investigate the help les for the prcomp() and princomp() commands. What is the dierence between them, and in what circumstances would you use princomp() instead of the prcomp() command?

Chapter 20 Discriminant Analysis: Classication Drug Treatments


An original chapter

written by

Siva Ganesh1

20.1

Introduction

In this chapter we consider a data set that contains information about a set of patients who suered from the same illness. Imagine that you are a researcher compiling data for a medical study, and you want to use your data analysis skills to nd out which of the ve drugs might be appropriate for a future patient with the same illness. In the study, each patient responded to one of ve drug treatments (randomly allocated) and the data contains the following information for each patient: Age, Blood pressure (BP), Blood cholesterol, Blood sodium concentration, Blood potassium concentration and Drug to which the patient responded (A, B, C, X or Y). Training data with a sample of 544 patients are in the le Drugtrain.csv with a sample shown in Exhibit 20.1, while a new sample of 15 patients to be scored is in the le Drugnew.csv.
1

Ganesh is a former colleague in the Statistics group of the Institute of Fundamental Sciences who

has moved to AgResearch Ltd. Please contact the editor for any queries relating to this chapter.

245

246

CHAPTER 20. DISCRIMINANT ANALYSIS

Exhibit 20.1 Drug data. Some information for 544 patients.


Patient 1 3 4 5 7 9 12 13 16 ... ... Age 48 37 35 32 15 54 34 62 39 ... ... BP 146 146 135 125 125 142 105 146 100 ... ... Cholesterol 4.00 5.16 8.43 5.48 6.09 5.26 6.92 5.68 6.09 ... ... Sodium 0.69262 0.53819 0.63568 0.68974 0.81352 0.65043 0.73617 0.58367 0.58008 ... ... Potassium 0.055369 0.069780 0.068481 0.055571 0.060659 0.044934 0.067911 0.076438 0.025057 ... ... Drug A A A X X B C B Y ... ...

20.2
20.2.1

Discriminant Analysis
Group Separation

When the individuals or units in the multivariate data are grouped (e.g. grouping of patients taking dierent drug treatments), we may wish to explore the following, given measurements for these individuals on several variables: Find a low-dimensional representation that highlights as accurately as possible the true dierences between the groups of individuals, in particular when the dimensionality of the data is high. Find a rule for classifying new individuals into one of the existing groups. Establish whether the group structure is statistically signicant, i.e. test whether the multivariate means (of the variables) for the various groups dier signicantly. The above tasks could be achieved by means of techniques such as Discriminant Analysis (Canonical, Linear, Quadratic etc.) and Multivariate Analysis of Variance (MANOVA). Initially, we shall consider the application of discriminant analysis on data with two groups only, and then extend the ideas to multiple group situations. To begin with, we shall consider the case of linear discriminant analysis, where the functions of variables for separation of groupings are linear.

20.2.2

Linear Discriminant Functions (LDF)

Suppose there are p variables, say, X1 , X2 , , Xp observed on each of n1 and n2 (sample)

individuals from groups denoted by G1 and G2 . Let X = (X1 , X2 , , Xp )T denote the

20.2. DISCRIMINANT ANALYSIS

247

1 and x 2 be the mean vectors and S1 and S2 be the covariance matrices of the let x samples of G1 and G2 , respectively. Furthermore, let S be the pooled covariance matrix of S1 and S2 . Note that, the pooled covariance matrix is used when assuming that the population covariance structures of the two groups are similar, i.e. when assuming a common covariance structure. variables. This can be written in vector form as, Y = aT X, where a = (a1 , a2 , , ap )T . Y is aT Sa (assuming a common covariance structure). Hence, we look for the coecient 2 ); and the within group variation of Then the group separation on Y is given by aT ( x1 x 2 )/aT Sa. It is easily shown that the maximum vector a that maximises the ratio aT ( x1 x 2 ). In other words, the linear function, value occurs when a takes the form S1 ( x1 x 2 )T S1 X Y = ( x1 x provides the maximum separation between the two groups while keeping the groups as compact as possible. This idea was rst used by Fisher (1936), hence called Fishers Linear Discriminant Function (LDF). Fisher also showed that this LDF provides the best rule for classifying an individual x into an appropriate group with minimum error. In the case of classication (of a new individual into one of the two groups), X is replaced by x, and the LDF becomes, 2 )T S1 x Y = ( x1 x The problem of nding the best linear function which minimises the cost of misclassication can be tackled easily. Let us denote the (prior) probability that an individual belongs to the group Gi by qi (i=1,2), and the probability that an individual that belongs Then we know that P (1|1) = 0 = P (2|2), and the corresponding cost of misclassication to Gj is classied (or more appropriately, misclassied) into Gi , by P (i|j ) for i,j = 1,2. will be zero. However, P (2|1) and P (1|2) may not be zero, and we denote the corresponding costs of misclassications by c1 and c2 respectively. It can be shown using the 2) k; classify individual x into group G1 , if LT x 1 ( x1 + x 2 otherwise classify into group G2 2 ) and k = log(c2 q2 /c1 q1 ) . Note that, the LHS of the above where L = S1 ( x1 x principle of log-likelihood-ratio, that the best rule for classifying individuals is Now consider the linear combination Y1 = a1 X1 + a2 X2 + + ap Xp , of the original

vector of variables and x = (x1 , x2 , , xp )T be any observation (or individual). Also

classication rule can be written as, Fishers LDF plus (or minus) a constant! Hence, without loss of generality, we classify individual x using the new LDF function, 1 2) x1 + x W = x ( 2
T

2) S1 ( x1 x

248

CHAPTER 20. DISCRIMINANT ANALYSIS

Usually, the costs of misclassications are assumed to be the same for each group and so are the prior probabilities. In this case, called equi-priors & costs, the value of k is zero and the classication rule becomes, classify individual x into group G1 , if W 0; otherwise classify into group G2 . Remark: Unless otherwise stated, we shall assume equal priors and equal costs of misclassication. When equal priors are used, it is assumed that a new individual of unknown origin was equally likely to have come from each of the known groups/populations. Often though there are good reasons for believing that one group/category is more likely than another. In such cases, it is standard practice to opt for proportional priors where the actual proportions of the two groups in the given sample are considered as priors.

20.2.3

Probabilities of Misclassication

As seen earlier, there are two types of misclassication probabilities, namely, P (2|1) and (or simply, error rates). These error rates provide, not only the risk of classifying a new individual into one of the existing groups, but also a base for assessing the performance of the (linear) discriminant function itself. Under the assumption of multivariate normality on the variables X1 , X2 , . . . , Xp and common covariance structure of the observations in the two groups, together with equipriors & cost, we can show that, 1 P (2|1) = P (1|2) = ( D ) 2 where is the standard normal cumulative probability function and D 2 is the Sample 2 )T S1 ( 2 ). Mahalanobis Distance given by, D 2 = ( x1 x x1 x However, when the normality assumption is questionable, this procedure would provide P (1|2), associated with the LDF. These are also known as the misclassication error rates

very poor estimates of the error rates, though LDF is a fairly robust classication rule. This leads us to estimating the error rates by means of empirical approaches. The Resubstitution method: The simplest such approach is to apply the given classication rule to the two groups of observations in hand, and to estimate the error rates by the proportion of individuals that are misclassied by the classication rule. This method is called the Resubstitution

20.2. DISCRIMINANT ANALYSIS

249

method because the individuals of given data are used to nd the classication rule and are then re-substituted into it to estimate its performance. The steps in this procedure are, rst to compute the classication rule (i.e. the LDF in W form), and then to classify each individual in the given data set into the group determined by the rule. Since we know the origin of each individual observation (i.e. whether it belongs to G1 or G2 ), we can count the number of misclassications and hence the proportion of these misclassied individuals to the total number in that group. These and P (1|2). proportions for each group would then constitute the estimates of the error rates P (2|1) One obvious problem with the resubstitution method is that it tends to have a bias

in favour of classifying individuals into the group that they really come from. After all, the group means (used in the LDF) are determined from the observations in that group. Thus it would not be surprising to nd that an observation is closest to the centre (or mean) of the group where the observation helped to determine that centre.

The Cross-validation approach: The bias in the resubstitution method can be reduced splitting the data (and the groups) into two sets, one for deriving the LDF (i.e. for computing sample means and covariance matrices) and the other set for estimating the error rates. Once again, the origin of the individuals in the second set is known, thus the number of misclassications can be counted. The most common approach is to divide the data into two halves, a training sample (or calibration sample) and a validation sample (or hold-out sample or test data). For example, when the given data are divided up into two equal portion for training and testing, the approach is called half cross-validation. Although, this approach overcomes the bias associated with the resubstitution method by not using the same data to both construct and assess the discriminant function, it has the unfortunate eect of reducing the eective sample size; it is therefore suitable only for large data sets. To overcome this diculty for small data in particular, we may consider the leave-one-out method. Here, the given data set is split into two sets as before, but the validation sample consists of only one observation. In other words, each individual in the data set is classied into its closest group without using that individual to compute the discriminant function. Hence, this process involves a total of (n-1) cross-validations, n being the total sample size. As one can see, each time an individual is left out from the data, the LDF has a new form, i.e. the means and the pooled covariance matrix are (theoretically) dierent, thus computation is tedious. However, there are short-cut formulae available to accelerate the computation of all LDFs and the classication procedure. Unless otherwise stated, we shall regard the leave-one-out method as the cross-validation process in this material.

250

CHAPTER 20. DISCRIMINANT ANALYSIS

A remark on classifying new individuals: As seen earlier, once the discriminant function has been established, classication of individuals becomes a simple problem. It should be noted, however, that this classication is based on the assumption that the new individuals have come from one of the existing groups that are sampled. Obviously, in these cases it will not be known whether the classication is correct (as we do not know the origin of these new observations). However, the error rates estimated from the individuals from the known (or sampled) groups would provide an indication of how accurate the classication process is.

20.2.4

Distance-based Discrimination

The two groups of data, or training sets, in any discrimination problem may be thought of as two swarms of points in p-dimensional space. Here, the greater is the dierence between the two groups G1 and G2 , the greater will be the separation between the two swarms. An individual x to be classied into one of G1 and G2 may then be thought of as a single point in this space, and an intuitively attractive procedure would be to classify x into the group to whose swarm it is nearer. This approach requires a denition of distance between the single observation x and each training group (i.e. the points in each group). One possibility is to dene the squared distance as the Mahalanobis quantity,
2 k )T S1 (x x k) Dk (x) = (x x

k is the mean of the k th training group (k=1,2) and S is the pooled covariance where x matrix of the two training groups. Classication procedure would then be,
2 2 2 2 classify x into G1 if D1 (x) < D2 (x), or into G2 if D1 (x) D2 (x).

The relationship with LDF: Some simple algebraic manipulation would establish that this classication rule is exactly the same as the LDF under the assumption of equal priors and equal costs. Recall that,
T 1 2 ) S1 ( 2) x1 + x x1 x W = x ( 2 This can be re-written as, 1 T 1 1 T 1 1 ( 2 ( 1 ) xT S1 x 2) W = xT S1 x x1 S x x S x 2 2 2 The terms in square brackets are linear functions of x, and these are called the LDF for

LDF is given by,

group Gk , k = 1, 2. We may therefore, re-write the squared (Mahalanobis-type) distance of x to group Gk as, 1 T 1 2 k ) + xT S1 x k ( x S x Dk (x) = 2 xT S1 x 2 k

20.2. DISCRIMINANT ANALYSIS

251

where the term in square brackets is the LDF for group Gk . For a given x, the group with the smallest squared distance has the largest LDF.

20.2.5

Multiple-group case

The ideas described so far for the two-group case can be easily generalised to the multiplegroup (>2) situation, again assuming homogeneous covariance matrices. So, in a multiplegroup situation, an observation x is classied into group Gk , if the squared (Mahalanobis2 2 type) distance of x to group Gk is the smallest. Notice that, Dk ( xj ) = Dj ( xk ), and this

is the Mahalanobis squared distance between groups Gk and Gj .

20.2.6

Posterior Probability of Classication

Although, it is easy to nd the estimates of misclassication error rates by means of cross-validation approaches, some interest lie in the estimation of probabilities that an observation x belongs to one of the existing groups. To elaborate, consider the case of LDF, or equivalently, using the Distance-based rule, 1 T 1 2 2 ( 2 ) + xT S1 x Dk (x) = 2 xT S1 x x S x 2 2 to classify observation x to group Gk as described above. We may show easily that the posterior probability of observation x belonging to group Gk is given by, P (Gk |x) = qk exp 1 D 2 (x) 2 k
g j =1 qj

D 2 (x) exp 1 2 j

assuming that there are g groups in the training data and with qk (k=1,. . . ,g) being the prior probability for group membership. The observation x is then classied into the group that results in the largest posterior probability of classication. Note that, this computation of posterior probability relies on the assumption of multivariate normality on the variables X1 , X2 , . . . , Xp . Since the computation also relies on the prior probabilities, these need to be considered appropriately, i.e. assume equal priors or unequal (e.g. proportional) priors suitably.

20.2.7

Some Other Forms of Classication Rules

Several alternatives to LDF have been suggested in the literature. These include, QDF (Quadratic Discriminant Function) to overcome the non-homogeneity of withingroup covariance matrices. This approach requires the data to follow a multivariate normal distribution.

252

CHAPTER 20. DISCRIMINANT ANALYSIS Euclidean-distance Discriminant Function, which ignores the inuence of covariance matrices, so similar to the Mahalanobis-type distance without the term S 1; and the Regularised Discriminant Function, which in its basic form combines both the LDF and QDF to overcome the non-homogeneity of within-group covariance matrices. Classication Trees, where an object (or observation) is classied by asking a series of questions about it, and the answer to one question determines what question is asked next, and so on. Again, this method makes no assumptions about the distribution of the response variables that may be quantitative or categorical or a mixture of both. Basically, the approach is univariate in a sense that each question asked deals with one variable at a time. This method is generally regarded as a nonlinear approach to classication problems. Another non-linear approach is known as Neural Networks. This, in general, use non-linear functions to separate one group from another, and the form of the function does not have to be specied in advance as it is an iterative procedure. Both classication trees and neural networks work well when there is plenty of data, and are usually considered in Data Mining exercises.

These are left for self exploration in the future!

20.2.8

Canonical Discriminant Functions (CDF)

When the individuals in the multivariate data are grouped, and the overall dimensionality is high, we would rather nd a low-dimensional representation that highlights as accurately as possible the true dierences existing between the groups of individuals. Here the aim is discrimination rather than classication as in the case of LDF. However, there is a connection between CDF and LDF which will be explored later. For an illustration, consider the scatter plots of 2-dimensional data shown in Exhibit 20.2, where we have two variables X1 and X2 measured on all individuals of two groups (denoted by s and s) and the graphs show two dierent scenarios. It is evident in Exhibit 20.2(a) that, although the Y1 direction represents the maximum overall

spread of points, there is no indication of any dierence between the two groups along this direction. This is because, when projected on to this axis, the points from both groups would be completely intermingled. In fact, Y1 would constitute the 1st principal component in a PCA of the data, thus, PCA would be fairly useless as a dimensionality reduction technique for highlighting the grouping features that interest us in this data set. On the other hand, the best one-dimensional representation from our point of view would be Y2 as it gives the direction in which group separation is best viewed, noting

20.2. DISCRIMINANT ANALYSIS Exhibit 20.2 CDF Demonstration.


100

253

(a)

100

(b)

90

Y1 Y2

90

Y1 Y2

80

X2

X2 140 150 160 170 X1 180 190 200

70

60

50

40

40 140

50

60

70

80

150

160

170 X1

180

190

200

that the projection of all points on to Y2 axis would show a clear separation between the two groups. Finding such an axis or dimension is called canonical discriminant analysis (CDA). Note that, this example also illustrates the fundamental dierence between PCA and CDA. Alternatively, with respect to data shown in Exhibit 20.2(b), it is evident that there is no indication of any dierence between the two groups along the Y2 direction, and the best one-dimensional representation would be Y1 to best view the group separation, i.e. in a CDA. Note that, Y1 would also be the best dimension from a PCA point of view. CDA is a multivariate technique capable of identifying dierences among groups of individuals, and improving understanding of relationships among several quantitative variables measured on these individuals. It determines linear functions of these variables that maximally separate the groups of individuals while keeping the variation within groups as small as possible. The derivation of CDFs: Let the variables measured be X1 , X2 , , Xp . When only two distinct groups of individuals are in the data, a single linear combination of these variables would be sucient to discriminate between the two groups (also as seen in Exhibit 20.2). However, when the data matrix contains more than two groups of individuals, we would usually need more than one such linear function. If g groups and p variables are in the multivariate data, then m (= smallest of g -1 & p) dimensions are needed to fully represent the group approximation, in which the group dierences are exhibited as much as possible. Hence, dierences. As with PCA, however, we can try to nd the best r-dimensional (r m)

254

CHAPTER 20. DISCRIMINANT ANALYSIS

in the multiple group situation, we look for new variables Y1 , Y2 , , Ym which are linear functions of X s (i.e. Yi = ai1 X1 + ai2 X2 + + aip Xp , i = 1, . . . , m), and successively maximise between group dierences (or variation) relative to the variation within groups. The form of the solution is very similar to that of PCA, although there are obvious basic dierences such as, the new variables Y1 , Y2 , , Ym are called canonical discriminant functions (or the new variables are uncorrelated within groups instead of over the whole sample; the quoted variances of the CDFs relate to the amount of between-group variability each accounts for, instead of overall variability in the data; and the values of Y s for each individual are called discriminant scores instead of component scores. Finding the coecients of the CDFs turns out to be an eigenvalue problem (like that of PCA). Let xijk denote the value of the j th variable Xj on the k th individual in the ith group. Thus, denote the observation vector of the k th individual in the ith group by xik . Furthermore, let ni be the number of individuals in the ith group, and the total number of i and x , respectively denote the vector of means observations be n (= g i=1 ni ). Now, if x in the ith group and the vector of overall means (ignoring groups), then we may dene the following: ith group covariance matrix, Wi =
ni k =1 (xik

CDFs) instead of principal components ;

i )(xik x i )T /(ni 1) x
g i=1 (ni

pooled within-groups covariance matrix, W = between-groups covariance matrix, B = matrix W1 B (or BW1 .
g i=1

1)Wi /(n g )

)( )T /(g 1) ni ( xi x xi x

Then, performing CDA is equivalent to nding the eigenvalues and eigenvectors of the Suppose the ordered eigenvalues are 1 > 2 > > m . Then i measures the

ratio of the between-group variation (B) to the within-group variation (W) for the ith

linear function (or CDF) Yi . So if there is clear separation of groups, then the rst few eigenvalues, (preferably one or two) will be substantially greater than the others. The elthe coecients of the linear functions, also known as canonical discriminant coecients. ements of the corresponding eigenvectors (denoted by, say, vectors a1 , a2 , , am ) provide Hence, the CDFs are chosen in such a way that the rst CDF Y1 reects group differences as much as possible; Y2 captures as much as possible of the group dierences

20.2. DISCRIMINANT ANALYSIS

255

not displayed by Y1 conditional on there being no correlation between Y1 and Y2 within groups; Y3 , uncorrelated with Y1 and Y2 within groups, reects groups dierences not accounted for by Y1 and Y2 ; and so on. Canonical Discriminant Scores: As in principal component analysis, scores (or canonical discriminant scores) can be computed and these are the values of the new canonical discriminant functions/variables obtained for each individual or unit in the data using the linear functions, Yi = ai1 X1 + ai2 X2 + + aip Xp , i = 1, . . . , m (= min(g 1, p))

One particular attraction of these scores is that, if only the rst two (or three) CDFs are adequate to account for most of the important dierences (variation) between groups then a simple graphical representation of the relationship among the various groups can be produced. This is achieved by plotting the values of these CDFs (i.e. canonical discriminant scores) for the individuals in the sample. The plots would also be useful in identifying outliers or anomalies in the data. It is usual practice to work with mean-centered canonical scores (similar to using meancentered principal component scores). Note however that, mean-centering of the canonical scores implies that small raw canonical scores may result in large negative mean-centered canonical scores, similarly, large canonical scores may inevitably result in large positive mean-centered canonical scores. Interpreting the Canonical Discriminant Functions: Coecients of CDFs are the canonical weights of the original variables and provide information about the power of each variable in separating or discriminating between groups. A distinction can be made between interpreting each CDF, and evaluating the contribution of each original variable to that CDF. Similar to that of PCA, absolute values and signs (positive or negative) of canonical discriminant coecients can be used to rank variables in order of their contribution and to characterise the associated CDF. The notion is that, a high percent of separation between groups explained by a CDF along with large absolute coecients between that CDF and some original variables means that those original variables are important in distinguishing between the groups. One of the aims is to attach meaningful labels to these discriminant functions. Ideally, the canonical scores graphs should be used to highlight the interpretation of the CDFs. However, the matter is more complicated than that of PCA by the fact that betweengroup variability is being assessed relative to within-group variability. This means that large canonical coecients may be a reection either of large between-group variability or

256

CHAPTER 20. DISCRIMINANT ANALYSIS

small within-group variability in the corresponding CDF. For interpretation, therefore, it is better to consider modied canonical coecients that are appropriate when the original variables are standardised to have unit within-group variance. j th diagonal element of the pooled within-group covariance matrix W (i.e. the pooled w variance of variable Xj ), then the modied coecients aw ij are given by aij = aij wjj . This standardisation puts all original variables on a comparable footing as regards to within-group variability (e.g. remove the eect of dierent scales of measurement variation), and allows the modied CDFs to be interpreted in the manner suggested above. These modied coecients are called pooled within-group standardised canonical coecients, while the original coecients are called raw canonical coecients. It should be pointed out here that the raw canonical coecients are, in fact, normalized, so that the CDFs are arranged to be uncorrelated and of equal variance within groups. Note also that the CDFs are uncorrelated between groups as well, and they are usually arranged in decreasing order of between-group variance. This process is equivalent to standardising the X variables within each group and then using the overall standardised data in canonical discriminant analysis and interpreting the raw canonical coecients. The above form of standardisation amounts to writing the ith CDF as,
w w w w w Yi = aw i1 X1 + ai2 X2 + + aip Xp

For example, consider the ith CDF, Yi = ai1 X1 + ai2 X2 + + aip Xp . If wjj is the

w where aw wjj s (i.e. ij = aij wjj and Xj = Xj / wjj . Hence, clearly the introduction of the within-group standardisation) does not change the value of the individual components
w (i.e. aij Xj or aw ij Xj ) in the CDFs, nor does it aect the information contained in Yi . This

means that the requirement for maximal separation (or discrimination) between groups does not place any restrictions on the choice of standardisation method. Alternative choices of such standardisation are available and left as an exercise to explore. One such method is Total-Sample Standardisation where group structure in the data is ignored. This is equivalent to standardising variables ignoring groupings and then using it to obtain CDFs and interpreting the raw canonical coecients. Note that the canonical discriminant scores of individuals in the data are not aected
w by the standardisation, i.e. either use aij Xj or aw ij Xj in the CDFs when computing the

canonical scores. This means we may just use the measurements in terms of their raw scale when interpreting the graphs of canonical discriminant scores. Some Remarks: Assumptions: CDA operates under the assumption that the within-group dispersion (variances and covariances) structure is homogeneous for all groups, i.e. eigen anal-

20.3. ANALYSIS USING R

257

ysis is carried out on the matrix W1 B, where W is the pooled within-group covariance matrix. Failure of this assumption reduces the reliability of any statistical signicance tests used, for example, to test the dierences between groups and/or to determine the signicant number of CDFs adequate to represent the data. These tests also require the data within each group to follow a multivariate normal distribution. However, the normality and homogeneity assumptions are not always considered absolute prerequisites for use of the technique, in particular, if a descriptive or empirical interpretation deemed sucient to make conclusions. Scale invariance: Unlike PCA, CDA is invariant under change of origin and scale of measurement. In other words, data for CDA need not be standardised prior to the analysis, i.e. the eigenvalues etc. associated with CDA on the raw data are identical to those associated with the standardised data. CDA vs LDA: Since both LDA and CDA attempt to discriminate between groups, naturally there is a connection between the two methodologies. Using denitions of the two approaches, it is easy to show that, in a two-group case, the single CDF required to highlight the dierences between, say, groups G1 and G2 is essentially the same linear function as [LDF(G1)-LDF(G2)]. (Explore - left as an exercise.)

20.3

Analysis Using R

The data for information about a set of patients who suered from the same illness are read, using the function read.csv(), from the le Drugtrain.csv.
> drugtrn = read.csv("Drugtrain.csv", row.names = 1) > head(drugtrn)

1 3 4 5 7 9

Age 48 37 35 32 15 54

BP Cholesterol Sodium Potassium Drug 146 4.00 0.6926 0.05537 A 146 5.16 0.5382 0.06978 A 135 8.43 0.6357 0.06848 A 125 5.48 0.6897 0.05557 X 125 6.09 0.8135 0.06066 X 142 5.26 0.6504 0.04493 B

One of the main aims here is to determine the relative importance of the variables Age, BP, Cholesterol, Sodium and Potassium in highlighting the dierences among the ve drugs responded to by the patients. Another aim is to determine which of the ve drugs each of the 15 new patients in the data le Drugnew.csv would respond to. The rst aim can be tackled via a canonical discriminant analysis on the 544 reference patients to determine new linear dimensions that maximise the dierences between the ve

258

CHAPTER 20. DISCRIMINANT ANALYSIS

Exhibit 20.3 Proportion of between-group variation explained, Canonical discriminant function etc.
> print(drugtrn.lda, digits = 4) Call: lda(Drug ~ Age + BP + Cholesterol + Sodium + Potassium, data = drugtrn) Prior probabilities of groups: A B C X Y 0.2040 0.1268 0.1893 0.2518 0.2279 Group means: Age BP Cholesterol Sodium Potassium A 30.96 139.6 6.049 0.6856 0.06295 B 62.77 139.0 5.569 0.6825 0.06062 C 42.93 105.3 7.454 0.6581 0.06103 X 45.93 119.2 5.694 0.6762 0.06263 Y 43.74 124.3 5.991 0.7490 0.03381 Coefficients of linear discriminants: LD1 LD2 LD3 LD4 Age 0.007149 0.00827 0.0631671 0.01267 BP -0.109587 -0.01736 -0.0002137 0.02120 Cholesterol 0.158554 0.03123 -0.1080916 0.51943 Sodium -0.708449 5.27743 -0.9662895 -1.08494 Potassium 16.652527 -97.72879 1.9571826 -6.14924 Proportion of trace: LD1 LD2 LD3 LD4 0.4861 0.4201 0.0818 0.0120

drug-groups of patients and to highlight such dierences in a low-dimensional graphical representation of the patients. The second aim, on the other hand, can be handled using a linear discriminant analysis. Both these analyses can be carried out using the function lda() available in the MASS package together with other related functions. We use the library() command to ensure access to these functions.
> library(MASS) > drugtrn.lda = lda(Drug ~ Age + BP + Cholesterol + Sodium + + Potassium, data = drugtrn)

The lda() function assumes proportional priors by default (i.e. the class or group proportions for the training set are used as prior probabilities of class membership). The results can be viewed by printing the object drugtrn.lda via the print() function; the output is shown in Exhibit 20.3. Exhibit 20.3 shows the formula of the model, the prior probabilities, the group means, the coecients of the linear discriminants (which show the raw canonical discriminant coecients), and proportion of trace which are the proportion of the between-group variation that each canonical discriminant function explains (i.e. based on the eigenvalues of

20.3. ANALYSIS USING R

259

W1 B). We may access some of these output components of drugtrn.lda separately via drugtrn.lda$prior, drugtrn.lda$means and drugtrn.lda$scaling. Since there are ve response variables and ve groups of observations, the maximum number of (new) canonical dimensions required for separating the groups is four. It is clear from the results (in Exhibit 20.3) that almost 91% of the group-separation between the ve drug groups can be accounted for by the rst two canonical discriminant functions (CDF). Note that these two CDFs explain a similar amount of between-group variation, although as expected the rst CDF accounts for more variation (48.6%) than the 2nd CDF (42%). Therefore, major emphasis should be on the interpretation of both these CDFs (i.e. LD1 and LD2). The third CDF accounts for another 8.2% or so, while the 4th one explains only about 1.2% of the between-group variation. The raw canonical discriminant coecients associated with the rst two CDFs indicate that the Potassium variable dominate both new dimensions. However, examination of the given data indicates that the magnitude of the values for Potassium and Sodium concentrations (e.g. 0.055369 and 0.69262 for the 1st patient) are very small compared to that of BP and Age (e.g. 146 and 48 for the 1st patient) of patients. This means that, stating both LD1 and LD2 being dominated by Potassium is infeasible! Hence, we need to compute the pooled within-group standardized canonical coecients for further interpretation of the CDFs. R doesnt appear to have a simple/single function that gives these coecients, thus the steps in Exhibit 20.4 have been utilised. The results in Exhibit 20.4 indicate that, putting variables on equal footing, the 1st canonical dimension may be labeled the blood pressure dimension as the variable BP dominates this dimension with a relatively large coecient. Hence, we may say that about 48.6% of the dierences among the patients responding to the ve drugs are mainly due to their blood pressure. Drug groups with high BP would score high and negative along the 1st CDF, while patient drug groups with relatively low BP would be on the high-positive side of this CDF. As for the 2nd CDF, the main relative inuence comes from blood potassium concentration which has the largest (negative) coecient. However, blood sodium concentration also has a moderately large inuence on this 2nd canonical dimension, but in the opposite direction to that of potassium. Hence, we may argue that about 42% the dierence between the ve groups of patients is mainly due to a contrasting behaviour of blood sodium and potassium concentrations with groups having high potassium but low sodium values scoring high and negative on the 2nd CDF while groups with high sodium and low potassium patients falling on the high positive side of the dimension. Finally, even though only a small proportion of between-group variation is explained by each of the 3rd and 4th CDFs, the 3rd CDF can be regarded as the Age dimension while the 4th one is dominated by cholesterol measurements.

260

CHAPTER 20. DISCRIMINANT ANALYSIS

Exhibit 20.4 Pooled within-group standardised canonical coecients.


> > > > > > > > > > > > > > > + > > attach(drugtrn) drug.m = as.matrix(drugtrn[1:5]) varA = diag(cov(subset(drug.m, Drug == "A"))) varB = diag(cov(subset(drug.m, Drug == "B"))) varC = diag(cov(subset(drug.m, Drug == "C"))) varX = diag(cov(subset(drug.m, Drug == "X"))) varY = diag(cov(subset(drug.m, Drug == "Y"))) nA = dim(subset(drug.m, Drug == "A"))[1] nB = dim(subset(drug.m, Drug == "B"))[1] nC = dim(subset(drug.m, Drug == "C"))[1] nX = dim(subset(drug.m, Drug == "X"))[1] nY = dim(subset(drug.m, Drug == "Y"))[1] nn = nA + nB + nC + nX + nY ng = dim(drug.m)[2] varW = ((nA - 1) * varA + (nB - 1) * varB + (nC - 1) * varC + (nX - 1) * varX + (nY - 1) * varY)/(nn - ng) std.coef = (drugtrn.lda$scaling) * sqrt(varW) print(std.coef, digits = 4)

LD1 LD2 LD3 LD4 Age 0.11035 0.12765 0.975044 0.19565 BP -0.98307 -0.15572 -0.001917 0.19015 Cholesterol 0.28492 0.05612 -0.194237 0.93340 Sodium -0.07793 0.58054 -0.106296 -0.11935 Potassium 0.17716 -1.03969 0.020822 -0.06542 > > > > + + + drug.m = as.matrix(drugtrn[1:5]) varW = rep(0, dim(drug.m)[2]) names(varW) = dimnames(drug.m)[[2]] for (VARS in names(varW)) { varW[VARS] = anova(aov(drugtrn[, VARS] ~ drugtrn$Drug))[2, 3] }

A plot of canonical scores of the patients for the rst two CDFs, with appropriate labels for drug groups, may enhance the interpretation of the CDFs given above. Such a plot is shown in Exhibit 20.5 obtained using the plot() function. The canonical scores in Exhibit 20.5 indicate that the Drug Y group is reasonably well separated from all other groups along the 2nd canonical dimension. Furthermore, the Drug Y group appears to have comparatively higher blood sodium concentration than blood potassium concentration while the reverse seems to be the case for most patients in the other groups. Note also that none of the other groups are separated by this 2nd canonical dimension. Recall that this behaviour accounts for about 42% or so of the between-group variation or dierences among the ve drug groups. The rst canonical dimension not only separates the drug groups C and X well, but also separates these groups from drug groups A and B. This is the major notable behaviour along the 1st CDF and accounts for about 49% of all inter-group variation. Although, groups A and B are not separated by this dimension, they score high and negative on this

20.3. ANALYSIS USING R Exhibit 20.5 Canonical discriminant scores.


> plot(drugtrn.lda, dimen = 2, cex = 0.8, xlab = "First CDF (48.6%)", + ylab = "Second CDF (42%)")

261

Y Y Y Y Y Y Y Y Y YY YY Y Y Y Y Y Y Y Y Y Y Y Y Y Y YY Y Y YY Y Y Y Y Y YY Y Y Y Y Y Y Y Y Y Y Y Y Y YY Y Y Y Y Y Y Y Y YY YY Y Y Y Y Y Y Y Y Y Y Y Y Y YYY Y Y Y Y YY C Y YY Y Y Y Y Y C Y Y Y Y Y CC C Y Y YY Y Y X C CC C C Y XY C Y X Y Y Y Y X C YY X C Y Y C YX X CC C C X B C X XXX B CC C A X BA B C C XX B B BA XX XX X B B C XC B A X A B X X B C X A X B CC BB A C XX X A CC AAAA A A XX X C B X XX X X X X B C A X X B B X B X A A X A A A C AX X XX B A C X C B BB C X B C B B X C X C X A X B A C C X C C C C A AA A C C B B AX AA B X X XX X B BB C B A X B X C CC C CC X A X X C A X C C B C A A X B B C C X X A A A A B X B C C C C C C X X A ABA B A B X C XX BX X X C XX X A A BA A A A CC C XXX BA X XXX C B C A C X X B X B A C A X XX B A AA C CC X C X C B A A B A A A A A B X X X XX A BB XC X CC C C C A A A ABAA X X XXX A AB CC B X A BA AA A X X A AA A A A X CCC X XX A BB X X C B A A X X A X C AA C A A AB X A X C B A A X C X X XX X AA A A B A B B C A AB X X A X AA Y

Second CDF (42%)

2 4

0 First CDF (48.6%)

axis indicating that the patients who responded to these drugs have relatively larger blood pressure than their counterparts who responded to drugs X and C, the latter group having the lower BP levels. Note also that, patients who responded to drug Y do not appear to have any notable pattern with respect to blood pressure (with their scores spanning almost the entire 1st CDF dimension). Next consider the aim to determine which of the ve drugs each of the 15 new patients in the data le Drugnew.csv would respond to. As we should expect, the data for these additional patients has recordings for all but one of the variables in the training data. The variable names could clash if we attach() the new data without rst detaching the training data we attached earlier.
> detach(drugtrn)

The main aim here is to create a classication rule based on the reference data in the le Drugtrain.csv, assess it performance by means of (mis)classication rates and use it to classify the 15 new patients. Since we do not have additional information about the population of patients, it is reasonable to assume that the referenced (training) data was a true reection of the population, thus we may utilise the proportional priors when building a linear discriminant function (LDF). However, in drug trials, the common practice is to randomly allocate an

262

CHAPTER 20. DISCRIMINANT ANALYSIS

Exhibit 20.6 Cross-validation summary using Linear Discriminant Function.


> library(MASS) > drugtrn.cvlda = lda(Drug ~ Age + BP + Cholesterol + Sodium + + Potassium, data = drugtrn, CV = TRUE) > table(drugtrn$Drug, drugtrn.cvlda$class) A A 108 B 1 C 0 X 7 Y 3 B 1 65 0 2 7 C X Y 0 2 0 0 3 0 93 10 0 1 122 5 10 0 104

> tab.m = as.matrix(table(drugtrn$Drug, drugtrn.cvlda$class)) > print(prop.table(tab.m, 1), digits = 2) A 0.9730 0.0145 0.0000 0.0511 0.0242 B 0.0090 0.9420 0.0000 0.0146 0.0565 C 0.0000 0.0000 0.9029 0.0073 0.0806 X 0.0180 0.0435 0.0971 0.8905 0.0000 Y 0.0000 0.0000 0.0000 0.0365 0.8387

A B C X Y

> (sum(tab.m) - sum(diag(tab.m)))/sum(tab.m) [1] 0.09559

equal number of patients to each drug, which implies utilising equal priors when building classication models. This is left as an exercise to explore later! A linear discriminant analysis with proportional priors can be carried out on the reference data using the lda() function with appropriate options for obtaining the crossvalidation (leave-one-out) based estimates for correct or mis- classications. Note the use of table() and prop.table() in Exhibit 20.6. First, we shall examine the cross-validation summary (shown in Exhibit 20.6) associated with the reference data for an assessment of the behaviour of LDA. It is apparent that, overall about 9.6% of the patients in the reference data have been misclassied. Further break-down of the misclassications reveals that only about 2.7% of the patients who responded to drug A have been wrongly classied by LDA into other groups, in particular, one to drug B and 2 or 1.8% into drug X group. The drug group that has the highest misclassication rate is Y, with only about 83.87% correct classications, 16.13% of patients responding to this drug have been misclassied into other groups, in particular, 10 or 8.06% into drug C group and 7 and 3 respectively into drug groups B and A. We may also see that, 5.8%, 9.71% and 10.95% of patients in groups B, C and X have been misclassied by the LDA with all misclassied from drug C group being classied into group X. We may argue from these interpretations that, the LDA classication rule with about 90.4% correct classication rate appears to be a reasonably reliable tool for classifying

20.3. ANALYSIS USING R Exhibit 20.7 Classication results for new patients.
> drugnew = read.csv("Drugnew.csv", header = T) > head(drugnew)

263

1 2 3 4 5 6

Patient Age BP Cholesterol Sodium Potassium 1 46 124 4.89 0.5262 0.07223 2 32 147 5.07 0.5298 0.05609 3 39 104 3.85 0.6050 0.04340 4 39 129 4.48 0.5175 0.05330 5 15 127 7.44 0.6424 0.07071 6 73 127 5.58 0.8327 0.04332

> drugnew.pred = predict(drugtrn.lda, drugnew, dimen = 4) > drugnew.pred$class [1] X A X X A Y B Y X X X A C Y C Levels: A B C X Y > round((drugnew.pred$posterior), 4) A 0.0973 0.9141 0.0005 0.3660 0.6944 0.0005 0.2138 0.0038 0.0816 0.0505 0.0529 0.4990 0.0000 0.0000 0.0000 B 0.0548 0.0801 0.0007 0.1540 0.0044 0.0335 0.6948 0.0130 0.0050 0.1539 0.4147 0.1363 0.0000 0.0000 0.0000 C 0.0209 0.0000 0.3702 0.0041 0.0216 0.0005 0.0001 0.0524 0.1920 0.0084 0.0023 0.0084 0.9779 0.0000 0.8558 X 0.8270 0.0057 0.5472 0.4719 0.2795 0.0180 0.0530 0.4080 0.6627 0.7267 0.5301 0.1438 0.0221 0.0001 0.0120 Y 0.0000 0.0001 0.0814 0.0040 0.0001 0.9475 0.0383 0.5228 0.0588 0.0605 0.0000 0.2125 0.0000 0.9999 0.1321

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

new patients into one of the ve drug groups. The next task is to use the LDA classication rule to determine which of the 15 patients in the test data would respond to each of the ve drugs. The ndings are summarised using the classication summary and posterior probabilities associated with these 15 patients, as shown in Exhibit 20.7. Some additional R coding is required here (e.g. read.csv(), predict() functions etc.) and are also shown in Exhibit 20.7. The classication results for the 15 new patients, based on the tted LDF, are shown in Exhibit 20.7. It is clear that, 6 of the 15 patients have been classied as responding to drug X, 3 patients each to drugs A and Y, 2 to drug C and only 1 patient to drug B. Overall, the posterior probability of membership associated with the recommendations made fall between about 0.4719 and 0.9999. We may also notice that both patients who

264

CHAPTER 20. DISCRIMINANT ANALYSIS

Exhibit 20.8 Canonical discriminant scores with new patients added.


> newid = as.character(drugnew[, "Patient"]) > plot(drugtrn.lda, dimen = 2, cex = 0.8, xlab = "First CDF (48.6%)", + ylab = "Second CDF (42%)") > text(drugnew.pred$x[, 1:2], labels = newid, cex = 1.1, col = 4, + font = 2)

Y Y Y Y Y Y Y Y Y YY YY 14 Y Y YY Y Y Y Y Y Y Y Y Y Y YY Y Y YY Y Y Y Y Y YY Y Y Y Y Y Y Y Y Y Y Y Y Y YY Y Y YY Y Y Y 15 Y YY YY Y Y Y Y Y Y Y Y Y Y Y 6 Y Y YYY Y Y Y Y YY Y Y Y C Y Y Y Y Y C Y Y Y Y Y CC C Y Y YY Y Y X C CC C C Y XY C 8 Y X Y Y Y X C Y X C Y Y Y YY C X 3 CC C C X B C X XXX B A B A XX CC C BA X B C C XX XX X B B C XC B AB B B X BA X X XX C X A AA A X B 12 CC B A X C A AA B A X XX X X X XX X C B XC X B C A X X BA B X 10 B A X 9 A A X A A X C X X A X B 7 CC X B BB C X B C B B X C C CC C X X AA X B A X A A C X C C A C C B B B AA B X X XX X A BB C B A X B X C CC C CC X A X X C X X C C B C A A B B 13C XX X A A A AA B X X C B C C C C A B AB C X B X A C X CC B X X A X A X XX X A 4 XX BA A A A CC C X A X B X X C B C A C X X B X A C B A AAB X XX C CC X C X C BA A A B A A A A A B X X X XX A BB XC X CC C C C A 2 A A ABAA X X XXX A C AB C C B X A BA AA A X A A A A A XX CC X X X A BB A X C B 5X A A X X A X C A C A AB X X 11 X C B AA A AA X XC 1 X XX AA A B A B B C A A AB X X A X AA Y

Second CDF (42%)

2 4

0 First CDF (48.6%)

have been classied into drug group C have been done so with high probability (0.8558 & 0.9779), while the sole patient classied into drug group B only has a marginal (0.6948) certainty of membership. Of the three patients classied as responding to drug A, only one has reasonable reliability (0.9141), one with less than adequate reliability (0.4990) and the third has a probability of 0.6944 of membership. On the other hand, two of the three patients classied into group Y with high probability (0.9999 & 0.9475) and the third with a marginal probability of 0.5228. The probability of membership range from 0.4719 to 0.8270 for those six patients classied into drug group X. One of the patients (the 4th one) classied as responding to X, and a few others, namely the 3rd, 8th & 12th, have relatively low reliability (0.47 to 0.55) of membership as classied by the LDA. Finally, we may utilise the graphical representation of the canonical scores of the patients, as seen in Exhibit 20.5, obtained for the reference data, to explore a dierent way of classifying the 15 new patients and perhaps compare the results with those of using the LDA as in Exhibit 20.7. The required R codes and the suggested graph are in Exhibit 20.8. The 15 new patients are identied by large boldface numbers 1 to 15 (in the order they appear in the given data) in Exhibit 20.8. The symbols A, B etc. denote the ve

20.4. EXERCISES

265

drug-groups in the reference data. We noted in earlier that, 3 of these 15 patients, namely 6th, 8th & 14th, were classied as responding to drug Y by LDA with the 8th patient having only marginal probability (0.5228) of membership. This decision appears to be well supported by the graph with cases 6 and 14 which both have large posterior probabilities (>0.94) falling well within the scatter of group Y points while patient 8 falls closer to the border between this group and others (in particular, group X). In general, the new patients classied into the various groups with high posterior probabilities seem to fall well within the scatter of their corresponding group, noting that the groups A and B overlap substantially with respect to their CDF scores. For example, patient 2 into A, patient 13 into C and patients 6 and 14 into Y as seen before. Similarly, patients with reasonably good posterior probabilities (with respect to LDA) appear to be falling, at least approximately, in vicinity to their classied groups - e.g. patient 1 to X, 15 to C and 10 to X. We should also notice that, some of the patients who were classied by LDA with marginal probabilities appear to be classied by the CDF graph more accurately into the same grouping! This includes patient 7 to Drug B and patient 9 to Drug X. On the other hand, patient 4 and to a lesser extent, patient 11 who were classied into group X by LDA with minimal probability, fall away from this group (and into the overlapped groups A and B) in the CDF dimensions.

20.4

Exercises

Note: The data sets in the exercises are available via the DRUGS package. Exercise 20.1: This exercise compares CDA and LDA. Since both LDA and CDA attempt to discriminate between groups, naturally there is a connection between the two methodologies. Using a subset of the Drug data given in the le Drugtrain.csv (see Exhibit 20.1), show that, in the two-group case, the single CDF required to highlight the dierences between, say, drug groups X and Y is essentially the same linear function as [LDF(X)-LDF(Y)], the dierence between LDF of group X and LDF of group Y. In the DRUGS package, this data set is called DrugTrain and can be obtained using
> data(DrugTrain, package = "DRUGS")

Exercise 20.2: Compute Mahalanobis distances among the pairs of the ve drug groups of patients of the Drug data given in the le Drugtrain.csv. Use suitable R codes and interpret these distances. Recall that, the greater the dierence between the groups, the greater will be the separation between the groups. The Mahalanobis squared distance between groups Gi and Gj is dened as,
2 j )T W1 ( j ) Dij = ( xi x xi x

266

CHAPTER 20. DISCRIMINANT ANALYSIS

i and x j are the mean vectors of the two groups and W is the within-group where x covariance matrix pooled among all groups. Exercise 20.3: Four measurements were made of male Egyptian skulls from ve dierent time periods ranging from 4000 B.C. to 150 A.D. We wish to analyze the data to determine if there are any dierences in the skull sizes between the time periods and if they show any changes with time. The researchers theorize that a change in skull size over time is evidence of the interbreeding of the Egyptians with immigrant populations over the years. Thirty skulls are measured from each time period and the four skull measurements are, Maximal Breadth of Skull (Max), Basibregmatic Height of Skull (BasiHt), Basialveolar Length of Skull (BasiLth) and Nasal Height of Skull (NasalHt). The training sample with 150 skulls are in the le skullstrain.csv, while a new sample of 6 skulls waiting to be classied into one of the existing time-period groups, is in the le skullsnew.csv. The time-period grouping is denoted by Epoch and takes values 1 to 5 with labels also shown. In the DRUGS package, these data sets are called SkullsTrain and SkullsNew ; they can be obtained using
> data(SkullsTrain, package = "DRUGS") > data(SkullsNew, package = "DRUGS")

(a) The major aim here is to determine the relative importance of the four skull measurements in highlighting the dierences among the ve time-periods. Perform a canonical discriminant analysis on the 150 reference skulls to determine new variables or linear dimensions that maximise the dierences between the ve time-periods. Interpret your results using the following information: Dimensionality reduction (using proportion of group-variation explained). Relationship between the response variables and the canonical discriminant functions (main emphasis being on relative information associated with pooled within-group standardisation) - attempt to give an interpretation to as many canonical functions as possible. Suitable graphical representation of the skulls on the new canonical axes. Mahalanobis squared distances between the time-period-groups. (b) It is suggested that a linear discriminant analysis be carried out on the skull variables in the reference data to discriminate between the ve time-period groups, and to determine which of the ve time-periods each of the 6 new skulls would belong to.

20.4. EXERCISES

267

Carry out the suggested linear discriminant analysis, utilising proportional priors. Comment on the performance of this classication rule using the leaveone-out cross-validation summary (classication matrix or table as well as overall error summary). Using your LDA approach, determine which of the 6 new skulls would belong to each of the ve time-periods. Summarise your ndings (use the posterior probabilities associated with these skulls). (c) Suppose we wish to show the 6 new skulls on an appropriate graph together with the 150 skulls in the reference data. The notion behind this is to explore a dierent way of classifying the 6 cases and perhaps to compare the results with those of using the LDA. One way to do this is to utilise the graphical representation of the canonical scores obtained above for the reference data. Carry out this task and briey comment on your display.

Chapter 21 Cluster Analysis: Weather in New Zealand Towns


An original chapter
written by

Siva Ganesh1

21.1

Introduction

This chapter considers a data set containing some climate related information on a sample of 36 New Zealand towns. This NZclimate data (available in the le NZclimate.csv contains the following information for each town, a subset of which was shown in Exhibit 19.1: Name of Town, Latitude (in 0 S), Longitude (in 0 E), Mean January temperature (in 0 C), Mean July temperature (in 0 C), Average rainfall (in mm), Average sunshine (in hours), Altitude (height above sea level in metres), Location (Coastal or Inland) and Island (North or South). A cluster analysis may show the similarities or dissimilarities between the towns in the sample. Someone suggests that the towns considered should fall into two or three natural groups based on the climate related variables Latitude, Longitude, JanTemp, JulTemp, Rain, Sun and Altitude, ignoring the location variables (i.e. whether the towns are in the North or South island or whether they are coastal or inland towns).
1

Ganesh is a former colleague in the Statistics group of the Institute of Fundamental Sciences who

has moved to AgResearch Ltd. Please contact the editor for any queries relating to this chapter.

268

21.2. CLUSTER ANALYSIS

269

21.2

Cluster Analysis

The multivariate techniques considered so far may be regarded as dimensionality reduction techniques. Principal component analysis is concerned with nding the best low dimensional display for explaining the variation among the original variables, ignoring any grouping of individuals or units in the data. Discriminant analysis, on the other hand, attempts to nd the best low dimensional representations that maximally separate the groups of individuals in the data. Hence, it is obvious that some techniques assume the existence of groupings of individuals in the data, while the others ignore such group information. It is worth mentioning here that, grouping of individuals is one of the basic aims of all scientic research and is something that the human brain does automatically in most aspects of life! Note that, construction of a graphical representation of the multivariate data is one of the main aims associated with most of the methodology mentioned above. For such a graphical conguration to successfully represent the patterns existing among the entities (or units or individuals), similar entities should be represented by points (on the graph) that are closer together, and the more dissimilar the entities are, the more distant should be the points representing them. There is thus a direct connection between the dissimilarity of two entities and the distance between them in the geometrical representation. The above notion leads into the basic aims of two popular multivariate techniques: 1. Multi-dimensional scaling (MDS), and 2. Cluster analysis (CLS). In simple terms, given the proximities (i.e. similarities or dissimilarities) between the individuals in the raw data (in, say, p-dimensions), MDS tries to nd a projection of these individuals (from a p-dimensional space) onto a smaller (say, 2 or 3) dimensional space, preserving the inter-object proximities as much as possible. The basic aim of cluster analysis, on the other hand, is to devise a scheme that nds the natural groupings, if any, of individuals in a data set. In other words, the aim is to classify a group of individuals into a set of mutually exclusive, exhaustive, (sub) groups such that individuals within a group are similar to each other while individuals in dierent groups are dissimilar.

Remark: The two main types of matrices encountered in multivariate analysis are correlation (or covariance) matrices between variables, and similarity (or dissimilarity) matrices between units or individuals. The former can be said to contain information on associations between variables, and the latter to contain information on proximities between individuals.

270 Proximity Measures

CHAPTER 21. CLUSTER ANALYSIS

In cluster analysis (and in MDS), knowing how to measure the dissimilarity (or similarity) between individuals is evidently of fundamental interest. Since these measures are closely linked to the idea of distance between a pair of individuals, one natural way of measuring these is by the use of a familiar metric such as Euclidean distance. However, there are many other ways of dening a similarity or dissimilarity between individuals. The choice of measure is usually closely tied to the nature and purpose of the study, but mistakes may be avoided by paying careful attention to the type of data that have been collected, and to the nature of the entities between which the proximity measure is to be computed. Before setting out some of the measures of proximity, it needs to be emphasised that the focus of interest is usually on the dissimilarity between two entities. Frequently, however, a similarity measure is quoted or computed by a software package. If s is the the direct opposite of s and hence may be obtained by using a monotonically decreasing transformation of s. The most common such transformation is d = 1 s. Scaling: There are some aspects of the given data that often cause problems when computing dissimilarity measures. One practical question is whether each variable is equally important. In other words, should the variables be scaled in any way? Clearly, if some variables exhibit much larger values or a much greater variation than others, then if left unscaled they may dominate the dissimilarity computation. However, this may be valid in the context of some studies, i.e. variables may not be equally important. (This argument is similar to that associated with performing a principal component analysis - PCA is not scale invariant.) similarity between two entities (usually in the range 0 1), then the dissimilarity d is

Computed dissimilarity measures: In many cases the measures of dissimilarity (or similarity) are not observed directly, but are obtained from a given data matrix. Given observations on p variables, say, X1 , X2 , , Xp measured on each of n individuals, there are many ways of constructing a n n matrix of dissimilarities between pairs of individuals, and the choice depends on the type of variables that are utilised. The most popular measure of dissimilarity suitable for quantitative variables is the well-known Euclidean distance. Assuming that the numerical value xik is observed for the k th variable on the ith individual in the data, the Euclidean distance, say, dij between the

21.2. CLUSTER ANALYSIS ith and j th is given by,


p

271

dij =
k =1

(xik xjk )2

A generalised version is known as Minkowski metric given by,


p 1/

dij =
k =1

|xik xjk |

The case = 1 is known as the City block metric (or Manhattan distance), while the case = 2 gives the Euclidean metric. The consequence of increasing is increasingly to exaggerate the more dissimilar units relative to the similar ones. Euclidean distance is often less than satisfactory, particularly if the variables are measured in dierent units and have diering variances, and also if the variables are highly correlated. Standardisation (or scaling) of data overcomes the rst two problems, and usually the Euclidean distances are calculated using the standardised variables. However, this measure does not take into account the correlation between the variables. A measure that does take the correlations into account is the Mahalanobis-type distance given by, dij = (xi xj )T S1 (xi xj )

with the Euclidean distance given by, dij =

j , and S is the p p sample covariance matrix. This distance measure can be compared (xi xj )T (xi xj )

where xi and xj are the p 1 observation vectors (of p variables) for individuals i and

There are a few choices of distances measure available for non-quantitative variables, for example, matching coecients for binary or qualitative (e.g. nominal) variables. These are left as an exercise for exploration. We shall use Euclidean distances in this chapter, hence dealing with quantitative variables only. Qualitative variables can be converted to a set of binary variables (e.g. one binary variable representing each level of the quantitative variable), and Euclidean distance is fairly robust for a mixture of quantitative and binary variables. Euclidean distances are the starting point for many clustering techniques. Clustering Methods: Many researchers use the phrase cluster analysis synonymously with classication and with taxonomy, the latter being the biologists term for classication. However, as we know, the term classication also means the assignment or allocation of individuals to

272

CHAPTER 21. CLUSTER ANALYSIS

a pre-existing group in discriminant analysis. However, in cluster analysis, the term classication denotes the approach where the groups are determined from the data, and does not refer to the assignment problem since there are no pre-existing groups. There are many clustering methods available, and many are more concerned with nding a hierarchical structure of individuals than nding a single partition of the given data. Thus, in this chapter, we shall consider the hierarchical approach to cluster analysis. Hierarchical clustering methods start with the computation of distances (or similarity measures) of each individual to all other individuals in the data. Clusters are then formed by a process of agglomeration or division. With agglomerative approach all individuals start by being in clusters of their own (i.e. groups of one). Closer clusters (with small distance or high similarity) are then gradually joined until nally all individuals are in a single group. With the divisive approach, all individuals start in a single group, which is then split into two clusters. The two groups are then split, and so on until all individuals are in clusters of their own. But divisive hierarchic methods are unpopular, so we shall concentrate only on the agglomerative approach. This hierarchical structure is called a hierarchical tree or dendrogram represented by a diagram (examples are given later). Determining the number of clusters: Given the cluster solution, e.g. once the dendrogram has been created, the next step is to evaluate the solution and determine the optimum number of clusters. One obvious approach is to view the dendrogram and a partition by drawing a horizontal line through the (vertical) dendrogram at an appropriate level. Suppose we want a partition of g clusters from n individuals, then we move the horizontal line down (or up) the tree until it cuts g vertical lines. Then we simply group the individuals that are connected to the same vertical line. By varying g , we may note that the hierarchic structure implicitly implies a partition for every possible number between 1 and n. Exhibit 21.1 shows a typical example of the use of a horizontal line to nd 2 (or 3) clusters. Alternatively, we may simply look for a big change in the criterion used (say, the minimum distance between two clusters in the single linkage method) to determine the number of clusters to segment the given data. Details are shown below. So far, we have not mentioned how the individuals are joined to form clusters (within the agglomerative approach). There are many hierarchic methods available to do this, though there is no generally accepted best method. Unfortunately, dierent methods do not necessarily produce the same results on a given set of data. In some cases, diculties arise because of the shape of underlying clusters. Consider the case of bivariate data where individuals are plotted according to their values of the two variables measured. Some possible patterns are shown in Exhibit 21.2. The two clusters in case (a) are likely

21.2. CLUSTER ANALYSIS Exhibit 21.1 Dendrogram with 2 or 3 clusters apparent.

273

Exhibit 21.2 Some possible patterns of points with two clusters.

to be found by most clustering methods. Note that the two groups of points are not only well-separated, but are also compact within the groups, i.e. every point is closer to every point within the cluster than to any point not in it. Thus, as the agglomeration occurs, the next point will always be within the same cluster. The clusters in case (b) would also be identied by most methods, as they are also well separated, even though an end point might be closer to the nearest point in the other cluster than the furthest in the same cluster. These patterns are usually called elongated clusters (and are parallel here). As can be seen, the groups in case (c) are not as well-separated as those in cases (a) and (b), thus some methods may fail to detect two clusters. Most clustering methods would have trouble handling cases like (d), (e) and (f).

274 Hierarchical clustering procedures:

CHAPTER 21. CLUSTER ANALYSIS

The various Hierarchical clustering methods dier in the denition of the distance (or similarity or dissimilarity) between clusters of individuals. These distances are used to merge two close clusters (in agglomerative approach). Suppose we dene Dij as the distance between the ith and the j th clusters. Also assume that there are ni individuals in the ith cluster. Recall that, in an agglomerative approach, the initial clusters are the individuals themselves, thus Dij is simply the Euclidean distance (say) between a pair of individuals.

Single linkage clustering Perhaps the simplest method of nding a hierarchical tree (or dendrogram) is the single linkage method. This is also known as the nearest neighbour method. Here, Dij is dened as the least (or smallest) of all ni nj distances between elements of the ith and of the j th groups. In other words, the distance between two groups of individuals is eectively dened to be the dissimilarity between their closest members. Hence, the single linkage algorithm tends to absorb new individuals into the existing clusters rather than forming new clusters. This behaviour results usually in elongated clusters, and the eect is known as chaining. Remark: Ties are a common problem in cluster analysis. Ties occur when there is more then one minimum distance (between clusters). In this situation a choice must be made as to which of the two (or more) closest pairs (of clusters) should be combined. When single linkage method is used, an arbitrary choice is made as to which of the tiedclusters should be combined, and the resulting dendrogram is not aected by which pair is selected.

Complete linkage clustering (Furthest-neighbour method): Here, Dij is dened as the greatest of all the ni nj distances between the elements of the ith and of the j th clusters. In other words, the distance between two groups is dened as the dissimilarity between their most remote pair of individuals. In a sense this is the exact opposite of the single-link denition. The complete linkage algorithm tends to nd compact clusters with high internal anity. In other words, the clusters are built in parallel because as a cluster grows the most distant element of the cluster becomes even further from any individual not yet in the cluster. This implies that new elements form their own clusters.

21.2. CLUSTER ANALYSIS Some other hierarchical clustering procedures:

275

The average linkage method starts out the same as the single linkage or complete linkage methods, but the distance between two groups is dened to be the average of the dissimilarities (or the distances) between all pairs of individuals. In other words, Dij is the average of the ni nj distances between the elements of the ith and of the j th clusters. So it does not depend on extreme values, as do single or complete linkage methods, but partitioning is based on all members of the clusters rather than on a single pair of extreme individuals. Some clustering methods implicitly regard the individuals in the given data as points in space, and rely on various geometrical criteria for classication. One such method is Centroid clustering. Here, the distance between two groups is dened as the distance between the group centroids (e.g. group mean vectors). In other words, Dij is the distance between the centroid of the ith cluster and that of the j th cluster. When the medians are used instead of the means to dene the centroids, then the corresponding method is called median clustering. Another method that uses the ideas of geometry is the minimum variance clustering, also known as Wards minimum variance method. This method is based on the withingroup sums of squares rather than the distances between clusters. At each stage of an agglomerative scheme, the number of groups is reduced by one, by combining the two groups, which give the smallest possible increase in the total within-group sum of squares. Of course, at the start when there are n clusters of just one individual each, the total within-group sum of squares is zero. A comparison of hierarchical clustering procedures: Reconsider the various cases in Exhibit 21.2. As noted earlier, the two clusters in cases (a) & (b) are likely to be found by most, if not all, clustering methods described above, as they are well separated. The groups in case (c), on the other hand, are not as well-separated, thus some methods may fail to detect two clusters. The single-linkage method would join points across the bridge before forming the main clusters, while the complete-linkage method would probably work better. The single-linkage method would work satisfactorily in nding the clusters in (d), and it would be the best choice for case (f). Most clustering methods would fail with cases like (e). When attempting to evaluate clustering methods, it is important to realise that most methods are biased toward nding clusters possessing certain characteristics such as clusters of equal size (number of members or individuals), shape, or dispersion. Methods based on least-squares, such as Wards minimum-variance method, tend to combine clusters with a small number of individuals, and are biased toward the production of clusters with

276

CHAPTER 21. CLUSTER ANALYSIS

roughly the same number of individuals. The Average-link method is somewhat biased towards combining clusters with small within-cluster variation and tends to nd clusters of approximately equal variation. In general, dierent results produced by dierent clustering methods may be regarded as looking at the data from dierent angles! Interpretation of New Clusters: Once the new natural clusters (of individuals) are formed, we may wish to examine each cluster to name or assign a label accurately describing the nature of it, and explain how the clusters may dier on relevant dimensions. One attribute often used when assigning a label is the clusters centroid. If the clustering procedure was based on the raw data, this is a natural choice. However, if the standardised data were used, then we need to go back to the raw scale of the original variables and compute the average prole of the clusters. As noted earlier, dierences between clusters may be examined using techniques such as the (Canonical) Discriminant analysis, in particular, to assess the inuence of the original variables in maximally separating the newly formed clusters. Multidimensional Scaling (MDS) Multidimensional (MDS) is a general term for a class of techniques which permit the relationships among a set of elements (or individuals or objects) to be represented as inter-element distances (or proximities) in low dimensional spaces. A classic example of MDS is the construction of maps of countries or regions based on the distances between cities/towns/signicant points. Other common applications are respondents evaluation of objects including perceptions of political candidates or issues, assessment of cultural dierences between distinct groups and comparison of physical matrix of dissimilarities (or similarities or proximities), and not as an n p raw data matrix in which there is data for p variables measured on each of n objects or individuals. of n points using information about the proximities (or similarities or dissimilarities) between the n objects, preferably in a small number of dimensions. When given a set of raw data, it is easy to compute the proximity (e.g. Euclidean distance) between each pair of individuals. MDS works the other way around - given the proximities between the individuals, MDS tries to nd the co-ordinates of the individuals (in a small dimensional In simple terms, MDS is concerned with the problem of constructing a conguration qualities (e.g. food tastes). In many of these situations, the data consists of an n n

21.2. CLUSTER ANALYSIS

277

space). Even if we begin with the raw data, we may regard MDS as a projection of the n individuals from a p-dimensional space onto a smaller (say, 2 or 3) dimensional space, preserving the inter-object proximities as much as possible. Assuming that the data matrix is of size (n p), i.e. p variables measured on each

of n objects, the main aim of MDS is to reproduce the proximities (or similarities or dissimilarities) between the given n objects in a much smaller number of dimensions. i.e. if dij is the proximity between the ith and j th objects in the p-dimensional space, then the aim is to obtain a new proximity dij in r (< p)-dimensions, as close as to the original dij as possible. We need (n 1) dimensions in order to reproduce the proximity between n points, exactly. However, we would prefer to have r =2 or 3 so that the proximities can be graphed to show patterns of objects. In particular, if the objects fall into groups or clusters, then this would be evident from a visual inspection of the graph - a lean into Cluster Analysis ! Note that, in cluster analysis, the main aim is to form groups or clusters of objects such that, individuals within a group are similar to each other while individuals in dierent groups are dissimilar as possible. A variety of MDS models can be used involving dierent ways of computing dissimilarities and various functions relating the dissimilarities to the actual data. The two major approaches are, metric scaling and ordinal scaling. In the case of metric scaling, also referred to as Classical Scaling, Euclidean distance s are used. Metric scaling can be essentially an algebraic (non-iterative) method of reconstructing the point co-ordinates assuming that the dissimilarities are Euclidean distances (e.g. principal co-ordinate analysis ). Alternatively, an iterative method can be used. When the distance measure is not Euclidean, ordinal scaling (also known as non-metric scaling ) may be used. In this case, the observed dissimilarity (numerical) values are of little interest and the rank order of the dissimilarities is thought to be the only relevant information. In such cases, methods based on ordinal properties of dissimilarities need to be utilised. These are iterative methods that become metric scaling methods when applied to Euclidean distances. Only the simplest of the classical (metric) scaling approach, the principal co-ordinate analysis, will be considered in this course and it uses the relationship between MDS and principal component analysis. In PCA, an eigen-analysis of the XT X matrix (equivalent is eigen-analysed! XT X and XXT have the same eigenvalues, but dierent sets of eigenvectors (of sizes p 1 and n 1 respectively). It should be mentioned that, just as we perform a standardised PCA using the correlation matrix, we may perform a standardised MDS using the standardised Euclidean distances (where each variable is transformed to have unit variance). As in PCA, we may plot the data (co-ordinates) on the rst few (preferably 2 or 3) principal co-ordinate axes corresponding to the largest eigenvalues. to the (p p) variance-covariance matrix) is performed, while in MDS, the matrix XXT

278

CHAPTER 21. CLUSTER ANALYSIS

21.3

Analysis Using R

The data for the climate related information for some NZ towns are read (using the function read.csv()) from the le NZclimate.csv. Note that row.names=1 tells R that the rst value in each row is to be treated as a name for the row, not simply as a variable in the data set. This is necessary for subsequent functions to pick up the names of the towns.
> climate = read.csv("NZclimate.csv", row.names = 1) > head(climate)

Latitude Longitude JanTemp JlyTemp Rain Sun Altitude Location 35.1 173.3 19.3 11.7 1418 2113 80 Coastal 35.2 174.0 18.9 10.8 1682 2004 73 Coastal 36.0 173.8 18.6 10.7 1248 1956 20 Coastal 35.7 174.3 19.7 11.0 1600 1925 29 Coastal 36.9 174.8 19.4 11.0 1185 2102 49 Coastal 37.7 176.2 18.5 9.3 1349 2277 4 Coastal Island Kaitaia North Kerikeri North Dargaville North Whangarei North Auckland North Tauranga North Kaitaia Kerikeri Dargaville Whangarei Auckland Tauranga

Since the climate related variables were measured in dierent units, we shall perform a set of cluster analyses on the standardised data (i.e. standadised variables Latitude, Longitude, JanTemp, JulTemp, Rain, Sun and Altitude), using single-linkage, completelinkage, average-linkage and ward-minimum-variance methods. Considering the dendrograms only, we shall compare the four clustering methods to describe the dierences in the dendrograms obtained, and to suggest which of these methods show the expected two or three natural groupings of towns in the sample. We may also see whether any of these methods re-discover the Island and/or Coastal or Inland groupings, at least approximately, and check whether any peculiarities are highlighted. The various cluster analyses were carried out on the standardised data and the dendrograms associated with simple-linkage, complete-linkage, average-linkage and Wards minimum variance clustering techniques are shown in Exhibit 21.3, which also shows the R commands used. Note that we use the dist() command to nd distances between the standardized observations, which were found using the scale() command. The hclust() command creates the numeric details needed to determine the clustering, while the plclust() command is how we plot the dendrograms. It is fairly clear from the dendrograms in Exhibit 21.3 that the single-linkage method suers from chaining. This method does not suggest the expected two or three natural

> clm.hclust = hclust(clm.dist, method = "ward") > plclust(clm.hclust, labels = row.names(climate), xlab = "Town Names", + main = "Ward")

> clm.hclust = hclust(clm.dist, method = "average") > plclust(clm.hclust, labels = row.names(climate), xlab = "Town Names", + main = "Average-link")

> clm.hclust = hclust(clm.dist, method = "complete") > plclust(clm.hclust, labels = row.names(climate), xlab = "Town Names", + main = "Complete-link")

> clm.hclust = hclust(clm.dist, method = "single") > plclust(clm.hclust, labels = row.names(climate), xlab = "Town Names", + main = "Single-link")

> climate.m = as.matrix(climate[1:7]) > clmscaled.m = scale(climate.m, center = T, scale = T) > clm.dist = dist(clmscaled.m, method = "euclidean")

Exhibit 21.3 Dendrograms.

21.3. ANALYSIS USING R

Height 0 1 2 3 4 5 MtCook 6 0 1

Height 2 3 4 MtCook

Gisborne Dargaville Kerikeri Whangarei Kaitaia Auckland Wellington Kaikoura Hamilton PalmerstonNth Waipukurau Masterton Taumaranui Rotorua Taupo Nelson Blenheim NewPlymouth Wanganui Tauranga Napier LakeTekapo Ohakune Hanmer Haast Westport Greymouth Hokitika Invercargill Dunedin Gore Christchurch Timaru Queenstown Alexandra

LakeTekapo Ohakune Haast Westport Greymouth Hokitika Hanmer Gisborne NewPlymouth Nelson Blenheim Wanganui Tauranga Napier Invercargill Dunedin Gore Auckland Kaitaia Dargaville Kerikeri Whangarei Rotorua Taupo Taumaranui Christchurch Timaru Hamilton PalmerstonNth Waipukurau Masterton Wellington Kaikoura Queenstown Alexandra

Town Names hclust (*, "average") Town Names hclust (*, "ward")

Town Names hclust (*, "single")

Averagelink

Singlelink

Height 0 Rotorua Taupo Taumaranui PalmerstonNth Waipukurau Masterton Wellington Kaikoura Dargaville Kerikeri Whangarei Kaitaia Auckland Nelson Blenheim Gisborne Wanganui Tauranga Napier Hamilton NewPlymouth Haast Westport Greymouth Hokitika Invercargill Dunedin Gore Christchurch Timaru Queenstown Alexandra MtCook LakeTekapo Ohakune Hanmer 5 10 15 20 25 30 0 Nelson Blenheim Wanganui Tauranga Napier Gisborne Dargaville Kerikeri Whangarei Kaitaia Auckland NewPlymouth Wellington Kaikoura Rotorua Taupo Taumaranui Hamilton PalmerstonNth Waipukurau Masterton Haast Westport Greymouth Hokitika Invercargill Dunedin Gore Christchurch Timaru Queenstown Alexandra

Height 2 4 6 8

Town Names hclust (*, "complete")

Completelink

Ward

LakeTekapo Ohakune Hanmer

MtCook

279

280

CHAPTER 21. CLUSTER ANALYSIS

groupings of NZ towns. Although it could not produce reasonably distinct clusters, it suggests that two or three towns, namely, MtCook, Lake Tekapo and perhaps Ohakune stand out as outliers as they appear to be most dissimilar to the others in the data. The complete linkage and ward-minimum variance methods identically produce two distinct clusters of sizes 21 and 15. In fact, the latter cluster consists of a small but distinct subgroup containing the three outlier towns identied by single-linkage, namely, MtCook, Lake Tekapo and Ohakune, and Hanmer. Considering the proximity measures, we may suggest that this small subgroup ts well within the parent cluster with respect to Wards method compared to the complete-link method. The average-linkage method produces almost identical results to those of complete-link and ward-minimum-variance methods, except for MtCook. In other words, the small subgroup referred to above contains only Lake Tekapo, Ohakune and Hanmer; MtCook then becomes an obvious outlier. Complete-linkage, average-linkage and wards methods appear to derive the suggested number of (2 or 3) natural groupings of NZ towns, although MtCook behaves like a unique town with respect to these climate related variables. The smallest grouping identies the towns with extreme climate (i.e. MtCook, Lake Tekapo, Ohakune and Hanmer)! When considering the two large clusters from these methods, the largest cluster predominantly consists of North Island towns while the smaller one mainly represents the South Island towns. The exceptions are the South Island towns Nelson, Blenheim and Kaikoura which fall into the cluster with North Island towns, while the North Island town Ohakune behaves like a South Island town. Note that, the towns have been ordered, at least approximately, according to their geographic positioning. None of the clustering methods appears to identify the coastal or inland nature of towns! To illustrate further, consider the three main clusters suggested by the complete-link method. We shall compare the three clusters of towns, using the means of the seven variables considered and suitable graphics (e.g. boxplots). We may suggest suitable names or labels for each cluster. As indicated earlier, we need to use the raw data for computations here. R codes and the (edited) results are shown in Exhibit 21.4. Both the means and the boxplots presented in Exhibit 21.4 clearly indicate that the smallest cluster (i.e. cluster 2) represents the four NZ towns with much higher altitude than the other towns in the sample. Clusters 2 and 3 behave very similarly to each other with respect to latitude, longitude, mean January temperature and sun shine hours. Mean July temperature appears to be marginally lower for Cluster 2 than Cluster 3, but the reverse is true for rainfall. On the other hand, Cluster 1 appears to represent, on average, the towns that have lower latitude but higher longitude than those in the other clusters. Furthermore, this cluster consists of towns with higher mean January and July temperatures, higher sun shine hours, but lower rainfall than others.

21.3. ANALYSIS USING R

281

Exhibit 21.4 Complete-link clusters: freqencies, means and boxplots.


> > > > 1 21 clm.hclust = hclust(clm.dist, method = "complete") ClusterID = as.factor(cutree(clm.hclust, k = 3)) clmcls = as.matrix(cbind(climate.m, ClusterID)) table(clmcls[, "ClusterID"]) 2 3 4 11

> clsmeans <- aggregate(clmcls, list(ClusterID), mean) > round(clsmeans[, -1], digits = 4) Latitude Longitude JanTemp JlyTemp Rain Sun Altitude ClusterID 38.82 175.1 18.00 8.943 1173 2079 83.33 1 42.40 172.2 14.78 3.150 1794 1860 611.25 2 44.31 170.2 15.62 5.846 1447 1835 58.18 3

1 2 3

> windows(width = 18, height = 14.5)

> > + > > > + + +

par(mfrow = c(2, 4)) par(pch = 1, cex = 1, cex.lab = 1, cex.axis = 1, font.axis = 1, font.lab = 1) clmcls2 = clmcls[, 1:7] namesclmcls2 = dimnames(clmcls2)[[2]] for (VARS in namesclmcls2) { boxplot(clmcls2[, VARS] ~ ClusterID, data = clmcls, main = VARS) }
Latitude
178

Longitude

JanTemp
14

JlyTemp

46

19

176

44

18

42

174

17

40

172

16

38

15

36

170

14

168

10

12

Rain
4000

Sun

Altitude

2400

3000

2000

2200

2000

1800

1000

1600

200

400

600

282

CHAPTER 21. CLUSTER ANALYSIS

Hence, Cluster 1, in general, may be regarded as warm, dry, sunny, low altitude towns, while Cluster 2, on average, representing cold, wet, no so sunny, high altitude towns, and Cluster 3 consisting of low altitude towns with average climate. Next, we shall consider the two main clusters (i.e. from a 2-cluster solution) suggested by the complete-link (or Ward-minimum-variance) method and carry out a canonical discriminant analysis on the seven climate related variables considered to highlight the dierences between these two clusters. As there are only two groups (clusters), the maximum number of canonical discriminant functions (CDFs) requited is just one. The corresponding within-group standardised canonical coecients and the graph of canonical scores are shown in Exhibit 21.5 together with the R codes required to obtain these results. Since there is only one CDF, we may just create boxplots of canonical scores of towns for each cluster or create a scatter plot of towns with the second axis representing just random jitter. The advantage of the latter graph is that we may show the town labels of the for additional interpretation. Note that the lda() function is part of the MASS package. To use this function we must gain access to the package using the library() command. It is clear from the plot of canonical scores in Exhibit 21.5, that the two clusters are fairly well separated by the canonical dimension. The (standardised) canonical coecients reveal that, this canonical dimension, although dominated by Longitude, may be regarded as contrasting Latitude and perhaps Altitude against Longitude and to a lesser extent, Sunshine hours and January temperature. Given equal importance to the variables considered, we may regard Cluster 2 (in blue italics) as the group of towns that have high latitude and altitude but low longitude and not so sunny and cooler January, while Cluster 1 (in black) towns are sunny in high longitude with warm January but low altitude and latitude! Also note that, all of the North Island towns plus some South Island ones (Kaikoura, Blenheim and Nelson) form the 1st cluster, while the remaining South Island towns form cluster 2. Finally, we may attempt to examine the similarities between the NZ towns with respect to their climate related variables using low-dimensional scatter plots from multidimensional scaling (MDS) analysis. Again, the standardised data has been utilised. A 2-dimensional MDS graph of towns is shown in Exhibit 21.6 together the R codes required to obtain the plot. The cluster numbers from the complete-link 3-cluster solution is used to highlight the graph. The MDS graph of towns in Exhibit 21.6 indicate clearly that, the 1st cluster (in black) is fairly well distinguished from the others along the 2nd MDS dimension. Cluster 2 towns (in red boldface) fall within cluster 3 (in blue italics), but MtCook obviously appears to be an outlier! This graph certainly highlights the similarity between towns within the clusters, in particular, within the 1st cluster.

21.3. ANALYSIS USING R

283

Exhibit 21.5 Canonical discriminant analysis of clusters.


> > > > > > > > + + + > > ClusterID = as.factor(cutree(clm.hclust, k = 2)) clmcls2 = as.data.frame(cbind(climate.m, ClusterID)) library(MASS) clmcls.lda = lda(ClusterID ~ ., data = clmcls2) clmcls2.m <- as.matrix(clmcls2[1:7]) varW = NULL namesvarW = dimnames(clmcls2.m)[[2]] for (VARS in namesvarW) { varW[VARS] = anova(aov(clmcls2[, VARS] ~ clmcls2$ClusterID))[2, 3] } std.coef <- (clmcls.lda$scaling) * sqrt(varW) round(std.coef, digits = 4)

LD1 Latitude 0.3868 Longitude -0.6348 JanTemp -0.2421 JlyTemp 0.0256 Rain 0.0058 Sun -0.2774 Altitude 0.3088 > > > + > + + + clmclslda <- predict(clmcls.lda, clmcls2, dimen = 1) dummy <- rnorm(nrow(clmcls), mean = 0, sd = 1) plot(clmclslda$x, dummy, xlab = "Canonical Discriminant Function", ylab = "Jittered", cex = 0.2, xlim = c(-4, 5), main = "Canonical Scores of Towns") text(clmclslda$x, dummy, labels = row.names(climate.m), col = c("black", "blue")[as.numeric(ClusterID)], font = c(1, 3)[as.numeric(ClusterID)], cex = 1.1, cex.lab = 1.1, cex.axis = 1.1)

Canonical Scores of Towns


Hokitika
2

Hamilton MtCook Gisborne Auckland Dargaville Napier Blenheim LakeTekapo Hanmer Whangarei Tauranga Masterton Rotorua Wanganui PalmerstonNth Greymouth Nelson Kerikeri Wellington Kaikoura Christchurch Kaitaia Alexandra Queenstown Dunedin Waipukurau Ohakune Invercargill NewPlymouth Taumaranui Taupo

Jittered

Timaru Westport

Haast Gore

Canonical Discriminant Function

284 Exhibit 21.6 A 2-dimensional MDS plot.


> > > + > + + +

CHAPTER 21. CLUSTER ANALYSIS

clm.cmd <- cmdscale(clm.dist, k = 2) ClusterID <- as.factor(cutree(clm.hclust, k = 3)) plot(clm.cmd[, 1], clm.cmd[, 2], xlab = "1st MDS", ylab = "2nd MDS", cex = 0.2, xlim = c(-3.5, 5), main = "Multi-dimensional Scaling of Towns") text(clm.cmd[, 1], clm.cmd[, 2], labels = row.names(climate.m), col = c("black", "red", "blue")[as.numeric(ClusterID)], font = c(1, 2, 3)[as.numeric(ClusterID)], cex = 1.1, cex.lab = 1.1, cex.axis = 1.1)

Multidimensional Scaling of Towns


Alexandra LakeTekapo Blenheim Nelson Christchurch Timaru Queenstown Wanganui Gore Napier Kaikoura Dunedin Invercargill Masterton Waipukurau Hanmer Wellington Tauranga Taupo PalmerstonNth Gisborne Hamilton Auckland NewPlymouth Rotorua Ohakune Dargaville Kaitaia Westport Taumaranui Whangarei Kerikeri Hokitika Greymouth Haast

2nd MDS

MtCook
2 0 1st MDS 2 4

21.4

Exercises

Note: The data sets in the exercises are available via the DRUGS package. Exercise 21.1: The data le Cigarettes.csv contains measurements of weight (in g) and tar, nicotine and carbon monoxide (CO) content (in mg) for 25 brands of cigarettes. In the DRUGS package, this data set is called Cigarettes and can be obtained using
> data(Cigarettes, package = "DRUGS")

(a) A Cluster Analysis may show the similarities or dissimilarities of the various cigarette brands. Someone suggests that the cigarette brands considered should fall into two natural groups. It is also suggested that analyses should be based on the standardised data. Perform a set of cluster analyses using single-linkage, complete-linkage, averagelinkage and Wards minimum-variance methods. Considering the dendrograms

21.4. EXERCISES

285

only, compare the four clustering methods. Explain any dierences in the dendrograms obtained. Do you notice any peculiarities? Consider the two main clusters suggested by Wards minimum variance method. Calculate simple statistics (i.e. mean, standard deviation etc.) of weight and tar, nicotine and carbon monoxide contents for each of the two clusters. Suggest suitable names/labels for each cluster. (You need to use the raw data here!) (b) Suppose we wish to examine the dis/similarities between the cigarette brands using simple low-dimensional scatter graphs. We may do this in two ways: Perform a principal component analysis on the four variables, weight and tar, nicotine and carbon monoxide contents, and graph the component scores. Briey interpret your ndings, relating your comments to the results of Wards clustering method discussed in (a) above. (Present only the eigen-analysis information and the plot of PC-scores.) Perform a multi-dimensional scaling (MDS) analysis on the four variables, weight and tar, nicotine and carbon monoxide contents, and show a two dimensional mapping of the various cigarette brands. (Use a classical scaling approach on Euclidean distances.) Briey interpret your ndings, again, relating your comments to the results of Wards clustering method, in particular, the dendrogram obtained in (a). (Use suitable labels, e.g. brand names and/or cluster groups, in your graphs.) Exercise 21.2: The data le EUprotein.csv shows protein consumption, classied into nine food groups, in 25 European countries (in the late 1970s). The food groups are, red meat (rm), white meat (wm), eggs (egg), milk (mlk), sh (fsh), cereals (crl), starchy foods (sch), pulses, nuts & oils (pno) and fruits & vegetables (fv). The protein consumption is given in grams per person. Three natural groups existed among these countries when the data were collected. These were, the European Economic Community (EEC); the other western European countries (WEC); and the eastern European communist countries (CCE). In the DRUGS package, this data set is called EUProtein and can be obtained using
> data(EUProtein, package = "DRUGS")

A cluster analysis may show the similarities or dissimilarities between the countries in the sample. It is suggested that the countries considered should fall into two or three natural groupings. (a) It is also suggested that cluster analyses should be based on the standardised data. Do you agree? Explain your answer using summary statistics of the protein variables and/or other information you consider appropriate.

286

CHAPTER 21. CLUSTER ANALYSIS

(b) Perform a set of cluster analyses on both, the raw and the standardised data (protein variables), using complete-linkage and ward-minimum-variance methods. Considering the (four) dendrograms only, compare the four clustering approaches: Are there any dierences in the dendrograms obtained? Do you notice any peculiarities? Which approaches suggest the expected two or three natural groupings of countries considered here? (c) Consider the two main clusters suggested by the ward-minimum-variance method applied on the raw data. Using mean, standard deviation etc and suitable graphs associated with the protein variables, compare and contrast the two clusters of countries. Furthermore, carry out a canonical discriminant analysis on the protein variables to separate the two main clusters, and briey interpret your ndings (use the graph of canonical scores with suitable labels and the standardised canonical coecients). (d) To examine the dis/similarities between the countries further, use a suitable lowdimensional scatter graph. A multi-dimensional scaling (MDS) analysis on the protein variables may be of some use here. Show the resulting two-dimensional mapping of the various countries. Briey interpret your ndings. Use suitable labels in the graph for interesting exploration.

References
Albert, K., Sakmar, E., Hallmark, M. R., Weidler, D. J., and Wagner, J. G. (1974), Bioavailability of diphenylhydantoin, Clinical Pharmacology and Therapeutics , 16, 727735. Bacharach, A. L., Chance, M., and Middleton, T. (1940), The biological assay of testicular diusing factor, Biochemistry Journal , 34, 14641471. Beall, G. (1942), The transformation of data from entomological eld experiments, Biometrika , 29, 243262. Beckman, R. J. and Nachtsheim, C. J. (1987), Diagnostics for mixed-model analysis of variance. Technometrics , 29, 413426. Bland, M. (2000), An Introduction to Medical Statistics , Oxford: Oxford University Press, 3rd edition. Box, G. E. P., Hunter, J. S., and Hunter, W. G. (2005), Statistics for Experimenters: Design Innovation, and Discovery , New Jersey: Wiley, 2nd edition. Carroll, R. J., Spiegelman, C. H., and Sacks, J. (1988), A quick and easy multipleuse calibration-curve procedure, Technometrics , 30, 137141, ISSN 00401706, URL http://www.jstor.org/stable/1270158 . Charig, C. R., Webb, D. R., Payne, S. R., and Wickham, O. E. (1986), A comparison of treatment of renal calculi by operative surgery, percutaneous nephrolithotomy, and extracorporeal shock wave lithotripsy, Br Med J (Clin Res Ed) , 292, 879U882. Cochran, W. G. and Cox, G. M. (1957), Experimental designs , New York: Wiley & Sons, second edition. Collett, D. and Jemain, A. A. (1985), Residuals, outliers and inuential observations in regression analysis, Sains Malaysiana , 4, 493511. Cox, D. R. (1972), Regression models and life-tables, Journal of the Royal Statistical Society, Series B , 34, 187202. 287

288

REFERENCES

Davies, O. and Goldsmith, P. (1972), Statistical Methods in Research and Production , Edinburgh: Oliver and Boyd, 4th edition. Dell, R. B., Holleran, S., and Ramakrishnan, R. (2002), Sample size determination, Institute of Laboratory Animal Resources Journal , 43, 207213. Dempster, A., Selwyn, M., Patel, C., and Roth, A. (1984), Statistical and computation aspects of mixed model analysis, Applied Statistics , 33, 203214. Dohoo, I., Tillard, E., Stryhn, H., and Faye, B. (2001), The use of multilevel models to evaluate sources of variation in reproductive performance in dairy cattle in reunion islands, Prev. Vet. Med., 50, 127144. Dunnett, C. W. (1955), A multiple comparisons procedure for comparing several treatments with a control, Journal of the American Statistical Association , 50, 10961121. Ebbutt, A. (1984), Three-period crossover designs for two treatments, Biometrics , 40, 219224. Everitt, B. S. (2002), A Handbook of Statistical Analyses Using SAS , London: Chapman & Hall, 2nd edition. Fisher, R. A. and Yates, F. (1957), Statistical tables for biological, agricultural and medical research , New York: Hafner, fth edition. Gardemann, A., Ohly, D., Fink, M., Katz, N., Tillmanns, H., Hehrlein, F. W., and Haberbosch, W. (1998), Association of the insertion/deletion gene polymorphism of the apolipoprotein b signal peptide with myocardial infarction, Atheroslerosis , 141, 167175. Giardiello, F. M., Hamilton, S. R., Krush, A. J., Piantadosi, S., Hylind, L. M., Celano, P., Booker, S. V., Robinson, C. R., and Oerhaus, G. J. A. (1993), Treatment of colonic and rectal adenomas with sulindac in familial adenomatous polypsis, New England Journal of Medicine , 328, 13131316. Godfrey, A. J. R. (2004), Dealing with Sparsity in GenotypeEnvironment Analyses , Ph.D. thesis, Massey University, Palmerston North, New Zealand. Grana, C., Chinol, M., Robertson, C., Mazzetta, C., Bartolomei, M., Cicco, C. D., Fiorenza, M., Gatti, M., Caliceti, P., and Paganelli, G. (2002), Pretargeted adjuvant radioimmunotherapy with yttrium-90-biotin in malignant glioma patients: A pilot study, British Journal of Cancer , 86, 207212.

REFERENCES

289

Haberman, S. J. (1973), The analysis of residuals in cross-classied tables, Biometrics , 29, 205220. Hanus, D., Mendes, N., Tennie, C., and Call, J. (2011), Comparing the performances of apes <italic>(gorilla gorilla, pan troglodytes, pongo pygmaeus)</italic> and human children <italic>(homo sapiens)</italic> in the oating peanut task, PLoS ONE , 6, e19555, URL http://dx.doi.org/10.1371%2Fjournal.pone.0019555 . Hedges, A., Hills, M., Maclay, W., Newman-Taylor, A., and Turner, P. (1971), Some central and peripheral eects of meclastine, a new antihistaminic drug, in man, Journal of Clinical Pharmacology , 2, 112119. Hotelling, H. (1933), Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology , 24, 417441. Huet, S., Bouvier, A., Gruet, M., and Jolivet, E. (1996), Statistical Tools for Nonlinear Regression , New York, USA: Springer-Verlag. Johnson, S. and Johnson, R. (1972), Tonsillectomy history in hodgkins disease, N. Eng. J. Med., 287, 1122U1125. Jones, B. and Kenward, M. (1989), Design and Analysis of Cross-over Trials , New York: Chapman and Hall. Jones, G., Keedwell, R. J., Noble, A. D., and Hedderley, D. I. (2005), Dating chicks: Calibration and discrimination in a nonlinear multivariate hierarchical growth model, Journal of Agricultural, Biological and Environmental Statistics , 10, 306320. Jones, G., Leung, J., and Robertson, H. (2009), A mixed model for investigating a population of asymptotic growth curves using restricted b-splines, Journal of Agricultural, Biological and Environmental Statistics , 14, 6678. Kuehl, R. O. (2000), Design of Experiments: Statistical Principles of Research Design and Analysis , Pacic Grove, California, USA: Duxbury Press, 2nd edition. Legube, B., Parinet, B., Berne, F., and Croue, J.-P. (2002), Bromate surveys in French drinking waterworks. Ozone Science and Engineering , 24, 293304. Lenth, R. V. (2001), Some practical guidelines for eective sample-size de-

termination, Technical report, Dept. of Statistics, University of Iowa, URL http://www.stat.uiowa.edu/techrep/tr303.pdf . Lucas, H. (1957), Extra-period latin-square change-over designs, Journal of Dairy Science , 40, 225239.

290

REFERENCES

Maxwell, S. and Delaney, H. (1990), Designing Experiments and Analysing Data , Belmont, California: Wadsworth. Mazess, R., Peppler, W., and Gibbons, M. (1984), Total body composition by dualphoton (153 gd) absorptiometry, American Journal of Clinical Nutrition , 40, 834839. Meas, P., Paterson, A., Cleland, D., Bronlund, J., Mawson, J., Hardacre, A., and Rickman, J. (2009), Eects of dierent sun drying methods on the drying time and the rice grain quality, in Post harvest Rice Conference and Exhibition, 1517 July , Queen Sirikit national conference centre, Bangkok, Thailand. Milliken, G. A. and Johnson, D. E. (1989), Analysis of Messy Data: Volume 2 Nonreplicated experiments , London: Chapman and Hall. Milliken, G. A. and Johnson, D. E. (2002), Analysis of Messy Data: Volume 3 Analysis of Covariance , Boca Raton, Florida: Chapman and Hall/CRC Press. Mungomery, V., Shorter, R., and Byth, D. (1974), Genotypeenvironment interactions tions, Australian Journal of Agricultural Research , 25, 5972. Nelder, J. A. and Wedderburn, R. W. M. (1972), Generalized linear models, Journal of the Royal Statistical Society, Series A, 140, 13631369. Partridge, L. and Farquhar, M. (1981), Sexual activity and the lifespan of male fruities, Nature , 294, 580581. Patel, H. (1983), Use of baseline measurements in the two-period cross-over design, Communications in Statistics: Theory and Methods , 12, 26932712. Pearson, K. (1900), On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling, Philosophical Magazine Series 5 , 50, 157U175. Racine-Poon, A. (1988), A Bayesian approach to nonlinear calibration problems, Journal of the American Statistical Association , 83, 650656. Ratowsky, D. A., Evans, M. A., and Alldredge, R. J. (1993), Cross-over experiments: Design, Analysis and Application , New York: Marcel Dekker Inc. Rawlins, S. C. and Wan, J. O. H. (1995), Resistance in some Caribbean population of aedes aegypti to several insecticides, Journal of the American Mosquito Control Association , 11, 5965.

and environmental adaptation. i pattern analysis - application to soya bean popula-

REFERENCES

291

Rawls, R. (1970), Training for increased comprehension with accelerated word rates in auditory reading media (compressed speech) , Ph.D. thesis, North Carolina State University, Raleigh, NC. Rayner, A. (1969), A First Course in Biometry for Agriculture Students , Pietermaritzburg: University of Natal Press. Sauerbrei, W. and Royston, P. (1999), Building multivariable prognostic and diagnostic models: Transformation of the predictors by using fractional polynomials, Journal of the Royal Statistical Society, Series A (Statistics in Society) , 162, 7194. Seber, G. U. H. (1989), On the regression analysis of tumour recurrence rates, Statistics in Medicine , 8, 205220. Snee, R. (1985), Graphical display of results of three-treatment randomized block experiments, Applied Statistics , 34, 7177. Sokal, R. and Rohlf, F. (1981), Biometry , San Francisco: W.H. Freeman, 2nd edition. Steel, R. G. D. and Torrie, J. H. (1981), Principles and Procedures of Statistics: A Biometrical Approach , New York: McGraw-Hill, international edition. Taylor, A. and Gray, D. R. (2009), Animal cognition: Aesops fable ies from ction to fact, Current Biology , 19, R731R732. Therneau, T. M. and Grambsch, P. M. (2000), Modelling Survival Data: Extending the Cox Model , New York, USA: Springer. Treloar, M. A. (1974), Eects of Puromycin on Galactosyltransferase in Golgi Membranes , M.Sc. thesis, University of Toronto, Toronto, Canada. Tukey, J. W. (1953), The problem of multiple comparisons (unpublished manuscript), in The Collected Works of John W. Tukey VIII. Multiple Comparisons: 1948-1983 , New York, USA: Chapman & Hall. Wishart, J. (1938), Growth-rate determination in nutrition studies with the bacon pig and their analysis, Biometrika , 30, 1628.

Index
adjusted marginal means, 59, 6263 adjusted sums of squares, 58, 80 adjusted treatment means, 56, 80, 85 agglomeration, 272 component scores, 254 concurrence matrix, 54 confint function, 147 confounding, 6

anova function, 83, 84, 149, 152, 178, 179, confounding variables, 163 204 continuity correction, 161 aov function, 11, 12, 23, 34, 101, 123, 139, convergence criterion, 187 203 cor function, 96, 234 apply function, 239 arcsin transformation, 61 as.factor function, 15 as.matrix function, 233 assign function, 139 attach function, 11, 23, 43, 261 balance, 6 balanced incomplete block design, 54 baseline, 152 baseline hazard, 173 Bernoulli distribution, 144 c function, 2, 122 canonical coecients, 256 canonical discriminant analysis, 233, 253, 257, 266, 276, 282, 286 correlation matrix, 230, 277 cost of misclassication, 247 count, 96 covariance matrix, 228, 230, 254 cox.zph function, 178 Cross-validation approach, 249 cumulative hazard, 171 data function, 2, 96 data.frame class, 2, 11, 17, 23, 38, 43, 48, 50, 96, 112, 139, 203, 233 data.frame function, 2, 17, 122 datasets package, 14 degrees of freedom, 7, 8, 92, 99101, 110, 132, 201, 203, 204 detach function, 11

canonical discriminant functions, 252257, dev.off function, 46 deviance, 144 259, 266 cell means, 46, 58, 59 class function, 101 cluster analysis, 268 Cochran-Mantel-Haenszel (CMH) test, 163 colMeans function, 62 completely randomised design, 6, 7 deviance residual, 144 dimensionality reduction, 269 discriminant analysis, 246267, 269, 272 discriminant scores, 254 dispersion matrix, 228 dist function, 278 292

INDEX DRUGS package, 1, 14, 15, 27, 3537, 43, hierarchical tree, 272 48, 67, 87, 88, 95, 103, 104, 114, 127, Hotelling-Lawley trace, 134, 137 139, 140, 146, 150152, 156, 179, hypergeometric distribution, 161 188, 195, 241243, 265, 266, 284, 285 Dunnetts procedure, 910, 1314 DunnettStat function, 13 eect size, 214 eigen analysis, 134, 228, 235, 239, 256 eigenvalue, 134, 135 Euclidean distance, 270, 271, 274, 276, 277, 285 expand.grid function, 50 expected mean squares, 107, 108 experimental unit, 5 exponential family, 143 faraway package, 88, 114 Fishers exact test, 161, 164 Fishers Linear Discriminant Function, 247 fisher.test function, 164 fitted function, 103, 147 xed eects, 200, 201, 204, 208 for function, 139 foreign package, 3 formula class, 23 general public licence, v generalised linear model, 96 glm function, 147, 151 Greenwoods formula, 172 growth curve models, 183 half cross-validation, 249 hazard function, 171 hclust function, 278 head function, 22, 43 heterogeneity of variance, 11 hierarchical designs, 111 jpeg function, 46 Kaplan-Meier estimate, 172 Lagrange multiplier, 228 Latin square, 67 lattice package, 206 lda function, 258, 262, 282 leave-one-out method, 249 levels, 5 levels function, 15 Levenes test, 44 leveneTest function, 44 instantaneous failure rate, 171 interaction, 23, 3942, 46, 48 interaction plot, 40, 46 inverse link, 145 is.data.frame function, 23

293

library function, 87, 103, 114, 258, 282 likelihood ratio test, 144 linear discriminant analysis, 246, 258, 262, 266, 267 linear predictor, 143 link function, 143 lm function, 23, 65 lme function, 202 locator function, 193 log link, 145 log-odds, 145 log-rank test, 172 logistic link, 145 logistic regression, 145 logit link, 147 ls function, 139

294 main eects, 40, 136 MANOVA, 131140 manova function, 136 Mantel-Haenszel test, 163 mantelhaen.test function, 166 marginal means, 5859 martingale residual, 173 MASS package, 88, 156, 258, 282 Mat2List function, 22 matched pairs analysis, 162 matrix class, 2, 233, 239 matrix function, 164 McNemars test, 162, 165 mcnemar.test function, 165 mean function, 4 median clustering, 275 meta-analysis, 163 method, 138 metric scaling, 277 Michaelis constant, 185 misclassication error rates, 248 mixed model, 110, 197212 model.tables function, 24, 101, 103, 104 multi-centre trials, 163 multidimensional scaling (MDS), 282 multiple comparison procedures, 10, 21 multivariate normal distribution, 257 multivariate normality, 248 mvtnorm package, 13 ncol function, 62 nested designs, 111 Newton-Raphson method, 187 nlme package, 202 nls function, 188, 191, 193 non-metric scaling, 277 noncentral F -distribution, 218 noncentral t-distribution, 216 normal linear model, 142 nrow function, 62 nsize.p function, 219 observed marginal means, 59 odds ratio, 161 oset, 146 ordinal scaling, 277 orthogonal, 32, 40, 57 overdispersion, 146 pairs function, 234 partial likelihood, 173 paste function, 64, 112, 139 Pearson residual, 144 Pillais trace, 134, 135, 137 plclust function, 278

INDEX

plot function, 12, 24, 25, 85, 101, 150, 236, 260 Poisson regression, 146, 152 posterior probability, 251 power, 214 prcomp function, 235, 240, 244 predict function, 87, 88, 147, 178, 189, 263 principal co-ordinate analysis, 277 principal co-ordinate axes, 277 principal component analysis, 224241, 255, 269, 270, 277, 285 principal component regression (PCR) modelling, 243 principal component scores, 227, 232 princomp function, 235, 244 print function, 234, 235, 258 prior probability, 251 prop.table function, 262 proportional hazards model, 173 pure replication, 39 quasi-likelihood, 146

INDEX random assignment, 6 random eects, 106108, 200, 201, 204, 205, 210

295 summary function, 11, 12, 23, 34, 123, 138, 235 summary.aov function, 137, 138

randomized complete block (RCB) design, summary.manova function, 138 Surv function, 174 19, 80, 83, 89, 99 read.csv function, 3, 95, 136, 146, 150, 174, survdiff function, 174 survfit function, 178 188, 233, 257, 263, 278 read.table function, 3 relative risk, 173 rep function, 15, 16, 122 repeated measures, 91, 131 replicates, 6, 7, 39 resid function, 101, 103 Resubstitution method, 248249 rnorm function, 16 rotation, 229 rowMeans function, 62 Roys maximum root, 134, 137 runif function, 16 sample function, 16, 17 scale function, 235, 278 scale-invariant, 230 scree plot, 231 screeplot function, 236 sd function, 239 semi-parametric model, 173 sensitivity, 213 sequential sums of squares, 58 Simpsons Paradox, 163, 173 simultaneous condence intervals, 9, 10, 13, 21, 25 singular value decomposition, 229 source function, 13 specicity, 213 sqrt function, 97 standard error of the dierence, 9, 21, 80 str function, 10, 15, 96, 112 Wilks lambda, 134136 write.csv function, 49 survival curve, 170 survival function, 170 survival package, 174 table function, 262 tabulate function, 202 tapply function, 46, 49, 63, 202 text function, 236 trace, 135 transformations, 46, 186 treatment factor, 5 Tukey honest signicant dierences, 21 TukeyHSD function, 25, 65 Type I SS, 58 Type III SS, 58 variance components, 107, 108, 111, 114, 200, 201 variance function, 144 vector class, 2

Das könnte Ihnen auch gefallen