Sie sind auf Seite 1von 22

ENVIRONMENTAL

AND ECOLOGICAL
STATISTICS WITH R
Second Edition
CHAPMAN & HALL/CRC
APPLIED ENVIRONMENTAL STATISTICS
University of North Carolina
Series Editors
TATISTICS
Doug Nychka Richard L. Smith Lance Waller
Institute for Mathematics Department of Statistics & Department of Biostatistics
Applied to Geosciences Operations Research Rollins School of
National Center for University of North Carolina Public Health
Atmospheric Research Chapel Hill, USA Emory University
Boulder, CO, USA Atlanta, GA, USA

Published Titles

Michael E. Ginevan and Douglas E. Splitstone, Statistical Tools for


Environmental Quality
Timothy G. Gregoire and Harry T. Valentine, Sampling Strategies for Natural
Resources and the Environment
Daniel Mandallaz, Sampling Techniques for Forest Inventory
Bryan F. J. Manly, Statistics for Environmental Science and Management,
Second Edition
Bryan F. J. Manly and Jorge A. Navarro Alberto, Introduction to Ecological
Sampling
Steven P. Millard and Nagaraj K. Neerchal, Environmental Statistics with
S Plus
Wayne L. Myers and Ganapati P. Patil, Statistical Geoinformatics for Human
Environment Interface
Nathaniel K. Newlands, Future Sustainable Ecosystems: Complexity, Risk
and Uncertainty
Éric Parent and Étienne Rivot, Introduction to Hierarchical Bayesian
Modeling for Ecological Data
Song S. Qian, Environmental and Ecological Statistics with R,
Second Edition
Thorsten Wiegand and Kirk A. Moloney, Handbook of Spatial Point-Pattern
Analysis in Ecology
ENVIRONMENTAL
AND ECOLOGICAL
STATISTICS WITH R
Second Edition

Song S. Qian
The University of Toledo
Ohio, USA

Boca Raton London New York

CRC Press is an imprint of the


Taylor & Francis Group, an informa business
A CHAPMAN & HALL BOOK
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2017 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper


Version Date: 20160825

International Standard Book Number-13: 978-1-4987-2872-0 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copy-
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-
vides licenses and registration for a variety of users. For organizations that have been granted a photo-
copy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
In memory of my grandmother 张一贯,mother 仲泽庆, and father 钱拙.
Contents

Preface xiii

List of Figures xvii

List of Tables xxiii

I Basic Concepts 1
1 Introduction 3

1.1 Tool for Inductive Reasoning . . . . . . . . . . . . . . . . . . 3


1.2 The Everglades Example . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Statistical Issues . . . . . . . . . . . . . . . . . . . . . 10
1.3 Effects of Urbanization on Stream Ecosystems . . . . . . . . 14
1.3.1 Statistical Issues . . . . . . . . . . . . . . . . . . . . . 15
1.4 PCB in Fish from Lake Michigan . . . . . . . . . . . . . . . 16
1.4.1 Statistical Issues . . . . . . . . . . . . . . . . . . . . . 16
1.5 Measuring Harmful Algal Bloom Toxin . . . . . . . . . . . . 17
1.6 Bibliography Notes . . . . . . . . . . . . . . . . . . . . . . . 18
1.7 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 A Crash Course on R 19

2.1 What is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Getting Started with R . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 R Commands and Scripts . . . . . . . . . . . . . . . . 21
2.2.2 R Packages . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.3 R Working Directory . . . . . . . . . . . . . . . . . . . 22
2.2.4 Data Types . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.5 R Functions . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Getting Data into R . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Functions for Creating Data . . . . . . . . . . . . . . . 29
2.3.2 A Simulation Example . . . . . . . . . . . . . . . . . . 31
2.4 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . 35
2.4.1.1 Missing Values . . . . . . . . . . . . . . . . . 36

vii
viii Contents

2.4.2 Subsetting and Combining Data . . . . . . . . . . . . 36


2.4.3 Data Transformation . . . . . . . . . . . . . . . . . . . 38
2.4.4 Data Aggregation and Reshaping . . . . . . . . . . . . 38
2.4.5 Dates . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3 Statistical Assumptions 47

3.1 The Normality Assumption . . . . . . . . . . . . . . . . . . . 48


3.2 The Independence Assumption . . . . . . . . . . . . . . . . . 54
3.3 The Constant Variance Assumption . . . . . . . . . . . . . . 55
3.4 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . 56
3.4.1 Graphs for Displaying Distributions . . . . . . . . . . 57
3.4.2 Graphs for Comparing Distributions . . . . . . . . . . 59
3.4.3 Graphs for Exploring Dependency among Variables . . 61
3.5 From Graphs to Statistical Thinking . . . . . . . . . . . . . . 69
3.6 Bibliography Notes . . . . . . . . . . . . . . . . . . . . . . . 72
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4 Statistical Inference 77

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Estimation of Population Mean and Confidence Interval . . . 78
4.2.1 Bootstrap Method for Estimating Standard Error . . . 86
4.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . 90
4.3.1 t-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3.2 Two-Sided Alternatives . . . . . . . . . . . . . . . . . 98
4.3.3 Hypothesis Testing Using the Confidence Interval . . . 99
4.4 A General Procedure . . . . . . . . . . . . . . . . . . . . . . 101
4.5 Nonparametric Methods for Hypothesis Testing . . . . . . . 102
4.5.1 Rank Transformation . . . . . . . . . . . . . . . . . . 102
4.5.2 Wilcoxon Signed Rank Test . . . . . . . . . . . . . . . 103
4.5.3 Wilcoxon Rank Sum Test . . . . . . . . . . . . . . . . 104
4.5.4 A Comment on Distribution-Free Methods . . . . . . 106
4.6 Significance Level α, Power 1 − β, and p-Value . . . . . . . . 109
4.7 One-Way Analysis of Variance . . . . . . . . . . . . . . . . . 116
4.7.1 Analysis of Variance . . . . . . . . . . . . . . . . . . . 117
4.7.2 Statistical Inference . . . . . . . . . . . . . . . . . . . 119
4.7.3 Multiple Comparisons . . . . . . . . . . . . . . . . . . 121
4.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.8.1 The Everglades Example . . . . . . . . . . . . . . . . 127
4.8.2 Kemp’s Ridley Turtles . . . . . . . . . . . . . . . . . . 128
4.8.3 Assessing Water Quality Standard Compliance . . . . 134
4.8.4 Interaction between Red Mangrove and Sponges . . . 137
4.9 Bibliography Notes . . . . . . . . . . . . . . . . . . . . . . . 142
Contents ix

4.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

II Statistical Modeling 147


5 Linear Models 149

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 149


5.2 From t-test to Linear Models . . . . . . . . . . . . . . . . . . 152
5.3 Simple and Multiple Linear Regression Models . . . . . . . . 154
5.3.1 The Least Squares . . . . . . . . . . . . . . . . . . . . 154
5.3.2 Regression with One Predictor . . . . . . . . . . . . . 156
5.3.3 Multiple Regression . . . . . . . . . . . . . . . . . . . 158
5.3.4 Interaction . . . . . . . . . . . . . . . . . . . . . . . . 160
5.3.5 Residuals and Model Assessment . . . . . . . . . . . . 162
5.3.6 Categorical Predictors . . . . . . . . . . . . . . . . . . 170
5.3.7 Collinearity and the Finnish Lakes Example . . . . . . 174
5.4 General Considerations in Building a Predictive Model . . . 185
5.5 Uncertainty in Model Predictions . . . . . . . . . . . . . . . 189
5.5.1 Example: Uncertainty in Water Quality Measurements 191
5.6 Two-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . 193
5.6.1 ANOVA as a Linear Model . . . . . . . . . . . . . . . 193
5.6.2 More Than One Categorical Predictor . . . . . . . . . 195
5.6.3 Interaction . . . . . . . . . . . . . . . . . . . . . . . . 198
5.7 Bibliography Notes . . . . . . . . . . . . . . . . . . . . . . . 200
5.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

6 Nonlinear Models 209

6.1 Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . 209


6.1.1 Piecewise Linear Models . . . . . . . . . . . . . . . . . 220
6.1.2 Example: U.S. Lilac First Bloom Dates . . . . . . . . 226
6.1.3 Selecting Starting Values . . . . . . . . . . . . . . . . 229
6.2 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.2.1 Scatter Plot Smoothing . . . . . . . . . . . . . . . . . 240
6.2.2 Fitting a Local Regression Model . . . . . . . . . . . . 243
6.3 Smoothing and Additive Models . . . . . . . . . . . . . . . . 245
6.3.1 Additive Models . . . . . . . . . . . . . . . . . . . . . 245
6.3.2 Fitting an Additive Model . . . . . . . . . . . . . . . . 248
6.3.3 Example: The North American Wetlands Database . . 250
6.3.4 Discussion: The Role of Nonparametric Regression
Models in Science . . . . . . . . . . . . . . . . . . . . 254
6.3.5 Seasonal Decomposition of Time Series . . . . . . . . 259
6.3.5.1 The Neuse River Example . . . . . . . . . . 261
6.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . 267
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
x Contents

7 Classification and Regression Tree 271

7.1 The Willamette River Example . . . . . . . . . . . . . . . . . 272


7.2 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . 275
7.2.1 Growing and Pruning a Regression Tree . . . . . . . . 277
7.2.2 Growing and Pruning a Classification Tree . . . . . . 285
7.2.3 Plotting Options . . . . . . . . . . . . . . . . . . . . . 289
7.3 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
7.3.1 CART as a Model Building Tool . . . . . . . . . . . . 293
7.3.2 Deviance and Probabilistic Assumptions . . . . . . . . 297
7.3.3 CART and Ecological Threshold . . . . . . . . . . . . 298
7.4 Bibliography Notes . . . . . . . . . . . . . . . . . . . . . . . 300
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

8 Generalized Linear Model 303

8.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . 305


8.1.1 Example: Evaluating the Effectiveness of UV as a
Drinking Water Disinfectant . . . . . . . . . . . . . . 306
8.1.2 Statistical Issues . . . . . . . . . . . . . . . . . . . . . 307
8.1.3 Fitting the Model in R . . . . . . . . . . . . . . . . . . 308
8.2 Model Interpretation . . . . . . . . . . . . . . . . . . . . . . 309
8.2.1 Logit Transformation . . . . . . . . . . . . . . . . . . 310
8.2.2 Intercept . . . . . . . . . . . . . . . . . . . . . . . . . 310
8.2.3 Slope . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
8.2.4 Additional Predictors . . . . . . . . . . . . . . . . . . 312
8.2.5 Interaction . . . . . . . . . . . . . . . . . . . . . . . . 314
8.2.6 Comments on the Crypto Example . . . . . . . . . . . 315
8.3 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
8.3.1 Binned Residuals Plot . . . . . . . . . . . . . . . . . . 316
8.3.2 Overdispersion . . . . . . . . . . . . . . . . . . . . . . 316
8.3.3 Seed Predation by Rodents: A Second Example of
Logistic Regression . . . . . . . . . . . . . . . . . . . . 319
8.4 Poisson Regression Model . . . . . . . . . . . . . . . . . . . . 332
8.4.1 Arsenic Data from Southwestern Taiwan . . . . . . . . 332
8.4.2 Poisson Regression . . . . . . . . . . . . . . . . . . . . 333
8.4.3 Exposure and Offset . . . . . . . . . . . . . . . . . . . 340
8.4.4 Overdispersion . . . . . . . . . . . . . . . . . . . . . . 341
8.4.5 Interactions . . . . . . . . . . . . . . . . . . . . . . . . 344
8.4.6 Negative Binomial . . . . . . . . . . . . . . . . . . . . 351
8.5 Multinomial Regression . . . . . . . . . . . . . . . . . . . . . 353
8.5.1 Fitting a Multinomial Regression Model in R . . . . . 354
8.5.2 Model Evaluation . . . . . . . . . . . . . . . . . . . . . 358
8.6 The Poisson-Multinomial Connection . . . . . . . . . . . . . 361
8.7 Generalized Additive Models . . . . . . . . . . . . . . . . . . 367
Contents xi

8.7.1 Example: Whales in the Western Antarctic Peninsula 369


8.7.1.1 The Data . . . . . . . . . . . . . . . . . . . . 371
8.7.1.2 Variable Selection Using CART . . . . . . . 371
8.7.1.3 Fitting GAM . . . . . . . . . . . . . . . . . . 374
8.7.1.4 Summary . . . . . . . . . . . . . . . . . . . . 378
8.8 Bibliography Notes . . . . . . . . . . . . . . . . . . . . . . . 380
8.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381

III Advanced Statistical Modeling 385


9 Simulation for Model Checking and Statistical Inference 387

9.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388


9.2 Summarizing Regression Models Using Simulation . . . . . . 390
9.2.1 An Introductory Example . . . . . . . . . . . . . . . . 390
9.2.2 Summarizing a Linear Regression Model . . . . . . . . 392
9.2.2.1 Re-transformation Bias . . . . . . . . . . . . 396
9.2.3 Simulation for Model Evaluation . . . . . . . . . . . . 397
9.2.4 Predictive Uncertainty . . . . . . . . . . . . . . . . . . 405
9.3 Simulation Based on Re-sampling . . . . . . . . . . . . . . . 408
9.3.1 Bootstrap Aggregation . . . . . . . . . . . . . . . . . . 410
9.3.2 Example: Confidence Interval of the CART-Based
Threshold . . . . . . . . . . . . . . . . . . . . . . . . . 411
9.4 Bibliography Notes . . . . . . . . . . . . . . . . . . . . . . . 414
9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414

10 Multilevel Regression 417

10.1 From Stein’s Paradox to Multilevel Models . . . . . . . . . . 417


10.2 Multilevel Structure and Exchangeability . . . . . . . . . . . 421
10.3 Multilevel ANOVA . . . . . . . . . . . . . . . . . . . . . . . 425
10.3.1 Intertidal Seaweed Grazers . . . . . . . . . . . . . . . 426
10.3.2 Background N2 O Emission from Agriculture Fields . . 431
10.3.3 When to Use the Multilevel Model? . . . . . . . . . . 434
10.4 Multilevel Linear Regression . . . . . . . . . . . . . . . . . . 436
10.4.1 Nonnested Groups . . . . . . . . . . . . . . . . . . . . 447
10.4.2 Multiple Regression Problems . . . . . . . . . . . . . . 453
10.4.3 The ELISA Example—An Unintended Multilevel Modeling
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 464
10.5 Nonlinear Multilevel Models . . . . . . . . . . . . . . . . . . 465
10.6 Generalized Multilevel Models . . . . . . . . . . . . . . . . . 469
10.6.1 Exploited Plant Monitoring—Galax . . . . . . . . . . 470
10.6.1.1 A Multilevel Poisson Model . . . . . . . . . . 471
10.6.1.2 A Multilevel Logistic Regression Model . . . 474
xii Contents

10.6.2 Cryptosporidium in U.S. Drinking Water—A Poisson


Regression Example . . . . . . . . . . . . . . . . . . . 478
10.6.3 Model Checking Using Simulation . . . . . . . . . . . 482
10.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . 486
10.8 Bibliography Notes . . . . . . . . . . . . . . . . . . . . . . . 489
10.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489

11 Evaluating Models Based on Statistical Significance Testing 493

11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 493


11.2 Evaluating TITAN . . . . . . . . . . . . . . . . . . . . . . . . 495
11.2.1 A Brief Description of TITAN . . . . . . . . . . . . . 496
11.2.2 Hypothesis Testing in TITAN . . . . . . . . . . . . . . 498
11.2.3 Type I Error Probability . . . . . . . . . . . . . . . . . 499
11.2.4 Statistical Power . . . . . . . . . . . . . . . . . . . . . 503
11.2.5 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . 511
11.2.6 Community Threshold . . . . . . . . . . . . . . . . . . 512
11.2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 513
11.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514

Bibliography 515

Index 529
Preface

I learned statistics from Bayesian statisticians. As a result, I do not pay


attention to hypothesis testing and p-values in my work. Likewise, I do
not emphasize the use of them in my teaching. However, most students
from my classes remember the term “statistically significant” (or p < 0.05)
better than anything and check the R2 value when evaluating a regression
model. I have talked to many of them on their experiences in learning
and using statistics to understand why they seem to be naturally drawn
to these numbers that few can explain clearly in plain language. I came to
a satisfactory explanation around 2007 when I read slides of a presentation
given by Dick De Veaux of Williams College entitled “Math is Music; Statistics
is Literature.” (This presentation is now available on YouTube.) According
to Dr. De Veaux, statistics is challenging to both students and instructors
alike, because we want to teach not only the mechanical part of statistics,
but also the process of making a judgment. As a statistics course is always
counted as a quantitative methods class, students naturally view statistics as
a mathematics class. But statistics is not mathematics. In a typical statistical
class for environmental/ecological graduate students, we typically use very
simple (but often tedious) mathematics. Students expect to learn statistics as
they learn mathematics. However, the mode of inference in mathematics is
deduction while the mode of inference of statistics is induction. As a result,
statistics cannot be learned by remembering rules and formulae. The process
of making a judgment requires putting the analysis in the context, combining
information from multiple sources, using logic and common sense. Learning
statistics is not about learning rules (as in mathematics) but more about
interpretation and synthesis, which requires experience (as in literature).
When deciding to write this book, I wanted to put together some examples
to illustrate the process of making a judgment and integrate these examples
to illustrate the iterative process of statistical inference. This process will
inevitably include more than one statistical topic. As a result, many examples
included in this book are used in multiple chapters. For example, I used the
PCB in fish example as an example of a two-sample t-test in Chapter 4, simple
and multiple regressions in Chapter 5, and an example of nonlinear regression
in Chapter 6. With these examples, I try to illuminate the difference between
how we learn statistics and how we use statistics. In learning statistics, we
learn by topics (e.g., from t-test to ANOVA to linear regression, and so on).
By the end of the class, students often see statistics as a collection of unrelated

xiii
xiv Preface

methods. When using statistics, we first must decide what is the nature of the
problem before deciding what statistical tools to use. This first step is not
always taught in a statistics class.
Using the PCB in fish example, I want to illustrate the iterative nature
of a statistical inference problem. We may not be able to identify the most
appropriate model at first. Through repeated effort on proposing the model,
identifying flaws of the proposed model, and revising the model, we hope to
reach a sensible conclusion. As a result, a statistical analysis must have subject
matter context. It is a process of sifting through data to find useful information
to achieve a specific objective. The basic problem of the PCB in fish example
is the risk of PCB exposure from consuming fish from Lake Michigan. The
initial use of the data showed a large difference between large and small fish
PCB concentrations. However, Figure 5.1 suggests that the difference between
small and large fish PCB concentrations cannot be adequately described by the
simple two sample t-test model. Throughout Chapter 5, I used this example
to discuss how a linear regression model should be evaluated and updated. In
Chapter 6, some alternative models are presented to summarize the attempts
made in the literature to correct the inadequacies of the linear models. But I
left Chapter 6 without a satisfactory model. In Chapter 9, I used this example
again to illustrate the use of simulation for model evaluation. While writing
Chapter 9, I discovered the length imbalance. In a way, this example shows
the typical outcome of a statistical analysis — no matter how hard we try, the
outcome is always not completely satisfactory. There are always more “what
if”s. However, the ability to ask “what if” is not easy to teach and learn,
because of the “seven unnatural acts of statistical thinking” required by a
statistical analysis: think critically, be skeptical, think about variation (rather
than about center), focus on what we don’t know, perfect the process, and
think about conditional probabilities and rare events [De Veaux and Velleman,
2008]. By examining the same problem from different angles, I hope to bring
home the essential message: statistical analysis is more than reporting a p-
value.
Since the publication of the first edition, I have learned more about the
problem of using statistical hypothesis testing. One part of these problems
lies in the terminology we use in statistical hypothesis testing. The term
“statistically significant” is particularly corruptive. The term has a specific
meaning with respect to the null hypothesis. But by declaring our result
to be “significant” without further explanation, we often mislead not only
the consumer of the result but also ourselves. In this edition, I removed the
term “statistically significant” whenever possible. Instead, I try to use plain
language to describe the meaning of a “significant” result. As I explained in
a guest editorial for the journal Landscape Ecology, a statistical result should
be measured by the MAGIC criteria of Abelson [1995]: a statistical inference
should be a principled argument and the strength of the inference should
be measured by Magnitude, Articulation, Generality, Interestingness, and
Credibility, not just a p-value or R2 or any other single statistic. Throughout
Preface xv

the book, I emphasize the interpretation of a fitted model and making


conclusions based on the context of the problem. I have followed the following
rules in all examples:
• Verbal description of a model – a clear description of the model using
nonstatistical terms should be a first step. When describing the model in
clear scientific terms, we can better judge whether the model is sensible
and whether the real world can be reasonably represented by the model.
Even for a simple model such as a t-test or ANOVA, a verbal description
can be helpful.
• Verifying model assumptions – plots, plots, and more plots.
• Verbal description of estimated model coefficients – before finalizing the
model, we should describe the estimated model coefficients in words.
This should be done even in a simple two-sample t-test.
The American Statistical Association issued a statement on p-values
[Wasserstein and Lazar, 2016]. The statement emphasizes that the use of
statistics should include the context of the problem, the process of data
collection and model formulation, and the purpose of the analysis. I will use
the statement as a required reading in my class during the first and last weeks
of the semester.
Major changes made in this edition include:
• New and revised Chapters and Sections

– Sections 1.2–1.5 describe main examples used in more than one


chapter.
– Chapter 2 is rewritten with a brief introduction to R and the use
of R for data manipulation.
– Section 5.1 is rewritten to use the PCB in fish example as the lead
for linear regression model.
– New section 5.3.1 introduces the ELISA data collected during the
Toledo water crisis in 2014.
– New section 6.1.3 presents the use of a self-starter function for
nonlinear regression.
– Sections 8.5–8.6 present the multinomial regression and the
connection between multinomial and Poisson models.
– Section 9.2 is revised to include nonlinear regression simulation.
– Two-way ANOVA is removed from section 10.3.
– Section 10.4.3 is added to introduce the ELISA example as a
multilevel modeling problem.
– Section 10.5 is added to introduce nonlinear multilevel models.
xvi Preface

– Section 10.6.1 uses new examples for generalized multilevel models.


– Chapter 11 is added to discuss the use of simulation in evaluating
hypothesis testing based methods. This chapter demonstrates the
importance of putting a statistical test in the context of a real-
world problem. We should ask: what is the scientific problem at
hand, what is the null hypothesis in the context of the problem,
what alternatives are supported when the null is rejected? Once
these questions are answered, we often have a better understanding
of the problem and can be better prepared for making a sound
judgment.
• Exercises are added to the end of each chapter.
• Online materials (data and R code) are at GitHub (https://github.
com/songsqian/eesR).

Song S. Qian
Sylvania, Ohio, USA
July 2016
List of Figures

1.1 Map of WCA2A . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1 RStudio screenshot when opened for the first time. . . . . . 20


2.2 RStudio screenshot with the R script file of this book open. 22
2.3 An example stream networks . . . . . . . . . . . . . . . . . 40

3.1 The standard normal distribution . . . . . . . . . . . . . . . 49


3.2 Everglades background TP concentration distribution . . . . 50
3.3 Normal Q-Q plot of the Everglades background TP
concentration . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 TP in Lake Erie as a function of distance to Maumee . . . . 55
3.5 Comparing standard deviations using S-L plot . . . . . . . . 56
3.6 Histograms of Everglades TP concentrations . . . . . . . . . 57
3.7 An example quantile plot . . . . . . . . . . . . . . . . . . . . 58
3.8 Explaining the boxplot . . . . . . . . . . . . . . . . . . . . . 59
3.9 Additive versus multiplicative shift in Q-Q plot . . . . . . . 60
3.10 Bivariate scatter plot . . . . . . . . . . . . . . . . . . . . . . 62
3.11 Scatter plot matrix . . . . . . . . . . . . . . . . . . . . . . . 63
3.12 Scatter plot of North American Wetland Database . . . . . 64
3.13 Power transformation for normality . . . . . . . . . . . . . . 65
3.14 Daily PM2.5 concentrations in Baltimore . . . . . . . . . . . 67
3.15 Seasonal patterns of daily PM2.5 in Baltimore . . . . . . . . 68
3.16 Conditional plot of the air quality data . . . . . . . . . . . . 68
3.17 The iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.1 Simulating the Central Limit Theorem . . . . . . . . . . . . 83


4.2 Distribution of sample standard deviation . . . . . . . . . . 84
4.3 Distribution of the 75th percentile of Everglades background
TP concentration . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4 The t-distribution . . . . . . . . . . . . . . . . . . . . . . . . 93
4.5 Relationships between α, β, and p-value . . . . . . . . . . . 94
4.6 A two-sided test . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.7 Factors affecting statistical power . . . . . . . . . . . . . . . 110
4.8 Residuals from an ANOVA model . . . . . . . . . . . . . . . 120
4.9 S-L plot of residuals from an ANOVA model . . . . . . . . . 121
4.10 ANOVA residuals . . . . . . . . . . . . . . . . . . . . . . . . 122

xvii
xviii List of Figures

4.11 Normal quantile plot of ANOVA residuals . . . . . . . . . . 123


4.12 Annual precipitation in the Everglades National Park . . . . 128
4.13 Yearly variation in Everglades TP concentrations . . . . . . 129
4.14 Statistical power is a function of sample size. . . . . . . . . 136
4.15 Boxplots of the mangrove-sponge interaction data . . . . . . 138
4.16 Normal Q-Q plots of the mangrove-sponge interaction data 139
4.17 Pairwise comparison of the mangrove-sponge data . . . . . . 140

5.1 Q-Q plot comparing PCB in large and small fish . . . . . . 153
5.2 PCB in fish versus fish length . . . . . . . . . . . . . . . . . 154
5.3 Temporal trend of fish tissue PCB concentrations . . . . . . 157
5.4 Simple linear regression of the PCB example . . . . . . . . . 159
5.5 Multiple linear regression of the PCB example . . . . . . . . 160
5.6 Normal Q-Q plot of PCB model residuals . . . . . . . . . . 166
5.7 PCB model residuals vs. fitted . . . . . . . . . . . . . . . . . 167
5.8 S-L plot of PCB model residuals . . . . . . . . . . . . . . . . 168
5.9 Cook’s distance of the PCB model . . . . . . . . . . . . . . 169
5.10 The rfs plot of the PCB model . . . . . . . . . . . . . . . . . 170
5.11 Modified PCB model residuals vs. fitted . . . . . . . . . . . 173
5.12 Finnish lakes example: bivariate scatter plots . . . . . . . . 175
5.13 Conditional plot: chlorophyll a against TP conditional on TN
(no interaction) . . . . . . . . . . . . . . . . . . . . . . . . . 178
5.14 Conditional plot: chlorophyll a against TN conditional on TP
(no interaction) . . . . . . . . . . . . . . . . . . . . . . . . . 179
5.15 Finnish lakes example: interaction plots (no interaction) . . 180
5.16 Conditional plot: chlorophyll a against TP conditional on TN
(positive interaction) . . . . . . . . . . . . . . . . . . . . . . 182
5.17 Conditional plot: chlorophyll a against TN conditional on TP
(positive interaction) . . . . . . . . . . . . . . . . . . . . . . 183
5.18 Finnish lakes example: interaction plots (positive interaction) 184
5.19 Finnish lakes example: interaction plots (negative interaction) 184
5.20 Box–Cox likelihood plot for response variable transformation 188
5.21 ELISA standard curve and prediction uncertainty . . . . . . 193

6.1 Nonlinear PCB model . . . . . . . . . . . . . . . . . . . . . 211


6.2 Nonlinear PCB model residuals normal Q-Q plot . . . . . . 212
6.3 Nonlinear PCB model residuals vs. fitted PCB . . . . . . . . 213
6.4 Nonlinear PCB model residuals S-L plot . . . . . . . . . . . 214
6.5 Nonlinear PCB model residuals distribution . . . . . . . . . 214
6.6 Four nonlinear PCB models . . . . . . . . . . . . . . . . . . 219
6.7 Simulated % PCB reduction from 2000 to 2007 . . . . . . . 219
6.8 The hockey stick model . . . . . . . . . . . . . . . . . . . . . 222
6.9 The piecewise linear regression model . . . . . . . . . . . . . 223
6.10 The estimated piecewise linear regression model for selected
years . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
List of Figures xix

6.11 First bloom dates of lilacs in North America . . . . . . . . . 227


6.12 All first bloom dates of lilacs in North America . . . . . . . 230
6.13 Toledo ELISA standard curve data . . . . . . . . . . . . . . 231
6.14 Toledo ELISA model diagnostics 1 . . . . . . . . . . . . . . 238
6.15 Toledo ELISA model diagnostics 2 . . . . . . . . . . . . . . 239
6.16 A moving average smoother . . . . . . . . . . . . . . . . . . 242
6.17 A loess smoother . . . . . . . . . . . . . . . . . . . . . . . . 244
6.18 Graphical presentation of a multiple linear regression model 246
6.19 Graphical presentation of a multiple linear regression model
with log-transformation . . . . . . . . . . . . . . . . . . . . . 247
6.20 Graphical presentation of a multiple linear regression model
with log-transformation . . . . . . . . . . . . . . . . . . . . . 247
6.21 Additive model of PCB in the fish . . . . . . . . . . . . . . . 248
6.22 Effects of smoothing parameter . . . . . . . . . . . . . . . . 250
6.23 The North American Wetlands Database . . . . . . . . . . . 252
6.24 The effluent concentration–loading rate relationship . . . . . 253
6.25 Fitted additive model using mgcv default . . . . . . . . . . . 253
6.26 Contour plot of a two-variable smoother fitted using gam . . 256
6.27 Three-dimensional perspective plot of a two variable smoother
fitted using gam . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.28 The one-gram rule model . . . . . . . . . . . . . . . . . . . . 258
6.29 Fitted additive model using user-selected smoothness parameter
value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
6.30 CO2 time series from Mauna Loa, Hawaii . . . . . . . . . . 259
6.31 Fecal coliform time series from the Neuse River . . . . . . . 264
6.32 STL model of fecal coliform time series from the Neuse River 265
6.33 STL model of total phosphorus time series from the Neuse
River . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
6.34 Long-term trend of TKN in the Neuse River . . . . . . . . . 268

7.1 A classification tree of the iris data . . . . . . . . . . . . . . 274


7.2 Classification rules for the iris data . . . . . . . . . . . . . . 275
7.3 Diuron concentrations in the Willamette River Basin . . . . 278
7.4 First diuron CART model . . . . . . . . . . . . . . . . . . . 280
7.5 Cp-plot of the diuron CART model . . . . . . . . . . . . . . 282
7.6 Pruned diuron CART model 1 . . . . . . . . . . . . . . . . . 283
7.7 Pruned diuron CART model 2 . . . . . . . . . . . . . . . . . 284
7.8 Quantile plot of diuron data . . . . . . . . . . . . . . . . . . 286
7.9 First diuron CART classification model . . . . . . . . . . . . 288
7.10 Cp-plot of the diuron classification model . . . . . . . . . . 289
7.11 Pruned diuron classification model . . . . . . . . . . . . . . 290
7.12 CART plot option 1 . . . . . . . . . . . . . . . . . . . . . . 291
7.13 CART plot option 2 . . . . . . . . . . . . . . . . . . . . . . 292
7.14 CART plot option 3 . . . . . . . . . . . . . . . . . . . . . . 294
7.15 Alternative diuron classification models . . . . . . . . . . . . 296
xx List of Figures

8.1 A dose-response curve . . . . . . . . . . . . . . . . . . . . . 310


8.2 Logit transformation . . . . . . . . . . . . . . . . . . . . . . 311
8.3 Mice infectivity data . . . . . . . . . . . . . . . . . . . . . . 313
8.4 Logistic regression residuals . . . . . . . . . . . . . . . . . . 317
8.5 The binned residual plot . . . . . . . . . . . . . . . . . . . . 317
8.6 Seed predation versus seed weight . . . . . . . . . . . . . . . 320
8.7 Seed predation over time . . . . . . . . . . . . . . . . . . . . 323
8.8 Time varying seed predation rate . . . . . . . . . . . . . . . 324
8.9 Probability of predation by time and seed weight . . . . . . 325
8.10 Probability of seed predation as a function of seed weight . 328
8.11 Seed weight and topographic class interaction . . . . . . . . 330
8.12 Binned residual plot of the seed predation model . . . . . . 331
8.13 Arsenic in drinking water data 1 . . . . . . . . . . . . . . . . 336
8.14 Arsenic in drinking water data 2 . . . . . . . . . . . . . . . . 337
8.15 Arsenic in drinking water data 3 . . . . . . . . . . . . . . . . 338
8.16 Arsenic in drinking water data 4 . . . . . . . . . . . . . . . . 339
8.17 Raw versus standardized residuals of an additive Poisson
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
8.18 Fitted overdispersed Poisson model . . . . . . . . . . . . . . 347
8.19 Fitted overdispersed Poisson model with age as a covariate . 350
8.20 Residuals of a Poisson model . . . . . . . . . . . . . . . . . . 351
8.21 Tolerance group multinomial model 1 . . . . . . . . . . . . . 357
8.22 Tolerance group multinomial model 2 . . . . . . . . . . . . . 359
8.23 Multinomial residual plot . . . . . . . . . . . . . . . . . . . . 361
8.24 The Poisson-multinomial connection . . . . . . . . . . . . . 364
8.25 Independent Poisson models for tolerance groups . . . . . . 365
8.26 Independent Poisson models of mayfly taxa . . . . . . . . . 366
8.27 Comparing mayfly taxa models . . . . . . . . . . . . . . . . 367
8.28 Antarctic whale survey locations . . . . . . . . . . . . . . . 370
8.29 Antarctic whale survey data scatter plots . . . . . . . . . . . 372
8.30 Antarctic whale survey CART model Cp plot . . . . . . . . 373
8.31 Antarctic whale survey CART (regression) model . . . . . . 373
8.32 Antarctic whale survey CART (classification) model . . . . 374
8.33 Antarctic whale survey Poisson GAM . . . . . . . . . . . . . 376
8.34 Residuals from GAM show overdispersion . . . . . . . . . . 378
8.35 Antarctic whale survey logistic GAM . . . . . . . . . . . . . 379

9.1 Fish tissue PCB reduction from 2002 to 2007 . . . . . . . . 398


9.2 Fish size versus year . . . . . . . . . . . . . . . . . . . . . . 398
9.3 Residuals as a measure of goodness of fit . . . . . . . . . . . 400
9.4 Simulation for model evaluation . . . . . . . . . . . . . . . . 401
9.5 Tail areas of selected PCB statistics . . . . . . . . . . . . . . 402
9.6 Cape Sable seaside sparrow population temporal trend . . . 403
9.7 Cape Sable seaside sparrow model simulation . . . . . . . . 404
9.8 ELISA test uncertainty . . . . . . . . . . . . . . . . . . . . . 409
List of Figures xxi

9.9 Bootstrapping for threshold confidence interval . . . . . . . 412

10.1 Seaweed grazer example comparing lm and lmer . . . . . . . 430


10.2 Comparisons of three data pooling methods in the N2 O
emission example . . . . . . . . . . . . . . . . . . . . . . . . 432
10.3 Logit transformation of soil carbon . . . . . . . . . . . . . . 434
10.4 N2 O emission as a function of soil carbon . . . . . . . . . . 435
10.5 The EUSE example data . . . . . . . . . . . . . . . . . . . . 437
10.6 EUSE example linear model coefficients . . . . . . . . . . . 440
10.7 Comparison of linear and multilevel regression . . . . . . . . 443
10.8 Multilevel model with a group level predictor . . . . . . . . 446
10.9 Antecedent agriculture land-use as a group level predictor . 448
10.10 Antecedent agriculture land-use and temperature as group-
level predictors . . . . . . . . . . . . . . . . . . . . . . . . . 450
10.11 Antecedent agriculture land-use and temperature interaction 452
10.12 Lake type-level multilevel model coefficients . . . . . . . . . 455
10.13 Conditional plots of oligotrophic lakes (TP) . . . . . . . . . 456
10.14 Conditional plots of oligotrophic lakes (TN) . . . . . . . . . 457
10.15 Conditional plots of eutrophic lakes (TP) . . . . . . . . . . . 458
10.16 Conditional plots of eutrophic lakes (TN) . . . . . . . . . . 459
10.17 Conditional plots of oligotrophic (P limited) lakes (TP) . . . 460
10.18 Conditional plots of oligotrophic (P limited) lakes (TN) . . 461
10.19 Conditional plots of oligotrophic/mesotrophic lakes (TP) . . 462
10.20 Conditional plots of oligotrophic/mesotrophic lakes (TN) . . 463
10.21 Random effects of ELISA model coefficients using SSfpl2 . 467
10.22 Random effects of ELISA model coefficients using SSfpl . . 469
10.23 Random effects (sites) of the Galax model . . . . . . . . . . 473
10.24 Large leaf density of the Galax model . . . . . . . . . . . . . 474
10.25 Large leaf proportion random effects . . . . . . . . . . . . . 476
10.26 Large leaf proportions . . . . . . . . . . . . . . . . . . . . . 477
10.27 System means of cryptosporidium in U.S. drinking water
systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
10.28 System mean distribution of cryptosporidium in the United
States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
10.29 Simulating cryptosporidium in U.S. drinking water systems 485

11.1 IV and z-score under the null model . . . . . . . . . . . . . 502


11.2 Permutation µ and σ under the null model . . . . . . . . . . 502
11.3 TITAN’s underlying models . . . . . . . . . . . . . . . . . . 505
11.4 IV and z-score under a linear model . . . . . . . . . . . . . 507
11.5 IV and z-score under a hockey stick model . . . . . . . . . . 508
11.6 IV and z-score under a step function model . . . . . . . . . 509
11.7 IV and z-score under a sigmoidal model . . . . . . . . . . . 509
11.8 IV and z-score under a sigmoidal model . . . . . . . . . . . 510
11.9 IV and z-score under a sigmoidal model . . . . . . . . . . . 510

Das könnte Ihnen auch gefallen