Beruflich Dokumente
Kultur Dokumente
HYDROLOG*
S t a t i s t i c a l Methods i n
HYDROLOGY
Second Edition
CHARLES T. HAAN
CHARLES T. HAAN is Regents Professor and Sarkeys Distinguished Professor, Emeritus, from
the Department of Biosystems and A,gicultural Engineering, Oklahoma State University, Stillwater.
O 1974 Iowa State University Press
O 2002 Iowa State Press
1-800-862-6657
1-515-292-0140
1-515-292-3348
www.iowastatepress.com
Authorization to photocopy iteins for internal or personal use, or the internal or personal use of
specific clients, is granted by Iowa State Press, provided that the base fee of $.lo per copy is paid
directly to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For those
organizations that have been granted a photocopy license by CCC, a separate system of payments
has been arranged. The fee code for users of the Transactional Reporting Service is 0-8 138-1503712002 $. 10.
@Printed on acid-free paper in the United States of America
First edition, 1974
Second edition, 2002
Library of Congress Cataloging-in-Publication Data
Haan, C. T. (Charles Thomas)
Statistical methods in hydrology / Charles T. Haan.-2nd ed.
p. cm.
Includes bibliographical references and index.
ISBN 0-8 138-1503-7 (acid-free paper)
1. Hydrology-Statistical methods. I. Title.
GB656.2.S7 H3 2002
55 1.48'07'27-4~21
2002000060
The last digit is the print number: 9 8 7 6 5 4 3 2 1
Contents
PREFACE TO SECOND EDITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xv
..
PREFACE TO FIRST EDITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .x v ~ i
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
Hydrologic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Deriveddistributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Mixed distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3
CONTENTS
ix
Poissonprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Summary of Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -94
Multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
NORMALDISTRIBUTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
General normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Reproductiveproperties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Standard normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -102
Approximations for standard normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Constructing pdf curves for data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Normal approximations for other distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .109
Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Negative binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .111
Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .111
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
FREQUENCYANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
.
Probability plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Historicaldata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .156
Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .158
Analytical hydrologic frequency analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .158
Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Lognormal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .160
Log Pearson type I11 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Extreme value type I distribution (Gumbel distribution) . . . . . . . . . . . . . . . . .164
Other distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Generalconsiderations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .165
Confidenceintervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Treatmentofzeros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Truncation of low flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .176
Use of paleohydrologic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Probable maximum flood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Discussion of flood frequency determinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .178
Regionalfrequencyanalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .180
Delineation of homogeneous regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .180
Historical development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Statistical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .182
Frequencydistributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .182
Regression-based procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Index-floodmethod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Regional index-flood relationship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .186
Regionalization using L-moments and the GEV distribution . . . . . . . . . . . . . 187
Regionalization using modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Frequency analysis of precipitation data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -189
Frequency analysis of other hydrologic variables . . . . . . . . . . . . . . . . . . . . . . . . . .191
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .192
..
..
CONTENTS
xi
11 CORRELATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
.
Inferences about population correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . .282
Serialcorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .287
Correlation and regional analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
Correlation and cause and effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -291
Spurious correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .291
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .293
12 MULTIVARIATE ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .297
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .297
Principalcomponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .298
Regression on principal components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -307
Multivariate multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .311
Canonical correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .312
Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .318
13 DATAGENERATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Univariate data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -321
Multivariate data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .327
Multivariate. correlated. normal random variables . . . . . . . . . . . . . . . . . . . . . -327
Multivariate. correlated. nornormal random variables . . . . . . . . . . . . . . . . . . .328
Applications of data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -331
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .334
14 ANALYSIS OF HYDROLOGIC TIME SERIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -336
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
Trendanalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .340
Jumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .346
Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
Periodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .350
Autoregressive integrated moving average models (ARIMA) . . . . . . . . . . . . . . . . . 355
Moving Average Processes (MA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .356
.
Autoregressive processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
Autoregressive Moving Average Models ARMA (p, q) . . . . . . . . . . . . . . . . . .362
Autoregressive Integrated Moving Average ARIMA (p. d. q) . . . . . . . . . . . . -363
~stimateof noise variance o: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -364
Parameter estimation via least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
ARmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .364
MAmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .364
Parameter estimation via maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . -366
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
CONTENTS
...
xu1
xiv
CONTENTS
Estimation of cumulative distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .447
Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .448
Modeling using geostatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -449
APPENDIXES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .451
A .1. Common distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -451
Hydrologicdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .454
A.2. Monthly runoff (in.), Cave Creek near Fort Spring, Kentucky . . . . . . . -454
A.3. Peak discharge (cfs), Cumberland River at Cumberland Falls,
Kentucky .................................................. 455
A.4. Peak discharge (cfs), Piscataquis River, Dover-Foxcroft, Maine ......457
A.5. Total Precipitation (in.) for week of March 1 to March 7, Ashland,
Kentucky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .458
A.6. Flow and sediment load, Green River at Munfordville, Kentucky . . . . . .458
A.7. Streamflow (in.), Walnut Gulch near Tombstone, Arizona . . . . . . . . . . . .459
A.8. Monthly Rainfall (in.), Walnut Gulch near Tombstone, Arizona . . . . . . .460
A.9. Annual discharge (cfs ), Spray River, Banff, Canada . . . . . . . . . . . . . . . .461
A.lO. Annual discharge (cfs), Piscataquis River, Dover-Foxcroft, Maine . . . .461
A.ll. Annual discharge (cfs), Llano River, Junction, Texas . . . . . . . . . . . . . .461
Statistical tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .462
A.12. Standard normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .462
A .13. Percentile values for the t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 464
A.14. Percentile values for the chi square distribution . . . . . . . . . . . . . . . . . . .465
A.15. Percentile values for the F distribution . . . . . . . . . . . . . . . . . . . . . . . . . .467
A .16. Critical values for the Kolmogorov-Smirnov test statistic . . . . . . . . . . .469
A .17. Durban-Watson test bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .470
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .471
INDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .483
Preface to the
Second Edition
SINCE THE publication of the first edition of this book, statistics has come to play an
increasingly important role in hydrology. The advancements in computing technology and data
management have made the application of statistical techniques that were previously known but
difficult to implement allnost routine. User friendly software for personal computers has made
powerful statistical routines available to nearly all hydrologists. Generally, this software comes
with user manuals or help files that lead a new user through the steps needed to use the programs.
Unfortunately, these aids rarely indicate the assumptions inherent in the techniques, the limitations of the techniques, and the situations in which the techniques should or should not be used.
They are generally weak in instructing one on the interpretation of the results of the analysis as
well. This software is a tool that is available for use in hydrology but does not replace sound
hydrologic understanding of the problem at hand nor does it replace a basic understanding of the
statistical technique being used.
This current edition should serve as a companion to many of the software programs
available-not to explain how to use the software, but to provide guidance as to the proper routines to use for a particular problem and the interpretation of the results of the analysis.
The basic philosophy of the current edition is the same as that of the first edition. Enough
detail on particular statistical methods is presented to gain a working understanding of the technique. Certainly the treatment on any particular statistical technique is not exhaustive. Much
theory and derivation are omitted and left to more in-depth treatments found in books dealing
specifically with the various topics.
Two chapters have been added to the book. One of these chapters deals with uncertainty
analysis and the other with geostatistics. Both of these topics have received great emphasis in
xvi
the past decade. Uncertainty analysis is a growing concern as it is increasingly recognized that
both statistical and deterministic analyses result in estimates that are far from absolute answers.
Increasingly. attempts are made to evaluate how much uncertainty should be associated with
various types of analyses. Rather than providing a point estimate of some quantity, confidence
limits are sought, such that one can assert with various degrees of confidence bounds within
which the sought after quantity is thought to be. Geostatistics has become of increasing importance as geographically referenced information becomes available and is used in geographical information systems (GISs) to produce hydrologic estimates.
The chapter on uncertainty was written by Aditya Tyagi, a former PhD candidate at
Oklahoma State University and currently a water resources engineer with CH2M Hill. Jason
Vogel, a research engineer and PhD candidate at Oklahoma State University, was a coauthor of
the chapter on geostatistics.
Preface to the
First Edition
THE RANDOM variability of such hydrologic variables as streamflow and precipitation has
been recognized for centuries. The general field of hydrology was one of the first areas of science
and engineering to use statistical concepts in an effort to analyze natural phenomena. Many papers have been published that amply demonstrate the value of statistical tools in analyzing and
solving hydrologic problems. In spite of the long history and proven utility of statistical techniques in hydrology, relatively few comprehensive and basic treatments of statistical methods in
hydrology have been published.
This book has been prepared to assist engineers and hydrologists develop an elementary
knowledge of some statistical tools that have been successfully applied to hydrologic problems.
The intent of the book is to familiarize the reader with various statistical techniques, point out
their strengths and weaknesses and demonstrate their usefulness. The serious reader will want to
supplement the material with formal courses or independent study of those individual topics that
are major interests. No single topic has been developed completely. Books have been written covering many of the topics discussed as single chapters in this presentation. Again the purpose here
is to develop understanding and illustrate the usefulness of the techniques. Most of the techniques
are discussed in sufficient detail for a thorough understanding and application to problem situations. The philosophy of the presentation has been that one does not have to understand hydrodynamics to swim even though it could help one to become a more proficient swimmer.
The book has not been written for statisticians or for those primarily interested in statistical
theory. Rather it has been prepared for hydrologists and engineers interested in learning how
statistical models and methods can be valuable tools in the analysis and solution of many
hydrologic and engineering problems. The basic premise has been taken (and justifiably so) that
xviii
statisticians are competent so that many statistical results are presented without developing a
rigorous proof of their validity. Proofs for most results can be found in mathematical statistics
books many of which are listed in the bibliography.
No prior knowledge of statistics is required if one starts with Chapter 2. Those with varying
degrees of statistical knowledge may choose to start with later chapters. A knowledge of calculus
is required throughout and some familiarity with matrices is needed for material in later chapters.
Appendix D is a review of the basic matrix manipulation used in the book (not in this new
edition).
This is not a statistical "cookbook" for hydrologists. It does non contain step-by-step calculation procedures for "standard hydrologic problems. Basic statistical concepts are discussed
and illustrated in enough detail so that one can develop his own computational procedures or
methods.
Most of the computations in actual work situations would be done on digital computers.
Computer programs have not been included because it is felt that most computer centers will
have programs or programmers available. Likewise computational techniques are not emphasized. For example, in the chapter on multiple regression, efficient techniques for matrix inversion are not presented as it is felt that these techniques are readily available at most computer tenters. The emphasis is thus retained on the statistical technique being used and not on the
computational aspects of the problem.
Some liberties have been taken in that many terms are not precisely defined in a mathematical sense unless such a definition is warranted. Where terms are loosely defined, it is hoped that
the meticulous reader will accept the general connotation of the terms for purposes of simplicity
and to avoid placing emphasis on terms rather than concepts.
Many of the problems require sets of data. Those data may be supplied by the reader or selected from the data in Appendix C.
I am grateful to the Literary Executor of the late Sir Ronald A. Fisher, F. R. S., to Dr. Frank
Yates, F. R. S. and to London Group Ltd., London, for permission to reprint Table E.5 from their
book Statistical Tablesfor Biological, Agricultural arzd Medical Research, 6~ Edition ( 1 974) (not
in this new edition).
Acknowledgments for
the Second Edition
IT HAS been nearly a quarter century since I wrote the first edition of this book. During
that time I have become indebted to many people. I have spent nearly this entire period with
the Biosystems and Agricultural Engineering Department at Oklahoma State University. This
Department has provided a wonderful atmosphere for intellectual growth and accomplishment.
The faculty, staff, and students that I have been associated with have helped to create a working
environment that was challenging, friendly, and one in which my only limitation was myself.
I am grateful to many individuals. Bill Barfield has continued to be a valued friend and
coworker. Dan Storm, Bruce Wilson, and many graduate students have been especially instrumental in much of my research and teaching in the field of statistical hydrology.
My daughter, Dr. Patricia Haan, assistant professor in the Biological and Agricultural
Engineering Department at Texas A&M University, has been very helpful in clarifying some
points in the text and correcting errors.
Certainly my wife of 34 years, Jan, has been most supportive and forgiving as I have devoted
far too much time to work.
As is true of all of us, I owe whatever I have accomplished to my Creator without Whom
I could accomplish nothing.
Acknowledgments for
the First Edition
MUCH OF the material presented in this book was developed for a course taught to students
in the Agricultural Engineering and Civil Engineering Departments at the University of Kentucky. The suggestions and clarifications made by the students in this course over the past 8 years
have been a great aid in attempting to make this book more understandable.
Special acknowledgment must be given to Dan Carey for his careful readings of the entire
manuscript. These readings resulted in several corrections and clarifications. Several individuals have read parts of the book and made valuable suggestions for its improvement. Among
those reviewing parts of the manuscript were Donn DeCoursey, David Allen, David Culver,
and personnel of the U.S. Soil Conservation Service under the direction of Neil Bogner.
Several individuals in the Agricultural Engineering Department at the University of Kentucky
offered valuable suggestions and considerable encouragement. Deserving special mention are Billy
Barfield, Blaine Parker. and John Walker.
This undertaking has required sacrifice on the part of my family and especially my wife Janice.
She not only typed the early drafts of the book but offered continued encouragement over the years
as work and revisions were done on the book.
This manuscript was reproduced from photo-ready copy. The excellent typing involved in
preparing this final draft as well as an earlier draft was done by Pat Owens. Buren Plaster drafted
all of the figures.
Of course any failings and shortcomings of this book must be credited to me. My hope is that it
will be found useful in at least partially meeting the need for an elementary treatment of statistical
methods in hydrology. Whatever is accomplished along these lines I owe to our Father for giving me
the will to see this project through and the ability to withstand the setbacks experienced along the way.
Finally I express my appreciation to all of the members of the Agricultural Engineering
Department at the University of Kentucky for their understanding during the preparation of this
manuscript.
Statistical Methods i n
HYDROLOGY
1. Introduction
MORE THAN 25 years ago I set about writing a book on the application of statistical techniques to hydrology. That book, published in 1977, became the first edition of this current work
and was appropriately titled Statistical Methods in Hydrology. Although soundly criticized for
," that was little more than a "relevant
producing a book of the general type "Statistics for
Schuam's Outline series" on statistics with a little hydrology thrown in (Burges 1978), the book
has had a very wide reception, has gone through several printings and has been widely quoted in
the literature. However, as I have reflected on this critique over the years, and as I have used statistics to address problems in hydrology and observed others doing the same, I have come to the
conclusion that this critique contained a large element of truth.
There is no shortage of very fine books at many levels of complexity on statistics. The
theory of statistical procedures and the assumptions in statistical procedures are well explained
and widely available. The same statistical techniques might be applied to hydrologic data or to
the comparison of the value of the Japanese yen to the U.S. dollar. Statistical techniques are based
in mathematics and probability. The units attached to the data being studied are immaterial from
a statistical standpoint. What is important is the degree to which the data agree with the assumptions inherent in the statistical procedure being applied.
Similarly, there are many books on hydrology. Some of these books are quite general, some
are quite theoretical, some are quite empirical, and none are really exhaustive. The problem with
hydrology is that it is, in practice, very messy. For example, we can present in great detail the
mathematical development of equations describing the overland flow of water on planes of various types and how flow profiles develop and how runoff hydrographs result at the lower end of
these planes. There exist very elegant solutions for these problems-albeit often numerical
CHAPTER 1
procedures are required to arrive at these solutions. With rapid advances in computing technology, this presents a rapidly diminishing problem.
The real problem as I see it is that we have developed an elegant solution to a nonexistent
problem. In my lifetime I have observed many rainfall-runoff events and have rarely seen the
type of flow described above except in artificial situations such as parts of parking lots or streets
covering a tiny fraction of a drainage basin. If there is any overland flow, before it goes very far
flow concentration develops and the overland flow "planes" become very nonuniform.
Does that mean it is wrong to develop and present these idealized equations? Does that
mean it is wrong to use models that contain these equations to develop runoff hydrographs? NO!
It simply means that one must be aware of the relationships between the mathematics of the
model and the actual hydrology that is occurring. Through proper selection of roughness coefficients and other coefficients in such models, good estimates of runoff hydrographs may result.
Yet that does not mean that the model actually describes in exact detail the hydrologic processes
that are occurring. We must not confuse actual hydrologic processes with models of these
processes.
On numerous occasions I have seen those practicing hydrology confusing hydrologic models with actual hydrologic systems. The complexity, the nonhomogeneity, the dynamic nature of
actual hydrologic systems are not recognized. The uncertainty inherent in parameters used by hydrologic models to particularize the model to a specific catchment or hydrologic problem are not
recognized. The numbers produced by the model are taken as the true hydrologic response of the
actual hydrologic system. More disturbing, the algorithms that make up the model are taken as
true and exact representations of the hydrologic systems they purport to represent. Quite likely
the one using the hydrologic model has great skills in modeling and in computers but little
understanding of the complexity of hydrologic systems.
At this point one might be wondering why I have jumped on mathematical models when this
book is about statistics. The answer lies in my experience over the years that statistical methods
are often criticized for not being physically based and not representing what is actually occumng
in the field. Yet all hydrologic models, not just statistical models, are susceptible to this criticism.
Statistical models are often applied just as are mathematical models with little regard to the
assumptions in the models. Some take model results as truth, especially if the statistical or mathematical technique is complex. Others will reject model results on the basis that all assumptions
are not met. So basically, in hydrology, we face the same dilemmas whether we use mathematical or statistical models.
No model describes the actual and complete hydrology of anything but the simplest of
settings. Regardless of what approach we use toward solving an actual hydrologic problem,
compromise must be made with the methodology employed. One can never turn professional
judgment over to any particular hydrologic model whether the model is mathematical, statistical,
or some combination of the two. Any model must be seen as an aid to judgment and not as a
replacement for it.
There are no completely theoretical models and no completely statistical models. All models have components of both theory and statistics. Both are techniques for quantifying our
understanding and our observations of hydrologic processes. The presence of theory or statistics
may not be a formal presence, but it is there. This leads to the conclusion that all models have
INTRODUCTION
statistical components to some degree. Any constants that are estimated based on observations,
even observations formalized into tables like Manning's n values, have been determined by formal or informal application of statistics. Any statistical model should be formulated based on
some understanding of the system being modeled. This understanding may be brought into the
model through a conceptual structure of the model. These conceptual components are what bring
hydrology into the model as opposed to having a purely statistical model. In my view, one should
not ignore hydrology when developing models for use in hydrology no matter how sophisticated
the statistical techniques that are being used. To the extent that hydrologic knowledge is used in
structuring a statistical model, the model may be said to contain conceptual components. Statistical models should not be developed by simply throwing data on every conceivable variable into
some computerized statistical routine and hoping for the best.
As far as the hydrologist is concerned, statistics is not an end in itself. Statistics is a tool that
may help one to understand hydrological data. The fact that to hydrology, statistics is a tool must
be kept foremost in mind. It must also be kept in mind that statistics is just one of several tools
available for application in hydrology.
Hydrologic processes are not driven by principles of statistics but by physical, chemical,
and biological principles, the so-called "Laws of Nature". Often the hydrologic setting is of such
complexity that the underlying component hydrologic processes cannot be expressed in such a
way as to yield a suitable computational framework for describing the system. Perhaps the
mix of surface soil properties, land uses, topography, and so forth are such that the setting of a
particular hydrologic problem cannot be adequately described. Perhaps the complexity and
heterogeneity of the system is such as to preclude deterministic modeling. Perhaps data are available on a response variable such as stream flow, water quality, or ground water level, but not on
the causative variables of rainfall, evaporation, infiltration, and so on. In such a case statistical
techniques may be needed in an effort to uncover descriptive behavioral relationships among the
data. Such relationships are not cause-effect relationships but descriptive relationships. The
relationships may support hypotheses concerning cause and effect but do not conclusively establish such relationships.
Over the past 20 years I have seen many inappropriate applications of statistics in hydrology. I have seen hydrologists stake their reputation as hydrologists on statements made based on
poor knowledge of statistics. I have also seen statisticians make far-reaching conclusions with a
very elementary knowledge of hydrology; here the argument goes "the data show . . .". The data
are separated from their hydrologic reality and analyzed as pure numbers!
One thing that has compounded the problem of inappropriate use of statistics in hydrology
(or any other field, I suspect) is the ready availability of powerful statistical software that is easy
to use. I applaud the availability of this software but shudder at some of the applications that are
made with it.
Sometimes a statistical procedure is improperly applied or applied in inappropriate circumstances. The numbers generated by a statistical analysis are then venerated as absolute truth. It
would be better to apply a technique recognizing and admitting its shortcomings and then using the
results as a guide rather than religiously adopting the results and claiming they represent reality.
This long introduction has been composed to impart some of my hydrologic-statistical
modeling philosophy and to alert the reader that this book will emphasize the assumptions
inherent in statistical techniques and the consequences of violating these assumptions. Statistical
techniques will be explained at the practical level without many derivations and proofs. References to these will be given. The book will be most useful to someone having at least an elementary knowledge of mathematical statistics and hydrology. This book addresses the interface of
these two disciplines.
The question naturally arises as to what is meant by hydrology in this book. Hydrology
broadly defined is the study of water. The Federal Council for Science and Technology (1962)
defined hydrology as
the science that treats of the waters of the Earth, their occurrence, circulation, and
distribution, their chemical and physical properties, and their reaction with their environment, including their relation to living things. The domain of hydrology embraces
the full life history of water on the earth.
This definition is more or less used in this book. The definition is broad and includes topics
some may consider to be more proper to geology, engineering, environmental science, biology,
chemistry, paleontology, or some other science. Some may even feel it includes aspects which are
nonscientific. By using this definition, when the word "hydrology" is used, it includes these other
areas as well.
Statistics will be considered in a limited sense in the context of this book. Statistics will be
defined as
a science devoted to developing an understanding of a system that can be used to make
inferences about the system based on observation relative to that system.
Models are often used in developing this understanding and in making inferences. Model is
a general term that will be taken to mean
a collection of physical laws and empirical observations written in mathematical terms
and combined in such a way as to produce estimates based on a set of known and/or assumed conditions.
There are many ways of collecting physical laws and empirical observations and of combining them to produce a model. Models can generally be represented as
where 0 represents the outputs or quantities to be estimated; f(...) represents the mathematical
structure of the model; I represents inputs to the model, boundary conditions, and initial conditions; P represents parameters that help particularize the model to a specific situation; and e represents differences between what actually occurs, 0, and what the model predicts, 0,.
INTRODUCTION
There are many ways of classifying models. Some people draw sharp distinctions between
statistical models and other models. In practice one cannot do a thorough modeling exercise
without drawing on statistics in some way. Often some type of statistical work has to be done to
come up with values for parameters for a model that might otherwise be considered a nonstatistical model. Thus, the parameters of the model become some function of observations. If another
set of observations were used presumably different parameter values would result. Since observations (data) in hydrology are generally thought of as random variables and any function of a
random variable is a random variable, the parameters for the model effectively become random
variables and thus a statistical element enters a model that might otherwise not be considered as
a statistical model.
Broadly speaking, quantitative hydrologic models fall on a continuous spectrum of model
"types" ranging from completely deterministic on the one hand to completely stochastic on the
other. A completely deterministic model would be one arrived at through consideration of the underlying physical relationships and would require no experimental data for its application.
Statistical models range in complexity from estimating the most likely outcome or result of an
experiment to describing in detail a sequence (time series) of outcomes that mimic actual outcomes. All statistical approaches rely on observations. The mathematical techniques used to
extract the information contained in the observations may be as simple as computing an average
or so complex as to require thousands of stochastic simulations.
Most hydrologic models fall somewhere between the extremities of this model spectrum.
Often such models are termed parametric models. A parametric model may be thought of as deterministic in the sense that once model parameters are determined, the model always produces
the same output from a given input. On the other hand, a parametric model is stochastic in the
sense that parameter estimates depend on observed data and will change as the observed data
changes. A stochastic model is one whose outputs are predictable only in a probabilistic sense.
With a stochastic model, repeated use of a given set of model inputs produces outputs that are not
the same but follow certain statistical patterns. A statistical model is one arrived at by applying
statistical methods to a set of data to produce an estimation procedure. Multiple regression models are examples of statistical models. In this sense, all stochastic models are statistical models
but all statistical models are not stochastic models.
No matter how simple the hydrologic system or how complex the hydrologic model, the model
is always an approximation to the system. There are no hydrologic models-deterministic, stochastic, or combined-that represent exactly anything but the most trivial of hydrologic systems.
The digital computer has made possible great advances in all types of hydrologic models.
These advancements are noteworthy for both stochastic and deterministic models and have led
some hydrologists to vigorously adopt the philosophy that all hydrologic problems should be
attacked stochastically and some the philosophy that they should be attacked deterministically. The
purpose of this book is not to promote statistical or stochastic models but to present some basic
statistical concepts that have been found useful as aids for the solution of hydrologic problems.
Many hydrologic problems can best be solved through the joint application of the various
modeling methods. For instance, it may be possible to adequately predict the runoff hydrograph
CHAPTER 1
from a simple watershed deterministically given the rainfall input. It is unlikely, however, that
rainfalls that will occur during the life of a water resources project will be deterministically predictable. Thus, one approach to project evaluation would be a stochastic simulation of rainfall,
deterministic conversion of the rainfall to streamflow, and a statistical analysis of the resulting
streamflows.
Regardless of the type of model that is used, model parameters must be determined in some
way from observed hydrologic data. The validity and applicability of a model depend directly on
the characteristics of the data used to estimate model parameters. A model can be no better than
the data available for parameter estimation. The data used for parameter estimation must be representative of the situation in which the model is going to be used. Obviously, if one is attempting to model streamflow from an urban area, model parameters cannot be estimated from forested
watersheds. Similarly, future hydrologic behavior of a watershed can be modeled based on past
observations only if available historical data are representative of future conditions. If drastic
land use changes are to be made, then the model parameters must be adjusted accordingly.
All techniques used for hydrologic analysis rely on assumptions. Often the strict validity of
the analysis depends on how well the true system meets these assumptions. This is certainly true
of statistical models and statistical methods applied to hydrologic systems.
There are no statistical procedures whose assumptions exactly match particular hydrologic
systems. Likewise there are no hydrologic systems that exactly meet the assumptions made in
any particular hydrologic model.
With this in mind one is forced to the conclusion that models cannot yield an exact solution
to any realistic hydrologic problem. Models must be treated as a tool that can be used to gain
insight and to arrive at potential outcomes in a given hydrologic setting, but the final decision regarding any hydrologic process rests with the hydrologist, not the models. The hydrologist may
choose to adopt a solution generated from modeling considerations, but this decision must be
based on the hydrologist's convictions that the solution is hydrologically sound and not simply
on how well the model describes the data. How close the final real solution is to the model
solution will certainly depend on how well the physical setting matches the assumptions of the
modeling techniques employed. It is the hydrologist who must make the determination as to
the relationship between the model result and hydrologic reality.
The fact that a statistical modeling procedure requires assumptions that are not strictly met
in a particular hydrologic setting does not mean that statistically derived results are of no value.
Again, the statistical modeling technique is used to provide insight into the problem at hand and
not the final result. Even when it is known that certain assumptions are violated, useful information can often be obtained from a statistical modeling effort.
Throughout this book, assumptions that accompany the statistical technique being discussed
will be set forth and discussed from a hydrologic standpoint. The potential problems associated
with violating the assumptions will be discussed. One of the frustrations that is constantly faced
in using statistical models to represent hydrologic systems is trying to determine if assumptions
are met or to what extent assumptions are not met for a particular set of data and the effect of not
meeting assumptions on conclusions reached using the method.
One might come away feeling that it is inappropriate to use statistics in hydrology. That is not
the case at all. What is inappropriate is for an analyst to relegate absolute hydrologic authority to
a statistical analysis at the expense of hydrologic knowledge of the system and to give no weight
to other tools available, such as mathematical models and common sense.
Deterministic hydrologic models, whether numerical or conceptual, suffer the same problems in terms of assumptions as do statistical models. Rarely are hydrologic models adequately
tested over the full range of conditions for which they will be applied. Rarely are all of the assumptions associated with hydrologic models actually set forth. For instance, one assumption
inherent in hydrologic models is that a basin's hydrologic response to a rare or extreme event can
be modeled with the same algorithms used to model common or predominate events.
In hydrologic frequency analysis, the criticism is often justifiably leveled that estimating a
rare flood-say a 500-year flood, from a record of 20 or 30 years, none of which are extraordinarily large-is fraught with the possibilities of errors. The question is asked, how could
relatively common flow levels have information embedded in them that would determine the
magnitude of a 500-year event? Said in another way by example, in Oklahoma most annual peak
flows from smaller watersheds are generated from thunderstorms that arise over the Great Plains
of the central United States. The really big floods may be the result of a hurricane sweeping in
from the Gulf of Mexico and traveling over Oklahoma. How can flow data from thunderstorms
predict flow magnitudes of hurricane-related floods?
But the same questions apply to deterministic hydrologic models. If a model is formulated
and parameters estimated based on common flow levels, how can one be sure these same pararneter values and algorithms apply to extreme events?
In both cases, flood frequency analysis and modeling, information is gained about the possible magnitude of the 500-year event. For certain neither estimate is exact! In addition to these
estimates the hydrologist should do some field work, look at channel capacities, possibly look
for evidence of extreme floods in the geologic past (paleohydrology), and rely on as much
hydrologic reasoning as possible to arrive at the final estimate of the 500-year event. One should
additionally attempt to place some type of uncertainty bands on the estimate.
What is being suggested is that responsibility for a hydrologic estimate rests squarely on the
hydrologist rather than on some analytic technique. One cannot blame the log-Pearson type 111
distribution for making a bad flood frequency estimate. The problem is not the distribution itself
(after all the distribution is just a mathematical equation) but the inappropriate application of the
distribution in making the estimate. One cannot blame a hydrologic model if a hydraulic structure fails because the flow estimated by the model was in error. One may conclude that the model
was inappropriate but it was the hydrologist that made the estimate using the model as a tool.
HYDROLOGIC DATA
Hydrologic data seems to be simultaneously abundant and scarce. We are deluged with data
on rainfall, temperature, snowfall, and relative humidity from around the world on a daily basis
in newspapers, radio and television reports, and on world-wide computer information networks.
Many agencies worldwide collect and archive hydrologic data on streamflow, lake and reservoir
levels, ground water elevations, water quality measures, and other aspects of the hydrologic
cycle. These data are available in many different forms. Currently access to hydrologic data is
being rapidly improved as the data is made available over electronic networks.
Yet in the face of this apparent abundance, data on a particular aspect of the hydrologic cycle
at a particular location for a particular time period are often inadequate or completely lacking. It
is often the task of the hydrologist to use any data that can be found having some application to
the problem at hand, hydrologic models of various kinds, plus their own hydrologic knowledge
to explain past, present, or anticipated hydrologic behavior of the system under study. Statistical
procedures are used to evaluate the data, transfer the data to the problem at hand, select models
and model parameters, evaluate model predictions, organize one's personal conception of how
available data and knowledge come to bear on the problem, make predictions of future behavior
of the system, and many other aspects of hydrologic problem-solving.
Hydrologic data are generally presented as values at particular times, such as a river stage at
a particular time, or values averaged over time, such as the annual flow for a stream for a particular year. Aggregating data into averages over time intervals may cause a loss of information if
the variability of the process within the time period is of interest. Conversely, aggregation may
make it possible to more clearly visualize long-term trends because short-term variations about
the trend may be removed. The variability from observation to observation in a time series of hydrologic data may be very rapid and significant or very minor. Generally systems having a lot of
storage vary more slowly than systems lacking that storage. Figure 1.1 is a plot of the water surface elevation of the Great Salt Lake near Salt Lake City, Utah. This figure shows that during the
period of this record, water level changes of about 20 feet have occurred but year-to-year change
is relatively slow with the exception of 1982-1 984 when a rise of about 4 feet per year occurred
and in the late 1980s when the level dropped rather quickly.
Figure 1.2 shows the annual peak discharge for the Kentucky River near Salvisa, Kentucky.
There is little year-to-year carry-over or storage in this river system, so the flows vary more or
less randomly from one year to the next.
Figure 1.3 shows the water surface elevation of Devils Lake in North Dakota. The behavior of this lake is puzzling in that it has gone from nearly 1440 feet in elevation in 1867 to 1401
feet in 1940 in an almost continuous decline, at which point an erratic but steady increase in elevation began until it reached 1447 feet in 1999.
1840
1860
1880
1x0
1920
1910
198D
1980
aOOO
Year
Fig. 1.1. Water surface elevation of the Great Salt Lake near Salt Lake City, Utah.
INTRODUCTION
11
0
1895
1915
1935
1955
1975
1995
Year
Fig. 1.2. Annual peak flows on the Kentucky River near Salvisa, Kentucky.
1850
1870
1890
1910
1930
1950
1970
1990
Year
In the case of the Salt Lake data, a model that estimated the water level in one year based
solely on the level the previous year might produce reasonable estimates. The form of such a
model would be y, = y,-, where y, is the water level at time t and y,-, is the water level at the
previous time t - 1. Such a model may give a better prediction of the lake level in year t than
would a model y, = y where y is the average lake level. The opposite is the case in.the Kentucky
River peak flow data. Here y, = would be better than y, = y,- The previous year's flow is of
little value in predicting the current year's flow.
A model for Devils Lake would be difficult to surmise based simply on lake level data,
because even a reasonable estimate for the long-term average lake level could not be determined
on this record of over 100 years. Simply based on the data, one cannot determine the maximum
elevation reached prior to 1867 or what elevation the lake might achieve in the absence of
human interference after 1999. Presumably, physical and hydrologic information would shed
some light on this problem. These considerations will be discussed in detail and quantified later
in the book.
,.
In selecting data for model parameter estimation, it is important to establish that the data are
representative and homogeneous over time or can be adjusted for any nonhomogeneities that
may be present. L anything has occurred to cause a change in the characteristic being analyzed,
the data must either be adjusted to account for the change or analyzed in two sections: one before
the change and one after.
Some common causes of nonhomogeneities are relocating gages (especially rain gages),
diverting streamflows, constructing dams, watershed changes such as urbanization or deforestation, stream channel alterations and possibly weather modification, as well as natural events of a
catastrophic nature such as earthquakes, humcane floods, and so forth. In some instances the data
can be corrected for changes. One possible adjustment would be by reverse reservoir routing to
determine what streamflows would have been had a reservoir not been constructed. Some
changes such as gradual urbanization of a watershed are difficult to correct.
The statement that the data must be representative means, for example, that data from only
unusually wet or dry periods should not be used alone as this will bias the results of the analysis.
If there are only a few years of record available for analysis, the chances are good that the data
are not representative of the long-term variability that actually exists. Most stochastic models assume that the data being considered are homogeneous and representative.
The concept of the return period of hydrologic events plays an important role in hydrology.
The return period of an event is defined as the average elapsed time between occurrences of an
event with a certain magnitude or greater. For example, a 25-year peak discharge is a discharge
that is equaled or exceeded on average once every 25 years over a long period of time. It does not
mean that an exceedance occurs every 25 years, but that the average time between exceedances
is 25 years. An exceedance is an event with a magnitude equal to or greater than a certain value.
Sometimes the actual time between exceedances is called the recurrence interval. With this
definition for recurrence interval, the average recurrence interval for a certain event is equal to
the return period of that event. In this book, recurrence interval is used in the same sense as return
period.
Of course, the concept of return period can also be applied to low flows, droughts, shortages,
and so on. In this case the return period would be the average time between events with a certain
magnitude or less. Such an event might still be called an exceedance in the sense that the severity of a drought exceeds some preset level.
Regardless of whether the return period is refemng to an event greater than some value or
to an event less than some value, the return period can be related to a probability of an
exceedance. If an exceedance occurs on the average once every 25 years, then the probability or
chance that the event occurs in any given year is & = 0.04 or 4%. Probability, p, of an event
occurring in any one year and return period, T, in years, are thus related by
INTRODUCTION
13
collection of objects, if it contains all of the objects possible, is called the population. For example, 20 years of peak flow data from a certain river is a sample of the possible peak flows on the
river. A random sample is one that is selected in such a fashion that any other sample could have
resulted with equal likelihood. If the 20 years of peak flow data are considered a random sample,
then one is assuming that these 20 years of data are just as likely as any other possible 20 years
of data and vise versa.
In some types of analysis it is assumed that the order of occurrence of the data is not important, only the data values are important. The traditional hydrologic frequency analysis is an example of this. If a sample contains elements that are independent of each other, then the order of
occurrence of the data is not important. This is the same as saying that the magnitude of an element in the sample is not affected by the temporal pattern of the other elements in the sample.
Each element in the sample might be thought of as a random sample of size 1.
On the other hand, there are situations where the order of occurrence of the events is important. In designing a storage reservoir to meet projected water demands, the fact that low flows
tend to follow low flows makes it necessary to have a larger reservoir than would be required if
the low flows occurred randomly throughout time. This is known as persistence and indicates the
elements of the sample are not independent of each other. In this case the entire sequence of data
values must be considered the random sample. That is, the sequence contained in the sample
is assumed to be as likely as any other sequence. The individual events in the sample are not
independent.
If one wanted a random sample consisting of 7 observations of daily flows on a river during
a particular year, the daily flows in a particular week of that year could not be used. This is because
the flow on the second, third, and so on, day of the week would be dependent on the flows on the
preceding days. The flow on day 2, for example, would not represent all possible daily flows but
would be highly dependent on the flow during day 1. To get a random sample of daily flows, each
of the 365 daily flows would have to have an equal chance of being selected. The sample of flows
during the 7 consecutive days could be considered as a random sample of size I of weekly flows
(if the week was randomly selected) but not a random sample of size 7 of daily flows.
In any hydrologic data there are errors of various kinds. The errors include measurement
errors, data transmittal errors, processing errors, and others. The errors may be systematic errors
and show up as a bias in the data or they may be random errors. In most error analysis it is
assumed that the errors are random errors and follow the normal distribution. The treatment of
hydrologic data contained in this book is not concerned so much with these types of errors as it
is with sampling errors.
Sampling error is a misnomer in that there are no errors in the usual sense involved. Sampling errors should more properly be called sampling variability, sampling fluctuation, or sample
uncertainty. What is meant by sampling error is simply that a random sample has statistical properties that are similar to the population parameters but only equal to the population properties as
the sample size gets very large (or the entire population is sampled). If two samples are selected
from the same population, their statistical properties will again be similar but equal to each other
only as the sample size gets very large.
For example, we may desire to know the average annual rainfall at a given location. Assume
we can measure exactly, that is with out any measurement error, the rainfall at the desired
location. Measurements are collected over a 5-year period and the average annual rainfall is
calculated without error in the calculations. A second 5-year period elapses and data from this
period is used to calculate the average annual rainfall. The two estimates will be different.
Neither will equal the true average annual rainfall. The difference in the estimated values and
the true values are the sampling errors. Note we cannot exactly determine the sampling error
since the true average is not known.
Thus, variability or uncertainty in the statistical properties of a population based on estimates of the properties from sampIes is called sampling error. It is clear that errors in the sense of
mistakes, faulty data, or carelessness are not involved in sampling errors. Sampling error is simply an inherent property of random samples. If it weren't for sampling errors, this book or hundreds of others on statistics would not be needed since populations would then be completely
specified by any sample from that population.
Example 1.I. The mean annual suspended sediment load for the Green River near Munfordville,
~ e n t u c k ican
, be estimated from the data contained in Appendix B. This data and the resulting
estimated mean annual suspended sediment load may contain many types of errors. Systematic
errors could result if the flow was sampled for sediment only when the depth of flow exceeded a
preset stage. This is because low flows would not be sampled.
Generally, the sediment concentration in low flows is less than that in higher flows. Thus a
built in bias or systematic error is produced. Measurement errors could result from plugged samplers, samplers not properly aligned with the direction of flow, allowing the sampler to pick up
some bed load, and a number of other reasons. Data transmittal errors and processing errors can
result from mistakes in transcribing data from data forms, placing data in the wrong columns on
spreadsheets or data entry forms, illegibly written data, and other sources.
Sampling error can be illustrated by assuming that the tabulated data are exactly correct
(contain no systematic, measurement, transmittal, or processing errors). If the mean annual suspended sediment load is calculated for each successive 5-year period, the results are 640,827;
484,739; 497,604; and 460,392 tons per year. Under the no error assumption, 4 different values
of the mean annual suspended sediment discharge have been calculated each of which contains
no errors yet none of which are the same. The difference in the 4 estimates is caused by natural
variability in the phenomena (sediment) being sampled. This difference is called sampling error.
If conditions on the watershed contributing to the Green River near Munfordville never changed
and if the climatic conditions do not change, then theoretically the sampling error can be made as
small as desired by an increase in the sample size above the 5 years used in this illustration. Practical limitation is imposed by the length of the available sediment load data record.
Much of the statistical machinery discussed in this book is concerned with sampling errors
and the estimation of population characteristics from samples of data. The fact that sampling
errors are inherent in random data does not mean, however, that statistical manipulations and
sophistication can in any way overcome faulty data. The quality of any statistical analysis is no
better than the quality of the data used. It can be worse but no better. Furthermore, statistical
considerations should not be used to replace judgement and careful thought in analyzing hydrologic data. In many instances some intelligent thought is worth reams of computer output based
INTRODUCTION
15
on a statistical analysis of some data. Statistics should be regarded as a tool, an aid to understanding, but never as a replacement for useful thought.
Rarely will one find a hydrologic problem that exactly fulfills all of the requirements for the
application of one statistical technique or another. Two choices are thus available. One can redefine the problem so that it meets the requirement of the statistical theory and thus produce an
"exact" answer to the artificial problem. The second approach is to alter the statistical technique
where possible and then apply it to the real problem realizing that the results will be an approximate answer to the real problem. In this case the degree of the approximation depends on the
severity of the violated assumptions. This latter approach is preferable and requires knowledge
of available statistical techniques, of assumptions and theory underlying the techniques, and of
the consequences of violating the assumptions. It is toward this latter approach that this book is
oriented.
Most of the examples and exercises used in this book were selected for pedagogical reasons,
not to promote a particular technique. Thus, when a problem involves fitting a normal distribution to annual peak flow, the purpose of the problem revolves around learning about the normal
distribution and is not to demonstrate that a normal distribution is applicable to peak flows. Similarly, many examples and problems had to be simplified so that they could be realistically solved
with attention focused on the statistical technique and not the many fascinating intricacies of
most real problems. That is not to say the techniques do not apply to real problems-uite
the
contrary. However, most real problems involve multiple aspects, lots of data, and many considerations other than statistical ones. Rather than get involved in these other important aspects,
many of the examples and problems are idealizations of real situations.
Because the exercises were selected as a learning aid, it will be instructive to at least read
the problems at the end of each chapter. Many of the problems present useful results that supplement the material in that chapter.
Many actual problems in hydrology require considerable computation. Digital computers are
used for this purpose. Special statistical-numerical procedures have been developed to simplify
the computations involved and improve the accuracy of the results obtained from many of the
analyses presented in this book. These procedures are not presented here. Rather the emphasis is
on the principles involved. Some statistical techniques such as geostatistics and multivariate techniques often require extensive calculation and considerable efficiency is gained by using specialpurpose programs incorporating numerical shortcuts and safeguards against roundoff errors.
Finally, there are many important areas of statistical analysis applicable to hydrology that
are not included in this book. These omitted techniques for the most part require knowledge of
the material contained in this book before they can be applied. Thus, this book is an introduction
to statistical methods in hydrology. Furthermore, the book is not intended as a handbook or
statistical "cookbook for hydrologists. The purpose of this book is to enable the reader to better
apply statistical methods to hydrologic problems through a knowledge of the methods, their
foundations and limitations.
2. Probability and
Probability
DistributionsBasic Concepts
HYDROLOGIC PROCESSES may be thought of as stochastic processes. Stochastic in this
sense means involving an element of chance or uncertainty where the process may take on any of the
values of a specified set with a certain probability. An example of a stochastic hydrologic process
is the annual maximum daily rainfall over a period of several years. Here the variable would be the
maximum daily rainfall for each year and the specified set would be the set of positive numbers.
The instantaneous maximum peak flow observed during a year would be another example
of a stochastic hydrologic process. Table 2.1 contains such a listing for the Kentucky River near
Salvisa, Kentucky. By examining this table it can be seen that there is some order to the values
yet a great deal of randomness exists as well. Even though the peak flow for each of the 99 years
is listed, one cannot estimate with certainty what the peak flow for 1998 was. From the tabulated
data one could surmise that the 1998 peak flow was "probably" between 20,600 cfs and 144,000
cfs. We would like to be able to estimate the magnitude of this "probably". The stochastic nature
of the process, however, means that one can never estimate with certainty the exact value for the
process (peak discharge) based solely on past observations.
The definition of stochastic given above has some theoretical drawbacks, as we shall see.
Hydrologic processes are continuous processes. The probability of realizing a given value from
a continuous probability distribution is zero. Thus, the probability that a variable will take on a
certain value from a specified set is zero, if the variable is continuous. Practically this presents no
problem because we are generally interested in the probabilities that the variate will be in some
range of values. For instance, we are generally not interested in the probability that the flow rate
will be exactly 100,000 cfs but may desire to estimate the probability that the flow will exceed
100,000 cfs, or be less than 100,000 cfs, or be between 90,000 and 120,000 cfs.
Table 2.1. Peak discharge (cfs) Kentucky River near Salvisa, Kentucky
Year
Flow
Year
Flow
Year
Flow
With this introduction, several concepts such as probability, continuous, and probability
distribution have been introduced. We will now define these concepts and others as a basis for
considering statistical methods in hydrology.
PROBABILITY
In the mathematical development of probability theory, the concern has been not so much
how to assign probability to events, but what can be done with probability once these assignments are made. In most applied problems in hydrology, one of the most important and difficult
tasks is the initial assignment of probability. We may be interested in the probability that a certain flood level will be exceeded in any year or that the elevation of a piezometric head may be
more than 30 feet below the ground surface for 20 consecutive months. We may want to determine the capacity required in a storage reservoir so that the probability of being able to meet the
projected water demand is 0.97. To address these problems we must understand what probability
means and how to relate magnitude to probabilities.
The definition of probability has been labored over for many years. One definition that is
easy to grasp is the classical or a priori definition:
If a random event can occur in n equally likely and mutually exclusive ways, and if na
of these ways have an attribute A, then the probability of the occurrence of the event
having attribute A is n d n written as
This definition is an a priori definition because it assumes that one can determine before the
fact all of the equally likely and mutually exclusive ways that an event can occur and all of the
ways that an event with attribute A can occur. The definition is somewhat circular in that
"equally likely" is another way of saying "equally probable" and we end up using the word
"probable" to define probability. This classical definition is widely used in games of chance
such as card games and dice and in selecting objects with certain characteristics from a larger
group of objects. This definition is difficult to apply in hydrology because we generally cannot
divide hydrologic events into equally likely categories. To do that would require knowledge of
the likelihood or probability of the events, which is generally the objective of our analysis and
not known before the analysis.
The classical definition of probability takes on more utility in hydrology in terms of relative
frequencies and limits.
If a random event occurs a large number of times n and the event has attribute A in
na of these occurrences, then the probability of the occurrence of the event having
attribute A is
na
prob(A) = limit n+m
n
The relative frequency approach to estimating probabilities is empirical in that it is based on
observations. Obviously, we will not have an infinite number of observations. For this probability estimate to be very accurate, n may have to be quite large. This is frequently a limitation in
hydrology.
The relative frequency concept of probability is the source of the relationship given in
chapter 1 between the return period, T, of an event and its probability of occurrence, p.
These two definitions of probability can be illustrated by considering the probability of getting heads in a single flip of a coin. If we know a priori that the coin is balanced and not biased
toward heads or tails, we can apply the first definition. There are two possible and equally likely
PROBABILITY
19
Number of trials
Fig. 2.1. Coin flipping experiment.
outcomes-heads or tails-so n is 2. There is one outcome with heads so n, is 1. Thus the probability of a head is !4. If the coin is not balanced so that the two outcomes are not equally likely,
we could not use the a priori definition. We had to know the answer to our question before we
could apply the a priori definition.
This is not the case when the relative frequency definition is used. Obviously we cannot flip
the coin an infinite number of times. We have to resort to a finite sample of flips. Figure 2.1 shows
how the estimate of the probability of a head changes as the number of trials (flips) changes. A
trend toward K is noted. This is called stochastic convergence towards %. One question that might
be asked is, "is the coin unbiased?" One's initial reaction is that more trials are needed. It can be
seen that the probability is slowly converging toward K but after 250 trials is not exactly equal to
!4. This is the plight of the hydrologist. Many times more trials or observations are needed but are
not available. Still, the data does not clearly indicate a single answer. This is where probability
and statistics come into play.
Equation 2.2 allows us to estimate probabilities based on observations and does not require
that outcomes be equally likely or that they all be enumerated. This advantage is somewhat offset in that estimates of probability based on observations are empirical and will only stochastically converge to the true probability as the number of observations becomes large. For example,
in the case of annual flood peaks, only one value per year is realized. Figure 2.2 shows the probability of an annual peak flow on the Kentucky River exceeding the mean annual flow as a function of time starting in 1895. Note that each year additional data becomes available to determine
both the mean annual flow and the probability of exceeding that value. Here again, a convergence
toward % is noted yet not assured. In fact, there is no reason to believe that K is the "correct"
20
CHAPTER 2
Year
Fig. 2.2. Probability that the annual peak flow on the Kentucky River exceeds the mean annual
peak flow.
probability since the probability distribution of annual peak flows is likely not symmetrical about
the mean.
If two independent sets of observations are available (samples), an estimate of the probability of an event A could be determined from each set of observations. These two estimates of
prob(A) would not necessarily equal each other nor would either estimate necessarily equal the
true (population) prob(A) based on an infinitely large sample. This dilemma results in an important area of concern to hydrologists-how many observations are required to produce "acceptable" estimates for the probabilities of events?
From either equation 2.1 or 2.2 it can be seen that the probability scale ranges from zero to
one. An event having a probability of zero is impossible, whereas one having a probability of one
will happen with certainty. Many hydrologists like to avoid the endpoints of the probability scale,
zero and one, because they cannot be absolutely certain regarding the occurrence or nonoccurrence of an event. Sometimes probability is expressed as a percent chance with a scale ranging
from 0% to 100%. Care must be taken to not confuse the percent chance values with true probabilities. A probability of one is very different from a 1% chance of occurrence as the former implies the event will certainly happen while the latter means it will happen only one time in 100.
In mathematical statistics and probability, set and measure theory are used in defining and
manipulating probabilities. An experiment is any process that generates values of random variables. All possible outcomes of an experiment constitute the total sample space known as the
population. Any particular point in the sample space is a sample point or element. An event is a
collection of elements known as a subset.
To each element in the sample space of an experiment a non-negative weight is assigned
such that the sum of the weights on all of the elements is 1. The magnitude of the weight is proportional to the likelihood that the experiment will result in a particular element. If an element is
quite likely to occur, that element would have a weight of near 1. If an element was quite unlikely
to occur, that element would have a weight of near zero. For elements outside the sample space,
a weight of zero is assigned. The weights assigned to the elements of the sample space are known
as probabilities. Here again, the word likelihood is used to define probability so that the definition becomes circular.
Letting S represent the sample space; Ei for i = 1,2, ..., represents elements in S; A and B
represent events in S; and prob(Ei) represents the probability of Ei, it follows that
and
Fig. 2.3. Venn diagram illustrating a sample space, elements, and events.
22
CHAPTER 2
Using notation from set theory and Venn diagrams, several probabilistic relationships can be
illustrated. If A and B are two events in S, then the probability of A or B, shown as the shaded
areas of Figure 2.3, is given by
Note that in probability the word "or" means "either or both". The notation U represents a union
so that A U B represents all elements in A or B or both. The notation n represents an intersection
so that A n B represents all elements in both A and B. The last term of equation 2.6 is needed
since prob(A) and prob(B) both include prob(A n B). Thus, prob(A r l B) must be subtracted
once so the net result is only one inclusion of prob(A n B) on the right-hand side of the equation.
If A and B are mutually exclusive, then both cannot occur and prob(A n B) = 0. In this case
Figure 2.3 illustrates the case where event A and B are mutually exclusive and figure 2.4
shows A and B when they are not mutually exclusive.
If A" represents all elements in the sample space S that are not in A, then
This statement says that the probability of A or Ac is certainty since one or the other must occur.
All of the possibilities have been exhausted. Since A and A" are mutually exclusive
Equation 2.7 often makes it easy to evaluate probability by first evaluating the probability that an
outcome will not occur.
An example is evaluating the probability that a peak flow q exceeds some particular flow q,.
A would be all q's greater than q, and A" would be all q's less than q,. Because q must be either
greater than or less than q,, prob(q > q,) = 1 - prob(q < q,). We show later that for continuous
random variables prob(q = q,) = 0.
If the probability of an event B depends on the occurrence of an event A, then we write
prob(B IA), read as the probability of B given A or the conditional probability of B given A has
occurred. The prob(B) is conditioned on the fact that A has occurred. Referring to figure 2.4 it is
apparent that conditioning on the occurrence of A restricts consideration to A. Our total sample
PROBABILITY
23
assuming of course that prob(A) f 0. Equation 2.8 can be rearranged to give the probability of
A a n d B as
Now if prob(B1A) = prob(B), we say that B is independent of A. Thus the joint probability of
two independent events is the product of their individual probabilities.
Example 2.1. Using the data shown in table 2.1, estimate the probability that a peak flow in excess
of 100,000 cfs will occur in 2 consecutive years on the Kentucky River near Salvisa, Kentucky.
Solution: From table 2.1 it can be seen that a peak flow of 100,000 cfs was exceeded 7 times in
the 99-year record. If it is assumed that the peak flows from year to year are independent, then the
probability of exceeding 100,000 cfs in any one year is approximately 7/99 or 0.0707. Applying
equation 2.9, the probability of exceeding 100,000 cfs in two successive years is found to be
0.0707 X 0.0707 or 0.0050.
Example 2.2. A study of daily rainfall at Ashland, Kentucky, has shown that in July the probability of a rainy day following a rainy day is 0.444, a dry day following a dry day is 0.724, a rainy
day following a dry day is 0.276, and a dry day following a rainy day is 0.556. If it is observed
that a certain July day is rainy, what is the probability that the next two days will also be rainy?
24
CHAPTER 2
Solution: Let A be a rainy day 1 and B be a rainy day 2 following the initial rainy day. The probability of A is 0.444 since this is the probability of a rainy day following a rainy day.
prob(A r l B) = prob(A) prob(B IA)
Now, the prob(B1A) is also 0.444 since this is the probability of a rainy day following a rainy day.
Therefore
The probability of two rainy days following a dry day would be 0.276 X 0.444, or 0.122.
Note that the probabilities of wet and dry days are dependent on the previous day. Independence does not exist. It can be shown that over a long period of time, 67% of the days will be
dry and 33% will be rainy with the conditional probabilities as stated. If one had assumed
independence, then the probability of two consecutive rainy days would have been 0.33 X 0.33
= 0.1089, regardless of whether the preceding day had been rainy or dry. Since the probability of a rainy day following a rainy day is much greater than following a dry day, persistence
is said to exist.
This is called the theorem of total probability. Equation 2.10 is illustrated by figure 2.5.
PROBAB ILlTY
25
Example 2.3. It is known that the probability that the solar radiation intensity will reach a
threshold value is 0.25 for rainy days and 0.80 for nonrainy days. It is also known that for this
particular location the probability that a day picked at random will be rainy is 0.36. What is the
probability the threshold intensity of solar radiation will be reached on a day picked at
random?
Solution: Let A represent the threshold solar radiation intensity, B1 represent a rainy day and B,
a nonrainy day. From equation 2.10, we know that
BAYES THEOREM
By rewriting equation 2.8 in the form
and then substituting from equation 2.10 for prob(A), we get what is called Bayes Theorem:
As pointed out by Benjamin and Cornell (1970), this simple derivation of Bayes Theorem
belies its importance. It provides a method for incorporating new information with previous or
so-called prior probability assessments to yield new values for the relative likelihood of events
of interest. These new (conditional) probabilities are called posterior probabilities. Equation
2.11 is the basis of Bayesian Decision Theory. Bayes theorem provides a means of estimating
probabilities of one event by observing a second event. Such an application is illustrated in
example 2.4.
Example 2.4. The manager of a recreational facility has determined that the probability of 1000
or more visitors on any Sunday in July depends upon the maximum temperature for that Sunday
as shown in the following table. The table also gives the probabilities that the maximum temperature will fall in the indicated ranges. On a certain Sunday in July, the facility has more than 1000
visitors. What is the probability that the maximum temperature was in the various temperature
classes?
Temp
Ti
Prob of
1000 or
more visitors
Prob of
being in
temp class
Prob of
TjllOOO
or more visitors
<60
60-70
70-80
80-90
90- 100
>lo0
0.05
0.20
0.50
0.75
0.50
0.25
0.05
0.15
0.20
0.35
0.20
0.05
0.005
0.059
0.197
0.5 17
0.197
0.025
Total 1.000
O F
Solution: Let Tj for j = 1, 2, ..., 6 represent the 6 intervals of temperature. Then from equation
2.11
prob(Tj1 1000 or more) =
prob(TjI 1000 or more) =
For example
0.05(0.05)
= 0.005
0.507
Similar calculations yield the last column in the above table. Note that
more) is equal to 1.
~ , 6 .prob(Tjl
,
1000 or
COUNTING
In applying equation 2.1 to determine probabilities, one often encounters situations where it
is impractical to actually enumerate all of the possible ways that an event can occur. To assist in
this matter certain general mathematical formulas have been developed.
If El, E2, ..., En are mutually exclusive events such that Ei n Ej = 0 for all i # j where 0 represents an empty set and Ei can occur in ni ways, then the compound event E made up of
outcomes El, &, ..., En can occur in n,, n2, ... nn ways.
The problem of sampling or selecting a sample of r items from n items is commonly
encountered. Sampling can be done either with replacement, so that the item selected is irnmediately returned to the population, or without replacement, so that the item is not returned. The
order of sampling may be important in some situations and not in others. Thus, we may have four
PROBABILITY
27
+ 1) = n!/(n
(2.12)
r)!
The notation n! is read "n factorial". n! = n(n - l)(n - 2) ... (2)(1) By definition O! = 1.
Unordered sampling without replacement is similar to ordered sampling without replacement
except that in the case of ordered sampling the r items selected can be arranged in r! ways. That is,
an ordered sample of r items can be selected from r items in (r),, or r!, ways. Thus r! of the ordered
samples will contain the same elements. The number of different unordered samples is therefore
(n),/r!, commonly written as ( y ) and called the binomial coefficient. The binomial coefficient gives
the number of combinations possible when selecting r items from n without replacement.
The number of ways of selecting samples under the four above conditions is summarized in the
following table:
With
replacement
n!
(4,= (n - r)!
Ordered
Unordered
Without
replacement
n!
(n - l)!r!
Example 2.5. For a particular watershed, records from 10 rain gages are available. Records from
3 of the gages are known to be bad. If 4 records are selected at random from the 10 records, (a)
What is the probability that 1 bad record will be selected? (b) What is the probability that 3 bad
records will be selected? (c) What is the probability that at least 1 bad record will be selected?
Solution: The total number of ways of selecting 4 records from the 10 available records (order is
not important) is
(a) The number of ways of selecting 1 bad record from 3 bad records and 3 good records
from 7 good records is
Applying equation 2.1 and letting a = 1 bad and 3 good records, the probability of a is 105/210
or 0.500.
(b) The number of ways of selecting 3 bad records and 1 good record is
Thus, the probability of at least 1 bad record is 0.500 + 0.300 + 0.033 = 0.833. This latter result
could also be determined from the fact that the probability of 0 or 1 or 2 or 3 bad records must
equal one. The probability of at least 1 bad record thus equals
>:(I
210
- 1 - ---35 - 0.833
210
Example 2.6. For the situation described in Example 2.5, what is the probability of selecting at
least 2 bad records given that one of the records selected is bad?
prob(at least 2 bad out of 41 1 is bad) =
GRAPHICAL PRESENTATION
Hydrologists are often faced with large quantities of data. Since it is difficult to grasp the
total data picture from tabulations such as table 2.1, a useful first step in data analysis is to use
graphical procedures. Throughout this book various graphical presentations will be illustrated.
Helsel and Hirsch (1992) contains a very good treatment of graphical presentations of hydrologic
data. The appropriate graphical technique depends on the purpose of the analysis. As various
analytic procedures are discussed throughout this book, graphical techniques that supplement the
procedures will be illustrated. Undoubtedly the most common graphical representation of hydrologic data is a time series plot of magnitude versus time. Figure 1.2 is such a plot for the peak
flow data for the Kentucky River. A plot such as this is useful for detecting obvious trends in the
data in terms of means and variances or serial dependence of data values next to each other in
time. Figure 1.2 reveals no obvious trends in the mean annual peak flow or variances of annual
peak flows. It also shows no strong dependence of peak flow in one year on the peak flow of the
previous year. We previously indicated that figure 1.1, showing the water level in the Great Salt
Lake of Utah, indicated an apparent year-to-year relationship. Such a relationship is reflected in
the serial correlation coefficient. Serial correlation will be discussed in detail later in the book.
For now suffice it to say that the serial correlation for the annual lake level for the Great Salt Lake
is 0.969 and for the annual peak flows on the Kentucky river is -0.067. This indicates a very
strong year-to-year correlation for the Salt Lake data and insignificant correlation for the peak
flow on the Kentucky River. Serial correlation indicates dependence from one observation to the
next. It also indicates persistence. In the Salt Lake data, lake levels tend to change slowly with
high levels following high levels. In the Kentucky River data, no such pattern is evident.
In conducting a probabilistic analysis, a graphical presentation that is often quite useful is
a plot of the data as a frequency histogram. This is done by grouping the data into classes and
then plotting a bar graph with the number or the relative frequency (proportion) of observations
in a class versus the midpoint of the class interval. The midpoint of a class is called the class
mark. The class interval is the difference between the upper and lower class boundaries.
Figure 2.6 is such a plot for the Kentucky River peak flow data. Frequency histograms are of
most value for data that are independent from observation to observation. A frequency histogram for the Kentucky River data would have the same general shape and location regardless
of what period of record was selected. The Salt Lake data would have a different location if the
period 1890 to 1910 was used in comparison to the period 1870 to 1890 because the levels were
higher over the later period. Certainly no satisfactory histogram of levels could be developed for
Devils Lake, North Dakota (figure 1.3).
.24
.25
Kentucky River
.19
.14
.09
- .03
-04
, .oo
.oo
22.5
37.5
52.5
67.5
82.5
.01
Fig. 2.6. Frequency histogram for the Kentucky River peak flow data.
The selection of the class interval and the location of the first class mark can appreciably affect the appearance of a frequency histogram. The appropriate width for a class interval depends
upon the range of the data, the number of observations, and the behavior of the data. Several suggestions have been put forth for forming frequency histograms. Spiegel (1961) suggests that
there should be 5 to 20 classes. Steel and Torrie (1960) state that the class interval should not exceed one-fourth to one-half of the standard deviation of the data. Sturges (1926) recommends that
the number of classes be determined from
m = 1 + 3.3 log n
(2.14)
where m is the number of classes, n is the number of data values, and the logarithm to the base
10 is used.
Whatever criteria is used, it should be kept in mind that sensitivity is lost if too few or too
many classes are used. Too few classes will eliminate detail and obscure the basic pattern of the
data. Too many classes result in erratic patterns of alternating high and low frequencies. Figure 2.7
represents the Kentucky River data with too many class intervals. If possible, the class intervals
and class marks should be round figures. This is not a computational or theoretical consideration,
but one aimed at making it easier for those viewing the histogram to grasp its full meaning.
In some situations it may be desirable to use nonuniform class intervals. In chapter 8 a situation is presented where the intervals are such that the expected relative frequencies are the same
in each class.
Another common method of presenting data is in the form of a cumulative frequency distribution. Cumulative frequency distributions show the frequency of events less than (greater than)
some given value. They are formed by ranking the data from the smallest (largest) to the largest
(smallest), dividing the rank by the number of data points and plotting this ratio against the
corresponding data value. If the data are ranked from the smaller (larger) data values to the larger
(smaller) values, the resulting cumulative frequency refers to the frequency of observations less
31
PROBABILITY
.I4
K e n t u c k y River
25
40
5 5 7 0 8 5 1 0 0 1 1 5 130 1 4 5
P e a k flow (1000s cfs)
Fig. 2.7. Frequency histogram for the Kentucky River data with too many classes.
20
70
120
(more) than or equal to the corresponding data value. Figure 2.8 is a cumulative frequency plot
based on the Kentucky River peak flow data. Again, ranking and plotting data in this fashion is
most meaningful if the data are not serially correlated.
RANDOM VARIABLES
Simply stated, a random variable is a real-valued function defined on a sample space. If the
outcome of an experiment, process, or measurement has an element of uncertainty associated
with it such that its value can only be stated probabilistically, the outcome is a random variable.
This means that nearly all quantities in hydrology, flows, precipitation depths, water levels,
storages, roughness coefficients, aquifer properties, water quality parameters, sediment loads,
number of rainy days, drought duration, and so forth, are random variables.
Random variables may be discrete or continuous. If the set of values a random variable can
assume is finite (or countably infinite), the random variable is said to be a discrete random variable. If the set of values a random variable can assume is infinite, the random variable is said to
be a continuous random variable. An example of a discrete random variable would be the number of rainy days experienced at a particular location over a period of 1 year. The amount of rain
received over the year would be a continuous random variable. For the most part in this text, capital letters will be used to denote random variables and the corresponding lower case letter will
represent values of the random variable.
It is important to note that any function of a random variable is also a random variable. That
is if X is a random variable then Z = g(X) is a random variable as well. This follows from the
fact that if X is uncertain, then any function of X must be uncertain as well. Physically, this means
that any hydrologic quantity that is dependent on a random variable is also a random variable. If
runoff is a random variable, then erosion and sediment delivery to a stream is a random variable.
If sediment has absorbed chemicals, then water quality is a random variable.
Example 2.7. Nearly every hydrologic variable can be taken as a random variable. Rainfall for
any duration, streamflow, soil hydraulic properties, time between hydrologic events such as flows
above a certain base or daily rainfalls in excess of some amount, the number of times a streamflow rate exceeds a given base over a period of a year, and daily pan evaporation are all random
variables. Quantities derived from random hydrologic variables are also random variables. The
storage required in a water supply reservoir to meet a given demand is a function of the demand
and the inflow to the reservoir. Since reservoir inflow is a random variable, required storage is
also a random variable. As a matter of fact, the demand that is placed on the reservoir would be
a random variable as well. The velocity of flow through a porous media is a function of the hydraulic conductivity and hydraulic gradient which are both random variables. Therefore, the flow
velocity is also a random variable.
The cumulative distribution has jumps in it at each xi equal in magnitude to fx(xi) or the probability that X = xi. The probability that X = xi can be determined from
PROBABILITY
33
The notation fx(x) and Fx(x) denote the pdf and cdf of the discrete random variable X
evaluated at X = x.
Often, continuous data are treated as though they were discrete. ~ o o k i n gagain at the
Kentucky River peak flow data, we can define the event A as having a peak flow in the ithclass.
Letting ni be the number of observed peak flows in the ithinterval and n be the total number of
observed peak flows, the probability that a peak flow is in the ithclass is given by
Thus, the relative frequency, fxi, can be interpreted as a probability estimate, the frequency
histogram can be interpreted as an approximation for a pdf, and the cumulative frequency can be
interpreted as an approximation for a cdf.
Many times it is desirable to treat continuous random variables directly. Continuous random
variables can take on any value in a range of values permitted by the physical processes involved.
Probability distribution functions of continuous random variables are smooth curves. The pdf of
a continuous random variable X is denoted by px(x). The cdf is denoted by Px(x). Px(x) represents the probability that X is less than or equal to x.
The notation px(x) and Px(x) denote the pdf and cdf, respectively, of the continuous random
variable X evaluated at X = x. Thus, p,(a) represents the pdf of the random variable Y evaluated
at Y = a. P,(a) represents the cdf of the random variable Y and gives prob(Y 5 a).
A function, px(x), defined on the real line can be a pdf if and only if
(2.21)
By definition px(x) is zero for X outside R. Also Px(x,) = 0 and Px(x,) = 1 where xl and
xu are the lower and upper limits of X in R. For many distributions these limits are -m to 03
or 0 to 03. It is also apparent that the probability that X takes on a value between a and b is
given by
The prob(a r X 5 b) is the area under the pdf between a and b. The probability that a random
variable takes on any particular value from a continuous distribution is zero. This can be seen
from
Because the probability that a continuous random variable takes on a specified value is zero, the
expressions prob(a 5 X 5 b), prob(a < X 5 b), prob(a 5 X < b) and prob(a < X < b) are all
35
PROBABILITY
equivalent. It is also apparent that Px(x) can be interpreted as the probability that X is strictly less
than x since prob(X = x) = 0.
Figures 2.11 and 2.12 illustrate a possible pdf and its corresponding cdf. In addition to density functions that are symmetrical and bell-shaped, densities may take on a number of different
36
CHAPTER 2
Munimodal
1
'J '
shaped
'U "
shaped
ExDonential
shapes including distributions that are skewed to the right or left, rectangular, triangular, exponential, "J"-shaped, and reverse "J"-shaped (Fig. 2.13).
At this point a cautionary note is added to the effect that the probability density function,
px(x), is not a probability and can have values exceeding one. The cumulative probability distribution, Px(x), is a probability [prob(X 5 x) = Px(x)] and must have values ranging from 0 to 1.
Of course, px(x) and Px(x) are related as indicated by equation 2.20 and knowledge of one specifies the other.
Example 2.8. Evaluate the constant a for the following expression to be considered a probability
density function:
What is the probability that a value selected at random from this distribution will (a) be less than 2?
(b) fall between 1 and 3? (c) be larger than 4? (d) be larger than or equal to 4? (e) exceed 6?
Solution:
From equation 2.21 we must have
PROBA%ILITY
37
and
A
Px(x) = 125
for 0 5 x 5 5
(a) prob(X
(b) prob(1
5X 5
2) = Px(2) = 8/125
3) = Px(3) - Px(l) = 26/125
for
X >5 = 1
for X < d
for
(2.24)
where Pz(d) > P,(d), P,(x,) = 0, Pz(x,) = 1, and P,(x) and P2(x) are nondecreasing functions of
X. Figure 2.14 is a plot of such a distribution. For this situation the prob(X = d) equals the magnitude of the jump AP at X = d or is equal to Pz(d) - P,(d). Any finite number of discontinuities
of this type are possible.
An example of a distribution as shown in figure 2.14 is the distribution of daily rainfall
amounts. The probability that no rainfall is received, prob(X = 0), is finite, whereas the probability distribution of rain on rainy days would form a continuous distribution. A second example
would be the probability distribution of the water level in some reservoir. The water level may be
maintained at a constant level d as much as possible but may fluctuate below or above d at times.
The distribution shown in figure 2.14 could represent this situation.
The relationship between relative frequency and probability can be envisioned by considering an experiment whose outcome is a value of the random variable X. Let px(x) be the probability density function of X. The probability that a single trial of the experiment will result in an
outcome between X = a and X = b is given by
-2
10
Fig. 2.14. A possible piecewise continuous pdf for the case prob (X = d) # 0.
In N independent trials of the experiment, the expected number of outcomes in the interval a to b
would be
Because the right-hand side of this equation represents the area under px(x) between xi - Axi/2
and xi Axi/2, it can be approximated by
Equation 2.25 can be used to determine the expected relative frequency of repeated, independent
outcomes of a random experiment whose outcome is a value of the random variable X.
If N independent observations of X are available, the actual relative frequency of outcomes
in an interval of width Axi centered on xi may not equal fxi as given by equation 2.25 because X
39
PROBABILITY
is a random variable whose behavior can only be described probabilistically. The most probable
outcome or the expected outcome will equal the observed outcome only if px(x) is truly the probability density function for X and for an infinitely large number of observations. Even if the true
probability density function is being used, the actual frequency of outcomes in the interval Ax,
approaches the expected number only as the number of trials or observations becomes very large.
Example 2.9. Plot the expected frequency histogram using the probability density function of
example 2.8 and a class interval of !4.
Solution: fx, = Axipx(xi)
.00075
.00675
.01875
.03675
.06075
.09075
.I2657
.I6875
.2 1675
.27075
Sum .99750
0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75
X
40
CHAPTER 2
BIVARIATE DISTRIBUTIONS
The situation frequently arises where one is interested in the simultaneous behavior of two or
more random variables. An example might be the flow rates on two streams near their confluence.
One might like to know the probability of both streams having peak flows exceeding a given value.
A second example might be the probability of a rainfall exceeding 2.5 inches at the same time the
soil is nearly saturated. Rainfall depth and soil water content would be two random variables.
Example 2.10. The magnitude of peak flows from small watersheds is often estimated from the
"Rational Equation" given by Q = CIA where Q is the estimated flow, C is a coefficient, I is a
rainfall intensity, and A is the watershed area. The assumption is made that the return period of
flow will be the same as the return period of the rainfall that is used. To verify this assumption it
is necessary to study the joint probabilities of the two random variables Q and I.
If X and Y are continuous random variables, their joint probability density function is
pX,y(x,y) and the corresponding cumulative probability distribution is P, .(x, y). These two are
related by
and
P,,y(x, y) = prob(X 5 x and Y 5 y) = m:J
JTm
px,,(t, s) ds dt
The corresponding relationships for X and Y being discrete random variables are
(2.28)
fX,Y(~i,
yj) = prob(X = xi and Y = yj)
F X , y ( ~y), = prob(X
x and Y 5 y) =
2
x,sx y,sy
fx,y(xi, yj)
41
PROBABILITY
MARGINAL DISTRIBUTIONS
If one is interested in the behavior of one of a pair of random variables regardless of the
value of the second random variable, the marginal distribution may be used. For instance,
the marginal density of X, px(x), is obtained by integrating px,,(x, y) over all possible values
of Y.
x and Y
m)
(2.32a)
and
CONDITIONAL DISTRIBUTIONS
A marginal distribution is the distribution of one variable regardless of the value of the second variable. The distribution of one variable with restrictions or conditions placed on the second
variable is called a conditional distribution. Such a distribution might be the distribution of X
given that Y equals yoor the distribution of Y given that x, 5 X Ix,.
42
CHAPTER 2
then
for X and Y continuous. Similarly the conditional distribution of (xIY is in R) for X and Y discrete is
The determination of conditional probabilities from equations 2.37 and 2.38 are done in the
usual way.
The proof of this may be found in Neuts (1973). In most statistics books pXly(xIY= yo) is simply
written as
All of the above results are symmetrical with respect to X and Y. For example
If the region R of equation 2.37 is the entire region of definition with respect to Y, then
SRPY(S)ds = 1
and
-&
s) ds = pX(x)
so that
This results from the fact that the condition that Y is in R when R encompasses the entire region of definition of Y is really no restriction but simply a condition stating that Y may
take on any value in its range. In this case, pXIy(xIYis in R) is identical to the marginal
density of X.
INDEPENDENCE
From equation 2.37 or 2.38 it can be seen that in general the conditional density of X given
Y is a function of y. If the random variables X and Y are independent, this functional relationship
is)not a function of y). In fact, in this case
disappears (i.e., p X l y ( x l ~
or the conditional density equals the marginal density. Furthermore, if X and Y are independent
(continuous or discrete) random variables, their joint density is equal to the product of their marginal densities.
The random variables X and Y are independent in the probabilistic sense (stochastically
independent) if and only if their joint density is equal to the product of their marginal densities.
Independence is an extremely important property. A bivariate distribution is much more
difficult to define and to work with than is a univariate distribution. If independence exists and
the proper pdf for X and for Y can be determined, the bivariate distribution for X and Y is given
as the product of these two univariate distributions.
DERIVED DISTRIBUTIONS
Situations often arise where the joint probability distribution of a set of random variables is
known and the distribution of some function or transformation of these variables is desired. For
example, the joint probability distribution of the flows in two tributaries of a stream may be
known whereas the item of interest may be the sum of the flows in the two tributaries. Some
commonly used transformations are translation or rotation of axes, logarithmic transformations,
nthroot transformations for n equal 2 and 3 and certain trigonometric transformations.
Thomas (197 1) presents the developments that lead to the results presented here concerning
transformations and derived distributions for continuous random variables. The procedures for
discrete random variables is simply an accounting procedure.
Example 2.1 1. Let X have the distribution function
fx(x) = c/x for X = 2 , 3 , 4 , 5
Let Y = x2- 7X + 12. The probability distribution and possible values of Y can be determined
from the following table.
PROBABILITY
45
is a monotonic function u(X) is monotonically increasing if u(x,) 1 u(x,) for x2 > x, and
monotonically decreasing if u(x2) 5 u(x,) for x, > x,) can be found from
Example 2.12. Find the probability of 0 < U < 10 if U = X' and X is a continuous random variable with
Solution:
A check to see that pU(u) is a probability density can be made by integrating pU(u) from 0 to 25
Now prob(0
J'OO
103'
3 6
du = 250
125
In the case of a continuous bivariate density, the transformation from p,,,(x, y) to pUvv(u,v)
where U = u(X, Y) and V = v(X, Y) are one-to-one continuously differentiable transformations
can be made by the relationship
where J(li) is the Jacobian of the transformation computed as the determinant of the matrix of
u, v
partial derivatives
The limits on U and V must be determined from the individual problem at hand.
Example 2.13. Given that p,,,(x, y) = (5 - y/2 - x)/14 for 0 < X < 2 and 0 < Y < 2. If U =
X + Y and V = Y/2, what is the joint probability density function for U and V? What are the
proper limits on U and V?
Solution:
The limits on U and V can be determined by noting that Y = 2V and X = U - 2V. Therefore, the
limitofY=OmapstoV=O,Y=2mapstoV= 1,X=OmapstoU=2VandX=2mapsto
U = 2V + 2. These limits are shown in figure 2.16. A check can be made by integrating pu.v(u, v)
overtheregion0 < V < 1,2V < U < 2V + 2.
PROBABILITY
47
Ddf
In some cases, the function U = u(X) may be such that it is difficult to analytically determine
the distribution of U from the distribution of X. In this case it may be possible to generate a large
sample of X's (chapter 13), calculate the corresponding U's and then fit a probability distribution
to the U's (chapter 6). It should be noted, however, that this empirical method will not in general
satisfy equations 2.47 or 2.48.
MIXED DISTRIBUTIONS
If pi(x) for i = 1, 2, ..., m represent probability density functions and Xi for i = 1, 2, ..., m
represent parameters satisfying Xi 2 0 and Xy= Xi = 1, then
Mixed distributions in hydrology may be applicable in situations where more than one distinct cause for an event may exist. For example, flood peaks from convective storms might be
described by pl(x) and from hurricane storms by p2(x)-If A1 is the proportion of flood peaks generated by convective storms and X2 = (1 - XI), is the proportion generated by hurricane storms,
then equations 2.54 and 2.55 would describe the probability distribution of flood peaks.
Singh (1974), Hawkins (1974), Singh (1987a, 1987b) Hirschboeck (1987), and Diehl and
Potter (1987) discuss procedures for applying mixed distribution in the form of equation 2.54 to
flood frequency determinations. Two general approaches are used. One is to allow the data and
statistical estimation procedures to determine the mixing parameter, A, and the parameters of the
distributions, pi(x). The other is to use physical information on the actual events to classify them
and thus determine A. Once classified, the two sets of data can independently be used to determine the parameters of the pdfs.
Example 2.14. A certain event has probability 0.3 of being from the distribution pl(x) = e-",
x > 0. The event may also be from the distribution p2(x) = 2e-2x, x > 0. What is the probability
that a random observation will be less than I?
Solution:
PROBABILITY
49
Exercises
2.1. (a) Construct the theoretical relative frequency histogram for the sum of values obtained in
tossing two dice. (b) Toss two dice 100 times and tabulate the frequency of occurrence of the
sums of the two dice. Plot the results on the histogram of part a. (c) Why do the results of part b
not equal the theoretical results of part a? What possible kinds of errors are involved? Which kind
of error was the largest in your case?
2.2. Select a set of data consisting of 50 or more observations. Construct a relative frequency
plot using at least two different groupings of the data. Which of the two groupings do you prefer?
Why?
2.3. In a period of one week, 3 rainy days were observed. If the occurrence of a rainy day is an
independent event, how may ways could the sequence consisting of 4 dry and 3 wet days be
arranged?
2.4. If the occurrence of a rainy day is an independent event with probability equal to 0.3, what
is the probability of (a) exactly 3 rainy days in one week? (b) the next 3 days will be rain? (c) 3
rainy days in a row during any week with the other 4 days dry?
2.5. Consider a coin with the probability of a head equal to p and the probability of a tail equal
to q = 1 - p. (a) What is the probability of the sequence HHTHTTH in 7 flips of the coin? (b)
What is the probability of a specified sequence resulting in r H's and s T's? (c) How many ways
can r H's and s T's be arranged? (d) What is the probability of r H's and s T's without regard to
the order of the sequence?
2.6. The distribution given by fx(x) = l / N for X = 1'2'3, . . . ,N is known as the discrete uniform distribution. In the following consider N 1 5. (a) What is the probability that a random
value from fx(x) will be equal to 5? (b) What is the probability that a random value from fx(x)
will be between 3 and 5 inclusive? (c) What is the probability that in a random sample of 3
values from fx(x) all will be less than 5? (d) What is the probability that the 3 random values from
fx(x) will all be less than 5 given that 1 of the values is less than 5? (e) If 2 random values are selected from fx(x), what is the probability that one will be less than 5 and the other greater than 5?
(f') For what X from fx(x) is prob(X 5 x) = 0.5?
2.7. Consider the continuous probability density function px(x) = a sin2 mx for 0 < X < 7c. (a)
What must be the value of a and m? (b) What is Px(x)? (c) What is prob(0 < X < 7c/2)? (d) What
is prob(X > a/2 I X < a/4)?
2.8. Consider the continuous probability density function given by px(x) = 0.25 for 0 < X < a.
(a) What is a? (b) What is prob(X > a/2)? (c) What is prob(X > a/2 I X > a/4)? (d) What is
prob(X > a/2 I X < a/4)?
50
CHAPTER 2
2.9. Let px(x) = 0.25 for 0 < X < a as in exercise 2.8. What is the distribution of Y = In X?
Sketch py(y).
2.10. Many probability distributions can be defined simply by consulting a table of definite integrals. For example 5," xn-' e-' dx is equal to r(n) where r(n) is defined as the gamma function
(see chapter 6). Therefore one can define px(x) = xn-' e-'/T(n) to be a probability density function for n > 0 and 0 < X < m. This distribution is known as the l-parameter gamma distribution.
Using a table of definite integrals, define several possible continuous probability distributions.
Give the appropriate range on X and any parameters.
2.11. The annual inflow into a reservoir (acre-feet) follows a probability density given by p,(x) =
l/(P1- al). The total annual outflow in acre-feet follows a probability distribution given by py(y)
= 1/(P2 - %). Consider that P1 > P2 and al < 04~.(a) Calculate the expression for the probability distribution of the annual change in storage. (b) Plot the probability distribution of the annual
change in storage. (c) If P1 = 100,000, a, = 20,000, P, = 70,000 and % = 50,000, what is the
probability that the change in storage will be i) negative and ii) greater than 15,000 acre-feet?
2.12. The probability of receiving more than 1 inch of rain in each month is given in the following table. If a monthly rainfall record selected at random is found to have more than 1 inch of
rain, what is the probability the record is for July? April?
Jan -25
Feb .30
Mar .35
Apr .40
May -20
Jun .10
Jul .05
Aug .05
Sept .O5
Oct .05
Nov -10
Dec -20
2.13. It is known that the discharge from a certain plant has a probability of 0.001 of containing
a fish killing pollutant. An instrument used to monitor the discharge will indicate the presence of
the pollutant with probability 0.999 if the pollutant is present and with probability 0.01 if the pollutant is not present. If the instrument indicates the presence of the pollutant, what is the probability that the pollutant is really present?
2.14. A potential purchaser of a ferry across a river knows that if a flow of 100,000 cfs or more
occurs, the ferry will be washed down stream, go over a low dam, and be destroyed. He knows
that the probability of a flow of this kind in any year is 0.05. He also knows that for each year that
the ferry operates a net profit of $10,000 is realized. The purchase price of the ferry is $50,000.
Sketch the probability distribution of the potential net profit over a period of years neglecting interest rates and other complications. Assume that if a flow of 100,000 cfs or more occurs in a
year, the profit for that year is zero.
2.15. Assume that the probability density function of daily rainfall is given by
(a) Is this a proper probability density function? (b) What is prob(X > 0.5)? (c) What is prob
(X > 0.5 1 X # O)?
This is a mixture of two uniform distributions. (a) Sketch p,(x) for A, = 0.5. (b) Sketch px(x) for
A, = 0.1. (c) Sketch px(x) for A, = 0.333. (d) In a random sample from px(x), 60% of the values
were between 0 and 2. What would be an estimate for the value of A,?
2.17. Show that equations 2.50 through 2.53 are valid.
3. Properties of
Random Variables
IN CHAPTER 2 random variables and their probability density functions were discussed in
general and somewhat abstract terms. Actually, nearly every hydrologic variable is a random
variable. This includes rainfall, streamflow, infiltration rates, evaporation, reservoir storage, and
so on. Any process whose outcome is a random variable can be thought of as an experiment. A
single outcome from an experiment is a realization of the experiment or an observation from the
experiment. Thus, daily rainfall values are observations generated by a set of meteorologic conditions that comprise the experiment.
The terms realization and observation can be used interchangeably; however, an observation
is generally taken to be a single value of a random variable and a realization is generally taken as
a time series of random variables generated by a random experiment. A 10-year record of daily
rainfall might be considered as a single realization of a stochastic process (daily rainfall). A
second 10-year record of daily rainfall from the same location would then be a second realization
of the process.
In this chapter we will be concerned mainly with observations of random variables and with
the collection of possible values that these observations may take on. The complete assemblage
of all of the values representative of a particular random process is called a population. Any subset of these values would be a sample from the population. For example, the pages of this book
could represent a population while the pages of this chapter are a sample of that population. All
of the books in a library might be taken as a population and should this book be found in the
library, it would be a sample from the total population.
Generally, one has at hand a sample of observations or data from which inferences about the
originating population are to be made, and then possibly inferences about another sample from
RANDOM VARIABLES
53
this population. Streamflow records for the past 50 years on a particular stream would be a
sample from which inferences about the behavior of the stream for all time (the population) could
be made. This information could also be used to estimate the behavior of the stream during some
future period of years (another but yet unrealized sample) so that a structure could be properly
designed for the stream. Thus, one might use information gleaned from one sample to make
decisions regarding another sample.
Quantities that are descriptive of a population are called parameters. In most situations these
parameters must be estimated from samples of data. Sample statistics are estimates for population parameters. Sample statistics are estimated from samples of data and as such are functions
of random variables (the sample values) and thus are themselves random variables. The average
number of pages in all of the books in a particular library would be a parameter representing
the population (the books in the library). This parameter could be estimated by determining the
average number of pages in all of the books on a particular shelf in the library (a sample of the
population). This estimate of the parameter would be a statistic.
As pointed out in chapter 1, for a decision based on a sample to be valid in terms of the
population, the sample statistics must be representative of the population parameters. This in
turn requires that the sample itself be representative of the population and that "good" parameter estimation procedures are used. One could not get a "good" estimate of the average number
of pages per book in a library by sampling a shelf that contained only fat, engineering
handbooks. By the same token, one cannot get "good" estimates for the parameters of a streamflow synthesis model if the estimates are based on a short period of record during which an
extreme drought occurred.
One rarely, if ever, has available a population of observations on a hydrologic variable.
What is generally available is a sample (of observations) from the population. Thus, population
parameters are rarely, if ever, known and must be estimated by sample statistics. By the same
token, the true probability density function that generated the available sample of data is not
known. Thus, it is necessary to not only estimate population parameters, but it is also necessary
to estimate the form of the random process (experiment) that generated the data.
This chapter is devoted to a discussion of parameters descriptive of populations and
how estimates (statistics) for these parameters can be obtained from samples drawn from
populations.
and the first moment of the total area about the origin is
X
Fig. 3.1. Moment of arbitrary area.
In case of a random variable and its associated probability density function such as shown
in figure 3.2, the first moment about the origin is again given by
55
RANDOM VARIABLES
Generalizing the situation, the ithmoment about the origin is
The ithcentral moment is defined as the ithmoment about the mean, p, of a distribution and
is given by
Jro,x px(x) dx
E(X) =
xjxj fx(xj)
X continuous
X discrete
J_",g(x) px(x) dx
E[g(x)] =
xjg(xj) fx(xj)
X continuous
X discrete
(3.8)
(3.9)
It is apparent that the expected value of (x - p,)' is equal to the ithcentral moment
A sample estimate of the population mean is the arithmetic average, X,calculated from
where n is the number of observations or items in the sample. The arithmetic mean can be estimated from grouped data by
where k is the number of groups, n is the number of observations, ni is the number of observations in the ithgroup and xi is the class mark of the ithgroup.
Geometric Mean
The sample geometric mean,
K, is defined as
J-kpx(x) dx = 0.5
X continuous
(3.18)
X discrete
(3.19)
and
d2px(x) < 0
dx2
x continuous
X discrete
(3.21)
RANDOM VARIABLES
57
The sample mode, X,,, would simply be the most frequently occurring value in the sample.
A sample or a population may have none, one, or more than one mode.
Weighted Mean
The calculation of the arithmetic mean of grouped data is an example of calculating a
weighted mean where ni /n is the weighting factor. In general, the weighted mean is
where wi is the weight associated with the ithobservation or group and k is the number of observations or groups.
MEASURES OF DISPERSION
Range
The two most common measures of dispersion are the range and the variance. The range of
a sample is simply the difference between the largest and smallest sample values. The range of a
population is many times the interval from -m to or from 0 to m. The sample range is a function of only two of the sample values but does convey some idea of the spread of the data. The
population range of many continuous hydrologic variables would be 0 to m and would convey
little information. The range has the disadvantage of not reflecting the frequency or magnitude of
values that deviate either positively or negatively from the mean because only the largest and
smallest values are used in its determination. Occasionally, the relative range-the range divided
by the mean-is used.
Variance
By far the most common measure of dispersion is the variance, or its positive square rootthe standard deviation. The variance of the random variable X is defined as the second moment
about the mean and is denoted by 0;.
Thus, the variance is the average squared deviation from the mean. For a discrete population of
size n, equation 3.23 becomes
Two basic differences should be noted between equations 3.24 and 3.25. First, in 3.25 F is used
instead of p. This is because in dealing with a sample, the population mean would not be known.
Second, n - 1 is used as the denominator in determining S; rather than n when calculating 0;.
Ci(xi - x ) ~
would result in a biased estimate for 0;.
The reason for this is that
n
The variance for grouped data can be estimated from
where k is the number of groups, n is the number of observations, xi is the class mark and ni the
number of observations in the i" group.
The variance of some functions of the rambmvariable X can be determined from the
following relationships:
The units on the variance are the same as the units on x2.The units on the standard deviation are the same as the units on the random variable. A dimensionless measure of dispersion is
the coefficient of variation, defined as the standard deviation divided by the mean. The coefficient
of variation is estimated from
MEASURES OF SYMMETRY
As is apparent from figure 2.13, many distributions are not symmetrical. They may tail off
to the right or to the left and as such are said to be skewed. A distribution tailing to the right is
said to be positively skewed and one tailing to the left is negatively skewed. The skewness is the
third moment about the mean and is given by
skewness = J_"m (x - p)3 px(x) dx
One measure of absolute skewness would be the difference in the mean and the mode. A measure such as this would not be too meaningful, however, because it would depend on the units of
measurement. A relative measure of skewness, known as Pearson's first coefficient of skewness,
can be obtained by dividing the difference in the mean and the mode by the standard deviation.
population measure of skewness =
Pmo
(7
(3.32)
59
RANDOM VARIABLES
Mean = Mode = Median
Symmetrical
Positive skew
Negative skew
x -
Xmo
Sx
The mode of moderately skewed distributions can be estimated from (Par1 1967)
Xmo
=X -
3(x - xmd)
so that
s a m ~ l emeasure of skewness =
~md)
If sample estimates are replaced by population values in equation 3.35, Pearson's second coefficient of skewness results.
The most commonly used measure of skewness is the coefficient of skew given by
where M, is the sample estimate for p3.The sample coefficient of skew has the advantage of being a function of all of the observations in the sample. Figure 3.3 shows symmetrical, positively
and negatively skewed distributions.
MEASURES OF PEAKEDNESS
A fourth property of random variables based on moments is the kurtosis. Kurtosis refers to
the extent of peakedness or flatness of a probability distribution in comparison with the normal
60
CHAPTER 3
Leptokurtic, K >3,E >O
/-- Normal, K = 3,E = 0
5-
probability distribution. Kurtosis is the fourth moment about the mean. A coefficient of kurtosis
is defined as ,
where M4 is the sample estimate for k4. According to Yevjevich (1972a), a less biased estimate for the coefficient of kurtosis is obtained by multiplying equation 3.39 by
n3
--
RANDOM VARIABLES
61
A much simpler and more direct method of finding E[g(X, Y)] would be to use the
relationship
In either case, the result is the average value of the function g(X, Y) weighted by the probability
that X = x and Y = y or more simply the mean of the random variable U.
In the discrete case
A general expression for the r, s moment about the origin of the jointly distributed random
variables X and Y is
CHAPTER 3
62
The most useful central moments are for (r = 2, s = 0), (r = 1, s = 1) and (r = 0, s = 2).
For the case (r = 2, s = 0) we have
E[(X - PXl21 =
J- J- (X - PXI2PX,Y(X,Y)dx dy
J- (x - PXI2 .I-PX,Y(X,Y)dy dx
var (X)
The analogous result holds for (r = 0, s = 2). The comparable results for discrete random variables
are easily obtained.
Covariance
The covariance of X and Y is defined as the 1, 1 central moment
For the case where X and Y are independent, equation 3.49 can be written
since px,,(x, y) would equal px(x) py(y). Furthermore, both of the integrals in equation 3.50 are
equal to zero so that
if X and Y are independent. The converse of this is not necessarily true, however.
The sample estimate for the population covariance ax,,is S,,, computed from
Correlation Coefficient
The covariance has units equal to the units of X times the units of Y. A normalized covariance called the correlation coefficient is obtained by dividing the covariance by the products of
the standard deviations of X and Y
RANDOM VARIABLES
63
It can be shown (Thomas 1971) that - 1 5 pxy 5 1. Obviously, if X and Y are independent,
p,,, = 0. Again, the converse is not necessarily true. X and Y can be functionally related and still
have p,,, (and ox,, ) equal to zero. Actually px,, is a measure of the linear dependence between
X and Y. If pxqy= 0, then X and Y are linearly independent; however, they may be related by
some other functional form. A value of p,,, equal to 1 implies that X and Y are perfectly related
by Y = a + bX. If pxVy= 0, X and Y are said to be uncorrelated. Any nonzero value of px,,
means X and Y are correlated.
The covariance and the correlation coefficient are a measure of how the two variables X and
Y vary together. If pX,, and ox,, are positive, large values of X tend to be paired with large values of Y and vice versa. If pX,, and ox,, are negative, large values of X tend to be paired with
small values of Y and vice versa.
The population correlation coefficient p,,, can be estimated by the sample correlation
coefficient as
where sx and sy are the sample estimates for ox and oy given by equation 3.25 and SX,Yis the
sample covariance given by equation 3.52.
Figure 3.5 demonstrates some typical values for r,,,. In figure 3.5a all of the points lie on
the line Y = X - 1; consequently, there is perfect linear dependence between X and Y and
the correlation coefficient is unity. In figure 3.5b the points are either on or slightly off the line
Y = X - 1, and r x , = 0.986. Perfect linear dependence does not exist in this case because some
of the points deviate slightly from the straight line. In measuring and relating naturally occurring
hydrologic variables, a correlation coefficient of 0.986 would be considered quite good and the
resulting straight line, Y = X - 1 in this case, would usually be judged a good usable relationship between X and Y.
In figure 3 . 5 the correlation coefficient has dropped to -0.671. The points in this case are
scattered about the line Y = 1.264 - 1.571X. The scatter of the points is much greater than in
the previous case, although the existence of some dependence (stochastic) is still in evidence.
In figure 3.5d the scatter of the points is great, with a corresponding lack of a strong (stochastic) dependence. Generally, a correlation coefficient of 0.21 1 is considered too small to
indicate a useful stochastic dependence as knowledge about X gives very little information
about Y.
In the last two paragraphs the modifier "stochastic" has appeared with the word dependence.
This is because in reality there are two kinds of dependence-stochastic and functional. Generally, throughout this book the word dependence alone should be taken to mean stochastic (or statistical) dependence.
Figures 3.5e and 3.5f contain examples of functionally dependent variables. In figure 3.5e
the relationship is Y = x 2 / 4 for X > 0 and in figure 3.5f the relationship is Y = &
.for -3 < X < 3. The correlation coefficient for figure 3.5e is 0.963, indicating a high degree of
stochastic (linear) dependence. This illustrates that even though the dependence between X and Y
is nonlinear, a high correlation coefficient can result. If the plot of figure 3.5e were to cover a
different range of X, the correlation coefficient would change as well.
Figure 3.5f illustrates a situation where Y and X are perfectly functionally related even
though the correlation coefficient is zero. The functional relationship is not linear, however. This
figure demonstrates that one cannot conclude that X and Y are unrelated based on the fact that
their correlation coefficients are small.
The fact that two variables have a high degree of linear correlation should not be interpreted
as indicating a functional or cause-and-effect relationship exists between the two variables. The
annual water yield on two adjacent watersheds may be highly positively correlated even though
a high yield from one watershed does not cause a high yield from the second watershed. More
likely the same climatic factors and geomorphic factors are operating on the two watersheds,
causing their water yields to be similar. The fact is often overlooked that high correlation does not
necessarily mean a cause-and-effect relationship exists between the correlated variables.
65
RANDOM VARIABLES
Further Pro~ertiesof Moments
If Z is a linear function of two random variables X and Y, then
Equations 3.55 and 3.56 can be generalized when Y is a linear function of n random
variables as follows.
then
and
Var(Y) =
%ajCov(Xi, X,)
i<j
A noteworthy result of equation 3.56 or 3.58 is that for uncorrelated random variables, the
variance of a sum or difference is equal to the sum of the variances. This is because the variation
in each of the random variables contributes to the variation of their sum or difference.
As a special case of a linear function, consider the Xi to be a random sample of size n. Let
the ai all be equal to l/n. Then Y is equal to X,the mean of the sample. The Var(Y) is the var(T?T)
and can be found from equation 3.58. Since the Xi form a random sample, the Cov(Xi, Xj) = 0
for i j and Var(Xi) = Var (X). we now have
Equation 3.59 states that the variance of the mean of a random sample is equal to the variance of the sample divided by the number of observations used to estimate the mean of the
sample. If X and Y are independent random variables, then the equation 3.49 shows that the
expectation of their product is equal to the product of their expectation.
E(XY) = E(X)E(Y)
if X and Y independent
(3.60)
The variance of the product XY for X and Y independent can be obtained from
Because X and Y are independent, pX,,(x, y) = px(x) py(y) and E ( X Y ) ~becomes E(x2)E(y2)
or E ( X Y ) ~= (& + a;)(& + a t ) . Also from equation 3.60, E2(xY) = E~(x)E~(Y)
=
Thus
which reduces to
That this is true is obvious from the example of g(X) = X2. From equation 3.23 it can be
seen that E(x2) = a; +
SAMPLE MOMENTS
If xi for i = 1 to n is a random sample, then the rfhsample moment about the origin is
RANDOM VARIABLES
67
(xi - X)'
z:=,
For the bivariate case involving a random sample of xi and y,, the r, s sample moment about
the origin is
is
The expected value of sample moments is equal to the population moments (Mood et al. 1974).
Two important properties of moments worthy of repeating are:
E(X - px)
E(X) - px = px - px = 0
2. The second moment about the origin is equal to the variance plus the square of the mean.
The moments about the mean are related to the moments about the origin by the following
general equation (Thomas 1971)
For the computation of sample moments it is often convenient to use equation 3.66. The results
of equation 3.66 for the first four sample moments are
68
CHAPTER 3
Sample moments can be computed from grouped data by using the equations
and
where xj and nj are the class mark and number of observations, respectively, in the j" group, n is
the total number of observations, and k is the number of groups.
Moments of greater than third order are generally not computed for hydrologic variables
because of the small sample size. Higher-order moments are very unreliable (have a high variance) for small samples. For example, the variance of s2 (the variance of the sample variance) is
(Mood et al. 1974)
Yevjevich (1972a) presents general expressions for the variance of the variance, coefficient of
skew, and kurtosis.
where 1 - ( j - 0.35)/n are estimators for P,(x(~,). Stedinger et al. (1994) recommend this estimator for single site estimation despite its bias because it generally results in a smaller mean
square error than the unbiased estimator given below.
RANDOM VARIABLES
69
L-moment estimates for the mean, standard deviation, skewness, and kurtosis are given by
Because L-moments do not involve squares and cubes of observations, they tend to produce less
variable estimates for higher moments, especially when an unusually large or small observation
happens to be present in a sample.
L-moments and probability weighted moments are related by
PARAMETER ESTIMATION
Thus far, probability distribution functions have been written px(x) or fx(x), depending on
whether they were continuous or discrete. More correctly, they should be written px(x; 0,, I2,..- ,
0,) or fx(x; 0,, I,, ..., Om),indicating that in general the distributions are a function of a set of
parameters as well as of random variables. To use probability distributions to estimate probabilities, values for the parameters must be available. This section discusses methods for estimating
the parameter values for probability distributions. Certain properties of these parameter estimates
or statistics are also discussed. Rather than carry a dual set of relationships-one for continuous
and one for discrete random variables-only the expressions for the continuous random variables will be displayed. The results are equally applicable to discrete distributions.
The usual procedure for estimating a parameter is to obtain a random sample x,, x2, ..., X,
from the population X. This random sample is then used to estimate the parameters. Thus Gi, an
estimate for the parameter 4, is a function of the observations or random variables. Since iiis a
function of random variables, iiis itself a random variable possessing a mean, variance, and
probability distribution.
Intuitively, one would feel that the more observations of the random variables that were
available for parameter estimation, the closer 6 should be to 0. Also, if many samples were used
for obtaining 6, one would feel that the average value of 6 should equal 0. These two statements
deal with two properties of estimators known as consistency and unbiasedness.
Unbiasedness
An estimate 6 of a parameter 0 is said to be unbiased if E(6) = 0. The bias, if any, is given
by E(6) - 0.
) 0
bias = ~ ( 6 -
(3.75)
The fact that an estimator is unbiased does not guarantee that an individual 6 is equal to 0
or even close to 0, it simply means that the average of many independent estimates for 0 will
equal 0.
Consistency
An estimator 6 of a parameter 0 is said to be consistent if the probability that 6 differs from
0 by more than an arbitrary constant E approaches 0 as the sample size approaches infinity.
Consistency is an asymptotic property because it states that by selecting an n sufficiently
large, the prob (1 6 - 0 I > E ) can be made as small as desired. For small samples (as are many
times used in practice) consistency does not guarantee that a small error will be made. In spite of
this, one feels more comfortable knowing that 6 would converge to 0 if a larger sample were
used.
A single estimate of 0 from a small sample is a problem because neither unbiasedness nor
consistency give us much comfort. In choosing between several methods for estimating 0, in
addition to being unbiased and consistent it would be desirable if the ~ a r ( 6 were
)
as small as
possible. This would mean that the probability distribution of 6 would be more concentrated
about 0.
RANDOM VARIABLES
71
Efficiency
An estimator 6 is said to be the most efficient estimator for 0 if it is unbiased and its variance is at least as small as that of any other unbiased estimator for 0. The relative efficiency of 6,
with respect to 6, for estimating 0 is the ratio of ~ar(6,)to ~ a r ( 6 , ) .
Sufficiency
Finally, it is desirable that 6 use all of the information contained in the sample relative to 0.
If only a fraction of the observations in a sample are used for estimating 0, then some information about 0 is lost. An estimator 6 is said to be a sufficient estimator for 9 if 6 uses all of the
information relevant to 0 that is contained in the sample.
More formal statements of the above four properties of estimators and procedures for determining if an estimator has these properties can be found in books on mathematical statistics
(Lindgren 1968; Freund 1962; Mood et al. 1974).
There are many ways for estimating population parameters from samples of data. A few of
these are graphical procedures, matching selected points, method of moments, maximum likelihood, and minimum chi-square. The graphical procedure consists of drawing a line through plotted points and then using certain points on the line to calculate the parameters. This procedure is
very arbitrary and is dependent upon the individual doing the analysis. Frequently, the method is
employed when few observations are available-with the thought that few observations will not
produce good parameter estimates anyway. When few points are available is precisely the time
when the best methods of parameter estimation should be used.
The method of matching points is not a commonly used method but can produce reasonable
first approximations to the parameters. The procedure can be valuable in getting initial estimates
for the parameters to be employed in iterative solutions that can arise when the method of moments or maximum likelihood are used.
Example 3.1. A certain set of data is thought to follow the distribution p,(x) = Xe-" for X
In this particular data set, 75% of the values are less than 3.0. Estimate the parameter X.
0.
Solution:
px(x) = hepAx
Px(x) =
Jt Xe-"
dt = 1
e-""
1 - Px(x) = e-Ax
Xx = -In( 1 - Px(x))
Comment: If a sample of size n is available this procedure could be used to obtain n estimates for
h. These n estimates could then be averaged to obtain i.
If the probability distribution of interest
had m parameters, then the value of P,(x) and x at m points would be used to obtain m equations
in the m unknown parameters. The method of matching points is not recommended for general use
in getting final parameter estimates. Certainly this method would not use all of the information in
the sample. Also, several different estimates for the parameters could be obtained from the same
sample depending on which observations were used in the estimation process.
Method of Moments
One of the most commonly used methods for estimating the parameters of a probability distribution is the method of moments. For a distribution with m parameters, the procedure is to
equate the first m moments of the distribution to the first m sample moments. This results in m
equations which can be solved for the m unknown parameters. Moments about the origin, the
mean, or any other point can be used. Generally, for 1-parameter distributions the first moment
about the origin, the mean, is used. For 2-parameter distributions the mean and the variance are
generally used. If a third parameter is required, the skewness may be used.
Similarly, L-moments may be used in parameter estimation by equating sample estimates of
the L-moments to the population expression for the corresponding L-moment depending on the
particular pdf being used. Again, for m parameters, m L-moments would be required. This technique will be illustrated in chapter 6 for some particular pdfs.
-
1
Thus, the mean of px(x) is 1/A so that A can be estimated by ); = =.
X
Example 3.3. Use the method of moments to estimate the parameters of
Solution:
let
Y=-
x - 0,
02
so that dx = 0, dy
RANDOM VARIABLES
73
and
The first integral has an integrand h(y) such that h(-y) = -h(y) and is therefore zero. The
second integral can be written as
Therefore kx = 4 , or the parameter 8, of this distribution is equal to the mean of the distribution
and can be estimated by
let y =
fie2
so that
dx = f i g 2 dy
and
= 9;
Thus, the parameter 0; is equal to the variance and can be estimated by s; (the sample variance).
622 -
sx
Substituting the parameter estimates in terms of their population values into the expression
for px(x), the result is
Now, this latter expression is proportional to the probability that the particular random sample
would be obtained from the population and is known as the likelihood function.
The m parameters are unknown. The values of these m parameters that maximize the likelihood that the particular sample in hand is the one that would be obtained if n random observations were selected from px(x; I1,I2,..., 0,) are known as the maximum likelihood estimators.
The parameter estimation procedure becomes one of finding the values of I,,I2,..., 0, that maximize the likelihood function. This can be done by taking the partial derivative of L(0,, O,, ..., 0,)
with respect to each of the Oi's and setting the resulting expressions equal to zero. These m
equations in m unknowns are then solved for the m unknown parameters.
Because many probability distributions involve the exponential function, it is many times
easier to maximize the natural logarithm of the likelihood function. The logarithmic function is
monotonic, thus the values of the 0's that maximize the logarithm of the likelihood function also
maximize the likelihood function.
Example 3.4. Find the maximum likelihood estimator for the parameter A of the distribution
px(x) = Ae-'" for X > 0.
Solution:
RANDOM VARIABLES
75
Note that this is the same estimate as obtained in example 3.2 using the method of moments. The
two methods do not always produce the same estimates.
Example 3.5. Find the maximum likelihood estimators for the parameters
distribution
el, and
0; of the
Therefore 2 ( x i
0,) = 0
Example 3.5 shows that the maximum likelihood estimators are not unbiased. It can be
shown, however, that the maximum likelihood estimators are asymptotically (as n +m) unbiased. Maximum likelihood estimators are sufficient and consistent. If an efficient estimator exists, maximum likelihood estimators, adjusted for bias, will be efficient. In addition to these four
properties, maximum likelihood estimators are said to be invariant, that is, if (6) is a maximum
likelihood estimator of 0 and the function hie) is continuous, then h(6) is a maximum likelihood
estimator of h(0).
The method of moments and the method of maximum likelihood do not always produce the
same estimates for the parameters. In view of the properties of the maximum likelihood estimators, this method is generally preferred over the method of moments. Cases arise, however, where
one can get maximum likelihood estimators only by iterative numerical solutions (if at all), thus
leaving room for the use of more readily obtainable estimates possibly by the method of
moments. The accuracy of the method of moments is severely affected if the data contains errors
in the tails of the distribution where the moment arms are long (Chow 1954). This is especially
troublesome with highly skewed distributions.
Finally, it should be kept in mind that the properties of maximum likelihood estimators are
asymptotic properties (for large n) and there well may exist better estimation procedures for
small samples for particular distributions.
CHEBYSHEV INEQUALITY
Certain general statements about random variables can be made without placing restrictions
on their distributions. More precise probabilistic statements require more restrictions on the distribution of the random variables. Exact probabilistic statements require complete knowledge of
the probability distribution of the random variable.
One general result that applies to random variables is known as the Chebyshev inequality.
This inequality states that a single observation selected at random from any probability distribution will deviate more than k a from the mean, k, of the distribution with probability less than or
equal to l/k2.
For most situations this is a very conservative statement. The Chebyshev inequality produces an
upper bound on the probability of a deviation of a given magnitude from the mean.
Example 3.6. The data of table 2.1 has a mean of 66,540 cfs and a standard deviation of 22,322
cfs. Without making any distributional assumptions regarding the data, what can be said of the
probability that the peak flow in a year selected at random will deviate more than 40,000 cfs from
the mean?
Solution: Applying Chebyshev's inequality we have k a = 40,000 cfs. Using 22,322 cfs as an
estimate for a we obtain k = 1.79.
RANDOM VARIABLES
77
The probability that the peak flow in any year will deviate more than 40,000 cfs from the
mean is thus less than or equal to 0.3 11.
Comment: One can see that this is a very conservative figure by noting that only 6 values out of
99 (6/99 = 0.061) lie outside the interval 66,540 5 40,000. By not making any distributional
assumptions, we are forced to accept very conservative probability estimates. In later chapters we
will again look at this problem making use of selected probability distributions.
LAW OF LARGE NUMBERS
Chebyshev's inequality is sometimes written in terms of the mean Z of a random sample of
size n. In such a case equation 3.77 becomes
a;c
If we now let S = l/k2 and choose n so that n 1 7 ,
we have the (weak) Law of Large Numbers
se(Mood and Graybill 1963) which states:
Let px(x) be a probability density function with mean yx and finite variance a;. Let x, be the
mean of a random sample of size n from px(x). Let E and S be any two specified small numbers
a;,
such that (E > 0 , 0 < S < 1. Then for n any integer greater than e2s
This statement assures us that we can estimate the population mean with whatever accuracy
we desire by selecting a large enough sample. The actual application of equation 3.79 requires
knowledge of population parameters and is thus of limited usefulness.
Example 3.7. Assume that the standard deviation of peak flows on the Kentucky River near
Salvisa, Kentucky, is 22,322 cfs. How many observations would be required to be at least 95%
sure that the estimated mean peak flow was within 10,000 cfs of its true value if we know nothing of the distribution of peak flows?
Solution: Applying equation 3.79 we have
We must have at least 100 observations to be 95% sure that the sample mean is within 10,000
cfs of the population mean if we know nothing of the population distribution except its standard
deviation. This happens to be very close to the number of observations in the sample (99).
Comment: We will look at this ~roblemaoain later making certain distributional assum~tions.
78
CHAPTER 3
Exercises
3.1. What is the expected mean and variance of the sum of values obtained by tossing two dice?
What is the coefficient of skew and kurtosis?
xt
3.2. Modular coefficients defined as Kt = T are occasionally used in hydrology. What is the
X
mean, variance, and coefficient of variation of modular coefficients in terms of the original data?
3.3. What effect does the addition of a constant to each observation from a random sample have
on the mean, variance, and coefficient of variation?
3.4. What effect does multiplying each observation in a random sample by a constant have on
the mean, variance, and coefficient of variation?
3.5. Without any knowledge of the probability distribution of peak flows on the Kentucky River
0 - kQj is greater than 10,000 cfs?
(table 2.1), what can be said about the probability that 1
3.6. Without any knowledge of the probability distribution of peak flows on the Kentucky River
(table 2.1), what can be said about the probability that a single random observation will deviate
more than 10,000 cfs from pQ?
3.7. Using the data of exercise 2.2 calculate the mean and variance from the grouped data. How
do the grouped data mean and variance compare to the ungrouped mean and variance? Which
estimate do you prefer?
3.8. Calculate the covariance between the peak discharge Q in thousands of cfs and the area A in
thousands of square miles for the following data.
3.9. Calculate the correlation coefficient between Q and A for the data in exercise 3.8.
3.10. Calculate the coefficient of skew for Q in exercise 3.8. Note that this estimate is relatively
unreliable because of the small sample.
3.11. Calculate the kurtosis and the coefficient of excess for Q in exercise 3.8. Note that these
estimates are unreliable because of the small sample size.
RANDOM VARIABLES
79
3.12. Complete the steps necessary to arrive at equation 3.56 from 3.55.
3.13. Show that o,oy
2 loxyl
3.14. A convenient relationship for calculating the estimated variance of a sample of data is
s; =
2 x? - nx2 n-1
C xi' - ( 2 xi)'
n
n-1
Derive this expression from equations 3.49. Note that this estimated covariance is biased. In
practice, the final divisor of n is replaced by n - 1 to correct for bias.
3.16. In exercise 2.14, if the future maximum life of the ferry is 15 years, what is the expected
net profit? Neglect the interest or discount rate.
3.17. What are the maximum likelihood estimates for the parameters of the two parameter
exponential distribution? This distribution is given by
3.18. What are the moment estimates for the parameters of the exponential distribution given in
exercise 3.17?
3.19. For the following data, what are the moment and maximum likelihood estimates for the
parameters of the distribution given in exercise 3.17? x = 15.0, 10.5, 11.O, 12.0, 18.0, 10.5, 19.5.
3.20. Calculate the coefficient of skew for the Kentucky River data of table 2.1.
3.21. Calculate the kurtosis of the Kentucky River data of table 2.1.
3.22. Using the data of exercise 2.2, calculate the coefficient of skew from the grouped data.
3.23. Using the data of exercise 2.2, calculate the kurtosis from the grouped data.
= - for x
P in the distribution
= 1, 2, ..., N?
3.26. What are the mean and variance of px(x) = a sin2x for 0 < X < n?
3.27. Use the method of moments to estimate a in px(x) = a sin2 x for 0 < X < n based on the
random sample given by X = 0.5, 2.0, 3.0, 2.5, 1.5, 1.8, l.0,0.8, 2.5, 2.2.
3.28. The r~ moment about xo can be written as E(X
possible second moment.
- xo)'.
4. Some Discrete
Probability
Distributions and
Their Applications
THUS FAR, probability distributions have been considered in general terms. This chapter is
devoted to some particular discrete distributions and their applications. The following two chapters
are devoted to selected continuous distributions. These chapters are by no means exhaustive treatments of probability distributions; only some of the more common distributions are considered.
HYPERGEOMETRIC DISTRIBUTION
Drawing a random sample of size n (without replacement) from a finite population of size
N, with the elements of the population divided into two groups with k elements belonging to one
group, is an example of sampling from a hypergeometric distribution. The two groups may be defective or nondefective objects, rainy or nonrainy days, success or failure of a project, and so
forth. For discussion purposes we will consider that an element (or outcome) from the population
is either a success or a failure. The probability of x successes in a sample of size n selected from
a population of size N containing k successes can be determined by applying equation 2.1.
The total number of possible outcomes or ways of selecting a sample of size n from N objects is (F). The number of ways of selecting x successes and n - x failures from the population
containing k successes and N - k failures is (,k) (:I~~) . Thus the probability is
The distribution given by equation 4.1 is known as the hypergeometric distribution where
fx(x; N, n, k) is the probability of obtaining x success in a sample of size n drawn from a population of size N containing k successes.
The cumulative hypergeometric distribution giving the probability of x or fewer successes is
There are certain natural restrictions on this distribution. For example: x cannot exceed k, x cannot exceed n, k cannot exceed N, and n cannot exceed N. N, n, k, and x are all nonnegative integers. Furthermore, the outcomes must be random and equally likely.
The mean of the hypergeometric distribution is
Example 4.1. The hypergeometric applies in example 2.5. In this example, a success is selecting
a bad record and N = 10, k = 3, n = 4. The solutions can be written in terms of the hypergeometric as
(:)(:) (I;'t)(&)
-
,3)(35)
210
0.500
DISCRETE DISTRIBUTIONS
83
Example 4.2. Assume that during a certain September, 10 rainy days occurred. Also assume that
at this particular location the occurrence of rain on any day is independent of whether or not it
rained on any previous day. (This is often not a good assumption).
A sample of 10 September days is selected at random. (a) What is the probability that 4 of
these days will have been rainy? (b) What is the probability that less than 4 of these days were
rainy?
Solution: Use the hypergeometric distribution with
= 0.560
Example 4.3. Examples of the hypergeometric distribution commonly found in statistics books
include card sampling problems (What is the probability of exactly 2 aces in a 5-card hand
selected at random from a 52-card deck?) and acceptance sampling problems (What is the probability of selecting 5 defective items from a lot of 50 items if 20 items are selected and the lot
actually contains 12 defectives?)
Solution: Card problem
84
CHAPTER 4
BERNOULLI PROCESSES
Binomial Distribution
Consider a discrete time scale. At each point on this time scale an event may either occur
or not occur. Let the probability of the event occumng be p for every point on the time scale;
thus, the occurrence of the event at any point on the time scale is independent of the history of
any prior occurrences or nonoccurrences. The probability of an occurrence at the ithpoint on the
time scale is p for i = 1,2, ... A process having these properties is said to be a Bernoulli
process.
An example of a Bernoulli process might be the occurrence of rainy days. The time scale has
units of days. On any particular day, rainfall may or may not occur. If the occurrence of rainfall
on any given day is independent of the past history of rainfall occurrences, the sequence of rainy
and dry days can be considered a Bernoulli process.
As an example of another Bemoulli process, consider that during any year the probability of
the maximum flow exceeding 10,000 cfs on a particular stream is p. Common terminology for a
flow exceeding a given value is an exceedance. Further consider that the peak flow in any year is
independent from year to year (a necessary condition for the process to be a Bernoulli process).
Let q = 1 - p be the probability of not exceeding 10,000 cfs. We can neglect the probability of
a peak of exactly 10,000 cfs since the peak flow rates would be a continuous process. In this example the time scale is discrete with the points being nominally 1 year in time apart. We can now
make certain probabilistic statements about the occurrence of a peak flow in excess of 10,000 cfs
(an exceedance).
For example, the probability of an exceedance occumng in year 3 and not in years 1 or 2 can
be evaluated from equation 2.9 as qqp since the process is independent from year to year. The
probability of (exactly) one exceedance in any 3-year period is pqq qpq + qqp since the exceedance could occur in either the first, second, or third year. Thus, the probability of (exactly)
one exceedance in three years is 3pq2
In a similar manner, the probability of 2 exceedances in 5 years can be found from the summation of the terms ppqqq, pqpqq, pqqpq, ..., qqqpp. It can be seen that each of these terms is
equivalent to p2q3 and that the number of terms is equal to the number of ways of arranging 2
or 10,
items (the p's) among 5 items (the p's and q's). Therefore, the total number of terms is
so that the probability of exactly 2 exceedances in 5 years is
This result can be generalized so that the probability of X exceedances in n years is
(:) pxqn-".The result is applicable to any Bemoulli process so that the probability of X occurrences
of an event in n independent trials if p is the probability of an occurrence in a single trial is given by
(z),
DISCRETE DISTRIBUTIONS
85
and gives the probability of X or fewer occurrences of an event in n independent trials if the probability of an occurrence in any trial is p.
Continuing the above example, the probability of less than 3 exceedances in 5 years is
The mean, variance, and coefficient of skew of the binomial distribution are
Var(X) = npq
(4.8)
The distribution is symmetrical for p = q, skewed to the right for q > p and skewed to the left
for q < p.
Because the probability of a success on any trial is independent of past history, the origin of
the time scale of a Bernoulli process can be taken at any time point. Thus the probability of any
combination of successes or failures is the same for any sequence of n points regardless of their
location with respect to the origin.
Example 4.4. On the average, how many times will a 10-year flood occur in a 40-year period?
What is the probability that exactly this number of 10-year floods will occur in a 40-year period?
Solution: A 10-year flood has p = 1/10 = 0.1
Comment: This problem illustrates the difficulty of explaining the concept of return period. On
the average a 10-year event occurs once every 10 years and in a 40-year period is expected to
occur 4 times. Yet in about 80% (100[1 - 0.20591) of all possible independent 40-year periods,
the 10-year event will not occur exactly 4 times. As a matter of fact the probability that it will
occur 3 times is nearly identical to the probability it will occur 4 times (0.2003 vs. 0.2059). The
number of occurrences, X, is truly a random variable (with a binomial distribution).
The binomial distribution has an additive property (Gibra 1973). That is, if X has a binomial
distribution with parameters n, and p and Y has a binomial distribution with parameters n, and p,
then Z = X + Y has a binomial distribution with parameters n = n, + n, and p.
A useful property of the binomial distribution is that
The binomial distribution can be used to approximate the hypergeometric distribution if the
sample selected is small in comparison to the number of items N from which the sample is drawn.
In this case, the probability of a success would be about the same for each trial, and sampling
without replacement (hypergeometric) would be very similar to sampling with replacement
(binomial).
Example 4.5. Compare the hypergeometric and binomial for N = 40, n = 5, k = 10 and X = 0,
1,2,3,4,5.
Solution:
Hypergeometric
fx(x; N, n, k) = fx(x; 40, 5, 10)
Binomial
fx(x; n, p) = fx(x; 5, 10/40)
Comment: This merely indicates that drawing a small sample without replacement from a large
population and drawing the same sample with replacement (so probabilities in each trial are constant) are nearly equivalent.
Example 4.6. The operator of a boat dock has decided to put in a new facility along a certain
river. In an economic analysis of the situation it was decided to have the facility designed to withstand floods up to 75,000 cfs. Furthermore, it was determined that if one flood greater than this
occurs in a 5-year period, repairs can be made and the operator will still break even on its operation during the 5-year period. If more than one flow in excess of 75,000 cfs occurs, money will
be lost. If the probability of exceeding 75,000 cfs is 0.15, what is the probability the operator will
make money?
Solution: Money will be made if no floods exceeding 75,000 cfs occur during the 5-year period.
Let X be the number of floods. From the binomial distribution
DISCRETE DISTRIBUTIOPiS
87
Comment: The probability that the operator will make the investment, work for 5 years, and just
break even is very high
Thus, even though the risk or probability of losing money is low (1 - 0.39 15 - 0.4437
the investment may not be an attractive one.
0.1648),
1
T = - = 95 years
P
Comment: To be 90% sure that a design storm is not exceeded in a 10-year period, a 95-year
return period storm must be used. If a 10-year return period storm is used, the chances of it being
exceeded are
It can be shown that as T gets large, this expression approaches 1 - l/e or 0.632. For T = 5,
10, and 25, the probability is 0.67,0.65, and 0.64, respectively. Thus, if the design life of a structure
--
and its design return period are the same, the chances are very great that the capacity of the structure will be exceeded during its design life. The risk associated with a return period over n years is
risk = 1 - (1 - l/Ty.
The procedure outlined in example 4.7 can be used to determine a design return period when
the allowable risk is stated. Note that the design return period must be much greater than the life of
the project to be reasonably sure that an exceedance will not occur. No matter what design return period is selected, there is still a chance that an exceedance will occur. Some may argue that there is an
upper limit to the magnitude of natural events, such as flood peaks. They would argue that a peak of
100,000 cfs from a 1-acre watershed would be impossible. In practice the probability that would be
assigned to an event of this sort is so small that it can be neglected for most practical purposes.
Figure 4.1 shows the design return period that must be used to be a certain percent confident
that the design will not be exceeded during the design life of the project. The parameters on the
curves are the percent chance of no exceedance during the design life. For example, to be 90%
sure that a design condition will not be exceeded during a project whose design life is 100 years,
the project would have to be designed on the basis of a 900-year event. Figure 4.1 is derived from
calculations like those contained in example 4.7.
Figure 4.1 can also be used to evaluate the risk or percent chance of an event in excess of the
design event during the design life. For example, if a project is designed on the basis of a 50-year
Fig.
Design return period required as a function of design life to be a given percent confident
(curve parameter) that the design condition is not exceeded.
89
I
I
event and the design life of the project is 10 years, the designer is taking a 19% c ance (100
that the design condition will be exceeded.
Solution:
81)
Comment: What has occurred prior to the trials of interest is of no concern since the Bernoulli
process is based on the assumption of independence from trial to trial.
Geometric Distribution
The probability that the first exceedance (or success) of a Bernoulli tqal occurs on the
xthtrial can be found by noting that for the first exceedance to be on the xth there must be
X - 1 preceding trials without an exceedance followed by 1 trial with an
the desired probability is pqx-' This is known as the geometric distribution
1
E(X) = l/p means that on the average a T-year event occurs on the T~~y ar, which agrees
with our intuitive concept of a return period.
Example 4.9. What is the probability that a 10-year flood will occur for the fir time during the
fifth year after the completion of a project? What is the probability it will be at
the fifth year
before a 10-year flood occurs?
Solution: The probability that the first exceedance is in year 5 is
The probability that it will be at least the fifth year before the first occurrence is not the same as
the probability of the first occurrence in the fifth year. The expression "at leas& implies the first
occurrence might be in the fifth year or some later year. The desired probability is equal to the
probability of no occurrences in the first 4 years, which is (0.9)~= 0.6561.
Solution: This is the same as the probability of the first occurrence on the tenth year or
As might be expected because the negative binomial is based on the binomial, the additive
feature holds. Thus, if X and Y are described by fx(x; k,, p) and f,(y; k,, p) respectively, then
Z = X + Y follows the negative binomial f,(z; k, + k,, p).
-
Example 4.1 1. What is the probability that the fourth occurrence of a 10-year flood will be on the
fortieth year?
Solution:
DISCRETE DISTRIBUTIONS
91
geometric distribution. The probability that the kthoccurrence was at the xthtime is described by
the negative binomial distribution. It was also found that the probability distribution of the length
of time between occurrences can be found from the geometric distribution by noting that the
probability that X trials elapse between occurrences is the same as the probability that the first
occurrence is at the X + first time or fx(x + 1; p) = pqx.
POISSON PROCESS
Poisson Distribution
Consider a Bernoulli process defined over an interval of time (or space) so that p is the probability that an event may occur during the time interval. If the time interval is allowed to become
shorter and shorter so that the probability, p, of an event occurring in the interval gets smaller and
the number of trials, n, increases in such a fashion that np remains constant, then the expected
number of occurrences in any total time interval remains the same. It can be shown that as n gets
large and p gets small so that np remains a constant, A, the binomial distribution approaches the
Poisson distribution given by
The mean, variance, and coefficient of skew of the Poisson distribution are
As A gets large, the distribution goes from a positively skewed distribution to a nearly symmetrical distribution. The cumulative Poisson distribution is
Example 4.12. What is the probability that a storm with a return period of 20 years will occur
once in a 10-year period?
Solution: Using the binomial distribution the exact answer is
Thus the solutions are not identical but are quite close to each other.
Example 4.13. What is the probability of 5 occurrences of a 2-year storm in a 10-year period?
Solution: Using the binomial
Comment: For this situation n is not large enough and p not small enough for a good approximation.
Example 4.14. What is the probability of fewer than 5 occurrences of a 20-year storm in a 100year period?
Solution: n is relatively large and p small so the Poisson will be used.
The Poisson distribution possesses the additive property that the sum of two Poisson
random variables with parameters A, and A, is a Poisson random variable with parameter
A = A, + A,. A Poisson process for a continuous time scale can be defined analogous to a
Bernoulli process on a discrete time scale. The Poisson process refers to the occurrence of
events along a continuous time (or location) scale. The assumptions underlying the process
are:
1. The probability of an event in any short interval t to t At is AAt (proportional to the length
of the interval) for all values oft. This property is known as stationarity.
93
DISCRETE DISTRIBUTIONS
2. The probability of more than one event in any short interval t to t
parison to AAt.
+ At is negligible in com-
3. The number of events in any interval of time is independent of the number of events in any
other non-overlapping interval of time.
The probability distribution of the number of events X in time t for a Poisson process is
given by
fx(x; At) =
(~t)'e-~~
x!
A>O;
t>O;
x=1,2,
...
where fx(x; At) is the probability of X events in time t. Equation 4.20 is a Poisson distribution
with parameter At. The mean and variance of fx(x; At) are E(X) = At and Var(X) = At. The
parameter A is the average rate of occurrence of the event.
Exponential Distribution
The probability distribution of the time, T, between occurrences of the event can be found
by noting that the prob(T < t) is equal to 1 - prob(T > t). The prob(T > t) is equal to the probability of no occurrences in time t which is fx(O; At) or e-". Thus
which is a cumulative distribution known as the exponential distribution. The probability density
function is
and is the probability distribution of the length of the time interval between occurrences of the
~ ,
event. The mean and variance of the exponential distribution are 1/A and 1 / ~respectively.
Gamma Distribution
The probability distribution of the time to the nthoccurrence can be found by noting that the
+ T, from
time to the nthoccurrence is the sum of n independent random variables, TI + T2 +
the exponential distribution. The method of derived distributions can be used with the result that
the probability density function of the time to the n" occurrence is
which is the gamma distribution for integer values of the parameter n. The gamma distribution
has E(T) = n/A and Var(T) = n / ~ ~ .
Example 4.15. Barges arrive at a lock an average of 4 each hour. (a) If the arrival of barges at the
lock can be considered to follow a Poisson process, what is the probability that 6 barges will
arrive in 2 hours? (b) If the lock master has just locked through all of the barges at the lock, what
is the probability she can take a 15-minute break without another barge arriving? (c) If the operation of the lock is such that 4 barges can be locked through at once and the lock master insists
that this always be the case, what is the probability that the first barge to arrive after 4 previous
barges have been locked through will have to wait at least 1 hour before being locked through?
Solution:
(a) For this problem the rate constant is 4 hours-'. The probability of 6 arrivals in 2 hours
can be determined from the Poisson distribution
86e-8
fx(x; At) = fx(6; 8) = -= 0.1221
6!
(b) The probability of no arrivals in 15 minutes is also from the Poisson
Note that this is not the same as the probability that it will be 15 minutes until the next amval.
The time scale is continuous so the probability that it will be exactly 15 minutes until the next
arrival is zero. We can only talk of probabilities associated with time intervals, not specific
times.
(c) The barge must wait for the arrival of 3 additional barges. The probability that the time
T for 3 barges to arrive is greater than 1 hour
prob(T3 > 1) is 1 - prob(T3 5 I).
The probability that T 5 1 for 3 arrivals comes from the gamma distribution
DISCRETE DISTRIBUTIONS
95
For a Poisson process, the probability that an event will occur in a short time interval t to t + At
is hAt for all t. The probability that more than one event occurs in At is negligible. The probability
distribution of the number of events in a given time T is the Poisson distribution. The exponential
distribution describes the time between events and the gamma distribution the time to the n" event.
Example 4.16. It has been proposed that an event-based rainfall simulation model can be
constructed by modeling the occurrence of rainstorms by a Poisson process and the amount of
rain in each storm by some continuous probability distribution. In this way, the time between
rainstorms would follow an exponential distribution, the time for X rainstorms would follow a
gamma distribution, and the number of rainstorms in a time interval would follow a Poisson
distribution. Duckstein et al. (1975) and Fogel et al. (1974) used a modification of this approach.
Part of Fogel et al.'s results are shown as figure 4.2.
10
15
20
25
30
Fig. 4.2. Distribution of occurrences of warm season rainfall in which the areal mean of five
gages in New Orleans, Louisiana, exceeded 0.50 inches and at least one gage recorded
more than 1.O inch. (Fogel et al. 1974).
MULTINOMIAL DISTRIBUTION
The binomial distribution can be generalized to include the probabilities of outcomes of several types rather than the two possible outcomes of the binomial. If the probabilities associated
with each of k distinct outcomes are p,, p2, ...,p,, then in independent trials the probability of XI
outcomes of type 1, X2 outcomes of type 2, ...,Xkoutcomes of type k is given by the multinomial
distribution as
zi=lpi=l
and
'Cf='=lxi=n
(4.25)
-
(4.26)
pi)
Example 4.17. On a certain stream the probability that the maximum peak flow during a l-year
period will be less than 5,000 cfs is 0.2 and the probability that it will be between 5,000 cfs and
10,000 cfs is 0.4. In a 20-year period, what is the probability of 4 peak flows less than 5,000 cfs
and 8 peak flows between 5,000 and 10,000 cfs?
Solution: To apply the multinomial distribution we define the third event as a peak flow in excess of 10,000 cfs. This event has probability 1 - 0.2 - 0.4 = 0.4. The event of a peak flow
greater than 10,000 cfs must occur 20 - 4 - 8 = 8 times. The desired probability is
Comment: The expected result from 20 years of flood peak data would be
E(X,) = npl = 20(0.2) = 4
E(X2) = np, = 8
E(X,) = np, = 8
This problem demonstrates that even though the expected results are 4, 8, and 8, the probability
of this happening is very low.
Exercises
4.1. Compute the terms of the binomial distribution with n = 10 and p = 0.2. Plot in the form
of a histogram.
4.2. Compute the terms of the cumulative binomial with n = 10 and p = 0.2. Plot the terms.
4.3. If a project is designed on a 10-year retum period, what is the probability of at least 1
exceedance during the 10-year life of the project?
4.4. What design retum period should be used to ensure a 95% chance that the design will not be
exceeded in a 25-year period?
DISCRETE DISTRIBUTIONS
97
4.5. Construct a curve relating the design return period to the life of a project when a 90 percent
chance of no exceedance is used.
4.6. What design return period should be used to ensure a 50% chance of no exceedance in a
10-year period?
4.7. What design return period should be used to ensure a 75% chance of no more than 1
exceedance in 10 years?
4.8. Construct an example where the Poisson is not a good approximation for the binomial.
4.9. In a certain locality contractors A, B, and C get about 50%, 25% and 25% respectively of
all water resources projects. Five contracts are coming up for bid. What is the probability that
contractor A will get all 5 jobs? What is the probability that A will get 2 jobs and B will get
2 jobs?
4.10. In 100 years the following number of floods were recorded at a specific location. Draw a
relative frequency histogram of the data. Fit a Poisson distribution to the data and plot the relative
frequencies according to the Poisson distribution on the histogram. Is the Poisson a good
approximation for the data?
No. of floods
No. of occurrences
4.1 1. Based on a Poisson approximation to the data of exercise 4.10, what is the probability of 5
successive years without a flood?
4.12. Based on a Poisson approximation to the data of exercise 4.10, what is the probability of
exactly five years between floods?
4.13. Compute the probability of at least 1 n-year event in a k-year period using (a) n
k = 20; (b) n = 500, k = 50.
100,
4.14. Using the Poisson approximation to the binomial distribution show that the probability of
at least one occurrence of a T-year event in T years is 0.632.
98
CHAPTER 4
1
10
4
15
9
17
8
4
8
3
10
14
2
20
8
4
6
DISCRETE DISTRIBUTIONS
99
4.22. For the binomial distribution show that f,(x; n, p) = f,(x - 1; n - 1, p) f,(l; 1, p)
fx(x; n - 1, p) fx(O; 1, p). Write out a narrative description of the meaning of this equation.
4.23. Work exercise 4.21 using the Poisson distribution to approximate the binomial.
4.24. Pool the data of exercise 4.21 so that a single estimate is obtained for p of the binomial distribution. Compute the probability of 20 rainy days in the 2-month period of July-August. Compare
this probability to the one computed in part b of exercise 4.21. Which answer would you prefer?
4.25. Using the data of exercise 4.21, what is the probability that the sixth wet day of August
occurs on August 29,30, or 3 1 ?
4.26. Show that for the Poisson process the time for n occurrences follows the gamma distribution.
(Hint: Use the method of derived distributions to find the distribution of the time to 2 occurrences.
Using the distribution of the time to 2 occurrences the method of derived distributions can be used
to get the time to 3 occurrences. This process can then be repeated until a pattern emerges. Induction could also be used by showing that if the time for n - 1occurrences is given by equation 4.20
by substituting n - 1for n then the time for n occurrences is given by equation 4.20. Also, the time
for 1 occurrence is given by equation 4.19, which is the same as equation 4.20 with n = 1.)
5. Normal Distribution
THE MOST widely used and most important continuous probability distribution is the
Gaussian, or normal distribution. The normal distribution has been widely used because of its
early connection with the "Theory of Errors" and because it has certain useful mathematical
properties. Many statistical techniques such as analysis of variance and the testing of certain
hypotheses rely on the assumption of normality. The errors involved in incorrectly assuming
normality (purposely or unknowingly) depend on the use under consideration. Many statistical
methods derived under the assumption of normality remain approximately valid when moderate
departures from normality are present and as such are said to be robust.
The very name "normal" distribution is misleading in that it implies that random variables
that are not normally distributed are abnormal in some sense. The Central Limit Theorem indicates
the conditions under which a random variable can be expected to be normally distributed. In a
strict theoretical sense, most hydrologic variables cannot be normally distributed because the
range on any random variable that is normally distributed is the entire real line (-03 to +a). Thus
non-negative variables such as rainfall, streamflow, reservoir storage, and so on, cannot be strictly
normally distributed. However, if the mean of a random variable is 3 or 4 times greater than its
standard deviation, the probability of a normal random variable being less than zero is very small
and can in many cases be neglected.
GENERAL N O W DISTRIBUTION
The normal distribution is a 2-parameter distribution whose density function is
NORMAL DISTRIBUTION
101
Fig. 5.1. Normal distributions with same mean and different variances.
PI
P2
$3
Fig. 5.2. Normal distributions with same variance and different means.
In examples 3.3 and 3.5 it was shown that if either the method of moments or the method of maximum likelihood is used to estimate the two parameters of this distribution, the result is 8, = p
and 822 = u2 where p and u2 are the mean and variance of X, respectively. For this reason the
normal distribution is generally written as
REPRODUCTIVE PROPERTIES
If a random variable X is N(p, u2) and Y = a + bX, the distribution of Y can be shown to
be N(a + by, b2u2). Furthermore, if Xi for i = 1, 2, ..., n, are independently and normally
distributed with mean pi and variance ui2,then Y = a + blX, + b2X2+ - - - + b,X, is normally
distributed with
py = a
+ Cy,lbipi
(5.2)
and
2
UY
Cb2" i = 1
iui2
(5.3)
Any linear function of independent normal random variables is also a normal random variable.
Example 5.1. If xiis a random observation from the distribution N(p, u2), what is the distriXi
bution of Z = C:= -?
n
Solution: X is a linear function of xi given by 51 = (xl + x2 + . - -+ xn)/n. From equations 5.2 and
5.3 and the reproductive properties of the normal distribution, % is normally distributed with mean
and variance
Unfortunately, equation 5.4 cannot be evaluated analytically. Approximate methods of integration are required. If a tabulation of the integral was made, a separate table would be required
for each value of p and u2. By using the linear transformation
the random variable Z will be N(0, I). This is a special case of a + bX with a = - p/u and b =
l / u . The random variable Z is said to be standardized (has P = 0 and u2 = 1) and N(0,l) is said
to be the standard normal distribution. The standard normal distribution is given by
-2
103
-1
+I
+2
Figure 5.3 shows the standard normal distribution which along with the transformation Z =
(X - p)/u contains all of the information shown in figures 5.1 and 5.2. Both pZ(z)and Pz(z) are
widely tabulated. Most tables utilize the symmetry of the normal distribution so that only positive values of Z are shown. Tables of Pz(z) may show prob(Z < z), prob(0 < Z < z), or prob(-z
< Z < z). Care must be exercised when using normal probability tables to see what values are
tabulated. The table of Pz(z) in the appendix gives prob (Z < z). There are many routines programmed into computer software to evaluate the normal pdf and cdf. Some approximations for
the standard normal distribution are given below.
A table of Pz(z) shows that 68.26% of the normal distribution is within 1 standard deviation
of the mean, 95.44% within 2 standard deviations of the mean, and 99.74% within 3 standard
deviations of the mean. These are called the 1,2, and 3 sigma bounds of the normal distribution.
The fact that only 0.26% of the area of the normal distribution lies outside the 3 sigma bound
demonstrates that the probability of a value less than p - 3 0 is only 0.0013 and is the justification for using the normal distribution in some instances even though the random variable under
consideration may be bounded by X = 0. If p is greater than 30, the chance that X is less than
zero is many times negligible (this is not always true, however).
Example 5.2. Compare the 1, 2, and 3 sigma bounds under the assumption of normality and
under no distributional assumptions using Chebyshev's inequality.
Solution: The 1, 2, and 3 sigma bounds of N(p, u2) contain 68.26, 95.44, and 99.72% of the
distribution. Thus, the probability that X deviates more than a , 2u, and 3u from p is 0.3174,
0.0456, and 0.0028 respectively.
104
CHAPTER 5
Chebyshev's inequality states that the prob(1X - pI > ka) < l/k2. This corresponds to a
probability that X deviates more than a , 20, and 3 a from p of less than 1.00, less than 0.25, and
less than 0.1 1, respectively.
Comment: By making no distributional assumptions, we are forced to make very conservative
probability statements. It is emphasized that Chebyshev's inequality gives an upper bound to the
probability and not the probability itself.
~
-~
Example 5.3. As an example of using tables of the normal distribution consider a sample drawn
from an N(15,25). What is the prob(15.6 < X < 20.4)?
Solution: The desired probability could be evaluated from
However, this integral is difficult to evaluate. Making use of the standard normal distribution, we
can transform the limits on X to limits on Z and then use standard normal tables.
x = 15.6 transforms t o z = (15.6 - 15.0)/5 = 0.12
x = 20.4 transforms to z = (20.4 - 15.0)/5 = 1.08
The desired probability is
From the standard normal table Pz(l-08) = 0.860 and P,(0.12) = 0.548. The desired probability is 0.860 - 0.548, or 0.312.
Let y =
-In (2p). For 0.005 < Pz(z) < 0.5, an approximation for z is given by
NORMAL DISTRIBUTIOW
105
1 - 0.5 exp -
(83z ;3351)z
-
+ 562
+ 165
Of course, for negative values of z, P,(z) for the absolute value of z can be obtained and then
Pz(z> = 1 - Pz(lzl).
Example 5.4. Use a normal approximation to determine prob(l0.5 < X < 20.4) if X is distributed N(15,25).
Solution: Using the approximation for PZ(z)
[(83)(1.08) + 3511 1.08
'03
+ 165
1.08
prob(0 < z < 1.08) = 0.85987
Similarly, prob(z < 0.9)
prob(0 < z < 0.9)
=
=
0.50000
+ 562
0.860
0.360
0.816 so that
0.316 and
+ 0.316 = 0.676
Comment: Often, in solving problems of this type, it is useful to sketch a normal distribution and
then shade in the area corresponding to the desired probability. For this problem the sketch would
be as in figure 5.4.
106
-
CHAPTER 5
--
Example 5.5. kepeat example 3.7 assuming the Kentucky River data is
-
'
22,322/6
is N(O, 1).
From the standard normal table it is seen that 95% of the normal distribution is enclosed by
- 1.96 < Z < 1.96. From this n is calculated as
or at least 19 observations are required to be 95% sure that X is within 10,000 cfs of p if X is
N(p, 22,322')Comment: By assuming normality, the required minimum number of observations has been
reduced from 100 to 19. The Law of Large Numbers has placed a lower limit on n without knowledge of the distribution of X. The price for this ignorance of the distribution of X is seen to be
very great if in fact X is normally distributed.
In practice, if the Xi are identically and independently distributed, n does not have to be very
large for S, to be approximated by a normal distribution. If interest lies in the central part of the
distribution of S, ,values of n as small as 5 or 6 will result in the normal distribution producing
reasonable approximations to the true distribution of S,. If interest lies in the tails of the
distribution of S,, as it often does in hydrology, larger values of n may be required.
As stated above, the Central Limit Theorem is of limited value in hydrology since most
hydrologic variables are not the sum of a large number of independently and identically distributed random variables. Fortunately, under some very general conditions it can be shown that if Xi
for i = 1,2, ..., n is a random variable independent of Xj for j # i and E(Xi) = pi and Var(Xi)
= ai2, then the sum S, = X, + X2 + - - .+ X, approaches a normal distribution with E(S,) =
2:=pi and Var(S, ) = Z:= a"s n approaches infinity (Thomas 1971). One condition for this
generalized Central Limit Theorem is that each Xi has a negligible effect on the distribution of S,
(i.e., there cannot be one or two dominating Xi's).
NORlMAL DISTRIBUTION
107
This general theorem is very useful in that it says that if a hydrologic random variable is the
sum of n independent effects and n is relatively large, the distribution of the variable will be approximately normal. Again, how large n must be depends on the area of interest (central part or
tail of the distribution) and on how good an approximation is needed.
Example 5.6. In the last chapter the gamma distribution for integer values of n was derived
as the sum of n exponentially distributed random variables. The mean and variance of the exponential distribution are given as 1/X and 1/X2, respectively. The Central Limit Theorem
gives the mean and variance of the sum of n values from the exponential distribution as n/X
and n/X2 for large n. This agrees with the mean and variance of the gamma distribution.
In chapter 6, the coefficient of skew of the gamma distribution is given as 2 / f i , which
approaches zero as n gets large. Thus, the sum of n random variables from an exponential distribution is a gamma distribution which approaches a normal distribution (with y
approaching 0) as n gets large.
because the mean of the data is 66,540 cfs and the standard deviation is 22,322 cfs. This integral
is easily evaluated using standard normal tables as 0.0322.
An approximation to the relative frequency in a class interval can also be made by using
equation 2.25b.
108
CHAPTER 5
Table 5.1. Expected relative frequencies according to the normal distribution for the Kentucky River data
Class
Mark
Xi
Expected
Relative Frequencies
Zi
Pz(zi)
f xi
Observed
Relative Frequencies
0.03 16
0.0659
0.1122
0.1564
0.1783
0.1663
0.1270
0.0793
0.0405
0.0169
Sum 0.9744
- 1.8609, pZ(zi) =
Similar calculations for each of the class intervals are shown in table 5.1, with the results plotted
in figure 5.5. The sum of the expected relative frequencies is not 1 because the entire range of the
normal distribution was not covered.
20
40
60
80
100
Peak flow (1 000 cfs)
120
140
Fig. 5.5. Comparison of normal distribution with the observed distribution, Kentucky River
peak flows.
NORMAL DISTRIBUTION
109
The procedure of integrating p,(x) over each class interval or of using equation 2.25b can
be used for any continuous probability distribution to get the expected relative frequencies for
that distribution.
Table 5.2. Corrections for approximating a discrete random variable by a continuous random variable
Discrete
Continuous
110
CHAPTER 5
distribution approximates the binomial distribution if n is large. Thus, as n gets large the
distribution of
approaches a N(0, 1). This is sometimes known as the DeMoivre-Laplace limit theorem (Mood
et al. 1974).
Example 5.7. X is a binomial random variable with n = 25 and p = 0.3. Compare the binomial
and normal approximation to the binomial for evaluating the prob(5 < X 5 8).
Solution: Using the binomial distribution this is equivalent to
Using the normal approximation, the probability is determined as prob(5.5 < X < 8.5), which is
0.476. Therefore, the exact probability of 0.483 is approximated by the normal to be 0.476 for an
n of 25.
Negative Binomial Distribution
Following reasoning similar to that given for the binomial distribution, the negative binomial distribution with large k can be approximated by a normal distribution. In the case of the
negative binomial, the distribution of
This compares favorably with the 0.0206 computed using the negative binomial.
NORMAL DISTRIBUTION
111
Poisson Distribution
The sum of two Poisson random variables with parameters A, and A, is also a Poisson
random variable with parameter h = h, + A,. Extending this to the sum of a large number of
Poisson random variables, the Central Limit Theorem indicates that for large h, the Poisson may
be approximated by a normal distribution. In this case the distribution of
approaches an N(0, 1). Since the Poisson is the limiting form of the binomial and the binomial
can be approximated by the normal, it is no surprise that the Poisson can also be approximated by
the normal.
Continuous Distributions
Many continuous distributions can be approximated by the normal distribution for certain
values of their parameters. For instance, in example 5.6, it was shown that for large n the gamma
distribution approaches the normal distribution. To make these approximations one merely
equates the mean and variance of the distribution to be approximated to the mean and variance of
the normal and then uses the fact that
is N(0, 1) if X is N(p, u2).Not all continuous distributions can be approximated by the normal
and for those that can the approximation is only valid for certain parameter values. Things to
look for are parameters that produce near zero skew, symmetry, and tails that asymptotically
approach p,(x) = 0 as X approaches large and small values. Again, it is emphasized that approximations in the tails of the distributions may not be as good as in the central region of the
distribution.
Exercises
5.1. Consider sampling from a normal distribution with a mean of 0 and a variance of 1. What is
the probability of selecting (a) an observation between 0.5 and 1.5? (b) an observation outside the
interval -0.5 to +0.5? (c) 3 observations inside and 2 observations outside the interval of 0.5 and
1.5? (d) 4 observations inside the interval 0.5 to 1.5 exactly two of which are not in the interval
-0.5 to l.O?
5.2. What is the probability of selecting an observation at random from an N(100,2500) that is
(a) less than 75? (b) equal to 75?
5.3. For the Kentucky River data of table 2.1, what is the probability of a peak flow exceeding
100,000 cfs if the peaks are assumed to be normally distributed?
5.4. Construct the theoretical distribution for the data of exercise 2.2 if it is assumed that the data
are normally distributed. From a visual comparison with the data histogram, would you say the
data are normally distributed?
5.5. Work exercise 4.1 using the normal approximation to the binomial and plot the results on the
histogram developed for exercise 4.1.
5.6. Show that if X is N(p,
0')
then Y = a
5.7. For a particular set of data the coefficient of variation is 0.4. If the data are normally distributed, what percent of the data will be less than 0.0?
5.8. A sample of 150 observations has a mean of 10,000, a standard deviation of 2,500 and is
normally distributed. Plot a frequency histogram showing the number of observations expected
in each interval.
5.9. The appendix contains a listing of the annual runoff from Cave Creek watershed near Fort
Spring, Kentucky. What is the probability that the true mean annual runoff is less than 14.0 in. if
one can assume the true variance is 22.56 in.'? What other assumptions are needed?
5.10. Random digits are the numbers 0, 1, 2, ..., 9 selected in such a fashion that each is equally
likely (i.e., has probability 1/10 of being selected). An experiment is performed by selecting 5
random digits, adding them together and calling their sum X. The experiment is repeated 10
times and X is calculated. What is the probability that X is less than 21.5? (Exercise 13.9 requires
that this experiment be carried out.)
5.1 1. Plot the individual terms of the Poisson distribution for A = 2. Approximate the Poisson
by the normal and plot the normal approximations on the same graph.
5.12. Repeat exercise 5.11 for A = 9.
5.13. Assume the data of exercise 4.21 is normally distributed. (a) Within each month what is
the probability of 10 or more rainy days? (b) What is the probability of 20 or more rainy days in
the July-August period? (c) What is the difference in assuming the data are normally distributed, and in assuming the data are binomially distributed and approximating the binomial with
the normal?
5.14. Plot the observed frequency histogram and the frequency histogram expected from the normal distribution for the annual peak flows for the following rivers. Discuss how well the normal
approximates the data in terms of the coefficient of variation and skewness. (Note: data are in the
appendix or may be obtained from the Internet).
a) North Llano River near Junction, Texas
NORMAL DISTRIBUTION
113
No. of days
Month
Jan.
Feb.
Mar.
Apr.
May
June
2
2
2
1
July
Aug.
Sept.
Oct.
Nov.
Dec.
0
2
No. of days
7
7
3
2
2
2
Total 32
5.17. An experimenter is measuring the water level in an experimental towing channel. Because
of waves and surges, a single measurement of the water level is known to be inaccurate. Past
experience indicates the variance of these measurements is 0.0025 ft2. How many independent
observations are required to be 90% confident that the mean of all the measurements will be
within .02 feet of the true water level?
5.18. At a certain location the annual precipitation is approximately normally distributed with
a mean of 45 in. and a standard deviation of 15 in. Annual runoff can be approximated by
R = -7.5 + 0.5P where R is annual runoff and P is annual precipitation. What is the mean and
variance of annual runoff? What is the probability that the annual runoff will exceed 20 in.?
5.19. Plot a frequency distribution for a mixture of two normal distributions. Use as the first
distribution an N(0, 1) and as the second an N(l, 1). Use as values for the mixing parameter 0.2,
0.5, and 0.8.
6. Continuous Probability
Distributions
THERE ARE many continuous probability distributions in addition to the normal distribution. This chapter covers some of these distributions, methods for estimating their parameters,
properties of the distributions, and potential applications for them. Further discussion on distribution selection is contained in chapter 7. Other books may be consulted for more detailed treatment
of the various distributions (Kececioglu, 1991). Rao and Harned (2000) is particularly applicable
to hydrology.
UNIFORM DISTRIBUTION
If a continuous random process is defined over an interval a to P and the probability of an
outcome of this process being in a subinterval of a to P is proportional to the length of the subinterval, the process is said to be uniformly distributed over the interval a to p (figure 6.1). The
probability density function for the continuous uniform distribution is
1
PX(X)
P,(x) = ---
P-a
CONTINUOUS DISTRIBUTIONS
115
The skewness is zero since the distribution is symmetrical about the mean. The methods of
moments yields the following estimators for the parameters a and P:
The method of maximum likelihood when applied to the uniform distribution results in
the estimators for a and p being the smallest and largest sample values respectively. That this
is the case can be seen by writing out the likelihood function and then selecting those values of
a and p (within the constraints that a < X < p for all X) that maximize the function.
The uniform distribution finds its greatest application as the distribution of Px(x) for all
probability density functions. That is the prob(Px(x) < y) is uniformly distributed over the interval 0 < y < 1 for any continuous probability distribution. This fact is used in generating random
observations from some probability distributions.
Example 6.1. Use the method of moments to estimate the parameters of the uniform distribution
based on the following sample: 1, 4, 3, 4, 5, 6, 7, 6, 9, 5. What are the maximum likelihood
estimators for this sample?
116
CHAPTER 6
By maximum likelihood
&
fi
TRIANGULAR DISTRIBUTION
The triangular distribution shown in Figure 6.2 is given by
It is unlikely that any natural hydrologic process would exactly follow a triangular distribution.
The distribution may be a reasonable approximation to the actual but unknown distribution of
some hydrologic quantities. The triangular distribution has been used in simulation studies
involving bounded random variables whose central tendencies are known.
The mean, variance, and coefficient of skew of the triangular distribution are
CONTINUOUS DISTRIE3UTIONS
117
X
Fig. 6.2. Triangular distribution (here y is 6 of equation 6.6).
The parameter 6 gives the mode of the triangular distribution. If 6 is known, the parameters
ci and p may be estimated based on the method of moments as
where
A =3
36
9K
Some special cases of the triangular distribution yield the following estimators:
Mode
I3
118
CHAPTER 6
X > 0, A > 0
The coefficient of skew is a constant, 2, indicating the exponential is skewed to the right for all
values of A. The curve labeled = 1 in figure 6.4 is an exponential distribution with A = 1.
Examples 3.2 and 3.4 demonstrated that when either the method of moments or maximum likelihood is used for parameter estimation, the result is
No. of depressions
106
36
18
9
12
2
5
1
4
5
2
6
3
1
1
1
Total
212
C0NTI;tiOUS DISTRIBUTIONS
119
Solution: The relative frequencies are computed by dividing the number of depressions in each
class by the total number of depressions. The best fitting exponential is estimated by using equation 6.15 to estimate the exponential parameter A. X is calculated from equation 3.16 as 1.27
acres. Then ); = 1fi = 0.787. The expected relative frequency in each class is then calculated
from equation 2.25b as
where xi is the midpoint of the class interval, Ax, = %, and pA(xi)is the exponential distribution
of area given by
A
pA(xi)= hepAxi
Therefore
fxi = (1/2)0.7S7e-0.787xi
For example, for the second class interval
f0.75
0.393 e-0.787(.75) = 0 22
The observed fraction of depressions with areas in excess of 2.25 acres is 31/212, or 0.146.
0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75 6.25 6.75 7.25 7.75
Area (acres)
Fig. 6.3. Observed and expected (according to the exponential distribution) number of depressions in various size categories for example 6.2.
GAMMA DISTRBUTION
The distribution of the sum of n exponentially distributed random variables each with
parameter A is a gamma distribution with parameters T = n and A. In general, -q does not have to
be an integer. A comprehensive treatment of the gamma distribution and other distributions in I&
gamma family of distributions is given by Bobee and Ashkar (1991). The gamma density function is given by
)=(
For
-q = 1, 2, 3, ...
+ 1 = ( ) For -q > 0
qT)= Jr tq- 'e-' dt For T > 0
(
The mean, variance and coefficient of skew for the gamma distribution are
The gamma distribution is positively skewed with y decreasing as -q increases. Plots of the
distribution for various values of -q and A are shown in figure 6.4. A wide variety of shapes ranging from reverse J-shaped for -q < 1 to single peaked with the peak (mode) at x = (-q - 1)/A for
T > 1 can be produced by the gamma density function. Changing A and holding -q constant
changes the scale of the distribution, whereas changing -q and holding A constant changes the
shape of the distribution. Thus, A and -q are sometimes known as scale and shape parameters.
The cumulative gamma distribution is
Some computer spreadsheets will evaluate Px(x) for the gamma distribution.
CONTINUOUS DISTRIBUTIONS
121
Gamma pdf
where x, is the sample geometric mean and +(x) = d In r(x)/dx is the psi-function. Thom
(1958) has proposed an approximate relationship based on the truncation of a series expansion of
the maximum likelihood estimator for -q given by
Table 6.1. Correction factor for the maximum likelihood estimator for the parameter -q of the
gamma distribution
where y is In Z - E,A: is a correction term arising because of the truncation and Inx is the
mean natural logarithm of the observations. Table 6.1 contains the values A+ for 4 for ranging
from 0:2 to 5.6. For 4 > 5.6 the correction is negligible (as it is anyway for many practical situations regardless of the value of 4). The procedure for finding the correcf on factor is to assume
that 4 is equal to the first term of equation 6.24 and use the A: from table 6.1 corresponding to
4.
Thom (1958) states that for -q < 10 the method of moments produces unacceptable estimates for both h and T. For near 1 the method of moments uses only 50% of the sample information for estimating h and only 40% for q. This means the maximum likelihood estimators
would do as well with one half the number of observations.
Greenwood and Durand (1960) present the following rational fraction approximations for
the maximum likelihood estimators
for
and
for
CONTTNUOUS DISTRIBUTIONS
123
where
A is then estimated from equation 6.25. Greenwood and Durand (1960) state that the maximum
error in equation 6.26 is 0.0088% and in equation 6.27 is 0.0054%.
Equations 6.24-6.27 produce estimates for T and A that have a slight asymptotic bias. For
small samples the bias may be appreciable (Shenton and Bowman 1970). Bowman and Shenton
(1968) present the following approximate relationship for estimating the bias in the parameter T
when equations 6.24-6.27 are used.
317 - 0.677
E($ - T)
0.1 11 0.032
++7
T
for n r 4 and T r 1
n-3
where E($ - T) is the bias in T with error of less than 1.4%. The result of using this relationship
for estimating the bias in $ for a sample size n from a gamma distribution having a population
parameter of T = 2 is shown in figure 6.5. In practice, equation 6.29 can be used to correct $ for
bias. If the population T were known, there would of course be no need for estimating T.
Bowman and Shenton (1968) suggest that the bias in $ can be approximated from
which yields
The gamma distribution has been widely used in hydrology (Bobee and Ashkar 1991).
Rainfall probabilities for durations of days, weeks, months, and years have been estimated by the
10
20
30
40
50
60
70
Ba
Sample size n
90
100
gamma distribution (Barger and Thom 1949; Barger, Shaw and Dale 1959; Friedman and Janes
1957; Mooley and Crutcher 1968). Annual runoff (Markovic 1965) has been described by the
gamma distribution.
,
Example 6.3. The annual water yield for Cave Creek near Fort Spring, Kentucky (USGS #
03288500) is shown in the following table. Estimate the parameters of the gamma distribution for
this data using both the method of moments and the method of maximum likelihood. Assuming
the data follows a gamma distribution, estimate the probability of an annual water yield exceeding 20.00 inches.
Year
Annual Runoff
(inches)
Year
Annual Runoff
(inches)
CONTINUOUS DISTRIBUTIONS
125
Thus, the maximum likelihood estimators are ); = 0.485 and $ = 7.107. These estimates
may be corrected for bias using either equation 6.29 or 6.30. If 6.30 is used
T) = E($) - E(T)
7.107
5.922 = 1.I85
If q = 5.922 is substituted into equation 6.29, the result is E($ - -q)= 1.141 which is in
good agreement with the 1.185 produced by equation 6.30. The final estimated for q is now
$ = 5.922 and ); =
= 0.404. Using the method of moments the parameter estimates are
$/x
9.513 and ); = 0.649, whereas the maximum likelihood estimates are +j= 5.922 and ); =
0.404. Following the recommendation of Thom (1958), the latter estimates will be used in estimating the probability of an annual water yield in excess of 20.00 inches.
=
Thus 1 - Px(20.00) is 0.176, which is the desired probability. The prob(yie1d > 20.00) = 0.176
if the annual water yield follows a gamma distribution with parameters q = 5.922 and A =
0.404. In these calculations Microsoft Excel 97 was used to evaluate the gamma distribution.
Comment: If the moment parameter estimates had been used, the resulting probability would
have been 0.132, which is reasonably close to 0.176. This is because -q is reasonably close to the
10.00 that Thom (1958) suggested is the smallest value of -q for which the method of moments
results in good parameter estimates. For this data C, = 2/*
= 0.82, so that the distribution is
moderately skewed to the right. If the normal distribution had been used to estimate prob(X >
20.00), the result would have been 0.126, which again is a reasonable approximation. However,
if the annual water yield with a return period of 100 years or a 1% chance of being exceeded is
evaluated by the gamma with q = 5.922 and A = 0.404 and by the normal with p = 14.56 and
a = 4.75,the results are 32.2 inches and 25.6 inches-again showing the sensitivity of estimates
of rare events to the distributional assumption even though in the main body of distribution the
agreement is good.
Generally, 18 observations are not enough to make reliable probability estimates or to
determine the proper probability distribution to use. It is a small enough number that one can follow through all of the needed calculations for this example in a short time on a desk calculator,
however. The fact that the gamma and normal estimates differ greatly for this data at large return
periods does not mean the gamma (or the normal) is a better approximation for the data. This
question will be taken up later. Exercise 6.21 should be consulted for another approximate solution to this example.
LOGNORMAL DISTRIBUTION
The Central Limit Theorem was used in deriving the general result that if a random variable
X is made up of the sum of many small effects, then X might be expected to be normally distributed. Similarly, if X is equal to the product of many small effects, that is if X = X, X2...Xn,then
the logarithm of X, In X, can be expected to be normally distributed. This can be seen by letting
Y = In X so that Y = In(XlX2...X,) = In XI + In X2 +
+ In X,. Because the Xi are random
variables, the In Xi are also random variables and Y = In X is a random variable made up from
the sum of many other random variables. From the Central Limit Theorem, Y can be expected to
be normally distributed with mean py and variance u;.
a - -
Because Y = In X
and
Note that equation 6.31 gives the distribution of Y as a normal distribution with mean py
and variance a;. Equation 6.32 gives the distribution of X as the lognormal distribution with
parameters py and a;. Y = In X is normally distributed while X is lognormally distributed.
CONTINUOUS DISTRBUTIONS
The parameters py and o; can be estimated by
forming all of the Xi's to Yi's by
127
then
and
with all of the summations from 1 to n. If a digital computer is used the above equations are
easily applied. Y and S; may be determined without taking the logarithms of all of the data from
where C, is the coefficient of variation of the original data (C, = S,/X). These relationships are
not general results but depend on data being lognormally distributed.
The mean, variance, and coefficient of variation of the lognormal distribution are
Thus, the lognormal distribution is positively skewed with the skew decreasing as the coefficient
of variation decreases. Based on the properties of the normal distribution, the skewness of the
logarithms of lognormal data is zero.
Tables of the standard normal distribution can be used to evaluate the lognormal distribution. From equation 6.32 we have p,(x) = py(y)/x. But py(y) is a normal density function. From
equation 5.7, py(y) = pz(z)/sy or
Therefore, standard normal tables can be used with the proper transformations to evaluate px(x)
and Px(x) for the lognormal distribution.
Certain reproductive properties of the lognormal follow directly from the reproductive
properties of the normal distribution. For example, if X is lognormally distributed then Y = a x b
is lognormally distributed with
Pln Y = In a + bPln x
and
aty = b u h x
2 2
and
Two special cases of the above are if Z = XY and Z = X/Y with X and Y being independently and lognormally distributed, then Z is lognormally distributed with its mean and variance
easily determined from equations 6.43 and 6.44.
Because of its simplicity, its ready availability in tables for its evaluation, and the fact that
many hydrologic variables are bounded by zero on the left and positively skewed, the lognormal
distribution has received wide usage in hydrology.
Example 6.4. Use the lognormal distribution and calculate the expected relative frequency for
the third class interval of the data in table 5.1.
Solution: The expected relative frequency according to the lognormal distribution is
The evaluation of px(x) from equation 6.41 requires an estimate for py and 0,. These are
estimated from equations 6.35 and 6.36.
CONTINUOUS DISTRIBUTIONS
129
or the expected relative frequency in the interval 40,000 to 50,000 according to the lognormal
distribution is
Example 6.5. Assume the data of table 5.1 follow the lognormal distribution. Calculate the
magnitude of the 100-year peak flow.
Solution: The 100-year peak flow corresponds to a prob(X > x) of 0.01. X must be evaluated
such that P,(x) = 0.99. This can be accomplished by evaluating Z such that Pz(z) = 0.99 and
then transforming to X. From the standard normal tables the value of Z corresponding to Pz(z) of
0.99 is 2.326. From equation 6.37
+ 11.0524 = 11.812
130
CHAPTER 6
Consider a random sample of size n consisting of x,, x,, ..., x,. Let Y be the largest of the
sample values. Let Py(y) be the prob(Y 5 y) and Pxi(x) be the prob(Xi 5 x). Let py(y) and pxi(x)
be the corresponding probability density functions.
Py(y) = prob(Y 5 y) = prob(al1 of the x's 5 y). If the x's are independently and identically
distributed we have
Therefore, the probability that the maximum interrain time will be greater than 8 is 1 - 0.27 1
= 0.729.
Comment: The probability density function for the maximum interrain time is from equation 6.46
This distribution is plotted in figure 6.6 for various values of n. Note that for even moderately
large n, the probability is very high that the extreme value (longest intkain time) will be from
the tail of the parent (exponential) distribution.
Frequently the parent distribution from which the extreme is an observation is not known
and cannot be determined. If the sample size is large, use can be made of certain general asymptotic results that depend on limited assumptions concerning the parent distribution to find the
CONTINUOUS DISTRIBUTIONS
131
Fig. 6.6. Distribution of the largest sample value from a sample of size n from an exponential
distribution.
distribution of extreme values. Much of the work on extreme value distributions is due to
Gumbel (1954, 1958).Three types of asymptotic distributions have been developed based on
different (but not all) parent distributions. The types are:
a. Type I-parent distribution unbounded in direction of the desired extreme and all
moments of the distribution exist (exponential type distributions).
b. Type 11-parent distribution unbounded in direction of the desired extreme and all
moments of the distribution do not exist (Cauchy type distributions).
c. Type 111-parent
distributions).
Interest may exist in either the distribution of the largest or smallest extreme values. Exarnples of parent distributions falling under the various types are:
a. Type I-extreme value largest: normal, lognormal, exponential, gamma
b. Type I-extreme value smallest: normal
c. Type II-extreme value largest or smallest: Cauchy distribution (Hahn and Shapiro
1967; Thomas 1971)
d. Type III-extreme value largest: beta distribution (Hahn and Shapiro 1967; Gibra 1973;
Benjamin and Cornell 1970)
e. Type III-extreme value smallest: beta, lognormal, gamma, exponential
The type I1 or Cauchy type extreme value distributions have found little application in
hydrology. The distribution of the largest extreme value in hydrology generally arises as a type I
extreme value largest distribution because most hydrologic variables are unbounded on the right.
(See Van Montfort [I9701 for a test to determine whether a type I or type I1 extreme value largest
best fits the observed data.) The distribution of extreme value smallest commonly found in hydrologic work is the type 111extreme value smallest since many hydrologic variables are bounded on
the left by zero. The following is a treatment of these two (type I largest and type III smallest) extreme value distributions plus the type I smallest because of its symmetry with the type I largest.
Extreme Value Type I
The type I extreme value has been referred to as Gumbel's extreme value distribution, the
extreme value distribution, the Fisher-Tippet. type I distribution, and the double exponential
distribution. The type I asymptotic distribution for maximum (minimum) values is the limiting
model as n approaches infinity for the distribution of the maximum (minimum) of n independent
values fiom an initial distribution whose right (left) tail is unbounded and which is an exponential
type; that is, the initial cumulative distribution approaches unity (zero) with increasing
(decreasing) values of the random variable at least as fast as the exponential distribution approaches
unity. The normal, lognormal, exponential, and gamma distributions all meet this requirement for
maximum values while the normal distribution satisfies the requirement for minimum values.
The type I extreme value distribution has been used for rainfall depth-duration-frequency
studies (Hershfield 1961) and as the distribution of the yearly maximum of daily and peak river
flows. Gumbel (1958) states that this latter application assumes 1) the distribution of daily
discharges (the parent distribution) is of the exponential type, 2) n = 365 is a sufficiently large
sample and 3) the daily discharges are independent. Gumbel states that the first and second
assumptions cannot be checked because the analytical form of the distribution of discharges is
unknown and that the third assumption is clearly not true so that the number of independent
observations is something less than 365. In spite of violating the last assumption, experience with
the type I for the maximum of daily discharges has been reasonably good. Maximum annual
flood peaks would more nearly fulfill assumption 3 although the effective sample size would be
much less than 365.
The probability density function for the type I extreme value distribution is
px(x) = -exp
a
+------
[-" n 7)
- exp +---
where the - applies for maximum values and the + for minimum values. The parameters a and
p are scale and location parameters with p being the mode of the disMbution. The type I for maximum and minimum values are symmetrical with each other about f3. Figure 6.7 is a plot of the
distributions for a = 3,897 and f3 = 7,750.
The mean and variance of the extreme value type I distribution are
E(X)
= f3
+ yea
(maximum)
= f3
- yea
(minimum)
(6.48)
CONTINUOUS DISTRIBUTIONS
133
Largest
Smallest
-20
-10
20
10
30
40
X (1000s)
7rL
Var(X) = -a2
6
(both)
(maximum)
= - 1.1396
(minimum)
exp[-exp(-y)]
-m
(maximum)
(minimum)
<Y <
(6.53)
(6.54)
The designation "double exponential" distribution follows from these equations. The cumulative distribution for maximum and minimum values are related by
134
CHAPTER 6
The parameters of the type I extreme value distribution can be estimated in a number of
ways. Lowery and Nash (1970) compared several methods and concluded that the method of
moments was as satisfactory as other methods. If the method of moments is used. The estimators
are
and
p=x-A
y e 6
S (maximum)
5T
= X-+ -y e 6 S (minimum)
5T
The maximum likelihood estimators (Lowery and Nash 1970) can be determined by a
simultaneous solution to the equations
fi
Unfortunately, these equations cannot be easily solved explicitly for & and so that a numerical
solution is required.
The type I extreme value distribution for maximums has been used to define the "mean annual flood". The probability that an observation from this distribution will exceed the mean of the
distribution is 1 - Py(y) where Py(y) is evaluated from equation 6.53 for y = ( p - P)/a. Since
p = E(X) = P + yea (equation 6.48), we simply have that y = y, and Py(y) = 0.5703. The
probability of a value in excess of the mean is 1 - P,(y) = 0.4297. The return period of a flood
equal in magnitude to the mean is
T=
= 2.33 years
1 - PY(Y)
Often the "mean annual flood" refers to a flood with a return period of 2.33 years.
Extreme Value Type III Minimum (Weibull)
The extreme value type III distribution arises when the extreme is from a parent distribution
that is limited in the direction of interest. This distribution has found use in hydrology as the
CONTINUOUS DISTRIBUTIONS
135
distribution of low stream flows. Naturally low flows are bounded by zero on the left. The type
III for minimum values is also known as the Weibull distribution and is defined as
The parameters of the Weibull distribution can be estimated by the method of moments by substituting the sample mean and variance for the population mean and variance respectively in
equations 6.62 and 6.63 and then solving the two equations simultaneously for & and
The maximum likelihood estimates can be determined by letting
6.
and
fi is then given by
Either method of parameter estimation is difficult. Exercise 6.18 provides a method for
simplifying the solution of the moment equations.
136
CHAPTER 6
Fig. 6.8. Examples of extreme value type III minimum (Weibull) density curves.
The Weibull probability density function can range from a reverse-J with a < 1, to an exponential with a = 1 and to a nearly symmetrical distribution (figure 6.8) as a increases. If the
lower bound on the parent distribution is not zero, a displacement parameter must be added to the
type III extreme value distribution for minimums so that the density function becomes
tables of e-Ycan be used to determine Px(x). Equation 6.68 is sometimes known as the 3-parameter
Weibull distribution, or as the bounded exponential distribution.
The mean and variance of the three parameter Weibull distribution are
and
CONTINUOUS DISTRIBUTIONS
137
The coefficient of skew is again given by equation 6.64. Through algebraic manipulation, equations 6.70 and 6.7 l can be put in the form (Gumbel 1958)
E =
oB(a)
where
The moment estimates for a , P, and E can now be obtained by 1) solving equation 6.64 for
&, 2) solving 6.74 and 6.75 for A(a) and B(a), 3) solving 6.72 for and 4) solving 6.73 for i .
Table 6.2 can be used to simplify the calculations.
8,
138
CHAPTER 6
Example 6.7. The minimum annual daily discharges on a stream are found to have an average of
125 cfs, a standard deviation of 50 cfs, and a coefficient of skew of 1.4. Using both the type 111
minimum and the type I minimum extreme value distributions, evaluate the probability of an
annual minimum flow being less than 100 cfs.
Solution: Type III minimum using interpolation in table 6.2.
prob(X
(eq. 6.72)
(eq. 6.73)
Type I minimum
&
V&s/.rr = 0.78(50) = 39
b = f + ye V&s/.rr
= 125
(eq. 6.56)
(eq. 6.57)
+ .45(50) = 147.5
(eq. 6.54)
Px(lOO) = 1 - e ~ ~ ( - e - ' . =
~ ~1 )- 0.744 = 0.256
Comment: The results of applying these two distributions to this problem are very different. This
should be expected as it is a situation where the type I for minimums would not be expected to
apply because there would be a lower bound and because the coefficient of skew was given as 1.4
whereas the coefficient of skew for the type I minimum is - 1.1396.
Discussion
The theory on which the extreme value distributions depend is not as strong as the Central
Limit Theorem for the normal distribution. More assumptions concerning the underlying or
CONTINUOUS DISTRIBUTIONS
139
parent distribution must be made and the rate of convergence to an asymptotic extreme value
distribution may be rather slow. However, the extreme value distributions do provide a connection between observed extreme events and models that may be used to evaluate the probabilities
of future extreme events.
The conditions under which the various extreme value distributions arise are such that for
many parent distributions (lognormal, gamma) the distribution of maximum values and the distribution of minimum values are not of the same type. The minimum values from a lognormal
would be expected to follow the type 111distribution while the maximum values would follow a
type I distribution.
Various types of extreme value distributions are related. The logarithms of a random variable that follows a type III minimum are distributed as the type I minimum extreme value distribution. Chow (1954) has shown that if the coefficient of variation of the type I maximum extreme
value distribution is 0.364, the distribution is practically the same as the lognormal distribution
with the same coefficient of variation and coefficient of skew (1.139).
.(XI
exp(-
1 - K(x a
"1')
for
r io
The Gumbel distribution is obtained when K = 0. For IKI < 0.3, the general shape of the GEV is
similar to the Gumbel extreme value distribution with some differences in the right tail. The
parameters 5, a, and K are location, scale, and shape parameters. For K > 0, the distribution has
~ corresponds to the extreme value type I11 distribution for
a finite upper bound at 5 + a / and
maximums that are bounded on the right. The moments of the GEV are
then x < 5 + -
a
K
f o r ~ < O then
x>5+-
a
K
K.
For
and may be estimated by equations 3.71. The L-moments, Xi may then be estimated by equations 3.74.
The parameters of the GEV in terms of L-moments are:
where
where Px(xp)is the cdf of X. In chapter 7 an example of the use of the GEV for flood frequency
analysis is given.
BETA DISTRIBUTION
A distribution that has both an upper and lower bound is the beta distribution. Generally, the
beta distribution is defined over the interval 0 to 1. It can, however, be transformed to any interval a
to p. If the limits of the distribution are unknown, they become parameters of the distribution, making it a 4-parameter rather than a 2-parameter distribution. The beta density function is given by
'
The function B(a, P) = J,' xa- '(1 - x ) ~ -dx is called the beta function. The beta function is
related to the gamma function by
CONTINUOUS DISTRIBLTIONS
141
The beta function is tabulated. The mean and variance of the beta distribution are
The mean and variance can be used to get the moment estimators for a and P.
PEARSON DISTRIBUTIONS
Karl Pearson (Elderton 1953) has proposed that frequency distributions can be represented by
By choosing appropriate values for the parameters, equation 6.90 becomes a large number of
families of distributions including the normal, beta, and gamma distributions.
The Pearson type 111has found application in hydrology especially as the distribution of logarithms of flood peaks. This distribution can be written
with the mode at X = 0. The lower bound of the distribution is X = -a. The difference in the
mean and mode is 6 and the value of px(x) at the mode is po. It can be shown that the Pearson type
I11 is the same as the 3-parameter gamma distribution. By shifting equation 6.91 so that the mode
is at X = a and the lower bound is at X = 0, we have
The gamma distribution has the mode at (q - l)/A and the mean at q/A. Thus a = (-q - l)/A
and 6 = -q/A - (q - l)/A = l/A. The value of px(x) at the mode for the gamma distribution is
where Y is the sum of squares of n random values of Z and has a chi-square distribution with n
degrees of freedom. The chi-square distribution is a special case of the gamma distribution when
X = j/2 and is a multiple of %. The distribution thus has a single parameter v = 27 known as the
degrees of freedom. The expression for the distribution is
The parameter v is usually known in any application of the chi-square to statistical testing.
Equation 6.95 produces the moment estimator for v as = X.In figure 6.4, the curve labeled
X = % is a chi-square distribution with v = 6. The coefficient of skew for the chi-square distri
bution is 2*/.
The cumulative chi-square distribution is contained in the appendix in the
form
CONTITWOUS DISTRIBUTIONS
143
The t Distribution
.
If Y is a standardized normal variate and U is a chi-square variate with v degrees of freedom
and Y and U are independent, then
n
Var(T) = n -2
for
>2
distribution. One can reason that as the sample size increases, the estimate for the variance improves to the point where the sampling distribution of the mean of a normal distribution with
unknown variance can be approximated by the sampling distribution of the mean of a normal
distribution with a known variance, which is itself a normal distribution. In practice one rarely
knows the variance of the distribution from which a sample is obtained.
Example 6.8. A sample of size 8 from a normal distribution results in X = 12.7 and s2 = 9.8.
What is the probability that X is in error by more than 1.0?
Solution: (% - p,)/m
has a t distribution with n - 1 degrees of freedom. To be in error by
X - p,l > 1.0.
more than 1.0 units we must have I
The desired probability is the area to the right of t = 0.904. By interpolation in the t table,
this value is found to be 0.198. By symmetry, the area to the left of -0.904 is 0.198. The desired
probability is 0.198 + 0.198 = 0.396.
If a standard normal distribution had been used rather than a t distribution, it would have been
necessary to find prob(lZ1 > 0.904). This probability can be found from the standard normal table
to be 0.366. Thus, even for a sample as small as 8, the normal is a reasonable approximation.
The F Distribution
If U is a chi-square variate with y = m degrees of freedom and V is a chi-square variate
with y = n degrees of freedom and U and V are independent, then
has an F distribution with yl = m and y2 = n degrees of freedom (m and n are known as the
numerator and denominator degrees of freedom, respectively). The F distribution is given by
COKTINUOUS DTSTRIBUTIONS
145
TRANSFORMATIONS
Often a transformation can be made in an attempt to anive at a probability distribution that
will describe the data. Common transformations are logarithmic transformations, translations
along the x axis, and n" power transformations for n = K,, %, 2, and 3.
We have already made one application of the logarithmic transformation to get the lognormal distribution from the normal distribution. Other distributions can be transformed by means
of this transformation as well. Benson (1968) and an Interagency Subcommittee on Water Data
(1982) discuss the use of the log-Pearson type I11 distribution for flood frequencies.
Translations are especially useful in the case of bounded distributions. We made use of a
translation in deriving the 3-parameter extreme value type I11 for minimums from the corresponding 2-parameter distribution. In general, a translation is accomplished by subtracting a location
parameter, E, from the random variable. For example
146
CHAPTER 6
The fact that y must be now used to estimate means that for small samples accuracy is lost, because y is based on the third sample moment. As shown earlier, the 3-parameter gamma is the
same as the Pearson type 111distribution.
Sangal and Biswas (1970) have used the 3-parameter lognormal distribution obtained by fitting a normal distribution to the logarithms of (X - E) where E is a parameter that must be estimated from the data. They found for 10 Canadian rivers that the 3-parameter lognormal
distribution fit the observed distribution of peak flows. They also state that the Gumbel extreme
value distribution is a special case of the 3-parameter lognormal distribution.
The three parameter lognormal is given by
where
Stidd (1953) and Kendall (1967) discuss transforming variables by Y = x"' and then fitting a
normal distribution to Y. They discuss this transformation in terms of precipitation probabilities.
Exercises
6.1. Show that the mean of the uniform distribution is (P
(P - 4 / 1 2 .
+ a)/2
CONTINUOUS DISTRIBUTIONS
147
6.8. A set of data having a mean of 4.5 and a standard deviation of 2.0 is thought to follow the
type I extreme value distribution for maximums. What proportion of the observations from this
distribution exceed 6.0? Plot the probability density function.
6.9. Repeat exercise 6.8 using the type I extreme value distribution for minimums.
6.10. Repeat exercise 6.8 using the Weibull distribution.
6.11. Repeat exercise 6.8 using the lognormal distribution.
6.12. Show that the exponential distribution is memoryless [i.e., show that prob(X 2 t
T ~ X> t ) = prob(X > t)].
6.13. Plot the probability density function and the cumulative probability distribution for the
lognormal distribution with p, = 50,000 and ox= 25,000.
6.14. Plot the theoretical distribution of the largest value selected from a normal distribution
with p = 4 and u = 4 for sample sizes of n = 2,5,9, and 33. Compare the results with those of
example 6.6.
6.15. Derive expressions analogous to equations 6.45 and 6.46 for the smallest of n independently and identically distributed random variables.
6.16. Verify equation 6.52 from equations 6.47 and 6.51.
6.17. Assume that during month 1 the mean and standard deviation of the monthly rainfall are
0.750 and 0.433 inches, respectively. Similarly, during month 2 the mean and standard deviations
of monthly rainfall are 3.000 and 0.866 inches, respectively. Assume monthly rainfall amounts
can be approximated by the gamma distribution and that rainfall in month 2 is independent of
rainfall in month 1. What is the probability of receiving more than 3 inches of rain during the
two-month period?
6.18. Show that for the 2-parameter Weibull distribution the parameter a is a function only of the
coefficient of variation. Using this fact, describe a procedure for estimating a and lj of the distribution.
6.19. If peak discharge, q, is lognormally distributed with mean p, and variance oi, what is the
probability distribution of stage S? Assume stage and discharge are related by q = asb.
6.20. Work exercise 6.19 assuming the peak discharges are distributed as the type I extreme
value distribution.
6.21. In example 6.3 let be approximated by 6.0. Calculate from equation 6.25 and then
evaluate the prob(yie1d > 20.0) by using the equation following equation 6.21. Compare the
results with those of example 6.3.
6.22. Use the method of moments to estimate the parameters of the 3-parameter lognormal
distribution for the North Llano River near Junction, Texas. What is the return period of a mean
annual flow of 273 cfs or more?
6.23. Calculate the return period associated with an annual runoff of 0.500 inches for Walnut
Gulch near Tombstone, Arizona (Data in Appendix C). Assume (a) lognormal distribution, (b)
gamma distribution, (c) extreme value type I, (d) normal distribution.
6.24. Assume the data of exercise 4.10 are distributed as a 2-parameter exponential distribution.
Estimate the parameters of this distribution and prepare a table comparing the observed and
expected number of floods over the 100-year period.
7. Frequency Analysis
ONE OF the earliest and most frequent uses of statistics in hydrology has been that of
frequency analysis. Early applications of frequency analysis were largely in the area of flood flow
estimation. Today nearly every phase of hydrology is subjected to frequency analysis. Although
most of the discussion in this chapter centers on flood flows or peak flows, the techniques are
generally applicable to a wide range of problems including runoff volumes, low flows, rainfall
events of various kinds, water quality parameters, measures of ground water levels and flows,
and many other environmental variables. The statistical and mathematical manipulations discussed in this chapter do not depend on the units of measurement or the quantity measured. The
assumptions that are made, however, must be carefully compared to the situation under study.
The goal of a frequency analysis is to estimate the magnitude of an event having a given
frequency of occurrence or to estimate the frequency of occurrence of an event having a given
magnitude. The frequency is often stated in terns of a return period, T, in years, or a probability
of occurrence in any one year, p. Other terminology commonly used includes the estimation of a
"quantile" or "percentile" of the probability distribution of the quantity of interest. The loopth
percentile is simply the event having a probability, p, of occurring. The term "quantile" is used in
a similar manner. The 9othquantile is the same as the 9othpercentile. The loopthpercentile or the
loopthquantile is the value, xp, of the random variable X satisfying
There have been and continue to be volumes of material written on the proper probability
distribution to use in various situations. One cannot, in most instances, analytically determine
which probability distribution should be used. Certain limit theorems such as the Central Limit
Theorem and Extreme Value Theorems might provide guidance. One should also evaluate the experience that has been accumulated with the various distributions and how well they describe the
phenomena of interest. Certain properties of the distributions can be used in screening distributions for possible application in a particular situation. For example, the range of the distribution,
the general shape of the distribution, and the skewness of the distribution often indicate that a
particular distribution may or may not be applicable in a given situation. When two or more distributions appear to describe a given set of data equally well, the distribution that has been traditionally used should be selected unless there are contrary overriding reasons for selecting another
distribution. However, if a traditionally used distribution is inferior, its use should not be continued just because "that's the way it's always been done".
The first part of this chapter discusses empirical frequency analysis by plotting data in the
form of a cumulative probability distribution. The second topic covered is analytical frequency
analysis based on probability distributions. A simplified technique based on frequency factors is
shown for determining the magnitude of an event with a given return period. In general, the
frequency factor is a function of the distributional assumption that is made and of the mean,
variance, and, for some distributions, the coefficient of skew of the data. Regional frequency
analysis is then discussed. Regional frequency analysis attempts to use data from several
locations in a "homogeneous" region to determine the frequency relationship for a point having
limited data. The chapter closes with a discussion of the frequency analysis of precipitation data
and other forms of hydrologic data.
Frequency analysis of hydrologic data requires that the data be homogeneous and independent.
The restriction of homogeneity ensures that all the observations are from the same population. Nonhomogeneity may result from a stream gaging station being moved, a watershed becoming
urbanized, or structures being placed in the stream or its major tributaries. Different types of storms,
such as frontal storms and storms associated with hurricanes, may introduce nonhomogeneity. In
this latter situation a mixed population model may be required for the frequency analysis.
The restriction of independence ensures that a hydrologic event such as a single large storm
does not enter the data set more than once. For example, a single storm system may produce two
or more large runoff peaks only one of which (the largest) should enter the data set. Dependence
may also result when a major rainfall occurs, producing very wet antecedent conditions on a
catchment. A subsequent rainfall may then produce much larger flows than would have occurred
had a more normal antecedent condition existed. The flow from the second storm is then dependent on the fact that the first storm had occurred. Runoff from only one of these events, the
largest one, should enter the analysis.
For the prediction of the frequency of future events, the restriction of homogeneity requires
that the data on hand be representative of future flows (i.e., there will be no new structures,
diversions, land use changes, etc., in the case of stream flow data). Recently, the possibility of climate change has been raised as a factor contributing to nonhomogeneity of a hydrologic record.
If climate change is occurring at a rate rapid enough to affect the usefulness of a particular
hydrologic analysis, this change must be reckoned with in the analysis,
FREQUENCY ANALYSIS
151
Hydrologic frequency analysis can be made with or without making any distributional
assumptions. The procedure to be followed in either case is much the same. If no distributional
assumptions are made, the observed data are plotted on any kind of paper (not necessarily probability paper) and judgment used to determine the magnitude of past or future events for various
return periods. If a distributional assumption is made, the magnitude of events for various return
periods is selected from the theoretical "best-fit" line according to the assumed distribution. If an
analytical technique is used, the data should still be plotted so that one can get an idea of how
well the data fit the assumed analytical form and to spot potential problems.
PROBABILITY PLOTTING
Once data for a frequency analysis have been selected, they must be carefully scrutinized to
ensure all of the observations are all valid representations of the hydrologic characteristic under
consideration. For example, in a flood frequency data set consisting of the annual maximum flow,
it is possible that the lower values are merely flows somewhat above the flows for the remainder
of the year but do not truly represent high flows or flood flows. In such a case, some truncation
of low flows might be instituted with the analysis done on the truncated data set and adjusted to
the full record length.
After accepting the data as valid, basic statistics (mean, variance, skewness) of the data
should be computed and the data plotted as a probability plot. Plotting probability density functions and cumulative probability distributions on arithmetic paper has already been discussed. In
general, when the cumulative distribution function, Px(x), is plotted on arithmetic paper versus
the value of X, a straight line does not result. To get a straight line on arithmetic paper, Px(x)
would have to be given by the expression Px(x) = ax + b or p,(x) = a, the uniform distribution.
Thus, if the cumulative distribution of a set of data plots as a straight line on arithmetic paper, the
data follows a uniform distribution. Probability paper can be developed so that any cumulative
distribution can be plotted as a straight line. Generally, the scaling of the probability axes is
unique for each of the different probability distributions to plot as a straight line. The scaling of
the probability axis may even have to change as the parameters of a particular distribution
change. Constructing probability paper is a process of transforming the probability scale so that
the resulting cumulative curve is a straight line. Many types of probability paper are comrnercially available, including paper for the normal, lognormal, exponential, certain cases of the
gamma, extreme value (type I), Weibull, and chi-square distributions.
A few computer software packages provide for plotting using a normal distribution probability scale. Some of the packages will plot probability directly whereas others use the Z transformation of the normal distribution. The resulting plots are similar. When the Z transformation
is used, the probability associated with the plotted Z values must be independently determined.
The most common probability paper has a normal probability scale and either an arithmetic
(normal probability paper) or logarithmic (lognormal probability paper) scale. Normally distributed data will plot as a straight line on normal probability paper and lognormally distributed data
will plot as a straight line on lognormal probability paper. One way to determine if data might be
from a normal or lognormal distribution is to plot the data on normal and lognormal probability
paper and visually determine if a straight line is obtained.
FREOUENCY ANALYSIS
153
Regardless of the type of sample data used, the plotting position can be determined in the
same manner. Gumbel (1958) states the following criteria for plotting position relationships:
1. The plotting position must be such that all observations can be plotted.
2. The plotting position should lie between the observed frequencies of (m - l)/n and m/n
where m is the rank of the observation beginning with m = 1 for the largest (smallest) value
and n is the number of years of record (if applicable) or the number of observations.
3. The return period of a value equal to or larger than the largest observation and the return
period of a value equal to or smaller than the smallest observation should converge toward n.
4. The observations should be equally spaced on the frequency scale.
5. The plotting position should have an intuitive meaning, be analytically simple, and be easy to use.
Several plotting position relationships are presented in Chow (1964) and Singh (1992). A general
plotting position relationship is given by
where a and b are constants (Adamowski 1981). Some of the most common relationships for
plotting positions are shown in Table 7.1. Unless specifically stated to the contrary, the Weibull
relationship is used in the remainder of this book. Benson (1962a), in a comparative study of
several plotting position relationships, found on the basis of theoretical sampling from extreme
Source
California
California (1923)
Hazen
Hazen (1930)
Weibull
Weibull (1939)
Cunnane (1978)
Gringorton
Adamowski
Adamowski (198 1)
Relationship
CHAPTER 7
154
value and the normal distributions that the Weibull relationship provided estimates that were
consistent with experience.
The Weibull plotting position formula meets all 5 of the above criteria: 1) All of the
observations can be plotted since the plotting positions range from l/(n + I), which is greater than
zero, to n/(n + I), which is less than one. Probability paper for distributions with infinitely long tails
does not contain the points zero and one; 2) The relationship m/(n + 1) lies between (m - l)/n and
m/n for all values of m and n; 3) The return period of the largest value is (n + 1)/1, which
approaches n as n gets large, and the return period of the smallest value is (n + l)/n = 1 + 1In,
which approaches 1 as n gets large; 4) The difference between the plotting position of the (m +
and m~ value is l/(n + 1) for all values of m and n; and 5) The fact that condition 3 is met plus the
simplicity of the Weibull relationship fulfills condition 5.
One objection to the Hazen plotting position is that the return period for the largest
(m = 1) event is 2n, or twice the record length. An objection to the California plotting position
is that the smallest value (m = n) has a plotting position of 1, which implies that the smallest
sample value is the smallest possible value. A value of 1 cannot be plotted on many types of
probability paper.
It should be noted that all of the relationships give similar values near the center of the
distribution but may vary considerably in the tails. Predicting extreme events depends on the tails
Table 7.2. Determination of plotting position for Kentucky River data
Flow
Rank
pp
Flow
Rank
pp
Flow
Rank
pp
Flow
Rank
pp
FREQUENCY ANALYSIS
155
of the distribution, so care must be exercised. The quantity 1 - Px(x) represents the probability
of an event with a magnitude equal to or greater than the event in question. When the data are
ranked from the largest (m = 1) to the smallest (m = n), the plotting positions correspond to
1 - Px(x). If the data are ranked from the smallest (m = 1) to the largest (m = n), the plotting
position formulas are still valid; however, the plotting position now corresponds to the
probability of an event equal to or smaller than the event in question, which is Px(x). Probability
paper may contain scales of Px(x), 1 - Px(x), TX(x),or a combination of these.
Plotting data on probability paper results in an empirical distribution of the data. As an
example of probability plotting, consider the data in table 2.1. The steps in plotting this data are:
1. Rank the data from the largest (smallest) to the smallest (largest) value. If two or more observations have the same value, several procedures can be used for assigning a plotting position.
The procedure adopted here is to assume they have different values and assign each a unique
rank. For example, in the data of Table 7.2, the value of 82,900 is assigned a rank of both 22
and 23 since it occurs twice in the data set.
2. Calculate the plotting position.
3. Select the type of probability paper to be used. Normal probability paper is used in this
example.
0.5 1
10
20 30
50
70 80
Exceedance proability
90
95
98 99
156
CHAPTER 7
When probability plots are made and a line drawn through the data, the tendency to extrapolate the data to high return periods is great. The distance on the probability paper from a return
period of 20 years to a return period of 200 years is not very much; however, it represents a
10-fold extrapolation of the data. If the data do not truly follow the assumed distribution with
population parameters equal to the sample statistics, the error in this extrapolation can be quite
large. This fact has already been referred to when it was stated that the estimation of probabilities
in the tails of distributions is very sensitive to distributional assumptions. Because one of the
usual purposes of probability plotting is to estimate events with longer return periods, Blench
(1959) and Dalrymple (1960) have criticized the blind use of analytical flood frequency methods
because of this tendency toward extrapolation.
If a set of data plots as a straight line on probability paper, the data can be said to be distributed as the distribution corresponding to the probability paper. Because it would be rare for a set
of data to plot exactly on a line, a decision must be made as to whether or not the deviations from
the line are random deviations or represent true deviations, indicating that the data does not follow the given probability distribution. Examining figure 7.1, it is apparent that, with the exception of the largest value, the deviations from a straight line are small. It might be assumed that the
data can be approximated by the normal distribution.
So far two tests, both based on judgment, have been described for determining if a set of
data follows a certain distribution. The first method was to visually compare observed and theoretical frequency histograms and the second to visually compare observed and theoretical
cumulative frequency curves in the form of probability plots. In chapter 8, statistical tests based
on these two visual tests will be presented.
Historical Data
Occasionally, flood information outside of the systematic flow record is available from historical sources such as newspaper reports, earlier flood investigations, or from paleohydrologic
investigations. Such data contain valuable information that should not be ignored in a frequency
analysis. Bulletin 17B of the United States Water Resources Council (1981) demonstrates computing the plotting position of the historical observations on the basis of the historical record
length. Likewise, the plotting position of the systematic data is computed on the basis of the
historic record length, except that the rank used in the calculation is adjusted by a factor, W,
depending on the historic record length, H, the number of historic flows, Z, and the length of the
systematic record, N. These are related by
H-Z
w=---N
with m being the unadjusted rank of the total record (systematic plus historic).
FREQUENCY ANL4LYSIS
157
Thus, if 20 years of systematic data and 2 historic observations larger than any values in the
systematic record are available from a 50-year period preceding the systematic record, the plotting position for the 2 largest values would be 1/71 = 0.014 and 2/71 = 0.028. The weighting
factor would be
The remaining plotting positions would be calculated from the adjusted rank given by
The adjusted rank is then used in the plotting position relationship (equation 7.2). Thus, for
m = 3 (the largest systematic flow observation), the plotting position using the Weibull plotting
position relationship would be [3.40(3) - 61/71? or 0.0592, and for m = 22 (the smallest value)
the plotting position would be [3.40(22) - 61/71, or 0.9690. This compares to plotting positions
of 1/21, or 0.0476, and 20/21, or 0.9523, respectively, if the historic data had been ignored. If
the historic data had simply been used to augment the systematic record without using the
weighting factor, the plotting positions for these two events would have been 1/23, or 0.0435,
and 2/23, or 0.0870, respectively. Clearly, a plotting position of 0.0435 assigns too high a
probability of occurrence to the largest systematic value. Knowledge that there were 48 years
with no flows larger than the two historical events has been ignored in this later case. It is also
apparent that the weighting procedure adjusts the plotting position toward a more frequent
occurrence for the largest systematic value thus taking into account the fact that two flows greater
in magnitude than the largest systematic flow occurred.
Bulletin 17B also suggests the flow statistics be computed by weighting the contribution of
the systematic record to the various statistics by the factor W. Thus the adjusted mean is
where the X represents the systematic record and X, the historic data. Similarly, the variance and
skew can be determined from
If a log based distribution such as the lognormal or log Pearson III is being used, the X's and Xz7s
would be based on logarithms.
158
CHAPTER 7
Outliers
When probability plots of hydrologic data are made, frequently one or two extreme events
are present that appear to be from a different population because they plot far off of the line
defined by the other points. For example, it is entirely possible that a 100-year event is contained
in 10 years of record. If this is the case, assigning a normal plotting position of 1/11 to this value
would not be reflective of its true return period. Unfortunately, the true return period is not
known. The treatment of these "outliers" is an unresolved and controversial question. The fact
that this occurs frequently in hydrologic data should not be surprising.
Using methods discussed in chapter 4, the probability of at least one occurrence of an n-year
event in a k-year record can be calculated as 1 - (1 - 1 /n)k. For example, the probability of at
least one occurrence of a 100-year event in a 32-year record is 1 - 0 . 9 9 ~or
~ ,0.275. If we have
four independent 32-year records, we expect one to contain at least one 100-year event. This is
the case even though the 100-year event is from the same population as the other 3 1 events in the
32-year record.
Bulletin 17B suggests that outliers can be identified from
where XHand XLare threshold values for high and low outliers and K, can be approximated from
(7.8)
FREQUENCY ANKYSIS
159
estimated based on either the actual observations or their logarithms. Then the magnitude of a
flow having a particular exceedance probability or return period would be based on the lognormal
distribution and the estimated parameters.
Fitting probability distributions to data and estimating quantiles or probabilities from these
distributions has the advantage of smoothing the data and of making it possible for standardizing
frequency estimation procedures. It also provides a consistent way for extrapolating short records
to obtain estimates corresponding to 50- to 200-year flows. Of course, such extrapolations are
fraught with ambiguities. The selection of an appropriate probability density function is critical as
is having an adequate sample from which to estimate the parameters of the selected distribution.
Rao and Hamed (2000) have an extensive discussion of the mathematical properties of most
of the probability distributions that are used in hydrologic frequency analysis.
Chow (195 1) has shown that many frequency analyses can be reduced to the form
where XTis the magnitude of the event having a return period T and KTis a frequency factor. This
relationship comes about by writing any X as
and then stating that AX, the deviation from the mean, is the product of the standard deviation s
and a frequency factor K.
KT depends on the probability distribution being used and the return period.
Recalling that c, = s/X, equation 7.11 takes on the form of equation 7.9. Chow (1951,
1964) presents the frequency factors for many different types of frequency distributions.
Equation 7.9 can also be used to construct the probability scale on plotting paper so that the
distribution corresponding to KT plots as a straight line. The use of frequency factors is equivalent to using the method of moments for estimating the parameters of a pdf.
Normal Distribution
For the normal distribution it can easily be shown that KT is the standardized normal
variate Z. The standard normal distribution, along with equation 7.9, can be used to determine
the magnitude of normally distributed events corresponding to various probabilities. For
example, the magnitude of a 20-year peak flow for the data of table 2.1 can be determined by
calculating
and
160
CHAPTER 7
The 20-year event corresponds to a prob(X > x) of .05, so the probability of an event less
than the 20-year event is 0.95. The value of Z corresponding to a probability of 0.95 is found
from standard normal tables to be 1.645. Thus
+ c,K~~)
= 66,540(1 + 0.335
X20 = X(l
X 1.645)
103,209 cfs
where 7 and s, are based on the natural logarithms of X, and Kn is from the standard normal
distribution.
Log Pearson Type I11 Distribution
Benson (1968) reported on a method of flood frequency analysis based on the log Pearson
type I11 distribution, which is obtained when the base 10 logarithms of observed data are used
along with the Pearson type UI distribution (equation 6.91). This method is applied as follows:
1. Transform the n annual flood magnitudes, Xi, to their logarithmic values, Yi (i.e., Yi = logloXi
for i = 1,2? .. n).
.?
5. Compute
FREQUENCY ANALYSIS
161
where KT is obtained from table 7.3. Note that this relationship is identical to equation 7.9 except
the logarithms are used.
Table 7.3a. KT values for positive skew coefficients Pearson type III distribution'
Skew
coef.
1.0101
99
50
50
100
200
0.5
162
CHAPTER 7
Table 7.3b. KTvalues for negative skew coefficients Pearson type III distribution'
Skew
coef
1.0101
99
50
50
100
200
0.5
(Beard 1962, 1974; Benson 1968). Figure 7.2 contains regionalized skew coefficients of annual
streamflow maximum logarithms computed by the U.S. Geological Survey.
The frequency factors of table 7.3 can be used for the Pearson type 111 distribution in the
same manner as for the log Pearson type ID. The actual data values rather than their logarithms
would then be used.
Approximate values of KT for the Pearson Type 111distribution are given by
FREQUENCY ANALYSIS
163
Fig. 7.2. Generalized skew coefficients of annual maximum stream flow logarithms.
where K, is the standard normal deviate (Interagency Advisory Committee on Water Data
1982). Because of certain limitations on this approximation, the use of the table for KT is recommended. Obviously, the use of analytic approximations for KT for any of the distributions
makes the calculations for flows of various return periods quite easy using spreadsheets or other
computer software. Table 7.4 contains the maximum percent error in equation 7.14 as compared
to Table 7.3. Note that a 1% error in KT does not translate directly to a 1% error in flow. For
example, when the log Pearson type 111 is used in example problem 7.2, the 100-year flow is
estimated at 29,719 cfs. The skewness of the logarithms was 0.296, so use of equation 7.14 has
a maximum error of 0.09%. With such an error, KT would be 1.0009 X 2.542, or 2.567, and
the resulting flow estimate would be 29,752 cfs, which represents a difference of 0.11 % from
Table 7.4. Errors in the use of equation 7.15 for estimating KT log Pearson distribution
164
CHAPTER 7
the estimate using the table value. This is a very small error when one considers the uncertainties present in estimates of this kind. Often interpolation has to be done in table 7.3, which may
introduce more error than the use of equation 7.14. Only for C, < -2.5 is the error in KT > 2%
for T of 50, 100, and 200 years.
Extreme Value Type I Distribution (Gumbel Distribution)
Chow (1951) presents the following relationship for the frequency factor for the extreme
value type I maximum distribution
where ye is the Euler number (0.577216) and Tx(x) is the desired return period of the quantity
being calculated. Potter (1949) presents some curves that simplified the application of the
extreme value type I. Kendall (1959) presents the frequency factors shown in table 7.5 for the
extreme value type I distribution. The values computed from equation 7.15 are equivalent to an
infinite sample size in table 7.5.
Return period
FREQUENCY ANALYSIS
165
Other Distributions
Any of the distributions discussed in chapter 6 can be fit to data by using the methods
discussed in that chapter. Frequency factors for some of the other distributions are given by Chow
(1951, 1964).
GENERAL CONSIDERATIONS
Many proponents (and opponents) of one analytical form or another for flood flow frequencies have come to the fore over the past few decades. The proponents claim that some particular
method is superior to some other method and "prove" their claim by a few rationalizations and
some case studies. The fact remains that these rationalizations involve questionable assumptions.
There is no direct theoretical connection between any analytical form of the frequency distribution and the underlying mechanisms governing flood flows except through the limit theorems.
The primary consideration in selecting a particular analytical form for the frequency distribution
is that the distribution "fit" the observed data (Anderson 1967; Benson 1968).
Benson (1968) reported on the results of a study by a work group consisting of 18 representatives from 12 federal agencies of the U.S. government. This group studied 6 methods of flood
frequency analysis on 10 streams located throughout the United States. The records on these
streams ranged in length from 40 to 97 years with an average of 55 years. The drainage areas
ranged from 16.4 to 36,800 square miles. The six methods of analysis consisted of 1) the gamma
distribution, 2) Gumbel distribution, 3) Gumbel distribution using the logarithms of the data, 4)
lognormal distribution, 5) log Pearson type 111distribution, and 6) Hazen's method. The computational procedures used were much like those presented in this book. The Hazen method consists
of using an equation like equation 7.8 along with a table of empirically derived frequency factors
that are a function of the return period and the coefficient of skew (Hazen 1930). Large differences were produced by the 6 different methods especially at long return periods. The results
showed that the lognormal, log Pearson type 111, and Hazen methods were about equally good.
The group suggested that the log Pearson type 111be used unless there was a good reason to use
some other method. This recommendation was made even though the group realized that "there
are no rigorous statistical criteria on which to base a choice of method". Benson's (1968) report
states that the study showed that "the range of uncertainty in flood analysis, regardless of the
method used, is still quite large" and that many questions concerning it remain unresolved.
In a follow-up study, Beard (1974) examined flood peaks from 300 stations scattered throughout the United States. Several probability distributions were tried, including the log Pearson type HI,
lognormal, Gumbel's extreme value distribution, and the 2- and 3-parameter gamma distributions.
Beard concluded that only the lognormal and log Pearson type 111 with a regionalized skew
coefficient were not greatly biased in estimating future flood frequencies. He stated that the latter
distribution produced somewhat more consistent results but that ... regardless of the methodology
employed, substantial uncertainty in frequency estimates from station data will exist ...".
In selecting a particular analytical form for a frequency curve, one may be tempted to select
a distribution with a large number of parameters. Generally, the more parameters a distribution
has, the better it will adapt to a set of data. However, for the sample size usually available in
hydrology, the reliability in estimating more than 2 or 3 parameters may be quite low. Thus, a
compromise must be made between flexibility of the distribution and reliability of the parameters.
"
166
CHAPTER 7
Recognizing the short record lengths often available for frequency analysis, methods of augmenting natural data by synthetic data are being developed. In some cases the rainfall record pertaining to a watershed is much longer than its streamflow record. In this event it may be possible to
calibrate a deterministic streamflow model to the watershed and then use the long rainfall record to
generate a long synthetic streamflow hydrograph. This synthetic hydrograph can then be combined
with existing data into a single frequency analysis. In the absence of rainfall records, it may be possible to transfer records from a nearby station or to stochastically generate a series of rainfall data.
This data could then be used with the calibrated deterministic model to augment natural streamflow
data. One might consider weighting the natural data more than the augmented data in the final frequency analysis. Regression and correlation techniques might be used to relate peak flows to rainfall or to peaks from nearby gages and using this relationship to extend the available record.
It was because of the many factors and uncertainties that are involved in the selection of a
probability distribution to use in flood frequency determinations, that several agencies of the U.S.
Federal government developed the guidelines published as "Guidelines for Determining Flood
Flow Frequency," commonly known as Bulletin 17B (Interagency Advisory Committee on Water
Data, 1982). Bulletin 17B has become a standard for flood frequency analysis of annual flood
peak discharges.
The developers of Bulletin 17B recognized that "there is no procedure or set of procedures
that can be adopted which, when rigidly applied to the available data, will accurately define that
flood potential of any given watershed. Statistical analysis alone will not resolve all flood frequency problems." The basic Bulletin 17B approach is to use the log Pearson type I11 distribution
as explained above. Because this distribution is a 3-parameter distribution, the coefficient of
skew is used when estimating the parameters by the method of moments.
The skew coefficient is sensitive to extreme flood values and thus difficult to estimate from
small samples typically available for many hydrologic studies. Figure 7.2 presents a map of generalized skew coefficients for the logs of peak flows taken from Bulletin 17B of the Interagency
Committee. The station skew coefficient calculated from observed data and generalized skew coefficients can be combined to improve the overall estimate for the skew coefficient. Under the assumption that the generalized skew is unbiased and independent of the station skew, the mean
square error (MSE) of the weighted estimate is minimized by weighting the station and generalized skew in inverse proportion to their individual mean square errors according to the equation
(Tasker 1978):
where Gw is the weighted skew coefficient, G is the station skew (from equation 7.1 3), G is the
generalized skew (from figure 7.2), MSEc is the mean square error of the generalized skew, and
MSEGis the mean square error of the station skew. MSEE is taken as a constant, 0.302, when the
generalized skew is estimated from figure 7.2. MSEGcan be estimated from (Wallis, Matalas, and
Slack 1974):
FREQUENCY ANALYSIS
167
where
A=-0.33+0.081GI
ifIGI10.90
=-0.52+0.301GI
ifIGI>0.90
(7.18)
record length
It is recommended that if the generalized and station skew differs by more than 0.5, the data
and flood producing characteristics of the watershed should be examined and possibly greater
weight given to the station skew.
CONFIDENCE INTERVALS
Any stream flow record is but a sample of all possible such records. How well the sample
represents the population depends on the sample size and the underlying population probability
distribution, which is unknown. Both the form and parameters of the underlying distribution
must be estimated. If a second sample of data were available, certainly different estimates would
result for the parameters of the distribution even if the same distribution were selected. Different
parameter estimates will obviously result in different return period flow estimates. If many
samples were available, many estimates could be made of the distribution parameters and
consequently many estimates could be made of return period flows-say Qloo.One could then
examine the probabilistic behavior of these estimates of Qloo.The fraction of the Q,,'s that fell
between certain limits could be determined.
In actuality, we have just one sample of data from which to make estimates of QT. Statistical procedures are available for estimating confidence intervals about estimated values of QTthat
will give a measure of uncertainty associated with QT. Confidence limits give a probability that
the confidence limits contain the true value for QT.A 90% confidence limit indicates that 90% of
the time intervals so calculated will contain the true estimate for QT.
Letting L,and UT be the lower and upper confidence intervals
where X and s, are the sample means and standard deviations and KT, and KT,, are the lower and
upper confidence coefficients. If a distribution like the log Pearson type III distribution is used, X
168
CHAPTER 7
and sx are based on the logarithms of the data and L,and UT are the logarithms of the confidence
limits.
Approximations for KT,, and KT,Ubased on large samples and the noncentral t-distribution are
where
In these relationships, KT is the frequency factor of equation 7.9, Z, is the standard normal
deviate with cumulative probability c = 50 + a / 2 if a is expressed as a percent. If a is 90%,
then c is 95%. The sample size is n. Confidence limits can be placed on frequency curves plotted
on probability paper by making calculations such as above for several values of T.
TREATMENT OF ZEROS
Most hydrologic variables are bounded on the left by zero. A zero in a set of data that is being logarithmically transformed requires special handling. One solution is to add a small constant
to all of the observations. Another method is to analyze the non-zero values and then adjust the
relation to the full period of record. This method biases the results as the zero values are essentially ignored. A third and theoretically more sound method would be to use the theorem of total
probability (equation 2.10).
In this relationship, prob (X # 0) would be estimated by the fraction of non-zero values and
prob(X 1 xlX # 0) would be estimated by a standard analysis of the non-zero values with the
sample size taken to be equal to the number of non-zero values. This relation can be written as a
function of cumulative probability distributions.
FREQUENCY ANALYSIS
169
or
where Px(x) is the cumulative probability distribution of all X (prob(X 5 xlX 2 0)), k is the
probability that X is not zero, and Px*(x) is the cumulative probability distribution of the non# 0)). This type of mixed distribution with a finite
zero values of X (i-e., prob(X < X ~ X
probability that X = 0 and a continuous distribution of probability for X > 0 was discussed in
chapter 2. Jennings and Benson (1969) have demonstrated the applicability of this approach to
analyzing flood flow frequencies with zeros present.
Equation 7.23 can be used to estimate the magnitude of an event with return period Tx(x) by
solving first for Px*(x) and then using the inverse transformation of P,*(x) to get the value of X.
For example the 10-year event with k = 0.95 is found to be the value of X satisfying
Note that it is possible to generate negative estimates for Px*(x) from equation 7.23. For
example, if k = 0.25 and Px(x) = 0.50, the estimated Px*(x) is
This merely means that the value of X corresponding to Px(x) = 0.50 is zero. This makes sense
because Px(x) = 0.50 corresponds to the 2-year flow, or the flow equaled or exceeded every
other year. If only 25% or 114 of the annual flows are greater than zero, then the flow exceeded
every other year must be zero.
Example 7.1. Seventy-five years of peak flow data are available from an annual series; 20 of
the values are zero; and the remaining 55 values have a mean of 100 cfs, a standard deviation
of 35.1 cfs, and are lognormally distributed. (a) Estimate the probability of a peak exceeding
125 cfs. (b) Estimate the magnitude of the 25-year peak flow.
Solution:
(a) prob(X > 125) = 1 - prob(X
125) = 1
Px(125)
Px*(125) can be evaluated by solving equation 7.12 for KN and then using the table for the
normal distribution to get the desired probability.
From a table of the standard normal distribution, this K, for a C, of 0.351 corresponds to a
prob(X, < x) of 0.795.
Px(125) = 1 - 0.733
prob(X
125) = 1
+ 0.733(0.795) = 0.850
The probability of a peak flowing any year exceeding 125 cfs is 0.15. The conditional probability
of a peak exceeding 125 cfs given that the peak is not zero is 1 - 0.795 = 0.205.
(b) Px*(x) = [Px(x) - 1
=
(1
+ k]/k
[l
(l/T)
- 0.04 - 1 + 0.733)/0.733
1 + k]/k
0.945
The value of X corresponding to Px*(x) = 0.945 can be obtained from equation 7.12. Z, for
P(x) = 0.945 is 1.60. Therefore, X2, = exp(4.547 + 0.341 *1.60) = 163 cfs.
--
- --
- -
-- -
Example 7.2. Table 7.6 contains annual peak flow data for Black Bear Creek near Pawnee,
Oklahoma, for the years 1945 through 1997.
(a) Plot the data on normal and lognormal probability paper.
(b) Plot the "best" fitting normal, lognormal, extreme value type I, and log Pearson type I11
distributions on the plot of part a.
(c) Estimate the 100-year peak flow based on the four distributions of part b.
(d) Estimate the 90% confidence intervals on the log Pearson type ID estimates.
(e) Estimate the 100-year peak flow using the log Pearson type 111with a weighted skew
coefficient based on the station skew and the generalized skew coefficients.
FREQUENCY ANALYSIS
171
Table 7.6. Annual peak flow data for Black Bear Creek near Pawnee, Oklahoma
Year
Flow
(cfs)
Year
Flow
(cfs)
Year
Flow
(cfs)
Solution:
(a) The plotting positions are calculated by ranking the data from largest to smallest and
then using the relationship pp = m/(n
1) where m is the rank and n is the number of
observations (53). Since the largest observation is 30,200 cfs, it is assigned a pp of 1/54,
or 0.0185. The second largest value is 19,200 cfs with a pp of 2/54 or 0.0370, and so
forth until the smallest value of 1,560 cfs with a pp 53/54 or 0.9815. The data are
plotted in figures 7.3.
(b) The best fitting lines for the various distributions can be obtained by calculating several
points from equation 7.9. The basic statistics of the data are found to be
Mean
Std dev
Skewness
Data
In of data
6683
5337
2.262
8.568
0.68 1
0.296
The next step is to determine the appropriate frequency factors for various return periods for
the four distributions. The frequency factor for the nonnal and log normal distributions comes
from the standard normal distribution. KT for the extreme value and log Pearson distributions
come from equations 7.15 and 7.14, respectively. Sample calculations follow for a return period
of 20 years.
N
flow
prob
-3.000
-2.000
LN
flow
-1.000
LP3
flow
EV 1
flow
0.000
1 .OOO
2.000
3.000
Fig. 7.3a. Flood frequency curves for Black Bear Creek using the standard normal z and arithmetic flow scales.
Fig. 7.3b. Flood frequency curves for Black Bear Creek using the standard normal z and logarithmic flow scales.
FREQUENCY ANALYSIS
173
Exceedance probability
Fig. 7 . 3 ~ .Flood frequency curves for Black Bear Creek using normal probability paper.
\.
.... .. .. Normal
- Lognormal
--
Extreme value
log Pearson
Data
Exceedance probability
Fig. 7.3d. Flood frequency curves for Black Bear Creek using lognormal probability paper.
Normal distribution:
Lognormal distribution:
XT = EXP (L(1
+ CvyKT))= EXF'
( (+
8.568 1
o'6811.645))=16132
8.568
XT = EXP(L(1
= 17029
Figures 7.3a-d show the resulting plot of the data and the best fitting distributions. The four
plots all contain the same information but show different formats. The first two plots use the z
transformation and the second two plots use normal probability scales. Both arithmetic and logarithmic scales are shown for flow. Note that the normal distribution plots as a straight line when
the arithmetic scale is used and the lognormal distribution plots as a straight line when the logarithmic scale is used.
(c) The 100-year flow estimates are contatined in the last line of the above table.
(d) The calculations of the confidence intervals are contained in the following table:
(4)
Kt, 1
(5
Kt, u
- 1.7499
0.1786
1.1122
1.6588
2.1360
2.2795
2.6994
3.089 1
FREOUENCY ANALYSIS
175
10
20 3 0 4 0 5 0 6 0 7 0 80
90
95
9899
Exceedance probability
Fig. 7.4. Flood frequency curves for Black Bear Creek with confidence intervals.
Explanation of columns in above table: (1) Return period; (2) From equation 7.14; (3)
From equation 7.22b; (4) and ( 5 )From equation 7.21 ; (6) and (7) Equation 7.20; (8) Exp(col(6));
(9) Exp(col(7)); (10) Last column of previous table
The results are plotted in the figure 7.4.
(e) Station skew, G, is 0.296. The generalized skew from figure 7.2 is -0.22. From equations 7.16 to 7.18 we get
Qloo= exp(Y(l
+ CvyKT))= exp
( (
8.568 1
2.3082))
+8.568
25333 C
~ S
176
CHAPTER 7
Example 7.3. Estimate the 100-year flow for Black Bear Creek using the GEV distribution.
Solution:
From example problem 7.2, E = 6683, sx = 5337, and C, = 0.296.
From equations 3.7 1
c=h,+
cx[T(l
+ K) - 11
K
4108
cx
5 + -(1
K
- [-ln(P,(x,))]")
29,803 cfs
FREQUENCY ANALYSIS
177
5000
10000
15000
2000(
Fig. 7.5. Estimated 100-year flow as a function of truncation level for Black Bear Creek.
Treatment of Zeros section. Figure 7.5 shows the estimated 100-year peak flow for Black Bear
Creek using data from example problem 7.2, the log normal distribution, and various truncation
levels. For example, if a truncation level of 3000 cfs is selected, the 11 values less than 3000
would be truncated and the k in equation 7.23 would be (53 - 11)/53, or 79.
USE OF PALEOHYDROLOGIC DATA
Baker (1987) has defined paleohydrology as the study of past or ancient flood events that
occurred prior to the time of human observation or direct measurement. Paleohydrologic
techniques provide means of obtaining data over periods of time much longer than are available
from systematic records or even historical data. Paleohydrologic data may enable the evaluation
of long-term hydrologic conditions by complementing existing short-term systematic and historical records, providing information at ungaged locations, and helping reduce uncertainty in flow
estimates. Paleohydrology is discussed by Baker (1987), Kochel and Baker (1982), Costa (1987),
Jarrett (1991), and Stedinger and Baker (1987).
Once the magnitude and year of occurrence of a paleoflood is determined, that flow value can
be assigned a return period. For example, if it is determined that 3000 years ago there was a flood
in excess of any flow since that time, the flow could be assigned a return period of 3000 years.
Questions of the stationarity of flood flows, the dating of paleofloods, and the difficulty of estimating the magnitude of paleofloods must be addressed in any paleoflood study.
PROBABLE MAXIMUM FLOOD
The probable maximum flood (PMF) is the flow that can reasonably be expected under conditions that maximize runoff conditions from the most severe combination of meteorologic and
hydrologic conditions for the drainage basin in question. A PMF does not directly enter into a
flood frequency analysis since the probability of such a flood is unknown. The PMF may provide
an upper bound to a frequency analysis. The concept of a PMF has been criticized (Yevjevich
1968) as being neither probable nor maximum, yet it has found wide use for hydrologic designs
for facilities whose failure would endanger human life or cause great economic loss.
178
CHAPTER 7
3. Are there ponds and reservoirs that may discharge at high rates during rare floods and not during smaller flows? What is the possibility of a dam breach and what would be the resulting
flow?
FREQUENCY ANALYSTS
179
4. Are the channel flow and storage characteristics the same for extreme flows as they are for
smaller flows?
5. Are land use and soil characteristics such that flows from rare storms may relate to precipitation in a manner different from more common storms?
6. Are there seasonal effects such that rare floods are more likely to occur in a different season
than the more common floods?
7. Is the rare flood represented in the sample of data? If so how is it treated? Is it assigned a return period of 15 years where in fact its return period may be much greater than that?
8. Are changes going on within the basin that may cause change in the hydrologic response of
the basin to rainstorms?
9. Are there climatic changes occurring that may influence flood flow frequencies?
These last few paragraphs paint a discouraging picture for flood frequency analysis. That
need not be the case as long as one does not discard hydrologic knowledge in the process.
Often, the questions posed can be answered in such a way as to make the statistical analysis
valid. At other times, when problems with the statistical procedures are recognized, adjustments can be made in the resulting flow estimates to more accurately reflect the hydrology of
the situation.
Hydrologic frequency analysis should be used as an aid in estimating rare floods. Sometimes the estimates made on the basis of the statistical frequency analysis can be taken as the
final estimate. Sometimes the statistical estimate may need to be adjusted to better reflect the
hydrology of the situation.
It should be kept in mind that other hydrologic estimation techniques suffer from some of
the same difficulties as do the statistical techniques. For example, if a hydrologic model is being
employed, the parameters of the model must be estimated in some way. This is generally done on
the basis of observed data from the basin in question, from observed data, from a similar basin or
from so-called physical relationships such as Manning's equation, infiltration parameters, and so
forth, and a set of accompanying tables. Regardless of how the parameters are estimated, the
same type of questions regarding these estimates and the nature of the hydrologic model itself
must be answered as outlined above for frequency analysis estimates. We cannot substitute mathematical and empirical relationships for hydrologic knowledge any more than we can substitute
statistics for hydrologic knowledge.
Based on this discussion, one might conclude that the magnitude of rare events should not
be estimated because the estimates may be so uncertain. Generally, however, this is not one of the
options available. An estimate must be made. Hydrology must not be ignored in making this
estimate. Statistical, modeling, or empirical flow estimates should be made and then adjusted, if
required, to reflect the hydrologic situation. This is not to say a factor of safety is to be applied.
Adjustments should be based on hydrology, not rules of thumb.
180
CHAPTER 7
FREOUENCY ANALYSIS
181
De Coursey (1973) applied discriminant analysis, a multivariate procedure, to flood data from
Oklahoma to form groups of basins having a similar flood response. Bum (1988, 1989, 1990)
described techniques for identifying homogeneous regions based on the correlation structure of
the observed data, cluster analysis, and the Region of Influence (ROI) approach, respectively.
The importance of identifying hydrologically homogeneous regions was further demonstrated by
Lettenmaier et al. (1987) in a study that showed the effect on extreme flow estimation of regions
containing heterogeneity.
Of the many approaches that have been used to identify homogeneous regions, cluster
analysis, a multivariate technique, has been getting more prominence in this field. This is primarily due to the fact that although cluster analysis does not entirely eliminate subjective decisions
associated with the other methods, it greatly facilitates interpretation of a data set. The objective
of cluster analysis is to group gaging stations that have similar hydrologic or basin characteristics. The most common similarity measures in cluster analysis is the Euclidean distance.
Historical Development
Weldu (1995) reviewed several articles on regional flow estimation. The earliest approach
to the regionalization problem was to use empirical equations relating flood flow to drainage area
within a particular region (Benson 1962~).The formulas were based on few data for a particular
region and contain one or more constants whose values are empirically determined. Such a formula, in generalized form, is
where Q is the flow, C is coefficient related to the region, and A is the drainage area. The above
equation, although simple to derive and apply, does not address the frequency of the flow and the
effect of variations in precipitation or topography on the flows that are not accounted for. The various "culvert formulas" used by railroad and highway engineers, such as the Talbot formula
(AISI 1967) are of this general type. The Talbot formula is widely used and is denoted by
One major weakness in this type of empirical formula is that the coefficients will remain
constant only within regions in which other hydrologic factors vary little, which implies that the
regions must of necessity be fairly small.
Statistical Methods
Other methods of regionalization include the application of statistical techniques to hydrologic
data. Statistics provides a means of reducing a mass of data to a few useful and meaningful figures.
The distribution of the data could be represented by a probability density function or a curve that
defines the frequency of values of the variable. Statistical procedures may also provide methods of
relating dependent variables to one or more independent variables through regression analysis.
Most applications of statistical techniques require a considerable amount of data. The value
of the analysis is directly related to the quantity and quality of the data that are available. Often,
hydrologic estimates are required at locations where there is little or no data. The design of a
bridge opening or culvert, for example, on one of the many streams for which there is no data
may be required. Regionalization is an attempt to use data from locations in the same region as
the point of interest to make hydrologic estimates at the point of interest.
Regional flood frequency models have extensively been used in hydrology for transfemng
data from gaged to ungaged sites. Two such regionalization procedures, namely the index-flood
and regression-based methods, have evolved over the years and have extensively been used in
regional flood frequency analysis.
This treatment will focus on flood frequency analysis. The goal is to estimate flood flows of
various return periods for streams and locations where there is little or no data.
Frequency Distributions
After a homogeneous region has been identified, the next stage in the specification of the
regional flood frequency model is the choice of appropriate frequency distribution(s) to represent
the observed data. The distributions most commonly used in hydrology are normal, lognormal,
Gumbel extreme value distribution (type I), and log Pearson type 111. The U.S. Water Resources
Council (1982) conducted studies involving comparison among different probability distribution
functions and their recommendation was to use the log Pearson type I11 as the basic distribution
for defining the annual flood series. The Council also recommended that this distribution be fitted
to sample data using the method of moments. In a more detailed study, the U.K. Natural
Environment Research Council (1975) found that 3-parameter distributions such as the log
Pearson type 111and the generalized extreme value distribution (GEV) were found to fit data from
35 annual flood series better than the 2-parameter distribution functions.
The log Pearson type III (LP 111) distribution has extensively been used in flood frequency
analysis since its favorable recommendation by the Water Resources Council in 1976. The
frequent use of the LP III attracted a number of detailed mathematical and statistical studies
regarding its role in flood frequency analysis. Various alternative fitting techniques for the LP I11
distribution have been suggested by Matalas and Wallis (1973) and Condie (1977). These
researchers carried out comparisons between the method of moments and the method of maximum likelihood, and concluded that the latter method yielded solutions that are less biased than
the method of moment estimates. Bobee (1975) and Bobee and Robitaille (1977) suggested using
moments of the original data instead of using moments of the logarithmic values. NozdrynPlotnicki and Watt (1979) studied the method of moments, the method of maximum likelihood,
and the procedure proposed by Bobee (1975), and found that none of the methods were superior
FREQUENCY ANALYSIS
183
than the others and concluded that the method of moments was the best because of its computational ease.
An important step in a regional flood frequency analysis is to ensure that the data that are
being used are of good quality. The data must be representative of the region and they must be
representative of the long-term flood characteristics of the region. Data on the physical characteristics of the catchments and any other data that are used must be of good quality. There are no
regional flood frequency techniques that can overcome faulty data.
After collecting and screening the data, the first step is to fit various pdfs to the observed
peak flow data at locations where sufficient data exist. Once all of the available data are fit to the
candidate distributions, assumptions and statistical tests must be made in an effort to select the
distribution that best describes each data set. This selection of pdfs is based on probability plots
of observed data along with the fitted distributions. Statistical tests such as the chi-square test and
the KolmogorovSmirnov test discussed in chapter 8 may be made. Personal judgment based on
the probability plots is also used.
Once the best fitting pdf is selected for each data set, that pdf can be used to estimate the
peak flow for various return periods. The pdf which best fits the data for the majority of the stations or locations included in the study is generally used for all locations.
Several options are now available for the next phase of the analysis:
1. Develop a relationship between the peak flows of various return periods and measurable characteristics of the catchments producing the flows (QT
- = f(X)).
2. Develop a relationship between parameters of the pdf that best fits a majority of the flow data
and measurable characteristics of the catchments producing the flows (0
- = f(X)).
3. Develop a dimensionless flood frequency curve for the region plus a relationship between
some index flood for each catchment and measurable characteristics of the catchments (QT/Q
vs T and Q, = f(X)).
Regression-Based Procedures
All three of the options mentioned above require relationships with measurable characteristics from the catchments for which flow data are available. Characteristics that might be included
in the analysis include precipitation variables, such as mean annual rainfall and 24-hour rainfalls
for various return periods. Physical characteristics such as catchment area, land slopes, stream
lengths, stream slopes, and land use might be included. Soils information such as permeability
and water holding capacities can be used. There are also a large number of geomorphic parameters such as drainage density, catchment shape factors, and measures of elevation changes that
might be included.
The result of this data collection effort will be a matrix of data having n observations on m
X is an m X n matrix. The n observations on each catchment come about
catchments. Therefore from making a single measurement or observation on each of the n characteristics included in the
analysis. The m represents the number of catchments in the study. Thus, a study that involved
30 catchments and 12 characteristics on each catchment would produce a data matrix having
30 rows (one for each catchment) and 12 columns (one for each characteristic).
A regional flood frequency approach, in addition to the m X n data matrix of independent
variables, will include an m X p data matrix of dependent variables which are the peak flow
estimates for the various return periods. With return periods of 2,5, 10,25,50, and 100 years, p
will be 6. With 30 catchments, a 30 X 6 matrix of dependent variables, where the rows are the
catchments and the columns correspond to the various return periods, will result.
Multiple regression techniques can now be used to relate the dependent variables to the independent variables based on the 30 observations on hand. Regression based on the regional data
and based on logarithms of the regional data can be investigated. Through the estimation process
based on multiple regression, the independent variables that are not useful in predicting the
dependent variables can be eliminated. The goal is to find if the peak flow for the various return
periods can be estimated based on a small subset of the original n catchment characteristics.
Although not always possible, it is desirable to use the same subset of independent variables for
predicting each of the p dependent variables. This will help to ensure a consistent set of predictions for various return periods on a particular catchment.
Using multiple regression to estimate the magnitude of a flood event that will occur on
average once in T years, denoted by QT, by using physical and climatic characteristics of the
watershed has a long history (Benson 1962c, 1964; Benson and Matalas 1967; Thomas and
Benson 1970). Sauer (1974) developed regional equations relating flood frequency data for
unregulated streams in Oklahoma to basin characteristics through multiple linear regression
techniques. Similar studies have been done throughout the United States (Jennings et al. 1993).
The Hydrology Committee of the U.S. Water Resources Council (1981) investigated numerous
methods of estimating peak flows from ungaged watersheds and found that the results obtained
using regional regression compared favorably well with more complex watershed models.
A logarithmic transformation of the QT,physiographic, and climatic data may be required to
linearize the regression model and to satisfy other assumptions of regression analysis. The relationship most commonly used is of the form
where XI, X,, ... X, represent the basin and climatic data, and b,, b,, b,, ... b, are the regression parameters. Regression parameters may be estimated using the ordinary least squares
(OLS), weighted least squares (WLS) or generalized least squares (GLS). OLS do not account
for unequal variances in flood characteristics or any correlations that may exist between
streamflows from nearby stations. To overcome these deficiencies in the OLS method, Tasker
(1980) proposed the use of WLS regression with the variance of the errors of the
observed flow characteristics estimated as an inverse function of the record length. Using a
weighting function of
FREQUENCY ANALYSIS
185
where N is the number of stations, toand t , are constants, and ni is the record length of station i,
Tasker (1980) reported that the WLS produced a smaller expected standard error of predictions
than the OLS. Using Monte Carlo simulation, Stedinger and Tasker (1985) demonstrated that
the WLS and GLS provide more accurate estimates of regression parameters than the OLS. A
major drawback of the WLS and GLS is the need to estimate the covariance matrix of the residual errors. The covariance matrix of the residual errors is a function of the precision with which
the model can predict the streamflow values.
Estimating a peak flow for some return period on an ungaged catchment now becomes an
exercise in applying the appropriate regression equation to the ungaged catchment. The required
catchment characteristics are used in the appropriate prediction equations to estimate the peak
flow.
Regional frequency analysis using option b is very similar to option a except the dependent
variables in the regression analysis are the parameters or some function of the parameters of the
pdf selected to represent the flood peak flows. If a lognormal distribution is used, there will be 2
dependent variables, the mean and standard deviation of the logarithms of the flows. If the log
Pearson type It1 is used, there will be 3 dependent variables.
.98
.95
.90
-80
.70
.30
.20
.I0
.05
.02
10
20
50
Median of 18 stations
1.02
1.11
186
CHAPTER 7
Again, it is desirable to use the same set of independent variables to predict all of the parameters of the selected pdf. This is because the parameters will most likely be correlated. Using
the same set of independent variables helps ensure that one maintains a consistent relationship
among the parameters of the pdf.
Estimating peak flows for an ungaged catchment consists of using the derived prediction
equations to estimate the parameters of the flow frequency pdf. These parameters are then used
in the pdf to estimate flow magnitude with the desired return periods.
Index-flood method
Another widely used statistical procedure in regional flood frequency analysis is the indexflood method. This method, first described by Dalrymple (1960), involves the derivation and use of
a dimensionless flood frequency distribution applicable to all basins within a homogeneous region.
Regional-Index Flood Relationship
The next step in the index-flood method is to define the index flood. The ratios of peak flows
of various return periods to this index flood are then computed. The ratios are of the form QT/QI
where QT is the flood with return period T, and QI is the index flood. The index flood is often
taken as the mean annual flood or the 2-year flood.
A plot is made of QT/QI versus T containing data for all of the watersheds. A line is drawn
through the median of the data in this plot. The resulting line is the regional flood frequency line.
In the past, the index-flood method was widely used to perform regional frequency analysis
(Dalrymple, 1960; Benson, 1962). The basic premise of the method is that a combination of
streamflow records maintained at a number of gaging stations will produce a more reliable record
than that of a single station and thus will increase the reliability of frequency analysis within a
region. The index flood method consists of two major steps. The first involves the development
of dimensionless ratios by dividing the floods at various frequencies by an index flood, such as
the mean annual flood for each gaging station (Stedinger 1983; Lettenrnaier and Potter 1985;
Lettenmaier et al. 1987). The averages or medians of the ratios are then determined for each
return period to estimate a dimensionless regional frequency curve. The second step consists of
the development of a relationship between the index-flood and physiographic and climatic characteristics of the basin. Flood magnitudes and frequencies at required locations within the region
can then be estimated by rescaling the corresponding dimensionless quantile by the index flood.
The index-flood method, once the standard U.S. Geological Survey (USGS) approach, is based
on the assumption that the floods at every station in the region arise from the same or similar
distributions (Chowdhury et al. 1991). At some stage this procedure fell out of favor, primarily
due to the fact that the coefficient of variation of the flows, which is assumed to be constant in an
index-flood method, was found to be inversely related to the watershed area (Stedinger 1983).
This implies that the standard deviations of the normalized data do not remain constant for various values of basin areas, because the coefficient of variation of the observed data is equal to the
standard deviation of the normalized flows. This can be demonstrated as follows. Let Yi be the
normalized flows given by:
FREQUENCY ANALYSIS
187
where xi represents the ordered observed flows (with x, being the largest observation and x, the
smallest) and Z is the mean observed flow, then the coefficient of variation of the observed data,
CV,, is given by:
the right-hand side of this equation is nothing but the standard deviation of the normalized flows.
The index-flood method started to be popular once again in the late 1970s and early 1980s since
the introduction of the probability weighted moments (PWM), a generalization of the usual
moments of a probability distribution (Greenwood et al. 1979). Greis and Wood (1983) reported
that improved regional estimates of flood quantiles were obtained by applying the PWM over the
conventional methods such as the method of moments and maximum likelihood estimation.
Parameter estimation by PWM requires the calculation of moments Mijkdefined as
where i, j, and k are real numbers and X is a random variable with distribution function, F(x)
where F(x) = Prob(X 5 x). M1,o,ois identical to the conventional moment about the origin and
the probability weighted moments corresponding to MITO,,
or Mk are denoted as
All higher-order PWMs are linear combinations of the ranked observations x, r . . . 5 x,,
which is an indication that PWM estimators are subject to less bias than ordinary moments.
Ordinary moment estimators such as variance (s2)and coefficient of skewness (C,) involve squaring and cubing of observations respectively, with a potential to give greater weight to outliers,
resulting in a substantial bias and variance. However, one major weakness of the PWM is that it
cannot be used to estimate parameters for those distributions which cannot be expressed in
inverse form, such as LP 111.
Regionalization Using L-Moments and the GEV Distribution
Hosking et al. (1985) and Stedinger et al. (1994) discuss regional flood frequency analysis
using L-moments and the generalized extreme value distribution. The following is adapted from
their work.
Consider K sites with flood records Xi(k) for i = 1, 2, ...,n, and k = 1, 2, ..., K. Normalize the
Xi(k) by dividing the observations at a site by the mean of the observations at that site.
1. At each site compute the three L-moments X,(k), X2(k), and X3(k) of the normalized observations using the probability weighted moments (PWM) estimators. The L-moments are linear
combinations of the ranked observations.
where xQjis the jthorder statistic of the normalized observations with x(,) the smallest and
x(,) the largest.
2. To get a normalized frequency distribution, compute the average of the normalized Lmoments of order r = 2 and r = 3.
- c:= ~t[ir(k)/fil(k)l
A:
k
Ck=l
Wk
for r = 2 , 3
FREQUENCY ANALYSIS
189
where
For sites without flow records on which to estimate h:, a regional regression could be used to
develop an equation of the form
CHAPTER 7
190
Table 7.7. Empirical factors for converting partial duration series to annual series (Hershfield 1961)
Return period
Conversion factor
and 24-hour rainfall depths converted to a partial duration series by using the factors shown in
table 7.7. For example, if the 5-year partial duration series value estimated from the maps is 2.00
inches, the corresponding annual series depth would be 0.96(2.00) or 1.92 inches. For return
periods greater than 10 years, the conversion factor is essentially unity.
The 2-year rainfall amounts were determined by plotting on log-log paper the return period
versus the rainfall depth using the California plotting position formula (Table 7.1), drawing a
smooth curve through the points, and reading the 2-year value.
The 100-year rainfall amounts were determined by using the type I Extreme Value distribution for selected stations with long rainfall records. The ratio of the 100-year to the 2-year rainfall amount was then determined for these stations and a map prepared showing the value of this
ratio. The 100-year rainfall amounts for the stations with short records was estimated by the
100-year to 2-year ratio.
The rainfall depths for other return periods were determined by plotting the 2-year and
100-year depths on special paper, connecting the points by a straight line, and reading off the
desired rainfall depths. The spacing of the return periods along the abscissa of this special paper
was empirical from 1 to 10 years based on free-hand plotting of partial duration series data and
theoretical according to the type I extreme value distribution from 20 to 100 years. The transition
between 10 and 20 years is smoothed by hand from the type I values.
The rainfall depths for durations other than 1 hour or 24 hours were obtained by plotting the
1-hour and 24-hour values on a second special paper and connecting the points with a straight
line. This diagram was obtained empirically from an analysis of records from 200 first-order U.S.
Weather Bureau stations. The depth of rainfall for the 30-minute duration is obtained by multiplying the 1-hour value by 0.79.
From these analyses, curves called depth (or intensity)-duration-frequency curves can be
prepared. Data from the maps in TP 40 can be used to determine depth-duration-frequency
(DDF) relationships for locations where actual data does not exist. Often, in developing DDF
curves, the interpolation from the maps of TP40 may result in rather rough plots. The curves can
be smoothed by using an empirical smoothing equation. One such equation is
D=
KTFx
(T + b)"
where D is the depth, T is the duration, and F is the frequency of the rainfall. The coefficients K,
x, b, and n may be estimated using nonlinear regression techniques. Figure 7.7 shows the results
of such an analysis for Stillwater, Oklahoma, based on TP40 data.
FREQUENCY ANALYSIS
0.1
191
10
100
Duration (hrs)
where k is the probability of rain or the proportion of time intervals with rainfall and P*(x) is the
cumulative probability distribution of rain given that R Z 0. often the gamma disfribution is used
for rainfall data. The parameters of the gamma distribution generally are determined by using
equations 6.18 and 6.19. Bridges and Haan (1972) have presented a technique for determining
the reliability of rainfall estimates from the gamma distribution based on simulation studies.
FREQUENCY ANALYSIS OF OTHER HYDROLOGIC VARIABLES
The principles set forth on flood frequencies and rainfall frequencies also apply to frequencies of other hydrologic variables. Basically, the quantity to be analyzed must be defined, the data
tabulated, and then a frequency analysis made. For instance, in the case of flow volume-frequency
192
CHAPTER 7
studies, the duration(s) of interest must be specified and then the maximum or minimum flow
volumes for each year having the specified duration are tabulated. The maximum flow volumes
would be used in the case of flood-flow volumes and the minimum volumes would be used in the
case of low-flow studies.
Frequency analysis can be applied on water quality parameters such as dissolved oxygen,
biological oxygen demand, sediment loads, and many other quantities. Care must be taken to see
that the data used meet the necessary requirements of homogeneity, independence, and representativeness. For example, if sediment concentration frequencies are being studied and part of the
data are collected during low flows and part during high flows, the data may not be homogeneous
because of the relationship between sediment concentration and flow rate.
Exercises
7.1. Assume.that daily rainfall on rainy days follows an exponential distribution. The average
daily rainfall on rainy days is 0.3 inches. If 30% of all days are rainy, what is the probability that
on some future day, the amount of rainfall received will exceed 1.OO inch? Assume daily rainfalls
are independent.
7.2. Derive a table of frequency factors for the exponential distribution corresponding to T = 2,
5, 10,20,50, and 100 years.
7.3. Select several streams in a single locality and prepare a plot of the ratio of the T-year flood
to the mean annual flood (as in figure 7.6).
7.4. An analysis of 50 years of data showed that the probability of a flood peak exceeding 90,000
cfs on a certain river was -02. During a 10-year period 2 such peaks occurred. If the original
estimate of the probability of this exceedance was correct, what is the probability of getting 2
such exceedances in 10 years?
7.5. Forty years of peak streamflow data are available. All but one of the data points indicate that
a lognormal distribution with = 125,000 cfs and sx = 50,000 describes the data very nicely.
The one outlier is equal to 285,000 cfs. What is the probability that an event of 285,000 cfs or
greater could occur in the 40-year period if the flood peaks truly follow the lognormal distribution with X and sx as given?
7.6. Select a set of data consisting of 20 or more independent observations. Plot these data on normal probability paper using several of the plotting position relationships contained in table 7.1.
7.7. Compute the 100-year peak flow for the annual series data of example 7.2 assuming the data
follow the gamma distribution.
7.8. Prepare a plot on log-log paper of low flow frequency-volume-duration for Cave Creek near
Fort Spring, Kentucky. Plot volume in inches as the ordinate, duration in months (use 1, 2, 3, 6,
FREQUENCY ANALYSIS
193
and 12 months) as the abscissa and use as curve parameters frequency (use 2, 5, 10, and 25
years).
7.9. Work exercise 7.8 for maximum flow frequency-volume-duration on Cave Creek.
7.10. Plot the annual runoff data for Walnut Gulch near Tombstone, Arizona, on normal and
lognormal probability paper. Does either of these distributions appear to "fit" the data?
7.1 1. Plot on normal probability paper the annual runoff data for (a) Piscataquis River near
Dover- Foxcroft, Maine, (b) North Llano River near Junction, Texas, and (c) Spray River, Banff,
Canada. Is there any apparent relationship between the curvature (or lack of it) and the skewness?
7.12. Work exercise 7.11, only plot the data on lognormal probability paper.
7.13. For the Piscataquis River near Dover-Foxcroft, Maine, estimate the 100-year annual flow
assuming the data follow the (a) normal distribution, (b) lognormal distribution, (c) Pearson type
III distribution, (d) log Pearson type I11 distribution, (e) extreme value distribution.
7.14. Work exercise 7.13 for the 100-year annual flow on the North Llano River near Junction,
Texas.
7.15. Work exercise 7.13 for the 100-year annual flow on the Spray River, Banff, Canada.
7.16. In reference to exercises 7.13,7.14 and 7.15, which distribution would you expect to give
the "best" estimate for the 100-year flow on each of the three rivers? Discuss in terms of the
means, variances, coefficient of variation, and skewness.
7.17. Plot the annual peak discharge of Walnut Gulch near Tombstone, Arizona, on lognormal
probability paper. Draw in what you consider the best fitting straight line. Estimate the mean and
variance of the data from this plot.
7.18. Plot the suspended sediment load data for the Green River at Munfordville, Kentucky on
normal and lognormal probability paper. Draw in the best fitting straight line.
7.19. Use the lognormal distribution to estimate the 25-year runoff volume for July on Walnut
Gulch near Tombstone, Arizona. Plot the data on lognormal probability paper and draw in the
theoretical best fitting straight line.
8. Confidence Intervals
and Hypothesis Testing
IN CHAPTER 3, parameter estimation was discussed in general terms. In chapters 4,5, and
6 specific methods for estimating the parameters of certain probability distributions were
discussed. Again, it should be recalled that parameter estimates are called statistics, are functions
of the sample (random) values, and are themselves random variables. Parameter estimates have
associated with them probability distributions.
Thus far we have discussed methods of getting point estimates for parameters and certain
properties of these point estimates. The possible errors in these point estimates due to inherent
variability in random samples of data have not been discussed. This chapter considers the reliability of parameter estimates and the testing of hypotheses regarding population parameters.
Hypothesis testing and confidence interval estimation may be classed as parametric or
nonparametric depending on whether or not assumptions are made regarding the probability
distribution of the observations and/or the parameters under consideration. Parametric and
nonparametric tests have certain assumptions in common. They both rely on independence in
the observations and randomness of the sample. They both require samples of data to be
representative of the situation under analysis. Parametric statistics deal with actual values of
observations while nonparametric methods often rely on the ranking or relative position of
data values.
The use of parametric statistics is frequently criticized because of deviations from the
distributions assumed by a particular test. One of the consequences of deviating from the
assumed distribution is that the level of significance of the test is no longer exact. This may be a
serious problem, but in most cases is not. Generally, the selection of the level of significance is
somewhat arbitrary. Early statisticians used 5 and lo%, so everybody uses 5 and lo%! If one
HYPOTHESIS TESTING
195
doesn't know how to select a level of significance, it makes little sense to be overly concerned if
the level of significance is unknown due to deviations from distributional assumptions. What is
purported to be an exact test becomes an approximate test, but that is often the nature of hydrologic analysis. Uncertainty abounds! An approximate test provides information to the decision
maker just as does a "so-called exact test and is certainly better than no test at all. Several papers are available indicating that nonparametric procedures are nearly as good as parametric procedures for some tests when distributional assumptions are met and are superior when distributional assumptions are not met (Helsel and Hirsch 1992).
In any application of hypothesis testing or confidence interval estimation, it must be kept in
mind that assumptions must be made concerning the data and the process under study. It is
unlikely that in an actual application the assumptions will be exactly met. Again, if the assumptions are not fully met, then the tests or confidence intervals become approximate.
If we reject the hypothesis that two streams have different BOD loadings, we do not necessarily believe their BOD loadings are exactly the same. It would be rare indeed to have two natural streams that have identical BOD loadings or any other quantifiable characteristic. We know
before we run the test, indeed before we collect any data, that the BOD loadings are not precisely
the same on two streams.
What we are really concerned with is whether the BOD loadings are "significantly" different. In statistical jargon, we are assessing whether the difference we detect in BOD is of such a
magnitude that it cannot be attributed to chance if the BOD loadings in the two streams are in fact
the same and meet the conditions of the test.
For example, consider a situation where the BOD level on two streams is sampled. Assume
that on each of the streams the true distribution of BOD is N(4, 1) and the BOD in the two streams
is uncorrelated. These are strong assumptions that we can never verify completely. If we could,
then statistical testing would be superfluous. It is hypothesized that the BOD levels are the same
in the two streams. The investigator decides to sample each of the streams and declare the BOD
levels different if the samples from the two streams differ by more than 1 mg/l. What is the probability an error will be made?
The error that might be made is to declare the BOD in the two streams different when, in fact,
they are, unknowingly to the investigator, the same. Since the BOD level is actually N(4, I), the
difference in two independent samples is N(O,2). The probability of selecting a random number
from an N(O,2) that is larger in absolute value than one is the probability of making an enor with
the test. Since the test statistic, the observed difference, has an N(O,2) pdf, the standardized Z value
= 0.707. The
corresponding to a difference in excess of the absolute value of one is (1 - 0)/*
probability of Z exceeding 0.70 in absolute value for a standard normal distribution is 0.48. There
is a 48% chance of rejecting the hypothesis even though it is true.
If the investigator thinks this probability of an error is too great, the appropriate value for the
test statistic consistent with the acceptable error probability can be determined. For example, if
the investigator wants to be 90% confident of not concluding the streams are different when in
fact they are not, the cutoff value for Z is such that the prob(Z > z,) = 0.05, which corresponds
~ 1.645 or d = 2.33.
to Z = 1.645. Then the actual difference is computed from (d - 0 ) / =
Therefore, the stream would be considered not significantly different unless the absolute value of
the difference in the samples from the streams exceeded 2.33 mg/l.
If the BOD distribution on one stream was N(3, 1) and on the other N(4, I), the distribution
of BOD would have been truly different on the two streams. The distribution of the difference in
BOD would be N(l, 2). The probability of getting a difference in excess of rt 1 would be the
probability of a value <O or > 1 from an N(l, 2). Again, using the standard normal distribution,
this probability can be found to be 0.74. In this case, the BOD distributions are different yet there
is a 26% chance of erroneously concluding they are not different.
What becomes apparent is that there is always a chance of making an error in statistical
tests of hypotheses. The first part of the example demonstrates how one could wrongly conclude
a difference when none existed and the second part shows how one could fail to detect a difference when one does exist. These two errors are rejecting a true hypothesis-known as a Type I
error-or accepting a false hypothesis known-as a Type I1 error.
The probability of a Type I and a Type I1 error are usually denoted by cx and P, respectively.
In this example when the true situation was no difference, cx was 0.48. In the situation where there
was a difference, P was 0.26.
CONFIDENCE INTERVALS
A parameter 0 is estimated by 6. The statistic 6 is a random variable having a probability
distribution. If 6 can take on any value in some continuous range, then prob(0 = 6) is zero.
Rather than a point estimate for 0, it may be more desirable to get an interval estimate such that
the probability that this interval contains 0 can be specified. Such an interval is known as a confidence interval. This statement may be written
where L and U are the lower and upper confidence limits, so that the interval from L to U is the
confidence interval and 1 - a is the confidence level, or confidence coefficient. Note that in
equation 8.1, 0 is not a random variable. One does not say that the probability that 0 is between
L and U is 1 - cx but that the probability is 1 - cx that the interval L to U contains 0. The difference in these two interpretations is subtle but based on the fact that 0 is a constant while L and U
are random variables.
Mood et al. (1974) discuss a general method for determining confidence intervals. Ostle
(1963) presents expressions for the confidence intervals for many different statistics. In the
discussion to follow, a procedure known as the method of pivotal quantities for determining confidence limits will be illustrated. This method consists of finding a random variable V that is a
function of the parameter 0 but whose distribution does not involve any other unknown parameters. Then v, and v, are determined such that
prob(v, < V < v,) = 1 - a
(8.2)
This inequality is then manipulated so that it is in the form of equation 8.1 where U and L are random variables depending on V but not 0.
HYPOTHESIS TESTTNG
197
- -
t, -
,.Therefore
This latter equation is in the form of equation 8.1, so the confidence limits are
Because F and s, are both random variables, L and U are random variables as well, with
estimates 1 and u given by equation 8.4. Note that the assumption that the observations are
normally distributed was made.
Example 8.1. The sample mean and variance of the Kentucky River data contained in table 2.1
have been calculated as Z = 66,540 and sx = 22,322. What are the 95% confidence limits on the
mean assuming the sample is from a normal population?
Solution:
Thus, we can say that we are 95% confident that the interval 62,076 to 71,004 contains the true
population mean.
Comment: If a 90% confidence interval is calculated, it is found to be 62,817 to 70,263. Thus, the
90% confidence interval is shorter than the 95% confidence interval but our degree of confidence
that the interval contains F, has decreased from 95% to 90%.
If a second independent sample of peak flows on the Kentucky River near Salvisa were available, this sample would have a different mean and variance. In this case, the 95% confidence
intervals would be different as well. If many samples were available and the 95% confidence
limits were calculated for each, 95% of the confidence limits would contain the true population
mean while 5% would not if the data were actually from a normal distribution. The 100(1 - a)%
confidence interval on the mean can be made as small as desired by increasing the sample size.
This is because s, decreases as the sample size is increased. An increase in the reliability of the
sample mean comes at the expense of an increase in the sample size. Unfortunately, in many
hydrologic problems the sample size is fixed. For a normal distribution, equations 8.4 provides a
means for determining the sample size required in order to estimate J.L, within a given reliability.
HYPOTHESIS TESTING
199
If the population variance of the normal distribution is known, then the pivotal quantity in
equation 8.3 becomes (X - y)/u,, which has a standard normal distribution. The confidence
limits then become
where z, -a/2 is the value of Z from the standard normal distribution such that the area to the right
of Z is a/2.
Equations 8.4 and 8.5 are based on the assumption that the underlying population of the
random variable X has a normal distribution. Only through the Central Limit Theorem can these
relations be applied to non-normal distributions. Confidence limits calculated by these relationships for the means of random samples from non-normal populations are only approximate with
the approximation improving as the sample size increases. If these approximations are not satisfactory, other methods are available (Ostle 1963; Mood et al. 1974).
Variance of a Normal Distribution
The quantity (n - 1)s2/u2has a chi-square distribution with n - 1 degrees of freedom. Letting this quantity equal V in equation 8.2 results in
,.Then
which is in the form of equation 8.1. Thus, the confidence limits on u2 are
Again, equations 8.6 are strictly valid only if X is from a normal distribution and approximate for X from a non-normal distribution-with the approximation improving as the sample
size increases.
The 90% confidence intervals on the standard deviation are found (by taking the square
roots of the above limits) to be 20,001 to 25,33 1 cfs.
Comment: In the preceding two examples the confidence limits on the mean and variance of a
normal distribution were calculated. If the joint confidence limits on ?
and
i s; are desired, they
cannot be computed separately as was done in these examples. Mood et al. (1974) discuss the
estimation of ioint confidence intervals.
One-Sided Confidence Intervals
Situations may arise where one is only interested in an interval estimate on one side of a parameter. For instance, it may be desired to find only a lower confidence limit. In this situation
equation 8.1 becomes
HYPOTHESIS TESTING
20 1
The same procedure for finding L would be followed as was used in the two-sided case,
except now all of the probability a will be in one tail. For instance, the one-sided lower limit on
the mean of a normal distribution with an unknown variance would be
The analogous results would hold for any one-sided, lower or upper confidence limit.
Parameters of Probability Distributions
For a wide class of distributions for large samples, the maximum likelihood estimators for
the parameters of the distribution are asymptotically normally distributed with the true parameter
IT1
.
Using this information, it is possible to construct confidence intervals and joint confidence
intervals for the parameters of these distributions. The book by Mood et al. (1974) should be consulted for the procedures to be used.
HYPOTHESIS TESTING
Often the acceptability of statistical models can be judged without actually making any
statistical tests. This would be the case when observed data is predicted very closely by the model
or when observed data deviates very greatly from the model. On the other hand, a common
occurrence is for the observed data to deviate some from the model but not enough for one to
state that the model is obviously inadequate. In this latter situation one must determine whether
the deviations represent true inadequacies in the model, or whether the deviations are chance
variations from the true model.
The general procedure to be followed in making statistical tests is
1. Formulate the hypothesis to be tested.
2. Formulate an alternative hypothesis.
Decision
Accept hypothesis
Reject hypothesis
True situation
True situation
Hypothesis true
No error
Type I error
Hypothesis false
Type II error
No error
For many statistical tests, steps 2 4 have been completed and may be found in a wide variety of statistics books. For many of the tests that a hydrologist might like to make, adequate test
statistics and their distributions have not been determined-largely because of restrictive assumptions. Nonpararnetric tests relieve this problem to some extent.
It is not possible to develop tests that are absolutely conclusive. All of the tests have a
possibility of two kinds of error-rejecting a true hypothesis (Type I error) or accepting a false
hypothesis (Type I1 error). Table 8.1 depicts the two types of errors. The probability of a Type I
error is denoted by a and the probability of a Type I1 error by P. The significance level is defined
as 100(1 - a ) (in percent). In testing hypotheses, the probability of a Type I error can be specified; however, the probability of a Type I1 error is not known unless the true parameter values
being tested are known. In general as the value of a decreases, the magnitude of P increases.
As an example, assume we select an observation xo at random from a normal distribution
with variance a; and hypothesize that the distribution has a mean po.The test statistic could be xo
itself, which has a normal distribution with unknown mean and variance a;. If the hypothesis is
true (something that is not known or the test would not be made), the distribution of the test
statistic would be a normal distribution with mean po and variance 0; and would appear as in
Figure 8.3. If it is decided to accept the hypothesis if xo is within 2 standard deviations of po and
reject the hypothesis otherwise, the critical region or rejection region would be the shaded area in
Figure 8.3. From the properties of the normal distribution, it is known that 95.44% of the area of
the normal curve is within 2 standard deviations of the mean, so the critical region occupies 4.56%
of the area. It is also apparent that there is a 4.56% chance that x, will be in the critical region and
the hypothesis rejected even though it is true. Thus, by definition a = 0.0456, or there is a 4.56%
chance of making a Type I error due to random variation in the x, selected. It is more common to
specify a and from this information determine the critical region. For example, if one wanted a to
be 0.10, then the critical region would be I (xo - po)/aoI > 1.645, which is the value of the standard normal distribution such that the area outside the limits - 1.645 to 1.645 is 0.10.
Po-2%
P o
Po+2q,
HYPOTHESIS TESTING
203
In order to evaluate p, the true parameter values must be known. Again, consider selecting
and an unknown mean. Let the
a single value xo from a normal population with variance
hypothesis be that p = po and the alternative be p # po. If p actually equals p,, then the
situation depicted in figure 8.4 would exist and there is a loop% chance that xo will fall in the
acceptance region of N(po, a;) and thus a Type I1 error committed. From figure 8.4 it can be seen
that as a is increased, p will decrease. It can also be seen that the nearer p1is to po, the greater
will be p. This is because it is increasingly difficult to tell the difference between the two distributions. It is not possible to determine the magnitude of P because it is a function of the unknown
population mean p,. Example 8.3 shows how P can be evaluated if p1is known. Of course, p1
would not be known or else one would not hypothesize p = po.
Example 8.3. Assume a single observation is selected from a normal distribution with mean
p1 = 7 and variance a; = 9. It is hypothesized that p = po = 5. If the test is conducted at the
10% significance level, what is P?
Solution:
Reference should be made to figure 8.5.
a = 0.10
The area to the left of z, = 0.978 from a standard normal distribution is 0.8365. Similarly, if X,
is the boundary of the lower critical region, we have (x, - 5)/3 = - 1.645, or x, = 0.0645. A, is the
area of a normal distribution with mean 7 and variance 9 to the left of 0.0645. z, = (0.0645 - 7)/3
or z, = -2.3,l. A, = 0.0104. Now P = A, - A, or P = 0.8365 - 0.0104 = 0.8261. Thus, the
probability of accepting the hypothesis that p = 5 when in fact p = 7 is 0.8261 when a is 0.10. The
probability of a Type II error is 0.8261.
If calculations such as those contained in example 8.3 are carried out for various values of
pl, a curve relating P to p1can be constructed. Such a curve is shown in figure 8.6. Figure 8.6
shows the p curve for a = 0.05 and a = 0.10. Curves such as shown in figure 8.6 are often
called operating characteristic (OC) curves.
Figure 8.6 verifies the earlier statements that P increases as a decreases and P increases as
the true mean, pl, approaches the hypothesized mean, po. In fact, as p1 gets close to po, the
PI
Fig. 8.6. Probability of a type I1 error as a function of the true mean for example 8.3.
HYPOTHESIS TESTING
205
POWER =
- 10
-5
1-8
10
15
20
PI
significantly different from el. For example, if we calculate the mean of a random sample and then
accept the hypothesis that the true mean is 5, we may not believe that the true mean is exactly 5 but
rather the true mean is not significantly different from 5. What constitutes a significant difference
has been defined by the type of test used and the level of significance. Furthermore, a statistically
significant difference and a physically significant difference are not the same. For example, if
6 = 4.0 is an estimate for 0 and a test of hypothesis shows 6 is not significantly different from
zero, it does not mean 0 = 0 should be used in some physical analysis if this physical analysis is
sensitive to differences in 0 of this order of magnitude. A physically significant difference depends
on the problem being studied.
The following is a discussion of several common tests of hypotheses. The hypothesis to be
tested is denoted by H, and the alternative hypothesis by Ha. For the tests that follow to be correct statistical tests, the assumptions involved in developing the test statistic must not be violated.
A primary assumption is that the statistics are estimated based on a random sample. In practice,
at least some.of the assumptions are generally violated- with the result that the tests are only
approximate tests. This approximation is manifest in the fact that the actual level of significance
will not be equal to 100a%. Because these tests are often approximate due to assumption violations does not render the tests of no value. It is the analyst that must make the decision, not a
statistical test following some prescribed procedure. The analyst may put less weight on a statistical test in arriving at a decision if the violations of the assumptions of the statistical test are of
concern, however.
H,: p = p l , Ha: p = p., Normal Distribution, Known Variance
In this case, H, is a simple hypothesis and Ha is a simple alternative hypothesis. The test
statistic is developed by considering that
HYPOTHESIS TESTING
207
H, is rejected if
and
-
+ tl-,,n-l
x L
sx/v'h
for p1 < p2
(8.12)
H, is rejected if
Izl =
(X - Po)
ax/ v'h
(>
zl-+.
(X - Po)
sx/ v'h
H, is rejected if
It1 =
(K - Po)
sx,&
> t1-a,2n-1
This test cannot be applied to every set of data. The assumption has been made that the
observations are from a normal distribution.
-
- --
Example 8.4. The annual runoff for Cave Creek near Fort Spring, Kentucky, for the period 1953
to 1970, has a mean of 14.65 inches and a standard deviation of 4.75 inches. Test the hypothesis
that the mean annual runoff is 16.5 inches.
Solution: The testing procedures we have available to us all are based on the assumption of
normality. If we assume the annual runoff is normally distributed, we can use equation 8.14 to
test H,: (I. = 16.5 versus Ha: p # 16.5.
There are 18 observations. The test statistic is
208
CHAPTER 8
- b.975,17
= 2.11. Because I t I =
Using a 95% level of significance, a = 0.05 and tl
1.65 < 2.11, we do not reject the hypothesis that the mean is 16.5.
Comment: Some statisticians do not like to "accept" H,. Their reasoning is that we have not
proven H,, only found strong evidence to support it. As a result of a statistical test, their conclusions would be either reject H, or fail to reject H,. It should be kept in mind, however, that we
have not proven H,.
For instance, in this example, we have calculated the sample mean to be 14.65 and accepted
the hypothesis that the population mean is 16.5. This illustrates two points. First, the data and the
test obviously do not prove that k = 16.5. Second, what we really have accepted is not that the
mean is 16.5 but that when sampling from this distribution using a sample of size 18, the difference between the sample mean of 14.65 and the hypothesized mean of 16.5 can reasonably be
ascribed to chance variations due to the random sample. Our conclusion is that based on this
sample, we cannot say that the population mean is not 16.5 or based on this sample the population mean is not (statistically) significantly different from 16.5.
Test for Differences in Means of Two Normal Distributions
If the variances of the two normal distributions are known, then the H,: k, - k2 = 6 versus
Ha: k1 - k2# 6 can be tested by calculating the test statistic
In this case, Z has a standard normal distribution, so the rejection region is 1 z I > z,
If the variance of the two normal distributions are equal but unknown, the H,: k1 - p2 = 8
versus Ha: p1 - p2 # 6 is tested by calculating the statistic
Again, note that these two tests are based on sample normality. For large samples, the Central
Limit Theorem may enable us to use these tests as approximate tests for nornormal samples.
Gibra (1973), Ostle (1963) and others discuss testing the H,: k, - p2 = 6 versus Ha: k1 k2# 6 when sampling from two normal populations with unknown and unequal variances. Ostle
recommends the following approximate procedure. Compute the test statistic
HYPOTHESIS TESTING
209
where
w, = s:/nl
Otherwise H, is rejected.
Test of H,: a: = a; versus H,: a: # a; for Two Normal Populations
To test the hypothesis that the sample variances of two normal populations are equal, the
sample test statistic is
ST
CHAPTER 8
210
and
H, is rejected if
In this test, Ha is 02 that are not all equal. This means that at least one
is different from
the other
The test is known as Bartlett's test for homogeneity of variances. Homogeneity of
variance is also known as homoscedasticity.
(~2.
Example 8.5. For the preceding example, test the hypothesis that the variance is 36.00.
Solution: The assumption of normality is used. The test is based on equation 8.18 using a
0.05
HYPOTHESIS TESTtNG
21 1
hypothesized relative frequency curve. The second method was to plot the data and the hypothesized distribution as a cumulative probability distribution on appropriate paper and judge as to
whether or not the hypothesized distribution adequately describes the plotted points. Statistical
tests corresponding to these visual tests will be discussed. In the following discussion, the
hypothesis being tested is that the data are from a specified probability distribution.
Chi-square Goodness of Fit Test
One of the most commonly used tests for goodness of fit of empirical data to specified theoretical frequency distributions is the chi-square test. This test makes a comparison between the
actual number of observations and the expected number of observations (expected according to
the distribution under test) that fall in the class intervals. The expected numbers are calculated by
multiplying the expected relative frequency by the total number of observations. The test statistic is calculated from the relationship
where k is the number of class intervals, and Oi is the observed and Ei the expected (according to
the distribution being tested) number of observations in the ithclass interval. The distribution
of X: is a chi-square distribution with k - p - 1 degrees of freedom, where p is the number of
parameters estimated from the data. The hypothesis that the data are from the specified distribution is rejected if
Example 8.6. As an example of using the chi-square test, consider the Kentucky River data of
table 2.1 and test the hypothesis that the data are from a normal distribution. The observed and
expected numbers in each class interval are obtained by multiplying the relative frequency by 99,
which is the number of observations. Table 8.2 shows the calculation of x:. The degrees of
Table 8.2. Chi-square test on Kentucky River data
(0 -E ) ~
Class mark
Observed
number
Expected
number
25,000
35,000
45,000
55,000
65,000
75,000
85,000
95,000
105,000
115,000
3
6
16
16
18
13
13
7
3
4
-
5.03
6.57
11.10
15.39
17.51
16.35
12.54
7.89
4.08
2.55
-
0.820
0.050
2.162
0.025
0.0 14
0.686
0.017
0.100
0.284
0.823
99
99
4.982
Total
CHAPTER 8
212
Table 8.3. Chi-square test on Kentucky River data (modified)
Class mark
Observed
number
Total
Expected
number
(0 -E)~
E
7
99
freedom is k - 3, or 7, since two parameters (pXand a;) were estimated for the normal distribution. Comparing x:. of 4.98 with Xg,90,7 = 12.0, it is concluded that the normal distribution can
not be rejected for this data for a = 0.10. If x:. had exceeded X:-,,k-,-,, the hypothesis that the
normal distribution describes the data would be rejected.
In constructing Table 8.2 the expected number in a class interval is based on n[Px(xi) P,(X~-~)]
for all intervals except the first and last ones. For the first interval the expected number
and for the last interval n[Px(w) - Px(x,-,)I. In these expressions xi represents the
is -(xi)
right boundary of the i" class.
Comment: By examining table 8.2 and equation 8.21, it is apparent that the chi-square goodness
of fit test is quite sensitive in the tails of the assumed distribution. Because of this many statisticians recommend that classes be combined if the expected number in a class is less than 3 (or 5).
If the 5 criteria is used, the first two classes and the last two classes must be combined. This
makes the calculation of X2 as shown in table 8.3 and X: value is reduced to 3.62. The degrees of
freedom are reduced to 5.
Perhaps a better way of conducting the chi-square goodness of fit test is to define the class
intervals so that under the hypothesis being tested the expected number of observations in each
class interval is the same. This means that the class intervals will be of unequal width and that the
interval widths will be a function of the distribution being tested.
Example 8.7. A chi-square test for normality of Kentucky River data using 10 class intervals
each having the same expected frequency can be conducted as follows.
Ten class intervals means that the expected relative frequency or probability in each interval
is 0.1. The class boundaries can be determined by solving the inverse of the cumulative distribution. For instance, the boundaries of the 4thclass intervals are given by the values of x satisfying
Px(x) = 0.3 and Px(x) = 0.4.
HYPOTHESIS TESTING
2 13
Table 8.4. Chi-square test based on equal expected numbers per class interval
Class
number
Lower
boundary
1
2
3
4
5
6
7
8
9
10
-0c)
37933
47753
54834
60885
66540
72195
78246
85327
95 147
Upper
boundary
Observed
number
Expected
number
( 0 - E)?
E
37933
47753
54834
60885
66540
72195
78246
85327
95147
8
15
13
7
7
14
5
12
8
10
99
9.9
9.9
9.9
9.9
9.9
9.9
9.9
9.9
9.9
9.9
99
0.365
2.627
0.97 1
0.849
0.849
1.698
2.425
0.445
0.365
0.001
10.596
00
Total
Table 8.4 contains the data for conducting the chi-square test based on 10 class intervals
having equal expected numbers of observations (99/100 or 9.9) in each interval. In this case, Xi
is 10.60, which is less than Xi.90,7
of 12.02. The hypothesis is, again, not rejected.
Distributional Tests Based on Cumulative Distributions
Conover (1980) presents a good discussion of statistical tests based on cumulative distributions. The most commonly used of these tests is the Kolmogorov-Smirnov one sample test (also
known as the Kolmogorov test). The hypothesis being tested is that a set of empirical observations come from a particular, known, and completely specified cumulative distribution. This test
is conducted as follows:
1. Let Px(x) be the completely specified theoretical cumulative distribution function under the
null hypothesis.
2. Let S,(x) be the sample cumulative density function based on n observations. For any
observed x, Sn(x) = k/n, where k is the number of observations less than or equal to x.
3. Determine the maximum deviation, D, defined by
4. If, for the chosen significance level, the observed value of D is greater than or equal to the critical tabulated value of the Kolmogorov-Smimov (K-S) statistic, the hypothesis is rejected.
The Kolmogorov-Smimov test statistic is included in the appendix.
This test can be conducted by calculating the quantities Px(x) and Sn(x) at each observed
point, or by plotting the data as in figures 7 . 3 and
~ d and selecting the greatest deviation on the
probability scale of a point from the theoretical line. If the latter approach is used, care must be
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Probability
Ranked data
sx
px
Isx - pxI
Isx-, - pxl
HYPOTHESIS TESTING
215
The critical value is 0.411 for n = 8 and ci = 0.10. The hypothesis cannot be rejected.
Note that for the Kolmogorov-Smirnov test, P,(x) is a completely specified, cumulative
probability distribution. That is no parameters for the distribution must be estimated from
observed data. Crutcher (1975) points out that when parameters must be estimated to specify
P,(x), the Kolrnogorov-Smirnov test is conservative with respect to the Type I error. That is, if
the critical value is exceeded by the test statistic obtained from the observed values, the hypothesis is rejected with considerable confidence. Crutcher (1975) presents a table of critical values
for sample sizes of 25 and 30 as well as infinitely large samples for the exponential, gamma,
normal, and extreme value distributions when parameters of these distributions must be estimated.
In general, these critical values are smaller than the values given in the Kolmogorov-Smirnov
table in the appendix.
Conover (1980) discusses Lilliefors's extension of the K-S test to the normal distribution
with mean and variance estimated from the data (Lilliefors, 1967) and the exponential distribution with mean estimated from the data (Lilliefors, 1969). The tests are conducted as with the
K-S except that the critical values are smaller. Conover (1980) presents tables for the required
critical values. Based on data in Conover, letting KS represent the critical value of the
Kolmogorov-Smimov statistic and L represent the critical value for the Lilliefors test, the
approximation L = a + bKS can be used where a and b are given in the following table for 4 to
30 observations. For n greater than 30 the approximation L = c / 6 from Conover ( 1 980) yields
reasonable estimates for the critical values.
Distribution
Normal
Normal
Normal
Exponential
Exponential
Exponential
0.10
0.05
0.01
0.10
0.05
0.01
0.02 1
0.027
0.040
0.003
0.009
0.016
0.586
0.565
0.528
0.780
0.767
0.744
0.805
0.886
1.031
0.977
1.075
1.274
with ci = 0.10. The tabled value for KS is 0.411. Therefore L is found to be 0.003
0.780(0.411) or 0.324. The hypothesis cannot be rejected.
21 6
CHAPTER 8
Example 8.10. Test the hypothesis that the Kentucky River peak flow data are normally distributed. Use the Kolmogorov-Smirnov test.
Solution: The data are plotted in figure 8.9. The maximum deviation between the best fitting
line, Px(x), and the plotted points, S,(x), on the probability scale is about 0.074 at X = 55,200
cfs (table 8.5). Because the test for normality is being done and the mean and variance are estimated from the data, Lilliefors approach is used. For a = 0.10 and n = 99, the critical value
for the Lilliefors statistic is 0.805/*
or 0.081. Table 8.5 shows the calculations needed to find
the maximum deviation. The maximum deviation is the maximum value in the columns under
Normal distribution
Fig. 8.9a. Normal probability plot of Kentucky River data on annual flow.
Normal distribution
Fig. 8.9b. Lognormal probability plot of Kentucky River data on annual flow.
Rank
Data
Sx
Px
Sx
Px
S(x
1) - Px
Rank
Data
87100
87200
88900
89400
91500
92500
93700
94300
96100
98400
99100
101000
105000
107000
111000
112000
115000
144000
Sx
0.82
0.83
0.84
0.85
0.86
0.87
0.88
0.89
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
Max dev
Px
Sx
Px
S(x - 1) - Px
HYPOTHESIS TESTING
219
I .i . i i i I i I i . . i . I
.
10
20
30
40
50
60
70
80
90
100
5.37,5.60,6.33, 8.90
Flow
Flow
HYPOTHESIS TESTING
22 1
The maximum deviation is 0.70. From Conover (1980), with a = 0.10, the critical test
value is 13/20, or 0.65. Thus, one can reject the hypothesis that the two samples are from the
same distribution. Conover (1980) can be consulted for more details on this test and for a companion one-sided test.
Xz
Exercises
8.1. A sample of 20 random observations produced a mean of 145 and a variance of 30. What
are the 95% confidence intervals on the mean assuming a normal distribution if (a) the true
variance is estimated by 30; (b) the true variance is 30. Discuss the reason you feel that the
confidence intervals computed for part (a) are wider than for part (b).
8.2. What are the 95% confidence intervals on the variance for the samples of exercise 8. l?
8.3. Test the hypothesis that the true mean of the data producing the sample whose properties are
given in exercise 8.1 is 165.
8.4. Discuss any connection between hypothesis testing and confidence intervals that you can
discern. What are the differences?
222
CHAPTER 8
8.5. Assuming the data are normally distributed, test the hypothesis that the mean peak discharge
on the Kentucky River near Salvisa (table 2.1) for the period 1895-1 9 16 is different than it is for
the period 1939-1 960.
8.6. Repeat exercise 8.5, except test for equality of variances.
8.7. Using the data of table 2.1, test the hypothesis that the variances of the peak discharges are
the same for the three periods 1895-1916,1917-1938,1939-1960.
8.8. Test the hypothesis that the mean monthly rainfall for September and October are the same
on the Walnut Gulch watershed near Tombstone, Arizona. What assumptions did you make? Are
these assumptions reasonable?
8.9. Repeat exercise 8.8 for equality for variances.
8.10. Test the hypothesis that the difference in the mean monthly rainfall on Walnut Gulch near
Tombstone, Arizona, for September and October is 0.50 inches. Discuss the validity of the
assumptions that are made.
8.1 1. Test the hypothesis that monthly rainfall in October on the Walnut Gulch watershed near
Tombstone, Arizona, is normally distributed.
8.12. Test the hypothesis that annual rainfall on the Walnut Gulch watershed near Tombstone,
Arizona, is normally distributed.
8.13. Comment on the results of exercises 8.1 1 and 8.12 in terms of the Central Limit Theorem.
8.14. Would the plotting position relationship used in exercise 7.6 have any effect on the results
of a test for normality on the data set you selected?
8.15. Use the Kolmogorov-Smimov test to answer exercise 7.10.
8.16. Use the Kolmogorov-Smimov test to test for normality the three sets of data plotted in
exercise 7.11.
8.17. Use the KolmogorovSmirnov test to test for lognormality of three sets of data plotted in
exercise 7.12.
8.18. Work exercise 8.16 using the chi-square test.
8.19. Work exercise 8.17 using the chi-square test.
8.20. What distribution do you think would fit the data of exercise 2.2? Use the chi-square test
to evaluate your assertion.
HYPOTHESIS TESTING
223
8.21. The following are experimentally determined values of Manning's n for plastic pipe as
determined by Haan (1965). Test the hypothesis that the mean value of n is different from the
recommended design value of 0.0090.
9. Simple Linear
Regression
NOTATION
IN THIS chapter an upper case letter will represent a variable, a lower case letter will represent
the difference between a variable and its mean, and a subscript will be used to denote a particular
value for the variable. Thus Y represents a variable which may take on values Y,, Y,, Y3, and so on.
Y is the mean of Y. y = Y - Y and yi = Yi - Y.Parameters are denoted by Greek letters and a
corresponding English letter is used to denote an estimate for the parameter. Thus a is a parameter
estimated by a (& = a). The lower case letter e will be used to denote the difference between an
observed value of Y and its predicted value ?.Thus Y - ? = e and Yi - Pi = ei.All summations
m this chapter will run from 1 to n unless otherwise specified, where n is number of observations
on Y and X.
SIMPLE REGRESSION
Possibly the most common model used in hydrology is based on the assumption of a linear
relationship between two variables. Generally, the objective of such a model is to provide a
means of predicting or estimating one variable, the dependent variable, from knowledge of a second variable, the independent variable. The statistical procedure used for determining a linear
relationship between two variables is known as regression. Often the term regression is reserved
for use when all of the X variables being considered are random variables. In this book liberties
will be taken and the term applied whether or not the X variables are random variables. As used
in this chapter, dependent and independent are not the same as dependence or independence of
random variables. Here, dependent means that the variable may be expressed as a (linear)
SIMPLE REGRESSION
225
Data
Mean
c=a+bx
//
-----
25
so[
0
30
---
/
95% CI on individual ~redictedv ,/
35
40
45
50
8
8
55
Precip.
(inches)
Runoff
(inches)
Year
Precip.
(inches)
Runoff
(inches)
226
CHAPTER 9
Two questions are of immediate concern. Can a model of the form
adequately represent the relationship between Y and X? For what values of a and P is the representation the best? Here E is the difference in Y and a + PX.
In looking at the question of the "best" straight line, a criteria for judging "bestness" is
needed. One intuitive criteria would be to estimate a and P by a and b so as to minimize the
deviation ei between the observed values of Y, Yi, and the predicted values of Y, Y,. In this way,
values for a and b would be sought that minimize the sum
Closer scrutiny of equation 9.2 reveals that it is not desirable to minimize the sum in an algebraic
sense becausethat would be equivalent to finding an a and b such that E ei is -a.
Another criteria might be to find an a and b such that X ei is zero. The fallacy with this can
be seen by considering two points. If the line Y = a + bX goes through the two points, then X ei
would be zero; however, the sum is also zero for any line that over-predicts one point by the same
amount that it under-predicts the second point. Thus, there is an infinity of lines such that
E ei = 0, and an additional restriction or criterion is needed to select a single line.
The X ei may be positive or negative. A criterion that is not sign dependent is needed. Such
a criterion might be to minimize X leil or to minimize X e'. Since absolute values are difficult to
work with mathematically, the second criterion is generally selected. Thus it is desired to
estimate a and p by a and b such X e' is a minimum. Denoting this sum by M, we have
This sum can be minimized with respect to a and b by taking the partial derivatives of M
with respect to a and b and setting the resulting equations equal to zero.
These equations can then be written in the following form, known as the normal equations.
SIMPLE REGRESSION
227
Equations 9.6 and 9.7 provide estimates for a and b such that C. e' is a minimum. Because
the procedure is based on minimizing the error sum of squares, C. e', the estimates a and b are commonly called least squares estimates. Equation 9.4 indicates that this solution also satisfies C. ei =
0. Equation 9.7 indicates that the line Y = a + bX goes through the point Y = Y and X = X.
The line Y = a + bX is commonly known as the regression line of Y on X. The procedure
of determining a and b is known as simple regression. The term "simple" regression is used when
only one independent variable is involved, as opposed to multiple regression when several independent variables are involved. The parameter estimates, a and b, are known as the regression
coefficients.
Equations 9.6 and 9.7 show that a and b are functions of the sample values of Y and X. If
another sample of observations were obtained and a and b were estimated from this sample,
different estimates would result. We have already seen that
Similarly
Thus, ei represents the deviation between an observed Yi and its predicted value qibased on the
regression equation estimated from the particular sample of data at hand. E; represents the deviation between an observed Yi and the assumed true but unknown relation between Y and X given
byY = a + P X .
Example 9.1. Determine the regression coefficients for the data plotted in figure 9.1.
Solution: The data required for solving equations 9.6 and 9.7 are contained in table 9.2. The
equation used to calculate b would depend on the method of calculation. If a small desk calculator is used, the first of equations 9.6 might be employed. If an electronic calculator or computer
is used, the latter of equations 9.6 might be employed. Generally, less roundoff error will result
if the latter form of equation 9.6 is used. In practice, readily available software would be used.
Therefore ? =
- 13.195 1
228
CHAPTER 9
Total
Average
13.26
3.31
15.17
15.50
14.22
21.20
7.70
17.64
22.91
18.89
12.82
11.58
15.17
10.40
18.02
16.25
234.04
14.63
Comment: The last two columns of table 9.2 contain qi and Yi - 9,. Note that except for
rounding errors, Y = 9 , C (Y, - q i ) = C ei = 0 and E = 0.
or
Y,
Pi = (Y, - P) - (9,- 7)
g (Y, - Pi)' = g
(Y, - Y)' -
2 (qi-
SIMPLE REGRESSION
229
Z (Y, - Y)'
However,
Z ( y i - Pi)'
z(Pi - Q2
- nY2 so we have
The total sum of squares, 2 Y:, has been partitioned into three components. These three
components are:
1. n F , the sum of squares due to the mean
2.
squares
3.
C y2
C e2 + b C xiyi
X Y;
-n p =
X (Y, -
+ 2 (Pi - Y)2
(9.11)
230
CHAPTER 9
Therefore, the total sum of squares corrected for the mean is made up of two componentsthe sum of squares of deviation from regression (also known as the error or residual sum of squares)
and the sum of squares due to regression. The larger the sum of squares due to regression in comparison to the residual sum of squares, the more of the total sum of squares corrected for the mean
is explained by the regression equation. The ratio of the sum of squares due to regression to the total
sum of squares corrected for the mean can be used as a measure of the ability of the regression line
to explain variations in the dependent variable. This ratio is commonly denoted by 2 and may be
written in a number of ways.
2=
u)2,
Because 0 < 2 < 1, we have - 1 < r < 1. The sign on r is identical to the sign on b because sx
and s, are always positive. From equation 9.14 it can be seen that r may also be written as
which would be equal to the sample correlation coefficient if X and Y were both random
variables. In fact, r is commonly called the correlation coefficient and can be shown to
be equal to the correlation between Y and ?. Correlation is discussed in more detail in
chapter 11.
SIMPLE REGRESSION
23 1
Example 9.2. What percent of the variation in Y is accounted for by the regression of example 9. l?
Solution:
Thus, 66% of the variation in Y is explained by the regression equation. The remaining 34% of
the variation is due to unex~lainedcauses.
CONFIDENCE INTERVALS AND TESTS OF HYPOTHESES
Thus far in the discussion of simple regression no assumptions have been made conceming
the model. In order to use some well-developed theorems conceming hypothesis testing and
confidence interval estimation, it is necessary to make the assumption that the E~ are identically
and independently distributed as a normal distribution with a mean of zero and a variance of 2 .
(A shorthand way of writing this is ei is i.i.d. N(0,d)). For further discussion of the assumptions
involved in regression analysis, see the closing section of this chapter, General Considerations.
Also see Johnston (1963) and Graybill (1961).
This assumption contains many implications. The fact that the E(E,) = 0 has been guaranteed by our estimation procedures. The assumption of independence means that the correlation
between E~ and ej for any i # j must be zero. The assumption that the ei are identically distributed
with variance a2means that the variance of ei must equal the variance of E~ for all i and j. That is,
the variance of ei cannot change as Xi changes. This is known as homoscedasticity. Finally we
must have the ei normally distributed.
The assumption of normality of the E~can be checked by the procedures of chapter 8. A rough
check would be to note that, for the normal distribution, 95% of the values of E~ should be within
2 standard deviations of the mean or only about 5% of the residuals should lie outside the interval -20 to 20. For a further discussion of examining the ei, reference should be made to Draper
and Smith (1966).
Under the normality assumption, we have E(E) = 0. The Var(e) is given by
The positive square root of the Var(~)is known as the standard error of the regression equation.
An unbiased estimate (Graybill 1961) for V a r ( ~ is
~ )s2 calculated from
The least squares estimation procedure produces estimates for a and b such that the standard
error of the regression equation is a minimum.
Another way to look at the coefficient of determination is to write equation 9.13 as
C. e2
r" = (I:y? - I: e?) - 1 - 7
I:Y?
I:~i
232
CHAPTER 9
Therefore, if the estimated standard error of the regression equation is nearly equal to the
standard deviation of Y, 8 will be close to zero and the regression equation is of little value in
explaining variation in Y.
Figure 9.3 depicts the relationships among the pdfs of X, Y, and e in a linear regression.
What is of interest is the spread or variance in the pdf of e, s2,in comparison to that of Y, s;. The
smaller is s2 in comparison to s;, the greater is 8 and the stronger is the linear relationship
between Y and X. This is stated mathematically by equation 9.18.
Example 9.3. Is there reason to believe the residuals of example 9.1 are not normally
distributed?
Solution:
95% of the e, should be between -2s and 2s or between -5.94 and +5.94. An inspection of
table 9.2 shows that none of the 16 observations are outside this interval. The number of
observations is not sufficient to determine if the ei are N(0, a2), however, there is not sufficient
evidence to reject this possibility.
SIMPLE REGRESSION
233
and
t=
(b
Po)
Sb
Po is tested by computing
234
CHAPTER 9
The significance of the overall regression equation can be evaluated by testing the
hypothesis that P = 0. The H,: P = 0 is equivalent to H,: r = 0. If this hypothesis is accepted,
Note that if r = 0, equation 9.18 shows that s2 s t , or the
then 9 may be estimated by 7.
regression line does not explain a significant amount of the variation in Y. In this situation one
would be as well off using Y as an estimator for Y regardless of the value of X.
--
Example 9.4. Compute the 95% confidence intervals on a and p and test the hypothesis that
a = 0 and the hypothesis that P = 0.500 for the regression of example 9.1.
Solution:
sa=s
[a
-+-
X2
'D
Ex?]
we reject H,: a = 0.
Because It1 > f0.975.14,
SIMPLE REGRESSION
235
Since It1 < t()975,14,we cannot reject H,. The slope is not significantly different from 0.5.
Comment: The significance of the overall regression can be evaluated by testing H,:
Under this hypothesis
0.
Because It( > f0.975,14 we reject H,. The regression equation explains a significant amount of the
variation in Y.
Confidence Intervals on Regression Line
Confidence interyals on the regression line can be determined by first calculating the
where Ykrepresents the predicted mean value of
for a given Xk.
variance of
vk
A
-
Yk = a
Pk
+ bXk
u2X
Therefore
E x2'
--
Equation 9.25 indicates that the variance of & depends on the particular value of X at which
A
the variance is being determined. The ~ a r ( & )is a minimum when Xk = X and increases as Xk
deviates from X.
236
CHAPTER 9
Confidence limits on the regression line are now given by
~ a r ( Y , )+ a2.Confidence intervals on an individual predicted value of Y could then be estimated from equations 9.27 where the expression
would be substituted for skk.The confidence limits on a future predicted value of Y are the same
as those for an individual predicted value of Y.
Example 9.5. Calculate the 95% confidence limits for the regression line of example 9.1.
Calculate the 95% confidence interval for an individual predicted value of Y for the same
problem.
Solution: s = 2.97, n = 16,C x' = 570.0559, b.975,14
= 2.145 and X = 42.94.Therefore,from
equations 9.27 we have for the 95% confidence intervals on the regression line
where the - applies to the lower limit, 1, and the + to the upper limit, u. Similarly, the 95%
confidence interval on an individual predicted value of Y is given by
By substituting various values of Xk into these equations, the desired confidence limits are
obtained. These intervals are plotted in figure 9.1.
SIMPLE REGRESSION
237
EXTRAPOLATTON
The extrapolation of a regression equation beyond the range of X used in estimating a and
p is discouraged for two reasons. First, as can be seen from figure 9.1 and equation 9.27, the
confidence intervals on the regression line become very wide as the distance from is increased.
Second, the relation between Y and X may be nonlinear over the entire range of X and only
approximately linear for the range of X investigated. A typical example of this is shown in figure 9.4.
GENERAL CONSIDERATIONS
Many authors discuss several different linear models depending on the assumptions made
concerning Y, X, and E (Graybill 1961; Benjamin and Cornell 1970; Mood and Graybill 1963).
These different models revolve around whether X (or X in multiple regression) is a random or
nonrandom variable, whether measurement errors are made on Y and/or X, the distribution of X
if X is a random variable, and the joint distribution of Y and X if X is a random variable.
True relation
/
/
where ey and ex are the measurement errors on Y and X. Thus, the normal equations are solved
in terms of Y* = a + PX* + E, or Y + ey = a + p(X + ex) + E = a + f3X + f3ex + E.
Now if ex is small in comparison to X, this latter equation becomes Y = a + f3X + E - ey, or
Y = a + f3X + e,, which can be handled by the methods outlined in this chapter.
Recall that no distributional assumptions are required to get the least squares estimates for
a and f3. The assumptions are involved when confidence intervals and tests of hypotheses are of
concern, or when it is desired to state that the least squares estimates for a and P are also maximum likelihood estimates. Johnston (1963) points out that the least squares estimates for a and
p are biased if significant measurement errors are present on X.
One of the assumptions used in developing confidence intemals and tests of hypotheses was
that the E~ are independent. If E, is correlated with E ~ + , ,the least square estimates of a and f3 are
unbiased, however, the sampling variance of a and f3 will be unduly large and will be underestimated by the least squares formulas for variances rendering the level of significance of tests of
hypotheses unknown. Also, the sampling variances on predictions made with the resulting equation will be needlessly large. Correlation between E, and
frequently arises when time series
data are being analyzed. This type of correlation is known as autocorrelation or serial correlation.
SIMPLE REGRESSION
239
Johnston (1963) discusses least squares estimation procedures in the presence of autocorrelation.
Autocorrelation of errors is discussed in more detail in the next chapter of this book.
In some situations the assumption of homoscedasticity [Var(~,)= 0' for all i] is violated.
Quite commonly, Var(ei) increases as X increases. Such a situation is depicted in figure 9.5. Draper
and Smith (1966) and Johnston (1963) discuss least squares estimation under this condition.
Another point to be made concerning hypothesis testing in general is that a statistically
significant difference and a physically significant difference are two entirely different quantities.
For example, when the H,: P = 0 was tested in example 9.4, the conclusion was that the
regression line explained a significant amount of the variation in Y. This refers to a statistically
significant amount of the variation at the chosen level of significance. It means that recognizing
an a%chance of an error, the relationship Y = a + bX cannot be attributed to chance. It does
not imply a cause and effect relationship between Y and X.
Looking at the confidence limits on the regression as plotted in figure 9.1 and the scatter of
the data, it can be seen that this simple relationship Y = a + bX leaves a lot to be desired in
terms of predicting annual runoff. Whether or not the derived relationship is usable depends on
the use to be made of the predicted values of Y and not on the fact that the Ho: p = 0 is rejected.
It may be that the standard error of the equation, s2,is so large as to render the estimate made with
the equation in some particular application too uncertain to be used even though the equation is
explaining a statistically significant portion of the variability in the dependent variable.
Exercises
9.1. The following data are the maximum air and soil temperatures (bare soil at 2-inch depth)
recorded for the first 30 days of July 1973, at Lexington, Kentucky. Derive a linear relationship
via simple regression for predicting the maximum soil temperature from the maximum air
CHAPTER 9
240
Air
Soil
Air
Max Temp
Soil
Air
Soil
temperature. Estimate a and 3 for the resulting regression. Test the hypothesis that (a) the intercept is 0, (b) the slope is 1, (c) the regression explains a significant amount of the variation in the
maximum soil temperature. Would you recommend using this relationship for predicting maximum soil temperature?
9.2. The asterisks following the soil data in exercise 9.1 indicate days on which rainfall occurred.
Using only these rainfall days, work exercise 9.1.
9.3. Calculate the regression coefficients in the relationship Q, = a + bQ where Q, is the annual
suspended sediment load and Q is the annual water discharge for the Green River at Munfordville,
Kentucky. Calculate the standard error of the regression equation and the correlation coefficient.
Plot the data along with the 95% confidence intervals on the regression line. Is this a usable
prediction equation?
9.4. Show that the correlation coefficient in simple regression is equivalent to the correlation
between Y and ?.
9.5. Calculate the regression equation for the data of table 9.1 considering the runoff as the independent variable and the precipitation as the dependent variable. Rearrange the resulting
equation to be in the form of the prediction equation of example 9.1. Does the resulting
regression equation agree with the regression equation in example 9.1? Should it agree? Why?
Which equation should be used?
9.6. A technique used by hydrologists to detect changes in the hydrologic response of a watershed
is to examine mass curves for changes in slope. A mass curve is a plot of the accumulation over
time of one variable versus the accumulation over time of a second variable. The data below are
the annual runoff and precipitation for Thorne Creek experimental watershed in Pulaski County,
Virginia. It is thought that there was a change in the hydrologic characteristics of this watershed
during the 11-year period of study. Plot the accumulated precipitation as the abscissa and the
accumulated runoff as the ordinate. Does there appear to be a change in the rainfall-runoff
relationship? During what year? Calculate the slope of the regression lines describing the data
SIMPLE REGRESSION
24 1
both before and after the apparent change. Test the hypothesis that these slopes are not significantly different.
Year
Precipitation
Runoff
Year
Precipitation
Runoff
9.7. Occasionally it is desirable to restrict the intercept of a simple regression to 0, thus requiring the regression line to pass through the origin. Derive the normal equation for the slope in this
case. Use the resulting equation to calculate the slope of the line describing the data plotted for
exercise 9.6. Neglect the apparent change in the slope for this problem (i.e., use all of the data to
estimate b in the equation accumulated runoff = b [accumulated precipitation]).
9.8. Hydrologists frequently use watershed physical characteristics as an aid in studying
watershed hydrology. The data below are the area (square miles) and length (miles) of several
Colorado mountain watersheds (Julian et al. 1967). Derive a linear regression equation for
predicting the area of similar watersheds as a function of the watershed length. Plot the data and
the derived regression line. Plot the 95% confidence intervals on the regression line.
Area
Length
Area
Length
Thus
MULTIPLE REGRESSION
243
predicting peak runoff would then c~ntainall of these variables. This is an extension of the linear
model discussed in chapter 9 to include several independent variables.
A general linear model is of the form
where Y is a dependent variable, XI, X2, ..., Xp are independent variables, P,, P2, ..., Pp are
unknown parameters, and E is an error component. This model is linear in the parameters, Pj.
and
where Yi is the ithobservation on Y and Xij is the ithobservation on the j' independent variable.
Equations 10.2 can be written
Y is an n X 1
When the model is written in the form of equation 10.5, it is easy to see that vector of observations on the dependent variable, X is an n X p matrix made up of n observations
P is a p X 1 vector of unknown parameters. For equation
on each of p independent variables, and 10.4 to have an intercept term, it is necessary that Xi,l = 1 for all i. p1 is then the intercept. In the
following development, it is assumed that Xi,, = 1 for i = 1 to n.
The model discussed in chapter 9,
In matrix notation
which represents the normal equations. The solution of equation 10.7 is obtained by premultiplying by (x'x)-'.
--
MULTIPLE REGRESSION
245
The X'X
- matrix plays an important role in estimating P and in the variance of the 6,'s.
The X'X
- matrix is made up of the sum of squares and cross products of the independent variables.
For the p x p matrix X'X
- to be inverted, its rank must be p. That is, no row or column can be a
linear function of any combination of the other rows and columns. If this occurs, it is known as
multicolinearity.
z = [z,,,], then zfz/(n
- 1 ) is a p X p correlation
If we define zi, to be (Xi, - X,)/s, and let matrix, R = [Rij], where Rid is the correlation coefficient between the ith and jth independent
variables. By definition, Rid = 1 for i = j. If 1Rij( = 1 for some i # j, then the ithindependent
X'X
- will be less than
variable is a linear function of the j' independent variable and the rank of p. This means that an independent variable cannot be a (perfect) linear function of any other
X'X
- to be p, an independent variable cannot
independent variable. Furthermore, for the rank of be linearly dependent on any linear function of the remaining independent variables. For exarnple, if p is 4 and X2 = a x 1 + bX3 + c, then X2 is a linear function of XI and X3 SO that the rank
of X'X
- would be at most 3. If there is near linear dependence in X,the calculation of (X 'X)~'
may involve roundoff errors and loss of significance leading to nonsensical estimates for P
(Draper and Smith 1966).
As in the case of simple regression, the total sum of squares can be partitioned into three parts.
Draper and Smith (1966) demonstrate that equation 9.10 can be written in matrix notation as
Y'Y
- or
so that the three components of the total sum of squares, -
2 Y:
are:
~ '-x ' Y- n?
3. -
2 (Pi- Y)',
~2
Source
Mean
Regression
Residual
Total
Degrees
of
freedom
Sum
of
squares
1
P-1
n-P
n
nY2
--fi'x'y - n p
Expected
mean
square
P'x'Y
YY - ----
s2
YY
--
~ 6 .
e'en-P
( Y - @ ) ' ( Y- X S )n-P
(Y'Y-fi'x'~)
- -- - - -
"- P
The standard error of the regression equation u is estimated by s. An expression for R2 that
is analogous to equation 9.18 is
Again, this shows that if the regression equation is explaining a large part of the variation in Y.
The standard error of the equation will be significantly less than the standard deviation of Y.
Example 10.1. Benson (1962) studied flood frequencies on many streams in the northeastern
United States. The following table contains a partial listing of some of Benson's data. Using this
data: (a) Estimate the regression coefficients for the model
where Q is the mean annual flood in thousands of cfs, A is the watershed area in thousands
of square miles, and I is the average annual maximum 24-hour rainfall depth in inches.
247
MULTIPLE REGRESSION
(b) Calculate R'.., (c) Calculate Q~for each observation on the independent variables. (d) Calculate ei for each Qi.
Station No.
Solution: To maintain consistency in notation, let Yi = Qi, Xi,I = 1, Xi,, = Ai, Xi,, = Ii. For this
problem n = 14 and p = 3. The column of data under Q is the 14 X 1 vector Y, a column of 1's
P is made up of
along with the data under A and I is the 14 X 3 matrix X,and the 3 X 1 vector b,, b,, and b3. From equation 10.8, we have
(&'&)-' is found to be
3.71678
-0.18094
- 1.37537
-0.18094
0.02028
0.06124
- 1.37537
0.06 124
0.52329
CHAPTER 10
248
b,
= 1.6570,
The parameter estimates are
From equation 10.10, we get
b2 = 13.1510, and 6, =
0.01 12.
R2 =
This means that 99% of the variation in Y is explained by the regression equation
Values for Q contained in the above table were calculated from this relationship.
Values for ei'were computed from
d.f.
Sum of squares
Mean square
Mean
Regression
Residual
Total
2
11
14
6,606.381
13,182.600
171.090
19,960.071
6591.300
15.554
If a large number of significant figures are not carried in computing the (x'x)-'matrix,
significant errors can result. To demonstrate this, the elements of the X'X and X'Y matrices were
rounded to two decimal places resulting in estimates for b of
= 1.10,
= 12.24, and
=
5.28. Computational problems of this type are rarely a problem when using well-established
computer routines unless there is near colinearity in the X matrix.
b1
6,
b3
MULTIPLE REGRESSION
249
,.
p)s2/u2has a chi-
fii
fii
Pi are given by
(Pi
A test of the hypothesis Pi = Po where Po is a known constant can be made by noting that
- P o ) / s ~has a t distribution. Thus, to test H,: Pi = Po versus H,: Pi # Po, the test statistic
is computed. H, is reject if I t 1
> tl - up,n-
,.
p.
Bi
Because in general
is not independent of P, (their covariance is given by c i 1 0 2 ) ,
repeated application of equation 10.17 to test H,: Pi = Poi and H,: Pj = Poj are not independent
tests.
A test of H,: pi = 0 versus Ha: pi # 0 is equivalent to testing the hypothesis that the iLh
independent variable is not contributing significantly to explaining the variation in the dependent
variable. If H,: pi = 0 is not rejected, it is often advisable to delete the i~ independent variable
from the model and recalculate the regression.
A test of the hypothesis that the entire regression equation is not explaining a significant
amount of the variation in Y is equivalent to H,: P2 = P3 = - - - = pp = 0 versus Ha:at least one
of these p's is not zero. Since pi is not independent of Pj, repeated application of equation 10.17
is not a valid way to test this hypothesis. Use can be made of the fact that the ratio of the mean
square due to regression to the residual mean square has an F distribution with p - 1 and n - p
degrees of freedom. To test H,: P2 = P3 =
= Pp = 0, calculate the test statistic
0
iMULTIPLE REGRESSION
25 1
Note that Q2 - Q2* is the reduction in the sum of squares due to regression brought about
by deleting k independent variables. If Q2* nearly equals Q2 , then the deletion of the k variables
has not greatly changed the ability of the model to explain the linear variation in Y. Under these
conditions F will be small and H, will not be rejected, indicating that one might eliminate the last
k variables from further consideration. Rejection of H, does not imply that all of the last k variables are important-it only implies that at least one of these variables is explaining a significant
amount of the variation in Y.
Confidence Intervals on the Regression Line
To place confidence limits on Y, where Yh = XhP,
- it is necessary to have an estimate for the
variance of
which can be estimated by replacing u2 with s2.The confidence limits on Yh are given by
The confidence intervals on an individual predicted value of Y, are given by equations 10.21
where var(9,) is replaced by the variance of an individual predicted value of Y at Xh which is
xh(Xrx)-'x',).
-- given by u2(1 + Other Inferences in Regression
Many other tests of hypotheses can be made and confidence intervals constructed relative to
multiple regression. For example, one might make tests concerning linear relationships among
the b's or that the p's obtained from one situation are equal to those obtained from another
situation. Reference can be made to Graybill (1961), Johnston (1963), Draper and Smith (1966),
or Neter et al. (1996) for these and other tests.
Example 10.2. For the regression equation of example 10.1: (a) Test the hypothesis that the
regression equation is not explaining a significant amount of the variation of Y. (b) Test the H,:
p2 = 0. (c) Test the H,: p, = 0. (d) Calculate the 95% confidence limits on P2. (e) Calculate the
95% confidence limits on the regression line at the point A = 4,000 square miles and I =
2.0 inches. (f) Calculate the 95% confidence intervals on u2.
CHAPTER 10
252
Solution:
The tabulated F,95,2 is 3.98. Therefore, H, is rejected. The regression equation does explain a
significant amount of the variation in Y.
(b) H,:
p,
=0
Ha: p, Z 0
(c) H,:
p3 = 0
Ha: P3 Z 0
(e) The 95% confidence limits on the regression line at X2,h= 4.00 and X3,h= 2.0 are
determined from equation 10.2 1. The var(Ph)is from equation 10.20.
var(9,) = 15.554X
-h (XIX)- - 'x;
(&I&)-'
MULTIPLE REGRESSION
253
(f) The 95% confidence intervals on u2 are calculated from equation 10.13
The 95% confidence intervals on u can be obtained by taking the square root of these limits to
obtain 2.80-6.69.
Comment: The hypothesis H,: P2 = 0 and H,: P, = 0 were both tested in this example as
though the tests were independent. In fact, P2 and P, are not independent. The cov(p2,
can
and
can
be determined from C;: s2 as .0612(15.554) = 0.9519. The correlation between
b,)
p2 6,
be estimated from cov(b2, fi,)/(up, up3)as 0.9519/(0.562 x 2.85) = 0.59. The test of H,: P, =
0 is made relative to the full model that includes all of the P's. The acceptance of H, implies that
p, = 0 given that p1 and p2 are in the model. In general, if there are p p7s and H,: Pi = 0 is
tested for each of them, with the result that k of the hypotheses can be accepted, one cannot
eliminate these k variables from the model on the basis of this test alone because each of the
individual H,: Pi = 0 assumes all of the other p - 1 p7sare still in the model. To eliminate k
variables at once, the test must be based on equation 10.19.
As an example of the application of equation 10.19, the H,: P, = 0 will be tested. The
ANOVA for the full model is contained in example 10.1. The reduced model is simply Y = b1 +
b2X where X is the watershed area in thousands of square miles. Because this is a simple regression situation, we can compute the sum of squares due to regression from b 2 Zxiyi where
b = 2 xiyi/Z xf.The result of this calculation is the sum of squares due to regression for the
reduced model, which is 13,182.60.
The test statistic from equation 10.19 is
254
CHAPTER 10
F,-,,,, - t:-an,n, SO for the special case where k = 1 variable is being tested, equations 10.17
and 10.19 produce identical results.
Because H,: P3 = 0 was not rejected, the next logical step is to eliminate I from the model
and consider only A. Ln so doing the resulting regression equation is
The dependence of p's again is evident because the intercept is not the same as was obtained
when rainfall depth was included in the model. This is a somewhat special example in that P2
accounts for nearly all of the variation in Y, leaving virtually none of the variation to be explained
by P3. Again, one reason for this unusual situation is the units on Y and A and the proximity of all
of the watersheds to each other, resulting in similar rainfalls on all of the watersheds. Unless the
relationship between the dependent variable and an independent variable is quite strong, variability in the dependent variable due to variability in the independent variable cannot be detected
if there is little variability in the independent variable.
where Kj and sj are the mean and standard deviation of the j" independent variable. Then define
z = [zij] so that the correlation matrix is
R is a symmetric matrix
where Rij is the correlation between the i" and j" independent variables. because Rij = Rj,i.We have already seen that if Ri,j = 1 for i # j, then either variable i or variable j must be omitted from the model or else the X'X
- matrix cannot be inverted. If Ri,j is close
- - can be inverted and P estimated. If Rijis close to unity,
to unity (but not equal to unity), then X'X
then the var(bi) or var(bj) may be very large. Tests of hypothesis on Pi and Pj may indicate that
neither is significantly different from zero when in fact either Pi or Pj when used alone may be
significantly different from zero. The problem here is that since Xi and Xj are nearly linearly
MULTIPLE REGRESSION
255
related, they both are attempting to explain the same thing in the linear model. By having both Xi
and Xj in the model, the part of the variation in Y that either would explain if used alone may be
split between them in such a fashion that neither is significant. In other words, the effect of one
explanatory factor (which may be reflected in either Xi or Xj) is being divided between two
correlated variables.
Retaining variables in a regression equation that are highly correlated (multicolinearity)
makes the interpretation of the regression coefficients difficult. Many times the sign of the
regression coefficient may be the opposite of what is expected if the corresponding variable is
highly correlated with another independent variable in the equation. Multicolinearity is discussed
below.
A common practice in selecting a multiple regression model (and one that is not necessarily
being advocated) is to perform several regressions on a given set of data using different
combinations of the independent variables. The regression that "best" fits the data is then
selected. A commonly used criterion for the "best" fit is to select the equation yielding the largest
value of R2.
Looking at equations 10.21, another and perhaps better criterion is apparent. The confidence
intervals on the regression line are a function of s, the estimated standard error. The line with the
smallest standard error will have the narrowest confidence intervals.
Often the two criteria of the largest R2 and the smallest s give the same results-but not
always. As more variables are added to a regression equation, the R2 value can never decrease.
Thus, from the standpoint of the R2 criterion, one should use all of the available variables. This,
however, makes a clumsy equation and one in which it is extremely difficult to place a meaningful interpretation on the coefficients.
As more variables are added to a regression equation, the standard error may get larger. This
can be seen from equation 10.11. Every time a variable is added, n - p gets smaller as does
--- P'X'X.
- - However, the numerator may not, and often does not, decrease proportionally to
Y'Y
n - p, so that as variables are added s may actually increase. This is a tip-off that the added
variables are not contributing significantly to the regression and can just as well be left out.
All of the variables retained in a regression should make a significant contribution to the
regression unless there is an overriding reason (theoretical or intuitive) for retaining a nonsignificant variable. The variables retained should have physical significance. If two variables are
equally significant when used alone but are not both needed, the one that is easiest to obtain or
easiest to interpret should be used.
The number of coefficients estimated should not exceed 25-35% of the number of observations. This is a rule of thumb used to avoid "over-fitting", whereby oscillations in the equation
may occur between observations on the independent variables.
Thus far all decisions on which regression equation to use have been made by the
investigator. In many cases this is the most reliable method of selecting a regression equation. Using computers, it is possible to perform many regressions on large sets of data. This
has led to several formal procedures for selecting a regression equation. Two methods will
be discussed here-all-possible-regressions
and stepwise regression. For a discussion of
some other techniques, reference should be made to Draper and Smith (1966) and Neter
et al. (1996).
EXTRAPOLATION
The comments on extrapolation contained in chapter 9 relative to simple regression are
equally applicable to multiple regression. In multiple regression an additional problem arises. It
is sometimes difficult to tell the range of the data. In example 10.1, A ranges from 0.091 to 8.27
and I ranges from 1.7 to 3.2. Is the point A = 6.0 and I = 2.7 in the range of the data?
A plot of A and I is shown in figure 10.1. From this plot it is apparent that A and I do not
cover the entire range defined by 0.091< A < 8.27 and 1.7 < I < 3.2. The point A = 6.0 and
I = 2.7 does not appear to be in the range of the data. In more than 2 dimensions it is much more
difficult to visualize the range of the data.
MULTIPLE REGRESSION
257
AUTOCORRELATED ERRORS
One of the assumptions that is made in linear regression is that the errors are independent.
This means that there should be no correlation between the errors at successive observations.
Correlation in the errors from one observation to the next is common in time series data, especially if the hydrologic system involves considerable storage. For example, if the dependent
variable is the elevation of the ground water in a particular observation well on a monthly basis,
it would not be uncommon that if this water level were under-predicted at a particular time step,
it would tend to be under-predicted in the next time step. Correlation of this type is often called
autocorrelation or serial correlation. The chapters in this book on Correlation and on Time Series
Analysis deal with this topic as well. Neter et al. (1996) has a good treatment of regression when
serial correlation is present.
It is important to note that what is of concern is autocorrelation in the error term of the
regression model, not in the dependent or the independent variables. Often, but not always,
autocorrelation in the dependent variable leads to autocorrelation in the error term of the regression model. Time series data such as daily or monthly streamflow, monthly ground water levels,
and monthly reservoir levels generally have significant serial correlation, and regressions using
these as dependent variables often have serial correlation in the error terms. The error term represents deviations between the predicted and observed values of the dependent variable. Serial
correlation in the predicted variables can arise because the model predicts similar values from
one time step to the next. It is only when over-predictions at one time step tend to follow overpredictions at the previous time step and under-predictions tend to follow under-predictions that
serial correlation in the error term exists.
Serial correlation in the errors can be detected by examining a time series plot of the errors
and noting any patterns. Random scattering of the errors indicates a lack of serial correlation or
258
CHAPTER 10
independence of the errors. Any pattern in the errors may be indicative of serial correlation. The
correlogram (chapter 14) of the errors can also be computed. A large first order serial correlation
indicates correlated errors.
Estimated regression coefficients in the presence of serial correlation in the errors are unbiased but their variances are incorrectly estimated, and thus the level of significance of hypothesis tests regarding these coefficients is unknown. The standard error of the regression equation is
also affected so that hypothesis tests involving the standard error are also at an unknown level of
significance.
Serial correlation may indicate that one or more important explanatory variable is missing
from the regression equation. Serial correlation implies that
where E, is the error at time t, p is the serial correlation, and E, is independent with mean zero. Neter
et al. (1996) indicate that if E, is iid N(0, a2) then e, has a mean of 0 and a variance of a2/(1 - p2)
where p is the first order serial correlation between e, and e,-,. This in turn implies that
we can perform a regression of Y: versus XI, and eliminate the problem of serial correlation in
the errors. Equation (10.25) requires that p be known. It can be estimated by computing the first
order serial correlation if the errors from the original equation involving Y, and the Xi,,?sfrom
as predictor
turns out to be iid N(0, u2), standard tests of hypotheses can then be used
to eliminate nonsignificant P's. In equation (10.28) the Y,-, and Xi,l-l are known as lagged
variables.
Lagged variables can often represent changes in storage. We know from continuity
E,
MULTIPLE REGRESSION
259
where I is inflow, 0 is outflow and AS is the change in storage for a particular hydrologic system.
In many systems of areal extent A, (Y, - Y,-,)A may be proportional to the change in storage
from time t - 1 to t. A prediction of Y, might be based on the difference in inflow and outflow
from t - 1 to t and Y,-,.
T - 6 = (Y, - Y,-,)A
It + It-1
At
2
Ot + Ot-1
At = (Y, - Y,- ,)A
2
An exact test is not available, but Durbin and Watson have obtained lower and upper bounds
dL and du such that values of D outside these bounds lead to the decision that the hypothesis can
not be rejected if D > dUand the hypothesis is rejected if D < dL.If dL< D < du, the test is inconclusive. Tables of d, and dUare contained in the appendix for various values of n, p, and for
levels of significance equal to 0.05 and 0.01.
260
CHAPTER 10
Neter et al. (1996) indicate that a test for negative serial correlation can be done by using as
a test statistic 4 - D. The test is then the same as for positive serial correlation. Helsel and Hirsch
(1992) indicate that the Durbin-Watson statistic requires the data to be evenly spaced in time.
Corrective Action
When serial correlation in the errors is detected, the first step should be to determine if some
important explanatory variable is missing from the regression equation. Often in hydrology the
serial correlation is a result of storage in the system. In this case, a measure of this storage may
need to be included as a predictor variable. In other cases some function of time may correct the
problem.
Aggregating data over longer time periods may reduce or eliminate serial correlation. As the
time between observations increases, the dependence of one observation on another can be
expected to decrease. At large enough time intervals, independence may be achieved.
As indicated earlier, the inclusion of lagged variables, both on the dependent and the independent variables, may help reduce serial correlation. The chapter on time series modeling
should be consulted for more on this topic.
MULTICOLINEARITY
In multiple linear regression it is unfortunate that the predictor variables in X are called "inY is being predicted as a function of X. Thus
dependent" variables. This terminology reflects that Y has been termed the "dependent" variable because Y is thought to depend on X. By extension,
the X's have become known as the "independent" variables because they are what Y is dependent upon. Independence has a special meaning in statistics that differs from the above, as we have
seen. We know that if all of the X's in X are mutually independent, then the correlation matrix
computed from X will be a diagonal matrix with ones on the diagonal and zeros elsewhere.
In most natural sciences where "independent" variables are measured values from uncontrolled experimentation,it is rare to achieve true statistical independence. Some level of correlation
almost always exists among the predictor variables. These correlations among the independent
variables are often called multicolinearities. Much of the discussion on multicolinearitycomes from
Neter et al. (1996). Generally the term multicolinearity is reserved for the case when rather strong
correlations exist within the X matrix.
As the name implies, multiple linear regression attempts to exploit linear relationships between Y and X to develop a prediction or descriptive equation for Y. If two X variables, say X,
and X2, are perfectly linearly related, then r,,, = 1. Furthermore, all of the information relative to
Y and X,, will be contained in the relationship between Y and X,. In
a linear relationship between other words, nothing is gained by including both X, and X, in a linear regression with Y. As a matter of fact, there is not a unique linear relationship between Y and X, and X, if X, and X, are perfectly correlated. Each of the relationships will predict the same value of Y for all X, and X2pairs
that follow the linear relationship between X, and X,.
When X, and X2 are perfectly correlated, the residuals of the regressions Y on X,, Y on X,,
and Y and X, and X, will all be exactly the same since the same information, in a linear sense,
will be contained in all three regressions. (Note the brief mention of multicolinearity following
equation 10.8.)
MULTIPLE REGRESSION
26 1
If we now relax the requirement that Xl and X2 are perfectly correlated to requiring that they
be "highly" correlated, an approximation to what is discussed above results. Now the residual
sum of squares of regressions of Y on X,, Y on X,, and Y on X, and X, will be nearly the same
depending on the strength of the linear relationship between XI
- and X2. Thus, if a regression of
Y on X1 is performed followed by a regression of Y on X1
X2, the reduction in the residual
- and sum of squares brought about by the addition of X, will be small because very little information
in a linear sense is added to the regression.
X, and X, are included is that the linear effects between Y and XI
What may happen if both or X2 may be split between Xl and X2 in such a fashion that the regression coefficients do not
make physical sense. For example, they may have the wrong sign. Furthermore, the individual
regression coefficients may test nonsignificant on both X1 and X2 even though the overall
regression is significant.
By splitting the importance of either X, or X2 among both X,
- and X,, the variance of the
Xl and X2 become larger, indicating increased sampling variability
regression coefficients on relative to these coefficients. Again, this is brought about by splitting the effect of one important
linear relationship among two (or more) variables that are closely linearly related. Substantial
changes in the values for the regression coefficients upon the addition or removal of a variable
from a regression equation is an indication that multicolinearity may be present.
XI and X, in the regression equation will not cause prediction problems as long
Having both as the predictions are confined to the region of X, and X, defined by the original data sets. This
X, and X, for prediction must exhibit the same near linear relationmeans that values used for ship as did the original values used in estimating the regression coefficient.
Multicolinearity is not restricted to correlations between pairs of X variables. It also includes correlation between any one of the X's and any linear combination of any of the remaining X's. Obviously, correlations between pairs of X's are easily detected from the correlation
matrix of X. Correlations with linear functions of several X's are not always easily detected. One
way to identify the possibility of an X being correlated with a linear combination of the other X's
is to compute the regression of Xi on X*, where X* is X with Xi removed. The multiple R, can
be examined and used as an indication of multicolinearity. This procedure can be carried out for
alloftheX,'s,i
= 2, ..., p.
A summary of what has been indicated about the effect of multicolinearity is:
1. Multicolinearity in itself does not inhibit the predictive ability of a regression model provided
the prediction is made within the regions of the independent variables used in deriving the
regression coefficients.
2. Multicolinearity may contribute to an inflated variance in the estimated regression coefficients. The sampling error of the coefficients may be large resulting in individual coefficients
being nonsignificant even though the overall regression is indicating a definite linear relationship exists between Y and X.
3. Individual regression coefficients may be hard to interpret in terms of their impact on Y. They
may even have the wrong sign. Thus, even though the overall equation makes a valid prediction, the contributions of the individual X variables may not be decipherable.
262
CHAPTER 10
4. The values for individual regression coefficients may change substantially upon the addition
or deletion of an X variable that involves multicolinearity.
Detection of Multicolinearity
Some general indications of the possible presence of multicolinearity that have been identified are:
1. Large correlations in the correlation matrix of X.
1
1
R;
where is the multiple coefficient of determination between Xi and all of the other X's in the regression equation. When R: is zero, then Xi is linearly independent of the other X's and the VIF
is one. If R: = 1, then the Var(Pi) and the VIF are unbounded. Large values of VIF indicate the
presence of multicolinearity. The exact value of VIF at which multicolinearity is declared depends on the individual investigator. Some use a value of 5 and others 10. A VIF of 10 corresponds to an R: of 0.90 and a VIF of 5 corresponds to ~ ? e ~ utoa 0.80.
l
Some will compute an average VIF over all p - 1 regression coefficients and declare that if
this average VIF is "considerably" larger than one, multicolinearity is indicated.
Some statistical packages will compute the VIE Some statistical packages use an indicator
called the tolerance, which is l/VIF. Thus, a VKF of 10 corresponds to a tolerance of 0.1 and a
VIF of 5 corresponds to a tolerance of 0.2.
MULTIPLE REGRESSION
263
Runoff
Precipitation
di
RS
Rr
L
P
4
Rs
F
Rr
Table 10.4. Correlation matrix for data of Haan and Read (1970)
Table 10.5. Regression analysis of data of Haan and Read (1970) (10 independent variables)
Analysis of Variance
Source
Degrees of
freedom
Sum of
squares
Mean
square
Regression
Residual
Total corrected for mean
R = 0.98
Std. Error = 0.69
Variable
sri
Constant
Precipitation
A
S
L
P
di
Rs
F
R,
Since the correlation matrix is symmetrical, it is customary to show only the diagonal elements
and the elements either above or below the diagonal.
The mean and standard deviation of runoff are 16.55 and 1.93 inches, respectively. Table
10.5 contains the results'ofl the multiple regression of runoff on all 9 of the independent
variables. Because an intercept term was included, p is equal to 10. In the ANOVA table, the
sum of squares for the mean and the total sum of squares are not shown. Instead the total sum of
squares corrected for the mean is given. The F that is given is the calculated F for the overall
regression equation (from equation 10.18) used in testing the hypothesis that the regression does
not explain a significant amount of the variation in Y. Because F,,,,,, is 8.81, this hypothesis is
rejected.
The lower part of table 10.5 contains the estimated regression coefficients, the standard
errors of the regression coefficients, and the calculated t (equation 10.17) used in testing H,: Pi = 0.
The only b's with calculated t's greater than 2.0 are those based on precipitation, P, and R,.If all of
the variables except these three and the intercept are eliminated at one time, the regression shown
in table 10.6 results. In going to the second regression, R*has been reduced from 0.97 to 0.91, the
F increased to 28.7, and the standard error has remained unchanged. All of the regression coefficients with the exception of the intercept are now significantly different from zero at the one
percent level of significance since t.95,5,9
is 3.25.
MULTIPLE REGRESSION
265
Table 10.6. Regression analysis of data of Haan and Read (1970) (4 independent variables)
Analysis of Variance
Source
Regression
Residual
Total
Degrees of
freedom
Sum of
squares
Mean
square
3
9
12
40.64
4.25
44.89
R = 0.95
Std. Error = 0.69
13.55
.47
I3
Sa
-9.65
0.430
0.620
0.010
4.440
0.093
0.075
0.002
-2.17
4.62
8.25
5.19
R' = 0.91
F = 28.7
A
Variable
Constant
Precipitation
P
The t test used to test the hypothesis that Pi = 0 makes the test assuming that all of the
other p's are still in the equation. Thus, when a decision is made to eliminate more than one
variable, the t's are unreliable and the F test using equation 10.19 should be used. This test
determines if several variables are simultaneously making a significant contribution to
explaining the variation in the dependent variable. As an illustration of the use of equation
10.19, the hypothesis that PA = P, = p, = P,i = P,, = P, = 0 be tested. For this example
n = 13, p = 10, k = 6, Q2 = 43.45, Q2* = 40.64, and Q, = 1.44. The F calculated from
equation 10.19 is 0.98. Since F.95,6,.3
= 8.94, it is concluded that the variables A, S, L, di, R,,
and F are not significant.
The resulting prediction model is
Runoff = -9.65
The observed values of runoff and values predicted from the above equation are shown in
the lower half of table 10.6.
To demonstrate the behavior of s, R', and F, several regressions were run using various
combinations of the data in table 10.2. The results of these regressions are summarized in table
10.7 and figure 10.2. This table illustrates that R2 never increases as variables are removed from
the equation, whereas s may decrease as some variables are removed and then increase as more
variables are removed. R~ approaches unity as the number of variables is increasing. If the number of variables were increased to 12, then p would be 13 (because the model has an intercept)
and R' would be unity. In figure 10.2 the lines connect the best values of the quantities s, R', and
F contained in table 10.7. This is because it is possible, for example, to have many combinations
Precipitation
R,
R,
R'
MULTIPLE REGRESSION
267
+ plnX
(10.32)
where
Standard regression techniques can now be used to estimate a' and 6' for equation 10.33
and a and 6 estimated from equations 10.34. Two important points should be noted. First, the
estimates of a and p obtained in this way will be such that E (Yf - Yi )2 is a minimum and not
such that E (Y, - qi )2 is a minimum. Second, the error term on equation 10.33 is additive
(Y' = a' + P'X' + E') implying that it is multiplicative on equation 10.31 (Y = axp).These
errors are related by E' = ln E. The assumptions used in hypothesis testing and confidence intervals
must now be valid for E' and the tests and confidence intervals made relative to the transformed
model.
In some situations the logarithmic transformation makes the data conform more closely to
the regression assumptions. For example, if the data plot as in figure 10.3, a logarithmic transformation may make the assumption of constant variance on the error more realistic.
The normal equations for a logarithmic transformation are based on a constant percentage
error along the regression line, whereas the standard regression is based on a constant absolute
error along the regression line. For example, the difference between Yi = 200 and Yi = 100 on
an arithmetic scale is 100 times as large as the difference between Yi = 2 and Yi = 1. However,
on a logarithmic scale In 200 - In 100 = 5.29832 - 4.60517 = .69315, which is the same as
" 1
Fig. 10.3. Example of the effect of a logarithmic transformation on the error variance.
In 2 - In 1 = .693 15 - .000 = .69315. In a situation of this type, the standard regression procedure would attempt to fit the point at Y = 100 in order to minimize X (Y - qi)'at the
expense of the point Y, = 1 because its contribution to X (Y, - ?i)2 is small. The logarithmically transformed model would give equal percentage weight to both points.
The above discussion can be extended to the model
can be transformed to
1nY = h a
+ PX
MULTIPLE REGRESSION
269
the line labelled "overall" results. If two regressions are done, one on the 1991 data and one on
the 1992 data, the two individually labeled lines result. It is possible using indicator variables to
obtain the two individual lines with a single regression using the model
where I is an indicator variable. Using this approach, the data would be coded such that I would
be 0 for one of the years (say 1991) and 1 for the other year. The resulting equation would then
effectively be
+ bX for 1991
Y = (a + c) + bX for 1992
Y =a
Thus, the slopes for the two regressions are the same, but the intercepts are a function of
year. The advantage of using the indicator variable is that all of the data are used to estimate a
common slope for the two lines. If two independent regressions were done, the slopes would
likely be different.
Indicator variables can be used to generate two lines having different slopes but a common
intercept using the model
+ bX for 1991
Y = (a + c) + (b + d)X
Y =a
for 1992
Obviously, the use of indicators uses extra degrees of freedom and thus requires more data for
parameter estimation.
The use of indicator variables can be extended to produce three lines. For three equally
spaced lines having a common slope, the appropriate model is
where values of - 1,0, and 1 are used for I. The resulting models are
Y=(a-c)+bX
Y = a + bX
Y = (a
forI=-1
for1 = O
+ c) + bX for I = 1
CHAPTER 10
270
Three unequally spaced lines can be generated using the model
Y =a
+ bX + cI1 + d12
(10.38)
Line 1
Line 2
Line 3
I1
I2
0
0
1
0
1
0
Line
Y=a+bX
Y = (a + d)
Y = (a + C)
+ bX
+ bX
using the same values for the indicator variables as above, with the result
+ bX for line 1
Y = (a + d) + (b + f)X
Y = (a + c) + (b + e)X
Y =a
for line 2
for line 3
Occasionally, it is desirable to fit a line through a set of data such that the line has a definite
break in its slope at some fixed point X = C. Figure 10.5 shows such a situation.
A regression of the form
27 1
forXIC
Y = (a - cC)
+ (b + c)X
C and I = 1 for X
for X
>C
where C, and C, (C, < C,) are the values of X at which the slope changes and the indicator
variables have values given by
I,=O
forX<C,
I,=l
forXZC,
I, = 0
for X
< C,
I, = 1
for X
1 C,
X<Cl
Y = (a
cC1) + (b
+ c)X
Y = (a
cC, - dC,)
C,
< C,
+ (b + c + d)X
C2 < X
Finally, a line with a change in slope and a jump or discontinuity as shown in figure 10.6 can
be estimated using the model
272
CHAPTER 10
GENERAL COMMENTS
Regression analysis should be regarded simply as a tool for exploiting linear tendencies that
may exist between a dependent variable and a set of independent variables. It is also a useful
device for estimating the parameters of a model that is linear or can be transformed to a linear model.
Any regression analysis should be preceded by a great deal of thought devoted to what
variables should be included in the analysis, how these variables might influence the dependent
variable, the correlations among the independent variables, and the ease of using a predictive
model based on the selected independent variables. The ready availability of digital computers
and library regression programs has led many to collect data with little thought, throw it into the
computer, and hope for a model. This temptation must be avoided.
Not infrequently, an investigator finds that a satisfactory regression equation cannot be
developed from the data at hand. This should not be surprising because if a relationship exists, it
may be much more complex than is indicated by a linear model. Commonly, factors that are
important in determining the behavior of a dependent variable are omitted from a regression
equation. In this case a good predictive model cannot be expected.
In some regression problems, it is possible to improve the model by including cross-product
terms (called interactions) by multiplying together two independent variables to form a new
variable. Thus, a variable X, may be defined as XrXs.Ratios may be used such as X, = Xr/Xs.
Powers of variables may improve the model X, = X: where n is a known constant. If any of these
procedures are used, care must be exercised to see that large correlations (see chapter 11) are not
built into X'X.
One may frequently know (or think they do, anyway) the factors affecting a particular phenomena. They cannot, however, easily measure these factors and are forced to use either another
related factor or a rough measure of the important factor. For instance, flood peaks depend among
other things on how rapidly the surface runoff reaches a particular point on a stream. This, in turn,
depends on surface flow characteristics such as the steepness and roughness of the flow surface
and the distance the flow must travel to a stream channel, plus the stream flow characteristics
such as roughness, hydraulic radius, slope, tortuosity, length, and so forth. All of the factors are
not linearly related to flood peaks and could not be included in the model if they were. Indices or
summaries are used, such as the average land slope and the average channel slope. It is hoped that
the real causative factors are correlated with these indices sufficiently to reflect this true importance and that the dependent variable is linearly related to the indices. These are large and
important assumptions. They point out that there is a limit to how well one can predict a dependent variable with a regression model.
If it is at all possible, the first step in a regression analysis should be the development of the
form of the predictive model based on a rational analysis of the problem. Regression analysis can
then be used to develop the parameters of the model, test the importance of the variables
included, and develop confidence intervals for the predictions.
LOGISTIC REGRESSION
Frequently, one must be able to classify a variable into one of two possible classes. For
example, in looking at ground water for drinking one might want to class the water as acceptable
(Y = 1) or unacceptable (Y = 0). Based on a set of independent variables, it may be desired to
IMULTIPLE REGRESSION
273
determine the probability, p, that water from a particular well is acceptable for drinking. This is
equivalent to determining prob (Y = 1).
A regression model for classifying the binary variable Y as a 0 or 1 might be written
(10.45)
then
so that
A major difficulty with this regression model is that the assumptions of ordinary least
squares regression are violated in that the error term is not iid N(0, a2). Neter et al. (1996) show
PIXi
- when Yi = 1 and -PIXi
that ei are not normal since they can take on only values of 1 - - PIXi),
- indicating a nonconstant variance.
when Yi = 0. They also show that a: is (PIXi)(l
-Experience has shown that often p or E(Y) is related to PIXi
- in a sigmoidal fashion as in
figure 10.7. Such a function can be expressed as
CHAPTER 10
274
or equivalently
Defining the odds, Od, as the ratio of the probability that Y = 1 to the probability that Y = 0,
one obtains
The transformation to p' is sometimes called a logit transformation. As p goes from 0 to 1, p'
goes from -m to 03.
Equation 10.51 provides an alternative to equation 10.43. Neter et al. (1996) presents details
of maximum likelihood estimation of the p's for equation 10.51. The procedure is known as
logistic regression with equation 10.51 being the logistic model. Some statistical packages contain routines for carrying out the computations involved in logistic regression. The programs
result in estimates for the p's and the standard error of the estimate for Pi, sbi.
A test of the hypothesis that
= 0 versus
# 0 is made by computing
pi
pi
-,,,
MULTIPLE REGRESSION
275
observations used to develop the model. Obviously, this latter procedure would not independently evaluate the model because the same observations are used to evaluate as were used to
develop the model.
Example 10.3. In a certain locality wetland areas are thought to be impacted by groundwater
pumpage. By examining a wetland, ecologists can determine if a wetland is impacted or not. By
looking at certain bio-indicators such as fungi lines on trees, the normal water level for a wetland
may be determined. Water level records can be used to estimate the median water level. The
distance to the nearest pumping water well is also known. It is desired to develop a model for
classifying a wetland as impacted (Y = 0) or not impacted (Y = 1) based on the difference in the
median water level and the normal water level, X2, and the distance to the nearest pumping well, X3.
Solution: A logistic regression model of the form of equation 10.49 is fit to the data shown in
table 10.8. The results of the logistic regression are shown in table 10.9. Table 10.9 shows that
the overall regression is significant
= 40.14) but that
on X3 is not significant
(z, = 0.022/0.180 = 0.12). A second logistic regression was computed eliminating X3 with the
(X1
fi3
L50
Impact
Dist
Predicted impact
E (Y)
Residual
0.004
0.005
0.007
0.007
0.014
0.015
0.016
0.019
0.02 1
0.022
0.025
0.029
0.037
0.038
0.045
0.054
0.062
0.082
0.207
-0.390
0.658
-0.304
-0.123
-0.028
-0.008
-0.003
-0.001
-0.001
0.000
(continued)
CHAPTER 10
276
Table 10.8. (continued)
L50
Obs. No.
Imuact
Dist
E (Y)
Predicted imuact
Residual
Variable
Regression
coefficient
Standard
error
Beta = 0
Problem
level
Intercept
L50
Dist
7.014575
3.48942 1
2.245529 X lo-'
2.638804
1.355271
0.1801355
2.66
2.57
0.12
0.00
0.01
0.90
R~
Model
D.E*
Model
chi-square
Model
problem
Classification Table
Predicted
Actual
Total
Count
Row percent
Column percent
16.00
88.89
88.89
2.00
11.11
9.52
18.00
100.00
46.15
Count
Row percent
Column percent
2.00
9.52
11.11
19.00
90.48
90.48
2 1.OO
100.00
53.85
Count
Row percent
Column percent
18.00
46.15
100.00
21.00
53.85
100.00
39.00
Total
MULTIPLE REGRESSION
277
Variable
Regression
coefficient
Standard
error
Beta = 0
Problem
Level
Intercept
L50
7.094155
3.a3749
2.553 149
1.283279
2.78
2.68
0.005460
0.007284
+ Exp(-XB)).
Model
D.F.
Model
chi-square
Model
problem
Classification Table
Predicted
Actual
Count
Row percent
Column percent
Count
Row percent
Column percent
Total
Total
Count
Row percent
Column percent
Row
Actual
@OUP
Predicted
POUP
Score
Residual
278
CHAPTER 10
feet
If an E(Y) = 0.78 is used as a cutoff for impact evaluation, only 2 of the wetlands would be
misclassified. The true test of the model would be how it performs on an independent data set.
Exercises
10.1. Use the matrix methods of this chapter to work example 9.1.
10.2. Compute R for example 10.1.
10.3. Use the matrix methods of this chapter to work example 9.4.
10.4. Use the matrix methods of this chapter to work example 9.5. Calculate the confidence
interval for the point X equals 50.0 inches of rainfall.
XP
x?.
x (Y
- ?
)
2
279
=Y'Y
~ 'x 'Y.
---
2 (Y
Q)'
10.9. The relationship between stage and discharge (rating curve) for many streams has been
found to follow an equation of the type Q = a sbwhere Q is the discharge and S is the stage.
Using the following data from the Cumberland River at Cumberland Falls, Kentucky, derive such
a rating curve. Test the hypothesis that b = 1.5.
10.10. The data in table 10.11 is a partial listing of the data used by Benson (1964) in a study of
floods in the Southwest. Derive a prediction equation for Q,, the mean annual flood, in terms of
the remaining variables. Consider both the models given by equation 10.1 and by the multiple
regression extension of equation 10.24.
(continued)
11. Correlation
IN CHAPTER 3 the population correlation coefficient between two random variables X and
Y was defined in terms of the covariance of X and Y and the variances of X and Y as
where sx,, is the sample covariance between X and Y, and sx and sy are the sample standard
deviations of X and Y, respectively. Figure 3.5 and the accompanying description discussed some
typical values for r,,~ and their meaning. Here it was emphasized that 1) rxYycan range from - 1
to 1; 2) r,,~ = f1 implies a perfect linear relationship between X and Y; 3) r,,, = 0 implies
linear independence but leaves room for other types of dependence; and 4) if X and Y are
independent, then rX,, = 0.
In chapters 9 and 10 the concept of correlation was extended to give a measure of the
strength of the linear relationship between a random variable Y and a second variable which was
a linear function of one or more X variables, each of which may or may not be a random variable.
Throughout the text many of the results that have been developed have included the
assumption that the random variables were independent or that the sample being analyzed was
composed of random observations. A random observation simply means that every possible
element in the sample space has an equal chance of being selected during any trial.
Random variables may be either uncorrelated (r,,, = 0) or correlated (rX,Y# 0). Even when
sampling from uncorrelated populations, it would be rare for the sample correlation coefficient to
be exactly zero. More likely it will deviate from zero due to chance. Thus, statistical tests are
needed to evaluate whether the deviation of the sample correlation coefficient from zero may be
ascribed to chance or whether the deviation is too large to attribute to chance.
If successive observations in a time series of hydrologic data are correlated, this must be
taken into account in any inferences made about the data or in attempts to model the process that
produced the data. Again, a procedure is required for determining if the sampled elements from
a time series can be considered as random. These and other properties of correlation are the subject of this chapter.
has a t distribution with n - 2 degrees of freedom, where n is the sample size. Thus, to test
H,: p = 0, the test statistic is calculated from equation 11.3 and H, is rejected if It/ > t,
If n is moderately large (n > 25), then the quantity W is approximately normally distributed
with mean w and variance (n - 3)-' where
: [:+:I
W = - In - = arctanh r
and
To test the hypothesis H,: p = p* against the alternative Ha: p # p* for p*, a known
constant, the quantity
283
CORRELATION
can be considered to be normally distributed with a mean of zero and a variance of one. If
> z,-~/,(Zis the standard normal variable), H, is rejected.
Confidence limits on p can be estimated from
IzI
(11.8)
has a chi-square distribution with k degrees of freedom. H, is rejected if X2 > x;-,,~. Rejection
of the hypothesis infers that at least one of the pi's is not equal to p*.
The hypothesis H,: p, = p, = - - - = pk (all correlation coefficients are equal) is tested by
noting that
H, is rejected if
X2
> X:-u,k-l. Rejection of this hypothesis infers that at least one of the pi's is
and
Example 11.1. Burges and Johnson ( 1973) present the following sample correlation coefficients
for monthly flow volumes for the Sauk River in Washington and Arroyo Seco in California. In the
following table rj represents the sample correlation coefficient between the monthly flow
volumes in months j and j - 1. Assume the coefficients are based on 30 observations each and
that the parent populations are all bivariate normal (Burges and Johnson actually used the
lognormal distribution in their study). 1) Test the hypothesis that p, for the Sauk River is equal to
0.50.2) Compute the 95% confidence limits for p8 of the Sauk River. 3) Test the hypothesis that
p, on Arroyo Seco is zero. 4) Test the hypothesis that on each of the streams all of the monthly
correlation coefficients are equal. 5) Assume the hypothesis in part 4 is accepted for the Sauk
River and estimate an average correlation coefficient for the Sauk River.
Month
Sauk River
October
November
December
January
February
March
April
May
June
July
August
September
Solution:
1) H,: p, = 0.5 for Sauk River
From equation 11.6
where
W = arctanh r = arctanh (.34)
.35409
Arroyo Seco
CORRELATION
285
Since Izl < 1.96, we cannot reject H,: r, = 0.5 for the Sauk River.
2) The 95% confidence limits on p, for the Sauk River are calculated from equation 11.7 as
Sauk River
Arroyo Seco
ri
Wi = arctanh ri
ri
Wi = arctanh ri
1
2
3
4
5
6
7
8
9
10
11
12
0.6 1
0.58
0.50
0.3 1
0.38
0.37
0.44
0.34
0.17
0.65
0.93
0.5 1
0.71
0.66
0.55
0.32
0.40
0.39
0.47
0.35
0.17
0.78
1.66
0.56
0.00
0.00
0.00
0.45
0.2 1
0.70
0.60
0.75
0.98
0.97
0.96
0.00
0.00
0.00
0.00
0.49
0.21
0.87
0.69
0.97
2.30
2.09
1.95
0.00
where
W = 0.585
m=
C [(i - 3 ) 2 (ni - 3)
1 ) -- 27/29 = 1/29
27
Comment: In parts 4 and 5 of this problem several simplifications were made in the summations
since ni was equal to 30 for all i; in general this cannot be done. In part 5 an overall average
correlation coefficient was calculated. Since in part 4 it was shown that the correlations for the
various months are significantly different, the utility of an overall average correlation is suspect.
Graybill (1961) presents the exact probability distribution of r and states that for small
samples, the exact distribution should be used in hypothesis testing. References to tables that aid
in hypothesis testing for small samples and examples of their use are also given.
Again, it is emphasized that the above tests are based on a random sample from multivariate
normal distributions. Even under these conditions, only the test of H,: r = 0 conducted using
equation 11.3 is "exact". The other tests are approximate with the approximation improving as
the sample size increases.
For non-normal populations, it may be possible to transform the variables to a normal
situation and then apply the above tests to the transformed data. If a transformation of a nonnormal random variable is not possible or not desired, then the above tests must be considered as
CORRELATION
287
approximate with the approximation becoming poorer as the coefficient of skew of the random
variables increase.
S E W CORRELATION
It is not uncommon to find in a time series of hydrologic data that an observation at one time
period is correlated with the observation in the preceding time period. Such correlation is termed
serial correlation or autocorrelation. By definition, the elements of a sample of data possessing
serial correlation are not random elements. A serially correlated sample of size n contains less
information about a process than a completely random sample of size n. In a serially correlated
sample, part of the information contained in each observation is already known through its
correlation with the preceding observation
Such correlation can also exist between an observation at one time period and an
observation k time periods earlier fork = 1,2, . . . In this discussion of serial correlation, it is assumed that observations are equally spaced in time and that the statistical properties of the
process do not change with time (stationary process). The population serial correlation coefficient is denoted by p(k) (and frequently called the autocorrelation coefficient) where k is the lag
or number of time intervals between the observations being considered. The sample serial correlation coefficient will be given by r(k). The sample serial correlation coefficient for a sample of
size n is given by
n-k
XiXi+k
r(k) =
n-k
n-k
xi=lxiEi=l X i + k
(n - k)
(2;::
n-k
xi
I?
n-k
From equation 11.12 it is seen that r(0) is unity. That is, the correlation of an observation
with itself is 1. Equation 11.12 also demonstrates that as k increases, the number of pairs of
observations used in estimating r(k) decreases because all of the summations contain n - k
terms. Serial correlation should only be estimated for k considerably less than n.
If p(k) = 0 for all k # 0, the process is said to be a purely random process. This indicates that
all of the observations in a sample will be independent of each other. In chapter 14, Yevjevich
(1972b), Matalas (1966, 1967b),Julian (1967) and others treat hydrologic time series ,inmore detail.
Anderson (1942) has proposed a test of significance for the serial correlation coefficient for
a circular, normal, stationary time series. A circular series is one that closes on itself so that xn is
followed by xl. Under these assumptions
Although the assumption of a circular series is unrealistic, values of r(k) from equation
11.13 will not differ greatly from those calculated from equation 11.12 if n is large in comparison
to k. Under these conditions r(k) will be approximately normally distributed with mean
- l/(n - 1) and variance (n - 2)/(n - 112if p(k) = 0. The confidence limits on p(k) are then
estimated by
If the calculated r(k) falls outside these confidence limits, the hypothesis that p(k) is zero
[H,: p(k) = 0 versus Ha: p(k) # 0] is rejected.
Example 11.2.' Frequently, in the analysis of runoff volumes, one finds there is significant serial
correlation caused by storages on the watershed. Appendix C contains a listing of the monthly
and annual runoff volumes for Cave Creek near Lexington, Kentucky. Test the hypothesis that
p(1) = 0 for the annual runoff volumes.
Solution: This solution assumes a = 0.05 and is based on equation 11.14, and therefore assumes
that the annual runoff is normally distributed and is a stationary time series. Furthermore, p(1) is
estimated from equation 11.13 assuming that the series is circular [in this case this is equivalent
to assuming x,+, = x, in calculating r(l)].
Since -0.520
0 is not rejected.
Comment: From the width of the confidence interval, it is apparent that the above test is not very
powerful for small samples. A sample of around 400 observations would be required to reject H,:
p(k) = Oifr(k) = 0.1.
CORRELATION
289
Matalas (1967b) has suggested that for hydrologic data r(1) tends to be greater than zero due
to persistence caused by storage. If r(1) is found to be less than zero, it is in many cases difficult
to explain hydrologically. In this case one might take r(1) as equal to zero.
Matalas and Langbein (1962) state that in an autocorrelated series, each observation
represents part of the information contained in the previous observation. They discuss stationary
time series having r(1) # 0 and r(i) = 0 for i = 2, 3, . . . They state that n observations of a
nonrandom series having r(1) > 0 give only as much information (measured in terms of a
variance) about the mean as some lesser number, n, of observations in a purely random time
series.
This lesser number of observations is called the effective number of observations and is
given by
If r(1) = 0, then n, = n. If r(1) > 0, then n, < n. Equation 11.15 is expressed graphically as
figure 11.1. As an example, a 50-year record for which r(1) = 0.2 contains only as much
information about the mean as a 33-year record with r(1) = 0. Note that if n is large or r(1)
small, the second term in the denominator of equation 11.15 can be neglected with little loss
in accuracy.
Fig. 11.1. Relation between n and n, for various values of p(1) (after Matalas and Langbein 1962).
As n gets large, n' approaches l/p. For a of 0.2, the maximum information about the
regional mean contained in n stations could not exceed the information contained in 5 uncorrelated stations.
From a consideration of equation 11.16, it seems it would be logical to establish relatively
few independent hydrologic stations in a region rather than several correlated stations. However,
by the very concept of a hydrologic region, the hydrologic characteristics may be correlated.
Correlation within a region can be exploited to yield improved estimates of a particular
hydrologic variable at a point through correlation with another hydrologic variable at that point
or a similar characteristic at another point. For instance, let Y and X represent two random
hydrologic variables having no serial correlation for which n, and n, + n2 observations, respectively, are available. Also consider that Y and X are correlated with a correlation coefficient of
ryx. Now, the record on Y can be extended by using the correlation between Y and X. This
relation is merely a simple regression considering Y as the dependent and X the independent variable. The relation is developed based on the n, common observations. From equation 9.15 it can
be shown that the regression between Y and X is given by
where ryXis the estimate for pyx and y and x are deviations from their respective means. Now n2
estimates of Y can be computed from equation 11.17 based on the n2 observations on X not
common to the observations on Y. Let Y1 and Y2 represent the mean of Y based on the original
n, observations and the n2 estimated observations, respectively. A new weighted mean for Y
based on n, + n, observations can now be computed from
For the n2 additional observations to improve the estimate of Y, it is necessary that ryx be greater
than 1/(n, - 2) (Matalas and Langbein 1962).
If the random variables Y and X contain significant serial correlation, the situation is
somewhat more complex. Matalas and Langbein (1962), Matalas and Rosenblatt (1962), and
Yevjevich (1972a) contain treatments of this case. In general, serial correlation serves to decrease
the information relative to the mean while cross-correlation tends to improve information relative to the mean.
SPURIOUS CORRELATION
Spurious correlation is any apparent correlation between variables that are in fact
uncorrelated. Spurious correlation can arise due to clustering of data. For example, in figure 11.2,
the correlation of Y with X within either of the data clusters is near zero. When the data from both
clusters are used to calculate a single correlation coefficient, this correlation is found to be quite
high. This is spurious correlation. Figure 11.3 shows a plot of Y versus X where both Yi and Xi are
random variables obtained by adding 11 to a random observation from a standard normal
distribution. For a sufficiently large sample rx,, would be zero. If both Yi and Xi are divided by yet
a third random observation Zi, obtained in the same manner as Xi and Yi, and the correlation between Yi/Zi and Xi/Zi computed, for a sufficiently large sample the correlation will be near 0.5.
Figure 11.4 is a plot of Yi/Z, versus Xi/Zi. Figure 11.4 indicates that Xi furnishes information useful in estimating Yi when in fact Yi and Xi are uncorrelated. The correlation between Yi/Zi and
Xi/& is spurious.
X
Fig. 11.2. Spurious correlation due to data clustering.
292
CHAPTER 11
Fig. 11.4. Spurious correlation introduced by dividing 2 random variables by a common third
random variable.
CORRELATION
293
Pearson (1896-1 897) investigated the spurious correlation that can arise between ratios. Let
Y = X1/X2and Z = X3/X4. The correlation between Y and Z, rxz, was found to be a function of
the variances, covariances, and means of the X's. Pearson's derivation assumed that the X's were
normally distributed and that the coefficient of variation of each X was small enough so that its
third and higher powers could be neglected. Reed (1921) arrived at the same results without
specifying the parent distribution of the X's. Pearson's general formula is
where rij is the correlation between Xi and Xj, and Ci is the coefficient of variation of Xi.
Chayes (1949) and Benson (1965) considered many special cases of equation 11.19. For
example, if X2 = X4, r12 = r13 = r34 = 0, r24 = 1, and C, = C2 = C3 = C4, equation 11.19
reduces to rxy = 0.5, which is the case shown in figure 11.4. Benson (1965) produced a table
(Table 11.1) showing many special cases of ratio and product correlations.
Spurious correlation can arise in hydrology when dimensionless terms or standardized
variables are used. Benson (1965) presents several examples of possible spurious correlation in
hydrology.
Exercises
11.1. Calculate the first-order serial correlation coefficients for the sediment load and annual discharge data for the Green River at Munfordville, Kentucky. Test the hypothesis that these two
correlations are equal. Discuss the assumptions you have made and how they affect the validity
of the tests you have made.
11.2. Calculate the correlation between the sediment load and annual discharge for the Green
River at Munfordville, Kentucky. Test the hypothesis that this correlation is equal to 0.50.
11.3. Verify the "comment" of example 11.2.
11.4. Calculate the first-order serial correlation coefficient for the Spray River, Banff, Canada.
Test the hypothesis that the first order serial correlation is zero.
11.5. Work exercise 11.4 for the Piscataquis River near Dover-Foxcroft, Maine.
11.6. If the annual runoff from the Spray River, Banff, Canada, is normally distributed, how
many independent observations would provide as much information relative to estimating the
mean annual runoff as does the 45 years of actual record?
11.7. Work exercise 11.6 for the Piscataquis River, near Dover-Foxcroft, Maine and its 54 years
of record.
.p
u"u"u"
rnmm
5 u"Urnm
*u"u"
u "- 5
r~
nm
2
u_
n
PI
u_
Ph
IC)
rn
I
N N
PI PI
n
PI PI
cV
<
u" %
2
I
*
rn
I
u,
L' U
+
PI*
PI PI
I
=
PIPI
U,
PI
%
4
rn
PIN
:U
V
PI
1%
IX
121
12 1)
11
PI
X,
21 x"
x
11.8. The following data were collected on two streams in southeastern Kentucky. Use the data
to extend the peak flow record of Cave Branch through 1972. Estimate the average peak flow for
the entire record plus estimated record for Cave Branch. Is this estimated average an
improvement over an estimate based on the actual observed record of Cave Branch?
Peak Flow Data
Year
Cave Branch
Helton Branch
Year
Helton Branch
NOTATION
In this chapter an uppercase underlined letter will denote a matrix and a lowercase underlined letter will denote a column vector. Thus Z could be an n X p matrix made up of p n X 1
column vectors for j = 1,2, ..., p.
zj
PRINCIPAL COMPONENTS
Often, when data are collected on p variables, these p variables are correlated. This correlation indicates that some of the information contained in one variable is also contained in some of
the other p - 1 variables. The objective of principal components analysis is to transform the p
original correlated variables into p uncorrelated, or orthogonal, components. These components
are linear functions of the original variables. Such a transformation can be written
The total system variance, V, is defined as the sum of the variances of the original variables
and can be estimated as
V = Trace S=
2:='=, s ~ , ~
(12.4)
X
zk=Xi,kakj
P
1 and -Ja . a p
1 column vector.
(12.5)
MULTIVARIATE ANALYSIS
The variance of
299
z,
-
is thus defined by the vector a, that maximizes the variance of z, subject to the constraint that
a' -1a = 1. This is a normalizing constraint without which there would be no unique solution.
-1
Equation 12.7 can be maximized by using the Langrangian multiplier A, to introduce the
= 1. Let
constraint
Q is maximized by differentiation.
For the solution of equation 12.8 to be other than the trivial solution al = 0 we must have
a, the charThis is a classical characteristic value problem. A, is called the characteristic root and acteristic vector of S. Equation 12.9 has p solutions for A,. This is easily seen by considering the
special case of S to be a 2 X 2 matrix in which case equation 12.9 becomes
Having found the characteristic root, A,, of S, the characteristic vector, a,, is found from
equation 12.8 using the constraint that _a;?, = 1, which is equivalent to Xi=, a:,, = 1.
The second principal component is found in a similar manner. Now it is desired to find -a,
such that Var(z2)
- =a;Sa2
is maximized subject to the constraints that a;a2 = 1 and
a;a, = g;a, = 0. This latter constraint guarantees that z, and z2 are orthogonal (uncorrelated).
Using,a procedure similar to the above for a,, let Q be
Premultiplication by a; results in
a,, the coefficients of the second largest principal component, are the
from which it follows that coefficients of the characteristic vector associated with the second largest characteristic root of
S. Premultiplying equation 12.14 by a; also results in A, = a;Sa2
= Var(z2).
- In general, the jb
principal component of the p-variate sample X is the linear function 3 = -JX a where 2j are the
elements of the characteristic vector associated with the jth largest characteristic root of S.
From equation 12.1 we can find Z'Z
- - as Z'Z
- = (XA)'(XA)
- - = A'X'XA = (n - 1) A'SA.
- It can
be easily shown (see equations 12.24-12.28) that A'SA
- - is a diagonal p X p matrix with the ith
diagonal element equal to A,. This matrix may also be written as -A'SA = D, where D,
- is the
diagonal matrix whose diagonal elements are the characteristic roots of S.
One property of matrices is that if E is an orthogonal matrix, then the trace of E'FE
- - equals
F. Therefore
the trace of - = Trace(AISA)
-= Trace(S)
- =V
Trace (D,)
However
The sum of the characteristic roots which equals the sum of the variances of the principal components also equals the total system variance.
MULTIVARIATE ANALYSIS
30 1
zj
1. zi and
4. x L = l ~ a -r z=i
EL=,hi = Traces- = V
5. Z = XA where
From item 4 above, it can be seen that the fraction of the total variance accounted for by the
jth principal component is A,/V. In many situations the first q components account for a large
fraction (say 90% or more) of the system variance, indicating that the last p - q components are not
needed in terms of explaining variance. Many times these last p - q components are discarded with
X matrix containing
the effect that the problem has been reduced from one of dealing with an n X p correlation to dealing with an n X q(q < p)Z
- matrix that is orthogonal.
The question of how many components are needed to satisfactorily explain the system variance or what part of the total system variance should be explained is an unresolved one. Morrison
(1967) suggests that only the first 4 or 5 components should be extracted since later components
will be difficult to physically interpret in terms of the problem at hand. Unfortunately, there are
no statistical tests that can be used to determine the significance of a component. The sampling
theory of principal components is not well developed, especially when the components are extracted from the correlation matrix rather than the covariance matrix as in later examples.
X, and the principal components, Z, is given by
The covariance between the original variables, Cov(X,
- Z)
- = Cov(X, XA) = SA
-The covariance between the variables and the jthcomponent is given by
From (S - A-I)a.
1- -J = 0 we have -JS a = Ajgj. Therefore, Cov(X,
- Z.)
-J
= Ajgj. The covariance
between the ithvariable and the jth component is given by A,qi. The correlation between the i"
variable and the jthcomponent is
302
CHAPTER 12
The vx(xi)
= siPi=
-
S;
This equation can be used to transform A into a p X p matrix of correlations between the ith
observed variable and the jth computed component. These correlations can then be used in an
attempt to assess the physical meaning of the components.
In some situations, some of the p variables can be eliminated from further consideration
by examining the correlations defined by equation 12.17. If a variable has no significant
correlation with a component, then that variable is not contributing much to the variance of the
component. By eliminating the variable from the component, the fraction of the system
variance explained by the component would be changed very little. The difficulty here is that
this
variable may be correlated with a second component, in which case its
elimination would decrease the variance explained by the second component. For these
reasons, variables are generally eliminated only if they are not correlated with any of the q
components retained for analysis.
Example 12.1. Consider the data in table 10.2. Let X be a 13 X 3 matrix made up of 13
X based on the
observations on mi^), S(%), and L(ft). Compute the principal components of covariance matrix. Compute the correlation between the variables and the components.
Solution: S is computed from equation 12.2.
which simplifies to
MULTIVARIATE ANALYSIS
303
Solving these three equations simultaneously for a,,,, a , , , and a,,, results in
a2,, = -51.43a3,,
and a,,,
1.5503a3,,.
Thus,
The values for the principal components can now be calculated from
304
CHAPTER 12
The correlation matrix between the variables and the components can be computed from
x2 and z1 is
equation 12.17. For example, the correlation between Cor(x2,
- Z,) = ~ : ~ a ~ ,=, 155.9631/2(-0.999)/155.7691P
/s~
= -0.9995
The resulting correlation matrix is
Example 12.1 illustrates that using the S matrix in a principal component analysis presents some
problems if the units of the X variables differ greatly. In example 12.1, the magnitude of the
observations associated with the second variable were much greater than those associated with
the other two variables. Consequently, the variance of x2 was much greater than either Var(x,) or
Trace S or 96.4% of the system variance. This means that
Var(x3).
- x2
- accounted for 100 Var(x2)/
x2. This can also be seen from the fact
the first principal component is merely a restatement of that the correlation between x2 and 2, is - 1.000.
In most hydrologic studies the problem of noncommensurate units on the X's has been
handled by standardizing the X's through the transformation ( x , ~- Xj)/sj. The covariance matrix
S =R, as can be seen from
of the standardized variables becomes the correlation matrix equation 12.2. The principal components analysis is then done on R. The total system "variance"
now becomes Trace R = p because R has 1's on the diagonal.
The characteristic roots and vectors are determined from
MULTIVARIATE ANALYSIS
305
The correlation between the jthstandardized variable and the jth component (equation 12.17)
reduces to
These correlations are sometimes called factor loadings. The factor loadings can be used to attach
physical significance to the components. If a particular component is highly correlated with 1,2,
or 3 variables, then the component is a reflection of these variables. For example, in a study of
watershed geomorphic factors, it might be found that a component is highly correlated with the
average stream slope and the basin relief ratio. This being the case, that particular component
might be termed a measure of watershed steepness.
-
A, = 0.1035
In this formulation, z, accounts for 100(1.9692)/3 or 65.64% of the system "variance"
z, and z, account for 30.91% and 3.45%, respectively.
whereas The corresponding characteristic vectors are
Since component 1 is highly correlated with both area and length, this component might be
called a "size" component. Likewise, component 2 might be called a slope component. In terms of
explaining the "variance" of R, component 3 could be eliminated because it explains only 3.40%
of the variance and is not correlated with any of the variables. We cannot eliminate any variables,
however, because component 1 is strongly dependent on XI and X3
- whereas component 2 depends
on X,.
In terms of explaining the variance of R, we have reduced our problem from one of considering
X matrix with correlations to a 13 X 2 Z matrix without correlations (assuming Z3 is
a 13 X 3 discarded).
The values for the components are computed from
where
thus
MULTIVARIATE AVALYSIS
307
x1.J- . = (X..
- Xj)/sj and yi = Yi - Y
1.J
(12.21)
where Yi is the ithobservation on Y. Y is the mean of Y, Xi, is the i" observation on the j" variable,
and X, and sj are the mean and standard deviation of the jthvariable. Centering Y is not necessary.
It eliminates the need for an intercept and simplifies notation. The matrix of principal
components, Z, is determined from Z
- =XA with A being a p X p matrix whose jthcolumn is
the characteristic vector computed from equation 12.18 with R =X'X/(n
- 1).
The regression model is
a,,
where Y is an n X 1 vector whose elements are the n observations of the centered dependent
variable, Z is an n X p matrix whose elements, Zij, represent the i" value of the jth principal
component.
p is estimated from equation 10.8 as
where -Jz. is an n X 1 vector whose elements are the n values of the jthprincipal component.
so that
Z'Z
- = [zi'
- zj]
4R
-J a . is 0 for i # j and is A, for i = j. Thus, Z'Z
- is a p X p matrix whose off-diagonal
Now elements (i # j) are all zeros and whose j' diagonal element (i = j) is (n - 1)A,.
(z'z)-'
-
is therefore
for
uL
(n
l)Aj
i#j
for
i =j
MULTIVARIATE ANALYSIS
b,
309
6,
Thus is independent of for i f j. The independence of the b's is a result of the onhogonality of the principal components. Since the p's are independent, the t-test given by equation
10.17 can be repeatedly applied to test hypotheses on the 6's from a single regression equation.
Furthermore, the numerical value for the fi's retained in the regression will not be altered by
eliminating any number of the other b's. This is the distinct advantage of having an orthogonal
matrix of independent variables.
A second advantage of having independent b's is that the interpretation of the fi's in terms
of the independent variables is greatly simplified. Thus, if some hydrologic meaning can be
attached to a component through an examination of the factor loadings, hydrologic significance
can also be attached to the 6's. Unfortunately, in most hydrologic applications of principal
components analysis, a clear and distinct interpretation of the principal components has not been
possible. This, in turn, means the hydrologic significance of the fi's is unclear as well.
Some authors (DeCoursey and Deal 1974) state that yet another advantage for using regression on principal components as compared to normal multiple regression is that the resulting regression coefficients are more stable when applied to a new set of data because the coefficients
are fitted on the basis of only statistically significant orthogonal components. This could imply
that using an equation based on regression on principal components for prediction on a sample
not included in the equation development would have a smaller standard error on this sample
than would a normal multiple regression equation. If this is the case, it would be an important
advantage for the regression on principal components technique. An adequate demonstration of
this hypothesis needs to be developed, however.
A disadvantage of using principal components in a regression analysis is that even if all but
one of the components is eliminated, all of the original variables (the X's) must still be measured
because each component is a function of all of the X's (equation 12.4).
In reporting the results of a regression on principal components, it is generally desirable to
transform the resulting regression equation
into an equation in terms of the original X variables.
This can be done
since yi = Yi - Y, the 6 ' s are known constants, zil =
akjx,, and
x-1 ..~ = (X.' J. - Xj)/sj. Thus equation 12.22 becomes
,.
where the p*'s are constants. If only q(q < p) components are retained in the final regression
equation, and the components are rearranged so that the first q components are retained, the first
summation in equation 12.33 would run from I to q; however, the second summation would still
run from 1 to p. This means the summation in equation 12.34 would run from 1 to p. It also means
that even though the equation contains only q components, all p of the original variables must be
measured to predict Y.
Some of the original X variables can be eliminated from the analysis before any regressions are performed by examining the factor loadings and eliminating variables that are not
highly correlated with any of the components. The remaining X variables are then resubmitted
3 10
CHAPTER 12
to a principal components analysis with the multiple regression being performed on the new
components. This procedure has the advantage of reducing the number of variables that must be
measured to use the resulting regression equation. It has the disadvantage of eliminating X variables rather arbitrarily (there is no statistical test for the significance of the factor loadings)
without ever having them in a position to determine their usefulness in explaining the variation
in the dependent variable, Y.
In many applications of regression on principal components, the last p - q components are
discarded before the regression is performed. The number of retained components, q, is selected
X is accounted for. This procedure reduces the
so that a large proportion of the variance of number of coefficients that must be estimated but runs the risk of eliminating a component that
may explain a significant amount of the variation in Y even though it explains little of the
variance of X.
Equation 12.31 gives
= a;XIY/(n - 1)X, whereas equation 12.32 gives ~ a r ( 8 , =
)
u2/(n - l)Xj. The statistical significance of Pj is tested using equation 10.17 with Po = 0.Thus
the test statistic is
There is no reason to believe before the regression is performed that this test statistic will be
nonsignificant for small values of Xj (i.e., for the last p - q components). Therefore, the
regression should be performed on all of the components and then the components that prove to
be nonsignificant can be eliminated.
The value of the test statistic given by equation 12.35 can be shown to be proportional to the
correlation between Y and -Jz . as follows:
Therefore
or the significance of the jh component is directly proportional to its correlation with the dependent variable. Equation 12.38 can be used to test the significance of the jthcomponent.
At this point, it should be noted that if a dependent variable Y is regressed on p principal
components extracted from a p X p correlation matrix and then transformed via equation 12.33,
the results are identical to those that would be obtained by a direct regression of Y on the original
p variables. This is because multiple regression is a linear operation and the principal
components are independent linear functions of the original variables that explain all of the variance of the variables.
MULTIVARIATE ANALYSIS
311
1MULTIVARIATEMULTIPLE REGRESSION
Occasionally, it is desirable to predict several dependent variables from the same set of
independent variables. Such a situation might be predicting the mean annual flood, 10-year peak
flow, and 25-year peak flow for a setting where it is desirable to maintain the correlation among
the dependent variables. This can be accomplished using a multivariate extension of multivariate
regression. The prediction model would be
1 vectors. Furthermore
demonstrating that the solution to equations 12.40 is equivalent to q multiple regressions each
involving the same X but a different vector of dependent variables. Tests of hypothesis
concerning p J can be made using the procedures set forth in chapter 10.
In multivariate regression as in multiple regression, one commonly has a large number of
independent variables all of which are not important in predicting the q dependent variables. If q
separate multiple regressions are performed and independent variables eliminated using the procedures of chapter 10, it would be unlikely that the resulting equations would contain the same
set of independent variables.
If the multivariate regression model is used, all q of the prediction equations will contain the
same set of independent variables. Press (1972) presents a procedure for testing the hypothesis that
j3. = PT where -P, is a 1 X q vector made up of the coefficients associated with the ithindependent
-1
variable for each of the q dependent variables and PT is a 1 X q vector of constants. To test that
the ithindependent variable was not significant would be equivalent to the test that Pi = -0. Thus,
a procedure is available for eliminating variables from the regression to produce a useable model.
One distinct advantage in using the same independent variables for estimating several
dependent variables is that the correlation structure of the dependent variables is preserved.
DeCoursey (1973) used such an approach to derive prediction equations for the 2-, 5-, lo-, and
25-year peak flows on watersheds in Oklahoma. In situations like this it is highly desirable to
retain the observed correlations among the dependent variables in the resulting prediction
312
CHAPTER 12
equations. In the case of flood flows, if this is not done it might be possible to have equations that
are inconsistent and predict, say, a 10-year peak to be greater than the 25-year peak flow.
Another place where retention of the correlation structure among a set of dependent variables
is important is in estimating the parameters descriptive of runoff hydrographs. Rice (1967) discusses this application of multivariate, multiple regression in simultaneously estimating the runoff
volume, peak discharge, and a base time parameter for runoff hydrographs based on data presented
by Reich (1962). Rice states that even though three separate regressions produce slightly better fits
to the original pool of data, the multivariate solution might be more effective in predicting hydrographs for storms on watersheds not included in the original data sample.
CANONICAL CORRELATION
Canonical correlation examines the relationship between two sets of variables. Consider the
n X p matrix X with covariance matrix 2. Partition X and 2 so that
X = [Y
-,Z
-]
whereY
- i s n X p, a n dZ i s n X p2and
Zl1
where 2
- is p X p,
is p1 x pl, 2 1 2 is PI X p2, 2 2 1 is p2 X P I ,and &2 is p2 x p2 with pl +
p2 = p and p, 5 p2. In this formulation Ell = Var(Y),
= Var(Z),
- and C12= 22, =
Cov(Y,
- Z).
Canonical correlation investigates the correlation between Y and Z. Linear functions of Y
and Z are formed and then the correlation between these linear functions determined. Define
U, = a;Y'
and V, =
z2,
a;zl
U, and V, are
The variances of -
a;cl,
Var(U,)
- = var(a;
- Y') = - - g, and
Therefore
MULTIVARIATE -ANALYSIS
313
X into Y and Z
- has to be done up front by the investigator and is not a result
The partitioning of of the analysis.
Interpretations and use of canonical correlation in hydrology appears to have some of the
same drawbacks associated with principal components. The problem might be reduced from considering a p, variate Y and a p2 variate Z to single variates U and V, yet U is a function of all p,
Y's and V is a function of all p2 Z's. Some investigators eliminate some of the Y and Z variables
a,or a, are "small" on those particular variables.
if the coefficients in CLUSTER ANALYSIS
The main objective of a regional flood frequency analysis is to develop regional regression
models which can be used to estimate flow characteristics at ungaged stream sites. Hydrologic
data from several gaging stations in hydrologically homogeneous regions are collected and analyzed to obtain estimates of the regression parameters. Identification of these hydrologically homogeneous regions is a vital component in any regional frequency analysis. One method used to
identify these regions is a multivariate statistical procedure known as cluster analysis.
Cluster analysis is a method used to group objects with similar characteristics. Two clustering
methods are used for this purpose. The first type of procedures is known as hierarchical methods, and
they attempt to group objects by a series of successive mergers. The most similar objects are first
grouped and as the similarity decreases, all subgroups are progressively merged into a single cluster.
The second type of procedures is collectively referred to as nonhierarchical clustering techniques
and, if required, can be used to group objects into a specified number of clusters. The clustering
process starts from an initial set of seed points, which will form the nuclei of the final clusters.
The most commonly used similarity measure in cluster analysis is the Euclidean distance,
defined by:
where Di, is the Euclidean distance from site i to site j, p is the number of variables included in
the computation of the distance (i.e., the basin and climatic variables) and zi,, is a standardized
value for variable k at site i.
In many applications the variables describing the objects to be clustered (discharges, watershed areas, stream lengths, etc.) will not be measured in the same units. It is reasonable to assume
that it would not be sensible to treat, say, discharge measured in cubic meters per second, area in
square kilometers, and stream length in kilometers as equivalent in determining a measure of
similarity. The solution suggested most often is to standardize each variable to unit variance prior
to analysis. This is done by dividing the variables by the standard deviations calculated from the
complete set of objects to be clustered. The standardization process eliminates the units from
each variable and reduces any differences in the range of values among the variables.
To get a feel for how cluster analysis works, consider six precipitation stations and their
associated annual precipitation in mm:
station
precipitation
1
1000
2
1200
3
600
4
700
5
500
6
1100
It is desired to see if these stations can be grouped into homogeneous groups based on the average annual precipitation.
The first thing that is done is to standardize the precipitation values. For this set of data, the
mean is 850 and the standard deviation is 288. Table 12.1 contains the data and results. Equation
12.47 is used to calculate Dij.For example, DlY2is d ( 0 . 5 2 - 1.21)~which equals (0.52 - 1.21),
or 0.69. The results for all of the Dij are shown in Section A of table 12.1.
The next step is to find the minimum value of the similarity measure, Di,,.This value is seen
to be 0.35. The value 0.35 appears several times. The pair (3,4) was arbitrarily chosen as the first
similar pair. Section B of table 12.1 contains the Dij values from Section A except for the (3, 4)
row. This row contains the minimum of D3,jand D4,jfor j = 1, 2, 5, and 6. For example, D3,1is
1.39 and D4.1 is 1.04. Therefore, the (3,4), 1 entry in Section B is 1.04. Other values in the (3,4)
row are similarly determined.
Again, the minimum entry in Section B is found to be 0.35 corresponding to the (1,6) pair. Thus
(1,6) is clustered as in Section C and entries for Section C are determined from Section B in the same
manner as entries in Section B were determined from Section A. The next step results in (1,6) and 2
being clustered to form (1, 2, 6). This is followed by (3, 4) being clustered with 5 to form (3,4, 5).
Table 12.2 is similar to table 12.1 except that the value of precipitation for the third station
is changed from 600 to 1050 mm. Carrying through the analysis as was done for table 12.1 results in forming the clusters (4,5) and (1, 2, 3,6).
In table 12.3, the third station value is changed to 1800 mm. The cluster results are (1,2,4,
5, 6) and 3. In all of these analyses, the Di, entry is a measure of the similarity that exists. For
Mean
St dev
Precipitation
z
1000
0.52
I200
1.21
600
-0.87
700
-0.52
500
-1.21
1100
0.87
850
0
288
1
Table A
Table B
Table C
Table D
Table E
Precipitation
z
1000
-0.11
1200
0.33
1800
1.66
700
-0.78
- 1.22
Table A
Table B
Table C
Table D
Table E
500
Mean
St dev
1100
0.11
1050
-7E- 18
45 1
1
318
CHAPTER 12
example, in table 12.3, the Dij values of 0.22 indicate strong similarity. The value of 0.44 shows
that stations 4 and 5 are not as similar as are stations 1, 2, and 6. The value 0.67 shows that the
cluster (4,5) and (1,2,6) are less similar than either 4 and 5 or l , 2 , and 6. Finally, the value 1.33
shows that 3 is not very similar to the cluster (1, 2,4,5,6).
Clustering may stop when there is a significantjump in the similarity measure. In table 12.3 one
might conclude with three clusters, (1,2,6), (4,5), and (3),or with two clusters, (1,2,4,5,6) and 3.
Table 12.4 extends the analysis to consideration of two measures of the stations being considered, precipitation and potential evapotranspiration. Again, Section A was constructed from
equation 12.47. For example, the Dl,, entry is calculated from standardized values as
D I 2 = d ( - 0 . 1 1 - 0.33)~+ (-1.21 - 1.21)~or 2.47. The analysis is completed based on
Section A in the same manner as for tables 12.1-12.3. Here a satisfactory clustering doesn't
appear to exist. It looks as though 2 and 6 might be clustered but possibly the other stations can
not be clustered.
Table 12;5 is based on the ratio of precipitation to potential evapotranspiration. Using this
system measure, 2, 4, and 6 certainly form a cluster. Depending on the purpose of the analysis,
one might conclude that (1, 3) and (2,4,5, 6) represent the final clustering.
Exercises
12.1. Calculate the correlation matrix for the first two variables contained in the tables of exercise 10.8.
12.2. Calculate the characteristic values and characteristic vectors associated with the correlation matrix of exercise 12.1.
12.3. Compute the numerical values of the principal components of the data in the first two
columns of the table in exercise 10.8 (based on the correlation matrix).
12.4. (a) Work exercise 12.1 using the first three variables. (b) Work exercise 12.2 based on the
first three variables. (c) Work exercise 12.3 based on the first three variables.
12.5. (a) Work exercises 12.1, 12.2, and 12.3 based on the covariance matrix. (b) Work exercise
12.4 based on the covariance matrix.
12.6. Work exercise 12.4 using all of the variables in the table of exercise 10.8 except Q,. (Note:
Don't try this without a computer-life is too short!)
12.7. Calculate the factor loadings for the data of (a) exercise 12.2, (b) exercise 12.4, or (c) exercise 12.5.
12.8. Show that Z'Z
- = (n - 1)D,
- by using as an example the data of exercise (a) 12.2, (b) 12.4,
or (c) 12.5.
Station
Precipitation
21
PET
22
1000
-0.11
500
- 1.21
1200
0.33
1200
1.21
3
1800
1.66
600
-0.87
Mean
700
-0.78
700
-0.52
500
-1.22
1000
0.52
1100
0.11
1100
0.87
1050
-7E-18
850
2E- 17
Table A
Table B
Table C
Table D
Table E
St dev
45 1
1
288
1
Ratio
z
Table A
Table B
Table C
Table D
Table E
Mean
St dev
of the interval, the pdf is a uniform distribution. A random digit would be one of the numbers
0, 1, ..., 9 selected in such a fashion that any one of these numbers would have an equal ability
of being selected. In a sample of 100 random digits, the expected result (but with a very low probability of occurrence) would be ten each of the digits 0, 1, ..., 9.
Tables of random numbers are generally available in many statistics books. Computer routines
for generating random numbers are included as a part of the program libraries for most computers.
Care must be exercised when using computer routines in that some generate biased samples.
Many computer routines generate uniformly distributed random numbers in the interval (0, 1).
A uniform random number, Y, in the interval (a, b) can be generated from a uniform random number in the interval (0, l), R,, by the relationship Y = (b - a)R, + a.
Random observations may be generated from probability distributions by making use of the fact
that the cumulative probability function for any continuous variate is uniformly distributed over the
interval 0 to 1. Thus, for any random variable Y with probability density function py(y), the variate
3. Solve for y.
Step 3 in this procedure is known as obtaining the inverse transform of the probability
distribution.
Fig. 13.1. Procedure for generating a random observation from a probability distribution.
DATA GENERATION
323
and
By substituting Ru for Py(y), random values of Y from the 3-parameter Weibull distribution can
be generated from
For some distributions it is not possible to solve equation 13.1 explicitly for y. That is, an
analytic inverse transform cannot be found. The normal and gamma distributions are examples
of this. Fortunately, in the case of the normal distribution, numerically generated tables of
standard random normal deviates are widely available. A standard random normal deviate is a
random observation from a standard normal distribution. Random observations for any normal
distribution can be generated from the relationship
where RNis a standard random normal deviate and p, and o are the parameters of the desired normal
distribution of Y. Computer routines are available for generating standard random normal deviates.
For some distributions, relationships with other distributions can be used in the generating process. For example, a gamma variate with integer values for rl has been shown to be
the sum of rl exponential variates each with parameter A. Therefore, gamma variates with
integer values for rl can be generated by summing rl values generated from an exponential
distribution.
Whittaker (1973) discusses a method for generating random gamma variates with any
shape parameter rl. Because the gamma distribution is closed under addition, a gamma random
variable with any shape parameter can be constructed if one with a shape parameter in the
internal 0 < rl < 1 can be constructed. Let Rul,Ru2,and Ru3be independent uniform random
variables on (0, 1). Define S1 and S2by
+ S2)and
DATA GENERATION
325
Then Y has a gamma distribution with shape parameter -q and scale parameter A.
This procedure requires the generation of at least 3 uniform random variables. If S, + S2
> 1, then R,, and RU2are rejected and new values generated. The probability that S, + S2 5 1
is given as .rr-q(l - -q) cosec(.rr-q) and has a minimum of n / 4 at -q = 6 and is symmetric about
this value.
To generate a gamma variate with -q > 1, a gamma variate with an integer shape parameter,
and a shape parameter <1 can be added as long as the scale parameter, A, is held constant. For
example, to generate a gamma random variate with -q = 3.6 and any A, a gamma variate with
-q = 3 and A can be added to a gamma variate with -q = 0.6 and A.
Table 13.1 presents a summary of some analytical methods for generating observations from
selected common probability distributions. The table is modified from Hahn and Shapiro (1967).
Computer routines are available for generating random numbers from many different probability
distributions.
Where analytical inverse transforms cannot be found, numerical procedures can be
employed. One numerical method is to select a random number between 0 and 1 and then
numerically integrate equation 13.1 along the x-axis until the accumulated integral equals the
selected random number. At this point y would be equal to the value of x that had been reached.
A second numerical method, and one that would be faster if a large number of random
observations were needed, would be to numerically integrate equation 13.1 starting at the
extreme left of the distribution. The integration would proceed to the right in small increments
along the x-axis until the accumulated integral was sufficiently close to 1. At each step of the
integration, the value of x and the accumulated integral would be saved in the form of a table. The
generation process would then consist of selecting a random number in the interval (0, I),
entering the table with this random number considered as an accumulated integral, and finding
the corresponding value of x. This value of x would then be set equal to the desired random
variate y.
Example 13.1. Generate 22 observations from an exponential distribution with A = 2. Plot the
observations on semilogarithmic (probability) paper. Estimate A from the observations.
Solution: The 22 observations are generated from the relationship y = -ln(R,)/A where
R, is a randomly selected number in the interval 0 to 1. The values of Y so generated are
shown below. Figure 13.2 is a plot of the resulting numbers along with the lines describing
the exponential distribution with parameter A = 2 and with parameter i= 1.718 calculated
as i= 1 / ~ .
Comment: This problem illustrates the random variations possible when sampling from a probability distribution. As the sample size increases, ); should approach A and the plotted points will
lie more nearly on the line describing the exponential distribution with A = 2.
In hydrologic frequency analysis, the data represent a sample from an unknown population.
Thus, uncertainty as to the proper frequency distribution exists as well as uncertainty in the
values for population parameters.
326
CHAPTER 13
Ru
Rank
Sum = 12.806
Plotting Position
DATA GENERATION
327
Several exercises at the end of this chapter are designed to help develop a "feel" for the
scatter that can be expected when sampling from various frequency distributions. Problems
dealing with testing distributional assumptions are also included. These problems demonstrate
that for small samples a single set of data is not a reliable indicator of the distribution that generated the sample.
where z is a 1 X p vector consisting of a single value for each of the p uncorrelated components.
The mean and variance of the jthprincipal component are 0 and Aj, respectively. This equation
can be used to generate standardized normally distributed random variables that preserve their
correlation. The components are uncorrelated so that a random value of 3 is generated as
3 = (zl, z2, ..., zp) where zk is a random observation from a normal distribution with mean zero
and variance A,. Post multiplying by 4' then produces x j. n values for can be generated by
repeating this process n times. Then
zj
xj
Example 13.2. Generate a sample of 20 observations from the 3-variate normal distribution
having the properties p1 = 3.173, p2 = 16.462, p3 = 2.566, al = 2.113, a2= 12.481, a3=
1.150, p,,, = -0.1713, p1,3 = 0.8958 and p2,3 = -0.2059.
Solution: This correlation structure corresponds to the correlation matrix in example 12.2. The
procedure is to first generate 20 observations from a 3-variate standard normal distribution
having the desired correlation structure by 20 applications of equation 13.8. The matrix A is
contained in example 12.2. A 1 X 3 vector z is generated as (Z,, Z2, Z3) where Z, is a random
observation from a normal distribution with mean 0 and variance Xi. The Xi are obtained from
x = (XI, X,, X3) is then computed
example 12.2 as 1.9692,0.9273, and 0.1035. The 1 x 3 vector from equation 13.8. Finally, a 1 X 3 vector y = (y,, y2, y3) is computed as yi = xiai + pi. This
process is repeated 20 times, generating the required 20 values for y. The following matrix
contains the resulting 20 observations on Y.
329
DATA GENERATION
The means, standard deviations, and correlations of this Y are shown below. These statistics
are not the same as the desired population parameters (as expected) because they are based on a
random sample of size 20.
The above procedure was also carried out for samples of 200 and 999 observations with the
results shown below. Again, it should be kept in mind that these results are based on random samples. A second random sample of the same size would result in different estimates for the
population parameters.
Population
20
200
999
3.17
3.79
3.12
3.13
16.46
12.83
17.10
15.48
2.57
2.90
2.58
2.54
2.11
1.52
2.22
2.08
12.48
13.14
11.39
12.38
1.15
0.94
1.20
1.12
-.I7
-.I3
-.I3
- .20
-90
.89
.9 1
.90
-.21
.03
-.I5
- .23
Example 13.3. Generate a sample of 20 observations having the properties given in example
13.2 except assume that the distributions of the random variables XI, X2, and Xg are normal,
exponential, and lognormal, respectively. Note that X, can only be approximately exponential
because p2is not exactly equal to a,.
Solution: The first steps are the same as for example 13.2. The Y matrix is then transformed
P and the P is transformed to h e required X
- by finding the
to a cumulative probability matrix -
330
CHAPTER 13
For X,, the mean and standard deviation of the logarithms are computed from equations
6.30-6.3 1. The ln(X,) is then obtained as the inverse of the normal distribution, having the determined logarithmic mean and standard deviation. Finally, X, is the antilogarithm of the
resulting normally generated value. A part of the indicated matrices are shown here:
As an example, consider p,,,. This is the probablity that Y2 < -5.56 if Y2 is N(16.462,
12.4812).The value of this probability is 0.038829. In general, pi, = probability of X, < xijif X,
is N(pj, 0;).To transform pij to the actual xij, one finds the value of Xij satisfying prob(xj < X)
= pij, where the probability is based on the appropriate pdf. For example, xlz is from the
cumulative exponential distribution whose value is plz.
The value of XI,, is generated from a lognormal distribution. Using equations 6.35 and 6.36,
the mean and standard deviation of the logs of yl,, are found to be 0.85083 and 0.42782. The value
of the standard normal distribution having a cumulative probability of 0.610928 is 0.281739.
Thus,
is given as
This procedure was also carried out generating 1000 observations with the results shown
below:
Figure 13.3 shows histograms of the resulting simulations of the 1000 observations. The
histograms indicate the data generally follow the desired distributions.
DATA GENERATION
33 1
Fig. 13.3. Frequency histograms for data generated for example 13.3.
APPLICATIONS OF DATA GENERATION
Data generation techniques or Monte Carlo simulation have been widely used in hydrology.
These uses range from generating large samples of data from known probability distributions to
studying the probabilistic behavior of complex water resources systems. Chapter 15 treats stochastic hydrologic models in more detail. The use of simulation in hydrology is certainly not a
recent development. In 1927 Sudler (1927) generated a 1000-year record of annual runoff values
to develop probability distributions of reservoir capacities. Chow (1964) has indicated that how
much risk and uncertainty are associated with a proposed investment can be estimated by the use
of multiple sequences of generated data.
Fiering (1966) has discussed the stochastic simulation of water resources systems. In his
paper, he makes the following points:
1. Synthetic hydrologic traces do not provide a mechanism for overcoming biased or faulty data.
2. Simulation is not a substitute for analytical solution.
3. When system simulation appears necessary, it is statistically unjustifiable to rely solely on the
observed sequence of hydrologic events.
McMahon et al. (1972) discuss the use of simulated streamflow in reservoir design. Burges
and Linsley (1971) investigated the influence of the number of traces used in determining the
frequency distribution of reservoir stage. In their study, they generated inflows from both an
annual and a monthly, normal, Markov model. They found that, in general, fewer traces were
required to define the storage distribution when the monthly model was used than when the annual
model was used. They also found that about 1000 traces should be used to determine the storage
frequency distribution when the annual model is used.
Hahn and Shapiro (1967) discuss evaluating system performance by Monte Carlo simulation. Benjamin and Cornell (1970) discuss using simulation to derive the probability distribution
of a random variable that is a function of other random variables. Smart (1973) discusses the use
of simulation to determine relationships between certain parameters of random geomorphological
models. Shreve (1970) used simulation to generate a sample of topologically random channel
networks. Fiering (1961) discusses simulation in reservoir design. Fiering and Jackson (1971)
develop models for simulating streamflow.
A widely used application of data generation has been in the general area of uncertainty,
reliability, and risk analysis. Data generation is used to examine a large number of outcomes or
possibilities from a system from which probabilistic statements can be made. Chapter 16 should
be consulted for more detail on this topic.
The stochastic nature of quantities estimated from stochastic models can be investigated
using data generation techniques. The design of any water resources system is dependent upon
estimates of hydrologic quantities. These estimates are based on some type of stochastic modelwhether it be a flood frequency curve or a comprehensive river basin simulation model. One of
the first steps in developing design estimates is the selection of the stochastic model to be used.
Regardless of what stochastic model is finally selected, the parameters of this model must be
estimated from historical data. Because the parameters are functions of random variables (the historical data), the parameters themselves are random variables. Furthermore, the design estimate that
is arrived at using the model is a random variable because it is dependent on the model parameters.
As an example, consider the design capacity of a reservoir required to meet a given criterion. This capacity might be determined based on an available historical streamflow record. If a
different historical streamflow record were available and was used to determine the required
capacity, the estimate based on this historical record would differ from the estimate based on the
original historical record. The estimated design would be a random variable because it is a function of the available streamflow record and streamflow is a random variable. Intuitively, if two
extremely long streamflow records were used, one would expect less difference in the estimated
reservoir capacity than if two short streamflow records were used. Furthermore, one would expect
the estimated capacity based on the long record to more closely approximate the "true"capacity
than the estimate based on a short record.
In general, the variance of a parameter estimate is a decreasing function of the sample size.
The larger the sample, the smaller the variance of the parameter estimate. This, in turn, implies
DATA GENERATION
333
that the variance of the design estimate will decrease as the sample size increases. The difference
in a design estimate and its true population value may be thought of as a prediction error.
A general procedure for determining the probability distribution of prediction errors as a
function of sample size is presented in Haan (1972b). The procedure assumes that the correct
stochastic model is being employed. The procedure is as follows:
1. Estimate the parameters of the stochastic model and assume these estimates are equal to the
population values.
2. Simulate k independent sets of data of the type being studied with the model using the assumed population parameters. Each set of data consists of n observations or years of record.
3. Reestimate the parameters of the model being used from the n simulated observations for each
of the k data sets. This results in k parameter sets.
4. Estimate the desired quantity, Q (mean annual runoff, 50-year peak flow, 90-day low flow,
etc.) with the model using each of the k parameters sets. This will result in k estimates for Q.
5. Look at the probability distribution, PQ(q), of the k estimates for Q and determine the
probability of an individual estimate being outside some acceptable limits. If Q* represents
the estimate of Q and Q, and Q, are the lower and upper limits, then the probability that Q*
will be outside the desired interval is given by
334
CHAPTER 13
Table 13.2. Probability of error greater than d in mean annual runoff for problem described
by Haan (1972a)
n, in
years
d in millimeters
6.40
12.70
25.40
38.10
50.80
76.20
101.60
The results of this analysis, presented in table 13.2, show the expected result that as the
number of years of record increases, the probability of making an error greater than a given
value decreases. For example, for this particular stream, there is a probability of 0.22 of missing
the true mean annual runoff by more than 50.8 mm if 10 years of record are available, whereas
the probability is only 0.03 if 50 years of data are available for parameter estimation.
This procedure for estimating prediction error probabilities requires that the population
parameters for the stochastic model be known. Because this is rarely the case in hydrology, these
parameters must be estimated from all of the available information. Obviously, these estimated
parameters will not equal the population parameters, but when used as population values along
with the above simulation technique, they will yield estimates of error probabilities that can serve
as a guide in determining how much data is needed to ensure an acceptably low probability of
making an unacceptable error with the stochastic model.
Exercises
13.1. Without using a table of standard normal deviates, generate 20 observations from a normal
distribution with a mean of 100 and a variance of 100. What is the mean and variance of the 20
observations?
13.2. Select 100 observations from a normal distribution with mean 0 and variance 1 (use a table
of standard normal deviates). Plot a histogram of these observations. Test the hypothesis that
these are from a normal distribution using the X 2 test and the Kolmogorov-Smirnov test. Why do
the mean and variance of the data not equal 0 and 1, respectively?
13.3. Generate 20 observations from an exponential distribution with A = 0.5. (a) Test the hypothesis that the observations are normally distributed. (b) Test the hypothesis that the observations
are exponentially distributed.
13.4. Generate independent samples of size 10, 20, 30, 50, and 100 from an N(0, 1). Plot the observations on probability paper using one plot for each sample. Repeat this entire process 5 times.
Study the resulting probability plots in an attempt to develop a "feel" for the scatter that one can expect when sampling from a frequency distribution. (This might be undertaken as a class project with
each student working through a sample of 10,20,30,50, and 100. The results can then be shared.)
DATA GENERATION
335
13.5. Any number of variations of exercise 13.3 can be worked using different initial
distributions, test distributions, parameter values, and sample sizes. Some variations should be
used to assist in developing a "feel" for the scatter present in random samples from frequency
distributions and for the discriminatory power of the chi-square and Kolmogorov-Smimov tests.
13.6. Write a computer program for generating random observations from a gamma distribution
for integer values of q. Generate independent sets of size 10,20,30,40,50, and 100 observations
using q = 2 and A = 1.5. Test the hypothesis that these generated values are from a (a) gamma
distribution, (b) normal distribution, (c) exponential distribution.
13.7. Repeat example 13.2 for samples of size 20,200, and 999. Why are your results not identical to those of example 13.2?
13.8. Weekly rainfall during a particular week of the year at a weather station is thought to follow a gamma distribution with q = 2 and A = 1.5. If 25 years of data are available for estimating the parameters of the gamma distribution, what is the probability that the estimated 50-year
weekly rainfall based on the estimated gamma parameters will be in error by more than
0.5 inches?
13.9. See exercise 5.10.
14. Analysis of
Hydrologic Time Series
THIS IS an introduction into the analysis of time series of hydrologic data. There have been
many books and articles written on the subject of time series and stochastic processes. The
statistical literature contains a wide array of such books covering a wide range of complexity.
Bendat and Piersol (1966, 1971) have prepared very readable books on the analysis of random
data. Books and articles dealing with time series analysis and stochastic models in hydrology
include Yevjevich (1972b,c), Kisiel (1969), Matalas (1966, 1967b), Julian (1967), Salas et al.
(1980), Bras and Rodriguez-Iturbe (1985), Salas (1993), and Clarke (1998). The classic text of
Box and Jenkins (1976) lays the foundation for a particular type of time series analysis and
models now known as Box-Jenkins models. Pankratz (1983) and others have presented
examples, details and clarifications of the Box-Jenkins approach. Many books on time series
analysis are available. In the treatment here considerable reliance has been placed on Cryer
(1986). These references and others contain much more information than is presented here and
should be consulted by those requiring more than an introductory knowledge to time series
analysis. In addition to books and articles, several software packages for personal computers are
available that make time series analysis practical.
DEFINITIONS
A sequence of variables collected over time on a particular variable is a time series. A time
series can be composed of a quantity either observed at discrete times, averaged over a time interval, or recorded continuously with time. An ensemble of time series is a set of several time series
measuring the same variable. A single time series is called a realization. Thus, an ensemble is
t
a. Stochastic
t
c. Stochastic
+ Periodic
b. Stochastic
+ Trend
t
d. Stochastic + Jump
Fig. 14.1. Time series containing stochastic and several types of deterministic components.
made up of several realizations. A time series may be composed of only deterministic events,
only stochastic events, or a combination of the two. Most generally, a hydrologic time series will
be composed of a stochastic component superimposed on a deterministic component. For
example, the series composed of average daily temperature at some point would contain seasonal
variation-the deterministic component-plus random deviations from the seasonal values-the
stochastic component. The deterministic components may be classified as a periodic component,
a trend, a jump, or a combination of these. Figure 14.1 shows typical stochastic time series with
various types of deterministic components.
Trends in a hydrologic time series can result from gradual natural or human induced
changes in the hydrologic environment producing the time series. Changes in watershed
conditions over a period of several years can result in corresponding changes in streamflow
characteristics that show up as trends in time series of streamflow data. Urbanization on a large
scale may result in changes in precipitation amounts that show up as trends in precipitation
(Huff and Changnon 1973). Climatic changes or shifts may introduce trends into hydrologic
time series.
Jumps in time series may result from catastrophic natural events such as earthquakes or
large forest fires that may quickly and significantly alter the hydrologic regime of an area.
Anthropomorphic changes such as the closure of a new dam or the beginning or cessation of
pumping of ground water may also cause jumps in certain hydrologic time series. Astronomic
cycles are generally responsible for periodicities in natural hydrologic time series. Annual cycles
are many times apparent in streamflow, precipitation, evapotranspiration, groundwater level, soil
moisture and other types of hydrologic data. Weekly cycles may be present in water use data such
as industrial, domestic, or irrigation demands. Many times the latter time series will contain both
annual and weekly periodicities. Salas-LaCruz and Yevjevich (1972) and Yevjevich (1972~)discuss periodicities and trends in hydrologic data in more detail.
The time scale of time series may be either discrete or continuous. A discrete time scale
would result from observations at specific times with the times of the observations separated
by At or from observations that are some function of the values that actually occurred during
At. Most hydrologic time series fall in this latter category. Examples would be the average
monthly flow in a stream (At = 1 month), annual peak discharge (At = 1 year), and daily
rainfall (At = 1 day).
A continuous time scale results when data are recorded continuously with time such as the
stage at a stream gaging location. Even when a continuous time scale is used for collecting the
data, the analysis is usually done by selecting values at specific time intervals. For example, raingage charts &e usually analyzed by reading the data at selected times (i.e., every 5 minutes) or at
"break points" (here At is not a constant).
In this chapter it will be assumed that the data are available at discrete times evenly spaced
At time units apart. Even though the discussion centers around a time scale concept, a distance or
space scale can be used as well. For example, the width of a stream along a certain reach might
be a stochastic process where the width would be the random variable and distance along the
reach the "time".
The random variable described by the time series may be discrete or continuous. A sequence
of 0's and 1's denoting rainless and rainy days would be a discrete stochastic process with a
discrete time scale. The amount of daily rainfall would be a continuous stochastic process with a
discrete time scale (At = 1 day). Thus a times series may be composed of either discrete or
continuous random variables on discrete or continuous time scales.
A stochastic process can be represented by X(t). The probability density function of X(t) is
denoted by px(x;t) which describes the probabilistic behavior of X(t) at the specified time, t. If
the properties of a time series do not change with time, the series is called stationary. For a stationary series p,(x;t,) equals px(x;t,) where t, and t2 represent any two different possible times.
If px(x;t,) and px(x;t,) are not equal, the series is termed nonstationary. Of the series shown in figure 14.1, only that given in 14.la can possibly be stationary. If the deterministic component is removed from 14.lb, c, and d, they too might be stationary.
The properties of a time series can be obtained based on a single realization over a time interval or based on several realizations at a particular time. The properties based on a time interval of a single realization are known as time average properties. The properties based on several
realizations at a given time are known as the ensemble properties. If the time average properties
and the ensemble properties are the same, the time series is said to be ergodic.
Figure 14.2 shows several possible realizations for a continuous stochastic process on a
continuous time scale. The time average over the time interval 0 to T of the i" realization is given by
TIME SERIES
339
X~
0
t
t+r
0
.
i ( t )I l l
~.;t+r)
where n is the total number of equally spaced points at which Xi(t) was observed.
The ensemble average at time t is given by
If the process is such that X(t) = X(t + T) for all values o f t and T, the process is said to be
stationary in the mean, or first-order stationary.
The ensemble covariance of X(t) and X(t + T) is given by
If the covariance given by equation 14.5 is independent o f t but dependent on T (the lag), the
time series is stationary in the covariance. If T = 0, equation 14.5 gives the variance of the series.
Stationarity in the covariance implies stationarity of the variance. If a series is stationary in the
mean and in the covariance, the series is said to be second-order stationary, or weakly stationary.
340
CHAPTER 14
If a series is stationary in the covariance but not in the mean, the term weakly stationary or secondorder stationary should not be used. For many hydrologic applications, one is satisfied with second-order stationarity. If a process is second-order stationary and p,(x; t) is a normal distribution,
the process can be shown to be stationary.
Bendat and Piersol (1966) state that in actual practice, random data representing stationary
physical phenomena are generally ergodic. For ergodic random processes, the time average
mean, as well as all other time average properties, equals the ensemble averaged value. Thus the
properties of a stationary random phenomenon can be measured properly, in most cases, from a
single observed time history record.
Generally, only one realization of a stochastic process is available. More than one realization can be obtained by breaking the single realization into several shorter series. Unfortunately,
most hydrologic records are so short that breaking them into even shorter series may not be practical. If the statistical properties of the parts of a time series are not significantly different from
one another, the series is said to be self-stationary.
For a skgle realization the mean is determined from equation 14.1 or 14.2.The covariance
can be determined by
TREND ANALYSIS
A common deterministic component in a time series is a trend. A trend is a tendency for
successive values to be increasing (or decreasing) over time. Changing hydrologic conditions can
introduce trends into a hydrologic time series. Urbanization may contribute to increased peak flows
or runoff volumes. Increased demands on groundwater may result in declining groundwater levels
or declining base flows in streams. Climate change may result in changes over time in rainfall, temperature, and other climatic variables which in turn may alter strearnflows and groundwater levels.
If the data meet the assumptions of regression, simple linear regression can be used to test
for the presence of a linear trend and multiple linear regression may be used to test for trends
1890
1910
1930
1950
1970
1990
Year
where X(t) is the total annual rainfall in year t. The slope of 0.028 has a standard error of 0.0267.
The calculated "t" statistic for testing the hypothesis that the slope is zero is 1.05, which is clearly
not significant-indicating that the hypothesis cannot be rejected. Figure 14.4 is a probability
Normal Distribution
Fig. 14.4. Normal probability plot of residuals of time series regression of Stillwater, Oklahoma,
annual rainfall.
CHAPTER 14
342
20
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
Year
The standard error on the slope was 0.489 and the calculated t for testing for zero slope was
3 . 3 6 a n indication one must reject the hypothesis of no trend. Figure 14.6 shows a probability
Normal Distribution
Fig. 14.6. Normal probability plot of residuals of a portion of the Stillwater, Oklahoma, annual
rainfall.
plot of the residuals for this regression. The first order serial correlation coefficient for the residuals was 0.03. Again, the assumptions of regression are not violated.
The Stillwater rainfall example illustrates that short periods of apparently nonstationary
data may be embedded in a longer stationary data series. Concluding nonstationarity from the
short series and projecting data either forward or backward in time based on this conclusion can
clearly lead to erroneous projections.
If the data under consideration do not meet the assumptions of regression as set forth in
chapters 9 and 10, conclusions based on regression are approximate with an unknown level of
confidence associated with statistical tests. Helsel and Hirsh (1992), Salas (1993) and others
discuss the use of nonparametric tests for trends. Nonparametric tests do not depend on distributional assumptions regarding the data and residuals but are generally based on relative ranks of
data points. Conover (197 1) presents many nonparametric statistical procedures.
Salas describes the Mann-Kendall nonpararnetric test for trends in the series X(t) for t = 1,
2, ..., N. Each value in the series X(tl) for t' = t I, t 2, ..., N is compared to X(t) and assigned
a score z(k) given by
- 1)/2.
where m = 1 if S
< 0 and m
if there are few values of z(k) = 0 (ties). In hydrologic data such as rainfall totals, streamflow
rates or volumes, groundwater levels, and so forth, one would expect very few exact ties. If the
data series is on a discrete time scale, ties might be more common. In that event, Salas (1993) or
Helsel and Hirsh(1992) should be consulted.
The hypothesis of no trend is rejected if lu,l > z,-+ where z is from the standard normal
distribution and a is the level of significance.
Example 14.1. The annual rainfall for Stillwater for the period 1978-1 987 is given below. Use
the Mann-Kendall test for a significant trend in the data.
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
Sums
Values of z(k) are determined by constructing N - 1 columns of z(k) values with the first value
of z(k) in column j occupying the j + IS' position. Thus, in column 3 the first value of z(k)
occupies the 4" position. The value of z(k) is determined by the assignment rules given as equations 14.8. Consider the third column. By applying equations 14.8, t = 3 and X(3) = 34.03. The
first entry in the third column is in the 4lhrow and is - 1 because t' = t + 1 = 4, and X(3) = 34.03
is less than X(4) = 35.72.
S is simply the sum of all N(N - 1)/2 or 45 values of z(k) and is equal to -25. The value
of m is 1 since S < 0.
~ 1.64.
/ ~ Because lucl > z, - a/27the hypothesis of no trend is rejected.
For u = 0.10, z ~ - =
Comment: This is the same result as obtained using the parametric regression test. For this data
the regression assumptions were satisfied. Most nonparametric tests have been found to be nearly
as powerful as parametric tests when the assumptions of the parametric test are met and more
powerful when they are not met. This leads many to adopt the nonparametric approach if any
doubt concerning distributional assumptions exists.
The trend in a set of data may be removed by subtracting the trend line from each data point.
If the data follow the assumptions of linear regression, the detrended data X1(t)would be given by
Nonparametric estimates of a and b may also be obtained. Helsel and Hirsh (1992) suggest
that the slope may be estimated from
&
b = median
for t'
X(t) - X(tl)
t - t'
1,2, ..., n - 1; t = 2,3, ..., n. The intercept is estimated from
2 = x(t)med - 6 ~ 4
(14.14)
Values for a and b are estimated from equations 14.12 and 14.14. The median of the values in the
above table is 1.70667, which is the estimate for b. The median of the values in the column
labeled t is 1982.5 and in the column labeled X(t) is 34.88. Therefore
For example, for t = 1980 the estimate is -3348.59 + 1.70667(1980), or 30.61. Figure 14.7
shows the resulting nonparametric regression line. The nonparametric estimate of the detrended
data, Xf(t),is again given by equation 14.12 using the nonparametric estimates for the slope and
intercept.
Year
Fig. 14.7. Nonparametric regression line on part of the Stillwater, Oklahoma, annual rainfall.
JUMPS
Jumps or abrupt changes in the mean of a time series may be detected using equations 8.15
through 8.17 if the time point at which a jump is suspected is known and the necessary distributional
assumptions (normality) are valid. If the distributional assumptions are not met, these tests become
approximate with an unknown level of significance. The degree of approximation depends on the
severity of the deviation from normality. For highly skewed data, the approximation could be quite
poor. The procedure in making the test is to divide the time series into two subseries at the point of
the suspected jump with n, and n, observations in the subseries where n, + n, = n, the total number
of observations. A test of the hypothesis that pl= p2for the two subseries is then made.
For data that do not meet the assumptions associated with the parametric tests, a nonparametric test for the hypothesis p1= p2is available. Conover (1971, 1980) and Salas (1993) present the Mann-Whitney test for the equality of means.
The entire sample is ranked with Ri being the rank of the ih observation in the series for
i = 1 to n. The quantity
are computed. If IT1 > z,-,/, where z,-,/~ is the standardized normal z value with probability
z > z,-,/, equal to a/2, the hypothesis of equality of means is rejected. Conover (1971, 1980)
should be consulted if the data has many ties or groupings of equal values.
Example 14.3. Below are annual flow data for Beaver Creek in western Oklahoma. The data are
plotted in figure 14.8. It has been hypothesized that after the 28th year, the flow regime has
TIlME SERIES
347
10
15
20
25
30
35
40
45
Year
which would indicate the two parts of the records have different means.
Year
Flow
Rank
Year
Flow
Rank
Year
Flow
Rank
(b) Based on the nonparametric approach, equations 14.15 and 14.16 are used. The sum S
is computed as
AUTOCORRELATION
One method of characterizing correlation within a time series over time is the autocorrelation function, P(T),given by
For T = 0 equation 14.17 indicates that p(0) = 1 because Cov(X(t), X(t + T)) = Cov(X(t), X(t)) =
Var(X(t>>
From figure 14.2 it can be seen that for small values of T the covariance term would be
positive because for the most part like signs are being multiplied (X(t) - X and X(t + T) - X
have the same sign for small T).As T increases, a point is reached where the covariance, and thus
P(T),may become negative. Some authors call Cov(X(t), X(t + T)) the autocorrelation function.
In keeping with the terminology established earlier in this book, the Cov(X(t), X(t + T)) will be
called the autocovariance.
A plot of the autocorrelation function against the lag T is called a correlogram. For random
data such as shown in figure 14.la, the correlogram would appear as in figure 14.9a. In the case
of data containing a cyclic and stochastic component such as shown in figure 14. lc, the correlogram would be cyclic as in figure 14.9b where p is the period of the cycle.
Correlograms are useful in determining if successive observations are independent. If the
correlogram indicates a correlation between X(t) and X(t + T), the observations cannot be
assumed to be independent. The autocorrelation function may thus be said to indicate the "memory" of a stochastic process. When p ( ~ becomes
)
zero, the process is said to have no memory for
what occurred prior to time t - T. In practice, P(T) should be zero for large T for most random
TIME SERIES
349
Fig. 14.9. Typical correlograms: (a) Random process. (b) Random process superimposed on a
periodic process.
processes. If p ( ~ for
) large T exhibits a pattern that is not zero, it may be an indication of a deterministic component. For example, if the correlogram appears as in figure 14.9b, it indicates the
data contains a periodic component.
A hydrologic time series representing a process involving significant storage is likely to
have values at time t
1 that are correlated with values at the previous time t. The correlation
may extend over several time increments so that X(t k) is correlated with X(t) k time units earlier. Daily flows in a stream and daily, monthly, and possibly annual groundwater levels are examples of hydrologic time series that often exhibit correlation over time. Annual maximum peak
flow is an example of a time series that is unlikely to exhibit correlation over time.
For a discrete time scale, the autocorrelation function becomes p(k) where k is the lag or
number of time intervals separating X(t) and X(t + T).The relationship between T and k is given by
where At is the length of the time interval (e-g., 1 day, 1 month, 1 year, etc.). If p(k) = 0 for all
k f 0, the process is said to be a purely random one. This indicates that the observations are linearly independent of each other. If p(k) # 0 for some k # 0, the observations k time increments
apart are dependent in a statistical sense and the process is referred to as simply a random one. If
a time series is nonstationary, p(k) will not be zero for all k # 0 because of the deterministic
element, even if the random element is itself a purely random time series (Matalas 1967b).
Unless the deterministic element is removed, one cannot determine to what extent nonzero values
of p(k) are affected by the deterministic element.
The population autocorrelation function, p(k), may be estimated by r(k), which is given by
n-k
n-k
xixi+k -
r(k) =
n-k
(2:;
Ci=l xi
n-k
n-k
Ci=l Xi
Xi+k
n-k
n-k
2
xi=lX i + k
- (2:
xi+, )2
n-k
1/2
with Xi = X(ti), XI+, = X(ti + kAt), and n is the total number of observations. Some authors use
) serial correlation function for r(k)
the terminology autocorrelation function for p(k) or p ( ~ and
or r ( ~ )This
.
distinction is not made in this text.
350
CHAPTER 14
For any observed series, it is unlikely that r(k) will be exactly zero. If r(k) differs from zero
by more than is expected by chance, then the observations k time periods apart cannot be assumed
independent. Procedures are available for testing the hypothesis that p(k) = 0 and for placing
confidence intervals on p(k) (see chapter 11). Often, computer programs that generate correlograms also compute confidence intervals such that if the computed autocorrelation falls outside the
confidence interval for a particular value of k, a hypothesis of r(k) = 0 would be rejected.
PERIODICITY
Autocorrelation analyzes a time series in the time domain. It provides information on the
behavior of the series over time, especially with regard to the memory of the process or how the
process at one instance of time is dependent on, or related to, the process at some prior time. An
alternate analytic approach is to examine the series in the frequency domain. With this approach,
an attempt is made to quantify the variability in the series in terms of repeating patterns having
fixed periods or, what is equivalent, fixed frequencies. The variance of the process is partitioned
among all possible frequencies so that the predominate frequencies can be identified. Let
X, = Xi = X(t = ih) for i = 1 to n. That is, the X's are equally spaced in the time domain.
We can express X, as a Fourier series
The maximum value for the number of terms in the series, q, is given by q = (n - 1)/2 if n
is odd and q = n/2 if n is even. The frequency, fi = i/n, represents the i" harmonic of the fundamental frequency l/n.
The coefficients a and b can be estimated from
% =;
X:=lxt= x
2
bi = n
X:=
X, sin 2.rrfit
and
b, = 0
or
35 1
TIME SERIES
The periodogram, I(fi), is defined as
For a discrete time series, the angular frequency, mi, is equal to 2nfi or 2ni/n.
The variance of X, is given by
VdXt) = Va[%
Because % is a constant,
otherwise
Similarly,
Var(sin wit) = 1/2
=
+ 0, i # n/2
otherwise
+ b:)
n odd
0. Therefore
n even
or, the variance of XIhas been partitioned among the frequencies so that the variance associated with
fi is M($
bi2).I(fi)is n times the variance associated with fi. By definition, Var(X1) = 2.Therefore
I(fi)
Let g(fi) = - so that
nu2
The function g(fi) is the spectral density function representing the fraction of the variance of
X, associated with fi. Recall that
0 02
0 04
0 06
0 08
0 1
0 12
0 14
0 16
Frequency
353
TIME SERIES
Some plot the spectral density function, g(fi), versus fi and some plot the periodogram, I(i),
versus i. Because p = l/f = l/(i/n) = n/i, the period associated with any i can be easily determined as n/i. Peaks or spikes in g(fi) or I(i) indicate frequencies that predominate in determining
the variance of X,.
Figures 14.10 and 14.11 show some correlograms and spectral density functions for wellbehaved functions. In figure 14.10 the function is
For this function f is 1/12 cycles per time unit and the wave length is 12 time units. A software
package, NCSS 2000 (1998), was used to make the calculations used in generating the plots of
-0.8 J
Lag
0.05
0.1
0.15
0.2
0.2 5
Frsqusncy
Fig. 14.11. Sum of 3 cosines (top) and its correlogram (middle) and spectral density (bottom).
CHAPTER 14
354
figure 14.10. The frequency axis of the periodograrn is actually 27r/f in this plot. Note that for
this deterministic function the correlograrn reflects the function exactly.
The function for figure 14.11 is
"0
50
100
1 50
Month
Autocorrelation Plot
Periodograrn
0.1
0.2
Frequency
200
250
355
TIME SERIES
Here again, the correlogram reflects the deterministic function. The three frequencies of 116,
1/ 12, and 1/24 can easily be seen in the periodogram.
Figure 14.12 is a similar analysis of the monthly stream flow on Cave Creek near Lexington,
Kentucky. The correlograrn reflects a cycle of 12 months but does not reproduce the flow record. The
maximum correlations of 50.4 are considerably less than the + 1.0 for the deterministic functions
reflecting a combination of deterministic and random components in the data. The periodogram
clearly shows the periodic nature of monthly flow at this location with a period of 12 months.
+ bt and w, = z, - 2,-,
then
b(t - 1) - c(t
1)2 = b -
356
CHAPTER 14
where q represents an unobserved white noise series. The q are identically and independently
distributed random variables (iid rvs) with a mean pa= 0 and variance o:,and z, is a stationary
time series with zero mean. A mean term can be added later if necessary.
MA( 1)
A first-order moving average process, MA(1), is given by
For notational convenience let y, = Cov(zt, zt-,,). Some properties of a MA(1) series are:
Note that this last equation presents a way of estimating O1. We can estimate p, by r1 and
equate
and solve for 8,. This is a moment estimator and is not very efficient.
For 0, to be real, 4r: must be less than 1. This implies that rf must be less than '/4 or -% 5 r1 5 %.
For a MA(1) process, p, must lie between 2%.
When p, = -0.5,0, = 1. When p, = +0.5, el = - 1.
Therefore, - 1 5 8, 5 1. Because of the randomness of a sample, it is possible for IrlI > % but if that
occurs it brings into question the appropriateness of a MA(1) as a descriptor of the process.
358
CHAPTER 14
Moment estimates can be obtained by solving the following equations for 81 and 02.
Once again, the expression for p1and p2provide a means for estimating 8, and 02; however,
the two simultaneous equations may be difficult to solve.
MA(q)
The general result for pk for a MA(q) can now be written as
Note the numerator for p, is - 3' , and pk = 0 fork > q. This can be used to identify the order
of a MA(q) process.
Autoregressive Processes
An autoregressive process of order p is given by
...
359
TIME SERIES
and has the properties that
/+,I
Fork = 1
For k = 2
For k = k
Because I+,\ < 1, pk exponentially decays toward zero. For positive, pk is positive. For
negative, pk alternates between positive and negative values.
For k = 1, 4, = Y,/Y, = pl and can be estimated by r,. Another way to estimate 4, is
through linear regression using the model
360
CHAPTER 14
= E(zz-~) = E ( $ ~ z - l & - k
+ $2~t-2~t-k +
a,~-k)
so
that
Dividing by yo
AR@)
A general pb order autoregressive process is given by
= E(zt&-k)
= E($l&-l&-k
= $lE(~-i&-k)
= $lyk-l
+ $ 2 z t - 2 ~ t - k + ". + $ p ~ t - ~ ~ - +
k qzt-k)
$2E(~-2zt-k)
+ $ 2 ~ k - 2 + .- - + $pyk-p
+pE(zt-p~-k)
for k > 1.
E(%zt-k)
TIME SERIES
36 1
For an AR(1) process, the Yule-Walker solution is pl = 4,. For an AR(2) process, the
Yule-Walker relationships are
To get an estimate of 4, and <b2,the p1 and p2 are replaced by r, and r2. In general, the
Yule-Walker equations are solved to estimate the 4's.
362
CHAPTER 14
A R M . (1 , l )
An ARMA(1, 1) model is given by
Yk
E(~zt-k)= E ( + l z t - l ~ - k
fork > 1
= +lyk-l
y1 = +lye - 0,u2
Because yk = +,yk-
fork = 1
for k > 1,
qZt-k
- Olat-lzt-k)
363
TIME SERIES
= -=
Y0
1 - 201+,
for k 1 I
+ 0:
zt - z , - ~
first difference
w,
Vz,
second difference
w,
In practice, rarely is it necessary to consider d > 2. The purpose of forming w, is in an attempt to define a stationary time series from a nonstationary one. Thus, if z, has a linear appearing trend, the first difference may well be stationary.An ARMA model may then be fit to w,. Such
a model is called an AR?MA(p,l ,q) model. The "1" indicates that an ARMA(p, q) model has been
applied to the first difference of z,.
Some obvious special cases of ARIMA models are:
CHAPTER 14
364
AR1M.N1, 1,O)
For AR(P)
6; = (1
&r1 - &2r2-
- - - - &$p)s2
For AR(1)
6; = (1 - 6)s2 since$,
= r,
For MA(q)
For ARMA(1, 1)
MA(1)
TIME SERIES
a,-,
365
z , - ~+
a, = z,
+ 8,z,-, + 8:a,-,
+ 8z,-, + 82z,-2 +
= (-8~,-~
- 8 2 ~ , - 2-
-.- or
. -) + a,
(Eq B)
From equation A, calculate zi for i = 1 to n by taking a, equal to its expected value of zero. Then
The procedure is to search the interval (- 1 < 8 < 1) for 8 that minimizes S(8). Cryer (p. 133)
discusses a Gauss-Newton procedure based on a linear approximation and recursive differentiation.
MA(q)
We must simply generalize the above procedure. Now we have a multivariate search
with a, = a- = a-, =
- -
E 2.
To get a, we have a start-up problem, namely zo. We could set zo = E(z,). Another way is to
simply start the sum at t = 2 and thus avoid the zo problem.
366
CHAPTER 14
ARMA(P' q)
with a, = a,-, =
Exercises
14.1. Let X(k) = cos(2nk/ 12) + E where E is an independent random observation from a normal
distribution with a mean of zero and a variance of 0.5. Compute and plot the correlogram and
spectral density function for this process by generating values for X(k) for k = 0 to 200. Compare
the results with figure 14.10.
14.3. The following data represent the years of eruptions of the volcano Aso for the period 1229
to 1962. Let k equal the eruption number beginning with 1229 so that k = 1 for 1239, k = 2 for
1240, and so on. Let X(k) be the number of years since the last eruption so that X(l) = 10, X(2)
= 1, X(3) = 25, and so forth. Compute the correlogram and spectral density function for X(k).
Is there any apparent pattern to the eruptions?
Years of eruptions of the volcano Aso for the period 1229 to 1962
14.4. The following data represent an unusual phenomena in that they are observations of a true
time series from the geologic past. The Ecocene lake deposits of the Rocky Mountains consists
of thinly laminated dolomitic oil shales hundreds of feet thick. It has been established that the
laminations are varves, or layered deposits caused by seasonal climatic changes in the lake
basins. By measuring the thickness of these laminations, a record of the annual change in the rate
of deposition through the lake's history is obtained (Davis 1973). Compute and plot the correlogram and spectral density function for this data. Discuss any apparent patterns. The data are presented column-wise.
368
CHAPTER 14
Thickness (mm) of successive varves of a section through the Green River Oil Shale
14.6. Work exercise 14.5 using the monthly precipitation for Walnut Gulch near Tombstone,
Arizona. (Data in the appendix.)
14.7. For a MA(1) process with a, iid N(0,l) and values of 8, = 0.9,0.4, and -0.9:
(a) Generate 100 values for z,.
(b) Plot z, vs t.
(c) Compute and plot the autocorrelation function.
(d) Compute and plot the periodogram.
(e) Estimate el.
14.8. For the following MA(2j processes, assume a, is iid N(0,l). For each case:
(a) Generate and plot 100 values of z,.
(b) Compute and plot r(k) for k = 1 to 10.
(c) Compute and plot the periodogram.
(d) Based on population 8's, calculate p, and p., Compare to r(1) and r(2).
14.10. Assume an AR(2) process with a, iid N(0, 1) and <PI = 0.5 and
(a) Generate and plot 100 values of z,.
(b) Compute and plot r(k) for k = 1 to 10.
(c) Compute and plot thc periodogram.
(d) Estimate the 4's.
= -0.3
15. S o m e Stochastic
Hydrologic Models
EVERY DESIGN decision requiring hydrologic knowledge is based on a hydrologic
model of some type. This model might be one that gives the peak discharge from a small
watershed as some function of the watershed area, it might be a flood frequency curve, it might
be a comprehensive "deterministic" model capable of generating synthetic streamflow records,
or it might be a stochastic model for generating a time series of hydrologic data. "Deterministic" is used here in the sense that once the model parameters are known, the same inputs to the
model always produce the same outputs. Thus parametric models are included under "deterministic" even though the model parameters may be functions of observed hydrologic records,
and thus be random variables. Ultimately, design decisions must be based on a stochastic
model or a combination of stochastic and deterministic models. This is because any system
must be designed to operate in the future. Deterministic models are not available for generating future watershed inputs in the form of precipitation, solar radiation, and so forth, nor is it
likely that deterministic models for these inputs will be available in the near future. Stochastic
models must be used for these inputs. If a design is based solely on the basis of a historical
record of rainfall or streamflow, the stochastic model employed is simply the historical record
itself. It should be kept in mind that any historical record is but one realization of a stochastic
time series and that future realizations will resemble the historical record only in a statistical
sense even if the process is stationary.
Designers of water resources systems have realized for years that evaluating their designs
using past or historical records provided no guarantee that the design would perform satisfactorily in the future because future flow sequences will not be the same as past flow sequences.
Typically, historical flow sequences are quite short-generally less than 25 years in length.
STOCHASTIC MODELS
37 1
Even during the 100-year life of a project, an observed historical flow sequence of 10 to 25
years in length will not repeat itself. In all cases the designer would agree that the worst flood
(or drought) on record is not the worst possible flood (or drought). The use of historical records
alone can only approximate the risk involved. That is, if a design is made based on a historical
record, chances are the design would be adequate if the historical record repeated itself.
However, we know that the historical record will not repeat itself. There is, thus, a certain risk
that the design will be inadequate for the unknown flow sequence that the system will actually
experience. This latter point can be illustrated by considering the design of a facility that might
have a 5-year life-say a small, temporary boat dock. For this design assume that 100 years of
flow records are available. The 100 years of record would provide 20 independent 5-year flow
sequences. A proposed design could then be evaluated on 20 independent flow sequences equal
in length to the design life of the facility. If it is found that in 15 of the sequences, the design
is adequate and inadequate in the remaining 5, then one would estimate that for some future
5-year period there would be a probability of 0.25 (5/20) that the design would be inadequate.
If the design is adopted, a risk of 0.25 exists. A risk of 0.25 may be unacceptable. For this case,
consider that an acceptable risk is 0.05. This means the design should be increased and reevaluated until it proves inadequate in only 1 of the 20, 5-year observed sequences. Generally, the
design life of a water resources project exceeds the length of available record so that a risk
evaluation using this procedure is not possible.
In other cases it may be desirable to know the severity of a shortage. For instance, one might
be looking at a system to control the thermal pollution from a power plant. It may be that the design requirement is to affect the natural water temperature by less than 5C. During low flows, the
ratio of the volume of heated water discharge to the volume of natural flow may be such that it is
difficult to keep the overall temperature rise to less than 5.5"C. In this case, it would be desirable
to know the magnitude as well as the frequency of failing to meet the design standard since a 6C
temperature rise would be less damaging than a 15C temperature rise.
The approach outlined above assumes that there is some probabilistic mechanism underlying the generation of streamflows and that this mechanism is sufficiently stable that it can be
considered stationary. It is also assumed that the sample in hand is a representative one.
An even better approach to determining risk probabilities would be through operations
(analytic or Monte Carlo) on the underlying, exact probability distribution or distributions that
the natural hydrologic process follows. Of course, this type of information is never available and
in practice must be approximated. Even if the exact distribution was known, its parameters would
have to be estimated from an observed record (sample) and would not equal the population parameters. Thus, to overcome the objections of design evaluation based on a single (and many
times short) flow record, a data generation scheme or stochastic model is needed.
A stochastic model is a probabilistic model having parameters that must be obtained from
observed data. Stochastic streamflow models, for example, do not convert rainfall to runoff
through theoretical or empirical relationships as do deterministic models, but use the information
in past or historical streamflows. Stochastic streamflows are neither historical flows nor predictions of future flows, but they are representative of possible future flows in a statistical sense.
Stochastically generated data can be used in evaluating risk probabilities, providing a "satisfactory" stochastic model is available.
372
CHAPTER 15
It should be noted that a stochastic model depends heavily on the assumptions of stationarity
and representativeness. The effects of watershed changes for example cannot be evaluated. On the
other hand, deterministic models might be able to simulate a changing hydrology as the basin
changes-but remember the future rainfall problem. Thus, one approach to watershed modeling
is to stochastically generate rainfall and use a deterministic model to convert the rainfall to
streamflow.
In developing a stochastic model, it is assumed that the data are the result of a random
process or one that involves chance. One cannot precisely state what the data values will be at
any particular future time, but one will be able to make statements of probability concerning
future data values. In looking over past data records it is apparent that strearnflows are not completely random with no constraints, but do possess certain recognizable features. For instance, if
the average annual flow has been around 15 inches for a long period of time, it is unlikely that it
will suddenly change to 25 inches unless the watershed is altered in some fashion. If the flows
have tended to be between 10 and 20 inches per year with only an occasional yearly total outside
these limits, the model should not produce a large number of flows outside these limits. Thus, the
model should preserve the overall mean and spread or variance of the data.
Further, it may be noted that there is some degree of persistence in that low flows tend to follow low flows and high flows tend to follow high flows. A streamflow model should retain this
property. From this it can be seen that historical records certainly guide us in model development.
It is not the purpose of this chapter to promote stochastic hydrologic models. Rather, some
of the most prominent models are discussed. There is a very rapidly expanding literature on
stochastic hydrologic models. No attempt is made to cover all of the models currently in the
literature or to discuss all of the features of the models that are covered in the chapter.
In selecting a stochastic model it is important to be able to state what characteristics of the
phenomena being modeled are important and what characteristics are unimportant. For example,
if streamflow is being modeled, the following is a partial listing of the questions that must be
considered.
1. Is it necessary to model the peak flows?
2. Are annual peaks sufficient or will other peaks occurring during the year be important?
3. Is the time during the year when the peak occurs important?
4. Is the sequence or order of occurrence of the peaks important?
5. Is the simultaneous occurrence of a peak flow and some other event important?
6. Is the volume of flow important?
STOCHASTIC MODELS
373
10. Is the dependence of the flow in one time period on the flow in previous time periods
important?
11. Is it sufficient to model the mean flow for a period? Is the variance important too? What
about the skewness?
12. Is the relationship of the flow on one stream and that on nearby streams of concern?
374
CHAPTER 15
There is no substitute for a thorough knowledge of the problem to be solved and the features of the problem that must be reproduced by the simulation model. It is relatively easy to
develop a simulation model for a problem by making unrealistic simplifying assumptions. It is
difficult to develop a model for use in solving a problem as it really exists. It is generally better to develop an approximate solution to the real problem than an exact solution to an unreal
problem.
In chapter 14 it was stated that a hydrologic time series may contain trends or jumps. If a
historical record contains trends or jumps and it is desired to use this record for the estimation of
the parameters of a stochastic model, it is necessary to be able to separate the deterministic and
stochastic components of the historical record. Once the deterministic component is removed, the
stochastic component can be used for parameter estimation. The model that is developed may
have to incorporate both deterministic and stochastic components. This is especially apparent in
cases where trends are present. If the trend is expected to continue into the period being modeled,
the trend component must be present in the model that is developed. Trends can generally be
modeled by a polynomial equation of the type
Where Tp(t)represents the trend in the parameter p as a function of time t and Po, P,, P2, ... are
coefficients that may be estimated by multiple regression. The order of the polynomial can also
be tested by determining the highest order trend having a regression coefficient that is significantly different from zero.
Jumps in a hydrologic time series may be identified by computing the mean value of the parameter of interest during the two time periods on either side of the jump. These two means are
then tested to see if they are significantly different from each other. The exact time at which a
jump occurs cannot be easily identified from the data alone because of the presence of stochastic
variation. In this case, a review of the data gathering procedure and factors affecting the variable
under study should be undertaken in an attempt to identify possible causes for the jump and the
time that these factors became important.
Many stochastic models require the estimation of a large number of parameters. Again, the
limited hydrologic data that is available at a point may be inadequate to estimate these parameters. A regional approach to parameter estimation may help in this situation, provided regional
data are available (Benson and Matalas, 1967; Stedinger et al., 1994; Helsel and Hirsh, 1992; also
chapters 7 and 10).
Two classical stochastic models, the Bernoulli process and the Poisson process, and some
of their potential applications in hydrology, were discussed in chapter 4. The remainder of this
chapter is devoted to other selected models that appear frequently in the hydrologic literature.
STOCHASTIC MODELS
375
known. Stochastic generation from a model of this type merely amounts to generating a sample
of random observations from a univariate probability distribution.
This type of model might be appropriate for generating a synthetic record of flood peaks.
Problems with the method are the uncertainty as to the proper probability distribution to use and
the uncertainty in the parameter values of the probability distribution. These two types of
uncertainty exist in all stochastic models to some extent. The larger the sample for estimating the
model parameters and testing the derived model, the less will be these uncertainties. Regional data
can also be used in some situations to assist in distribution selection and parameter estimation.
Another slightly more advanced application of a purely random model might be in generating sequences of point storm rainfall amounts. The time between storms might be modeled as an
independent Poisson or Bernoulli process (Lane and Osborn 1973) and the amount of rain as a
gamma variable. The model could be made more complex by assuming the distribution parameters are a function of the time of year or that the parameters of the gamma distribution depend on
the generated time since the last storm.
Whether or not a process can be considered as a purely random process may be indicated by
its correlogram, or spectral density. If r(k) is not significantly different from zero for k greater
than zero, or if the spectral density function oscillates randomly with no apparent peaks, the
process may be a purely random process. The difficulties in selecting the proper probability density function and in parameter estimation remain, however.
where Xi is the value of the process at time i, px is the mean of X, px(l) is the first-order serial
correlation, and E ~ is
+ a~ random component with E(E) = 0 and Var(~)= a
:
. This model states
that the value of X in one time period is dependent only on the value of X in the preceding time
+ independent
~
of Xi. The variance
period plus a random component. It is also assumed that E ~ is
of X is given by a; and can be shown to be related to a: by
x,
Thus, the correlogram exponentially decays from px(0) = 1 to pX(m) = 0 according to equation
15.4. If an observed correlogram has this property, the Markov model may be an appropriate
generating model.
Equation 15.3 can be applied to the logarithms of data through the transformation
Yi = ln(Xi). The generation model is given by
where FY, ayand py(l) refer to the mean, standard deviation, and first-order serial correlation of
the logarithms of the original data. Generation by equation 15.5 preserves the mean, variance,
coefficient of skew, and first-order serial correlation of the logarithms of the original data, but not
of the data itself. Matalas (1 967) suggests a procedure for using a first-order Markov model on
the logarithms that preserves the mean, variance, skewness, and first-order serial correlation of the
STOCHASTIC MODELS
377
original data. The procedure is based on the transformation Yi = ln(Xi - a ) with the parameters
of equation 15.5 related to the parameters of X through the following equations:
px = a
+ exp
:(
+ py
In these equations, px, a;, yx, and px(l) refer to the mean, variance, coefficient of skew, and
first-order serial correlation of the original data and are estimated by X,s;, CSx,and rx(l),
respectively. The quantities py, uy, py, and a are estimated from equations 15.6-15.9 and then
used in equation 15.5 to generate values for Yi+,. Xi+, is then calculated from
The X's generated in this fashion have the same mean, variance, skewness, and first-order serial
correlation as the sample used to estimate px, a x2 , yx, and px(l).
The procedure that is recommended for estimating py, uy, py, and a is to solve equation
15.8 for s,. Equation 15.9 then yields ry(l), equation 15.7 yields Y,and equation 15.6 yields a.
Equation 15.1 can be used to generate X's that are distributed approximately gamma with
mean X, variance s;, and skewness cSx(Thomas and Fiering 1963). The procedure is to define ye
as the skewness of the random component, E. y, is estimated by
where ti+, is a random value from an N(0, 1). Xi+, is then generated by
with the resulting generated X's being approximately gamma distributed with mean X, variance
s;, first-order serial correlation rx(l), and skewness cSx.
The first-order Markov model (equation 15.1) is also known as the first-order autoregressive
model because px(l) is equal to the regression coefficient P that would be obtained with a
regression using Y as Xi+, and X as Xi.
with n equal to the number of years of data, and Xijthe data value in the jth season of the ithyear.
Similarly, u i j is estimated by s;,,, yxi is estimated by CSx,and ~ , ~ (is lestimated
)
by rX,,(l).
Note that ~ ~ ~is ~the( first-order
1 )
serial correlation between values in successive seasons. If
monthly streamflow is being considered, px,,(l) would be the first-order serial correlation
between flows in months 4 and 5. pXj(l)would be estimated by
where
In equation 15.15 there are some notational problems when j = m. In this case, j + i should be
taken as 1 because the first season follows the m" season (January follows December, for example).
With this notation, the multiseason, first-order Markov model for normally distributed flows
becomes
In any application, the population parameters are estimated by the corresponding sample
statistics. The subscript notation of equation*15-17 again has problems in that Xtj+, is really
equal to Xi+,,,when j = m. For instance. if a monthly model is considered, then Xi,,, (or the 1 3 ' ~
STOCHASTIC MODELS
379
where tij is a random value from an N(0, I). Equation 15.13 becomes identical to equation 15.17
~ +used
~
in place of tij+,. The
with population parameters replaced by their estimates except E ~ , is
resulting XiVjwill be distributed almost gamma. Because skewness varies from season to season,
the representation is not statistically pure (Fiering and Jackson 1971). This is because the sum of
gamma variates is not gamma unless the scale parameter, A, is the same.
Equation 15.17 can be applied to the logarithms of the original data. In this case, X i jwould
refer to the logarithm of the value in the ith year and jth season. The parameters of the model
would also be based on the logarithms. The model used in this way would preserve the mean,
variance, skewness, and first-order serial correlation of the logarithms of the data, but not of the
data itself. Equations 15.3 and 15.17 have been widely used in hydrology. Equation 15.17 is
sometimes known as the Thomas-Fiering model because of the early work of these two
researchers with the model (Thomas and Fiering 1962, 1963; Fiering 1967). The model in the
form of equation 15.17 requires that many parameters be estimated. For each season the mean,
variance, and first-order serial correlation must be estimated. This results in estimating 3m
parameters (a monthly model requires one to estimate 36 parameters). This large number of
parameters requires considerable data. The technique based on data generation given in chapter
13 can be used for evaluating the effect of the length of record available for parameter estimation
on the reliability of the Thomas-Fiering model or other stochastic models.
CHAPTER 15
380
The Xj's might represent actual data values or their natural logarithms. In the case of a normal
model, the random element becomes
where a$ is the variance of X; R2 is the multiple coefficient of determination between Xi+,and Xi,
Xi-1, ..., Xi -+,ti+, is a random observation from N(0, 1); and the p's are multiple regression
coefficients.
The multilag model permits one to incorporate linear influences on data in one period reflected by data in several preceding periods. The regression coefficients, p, can be estimated by
normal multiple regression means. The question of how many lags to include can also be analyzed by the methods of multiple regression devoted to determining whether or not a particular
"independent" variable is important.
One difference between this model and the multiple regression procedures is that the
number of observations available for parameter estimation, n*, changes as the number of lags
changes. If there are n total observations, then there are n - 1 observations available for
estimating px(l) of the first-order Markov model. If two lags are considered (m = 2), then there
are only n - 2 observations available for parameter estimation. In general, for an mth-order
Markov model, there are n* = n - m observations for parameter estimation. What this means is
that multiple regression techniques are not strictly applicable because the sample size and
variables involved in the regressions change as the number of lags change. For instance, R2 may
actually increase if the number of lags included decreases because the data set involved in the
regression has changed. Generally, it is recommended that if a kLh-orderlag is included, then all
lags up to k also be included. For example, if a third-order lag is included, then the first- and
second-order lags should be included as well.
,;
The conditional probability, prob(X, = aj(X,-, = q), gives the probability that the process at time
t will be in "state" j given that at time t - 1 the process was in "state" i. Equation 15.22 says that
this conditional probability is independent of the "states" occupied at times prior to t - 1.A state
is simply a subdivision of the process X, into some interval. Thus, if X, represents the depth of
rainfall on day t, one state might be defined as no rainfall, another as between 0.00 and 0.05
inches of rainfall, and so forth.
The prob(X, = ajlXL-, = q ) is commonly called the one step transition probability. That is,
it is the probability that the process makes the transition from state q to state aj in one time period
38 1
STOCHASTIC PyIODELS
or one step. The prob(Xt = aj/Xt-, = q) is usually written as pivj(t),indicating the probability of
a step from a, to aj at time t. If pij(t) is independent o f t (pij(t) = pij(t + T) for all t and T),then
the Markov chain is said to be homogeneous. In this event
prob(Xt = aj1 Xt-, = a,) = p.11. .
(15.23)
Higher-order Markov chains can be defined to represent stochastic processes such that the
value of the process at time t is dependent on its value in several immediately preceding time periods. Thus an nth-orderMarkov chain is one in which
prob(Xt = aj[Xt-, = ai, Xt-, = a,, Xt-,
a,,
..., Xo = a,)
With this restriction, an m-state Markov chain requires that m(m - 1) transition probabilities
(parameters) be estimated. The remaining m pi,,'s can be determined from equation 15.25. The m2
transition probabilities can be represented by the m X m matrix P given by
Equation 15.25 states that the elements in any row of P must sum to unity. A matrix having
P as P = [pi+j]'= pji. Under
this property is said to be a stochastic matrix. Some authors define P sum to unity. The definition given by equation 15.23 will be used
this definition the columns of in this treatment.
- can be estimated from observed data by tabulating the
The transitional probability matrix P
number of times the observed data went from state i to state j, ni,j. Then an estimate for pij
would be
Considerable data may be required to get accurate estimates of pi,jif p i j is small. This is because
in an observed set of data, ni,jmay be uncharacteristically high or low if plj is close to zero and
the sample is small.
382
CHAPTER 15
The more states that a process is divided into, the less accurate will be the estimates for pij.
For example, if a daily rainfall model is being considered, one might like to have 10 states to adequately represent the possible amounts of rainfall. However, 10 states require the estimation of
90 transition probabilities. This, in turn, requires a large amount of data.
Once P is known, all that is required to determine the probabilistic behavior of the
Markov chain is the initial state of the chain. In the following the notation
means the probhas elements
.
ability that the chain is in state j at step or time n. The 1 X m vector
Thus
py)
py)
Under this definition p(0)is the initial probability vector. P(') is then given by
For a proof of these relationships, reference should be made to any number of books on probability or stochastic processes (see for instance Bailey 1964; Feller 1957; Brieman 1969).
As the Markov chain advances in time, pj") becomes less and less dependent on
That is to say the probability of being in state j after a large number of steps becomes inde= p(n+m) for a sufficiently
pendent of the initial state of the chain. A point is reached where
large n. From equation 15.31 we then get, for a sufficiently large n, that P"
- =Pn+". When this
occurs the chain is said to have reached a steady state. Under steady state conditions
p(n) =)m'$
and can thus be denoted simply as p. The 1 X m vector p can be thought of as
giving the probabilities of being in the various states after a large number of steps. Under
steady state conditions
p.
The solution of equation 15.33 thus provides Pnis called the n step transitional probability matrix. That is, Pn = [p/;)] has elements which
give the probability of going from state i to state j in n steps. Since for large n,
is independent
383
STOCHASTIC MODELS
of the initial state, we must have p$' = pjn). Thus, Pn is made up of m 1 X m vectors all equal to
p. That is, for large n
-
One can therefore calculate the steady state probabilities simply by computing Pn
- for a large
- and p2".
If the two differed by only an acceptably
enough n. In practice, one would compute P"
small amount, p would be taken as one of the rows of p2". Bailey (1964) gives a procedure for calPncan be easily evaluated by mulculating Pn based on characteristic roots. On a digital computer, tiplication. This method for finding p may require n to be very large. The steady state probabilities
p can also be determined directly from equation 15.33. Example 15.2 illustrates this approach.
Example 15.1. Consider a 2-state, first-order Markov chain for a sequence of wet and dry days.
Let state 1 be a dry day and state 2 be a wet day. Assume the transitional probability matrix to be
Thus the probability of a dry day following a wet day is given by p2,1as 0.5. Evaluate:
(a) prob(day 1 wet lday 0 dry)
(b) prob(day 2 wet lday 0 dry)
(c) prob(day 100 wetlday 0 dry)
Solution:
(a) prob(day 1 wetlday 0 dry) = plV2= p(:)
(b) prob(day 2 wet 1 day 0 dry) =
0.1
384
CHAPTER 15
However, the fact that day 1 was dry would not significantly affect the probability of rain on day
100. Therefore, it can be assumed that n is large and base the solution on the steady state probaPn for large n.
bilities contained in -
p16 is
-
assumed to be the steady state n-step transitional probability matrix because the
are not
changing much and the two rows are identical. Thus, p$loo)= pf') = 0.1667. The probability of
rain on any day in the distant future is 0.1667. For this to be true, an analysis of rainfall records
should show that 16.67% of the days are wet. This serves as a check on P.
Comment: Another check on the steady state probabilities is to see if equation 15.33 is valid.
This demonstrates that p = (0.8333, 0.1667) is the steady state probability matrix. See example
15.2 for further comment on this.
- -
Example 15.2. Consider a Markov chain model for the amount of water in storage in a reservoir.
Let state 1 represent the nearly full condition, state 2 an intermediate condition, and state 3 the
nearly empty condition. Assume that the transition probability matrix is given by
Note that it is not possible to pass directly from state 1 to state 3 or from state 3 to state 1 without going through state 2. Over the long run, what fraction of the time is the reservoir level in
each of the states?
Solution: The fraction of time spent in each state is given by p. Equation 15.33 can be used to
determine p. Examination of equation 15.33 shows that if p is a solution so is Ap.
- Therefore,
a solution to 15.33 is unique only up to a scalar multiplication. However, because p is a
STOCHASTIC MODELS
385
probability vector, the sum of its elements must be 1. Therefore, our solution technique is to find
an arbitrary solution to 15.33 and then scale it so that it C pi = 1.
If we let p, = 1, the first of these equations gives p2 = 3. With p2 = 3, the last of the equations
gives p, = 6/7. Therefore, one solution of p P = p is p = (1,3,6/7). Since 2 pi must equal 1, p
can be scaled so that p = (0.2059,0.6176, 0.1765). This solution can be substituted into equation
15.33 to verify that it is, in fact, a solution. Another check would be to compute Pnfor large n and
show that Pn = (p p p)'. Thus, over the long run the reservoir is nearly full 20.59% of the time,
nearly empty 17.65% of the time, and in the intermediate state 61.76% of the time.
p. The Pn
- approach
Comment: This problem illustrates the direct solution via equation 15.33 for for determining P can, however, be used. For this example p8
- and P'~
can be found to be (4 significant figures):
c;:
386
CHAPTER 15
-
Example 15.3. Assume that the reservoir of example 15.2 is nearly full at t = 0. Generate a
sequence of 10 possible reservoir levels corresponding to t = 1,2, ..., 10.
P can be written in the form of a cumulative transition probability matrix
Solution: The matrix P* where
and
For this example
Time
t
State
at t
Random
no.
State
att = l
0
1
2
3
4
5
6
7
8
9
10
1
2
2
2
1
1
1
2
1
2
3
0.48
0.52
0.74
0.15
0.27
0.03
0.49
0.02
0.97
0.96
2
2
2
1
1
1
2
1
2
3
Reservoir level at t
Nearly full
Intermediate
Intermediate
Intermediate
Nearly full
Nearly full
Nearly full
Intermediate
Nearly full
Intermediate
Nearly empty
Markov chains have been used in hydrology for modeling rainfall (Gabriel and Neumann
1962; Pattison 1964; Bagley 1964; Grace and Eagleson 1966; Hudlow 1967). Lloyd (1967) presents a discussion of the application of Markov chains to reservoir theory.
Some of the difficulties in using Markov chains in hydrology are:
3. Assigning a number to the magnitude of an event once the state is determined (i.e., how much
rainfall should be assigned given that chain moved to state 3 and that state 3 encompasses all
rainfalls between 1 and 2 inches).
4. Estimating the large number of parameters involved in even a moderate size Markov chain
model. A chain with 5 states has 20 parameters to estimate. If seasonality is encountered and
4 seasons are needed, 80 parameters are required.
STOCHASTIC MODELS
387
5. Handling situations where some transitions are dependent on several previous time periods
while others are dependent on only one prior time period. Hudlow (1967) found the dry-dry
transition for hourly rainfall showed a sixth-order Markov dependence while a first-order
dependence was adequate for the other transitions.
Woolhiser, Rovey, and Todorovic (1973) discuss an n-day rainfall model in which the transition from wet to dry days is based on a 2-state Markov chain and the amount of rain on rainy
days is exponentially distributed. Haan et al. (1976) describe a 7-state Markov chain model of
daily rainfall in which the amount of rain in each state is assumed uniformly distributed except
for the last state, in which a shifted exponential distribution is used.
Carey and Haan (1976) present a modified Markov chain daily rainfall simulation model in
which the transitional probabilities are replaced by a continuous probability distribution. That is,
given that the system is in state i on day n in season k, then the probability distribution of the
amount of rain on day n + I is given by
prob(X,+,
x 1 X, in i, season k) = p[l
+ (I
(15.35)
ptl
ptl
4. If R,
0.
=
x.
12(7 X 6) + I = 5051. When simulated rainfall for these two models was compared to historical rainfall at 7 Kentucky locations, the Carey-Haan model proved superior.
Exercises
15.1. Develop a stochastic model for generating a sequence of numbers that could represent the
years between eruptions of the Volcano Aso (see exercise 14.3). Use the model to generate a
series of 100 possible times (years) between eruptions. Compare the correlogram and spectral
density functions for the generated and observed sequences.
15.2. Assume that the time (days) between rains follows a Poisson distribution with a mean of
2 days. Further assume that the amount of rain (inches) on rainy days follows a gamma distribution
with a mean of 1 inch and a variance of 0.50 inch. Simulate I year of rainfall using this model.
15.3. Use the first-order Markov model to generate 100 years of annual runoff (inches) for Cave
Creek near Fort Spring, Kentucky. (Basic data in Appendix.)
15.4. Generate a random sample of size 100 from a gamma distribution with n = 3.5 and X =
2.5. Plot the observed and expected relative frequencies.
15.5. The following data are presented by Burges and Johnson (1973) for the Sauk River in
Washington. Based on this data and the first-order, seasonal, lognormal Markov model, generate
50 years of streamflow data. Compute and plot the correlogram and spectral density function for
the generated data.
Month
xj
Oct.
Nov.
Dec.
Jan.
Feb.
Mar.
5.02
6.50
7.33
6.42
5.35
5.02
Sx,j
Yxj
Month
2.3 1
3.38
3.23
2.95
2.62
1.66
0.61
0.58
0.50
0.3 1
0.38
0.37
Apr.
May
June
July
Aug.
Sept.
xj
6.42
10.70
12.76
9.05
4.44
3.29
SX,~
1.80
2.89
3.32
3.26
1.47
1.22
Yxj
0.44
0.34
0.17
0.65
0.93
0.5 1
15.6. Generate 100 years of monthly streamflow data for Cave Creek near Fort Spring, Kentucky, using the seasonal first-order normal Markov model. Compare the correlogram and spectral density function of the simulated and observed data. (Basic data are in Appendix.)
15.7. Write out and explain how a model such as described by equations 15.20 and 15.21 can be
used as a higher-order, multiseason Markov model. Apply the model to Cave Creek near Fort
Spring, Kentucky, (Appendix for data), using a second-order, monthly Markov model.
15.8. Use the first-order, normal Markov model to generate 100 years of annual runoff for the
Spray River near Banff, Canada. Compare the correlogram and spectral density functions for the
observed and simulated data.
STOCHASTIC MODELS
389
15.9. Use equation 15.29 to show the individual generating equations for a 2-site model in terms
and xi.
of p1,2(0)7pl(l), ~2(1)7
15.10. What is pj,,(l) for the model given by equation 15.25?
15.11. Generate 1 year of rainfall letting the sequence of wet and dry days be defined by the
Markov chain of example 15.1 and the amount of rainfall on a rainy day by a gamma distribution
with a mean of 1 and a variance of 0.50 inches.
15.12. Generate a succession of 200 water level states for the situation described in examples
15.2 and 15.3. What fraction of the time was the reservoir level in each of the three states? How
does this compare to the predicted results of example 15.2?
0 represents the outputs being modeled, I- represents the inputs to the model such as rainwhere fall, temperature, and so on, P
- represents the parameters required by the model, t represents time,
and e- represents errors associated with the modeling process.
One axiom of stochastic processes is that any function of a random variable is itself a
random variable. Thus, if any of the variables in -I or P
- are uncertain and known only in a probabilistic sense or if the nature of the functional relationships in the model are uncertain, then 0
is also uncertain and can be known only in a probabilistic sense. The design and analysis of hydrologic, hydraulic, and environmental projects are subject to uncertainty because of inherent
uncertainty in natural systems, a lack of understanding of the causes and effects in various
physical, chemical, and biological processes occurring in natural systems, and insufficient data.
This chapter was written by Dr. Aditya Tyagi, formerly a graduate research assistant in the
Biosystems and Agricultural Engineering Department of Oklahoma State University, Stillwater, Oklahoma, and currently a water resources engineer with CH2M Hill, Austin, Texas.
39 1
As a result of these uncertainties, the performance of a project will also be uncertain. The presence of uncertainties brings into question conventional deterministic design practices due to
their inability to account for possible variations of system responses. The issues involved in the
design and analysis of water resources and environmental engineering systems under uncertainty are multidimensional. Therefore, quantification of system uncertainties is imperative in
order to design or operate a project successfully. Reliability, risk, and uncertainty analysis are
therefore becoming increasingly important in modeling and designing water resources infrastructure and decision support systems. In some cases, uncertainty analysis is mandatory, particularly when critical decisions involve potentially high levels of risk. A systematic quantitative uncertainty analysis provides insight into the level of confidence warranted in model
estimates and in understanding judgements associated with modeling processes. It may also
play an illuminating role in identifying how robust the conclusions about model results are and
help target data gathering efforts.
It is apparent that considerable work may be involved in gathering the data required to characterize the uncertainty in each parameter and the parameters as a whole. Before making any data
collection effort, it would be wise to investigate the importance of various parameters to the
process being modeled. If a parameter has little impact on the output of a model, there is no need
to spend a great deal of time and money to estimate that parameter or worrying about uncertainty
in that parameter. Sensitivity analysis is used to measure the importance of a parameter.
SENSITIVITY ANALYSIS
Sensitivity analysis is the study of how the variation in the output of a model can be apportioned, qualitatively or quantitatively, to different sources of variation, and how the given model
depends upon the information fed into it. It ranks model parameters based on their contribution
to overall error in model predictions. While carrying out sensitivity analysis, selection of an efficient sensitivity analysis method is critical.
Traditional or Local Sensitivity Analysis
This method is also known as a one-parameter-at-a-time sensitivity analysis, in which the
effect of the variation in each uncertain input pararneter is determined by keeping other uncertain
parameters at a constant level (generally at their expected value). The result is a series of partial
derivatives, one for each parameter, that defines the rate of change of the output function relative
to the rate of change of the input parameter. Two types of sensitivity coefficients are used. One is
called an absolute sensitivity coefficient, or simply the sensitivity coefficient, S, and the other is
called a relative sensitivity coefficient, S,. These coefficients are defined as
where S is the absolute sensitivity (output unitslinput units), S, is the relative sensitivity (dimensionless), 0 represents a particular output, and P represents a particular input parameter. Graphically the terms in these relationships are shown in figure 16.1.
392
CHAPTER 16
P
Fig. 16.1. Definitions for numerical derivatives.
Most hydrologic and water quality models are a collection of algorithms and not a continuous function of the parameters in the usual sense of function, thus numerical derivatives are used
to approximate the partial derivatives of equation 16.2. The numerical derivatives may be
approximated as
where AF' is the amount a parameter is perturbed from its base value (this is generally taken as
10%or 15%of P). The numerical derivatives are calculated about base parameter values. The relative sensitivity coefficients are dimensionless and thus can be compared across parameters,
whereas the absolute sensitivity coefficients are affected by units of output and input and cannot
be directly compared across non-commensurate parameters.
Most hydrological and environmental engineering models are complex and contain a large
number of parameters. The disadvantage of determining model response to one parameter at a
time (performing the traditional sensitivity analysis) is that it requires considerable computation
and provides information about only one point in the parameter space. In some cases, this type of
sensitivity analysis may be misleading as such combinations of inputs would be unlikely in the
real world. To overcome this problem, global sensitivity analysis may be used.
Global Sensitivitv Analvsis
This method is also known as a variance-based method. The effect of variation in the
inputs, as all inputs are allowed to vary over their ranges, taking into account the shape of their
probability density functions, is determined. This usually requires some procedure for sampling
393
Repeat n times
the parameters, perhaps in a Monte Carlo simulation (MCS) form. The MCS process is illustrated in figure 16.2 and discussed in detail in the subsequent section, Uncertainty Methods.
If several parameters are simultaneously and independently varied, then the multiple
regression of output 0 on all parameters, Pi, is
where bi represents regression coefficients. Normalized sensitivity indices can be obtained for
each variable in equation 16.4 by subtracting its mean and dividing by its estimated standard
deviation. The normalized regression model is
where 6 and so are the mean and standard deviation of simulated output 0 , Pi and share the mean
and standard deviation of ith parameter. By equating equation 16.4 with equation 16.5, the
relationship between the standardized coefficient and un-normalized multiple regression
coefficient is
Pi is the corre-
- --
Example 16.1. The head loss, hf(m), in a pipe is given by the Hazen-Williams equation as
Compute the sensitivity coefficients (equation 16.2) assuming L as constant (1500 m) and mean
values of Am, Q, C, and D are 1.0 (non-dimensionless), 0.915(m3/s), 130 (SI units), 0.305(m),
respectively.
Solution: Output, h , is calculated by substituting base values (mean values) in the given
functional relation of hf as
Analytical partial derivatives of hf with respect to various uncertain parameters are determined and
absolute sensivity coefficients are evaluated by substituting mean values in the resulting expressions.
The calculation is presented in table 16.1, which indicates that D is the most sensitive parameter.
Table 16.1. Sensitivity analysis using analytical derivatives
Parameter, Pi
Symbol
Base value
sr
Example 16.2. For the preceding example, determine S and S, using numerical approximation.
Solution: First, the output value 0 at the base values of the input parameters is found to be
535.29. Then, assuming AP = lo%, parameters are perturbed about their base values. The calculation is presented in table 16.2.
For nonlinear models, the error in sensitivity coefficients depends upon magnitude of perturbation (AP) and non-linearity of model response with respect to different parameters.
Table 16.3 presents effect of magnitude of perturbation on relative sensitivity coefficients.
Table 16.3 demonstrates that when the functional relationship is linear with respect to a
parameter, there is no impact of magnitude of AP on S,. The inexactness of S, increases with
395
P-AP
P+AP
0,-AP
OP+AP
Sensitivity coefficients
S
sr
AP = 1%
AP = 5%
AP = 10%
AP = 15%
0.539D
+ constant
The coefficient of each parameter is its normalized sensitivity index. For example, the normalized
sensitivity coefficient of X, is 0.268. It means that one standard deviation change in the model
parameter will lead to a 0.268 standard deviation change in the model prediction. As mentioned
396
CHAPTER 16
earlier, the normalized sensitivity coefficients for an input parameter is its correlation coefficient
with the output random variable.
The difference between the local and global sensitivities for the head loss should be noted. The
local sensitivity coefficient does not require the use of probabilistic properties of uncertain
parameters. It is based on the functional characteristics. On the other hand, the global sensitivity
coefficient is based on both functional and probabilistic characteristics of the input random variables. Local sensitivity analysis indicates D as the most sensitive parameter. Global sensitivity
analysis indicates C as the most uncertain parameter. To investigate which sensitivity coefficient
is the most useful, the contribution due to each component function should be determined. This
is discussed in the next section.
Uncertainty Analysis
The main objective of uncertainty analysis is to assess the statistical properties of model outputs as a function of stochastic input parameters. In water resources engineering projects, design
quantities and model outputs are functions of several parameters, not all of which can be quantified
with absolute accuracy. The task of uncertainty analysis is to determine the uncertainty features of
the model outputs as a function of uncertainties in the model itself and in the stochastic parameters
involved. It provides a formal and systematic framework to quantify the uncertainty associated with
the model outputs. Furthermore, it offers the designer useful insights regarding the contribution of
each stochastic parameter to the overall uncertainty of the model outputs. Such knowledge is essential in identifying the important parameters to which more attention should be given to improve
assessment of their values and then reduce the overall uncertainty in the model output. Quantitative
characterization of uncertainty provides an estimate of the degree of confidence that can be placed
on the analysis and findings.
As an example, water quality models are formulated to describe both observed conditions
and predict planning scenarios that may be substantially different from observed conditions.
Planning and management activities such as checking basin-wide water quality for regulatory
compliance, waste load allocation, and so forth, require the assessment of hydrologic, hydraulic,
and water quality conditions beyond the range of observed data. These inadequacies in model
parameters or inputs force water quality modelers to characterize the impacts of parameter
uncertainties quantitatively so that appropriate decisions regarding water pollution abatement
programs can be made. The most complete and ideal description of uncertainty is the pdf of the
quantity subject to uncertainty. However, in most practical problems, a pdf is very difficult, if
not impossible, to derive precisely. In most situations, the main objective of uncertainty analysis
is to evaluate the first and second moments of a model output in terms of input random
variables.
RELIABILITY AND RISK ANALYSIS
Reliability and risk analysis is a technique for identifying, characterizing, quantifying, and
evaluating the probability of a pre-identified hazard. It is widely used by private and government
agencies to support regulatory and resource allocation decisions. In most hydrologic, hydraulic,
and environmental engineering problems, empirically developed or theoretically derived mathematical models are used to evaluate a system's performance. These models involve several
397
uncertain parameters that are difficult to accurately quantify. An accurate reliability assessment
of such models would help the designer build more reliable systems and aid the operator in making better maintenance and scheduling decisions.
The reliability of a system can be most realistically measured in terms of probability. The
failure of a system can be considered as an event in which the demand or loading, L, on the system exceeds the capacity or resistance, R, of the system so that the system fails to perform satisfactorily for its intended use. The objective of reliability analysis is to ensure that the probability
of the event (R < L) throughout the specified useful life is acceptably small. The risk, Pf, defined
as the probability of failure, can be expressed as (Ang and Tang 1984; Yen et al. 1986)
where P denotes the probability function. Equation 16.7 can be rewritten in terms of the performance function Z as
P, = P(Z
< 0)
where PR,L (r, 1) is the joint pdf of R and L; c is the lower bound of R; and a and b are the lower and
upper bounds of L, respectively. The resistance, R, and load, L, are random variables given as
where U is the vector representing input parameters of the model representing R, and V is the
vector representing input parameters of the model representing L. In some problems L may be a
398
CHAPTER 16
where pZ(z) is the pdf of Z. The pdf of Z is unknown, or difficult to obtain. In most cases the
exact distribution of Z may not be required, as any of several distributions can be used to make a
decision if correct information about the moments of pZ(z)is available.
Uncertainty, Risk, and Reliability Analysis Methods
Ideally, a pdf should be obtained for a complete assessment of the uncertainty, risk, and
reliability analysis of a given system. This requires determination of the joint pdf for all the significant sources of uncertainty affecting the output of the system. However, the determination of
probability distributions for the basic variables is quite difficult and involves several assumptions. Furthermore, the multivariate combination and integration of the input variable distributions is a daunting task. The aggregation of uncertainties in the basic variables of a model into
measures of overall model output uncertainty/reliability is done in only an approximate manner.
Several methods that have been used in water resources and environmental engineering will be
discussed.
First-Order Approximation Method
The first-order approximation (FOA) method can be used to estimate the amount of uncertainty, or scatter, of a dependent variable due to uncertainty in the independent variables included
in a functional relationship. Benjamin and Cornell (1970) have described the first-order approximation technique in detail.
Consider an output random variable, Y, which is a functjon of n random variables. Mathematically, Y can be expressed as
where X
- = (X,, X,, .. ., XJ, a vector containing n random variables. In FOA, a Taylor series
expansion of the model output is truncated after the first-order term
where X, = (X,,, X,,, ..., X,,), a vector representing the expansion points. In FOA applications
to water resources and environmental engineering, the expansion point is commonly the mean
value of the basic variables. Thus, the expected value and variance of Y are
399
- where a, is the standard deviation of Y; X = (XI, X2, ..., Xn), a vector of mean values of the
input basic variables. If the basic variables are statistically independent, the expression for
Var(Y) becomes
where Co and ri are constants and Xis are independent stochastic input random variables. The
first-order mean of the model output, by,can be written as
where px is the mean of Xi. The first-order variance of the multiplicative form, 6$, can be
approximated as
where CVq is the coefficient of variation of Xi. Dividing equation 16.24 by the square of equation 16.23, the approximate coefficient of variation of Y, e ~ , ,can be evaluated as
Another form of interest is the additive form obtained when two or more power functions
are added. The general additive form is written as:
So evYcan be evaluated by
ev,
A third,functional form is the combination of multiplicative and additive forms. This form
is obtained when two or more multiplicative forms having common power function(s) are added.
The general form can be represented as
For evaluating the mean and variance of combined forms of Y such as equation 16.30, the mean
and variance of the additive part must be determined first using equation 16.27 and equation
16.28. Next equations 16.23, 16.24, and 16.25 are used to determine the mean, variance, and CV
of Y by treating the combined form as a multiplicative form assuming the additive part as a multiplicative component with known mean and variance.
To estimate the reliability, 8,of a system, it is typically assumed that Z is normally distributed. Using p,(z) to be a normal distribution with its parameters E[Z] and ozdetermined by FOA,
equations 16.8 and 16.12 are used to determine the risk and reliability of a given system.
An alternative method to define a system reliability is the reliability index, P, which is defined as the reciprocal of the coefficient of variation of Z, given as
The great advantage of FOA is its simplicity, requiring knowledge of only the first two statistical moments of the basic variables and simple sensitivity calculations about selected central
values. FOA is an approximate method that may suffice for many applications (Ku 1966), but the
method does have several theoretical and/or conceptual shortcomings (Melching 1992a; Cheng
1982). The main weakness of the FOA method is that it is assumed that a single linearization of
the system performance function at the central values of the basic variables is representative of
the statistical properties of system perfofinance over the complete range of basic input variables.
The accuracy of the estimates is influenced in part by the degree of nonlinearity in the functional
relationship, and the importance of higher-order terms which are truncated in the Taylor series
40 1
expansion (Bum and McBean, 1985). In applying FOA in risk and reliability analyses, it is generally assumed that the performance function is normally distributed, which is seldom true. Any
attempt to characterize the tails of the actual distribution based on an assumption of normality is
likely to result in an inexact answer (Burn and McBean, 1985).
Example 16.4. Determine the first-order mean and standard deviation of head loss function
given in example 16.1. Use the same mean and CV values as given in the preceding examples.
Solution: The FOA estimate for the mean, bhf,is calculated using equation 16.23 as
where A, is the model correction factor to account for model uncertainty with mean and CV
values of 1.0 and 0.15, respectively. Y, and Y, represent section factors (Yi = A R ~ / for
~ ) the
main channel and overbank sections, respectively. Consider section factors to be deterministic
(Y, = 296.9 m8/3and Yb = 0.6 m8j3),and n,, n,, and S are random variables with mean values
of 0.034, 0.068, and 0.005 and CV values of 0.17, 0.38, and 0.25, respectively. Determine the
mean and standard deviation of Q.
Solution: Substituting values of Y, and Yb, the expression for flow is rewritten as
where is a dummy variable representing the additive form = 296.9%-' + 1.2nb1.The firstorder mean and standard deviation of are calculated from equations 16.27 and 16.28 as
2
b,+= xi=
1 Cik$i = 296.9(0.034)-I + 1.2(0.068)-I
6: =
2
xi=1
2 2
8750 and
+ (1.2)2(- 1)2(0.068)'-1)(0.38)2
2ricv2
Ci i k y
= 22037852, So
X,- 296.92(-1)2(0.034)2(-1)(0.
17)~
CV,+= q22037852/8750
= 0.17.
402
CHAPTER 16
Now, Q = A,+SO.~ can be considered as a multiplicative form and equations 16.23 and 16.25 can
be used to determine the overall FOA mean and CV of Q. The FOA estimate for the mean, Go, is
Example 16.6. For a storm sewer, peak runoff, QL,is given by the rational formula as:
The definition and statistical characteristics of uncertain variables are listed in table 16.4. Determine the risk.
Solution: Using equation 16.9, the performance function can be defined as
To determine the first-order mean and standard deviation of Z, first the mean and CVs of Qc and
QLare determined using multiplicative formulas as
Using these estimates, the mean and standard deviation of Z are determined considering it as an
additive form. Using equation 16.27
Definition
Mean
1.100
0.015
3.000
0.005
1.000
0.825
4.000
10.00
CV
Distribution
Triangular
Gamma
Triangular
Triangular
Triangular
Triangular
Triangular
Triangular
Using equation 16.31, the reliability index, P, is 0.79 and the corresponding risk (equation 16.16
assuming a nomal distribution) is
where z is the standard nomal variate defined as z = (X - yx)/ux, and @(z) is the standard
normal cumulative distribution function.
Example 16.7. For the preceding example, determine the risk using the following definition of
the performance function, Z = Qc/QL - 1.
Solution: Substituting the expressions of Qc and QL,Z is expressed as
where $ = Qc/QL, also known as safety factor. The first-order mean and standard deviation of $
are determined using formulas corresponding to multiplicative forms.
404
CHAPTER 16
The results of above examples (16.6 and 16.7) clearly show that risk estimates are different for
the two mechanically equivalent formulations under the same underlying assumption of normal distribution for the performance function. This indicates that the probability of failure depends upon
the formulation of the performance function. This is known as the lack of invariance problem.
Monte Carlo Simulation
In Monte Carlo simulation (MCS), probability distributions are assumed for the uncertain
input variables for the system to be studied. Random values of each of the uncertain variables are
generated according to their respective probability distributions and the model describing the
system is executed. By repeating the random generation of variable values and model execution
steps many times, the statistics and an empirical probability distribution of the model output can
be determined. A schematic of MCS is illustrated in figure 16.2. The accuracy of the statistics and
probability distribution obtained from MCS is a function of the number of simulations performed
and the adequacy of the assumed parameter distributions. It requires judgement on the part of the
modeler to create theoretical input sample distributions that are representative of the populations
and to estimate the number of trials needed to generate the input and output density functions.
There is no strictly defined answer to either of these questions. Further, if the input parameters
are correlated, a multivariate simulation of the input parameters must be used.
A key problem in applying the MCS method is estimating the necessary sample size. One
empirical test consists of iterating the sample program with increasingly greater sample sizes and
estimating the convergence rate of the sample mean value towards the population mean (Burges
and Lettenrnaier, 1975). The error in the estimation of the population mean is inversely proportional to the square root of the number of trials. To improve the estimate by a factor of two, the
sample size must increase by a factor of four. If the sample size is n, the standard deviation of the
mean is 1 / 6 times the standard deviation of the population. This indicates that the sample size
must be large (Siddal 1983). As the sample size increases, the precision of the empirical
percentile estimates of a model output improves. However, Martz (1983) noted that the rate of
convergence to the true distribution decreases as the size of sample increases. The method often
entails sample sizes that are in the range of 5,000 to 20,000 members. Generally, the number of
required samples increases with the variances and the coefficient of skewness of the input distributions (Burges and Lettenmaier 1975).
The fraction, Fi, of the total variance in model output attributable to the i" parameter based
on a MCS can be estimated by computing
where pi is the correlation coefficient between i" parameter and the output as defined in equation
16.6.
Another simulation technique similar to MCS is the Latin hypercube sampling (LHS), in
which a stratified sampling approach is used. In LHS the probability distribution of each basic
variable are subdivided into non-overlapping intervals (say, m) each with equal probability (l/m).
Random values of the basic variables are simulated such that each range is sampled only once. The
order of the selection of the ranges is randomized and the model is executed m times with the ran-
405
dom combination of basic variables from each range for each basic variable. The output statistics
and distributions may then be approximated from the sample of m output values. McKay et al.
(1979) has shown that the stratified sampling procedure of LHS converges more quickly than an
equidistribution sampling employed in MCS. The main shortcoming with this stratification scheme
is that it is one-dimensional and does not provide good uniformity properties on a k-dimensional
unit hypercube (Diwekar and Kalagnanam 1997). Except reducing computation effort to some extent, LHS has the same problems that are associated with MCS.
Example 16.8. Using MCS, determine the mean and variance of head loss for example 16.3.
Also determine the contribution due to each input parameter.
Solution: Using MCS, means and standard deviations of head loss are determined for different
numbers of simulations as shown in figure 16.3. The MCS estimates for the mean and standard deviation of head loss with 20,000 simulations were obtained as 595.5 m and 270.2 m, respectively.
Equation 16.32 and the regression of example 16.3 show that ,A Q, C, and D contributed 7.9,
20.8,39.3, and 32.0 percent, respectively, of the overall variance.
550
4
0
1
2000
40M)
6000
8000
10000
12000
Number of simulations
Number of simulations
14000
16000
18000
20000
406
CHAPTER 16
where E[] is an expectation operator, and py, is the mean of the ithpower function
Equation 16.35 shows that the output uncertainty of a multiplicative model is governed by the
most uncertain component function. Using the additive form (equation 16-26),the mean of Z, p,,
is given by
where c and r are constants. The FOA estimate for the mean (Benjamin and Cornell 1970), by,is
407
These estimates for pYand oycontain errors. The exact value of any moment can be computed as
Exact value =
FOA estimate
1 - E(.)
where E(.) is the relative error in a moment estimated using FOA. Analytical relationships for
E(.) in FOA estimates for the means and the variances of component functions were developed
(Tyagi 2000) for generic power and exponential functions using five common distributions.
These analytical expressions can be used as a guide for judging the suitability of the FOA by
determining the relative errors in the most sensitive parameters. Further, when relative error is
more than the acceptable error, these analytical relationships enable one to correct FOA estimates for means and variances of model components to their true values. Using these corrected
values of means and variances for model components, one can determine the exact values of
mean and variance of an overall model output. Tables 16.5 and 16.6 present the developed
expressions for E(cy) and ~ ( 6 ; )for a generic power function (Y = cXr). Similarly, tables 16.7
and 16.8 present the developed expressions for E(bY)and ~ ( 6 ;for
) a generic exponential
function (Y = becx).
To further simplify the correction procedure, these analytical relationships have been presented graphically by Tyagi (2000). The relative error plots show where FOA estimates are
acceptable and where they are unacceptable and need to be corrected. In specific situations, a given
Table 16.5. Generalized relative error in FOA predicted mean of a power function
Distribution
Uniform
+ l)(r + ~)cv;
[(I + c v x G ) ( ' + * ) + (1 C V ~ G ) ( ' +-~ )21
6(r
Symmetrical triangular
1-
Lognormal
Gamma
Exponential
Note:
(1) To avoid singularity at r = - 1, r should be taken as -0.9999.
(2) To avoid singularity at r = -2, r should be taken as - 1.9999.
Table 16.6. Generalized relative error in FOA predicted variance of a power function
Distribution
Uniform
Symmetrical
triangular
+ 1) r2(r + 1 ) 2 ~ ~ 4 ,
{ 2 f i c v x ( r + 1 ) ~ [ ( 1+ ~ ~ ~ f i ) ~ ~-+c ' ~- ~( lf i ) ~-' (2r
+ ~+ ]1)[(1 + ~ \ j ~ f i ) I + ~ - (~l ~ ~ d ? i ) ' + l ] ~ )
36(2r + 1) r2(r + 1)'(r + ~)'CV;
1{3(r + l)(r + 2 ) 2 ~ ~ i [+( 1C V ~ ~ 2+(1
) ~ -" C V ~ ~ )-~21-(2r
' + ~+ 1)[(1 + c v x 6 ) " 2+(1 - C V X 6 ) ' +'-21')
12(2r
1-
Lognormal
Gamma
1-
Exponential
1-
+ ~ ~ c v ~ , ) ] T ( c v ; -~ ){T[cv;~(~ + ~ c v ; ) ] ) ~
r2
[r(2r
+ 1) - T2(r + I)]
Note:
(1) To avoid singularity at r
(2) To avoid singularity at r
= - 1, r
= -2, r
409
Distribution
Uniform
Symmetrical triangular
Normal
1
c ~ ~ c vexp(cp,)
:)~
Gamma
1 - (1 -
Exponential
1 - (1 - ~ I J J ~ ) ~ ~ P ( ~ I J J ~ )
Table 16.8. Generalized relative error in FOA predicted variance of an exponential function
Distribution
Uniform
1-
(e2fic*xCVx
- I ) [ ( ~ ~ C ~~~
C V) , e ~ ~d 3~c C*L X
x ~~
vX
I ]~ ~
Symmetrical
triangular
1-
6 6 v6e2<&*.~v,
72c PXC x
(edCkCVx - 1 ) 2 [ ( 3 ~ 2 p : ~~ :2)(e2*kcVx + 1) 2 e 6 ~ ~ C V ~ ( 3 ~+2 2)]
p:~~~
Normal
1-
c2u:
exp(c2u;) [exp(c2u:) - 1]
Gamma
1-
c2p:exp(2c
1
(1 - 2cpxCv;)-,:
Exponential
1-
PX)
(1 - cpxcv:>-,:
c2p;exp(2c PX)
(1 - 2cpx)-' - (1 - cPx)-l
function may be very nonlinear (represented either by a very large or very small exponent of a
power function). These situations can be identified and dealt with by using the relative error plots.
In absence of the knowledge of the complete pdf, but knowing the mean and variance exactly, certain exact statements on the probability of an output random variable lying within given
bounds can be estimated using the Chebyshev inequality (equation 3.77) which states that
where t is a constant. In example 16.7, the safety factor is defined as = R/L = Qc/&. Then,
the greatest lower bound of the system reliability [here, 8 = P($ 2 I)] is given (Huang 1986) as
410
CHAPTER 16
Considering example 16.7, the greatest lower bound of the probability of safety is 3
This result may be used as a reference value.
0.347.
Example 16.9. Estimate the mean and standard deviation of Q given in example 16.5 assuming a
normal distribution for ,A a uniform distribution for n, and n,, and a lognormal distribution for S.
Solution: To determine exact values of the mean and standard deviation of Q, firstly, FOA
estimates for the mean and variance of component power functions are estimated using equations
16.38 and 16.39. Then using relative error functions corresponding to given distributions from
tables 16.5 and 16.6, the FOA estimates are corrected. The calculation procedure is presented in
table 16.9.
Using equation 16.35 and corrected means of the individual power functions from
table 16.9, the exact mean of the additive form, p+, is 9019.9 m3/s. Similarly, using equation 16.36
and corrected variances of the component power functions, o+is 1586.2 m3/s. The corresponding
CV+is 0.176. Now, treating Q as a multiplicative form with A,,
and S$%S its components with
known means and CV values, PQ is 632.99 m3/s from equation 16.31, and CV, is 0.265 from
equation 16.33. Thus, uQ is 167.73 m3/s. Note that, in this example, n, is the most uncertain parameter but its contribution to the uncertainty of Q is negligible as the additive form is governed
mainly by n, because of its very large coefficient in comparison to that of n,. The values of p, and
UQ corresponding to 20,000 MCS simulations are 632.72 m3/s and 167.27 m3/s respectively.
Figure 16.4 compares the mean and standard deviation of flood levee capacity obtained by
the FOA, MCS (with 20,000 runs), and corrected FOA at different levels of coefficient of variation
of n, and with different distribution types for n, (lognormal, triangular, and uniform). The mean
flood levee capacity predicted by the FOA is constant regardless of the level of uncertainty in n,.
The standard deviation of Q increases with CV of n, but at a smaller rate. Further, figure 16.4(b)
shows that distribution type is immaterial at smaller CV values of n, but it may have a significant
impact at higher CV values of n,.
+,
Table 16.9. Computation of the mean and variance of flood levee capacity using corrected FOA method
Mean
Power
function
FOA
estimate
Relative
error
Variance
Corrected
estimate
FOA
estimate
Relative
error
Corrected
estimate
CV:
41 1
---
-Corrected
FOA
MCS w i ( h L o g m l
(b)
MCSwithTrianglJar
fn
MCSwith Unrform
I
r
.-ii
E
0
B 200 0
C
m
al
>
--
---
-Corrected
+FOA
150-
100 -
CVof n ,
Fig. 16.4. Flood levee capacity (a) mean and (b) standard deviation.
If sufficiently large numbers of simulations are not used, the MCS results may be erroneous.
In the present example, MCS does not give unique results even after an extensive computation
effort. Theoretically, an infinite number of simulations are required in MCS to produce results
identical to those obtained using the corrected FOA method. Finally, compared to MCS, the corrected FOA method is simple, efficient, and accurate for determining the first two moments of an
output variable when dealing with analytical models. Further, the difficulties of implementing
MCS, the possibility of unseemly generated parameter values, and the ever-present concern
about the number of simulations are avoided.
Second-Order Approximation Method
In the second-order approximation (SOA) method, a Taylor series expansion of a model is
truncated after the second-order term. Consider a model represented by equation 16.17, the
second-order Taylor series expansion of Y is given as
412
CHAPTER 16
In SOA, the expansion point is commonly the mean value of the basic variables. Considering that
all input variables are statistically independent and taking expectation of equation 16.45, the
expected value Y is given as
v ~ ( Y )=
( 3 y,
zbl(3)
axi xiv x ( X i ) - -4 z:=~ax?
~;=~(*y
axi (3J
ax;
2,
, E[(x~-
xi]3]+ 4
va?(xi)
ax?
L ~ ; ~ l ( g,
3E ~~
( X-~xi)4]
Equation 16.47 indicates that determination of the second-order variance is not only complicated
but also requires the third- and fourth-order moments of Xi's, which are generally not available.
This is the reason that SOA has not been used for variance evaluation in the literature.
413
tributed. Thus, FORM is applicable when all the input variables are normally distributed. Ang
and Tang (1984) have presented an excellent description of FORM for both correlated and uncorrelated input random variables.
X* be the failure
Assume Xi's are normally distributed and statistically independent. Let point. From equation 16.18 the first-order Taylor series expansion of Z is
z = z(x*)
- +
z:4 (Xi
x*)
(El*
Substituting equation 16.49 in 16.48 and taking its first- and second-order expectations, the mean
and standard deviation of Z are
where Xf is a standard normal variable with zero mean and unit standard deviation defined as
Xf = (Xi - px,)/ux,. Substituting values of the mean and standard deviation of Z in equation
16.22, P is expressed as
Now, equations 16.8 and 16.12 can be used to determine Pf and a. For models having a linear
failure surface and all the basic variables normally distributed, the estimates of Pf and '8 are exact. The most probable failure point in the reduced space is x i * = -at P (Ang and Tang 1984).
which can be represented in the original coordinate system as
where a: is given as
414
CHAPTER 16
X, the location of failure point X* and the reliFor the given set of pXand oxiof basic variables ability index P can be found by solving the n + 1 simultaneous equations of 16.49 and 16.53
- is initially unknown, the solution is achieved
with a* defined by equation 16.54. However, X*
successively through iterations with improving values of a* and P.
For most modeling problems, it is very unlikely that all basic input variables will be normally distributed. Rackwitz (1976) proposed a transformation technique in which the values of
the cdf and pdf of the non-normal distributions are the same as those of the equivalent normal
distributions at the failure point X*. Consider an input random variable Xi for which pdf and
cdf are given as pxi(xi)and Px(xi) respectively. Equating the cumulative probabilities at the
failure point
where pzi and a!i are the mean value and standard deviation of the equivalent normal distribution for Xi; Px(xT ) is the original cdf of Xi; and @(.) is the cdf of the standard normal distribution. Using equation 16.55, the mean of the equivalent normal distribution can be written as
where +(.) is the pdf of the standard normal distribution. Based on equation 16.57, the standard
deviation of the equivalent normal distribution can be written as
The key to FORM is the determination of the failure point for the Taylor series expansion.
Shinozuka (1983) has shown that for FORM the reliability index, P, is the shortest distance in the
standardized space between the system mean state and the failure surface. Thus, if the failure point
is determined correctly, it represents the most likely combination of input variable values which
produce the critical target level. The determination of P requires application of a constrained non-
415
linear optimization such as the generalized reduced-gradient algorithm used by Cheng (1982), a
Lagrange multiplier approach used by Shinozuka (1983), or an iterative optimization method
suggested by Rackwitz (1976).
FORM is quite accurate because it is able to overcome model non-linearity problems, and
no additional assumption about the distribution type of the performance function is required. It
is still an approximation method because the performance function is approximated by a linear
function at the design point, and accuracy problems may arise when the performance function
is strongly nonlinear (Cawlfield and Wu 1993; Zhao and Ono 1999). Another disadvantage of
the FORM is that determination of the linearization point is generally not easy, depending
upon the nature and complexity of the system for which the reliability, risk, or uncertainty
analysis is being studied (Melching and Anmangandla 1992). Further, the magnitude of
acceptable convergence may affect the accuracy of the reliability estimates. In some cases, the
magnitude of the convergence error may not be reduced after a certain level. For reliability
analysis with correlated variables, Ang and Tang (1984) and Haldar and Mahadevan (2000)
should be consulted.
Example 16.10. Solve example 16.6 using FORM.
Solution: To explain the working procedure, the following steps are followed:
Step 1: Assume a failure point x:. Generally, the initial failure point is assumed to be x*, which
is (1.1,0.015, 3.0,0.005, 1.0,0.825,4.0, 10.0) for the given problem (table 16.4).
Step 2: Transform the non-normal distributions to the equivalent normal distributions so that the
values of the pdf and cdf of the non-normal distributions are equal to the equivalent normal distributions at the failure point.
Table 16.4 shows that several variables have a triangular distribution. The pdf, px,(xT), for
any triangular distribution is
2
p x l x* i )-- b-a
(b - xT)
(b-c)7
when
CCx:
<b
where a, b, c are the minimum, maximum, and mode values of Xi. These parameters can be obtained by the following equation (Tyagi 2000)
1 + 2 f i c v X icos
cos-l(&
YXi)])
416
CHAPTER 16
n = pxi(l - Gcv,)
(16.61a)
(x:
- a)2
(b - a)(c - a)
for xf 5 c
For &, substituting values of the mean and CV in equation 16.61, the parameters of triangular distribution are a = 0.86 and b = 1.34. Using equations 16.59 and 16.61, the corresponding
values of the pdf and cdf, at A: = 1.1, are pAm(Az) = 4.167, and P&(A;) = 0.5. Using the normal
distribution results in @-'[P~(A:)] = @[0.5] = 0 and (b(@-'[~,~(h*,)]}
= +(0) = 0.399.
Now using equations 16.56 and 16.58, the parameters of equivalent normal distribution are calculated as aft = 0.096 and
= 1.1. Similarly, parameters of equivalent normal distributions corresponding to other input variables having triangular distribution are obtained.
For a parameter Xi having a gamma distribution, the pdf is given by
pE.
where a and A are the distribution parameters which are obtained as:
The corresponding pdf and cdf at n* = 0.015 are pn(n*) = 105.83 and P,(n*) = 0.53. The cdf
and pdf for an equivalent normal distribution are @-'[Pn(n*)] = (P-'[0.53] = 0.083 and
+{(a-'[P,(n*)]} = (b(0.083) = 0.398. Now, using equation 16.56 and 16.58, the parameters of
equivalent normal distribution are a! = 0.004 and CL: = 0.015. This calculation has to be
repeated for each set of failure points. A spreadsheet can be used to calculate parameters of equivalent normal distribution as given in table 16.10.
Step 3: Corresponding to each parameter, reduced standard normal variate x(*= (x*- pE,)/a;,
is obtained at the failure point.
X:
Step 4: Values of (dz/ax',), and a: are evaluated at Step 5: Using equation 16.53, new failure point is represented by X: = px, - a: pox,.
Step 6: Substituting X*S in equation 16.49, z ( x ~ ,X: , ...,x,*) = 0, and solve for P.
417
Parameter
(1)
Old X*
(2)
P!,
s,
(3)
(4)
(5)
(6)
(7)
4.244
10.041
3.989
9.999
0.664
0.414
-5.769
- 1.520
-0.426
4.244
10.041
(az/ax;)*
*
a,
-0.112
x? = CLx, - a , * P ~ x ,
x*)= - 1.54E - 04
Calculation of step 3 to step 6 can be done using a Microsoft Excel spreadsheet's GoalSeek
tool to solve for p. The equation for P need not be formulated explicitly. One can enter the equax". The calculation is presented
tion for the performance function based on the new failure point in table 16.11.
It should be mentioned that column (2) of tables 16.10 and 16.11 are the same. The new x"
values from column (7) of table 16.11 are the values in the 2"d column of both the tables. Then
columns (12) and (1 1) of table 16.10 become columns (3) and (4) of table 16.11. Then step 3 to
6 are repeated until P converges. The iterations can be automated using a Macro on a spread
sheet. The final result is given in the table 16.11.
419
Table 16.12. Generic expectation functions for some commonly used probability density functions
Name
I_",(b - a)XdX
Uniform
E[Xl=
Symmetrical
triangular
E[X] =
Unymmetrical
triangula?
E[X] =
Lognormal
E[Xr] = pk(1
Gamma
E [ X ] = p i ~ ~ $ e x p { l n [ T ( ~+~ r)]
,2 - ~ [ T ( c v ~ ~ ) ] )
Exponential
E[Xr] = pkT(r
~k
6(r
+ l)(r + ~)cv;
2[(b - c)arf
(r
+ C V , ~ % ) ' + ~+ (1 - C V , ~ % ) ' + ~- 21
+ (c - a)brf + (a - b)cr+*]
4 - 1)
V ; ) ~
+ 1)
r(r - 1)
(
pkEn=o(
E[X] = pk
[(I
+C
Normal
E[X] =
= 2%"3(r
CV;
.;,x:!~
2n)CV?E[z2"],
"-')'
2n
+ - - .+
+ 1)
2n/2(n/2)!
CVx
. - a
when r is even
)CV?E[z2"],
when r is odd
"Equations 16.59a, b.
Table 16.13. Generic expectation functions for some commonly used probability density functions
Name
Uniform
Symmetrical triangular
Unsymmetrical triangula?
efrcp,(l + cv,fi) -
e;~k(l
-C V , ~ )
+ (o - a)exp(rcp)]
a)(p - W)
Normal
Gamma
Exponential
"Equations 6.6a, b in chapter 6; w in the table equation = 6 in equations 6.6a, b.
420
CHAPTER 16
The k~ order moment of a function defined by Y = g(x) about the origin can be obtained as
r~.; = E [ Y ~ ]= E [ x ~ ] =
J_",[g(x)]kpx(x) dX
J> xkpx(x) dX
(16.64)
where px(x) = probability density function of X. Substituting the function Y = g(x) and density
function px(x) and solving the integral for a generic value of k, the generic expectation function
for a given function with given distributional characteristics of its random input variable, can be
obtained. Generic expectation functions for a power function and an exponential function are
presented in tables 16.12 and 16.13.
Example 16.11. To demonstrate the use of developed generic expectation functions for uncertainty and risk analysis, example 16.6 is considered.
Solution: To explain the working process using generic expectation functions the solution is presented in the following steps:
Step 1: Define a performance function for the given problem. As presented in example 16.6, the
performance function is given as
0.463
where Qc = -X,d3S;
n
and
QL = XLCIA.
Step 2: From table 16.10, select generic expectation functions corresponding to a power function
and distribution type of input variables to calculate moments of component functions of Qc and
Substituting the given statistical data (table 16.4) into the selected generic expectation functions, first-, second-, third-, and fourth-order moments about the origin are determined for all the
component functions of Qc and QLas presented in table 16.14.
a.
Table 16.14. Calculation of expectations for storm sewer design (example 16.10)
Expectation
Order of Expectation
Order
Moment
Step 3: Using computed expectations of various component power functions, different orders of
moments of QL and Qc about the origin were calculated. Using these moments of QL and Qc
about the origin various orders of central moments of QL and Qc are determined.
Step 4: Determine the moments of performance function, Z. Taking the expectations of first, second, third, and fourth powers of Z, the following equations for the first four moments of Z about
the origin are obtained.
Substituting moments of QLand Qc about the origin into above equations, moments of Z about
the origin are obtained. Using these moments, central moments are also calculated as presented
in table 16.15.
Step 5: For the ease of identification of distribution for QL,Qc, and Z by intuition, distributional
characteristics (mean, CV, skewness, and kurtosis) of QL, Q,, and Z are determined using the
above calculated moments as presented in table 16.16 along with those determined using the
MCS with 20,000 simulations.
Step 6: Using knowledge of distributional characteristics and higher order moments of Z, identify its distribution and estimate risk or reliability of the given system.
Table 16.16. Distributional characteristics of Qc, QL, and Z (example 16.10)
Output
variable
(1)
MCS
(2)
cvz
(3)
Yz
(4)
KZ
CLz
cvz
(5)
(6)
(7)
Yz
(8)
KZ
(9)
CHAPTER 16
422
For the present problem, exact moments and other distributional characteristics of Z are
available. Using this information, several suitable probability distributions can be selected for Z,
and risk corresponding to each of these assumed distributions can be calculated. For estimating
the range, risk corresponding to the normal and uniform distributions can be estimated. The risk
obtained assuming these two distributions may be regarded as extremes, because in reality the Z
distribution of most cases probably falls between the normal and uniform distributions (Yen et al.
1986). Using the extremes and risk calculated assuming other distributions, an appropriate decision can be made about the system risk.
Seeing the distribution characteristics of Z, the normal distribution may be a good choice as
it has small skewness and kurtosis. The CV of Z is quite high indicating negative values of Z will
occur when QLis more than Qc. Using equation 16.16 the risk corresponding to a normal distribution for Z, is
where y = ln(Z), and E is a location parameter. The relationships between E, y, and Z are given as
IT.
To solve, substitute y, = 0.85 1 (table 16.16), into equation 16.69 and find [exp(ut) - 1 This
cubic equation has one real and two imaginary roots. The real root gives a, = 0.183. Substituting values of a, and a, into equation 16.68, p, = 4.49 was obtained. Substituting, a,, p,, and p,
into equation 16.67, E = -75.75 was determined. Using a,, p,, and E, the standard normal variate corresponding to Z = 0 was found as z = -0.895. The corresponding risk is obtained as
It can be noted that risk estimate based on the normal distribution and the 3-parameter lognormal distribution is the same and very close to that obtained using FORM (example 16.9).
Now, to see the upper bound of risk, Z is assumed to be uniformly distributed. The Pf is determined from
0.3
0.25
423
-.---.-I
0
0
0.05
0.1
0.15
0.2
0.25
0.3
CVn
OTHER METHODS
Second-Order Reliability Methods
The second-order reliability method (SORM) has been used extensively in structural reliability analyses. It has been established as an attempt to improve the accuracy of FORM. SORM
is obtained by approximating the limit state surface function at the design point by a second-order
surface, and the failure probability is given as the probability content outside the second-order
surface. There are two kinds of second-order reliability approximations: curvature-fitting SORM
(Tvedt 1990) and point-fitting (Zhao and Ono 1999). Both methods involve complex numerical
algorithms and extensive computational efforts.
Hamed et al. (1995) compared risk assessments due to groundwater contamination based on
FORM and SORM and reported that their results were in good agreement when the limit-state
surface at the design point in the standard normal space is nearly flat. On the other hand, when
the limit-state function contains highly nonlinear terns, or when the input random variables have
an accentuated non-normal character, SORM tends to produce more accurate results than FORM.
But computational requirements of SORM are much higher than FORM.
Point Estimation Methods
The point estimation (PE) method was originally proposed by Rosenblueth (1975) to deal
with symmetric, correlated, stochastic input parameters. The method was later extended to the
case involving asymmetric random variables (Rosenblueth, 198 1). The idea is to approximate the
given pdf of an input random variable by discrete probability masses concentrated at two points
in such a way that its first three moments are preserved.
Consider a model represented by n stochastic input parameters. Rosenblueth (1975, 1981)
demonstrated that the rh-order moment of output random variable Y about the origin could be approximated via a point-probability estimate of the first-order Taylor series expansion. This
method requires 2" model evaluations to estimate a single statistical moment of the model output.
For a large model with a large number of parameters, Rosenblueth's PE method is computationally impractical. Further, a reliability analysis requires knowledge of higher-order moments in order to approximate the distribution of the output random variable. This makes the method even
more computationally extensive. Thus, while Rosenblueth's method is quite efficient for
problems with a small number of uncertain basic variables, its computational requirements are
similar to those of MCS for a model having a large number of parameters. For example, a model
having between 10 and 15 parameters will require 1024 to 32768 model evaluations (Melching
1995). Examples of applying the Rosenblueth's method to watershed hydrology include Rogers
et al. (1985) and Melching (1992b).
Harr (1989) modified the Rosenblueth's method to reduce its computational requirements
from 2" to 2n for an n-parameter model by using the first two moments of the random variables.
This method does not provide the flexibility to incorporate known higher-order moments of input random variables. Chang et al. (1995) showed that the estimated uncertainty feature of model
output could be inaccurate if the skewness of a random variable is not accounted for. Yeh and
Tung (1993) and Chang et al. (1992) are some of the examples of applying Harr's method in
hydraulic engineering.
Transform Methods
Tung (1990) used the Mellin transform to calculate the higher-order moments of a model
output. The application of the Mellin transform is not only cumbersome, but also it can not be
universally applied. As pointed out by Tung, the Mellin transform may not be analytic under certain combinations of distribution and functional forms. In particular, problems may arise when a
functional relationship consists of input variable(s) with negative exponent(s). When component
functions of a given model have other forms than power functions, it cannot be applied. Further,
no formulation was suggested to obtain the moments of a model output having non-standard normally distributed input variable(s).
17. Geostatistics
GEOSTATISTICS IS a term used to refer to a collection of statistical techniques applicable
to spatially referenced data. Generally, by "spatially referenced data" we mean the data are represented by a triple (x, y, z) where x and y are the spatial coordinates and z is the value of the variable of interest at location (x, y). Examples of spatially referenced data are the elevations of the
piezometric surface over the areal extent of an aquifer, the soil nitrogen content throughout a
particular field, the depth of rainfall from a given storm over a watershed, and the saturated
hydraulic conductivity of the surface soils within a county. One thing that is apparent from this
listing of examples of spatially referenced data is that if the sampling locations are close together,
one would expect the z values for neighboring points to be similar. Typically, groundwater elevations change slowly with distance so that observation wells close together may produce observed water level data that are correlated. As the distance between sampling points increases, the
correlation among the observations can be expected to decrease. At some separation distance, the
observations will become independent of each other.
The problems addressed by geostatistics are the same type of problems that are discussed
throughout this book-estimation and characterization. Some problems that may be addressed are:
1. Point estimates at locations where measurements are not available.
2. Estimates for the areal averages over the entire area or some part of the area.
This chapter is coauthored by Jason R. Vogel, research engineer, in the Biosystems and
Agricultural Engineering Department, Oklahoma State University, Stillwater, Oklahoma.
CHAPTER 17
426
If a correlation structure among the observations exists, one would like to take advantage of it in
making estimates. Geostatistics is a term used to refer to a collection of techniques that might be
used to address questions as listed above and other questions relative to spatially referenced data.
Typically, one may have a set of georeferenced data reflecting the values of some quantities
at a number of spatially referenced locations. Techniques that might be used to estimate the
values of georeferenced data at points where measurements are not available include contouring
and interpolation schemes of various kinds.
The coverage of geostatistics contained here is limited to basic concepts and approaches.
Isaaks and Strivastava (1989) present a very readable and basic coverage of the topic which is
both insightful and comprehensive. Cressie (199 1) is a quite complete coverage of the theory and
application of geostatistics. Goovaerts (1997) lies between these two books in its coverage of
theory and application. Deutsch and Journel(1998) is a collection of geostatistical software written in FORTRAN.
DESCRIPTIVE STATISTICS
In this chapter a location will be designated by a lower case letter such as s or t. A location
might refer to a coordinate system in one, two, or more dimensions. In one dimension, s might
simply be the x-coordinate. In two dimensions, s could represent the x, y coordinate. The
distance, h, between locations s and t in two dimensions is the Euclidean distance given by
The value of the variable of interest will be denoted by Z(s). For example, Z(s) might
represent the elevation of the piezometric surface at location s.
If we consider a process, Z(s), that is second-order stationary, then E(Z(s)) = p and the
for all s. Z(s) may be thought of as having two components-one component is
Var(Z(s)) =
the mean and the other is a random component, E.Therefore
We will use z(s) to represent the deviation of Z(s) from the mean.
and
Var(z(s)) = Var(Z(s)) = Var(E) = a,2
A milder assumption is to assume that for every vector h the increment yh(s) = Z(s + h) Z(s) is second-order stationary. Then Z(s) is called an intrinsic random function and is
characterized by
E[Z(s
+ h) - Z(s)] = (a, h)
(17.6)
and
Var[Z(s
+ h) - Z(s)] = 2y(h)
where (a, h) is the linear drift of the increment and g(h) is its semivariogram function (see equation 17.9). If the linear drift is zero (i.e., if the mean is constant), we have the usual form of the
intrinsic model
and
E[Z(s
+ h) - z(s)I2 = 2y(h)
As a first example, consider the data shown in figure 17.1. This data represents the elevation
at 0.25-mile intervals along a transect. Obviously, elevations near each other tend to be alike and
as the distance between measurements increases, the similarity in elevations is reduced.
Figure 17.2 is a correlogram of the elevation data clearly showing that as the separation distance
between points increases, the correlation decreases. The correlogram is a plot of the autocorrelation function versus separation distance. The autocorrelation function is given by
10
15
20
1400
0
428
CHAPTER 17
-0.2
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
where a: and ahare the variance and covariance estimated by s: and sh given by
2 zsz,
sh = n
for h = t - s
Here h is the distance between locations s and t. Equation 17.10 indicates that the covariance is
equal to the product of the variance and the autocorrelation. Thus, a plot of the covariance against
the separation distance would have exactly the shape of the correlogram with different scaling on
the ordinate.
Figure 17.2 indicates that points closer than about 1.75 miles have some degree of correlation in terms of elevation with this correlation decreasing as the separation distance increases
from 0 to 1.75 miles. At 1.75 miles and beyond, the elevations appear to be independent along
this transect.
In geostatistics, the semivariogram is defined as Var(Z, - &).
For a second-order stationary process, Var(Z,) = Var(&) = a:. Also Cov(Z,, Z,) = ah.The semivariogram, y,, is defined as X(Var(Z, - &)).Therefore
The net result is that the semivariograrn, the variance, and the covariance are related by
GEOSTATISTICS
429
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Distance, h (miles)
Fig. 17.3. Semivariogram and semivariogram cloud for the example elevation transect.
The semivariogram for the example elevation transect is actually based on calculations in
one dimension along the transect. The example data is cross-sectional data. An expansion of the
semivariogram concept to two dimensional surfaces can be used to represent surfaces of data. For
example, Z(s) = Z(x, y) might be the elevation of the piezometric surface of a particular well
field at coordinate locations x, y. If y, is a function of )hl only and not direction, the process is
isotropic. If yh depends on direction, the process is anisotropic.
In calculating a semivariogram, one often finds that as h +0, yh +a: where a: > 0. This
positive value of a: is called the nugget effect. Cressie (1991) discusses this nugget effect and
its possible relation to variations that occur at intervals closer than the intervals on which data
are measured and to measurement errors. Unless the data include more than one observation at
the same location, the nugget effect and the behavior of the semivariogram near the origin can
not be determined from a sample semivariogram. Isaaks and Srivastava (1989) indicate that the
nugget effect and the behavior near the origin are the two most important attributes when using
a semivariogram for estimation via kriging, as discussed later.
SEMIVARTOGRAM MODELS
The semivariogram is calculated from a sample of data. The results of semivariogram
calculation are semivariances at particular values of h. Semivariogram models are used to smooth
the resulting sample semivariograms and to provide estimates at points intermediate between
calculated estimates. To ensure that a unique solution to the ordinary kriging equations (discussed later in this chapter) exists, the semivariogram must be positive definite. The explanation
of why this condition guarantees existence of a unique solution can be found in Cressie (1991).
One way to satisfy the positive definiteness condition is by using one of a series of models that
are known to be positive definite.
Characteristics of most of these semivariogram models are they are nondecreasing functions
of h starting at yo = 0 and becoming asymptotic to a: at h equal to the range, R, of the semivariogram. The most common models are the spherical, exponential, Gaussian, and linear models.
The spherical model is given by
for lhl 2 R.
with yh = a:
The exponential model is
1 - e-)'
for
hj >0
GEOSTATISTICS
43 1
0'0
yh = --lhl
<R
and yh = a: for Ihl > R. The linear model is not strictly positive definite, and positive definiteness should be checked when using this equation. Isaaks and Srivastava (1989) discuss these
checks.
A nugget model may also be used which is given by
The nugget model may be used when Z(s) and Z(t) are independent for all spacings greater than
or equal to the smallest available spacing.
The choice of models is usually dependent on the behavior of the sample semivariogram
near the origin. If the sample semivariogram shows a parabolic behavior near the origin, the
Gaussian model may be the most appropriate. If the semivariogram behaves linearly near the
origin, the exponential or spherical model will be the best choice. If a straight line through the
first few points on the sample semivariogram intersects the sill at approximately one-fifth the
range, the exponential model will be a better fit. If the line intercepts at about two-thirds the
range, the spherical model will likely fit better.
Figure 17.4 illustrates these various models. Moser and Macchiavelli (1996) and Cressie
(1991) discuss the estimation of the parameters of semivariogram models. Least squares,
weighted least squares, and maximum likelihood are three methods that can be used. A variety of
statistical software is available for carrying out the actual computations for most geostatistical
procedures discussed in this book (Deutsch and Journel 1998).
c
Distance, h
Spherical
*A
Exponential
/ 1
I ,
//,/
*I/'
///
I /
t;/.
/:I
I"
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Distance, h (miles)
The example elevation transect data is modeled with the spherical and exponential models
in figure 17.5.
wi > 0 and
Zr= wi = 1
'=,
that will be positive definite as long as the n individual models are positive definite. This linear
combination forms a model of nested structures, where each nested structure corresponds to a
term in the linear combination in equation 17.22. Different model forms can be included in a single combination.
When selecting a model to represent the sample semivariogram, generally it is important
to remember that simpler is better. That is to say, if the major features of the semivariogram
can be satisfactorily characterized by a basic model and a combination of models, use the basic model.
Figure 17.6 shows the best fit model which is a combination of two models. This model is
a combination of two spherical models with range values of 0.8 and 3.5 and sill values of 980
and 1730. The coefficient of determination (12) for this combination model is 0.978, as opposed
to 0.975 for the basic exponential model. Because both semivariograms satisfactorily characterize the data, the exponential model is a better choice to represent the structure in the
semivariogram for estimation purposes because of its simplicity compared to the combination
model.
Combination
7-
+ +
4
/
+/
/
/
I
1
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Distance, h (miles)
Fig. 17.6. Fitted semivariogram to example transect data showing the combination of two
spherical models.
ESTIMATION
Isaaks and Srivastava (1989) present an excellent introduction to estimation using geostatistics. An estimate, q(xo), for the random variable, V, at a point xo, can be made in terms of a linear
combination of observed values of V at a number, n, of nearby points, xi, for i = 1 to n from
Therefore
For a stationary process and an unbiased estimate, the sum of the weights is 1. Our goal is
to find values for the weights subject to the constraint that they sum to 1.
There are several estimation procedures that can be used. All estimation procedures require
one to select a pattern of spatial continuity. The arithmetic average of all relevant points is possibly the simplest method.
This method requires one to define the n points that influence the value of V at x, and then assign
equal importance to each of these n points. For the arithmetic average, all of the weights are equal
to l/n.
A second common approach is to use a weighting scheme based on the inverse of the
distance.
Again, the points to be included in the calculations must be defined. Points closer to x, have more
influence on the estimate of V(x,) than more distant points.
The inverse distance weighting can be generalized using a power, p, on the distance. For
inverse distance squared weighting, p = 2.
For p > 1, more emphasis is placed on points closer to x, and the importance of more distant
points is diminished.
An estimation process known as kriging (after D. G. Krige, a South African mining engineer
and pioneer in the use of statistical methods in mineral evaluation) produces estimation weights
that minimize the variance of the estimation errors. The variance of the errors, a;, is estimated by
s; which is given by
This method requires knowledge of the true values of V at all of the Xi. We have only sample values of V at Xi. Because we want to estimate points at locations where we have no measurements,
435
GEOSTATISTICS
we will assume a model for the variance-covariance structure. We do this by using a semivariogram. Since we cannot minimize s;, we will minimize the model error variance, s&, by setting
the partial derivatives of the model error variance with respect to the weights to zero.
Equation 17.24 indicates that R(X,) = +(x,) - V(X,). In general, the variance of a
weighted linear combination Ey= qYi is given by
~ a r [ C y =,=qYi] =
Cy==,
ELI aiqCov(Yi, Y,)
+ 1, and a,
(17.30)
=
- 1 results in
We also have
var(V(X,)) = s2
and
ZCOV(+(X~),
V(Xo)) = 2Cov((C wi Vi)Vo)
=
2 E ( 2 wiViV0) - 2E E (wiVi)E(V0)
= 2 C. wiE(ViV0) - 2 2 wiE(Vi)E(Vo)
=
2 C. wiCov(vi, V,)
2 C wicio
We want to find the n wi7sthat minimize si. Normally, we would do this by taking n partial
derivatives with respect to the n weights and set them equal to zero. However, in this case we
have imposed a constraint that 2 wi = 1. Therefore, we have a constrained optimization which
we will solve by using Lagrangian multipliers.
We now have n
1 unknowns, the n wi's and A, the Lagrangian parameter. Taking partial derivatives with respect to A and setting it equal to zero results in 0 = 2 ( E wi - 1) or E wi = 1, which
is the desired unbiasedness condition. We now take partial derivatives with respect to the wi7s.
CHAPTER 17
436
Only the derivative with respect to w, is shown. The others are similar. The third term of
no w For the first term, dropping all terms without w, results in
,.
Therefore
si has
GEOSTATISTICS
437
So we anive at a solution for the weights of our estimation equation (equation 17.23). Note that
we have weights as a function of the Cij which are covariances. In practice, the Cij are generally
not known and must be estimated from an assumed model giving C(h).
The set of weights we have determined minimizes the error variance si. The error variance
can be determined by multiplying equation 17.35 by wi and adding the n equations to get
Also, because
Ci,
Pij = 1
u-
we have
and
02,
Xr=, wiyio -
and
The estimates obtained by kriging have the advantage that an estimate for the variance of
v(xo) is given as a;. This estimate takes into account the covariance structure of points used in
the estimation process through the semivariogram. Kriging produces minimum variance
estimates.
Equation 17.40 shows that the variance of an estimate made fiom equation 17.23 depends
on the weights assigned and the covariance structure of the system. The averaging and weighted
averaging methods of equations 17.26 to 17.27 do not depend on this covariance structure and it
is generally not available.
AN EXAMPLE
Let us consider the example transect presented earlier in this chapter. The actual data are
contained in table 17.1. Figure 17.3 is the semivariogram. Figure 17.5 shows how a spherical and
exponential semivariogram model fits the example data. In the spherical model (equation 17.17),
the best fit model has a sill of 2620 and range of 2.6. For the exponential model (equation 17.18),
the sill (a:) is 2755 and the range (R) is 3.1. The exponential model displayed the best fit and was
chosen to represent the semivariogram.
The first thing we will do is to use ordinary kriging to estimate the elevation at the midpoints
between the measured points beginning with a point at X = 0.625 miles and ending at X = 7.125
miles. We will use the exponential model to estimate the covariances required for equation 17.38.
The covariances actually come from equation 17.15 as
439
GEOSTATISTICS
Table 17.1. Example transect data
Distance
(mi>
Elevation
(ft>
Distance
(mi)
Elevation
(ft>
Distance
(mi>
Elevation
(ft>
0.25
0.50
0.25
0.75
0.50
distance for D
1 :::::/
To illustrate the calculation of an element of the C or D matrix, the value for C2,3will be determined. The separation distance for the points is 0.25 miles since they are adjacent points.
440
CHAPTER 17
The inverse of C is
w, are
The weights, -
L- 16365.3j
The elevation at the midpoints can now be determined from equation 17.23 as
1500
1450
1400
0
The standard error of the elevation, a,, can be determined from equation 17.40 as
a, = 18.22 feet
and
w =C-' D
-
or
w=
-
The estimate for the elevation is still largely dependent on the elevations immediately on
either side of the point but more weight is given to the closer known elevation.
If we use this method to estimate the elevation right on a known point, the C matrix remains
unchanged. The distance in the D would be (0.50,0.25.0,0.25,0.50, 0.75) a n b
resulting in
or the elevation is predicted to be exactly itself. Ordinary kriging predicts exact values at measured points. The variance at this point is
The variance at the estimated point is zero if the estimated point is also a point of measurement.
Finally, consider a three-dimensional situation with points as shown in figure 17.8 having
attributes
Point
Elevation
The distance between points is determined from equation 17.1, resulting in distance matrices of
1.03 1.12
C ii.03 0
o.go]
For 1.12 0.90 0
for
[":I
0.64
Continuing to use the semivariogram from the example transect data, the resulting C and D
matrices are
and
Ep = 0.497(1610)
37.0 feet
These examples indicate that point estimation via kriging requires the use of a semivariogram
model. The estimates obtained are dependent on the model chosen. Earlier it was indicated that the
behavior of the semivariogram model near the estimation point has a large influence on the
estimates that are made. The weights resulting from the kriging process can also be seen to be a
function of distance from x,. Also, if a basic model and a combination model both satisfactorily
characterize the structure of the semivariogram, use the basic model.
ANISOTROPY
Thus far we have considered only isotropic situations so that the semivariogram is a function of distance but not direction. For an anisotropic situation, both the distance between points
and the angular direction from one point to another are of concern. Thus only points within the
sector h + Ah and 0 A0 would be included where h is the separation distance and 0 the angu-
CHAPTER 17
444
-El
I
Horizontal
,' /,.'
/
1
I
I
I/
,
I
I /
i,'
0.o
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
4.0
4.5
5.0
4.0
4.5
5.t
Distance, h
1.2
Vertical
,
,,
Horizontal
0.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Distance, h
-El
Horizontal
,,,,/
/
/
I
/
Vertical
0.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Distance, h
Fig. 17.9. Semivariograms demonstrating (a) geometric anisotropy, (b) zonal anisotropy due to
stratification, and (c) zonal anisotropy due to areal trends.
GEOSTATISTICS
445
lar direction between points. The semivariogram should then be indicated as yh.+ It is possible
that the nugget, sill, or range of a semivariogram model are directionally dependent. Isaaks and
Srivastava (1989) discuss this situation. If the range changes with direction but not the sill or
nugget, and a plot of range versus directional angle is an ellipse, the anisotropy is known as geometric anisotropy. If only the sill changes with direction, zonal anisotropy is said to exist
(Goovaerts, 1997). Kupfersberger and Deutsch (1999) describe two types of zonal anisotropy. If
the sill is greater in the vertical direction, the anisotropy is characterized as zonal due to stratification. If the sill is greater in the horizontal direction, the semivariogram is said to be zonal due
to areal trends. Figure 17.9 shows the semivariograms for each of these types of anisotropies.
If anisotropy is present, the orientation or the axes of anisotropy must be identified. Physical
characteristics of the area should be a great aid in this step. Geologic features, prevailing winds,
and predominate direction of storm movements may contribute to anisotropy of different dependent variables and may be an aid in determining axes of anisotropy. Anisotropy can often be identified from contour plots. For an anisotropic situation, the contour lines tend to be ellipses, whereas
for isotropic cases, the contours are circular. With sufficient data, one might compute directional
semivariograms for several different directions in an effort to identify the required axes. If
anisotropy is identified from semivariograms, supporting evidence from the field should be sought.
Isaaks and Srivastava (1989) and Goovaerts (1997) suggest that one might combine directional semivariograms into a single semivariogram by using a transformation that standardizes
the range to one. To do this the separation distances must be transformed so that the standardized
model will provide a semivariogram value that is identical to any of the directional models for the
given separation distance.
Davis (1986) and Goovaerts (1997) discuss kriging in situations where there is nonstationarity in the form of a drift or trend in the mean value of the variable. That is, the mean is changing with location. One method to model this is to remove the drift to produce a variable stationary in the mean and then apply kriging to this stationary variable. For estimation, the drift has to
be added back to the estimate based on the stationary semivariogram.
Davis (1986) illustrates the treatment of drift by considering points in the near vicinity of the
point where an estimate is to be made. In this way a "local" drift is considered.
COKRIGING
Estimates based on kriging use only information contained in the variable of interest. If this
variable is correlated to a second variable, information on this second variable, if incorporated
into the estimation process, may improve the estimates. For example, in estimating soil nitrogen
levels over a field, it may be found that soil nitrogen is correlated to soil phosphorus. It would be
reasonable to use this correlation along with sampled soil phosphorus values in addition to the
soil nitrogen values that have been measured to estimate soil nitrogen at unsarnpled locations.
The estimation equation becomes
Here V(.) might represent soil nitrogen and U(.) soil phosphorus.
446
CHAPTER 17
As with kriging, in cokriging weights ai and bj that minimize the error variance in the estimate
are sought. Isaaks and ~rivastava(1 989) should be consulted for additional explanation and details.
LOCAL AND GLOBAL ESTIMATION
Often, we need an estimate of some physical quantity that is representative of an area such
as a field or catchment rather than at a point. Such an estimate is termed a global estimate if the
area encompasses a large part of the data field. If only a part of the data field is under consideration, the estimate might be termed a local area estimate or a local estimate. One simple global estimate is the average of all the available data concerning the quantity of interest. Such an estimate
might be very good if data from several points were available and each sample point represented
approximately the same fraction of the total area. For this to be the case, the sample locations
would have to have been carefully selected most likely on a regular, gridded pattern. In many sets
of data, sample locations tend to be clustered in some areas and very sparse in others. In such a
situation, a simple average would place more weight on the value of the property of interest
where data clustering existed and little weight on the value(s) where only isolated data existed. If
some degree of spatial continuity existed in the data, the area where clustering occurred would be
over-represented in the resulting estimate of the average. Of course if absolutely no spatial
continuity existed-that is, the data were completely independent from each other with respect
to location-then a straight average would be as good as any other averaging process because information at one location would be independent of that at any other.
In keeping with the purpose of this chapter, we will assume some spatial continuity exists in
the data. This is equivalent to saying that a semivariograrn is not purely a nugget model. We will
develop a global estimate as a weighted linear combination of the available sample.
Global estimation generally refers to estimation over a large part, possibly all of, the sample
space. Local estimation refers to estimation over an area that is a part of the sample space. Global
estimation is often done by polygon declustering or cell declustering. Local area estimation is generally done by using averages of point kriging or by using block kriging. Having said this, there is
no hard and fast rules as to what is global and what is local in the sense used here. Thus, local area
estimates may be done by defining the "local" area to be the desired global area of interest. In this
case, large covariance matrices with many small or zero covariances may be involved. If there are
many observations in the area of interest, global estimation techniques are often appropriate. If
there are few data in the area of interest, local estimation techniques may be used.
Polygon Declustering
The first method we will consider for global estimation is familiar to hydrologists and climatologist who have estimated total rainfall over an area based on point rainfall measurements.
Hydrologists call the method the Thiessen polygon method. In geostatistics it is often simply
called polygon declustering. The global estimate is given by
where the weight, wi, is the area closest to the point xi. This area is determined by drawing polygons whose sides are the perpendicular bisectors of lines connecting the various sample points.
GEOSTATISTICS
447
With this method, in areas where samples are closely spaced, the corresponding weights are
small. In locations where data is sparse, the weights are large.
Cell Declustering
Cell declustering is a two-step process. The total area is divided into regular cells all of the
same size. The average value of sample points within a cell is taken as the cell value. The average of the cell values is then taken as the global estimate of the property of interest. Thus, in cells
where samples are clustered, each sample value enters the calculations for the cell mean and has
a weight inversely proportional to the number of samples in the cell. For example, in a cell with
10 samples, each sample value would have a weight of 0.1 with respect to the cell average. In a
cell with only one point, a weight of 1 would be assigned to that data point in computing the cell
average. Because the global average is a simple average of cell averages, the relationship among
the data point weights would carry over into the global average.
Point Kriging
A straightforward procedure that uses the covariance structure of the sample points but may
require considerable calculation is to define many points within the global area, likely on a regular grid to ensure equal coverage; estimate the value of the quantity of interest by ordinary kriging at each point; and then compute the simple average of these kriged estimates.
Block Kriging
D in equation 17.36 contains the
Block kriging is similar to point kriging except the matrix covariance values between the random variables Vi at the sample locations and the blocks. These
point-to-block covariances are given by (Isaaks and Srivastava, 1989)
where the Vj7sare the sample values in the block and nj is the number of such points. Thus, the
point-to-block covariance is the average of the nj covariances of the Vj in the block and the sample point in question. Recall that Cov(Vj, Vi) is estimated using a semivariogram model. Block
estimation is then done using
where the wi are given by equation 17.39 using CiBin equation 17.37 rather than Cio.
Isaaks and Srivastava (1989) show that block estimation via averaging point kriged estimates as explained above and block kriging are the same.
448
CHAPTER 17
Recall that Pv(xo) = prob(V(x) < x,). Further consider we are looking for a global estimate of
this quantity. The procedure is to replace all actual V(xi) with an indicator variable I(xi) such that
Using this transformation, the mean of I is the fraction of the values in the sample that are less
than or equal to x,. This estimate does not take into account clustering and is a rather crude estimate of P,(a), just as the mean of the sample values is a crude estimate of the mean of the global
mean because sample location and clustering are not considered.
We can decluster the estimate for Pv(xo) just as we used declustering to estimate a global
mean. We do this by letting
where the weights are determined as before using the polygon or cell declustering technique.
Similarly, block kriging can be used to get local area estimates using
where wi are the kriging weights and the summation is over all appropriate points for the block
of interest.
By repeatedly using this approach while letting x, range incrementally from the minimum
to the maximum sample value, the entire empirical cdf can be estimated. From the definition of
the pdf and the cdf, the pdf can then be estimated as can the probabilities of estimates falling in
various ranges.
Similarly, we can define the moment estimates other than the mean. For example the variance is the average squared deviation from the mean. A global estimate for the variance is
where V(x) is the global mean. The skewness might be estimated from
Analogous results can be used for the estimation of local area moments using weights based on
kriging.
UNCERTAINTY
We have discussed the estimation of point values, mean values, empirical probability distributions, and various sample moments. One might logically inquire as to how good these various
estimates are. In statistics, we often quantify uncertainty in estimates in terms of confidence in-
GEOSTATISTICS
449
tervals, as previously discussed in this book. The width of a confidence interval is a measure of
the uncertainty associated with a particular estimate. In a theoretical sense we need to know the
underlying pdf of a process to determine confidence intervals. In a practical sense we never know
that so we have to use an estimation procedure.
We have already seen how we can estimate an empirical cdf for points, blocks, or local
areas. If we desire a 100a% confidence interval, we can determine them by finding the C, and C,
satisfying
prob(Q 5 C1) = a/2
prob(Q
C,) = 1 - a/2
where Q is the quantity of interest. The values of C1and C, represent the lower and upper confidence limit such that
prob(C, 5 Q
C,) = a
(17.55)
Appendixes
A. 1. COMMON DISTRIBUTIONS
Hvvergeometric Distribution
Binomial Distribution
Geometric Distribution
fx(x) = p(l - p)" - I
1,2, 3, ...
452
APPENDIXES
Poisson Distribution
Uniform Distribution
Triangular Distribution
A(=)
px(x) = p - a
6-a
- p - a( p - s )
-
18aPS - 3[aP(a
Exvonential Distribution
<X <s
s<x<p
2 3
+ P) + PS(p + 6) + Sa(S + a)] + 108Cvp
10c;
p3
APPENDIXES
453
Normal Distribution
Lognormal Distribution
y = 3u
+ u3
where k = ve/exp(u;) - 1
Gamma Distribution
- e-Y)
0L
p =
+ 0.5770~
for y = (x - P)/a
u2 = 1.645a2
-a<X<a
&>O
y = 1.1396
- eY)
for y = (x - P)/a
-a<X<a
0 ~ > 0
APPENDIXES
454
HYDROLOGIC DATA
A.2. Monthly runoff (in.), Cave Creek near Fort Spring, Kentucky
Year
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
0.02
0.00
0.02
0.04
0.07
0.03
0.06
0.03
0.12
0.02
0.39
0.01
0.15
0.04
0.07
0.09
0.14
0.07
0.05
0.02
0.04
0.06
0.10
1.06
0.09
0.36
0.52
0.06
1.41
0.04
0.07
0.02
1.19
0.38
0.22
0.25
0.19
0.04
0.30
0.13
1.72
4.32
0.17
2.69
0.79
0.76
1.24
0.03
3.47
0.02
3.57
2.71
1.12
0.91
2.40
0.54
0.73
0.59
3.08
2.00
2.70
2.19
2.04
3.46
1.50
0.87
2.76
0.48
0.97
1.35
2.78
1.30
0.86
0.22
4.63
6.37
3.25
2.21
1.95
3.13
2.95
4.01
1.46
1.73
2.30
2.81
1.61
0.98
2.16
3.89
4.16
0.04
5.79
4.69
1.03
1.17
1.12
2.91
5.32
5.08
5.48
7.88
4.49
0.79
4.66
4.25
0.73
2.91
1.47
1.39
0.59
1.92
3.92
2.35
1.02
0.68
4.76
3.30
0.52
0.45
1.46
2.02
0.50
2.38
2.37
5.68
3.54
0.35
1.97
0.28
0.68
2.36
0.24
0.19
4.14
0.79
0.25
0.21
0.31
3.32
4.76
1.99
0.74
2.06
0.08 0.33
Mean
1.34 1.76 2.58
StDev 0.09 0.44 1.40 0.96 1.50
Source: U. S. Geological Survey.
3.47
2.22
2.04
1.52
1.57
1.52
Total
Discharge
(cfs)
Gage height
(ft)
Date
Discharge
(cfs)
Gage height
(ft)
Date
Discharge
(cfs)
Gage height
(ft)
(Continued)
A.3. (continued)
Date
Discharge
(cfs)
Gage height
(ft)
Date
Discharge
(cfs)
Gage height
(ft)
Date
Discharge
(cfs)
Gage height
(ft)
6114/03
5112/04
4115/05
4117/06
4/24/07
11/7/07
9/29/09
4/1/10
4129111
4/23/12
4/1/13
10/21/13
511115
5/18/16
6/18/17
10/31/17
3129119
4114/20
12115/20
6130122
4/29/23
5/2/24
4/1/25
11/16/25
11120126
10/20/27
5/3/29
4/8/30
611013 1
9/17/32
4119/33
4/21/34
Discharge
(cfs)
Gage height
(ft)
5140
7420
2410
10400
8040
10100
17400
4010
41 10
7380
7130
6930
6100
6200
14600
5960
4710
8650
7600
8350
21500
8690
4570
8040
7780
10400
9600
8040
6870
12900
5960
8040
7.53
9.23
5.28
11.2
9.67
11.01
14.93
6.62
6.7
9.2
9.02
8.88
8.27
8.35
13.5
8.16
7.3
10.1
9.4
9.9
17.67
10.3
7.1
9.8
9.6
11.6
11
9.8
8.93
13.5
8.15
9.67
Date
Discharge
(cfs)
Gage height
(ft>
Date
Discharge
(cfs)
Gage height
(ft)
A.5. Total Precivitation (in.) for week of March 1 to March 7, Ashland, Kentucky
Mean
St Dev
Skewness
Source: Kentucky Division of Water, Dept. of Natural Resources and Environmental Protection,
Frankfort, Kentucky.
Year
Mean
St Dev
Source: U.S . Geological Survey.
Annual discharge
(cfs-days)
June
July
Sept
Oct
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
0.0000
0.0025
0.1533
0.0000
0.0014
0.1146
0.0223
0.1277
0.0012
0.0558
0.0026
0.0003
0.0255
0.0210
0.0301
0.0000
0.0089
0.0000
0.0002
0.0272
0.0620
0.087 1
0.3143
0.0332
0.0243
0.1191
0.0000
0.0029
0.0137
0.00 10
0.0000
0.0028
0.0005
0.0000
0.0000
0.0004
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0013
0.30737
0.09331
0.07438
0.00145
0.10555
0.02286
0.07280
0.1 1526
0.02261
0.04231
0.12579
0.02 196
0.04512
0.0 1909
0.097 17
Mean
St Dev
Skewness
0.0372
0.0520
1.4316
0.0463
0.0824
2.8070
0.0003
0.0008
2.8137
0.0778
0.074937
2.165884
Annual
Source: L. Lane, USDA, ARS, Southwest Watershed Research Center, Tucson, Arizona.
No flow in other months.
iph = inches per hour.
Peak (iph)
Year
Jan
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
Mean
St Dev
Skewness
AP~
May
Jun
Jul
Aug
S ~ P
Oct
0.62
0.6 1
0.87
0.00
0.00
1.01
0.44
0.74
0.17
0.32
0.28
0.48
0.00
0.55
0.10
0.00
0.05
0.00
0.00
0.00
0.00
0.62
0.08
0.00
0.00
0.00
0.05
0.28
0.00
0.3 1
0.14
0.11
0.00
0.04
0.45
0.00
0.10
0.00
0.04
0.18
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.16
0.00
0.20
0.00
0.00
0.04
0.50
0.38
0.66
1.06
0.5 1
0.11
0.24
0.08
0.00
0.00
0.04
0.05
0.19
0.00
0.00
0.18
0.07
0.89
7.57
2.08
2.75
2.22
4.74
1.04
2.07
2.78
2.83
3.03
3.06
4.39
1.75
2.13
4.26
1.48
1.67
1.81
4.08
1.31
1.75
3.74
3.83
1.OO
4.11
0.09
3.02
1.56
0.77
4.89
1.31
5.64
3.78
4.06
3.11
1.30
0.17
0.00
0.00
2.33
0.4 1
1.55
0.83
0.93
0.85
2.47
0.78
2.41
2.15
0.43
2.49
0.83
2.16
0.48
0.18
0.05
0.83
1.11
1.80
0.69
1.32
0.36
0.62
0.42
0.00
0.00
0.37
0.00
0.00
0.16
1.66
2.27
0.12
0.06
0.21
0.00
0.96
0.38
0.54
0.9 1
0.19
0.09
2.9 1
0.14
3.44
0.35
0.5 1
0.42
1.21
0.06
0.35
0.33
0.53
0.12
0.18
1.79
0.04
0.07
1.57
0.28
0.32
1.29
2.87
1.56
1.78
2.74
1.62
0.03
1.18
0.92
0.33
0.66
0.70
1.03
0.69
0.97
2.18
Feb
Mar
Source: L. Lane, USDA, ARS, Southwest Watershed Research Center, Tucson, Arizona.
Rain gage 63.001.
Nov
Dec
Annual
Mean
St Dev
Skewness
Source: Yevjevich, 1963.
Basin area 289 square miles. Data for 1910-1955.
Mean
St Dev
Skewness
579.8
124.5
0.0
Mean
St Dev
Skewness
Source: Yevjevich, 1963.
Basin area 914 square miles. Data for 1915-1957.
STATISTICAL TABLES
A. 12. Standard normal distribution
(Continued)
A. 12. (continued)
Table values are the prob(Z < 2 ) ; i.e., Prob (Z < 0.75) = 0.773 and Prob(Z < -0.75) = 0.227.
Values generated using Microsoft EXCEL spreadsheet program.
(Continued)
0\
a!
alpha
10
15
20
30
60
120
0.9
0.96
0.975
0.99
0.995
39.9
253
648
4052
16212
49.5
312
799
4999
19997
53.6
337
864
5404
21614
55.8
35 1
900
5624
22501
57.2
360
91-2
5764
23056
58.2
366
937
5859
23440
59.4
373
957
598 1
23924
60.2
378
969
6056
24222
61.2
384
985
6157
24632
61.7
388
993
6209
24837
62.3
39 1
1001
6260
25041
62.8
394
1010
6313
25254
63.1
396
1014
6340
25358
0.9
0.96
0.975
0.99
0.995
8.5
23.5
38.5
98.5
199
9.0
24.0
39.0
99.0
199
9.2
24.2
39.2
99.2
199
9.2
24.2
39.2
99.3
199
9.3
24.3
39.3
99.3
199
9.3
24.3
39.3
99.3
199
9.4
24.4
39.4
99.4
199
9.4
24.4
39.4
99.4
199
9.4
24.4
39.4
99.4
199
9.4
24.4
39.4
99.4
199
9.5
24.5
39.5
99.5
199
9.5
24.5
39.5
99.5
199
9.5
24.5
39.5
99.5
199
0.9
0.96
0.975
0.99
0.995
5.5
12.1
17.4
34.1
55.6
5.5
11.3
16.0
30.8
49.8
5.4
11.0
15.4
29.5
47.5
5.3
10.8
15.1
28.7
46.2
5.3
10.6
14.9
28.2
45.4
5.3
10.5
14.7
27.9
44.8
5.3
10.4
14.5
27.5
44.1
5.2
10.3
14.4
27.2
43.7
5.2
10.2
14.3
26.9
43.1
5.2
10.2
14.2
26.7
42.8
5.2
10.1
14.1
26.5
42.5
5.2
10.1
14.0
26.3
42.1
5.1
10.0
13.9
26.2
42.0
0.9
0.96
0.975
0.99
0.995
4.5
9.0
12.2
21.2
31.3
4.3
8.0
10.6
18.0
26.3
4.2
7.6
10.0
16.7
24.3
4.1
7.3
9.6
16.0
23.2
4.1
7.1
9.4
15.5
22.5
4.0
.
7O
9.2
15.2
22.0
4.0
6.9
9.0
14.8
21.4
3.9
6.8
8.8
14.5
21.0
3.9
6.7
8.7
14.2
20.4
3.8
6.6
8.6
14.0
20.2
3.8
6.5
8.5
13.8
19.9
3.8
6.5
8.4
13.7
19.6
3.8
6.4
8.3
13.6
19.5
0.9
0.96
0.975
0.99
0.995
4.1
7.6
10.0
16.3
22.8
3.8
6.6
8.4
13.3
18.3
3.6
6.1
7.8
12.1
16.5
3.5
5.8
7.4
11.4
15.6
3.5
5.7
7.1
11.0
14.9
3.4
5.5
7.0
10.7
14.5
3.3
5.4
6.8
10.3
14.0
3.3
5.3
6.6
10.1
13.6
3.2
5.1
6.4
9.7
13.1
3.2
5.1
6.3
9.6
12.9
3.2
5.0
6.2
9.4
12.7
0.9
0.96
0.975
0.99
0.995
3.8
6.8
8.8
13.7
18.6
3.5
5.8
7.3
10.9
14.5
3.3
5.3
6.6
9.8
12.9
3.2
5.0
6.2
9.1
12.0
3.1
4.9
6.0
8.7
11.5
3.0
4.6
5.6
8.1
10.6
2.9
4.5
5.5
7.9
10.3
2.9
4.3
5.3
7.6
9.8
2.8
4.3
5.2
7.4
9.6
2.8
4.2
5.1
7.2
9.4
3.1
4.9
6.1
9.1
12.3
2.7
4.1
4.9
7.0
9.0
0.9
0.96
0.975
0.99
0.995
0.9
0.96
0.975
0.99
0.995
3.6
6.3
8.1
12.2
16.2
3.3
5.3
6.5
9.5
12.4
3.1
4.8
5.9
8.5
10.9
3.0
4.5
5.5
7.8
10.1
2.9
4.4
5.3
7.5
9.5
3.1
4.7
5.8
8.5
11.1
2.8
4.2
5.1
7.2
9.2
3.1
4.9
6.1
9.2
12.4
2.8
4.1
5.0
7.1
9.1
2.6
3.8
4.5
6.2
7.8
2.6
3.7
4.4
6.0
7.5
2.5
3.6
4.3
5.8
7.3
2.5
3.5
4.2
5.7
7.2
3.5
6.0
7.6
11.3
14.7
'3.1
4.9
6.1
8.6
11.0
2.9
4.5
5.4
7.6
9.6
2.8
4.2
5.1
7.0
8.8
2.7
4.0
4.8
6.6
8.3
2.7
3.9
4.7
6.4
8.0
2.7
4.0
4.8
6.6
8.4
2.5
3.6
4.3
5.8
7.2
2.6
3.8
4.6
6.3
8.0
2.8
4.1
4.9
6.8
8.7
2.6
3.7
4.4
6.0
7.5
2.5
3.5
4.1
5.5
6.8
2.4
3.4
4.0
5.4
6.6
2.4
3.3
3.9
5.2
6.4
2.3
3.2
3.8
5.0
6.2
2.3
3.2
3.7
4.9
6.1
0.9
0.96
3.4
5.8
3.0
4.7
2.8
4.2
2.7
4.0
2.6
3.8
2.6
3.7
2.5
3.5
2.4
3.4
2.3
3.2
2.3
3.2
2.3
3.1
2.2
-? .-n
2.2
2.9
Continued
A. 15. (continued)
alpha
0.975
0.99
0.995
7.2
10.6
13.6
5.7
8.0
10.1
5.1
7.0
8.7
lo
4.7
6.4
8.0
4.5
6.1
7.5
4.3
5.8
7.1
4.0
5.3
6.4
2.3
3.2
3.7
4.8
5.8
3.8
5.0
6.0
3.7
4.8
5.8
3.6
4.6
5.6
3.4
4.5
5.4
3.4
4.4
5.3
2.2
3.1
3.5
4.6
5.5
2.2
3.0
3.4
4.4
5.3
2.2
2.9
3.3
4.2
5.1
2.1
2.8
3.2
4.1
4.9
2.1
2.8
3.1
4.0
4.8
15
20
30
60
120
0.9
0.96
0.975
0.99
0.995
10
3.3
5.6
6.9
10.0
12.8
2.9
4.5
5.5
7.6
9.4
2.7
4.1
4.8
6.6
8.1
2.6
3.8
4.5
6.0
7.3
2.5
3.6
4.2
5.6
6.9
2.5
3.5
4.1
5.4
6.5
4.1
5.5
6.7
2.4
3.3
3.9
5.1
6.1
0.9
0.96
0.975
0.99
0.995
12
3.2
5.3
6.6
9.3
11.8
2.8
4.3
5.1
6.9
8.5
2.6
3.8
4.5
6.0
7.2
2.5
3.5
4.1
5.4
6.5
2.4
3.4
3.9
5.1
6.1
2.3
3.2
3.7
4.8
5.8
2.2
3.1
3.5
4.5
5.3
2.2
2.9
3.4
4.3
5.1
2.1
2.8
3.2
4.0
4.7
2.1
2.7
3.1
3.9
4.5
2.0
2.6
3.0
3.7
4.3
2.0
2.5
2.8
3.5
4.1
1.9
2.5
2.8
3.4
4.0
0.9
0.96
0.975
0.99
0.995
15
3.1
5.1
6.2
8.7
10.8
2.7
4.0
4.8
6.4
7.7
2.5
3.6
4.2
5.4
6.5
2.4
3.3
3.8
4.9
5.8
2.3
3.1
3.6
4.6
5.4
2.2
3.0
3.4
4.3
5.1
2.1
2.8
3.2
4.0
4.7
2.1
2.7
3.1
3.8
4.4
2.0
2.5
2.9
3.5
4.1
1.9
2.5
2.8
3.4
3.9
1.9
2.4
2.6
3.2
3.7
1.8
2.3
2.5
3.O
3.5
1.8
2.2
2.5
3.O
3.4
0.9
0.96
0.975
0.99
0.995
20
3.O
4.8
5.9
8.1
9.9
2.6
3.8
4.5
5.8
7.0
2.4
3.3
3.9
4.9
5.8
2.2
3.1
3.5
4.4
5.2
2.2
2.9
3.3
4.1
4.8
2.1
2.8
3.1
3.9
4.5
2.0
2.6
2.9
3.6
4.1
1.9
2.5
2.8
3.4
3.8
1.8
2.3
2.6
3.1
3.5
1.8
2.2
2.5
2.9
3.3
1.7
2.1
2.3
2.8
3.1
1.7
2.0
2.2
2.6
2.9
1.6
2.0
2.2
2.5
2.8
0.9
0.96
0.975
0.99
0.995
30
2.3
3.1
3.6
4.5
5.2
2.2
2.9
3.3
4.1
4.7
2.1
2.9
3.2
4.0
4.6
2.0
2.7
3.0
3.6
4.1
2.0
2.7
3.0
3.7
4.2
1.9
2.5
2.8
3.3
3.8
2.0
2.6
2.9
3.5
3.9
1.9
2.4
2.6
3.1
3.5
1.8
2.3
2.5
3.0
3.3
1.7
2.1
2.3
2.7
3.0
1.7
2.0
2.2
2.5
2.8
1.6
1.9
2.1
2.4
2.6
1.5
1.8
1.9
2.2
2.4
1.5
1.7
1.9
2.1
2.3
60
2.5
3.6
4.2
5.4
6.4
2.4
3.4
3.9
5.0
5.8
1.9
2.4
2.7
3.2
3.6
0.9
0.96
0.975
0.99
0.995
2.9
4.6
5.6
7.6
9.2
2.8
4.4
5.3
7.1
8.5
1.8
2.2
2.4
2.8
3.1
1.7
2.1
2.3
2.6
2.9
1.6
1.9
2.1
2.4
2.6
1.5
1.8
1.9
2.2
2.4
1.5
1.7
1.8
2.0
2.2
1.4
1.6
1.7
1.8
2.0
1.3
1.5
1.6
1.7
1.8
0.9
0.96
0.975
0.99
0.995
120
2.7
4.3
5.2
6.9
8.2
2.3
3.3
3.8
4.8
5.5
2.1
2.9
3.2
3.9
4.5
2.0
2.6
2.9
3.5
3.9
1.9
2.4
2.7
3.2
3.5
1.8
2.3
2.5
3.0
3.3
1.7
2.1
2.3
2.7
2.9
1.7
2.0
2.2
2.5
2.7
1.5
1.8
1.9
2.2
2.4
1.5
1.7
1.8
2.0
2.2
1.4
1.6
1.7
1.9
2.0
1.3
1.5
1.5
1.7
1.7
1.3
1.4
1.4
1.5
1.6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
25
30
35
40
50
60
70
90
100
Asymptotic
Significance level
.20
.15
-10
.05
.O 1
.900
.684
.565
.446
.446
.410
.381
.358
.339
.322
.307
.295
.284
-274
.266
.258
-250
.244
.237
-231
.2 1
-19
.18
.925
-726
.597
.474
-474
.436
.405
.38 1
.360
.342
.326
.313
-302
-292
-283
.274
.266
.259
.252
.246
.22
.20
-19
.950
.776
.642
-510
-510
.438
-438
-411
.388
-368
-352
-338
.325
.314
.304
.295
.286
-278
.272
.26-4
.24
.22
.21
-975
.842
.708
.624
.563
.486
.486
.457
.432
.409
.39 1
.375
.361
.349
.338
.328
.318
.309
.301
.294
.264
.242
.23
.2 1
.19
.17
.16
.14
.14
1.36/<n
.995
.929
-829
.734
.669
.577
.577
.543
.5 14
.486
.468
.450
-433
.418
-404
.391
-380
.370
-361
.352
.32
.29
.27
.25
.23
.21
.19
1.07/6
1.14/l6
1.2216
1.63/G
0.59
0.77
0.90
1.01
1.08
1.15
1.20
1.24
1.28
1.32
1.37
1.42
1.45
1.48
1.46
1.41
1.41
1.42
1.44
1.46
1.48
1.49
1.5 1
1.52
1.55
1.57
1.59
1.60
0.49
0.68
0.83
0.94
1.03
1.10
1.16
1.20
1.25
1.28
1.34
1.39
1.43
1.46
Bibliography
Adamowski, K. 1981. Plotting Formula for Flood Frequency. Water Resources Bulletin 17(2):197-202.
AISI. 1967. Handbook of Steel Drainage and Highway Construction Products. American Iron and Steel
Institute, New York.
Alexander, G. N. 1954. Some Aspects of Time Series in Hydrology. Journal Institute of Engineers (Australia), September, pp. 196.
Allis, J. R., N. C. Matalas, and J. R. Slack. 1974. Just a Moment. Water Resources Research 10(2):211-291.
Anderson, D. V. 1967. Review of Basic Statistical Concepts in Hydrology, in Statistical Methods in
Hydrology. Proceedings of Hydrology Symposium 5, McGill University, Feb. 24 and 25, 1966,
Queen's Printer, Ottawa, Canada.
Anderson, H. W., et al. 1967. Discussions of Multivariate Methods. Volume 2, Proceedings of International
Hydrology Symposium, September 6-8, 1967, Fort Collins, Colorado.
Anderson, R. L. 1942. Distribution of the Serial Correlation Coefficient. Annals of Mathematical Statistics
13:l-13.
Anderson, T. W. 1958. An Introduction to Multivariate Statistical Analysis. John Wiley and Sons, Inc.,
New York.
Ang, A. H-S. and W. H. Tang. 1984. Probability Concepts in Engineering, Planning and Design, vol. II.
John Wiley and Sons, Inc., New York.
Bagley, J. M. 1964. An Application of Stochastic Process Theory to the Rainfall-Runoff Process. Tech.
Report 35, Department of Civil Engineering, Stanford University, Stanford, California.
Bailey, N. T. J. 1964. The Elements of Stochastic Processes. Wiley, New York.
Baker, V. R. 1987. Paleoflood Hydrology and Extraordinary Flood Events. Journal of Hydrology
96(1-4):79-99.
Barger, G. L. and H. C. S. Thom. 1949. Evaluation of Drought Hazard. Agronomy Journal 41(11):519-526.
Barger, G. L., R. H. Shaw, and R. F. Dale. 1959. Gamma Distribution Parameters from 2 and 3-week Precipitation Totals in The North Central Region of the United States. Agricultural and Home Economics
Experiment Station, Iowa State University, Arnes, Iowa.
472
BIBLIOGRAPHY
Beard, L. R. 1974. Flood Flow Frequency Techniques. Technical Report 119, Center for Research in Water
Resources, University of Texas, Austin, TX.
Beard, L. R. 1967. Optimization Techniques for Hydrologic Engineering. Water Resources Research
3:809-815.
Beard, L. R. 1962. Statistical Methods in Hydrology. U. S. Army Engineer District, Corps of Engineers,
Sacramento, CA.
Bendat, J. S. and A. G. Piersol. 1971. Random Data: Analysis and Measurement Procedures. WileyInterscience, New York.
Bendat, J. S. and A. G. Piersol. 1966. Measurement and Analysis of Random Data. John Wiley and Sons,
New York.
Benjamin, J. R. and C. A. Cornell. 1970. Probability, Statistics, and Decision Theoly for Civil Engineers.
McGraw-Hill, New York.
Benson, M. A. 1968. Uniform Flood-Frequency Estimating Methods for Federal Agencies. Water Resources
Research 4(5):89 1-908.
Benson, M. A. 1965. Spurious Correlation in Hydraulics and Hydrology. Proceedings American Society
Civil Engineers, Journal of the Hydraulics Division, HY4, July 1965.
Benson, M. A. 1964. Factor Affecting the Occurrence of Floods in the Southwest. U. S. Geological Survey
Water Supply Paper 1580-D. Washington, D. C.
Benson, M. A. 1962a. Plotting Positions and Economics of Engineering Planning. Proceedings of American
Society of Civil Engineers 88(HY6) pt. 1, November, 1962.
Benson, M. A. 1962b. Factors Influencing the Occurrence of Floods in a Humid Region of Diverse Terrain.
Water Supply Paper 1580-B, U. S. Geological Survey, Washington, D. C.
Benson, M. A. 1962c. Evolution of Methods for Evaluating the Occurrence of Floods. Water Supply Paper
1580-A, U. S. Geological Survey, Washington, D. C.
Benson, M. A. and N. C. Matalas. 1967. Synthetic Hydrology Based on Regional Statistical Parameters.
Water Resources Research 3(4):931-935.
Beyer, W. H. (Ed.) 1968. Handbook of Tables for Probability and Statistics. Second Edition. Chemical
Rubber Company, Cleveland, OH.
Blench, T. 1959. Empirical Methods, in Spillway Design Floods. Proceedings of Hydrology Symposium 1,
Ottawa. Sponsored by National Research Council of Canada, November 4 and 5, 1959.
Bobee, B. 1975. The Log Pearson Type 3 Distribution and Its Application in Hydrology. Water Resources
Research 11(5):681-689.
Bobee, B. and F. Ashkar. 1991. The Gamma Family and Derived Distributions Applied in Hydrology. Water
Resources Publications, Littleton, CO.
Bobee, B. and R. Robitaille. 1977. The Use of the Pearson Type 3 and Log Pearson Type 3 Distribution
Revisited. Water Resources Research 13(2):427442.
Bowman, K. 0. and L. R. Shenton. 1968. Properties of Estimators for the Gamma Distribution. Report
CTC-1, Union Carbide Corporation, Nuclear Division, Oak Ridge, TN.
Box, G. E. P., and G. M. Jenkins. 1976. Time Series Analysis Forecasting and Control. Holder-Day, Oakland, CA.
Brakensiek, D. L. 1958. Fitting Generalized Lognormal Distribution to Hydrologic Data. Transactions
American Geophysical Union 39(3):469473.
Bras, R. L. and I. Rodriguez-Iturbe. 1985. Random Functions and Hydrology. Addison-Wesley, Reading,
MA.
Bridges, T. C. and C. T. Haan. 1972. Reliability of Precipitation Probabilities Estimated from the Gamma
Distribution. Monthly Weather Review 100(8):607-611.
Brieman, L. 1969. Probability and Stochastic Processes. Houghton Mifflin Company, Boston, MA.
Burges. S. J. 1978. Review of "Statistical Methods in Hydrology." EOS, Transactions, American Geophysical Union, 59(12) 1005-1006.
Burges, S. J. and A. E. Johnson. 1973. Probabilistic Short-Term River Yield Forecasts. Proceedings American Society of Civil Engineers (IR2): 143-155.
Burges, S. J. and Lettenmaier, D. P. 1975. Probability Methods in Stream Quality Management. Water
Resources Bulletin 1l(1):115-130.
Burges, S.J. and R.K. Linsley. 1971. Some factors influencing required reservoir storage. Journal of the Hydraulics Division, American Society of Civil Engineers 97(HY7):977-99.
Bum, D. H. 1990. Evaluation of Regional Flood Frequency Analysis With a Region of Influence Approach.
Water Resources Research 26(10):2257-2265.
Bum, D. H. 1989a. An Appraisal of the "Region of Influence" Approach to Flood Frequency Analysis.
Hydrological Sciences Journal 35(2-4): 149-165.
Bum, D. H. 1989b. Cluster Analysis as Applied to Regional Flood Frequency, Journal of Water Resources
Planning and Management 115(5):567-582.
Burn, D. H. 1988. Delineation of Groups for Regional Flood Frequency Analysis. Journal of Hydrology
104~345-361.
Bum, D. H. and McBean, E. A. 1985. Optimization Modeling of Water Quality in an Uncertain Environment. Water Resources Research 2 1(7):934-940.
California State Department of Public Works. 1923. Flow in California Streams. Bulletin 5. Chapter 5.
(Original not seen, cited in Chow 1964).
Carey, D. I. and C. T. Haan. 1976. Supply and Demand in Water Planning: Streamflow Estimation and Conservational Water-Pricing. Water Resources Institute Report 92, University of Kentucky, Lexington, KY.
Carey, D. I. and C. T. Haan. 1975. Using Parametric Models of Runoff to Improve Parameter Estimates for
Stochastic Models. Water Resources Research 11(6):874-878.
Cawlfield, J. D. and Wu, M. C. 1993. Probabilistic Sensitivity Analysis for One Dimentional Reactive
Transport in Porous Media. Water Resources Research 29(3):661-672.
Chang, C. H., Tung, Y. K., and Yang, J. C. 1995. Evaluation of Probabilistic Point Estimate Methods.
Applied Mathematical Modelling 19(2):95-105.
Chang, J. H., Tung, Y. K., Yang, J. C., and Yeh, K. C. 1992. Uncertainty Analysis of a Computerized Sediment Transport Model, in Stochastic Hydraulics 1992. Proceedings Sixth InternationalAssociation for
Hydraulic Research Symposium on Stochastic Hydrology, Tapai, Taiwan. J. T. Kuo and G. F. Lin
(Eds.), Water Resources Publications, Littleton, Colrado, pp.123-130.
Chayes, F. 1949. On Ration Correlation in Petrography. Joumal of Geology 57(3):239-254.
Cheng, S. T. (1982). Overtopping Risk Evaluation for an Existing Dam. Ph.D. Thesis, Department of Civil
Engineering, University of Illinois at Urbana-Champaign.
Chilbs, J.-P. and P. Delfiner. 1999. Geostatistics, Modeling Spatial Uncertainty. John Wiley and Sons, New
York.
Chilks, J.-P. and S. Gentier. 1993. Geostatistical Modelling of a Single Fracture, in Geostatistics Troia '92.
Kluwer Academic, Dordrecht, The Netherlands, pp. 95-108.
Chow, V. T. (Ed.) 1964. Handbook of Applied Hydrology. McGraw-Hill, New York.
Chow, V. T. 1954. The Log-Probability Law and Its Engineering Applications. Proceedings American
Society of Civil Engineers 80:536- 1-25.
Chow, V. T. 1951. A Generalized Formula for Hydrologic Frequency Analysis. Transactions American
Geophysical Union 32(2):231-237.
Chowdhury, J. U., J. R. Stedinger, and L. H. Lu. 1991. Goodness-of-Fit Tests for Regional Generalized
Extreme Value Flood Distribution. Water Resources Research 27(7):1765-1776.
Clarke, R. T. 1998. Stochastic Processes for Water Scientists-Development and Applications. John Wiley
and Sons, New York.
Condie, R. 1977. The log Pearson type 3 Distribution: The T-year Event and its Asymptotic Standard Error
by Maximum Likelihood Theory. Water Resources Research 13(6):987-99 1.
Conover, W. J. 1980. Practical Nonpararnetric Statistics, Second Edition. John Wiley and Sons, New York.
Conover, W. J. 1971. Practical Nonpararnetric Statistics, Second Edition. John Wiley and Sons, New York.
Cooley, W. W. and P. R. Lohnes. 1971. Multivariate Data Analysis. John Wiley and Sons, Inc., New York.
Costa, J. E. 1987. A History of Paleohydrology in the United States 1800-1970. In The History of
Hydrology, E. R. Landa and S. Ince (Eds.). American Geophysical Union, Washington, DC.
474
BIBLIOGRAPHY
Couto, E. G., A. Stein, and E. Klamt. 1997. Large Area Spatial Variability of Soil Chemical Properties in
Central Brazil. Agriculture, Ecosystems, and the Environment, 66(2):139-152.
Cressie, N. A. C. 1991. Statistics for Spatial Data. John Wiley and Sons, New York.
Crutcher, H. L. 1975. A Note on the Possible Misuse of the Kolmogorov-Smirnov Test. Journal of Applied
Meteorology 14:1600-1 603.
Cryer, J. D. 1986. Time Series Analysis. PWS Publishers, Duxbury Press, Boston, MA.
Cudworth, A. G., Jr. 1987. The Deterministic Approach to Inflow Design Rainflood Development as
Applied by the U. S. Bureau of Reclamation. Journal of Hydrology 96(1-4):293-304.
Cumane, C. 1978. Unbiased Plotting Positions-A Review. Journal of Hydrology 37(3/4):205-222.
Dalrymple, T. 1960. Flood Frequency Analysis. U. S. Geological Survey Water Supply Paper 1543-A, Manual of Hydrology. Part 3, Flood Flow Techniques. U. S. Government Printing Office, Washington, DC.
Davis, J. C. 1986. Statistics and Data Analysis in Geology, Second edition. John Wiley and Sons, New York.
Davis, J. C. 1973. Statistics and Data Analysis in Geology. John Wiley and Sons, Inc., New York.
DeCoursey, D. G. 1973. Objective Regionalization of Peak Flow Rates. Floods and Droughts. Proceedings
of the Second International Symposium in Hydrology. Fort Collins, Colorado. September 11-13,
1972. Water Resources Publications, Fort Collins, CO.
DeCoursey, D. G. 1971. The Stochastic Approach to Watershed Modeling. Nordic Hydrology pp. 186-2 16.
DeCoursey, D. G. and R. B. Deal. 1974. General Aspects of Multivariate Analysis with Application to Some
Problems in Hydrology. Proceedings Symposium on Statistical Hydrology. U. S. Department of Agriculture Miscellaneous Publication No. 1275, Washington, D. C. pp. 47-68.
Deutsch, C. V. and A. G. Journel. 1998. GSLIB-Geostatistical Software Library and User's Guide. Oxford
University Press, New York.
Diehl, J. and K. W. Potter. 1987. Mixed Flood Distributions in Wisconsin, in Hydrologic Frequency Modeling. Proceedings of the International Symposium on Flood Frequency and Rish Analyses, 14-17 May,
1986. Louisiana State University, Baton Rouge, edited by V. P. Singh. D. Reidel Publishing Company,
Dordrecht, Holland.
Diwekar, U. M., and Kalagnanam, J. R. 1997. Efficient Sampling Technique for Optimization Under
Uncertainty, AIChE Journal 43(2):440-447.
Draper, N. R. and H. Smith. 1966. Applied Regression Analysis. John Wiley and Sons, New York.
Duckstein. L., M. Fogel, and D. Davis. 1975. Mountainous Winter Precipitation: A Stochastic Event-based
Approach. Proceedings National Symposium on Precipitation Analysis for Hydrologic Modeling
sponsored by the American Geophysical Union. June 26-28, 1975, Davis, CA, pp. 172-1 88.
Durant, E. F. and S. R. Blackwell. 1959. The Magnitude and Frequency of Floods on The Canadian Prairies,
in Spillway Design Floods, Proceedings of Hydrology Symposium 1, Ottawa, sponsored by National
Research Council of Canada, November 4 and 5,1959.
Elderton, W. P. 1953. Frequency Curves and Correlation, Fourth Edition. Harren Press, Washington, D. C.
Federal Council for Science and Technology. Ad Hoc Panel on Hydrology. 1962. Scientific Hydrology.
Washington, D. C.
Feller, W. 1957. An Introduction to Probability Theory and its Application, Volume I. John Wiley and Sons,
New York.
Fiering, M. B. 1967. Streamflow Synthesis. Harvard University Press, Cambridge, MA.
Fiering, M. B. 1966. Synthetic Hydrology: An Assessment, in Water Research, edited by A. V. Kneese and
S. C. Smith. Resources for the Future, Inc., Washington, D. C. pp. 33 1-341.
Fiering, M. B. 1961. Queueing Theory and Simulation in Reservoir Design. Proceedings American Society
of Civil Engineers 87(HY6):39-69.
Fiering, M. B. and B. B. Jackson. 1971. Synthetic Streamflows. Water Resources Monograph I, American
Geophysical Union, Washington, D. C.
Fogel, M. M., L. Duckstein, and J. L. Sanders. 1974. An Event-based Stochastic Model of Areal Rainfall
and Runoff. Proceedings Symposium on Statistical Hydrology. U. S. Department of Agriculture Miscellaneous Publication No. 1275, Washington, D. C. pp. 247-261.
Freund, J. E. 1962. Mathematical Statistics. Prentice Hall, Inc., Englewood Cliffs, NJ.
BIBLIOGRAPHY
475
476
BIBLIOGRAPHY
Hall, W. A. and 3. A. Dracup. 1970. Water Resources Systems Engineering. McGraw-Hill, New York.
Hamed, M. M., J. P. Conte, and P. B. Bedient. 1995. Probabilistic Screening Tool for Groundwater Contamination Assessment. Journal of Environmental Engineering, ASCE 121(11):767-775.
Harman, H. H. 1967. Modern Factor Analysis. University of Chicago Press, Chicago, IL.
Harr, M. E. (1989). Probability estimates for multivariate analyses. Applied Mathematical Modelling
13:313-318.
Hanis, R. J. 1975. A Primer of Multivariate Statistics. Academic Press, New York.
Hasofer, A. M., and Lind, N. C. (1974). Exact and invariant second-moment code format. Journal of Engineering Mechanics Division, ASCE 100:-12 1.
Hawkins, R. H. 1974. A Note on Mixed Distributions in Hydrology. Proceedings Symposium on Statistical
Hydrology. U. S. Department of Agriculture Miscellaneous Publication No. 1275, Washington, D. C.,
pp. 336-344.
Hazen, A. 1930. Flood Flows, A Study of Frequencies and Magnitudes. John Wiley and Sons, Inc. New York.
Helsel, D. R. and R. M. Hirsch. 1992. Statistical Methods in Water Research. Elsevier, Amsterdam, The
Netherlands.
Hershfield, D. M. 1961. Rainfall Frequency Atlas of the United States. U. S. Weather Bureau Technical
Paper 40. U. S. Department of Commerce, Washington, D. C.
Hirschboeck, K. K. 1987. Hydroclimatically-defined Mixed Distributions in Partial Duration Flood Series,
in Hydrologic Frequency Modeling, edited by V. P. Singh. 1987. Proceedings of the International Symposium on Flood Frequency and Risk Analyses, 14-17 May, 1986. Louisiana State University, Baton
Rouge. D. Reidel Publishing Company, Dordrecht, Holland.
Hosking, J. R., J. R. Wallis, and E. F. Wood. 1985a. An Appraisal of the Regional Flood Frequency
Procedure in the UK Flood Studies Report. Hydrological Science Journal 30(1):85-10.
Hosking, J. R., J. R. Wallis, and E. F. Wood. 1985b. Estimation of the Generalized Extreme Value Distribution by the Method of Probability-Weighted Moments. Technometrics 27(3):25 1-261.
Huang, K. Z. 1986. Reliability Analysis on Hydraulic Design of Open Channel, in Stochastic and Risk
Analyses in Hydraulic Engineering, edited by B. C. Yen. Water Resources Publications, Littleton, CO,
59-65.
Hudlow, M. D. 1967. Streamflow Forecasting Based on Statistical Applications and Measurements Made
with Rain Gage and Weather Radar. Tech. Report 7, Water Resources Institute, Texas A&M University, College Station, Texas.
Huff, F. A. and S. A. Changnon. 1973. Precipitation Modification by Major Urban Areas. Bulletin American
Meteorological Society 54(12): 1220-1 232.
Interagency Advisory Committee on Water Data. 1982. Guidelines for Determining Flood Flow Frequency.
Bulletin 17B. U. S. Department of the Interior, Geological Survey, Office of Water Data Coordination,
Reston, VA.
Ioamidis, M. A., I. Chatzis, and M. J. Kwiecien, 1999. Computer Enhanced Core Analysis for Petrophysical
Properties. Journal of Canadian Petroleum Technology 38(3): 18-24.
Isaaks, E. H. and R. M. Strivastava. 1989. An Introduction to Applied Geostatistics. Oxford University
Press, New York.
Jarrett, R. 0. 1991. Paleohydrology and its Value in Analyzing Floods and Droughts. US Geological Survey
Water Supply Paper 2375: 105-1 16.
Jemings, M. E. and M. A. Benson. 1969. Frequency Curves for Annual Series with Some Zero Events or
Incomplete Data. Water Resources Research 5(1):27&280.
Jennings, M. E., W. 0. Thomas, Jr., and H. C. Riggs. 1993. National Summary of US Geological Survey
Regional Regression Equations for Estimating Magnitude and Frequency of Floods for Ungaged
Sites, 1993. US Geological Survey, WRI 94-4002, Reston, VA.
Johnston, J. 1963. Econometric Methods. McGraw-Hill, New York.
Judge, G. G., R. C. Hill, W. E. Griffiths, H. Lutkepohl, and T-C. Lee. 1988. Introduction to the Theoly and
Practice of Econometrics, Second Edition. John Wiley and Sons, New York.
478
BIBLIOGRAPHY
BIBLIOGRAPHY
479
NCSS 2000. 1998. NCSS2000 Statistical System for Windows. Dataxiom Software Inc., 3700 Wilshire
Blvd., Suite 1000, Los Angeles, CA 90010. Available at http://www.dataxiom.com.
Neter, J., M. H. Kutner, C. J. Nachtsheim, and W. Wasserman. 1996. Applied Linear Regression Models,
Third Edition. Irwin, Chicago, IL.
Neuts, M. F. 1973. Probability. Allyn and Bacon, Inc., Boston, MA.
Nozdryn-Plotnicki, M. J. and W. E. Watt. 1979. Assessment of Fitting Techniques for the Log Pearson Type
3 Distribution Using Monte Carlo Simulation. Water Resources Research 15(3):714-718.
Oliver, Margaret A. and R. Webster, 1986. Semi-Variograrns for Modelling the Spatial Pattern of Landform
and Soil Properties. Earth Surface Processes and Landforms 11:49 1-504.
Ostle, B. 1963. Statistics in Research, Second Edition. Iowa State University Press, Ames, Iowa.
Pankratz, A. 1983. Forecasting with Univariate Box-Jenkins Models. John Wiley and Sons, New York.
Panofsky, H. A. and G. W. Brier. 1958. Some Applications of Statistics to Meteorology. The Pennsylvania
State University, University Park, PA.
Parl, B. 1967. Basic Statistics. Doubleday and Company, Inc., Garden City, NY.
Pattison, A. 1964. Synthesis of Rainfall Data. Technical Report 40, Department of Civil Engineering, Stanford
University, Stanford, CA.
Pearson, K. 1957. Tables of the Incomplete-Function. Biometriks Office, University College, London.
Pearson, K. 1896-1897. On the Form of Spurious Correlation which may Arise when Indices are Used in
the Measurement of Organs. Royal Society of London, Proceedings 60:489-502.
Potter, W. D. 1949. Simplification of the Gumbel Method for Computing Probability Curves. SCS-TP-78.
U. S. Department of Agriculture, Soil Conservation Service, Washington, D. C., May 1949.
Press, S. J. 1972. Applied Multivariate Analysis. Holt-Rinehart and Winston, Inc., New York.
Rackwitz, R. 1976. Practical Probabilistic Approach to Design. Comite European du Beton, Paris, Bulletin
No. 112.
Ralston, A. 1965. A First Course in Numerical Analysis. McGraw-Hill, New York.
Rao, A. R. and Khaled H. Hamed. 2000. Flood Frequency Analysis. CRC Press, Boca Raton, FL.
Reed, L. J. 1921. On the Correlation Between any Two Functions and Its Application to the General Case
of Spurious Correlation. Washington Academy of Science Journal 11:449455.
Reich, B. M. 1962. Design Hydrographs for Very Small Watersheds from Rainfall. Civil Engineering
Section, Colorado State University. Ft. Collins.
Rice, R. M. 1967. Multivariate Methods Useful in Hydrology. Volume 1. Proceedings of International
Hydrology Symposium. September 6-8, 1967. Fort Collins, CO.
Rogers, C. C. M., Beven, K. J., Morris, E. M., and Andreson, M. G. 1985. Sensitivity Analysis, Calibration and
Predictive Uncertainty of the Institute of Hydrology Distributed Model. Journal of Hydrology 81: 179-191.
Rosenblueth, E. 1981. Point Estimates in Probabilities. Applied Mathematical Modeling 72(10):38 12-3814.
Rosenblueth, E. 1975. Point Estimates for Probability Moments. Proceedings of the National Academy of
Sciences, U.S.A., 72(10):38 12-38 14.
Salas, J. D. 1993. Analysis and Modeling of Hydrologic Time Series, Chapter 19, in Handbook of Hydrology, edited by D. R. Maidment. McGraw-Hill, New York.
Salas, J. D., J. W. Delleur, V. Yevjevich, and W. L. Lane. 1980. Applied Modeling of Hydrologic Time Series. Water Resources Publications, Littleton, CO.
Salas-LaCruz, J. 0 . and V. Yevjevich. 1972. Stochastic Structure of Water Use Time Series, Hydrology
Paper 52, Colorado State University, Fort Collins, CO.
Sangal, B. P. and A. K. Biswas. 1970. The 3-Parameter Lognormal Distribution and Its Applications in
Hydrology. Water Resources Research 6(2):505-5 15.
Sauer, V. B. 1974. Flood Characteristics of Oklahoma Streams. U.S. Geological Survey, Water Resources
Investigation 52-73.
Schafmeister, M.-T. and A. Pekdeger, 1993. Spatial Structure of Hydraulic Conductivity in Various Porous
Media-Problems and Experiences, in Geostatistics Troia '92. Kluwer Academic, Dordrecht, The
Netherlands, pp. 733-744.
480
BIBLIOGRAPHY
BIBLIOGRAPHY
48 1
Tasker, G. D. 1987. Regional Analyses of Flood Frequencies. In Regional Flood Frequency Analysis, K. P.
Singh (Ed.). D. Reidel Publishing Company, Dordrecht, Holland.
Tasker, G. D. 1982. Comparing Methods of Hydrologic Regionalization. Water Resources Bulletin
18(6):965-970.
Tasker, G. D. 1980. Hydrologic Regression With Weighted Least Squares Water Resources Research
16(6):1107-1113.
Tasker, G. D. 1978. Flood Frequency Analysis With a Generalized Skew Coefficient.Water Resources Research
14(2):373-376.
Tasker, G. D. and J. R. Stedinger. 1989. An Operational GLS Model for Hydrologic Regression. Journal of
Hydrology 111:361-375.
Tasker, G. D. and J. R. Stedinger. 1987. Regional Regression of Flood Characteristics Employing Historical
Information. Journal of Hydrology 96:255-264.
Tavchandjian, O., A. Rouleau, and D. Marcotte, 1993. Indicator Approach to Characterize Fracture Spatial
Distribution in Shear Zones, in Geostatistics Troia '92. Kluwer Academic, Dordrecht, The Netherlands,
pp. 965-976.
Taylor, S. E., D. A. Bender, M. H. Triche, and F. E. Woeste. 1993. Monte Carlo Simulation Methods for
Wood Systems. Presented at 1993 ASAE International Winter Meeting. Paper 935501. ASAE, 2950
Niles Road, St. Joseph, MI.
Tercier, P., R. Knight, and H. Jol. 2000. A Comparison of the Correlation Structure of GPR Images of
Deltaic and Barrier-Split Depositional Environments. Geophysics 65(4): 1142-1 153.
Thom, H. C. S. 1958. A Note on the Gamma Distribution. Monthly Weather Review 86(4):117-122.
Thomas, D. M., and M. A. Benson. Generalization of Streamflow Characteristics. U.S. Geological Survey
Water Supply Paper 1975, 1970.
Thomas, H. A. and M. B. Fiering. 1963. The Nature of the Storage Yield Function, in Operations Research
in Water Quality Management. Harvard University Water Program, Cambridge, MA.
Thomas, H. A. and M. B. Fiering. 1962. Mathematical Synthesis of Streamflow Sequences for the Analysis
of River Basins by Simulation, Chapter 12, in Design of Water Resources Systems. Harvard University
Press, Cambridge, MA.
Thomas, J. B. 1971. An Introduction to Applied Probability and Random Processes. John Wiley and Sons,
Inc., New York.
Tung, Y. K. 1990. Mellin Transformation Applied to Uncertainty Analysis in Hydrology/Hydraulics. Journal
of Hydraulic Engineering, ASCE 116(5):659-674.
Tvedt, L. 1990. Distribution of Quadratic Forms in Normal Space-Application to Structural Reliability.
Journal of Engineering Mechanics, ASCE 119(6):1183-1 197.
Tyagi, A. 2000. A SimpleApproach to Reliability, Risk, and Uncertainty Analysis of Hydrologic, Hydraulic,
and Environmental Engineering Systems. PhD. Thesis, Oklahoma State University, Stillwater, OK.
UK Natural Environmental Research Council (NERC). 1975. Flood Studies Report, Vol. I-V Natural
Environmental Research Council (UK), London.
United States Water Resources Council. 1981. Estimating Peak Flow Frequencies for Natural Ungaged
Watersheds: A Proposed Nationwide Test. Hydrological Committee, Washington, DC.
United States Water Resources Council. 1982. Guidelines for Determining Flood Flow Frequency.
Bulletin 17B. U.S. Department of Interior, Geological Survey, Office of Water Data Coordination,
Reston, VA.
Van Montfort, M. A. J. 1970. On Testing That the Distribution of Extremes is of Type I when Type I1 is the
Alternative. Journal of Hydrology 11(4):421427.
Wallis, J. R., N. C. Matalas, and J. R. Slack. 1974. Just a Moment. Water Resources Research 10(2):211-219.
Water Resources Council. 1967. A Uniform Technique for Determining Flood Flow Frequencies. Bulletin
15. Water Resources Council, Washington, D. C.
Weibull, W. 1939. A Statistical Study of the Strength of Materials. Ing. Vetenskaps Akad. Handl. (Stockholm) pp. 151:15. (Original not seen, cited in Chow 1964).
482
BIBLIOGRAPHY
Weldu, F. 1995. Regional Flood Frequency Analysis for Zambian River Basins. PhD Dissertation, Library,
Oklahoma State University, Stillwater, Oklahoma, USA.
Whittaker, J. 1973. A Note on the Generation of Gamma Random Variables with Non-Integral Shape
Parameter. Proceedings of the Second International Symposium on Hydrology. Water Resources
Publications, Fort Collins, CO, pp. 59 1-594.
Willcinson, L. 1990. SYSTAT: The System for Statistics. SYSTAT, Inc., Evanston, IL.
Wong, S. T. 1963. A Multivariate Statistical Model for Predicting Mean Annual Flood in New England.
Annals of Association of American Geographers 53(3):298-3 11.
Woolhiser, D. A., E. Rovey, and P, Todorovic. 1973. Temporal and Spatial Variation of Parameters for the
Distribution of N-Day Precipitation. Proceedings of the Second International Symposium on Hydrology. Water Resources Publications, Fort Collins, CO. pp. 605-614.
Yeh, K. C., and Tung, Y. K. 1993. Uncertainty and Sensitivity Analyses of Pit-Migration Model. Journal of
Hydraulic Engineering, ASCE 119(2):262-283.
Yen, B. C., Cheng, S. T., and Melching, C. S. 1986. First Order Reliability Analysis, in Stochastic and Risk
Analyses in Hydraulic Engineering, edited by B. C. Yen. Water Resources Publications, Littleton, CO,
pp. 79 risk analyses 89.
Yevjevich, V. 1968. Misconceptions in Hydrology and their Consequences. Water Resources Research
4(2):225-232.
Yevjevich, V. M. 1963. Fluctuations of Wet and Dry Years, Part I. Research Data Assembly and Mathematical Models. Hydrology Paper 1, Colorado State University, Fort Collins, CO.
Yevjevich, V. M. 1972a. Probability and Statistics in Hydrology. Water Resources Publications, Fort
Collins, CO.
Yevjevich, V. M. 1972b. Stochastic Processes in Hydrology. Water Resources Publications, Fort Collins,
CO.
Yevjevich, V. M. 1972c. Structural Analysis of Hydrologic Time Series. Hydrology Paper 56, Colorado
State University, Fort Collins, CO.
Zhao, Y. G. and Ono, T. 1999. New Approximation for SORM: Part 2. Journal of Engineering Mechanics,
ASCE 125(1):86risk analyses 93.
Index
Absolute sensitivity, 391-392
Acceptable error, 407
Acceptance region, 203
Advanced first order approximation (AFOA)
method, 412
All-possible-regressions, multiple linear
regression, 256
Analysis of variance (ANOVA), 246-248
Anisotropy, 4 4 3 4 4 5
Annual series, 152
ANOVA, 246-248
ARIMA, 355-361,363-364
ARMA, 355,362-363
Assumptions, 8-9
Autocorrelation, 287-289
of errors, 239
function, 348
multiple linear regression, 257-260
time series, 348-350
Autocovariance, 348
Autoregressive integrated moving average
( A m ) models, 355-361,363-364
Autoregressive models, higher+rder,
379-380
Confidence intervals
on correlation coefficient, 282
frequency analysis and, 167-1 68
geostatistic estimations, 449
mean of normal distribution, 197-1 99
multiple regression, 249-254
one-sided, 200-201
on parameters of probability distribution, 201
pivotal quantities method, 196
on regression coefficients, 233-235,249-251
on regression line, 235-236,251
and sample size, 198
on serial correlation coefficient, 287
simple linear regression, 23 1-239
on standard error, 237,249
variance of normal distribution, 199-200
Confidence level, 196
Confidence limits. See Confidence intervals
Consistency, 70
Continuous probability distributions, 33-35
beta, 140-141
exponential, 117-1 19
extreme value, 129-140
gamma, 120-126
generalized extreme value, 139-140,187-189
lognormal, 126-129
normal approximation, 111
normal distributions, 100-113
Pearson, 141
triangular, 116-1 17
uniform, 114-116
Contour plots, 445
Convergence rate, sample mean, 404
Correlated random variables, 281-282
Correlation, 281
canonical, 3 12-3 13
and cause and effect, 291
cokriging, 4 4 5 4 6
-cross, 290
between linear functions, 3 12
and multivariate analysis, 298
and regional frequency analysis, 290
sampling points, 425,428
serial, 29,257-260,287-289,375
spurious, 291-293
time series, 348-350
between variables and components,
301-306,310
Correlation coefficient
multiple regression, 246
population, 63,282-287
INDEX
of random variables, 62-64
sample, 63,64,282
simple regression, 230
Correlation matrix
multiple linear regression, 254
multivariate analysis, 304, 306
Correlogram, 348-350,353-355
transect elevation data, 427,428
Counting, 26-29
combinations, 27
permutations, 27
Covariance
ensemble, 339
in kriging, 435-438
matrix, 249,298,302
multiple regression, 249
point estimates, 428,429
point-to-block, 447
principal components, 301-302
random variables, 62,63
sample, 282
time series, 339-340
Critical region, 202,204,205
Cross-correlation, 290
Cross-products, 272
Cross-products matrix, 245,249,254,272
Culvert formulas, 181
Cumulative distribution function
bivariate, 4 0 4 1
continuous, 34-38
discrete, 33
Cumulative frequency histogram, 30-3 1
Cumulative probability distribution, 32-39,
2 13-22 1
Cumulative probability function. See Cumulative
distribution function
Curvature-fitting second-order reliability
methods, 423
Cycles, 337-338
Data
aggregating, 10
Beaver Creek (Oklahoma), 346-347
Black Bear Creek (Oklahoma), 170-176
Cave Creek (Kentucky), 225-226
clustering, 29 1
Devils Lake (North Dakota), 10-11
Great Salt Lake (Utah), 1 0 , l l
hydrologic, 9-15
Kentucky River, 10, 11, 17, 19
order of occurrence, 13
paleohydrologic, 177
representative, 12
selecting, 12
series, 152
spatially referenced, 425,426
Stillwater (Oklahoma), 341-344
Data generation
applications of, 33 1-334
gamma, 323,325
from Markov chain, 385-386
multivariate, 327-330
normal, 323
numerical procedure, 325
simulation of random observations, 324
univariate, 321-327
Weibull, 323
D e c l u s t e ~ g446447,448
,
Degrees of freedom
chi-square distribution, 142
F distribution, 144
goodness of fit tests, 21 1
multiple regression, 246, 248,250
simple regression, 233,237
t distribution, 143
Demand, 397
DeMoivre-Laplace limit theorem, 110
Density functions
bivariate, 4 0 4 8
continuous, 32-39
discrete, 32,33, 81-96
Dependence
functional, 63
linear, 63
stochastic, 63
Dependent variables, 224,228,242,257,260
in multivariate multiple regression,
311-312
Depth-duration frequency curves, 190
Derived distributions, 4 4 4 8
one variable, 45
two variable, 46
Descriptive statistics, 426-430
Design life, 87-89, 371
Design return period, 87-88
Design storm, 87, 178
Determination, coefficient of, 230,245,280
Deterministic model, 7,9,370
Devils Lake (North Dakota), data, 10-1 1
486
Diagonal matrix, 300
Differencing, 355,363
Digit, random, 322
Direction, between points, 443-445
Discrete distributions, 8 1-96
Bernoulli, 84-9 1,98
binomial, 84-89,90,95-96
geometric, 89-90
hypergeometric, 81-83, 86
multinomial, 95-96
negative binomial, 90-9 1
Poisson, 91-93
uniform, 321
Discrete random variable, 32-33
Dispersion, measures of, 57-58
Distance, between points, 426,445
Distribution(s) '
Bernoulli, 84-9 1,98
beta, 140-141
binomial, 84-89,90,95-96,109-110
bivariate, 40,41,44
bounded exponential, 136
chi-square, 142
conditional, 4 1 4 3
continuous. See Continuous probability
distributions
of correlation coefficient, 282,287
cumulative, 32-39,2 13-221
estimation of, 447-448
derived, 44-48
discrete, 81-96
double exponential, 132
exponential, 93,117-1 19
extreme value, 129- 140
generalized, 139-140,187-189
type I, 164,165
F, 144-145
Fisher-Tippett, 132
frequency, 30-3 1
gamma, 93-94,120-126
geometric, 89-90
Gumbel, 164
hypergeometric, 8 1-8 3,86
joint, 40-44
lognormal, 126-129
frequency analysis and, 160, 165
log Pearson type ITI,frequency analysis and,
160-164,165
marginal, 41
of maximums, 129-1 30
INDEX
of minimums, 129
mixed, 48
multinomial, 95-96
multivariate, 4043,46,47, 61, 62,95-96
negative binomial, 90-9 1, 110
normal, 100-113,422
frequency analysis and, 159-160
of order statistics. See Extreme value
distributions
Pearson, 141
Poisson, 9 1-93
of prediction errors, 333-334
recommendations, 181
of sample statistics, 142-145
selecting, 150
t, 143-144
3-parameter lognormal, 422
3-parameter Weibull, 136
translations, 145
triangular, 116-117
uncertainty analysis, 398
uniform, 114-1 16,321,422
univariate, 32-39,44,53-55
Weibull, 131, 134-138
Double exponential distribution, 132-134
Drift, 427,445
Durbin-Watson test, 259-260
Type I, 196
Type 11, 196
Error variance, minimizing, 434-438
Estimates
errors in, 433-435
unbiased, 433,434
Estimation
global, 446447,448
interval, 194-201
least squares, 227,244,366366
local, 446,447,448
maximum likelihood, 7 1,74-76, 134
moments, 72-74
point, 70-74
single site, 68
using geostatistics, 433-438
weights, 434-438
Estimators
biased, 68, 70
consistent, 70
efficient, 7 1
interval, 194-201
least squares, 227,244,364-366
maximum likelihood, 7 1,74-76, 134
moment, 72-74
point, 70-74
properties of, 70-72
sufficient, 7 1-72
unbiased, 69,70,7 1
Euclidean distance, 3 14,426
Events, 17,26
certain, 20
complements of, 22
intersection of, 22
mutually exclusive, 22,26
percent chance of, 88-89
probability of, 22-24
union of, 2 1
Exceedance, 12
definition of, 84
and design return period, 88
probability, 149
Excess, coefficient of, 60
Exclusivity, of events, 22,26
Expansion point, 398,412,414
Expectation
of a function, 55
joint distribution, 60
and the mean, 55-56
product, 66
properties, 65-66
of random variables, 54,55,60-66
of sample moments, 67
and variance, 57-58
Expected value. See Expectation
Exponential distribution, 93, 117-1 19
probability paper, 151
Exponential model, 430,43 1,438
Extrapolation, in regression, 237,256
Extreme value distributions, 129-140
generalized, 139-140,187-189
type I, 131,132-134,164, 165
parameter estimation, 134
type II, 131
type JII,131, 134-138
parameter estimation, 135
Extreme value series, 152
Factorial, 27-28
Factor loadings, 305,309, 310
Failure
point, 413,415
probability of, 397,404,413,423
surface, 4 12-4 15
F distribution, 144-145
Federal Council for Science and
Technology, 6
Finite population, 32, 8 1
First-order approximation method, 398-399
corrected, 406-4 11
simplified, 399-400
First-order reliability method (FORM),
412-418
First-order stationary, 339
Fisher-Tippett type I distribution, 132-1 34
Floods
frequency analysis, 9, 158-189 .
rare, 178-179
Flow volume frequency studies, 191-1 92
Frequency
definition of, 149
relative. See Relative frequency
Frequency analysis, 9,149-15 1
analytical, 158-179
and confidence intervals, 167-168
of precipitation data, 189-191
probability plotting, 151-158
Gumbel distribution
extreme value, 132-134,139
frequency analysis, 164,165
I
Independence
Bernoulli process, 85,90
in probability, 2 3 , 4 3 4
of random variables, 43-44
in regression analysis, 254,255,309
stochastic, 44
of time series data, 348-349
Independent variable, 260
multiple regression, 245
Index-flood method, regional frequency
analysis, 186
Indicator variables, in regression, 268-27 1
Initial probability vector, 382
Input random variable, 399
Integrated autoregressive moving average
(ARIMA), 363
Intensity-duration-frequency curves, 190
Interactions, 272
Intersection, of events, 22
Interval estimation. See Confidence intervals
Intrinsic random function, 427
Inverse distance weighting, 434
Inverse transform, of probability distribution,
322-323
INDEX
MA, 356-358
Manning's roughness coefficient, 422
Mann-Kendall nonparametric test, 343-345
Mann-Whitney tests, 346
Marginal distribution, 41
Markov chain, 380
data generation from, 385-386
difficulties in using, 386-387
homogeneous, 38 1
m-state, 381
n" order, 38 1
parameter estimation, 38 1, 387
steady state probabilities, 382-383
transition probabilities, 381, 387
Markov model
first-order, 375-378
with periodicity, 378-379
gamma model, 377,379
of higher-order, 379-380
logarithmic model, 376, 379
Matching points method, parameter
estimation, 7 1, 72
Matrix
diagonal, 300
orthogonal, 300
stochastic, 38 1
transitional probability, 38 1
n step, 382
Maximum likelihood estimation, 7 1,74-76,
366367,431
Mean
arithmetic, 55-56
confidence interval and, 197-1 99
distribution of, 102
first-order approximation estimate, 406410
geometric, 56
global, 448
output random variable, 400
point estimates, 426
population, 56,77,404
of random variables, 55, 61
sample, 56,404
testing hypotheses, 206209
of time series, 339
variance, 65
weighted, 57
Mean annual flood, 134
Means, equality of, 346
Mean square, 246
Median, 56
Mellin transform, 424
Mesokurtic distribution, 60
Method of moments, 72-74,76
Midpoints, estimating elevation, 438-443
Mixed distribution, 48
Mixture, 48
Mode, 56-57
Model(s)
ARIMA, 355-361
ARMA, 355,361-363
autoregressive, 379-380
Bernoulli, 84-91
Box-Jenkins, 336
classifying, 7
definition of, 6
deterministic, 7, 9, 370
hydrologic, 5-6, 390, 392
linear. See Linear model
multivariate regression, 31 1
nonlinear, transforming, 266-268
output, 396,418,421
parametric, 7,370
Poisson, 91-95
as regionalization tool, 189
reliability assessment of, 397
semivariograrn, 430433
statistical, developing, 5
stochastic, 7, 370-388
water quality, 390, 396
Modeling, using geostatistics, 449
Moments
about the mean, 53-60
central, 5 5 4 2 1
of composite function, 418
exact value of, 407
higher order, 418,421,423-424
L-, 68,69,72, 187-189
of model output, 418,421
and parameter estimation, 72-74,76
of performance function, 42 1
of probability distributions, 53-54
probability-weighted, 68-69
properties of, 65-66
of samples, 66-68
variance, 57
Monte Carlo simulation, 331,332,393
reliability/risk analysis, 404-405,4 11
Moving Average (MA) Process, 356-358
Multicolinearity, 260-262
Multilag model, 379
INDEX
Multinomial distribution, 95-96
Multiple coefficient of determination,
245-246,255
Multiple correlation coefficient, 246
Multiple linear regression, 272
all-possible-regressions, 256
an application of, 262-266
assumptions, 257
autocorrelated errors, 257-260
coefficient of determination, 245-246,255
coefficients, estimation of, 243-245, 248
confidence intervals, 249-254
correlation coefficient, 246
correlation matrix, 254
extrapolation, 256
general linear model, 242-248
hypothesis testing, 249-254
indicator variables, 268-271
logistic regression, 272-278
model selection, 254-256
normal equation, 244
regression coefficients, 249-25 1,255
sensitivity analysis, 393
stepwise regression, 256
transformation, 266-268
Multivariate analysis, 297
canonical correlation, 3 12-3 13
cluster analysis, 3 13-3 18
multivariate multiple regression, 3 11-312
principal components, 298-310
Multivariate distributions
bivariate, 40-43,46,47, 61, 62
multinomial, 95-96
Multivariate multiple regression, multivariate
analysis, 3 11-3 12
Mutually exclusive, 22, 26
49 1
for approximating continuous
distributions, 111
for approximating negative binomial, 110
for approximating Poisson, 111
central limit theorem, 106-107
and frequency analysis, 159-160
general, 100-101
plotting, 107-109
probability plotting, 155,159-160
reproductive properties, 101-102
standard, 102-1 06
Normal equations, 226, 244
n step transitional probability matrix, 382-383
Nugget, 443
Nugget effect, 430
Nugget model, 43 1
Observations
definition of, 52
effective number of, 289
and probability, 19
random, 281
One-parameter-at-a-time sensitivity
analysis, 391-392
One step transition probability, 380-381
Operating characteristic curve, 204
Ordered sampling, 26,27
Order of occurrence, of data, 13
Ordinary kriging
for midpoint elevation estimation, 4 3 8 4 3
variance, 437
Orthogonality, 298-301, 309
Outcome, adverse, 87
Outliers, 158
Output
model, 396
random variable, 398,400
sensitivity analysis, 393-396
Paleohydrology, 177
Paper, probability, 151
Parameter, definition of, 53
Parameter estimation, 70-72, 194
least squares, 227,244, 364-366
Markov chain, 38 1, 387
492
Parameter estimation (continued)
maximum likelihood, 7 1,74-76,366-367
methods of, 7 1
moments, 72-74
regional, 374
semivariogram models, 43 1
stochastic, 374
Parametric model, 7,370
Partial duration series, 152
Peakedness, measures of, 59-60
Peaks over threshold series, 152
Pearson, Karl, 141
Pearson correlation formula, 293
Pearson distribution, 141
Pearson skew coefficient, 58-59
Pearson type III distribution, 160-164,165
Percent chance; 20
of an event, 88-89
Percentile, 149
Performance function, 397,398,403
linear approximation to, 412
moments of, 421
Period, 348
Periodic component, 337,349
Periodicity, time series, 337,350-355
Periodogram, 353-355
Permutations, 27
Persistence, 13
Pivotal quantities, 196
Platykurtic distribution, 60
Plotting position, 152-156
Point estimation
errors in, 433435
via kriging, 4 3 8 4 3
methods, 424
Point-fitting second-order reliability
methods, 423
Point kriging, 447,448
Points
direction between, 443-445
sampling, 425
separation distance, 426,445
Point-to-block covariances, 447
Poisson distribution, 91-93
to approximate binomial, 91-92
normal approximation, 111
Poisson process, 91-95,375
assumptions, 92-93
Polygon declustering, 4 4 W 7
Population
correlation coefficient, 282-287
INDEX
definition of, 13, 52, 53
finite, 32, 8 1
mean, 56,77
variance, 57-58
Positive definiteness, 43043 1,432
Posterior probabilities, 25
Power, of a test, 205
Precipitation data
cluster analysis of, 314-318
frequency analysis of, 189-1 9 1
Prediction error, 333-334
Principal components, 298-306
regression on, 307-3 10
Probabilistic model. See Stochastic models
Probabilities
posterior, 25
steady state, 382-383
Probability, 17
conditional, 22-24
cumulative distribution function,
32-37,4041
definition of, 12, 18-19,21
density function, 35-37
distribution. See Distribution(s)
estimation, 18, 19
of events, 17, 18
of failure, 413,423
of independent events, 23-24
joint, 23-24
and observations, 19
one step transition, 380-38 1
paper, 151
plotting, 151-158
of rainfall, 189-191
and relative frequency, 18-20,37-39
in reliability analysis, 397
steady state, 382-383
total, 24-25
transition, 38 1-383
vector, initial, 382
Venn diagram, 21-22,23
weighted, 25
Probability distribution function, 32-39
Probability-weighted moments, 68-69
Probable maximum flood, 177
Purely random process, 287,374-375
Quantile, 149
INDEX
Rainfall
cluster analysis of data, 3 14-3 18
frequency analysis of, 189-1 92
modeling, 386-388
Random component, 426
Random data generation
applications of, 33 1-334
gamma, 323,325
from Markov chain, 385-386
multivariate, 327-330
normal, 323
numerical procedure, 325
simulation of random observations, 324
univariate, 321-327
Weibull, 323
Random digit, 322
Random element, 321
Random numbers
generating, 322
uniform, 322
Random observation, 28 1
Random processes, ergodic, 340
Random samples, 12-13,322
and parameter estimation, 70
Random standard normal deviate, 323
Random variables, 31-32
asymmetric, 424
bivariate, 4044,60456
continuous, 32-39,40
transformation of, 4.448
correlated/uncorrelated,28 1-282
discrete, 32-33
expectation and, 54,55, 60-66
independence of, 4 3 4 4
input, 399
and joint probabilities, 40
maximum of, 130
nornormal, 328-330
normal, 327-328
output, 398,400
properties of, 52-77
in a time series, 338
Range, 57,422,429,443
Rank of matrix, 245
Realization, 336, 340
definition of, 52
Record
historical, 156-157
systematic, 156-157
493
Recurrence interval, 12
Regional analysis, and correlation, 290
Regional frequency analysis, 180-1 89
index-flood method, 186
regression-based procedures, 183-1 86
Regions, homogeneous, 180-18 1
Regression
definition of, 224
logistic, 272-278
multiple linear. See Multiple linear regression
on principal components, 307-3 10
simple linear. See Simple linear regression
variable selection, 254-255
Regression-based procedures, regional frequency
analysis, 183-186
Regression coefficients, 227,233-235
and multicolinearity, 26 1-262
multiple linear regression, 249-25 1,255
in regression on principal components, 309
sensitivity analysis, 393
Regression line, 227,235-236
multiple linear regression, 25 1
Rejection region, 202,204,205
Relative error, 4 0 7 4 10
Relative frequency, 33
expected, 38-39
and probability, 18-20, 37-39
Relative sensitivity, 391-392
Reliability index, 400,403,413,414
ReliabilityEisk analysis, 396-398
first-order approximation
method, 398-399
corrected, 4 0 6 4 11
simplified, 399a00
first-order reliability method (FORM),
412418
generic expectation function, 4 1 8 a 2 3
Monte Carlo simulation, 404405,411
point estimation methods, 424
second-order approximation method, 41 1 4 1 2
second-order reliability methods (SORM), 423
transform methods, 424
Representativeness,372
Reservoir storage, 13,32
Residual sum of squares, 229
Resistance, 397
Return period, 12, 159
design, 87-88
Risk
allowable, 88
definition of, 87, 397
INDEX
Risk (continued)
design, 371
system, 422
Risk analysis. See Reliabilitymisk analysis
Risk estimates
from first order approximation method,
398-399,423
from generic expectation functions,
422-423
Sample(s)
co~elationcoefficient, 282
covariance, 28 1
definition of, 12-13,52,53
locations, 446
mean, 56,404
moments, 66-68
observations and probability, 19-20
points, 20
arrangement of, 429
random, 12-13,70,322
size, 8 1
estimating, 404
space, 20,446
standard deviations, 28 1
statistics, 53
distributions of, 142-145
types of, 27
variance, 57-58
Sampling
error, 13-15
locations, 425
ordered, 26,27
points, 425
with replacement, 26,27
unordered, 26, 27
without replacement, 26,27
Sampling points, correlation between, 425,428
Scatter, 398
Second-order approximation method, 4 11-4 12
Second-order reliability methods (SORM), 423
Second-order stationary, 339-340
Semivariogram
anisotropic, 443446
cloud, 429
transect elevation data, 428-430
Semivariogram models, 430433
combination, 432
Sensitivity analysis
definition of, 391
global, 392-396
local, 39 1-392
one-parameter-at-a-time, 39 1-392
output, 393-396
traditional, 39 1-392
uncertainty analysis, 396
variance-based method, 392-396
Sensitivity coefficients, 394
absolute, 391-392
global, 396
local, 396
relative, 391-392
Sensitivity indices, 393,395-396
Serial correlation, 29,257-260, 287-289, 375
Shapiro-Wilkes test, 2 19
Sigma bounds, 103
Significance level, hypothesis testing, 202-206
Significance tests. See Hypothesis testing
Significant difference
physical, 206
statistical, 206
Sill, 429, 443
Simple linear regression, 224-228
assumptions, 23 1,238-239
coefficients, 227,233-235
confidence intervals, 23 1-239
correlation coefficient, 230
effect of measurement errors, 238
evaluation of, 228-23 1
standard error, 232,237
tests of hypotheses, 23 1-235, 238-239
Simulation
in hydrology, 33 1-334
Latin hypercube sampling, 404-405
Monte Carlo, 33 1,332,393
in risk analysis, 404-405,4 11
multivariate, 327-330
time series, 374-385
univariate, 321-327, 374-385
Skew coefficients, 59
frequency analysis, 161, 162, 166-167
Skewness, 58-59,422,448
Smoothing, semivariograms, 430
Software, statistical, 5
Space scale, 338
Span, 429
Spatial continuity, 446
Spatially referenced data, 425
Spectral density function, 352-355
495
INDEX
Spherical model, 430,43 1,438
Spurious correlation, 29 1-293
Standard deviation, 57,28 1
Standard error
multiple linear regression, 255
simple regression, 232,237
Standardized variable, 102,304
Standard normal distribution, 102-104
approximations for, 104-106
Standard random normal deviate, 323
State, 380
Stationarity, 92, 372
in a time series, 338-340
Statistical methods, applying, 4
Statistical tests. See Hypothesis testing
Statistics
definition of, 6
descriptive, 42-30
nonparametric, 194-1 95
parametric, 194- 195
Steady state probabilities, 382-383
Stepwise multiple regression, 256
Stillwater (Oklahoma), rainfall data,
34 1-344
Stochastic component, 337,374
Stochastic convergence, 19
Stochastic matrix, 38 1
Stochastic models, 7,370-374
Markov models, 375-388
purely random, 374-375
selecting, 372-373
Stochastic process, 16,336366,370-388
continuous, 338
uncertainty in, 390-391
Streamflow models, stochastic, 371
Students t distribution. See t Distributions
Subset, 20
Sufficiency, 7 1-72
Sum of squares, 229-230
multiple regression, 250
Symmetry, measures of, 58-59
Systematic record, 156-157
System reliability, estimating, 397,400,418.
See also ReliabilityIRisk analysis
t Distributions, 143-144
Tests of hypothesis. See Hypothesis testing
Tests of significance. See Hypothesis testing
Theory of errors, 100
Thiessen polygon method, 446-447
3-parameter lognormal distribution, 422
3-parameter Weibull distribution, 136
Time average properties, 338
Time scale
continuous, 338
discrete, 84,338,349
Time series
ARIMA, 355-361,363-364
autocorrelation, 348-350
definitions, 336-340
independence of data, 348-349
jumps in, 337,346348,374
parameter estimation
least squares, 364-366
maximum likelihood, 366-367
periodicity, 350-355
plot, 29
trend analysis, 337,340-346,374
variance, 35 1-353
Total probability theorem, 24-25
Total system variance, 298
TP 40 (United States Weather Bureau),
189,190
Trace of matrix, 298
Transformations, 145-146
bivariate, 4 7 4 8
logit, 274
multiple linear regression, 266-268
Z, 151
Transform methods, reliability/risk
analysis, 424
Transition probability, 38 1-383,387
matrix, 3 8 1
n step, 382-383
one step, 380-381
Translations, along the x axis, 145
Trends, in a time series, 337,340-346,374
Triangular distributions, continuous, 116-1 17
Type I error
hypothesis testing, 196,202
and Kolmogorov-Smirnov test, 215
Type 11 error
hypothesis testing, 196,202-204
and Kolmogorov-Smirnov test, 214
Type I, II, and III extreme value distributions.
See Extreme value distributions
in geostatistics, 4 4 8 4 4 9
in stochastic processes, 3 9 6 3 9 1
Uncorrelated random variables, 281-282
Uniform distributions, continuous, 114- 116
Uniformly most powerful test, 205
Uniform random number, 322
Union, of events, 21
United States Water Resources Council, 156, 182
United States Weather Bureau TP 40,189, 190
Univariate distribution, 32-39,44,53-55
Unordered sampling, 26,27
Variables
dependent, 224,228,242
independent, 245,260
indicator, 268-27 1
lagged, 258,260
random. See Random variables
selection of in regression, 254-255
standardized, 102,304
uncertain, 404
Variance, 57-58
confidence interval and, 199-200
of design estimate, 33
of errors, minimizing, 434-438
first-order approximation estimate, 407-410
global estimate for, 448
grouped data, 58
hypothesis tests concerning, 209, 210
.
of linear function, 65,298
multiple regression, 245, 249
noise, 364
output random variable, 399-400
of parameter estimate, 332
point estimates, 428,429
population, 57-58
of predicted value in regression, 236
of principal components, 298-300,304,306
.
Yule-Walker equation, 36 1,364