You are on page 1of 14

Introduction

This module will introduce some basic probability concepts and procedures used in Hydrological
Frequency Analysis. The methods that have been used historically for flood frequency analyses were
developed prior to the advent of the computer power that we now have access to. Subsequently the
methods employed to analyse flow data are becoming more sophisticated. The R environment is
particularly useful for understanding this increased perplexity.

2 Objectives
When you have completed this module of the course you should be able to:

Describe the statistical parameters that pertain to hydrological data

Apply the key probability concepts and methods to hydrological data

Summarise hydrological data using a descriptive statistics

Construct and analyse flow duration curves

Separate baseflow from a measured hydrograph

Conduct recession and spell analyses on hydrological data

3 Resources
The revision of AR&R, particularly Book 3, Chapter 2, Flood Frequency, has produced material that
is particularly valuable for understanding the new methods.

William King from Coastal Carolina University has produced a handy tutorial on Probability functions
in R.

The subject is also covered well by Ladson (2011) in Chapter 6: Floods and Flood Frequency
Analysis.

For some basic statistics and probability theory, see the book by G Jay Kearns. This book can be
accessed through R by three simple commands:

install.packages("IPSUR")

library(IPSUR)

read(IPSUR)
4 Probability Theory
In flood frequency analysis we are primarily interested in analysing the probability of floods of certain
sizes, therefore we are generally only interested in the flood peaks. Given the stochastic nature of
rainfall, and many of the processes that contribute to flooding, a probability framework is required in
order to assess risk, as the flood peaks are treated as random variables.

One of the primary assumptions of applying the flood probability framework is that each peak is
statistically independent of all other flood peaks. This has generally be shown to be the case.

To understand probabilities, and hence to be able to work with the probability of the occurrence of
hydrological events, we need to be able to describe and compute probabilities. This is usually done
using a probability density function, or pdf.

The flood probability model is generally described by its probability density function (pdf). For sample
data, the pdf is a smoothed curve applied to a histogram of measurements. There are many forms of
a pdf, depending on the shape of this curve. When the underlying data fits a pdf that resembles a
bell-like shape it is called a normal (or Gaussian) distribution. This is an excellent resource for
explaining probability distributions in R.

There are several distributions that are used in hydrology. Each of these curves can be described by
a mathematical equation, which are characterised by one or more parameters (e.g location, scale
and skew) which usually have some physical meaning in hydrology. For example the normal
distribution is characterised by the mean (\(\mu\)) and standard deviation (\(\sigma\)). Though there
are very few applications of a normal distribution in hydrology, it is instructive to introduce probability
concepts through the use of a normal distribution.

The values of the standard normal distribution (the special normal distribution which has \(\mu =
0\) and \(\sigma = 1\)) can be generated by using the dnorm function, which returns the height of the
normal curve at values along the \(x\)-axis.

Variables which follow a standard normal distribution are usually designated \(z\). In general, capital
letter represent the random variable (in this case \(Z\)), and small letter represent an outcome or
realization of that random variable (eg. \(z\)).

Wrapping the curve function around this result produces the familiar bell-shaped curve. The point
on the curve where z = 1 is shown by the green circle and line.

curve(dnorm(x), xlim=c(-3,3), col='red', lwd=3, xlab="Z", ylab="density",


main="Probability Density Function")

points(c(1,1), c(0,dnorm(1)), col='green', pch=19, cex=2, type="b")

#lines(c(1,1), c(0,dnorm(1)) ,col='green', lwd=2) #changed to type="b" in above


function, which plots both points and line.
To find the density value of particular value along the \(x\)-axis (ie. the height of the curve at that
point), a value is passed to the dnorm() function:

dnorm(1)

## [1] 0.242

The probability that \(z\) will be greater than a particular value is given by the area under the pdf to
the right of the \(z\)-value on the \(x\)-axis. You may remember looking up these \(Z\)-scores (values
on the \(x\)-axis) in a table to return the \(p\)-value (region under the normal curve) when conducting
hypothesis tests. The good news is that you no longer have to do that; you only need to enter the
correct command into R!

The area under the curve (shaded in blue below) from a specified \(z\) value on the \(x\)-axis to the
maximum value on the \(x\)-axis (which is \(\infty\) for the normal distribution), gives the probability of
the value of \(Z\) being greater than the particular value of \(z\) being examined: \(z\geq\) Z; ie. it
gives \(P(Z \geq z)\). The total area under the pdf curve is \(1\).

You can also specify a different values for \(\mu\) and \(\sigma\) in the dnorm function (rather than 0
and 1 respectively). The following pdf is for the case with \(\mu = 10\) and \(\sigma = 2\):

The cumulative distribution function (cdf) is the integral of the pdf (eg. area under the curve), or
the total accumulated probability up to a certain value. It is represented as \(P(z)= \int_{-\infty}^{z}
{p(z)dz}\). Therefore, like the pdf, there is a mathematical curve that describes the pdf. The pdf of
normally distributed data takes on the familiar s-shape and can be generated for the standard
normal distribution (\(\mu = 0\) and \(\sigma = 1\)) using the pnorm function inside
the curve function.

It stands to reason that if a value (\(z\)) is passed to the pnorm function, it will return the value on the
cdf curve, which is equivalent to the area under the pdf curve to the left of \(z\).

4.1 Probability Functions


The probability theory that is used to estimate the probability of floods of certain magnitudes
occurring is taken from theSampling Theory (see the book by G Jay Kearns) and the theory of
Bernoulli Trials (see page D-21 in this resource).

The binomial distribution represents the successes or failures or trials that are repetitive (eg. tossing
a coin 100 times and recording the occurrence of Heads or Tails). As we are dealing with discrete
events (that is, one Head or two Heads, but not 2.8034 Heads), the representation of the probability
function is called a probability mass function. The graphical representation of a mass function is
usually through a column chart (as opposed to a curve for the cdf), Essentially these theories result
in the equations that allow the probability of k successes in n trials to be calculated (using the
binomial formula):

\(P(k \text{ successes in } n \text{ trials})=\frac{n!}{k!(n-k)!}p^k(1-p)^{n-k}\)

The Annual Exceedance Probability (AEP) is the probability (p) that is of interest to us in the previous
equation. We can apply the above equation to determine the probability of occurrence of various
scenarios. For example, if we consider peaks that occur in an annual time period, we can calculate
the probability that exactly three 50 year floods will occur in a 100 year period.

If we consider a series that represents the maximum floods over a certain period (eg. annual) or
threshold (eg. Peak over Threshold or PoT), the magnitude of these floods can be related to the
probability of exceedance (in the case of annual peaks) or their return interval (in the case of PoT).
Fortunately there is a relationship between the probability of exceedance (AEP, or \(p\)) and the
return interval (ARI): \(ARI=\frac{1}{p}\).

Therefore we need to develop skills in being able to calculate these exceedance probabilities.

4.1.1 Exceedance Probabilities


If we wanted to work out the probability that exactly three 50-year floods will occur in a 100-year
period, we could of course apply the above equations. For example, in this situation, our k=3
and n=100. Our ARI = 50, therefore \(p=\frac{1}{50}\). Plugging these values into our first equation
yields:

\(P(3 \text{ in } 100)=\frac{100!}{3!(100-3)!}(\frac{1}{50})^{3}(\frac{49}{50})^{97} = 0.182\), so therefore


there is an 18.2% chance that three 50-year floods will occur in any 100 year period.

At this stage we dont know the magnitude of that flood for a given area. This will come through the
analysis of actual streamflow data.

Applying this equation every time we wish to calculate these probabilities seems onerous; fortunately
R has a built in function to enable the calculation of the probability of k successes in n trials to be
calculated using dbinom(k,n,p) \(= \)P(3 100) =$ 18.2276.

But what if we wanted to know the probability of 3 or more 50-year floods occurring in a 100 year
period? We know that the sum of the probabilities of all possible numbers of 50-year floods occurring
in 100 years must equal unity (this includes the probability of there being no 50-year events in a 100-
year period), so \(P(3 \text{ or more in } 100) = P(3 \text{ in } 100) + P(4 \text{ in } 100) + ... P(100
\text{ or more in }100)\).

However, it would take a long time to apply each of these combinations into the first equation. We
could take a different tack and just calculate the Probabilities of 0 in 100, 1 in 100 and 2 in 100 and
take the sum of these probabilities away from 1: \(P(3 \text{ or more in } 100) = 1 - P(0 \text{ in } 100)
+ P(1 \text{ in } 100) + P(2 \text{ in } 100)\). However, this also seems a bit onerous, but fortunately
we can again apply the dbinom(k,n,p) function and wrap a sum function around it: 1-
sum(dbinom(0:2,100,1/50)) = 0.3233. In R we could also easily apply the first option directly,
eg. sum(dbinom(3:100,100,1/50)) . You will see this gives the same answer.

To understand the shape of this distribution you can wrap a plot function around
the binom function for the points from 0:100. (The binomial distribution is computed at disctrete
values, such as 0, 1, 2, 100; for example, in 100 years, we can have 0 years or 1 year or two
years and so on with a flood, but it makes no sense to talk about 2.45604 years having a flood. This
means the pdf will not be continuous.) The blue dot in the following graph corresponds to the
probability of 2 occurrences of 50 -year floods in 100 years. It makes sense that this has the highest
probability, given that, on average, we could expect two 50-year floods in any 100 year period.

x <- seq(0, 100, 1)

plot(x,dbinom(x,100,1/50),col="red",lwd=3,type="h", las=1, xlab="Number of Occurrences


of 50 Year Floods in 100 Years", ylab="Probability of Occurrence", main="Probability
Distribution for number of \n Occurrences of 50 Year Floods in 100 Years")

points(2,dbinom(2,100,1/50), pch=19, col ="blue")

Exercise: If you were asked to design a bridge to be safe during a 50 Year flood, what is the chance
that the design flood will be exceeded four or more times in a 100 year period?

In design however, we are most interested in the probability of occurence of 1 or more failures
in n years. Using the previous logic, this equates to:

\(P(1 \text{ or more in } 100) = 1 - P(0 \text{ in } 100)\)

Applying the previous relationships yields:

\(P(1 \text{ or more in } 100) = 1-\frac{n!}{0!(n-0)!}p^0(1-p)^{n}=1-(1-p)^{n}\) (since \(0! = 1\) and \(p^0
= 1\)). Therefore the probability of one or more 2000 year floods in a 30-year period is given by: \(P(1
\text{ or more in } 30) = 1- (1-\frac{1}{2000})^{30} = 0.0149\) or 1.49%

Again we can use the dbinom function to calculate this: 1-dbinom(0,30,1/2000) = 1.49%.

Exercise: What is the probability of one or more 200 year floods in a 20 year period?

Now consider the case of a project with a design lifetime of 100 years. The design discharge we are
interested in is a discharge with a 100 year return period. In this case what is the probability of
failure in the lifetime of the project? In this case n=100, and the return period is also 100. Therefore
the probability is 0.01: 1- dbinom(0,100,0.01) . This equates to a 63.4% probability of a 100 year
flood occuring in any 100 year period!

What is the probability of 100 year floods occurring in two consecutive years? In this case our \
(n=2\), and we can either directly apply dbinom(2,2,0.1) or multiply the probability of one 100 year
flood in 1 year (ie. \(n\) and \(k =1\)): dbinom(1,1,0.01) by the same probability (since the events
are independent): dbinom(1,1,0.01)*dbinom(1,1,0.01) . Therefore the probability (expressed as a
percentage) of 100 year floods occurring in two consecutive years is 0.01%: It is very unlikely.

We can also turn these relationships around to determine the design return period for an acceptable
risk of failure. For example: An hydraulic construction site on a stream needs the projection of a
coffer dam/tunnel for a period of three consecutive years. If the total acceptable risk of failure during
this three year period is estimated to be 0.5%, what must be the design return period for the
protective structure? In this case, we have the answer to P = 1 - dbinom , which is the same as the
equation: \(P = 1 - (1-p)^{n}\). Given that P=0.005 (ie. 0.5% acceptable risk of failure), and \(n=3\),
the equation can be rerranged to calculate the exceedance probability (p)= 0.167%. The ARI is the
reciprocal of this: \(ARI=\frac{1}{p}\) = 599 years. There is no doubt an inverse bionomial function in
R to calculate this: can you find one?

Exercise: A total acceptable risk of failure for a bridge is 5%. Given that this bridge has a design life
of 50 years, What is the ARI that the bridge should be design for?

There are several other statistical distributions that are important for use in hydrology. These can be
broken down into different distribution families. Each of these distributions has a corresponding
function in R that enables you to extract a value of the density curve (d), cumulative distribution
curve (p) and a range of other statistical properties. The values in brackets proceed the function
name shown in Table 4.1.

Table 4.1: Distributions common in hydrology with the appropriate functionn name. Note each
function is proceeded by the letter d, but this can be changed to p (and a few other options)
depending on whether the value on the density or cumulative distribution curve is required.

Family Distribution Name Related R Function

Normal Family Normal dnorm

Log-normal (that is, \(\log X\) has a normal dlnorm


distribution)

Discrete binomial dbinom

Gamma Pearson type III dpearson available in Library


Family (SuppDists)

Log-Pearson type III dlgamma3 available in Library (FAdis


Family Distribution Name Related R Function

Extreme Value Type I - Gumbel (for max or min) dgumbel available in Library (FAdist

Type II- Weibull (for min) dweibull

5 Streamflow Data
The starting point for understanding the driving hydrological processes in a river system or
catchment is to extract flow data for the location(s) of interest. In Queensland, flow data is available
through the Queensland Government Water Monitoring Data Portal. For current streamflow data,
click on Open Stations, and select the relevant region, through the map, or basin, then gauging
station through the links in the frame to the left.

Data can be downloaded for any available timestep up to the length of the data availability, under the
Custom Outputs tab. For the example below, daily stream discharge (Megalitres/Day) from the the
Mary River at Bellbird Creek gauging station (Station Number: 138110A) have been used. These
data are available from 01/10/1959.

Data are available from the Qld Water Monitoring Data Portal in a structure that requires some
manipulation prior to being able to analyse it. Several variables can be downloaded at a time, but the
structure is always the same. The first column contains the Date and Time data, then the next six
columns contain the Mean, Min and Max values and corresponding quality codes for the first
variable. If there is more than one variable, this pattern is repeated again. The last column always
contains the metadata about the station and a description of the quality codes. The structure for one
variable is shown in the table below.

station station station


Time number number number

and Variable Variable Variable


Code Code Code

Date Variable Variable Variable


Name and Name and Name and
station station station
Time number number number

Units Units Units

Mean Qual Min Qual Max Qaul

1 Date
st
value valu value value value valu Sites:
e e

2nd
Date value valu value value value valu Station Number - Stat
e e Name Lat Long Elev

From row 8 onwards,


there is a description o
the Qual codes

Last Date value valu value value value valu


e e

Miscellaneous
text

Miscellaneous
text

The following code is commented to assist your understanding of one way to import the data into R.

# The station number is represented by the variable stn.

# To bring in data for a different location, you only need to change stn,

# provided the csv file is placed in the same directory.

stn<- "138110A"
#fn is a variable that is created by pasting together the directory (change to point
to your local directory), the station number (variable stn) and the file extension
(.csv).

fn<-
paste("/Users/helenfairweather/Documents/SunshineCoast/USC/Teaching/ENG330/2014/Data/"
,stn,".csv",sep="")

#fn<-
paste("//usc.internal/usc/home/hfairwea/Desktop/Teaching/ENG3302014Sem2/RHTML/data/Dai
ly/",stn,".csv",sep="")

#extract the first part of the file to get the variable name.

#before reading in the data determine the number of columns in the csv file. This is
stored in the ncol variable.

ncol <- max(count.fields(fn, sep = ","))

nrows<-length(count.fields(fn, sep = ","))

# The first column always contains the date, and the last column the metadata.

# For every variable of interest, the file contains six values:

# the mean value, min value and max value, plus the corresponding quality code for
each of these.

# Therefore the ncol minus 2 (to remove the date and metadata columns) divided by 6
will give the number of variables,

# which is stored in nv.

nv<-as.integer((ncol-2)/6)

#To extract the variable names, the first 4 rows of the data file are read into the hd
variable
hd<-read.csv(fn,header=FALSE,nrows=4,sep=",")

#An empty vector to store the variable names is created and a loop used to capture
these names for the number of variables (nv). The first variable name is sourced from
row 3, column 2, and each subsequent variable name from the same row (3), but
incremented by 6 columns.

varn<-character(0)

cn<-2

for(i in 1:nv)

varn[i]<-as.character(hd[3,cn])

cn=cn+6

nrws<-length(count.fields(fn, sep = ","))

#The data are extracted from the file by using the read.csv function and skipping the
first four rows.

ddtemp<-read.csv(fn,header=FALSE,skip=4,sep=",",nrows=nrws-6)

#The extra column at the end of the file is removed in the next line of code.

dd<-ddtemp[1:(nrow(ddtemp)),1:(ncol-1)]

#If the two extra lines at the end of the file are not removed, the following code is
used in lieu of above.

#dd<-ddtemp[1:(nrow(ddtemp)-2),1:(ncol-1)]

#Names are assigned to the columns using the concatenate function and in a loop given
by the number of variables.

cnames<-NULL
for(c in 1:nv){cnames<-
c(cnames,paste(c("Mean","Qual","Min","Qual","Max","Qual"),rep(c,6)))}

names(dd)<-c("Date",cnames)

Once the data are entered into R, they should be first be interrogated to ascertain the number of
missing values. The summary command produces the following result for this dataset.

library(xtable)

tab<-xtable(summary(dd))

print(tab, type="html")

Date Mean 1 Qual 1 Min 1 Qual 1 Max 1 Qual 1


1 1/01/00: 1 Min. : 0 Min. : 9.0 Min. : 0 Min. : 9.0 Min. : 0 Min. : 9.0
2 1/01/01: 1 1st Qu.: 27 1st Qu.: 9.0 1st Qu.: 24 1st Qu.: 9.0 1st Qu.: 30 1st Qu.: 9.0
Median : Median : Median :
3 1/01/02: 1 Median : 79 Median : 73 Median : 86
9.0 9.0 9.0
4 1/01/03: 1 Mean : 452 Mean : 14.6 Mean : 239 Mean : 14.6 Mean : 873 Mean : 14.6
3rd Qu.:
5 1/01/04: 1 3rd Qu.: 213 3rd Qu.: 9.0 3rd Qu.: 9.0 3rd Qu.: 238 3rd Qu.: 9.0
187
Max. : Max. : Max. :
6 1/01/05: 1 Max. :200.0 Max. :200.0 Max. :200.0
156586 30057 329157
(Other):2006
7 NAs :216 NAs :216 NAs :216
2
There are libraries to handle these missing data, but we will not cover this at this time.
The na.rm=TRUE option (which instructs R to remove the missing values, or NA s, before
computation) will be included in functions where these missing data would cause a problem.

Several libraries are available to facilitate the relatively easy plotting of hydrological timeseries data.
These are zoo, hydroTSM, lubridate and lattice . The zoo package is designed to create
timeseries data, which is important for hydrological analyses. The first requirement to create
a zoo object is to create an index of dates (and time, if relevant). Even though the dates are
included in the imported file, they are not necessarily read as such. They may be coerced into dates
using date_ind<-as.Date(dd$Date,format="%d/%m/%Y",origin="1970-01-01") . However, if this is
not successful, you can read the first and last dates using a similar function, and create a sequence
from this first to last date in increments of one day.

In the data frame previously created for discharge, the Mean, Min and Max discharge values are in
columns 2, 4 and 6.
(Columns 3, 5 and 7 contain the quality codes for the mean, min and max respectively.) If other
variables are included, they are accessed by adding 6 to these column numbers. Using these
columns, the zoo command creates a timeseries object with the associated date index.
The hydroTSM librarys plot command will produce a line graph of all the time series in
the zoo object.

Further investigation of these data can be conducted using the lattice library, particularly
the splom function, which produces a matrix plot of the three parameters (Mean, Min and Max) for
each variable (in this case, Discharge (ML/Day)) in our zoo object. The matrix that is produced
consists of two plots per parameter, each with the x and y axes reversed. This is instructive for
investigating the relationship between the variables and the parameters. In this case, the plot that
demonstrates the strongest relationship is between the Mean and Maximum flows, indicating the
Maximum flows have a large effect on the Mean values.

A boxplot is useful for investigating the behaviour of the stream over certain time periods, for
example each month of the year. The daily2monthly function in the hydroTSM library is used to
first create monthly sums from the daily data, using the m <- daily2monthly(zd[,3], FUN=sum,
na.rm=TRUE) command. A vector with the three-letter abbreviations of the month names is extracted
using the cmonth <- format(time(m), "%b") command, which are then converted to factors using
the command: months <- factor(cmonth, levels=unique(cmonth), ordered=TRUE) . Factors are
variables in R which take on a limited number of different values.

(The use of ordered tells R that the levels of the factor have a natural order in which they appear.)
These data are then used to create the boxplot using boxplot(coredata(m) ~ months,
col="lightblue", main="ML/month") . The m' object is a timeseries, and the coredata(m)`
extracts the values of interest, without the related dates.

The boxplot is a representation of all available data, showing the median (think black line inside the
blue box) and the interquartile range (represented by the blue box). The whiskers extend to the most
extreme data point which is no more than [1.5] times the length of the box away from the box.
The extreme points are indicated by the open circle points.

The boxplot graphically shows that the wettest months are generally from December through to
March, though it is evident that extreme flows have occured in April, July and October. It is also
evident that the highest flow occured in April. Applying the
command index(zd[which.max(zd[,3]),]) gives the date of this occurence as 1989-04-25. This
command finds the maximum value, which.max from the maximum daily values (column 3 in the
variable zd ). This maximum value is the index of the location of the value in column 3 of zd :
zd[10800]. Wrapping index around this will return the corresponding date.
6 Flow Duration Curves
Like the previous boxplots, flow duration curves are a summary of the hydrology of a stream, but
provide added information in terms of the percentage of time that flows of given magnitudes are
exceeded. The following code produces a plot showing the flow duration curve for all the data (blue
curve). This curve is produced by the fdc command from the hydroTSM library. In the example
provided, all available daily data has been included in the construction of this curve. Note the log
scale on the y-axis; as noted earlier, hydrology data rarely fits a normal distribution.

Any time step can be used to construct the flow duration curve, but a daily time step will provide the
most information. The graph can be used in two ways:

1. determine the percentage of time that a given flow is exceeded, and

2. find the magnitude of flow that is exceeded a certain percentage of time.

An estimate can be determined through a visual inspection, or through the application


of R functions. For example, the quantile function will return the flow that is exceeded a certain
percentage of time: quantile(zd[,3],1-tfe,na.rm=T) . In this case, the tfe variable is the
percentage of time that flow is equalled or exceeded. The flow duration curve is the cumulative
distribution function with the axis swapped around, with the x-axis (% time equalled or exceed on the
x-axis) representing 1 - the cdf probabilities. Therefore input into the quantile function is 1-tfe .
So for the flow that is equalled or exceeded 20% of the time
= ceiling(quantile(zd[,3],0.8,na.rm=T)) , or 315.

To find the % of time a certain flow is equalled or exceeded the Empirical Cumulative Distribution
Function ( ecdf ) function can be used. For example, a flow of 315 is equalled or exceeded 20% of
the time. The blue arrows demonstrate this on the figure below.

The flow duration curve can also be applied to a subset of data, eg. data from months with high
flows. The example below shows this for the high flow months (Jan - Feb) and low flow months (Aug-
Sep). Examples of applying the quantile function to these subsets of data are also shown,
demonstrating the changing statistical characteristics of this stream in wet versus dry times.

A steep section at the top end of the flow duration curve indicates a relatively small number of very
high flows, which is confirmed by the boxplots shown previously. According to Ladson (2011, p.143)
a steep section at the low flow end of the curve indicates that baseflow is insignificant. Baseflow will
be investigated in more detail in a later module.
Exercise: Bring in data from the Victorian Water Monitoring Portal for Joyces Creek at Strathlea
Gauge no. 407320 and compare with the graphs in Ladson, p.142. The format of these data are very
similar to the Qld water monitoring data.
https://rstudio-pubs-
static.s3.amazonaws.com/30850_e9f98501a0c3420897eb59de6abae6ab.html