Sie sind auf Seite 1von 116

Introduction to

Geological Data Analysis


GS-134
By William A. Prothero, Jr.
Winter, 2002
Table of Contents:
Chapter 1. Kinds of Data
Chapter 2, Plotting Data with 2 variables
Chapter 3, Correlation and Regression
Chapter 4, The Statistics of Discrete Value Random Variables
Chapter 5, Probability Distributions and Statistical Inference
Chapter 6, Statistical Inference and t, 2 and F distributions
Chapter 7, Propagation of Errors
This material is extracted from an unpublished book written by William A.
Prothero on Geological Statistics. It is an in-progress work. Please do not
copy or reproduce any of this work without my permission. Please also note
that the course syllabus, homework, and lab activities are available at
http;//oceanography.geol.ucsb.edu/ (click on "Classes").
Thank you,
William A. Prothero, Jr.

CHAPTER 1
By William A. Prothero, Jr.

Kinds of Data
A measurement can come in many forms. It may be the color of a rock, the number
that comes up on a die, or a measurement from an instrument.
We define several types of measurement scales.
nominal :

Classified as belonging to one of a number of defined categories. For example,


a rock may be 'measured' as igneous, metamorphic or sedimentary. The
simplest type of scale is a nominal scale with only two categories, that is, a scale
where an object can have one of two possible states. For example, we might
'measure' a rock as either containing a particular mineral or not containing that
mineral.

ordinal :

Classified as belonging to one of a number of defined categories where the


different categories have a definite rank or order. An example of the ordinal
scale of measurement is Moh's hardness scale for minerals. A mineral is
classified as 1,2,3..10 where a mineral with a hardness of 10 is harder than a
mineral with a hardness of 9, which is harder than a mineral with a hardness of 8
and so forth. A mineral with a hardness of 4, however, is not necessarily twice
as hard as a mineral with a hardness of 2.

counting:

measurement scale also has discrete values. An example of data measured on a


counting scale is the number of earthquakes above a certain magnitude
recorded in a particular location in one year.

interval:

Measurements made on these scales have a continuous scale of values.


Temperature is measured on an interval scale. Although the centigrade and
Kelvin scales have different zeros, the difference between the boiling point of
water and the freezing point of water is 100 on both scales.

ratio:

The ratio scale is the same as the interval scale, but it has a true zero. Length,
mass, velocity and time are examples of measurements made on a ratio scale.

angular :

Measurements of strikes and dips are examples of data measured on an


angular scale. An angular scale is a continuous scale between 0 and 360.

Each of the above data types may require variations in plotting strategies. The following
sections show how to construct histograms for these data types and later chapters show how to
plot these data when more than a single variable is associated with the data.
Version: December 19, 2001, University of California
University of California, 2001

Plotting Data - Kinds of Data

1-1

Parametric and non-parametric statistics


There are two important types of statistics. The first type is parametric statistics.
Parametric statistics concern the use of sample parameters to estimate population
parameters. For example, if we were interested in the porosity of a particular sandstone bed,
we might take a sample of 10 porosity measurements and estimate the mean and standard
deviation of the bed based on the mean and standard deviation of our sample using parametric
statistics.
Parametric statistics are limited to data measured on a continuous scale of values and
require that a number of assumptions be satisfied, including that the individuals in the population
are independent and that the population is normally distributed. As we will see later in this
chapter, the Central Limit Theorem greatly increases the number of problems that can be
addressed with parametric statistics.
Non-parametric statistics do not involve the parameters of the population from which the
sample was taken and may be used when data is measured on a discrete scale of values or
when the assumptions required by parametric statistics can not be satisfied. When the number of
samples is large, we can sometimes treat data measured on a discrete scale as if they were
continuous.
We stress that when using any statistical method, it is very important to make sure that
the assumptions on which the method is based are appropriate to the problem.

Measurement Errors
If data were error free, it would not be necessary to read the remainder of this text.
Errors come from many sources. In fact, they are physically required through the Heisenberg
Uncertainty Principle, which states that there is an inherent uncertainty in any measurement
that can be made in a finite length of time. On a practical level, errors occur because the
instruments that we use have noise and because of naturally occurring variations in the earth.
For example, suppose you are measuring the composition of rocks sampled from a particular
region. You would expect composition to vary because of (hopefully) small variations in the
history of the rock, variations in chemical composition of the source, and varying contamination
from other sources (crustal rock contamination of igneous intrusive rocks, weathering, leaching,
etc). In seismology, the earthquake signals will vary from site to site because of varying surface
soil conditions, scattering of the seismic waves on the way to the source, and instrument errors.
However, one persons noise may become another persons signal. If the problem is to
determine the magnitude of the quake, the variations in signal due to scattering and surface
structure variations will be noise. But, if the problem is to study scattering or site response,
the variations due to these effects are signal to be studied and explained.
Accuracy and Precision:
Accuracy is the closeness of a measurement to the "true" value of the quantity being
measured. Precision is the repeatability of a measurement. If our measurements are very
Version: December 19, 2001, University of California
University of California, 2001

Plotting Data - Kinds of Data

1-2

precise, then all our values will cluster about the same value. To make the distinction between
these two terms clear, consider the data plotted in Figure 1.1. Five measurements of the
concentration of chemical X in a given water sample are made with each of four different
instruments. The "true" concentration of the sample is 50 mg/l.
CHEMICAL X (mg/l)

80

60

40
20

Figure 1.1 Plot of concentration of chemical X. The correct value of the concentration is 50 mg/l. A shows
a precise and accurate measurement, B is accurate, but not precise , C is precise, but not accurate, and D is
neither precise nor accurate.

Instrument A is both precise and accurate. Instrument B is accurate but not precise.
Instrument C is precise but not accurate. Instrument D is neither precise nor accurate.
Bias:
Bias will also be discussed in more detail in the chapter 7. Cases C and D in figure 1.1
(above) demonstrate bias in the data. In this case, the bias is caused by an inaccurate
measurement device. A good example is when you measure your weight on the bathroom
scale. You may consistently get the same weight, but if the zero of the scale is not set properly,
the result will consistently be high (or low), or biased.
Significant figures and rounding:
Another important concern with respect to data measurement is the correct use of
significant figures. This has become a problem as hand calculators have come into universal
use. For example, suppose you measure the length of a fossil with a ruler and find it to be 5 cm.
Now suppose that you decide to divide that length by 3, for whatever reason. The answer is
1.666666666.... Obviously, since the ruler measurement is probably accurate to less than 0.1
cm, there is no reason to carry all of the sixes after the decimal point. Significant figures are the
accurate digits, not counting leading zeros, in a number. When the number of digits on the right
hand side of the decimal point of a number is reduced, that is called rounding off. If the
portion that is being eliminated is greater than 5, then you round up, but if it is less than 5, you
round down. So, 5.667 would round to 5.67, 5.7, or 6 while 5.462 would round to 5.46,
5.45, and 5.5, and 5. You set your own consistent convention for when the truncated digit is
exactly 5. Generally, it is a round up, but sometimes it is alternately rounded up, then down.
Some conventions and rules exist regarding the number of significant figures to carry in your
answer. When a number is put into a formula, the answer need have no more precision than the
original number. Precision is also implied by how many digits are written. Writing 16.0
Version: December 19, 2001, University of California
University of California, 2001

Plotting Data - Kinds of Data

1-3

implies 16.0 0.05, so that the number is known to within 0.1 accuracy whereas writing
'16.000' would indicate that the number is known to within 0.001 accuracy. The number
41.653 has 5 significant figures, 32.0 has 3, 0.0005 has 1, and so forth.
In calculations involving addition and subtraction, the final result should have no more
significant figures after the decimal point than the number with the least number of significant
figures after the decimal point used in the calculation. For example, 6.03 + 7.2 = 13.2. In
calculations involving multiplication and division, the final result has no more significant figures
than the number with the least number of significant figures used in the calculation. For example,
1.53 x 101 * 7.21 = 1.10 x 102. Note that it may be clearer if you use scientific notation in
these calculations. Consider the number 1,000; 1.000 x 103 has 4 significant figures whereas 1
x 103 only has 1. If your calculation involves multiple operations, it is best to carry additional
significant figures until the final result so round-off errors don't accumulate.
It is extremely important to maintain precision when performing repetitive numeric
operations. Keep as much precision as possible during intermediate calculations, but
show the answer with the correct number of significant figures.

Data distributions
It is very difficult to extract meaning from a large compilation of numbers, but graphical
methods help us extract meaning from data because we see it visually. When the data values
consist of single numbers, such as porosity, density, composition, amplitude, etc., the histogram
is most convenient. Data are divided into ranges of values, called classes, and the number of
data points within each class are then plotted as a bar chart. This bar chart represents the data
distribution. The shape of the distribution can tell us about errors in the data and underlying
processes that influence it.
A histogram can be described as a plot of measurement values versus frequency of
occurrence and for that reason, histograms are sometimes referred to as frequency diagrams.
Some examples of histograms are shown in Figure 1.2(a-f).
The histogram in Figure 1.2 (b) is called a cumulative histogram or a cumulative
frequency diagram. Note that in this figure, the values for weight % always increase as
increases. The weight % at any point on this graph is the weight % for all values equal to or
less than the value of at that point. ( is a measure of grain size equal to -log(grain diameter in
mm)). The histogram in Figure 1.2 (d) is called a circular histogram and is a more meaningful
way of displaying directional data than a simply x,y histogram.

Version: December 19, 2001, University of California


University of California, 2001

Plotting Data - Kinds of Data

1-4

50

100

WEIGHT %

80

60
25
40

20

0
2

10

10

Figure 1.2a and b: Grain size plots. The plot on the right is the cumulative histogram, which is just the
integral of the curve on the left.

60

50

% MAP AREA

40

30

20

10

0
igneous

metamorphic

sedimentary

Altitude (km)

Figure 1.2 c and b: Examples of a nominal histogram and an angular histogram.

-6-7
-5-6
-4-5
-3-4
-2-3
-1-2
-0-1
+0-1
+1-2
+2-3
+3-4
+4-5
0

10

20

30

% OF EARTH

Figure 1.2e: Example of a histogram with bars running horizontally.

Version: December 19, 2001, University of California


University of California, 2001

Plotting Data - Kinds of Data

1-5

100

# Students

75

50

25

0
31-40

41-50

51-60

61-70

71-80

81-90

91-100

Figure 1.2f: Examples of histogram of interval data. Each bar is the number of students achieving scores
between specific values.

The general shape of a distribution can be described by terms such as symmetrical,


bimodal and skewed (figure 1.3). The central tendency of a distribution is described by the
mean, mode and median value.

Figure 1.3 Symmetric, bimodal, and skewed distributions. Here there is sufficient data in the sample so that
the histogram bars follow a relatively smooth curve.

The mean is the sum of all measurements divided by the total number of measurements. The
mean of N data values is:
N

m =

i =1

mean of the data

This value is also referred to as the arithmetic mean. There is also a geometric mean,
defined as:
g=

x1 x 2 x 3 .....x N

Version: December 19, 2001, University of California


University of California, 2001

Plotting Data - Kinds of Data

1-6

The formula for the sample mean, m, given above, uses the Greek summation
N

symbol,

. This means to add all values of x (your data values). This is x1 +

i =1

x 2 + x 3 + x 4 + etc. This is a common symbol in statistical formulas. The


summation symbol may also be shown as:

, which means to sum over all

values of i. To test your understanding of this notation, imagine a very simple


data set with values of x = 1,2 and 3. Then, N=3. Do the calculation by hand. If
you can do it, you probably understand the notation. The answer is 2, for the
mean of the data.

It is interesting to note that the logarithm of the geometrical mean is equal to the
arithmetic mean of the logs of the individual numbers. Log[(abcd)1/N ] = {log(a) + log(b) +
log(c) + log(d)}/N

For a data distribution, important definitions are:


mode:

the most frequently occurring value.

median:

the value one-half of the measured values lie above and one-half the measured
values lie below.

For a symmetrical distribution, the mean, mode and median values are the same.
mode

median
mean

mode

Figure 1.4 For a symmetric distribution (bottom), the mode,


median, and mean are at the same place, but for a skewed
distribution, they are not.

median
mean

The dispersion or variation of a data distribution can be described by the variance and
standard deviation. The standard deviation is defined below.
N

sample variance =

(x

i =1

m) 2

N
N

unbiased sample variance =

(x
i =1

m) 2

N 1

Version: December 19, 2001, University of California


University of California, 2001

Plotting Data - Kinds of Data

1-7

sbx =

(x

i =1

N
N

sx =

m) 2

(x

i =1

standard deviation of the sampled data

m) 2

N 1

unbiased standard deviation of x (use this)

where the mean, m was defined previously. The larger the standard deviation, the larger the
spread of values about the mean. The variance is the square of the standard deviation. The
meaning of "unbiased", used above, will be discussed in Chapter 5.
mean
standard
deviation

Figure 1.5. This figure shows how the standard deviation is a measure of the spread of values about the
mean.
Symbols to be used throughout this book:
ith data value,
xi
number of data in the experiment,

m, or x
mean of data values,
unbiased standard deviation of the data, s
variance of the data,
s2
A data distribution can also be described in terms of percentiles. The median value is
the 50th percentile. The value which 75% of the measured values lie below is the 75th
percentile. The value which 25% of the measured values lie below is the 25th percentile, and so
forth. The 25th, 50th and 75th percentiles are also known as the quartiles. Given the range of
data values and the quartile values for a distribution, we can tell if the distribution is symmetrical
or highly skewed.

Version: December 19, 2001, University of California


University of California, 2001

Plotting Data - Kinds of Data

1-8

To illustrate the definitions of the terms defined above, we consider the set of
measurements shown in Table 1.1.
5.0
5.8
6.1
6.3
6.9
7.4
7.5
7.5
7.6
7.8
8.3
8.3
8.8
8.8
8.9
9.0
9.1
9.2
9.4
9.4
9.4
9.4
9.5
9.6 10.0
10.0 10.3 10.3 10.5 10.5
10.6 10.7 10.8 10.8 10.9
11.0 11.4 11.6 11.9 11.9
12.2 12.3 12.4 12.8 12.8
13.1 13.4 13.5 14.0 14.1
Table 1.1. Length of fossil A (cm)

There are 50 data points in this data set, ranging in value from 5.0 cm to 14.1 cm. The
mean, median and mode are 10.0 cm, 10.0 cm and 9.4 cm, respectively. The standard
deviation is 2.2 cm. The 25th and 75th percentiles are 8.8 cm and 11.9 cm, respectively. See
if you can find the median and mode by inspection.

Height of the Histogram Bars


Now you know how, using scripting, to assign a data value to a particular class. You
also need to know how to compute the height of the histogram bar. There are several ways that
this may be done.
1. Raw

Bar Height = number in class

This method may be used for any kind of data, but it is the only way that nominal data can be
plotted.
number in class
Bar Height =
2. Frequency:
width of class
This method may only be used for interval, ratio, and a modified version of angular data. It
requires that the data can be expressed on an interval number scale (it also works for integers).
This scale is most common when a comparison with expected values is wanted. The number
of values within the class is then equal to the area of the bar (height*width). This normalization
has the advantage that the width of the bars may vary without the undesirable effect that wider
bars also get taller. Figure 1.6 demonstrates the appearance of the plot of dice tossing
experiment when bar heights are calculated according to number in class and according to
number in class/class width, when faces "5" and "6" are combined into a single class.
Dice tossing experiment: A die is tossed N times. The number of times each face is expected to come
up is N/6, since there are six faces and it is equally likely that any face come up. When some number of
tosses are made, the number for each face will normally NOT be N/6. This is due to a natural
randomness that will discussed further in later chapters.

Version: December 19, 2001, University of California


University of California, 2001

Plotting Data - Kinds of Data

1-9

Results of dice toss when 5 and 6 are


combined into one histogram column

Results when number 5 and 6 are


combined into one column of histogram
30

30
# of
times a
die
number
comes
up for 60
tosses

25 Expected

25

20

20

15

15

10

10

Expected:
Observed:

5 and 6

5 and 6

# showing on Die

# on die face

Figure 1.6 Comparison of two methods of plotting a histogram when it is modified so that the numbers 5
and 6 are combined into a single class. The plot on the left shows the effect of simply adding the 5 and
6 values into a single bar. The plot on the right shows what happens when the bar height is the number
of 5 and 6 occurrences divided by the number of faces included in the bar, or class interval.

3. Probability:

Bar Height =

number in class
( width of class) * (Number of data)

Here, the bar height for option 2 is divided by the total number of data, N. In this case, the
histogram bars can be directly compared to the "probability density distribution", which will be
introduced in Chapter 5. This has the advantage, for the dice toss experiment, of allowing us to
plot any number of dice tosses without resetting the maximum Y value on the plot and makes it
easy to compare actual results to "expected" results.
Practice: It is important that you know how to process data by hand before entrusting it to the
computer. This exercise gives you some practice in this. The problem is to plot a histogram of
the data in the table below. You decide to sort the data into 5 equal width classes divided
between 0 and 10.
First, enter the upper and lower boundaries of each class into the table below. There should be
5 equally spaced intervals with the upper boundary of class n equal to the lower boundary of
class n+1.
Class #
1
2
3
4
5

Lower
Boundary
0

Version: December 19, 2001, University of California


University of California, 2001

Upper
Boundary

10

Plotting Data - Kinds of Data

1-10

Next, enter the class # for each of the numbers in the table of numbers below. You can do this
by inspection. Verify also that equation 4 gives the correct class number. This formula is
needed when the script is written to find the class.
Values
Class #
7.64
6.22
8.75
1.61
4.17
6.91
1.88
1.96
8.23
4.31
8.84
5.59
1.5
5.94
5.78
2.66
2.8
8.34
3.68
4.91
Now, count the number of data values in each class and enter them into column 2 (1) of the
table below. Then enter the class frequency and class probability values according to
normalizations 1, 2, and 3 of page 1-9.

Class #

(1) # in Class

(2) Class
Frequency

(3) Class
Probability
Density

Area of Class

1
2
3
4
5
For (1) you should have 4,3,6,3,4 and for (2), you should have 2,1.5,3,1.5,2 and for (3) you
should have 0.1,0.075,0.15,0.075,0.1 and for the area of each class, you should have
0.2,0.15,0.3,0.15,0.2.
Version: December 19, 2001, University of California
University of California, 2001

Plotting Data - Kinds of Data

1-11

If the above data were sampled from a continuous uniform distribution, where every number
between 0 and 5 is equally likely. The probability of getting a number in any class is 1/5 = 0.2.
Notice that the area of the class dithers around 0.2 and that the sum of all of the areas is 1.0.
Now, draw up the histogram on some scratch paper.

Circular Histograms
Circular histograms must also be constructed in a particular way. The important point is
that the area of the circular histogram element must be proportional to the number of data points
it contains. Figure 1.7 shows a portion of an angular histogram. The area of a circle of radius R
is R2 . Since a complete circle represents 2 radians of angular rotation (360o ), the area of a
pie shaped segment of a circle that is W radians, Aclass = R2 (W/2). W is analogous to the
width of the histogram bar from before, but its units are radians. What remains is to normalize
the area. We can also say that the area of a segment divided by the total area of the histogram
should be given by the number of data in the class divided by the total number of data. That is,
the fractional area represented by a class should equal the fraction of the total that is contained
within the class. This gives us the relation: Aclass/A = f/N, where Aclass is the

Figure 1.7 Histogram for directional data.

area of the pie-shaped histogram segment, A is the total area of the histogram, f is the number of
data points in the class, and N is the total number of data points.
The equation is, then:
A class
A

R 2 W
f
=
2 A
N

and
2
R =

2 Af
WN

equation for radius of pie segment in circular histogram

Version: December 19, 2001, University of California


University of California, 2001

Plotting Data - Kinds of Data

1-12

So, you first decide how many segments you want for the histogram. If you want 10 segments,
you have W=2/10 radians. All that is needed is to decide the total area of the histogram
based on how physically large you want the plot to be, determine the class frequencies (how
many are in each class), compute R, and make the plot.

Thinking in "Statistics"
Some of the terms of statistics were defined for studies of a population of people. A
pollster wants to predict the outcome of an upcoming election. He/she can't poll every person. It
would be too expensive. So, a sample of individual members of the population is called up on
the telephone and asked their opinion. It is the responsibility of the pollster to sample the
population in such a way that the results will reflect the opinions of the population as a whole.
Data taken from a sample are used to infer the opinions of the population. In fact, this is the
central problem of statistics. We take a sample of measurements of our study area, then infer
the properties of the entire study area from that sample.
But, it isn't enough just to produce a single number that is the "answer". Some measure
of the accuracy of that number is required. This number is usually expressed as a "confidence
limit". We say that: If this experiment was repeated many times, what are the upper and lower
limits that 95% of the results lie within? If we wanted to be safe, we could specify 99%, or
some other percentage. But, the critical piece of information is the statement that yes, there are
errors, but there is an XX% chance that our result lies between the two specified limits.
The ideas of a sample and a population apply to geological statistics, as well as
opinion polling. The term population as it is used in statistics refers to the set of all possible
measurements of a particular quantity. For example, if we are interested in the nature of the
pebbles in a given conglomerate, the population will consist of all the pebbles in the
conglomerate. For a dice toss experiment, the population can be considered the infinity of
possible dice toss outcomes. When the experiment is repeated, the population of all possible
dice throws is being sampled repeatedly. In other words, we visualize an abstract collection of
all possible values of the quantity of interest. Then, when we make a measurement, toss the die,
throw the coin, whatever, we are sampling this abstract population. I like to think of it as a
giant grab bag full of small pieces of paper with a number written on each one. A sample is
taken by reaching in and grabbing out N pieces of paper and reading the numbers.
Suppose, for example, that you drill some cores of a rock to determine the orientation
of the remnant magnetism for a paleomagnetic study. The rock most likely does not have a
constant magnetic direction throughout its body, so a single core would give a very uncertain
result. The result can be improved by taking a number of cores and averaging the results. The
level of uncertainty will be affected by how many cores are averaged together and the amount of
variation of magnetic direction within the rock itself. The entire rock formation would be
considered the population and the collection of cores would be the sample. It is then the
statisticians task to infer the properties of the population from the measurements on the
sample.

Version: December 19, 2001, University of California


University of California, 2001

Plotting Data - Kinds of Data

1-13

Sampling Methods
Simply going out and making some measurements sounds easy. In practice, the process
is prone to errors and misjudgements. It can be very expensive to launch a field program, gather
data at great expense, then find in the analysis that there are not enough data or the data are
sampled incorrectly. You can usually improve the sampling strategy if you understand as much
as possible about the process or system you are sampling. When this is impossible, small test
experiments can be useful. The following paragraphs discuss some of the issues involved in
designing a sampling strategy.
It is important to take a truly random sample so that errors tend to average out. But,
getting truly random sampling is not always straightforward, especially in the earth sciences,
where values of interest may not be randomly accessible or where, for example, only certain
kinds of rock formations are exposed.
Suppose you want to sample the soil properties in a 1 km2 area. You might be
measuring a soil contaminant, nutrients, porosity, moisture, or any other property appropriate to
the study. You must first answer some very basic questions. Is the parameter you want to
measure distributed randomly within the area, or is there a systematic variation? You must also
determine the source of noise in the measurement. Is the noise due to error in the measurement
instrument, or is it due to natural variations in the properties of the soil? An example of a
systematic variation is a slow change of the parameter across the area you are sampling. For
example, if you are measuring nutrients but portions of the study area have large trees and some
have low plants, you would expect a dependence on this. If you want to study the properties
averaged over a large area, you may want to consider natural variations as noise to be averaged
out. On the other hand, variations of the nutrients that are caused by vegetation differences may
be of interest. It all depends on the goals of your study.
One sampling option is to adopt simple random sampling. The area is divided into a
grid and sampling takes place at randomly selected grid points. Grid points may be selected
using random number tables or a computers random number generator. In the field, you could
toss a die. One method (Cheeney, 1983) is to divide the length into 6 locations and select one
by tossing the die. The chosen interval is divided into 6 subintervals and one of these is chosen
by die toss. This subdivision can continue as far as needed.
Another sampling method is stratified sampling . This method prevents the bunching
of data points that may occur with simple random sampling. For example, we might lay out a
grid of 10 x 10 squares and take a number of randomly located samples within each of the 100
squares. If you are measuring the magnetization direction of a rock outcrop by taking cores
and the random selection system bunched all of the samples in a small portion of the rock, it
would be wasteful to blindly take these data. However, if this selection of random data points
were rejected until a more satisfactory distribution was determined, the statistical assumptions
of randomness would be violated and conclusions based on statistics would be suspect. If you
will reject bunched data locations, stratified sampling is the method to choose. Methods of
identifying systematic variations will be discussed in later chapters, when correlation is
discussed.
Version: December 19, 2001, University of California
University of California, 2001

Plotting Data - Kinds of Data

1-14

Another approach would be systematic sampling scheme in which you would pick a
location at random and distribute the data points at even intervals from the start point over the
remainder of the field area. You must be careful that this approach doesn't introduce any bias.
If there is any reason to believe that the property being measured varies systematically, this
approach may not work. This method generally reduces the number of data points needed in a
sample, but produces somewhat less precise results than other methods. A systematic sampling
method is used in point counting work in petrography.
How many samples should be taken? As we will see, the required sample size
depends on two major factors. The first is the precision required by the study. The more
precision you want in your results, the more samples you will need to take. The second is the
inherent variability in the population you are sampling. The greater this variability, the greater the
necessary sample size. Of course there are practical limits which must be considered. These
may include the availability of possible samples and the costs involved in sampling. More
complete discussions of sampling theory and problems are given in Chapter 5 and 6. Also, for a
discussion of sampling methods, see Cheeney (1983) and Cochran, W.G. (1977. Sampling
Techniques, 5. New York: Wiley).

Modeling Statistical Interpretations


If you do not aspire to be a mathematician, and most geologists don't, there is a very
easy way to test your statistical inferences. This is by simulating the experiment on the computer.
It is a great way to prove, without mathematics, that your results are valid. Even more useful, the
use of random numbers can also help us understand the principles of statistics. Statistical
simulations will be an important component of the lab exercises.

Generating random numbers in Excel:


There are a couple of ways of generating random numbers in Excel. The first is to use the RAND ()
function. It generates a number between 0 and 1. To generate a number between a and b, use
=RAND()*(b-a)+a. RAND () generates numbers with an equal chance of taking on any value between
0 and 1. To get another distribution, use the "Data Analysis" tool, which can be accessed under the
"Tools" menu. When the dialog box comes up, scroll down the list of tools and select "Random
numbers". You will be able to select the distribution you want. Note that the numbers will only be
computed once. If you use the RAND() function, the numbers will change every time you do a
"recalculate" operation ("Apple =" on the Mac and "Ctrl =" on the PC.

Version: December 19, 2001, University of California


University of California, 2001

Plotting Data - Kinds of Data

1-15

Review
After reading this chapter, you should:

Be able to discuss the types of measurement scales discussed in this chapter: discrete,
continuous, nominal, ordinal, counting, interval, ratio and angular.

Understand the difference between accuracy and precision and know how to use significant
figures correctly.

Be able to describe a data distribution in terms of overall shape, central tendency and
dispersion. Know how to find the mean, mode, median, standard deviation and percentiles
for a data distribution.

Be able to construct a histogram of various kinds of data and compute the correct bar
heights.

Be able to describe the important considerations in designing a sampling strategy.

Vocabulary:
sample
population
sample mean
sample variance
sample standard deviation
histogram
bias

Version: December 19, 2001, University of California


University of California, 2001

Plotting Data - Kinds of Data

1-16

Problems
1. Describe the following data as discrete, continuous, nominal, ordinal, counting, interval, ratio
and/or angular.
a.
b.
c.
d.
e.
f.
g.
h.
i.

the mineral phases present in a rock


the concentration of iron in a rock
the age of a rock as determined by U-Pb dating
the age of a rock as determined from fossils
the size of earthquakes measured on the Richter scale
the daily high and daily low temperatures in an area
the amount of rainfall in a given locality
paleocurrent directions determined from ripple marks
O18 values relative to SMOW (standard mean ocean water)

2. Define the term's accuracy and precision by way of a dart board analogy. That is, draw a
dartboard with 5 darts on it thrown by someone who is accurate but not precise, precise but not
accurate, precise and accurate and neither precise nor accurate.
3. Give the answers to the following problems to the correct number of significant figures.
a. 13.67 + 4.2 =
b. 2.4 * 4.11 =
4. Using Excel, plot a regular histogram and a cumulative frequency histogram for the following
data set. Be sure to indicate the mean and standard deviation of the data on the plot of the
histogram. Note: you can plot a histogram with any desired bar heights by directly entering the
bar heights in the field labeled class frequencies and clicking on the Plot Data button.
43
52
56
59
64
65
67
69
72
78

47
53
57
60
64
65
68
70
72
78

48
54
57
61
64
65
68
70
73
79

49
55
58
62
65
65
69
70
74
79

49
55
58
63
65
65
69
70
74
83

5. In designing a water well, you need to select a screen slot size that will retain about 90% of
the filter pack material surrounding the well hole. Data from a sieve analysis of this filter pack
material are shown below. Construct a cumulative frequency diagram, using Excel, and
Version: December 19, 2001, University of California
University of California, 2001

Plotting Data - Kinds of Data

1-17

determine the necessary slot size by plotting the cumulative % caught on the sieves on the y-axis
and the sieve slot size on the x-axis..
weight % caught on sieve

sieve slot size (mm)

2
8
20
30
30
10
6.

0.0
0.4
0.6
0.8
1.1
1.7

What is the difference between parametric and non-parametric statistics?

Version: December 19, 2001, University of California


University of California, 2001

Plotting Data - Kinds of Data

1-18

Chapter 2

Plotting Data With Two Variables


Often each data item has more than a single number associated with it. Porosity may change
with height, earthquake signal amplitude changes with distance from the source, radiogenic composition
changes with age, oxygen isotope ratios change with temperature, etc. It is these relationships that tell
us the story we want to extract from the data. There may be a large number of variables associated
with each data item. This is the topic of Multi-Variate Analysis, and is beyond the scope of this
book. However, the geologist will often face the problem of processing data with only two variables.
This chapter treats the scaling and plotting of x-y data, fitting of the basic equations to data, and
how noise in the data can affect its interpretation.

Plotting X - Y Data
Most of your X-Y data plots will be created in a charting program. A simple x-y chart created
in Microsoft Excel is shown below. The chart has a title, a label for each axis, and a legend that
describes the symbols that represent the two data sets that are plotted. The importance of making clear
data plots cannot be overemphasized. The reader should be able to understand the content of the plot
by looking at the plot and its caption.

Concentration-ppm

Plot of atmospheric gas concentrations at Mauna


Loa Observatory
40
30
Data #1

20

Data #2

10
0
0

10

15

Time-years before the present

Figure 2.1a. This is a sample data plot showing correct axis and data labeling. When plotting data, error
bars should also be shown on the plot.

Logarithmic Scaling
Often, it is useful to plot values on a logarithmic scale. The logarithm of either, or both, of its X
and Y axis values is plotted. The most common reason for plotting on a logarithmic scale is that data

Version December 20, 2001


University of California, 2001

Data With Two Variables

2-1

Properties of logarithms reviewed:


We ask, what is the value of X in the formula BX = N, where N is the number of interest, B is the base,
and X is the logarithm of N. For example, suppose we are interested in base = 10. This is the base we
will use almost exclusively in this chapter. If N = 100, then we ask what is the value of X in 10X = 100?
It is easy to see that 102 = 100, so the log(100) = 2. It is simple to get orders of magnitude from log
values. From this, it is simple to derive other properties of logarithms. Some of the important properties
of logarithms are:
log(ab) = log(a) + log(b)
log(ab) = blog(a)
log(a/b) = log(a) - log(b)
** Note: The logarithm of a negative number does not exist. If you try to take the log of a negative
number in Excel, the value that is returned will be #NUM!, which means Not A Number.

values span many orders of magnitude. This is true for the earthquake magnitude scale where ground
motion induced by quakes varies from sub-micron to meter amplitudes, a range of six decades or more.

Figure 2.1b. When data vary over many decades, a


logarithmic scale is used.

X-Y plot with linear axes


500000

Figures 2.1b illustrates the need for log plots. Very


little detail is shown for most of the data points. The
last data value determines the plot scaling and other
points lay along the X-axis. The plots of figure 2.1c
show a conventionally labeled X vs log(Y) plot. This
is most commonly used because it is easy to read the
original data values from the Y-axis. The rightmost
option is where we numerically take the log of the Y
data values, then make the plot using the transformed
Y values. Thus, what we see on the Y-axis is the true
logarithm of the Y data. This is the simplest method
to use when determining the best-fit coefficients of the equation that describes the data (the method is
described in the next section). The reason for this is that the fitting equations require the slope of the line,
and this slope is best calculated from the log(Y) axis values. The student generally gets confused when
trying to use the left plot to do this. Excel labels log scales according to the left figure.
450000
400000
350000
300000
250000
200000
150000
100000
50000

10

15

Version December 20, 2001

Data With Two Variables

2-2

X-Y plot with log(Y) axis

X-Y plot with log(Y) axis


6

1000000
100000

10000

1000

100

10
1

1
0

10

15

10

15

Figure 2.1c. These two figures illustrate 2 ways of labelling the Y axis for a log plot. The left plot is the most
conventional and is what Excel produces. The right side is most useful for calculating the coefficients of the equation
of the line that fits the data. Notice, on the right, that the log of the y values are taken, then a linear y axis is used to
plot the values.

Data Plots and Determining Functional Dependence


The use of logarithmic plot scales can both illuminate and obscure important facts about your data. It is
possible to determine the functional form of the underlying equation followed by the data, by selecting
the correct kind of plot.

Slope = 0.75/0.5
0.75

0.5

intercept

Figure 2.2 Plot of a straight line, showing computation of slope.

Version December 20, 2001

Data With Two Variables

2-3

The following functional commonly occur in problems of interest to earth scientists:


y = mx + b

(linear dependence)

(2-1)

y = A xn + b

(power law dependence)

(2-2)

y = A en x + b

(exponential dependence)

(2-3)

Equation 2-1 is the familiar equation of a straight line. It is characterized by its slope and intercept,
which is the value of y at x=0. A diagram is shown in figure 2.2. The slope, m, is 1.5 and the intercept,
b, is 2. It is possible to determine using graphical methods, the unknown constants in equations 2-2
and 2-3. The following operations demonstrate how this is done.
Power Law Equation 2-2
Rearranging equation 2-2 slightly, it becomes:
Y

Ax n

Taking the log of both sides, we have:


log ( y b ) = log ( A ) + log( x n )
log (y b) = log ( A) + n log (x )

So, if we define new variables,


Y l = log( y b)
and
X l = log (x )
Method: If the data follows the power law dependence (eq. 2-2), a plot of the log(x) vs log(y-b) will produce a
straight line. The b is problematic. For many power law dependencies, it is zero. If it is not zero, you will
need to use a computer to fit the best line to the data. For the purposes of this class, always try to get a fit
with b=0 first.
If you get a straight line with log(x) vs log(y), then find the slope of the line. You should use the calculated
values of log(x) and log(y) to get the slope. This slope is then equal to n in the above equation. The intercept
is equal to log(A), so you can solve for A. The important thing to remember is to plot the calculated values to
determine the slope and intercept. Also, verify your answer by putting in one or two values for x and see if
they agree with the y values you are trying to fit. Don't omit this important self-test check!

The equation becomes:


Yl

nX l

+ log( A)

Version December 20, 2001

Data With Two Variables

2-4

we can plot Yl vs Xl and the slope will be equal to n, the power of x in equation 2-2. The intercept will
be the value of log(A).
Exponential Equation 2-3
Similarly, for equation 2-3, we have:
y b = A en x
log(y b) = log( A) + nx log(e)

We let:
Yl

log ( y

b)

So,
Y l = n log ( e) x + log ( A )

form of linear equation

We can see that the Y axis should be plotted on a logarithmic scale as log(y-b) and the X axis on a
linear scale. The slope will be the value of nlog(e). There is a complication in this procedure for these
two functional forms. We do not know the value of the constant b. Often we expect that b = 0, as in
the case of radioactive decay. If it is strongly suspected that the data follow a power law with a nonzero b, then b could be varied in the plot until the best straight line is achieved
So far, appeals to intuition are being made so that you obtain an understanding of the underlying
principles. However, the fitting of straight lines in the presence of noisy data is fraught with dangerous
traps in interpretation. Questions that must be asked of any data fit are:
a) what other values of the parameters produce an equally good fit?
b) do other functional dependencies produce an equally good fit?
A more quantitative definition of a Good fit will be given when computer curve fitting is discussed
using Excel.

Helpful hints:
It is not necessary to have the X = 0 value plotted to determine the intercept, which produces the
b value in equations 2-1 to 2-3. Once it has been determined that the data follow a straight line
dependence, any X,Y value (from the straight line) may be used to solve for b. Just read an X,Y
value from the graph, substitute it into the equation (slope is known, but b is not), and solve for b.

Version December 20, 2001

Data With Two Variables

2-5

Example of finding the functions constants:


Suppose that we have the following data. This data was calculated using the formula y = exp(2*x).
Let's plot it in several different ways, then see how to recover the original constants of this equation. But
first, note that this equation is of the form:
y b = Ae nx
This form is the same as that in equation 2-2, but with the b on the left hand side. For now, don't pay
attention to the column labeled log(y).
x
1
4
5
10
20

y
7.389056
2980.958
22026.47
4.85E+08
2.35E+17

log(y)
0.868589
3.474356
4.342945
8.68589
17.37178

The first thing to notice about the numbers in the y column is that they range from 7.8.. to 2.35 x1017, an
extremely large range. An x-y plot of this data is shown in figure 2.3 below.

y = exp(2*x)
2.5E+17
2E+17
1.5E+17
y

1E+17
5E+16
0
0

10

20

30

Figure 2.3. x-y plot of the data in the example.

Notice that the extreme range of the data causes all of the data except the largest to be plotted on the x
axis. We suspect that we should make the y axis into a log axis. Figure 2.4 shows this.

Version December 20, 2001

Data With Two Variables

2-6

y = exp(2*x)

log(y)

20
15
10

log(y)

5
0
0

10

20

30

Figure 2.4. The example data is plotted on a log y axis. Note: exp(x) means ex.

Notice that the data plots as a straight line. We can measure the slope and intercept of this straight line
and find the "A" and "n" coefficients of the equation, to make sure they agree with what we already
know. But first, we note that the labels on the Y axis are still reflecting the original data values. If we use
these numbers to calculate the slope and intercept of the straight line, we will get the wrong answer. This
is because the Excel plot routine, just to be nice and convenient for those who want to read the original
numbers from the Y scale, did not really label the log(y) values. The easiest way to get the log values is
to make a third column that is log(y), then do a new plot of x vs log(y). This plot is shown in figure 2.5
below.
We can see that the slope of this line is: 0.869, and can be measured from the plot itself, or calculated
from the table of numbers (don't do this with real data; it's best to do a least squares fit when data have
errors):
(17.37 4.34)
slope =
= 0.87
( 20 5)
See if you can get these numbers yourself. The plot above also shows the y intercept to be 0. So,
referring back to our equation:
Y l = n log ( e) x + log ( A )

we can see that it has the form


Y = mx + b
where m = nlog(e) and b = log(A).
In Excel, log(e) is computed as =LOG(EXP(1)), which equals about 0.434. So, solving for m, we
have:
0.87 = n*0.434
or n = 2.00 Hurray, that was our value!

Version December 20, 2001

Data With Two Variables

2-7

Also, since b = 0 (y intercept), then we solve log(A) = 0. The log(1) = 0, and our original value for A is
1. So, we have created some data artificially, pretended we didn't know where it came from, then
worked backwards to get our initial equation. This is the procedure for all of the other functional forms.
Complications:
If the values of one of the variables are negative, you cant take the log, because the logarithm of a
negative number has no meaning. But, you can make a substitution x = -x in the equations. This lets you
take the log(-x) for all values, so the log would become the log of a positive number. You need to adjust
the equation for the equation coefficients, though.
Also, most data have errors. We did the example with noise free data, so it worked out perfectly. In the
presence of errors, the coefficients that you solve for will have errors too. Also, sometimes you cannot
tell whether a log or linear axis gives the best fit. You have to use what you know about the process that
created the data and use your best judgement. It is never wise to blindly apply mathematical
techniques without knowing something about the processes that created the data.

Review
After reading this chapter, you should:

Know how to find the functional dependence of common forms and find the unknown
parameters in the function, from the plot. Be sure that you can derive the equations for slope
and intercept in all three cases.

Understand how to make linear and logarithmic plots .

Version December 20, 2001

Data With Two Variables

2-8

Problems:
Problem 1: This problem is designed to support your understanding of the simple derivationsof the
slopes and intercepts for the 3 functional forms of equations that have been discussed.
1a) If all values of x are negative, you cannot take the log of these numbers to test for a power law or
exponential dependence. Derive the equation for slope and intercept for power law and exponential
dependencies when all values of x are negative.
Problem 2: Do problem 1, but when all values of y are negative.
Problem 3. Seismologists have noticed that the relationship between the magnitude of earthquakes and
their frequency is described by the equation
y* = a-bx, where y* is the log of the number of earthquakes and x is the magnitude of the earthquakes.
For the following data, find 'b'. Use the mid-point of the range of values given for x.
magnitude of earthquakes
6.5-7.0
2
6.0-6.5
3
5.5-6.0
10
5.0-5.5
16
4.5-5.0
72
4.0-4.5
181
3.5-4.0
483
3.0-3.5
846
2.5-3.0
302
2.0-2.5
73

number of earthquakes

Problem 4.
Determine the half-life of chemical B based on the following experimental data. (The
half-life is the time at which one-half of the chemical remains.
fraction chemical left

time (days)

0.97 0.2
0.92 0.5
0.84 1.0
0.71 2.0
0.42 5.0
0.18 10.0
0.03 20.0
0.006 30.0

Version December 20, 2001

Data With Two Variables

2-9

Problem 5.
The following data were collected during an experiment to determine the relationship
between temperature and vapor pressure for an organic chemical. From previous experience you know
that the general form of the equation that describes this relationship is given below.
ln P = A
T

+ B

Find A and B. What is the vapor pressure at 370C (3100K)?


P(atm)
0.059
0.13
0.36
0.75

T(0K)
283
298
323
343

Hint: make the 1/T dependence linear with a substitution of variables.


Problem 6: For the following datasets, determine the functional form of the underlying equation and
its unknown parameters. You can assume that b = 0 for equations that are not linear.
Dataset 1
0,3
1,5
2,7
3,9
4,11
5,13
6,15
7,17
8,19
9,21

Dataset 2
0,2
1,1.1
2,0.6
3,0.33
4,0.18
5,0.1
6,0.05
7,0.03
8,0.02
9,0.01

Version December 20, 2001

Dataset 3
0,0
1,1
2,5.66
3,15.59
4,32
5,55.9
6,88.18
7,129.64
8,181.02
9,243

Dataset 4
0,2.02
0.83,1.23
1.67,0.71
2.5,0.45
3.33,0.26
4.17,0.15
5,0.16
5.83,0.08
6.67,0.02
7.5,0.03
8.33,0.02
9.17,0.04

Data With Two Variables

2-10

CHAPTER 3
Correlation and Regression
In this chapter, we discuss correlation and regression for two sets of data measured on a
continuous scale. We begin with a discussion of scatter diagrams.

Scatter Diagrams
A scatter diagram is simply an x,y plot of the two variables of concern. For example, figure 3.1
shows a scatter diagram of length and width of fossil A. These data are listed in Table 3.1.

_______________________

16

Width (mm) of fossil A

length

width

18.4
15.4
16.9
15.1
13.6
10.9
11.4
9.7
7.8
7.4
6.3
5.3
___________________________

12

Table 3.1
4

Powerful data analysis software has made it


easy to perform complex statistical analyses
Length of fossil A
on your data. This is very good, but there are
pitfalls
in relying too much on sophisticated
Figure 3.1. Plot of data in table 3.1.
computer calculations when you do not
completely understand how to do the calculations yourself. It is important to develop intuition about the
data and the expected results from a particular analysis. This intuition will help you avoid stupid mistakes
in interpretation and also catch numerical errors in data entry. Before you do a computer calculation,
you should always estimate a range of reasonable output values. Then, when/if the result of the
computer calculation is quite different from what you expected, you have either made an error in
specifying the analysis to the computer software, or you don't understand what you are computing.
Either situation requires careful investigation.
5

10

15

20

A good example of the need to understand the calculation at more than a superficial level is the
computation of the correlation between two variables, x and y. An x-y scatter plot is always done first.
Then you can visually determine whether there might be a correlation and whether it is reasonable to
calculate a correlation coefficient. Some interesting misinterpretations of the correlation coefficient will
be illustrated in the following pages. Even though the computer is a great tool for doing extensive
computation, you should do the calculation by hand, at least once, to make sure you understand the
process.
Version: December 20, 2001
University of California, 2001

Correlation and Regression

3-1

Variance and Covariance


The variance and covariance are important quantities, and are introduced here so we can use
them in the next section. The variance is give by:

xi
( xi x ) 2 xi2 n

n
var( x ) = s 2x = n
= n
n 1
n 1

(A)

Notice that the variance, in the above equation, is the standard deviation of the data squared. The
standard deviation was defined in chapter 1. The second form of the variance (right hand side of the
equation) is exactly equivalent to the standard definition, but is sometimes convenient to use when
calculating with a calculator, or when deriving equations. In general, the variance will increase as scatter
in the data values increases.
Another important quantity is the "covariance" between two variables, x and y. The formula for
the covariance is given below. It is very analogous to the variance, but includes both x and y values.
Notice the similarities between the two equations. Instead of squaring x, we have x times y values. This
keeps the dimensions the same.

x i yi

( xi x )( y i y ) xi yi n n

n
cov( x , y ) = s xy2 = n
= n
(B)
n 1
n 1
The covariance is an expression of the relationship between the x and y data points. Notice that it is
similar to the standard deviation of a single variable squared, but instead of squaring values, x and y
values are multiplied.
Hints on understanding these formulas: It is very important to become familiar with the summation
notation. A few minutes to focus on this notation will be very worth your while when you try to understand
more complex concepts and formulas later in this chapter. Suppose there are n values of x. Suppose these
values are: 1, 2 and 3 (for simplicity). The formula:
are 3 values, n = 3. The n on the
data set that means we do:

means to add all values of x together. Since there

sign means to sum over all of the n values of x. So, for the simple

= 1 + 2 + 3 = 6 . Now find the variance of this simple dataset using

formula A above. Use both forms of the formula to convince yourself that they are equivalent. After you do
this, assume a y dataset to be 2, 3, and 4. Now do the covariance formula (B) and see what you get. Do
both forms. They should agree. If they do, you will have mastered the summation notation.

Version: December 20, 2001


University of California, 2001

Correlation and Regression

3-2

More details: Since we can write the mean of x as:

be written as:

s x2 =

x=

1
x i , the second form of equation A above can
n n

nx 2

n 1

. We will use this form later in this chapter. Find an equivalent

simplification for formula B above.

Correlation coefficients
The problem with the previous formula for covariance is that its actual value is not simply related
to the relationship between x and y. It would be more elegant if we had a scale where 0 implied no
relationship and 1 (or -1) implied the maximum relationship. This can be achieved by defining the
Pearson's correlation coefficient. The Pearson's correlation coefficient, denoted by rxy is a linear
correlation coefficient; that is, it is used to assess the linear relationship between two variables. It is
used for data that are random and normally distributed. The xy subscripts are used to emphasize the fact
that the correlation is between the variables, x and y. This coefficient is very important in the least
squares fit of a straight line to x and y data.
The value of rxy can vary from -1 to +1. When the two variables covary exactly in a linear
manner and one variable increases as the other increases, rxy =+1. When one variable increases as the
other decreases, rxy =-1. When there is no linear correlation between the two variables, rxy =0. Figure
3.2 shows some scatter diagrams for various values of rxy.

Version: December 20, 2001


University of California, 2001

Correlation and Regression

3-3

25

25

r=-1

r=+1

0
0

10

20

10

20

r=0

r=+0.8

10

10

10

10

Figure 3.2. Plots showing the value of the Pearsons correlation coefficient for different rxy values. Notice that the
plots show increasing scatter as the r value decreases toward 0.

Version: December 20, 2001


University of California, 2001

Correlation and Regression

3-4

The correlation coefficient, rxy, is calculated as

(x

x )( y i y ) ( n 1)

rxy =

( xi x )2
i

n 1

Another form is: rxy =

s xy2

(3-1)

( yi y )2 s x s y
i

n 1

xi yi xi y k n
i

xi2 xi
i
i

2
n y i yi n
i

(3-2)

where x is one variable, y is the second variable and n is the sample size. It doesn't matter which
variable we call 'x' and which we call 'y' in this case. Notice that all we had to do to convert from the
covariance was to divide by s x s y . This division "normalizes" the value of the covariance so that it varies
between -1 and +1.
For the data in Table 3.1, (to make sure you understand the calculation, see if you can
duplicate the numbers given below. they correspond to eq. 3-2):
(74. 4)(63. 8)
6
2
74. 4
63. 8
1039. 62 760. 92 6
6
888. 48 -

r =

= 0. 99

Extra help: When calculating the above values, is very important to do the calculations in the correct order.
For example, if you have 3 numbers, say 3,2, and 5. The sum of them is 10. In our summation notation, this
is:

= (3 + 2 + 5) = 10

Now if we square the result, we get:


2

N
2
xi = (3 + 2 + 5) = 10 2 = 100
i =1
But, suppose we square the values before adding them together. This is indicated as:
N

2
i

= ( 3) 2 + (2) 2 + (5) 2 = 9 + 4 + 25 = 38

i =1

So, we got a value


of 100 by adding the numbers first,
then squaring, but a value of 38 by squaring the numbers first, then adding.
Clearly, this is an important effect and you should be very careful which of the procedures is indicated. It is
very important that you become familiar with the summation notation. This is best done by substituting

Version: December 20, 2001


University of California, 2001

Correlation and Regression

3-5

numbers into the examples in this book until you are comfortable and get answers that agree with the
book's.

We have calculated rxy, but as of yet, we have said nothing about the significance of the
correlation. The term "significance" has meaning both in real life and in statistics. An experimental result
may give us a number, but is that number "significant"? In statistics, we ask whether this number is highly
probably, given the errors (or randomness) in the data. For example, suppose you are a psychic and
studying psycho-kinesis, which is the use of the mind to influence of matter. You concentrate on
"heads". A coin is tossed once and the side that comes up is "heads". Wow! Is this significant? Does it
mean anything, or could the side just as easily have been "tails"? The probability of heads coming up is
1/2. Most would agree that a 50-50 probability is pretty "insignificant" and psycho-kinetic powers
remain unproven. But, suppose, after 100 tries, the coin toss favors heads 75% of the time. This result is
highly unlikely due to randomness. Therefore the "significance" of the result is much greater. This is an
important point, and applies to correlation as well. Intuitively, the significance of a correlation of 0.9 is
much greater when the sample size is very large than when the sample size is very small. For a small
sample size, the alignment of data points along a straight line may be fortuitous, but when many data
points lie along a straight line, the case becomes much more convincing, or "significant". The calculation
of "significance" will be discussed in greater depth in later chapters.

(a)

(b)

20
6

10

r=0

r=0.8

-6

10

20

-6

Figure 3.3. Data which show a low correlation coefficient, yet are obviously correlated. These kinds of data illustrate
inappropriate applications of the Pearson correlation coefficient.

A few words of warning


There are several factors of which you should be aware when interpreting the significance of
correlation coefficients. First, the Pearson's correlation coefficient, as we have said, is a measure of
linear correlation. Data may be highly correlated but have a zero linear correlation coefficient. Also, an
outlier in the data set can have a large effect on the value of rxy and lead to erroneous conclusions. This
does not mean that you should ignore outliers, but you should be aware of their effect on rxy. Figure 3.3
illustrates these points.
Version: December 20, 2001
University of California, 2001

Correlation and Regression

3-6

Obviously, these data are not randomly distributed, and a quick look at the scatter plot verifies this.
A problem also occurs when the data are acquired from a 'closed system'. A closed system is
one in which the values of x and y are not completely independent because the fixed total of all
measurements must add to 100% or some other fixed sum. Closed system data occur frequently in
geologic studies. For example, closed systems exist in measurements of percentage compositions in
studies of whole rock chemistry and in work with ternary plots of rock composition. Because the sum
of the various measurements must add to a fixed sum, an increase in the proportion of one variable can
only occur at the expense of one of the other variables. Therefore, negative correlations are artificially
induced.
One final point is a reminder that a significant correlation between two variables does not imply
a cause and effect relationship. We may notice that at the end of the month, our bank account balance is
at its lowest level. Does this mean our bank account is somehow linked to the calendar? No, it's the fact
our paycheck is deposited on the first of the month. The day of the month doesn't CAUSE our bank
account to go down, it is just a variable that varies in the same way.

Temperature

150

100

50

0
0

depth

Figure 3.4. A plot of depth vs temperature that will be used in the least squares fit example.

Least squares regression


Often, we wish to quantify the relation between two variables and we do this by fitting a line or
curve to the data. Fitting a curve to data is called regression. In this section, we will discuss linear
regression by the method of least squares. This method assumes that a linear correlation exists between
the two variables of concern and that the variables are normally distributed. In this section, we will use
the data listed in Table 3.2 and plotted in Figure 3.4 as an example.
The purpose of linear regression, of course, is to find the best straight line fit through data that
has errors, or some kind of natural variation. The data may have an underlying physical basis for lying on
a straight line, or may just plot in a linear way and the regression just allows us a more convenient
description of the behavior of the data.

Version: December 20, 2001


University of California, 2001

Correlation and Regression

3-7

There are two situations that will affect our approach to the regression:
1. The error or variation is almost exclusively in one of the two variables. This situation would occur,
for example if one was measuring fault offset vs time. The time measurement would be very precise,
but the offset measurement would be subject to measurement errors and natural variations in
distance due to shifts in monuments.
2. The error or variation is inherent in both variables. In this case, we compute the reduced major axis
line.

_____________________________________________________________
depth

temperature (C)

0.25
25
0.5
35
1.0
60
2.0
80
3.0
105
_____________________________________________________________
Table 3.2

Temperature (d eg C)

Temperature vs Depth
120
100
80

x I ,y i

60

y = ax + b
ei

40
20
0
0

depth
Figure 3.5. Plot of temperature/depth data. xi,y i is the coordinates
of the ith data point. ei is the difference between the value
predicted by the straight line, and the actual data value.

With a least squares regression, we fit a


straight line (y = a + bx) to the data by
minimizing the sum of the squares of the
distances between each data point and
the best-fit line. This distance is
measured in the y-direction. In
calculating a correlation coefficient, it
did not matter which variable we called
'x' and which we called 'y'. In least
square regression, it matters. If we are
regressing y on x, x is the independent
variable and y is the dependent variable.
We assume that the error involved in the
measurement of x is negligible
compared to the error involved in the
measurement of y.

So, our job now is to find the best a


and best b constants for the y = ax + b
equation, so the the data are fit as well as possible. There are many ways to do this, but the most
common way is to do a Least Squares Fit. We want all of the ei fit errors (see figure 3.5) to be as
small as possible. Suppose we calculate the sum of squares of all of the fit errors.
All data are referenced to an arbitrary straight line with slope = a and intercept = b. We don't
expect each data point to pass through the line. At each x data point, there will be a difference, or
error between the line and the data points y value. Here is the equation:
Version: December 20, 2001
University of California, 2001

Correlation and Regression

3-8

yi = ax i + b + ei

(3-3)

y and x are the x,y values of the ith data point. a is the slope of the straight line and b is its intercept. ei
is the error, or misfit. In order to get the best values for a and b, we want the sum of squares of all
of the errors to be as small as possible, or:
n
)
)
Rd = ( y i yi )2
where yi is the predicted y (from the straight line) at point i.
i =1

)
y i = b + ax i
equation for predicted y
To find the best values for a and b, we can differentiate Rd with respect to a and set the result to zero,
then do the same for b. Then we solve the two equations for a and b.
We do:

R d
R d
= 0 and
=0
a
b

When we do the operations indicated in the two equations, we have 2 equations and 2 unknowns, and
can then solve for a and b. The 2 equations are, after differentiating and simplifying:
n

i =1

i =1

i =1

x i y i = b xi + a xi2

(3-4)

and
n

i =1

i =1

y i = nb + a xi

(3-5)

1
x i . Then we get:
n n
y = b + ax
(C)
Now, we can substitute the above equation into eq 3-4, where we get:
x i y i = ( y ax ) x i + a xi2
Working on eq. 3-5, we divide each side by n, and use x =

We multiply out the terms and get:


x i y i = y xi ax xi + a xi2 = nyx ax 2 n + a x i2
n

= ny x + a ( x nx )
2
i

= ny x + a ( n 1) s x2
Ok, now rearrange the last line of the above equation:
xi y i nx y a (n 1)
=
s x2
n 1
n 1
Notice that the left side of the equation is s 2xy and that the n-1 cancels on the right side. This leaves us
with the simple formula:
s 2xy = as x2

Version: December 20, 2001


University of California, 2001

Correlation and Regression

3-9

or:

a=

s 2xy

s 2x
Remembering our definition of rxy , we get equation 3-6 below. Putting this value for a into eq. (C),
above, we get equation 3-7 below.
s
a = rxy y
(3-6)
sx
sy
b = y rxy x
(3-7)
sx
The sy and sx in the equation are the standard deviation of the y values and the standard deviation of the
x values. rxy is the Pearson correlation coefficient, which was defined earlier. Notice the relationship, in
equation 3-6 between the slope and correlation coefficient. As rxy gets larger, the slope, a gets larger
also, and if rxy = 0, then the slope of the best fit line is zero too.
Finally, we can write the equation for the best fit line as:
s
s
y i = ( y rxy y x ) + rxy y xi
sx
sx
another useful form, easier to remember, for the above equation is:
( y i y )
( x x)
= rxy i
sy
sx

(3-8)

(3-9)

Discussion: It is important to remember that the best fit line will not go through each data point. From
algebra, we remember that we need at least two equations to solve exactly for two unknowns. A straight line
has only two unknowns, the slope and the intercept. When we have more than two values for x and y, we
have more than enough unknowns to determine a straight line that passes exactly through two data points.
In fact, if the data have errors, the straight line slope and intercept will be different for each pair of data
points. The problem is that we have too many data points to exactly fit the line to all of them. This is called
an over-determined problem. In fact, it would be meaningless to try to exactly fit each data point, since there
are errors in real data. That is why we only try to find the "best fit" line for the data.

For the data in Table 3.2,


5 5 8 .7 5

( 6 .7 5 ) ( 3 0 5 )

b =
1 4 .3 1 2 5 -

5
6 .7 5

= 2 8 .2 7
2

a = 6 1 - ( 2 8 .2 7 ) ( 1 .3 5 ) = 2 2 .8 4

and y = 22.8 + 28.3x. As a check on the calculation, the best-fit line should pass through the point,
( x ,y ) .

Reduced major axis line


Version: December 20, 2001
University of California, 2001

Correlation and Regression

3-10

In the least squares regression line, we assumed that we knew one of the variables much better
than the other. If this is not the case, then a least squares regression is not appropriate. A reduced
major axis line is another type of linear regression line. Here, the sum of areas of the triangles between
the data points and the best-fit line, as shown in Figure 3.5.

least squares regression

reduced major axis

Figure 3.5

The equations for a and b for a reduced major axis line are:

b=

2
i

2
i

( y i )2

n
and a = y bx
( x i )2
n

For the data in Table 3.1,

760. 92 b =

1039. 62 -

so

( 63. 8 )

= 0. 84 and a= 10. 63 - (0. 84)(12. 4) = 0. 22

( 74. 4 )

y = 0.22 + 0.84x.

Version: December 20, 2001


University of California, 2001

Correlation and Regression

3-11

Transformations and weighted regression


Often, x and y will not show a linear relationship, but a linear relationship can be found by
transforming one or both of the variables. We discussed data transformations at some length in Chapter
4. Transforming the data changes the weighting of the individual data points. Even if you do not
transform the data, if some data points are known with more precision that others, it may be desirable to
give those points more weight.
Residual analysis
Residual analysis is a good way to decide if a linear fit was a good choice. In a residual
analysis, the differences for each data point between the true y-value and the y-value as determined
from the best-fit line are plotted for each x-value of the data points. If the linear fit was a good choice,
then the scatter above and below the zero line should be about the same. If this analysis shows a bias
(for example the residual grows larger as x increases), another curve might be a better choice. We can
compute the quality of the fit using the standard deviation of the errors, ei.
1
1 n
2
se2 =
e
=
( yi y i ) 2 = s 2y (1 rxy2 )

i
n 1
n 1 i =1

Calculating in Excel using the "Regression" tool


In Excel, there is a data analysis tool called "regression". Unfortunately, this tool does not
produce the same result for slope and intercept as when the formulae derived above are used. Which
one is best? Has a mistake been made in the calculations and shouldn't we just trust Excel, which has
been around for quite awhile? After all, the programming gods made it for us.
It is easy to test which one is "best". Just compute the "residual", sce using the formula above
and notice which value of slope and intercept results in the smallest value. It turns out that the smallest
value is gotten from the values computed from formulas derived in this text. The Excel regression tool
computes slope and intercepts that result in a higher residual. Since we are looking for the "best fit", we
choose the formulas in this text. Clearly, Excel (Office 95) has computed a value under different
assumptions than we are making here. The Excel help files do not provide the answer. Therefore, we
cannot trust the value that Excel produces unless we can determine why Excel's answer is different from
ours. This illustrates the importance of knowing how to do the calculation by "hand" before trusting a
powerful computer program to give you a number, which could be wrong, or could be assuming a
different set of conditions than you expected.

Version: December 20, 2001


University of California, 2001

Correlation and Regression

3-12

Review
After reading this chapter, you should:

Know what a scatter diagram is.

Know what a correlation coefficient is an how to calculate Pearson's correlation coefficient, r.

Understand what the terms 'correlation' and 'regression' mean.

Be aware of some of the pitfalls of correlation and regression statistics.

Know what least squares regression and reduced major axis lines are and how to calculate
them.

Know what residual analysis is and how it is used.

Version: December 20, 2001


University of California, 2001

Correlation and Regression

3-13

Exercises
1. Construct a scatter diagram, calculate 'r' and determine the significance of 'r' for the following
data. Show all your work!
island age (million years)

distance of island from current hot spot

0
0.5
2.8
7.8
11.2

0
200
400
800
1050

2. Determine the least squares regression line and 90% confidence interval for the data in Exercise
#1 above. Which variable should be called 'x' and which should be called 'y'? Does it matter?
Show all your work!
3. Construct a scatter diagram, calculate 'r' and determine the significance of 'r' for the following
data. Show all your work!

Na2O and K2O (weight %)


2
5
7
1.8
6
3.7

SiO 2 (weight %)
45
50
55
44
53
48

4. Determine the reduced major axis line for the data in Exercise #3 above. Which variable should
be called 'x' and which should be called 'y'? Does it matter? Why is a reduced major axis line
more appropriate than a least squares regression line, assuming the error in the analytical
techniques used for all analyses is the same. Show all your work!
5. During a Journal Club talk, a student states that the correlation between two variables is 98%.
Should you be impressed by this statistic or do you need more information? Explain.
6. List four pitfalls to watch out for when working with correlation and regression statistics.

Version: December 20, 2001


University of California, 2001

Correlation and Regression

3-14

Version: December 20, 2001


University of California, 2001

Correlation and Regression

3-15

CHAPTER 4
The Statistics of Discrete Value Random Variables
In the study of statistics, it is useful to first study variables having discrete values. Familiar examples are
coin and dice tosses. This gives us a chance to better understand beginning statistical principles and
leads naturally to the study of continuous variables and statistical inference.

Combinations
An understanding of combinations is the first step in learning about probability and statistical inference.
Lets begin with an analysis of the coin toss. When you toss a coin 10 times, how many heads and tails
do you expect? Right now, it would be a good idea for you to toss a coin 10 times and see how many
heads you get. Did you expect to get that number?
Simulating coin tosses using Excel:
The random number function (rand()), generates a random number between 0 and 1. You can use
Excels IF statement to test whether the random number is greater or less than 0.5 to give it a two
state value. To do this, make a column of random numbers in B2 to B12 using =rand(). In C2, enter
=IF(B2<0.5,0,1). Extend the formula to C12. Notice that the value in column C is 0 or 1, depending
on whether the random number is <0.5 or >0/5. You can sum up the number of heads by putting
=sum(C2:C12) in cell C13. Press Apple= keys simultaneously (or "Ctl=" on a PC) to get new
simulated toss experimental results.
You dont need to use Excel to simulate coin tossing. Go ahead and toss 4 coins right now. Do it
several times. You can either toss one coin four times, or 4 coins once. The statistics are the same.
When you toss a coin, you expect to get heads half the time and tails the other half. There are 2
possible outcomes in a single coin toss. These are a) heads and b) tails. There is only one outcome
that we are interested in (heads), so we define a heads as a success, and we have 1 of the outcomes
that is a success. To get the ratio of heads to tosses, you do:
Ratio =

# of possible outcomes that you define as successes 1


= =P
# of possible outcomes
2

So, this shows how we find that half of the tosses are expected to be heads. Ratio is the probability
that a single toss will come out to be a head. \
Suppose you are performing an experiment that consists of tossing a coin 4 times. You should get 4*P
= 4*(1/2) = 2 heads. Of course, sometimes you get 0 heads, 1 head, 3 heads, and 4 heads. But, if you
toss the coin many, many times, you expect the ratio of heads to tails to become closer and closer to
0.5.
For the 4 coin toss experiment, is it possible to predict the number of times we expect to get some
number of heads different from 2? It has already been shown that it is possible to predict the
Version: December 20, 2001
University of California, 2001

Probability and Probability Distributions

4-1

probability of getting a head in a single toss using the number of possible outcomes of a toss. Lets
write down all of the possible outcomes when we toss 4 coins. Each outcome is equally likely. The
possibilities are shown below, with the first letter representing the outcome of the first toss, the second
letter representing the outcome of the second toss, so TTHH would be tails for the first toss, tails for the
second, heads for the third, and heads for the fourth.
TTTT, TTTH, TTHT, TTHH, THTT, THTH, THHT, THHH
HTTT, HTTH, HTHT, HTHH, HHTT, HHTH, HHHT, HHHH
There are several facts to notice. The first is that the counting started with all Ts and progressed as if
counting in binary, where a T was a 0 and an H was a 1. There are other ways to do this, but
binary counting will come in handy later. The 4 tosses become analogous to a 4 bit number, which
has 24 = 16 possible values. Notice that there are 16 combinations of heads and tails that can occur for
the 4 coin toss sample we are discussing. So, how many outcomes are there with 0 heads? Count
em. The answer is 1. There are 4 outcomes with 1 head, 6 outcomes with 2 heads, 4 outcomes with
3 heads, and 1 outcome with 4 heads. Of course, the total of all the outcomes is 16, as it has to be.
We cant just apply the formula that we used before. The number of possible outcomes is 16, since
there are that many combinations of heads and tails when a coin is tossed 4 times. So, the probability
for a single outcome must be 1/16. That is the probability for any one of the above combinations
happening in the sample. But, when we are going to say we have a success when several of the above
outcomes occur, we then add the probabilities for all successful outcomes. Said another way, we
defined success such that half of the 16 combinations were successes, then the probability would have
to be 1/2, wouldnt it? So, the formula becomes:
P (1 head ) =

# Successes
# Outcomes

4
16

So, the probability of getting 1 head is 1/4, which means that if we conduct the 4 coin toss sample 12
times, we expect to get 0 head 12*(1/16) times, 1 head 12*(4/16) times, 2 heads 12*(6/16) times, 3
heads 12*4/16) times, and 4 heads 12*(1/16) times.
When asked to determine the probability of a particular random combination occurring, you
can Brute Force the result by writing down all possible combinations, then counting the
number of combinations that you consider successes and dividing that number by the total
possible combinations.

Rules of Probability
The probability of rolling a "5" with a single die is 1/6, but what is the probability of rolling a "5" or a
"6"? What is the probability of rolling two "6"s with two dice or that the sum of the faces of two dice
will add up to "5"? We know that the probability of any one coin toss being "heads" is 0.5, but what is
the probability that if we toss 4 coins, all will be "heads" or that 3 of the 4 coins will be "heads"?

Version: December 20, 2001


University of California, 2001

Probability and Probability Distributions

4-2

There are two basic rules of probability which you need to know. Rule #1 is that the probability of
occurrence of more than one mutually exclusive event is the sum of the probability of the
separate events. Mutually exclusive means that only one possible event can occur at one time (e.g., a
coin toss is either "heads" or "tails", it cannot be both). The probability of rolling a "5" or a "6" with a
single throw of a die is 1/6 + 1/6 which equals 1/3. The probability of throwing either a "heads" or
"tails" is 1/2 + 1/2 which equals 1, which makes sense since there are no other choices.
Rule #2 is that the probability of the occurrence of a number of independent events is the
product of the separate probabilities. Independent means that the occurrence of one event does not
affect the probability of the occurrence of any other event. The probability of rolling two "6"s with two
dice is therefore 1/6 * 1/6 which equals 1/36 and the probability of tossing 4 "heads" in a row is 1/2 *
1/2 * 1/2 * 1/2 which equals 1/16.
In some problems, the probability of a particular event may not be constant. For example, consider the
following problem. Two cards are dealt from a deck of 52 cards. What is the possibility that both
cards are aces? Since there are 4 aces in the deck and there is equal probability of receiving any card,
the probability of being dealt an ace with a single card is 4/52. The probability of being dealt a second
ace is 3/51 since one ace has been removed from the deck. So the probability of being dealt 2 aces
with 2 cards is 4/52 * 3/51 which equals 0.0045 or about 0.5%.

Probability distributions
A probability distribution is a plot of the expected frequency of an event occurring. Remember that
you have already been exposed to sample distributions, where the actual data sampled is being
plotted. Much of statistics involves the comparison of probability distributions with sample
distributions. Sometimes we use the term "expected frequency" rather than probability. There are
several important probability distributions in the field of statistics. In this chapter, we will be concerned
with the Binomial Distribution.

Expectation Values and Ensembles


It is important to understand that the average result of an experiment, repeated many, many times, in the
limit of an infinity of times, will approach the expected result. The imagined infinity of experiments is
called an "ensemble" and the average of this infinity of experiments is called an "ensemble average".
Now that this has been presented at an intuitive level, it is time to put this idea on a firmer mathematical
basis. We express the expectation value of a variable as:
E[x] =

(41)

where E[x] is the expectation value and is the average of an infinity of experiments (ensemble
average). is the population mean because it samples the entire population. Suppose we apply this
to the experimental result: the difference between the number of heads and the number of tails in a coin
toss experiment.
We define:

d=

Version: December 20, 2001


University of California, 2001

# of heads # of tails
N
Probability and Probability Distributions

4-3

We know that for any particular experiment, d will generally not be zero. We express the average of d,
for an infinite number of experiments as E[d]. We get:
# of heads - # of tails
E[d ] = E
= 0
N

The above equation states that, even though d may only rarely be exactly zero, when many coin toss
experiments are averaged, the positive and negative ds ultimately cancel and d asymptotically
approaches zero. It is exactly zero in the limit of the average of an infinity of experiments. There are
some general properties of E[] that are properties of the normal average also. Several useful relations
are:
E[ x] =
E[ ax ] = a
E[ ax + b] = a + b

a)
b)
c)
d)

E[

( x ) 2
] = 2
N

where x is a random variable and is the true average of x


where x is a random variable and a is a constant
where x is a random variable and a and b are a constants
where x is a random variable and n is the number of data values
in a sample. 2 is the population variance.

You can verify the above by considering the way a normal average behaves when it is multiplied by a
constant, or a constant is added to it. The above formulae only change this conceptualization by using
and infinite number of terms in the average. Note that a new symbol was introduced, . This is the
population mean of x, which is the average of x in the limit of an infinite number of measurements. We
also introduced the population variance of x, which is the variance averaged over an infinity of
experiments.
So for the coin toss case discussed previously, if c is a constant, the quantities:
E[cd] = 0

(4-2)

and

E[c+d] = c since d averages to 0.

(4-3)

Also:

E[(x - )2] = 2 which is the expected variance

A more rigorous and general derivation of the expectation value will be given in a later section.

Version: December 20, 2001


University of California, 2001

Probability and Probability Distributions

4-4

Expectation values and Ensemble Averages: This seems to be difficult for students to grasp, yet it
is very simple in concept. An ensemble is defined as the results of a number of experiments. If the
experiment is a single coin toss, then the ensemble is the results of some number of coin tosses. An
infinite ensemble would be the result of an infinite number of coin tosses. The Expectation Value of a
quantity is just the average value of the ensemble results. So, we do an experiment and get a result, say
x. The value of x is determined an infinite number of times in an experiment. We then can take the
expectation value of x, x2, etc, as discussed above. Why is this useful? Its useful because it gives us a
way to think about random processes. If we had the perfect experiment, with an infinite amount of data,
we would expect to get the Expectation Value of whatever variable we were measuring. However, if we
have a less than perfect experiment, the result will be more or less close to the expectation value,
depending on the variance. If we can theoretically calculate the variance, we would use the expectation
value of the variance (above). If not, we have to estimate it from the data.

Binomial distribution
The discrete probability distribution for a measurement, observation or event which has two possible
outcomes is described by the binomial distribution. Examples of such observations include a coin
toss ("heads" or "tails"), a quality control laboratory test (defective or not defective) and the drilling of an
oil well (a strike or a dry well).
Another important example where the binomial theorem applies is fluctuations in the values of a
particular class in a histogram. The binomial theorem will be derived for this example. Figure 4.1
shows the continuous probability distribution where any value is equally likely between 0 and xMax.
The uniform distribution is used here for simplicity, but any distribution could be used.

uniform distribution
1
xMax

p(x)

xu

xMax

x values
Figure 4.1. Description of variables for the derivation of the binomial theorem.

Suppose an experiment is run where X is measured N times. The probability that X is inside the class
interval is:
X Xl
P= u
xMax

Version: December 20, 2001


University of California, 2001

Probability and Probability Distributions

4-5

Note: For continuous data, the probability that any particular value will occur is zero. Why do you think this
is? Think about how many real numbers there are in any finite interval. Is it intuitive that because there are
so many possibilities, that the probability of getting any one of them is very small? To fix this problem, for
continuous numbers, it is necessary to consider the probability that a value will lie between two other
values. This is expressed as p(x l<x<x u), which is the probability that x lies between xl and xu. The probability
is the area under the curve (cross-hatched in figure 4.1 above). For discrete data (like dice toss and binary
theorem), x values are not continuous, and there are only a finite number of possible values, so it is possible
to evaluate the probability of getting a particular value.

Continuous distributions will be discussed later in the chapter. The important element of this discussion is
that the data point can be either a "hit" (with probability P) or a "miss" with probability 1-P. For N data,
the number of "hits" will be NP. Conversely, the number of "misses" will be N(1-P). We define:
Q = 1 P , so

(1 P) N = QN

We now plot the histogram of the expected results of this experiment. Figure 4.2 shows the expected
number of "misses" (0% on the x axis) and "hits" (100% on the x axis). Let's put this in terms of a dice
toss. If we are watching the number "2", then a toss with "2" showing will be a "hit" and a toss where
any other number is showing will be a "miss". P will be 1/6, and Q will be 5/6. Our experiment has N=1,
because we toss the die once, then count the result as a "hit" or "miss".

Q
P(R)
P

0%
Outcomes: % of data within the Class

100%

Figure 4.2. The acquisition of a single data point is the experiment, the probabilities of 0% inside the class and 100%
inside the class are plotted.

When the first data point is taken, there are only 2 possible outcomes, with probability P and Q. These
are that 0 data are in the class (0%, or 0 "hits") and 1 data are in the class (100% or 1 "hit"). Suppose
that we now have 2 data points in our experiment. In our dice analogy, this would correspond for
tossing the die twice, the counting "hits" or "misses". But now we have 3 possibilities: 0 "hits", 1 "hit", or
2 "hits". Figure 4.3 illustrates the probabilities. Keep in mind that when the data is "in the class", we
count it as a "hit".
Examining figure 4.3, we can see that each of the first two outcomes generates 2 more possible
outcomes. This is analogous to the toss of 2 coins. If the first toss produces a head, the second toss
Version: December 20, 2001
University of California, 2001

Probability and Probability Distributions

4-6

can produce a head, or a tail (HH or HT). If the first toss produces a tail, the second toss can produce
a head or a tail (TH or TT). So, there are 4 possible outcomes. For the coin toss example, P = Q =
0.5.
in

out

PP
in-in

in

out

in

First value

QP
out-in

out

PQ

QQ

in-out

out-out

Second value

Figure 4.3. Expected numbers of data points within each of the 4 different outcomes that are possible with 2 data
values in the experiment.

Figure 4.4 is a histogram of the expected outcomes when 2 data points are taken. There are 3 possible
values for the probabilities. None, half, or all of the 2 data values are between Xu and Xl. So the
probabilities of the 3 outcomes (2 in, 1 in, and 0 in) are P2, 2QP, and Q2.

Q2
P(R)
2PQ
P2

0%

50%

100%

Outcomes: % of data within the Class

Figure 4.4. Histogram of outcomes when 2 data points have been taken. Notice that the R value is also shown along
the x axis.

The next step in the derivation is to take 3 data values. Notice how this is reminiscent of counting all of
the possible outcomes of coin tossing. The only difference is that we are using P and Q instead of H
and T. We are also counting the possible outcomes in a slightly different way. For 3 data values, the
outcomes are, counting the same way we did with the coin tosses:
PPP
QPP

PPQ
QPQ

Version: December 20, 2001


University of California, 2001

PQP
QQP

PQQ
QQQ

Probability and Probability Distributions

4-7

These can be evaluated, and are:


P3
QP2

P2Q
Q2P

P2 Q
Q2P

PQ2
Q3

To get the probabilities for each outcome, we add the probabilities where the outcome is the same.
Thus, we have:
P3

3P2Q

3PQ2

Q3

for the probabilities for each class.

P(R)

3PQ2
Q3

3P2 Q
P3

100%
0%
33%
67%
Outcomes: % of data within the Class
Figure 4.5. Probabilities of outcomes when 3 data values have been taken.

The following table summarizes the results so far. The probabilities of the individual classes are the
coefficients of the equations.
Number of data points
N=1

(P + Q)

N=2

(Q + P)2

N=3

(Q + P)3

By inference, we expect that for any value of N, the probabilities are the coefficients of the equation
N(Q+P)N. Using the binomial expansion, the coefficients for arbitrary N are:
QN; NQN-1; N(N-1)Q N-2P2/(12); N(N-1)(N-2)Q N-3P3/(123) . . . . .
. . . . . . . . . . N(N-1)(N-2) . . . 2QPN-1/(123 . . . (N-1)); PN

Version: December 20, 2001


University of California, 2001

Probability and Probability Distributions

4-8

The Rth term is, where 1-P is substituted for Q:


P( R ) =

N!
R !(N R )!

(1 P)

N R

(4-4)

Important information: Do you know what the "!" sign means? It is the "Factorial" symbol. This is just a
shorthand that allows us to write down a sequence of numbers more concisely. Its definition is: N! = N*(N1)*(N-2)*.1. So, 4! would be: 4*3*2*1. There is an interesting property that isn't obvious. That is that 0! =
1. Odd, but you will see that this definition of 0! is the most useful for formulas like equation 4-4.

This is the answer. To apply this to an example, suppose that we have a 5-sided die and will throw it
10 times. P = 1/5 = 0.2 and N = 10. The expression for a particular side coming up R times is:
R

4 1
P( R ) =
R !(10 R)! 5 5
10 !

10

Figure 4.6 shows the histogram of the probabilities of getting a particular die value R times in 10 throws.
R = 2 times shows the greatest probability, since 10(1/5) is the expected value.

P(R)

10

# of times a given die face occurs


Figure 4.6. Histogram of probabilities of a particular die face showing R times out of N throws.

Next, we consider a more interesting example. Suppose that the probability of striking oil with a wildcat
well is 10%. If 10 wells are to be drilled, what is the probability that all 10 will be dry? If we assume
that the probability of finding oil at any one well is independent of finding oil at any other and that a well
is either a strike or dry, nothing in between, then we can apply the binomial probability distribution to
this problem. Here, P = 0.1, R = 0 and N = 10, so
P=

10 !
0!(10 !)

(0. 9)10 (0.1)0

which equals 0.35 or 35%. Note that we arrive at the same result if we simply follow the rules of
probability discussed in the beginning of this chapter. Since the probability of drilling a dry well is 0.9,
Version: December 20, 2001
University of California, 2001

Probability and Probability Distributions

4-9

the probability of drilling 10 dry wells is 0.9 + 0.9 + 0.9 + 0.9 + 0.9 + 0.9 + 0.9 + 0.9 + 0.9 + 0.9
which equals 0.35.
What is the probability of drilling one successful well? Now, P = 0.1, R= 1 and N = 10, so
P=

10 !

1!( 9!)

(0. 9) (0.1)

which equals ~0.39 or ~39%. To find the probability of drilling 2 successful wells, let p = 0.1, r = 2
and n = 10. In this case, P = 0.19 or 19%.
What if we want to know the probability of drilling at least 1 successful well? In this case, we must add
the probability of drilling 1 successful well to the probability of drilling 2, 3, 4 or more successful wells.
Note that often a problem such as
(a)

(b)

0.4

0.4

N=7, P=0.25

N=7, P=0.5
0.3

0.3

0.2

0.2

0.1

0.1

0.0

0.0
0

Figure 4.7. Binomial distribution for 2 values of N and P.

this can be simplified by rewording the original question. For example, to find the probability of at least
1 successful well, we can find the probability of 0 successful wells. The probability of more than 0
successful wells is just 1 minus this number. As an exercise, do these calculations. The sum of the
probabilities for 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10 successful wells should be 1.
The binomial distribution for n = 7 is given in Figure 4.7 for different values of P. Note that the
distribution is symmetric for P = 0.5 and asymmetric for P 0.5. Important properties of the binomial
distribution are its approximate mean and variance, if 0.1 P 0.9 and NP > 5:
E[x] = NP
and

E[s2 ] = NP(1-P)

Version: December 20, 2001


University of California, 2001

Probability and Probability Distributions

4-10

From the above two equations, the ratio of the standard deviation of binomial distributed data to the
mean is given by:
f =

NP (1 P )
NP

(1 P)

PN

(4-5)

So, as N increases, the width of the distribution expressed as a fraction of the mean also decreases.
Suppose N=10 and P=0.2. Then:
f =

(1 0. 2)
0. 2(10 )

= 0.632

which means that the standard deviation of the distribution is a bit more than half of the mean. But if we
let N=1000, then
f =

(1 0. 2)
0. 2(1000 )

= 0. 0632

and the standard deviation is only 6.3% of the mean value, a factor of 10 lower. The standard deviation
relative to the mean varies as:
f

1
N

This very important result will be used many times.


Stirlings factorial approximation for large N:

N! N e
N

2 N

This is extremely useful when large factorials will eventually cancel, but the computers number
range is too small to evaluate them prior to cancellation.

Problem 1:

Compute and plot the distribution of the number of heads in 100 coin tosses, using the
binomial theorem. To compute the large factorials, use Excel and Stirling's
approximation.

Version: December 20, 2001


University of California, 2001

Probability and Probability Distributions

4-11

Review
After reading this chapter, you should:

Have a basic understanding of the concept of probability.

Be able to do simple probability problems involving coins, dice and cards.

Understand the definition of an "expectation value". What is it, in statistical terms? How can you
define it in words?

Understand how the binomial distribution is derived

Know when the binomial distribution applies and how to use it to solve problems.

Version: December 20, 2001


University of California, 2001

Probability and Probability Distributions

4-12

Exercises
1. If you select one card at random from a deck of 52 cards, what is the probability that the card will
be
a.
b.
c.
d.
e.

the queen of spades?


of the suit clubs?
an ace?
the ace of spades or the queen of hearts?
either a jack, queen or king?

2. If you select two cards at random from a deck of 52 cards, putting the first card back before you
select the second card, what is the probability of
a.
b.
c.
d.

the queen of spades both times?


an ace both times?
the same card both times?
a card of the suit hearts or diamonds both times?

3. If you select two cards at random from a deck of 52 cards, and do not put the first card back
before you select the second card, what is the probability of
a. the cards both being aces?
b. the cards belonging to the same suit?
c. at least one of the cards being an ace?
4. Assume that when you drill a wildcat oil well, there are only two possible outcomes - either the well
is dry or it strikes oil. Assume that the chances of success at any one wells is independent or the
success at any other. Suppose that 8 wells are to be drilled and p = 0.1. Using the equation that
describes the binomial probability distribution,
a. what is the probability that they will all be dry?
b. what is the probability that 1 out of the 8 will be a success?
c. what is the probability that 2 out of the 8 will be a success?
d. what is the probability that at least 1 out of the 8 wells will be a success?
5. Given the assumptions in Example 6 above, how many wells must be drilled to guarantee a 75%
chance of at least 1 success?
6. Write an Excel sheet to calculate the binomial probability, P, given input by the user for N, R and P.
Version: December 20, 2001
University of California, 2001

Probability and Probability Distributions

4-13

7. Modify the program from exercise #9 to calculate the cumulative binomial probability, P. That is,
calculate the probability of at least R successes in N events.
8. Make a game that is a game based on probability. It could be involve cards, dice or whatever. Use
your imagination!

Version: December 20, 2001


University of California, 2001

Probability and Probability Distributions

4-14

CHAPTER 5

Probability Distributions and Statistical


Inference
In the previous chapter, we discussed probability, probability distributions for discrete variables,
and expectation values. In this chapter, we introduce continuous probability distributions and sampling
distributions and begin our discussion of statistical inference. At this point, you might want to review
page 1-9, where the height of the histogram bars is discussed.

Continuous Distributions and Expectation Values


When a variable is continuous, the distribution of sampled values will be continuous. This will
remind you of the differences in histograms between discrete and continuous data.
Referring to figure 5.1 below, imagine that we have done an experiment and plotted the
histogram. The class boundaries are at X1, X2 etc. The area of the far left bar of the histogram is equal
to (X2 - X1)F1, where F1 is the height of bar 1. The frequency (vertical height of the bar) is the number
of occurrences within the class divided by the width of the class, (X2-X1), so the area of the bar is the
number of occurrences within the class. The sum of all the areas of the histogram bars is the total
number of data values in the experiment, which we have been defining as N. Remember, in Chapter 1
(p 1-9) it was suggested that the scaling of the histogram plots would be more convenient if the
frequencies were divided by N. This makes the total area of the histogram would be equal to 1, exactly.
Suppose we now want to compute the expected mean of a data sample. The expected mean
will not depend on any particular experiment. According to the intuitive explanation of expectation that
has been already presented, we imagine calculating the individual means for the experiment conducted
an infinity of times and averaging the individual means to get the expected mean. We can't do this in
reality, of course. Bit. there must be a way. We know, from probability theory that the expected mean
of the number of heads in a coin toss is half the number of tosses. There must be a way to use the
probability to calculate the expected mean. In fact, there is.
First we need to figure out how to calculate the mean using only the histogram bar height values.
Here we go! Suppose, in a histogram for a discrete random variable, there are 5 data values equal to 1,
2 values equal to 3, and 6 values equal to 4. The mean would be:
m =

5 1 + 2 3 + 6 4
13

= 2. 7

If you don't believe this formula, work it out for yourself on paper.

Version: December 20, 2001


University of California, 2001

Sampling and Statistical Inference

5-1

Frequency of occurrence

3
2

1
5
6
7
8

X1

X
Figure 5.1. Plot of a histogram of a continuous distribution whose area has been approximated by 8 bars.

Notice that we computed m by grouping equal values. Similarly, we compute the mean of the
distribution of figure 5.1 by making the approximation that the all data values within each class are the
same, and are at the center of the class. The number of data within the i'th class is the area of the class,
which is Fi*(Xi+1 - Xi). So, the mean will be:

m=

(X + X 3 )
(X + X 9 )
(X 1 + X 2 )
+ F2 ( X 3 X 2 ) 2
....F8 ( X 9 X 8 ) 8
2
2
2
N

F1 ( X 2 X 1 )

Fi are the height of the individual bars, and the Xis are the values of X at the class boundaries. If we
define:
X ci =

(X i + X i +1 )
2

which is the center of the bar, we have:


m=

F1 ( X 2 X 1 ) X c1 + F2 ( X 3 X 2 ) X c 2 ....F8 ( X 9 X 8 ) X c 8
N

So, the mean is computed by multiplying the height of each bar by its center value of X, summing over
all bars, and then dividing by N. Now suppose that N gets very large and the bars also become very
narrow. Defining X=Xi+1 - Xi, we have:
N

F X
i

= lim N

Version: December 20, 2001


University of California, 2001

i =1

ci

X
=

1
N

xF ( x )dx = xp (x )dx

(5-1)

Sampling and Statistical Inference

5-2

where p(x) is given by F(x)/N. Note that E[x] = m from equation 4-1.
Notation: m will be used to indicate the mean of observed data
will be used to indicate the expected mean, or population mean obtained by averaging many
repeated experiments. So, E[m] = .

So, in the limit of infinite data, the distribution becomes continuous and we compute the mean as shown
in the previous equation. In general, the expectation value of an arbitrary function, f(x) is
+

E[ f ( x )] =

f ( x ) p(x )dx

(5-2)

The above equation is the general formula for computing an expectation value of a general function of a
random variable x. which is distributed according to the probability distribution p(x). It is now possible
to derive several extremely important algebraic properties of expectation values.
Multiplication of the function f(x) by a constant results in the following:
E[af ( x )] =

af (x ) p( x )dx = a f ( x )p( x )dx

So,
E[af ( x )] = aE[ f ( x )]

(5-3)

Similarly, the following relationships can be proven, where x and y are random variables and a and b
are constants.
E[af ( x ) + bf ( y)] = aE [ f ( x )] + bE [ f ( y )]

(5-4)

For example, the squared value of the data is computed as:


+

E[ x ] =
2

p( x )dx

(5-6)

The expectation value of the data variance is much more interesting. We compute
E[( x ) ] = E[ x 2 x + )] = E [x ] E [2 x ] + E [ ]
2

From equations 4-1 and 4-2, and considering that m is a constant,


2
2
2
E[( x ) ] = E[ x ] 2 E[ x ] +

Since E[x] = ,
Version: December 20, 2001
University of California, 2001

Sampling and Statistical Inference

5-3

= E [( x ) ] = E[ x ] =
2

p( x )dx

(5-7)

2 is the variance of the continuous distribution. This will be called the population variance in the next
chapter.
It is important to note the basic difference between the continuous and discrete distributions when
data with continuous values are considered. When data values are continuous, the only time that
discrete histograms are used is in the processing of actual data. Continuous distributions apply only
when we are considering the limit of an infinity of data. Continuous distributions are used when
performing computations to find expected values. When working with real data, the observed
values may vary considerably from expected values, as has been demonstrated by the simulations in
the previous chapters.

Gaussian and Normal Distributions


The Gaussian and Normal distributions are extremely important in the field of statistics. Many
populations follow a normal distribution. Moreover, as we will see in a later chapter, the sample means
of arbitrary populations follow a normal distribution.. The central limit theorem (discussed later) tells us
this.
The gaussian distribution is familiar to you as the bell curve upon which students' grades are
often based. The equation that describes the gaussian distribution is
p(x ) =

( x )
2

(5-8)

where is the mean and is the standard deviation.


0.2
=0
=1

P(x)
0.1
=0
=2

-8

-6

-4

-2

X value

Version: December 20, 2001


University of California, 2001

Sampling and Statistical Inference

5-4

Figure 5.2. Plots of the gaussian distribution for =0 and = 1 and =2. The area between X = -2 and X = +2 for the
curve with = 2 is the same as the area between X = -1 and X = +1 for the curve with = 1. These areas are filled in
for each curve.

It can be proven mathematically that the continuous gaussian distribution describes the discrete
binomial distribution when n approaches infinity. The gaussian distribution stretches from - to
+ and is described completely by two parameters, and . Figure 5.2 shows two examples of the
gaussian distribution for different values of and .
By definition, the total area under the normal curve is one square unit. Therefore the area under
the curve between any two x values, x 1 and x 2 gives us the percentage of the total number of x values
that lie between x 1 and x 2 . This means that simply by knowing and for a gaussian distribution, we
can determine the probability of the occurrence of an x value between any given x 1 and x 2 . values.
Figure 5.3 divides the area under the curve into percentages. As you can see from this figure, for any
gaussian curve, ~68.3% of the x values lie between and 1 and ~95.5% of the values lie between
and 2.
To illustrate how we use this information, we consider the following example. Assume that we
have a list of the mid-term exam scores for a class of 1,000 Geology 4 students. The mean of the exam
is 80 and the standard deviation is 5. Assume that the exam scores follow a perfect normal distribution.
Based on Figure 5.3, we know that the percentage of students who scored between 80 and 85 is
34.13% of the class, the percentage of students who scored between 75 and 85 is 64.26% and the
percentage of students who scored above 90 was 2.27%. We can see that 15.87% of the class scored
below 75 and that 2.27% scored below 70.
1
2

1
P(x)

34.13%

34.13%
2.27%

2.27%
13.60%

-3

-2

13.60%

+ 2

+ 3

x value
Figure 5.3. Areas beneath various segments of the Gaussian curve.

We can calculate the area under this curve for any two x values by numerical integration, but in
practice, we rely on existing tables such as Table A1 in the Appendix. Take a look at this table now.
This table is organized according to z-values, which describe the normal distribution, where
Z=

( x )

(5-9)

and z = 1 corresponds to + 1 and z = -1 corresponds to - 1. This table gives the percentage of


the area under the normal curve between positive infinity and the z-value of interest. Armed with the
fact that the normal curve is symmetric and a few mathematical manipulations, we can easily find the

Version: December 20, 2001


University of California, 2001

Sampling and Statistical Inference

5-5

percentage of the area under the curve that lies between any z-values, which we can translate back to x
values.
Let us continue with our example of Geology 4 exam grades to illustrate the use of Table A1.
Suppose we are interested in finding the percentage of students who scored above 88. First we need to
determine the appropriate z-value. Here, z = (88-80)/5 = 1.6. We look up z =1.6 in Table A1 and
find that the area under the curve above 88 is 0.0548 or 5.48%. If we wanted to find the percentage of
students who scored below 88, we would look up the percentage of students who scored above 88 (z
= 1.6) and subtract that value from 1, since the total area under the curve must add to 1. So the
percentage of students who scored lower than 88 is 94.52%. If we wanted to find the percentage of
students who scored between 80 and 88, we would find the percentage of students who scored above
80 (z = 0), 0.50, and subtract from this value the percentage of students who scored above 80 (z = 0),
0.0548 leaving 0.4452. It is best to draw yourself a sketch of the area of the curve in which you are
interested.
To test your understanding, find the area under the normal curve between the z-values listed
below and see if your answers agree with the ones given. Note that P(-Z) = 1-P(Z).
Z
area
Z
area
0.00 and
0.55 and
1.23 and
2.34 and
1.00 and

0.500
0.2912
0.1093
0.00996
0.1587

1.96 and
- and -1.27
- and -0.88
-0.70 and 0.70
-1.00 and 1.00

0.025
0.1020
0.1894
0.516
0.6826

Practice reading the Z tabl es in Appendix A1


Verify that you can read the table to get the values for the following situations:
1. What is the area beneath the Z distribution curve for Z > 1.5? (Ans = 0.0668)
2. What is the probability that Z > 2? (Ans = 0.0228)
3. What is the probability that 0 < Z < 1? (Ans = 0.3413)
4. If = 2 and = 4, what is the probability that a data value > 4? (Ans = 0.3085)
5. If = 5 and = 2, what is the probability that x < 3? (Ans = 0.1567)
6. If = 5 and = 2, what is the probability that 3 < x < 7? (Ans = 0.6827)

You should also know how to use Table A1 to find the z-value that a given percent of the curve
lies above. For example, what z-value does 2.5% of the curve lie above? The answer is +1.96. Find
the z-values that 10%, 5%, 1% and 0.25% of the curve lies above. Make sure that your answers are
+1.285, +1.645, +2.325 and +2.81.

Version: December 20, 2001


University of California, 2001

Sampling and Statistical Inference

5-6

0.4

Encloses 95% of
the area under the
curve

P(Z)

1.96

1.96

Z value
Figure 5.4. 90% of the area under the normal curve is contained between Z values of +1.96 and -1.96. This
corresponds to 1.96 in the general case.

Poisson distribution
The poisson distribution describes the probability distribution of discrete events occurring
within a given finite interval or object, such as time, length, area, volume, body of water, host specimen,
etc. For example, the poisson distribution may be used to describe radioactivity decay, where the
number of decay particles is counted for a specified length of time. Conditions for a process obeying a
poisson distribution are:
The probability of a single occurrence of the event is proportional to the interval size.
The probability of 2 or more events occurring within a sufficiently small interval is negligible.
Events occur in non-overlapping intervals independently. That is, the occurrence of one event does
not influence the occurrence of the other event.
The form of the poisson distribution is:
p(y ; ) =

y!

for y = 1, 2, 3, etc. y is the value of the random variable (# of occurrences within the interval) and is
the expected number of events. For example, suppose you are observing radioactive decay and expect
10 events/second. The probability of getting y values during any particular second is:
p(y ,10 ) =

e 10 10

y!

Uniform distribution
In a uniform distribution, the probability of the occurrence of any x value is the same as the
probability of the occurrence of any other x value, on the interval between Xu and Xl . Its probability
density distribution is given by:

Version: December 20, 2001


University of California, 2001

Sampling and Statistical Inference

5-7

p(x ) =

1
Xu Xl

The denominator maintains the normalization, which requires that the area of p(x) is equal to 1.

Von Mises distribution


The von Mises distribution is also called the circular normal distribution. It is the equivalent
of the normal distribution for directional data, such as paleocurrent directions or grain orientation data.
It is given by:
M( 0 , k ) =

1
2 Io (k )

ek

( cos (

where k (which is an angle that is always >0) is called the concentration parameter (analogous to
),I0 (k) is a modified Bessel function of the first kind, and 0 is the mean of the distribution. Its values
are tabulated in math tables texts.

Log-normal distribution
In a log-normal distribution, the logarithms of a set of values form a normal distribution. For
example, grain sizes, trace element concentrations and the sizes of oil fields all follow a log-normal
distribution. If a distribution of data values is skewed toward the low end, try taking the log of each
data value and plotting the distribution. If it looks like a normal distribution, the data most likely follow
the log-normal distribution. The form of the log-normal distribution is:
p(x ) =

1
x n 2

1 ln x n

2
n

where n and n are the mean and standard deviation of the ln(x)s. Note that at x=0, p(x) goes to
infinity.

Version: December 20, 2001


University of California, 2001

Sampling and Statistical Inference

5-8

Figure 5.5. This is a plot of the gaussian distribution for =5 and =1.25. The tails of the distribution are at +-1.96,
and we are defining any data point in the tails as a rare event.

Sample Distribution of a Single Data Value when is Known


Suppose that a sample consisting of a single data point is randomly taken from a population with
a Gaussian distribution. We want to find the lowest and highest value of the population mean, , of a
population that would produce that sample 5% or less of the time. We will call events in this range a
"rare event". Given a particular sample value, x, figure 5.2 shows the highest and lowest values of that
are possible, assuming that x is within the 95% limits.
Suppose that = xi - 1.96. This is the case in the left hand plot of figure 5.2 and is the
dividing line between a "rare event" and a non-"rare event", at the upper end of the distribution. We can
quantitatively define the non-"rare event" by the inequality xi < + 1.96, which occurs 2.5% of the
time. So, we can say that, under the conditions we have specified, > xi - 1.96 . This can be seen in
the plot, or found from the first inequality by subtracting 1.96 from each side of the equality. The right
hand plot of figure 5.1 shows that for a non-"rare event", xi > - 1.96, which occurs 2.5% of the
time. So, we could say that < xi + 1.96 . More succinctly, we could say that we have the following
result 95% of the time:
xi 1. 96 < < x i + 1. 96

(5-3)

Now, here is where we have to be extremely careful in how we think about this result. Obviously, for a
single value of xi, if we repeated the experiment, the value of would jump around all over the place.
What we can say is that: If the value for is between the two limits of equation 5-3, then we will get
the sampled value for xi only 5% of the time, on the average. Suppose the true

Version: December 20, 2001


University of California, 2001

Sampling and Statistical Inference

5-9

Figure 5.6. Smallest and largest values of that are consistent with a particular sample xi if we require that at least
95% of repeated experiments are required to produce an xi within this range. Notice that the left figure shows the
distribution offset to the left by the maximum amount and the right hand figure shows the distribution offset to the
right by the maximum amount.

value of is exactly xi. You would really get xi much more often than 5% of the time. However, that is
not the point. All you have is a single value of xi and you have to draw whatever conclusion that you
can from the information you have. Note also that you dont know from the data, but must know its
value independently. This presents a fatal complication if all you really know about the data is a single
value. In the absence of more information that that provided by a single data point, it would be
extremely optimistic to make any conclusion at all. The next section shows how to make inferences
about the population mean and variance when more than a single data point is available.

Version: December 20, 2001


University of California, 2001

Sampling and Statistical Inference

5-10

Sampling Distributions
Sampling distributions are probability distributions of a sample statistic (e.g. m and s2 )
calculated for all possible samples of size N from a given population, as discussed in the previous
paragraph. For example, if our population consists of the numbers 5, 8, 10 and 6 and we are interested
in the sampling distribution of the sample mean, m for a sample size of 2 (N=2), our sampling
distribution includes the values 6.5, 7.5, 5.5, 9, 7, and 8 where 6.5 is the mean of 5 and 8, 7.5 is the
mean of 5 and 10 and so forth. For this same population and a sample size of 3, our sampling
distribution is 7.66, 6.33, 8 and 7 where 7.66 is the mean of 5, 8 and 10, 6.33 is the mean of 5, 8 and 6
and so forth.
Consider the sampling distribution for the sample mean, m for a population with a normal
distribution. The distribution of the sample means is normally distributed. The mean of the sample
means is equal to the population mean,
m =

(5-4)

and the standard deviation of the sample means is related to the standard deviation of the population by
m =

(5-5)

where N is the sample size.


,

population mean

m=

p(x)
lower confidence limit

upper confidence limit

m - sample mean

Figure 5.7. This shows a possible distribution of sample means, m. The peak is at the population mean. The total
area under the distribution is 1, so if the shaded area is 0.95, 95% of the times the experiment is repeated (for a large
number of repeats), the sample mean will lie between the upper and lower confidence limits. The standard deviation
of the distribution of sample means is expected to be the standard deviation of the population divided by the
square root of the number of data points in the sample, N.

Distribution of Sample Means: Determining Optimum Sample Size


Since the distribution of sample means follows a normal distribution, we can express this
distribution in terms of Z-values,

Version: December 20, 2001


University of California, 2001

Sampling and Statistical Inference

5-11

Z=

(5-6)

where we have used the relationship between the variance of the sample means and the population
variance that we discussed above. We can use the standard normal distribution to determine the sample
size necessary to estimate the population mean, , to a required confidence level. The confidence
level is the percentage of times the experiment is expected to produce the result lying between some
upper and lower bound (figure 5.3).
Referring to figure 5.2 and substituting the value of given by equation 95 for in equation 53, and m for xi (because now we are working with the sample mean) we have:

m 1.96

< < m + 1. 96

(5-7)

Remember that the above equation means that if the experiment were repeated a large number of times,
and were beyond either one of the extremes shown, we would arrive at the measured value of m, for
a sample size of N, 5% (or less) of the time. Notice also that we must know . The next chapter will
discuss how we find the limits to when is not known.
For example, suppose an experimenter needs to measure the ratio of two isotopes to within
0.06 to a 95% confidence level. That is, if this experiment were repeated many times and the 'true'
isotopic ratio were known, then this 'true' value would lie within the specified range for 95% of the
experiments. Suppose that the errors in the measurements follow a normal distribution and the standard
deviation for the technique is known to be 0.1.
How many measurements should be made? Are 5 measurements enough? From Appendix
A1, we see that the z-value for which 2.5% of the curve lies above is 1.96. Since 2.5% of the curve
also lies below z = -1.96, that leaves 95% of the curve between z = +1.96 and -1.96. We refer to the
area above the positive z-value as the upper tail and the area below the negative z-value as the lower
tail. By using the above expression for z using z=1.96, we can find the maximum difference between x
and that will allow us to be in the center 95% of the curve.
Subtracting m from each side of equation 5-7, we can determine the limits of m - as:
1. 96

< m < 1.96

or at the extreme limits:


m = 1. 96

(5-8)

(5-9)

In the example discussed above,


Version: December 20, 2001
University of California, 2001

Sampling and Statistical Inference

5-12

m =

(1. 96 )(0.1)
5

= 0. 09

This value is too large. We try N=15. m - is:


m =

z
N

(1. 96 )( 0.1)
15

= 0. 05

which is lower than our required value of 0.06 for m - so we may be taking more samples than
necessary. We can determine the minimum number of samples by solving for N in equation 5-9, and,
z
N =
m

(5-10)

and in this case, N= 11.

Central Limit Theorem


Many kinds of randomness can be interpreted in terms of a normal distribution. This is not an
accident. In fact, when the measurement is derived from a process that has some kind of intrinsic
averaging effect, the distribution will be normal. There is even a theorem that states this fundamental
property. The Central Limit Theorem states that the distribution of sample means tends
toward a normal distribution as sample size becomes large, even if the population from which
those samples were taken is not normally distributed. This theorem is very important in statistics
because it tells us that even if a population does not follow a normal distribution, we can still assume
normally distributed data if we are studying the means.

Statistical Inference
We introduce the concepts of statistical inference and begin our discussion of hypothesis testing
and estimation here. We will present these concepts in more detail in the next chapter when we discuss
the t, F and 2 sampling distributions.
In general, statistical inference involves either the formulation of a hypothesis about a
population and the testing of that hypothesis or the estimation of a confidence interval for a population
parameter as given by equation 5-10. Both hypothesis testing and parameter estimation are based
on a sample and a sampling distribution (figure 5.3). For example, we might state as a hypothesis that
"the mean of this population is not significantly different than 10 at the 95% confidence level" or "the
means of these two populations are not statistically different at the 95% confidence level". If we are
interested in estimation, our question might be 'between what two values can we be 95% confident that
the mean of this population lies?'. We imagine the results of an experiment repeated many times. The
confidence intervals are the upper and lower values between which a particular statistic (e.g. sample
mean) lies some percentage of the time (often 95%).

Version: December 20, 2001


University of California, 2001

Sampling and Statistical Inference

5-13

In statistics, we never prove anything! We simply state the probability that a given
hypothesis is true or that a population parameter lies within a particular interval. There is nothing
magical about the value of 95%. We could have chosen to consider the 80% confidence level or the
99% confidence level, but 95% is a common value chosen.

Estimating population parameters from sample parameters


One of the most important values we compute is the sample mean. The value of the population
mean, , must be determined by taking samples. In chapter 1, we defined the sample mean, m as the
arithmetic mean of the sample values, x,
N

m =

i =1

(5-11)

where N is the sample size (the number of individuals in the sample). The sample mean, m is the best,
unbiased estimator of the population mean, . By unbiased, we mean that if we conducted the same
experiment many times (infinity, in the limit) then m will tend towards exactly. This result was given in
equation 4-1, where it was stated that E[x] = m.
The other property of the population that is of interest is its standard deviation. This parameter
defines how much variation from the mean the individual population values take. Again, this must be
estimated from the sample. We define the sample variance, s2 :

(x

s =
2

m )2

i =1

N 1

N
xi

i =1

i =1

N 1

(5-12)

where x i is the sample value and N is the sample size. Using the above equation for the sample
variance, s2 , we arrive at the best, unbiased estimator (see next section) of the population variance, 2
The standard deviation is the square root of the variance. The two equations for variance given above
are equivalent. The first is a simpler expression; the second may be more convenient under some
circumstances.
When computing s2 for large N, you must be careful as you accumulate the sum of squared data values,
so that the sum of the squared numbers does not overflow the capacity of the number system used by
the computer. The first form of the equation will be superior in this respect, because the mean is
subtracted before the number is squared. When this is not enough, the sums must be computed
separately for sub-blocks of numbers, then divided by N-1, then added together.

Version: December 20, 2001


University of California, 2001

Sampling and Statistical Inference

5-14

Bias and Unbiased Estimators


An Estimator is the equation you use to compute the desired quantity from the data. The goal is to
make the best possible estimate of the population value, from the data resulting from an experiment, that
you can. Some estimators may consistently give you a value that is lower (or higher) than the actual
population value. This means that although you may get more data and repeat the experiment many
times, the answer may still be off. You may wonder about the N-1 term in the denominator of equation
5-12 above. It would intuitively seem that N is the proper denominator to use, since we are essentially
taking an average of the squared difference between the sample value and the sample mean. However,
if we do this, the sample standard deviation will come out slightly under the true population standard
deviation, on the average. You could prove this using a computer simulation (see problem 1 below).

Problem 1.

Prove that equation 5-12 does produce an unbiased estimate of the population
variance. Do this by generating a random number in Excel using the Data Analysis
selection from the Tools menu. Generate a large number of random numbers with a
normal distribution (with = 0 and 2 = 1). Then, repeatedly sample and average the
computed values for s2 . Show that dividing by N-1 instead of N in the formula
produces the correct variance, in the limit of many samples. Hint: start with small values
of N.

There is another way to intuitively understand this biasing effect. It is caused by the use of m (the mean
computed by the data) rather than (the true population mean) in the variance formula. Consider the
computation of the variance of a sample consisting of a single value. The sample mean is exactly equal
to the sample value. In the variance formula, the numerator is zero, and the denominator is 0, so we get
0/0, which is undetermined. Its a good thing, because: how can we get a variance with only one
sample? There is not enough data! If we used N in the denominator for a sample size of 1, s2 would be
0, which would be wrong. Undetermined is a better answer. However, if we somehow knew the
population mean, , we would have some information on the variance, even from one sample. When
the population mean is known, the correct denominator is N, which is 1. Now, if we take 2 data points
for our sample, we can get a somewhat better value for s2 . The denominator (N-1) is 1. The sample
mean is computed as (x1 + x2)/2 and the variance is {(x1 - m)2 + (x2 - m)2}/1. It is important to note
that is halfway between x1 and x2. No matter what the population mean (for 2 data points), m is
always halfway between x1 and x2. The values of (x1 - m)2 and (x2 - m)2 will be slightly reduced, on
the average, from that which would have been obtained using . We speak of this as the statistics of the
variance having N-1 degrees of freedom.

Version: December 20, 2001


University of California, 2001

Sampling and Statistical Inference

5-15

When the population mean, is known, the correct formula for variance (the unbiased estimator of
the population variance) is:
N

(x

s2 =

)2

i =1

When only the sample mean, m is known, the correct formula for variance (the unbiased estimator of
the population variance) is:
N

s2 =

(x
i =1

m )2

N 1

The formula for a property of a sample distribution that tends toward the same value of the property
of the population (when its value for a large number of samples are averaged) is called an Unbiased
Estimator. The above formula for s2 is an Unbiased Estimator.

Review
After reading this chapter, you should:

Be able to calculate the sample mean and the sample variance. Know these formulas.

Understand what is meant when we say that the sample mean and the sample variance are the
'best, unbiased estimators' of the population mean and population variance.

Understand the term 'sample distribution' as it is used in this chapter.

Be able to describe the distribution of sample means for a normal population and the nature of
the mean and variance of this distribution and the relationship between these parameters and
sample size.

Be able to determine the sample size necessary to estimate the population mean from a sample
based on the distribution of the sample means. Assume that s is known.

Be able to state and understand the Central Limit Theorem and appreciate why this theorem is
so important in statistics.

Be familiar in a general way with the concepts of statistical inference , parameter estimation and
hypothesis testing.

Understand what an unbiased estimator is, and why it is important to use unbiased estimators.

Version: December 20, 2001


University of California, 2001

Sampling and Statistical Inference

5-16

Exercises
1.

Calculate m and s for the samples given below.

2.

a.

1
3
5

2
3
5

2
4

3
4

b.

4
6
6
8
8

8
9
9
9
10

10
10
10
11
11

12
13
14
15
17

Identify the population, the individual and a sample for each of the following problems. Note
that there is no one correct answer for many of these.
a. a study of the minerals and rock fragments in a thin section
b. a study of the porosity of a sandstone unit
c. a study of the opinions of professional geologists in California on offshore oil issues
d. a study of the occurrence of a particular fossil in the Jurassic
e. a study of the isotopic composition of lavas from hot spots
f. a study of the average number of hours spent by UCSB students doing homework per week
g. a survey of the extent of chemical contamination on a 1 km2 industrial lot
h. a study of fault orientation in a map area
i.

a study of minerals present in drill core

3.

Suggest appropriate sampling schemes for each of the problems outlined in exercise #2 above.
Note that there is no one correct answer for these questions.

4.

What is a random sample? What is a biased sample?

5.

What factors influence the choice of sample size?

Version: December 20, 2001


University of California, 2001

Sampling and Statistical Inference

5-17

6.

Define the term 'sampling distribution' in your own words.

7.

Suppose you wish to measure the concentration of chemical X with an accuracy of 0.1 g, to a
95% confidence level. The measurement process you will be using has a known variance of
0.01 g2. How many measurements must you make?

8.

What is the Central Limit Theorem and why is it important?

9.

Use the Central Limit Theorem to make a program that will compute a gaussian distributed
random number using the rand() function. This function, as specified, returns a number with a
uniform distribution. Prove that this number is gaussian distributed by computing the number of
times the random number is outside of the 1s bounds. Of course, you will have to make a
pretty good computation of what s is before you can do this (Hint: use your simulation to do
this).

10.

Assume that the final exams for a Geology 4 class follow perfectly a normal distribution. The
mean of the exam was 55 and the standard deviation is 20. Using Figure 4.10 from this
chapter, determine how many students received a score
a.
b.
c.
d.
e.

above 75?
below 35?
between 35 and 55?
above 95?
between 35 and 95?

Each of your answers should be accompanied by a sketch of the area of the curve you are trying to find.
11. Use Table A1 to answer the following questions pertaining to the problem described in exercise
#12. How many students received a score
a.
b.
c.
d.
e.

above 70?
between 70 and 90?
between 50 and 80?
above 85?
below 50?

Each of your answers should be accompanied by a sketch of the area of the curve you are trying to find.
12. For the problem described in exercise #10, between what two scores did 95% of the class score?

Version: December 20, 2001


University of California, 2001

Sampling and Statistical Inference

5-18

Version: December 20, 2001


University of California, 2001

Sampling and Statistical Inference

5-19

CHAPTER 6

Statistical Inference and the t, F and 2distributions


In this chapter, we introduce the concepts of statistical inference and the Students-t, F and 2 (Chisquared) sampling distributions, which will apply to Gaussian distributed populations. We discuss
the use of these distribution in a variety of parametric statistical tests. Throughout this chapter, we
stress the basic principles of statistical inference that underlie each of these tests - hypothesis testing or
parameter estimation at a specified level of confidence or significance.

The Students-t distribution


This distribution was introduced by a statistician named William S. Gossett, who wrote under the pen
name of Student. The distribution acquired the name Students-t, after Gossetts pen name. The
Students-t distribution solves a problem that is created when we want to compute the z value, but dont
know , the population standard deviation. We had, from chapter 9:
Z=

( m )

(6-1))

If is unknown, the z tables cannot be used. It is logical to substitute the sample standard deviation, s
for the population standard deviation, . If we take all possible (infinity in the limit) samples of size N
from a normal distribution with mean and variance 2 and calculate a t-statistic defined as:
t =

( m )
s
N

(6-2)

for each sample, we will have a Students-t distribution, which we shall call the t distribution from
now on. For large N, the t-distribution approaches the normal distribution. For practical purposes,
when N>30, the t-distribution and the normal distribution are identical. For N<30 the t-distribution
curve will be symmetric as is the normal curve; however, as N decreases the curve will flatten. This
makes intuitive sense as with a smaller sample size, m is less likely to be close to than for a larger
sample size and therefore there will be fewer values for small t. Curves for a normal distribution and for
a t-distribution with N=4 are shown in Figure 6.1.

Version: December 20, 2001


University of California, 2001

Parametric Statistics

6-1

Tables of t-values
In Chapter 5, we used tables with z-values for a
standard normal distribution to find the
proportion of the area under the normal distribution curve between a given z-value and z equal to
infinity. A table for t-values is contained in Appendix A2.

Figure 6.1, Distribution of t values

We are interested in finding the t-value that a certain proportion of the area under the t-distribution
curve lies above. In some cases we want to find a t-value such that 5% of the area under the curve is in
the upper tail. In other cases, we want 5% of the area under the curve to be contained in both tails, so
we want to find a t-value such that 2.5% of the area under the curve is in the upper tail. Appendix A2
gives the t-values for which 2.5% and 5% of the area under the curve is in the upper tail. These values
are given for various degrees of freedom. For a t-distribution, the number of degrees of freedom
is N-1, where N is the sample size .
A table of t-values is given as Table A2 in the Appendix. It is important that you feel comfortable with
reading this table and that you understand the relationship between the numbers on this table and the tdistribution curve. For example, suppose we want to find the t-value above which 5% of the area
beneath the t-distribution curve lies for a sample of size 11. In this case, we have 10 degrees of
freedom. Since we are interested in the upper tail only, we find the t-value for which we are looking,
1.812, in the row corresponding to 10 degrees of freedom and the column for 0.05. Next, suppose we
want to know between what two t-values 95% of the distribution curve lies, for 10 degrees of freedom.
In this case we want to find the t-value for which 2.5% of the area in the upper tail and 2.5% of the area
in the lower tail. We look in the 0.025 column for 10 degrees of freedom and read a value of 2.228
from the table. This means that the 95% of the area of the curve lies between t=-2.228 and t=+2.228.
Often we have a t-value and want to know whether that value lies in the tail or tails of the curve. For
example, suppose we calculate a t-value of 1.9000 for a sample with a sample size of 16. Is this t-value
in the 5% of the area beneath the distribution curve contained in the two tails of the curve? We look in
the column 0.025 and the row for degrees of freedom equal to 15
Version: December 20, 2001
University of California, 2001

Parametric Statistics

6-2

(a)

(b)
Figure 6.2. Curves showing t distribution. The areas in black show the 5% region of unlikely event for a two tail
test (a) and for a single tail test (b).

and find the t-value 2.131. It helps a great deal to draw a sketch of the curve and at this stage you
should always do so. From Figure 6.2a, it is clear that our t-value of 1.900 is not contained in the two
tails of the curve. What if instead we asked whether this t-value was contained in the upper tail of the
curve that contained 5% of the area under the curve? The answer this time is yes it is, as Figure 6.2b
shows.

Version: December 20, 2001


University of California, 2001

Parametric Statistics

6-3

Practice reading the Students-t Tables in Appendix A2.


Verify that you can read the table to get the t value for the following situations:
1. N = 5, find the t value that is in the upper 10% of the range (ans: t=1.533).
2. N = 10, find the t value that is in the upper 2.5% of the range (ans: t=2.262).
3. N=12, t = 1.785. Is the t value within the upper 10% range? (ans: Yes)
4. For example 3, is the t value within the upper 5% range? (ans: No).
5. N=11. We want the t values for the 95%, two-tailed confidence limit.
(ans: -2.228 < t < 2.228)
6. N=4. Find the t values representing the two-tailed 90% confidence limit.
(ans: -2.353 < t < 2.353)

t-test: Estimating the Population Mean


We can use a t-test based on the t-distribution to estimate the mean of a population within a specified
confidence level from a sample. The procedure is identical to that followed in Chapter 9, except that
we use the sample standard deviation rather than the population standard deviation. Using equation 6-2
instead of equation 6-6, we find, in analogy to the derivation of equation 6-7, the following:
s

m t

< < m +t

(6-3)

The application of this equation is demonstrated in figure 6.3. The figure demonstrates several important
properties. Given a particular value for t, the allowable range of will decrease as N increases. This is
because the standard
Sm a ll e st

c o n s i s t e n t w i th m

+t
-t
P(x)

s
N

L a r g e st

c o n s i s t e n t w i th m

-t

s
N

+t

s
N

Sample mean

Sample mean

Figure 6.3. Largest and smallest population mean, that is allowed by a specified t value. Note on the above plots
that the sample mean, m is at the same place on the plot and the probability distribution curve is shifted to the right
and to the left to reflect its most extreme allowed positions. See figure 5.6.

deviation of the distribution of the means decreases with increasing sample size (refer to equation 6-5
and the associated discussion). Also, the larger the value of t, the wider the limits on . The value of t
Version: December 20, 2001
University of California, 2001

Parametric Statistics

6-4

is selected from the t distribution table according to the number of degrees of freedom (N - 1) and the
desired confidence intervals.
Suppose we want to find for the porosity of a sandstone unit based on a sample of 11 measurements
for which m is 6.4 and s is 3.1. We won't worry about units of porosity here. We assume that the
porosity of the sandstone follows a normal distribution. Specifically, we want to find a range of values
which we can be 95% confident contains . This means that if the experiment were repeated many
times, 95% of the sample means, m, would be within this range. Since our sample size is 11, our tstatistic will have 10 degrees of freedom. We are interested in the t-distribution for 10 degrees of
freedom. Since we want the 95% confidence interval and we have no reason to be interested in only
the upper tail or only the lower tail of the t-curve, we use a two-tailed test. That is, we wish to find the
positive t-value such that 2.5% of the area under the curve lies above that value and the negative t-value
such that 2.5% of the area under the curve lies below that value. Since the curve is symmetric and most
tables give values only for positive t-values, we consider the upper tail only. We look up the value for
the t-statistic in column 0.025 for 10 degrees of freedom in Table 2A. This value is +2.228. This
means that when samples of size 11 are taken from a normal population, only 5% of the samples will
have extreme values for m and s such that t will satisfy the relation
2. 228 t 2. 228
for 95% of all possible samples of size 11.
From equation 6-2, we can express this relation as
m

2. 228 s

2. 228

(6-3)

If we rewrite this expression in terms of (equation 6-3), we obtain


m t

s
N

m +t

2.228 (3.1)
11

s
N

m+

(6-4)
2. 228 (3.1)
11

which, if we substitute our values for m and s, gives:


4.3 8.5
which says that the 95% confidence limits on are between 7.3 and 11.5. If we specify a confidence
level greater than 95%, our range of values would be larger. Conversely, if we specify a confidence
level of less than 95%, our range of values would be smaller.

Version: December 20, 2001


University of California, 2001

Parametric Statistics

6-5

In the example above we used a two-tailed approach. In other problems, we might be interested in
estimating so that we are 95% confident that exceeds a certain value or that is below a specific
value. In such cases, a one-tailed approach would be appropriate.
Using the Students t-test:
1. Take the sample
2. compute the mean, m and sample variance, s

sx = sx

3.

calculate:

4.
5.
6.

calculate t from the t tables and degrees of freedom (N-1)


calculate the expected error limits (eq 6-4)
interpret results: 95% of values will be within the limits (if you used t for 95%)

Single tailed test: Consider again the problem of determining the mean porosity of a sandstone unit
discussed above. Suppose we want to know the value, max we could be 95% confident exceeded.
Restated, we could say: if we repeat this experiment an infinity of times, the value we would get
for the sample mean would exceed max in 95% of the experiments. We look up the value for the tstatistic in column 0.05 for 10 degrees of freedom in Table 2A. This value is +1.812. This means that
when samples of size 11 are taken from a normal population, only 5% of the samples will have extreme
values for m and s such that t will be greater than 1.812. Another way of saying this is that for 95% of
all possible samples of size 11, t will be less than 1.812. We write
t 1. 812

m
1. 812
s
N

which gives 7.7.


Decreasing the error limits:
The error limits for an experiment can be reduced by increasing the sample size, N (see eqn 6-4). The
error approaches zero as N approaches infinity. This is a great help in designing an experiment where
you dont want to over-sample, because taking samples is time-consuming and often expensive. You
will somehow get an estimate of the population variance using a small trial experiment, existing data, or
theory. If the estimate of the population variance is close, you will be able to accurately estimate the
number of samples you need to estimate the population value with a particular accuracy. Of course,
there is also the possibility that your estimate of the population variance is poor, in which case you may
over or under sample by a considerable margin.

Introduction to Hypothesis Testing


Since it is so important to state statistical tests precisely, the formalism of Hypothesis Testing has
been developed. In hypothesis testing, we state the hypothesis we are testing (the null hypothesis, or

Version: December 20, 2001


University of California, 2001

Parametric Statistics

6-6

Ho), and also the alternative hypothesis that is true, if the original hypothesis in refuted (the alternative
hypothesis, or Ha).
For example, we might want to ask the question: Can we state, at a 95% confidence level, that
the mean of the population from which this sample was taken is 15? In this case, we would pose the
null hypothesis, Ho, that the mean of our population is 15. Our alternative hypothesis, Ha is that the
mean of our population is not 15. We state this formally as
Ho

1 = 15

a n d Ha

1 15

and we set our significance level at 95%. This means that we will reject Ho if there is less than a 5%
probability of Ho being true. We calculate a t-statistic based on the expression for t given above. Since
m is 6.4, s is 3.1, and our sample size is 10, our t-value is 6.0 as calculated below.
m 15 9. 4
t =
= 3.1
= 6. 0
s
N
11

In this case, we have 10 degrees of freedom. A two-tailed test is appropriate here since we do
not care whether the porosity of our sandstone is greater than 15 or less than 15, only that it is
significantly different from 15. Therefore, we look up the critical value for t in a t-table with 10 degrees
of freedom at the 0.025 level. One way to remember the appropriate column of the t-table to look at is
to take the desired significance level for rejection (here 0.05) and divide this number by '1' if we have a
one-tailed test or '2' if we are using a two-tailed test. In this case, the critical t-value is 2.228.
Now we have all the information we need to either accept or reject our null hypothesis at the
95% significance level. Figure 6.3 shows the critical and calculated t-values for this problem. Since our
t-value of 6.0 lies in the curve's tail, we can say that there is less than a 5% chance that the mean
porosity of the sandstone is 15. Therefore, we reject Ho at the 95% confidence level and conclude that
the mean porosity of the sandstone is not 15.

Version: December 20, 2001


University of California, 2001

Parametric Statistics

6-7

Figure 6.3 . t distribution showing 95% significance levels and example t value of 6.0.

We could also ask a question about the sample size necessary for a given study. In the last
chapter, when we asked this question, we knew 2. Usually, we only know s2 and we must use the tdistribution. If we have no idea what the value for s is in a particular case, we might perform a pilot
study to obtain this information before going ahead with the main study.
Discussion:
In general, the critical question is whether the data are consistent with the null hypothesis at the
required significance level. What is required is the sample distribution and confidence levels. An
important issue in this process is the selection of the correct sample distribution. In this text we
emphasize the gaussian distribution, where the Central Limit Theorem assures us that the distribution of
sample means will be close to gaussian.
Siegal (1956) suggests six steps that should be carried out in hypothesis testing. These are:
1. State the null and alternative hypotheses.
The null hypothesis is usually called H0. The alternative hypothesis is called Ha here. In chapter 5,
problem 6, where the problem was to decide whether die were weighted or not, H0 might be stated as:
there is no difference between the probability of each die face showing and Ha might be stated as:
there is a difference between probabilities of dice face showing.
2. Choose the statistical test.
There are many statistical tests to choose from. Each test requires that the data conform to certain
assumptions, for example that the population is normally distributed. Tests vary in their power to
discriminate. That is, the limits set on the parameters to be tested may vary from very wide when a test
has little power to narrow when a test has a high power. Tests with very few assumptions often
have little power relative to tests which have many assumptions. It is important to make efficient use of
the data. For example, if there is good reason to assume that the data are normally distributed, gaussian
Version: December 20, 2001
University of California, 2001

Parametric Statistics

6-8

statistics should be used. If the data are not normally distributed, less powerful tests will be required.
The power of a test will be discussed more later.
3. Choose the size of the sample, N, and the size of a small quantity, .
The choice of determines the confidence limit that you require for your data. is the probability
that the result falls outside of the confidence limits. An of 0.05 would correspond to a confidence
level of 95%. It is expected that as N increases, the result will be more accurate. This is reflected in the
limits set by equation 5-7 for gaussian distributions. It is important to choose neither too small nor too
large a value of N. Too large a value is a waste of effort and too small means that the results will not be
reliable.
4. Evaluate or determine the frequency distribution of the test statistic.
The test statistic is the value that we are going to test. In the dice toss problem, the test statistic
was the number of 3s (or the face you were testing). In that problem, the frequency distribution of
the number of times a particular die face occurred was given by the binomial distribution. The test
statistic might be the sample mean or sample variance. If the data are gaussian distributed, the correct
distribution might be the t, Chi-squared, or F distributions (coming later). If there is not enough
fundamental knowledge about the process that produces variations in the data, it may be necessary to
prove that the data follow a particular frequency distribution. If measured values are put into an
equation (e.g. to compute an age date), the distribution of the statistic may be influenced operations in
that equation.
5. Define the critical region or region of rejection of the null hypothesis.
This is the probability that the result of the experiment is outside of the confidence limits that were set
in step 3. For example, if the frequency distribution of the test statistic is normal and its standard
deviation is 1, an experimental result greater than 1.96 or less than -1.96 would be within the region of
rejection at the = 0.05 level. So, if the test statistic is outside of the critical region. we can accept
the null hypothesis at the specified confidence level, but if it is inside the critical region, we can reject
the null hypothesis at the specified confidence level.
6. Make the decision.
Suppose the test statistic is within the critical region. There are two possible conclusions. They are
either that H0 is false, or that an unlikely event has occurred and H0 is actually true.

Example of t-test: Comparing Two Means


The t-distribution can also be used to test the probability that two samples were taken from
identical populations. In this case we calculate a t-statistic according to the following expression
t =
S

m1 m 2
1
1
N1

(6-5)

N2

where m1 and m2 are the sample means and N1 and N2 are the sample sizes of the first and second
samples, respectively, and where
Version: December 20, 2001
University of California, 2001

Parametric Statistics

6-9

S =

(N 1 1)s12 + (N 2 1)s 22

(6-6)

N1 + N2 2

and S is called the combined variance. Because our t-distribution was determined by sampling from a
single population, one of the requirements for performing this t-test is that the variance estimates not be
significantly different. This can be tested using the F distribution discussed in the next section.
If we take all possible samples of size N1 and N2 from the same population and calculated a tstatistic as defined here for each sample, the result will be a t-distribution with N1+N2-2 degrees of
freedom. We illustrate how we use this t-test with an example.
Consider the following problem concerning ore deposits. Data on the concentrations of ore on
opposite sides of a fault suggest that the ores are significantly different. If so, then it is likely that the fault
was formed before the ore was emplaced. If not, then it is likely that the fault formed after ore
deposition. Based on the data provided below, is there a difference, at the 95% confidence level,
between the concentration of ore on either side of the fault? In other words, is the probability less than
5% that the sample from north of the fault and the sample from south of the fault are from identical
populations? We assume that the distribution of ore concentrations follows a normal distribution and 2
from the sample north of the fault is not statistically different from 2 from the sample south of the fault.
The data for this problem are given below in Table 6.1.
Mean ore concentration

North of fault
South of fault

33 mg/kg
23 mg/kg

Variance

Number of data in
Sample
13
12

10
15

Table 6.1. Data from example discussed in text .

Our null hypothesis, Ho, is that the mean of the ore concentration north of the fault is not
statistically different from the mean of the ore concentration south of the fault, assuming that the
variances for the two populations are the same. We state this formally as
H0 (1 = 2 12 = 22 )

where the subscript '1' refers here to the ore north of the fault and the subscript '2' refers to the ore
south of the fault. If Ho is true then we can state that our two samples come from identical populations
within the confidence level of the test.
Our alternative hypothesis, Ha, is that the mean of the ore concentration north of the fault is
statistically different from the mean of the ore concentration south of the fault. We state this formally as
Version: December 20, 2001
University of California, 2001

Parametric Statistics

6-10

H a ( 1 2 )

Next we must state our desired significance level. Remember, with statistics we never prove
anything. We can only state that our null hypothesis is supported at a stated level of confidence or
significance. For this problem, we set the required significance level at 95%. Another way of stating this
is that we will reject Ho if, when the experiment was repeated a large number of times, we would reject
Ho 95% of the those experiments.
Next, we calculate the t-statistic for the comparison of two means with equations 6-5 and 6-6.
S =

t =

(10 1)(13) + (15 1)(12 )


10 + 15 2

= 3.5

33 23
= 7. 0
1
1
3.5
+
10 15

Here, our t-statistic is 7.0 and we have 23 degrees of freedom. A two-tailed test is appropriate here
since we do not care whether the concentration of ore on one side of the fault is higher or lower than the
other only that one side is significantly different. Therefore, we look up the critical value for t for this
problem in a t-table with 23 degrees of freedom at the 0.025 level. Here, the critical t-value is 2.069.

Figure 6.4. t distribution for 23 degrees of freedom, with 95% confidence limits marked.

Now we have all the information we need to either accept or reject our null hypothesis at the
95% significance level, as shown in Figure 6.4. Since our t-value of 7.0 lies in the curve's tail, we can
say that there is less than a 5% chance that these two samples are from identical populations.
Version: December 20, 2001
University of California, 2001

Parametric Statistics

6-11

Therefore, we reject Ho at the 95% confidence level and conclude that we can be 95% confident that
the fault was formed before the ore was emplaced.
Figure 5.4. Illustration of type I and type II errors. Curve A
corresponds to the distribution specified in H0. Curve B is another
distribution, which might also produce the experiments outcome.

Coming to the Wrong Conclusion


Just as it is possible for a gambler to win, against the odds at the roulette table, or toss 10
heads in a row, it is possible to come to the wrong answer, in statistics. This is because even the rare
events have a probability of occurring.
Figure of 5.4 shows the situation where there are two populations. Population A has a mean of
10 and population B has a mean of 20. The overlap between the two distributions allows for a finite
probability of making an erroneous conclusion. Lets state the null hypothesis, Ho, that the population
we are sampling is represented by A. The alternative hypothesis, Ha, is that the population we are
sampling is represented by curve B. Suppose the population we are sampling is really population A.
Suppose also that the sample lies in the tail of curve A. We reject the null hypothesis, which was
correct, and conclude that the sample came from population B. What happened is we got a rare event
and came to the wrong conclusion. This is called a Type I error. At the 95% confidence level, there is a
5% chance of getting this type of error. You must be careful about one-tail or two-tail significance levels
here.
Suppose, however, that the correct answer is population B. The chance that a sample taken
from population B, lies within the accept range of our test is the area b shown in figure 5.4
(remember: Ho stated that the sample was from population A). When the sample lies within this region
of the probability curve, we would accept Ho, incorrectly concluding that the sample came from
population A. According to the curves shown in the figure, there is a pretty high probability that a
sample from population B will be interpreted as a sample from population A. The smaller the b value,
the smaller the probability of getting a type II error. The power of a test is defined as 1- b.
How can we improve the situation in a real experiment? The best way is to reduce the standard
deviation of our test population by increasing the number of data values in a sample. Increasing the level
of significance (smaller a) only increases the overlap. Reducing the level of significance (larger a)
decreases the area of overlap, and this is one of the ways to increase the power of a test, but it does
compromise the overall significance of the test. Since the width of the distribution of the means
decreases as 1 / N , sampling more data is one of the best ways to increase the power of a test.
Reality
Version: December 20, 2001
University of California, 2001

Parametric Statistics

6-12

Decision:

accept H0
accept H1

H0 true
1-a
a (type I)

H0 false
b (type II)
1-b

Table 5.1. Matrix of possible interpretations of an experiment based on the true population value and the sampled
outcome.

Example:
Figure 5.4 shows two distributions that might produce the same experimental result. Suppose
we define H0 as = 15 and Ha as 15. If X is greater than 17 (in the region defined by a),
we reject H0. The probability that a type I error occurred (we falsely rejected H0) is given by a (0.05
for a 95% confidence level). However, suppose that X is 15, well within the accept zone of H0. The
probability that is really 20, but X is within the accept zone of H0 is given by the area indicated by
b, which is the probability of a type II error. The type II error occurs when we erroneously accept
H0. It is simple to compute the probability, b, given the value of a chosen for the test and of the
population.
Suppose that the A = 10 and B = 20. Suppose also that

=4. We choose a = 0.05,

so that the upper reject region begins at 10 + 1.96*4 = 17.84. Since the mean of Curve B is 20, this
is (20 - 17.84)/4 = 2.16, which is 0.54 standard deviations from the mean of Curve B. From
Appendix A1, the area of the normal curve at Z = 0.54 is 0.295. Be sure to check this in Appendix
A1 to be sure you understand where this number came from. Use the graphic at the top of the figure to
be sure what area the table produces. The result means that the probability b of a type II error in this
case is 0.295. 29.5% of the time, under the given circumstances, if the distribution of Curve B was the
real distribution, one would falsely accept H0.

The F-distribution
The F-distribution describes the distribution of the ratio of the variances of two independent
samples from the same normal population. If we take all possible samples of size N1 from one normal
population and size N2 from a second normal population, where both populations have the same
variance 2 and calculate an F-statistic defined as
F=

s12
2

s2

for each two samples, we will have an F-distribution where s1 is the variance of the first sample and s2
is the variance of the second sample. The sample with the largest variance is always put on top in the
equation. The F-distribution has N1-1 and N2 -1 degrees of freedom.
An example of an F-distribution is shown below in Figure 6.6. The choice of which sample
should be sample 1 and which should be sample 2 should be made so that F > 1 in order to use most
tables of F-values. Note that there are no negative values for F.
Version: December 20, 2001
University of California, 2001

Parametric Statistics

6-13

0.8

frequency of F

0.6

0.4

0.2

0.0
0

F-value (4 degrees of freedom for both samples)

Figure 6.6. The F distribution , which is the ratio of variances of two samples from a Gaussian distributed population
for 4 degrees of freedom for both samples.

A table of F-values is given as Table A3 in the Appendix. As with the standard normal and t
distributions, we are interested in finding the value above which a certain proportion of the area under
the distribution curve lies. It is necessary to specify the degrees of freedom for both the numerator and
the denominator to use a table of F-values. F-values corresponding to 5% probabilities are given in
Table A3. For example, where 5% of the area under the curve lies in the upper tail for 4 degrees of
freedom in both the numerator and the denominator, the F-value is 6.39.

F-test
An F-test based on the F-distribution can be used to test the probability that two samples were
taken from populations with equal variances. For example, consider the problem of ore concentrations
across a fault that we discussed earlier. In performing a t-test, we assumed that the variances of the two
populations represented by the two samples were not statistically different. Let us now test whether or
not this was a good assumption.
Our null hypothesis, is that the two variances are equal and our alternative hypothesis is that they
are not. We state
H o ( 1 =
2

22 )

a n d H a ( 1
2

22 )

and this time let us set our confidence level at 95%. We look up the F-value in this case where N1=10
and N2=15 corresponding to a 5% area in the upper tail. This value is 2.65. The variance of the first
sample is 13 and the variance of the second sample is 12, so F = 1.08. Our F-value is not in the tail of
Version: December 20, 2001
University of California, 2001

Parametric Statistics

6-14

the curve, so we accept our null hypothesis. We were justified in assuming that the two samples came
from populations with the same variances at the 95% confidence level.
DO THIS NOW! Practice readi ng the F Tables in Appendix A3
Verify that you can read the table to get the F value for the following situations:
1. N1 = 5, N2 = 8. find the F value that is in the upper 5% of the range. Assume
sample 1 has the smallest variance. (ans: F=6.09).
2. N1 = 10, N2 = 20, find the Fvalue that is in the upper 5% of the range. Assume sample
2 has the smallest variance. (ans: F=2.42).
2
2
3. N1=12, N2= 11. s 1 = 4. 5, s2 = 1. 2 . Are the variances from the same population, to
a 95% significance? (ans: No, F = 3.75, the F limit = 2.85)
2
2
4. N1= 4, N2= 6. s 1 = 4. 6, s2 = 2. 0 . Are the variances from the same population, to a
95% confidence level? (ans: Yes. F=2.3, F limit = 5.41)

2-distribution
If gaussian distributed variables are squared, they follow the 2-distribution. For example, if Y
is a single gaussian distributed variable, then
2 =

(Y ) 2

follows a 2 distribution with 1 degree of freedom. If form a sum of N terms


N

2 =
i =1

(Y i ) 2

we have a 2 distribution with N degrees of freedom.


Figure 6.7 shows a plot of the 2-distribution for 4 degrees of freedom. As the number of degrees of
freedom becomes large, the 2-distribution approaches a normal distribution. As with the other
distributions we have discussed so far, the total area under the curve is one. Note, that the value of 2
is always positive.

Version: December 20, 2001


University of California, 2001

Parametric Statistics

6-15

frequency of chi-squared

0.2

0.1

0.0
0

12

chi-squared value (4 degrees of freedom)

Figure 6.7. Chi-squared distribution for 4 degrees of freedom.

Random variables with a gaussian distribution become 2 distributed when they are squared.
The mean of a 2 distributed variable with N degrees of freedom, E[2] = N
The variance of a 2 distributed variable with N degrees of freedom is var[2]=2N

Table A4 in the Appendix gives the values of 2 which define the upper tail of the curve for
various degrees of freedom. Critical 2-values are given corresponding to various area under the curve
in the upper tail.

2-tests
A very useful application of the 2-test is in testing whether a sample came from a Gaussian
distribution. To do this, we form a statistic which is related to the difference between the expected
and observed number of data values within each class. The 2-statistic for this situation is:
x2 =

# of classes

(O i E i )2

i =1

Ei

where Oi is the observed frequency in the ith class of the distribution and Ei is the expected
frequency in the ith class according to some probability distribution. The number of degrees of
freedom are c - k - 1 where c is the number of classes, k is the number of estimated parameters (k = 2
if m and s2 are used as estimates for and ). So, if an analysis used 6 histogram bars, and was

Version: December 20, 2001


University of California, 2001

Parametric Statistics

6-16

estimated from the data, x, and was also estimated from the data, the number of degrees of freedom
would be 6 2 = 4.
The 2-distribution is important because it can be used in many parametric and non-parametric tests.

Concept Review
It is important to understand the similarities in how the various distributions are used to test a hypothesis.
All of the distributions discussed in this chapter are derived from Gaussian distributed data. When the
data is transformed in specific ways (e.g. we may be interested in a squared parameter: chi-squared, or
a ratio of variances: F test), a certain distribution results. This is the distribution of a gaussian distributed
variable that has been squared or ratiod, or some operation has been performed on it.
For example, the Normal distribution results if we transform the Gaussian distributed data according to:
Z=

( x i )

The t distribution results if we transform the Gaussian distributed data according to:
t =

( xi m)
s

The t test is used for putting confidence limits on the distribution of sample means. It is important that the
sample means follow a normal distribution. Use the 2 test to prove it.
The 2 distribution results if we square Gaussian distributed variables. Use the 2 test to test the
confidence with which a distribution is normal (p 6-16).
The F distribution results if we compute the sample variances and divide the largest variance by the
smallest variance. It is used to test the confidence with which the sample variances of two samples are
from populations with equal population variances (p 6-14).
F=

s12
s 22

Of course, you should remember that the distribution comes from visualizing the repeating of the
experiment many times and plotting the histogram that is the average of all of the histograms, in the limit
where the class interval gets very small.
Reading each of the tables is similar. You figure out the degrees of freedom and the confidence limits,
read the value, then see if the computed sample statistic is within the Accept or reject range.

Encouragement

Version: December 20, 2001


University of California, 2001

Parametric Statistics

6-17

While statistical thinking represents a radical departure from the way you normally think, it is
really not so hard if you concentrate on a few facts. When making statistical inferences, it is helpful to
remember the sampling paradigm discussed earlier. There exists a population of values and you have
taken a sample from that population. The test statistic follows some kind of distribution, based on the
population statistics (for Gaussian population distributions the mean and variance are enough). Once
you know that distribution, the confidence limits follow immediately by considering the area underneath
the distribution curve. After that, it is a simple matter to test whether your sample value falls within those
limits.
Problem 2.

Suppose you are performing an experiment to determine whether a sample of seawater


is derived from deep bottom water or from surface water. Deep bottom water (DBW)
has a mean concentration of constituent A of 100 parts per million and surface water
(SW) has a mean concentration of constituent A of 120 ppm. Assume also that you
have found out by independent means that the standard deviation of the population of
both DBW and SW is 20 ppm. You take a sample consisting of N analyses of the
water. The problem is to choose whether the sample of water is from DBW or SW.
Make an analysis of this problem using the principles discussed above.
a) Discuss and perform the six steps required of hypothesis testing.
b) Analyze the problem in terms of type I and type II errors. Plot a and b vs N and
determine the optimum number of samples for a 95% confidence that you can
discriminate between DBW and SW.

Version: December 20, 2001


University of California, 2001

Parametric Statistics

6-18

Review
After reading this chapter, you should:

Know what a t-distribution is and how it is generated.

Know what an F-distribution is and how it is generated.

Know what an 2-distribution is and how it is generated.

Be able to read tables of t-values, F-values and 2-values and understand the relationship between
these values and the t, F and 2-distribution curves.

Be able to perform a t-test to:


-

estimate the population mean from a sample;

determine whether or not the mean of a population is different from (or higher or lower than) a
specified value; and

to compare two samples to test if they are from identical populations to a certain level of
confidence.

Be able to perform an F-test to determine whether two samples come from populations with equal
variances.

Be thoroughly familiar with the method of hypothesis testing.

Understand Type I and Type II errors, and the power of a test, and be able to calculate the
probability of each.

Version: December 20, 2001


University of California, 2001

Parametric Statistics

6-19

Exercises
State the null hypothesis and alternative hypothesis for all problems where you are asked to perform a
statistical test involving hypothesis testing.
1.

For 9 degrees of freedom,


a. what is the t-value above which 5% of the area beneath the t-distribution curve lies?
b. what is the t-value above which 2.5% of the area beneath the t-distribution curve lies?
c. between what two t-values do 95% of the t-values lie?

2.

A t-value of 1.8 is calculated for 20 degrees of freedom.


a. Is this value in the 5% of the area beneath the t-distribution curve in both tails?
b. Is this value in the 5% of the area beneath the t-distribution curve in the upper tail?

3a.

For the purpose of using a t-test to estimate the population mean from a sample, how many
degrees of freedom are there?

3b.

For the purpose of using a t-test to compare two sample means how many degrees of freedom
are there?

4.

List the basic steps in the hypothesis testing procedure.

5.

What does it mean to say "I am 95% confident that the population mean lies between 140 and
150."?

6.

What is meant by the phrase 'the power of the test'?

7.

In using a t-distribution to estimate the population mean from a sample, does the size of the
range of values specified for the population mean increase or decrease with
a.
b.
c.

8.

greater required precision in the estimate?


increased sample size?
increase variability in the population?

To determine the concentration of chemical X in a given liquid, 12 measurements were made.


The error in the measurements is normally distributed. Given the data below, what two values
can we be 95% confident the true concentration lies between?
15

Version: December 20, 2001


University of California, 2001

17

15

25

Parametric Statistics

6-20

21
19
9.

23
18

25
26

The recommended safe limit for chemical Y in drinking water is 10 mg/l. Water samples are
taken once a month to monitor this chemical. The data for the first 6 months of testing are given
below. Can we be 95% confident that the concentration of Y is less than 10 mg/l?
11

10.

14
23

After reviewing some measurements made in the lab, the lab supervisor notices a seemingly
systematically bias in the data. The supervisor suspects that the two lab assistants who made
the measurements are using a slightly different measurement technique and that this is the root of
the problem. One day, both assistants are given the same 10 materials to measure. Based on
the following data, can we be 95% confident that the technique of the two assistants is different?
Assistant A

Assistant B

52
58
57
70
65

57
59
65
68
60

11.

For 10 degrees of freedom in the numerator and 10 degrees of freedom in the denominator,
what is the f-value above which 5% of the area beneath the f-distribution curve lies?

12.

The variance of errors in measurements made by two different labs are given below. Are these
differences in variances statistically significant at the 95% significance level?

sample size
Lab A
Lab B
13.

standard deviation

11
21

66
40

This very important problem demonstrates the use of the Chi-squared distribution to test
whether a sample could have come from a Gaussian distributed population. 20 random data
points are taken. m = 2.995, s = 1.049. The data were plotted on a histogram consisting of 10
equal classes beginning at a value of 0 and ending at a value of 6. The number of data within the
classes is: 0,1,1,4,4,5,2,2,1,0.
a) Assuming that the data are sampled from a Gaussian distribution, compute the expected
number of data in each class. Approximate and with m and s.
b) Compute the 2 statistic for these data.

Version: December 20, 2001


University of California, 2001

Parametric Statistics

6-21

c) Within a 95% confidence level, could you reject the null hypothesis that the data are sampled
from a Gaussian distribution with = 2.995 and = 1.049?

Version: December 20, 2001


University of California, 2001

Parametric Statistics

6-22

Chapter 7

Propagation of Errors and Noise


In some cases, the data values are sampled directly in the form that is needed. An example is
the length between two markers. The length is measured directly. The most common case arises when
the measurements are put into a formula to produce another quantity. In the case of surveying, distance
between elements of an array of monuments across and earthquake fault might be used to compute the
surface strain. The amount of radioactive products in a rock may measured and put into a formula to
compute its age. The volume and weight of a rock may be put into a formula to determine its density, or
the amplitude of a seismic wave will be put into a formula to determine the magnitude of an earthquake.
The distribution the data errors and kind of formula will affect the interpretation of the answer. This
chapter will show you how to determine the accuracy of the answer and identify some pitfalls in
interpreting results from noisy data.

Errors When Data Values are Added or Subtracted


A common situation occurs when data values are added. An example of this is the
measurement of the distance between two widely separated points. Suppose that the distance is
sufficiently great and the topography sufficiently rough that you must make a series of end to end length
measurements. Each length measurement is subject to a certain error, which we will assume to be
Gaussian distributed with a zero mean and standard deviation, . Assume that N length measurements
are required. The total length plus an error is, assuming that all of the individual length measurements are
corrected exactly to horizontal distance:
N

L + = (l i + i )
i =1

The total length, L is the sum of all the individual lengths, li. The error in the ith length is given by i.
This results in an overall length error of . The individual errors would be expected to both add and
cancel randomly so it would be incorrect to simply add the errors. Since the total length will be a
random variable, we compute its "expected" value. Since the mean value of the individual errors is
zero, we have:
E [ L + ] = E
E [L ] + 0 =

( li + i ) =

i =1

( [ ] [
i= 1

E li + E i

E [(l + ) ]
i

i =1

]) = (li + 0) = L
i =1

So, the expectation value of L is just L, which is equal to the sum of all of the individual distances, which
agrees with our intuition. This only tells us that for repeated experiments, the errors average to zero.

Version December 20, 2001


University of California, 2001

Propagation of Errors

7-1

But, for an individual experiment, we need the standard deviation of the error. We get this by
computing the expectation value of the variance using equation 7-9. We have:
2
2



= E ( Le L) = E (l i + i ) l i = E i
i


i
i

2
L

L is the error free length and Le is the length measured as a result of a single experiment. Notice that the
term on the right is the square of a sum of terms. Multiplying out some of the terms, this will look like:

N N

L2 = E i k = E ( i k )
k

i =1 k =1

i
There are terms that are sums over i k . If N=2 we can multiply the terms by hand, resulting in
E [(1 + 2 )(1 + 2 )] = E [21 + 2 1 2 + 22 ]. The expectation of all 1 2 terms will be zero, since
2
2
1 and 2 are independent and will average to zero. The 1 and 2 terms will not, since they are
squared and will never have negative numbers to cancel with the positive ones. So, we have:

L2 = E 2i
i

In the general case, when i = k cancellation will not occur and E[ii] 0, but when i k, E[ik] =
0.. The expectation is the variance of the population (the population of errors of each of the individual
2
length measurements), which we will call l . The final answer is:
2


= E ( l i + i ) L = E 2i = Nl2
i
i

[ ]

2
L

So, the variance of the total length is given by the sum of the variances of each of the individual length
measurements. If the variances of each of the terms in the sum are different, the individual variances are
summed to get the final answer as shown in equation 7-1 below.
N

L = i2
2

(7-1)

i =1

Interestingly, the above formula also applies to the case when measurements are subtracted. This is
because the minus sign is eliminated by the variance computation, which squares the error.

Version December 20, 2001


University of California, 2001

Propagation of Errors

7-2

Problem 1:

Prove equation 7-1 for the case when 3 lengths are added to get the total length. Let
each of the individual length measurements have random errors with standard deviation
of e.

Problem 2:

Suppose that measurement A has a Gaussian distributed error with variance a2 and
measurement B has a Gaussian distributed error with variance b2 . Prove that the
variance of the difference, A - B is given by a2 + b2.

Errors When Data are Multiplied or Divided


Data values with random errors are often multiplied or divided. Suppose the density is being
measured by computing the volume and mass of an object. Then, the density is given by:
=

M
V

If the mass and volume each have errors, how will these combine to produce the error for the density?
To approximate the effect of of a small change in M and V, , we use the chain rule of differentiation,
which says:
df ( x, y ) =

f
f
dx +
dy
x
y

The above formula gives us a relationship that can be used to compute a small change in the function,
f(x,y) caused by small changes in either x or y. It only applies exactly to infinitesimally small changes in
x and y. Here we dont need an exact result, so we can extend it to larger changes (we say the result is
accurate to first order). We indicate that the changes are finite by using the notation x and y
instead of dx and dy. So, the chain rule takes the form:

f ( x , y ) =

f
f
x + y + small error
x
y

This equation is the first order term of the Taylors expansion for a function of two variables. The small
error will become important when bias is treated. For the density formula, the change in density due
to a small change in mass and volume is given by:
( M, V )

Version December 20, 2001


University of California, 2001

M +
V
M
V

Propagation of Errors

7-3

and since:
1
=
;
M V

M
= 2
V
V

Then:
=

1
V

M
V

Expressing the above equation as the fraction of the total density (note that we are dropping the
symbol, so must remember that the equations are only accurate to first order):
M V
=

M
V

We can compute the variance of the fractional density changes as:


2 2
M V 2 M 2 V 2

=E

=
+
E
=
V M
V
M

Note that once the chain rule is used, the results follow those derived for sums and differences of
random variables. If we define c as the ratio of the standard deviation of the parameter to the value of
the parameter, according to the above equation, we have:
c = c M + cV
2

where

V2
M2
2
2
c = 2 ; c V = 2 and c M = 2
V
M

We can then write a general law of propagation of errors, which states that if:
f ( x, y, z, .. . p, q, r, ... . ) =

x y z . ...
p q r .. ..

then the total error, expressed in terms of the fractional variation defined above,
c f = c x + c y + c z +. ... .. +c p + c q + c r +. ... ...
2

(7-2)

So, equation 7-1 expresses the total variance of the result when data are summed and equation 7-2
above expresses the total variance of the results when data are multiplied and divided.
Version December 20, 2001
University of California, 2001

Propagation of Errors

7-4

Version December 20, 2001


University of California, 2001

Propagation of Errors

7-5

Induced Bias
Mathematical operations on noisy data can affect the result in unexpected ways. A simple case
occurs when noisy data values are squared. The randomness which previously averaged to zero
because of cancellation of positive and negative will no longer average to zero because all of the
squared numbers have a positive sign. There will be a non-zero average, or bias added by this effect.
For example, suppose data follow the form of equation 7-7, where Y = y + aY noise . Ynoise is a
Gaussian distributed random quantity with mean = 0 and standard deviation noise. Suppose Y is
squared. We have:
Y

= (y + aY noise ) = y + a Y noise + 2 yaY


2

noise

Now taking the expectation of each side of the above equation and using equation 7-9, we have:

[ ]= E y + a Y + 2 yaY
= E[y ]+ E [a Y ]+ E[2 yaY
]

EY

noise

noise

noise

noise

(7-3)

= y + a ( noise + noise ) + 0
2

So, when Y is squared, its average value (which is the expectation) is biased by the variance of the
noise. If the mean of the noise is zero, as it has been defined here, then noise = 0. So, if Gaussian
distributed data will be used in a formula which squares the values, it is much better to find the average
of the values in the sample prior to squaring each value, as opposed to squaring each sample, then
taking the average.

120

Y +Bias

100
80
60
40
20
0
0

10

Figure 7.1. Plot of the result of squaring noisy data (equation 7-1). The dotted line shows how the expected value of
Y without noise is increased by the bias, which is a. Where is the standard deviation of the noise. In this case, a
= 4. This would lead the experimenter to estimate too high a value for the quantity represented by Y.

It is very common to put noisy data values into a formula, so it is important to understand the
effect that the formula will have on the answer. Will the noise bias the answer? Is the distribution of the
answer Gaussian if the data are Gaussian? It is important to answer these questions if we are to apply
statistical tests based on the assumption that errors are Gaussian distributed. Are the statistical tests
Version December 20, 2001
University of California, 2001

Propagation of Errors

7-6

applied to the data first, or should they be applied to the answer? This section will give guidance on this
question and follow with an example in age dating.
Assume that the data, x will be entered into a general formula, given by:
Y = f (x )

(7-4)

Y is the value computed from the data. Generally, x will also have a variation due to noise. The
experimenter would hope that this variation would be small relative to the data value (high signal to noise
ratio). This variation can be expressed as:
Y = f ( x + )

(7-5)

We are interested in the case where /x is small (relatively high signal to noise), so we use a Taylors
expansion of f(x), which is given by:
f ( x + ) = f ( x ) +

f ( x ) 2 f 2 ( x )
n 1 f n 1 (x )
+
+
..
...
+
+ error

2 ! x 2
( n 1)! x n 1
x

(7-6)

The Taylors series expansion for several functional forms is given below. The expansion is carried only
to second order. This is good enough to show the effect of bias. If the bias in a result is large, one
should also look at the higher order terms or take a different approach to the noise analysis.
If the function has an exponential dependence,
Y = f (x ) = Ae nx +b

f (x )
= Ane nx
x
f 2 ( x )
= An 2 e nx
x 2
So
Y = f (x ) + ( Ane

nx

) + 2 (An 2 e )+. .. ... ... .


nx

(7-7)

Here, x is the value of the data and is the random variation or noise in the data. The ( Ane nx ) term
is the first order randomness in Y (the result of the calculation) which is caused by randomness in x (the
data). The last term is also random and causes the bias in Y, since it will not average to zero. To get
the expected bias in Y, we take the expectation value of Y:

] [

E [Y ] = E [ f ( x )] + E Ane nx + E 2 An 2 e nx

= f ( x ) + Ane nx E [ ] + An 2 e nx E [2 ]

Version December 20, 2001


University of California, 2001

Propagation of Errors

7-7

The following derivations all assume that the random variable that is being input to the equation is
gaussian distributed.
2
Now, E[]=0, since the average of the noise is taken to be zero, and E [2 ]= noise
. Remember that
we are evaluating the noise effect at a particular value of x, so f(x) is unvarying in the above derivation,
so E[f(x)]=f(x). So, the result is:
2
2
E [Y ] = f (x ) + An e noise

nx

(7-8)

The second term is the bias effect, which gets larger as the square of n. The ratio of the bias to the
actual value is given by:
R=

bias
f ( x)

2
An 2 e nx noise

2
= n 2 noise

Ae nx

(7-9)

So, the bias (relative to the signal) gets larger as n and increase.
Practice: Using the above techniques, prove that the expansion to second order and bias ration, R are
correct for the following useful functional forms:
1. Linear :

f ( x ) = mx + b
Y = f (x + ) = mx + b + m

(7-10)

E [Y ] = mx + b + 0
R=0

2. Variable in denominator
f ( x) =

A
x

Y = f (x + ) =

E [Y ]
R

A
x

A
x
2A

2A
x

+ 2

2A
x3

+... .. ...
(7-11)

x3

2 2
x2

Version December 20, 2001


University of California, 2001

Propagation of Errors

7-8

3. Power law form (we assume n`> 1):


f ( x ) = Ax n
Y = f (x + ) = Ax + Anx
n

E [Y ] Ax n +
R

4. Logarithmic:

2
2 x2

2
2

An (n 1) x n

2
2

An( n 1) x n

+. .. ... ...

(7-12)

n(n 1)

f ( x ) = A ln ( x )
Y = f (x + ) = A ln (x ) +
E [Y ] = A ln ( x ) + 0
R=

5. Exponential:

n 1

A
x

2 A
2x

+ .. ... ...

2 A
2x

(7-13)

2 x ln ( x )

f ( x ) = Ae bx + c
Y = f (x + ) = Ae b ( x + )
bx
= Ae (1 + b +

E [Y ] Ae (1 +
bx

Problem 3:

(b )2
2!

b 2 2
2

+.. .. ... .)
(7-14)

b 2 2
2

Write and implement a button script that shows that equations 7-13 is true by
repeatedly adding random values to x and computing the running average of the value of
f(x), as was done in chapter 5 for coin tossing. Show that the value computed from the
equation for R agrees with the value found from the simulation.

Case Study - Errors in Age Dating Using U-Pb Analyses


Version December 20, 2001
University of California, 2001

Propagation of Errors

7-9

Age dating based on radioactive decay relies on the fact that radioactive elements decay at a
known rate depending on time. In general, we can represent the concentration of the radiogenic
element by:
t

A = A0 e

(7-14)

where A0 is the original concentration of the parent element at time t=0 and is the decay constant.
The time at which A is equal to half of the concentration is called the half-life and is equal to:
A
2

= A0 e

t 1/2

or
T 1/ 2 =

log e 2

0. 693

If the parent element decays to the daughter element, after a time, t the concentration of daughter
atoms will be:
D = A0 A = A0 A0 e

= A0 (1 e t )

(7-15)

If we take the ratio of D/N and solve for t, the result is:
t =

log e 1 +

D
N

So, if it is known that the daughter atoms were the result only of radioactive decay of the parent
atom, the age can be computed. But, it is often the case that there is an initial concentration of the
daughter element. When more than one age dating method is used, the results (if they agree) are said
to be concordant.
For this case study, we treat the 207Pb/206Pb isotope system. 238U decays to 206Pb and 235U
decays to 207Pb. The decay equations (from equation 7-15) are:

Pb

[ U]

(e

Pb

[ U]

(e

206

now

238

now

238 t

1)

235 t

1)

and
207

now

235

now

Dividing the two equations, we obtain:

Version December 20, 2001


University of California, 2001

Propagation of Errors

7-10

[
[

207
206

]
Pb ]
Pb

now
now

[ U]
=
[ U]
235

now

238

now

(e
(e

235 t
238 t

1)
1)

235 t

1)
137 . 88 (e
1)
(e

238 t

(7-16)

[ 207 Pb]/[ 206 Pb] is the measured present day lead isotope ratio and the present day [ 235 U/238 U]
ratio is 1/137.88 and is assumed to be a constant which does not depend on age and history of the
sample. So, it is possible to compute an age from a single analysis. The best mineral for use of this
system is zircon, because it retains uranium and its decay products, crystallizes with almost no lead, and
is widely distributed.
Equation 7-16 cannot be solved explicitly for age (t). The Simulations stack included with
this book provides a button whose script solves this equation numerically. An important question to be
asked is: how sensitive is the age determination to errors in the various constants that are in the
equation? Currently, the best available measurement accuracy in the 207 Pb/206 Pb ratios causes 1/5 of
the uncertainty in age than that caused by uncertainties in the decay constants. The decay constants of
the uranium to lead systems are 9.85 x 10-10 0.10% yr -1 for 235 U and 1.55 x 10-10 0.08% yr -1 for
238U and have been defined by international convention. The 235 U/238 U ratio is also uncertain to
about 0.15% .
The measurement of the 207 Pb/206 Pb ratios requires complex instrumentation and precise
analytical techniques. This ratio can be measured to accuracies as great as 0.1% to 0.03%. Another
important source of error is the correction for common lead, which is lead that is present in the sample
from sources other than decay of the parent isotopes of uranium. The source of common lead is
original lead in the sample as it crystallized, lead introduced by exchange with external sources, and lead
added during handling prior to analysis. A complete analysis of common lead errors is beyond the
scope of this text.

Problem 4:

Determine the error of the age determination of a zircon when the 207 Pb/206 Pb ratio
changes by 0.1% for ages of 100 Ma, 200 Ma, 300 Ma, and 500 Ma (1 Ma = 1 x 106
years; at 100 Ma and a ratio error of 1%, the age error is 23.7%). Draw a graph of the
age error vs age of the sample.

Problem 5:

Make graphs of the error in the age determination vs age of the sample caused by the
errors in each of the two decay constants.

Problem 6:

Determine the error of the age determination due to the error in the 235 U/238 U ratio.

Version December 20, 2001


University of California, 2001

Propagation of Errors

7-11

Problem 7:

Study the problem of bias in the result due to random errors in the
measured207 Pb/206 Pb ratio. Because the age equation cannot be solved analytically,
this simulation will need to be implemented using the computer. Repeatedly compute
the age, each time with a random error {e xRandom ("g", -1) } to the 235 U/238 U
ratio. Keep a running average of the age determination and put the current average
value into a card field, as was done in Chapter 5. Bias may show up as a higher age, on
the average, than the actual age. See if you can think of any way to determine bias
without doing the repeated sampling simulation.

The Distribution of the Errors

If y+e=f(x+x), and x is a random error is x then e will be a random error in y. In general if


x is gaussian distributed, e will not be gaussian distributed. This will affect the validity of the statistical
test that is applied to determine our confidence limits of y. Equation 7-1 shows that when data values
are added, the expectation value of the variance of the answer is just the sum of the expectation value of
the individual variances of the data points. So, if N data values, each with standard deviation , are
2
2
added together, the standard deviation of the sum is given by sum = N individual . The mean of the
answer is just the mean of the sum of the individual values. Since the answer is the result of linear,
additive operations the distribution of the errors remains gaussian, with =0 and std deviation as given
above. This result holds, to first order when data are divided or multiplied. Equation 7-2 is the
standard deviation of the answer, in this case. The answer remains gaussian distributed. This is
because we only took the first term in the expansion for M/V. When the errors are so large that the
second order term is required, bias results and the distribution is no longer gaussian.

Example: Determination of Density

Let's look at the expansion of the equation =M/V to higher orders. The density, , has a small
change, , caused by small changes in M and V. We can write this below as:
M + M
+ =
V + V
Rewrite this as:

M + M 1
+ =
1 + V
V

We can make the following simple expansion:


V

V V V
=1
+

+
+.. .. .
V
V V V
V
1+
1

Version December 20, 2001


University of California, 2001

Propagation of Errors

7-12

So after subtracting simplifying by subtracting out the on the left and the M/V on the right, we can
rearrange the density equation as:

(M + M ) 1 V
=
V

2
3
M
V V
+

+.. ..
V V
V
V

Multiply out so we can better see the small terms multiplied together.
=

M M

V
V

M V
V

2
2
3

V
V
V
+M
+ M
M
+. .. .
V
V
V

The first and second terms have only one variable with . This is why it is called the "first order". The
third and fourth terms have two variables multiplied, and is called the "second order". Terms 5 and 6
have cubed variables, and are called "third order", etc. First order terms are linear in the gaussian
distributed random error for mass and volume (M and V), so their distributions will be gaussian.
However, second order terms are squared. The V2 term is 2 distributed, since the 2 distribution is
the one that describes the distribution for squared gaussian variables (Ch 9).
But, what is the distribution for the MV term? We know that V2 will always be positive.
However, in this case it is possible that M will be positive while V is negative. So, right off, we know
that it will not be the same distribution as the one for V2. If M and V are completely independent
of each other, the product will average to zero and the contribution of this term to the standard deviation
of will be the product of the standard deviations of the volume and mass errors. The concept of
"independence" will be discussed further in a later chapter. It is sufficient to say, for now, that when two
random variables are independent of one another, the standard deviation of their product is just the
product of the standard deviations of each of the individual random variables. The third order terms
begin to get even more complicated. Term 5 is ok, because it has V2 and M. The V2 portion will
be 2 distributed, as before and M will be gaussian distributed, so we will have the product of a 2
distributed variable and a gaussian distributed variable. The term with V3 will have another unique
distribution. This is best modeled on the computer using a simulation. It is rarely be necessary to go
beyond second order.
It is the second order terms in the error expansion that produce "bias" in the result. This bias
cannot be eliminated by increasing the sample size. It exists even for an infinite number of data. This
can be easily seen by taking the expectation of , as follows:
1

V M V
V
E [] = E M M

+M
+. .
V
V
V

V
2

Simplifying,
E [] =

1
M
1
M
2
E [M ] E [V ] E [M V ] + 2 E (V ) +. ... .

V
V
V
V

Version December 20, 2001


University of California, 2001

Propagation of Errors

7-13

Since the average of M , V and MV will be zero, over many repetitions of the experiment, the
only term that will remain is the V2 term, which is the cause of the bias. So, the second order bias in
is given by:
E [] =

M
V

E (V )

]= VM

where v is the standard deviation of the volume measurement. It is interesting that the bias is
controlled by errors in the volume alone. The mass is in the numerator of the equation, so its effect is
linear and will average to zero at all orders.

Problem 8.

Suppose V/V=0.5 and M/M=0.5. Find the distribution(s) of the relative error in
density / up to second order. Find the expected mean value of and its standard
deviation if V=10m3 and M=2kg.

Problem 9:

Create a button that simulates problem 8. Show that the values for standard deviation
of each of the "orders" of the error expansion that result from your simulation agree with
the values you expect from problem 8.

The General Case


In practice, the equations describing the relationship between the data and the answer may be
quite complex. Certain problems, such as determining velocity structure from earthquake arrival time
measurements, are nonlinear and require extensive computation to determine the correct uncertainties in
the computed velocity structure. Other kinds of problems, such as trying to predict the weather, have a
result that is so strongly affected by small perturbations and errors that a meaningful error analysis is
impossible. If you are fortunate, you will encounter simpler equations of the form 7-10 to 7-14. If you
can perform a Taylor's series on the equation, you can do an error analysis. You can also approach the
problem as a simulation that is done on the computer. This is best for those who are uncertain of their
math skills and provides a meaningful check when strong nonlinearities or complex equations are
required.

When Should Averaging Be Done?


When measuring the density of an object by finding its mass and volume, it is possible to
approach the data analysis in two ways. Suppose the are N measurements of mass and volume. One
could first find the value for the volume by computing the average of all of the volume measurements.

The standard deviation of the volume errors would then be V = v


. Applying the same
procedure to the mass measurement, M =

m
N

. The v and m are the standard deviations of the

population distribution of the volume and mass measurements. After the averaging, the standard
Version December 20, 2001
University of California, 2001

Propagation of Errors

7-14

deviation of the errors is reduced by N. When this is put in the M/V formula, the bias caused by the
second order error terms is reduced by 1/N. The first order error terms are reduced by 1/N, as
expected. The other option for performing the analysis is to compute the value of for each of the M
and V values. Then, after all of the values of are computed, take the average of the 's. This has the
extreme disadvantage of increasing the size of the second order error terms, which cause bias in the final
result.
Problem 10:

Write a button script to simulate the effect described in the above paragraph and show
quantitatively that the results of the simulation agree with the results of the above
analysis. Generate N random values of the mass and volume using the xRandom
function. Use the parameters of problem 8, where V/V=0.5 and M/M=0.5. Then
compare the results (std deviation and bias of the answer) when the mass and volume
values are averaged first to the results when the densities are computed first and
averaged to get a final density.

Noise in Data Revisited


Chapter 6 treated the case where the noise level is constant. Here we expand the treatment to
include the case when the standard deviation of the noise depends on the signal level. We also show
how the choice of plot scale affects the appearance of the plotted data. When making logarithmic plots,
it is important to be aware that the plot scale is expanded at low values and compressed at high values.
For example, if variability of 0.5 exists in all data values, this will show up as large variations in the 0.1
range, but very small at the 103 range. Logarithmic axes most accurately reflect data variation when that
variation is proportional to the value of the data. The data variation (noise) is proportional to the signal
in earthquake magnitude determinations. This is true because most of the variation in seismic signal
levels is due to scattering of the seismic wave caused by heterogeneities in the earth, which is
proportional to the signal amplitude. On the other hand, when most of the noise in data is from the
measuring instrument such as a voltmeter or mass spectrometer, the variation in the data will be more or
less constant. A log plot of this kind of data would show large variations in plotted data at small values
and small variations at large values.
We saw in Chapter 6 that noise in data is also called random error. A measurement may be
expressed as the sum of the noise free, or exact value and an added noise component, in the following
way:
Y = y + aY noise

where y is the exact value and Ynoise is the random error. Here we will consider Ynoise to have an average
of zero and a standard deviation of 1. The constant, a is the standard deviation of the added noise. If
the average value of the noise was not zero, we would say it was biased. As discussed in Chapter 6,
two important cases are 1) when the amplitude of the noise is a constant and 2) when the amplitude of
the noise is proportional to the signal, y. Below are the two forms:
Noise constant:
Version December 20, 2001
University of California, 2001

Y = y + aY noise

Propagation of Errors

(7-7)
7-15

Noise proportional to y:

Y = y + aY noise y = y (1 + aY noise )

(7-8)

The distinction between these two cases is shown in the two log-log plots of figure 7.5. Data are
generated according to the each of equations 7-7 and 7-8.

Figure 7.5 The left plot is a log-log plot of y = x2 + Ynoise where noise is constant. The right plot is a log-log plot of y =
x2 + 0.2 x2 Ynoise ., where noise is proportional to noise free signal, y.

The left hand plot in figure 7.5 shows a log-log plot with signal (y) plus constant noise. The right hand
plot shows signal (y) plus noise proportional to the value of y. The important feature here is that the
randomness in the left plot decreases at larger x and y, and in the right plot the randomness remains
relatively constant. This has important consequences when fitting straight lines to log-log plotted data.
Obviously, in the first case, one would not want to fit a straight line to the lower values of the data where
the noise is high. In the second case, the noise is relatively uniform over the range and a fit will be force
to take into account the full range of the data.

Non Gaussian Data Distributions


Note that the previous discussions all assumed that the data were distributed according to a
gaussian distribution. This is often not the case and will affect which analysis procedures produce the
optimum result. For example, it is not uncommon for data to follow a log normal distribution. A log
normal distribution is suspected for data that cannot go negative or whose distribution is skewed to
higher values (Ch 5). If the data are log-normal, but the result involves an equation that takes the log
(which transforms the distribution back to gaussian), it is better to take compute individual values of the
result, then average. If you average before taking the log, you introduce bias.
*If the data follow a log normal distribution, and the result we want requires we apply the equation
y=log(data), can you figure out why bias is introduced if the data are averaged first?
*It is always necessary to be aware of the distribution that the data are following. What tests can
you apply to determine if data follow a particular distribution?

Version December 20, 2001


University of California, 2001

Propagation of Errors

7-16

Review:
After reading this chapter and working the problems, you should:

Understand the relationship between the mean and variance of the data to the mean and variance of
the answer, after putting data into an equation.

Understand what bias is and how to compute it analytically for simple functional forms and how to
model it on the computer using simulations.

Be able to determine the distribution of errors in the answer that is a function of the equation used to
get the answer and the distribution of the data.

Version December 20, 2001


University of California, 2001

Propagation of Errors

7-17