Beruflich Dokumente
Kultur Dokumente
The mean, median and mode are three kinds of average. If you have a data set (say
the angles of impact calculated from 5 blood stains that were all dropped at the same
angle) you can find an average or typical value to represent those 5 results. The
average can then be used in subsequent calculations. The phrase measure of central
tendency is used as an alternative to the word average, more about this later.
Suppose you have the following set of data (perhaps lengths of 6 bloodstains in
millimetres)
Your turn
Work out the three averages for each of the following data sets…
• The annual gross salaries paid by a Company A are as follows
£12 500 £13 500 £13 500 £17 250 £87 500
£29 850 £23 900 £28 950 £30 750 £30 750
If you were trying to make a definite statement about (say) the average height of adult
males in England and Wales, you would have two main alternatives
• Conduct a census: measure the height of every adult male in England and Wales
on a certain day
• Take a sample: pick at random a sample of (say) 1 000 adult males from England
and Wales and measure their heights
The first alternative would be expensive, a logistical nightmare, and might not
provide much more information than the second alternative.
The key word here is random. One way of picking a sample of adult males from
England and Wales would be to
It is very important to avoid bias in the sampling process. People asked to pick
numbers at random between 1 and 10 will tend to pick numbers near the middle of the
range – unless they are deliberately compensating in a statistics lesson!
When you calculate the mean height of the sample of 1 000 adult males from
England and Wales, the result is called an estimate of the mean height of the
population. The estimate is probably very close to the actual mean height of the
population, but there may be a small difference called sampling error.
One way to quantify the sampling error is to pick another sample of 1 000 adult males
and measure their heights… You can treat the means of samples as data items. More
on this later…
The aliens have landed in the Bull Ring and they kidnap the next 5 people that walk
past the statue of the bull. List as many different ways in which this sample of Homo
Sapiens could be biased as you can think up in two minutes…
Classification of data
What is the ‘average’ eye colour for the following data set?
There is not any kind of ‘typical’ eye colour for this group. If there had been two
people with (say) brown eyes, you could assign a modal eye colour to the group.
For each of the bloodstains in Scenario 1, you measured the width and the length of
the stain. The width and length are referred to as the variables you are measuring, and
a third variable was the height of the dissection board for that particular bloodstain.
Your table of results constitutes the data set.
Variables /
data
Qualitative Quantitative
Discrete Continuous
The distinction between discrete and continuous variables is often a bit fuzzy. For
instance, we often treat money as a continuous variable but strictly speaking it is
discrete and quantized in units of 1p.
By the same token, height and weight of people are classic examples of continuous
variables but we often round heights to the nearest 1cm and weights to the nearest
1Kg – effectively making them discrete!
Your turn
Try to classify the following variables – if continuous state the quantization level or
typical rounding precision.
Variable Classification
Mass of a suspected bag of
cocaine
Can you think of any qualitative variables that it might be possible to record for the
bloodstains?
Can you classify the appearance of bloodstains somehow?
Frequency Distributions
Finding an average gives you a ‘typical value’ that can represent your data. Part of
choosing the best average for your data is to see how spread out the data is. One way
of doing this is to draw a histogram or frequency bar chart.
If you have a list of 100 numbers, say heights of people, you have to ‘bin’ the
numbers into a series of intervals of height. The tally chart is a common way of doing
this.
Below are the heights in centimetres (rounded to the nearest centimetre) of 100
people.
138 149 152 153 155 160 162 165 167 170
139 149 152 154 156 160 162 165 167 171
139 149 153 154 156 160 162 166 167 171
144 149 153 154 156 160 162 166 168 172
145 149 153 154 156 160 162 166 168 172
145 150 153 154 157 160 163 166 168 172
145 150 153 154 157 161 164 166 168 173
145 150 153 154 158 161 164 167 169 173
146 150 153 154 159 161 164 167 169 173
147 151 153 155 159 161 164 167 169 176
The smallest height is 138 cm and the largest height is 176 cm, so the range is 176 –
138 = 38cm. We can use 9 intervals as follows…
Interval Frequency
135 ≤ X < 140 3
140 ≤ X < 145 1
145 ≤ X < 150 11
150 ≤ X < 155 24
155 ≤ X < 160 11
160 ≤ X < 165 20
165 ≤ X < 170 20
170 ≤ X < 175 9
175 ≤ X < 180 1
25
20
15
10
0
137.5 142.5 147.5 152.5 157.5 162.5 167.5 172.5 177.5
Height /cm
Your turn
Below is some data on the heights in mm of watercress seedlings grown on blotting
paper…
Histograms are often plotted from small numbers of results – really this is not
appropriate. A histogram of the sample of 100 heights changes a lot each time. A
histogram of a sample of 1000 heights ‘bounces around’ much less.
A better kind of diagram for small data sets is the dot plot. The dot plot is best
illustrated by example…
Suppose two people (SB and MB for the sake of argument) performed a titration 10
times each and recorded the amount of acid needed to neutralise 100 cm3 of alkali.
Data was recorded to the nearest 0.1 cm3
SB 48.7 41.9 45.8 49.0 46.3 49.0 55.6 44.7 54.9 48.4
MB 48.8 51.2 50.5 51.3 50.7 51.2 50.5 50.5 49.1 49.6
MB
SB
41 45 50 55
Acid volume / cm3
Your turn
Discussion point: Describe the differences between MB and SB’s results as displayed
on the dot plot. Find two ways in which the distribution of dots is different for MB
and SB.
Plot a dot plot on graph paper: Try plotting a dot plot for two more people (KB and
PS) repeating this titration.
Their results are shown below…
KB 45.9 48.1 51.9 50.2 52.1 50.6 50.0 48.8 49.8 50.0
PS 46.0 42.9 47.7 43.8 45.9 43.0 47.1 43.3 45.9 47.3
Write a few sentences about what the dot plot shows you.
• Compare the spread of each data set.
• Compare the middle of each data set (the ‘central tendency’)
• Plot a special point with a different symbol (perhaps ↓ ) marking the mean of
each set of data.
Low accuracy
Dot plots that show a low spread (tightly bunched dots) might be the result of more
precise measurements than dot plots that show a wide spread. A dot plot can help you
gauge the amount of random error in a measurement. Systematic error is harder to
deal with and may not show in statistical analysis.
Recall the errors work we looked at in the first session: we divided the error up into
systematic and random components…
Systematic error
(unknown)
Random error
(knowable)
The standard deviation is a measure of how spread out your data is – the dispersion
of your data - and can be used to calculate the size of random error. As we are
working mostly with samples that are drawn from a larger distribution, we use a
formula for the sample standard deviation…
S=
"(x ! x )
N !1
Where
• the large Σ symbol means ‘find the total of all the following’
• x stands for the mean
• x stands for each data value in turn
• N stands for the number of data items
The best way to learn to apply this formula is to work with an example and to work in
columns to represent each of the stages in the calculation…
PTO
Typical question
Calculate the mean and sample standard deviation of the 10 acid volumes obtained by
SB in the titration experiment: here are the results again in cm3…
SB 48.7 41.9 45.8 49.0 46.3 49.0 55.6 44.7 54.9 48.4
The first thing to do is to calculate the mean for this data. I get the total to be 484.3,
and so the mean is 48.43 cm3, which we can round to 48.4 to advantage.
The next thing to do is arrange the data in the first column of a three column table –
we will use the other two columns to record results as we go on…
X x- x (x - x )2
Data value Deviation Square deviation
48.7 48.7 – 48.4 = 0.3 0.32 = 0.09
45.8
49.0
46.3
49.0
55.6
44.7
54.9
48.4
Once you have your value for the total square deviation, you can complete the
calculation as follows…
162.01
S= = 18.0 = 4.24 " 4.2
10 ! 1
Your turn
Calculate the mean and sample standard deviation of the 10 acid volumes obtained by
SB in the titration experiment: here are the results again in cm3…
KB 45.9 48.1 51.9 50.2 52.1 50.6 50.0 48.8 49.8 50.0
X x- x (x - x )2
Data value Deviation Square deviation
Remember to complete the calculation by dividing the total square deviation by one
less than the number of data items and then taking the square root of the result. I
found the answer to be 1.8 cm3 rounded to two significant figures.
You might want to put marks on your dot plot for this data at ±S either side of the
mean, and then ±2S and then ±3S where S is the sample standard deviation.
Excel has functions built in that will calculate a wide range of statistics for you.
To calculate the sample standard deviation, you just
• Type your data into a column (say the first number is typed in cell A1 and the 10th
in cell A10)
• Click in cell A12 (or any cell outside the data range) and type
=stdev(A1:A10) and press enter
• You should see the standard deviation figure appear
The command =stdev(A1:A10)is called a formula, the ‘=’ sign tells Excel that a
calculation is needed. The cell range A1:A10 tells Excel which figures to include in a
calculation.
All scientific calculators have a routine for calculating the standard deviation.
• Learning how to use your particular calculator’s routine is best done as a ‘simon
says’ exercise in the lesson as each make of calculator (and sometimes different
models within a manufacturer’s range) will have a different logic.
• Make absolutely sure that you can get the same answers for mean and standard
deviation using your calculator’s statistics mode as you do when using the table
layout.
• Calculators often use the symbol σn-1 for what we have called the sample standard
deviation. The symbol σn-1 will often be found as a legend above a key.
The sample standard deviation of (say) the diameter of 30 airgun pellets taken at
random from a box of 500 pellets provides you with an estimate of the standard
deviation of the population in the box (in turn a sample of the whole production of
pellets!). There will be some error on the sample standard deviation estimate.
We use different symbols for statistics calculated from samples and the values we are
trying to estimate for the population…
Sample Population
Mean x µ
Standard deviation S or S.D. σ
Homework: Week 4
If you work through the questions below in time for the next lesson, you will have all
the mathematical techniques at your fingertips. Next week we look at the Normal
Distribution and the standard error of the mean, and then go on to calculate
confidence intervals of the mean.
Skill questions
B 476 524 511 527 514 524 510 510 482 491
A 839 785 816 842 820 849 895 808 890 837
b) Calculate the mean and standard deviation for both sets of results
346 303 371 417 414 438 262 349 409 311 329
284 277 316 325 265 334 342 366 1500
b) Which weevil has a weight that is very far from the mean?
(x ! x)
c) Calculate the value of where ( x ! x ) is the difference between
S
the extreme weevil and the mean and S is the standard deviation.
Draw diagrams and make calculations to compare the results of the two
technicians. Write a short paragraph explaining which of the two technicians is
generating the most consistent results. Explain your reasoning.
Hint: a dot plot might be a good starting point
2.3 2.4 2.4 2.5 2.5 2.5 2.6 2.6 2.6 2.6 2.7
2.7 2.7 2.8 2.8 2.9 3.0 3.0 3.1 3.2 3.2 3.3
There is an alternative to the mean and standard deviation for describing distributions.
The five number summary consists of the following set of numbers
• The minimum value in the data set
• The lower quartile (LQ) value (where 25% of the data are lower than the
quartile)
• The median
• The upper quartile (UQ) (where 75% of the data are lower than the upper
quartile)
• The maximum value in the data set
1
The median is found (as explained earlier) by finding the (n + 1) th number in a list
2
of the n data items sorted into order of size. The maximum and minimum values are
easily found.
There are a number of slightly different ways of finding the upper and lower quartiles.
The method here is quoted in Armitage, 1994, and is best illustrated by an example.
The numbers below could be the widths of 25 samples of steel bar in mm measured
with a micrometer...
4.26 4.70 4.78 4.85 4.86 4.89 4.96 5.06 5.08 5.09 5.12
5.13
5.13
5.18 5.20 5.22 5.30 5.30 5.31 5.34 5.37 5.40 5.41 5.53
5.68
The list is sorted in order of size, and the median value is 5.13mm. Use the formula
1 3
ql = (n + 1) to find the position of the lower quartile, and qu = (n + 1) to find the
4 4
position in the sorted list of the upper quartile. In this case, n = 25 so the positions are
1 3
ql = (25 + 1) = 0.25 ! 26 = 6.5 and qu = (25 + 1) = 19.5 . These figures suggest that
4 4
we pick the mean of the 6th and 7th figures for the LQ and the mean of the 19th and
20th figures for the UQ. My five number summary for this data is (all in mm)...
Min 4.26
LQ 4.94 (rounded up)
Median 5.13
UQ 5.33 (rounded up)
Max 5.68
1
If you had 10 data items, then the LQ has position ql = (10 + 1) = 0.25 ! 11 = 2.75
4
which we round up to be the 3rd data item in the list. The UQ rounds down to 8th.
Your turn
The airgun data from Homework Week 4 question 4 is reproduced below (all
measurements in mm)...
2.3 2.4 2.4 2.5 2.5 2.5 2.6 2.6 2.6 2.6 2.7
2.7 2.7 2.8 2.8 2.9 3.0 3.0 3.1 3.2 3.2 3.3
Min
LQ
Median
UQ
Max
Another kind of diagram can be drawn from the 5 number summary. The Box and
Whisker Plot is very useful for comparing two or more small samples and trying to
find similarities or differences in their distributions. You can include the Box and
Whisker plot on the same axes as a dot plot.
The data set for the steel bars from the previous page is shown below on a scale that
is a little wider each end than the maximum and minimum data points...
MIN LQ MED UQ MAX
The central box tells you the range of the central 50% of the distribution.
In statistics the range of a set of data is simply the difference between the largest and
smallest item in the data set...
The range is handy for scaling graphs, but as a measure of spread or dispersion, the
range suffers from some problems...
• The range depends on only two values
• Sensitive to extreme outliers
• The range takes no account of the distribution of values between the maximum
and minimum
The Semi-interquartile range (SIQR) is just half the difference between the upper
and lower quartiles for a set of data...
1
SIQR = (UQ + LQ)
2
Your turn
The table below contains the five number summaries of two samples of the length of
butterfly wings in mm. The sample sizes are also shown in the table as N.
Sample A Sample B
Min 10.5 11.3
LQ 12.2 12.7
Median 12.8 13.4
UQ 13.1 13.8
Max 13.4 14.3
N 36 40
Draw box and whisker plots representing these two samples on the same axes.
Discussion: What are the differences between the two samples?
The distribution of heights of a sample of 1000 adults resident in England might look
a bit like this;
250
225
200
175
150
125
100
75
50
25
0
132.5 137.5 142.5 147.5 152.5 157.5 162.5 167.5 172.5 177.5 182.5 187.5 192.5 197.5
Height / cm
Biological variables (height, weight, blood pressure, and so on) tend to have a
distribution shape like this as do genuine random errors in measurement. This
distribution shape became so common, it was referred to as the normal distribution.
Other distribution shapes have become more common as statistics developed as a
useful tool in industry and science - but the name has stuck.
For a sample of data that is drawn from a population that has a normal distribution for
the variable of interest, the three averages will give the same value.
If the median and mean are appreciably different for a given sample, then this may
indicate that the distribution is skewed one way or another, and the histogram of the
data will look ‘pushed over’. If the median is less than the mean, you have positive
skew and if the median is higher than the mean, you have negative skew.
Normal curve
0.45
0.4
0.35
0.3
±1! contains
68% of data
0.25
0.2
0
-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5
SDs either side of mean
This means that if (say) you had a batch of steel washers with mean thickness
0.35mm and a standard deviation 0.05mm, then you would expect that 99.7% of the
washers would have a thickness in the range 0.35 ± 0.15mm, or lie between 0.20 mm
and 0.50mm.
If you did find a washer that was 0.2mm thick, then this washer had a chance of 1.5 in
1000 (half of 0.3%) of being produced.
Your turn
Suppose a batch of airgun pellets has a mean diameter of 0.177 inches with a standard
deviation of 0.008 inches. What range of diameters could 99.7% of the pellets be
expected to lie within?
Discussion point: Would you send someone to prison if there was a 3 out of 1000
chance that they might really be innocent?
Suppose you took a large series of samples (say the heights of 10 adult males resident
in England). The means of the samples would show variation - the mean of the means
would be a very close approximation to the mean height of the population, and the
standard deviation of the sample means would tell you something about the sampling
error.
You can get an idea of the sample error from a single sample by calculating
something called the standard error of the mean (SEM).
The standard error of the mean is calculated by dividing the sample standard deviation
by the square root of the size of the sample…
S
SEM =
N
Your turn
As you can see, as the sample size increases, the SEM decreases. This shows you that
a larger sample has a smaller chance of having a mean that is different from the mean
of the population than a small sample.
• The SEM does NOT say anything about an individual value for a given data
item
• Samples must always be randomly selected from the population and must be free
from bias
“The distribution of an average tends to be Normal, even when the distribution from
which the average is computed is decidedly non-Normal.” - Charles Annis
I built a spreadsheet to pick 1000 samples of 100 random numbers each, and the
distribution of the means of the samples is shown below…
180
170
160
150
140
130
Frequency (1000 samples)
120
110
100
90
80
70
60
50
40
30
20
10
0
0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61
UCB
• A typical mean for the means of 1000 samples was 0.500 and the standard
deviation of the means was 0.029.
• A typical mean for a single sample is 0.478 and a sample standard deviation is
0.3
• As you can see, the standard deviation of the sample means is approximately the
same as the standard deviation of a single sample divided by the square root
of the sample size
• The standard error of the mean calculated from a single sample is a guide to the
standard deviation of the distribution of the means.
Confidence intervals
Because the Standard Error of the Mean is essentially an estimate of the standard
deviation of the sample means, we can use the SEM to estimate the chance that the
population mean falls within a certain range.
Recall that the area under the Normal curve within limits set by the standard deviation
is as follows...
Turning these facts round and applying the concept to the standard error of the mean
results in the following rule for calculating a confidence interval of the mean for a
sample...
“There is a 95.5% probability that the population mean falls within ± 2 standard
errors of the sample mean.”
We can modify this rule slightly to give us a formula for the 95% confidence interval
of the mean as follows:-
S
95% CI = x ± 1.96 !
N
In words, the 95% confidence interval of the mean is given by the mean plus and
minus 1.96 multiplied by the standard error of the mean.
The factor 1.96 is used instead of 2 so that the probability is 95% (1 in 20) rather than
95.5%.
Example calculation
Suppose you have the following information for the weight of a sample of sturgeon
roe caught in an area of the Caspian Sea - weights were recorded in grammes...
N 30
Mean 1.56 g
The 95% confidence interval of the mean for this sample is given by...
S
95% CI = x ± 1.96 !
N
0.2
= 1.56 ± 1.96 !
30
= 1.56 ± 0.072
= 1.45 to 1.63 g
And so we can say that there is a 1 in 20 chance that the sample of sturgeon roe is
actually an extremely unrepresentative one and that the population mean lies outside
the range 1.45 to 1.63g.
Notice that if you took repeated samples perfectly in a random way and without any
kind of bias, you would expect, 5% of the time, get a sample with a mean lying
outside this range.
Try the MS Excel spreadsheet simulation at this point - repeated pressing of the F9
function key will draw a new sample. Try pressing F9 50 times. Note each time the
sample mean falls outside the range.
The 5% probability splits symmetrically so there is a 2.5% probability that the sample
mean is below 1.45g and a 2.5% probability that the sample mean is above 1.63g
Your turn
Here is an extended example designed to take you from raw data to the 95%
confidence intervals for two sets of data. We will use this data in a later section so
please do complete the calculation - ask your tutor if you need help.
A calculator with a standard deviation routine or access to MS Excel will sure help!
The data
The weight in KG of sacks of flour was checked with an accurate weighing machine.
Random samples were obtained for the check that was carried out in two mills.
Samples of 20 sacks were taken in each case.
Mill A 9.00 9.05 9.05 9.10 9.10 9.10 9.15 9.15 9.20
9.20
9.20 9.25 9.25 9.30 9.40 9.50 9.60 9.90 9.95
10.00
Mill B 9.10 9.30 9.35 9.35 9.40 9.40 9.40 9.45 9.45
9.45
9.45 9.50 9.50 9.50 9.55 9.55 9.60 9.70 9.70
9.90
a) Draw dot plots for each of the samples from the two mills and describe the
appearance of the data - use the same axes for both dot plots and include a key
b) Calculate the 5 number summary for each of the data sets and organise your
results into a table
c) Draw box and whisker plots for each sample and comment on the appearance
of the diagrams
d) Calculate the mean and the sample standard deviation for each of the
samples
e) Calculate the 95% confidence interval of the sample mean for each of the
samples
f) Plot the 95% confidence intervals for each sample on your dot plot. Do the
intervals overlap?
g) Discussion: Is Mill A producing sacks that have a different weight from Mill
B? Try to make a statement based on probabilities.
• In statistics, it is normal to lay out the precise criteria that you will use to
decide a question before designing an experiment, choosing samples and so
on. This pre-defined way of working avoids any unconscious bias creeping in
• A standard set of steps where you are clear what you are trying to decide is
called a significance test
• The table below shows the main steps involved in a significance test
• You can find much more detail by looking up the terms involved in a statistics
text book or on the Web
• Below are two sets of data on the girth of Betula pendula as sampled from two
parts of a wood
• You suspect that the sample taken from the North segment may represent a
different population from the sample taken from the South
• Perform a significance test to verify your experimental hypothesis
• Use a 5% significance level
Data
Note: the experimental hypothesis is what you are trying to find out. You then frame
a more specific (usually negative) null hypothesis as part of the significance test.