Scenario 2 Statistics

Unit 6: Maths and Science for Technicians – statistics part 1
Mean, median and mode: measures of ‘central tendency’
The mean, median and mode are three kinds of average. If you have a data set (say
the angles of impact calculated from 5 blood stains that were all dropped at the same
angle) you can find an average or typical value to represent those 5 results. The
average can then be used in subsequent calculations. The phrase measure of central
tendency is used as an alternative to the word average, more about this later.
Suppose you have the following set of data (perhaps lengths of 6 bloodstains in
millimetres)
14.3 9.2 18.6 7.4 7.4 13.1
Average How to find Example

Mean Find the total of the data and then divide Total is 70.0
the total by the number of results
70 ÷ 6 = 11.67 = 11.7 (3 s.f.)
Median Put the numbers in order of size

Odd number of results: Pick the middle 7.4 7.4 9.2 13.1 14.3 18.6
value
Even number of results: Take the mean (9.2 + 13.1)
of the middle two values = 11.15 = 11.2 (3 s.f.)
2
Mode Find the value that appears most

commonly in the list of data Mode is 7.4
Your turn
Work out the three averages for each of the following data sets…
• The annual gross salaries paid by a Company A are as follows
£12 500 £13 500 £13 500 £17 250 £87 500
• Company B pays the following salaries
£29 850 £23 900 £28 950 £30 750 £30 750
Discussion: Compare the mean salaries for the two companies

Compare the median salaries for the two companies
How do the two kinds of average compare?
KPB 2004 http://bodmas.org/bnc/ page 1 of 26

Populations, samples and bias
If you were trying to make a definite statement about (say) the average height of adult
males in England and Wales, you would have two main alternatives
• Conduct a census: measure the height of every adult male in England and Wales
on a certain day
• Take a sample: pick at random a sample of (say) 1 000 adult males from England
and Wales and measure their heights
The first alternative would be expensive, a logistical nightmare, and might not
provide much more information than the second alternative.
The key word here is random. One way of picking a sample of adult males from
England and Wales would be to
• Define the population very carefully by a unique identifier – perhaps National

Insurance number.
• The NI number provides the sampling frame for the population.
• Then a sample could be picked at random using a computer program to select NI
numbers or by picking the digits of the NI numbers from a hat.
It is very important to avoid bias in the sampling process. People asked to pick
numbers at random between 1 and 10 will tend to pick numbers near the middle of the
range – unless they are deliberately compensating in a statistics lesson!
Estimating population parameters
When you calculate the mean height of the sample of 1 000 adult males from
England and Wales, the result is called an estimate of the mean height of the
population. The estimate is probably very close to the actual mean height of the
population, but there may be a small difference called sampling error.
One way to quantify the sampling error is to pick another sample of 1 000 adult males
and measure their heights… You can treat the means of samples as data items. More
on this later…
Your turn: discussion
The aliens have landed in the Bull Ring and they kidnap the next 5 people that walk
past the statue of the bull. List as many different ways in which this sample of Homo
Sapiens could be biased as you can think up in two minutes…

Classification of data
What is the ‘average’ eye colour for the following data set?
Hazel Blue Grey Brown Green
There is not any kind of ‘typical’ eye colour for this group. If there had been two
people with (say) brown eyes, you could assign a modal eye colour to the group.
Variables and data
For each of the bloodstains in Scenario 1, you measured the width and the length of
the stain. The width and length are referred to as the variables you are measuring, and
a third variable was the height of the dissection board for that particular bloodstain.
Your table of results constitutes the data set.
Variables can be classified as follows…
Variables /
data
Qualitative Quantitative
Discrete Continuous
Examples of each of the three types of variable include
Type of variable Example

Qualitative Ethnic self-classification
Quantitative discrete Number of people on the 97 bus
at 0730 each day
Quantitative continuous Heights of people picked at
random in a football crowd
The distinction between discrete and continuous variables is often a bit fuzzy. For
instance, we often treat money as a continuous variable but strictly speaking it is
discrete and quantized in units of 1p.
By the same token, height and weight of people are classic examples of continuous
variables but we often round heights to the nearest 1cm and weights to the nearest
1Kg – effectively making them discrete!

Your turn
Discussion Exercise: classify the following variables…
Try to classify the following variables – if continuous state the quantization level or
typical rounding precision.
Variable Classification
Mass of a suspected bag of
cocaine
Length of a skid mark in a

road accident
Number of people visiting a

shop each hour of the day
Shirt collar/dress/shoe size
Discussion exercise: Bloodstain shape
Can you think of any qualitative variables that it might be possible to record for the
bloodstains?
Can you classify the appearance of bloodstains somehow?

Frequency Distributions
Finding an average gives you a ‘typical value’ that can represent your data. Part of
choosing the best average for your data is to see how spread out the data is. One way
of doing this is to draw a histogram or frequency bar chart.
If you have a list of 100 numbers, say heights of people, you have to ‘bin’ the
numbers into a series of intervals of height. The tally chart is a common way of doing
this.
Below are the heights in centimetres (rounded to the nearest centimetre) of 100
people.
138 149 152 153 155 160 162 165 167 170
139 149 152 154 156 160 162 165 167 171
139 149 153 154 156 160 162 166 167 171
144 149 153 154 156 160 162 166 168 172
145 149 153 154 156 160 162 166 168 172
145 150 153 154 157 160 163 166 168 172
145 150 153 154 157 161 164 166 168 173
145 150 153 154 158 161 164 167 169 173
146 150 153 154 159 161 164 167 169 173
147 151 153 155 159 161 164 167 169 176
The smallest height is 138 cm and the largest height is 176 cm, so the range is 176 –
138 = 38cm. We can use 9 intervals as follows…
Interval Tallies Frequency

135 ≤ X < 140
140 ≤ X < 145
145 ≤ X < 150
150 ≤ X < 155
155 ≤ X < 160
160 ≤ X < 165
165 ≤ X < 170
170 ≤ X < 175
175 ≤ X < 180
Total of the frequencies 100

My frequency distribution for the data set above was as follows
Interval Frequency
135 ≤ X < 140 3
140 ≤ X < 145 1
145 ≤ X < 150 11
150 ≤ X < 155 24
155 ≤ X < 160 11
160 ≤ X < 165 20
165 ≤ X < 170 20
170 ≤ X < 175 9
175 ≤ X < 180 1
Below is a histogram produced in MS Excel by means of a subterfuge – I used a bar

chart with customised category axis labels. I used the midpoints of the intervals to
label each bar.
Histogram of heights
25
20
15
10
0
137.5 142.5 147.5 152.5 157.5 162.5 167.5 172.5 177.5
Height /cm
Your turn
Below is some data on the heights in mm of watercress seedlings grown on blotting
paper…
h (mm) 5 < h ≤ 10 10 < h ≤ 15 15 < h ≤ 20 20 < h ≤ 25 25 < h ≤ 30 30 < h ≤ 35

Frequency 12 22 10 6 4 1
Plot a histogram of this frequency distribution on graph paper in ‘landscape’ format

using suitable scales.
Write a sentence describing the shape of the histogram. Where does the modal
interval fall?

Illustrating small data sets: the dot plot
Histograms are often plotted from small numbers of results – really this is not
appropriate. A histogram of the sample of 100 heights changes a lot each time. A
histogram of a sample of 1000 heights ‘bounces around’ much less.
A better kind of diagram for small data sets is the dot plot. The dot plot is best
illustrated by example…
Suppose two people (SB and MB for the sake of argument) performed a titration 10
times each and recorded the amount of acid needed to neutralise 100 cm3 of alkali.
Data was recorded to the nearest 0.1 cm3
SB 48.7 41.9 45.8 49.0 46.3 49.0 55.6 44.7 54.9 48.4
MB 48.8 51.2 50.5 51.3 50.7 51.2 50.5 50.5 49.1 49.6
These results could be compared using a dot plot as drawn below…
MB
SB
41 45 50 55
Acid volume / cm3
As you can see, in a dot plot
• Each data value is represented by a dot

• Dots are plotted on a horizontal scale showing an appropriate range of values of
the variable
• If two (or more) dots have the same value, then they are plotted at the correct
value of variable but vertically separated
• Two different samples can be plotted on the same axes to show differences
between the samples. The samples are separated by a larger vertical space, and
plotted using different symbols for each sample and identified with a key.

Your turn
Discussion point: Describe the differences between MB and SB’s results as displayed
on the dot plot. Find two ways in which the distribution of dots is different for MB
and SB.
Plot a dot plot on graph paper: Try plotting a dot plot for two more people (KB and
PS) repeating this titration.
Their results are shown below…
KB 45.9 48.1 51.9 50.2 52.1 50.6 50.0 48.8 49.8 50.0
PS 46.0 42.9 47.7 43.8 45.9 43.0 47.1 43.3 45.9 47.3
Write a few sentences about what the dot plot shows you.
• Compare the spread of each data set.
• Compare the middle of each data set (the ‘central tendency’)
• Plot a special point with a different symbol (perhaps ↓ ) marking the mean of
each set of data.
Link with errors work

Recall that when we were discussing the errors work, we looked at precision and
accuracy…
High precision Low precision

High accuracy
Low accuracy
Dot plots that show a low spread (tightly bunched dots) might be the result of more
precise measurements than dot plots that show a wide spread. A dot plot can help you
gauge the amount of random error in a measurement. Systematic error is harder to
deal with and may not show in statistical analysis.

Measure of dispersion: Standard Deviation
Recall the errors work we looked at in the first session: we divided the error up into
systematic and random components…
Systematic error
(unknown)
Random error
(knowable)
The standard deviation is a measure of how spread out your data is – the dispersion
of your data - and can be used to calculate the size of random error. As we are
working mostly with samples that are drawn from a larger distribution, we use a
formula for the sample standard deviation…
S=
"(x ! x )
N !1
Where
• the large Σ symbol means ‘find the total of all the following’
• x stands for the mean
• x stands for each data value in turn
• N stands for the number of data items
What the formula is asking you to do is…
• Find the mean of the data ( x )

• Subtract the mean from each data value in turn and then square the result
• Find the total of the squares of the deviations from the means
• Divide this total by one less than the number of data items – you now have the
mean square of the deviations from the mean
• Take the square root to find the sample standard deviation
The best way to learn to apply this formula is to work with an example and to work in
columns to represent each of the stages in the calculation…
PTO

Typical question
Calculate the mean and sample standard deviation of the 10 acid volumes obtained by
SB in the titration experiment: here are the results again in cm3…
SB 48.7 41.9 45.8 49.0 46.3 49.0 55.6 44.7 54.9 48.4
The first thing to do is to calculate the mean for this data. I get the total to be 484.3,
and so the mean is 48.43 cm3, which we can round to 48.4 to advantage.
The next thing to do is arrange the data in the first column of a three column table –
we will use the other two columns to record results as we go on…
X x- x (x - x )2
Data value Deviation Square deviation
48.7 48.7 – 48.4 = 0.3 0.32 = 0.09
41.9 41.9 – 48.4 = -6.5 (-6.5)2 = 42.25
45.8
49.0
46.3
49.0
55.6
44.7
54.9
48.4
Total square deviation 162.01
Once you have your value for the total square deviation, you can complete the
calculation as follows…
162.01
S= = 18.0 = 4.24 " 4.2
10 ! 1

Your turn
Calculate the mean and sample standard deviation of the 10 acid volumes obtained by
SB in the titration experiment: here are the results again in cm3…
KB 45.9 48.1 51.9 50.2 52.1 50.6 50.0 48.8 49.8 50.0
Use the blank table below to calculate your results…
X x- x (x - x )2
Data value Deviation Square deviation
Total square deviation
Remember to complete the calculation by dividing the total square deviation by one
less than the number of data items and then taking the square root of the result. I
found the answer to be 1.8 cm3 rounded to two significant figures.
You might want to put marks on your dot plot for this data at ±S either side of the
mean, and then ±2S and then ±3S where S is the sample standard deviation.

Standard deviation: calculating with MS Excel
Excel has functions built in that will calculate a wide range of statistics for you.
To calculate the sample standard deviation, you just
• Type your data into a column (say the first number is typed in cell A1 and the 10th
in cell A10)
• Click in cell A12 (or any cell outside the data range) and type
=stdev(A1:A10) and press enter
• You should see the standard deviation figure appear
The command =stdev(A1:A10)is called a formula, the ‘=’ sign tells Excel that a
calculation is needed. The cell range A1:A10 tells Excel which figures to include in a
calculation.
Calculating with your scientific calculator
All scientific calculators have a routine for calculating the standard deviation.
• Learning how to use your particular calculator’s routine is best done as a ‘simon
says’ exercise in the lesson as each make of calculator (and sometimes different
models within a manufacturer’s range) will have a different logic.
• Make absolutely sure that you can get the same answers for mean and standard
deviation using your calculator’s statistics mode as you do when using the table
layout.
• Calculators often use the symbol σn-1 for what we have called the sample standard
deviation. The symbol σn-1 will often be found as a legend above a key.
Sample standard deviation as an estimate of population standard

deviation
The sample standard deviation of (say) the diameter of 30 airgun pellets taken at
random from a box of 500 pellets provides you with an estimate of the standard
deviation of the population in the box (in turn a sample of the whole production of
pellets!). There will be some error on the sample standard deviation estimate.
We use different symbols for statistics calculated from samples and the values we are
trying to estimate for the population…
Sample Population
Mean x µ
Standard deviation S or S.D. σ

Homework: Week 4
If you work through the questions below in time for the next lesson, you will have all
the mathematical techniques at your fingertips. Next week we look at the Normal
Distribution and the standard error of the mean, and then go on to calculate
confidence intervals of the mean.
Skill questions
1) Below is some data relating to the reaction time in milli-seconds (thousandths

of a second) of a person before (B) and after (A) ingesting 60mL of a rather
good malt whisky
B 476 524 511 527 514 524 510 510 482 491
A 839 785 816 842 820 849 895 808 890 837
a) Plot both sets of results on a single dot plot
b) Calculate the mean and standard deviation for both sets of results
c) Organise your statistics into a table to allow comparison
2) Below is a series of 20 measurements of the weight of mature weevils in mg…
346 303 371 417 414 438 262 349 409 311 329
284 277 316 325 265 334 342 366 1500
a) Calculate the mean and standard deviation of this data set
b) Which weevil has a weight that is very far from the mean?
(x ! x)
c) Calculate the value of where ( x ! x ) is the difference between
S
the extreme weevil and the mean and S is the standard deviation.

Questions that need explanations: harder
3) Two science technicians are calibrating an electrophoresis gel. Technician C

measures the position of the 4th calibration band on 7 gels. Technician D
measures the position of the 4th calibration band on 9 gels. Their results are
shown below in mm.
Tech C 10.9 13.1 16.9 15.2 17.1 15.6 15.0
Tech D 6.8 11.3 18.7 15.5 19.2 16.2 14.9 12.6

14.0
Draw diagrams and make calculations to compare the results of the two
technicians. Write a short paragraph explaining which of the two technicians is
generating the most consistent results. Explain your reasoning.
Hint: a dot plot might be a good starting point
4) Below are the measurements of a sample of airgun pellets in mm to the nearest

0.1mm.
2.3 2.4 2.4 2.5 2.5 2.5 2.6 2.6 2.6 2.6 2.7
2.7 2.7 2.8 2.8 2.9 3.0 3.0 3.1 3.2 3.2 3.3
3.4 3.4 3.5 3.8 4.2 4.2 4.4 6.5
a) Draw a dot plot showing the data

b) How would you describe the distribution of dots?
c) Find the median and the mean. Write a sentence comparing the two
values of central tendency.
d) Calculate the standard deviation using a calculator routine.
e) Is the 6.5 value an outlier?

The five number summary
There is an alternative to the mean and standard deviation for describing distributions.
The five number summary consists of the following set of numbers
• The minimum value in the data set
• The lower quartile (LQ) value (where 25% of the data are lower than the
quartile)
• The median
• The upper quartile (UQ) (where 75% of the data are lower than the upper
quartile)
• The maximum value in the data set
1
The median is found (as explained earlier) by finding the (n + 1) th number in a list
2
of the n data items sorted into order of size. The maximum and minimum values are
easily found.
There are a number of slightly different ways of finding the upper and lower quartiles.
The method here is quoted in Armitage, 1994, and is best illustrated by an example.
The numbers below could be the widths of 25 samples of steel bar in mm measured
with a micrometer...
4.26 4.70 4.78 4.85 4.86 4.89 4.96 5.06 5.08 5.09 5.12
5.13
5.13
5.18 5.20 5.22 5.30 5.30 5.31 5.34 5.37 5.40 5.41 5.53
5.68
The list is sorted in order of size, and the median value is 5.13mm. Use the formula
1 3
ql = (n + 1) to find the position of the lower quartile, and qu = (n + 1) to find the
4 4
position in the sorted list of the upper quartile. In this case, n = 25 so the positions are
1 3
ql = (25 + 1) = 0.25 ! 26 = 6.5 and qu = (25 + 1) = 19.5 . These figures suggest that
4 4
we pick the mean of the 6th and 7th figures for the LQ and the mean of the 19th and
20th figures for the UQ. My five number summary for this data is (all in mm)...
Min 4.26
LQ 4.94 (rounded up)
Median 5.13
UQ 5.33 (rounded up)
Max 5.68
1
If you had 10 data items, then the LQ has position ql = (10 + 1) = 0.25 ! 11 = 2.75
4
which we round up to be the 3rd data item in the list. The UQ rounds down to 8th.

Your turn
The airgun data from Homework Week 4 question 4 is reproduced below (all
measurements in mm)...
2.3 2.4 2.4 2.5 2.5 2.5 2.6 2.6 2.6 2.6 2.7
2.7 2.7 2.8 2.8 2.9 3.0 3.0 3.1 3.2 3.2 3.3
3.4 3.4 3.5 3.8 4.2 4.2 4.4 6.5
Find the 5 number summary for this data....
Min
LQ
Median
UQ
Max
Box and whisker plot
Another kind of diagram can be drawn from the 5 number summary. The Box and
Whisker Plot is very useful for comparing two or more small samples and trying to
find similarities or differences in their distributions. You can include the Box and
Whisker plot on the same axes as a dot plot.
The data set for the steel bars from the previous page is shown below on a scale that
is a little wider each end than the maximum and minimum data points...
MIN LQ MED UQ MAX
4.0 5.0 6.0

Diameter in mm
The central box tells you the range of the central 50% of the distribution.

The Range and the Semi-Inter-quartile Range
In statistics the range of a set of data is simply the difference between the largest and
smallest item in the data set...
Range = Max ! Min
The range is handy for scaling graphs, but as a measure of spread or dispersion, the
range suffers from some problems...
• The range depends on only two values
• Sensitive to extreme outliers
• The range takes no account of the distribution of values between the maximum
and minimum
The Semi-interquartile range (SIQR) is just half the difference between the upper
and lower quartiles for a set of data...
1
SIQR = (UQ + LQ)
2
The advantages of the SIQR as a measure of dispersion include

• Depends on the quartiles so less influenced by extreme values
• Takes account of data distribution through position of the quartiles
• Can provide measure of spread when standard deviation may be problematic (e.g.
highly skewed distributions)
Your turn
The table below contains the five number summaries of two samples of the length of
butterfly wings in mm. The sample sizes are also shown in the table as N.
Sample A Sample B
Min 10.5 11.3
LQ 12.2 12.7
Median 12.8 13.4
UQ 13.1 13.8
Max 13.4 14.3
N 36 40
Draw box and whisker plots representing these two samples on the same axes.
Discussion: What are the differences between the two samples?

The Normal Distribution
The distribution of heights of a sample of 1000 adults resident in England might look
a bit like this;
Height distribution of 1000 people
250
225
200
175
150
125
100
75
50
25
0
132.5 137.5 142.5 147.5 152.5 157.5 162.5 167.5 172.5 177.5 182.5 187.5 192.5 197.5
Height / cm
As you can see, the distribution is
• Peaked in the middle of the height range

• Roughly symmetrical about the modal class
• Tends to have a ‘bell curved’ shape
Biological variables (height, weight, blood pressure, and so on) tend to have a
distribution shape like this as do genuine random errors in measurement. This
distribution shape became so common, it was referred to as the normal distribution.
Other distribution shapes have become more common as statistics developed as a
useful tool in industry and science - but the name has stuck.
Measures of central tendency for Normal distribution
For a sample of data that is drawn from a population that has a normal distribution for
the variable of interest, the three averages will give the same value.
If the median and mean are appreciably different for a given sample, then this may
indicate that the distribution is skewed one way or another, and the histogram of the
data will look ‘pushed over’. If the median is less than the mean, you have positive
skew and if the median is higher than the mean, you have negative skew.

Percentage of data items under the Normal curve
Normal curve
0.45
0.4
0.35
0.3
±1! contains
68% of data
0.25
0.2
0.15 ±2! contains

95.5% of data
0.1
±3! contains
0.05 99.7% of data
0
-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5
SDs either side of mean
The theoretical normal curve above shows you that...

• 68% of data items should lie within ± 1 Standard Deviation from the mean
• 95.5% of data items should lie within ± 2 Standard Deviations from the mean
• 99.7% of data items should lie within ± 3 Standard deviations from the mean
This means that if (say) you had a batch of steel washers with mean thickness
0.35mm and a standard deviation 0.05mm, then you would expect that 99.7% of the
washers would have a thickness in the range 0.35 ± 0.15mm, or lie between 0.20 mm
and 0.50mm.
If you did find a washer that was 0.2mm thick, then this washer had a chance of 1.5 in
1000 (half of 0.3%) of being produced.
Your turn
Suppose a batch of airgun pellets has a mean diameter of 0.177 inches with a standard
deviation of 0.008 inches. What range of diameters could 99.7% of the pellets be
expected to lie within?
Does this range include a pellet of 0.200 inches?
Discussion point: Would you send someone to prison if there was a 3 out of 1000
chance that they might really be innocent?

Standard error of the mean (SEM)
Suppose you took a large series of samples (say the heights of 10 adult males resident
in England). The means of the samples would show variation - the mean of the means
would be a very close approximation to the mean height of the population, and the
standard deviation of the sample means would tell you something about the sampling
error.
You can get an idea of the sample error from a single sample by calculating
something called the standard error of the mean (SEM).
The standard error of the mean is calculated by dividing the sample standard deviation
by the square root of the size of the sample…
S
SEM =
N
Your turn
Complete the following table
Sample size Mean /cm Standard Deviation Standard error of the

/cm mean /cm
10 175 2.5
100 175 2.5
1000 175 2.5
10000 175 2.5
Interpreting the SEM
As you can see, as the sample size increases, the SEM decreases. This shows you that
a larger sample has a smaller chance of having a mean that is different from the mean
of the population than a small sample.
• The SEM does NOT say anything about an individual value for a given data
item
• Samples must always be randomly selected from the population and must be free
from bias

Central limit theorem
“The distribution of an average tends to be Normal, even when the distribution from
which the average is computed is decidedly non-Normal.” - Charles Annis
As a concrete example, consider the RND function on your calculator or the

=RAND() function in MS Excel. This function returns a random number with 3
decimal places between 0.000 and 0.999, and each possible number is equally likely
to be returned. The distribution is referred to as ‘uniform’, and a histogram would
consist of bars of approximately equal height for a large enough sample.
I built a spreadsheet to pick 1000 samples of 100 random numbers each, and the
distribution of the means of the samples is shown below…
Distribution of sample means from a population with uniform distribution
180
170
160
150
140
130
Frequency (1000 samples)
120
110
100
90
80
70
60
50
40
30
20
10
0
0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61
UCB
As you can see, the distribution looks to be Normal in shape.
• A typical mean for the means of 1000 samples was 0.500 and the standard
deviation of the means was 0.029.
• A typical mean for a single sample is 0.478 and a sample standard deviation is
0.3
• As you can see, the standard deviation of the sample means is approximately the
same as the standard deviation of a single sample divided by the square root
of the sample size
• The standard error of the mean calculated from a single sample is a guide to the
standard deviation of the distribution of the means.

Confidence intervals
Because the Standard Error of the Mean is essentially an estimate of the standard
deviation of the sample means, we can use the SEM to estimate the chance that the
population mean falls within a certain range.
Recall that the area under the Normal curve within limits set by the standard deviation
is as follows...
SD range either side of mean Percentage of area under

the Normal curve within
that range
± 1 SD 68%
± 2 SD 95.5%
± 3 SD 99.7%
Turning these facts round and applying the concept to the standard error of the mean
results in the following rule for calculating a confidence interval of the mean for a
sample...
“There is a 95.5% probability that the population mean falls within ± 2 standard
errors of the sample mean.”
We can modify this rule slightly to give us a formula for the 95% confidence interval
of the mean as follows:-
S
95% CI = x ± 1.96 !
N
In words, the 95% confidence interval of the mean is given by the mean plus and
minus 1.96 multiplied by the standard error of the mean.
The factor 1.96 is used instead of 2 so that the probability is 95% (1 in 20) rather than
95.5%.
PTO for example calculation...

Example calculation
Suppose you have the following information for the weight of a sample of sturgeon
roe caught in an area of the Caspian Sea - weights were recorded in grammes...
N 30
Mean 1.56 g
Sample Standard 0.2g

Deviation
The 95% confidence interval of the mean for this sample is given by...
S
95% CI = x ± 1.96 !
N
0.2
= 1.56 ± 1.96 !
30
= 1.56 ± 0.072
= 1.45 to 1.63 g
And so we can say that there is a 1 in 20 chance that the sample of sturgeon roe is
actually an extremely unrepresentative one and that the population mean lies outside
the range 1.45 to 1.63g.
Notice that if you took repeated samples perfectly in a random way and without any
kind of bias, you would expect, 5% of the time, get a sample with a mean lying
outside this range.
Try the MS Excel spreadsheet simulation at this point - repeated pressing of the F9
function key will draw a new sample. Try pressing F9 50 times. Note each time the
sample mean falls outside the range.
The 5% probability splits symmetrically so there is a 2.5% probability that the sample
mean is below 1.45g and a 2.5% probability that the sample mean is above 1.63g

Your turn
Here is an extended example designed to take you from raw data to the 95%
confidence intervals for two sets of data. We will use this data in a later section so
please do complete the calculation - ask your tutor if you need help.
A calculator with a standard deviation routine or access to MS Excel will sure help!
The data
The weight in KG of sacks of flour was checked with an accurate weighing machine.
Random samples were obtained for the check that was carried out in two mills.
Samples of 20 sacks were taken in each case.
Mill A 9.00 9.05 9.05 9.10 9.10 9.10 9.15 9.15 9.20
9.20
9.20 9.25 9.25 9.30 9.40 9.50 9.60 9.90 9.95
10.00
Mill B 9.10 9.30 9.35 9.35 9.40 9.40 9.40 9.45 9.45
9.45
9.45 9.50 9.50 9.50 9.55 9.55 9.60 9.70 9.70
9.90
The initial data analysis
a) Draw dot plots for each of the samples from the two mills and describe the
appearance of the data - use the same axes for both dot plots and include a key
b) Calculate the 5 number summary for each of the data sets and organise your
results into a table
c) Draw box and whisker plots for each sample and comment on the appearance
of the diagrams
A more analytical approach
d) Calculate the mean and the sample standard deviation for each of the
samples
e) Calculate the 95% confidence interval of the sample mean for each of the
samples
f) Plot the 95% confidence intervals for each sample on your dot plot. Do the
intervals overlap?
g) Discussion: Is Mill A producing sacks that have a different weight from Mill
B? Try to make a statement based on probabilities.

The idea of a significance test
• In statistics, it is normal to lay out the precise criteria that you will use to
decide a question before designing an experiment, choosing samples and so
on. This pre-defined way of working avoids any unconscious bias creeping in
• A standard set of steps where you are clear what you are trying to decide is
called a significance test
• The table below shows the main steps involved in a significance test
• You can find much more detail by looking up the terms involved in a statistics
text book or on the Web
Stage in general process Specific example in flour sack example

State the null hypothesis “There is no difference in weight between
the flour sacks from Mill A and Mill B”
State an alternate hypothesis “There is a difference between the weight
of flour sacks produced in Mill A and in
Mill B”
This is a two tailed test as we are not

asking if sacks from Mill A weigh more
than sacks from Mill B or vice versa
Decide the significance level to adopt 5% corresponds to a 1 in 20 chance of a
'significant' result occurring when in fact
there is no difference. Look up 'type I' and
'type II' errors in a textbook.
Calculate the statistics as dictated by the Calculate the mean, standard deviation, for
type of data and number of variables: the 'new' and the 'old' coins.
Different situations call for different
statistics Calculate the 95% confidence interval of
each sample of the means
See if the Confidence Intervals overlap
Accept or reject the null hypothesis If the confidence intervals overlap then you
according to the statistical criteria adopted have to accept the null hypothesis of no
difference.
If the confidence intervals do not overlap

then you have to reject the null hypothesis
and accept the alternate hypothesis.
This format is the one we shall adopt in scenario 6.

Your turn: dry run
• Below are two sets of data on the girth of Betula pendula as sampled from two
parts of a wood
• You suspect that the sample taken from the North segment may represent a
different population from the sample taken from the South
• Perform a significance test to verify your experimental hypothesis
• Use a 5% significance level
Data
North section South

girth sample /cm section girth
sample /cm
386 796
486 818
627 382
553 76
108 470
240 689
397 611
280 719
16 211
51 509
588 73
138 291
194 521
348 854
173 810
424 964
25 364
761 833
217 688
305 988
Note: the experimental hypothesis is what you are trying to find out. You then frame
a more specific (usually negative) null hypothesis as part of the significance test.

Scenario 2 Statistics

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Scenario 2 Statistics

Hochgeladen von

Copyright:

Verfügbare Formate

Unit 6: Maths and Science for Technicians – statistics part 1

Mean, median and mode: measures of ‘central tendency’

14.3 9.2 18.6 7.4 7.4 13.1

Average How to find Example

Median Put the numbers in order of size

Mode Find the value that appears most

• Company B pays the following salaries

Discussion: Compare the mean salaries for the two companies

KPB 2004 http://bodmas.org/bnc/ page 1 of 26

Populations, samples and bias

• Define the population very carefully by a unique identifier – perhaps National

Estimating population parameters

Your turn: discussion

KPB 2004 http://bodmas.org/bnc/ page 2 of 26

Hazel Blue Grey Brown Green

Variables and data

Variables can be classified as follows…

Examples of each of the three types of variable include

Type of variable Example

KPB 2004 http://bodmas.org/bnc/ page 3 of 26

Discussion Exercise: classify the following variables…

Length of a skid mark in a

Number of people visiting a

Shirt collar/dress/shoe size

Discussion exercise: Bloodstain shape

KPB 2004 http://bodmas.org/bnc/ page 4 of 26

Interval Tallies Frequency

140 ≤ X < 145

145 ≤ X < 150

150 ≤ X < 155

155 ≤ X < 160

160 ≤ X < 165

165 ≤ X < 170

170 ≤ X < 175

175 ≤ X < 180

Total of the frequencies 100

KPB 2004 http://bodmas.org/bnc/ page 5 of 26

My frequency distribution for the data set above was as follows

Below is a histogram produced in MS Excel by means of a subterfuge – I used a bar

h (mm) 5 < h ≤ 10 10 < h ≤ 15 15 < h ≤ 20 20 < h ≤ 25 25 < h ≤ 30 30 < h ≤ 35

Plot a histogram of this frequency distribution on graph paper in ‘landscape’ format

KPB 2004 http://bodmas.org/bnc/ page 6 of 26

Illustrating small data sets: the dot plot

These results could be compared using a dot plot as drawn below…

As you can see, in a dot plot

• Each data value is represented by a dot

KPB 2004 http://bodmas.org/bnc/ page 7 of 26

Link with errors work

High precision Low precision

KPB 2004 http://bodmas.org/bnc/ page 8 of 26

Measure of dispersion: Standard Deviation

What the formula is asking you to do is…

• Find the mean of the data ( x )

KPB 2004 http://bodmas.org/bnc/ page 9 of 26

41.9 41.9 – 48.4 = -6.5 (-6.5)2 = 42.25

Total square deviation 162.01

KPB 2004 http://bodmas.org/bnc/ page 10 of 26

Use the blank table below to calculate your results…

Total square deviation

KPB 2004 http://bodmas.org/bnc/ page 11 of 26

Standard deviation: calculating with MS Excel

Calculating with your scientific calculator

Sample standard deviation as an estimate of population standard