You are on page 1of 60

MATH2099/MATH2859

“Statistics”
Lecturer Dr Gery Geenens
RC-2053
ggeenens@unsw.edu.au
Office hours: please use email to fix an appointment

Lectures: Tue 4pm-6pm


Lecture slides and other course materials will be
made available on web page (BlackBoard9)
Tutorials start in Week 1 in the labs (Red Center)
Tutes in the computer lab: Weeks 1,2,4,6,8 and 12
Tutes in a classroom: Weeks 3,5,7,9,11,13
1
Help with the course:
• You will be assigned a tutor for tutorials/labs, who
should be your first point of contact
• Keep up! Don’t get behind. Seek help if you need it.
Prepare for tuts/labs and ask your tutor
• Attempt all the online quizzes
• There will be consultation hours available with your
lecturer and some other statistics’ staff. Details on
course web page or the door to RC-2053. Times can
be adjusted if necessary - contact lecturer if all
advertised hours clash with your timetable
• Peer support for statistics through the Student
Support Scheme (SSS) in RC-3064 2
Textbooks and reference books
1. Required textbook:

Applied Statistics for Engineers and Scientists, J. Devore


and N. Farnum, 2nd Edition, Duxbury Press, Thomson
Publishers.
Bundled with Student Solution Manual cost around $120.00
from UNSW Bookstore.

2. Additional references:

Probability and Statistics for Engineering and the Sciences:


J.L. Devore. (Any edition is useful. There is now a new 7th
edition.) Duxbury Press, Thomson Publishers.

Statistical methods for Engineers, G.G. Vining. Duxbury


Press, Thomson Publishers.
3
Assessments
• Matlab online quizz (4%) : start as early in session as
you can. Due Monday 1pm Week 3
• 3 Stats online quizzes (2%+2%+2%) : due Monday
1pm, Weeks 5, 9 and 13
The online quizzes are through Maple TA : there is a
link to it from your Blackboard9 web page
• Midsession Test (15% ) : in tutorial in Week 7
• Matlab test (15%) : Week 10 – you should book a
time to suit you
• Final exam (60%): in exam period
Week 1: Data and Distributions
• Textbook reference: Ch 1: sections 1.1,1.2,1.3

• Recommended exercises from textbook:


- Stemplot (stem-and-leaf display) Q1,page 19
- Histogram Q17, page 23
- Uniform distribution Q19, page 31
What is Statistics?
• Describing data
Producing data
Drawing conclusions from data
• Statistical Science is the science of collecting,
organising and interpreting numerical facts, which
are referred to as data
• “Statistical Science is the Science of turning data into
information for decision making”
• Statistical science provides methods to enable us to
make intelligent judgements and informed decisions
in the presence of uncertainty and variation
6
Variation is endemic
• Natural Variation: Variation is the usual
situation and arises in nature, raw materials,
experimental conditions etc.

• Measurement Error: Variation in repeated


measurements on the same physical quantity.

7
Branches of Statistical Methods
• Descriptive: Summarize and describe
important features of the data

• Inferential: Draw conclusions (make


inferences) about some characteristic of a
population based on measurements on a
sample of individuals selected (how?) from
the population

8
Example 1: Does Cloud Seeding Work?
(from Devore, 1995, p. 39):
Is cloud seeding really effective in increasing rainfall?
For 26 pairs of days with similar weather, cloud seeding
was tried on one of the two days
Pair Rainfall: Seeded Rainfall: Unseeded
(mm) (mm)
1 4.1 1.0
2 7.7 4.9
3 17.5 4.9
4 31.4 11.5
5 32.7 17.3
….. …… …..
9
Rainfall difference (Seeded-Unseeded)

0 500 1000 1500

Difference

10
Cloud seeding experiment
• Most of the rainfall differences are positive, which
seems to suggest that cloud seeding increases
rainfall. However, can we be sure that the apparent
effect is not due to chance?

• Here we have a decision to make (does cloud


seeding work or not?) based on experimental data

• Statistics gives us methods for designing


experiments like this one, for collecting data and for
answering questions of interest based on the data
11
Example 2
Hair colour and pain tolerance

• An experiment conducted at the University of


Melbourne suggests that there may be a difference
in pain threshold for blonds and brunettes

• A group of subjects was divided into light blond, dark


blond, light brunette and dark brunette groups and a
pain threshold score was measured for each subject.
(A higher score means a higher pain threshold)

12
Data for Pain and Hair Colour Experiment

D Brunette

L Brunette

D Blond

L Blond

0 10 20 30 40 50 60 70

13
Hair colour and pain tolerance

• Again we have a decision to make (is pain threshold


related to hair colour?) based on experimental data

• Pain threshold seems to increase with lighter hair


colour, but is this effect real or just due to chance?
(the sample size is quite small)

• Statistics gives us methods for assessing apparent


differences between groups defined by different
characteristics (inherent or experimental)
14
Example 3: Defective bricks
• A sample of 214 bricks from a batch of bricks
yields 18 defective

• If a proportion of defective bricks is less than


5% of the entire batch, the batch is considered
acceptable

• Is there evidence of something wrong with


the process?
15
Defective bricks continued
• Again we have a decision to make (is our
manufacturing process behaving as it should?) based
on experimental data

• Does a proportion of 18/214=8.4% defective in the


sample indicate that the long run fraction of
defective is bigger than 5%?

• Statistics provides methods for making decisions like


this one, as well as for deciding on what data should
be collected (e.g. the sample size and so on)
16
Example 4: Challenger Space Shuttle
• For the 23 previous space shuttle missions prior to
the Challenger disaster of January 28, 1986 the
following variables were recorded:

– Temperature at launch
– Prelaunch pressure
– Number of O-rings which failed (out of six)

17
18
Data used night before launch

NOTE
SCALE Does Temperature have an effect on
19
LIMITS O-ring incidents?
Data including launches with no
incidents

20
21
Probability of Field Failure O-rings
• For each field joint, let pF be the
probability of a field joint failure
pF = pa pb pc pd

• Let pFF be the probability of at least one


field joint failure
pFF = 1 – (1 – pF)6

22
Predictions under 2 scenarios:
Conditions for the ill-fated launch:

At 200psi, 31 F pˆ F ≈ 0.023,

pˆ FF ≈ 1 − (1 − 0.023) = 0.13
6

If delayed until a warmer time:


At 200psi, 60 F pˆ F ≈ 0.0032,

pˆ FF ≈ 1 − (1 − 0.0032) = 0.019
6
23
Experimental procedure:
1. Formulate the question(s)
2. Decide what data is required/most appropriate
3. Collect the data
4. Analyse/interpret the data
5. Draw conclusions (answer the questions)

STATISTICAL SCIENCE CONTRIBUTES TO ALL STEPS!

24
Branches of Statistical Methods
• Descriptive: Summarize and describe
important features of the data

• Inferential: Draw conclusions (make


inferences) about some characteristic of a
population based on measurements on a
sample of individuals selected (how?) from
the population

25
Sample from Populations
• Population: set of all objects of interest in a
problem

• Sample: subset of the population collected for


the purpose of learning about characteristics
of the population

• Statistical Inference requires that the sample


be REPRESENTATIVE

26
Sampling Populations for Inference

Population
X X X = sampled individuals
X
X
X Sample
XX
X

Infer Population
average is “close”
to sample Calculate Sample Average
average
27
Example of Inference
• Population average height for adult males is unknown
– A sample of 1000 randomly chosen males gave a sample
average height of 176.5cm
– We infer that the population average height is near
176.5cm
• Question: Can we quantify the accuracy or precision of
this inference?
• Answer: Statistical Science can!
• BUT the sample should ideally be REPRESENTATIVE of
the population studied
(RANDOM samples can help with this)

28
Defective bricks example revisited
• Bricks are manufactured in batches
• Contract requires proportion defective (π) of bricks in
batch is no larger than 5%
• Decide if a batch should be sent to a customer
• The Population is the set of all bricks in the batch
• Impractical to inspect every brick
• Collect a subset randomly from batch and calculate
the proportion p of defective in this sample

29
Descriptive Statistics
• Given sample data, our most basic statistical task is
to summarize it in some way

• Different ways of summarizing data are appropriate


for different types of data

• Types of data:
– Quantitative (Numerical)
– Qualitative (Categorical)

30
Individuals and Variables
• Individuals are the objects (people, animals,
things) described by the data. Individuals are
sometimes referred to as elements, units or
participants
• A variable is any characteristic of an
individual. A variable can take different values
for different individuals
Variables can be multidimensional
(Univariate/ Bivariate/ Multivariate).
31
Categorical & Numerical Variables
• Categorical (or qualitative) Variable
– individuals are placed into one of several groups
or categories

• Numerical (or quantitative) Variable


– takes numerical values for which numerical
operations apply

32
Revisit : Hair colour and pain tolerance

What kind of variable is hair colour?


What kind of variable is pain score?
Data for Pain and Hair Colour Experiment

D Brunette

L Brunette

D Blond

L Blond

0 10 20 30 40 50 60 70

33
Displaying Distributions
• The distribution of a variable tells us what values it
takes and how often it takes these values

• A first step is to examine the distribution of a variable


in a data set using suitable graphs and numerical
summaries

• A second step (looked at in detail later in the course)


is to look at the relationships between variables

34
Displaying Distributions
Graphs for categorical variables :
Show frequencies of individuals or observations
in each category of the variable: eg bar charts,
pie charts

Graphs for quantitative variables :


The pattern of variation in a quantitative
variable is often displayed in a histogram or a
stemplot (or stem-and-leaf display)

35
Qualitative data: Bar Charts

36
Qualitative data: Pie Charts

37
Quantitative data:
Stemplot ( Stem-and-Leaf Display)
• Key point: Numbers must have a context for
sensible conclusions to be made from data
• A stemplot is a quick and easy way to
graphically display the key features of a
distribution of data
• These are best suited to a small to moderate
number of observations

38
To make a stemplot:
• Separate each observation into a stem (all but last
digit) and a leaf (final digit)
• E.g.,
24 := 2|4 139 := 13|9 5 := 0|5
• Write all unique stems in vertical column with the
smallest at the top, and draw a vertical line at the
right of this column
• Write each leaf in the row to the right of its stem, in
increasing order out from the stem

39
Example 5: Expenditure ($) of 50
shoppers
3.11 8.88 9.26 10.81 12.69 13.78
15.23 15.62 17.00 17.39 18.36 18.43
19.27 19.50 19.54 20.16 20.59 22.22
23.04 24.47 24.58 25.13 26.24 26.26
27.65 28.06 28.08 28.38 32.03 34.98
36.37 38.64 39.16 41.02 42.97 44.08
44.67 45.40 46.69 48.65 50.39 52.75
54.80 59.07 61.22 70.32 82.70 85.76
86.37 93.34

40
Ex 5: Stemplot

0 3 9 9
1 1 3 4 5 6 7 7 8 8 9 9
2 0 0 1 2 3 4 5 5 5 6 6 8 8 8 8
3 2 5 6 9 9
4 1 3 4 5 5 7 9
5 0 3 5 9
6 1
7 0
8 3 6 6
9 3

41
Variations on Stemplots
• Rounding or truncating the numbers to a few
digits before making a stemplot to avoid too
much detail in the stems

• Splitting each stem to give greater detail in the


distribution

• Back-to-back stemplots with common stems to


compare two related distributions

42
Ex 5: Splitting each stem
0 3
0 9 9
1 1 3 4
1 5 6 7 7 8 8 9
2 0 0 0 1 2 3 4
2 5 5 6 6 8 8 8 8
3 2
3 5 6 9 9
4 1 3 4
4 5 5 7 9
5 0 3
5 5 9
6 1
6
7 0
7
8 3
8 6 6
9 3
43
Ex 5: Splitting each stem

Stemplot of the amounts


spent by 50 consecutive
supermarket shoppers:
(a) without splitting
stems
(b) splitting stems

44
Example of Stem and Leaf
• Data set 1
9, 10, 15, 22, 9, 15, 16, 24, 11, 46

• Data set 2
25, 27, 28, 36, 38, 39, 42, 50

Draw a back-to-back stemplot for this example.


What is the stem and what is the leaf?
45
46
Types of Quantitative Variables
• A variable is discrete if its set of possible
values constitutes a finite set or infinite
sequence (countable)

• A variable is continuous if its set of possible


values consists of an entire interval on a
number line (uncountable)

47
Frequency or relative frequency
Histograms for Discrete Data
• Determine the frequency and relative frequency for
each value

• Mark possible values on a horizontal axis

• Above each value, draw a rectangle (or line segment)


whose height is the relative frequency of that value

48
Example 6: Credit cards
Students from a statistics class were asked how
many credit cards they carry. X is the variable
representing the number of cards
x # people Relative
Frequency
0 12
1 42
2 57
3 24
4 9
5 4
6 2 49
Credit Card Histogram

50
Histograms: Continuous Data
• Subdivide the measurement axis into a suitable
number of classes (or class intervals). Try to choose
sensible end points. Choose enough intervals to
avoid too much detail while retaining information
about important features of the distribution
• Determine the frequency and relative frequency for
each class. Divide each relative frequency by the
corresponding class width, this is called the density
• Then mark the class boundaries on a horizontal
measurement axis
• Above each class interval, draw a rectangle whose
height is the density 51
Histogram for continuous data:
property
• Multiplying both sides of the formula of the
density by the class width gives
relative frequency = (class width) x (density)
= (rect. width) x (rect. height)
= rectangle area
• The area of each rectangle is the relative
frequency of the corresponding class
• The total area of all rectangles must be 1
52
Histogram shapes
Typical words/phrases used to describe histograms and
other graphical displays (e.g. stem-and-leave of data) :
• symmetric, or skewed to the right/left;
• unimodal, or bimodal/multimodal;
• bell-shaped (if symmetric & uni-modal);
• there are possible outliers around…, or there are no
obvious outliers;
• typical value of the data is …;
• the range of the data is …;
• compared to the typical value the spread of the data is
fairly big/small 53
Histograms and Stemplots
• Histograms replace:
– the stems, in a stemplot, by class intervals.
– the leaves (showing the values, possible rounded) by
counts, percentages or densities
• Stemplots are useful for displaying distributions of
smaller data sets.
• Histograms are useful for moderate to larger data
sets.

54
Example:
A histogram with a density curve
Survival times of 72 guinea
pigs injected with tubercle
bacilli (Moore & McCabe)
Smooth density curve is
estimated using software.
No easy maths formula for
this curve!

Note:
extra bumps in right hand tail.
Some positive skew ignoring
these

55
General density curves
A density curve is a smooth curve through a relative
frequency histogram used to summarise its key
features succinctly
Usually the smooth curve is described by a
mathematical formula f(x)

A density curve must satisfy two key properties:


non-negative
total area under it is 1 (integrates to unity)
56
Properties of density functions
• Non-negative function of a real valued variable

Reason: Proportion of outcomes in any interval


must be non-negative:
f ( x) ≥ 0
• Integral over the real numbers is unity

Reason: Proportion of all outcomes must be 1:



∫−∞
f ( x)dx = 1
57
Example 6 : Time Between Industrial
Accidents
Density is Relative Frequency in
bins of width 5 days
0.05

0.04
177 Times (in
days) between
0.03 accidents at a
Density

DuPont Facility
0.02
over a 10 year
0.01 period.
0.00 [Vining, p.51]
0 100 200
Time_bw_acdnts
58
Ex 6: Exponential Density for
Time Between Industrial
Accidents

Fitted p.d.f. is
0.05

1 y
0.04 f ( y ) = exp(− )
0.03
λ λ
with λ = 20.412
Density

0.02

0.01

0.00

0 100 200
Time_bw_acdnts
59
Using the Density Function
• Proportion of values between a and b is area
b
∫a
f ( x)dx
• Eg. Calculate the chance that the time to the
next industrial accident after the one just
observed exceeds 80 days.
• Answer:
∞ 1
∫80 20.412 e
− y / 20.412
dy = 0.0199
60