Beruflich Dokumente
Kultur Dokumente
Eric Wolsztynski
eric.w@ucc.ie
Department of Statistics
School of Mathematical Sciences
University College Cork, Ireland
2015-2016
Version 1.0
ST1051-ST3905-ST5005-ST6030
Acknowledgment
These lecture notes make use of former material written by Dr
Kingshuk Roy Choudhury and Dr Supratik Roy for previous course
syllabii. This material largely used [Dekking et al 2005].
IPS 2
ST1051-ST3905-ST5005-ST6030
Course information
References
[1] J. A. Rice, Mathematical Statistics and Data Analysis, 2nd Edition, ITP Duxbury Press 1995
[2] J. L. Devore, Probability and Statistics for Engineering and the Sciences, 3rd Edition, Brooks-Cole 1991
[3] F. M. Dekking, C. Kraaikamp, H. P. Lopuha and L. E. Meester, A Modern Introduction to Probability and
Statistics, Springer 2005
[4] B.W. Lindgren, Statistical Theory, Fourth Edition, Chapman & Hall, 1993
[5] D.A. Berry and B.W. Lindgren, Statistics: Theory and Methods, 2nd edition, 1995
[7] J. D. Gibbons and S. Chakraborti, Nonparametric Statistical Inference, 4th Edition, Dekker 2014
[8] B. S. Everitt and T. Hothorn, A Handbook of Statistical Analyses Using R, Second Edition, Chapman & Hall
2010
[10] R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical
Computing, Vienna, Austria. URL http://www.R-project.org/.
IPS 3
ST1051-ST3905-ST5005-ST6030
Course information
Timetable
Practicals: ST1051
Monday 4-5pm in lab WGB G34 (TBC)
Tuesday 3-4pm in lab WGB G33 (TBC)
IPS 4
ST1051-ST3905-ST5005-ST6030
Course information
Assessment
ST1051/ST3905:
ST5005/ST6030:
IPS 5
ST1051-ST3905-ST5005-ST6030
Course information
Module objective
IPS 6
ST1051-ST3905-ST5005-ST6030
Outline
1 Motivation
5 Limit theorems
6 Statistical Inference
7 Estimation
8 Hypothesis Testing
IPS 7
ST1051-ST3905-ST5005-ST6030
Motivation
Section I
Motivation
IPS 8
ST1051-ST3905-ST5005-ST6030
Motivation
General concepts
Probability? Statistics?
IPS 9
ST1051-ST3905-ST5005-ST6030
Motivation
General concepts
Typical examples
Business, financial mathematics and actuarial science:
decision making, investment strategies
Engineering:
tracking mobile terminals in wireless networks
genomics
IPS 11
ST1051-ST3905-ST5005-ST6030
Motivation
Examples
[Dekking et al 2005]
IPS 12
ST1051-ST3905-ST5005-ST6030
Motivation
Examples
Each rocket has three O-rings, and two rocket boosters are
used per launch
IPS 13
ST1051-ST3905-ST5005-ST6030
Motivation
Examples
IPS 14
ST1051-ST3905-ST5005-ST6030
Motivation
Examples
Modelling...
The probability p(t) that an individual O-ring fails should depend
on the launch temperature t. Use the data to calibrate this model
(a Binomial distribution) and estimate the expected number of
failures, 6p(t).
IPS 15
ST1051-ST3905-ST5005-ST6030
Motivation
Examples
Aftermaths...
IPS 16
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Section II
IPS 17
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Outline
Introduction
Computing probabilities
IPS 18
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Introduction
Probability
IPS 19
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Introduction
IPS 20
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Introduction
= {Jan, Feb, Mar , Apr , May , Jun, Jul, Aug , Sep, Oct, Nov , Dec}
= 1 2 = {(1 , 2 ) : 1 1 , 2 2 }
If |1 | = r ,|2 | = s, then |1 2 | = rs
IPS 22
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Introduction
Events
Recall: subsets of the sample space are called events
Events
IPS 24
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Events and set operations
IPS 26
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Events and set operations
IPS 27
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Computing probabilities
Probability
IPS 28
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Computing probabilities
Probability
(i) P() = 1
IPS 29
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Computing probabilities
Probability
Recall:
(i) P() = 1
Probability
IPS 31
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Computing probabilities
Probability
IPS 32
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Computing probabilities
Probability
How should we assign probabilities in the experiment where
we ask for the birthday month?
IPS 33
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Computing probabilities
IPS 34
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Computing probabilities
A = (A B) (A B c )
Hence
P(A) = P(A B) + P(A B c )
IPS 35
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Computing probabilities
We obtain (A B) B and (A B) B c
Thus
P(A B) = P(B) + P(A B c )
IPS 36
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Computing probabilities
IPS 37
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Computing probabilities
without replacement:
n!
Ank = = n(n 1) . . . (n k + 1)
(n k)!
different ordered samples
n! = n(n 1)(n 2) . . . 1
(try with a = b = 1)
IPS 39
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Computing probabilities
IPS 40
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Computing probabilities
Gender
Breath test Male Female Total
Pass 420 240 660
Fail 280 60 340
Total 700 300 1,000
IPS 41
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Conditional probability and independence
Conditional probability
IPS 42
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Conditional probability and independence
Conditional probability
T+ high blood concentration (positive test)
Toxicity
D+ D Total
T+ 25 14 39
T 18 78 96
Total 43 92 135
IPS 43
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Conditional probability and independence
Conditional probability
Converting the frequencies to proportions (out of 135):
Toxicity Toxicity
D+ D Total D+ D Total
T+ 25 14 39 T+ .185 .104 .289
T 18 78 96 T .135 .578 .711
Total 43 92 135 Total .318 .682 1.000
If one knows that the test for high blood concentration was
positive, what is the probability of disease (toxicity)?
P(D + T +) 25 .185
P(D+ | T +) = = = = .640 = 64%
P(T +) 39 .289
IPS 44
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Conditional probability and independence
IPS 45
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Conditional probability and independence
IPS 47
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Conditional probability and independence
Recall:
IPS 48
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Conditional probability and independence
Total probability
m
i=1 Bi =
IPS 49
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Conditional probability and independence
Bayes Theorem
Suppose a cow tests positive; what is the probability it really
has BSE?
Bayes Theorem
Bayes rule:
Suppose the events B1 , B2 , . . . , Bm are disjoint and
m
i=1 Bi = . Then
P(A|Bi )P(Bi )
P(Bi |A) = Pm
j=1 P(A|Bj )P(Bj )
IPS 51
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Conditional probability and independence
Independence
Consider the three probabilities
Independence
IPS 53
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Conditional probability and independence
Independence
Definition:
IPS 54
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Conditional probability and independence
Independence
Finally, by definition of conditional probability, if A is
independent of B, then
P(A B) P(A)P(B)
P(B|A) = = = P(B)
P(A) P(A)
that is, B is independent of A
IPS 56
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Conditional probability and independence
IPS 57
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Conditional probability and independence
IPS 58
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Conditional probability and independence
Random variables
IPS 60
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Random variables and distributions
M=a 1 2 3 4 5 6
p(a) 1/36 3/36 5/36 7/36 9/36 11/36
IPS 62
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Random variables and distributions
p(ai ) > 0
X
p(ai ) = 1
i
IPS 63
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Random variables and distributions
IPS 64
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Random variables and distributions
IPS 65
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Random variables and distributions
IPS 67
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Random variables and distributions
IPS 68
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Random variables and distributions
Conditions on f :
f (x) 0 x
R
f (x)dx = 1
IPS 69
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Random variables and distributions
IPS 70
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Random variables and distributions
Var (X ) = E (X E (X ))2
= E (X 2 ) E (X )2
IPS 71
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Random variables and distributions
IPS 72
ST1051-ST3905-ST5005-ST6030
Elements of Probability Theory
Random variables and distributions
E (aX ) = aE (X ) a constant
E (XY ) = E (X )E (Y ) if X and Y are independent
E (a + bX ) = a + bE (X ) linearity
E (X + Y ) = E (X ) + E (Y ) linearity
Xn n
X
E[ Xi ] = E [Xi ]
i=1 i=1
Variance:
Section III
IPS 74
ST1051-ST3905-ST5005-ST6030
Discrete Random Variables
Outline
IPS 75
ST1051-ST3905-ST5005-ST6030
Discrete Random Variables
The Binomial distribution
Binomial experiments
Consider an experiment with outcomes 1 (success) and 0
(failure) five times
A = {(0, 0, 0, 0, 1), (0, 0, 0, 1, 0), (0, 0, 1, 0, 0), (0, 1, 0, 0, 0), (1, 0, 0, 0, 0)}
Then P(A) = 5(1 p)4 p, since there are five outcomes in the
event A, each having probability (1 p)4 p
IPS 76
ST1051-ST3905-ST5005-ST6030
Discrete Random Variables
The Binomial distribution
Binomial experiments
IPS 77
ST1051-ST3905-ST5005-ST6030
Discrete Random Variables
The Binomial distribution
pX (1) = P(X = 1) = p
and
pX (0) = P(X = 0) = 1 p
You will pass the exam if you answer six or more questions
correctly
IPS 79
ST1051-ST3905-ST5005-ST6030
Discrete Random Variables
The Binomial distribution
Bernoulli / Binomial
Exercise:
Calculate the probability that you answered the first question
correctly and the second one incorrectly
IPS 80
ST1051-ST3905-ST5005-ST6030
Discrete Random Variables
The Binomial distribution
We have
P(X0 ) = P(R1 = 0, R2 = 0, . . . , R10 = 0)
= P(R1 = 0)P(R2 = 0) . . . P(R10 = 0)
= (3/4)10
The probability that we have answered exactly one question
correctly equals
P(X = 1) = (1/4) (3/4)9 10
IPS 81
ST1051-ST3905-ST5005-ST6030
Discrete Random Variables
The Binomial distribution
...
IPS 83
ST1051-ST3905-ST5005-ST6030
Discrete Random Variables
The Binomial distribution
for k = 0, 1, . . . , n
E (X ) = np
Its variance is
Var (X ) = np(1 p)
IPS 84
ST1051-ST3905-ST5005-ST6030
Discrete Random Variables
The Geometric distribution
IPS 85
ST1051-ST3905-ST5005-ST6030
Discrete Random Variables
The Geometric distribution
for k = 1, 2, ....
Its variance is
1p
Var (X ) =
p2
IPS 86
ST1051-ST3905-ST5005-ST6030
Discrete Random Variables
The Geometric distribution
Geometric distribution
Exercise:
Let X have a Geo(p) distribution. For n 0, show that
P(X > n) = (1 p)n .
IPS 87
ST1051-ST3905-ST5005-ST6030
Discrete Random Variables
The Geometric distribution
Memoryless property
Memoryless property: for n, k = 0, 1, 2, . . . one has
We have:
P({X > k + n} {X > k})
P(X > n + k | X > k) =
P(X > k)
P(X > k + n)
=
P(X > k)
(1 p)n+k
=
(1 p)k
= (1 p)n
= P(X > n)
IPS 88
ST1051-ST3905-ST5005-ST6030
Discrete Random Variables
The Poisson distribution
Section IV
IPS 91
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
Outline
Moments
IPS 92
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
Continuous random variables
IPS 93
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
Continuous random variables
IPS 94
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
Continuous random variables
IPS 95
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
Continuous random variables
IPS 96
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
Continuous random variables
a, b constant,
IPS 97
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
Continuous random variables
Ex: Let
if x 0
0
1
f (x) =
2 x
if 0 < x < 1
if x 1
0
is a probability density function
IPS 98
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
Continuous random variables
Since the object hits the disc, we have F (b) = 1 when b > r
IPS 100
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
Continuous random variables
The inner disc defined by the hitting point has radius b and
area b 2
b 2 b2
We should put F (b) = P(X b) = r 2
= r2
for 0 b r
IPS 101
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
Continuous random variables
Exercise:
Compute for the darts example the probability that
0 < X r /2, and the probability that r /2 < X r .
IPS 102
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
Continuous random variables
and Z +
E [g (X )] = g (x)f (x)dx
IPS 103
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
The Uniform distribution
IPS 104
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
The Uniform distribution
IPS 105
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
The Uniform distribution
Exercise:
Argue that the distribution function F of a rv that has a
U(, ) distribution is given by F (x) = 0 if x < , F (x) = 1
if x > , and F (x) = (x )/( ) for x .
IPS 106
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
The Uniform distribution
IPS 107
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
The Exponential distribution
IPS 109
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
The Normal distribution
Illustration
Example: relative frequency histogram of lifetimes of a
computer component
Notation: N(, 2 )
IPS 112
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
The Normal distribution
IPS 113
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
The Normal distribution
IPS 114
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
The Normal distribution
.00 .01 .02 .03 .04 .05 .06 .07 .08 .09
0.0 .5000 .4960 .4920 .4880 .4840 .4801 .4761 .4721 .4681 .4641
0.1 .4602 .4562 .4522 .4483 .4443 .4404 .4364 .4325 .4286 .4247
0.2 .4207 .4168 .4129 .4090 .4052 .4013 .3974 .3936 .3897 .3859
0.3 .3821 .3783 .3745 .3707 .3669 .3632 .3594 .3557 .3520 .3483
0.4 .3446 .3409 .3372 .3336 .3300 .3264 .3228 .3192 .3156 .3121
0.5 .3085 .3050 .3015 .2981 .2946 .2912 .2877 .2843 .2810 .2776
0.6 .2743 .2709 .2676 .2643 .2611 .2578 .2546 .2514 .2483 .2451
0.7 .2420 .2389 .2358 .2327 .2296 .2266 .2236 .2206 .2177 .2148
0.8 .2119 .2090 .2061 .2033 .2005 .1977 .1949 .1922 .1894 .1867
0.9 .1841 .1814 .1788 .1762 .1736 .1711 .1685 .1660 .1635 .1611
1.0 .1587 .1562 .1539 .1515 .1492 .1469 .1446 .1423 .1401 .1379 IPS 115
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
The Normal distribution
IPS 116
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
The Normal distribution
Since P(Z > P0.80 ) = 0.20, we rather look for 0.2000 within
the table
We read that
P(Z > 0.84) = 0.2005
P(Z > 0.85) = 0.1977
Therefore 0384 < P0.80 < 0.85
IPS 118
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
The Normal distribution
Standardization
X
Z= N(0, 1)
This principle is also (implicitly) fundamental in many
statistical inference methods
IPS 119
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
The Normal distribution
Standardization: example
A life assurance company has established that the lifetimes of
a certain subgroup of policy-holders are normally distributed
with a mean of 72 years and a standard deviation of 4 years,
i.e. the continuous lifetime H N(72, 4)
IPS 120
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
The Normal distribution
Standardization: example
Standardize:
H 78 72
P(H > 78) = P > = P(Z > 1.50)
4
IPS 121
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
The Normal distribution
IPS 122
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
The Normal distribution
IPS 123
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
The Normal distribution
IPS 124
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
Moments
Moments
IPS 125
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
Moments
X (t) = E [e tX ]
Useful properties:
1 Limit:
d k X (t)
lim = E [X k ]
t0 dt k
2 If X , Y are independent,
X +Y (t) = E [e t(X +Y ) ]
= E [e tX e tY ]
= E [e tX ]E [e tY ]
= X (t)Y (t)
IPS 127
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
Moments
Examples of MGFs
X Exp():
1 tx x/
Z
X (t) = e e dx
0
1 x[t(1/)]
Z
= e dx
0
" #
1 e x[t(1/)]
=
t (1/)
0
" #
1 e x[t(1/)]
1 1
= lim
x t (1/) t (1/)
1
=
1 t
as long as t < 1/, since the upper limit will vanish
IPS 128
ST1051-ST3905-ST5005-ST6030
Continuous Random Variables
Moments
Examples of MGFs
X N(, 2 ):
Z 2
1 1 (x)
X (t) = e tx e 2 2 dx
2
Z
1 1 x 2 2x+2
= e tx e 2 2 dx
2
Z 2 2 2
1 1 x 2x(+t )+
= e 2 2 dx
2
Z 0 2
(+t 2 )2
2
1 2+ 1 (x )
= e 2 2 2 e 2 2 dx
2
where 0 = ( + t 2 )
Characteristic function
The MGF of a random variable does not always exist
X (t) = E [e itX ]
Z
= e itx f (x)dx
X (it) = X (t)
Additive property:
Cumulants
The log of the characteristic function is used to generate
cumulants n :
X (it)n
log (X (t)) = n
n!
n=1
1 = E [X ]
2 = E [X 2 ] E [X ]2
3 = 2E [X ]3 3E [X ]E [X 2 ] + E [X 3 ]
...
Section V
Limit theorems
IPS 132
ST1051-ST3905-ST5005-ST6030
Limit theorems
Outline
Motivation
Limit theorems
IPS 133
ST1051-ST3905-ST5005-ST6030
Limit theorems
Motivation
IPS 134
ST1051-ST3905-ST5005-ST6030
Limit theorems
Limit theorems
Chebyshevs inequality
Var(Xn )
P | Xn |>
2
IPS 135
ST1051-ST3905-ST5005-ST6030
Limit theorems
Limit theorems
P | Xn |> 0
as n
IPS 136
ST1051-ST3905-ST5005-ST6030
Limit theorems
Limit theorems
IPS 137
ST1051-ST3905-ST5005-ST6030
Limit theorems
Limit theorems
where (x) denotes the cdc for the Standard Normal distribution
IPS 138
ST1051-ST3905-ST5005-ST6030
Limit theorems
Limit theorems
Examples:
IPS 139
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Section VI
Statistical Inference
IPS 140
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Outline
Sampling
IPS 141
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory Analysis and Descriptive statistics
Probability? Statistics?
IPS 142
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory Analysis and Descriptive statistics
Statistics!
Moneyball (2011)
IPS 143
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Sampling
Population parameters
Statistical inference consists in estimating population features
IPS 145
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Sampling
IPS 146
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Sampling
IPS 147
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Sampling
The estimate of l2 is
l n
1 X
sl2 = (Xil Xl )2
nl 1
i=1
IPS 148
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Sampling
Cluster Sampling
IPS 149
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Sampling
Systematic Sampling
IPS 150
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory data analysis
IPS 151
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory data analysis
IPS 152
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory data analysis
IPS 154
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory data analysis
Ordered durations
96 100 102 104 105 105 105 105 105 105 107 107 108 108 108 108 109
109 109 110 110 110 110 110 110 110 111 111 112 112 112 112 112 112
112 112 113 113 113 113 115 115 116 116 117 118 118 118 119 119 119
120 120 120 120 121 121 121 122 122 124 125 125 126 126 126 128 129
130 130 131 132 132 132 133 134 134 135 135 136 137 138 139 140 141
142 143 144 144 145 145 149 157 158 168 173 174 184 199 200 200 202
205 207 210 210 214 214 216 216 216 216 221 223 224 225 226 226 229
230 230 230 230 230 231 231 233 235 235 235 237 237 238 238 240 240
240 240 240 240 242 242 243 244 244 245 245 245 245 [.....] 274 274
275 275 275 275 276 276 276 276 277 278 278 278 279 280 280 282 282
282 282 282 282 283 284 285 286 287 288 288 288 288 288 288 289 289
290 290 291 293 294 294 296 296 296 300 302 304 306
Middle elements (136th and 137th) = 240, much closer to max
(306) than to min (96) - implies asymmetry
IPS 155
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory data analysis
Numerical summaries
X[1] , . . . , X[n]
IPS 156
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory data analysis
IPS 157
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory data analysis
IPS 158
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory data analysis
qn (0.25) = Q1 (X ) = X[ n+1 ]
4
qn (0.75) = Q3 (X ) = X[ 3(n+1) ]
4
IPS 159
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory data analysis
Numerical summaries
Q1 (X )
median(X )
Q3 (X )
IPS 160
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory data analysis
IPS 161
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory data analysis
Ex: sepal width on the Iris data (50 flowers from each of 3 species
of iris)
Iris data (2nd component)
4.0
3.5
3.0
2.5
2.0
IPS 163
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory data analysis
Drawing a histogram
Whenever feasible let the software do it!
IPS 164
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory data analysis
IPS 165
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory data analysis
IPS 166
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory data analysis
IPS 167
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory data analysis
Boxplot
Another way of summarising the underlying data distribution
90
80
70
60
50
IPS 168
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory data analysis
Scatterplot
IPS 169
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory data analysis
Scatterplot
Example: daily readings of air quality values in NYC,
1st May - 30th Sept 1973 (R dataset airquality)
IPS 170
ST1051-ST3905-ST5005-ST6030
Statistical Inference
Exploratory data analysis
Air quality, NYC, May-Sep 1973 Air quality, NYC, May-Sep 1973
20
90
Temperature (degrees F)
15
80
10
70
5
60
IPS 171
ST1051-ST3905-ST5005-ST6030
Estimation
Section VII
Estimation
IPS 172
ST1051-ST3905-ST5005-ST6030
Estimation
Outline
Statistical Inference
Estimation
Confidence intervals
Linear regression
IPS 173
ST1051-ST3905-ST5005-ST6030
Estimation
Statistical Inference
Statistical inference
Detection:
Discrete probabilities (most of the time)
Estimation:
Discrete or continuous probabilities
Why estimation?
IPS 175
ST1051-ST3905-ST5005-ST6030
Estimation
Estimation
Estimators
Let t = h(x1 , x2 , . . . , xn ) be an estimate based on the dataset
x1 , x2 , . . . , xn
IPS 176
ST1051-ST3905-ST5005-ST6030
Estimation
Estimation
The two most common criteria used are (a) Unbiasedness (b)
Minimum Variance
IPS 177
ST1051-ST3905-ST5005-ST6030
Estimation
Estimation
Let
L() = f (x1 , x2 , . . . , xn ; ),
be the joint pdf of X1 , X2 , . . . , Xn
IPS 179
ST1051-ST3905-ST5005-ST6030
Estimation
Estimation
MLE:
= max f (x1 , x2 , . . . , xn ; )
f (x1 , x2 , . . . , xn ; )
IPS 180
ST1051-ST3905-ST5005-ST6030
Estimation
Estimation
The pdf is
1
f (x, ) = e x/ , x, > 0
Joint pdf is
Y
L() = f (xi , )
i
n
Y 1 xi /
= e
i=1
n ni=1 xi /
P
= e
IPS 181
ST1051-ST3905-ST5005-ST6030
Estimation
Estimation
IPS 182
ST1051-ST3905-ST5005-ST6030
Estimation
Estimation
IPS 183
ST1051-ST3905-ST5005-ST6030
Estimation
Estimation
IPS 184
ST1051-ST3905-ST5005-ST6030
Estimation
Confidence intervals
Confidence Intervals
A confidence interval is an interval in which we are very
confident the population parameter of interest lies
Ex: for a 95% CI, one needs to remove the most extreme
2.5% from each tail of the distribution
0.2
0.1
0.0
4 2 0 2 4
X 2
Z=
n
IPS 187
ST1051-ST3905-ST5005-ST6030
Estimation
Confidence intervals
The mean value was 732.16 and the standard deviation was
83.14
IPS 188
ST1051-ST3905-ST5005-ST6030
Estimation
Confidence intervals
Confidence Intervals
IPS 189
ST1051-ST3905-ST5005-ST6030
Estimation
Confidence intervals
IPS 190
ST1051-ST3905-ST5005-ST6030
Estimation
Confidence intervals
IPS 191
ST1051-ST3905-ST5005-ST6030
Estimation
Confidence intervals
IPS 192
ST1051-ST3905-ST5005-ST6030
Estimation
Confidence intervals
IPS 193
ST1051-ST3905-ST5005-ST6030
Estimation
Confidence intervals
IPS 194
ST1051-ST3905-ST5005-ST6030
Estimation
Confidence intervals
IPS 195
ST1051-ST3905-ST5005-ST6030
Estimation
Confidence intervals
IPS 196
ST1051-ST3905-ST5005-ST6030
Estimation
Linear regression
Regression
Let Y be a random variable and x a deterministic variable
(that is, non-random)
Y = 0 + 1 x +
Yi = 0 + 1 Xi + i
IPS 198
ST1051-ST3905-ST5005-ST6030
Estimation
Linear regression
IPS 199
ST1051-ST3905-ST5005-ST6030
Estimation
Linear regression
n
SS X
= 2 xi (Yi 0 1 xi ) = 0
1
i=1
IPS 200
ST1051-ST3905-ST5005-ST6030
Estimation
Linear regression
IPS 201
ST1051-ST3905-ST5005-ST6030
Estimation
Linear regression
P5 2
P5
x= 4.9, y = 1.46, i=1 xi =120.15 and i=1 xi yi = 35.88
Then,
P5
xi yi 5
x y 35.88 5(4.9)(1.46)
1 = Pi=1
5
= = 1.1
2
i=1 xi 5 x2 120.15 5(4.9)2
and
0 = y 1 x = 1.46 (1.1)(4.9) = 3.93
IPS 202
ST1051-ST3905-ST5005-ST6030
Estimation
Linear regression
20
90
Temperature (degrees F)
15
80
10
70
5
60
Section VIII
Hypothesis Testing
IPS 204
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
Outline
Two-sample tests
Goodness-of-fit tests
Summary
IPS 205
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
Concepts in hypothesis testing
Hypothesis Testing
We know that X is an unbiased estimator
NB: we cannot assume that, say, < 500, since is not the
rate of an Exponential distribution
IPS 207
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
Concepts in hypothesis testing
Forming Hypotheses
IPS 208
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
Concepts in hypothesis testing
IPS 209
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
Concepts in hypothesis testing
Note that
P[N(0, 1) < 1.645] = P[N(0, 1) > 1.645] = 0.05
P[N(0, 1) < 1.96] = P[N(0, 1) > 1.96] = 0.025
IPS 210
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
Concepts in hypothesis testing
Errors in detection
Recall: one seeks to retain or reject a null hypothesis H0 on the
basis of evidence. Let us denote H1 the alternative hypothesis.
H0 is true H1 is true
H0 is accepted Correct decision Type II error
H1 is accepted Type I error Correct decision
Test statistic
We test the hypotheses based on the sample
p-value
The test procedure becomes: Reject H0 if T > tc for some
unknown but computable tc
i.e.
P(|X| > tc | = 700) = 0.05
0.4
P( Z > 1.645, H0) = 0.05
0.3
Density
0.2
0.1
0.0
2 0 2 4
x 0
z=
n
IPS 216
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
One-sample, one-sided tests of the population mean
z-test in R
IPS 217
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
One-sample, one-sided tests of the population mean
IPS 218
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
One-sample, one-sided tests of the population mean
X N , 2 /n
X12 + + Xn2 2 (n 1)
X
t(n 1)
s/ n
IPS 219
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
One-sample, one-sided tests of the population mean
for x R, where
(m + 12 )
km =
(m/2) m
and Z
(u) = e x x u1 dx
0
IPS 220
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
One-sample, one-sided tests of the population mean
X = 23.78778, s = 0.07827513, n = 23
Using R:
qt(0.025,22) = -2.073873
qt(0.975,22) = 2.073873
t-test in R
IPS 222
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
One-sample, one-sided tests of the population proportion
Testing proportions in R
IPS 224
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
One-sample, two-sided tests
z z/2 or z z/2
IPS 225
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
One-sample, two-sided tests
IPS 226
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
One-sample, two-sided tests
Using standardization,
X
N(0, 1)
/ n
IPS 227
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
One-sample, two-sided tests
IPS 229
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
One-sample, two-sided tests
X
P zL < < zU = P zL < X < zU
/ n n n
= P X zU < < X zL
n n
= 0.95
IPS 230
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
One-sample, two-sided tests
Using the given = 0.1 and = 0.05, we find the 95% CI:
0.1 0.1
23.788 1.96 , 23.788 + 1.96
23 23
i.e.
(23.747, 23.829) MJ/kg
IPS 231
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
Two-sample tests
Two-sample z-test
IPS 232
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
Two-sample tests
Paired t-test
In a paired t-test, one compares the mean d of the differences
between two samples with an hypothesized difference in
means d0 , using a test statistic of the form
d d0
t= , df = n 1
s/ n
Recall synopsis for the t-test:
t.test(x, y = NULL,
alternative = c(two.sided, less, greater),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
Other tests in R
F-test to compare the variances of two samples from normal
populations: var.test()
x <- rnorm(50, mean = 0, sd = 2)
y <- rnorm(30, mean = 1, sd = 1)
var.test(x, y)
IPS 234
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
Goodness-of-fit tests
IPS 235
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
Goodness-of-fit tests
IPS 236
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
Goodness-of-fit tests
k
X (nj mj )2
D2 =
mj
j=1
IPS 238
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
Testing for significance in linear regression
Regression Tests
0 00
T0 := q P 2 tn2
MSE i xi
SSX n
IPS 240
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
Testing for significance in linear regression
Regression Tests
1 10
T0 := q tn2
MSE
SSX
IPS 241
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
Testing for significance in linear regression
SSR = 12 SSX
= SST SSE
IPS 242
ST1051-ST3905-ST5005-ST6030
Hypothesis Testing
Testing for significance in linear regression
Summary
1 What we need to do a test:
1 null and alternative hypotheses
2 a test statistic T
3 a significance level