Stat Cookbook

Probability and Statistics
Cookbook
Version 0.2.4
14th May, 2017
http://statistics.zone/
Copyright
c Matthias Vallentin, 2017
Contents 14 Exponential Family 16 21.5 Spectral Analysis . . . . . . . . . . . . . 28
1 Distribution Overview 3 15 Bayesian Inference 16 22 Math 29

1.1 Discrete Distributions . . . . . . . . . . 3 15.1 Credible Intervals . . . . . . . . . . . . . 16 22.1 Gamma Function . . . . . . . . . . . . . 29
1.2 Continuous Distributions . . . . . . . . 5 15.2 Function of parameters . . . . . . . . . . 17 22.2 Beta Function . . . . . . . . . . . . . . . 29
15.3 Priors . . . . . . . . . . . . . . . . . . . 17 22.3 Series . . . . . . . . . . . . . . . . . . . 29
2 Probability Theory 8 15.3.1 Conjugate Priors . . . . . . . . . 17 22.4 Combinatorics . . . . . . . . . . . . . . 30
15.4 Bayesian Testing . . . . . . . . . . . . . 18
3 Random Variables 8
3.1 Transformations . . . . . . . . . . . . . 9 16 Sampling Methods 18
16.1 Inverse Transform Sampling . . . . . . . 18
4 Expectation 9 16.2 The Bootstrap . . . . . . . . . . . . . . 18
16.2.1 Bootstrap Confidence Intervals . 18
5 Variance 9 16.3 Rejection Sampling . . . . . . . . . . . . 19
16.4 Importance Sampling . . . . . . . . . . . 19
6 Inequalities 10
17 Decision Theory 19
7 Distribution Relationships 10 17.1 Risk . . . . . . . . . . . . . . . . . . . . 19
17.2 Admissibility . . . . . . . . . . . . . . . 20
8 Probability and Moment Generating
17.3 Bayes Rule . . . . . . . . . . . . . . . . 20
Functions 11
17.4 Minimax Rules . . . . . . . . . . . . . . 20
9 Multivariate Distributions 11 18 Linear Regression 20
9.1 Standard Bivariate Normal . . . . . . . 11 18.1 Simple Linear Regression . . . . . . . . 20
9.2 Bivariate Normal . . . . . . . . . . . . . 11 18.2 Prediction . . . . . . . . . . . . . . . . . 21
9.3 Multivariate Normal . . . . . . . . . . . 11 18.3 Multiple Regression . . . . . . . . . . . 21
18.4 Model Selection . . . . . . . . . . . . . . 22
10 Convergence 11
10.1 Law of Large Numbers (LLN) . . . . . . 12 19 Non-parametric Function Estimation 22
10.2 Central Limit Theorem (CLT) . . . . . 12 19.1 Density Estimation . . . . . . . . . . . . 22
19.1.1 Histograms . . . . . . . . . . . . 23
11 Statistical Inference 12 19.1.2 Kernel Density Estimator (KDE) 23
11.1 Point Estimation . . . . . . . . . . . . . 12 19.2 Non-parametric Regression . . . . . . . 23
11.2 Normal-Based Confidence Interval . . . 13 19.3 Smoothing Using Orthogonal Functions 24
11.3 Empirical distribution . . . . . . . . . . 13
11.4 Statistical Functionals . . . . . . . . . . 13 20 Stochastic Processes 24
20.1 Markov Chains . . . . . . . . . . . . . . 24
12 Parametric Inference 13 20.2 Poisson Processes . . . . . . . . . . . . . 25
12.1 Method of Moments . . . . . . . . . . . 13
12.2 Maximum Likelihood . . . . . . . . . . . 14 21 Time Series 25
12.2.1 Delta Method . . . . . . . . . . . 14 21.1 Stationary Time Series . . . . . . . . . . 26 This cookbook integrates various topics in probability theory
12.3 Multiparameter Models . . . . . . . . . 14 21.2 Estimation of Correlation . . . . . . . . 26 and statistics, based on literature [1, 6, 3] and in-class material
12.3.1 Multiparameter delta method . . 15 21.3 Non-Stationary Time Series . . . . . . . 26 from courses of the statistics department at the University of
12.4 Parametric Bootstrap . . . . . . . . . . 15 21.3.1 Detrending . . . . . . . . . . . . 27 California in Berkeley but also influenced by others [4, 5]. If you
21.4 ARIMA models . . . . . . . . . . . . . . 27 find errors or have suggestions for improvements, please get in
13 Hypothesis Testing 15 21.4.1 Causality and Invertibility . . . . 28 touch at http://statistics.zone/.
1 Distribution Overview
1.1 Discrete Distributions
Notation1 FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b a + 1)2 1 eas e(b+1)s

bxca+1 I(a x b) a+b
Uniform Unif {a, . . . , b} axb
ba ba+1 2 12 s(b a)
1 x>b

Bernoulli Bern (p) (1 p)1x px (1 p)1x p p(1 p) 1 p + pes
!
n x
Binomial Bin (n, p) I1p (n x, x + 1) p (1 p)nx np np(1 p) (1 p + pes )n
x

np1 !n
np1 (1 p1 ) np1 p2
k
! k
n! x
X . X
Multinomial Mult (n, p) px1 pk k xi = n .. .. pi e si
x1 ! . . . xk ! 1 i=1 np2 p1 . i=0
npk
m N m
!
x np x nx nm nm(N n)(N m)
Hypergeometric Hyp (N, m, n) N N 2 (N 1)
p
np(1 p) n
N
! r
x+r1 r 1p 1p p
Negative Binomial NBin (r, p) Ip (r, x + 1) p (1 p)x r r 2
r1 p p 1 (1 p)es
1 1p pes
Geometric Geo (p) 1 (1 p)x x N+ p(1 p)x1 x N+
p p2 1 (1 p)es
x
X i x e s
Poisson Po () e e(e 1)
i=0
i! x!
1 We use the notation (s, x) and (x) to refer to the Gamma functions (see 22.1), and use B(x, y) and Ix to refer to the Beta functions (see 22.2).
3
Uniform (discrete) Binomial Geometric Poisson
n = 40, p = 0.3 0.8 p = 0.2
=1
n = 30, p = 0.6 p = 0.5 =4
n = 25, p = 0.9 p = 0.8 = 10

0.3
0.2 0.6
0.2
PMF
PMF
PMF
PMF
1

0.4
n

0.1

0.1
0.2

0.0

0.0

0.0

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

x x x x
Uniform (discrete) Binomial Geometric Poisson
1 1.00
1.0 1.00

0.75 0.8 0.75

i

n

CDF
CDF
CDF
CDF
0.50 0.6 0.50

i

n

0.25 0.4 0.25

n = 40, p = 0.3 p = 0.2
=1
n = 30, p = 0.6 p = 0.5

=4

0 0.00

n = 25, p = 0.9 0.2 p = 0.8 0.00

= 10
a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

x x x x
4
1.2 Continuous Distributions
Notation FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b a)2 esb esa

xa I(a < x < b) a+b
Uniform Unif (a, b) a<x<b
ba ba 2 12 s(b a)
1 x>b

(x )2
Z x
2 s2

1
N , 2 2

Normal (x) = (t) dt (x) = exp exp s +
2 2 2 2
(ln x )2

1 1 ln x 1 2 2 2
ln N , 2 e+ /2
(e 1)e2+

Log-Normal + erf exp
2 2 2 2 x 2 2 2 2

1 T
1 (x) 1
Multivariate Normal MVN (, ) (2)k/2 ||1/2 e 2 (x) exp T s + sT s
2
(+1)/2 (
+1

2 x2 2
>2
Students t Student() Ix , 1 + 0 >1
2

2 2 1<2

1 k x 1
Chi-square 2k , xk/21 ex/2 k 2k (1 2s)k/2 s < 1/2
(k/2) 2 2 2k/2 (k/2)
r
d
(d1 x)d1 d2 2
2d22 (d1 + d2 2)

d1 d1 (d1 x+d2 )d1 +d2 d2
F F(d1 , d2 ) I d1 x , d1 d1 d2 2 d1 (d2 2)2 (d2 4)

d1 x+d2 2 2 xB 2
, 2
1 x/ 1
Exponential Exp () 1 ex/ e 2 s (s < )
1
!
(, x) 1 x 1
Gamma Gamma (, ) x e s (s < )
() () 2 1
, x

1 /x 2 2(s)/2 p
Inverse Gamma InvGamma (, ) x e >1 >2 K 4s
() () 1 ( 1)2 ( 2) ()
P
k
i=1 i Y 1
k
i E [Xi ] (1 E [Xi ])
Dirichlet Dir () Qk xi i Pk Pk
i=1 (i ) i=1 i=1 i i=1 i + 1
k1
!
( + ) 1 X Y +r sk
Beta Beta (, ) Ix (, ) x (1 x)1 1+
() () + ( + )2 ( + + 1) r=0
++r k!
k=1

sn n

k k x k1 (x/)k 1 2 X n
Weibull Weibull(, k) 1 e(x/) e 1 + 2 1 + 2 1+
k k n=0
n! k
x
m x xm x2m
Pareto Pareto(xm , ) 1 x xm m
+1 x xm >1 >2 (xm s) (, xm s) s < 0
x x 1 ( 1)2 ( 2)
1
We use the rate parameterization where =
. Some textbooks use as scale parameter instead [6].
5
Uniform (continuous) Normal LogNormal Student's t
2.0 1.00 0.4 =1
= 0, = 0.2
2
= 0, = 3
2
= 0, 2 = 1 = 2, 2 = 2 =2
= 0, 2 = 5 = 0, 2 = 1 =5
=
= 2, 2 = 0.5 = 0.5, 2 = 1
= 0.25, 2 = 1
1.5 0.75 = 0.125, 2 = 1 0.3
PDF
PDF
PDF
PDF
1
1.0 0.50 0.2
ba
0.5 0.25 0.1
0.0 0.00 0.0
a b 5.0 2.5 0.0 2.5 5.0 0 1 2 3 5.0 2.5 0.0 2.5 5.0
x x x x
2 F Exponential Gamma
d1 = 1, d2 = 1 2.0 = 0.5 0.5 = 1, = 0.5
1.00 k=1 3 d1 = 2, d2 = 1 =1 = 2, = 0.5
k=2 d1 = 5, d2 = 2 = 2.5 = 3, = 0.5
k=3 d1 = 100, d2 = 1 = 5, = 1
k=4 d1 = 100, d2 = 100 0.4 = 9, = 2
k=5
1.5
0.75
2
0.3
PDF
PDF
PDF
PDF
0.50 1.0
0.2
1
0.25 0.5
0.1
0.00 0 0.0 0.0
0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto
= 1, = 1 5 = 0.5, = 0.5 2.0 = 1, k = 0.5 4 xm = 1, k = 1
= 2, = 1 = 5, = 1 = 1, k = 1 xm = 1, k = 2
= 3, = 1 = 1, = 3 = 1, k = 1.5 xm = 1, k = 4
4 = 3, = 0.5 = 2, = 2 = 1, k = 5
4 = 2, = 5
1.5 3
3
3
PDF
PDF
PDF
PDF
1.0 2
2
2
0.5 1
1 1
0 0 0.0 0
0 1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5
x x x x
6
Uniform (continuous) Normal LogNormal Student's t
1 1.00 = 0, = 3
2 1.00
= 2, 2 = 2
0.75 = 0, 2 = 1
= 0.5, 2 = 1
= 0.25, 2 = 1
0.75 = 0.125, 2 = 1 0.75
0.50
CDF
CDF
CDF
CDF
0.50 0.50
0.25
0.25 0.25
= 0, 2 = 0.2 =1
= 0, 2 = 1 =2
= 0, 2 = 5 =5
0 0.00 = 2, 2 = 0.5 0.00 0.00 =
a b 5.0 2.5 0.0 2.5 5.0 0 1 2 3 5.0 2.5 0.0 2.5 5.0
x x x x
2 F Exponential Gamma
1.00 1.00 1.00
1.00
0.75 0.75 0.75 0.75

CDF
CDF
CDF
CDF
0.50 0.50 0.50 0.50
0.25 0.25 0.25 0.25

k=1 d1 = 1, d2 = 1 = 1, = 0.5
k=2 d1 = 2, d2 = 1 = 2, = 0.5
k=3 d1 = 5, d2 = 2 = 0.5 = 3, = 0.5
k=4 d1 = 100, d2 = 1 =1 = 5, = 1
0.00 k=5 0.00 d1 = 100, d2 = 100 0.00 = 2.5 0.00 = 9, = 2
0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto
1.00 1.00
1.00 1.00 = 0.5, = 0.5
= 5, = 1
= 1, = 3
= 2, = 2
= 2, = 5
0.75 0.75 0.75 0.75
CDF
CDF
CDF
CDF
0.50 0.50 0.50 0.50
0.25 0.25 0.25 0.25
= 1, = 1 = 1, k = 0.5
= 2, = 1 = 1, k = 1 xm = 1, k = 1
= 3, = 1 = 1, k = 1.5 xm = 1, k = 2
0.00 = 3, = 0.5 0.00 0.00 = 1, k = 5 0.00 xm = 1, k = 4
0 1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5
x x x x
7
2 Probability Theory Law of Total Probability
n n
Definitions X G
P [B] = P [B|Ai ] P [Ai ] = Ai
Sample space i=1 i=1
Outcome (point or element) Bayes Theorem

Event A
n
-algebra A P [B | Ai ] P [Ai ] G
P [Ai | B] = Pn = Ai
1. A j=1 P [B | Aj ] P [Aj ] i=1
S
2. A1 , A2 , . . . , A = i=1 Ai A Inclusion-Exclusion Principle
3. A A = A A
n n
r
[ X X \
Probability Distribution P (1)r1

Ai = A ij

1. P [A] 0 A i=1 r=1 ii1 <<ir n j=1
2. P [] = 1
" #
G
X 3 Random Variables
3. P Ai = P [Ai ]
i=1 i=1 Random Variable (RV)
Probability space (, A, P) X:R
Properties Probability Mass Function (PMF)
P [] = 0 fX (x) = P [X = x] = P [{ : X() = x}]

B = B = (A A) B = (A B) (A B)
Probability Density Function (PDF)
P [A] = 1 P [A]
b
P [B] = P [A B] + P [A B]
Z
P [a X b] = f (x) dx
P [] = 1 P [] = 0 a
S T T S
( n An ) = n An ( n An ) = n An DeMorgan
S T Cumulative Distribution Function (CDF)
P [ n An ] = 1 P [ n An ]
P [A B] = P [A] + P [B] P [A B] FX : R [0, 1] FX (x) = P [X x]
= P [A B] P [A] + P [B]
1. Nondecreasing: x1 < x2 = F (x1 ) F (x2 )
P [A B] = P [A B] + P [A B] + P [A B]
2. Normalized: limx = 0 and limx = 1
P [A B] = P [A] P [A B]
3. Right-Continuous: limyx F (y) = F (x)
Continuity of Probabilities
S b
A1 A2 . . . = limn P [An ] = P [A]
Z
where A = i=1 Ai
T P [a Y b | X = x] = fY |X (y | x)dy ab
A1 A2 . . . = limn P [An ] = P [A] where A = i=1 Ai a
Independence
f (x, y)
fY |X (y | x) =
A
B P [A B] = P [A] P [B] fX (x)
Conditional Probability Independence
P [A B] 1. P [X x, Y y] = P [X x] P [Y y]
P [A | B] = P [B] > 0 2. fX,Y (x, y) = fX (x)fY (y)
P [B] 8
Z
3.1 Transformations E [XY ] = xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
E [(Y )] 6= (E [X]) (cf. Jensen inequality)
Z = (X)
P [X Y ] = 1 = E [X] E [Y ]
Discrete P [X = Y ] = 1 = E [X] = E [Y ]
X
fZ (z) = P [(X) = z] = P [{x : (x) = z}] = P X 1 (z) =

fX (x)
X
E [X] = P [X x] X discrete
x1 (z) x=1
Continuous Sample mean

n
n = 1
Z X
X Xi
FZ (z) = P [(X) z] = f (x) dx with Az = {x : (x) z} n i=1
Az
Conditional expectation
Special case if strictly monotone Z

d

dx 1 E [Y | X = x] = yf (y | x) dy
fZ (z) = fX (1 (z)) 1 (z) = fX (x) = fX (x)

dz dz |J| E [X] = E [E [X | Y ]]
Z
The Rule of the Lazy Statistician E(X,Y ) | X=x [=] (x, y)fY |X (y | x) dx

Z Z
E [Z] = (x) dFX (x) E [(Y, Z) | X = x] = (y, z)f(Y,Z)|X (y, z | x) dy dz

Z Z E [Y + Z | X] = E [Y | X] + E [Z | X]
E [IA (x)] = IA (x) dFX (x) = dFX (x) = P [X A] E [(X)Y | X] = (X)E [Y | X]
A
E [Y | X] = c = Cov [X, Y ] = 0
Convolution
Z Z z
X,Y 0
Z := X + Y fZ (z) = fX,Y (x, z x) dx = fX,Y (x, z x) dx
0 5 Variance
Z
Z := |X Y | fZ (z) = 2 fX,Y (x, z + x) dx Definition and properties
0
Z Z 2
2
X V [X] = X = E (X E [X])2 = E X 2 E [X]
Z := fZ (z) = |x|fX,Y (x, xz) dx = xfx (x)fX (x)fY (xz) dx " n # n
Y X X X
V Xi = V [Xi ] + Cov [Xi , Xj ]
i=1 i=1 i6=j
4 Expectation " n
X
# n
X
V Xi = V [Xi ] if Xi
Xj
Definition and properties i=1 i=1
X

xfX (x) X discrete Standard deviation p
sd[X] = V [X] = X

Z x

E [X] = X = x dFX (x) = Covariance

Z
xfX (x) dx X continuous

Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]
P [X = c] = 1 = E [X] = c Cov [X, a] = 0
E [cX] = c E [X] Cov [X, X] = V [X]
E [X + Y ] = E [X] + E [Y ] Cov [X, Y ] = Cov [Y, X]
9
Cov [aX, bY ] = abCov [X, Y ] 7 Distribution Relationships
Cov [X + a, Y + b] = Cov [X, Y ]

n m

n X m
Binomial
X X X
n
Cov Xi , Yj = Cov [Xi , Yj ] X
i=1 j=1 i=1 j=1
Xi Bern (p) = Xi Bin (n, p)
i=1
Correlation X Bin (n, p) , Y Bin (m, p) = X + Y Bin (n + m, p)
Cov [X, Y ]
[X, Y ] = p limn Bin (n, p) = Po (np) (n large, p small)
V [X] V [Y ] limn Bin (n, p) = N (np, np(1 p)) (n large, p far from 0 and 1)
Independence
Negative Binomial
X
Y = [X, Y ] = 0 Cov [X, Y ] = 0 E [XY ] = E [X] E [Y ]
X NBin (1, p) = Geo (p)
Pr
Sample variance X NBin (r, p) = i=1 Geo (p)
n P P
1 X n )2 Xi NBin (ri , p) = Xi NBin ( ri , p)
S2 = (Xi X
n 1 i=1 X NBin (r, p) . Y Bin (s + r, p) = P [X s] = P [Y r]
Conditional variance Poisson
2 n n
!
V [Y | X] = E (Y E [Y | X])2 | X = E Y 2 | X E [Y | X] X X
Xi Po (i ) Xi Xj = Xi Po i
V [Y ] = E [V [Y | X]] + V [E [Y | X]]
i=1 i=1

n n
X X i
6 Inequalities Xi Po (i ) Xi Xj = Xi Xj Bin Xj , Pn
j=1 j=1 j=1 j
Cauchy-Schwarz
2 Exponential
E [XY ] E X 2 E Y 2

n
X
Markov Xi Exp () Xi
Xj = Xi Gamma (n, )
E [(X)]
P [(X) t] i=1
t Memoryless property: P [X > x + y | X > y] = P [X > x]
Chebyshev
V [X] Normal
P [|X E [X]| t]
t2
X

Chernoff X N , 2 = N (0, 1)

e
X N , Z = aX + b = Z N a + b, a2 2
2

P [X (1 + )] > 1
(1 + )1+
Xi N i , i2 Xi Xj =
P
Xi N
P
i , i i2
P
i i
Hoeffding
P [a < X b] = b a

X1 , . . . , Xn independent P [Xi [ai , bi ]] = 1 1 i n (x) = 1 (x) 0 (x) = x(x) 00 (x) = (x2 1)(x)
1

E X t e2nt2 t > 0
Upper quantile of N (0, 1): z = (1 )
P X
2 2
Gamma
E X | t 2 exp Pn 2n t

P |X 2
t>0
i=1 (bi ai ) X Gamma (, ) X/ Gamma (, 1)
P
Jensen Gamma (, ) i=1 Exp ()
P P
E [(X)] (E [X]) convex Xi Gamma (i , ) Xi
Xj = i Xi Gamma ( i i , )
10
Z
() 9.2 Bivariate Normal
= x1 ex dx
0
Let X N x , x2 and Y N y , y2 .
Beta
1 ( + ) 1 1 z
x1 (1 x)1 = x (1 x)1 f (x, y) = exp
2(1 2 )
p
B(, ) ()() 2x y 1 2
B( + k, ) +k1
E X k1
" #
E Xk =
2 2
=

B(, ) ++k1 x x y y x x y y
z= + 2
Beta (1, 1) Unif (0, 1) x y x y
Conditional mean and variance
8 Probability and Moment Generating Functions E [X | Y ] = E [X] +
X
(Y E [Y ])
Y
GX (t) = E tX |t| < 1 p
V [X | Y ] = X 1 2
" #
X (Xt)i X E Xi
ti

MX (t) = GX (et ) = E eXt = E =
i=0
i! i=0
i!
9.3 Multivariate Normal
P [X = 0] = GX (0)
P [X = 1] = G0X (0) Covariance matrix (Precision matrix 1 )
(i)
GX (0)
P [X = i] = V [X1 ] Cov [X1 , Xk ]
i! .. .. ..
=

E [X] = G0X (1 ) . . .
(k)
E X k = MX (0) Cov [Xk , X1 ] V [Xk ]

X! (k) If X N (, ),
E = GX (1 )
(X k)!
2 1
V [X] = G00X (1 ) + G0X (1 ) (G0X (1 )) fX (x) = (2) n/2
||
1/2
exp (x )T 1 (x )
d 2
GX (t) = GY (t) = X = Y
Properties
9 Multivariate Distributions Z N (0, 1) X = + 1/2 Z = X N (, )
X N (, ) = 1/2 (X ) N (0, 1)
9.1 Standard Bivariate Normal X N (, ) = AX N A, AAT

p
Let X, Y N (0, 1) X
Z where Y = X + 1 2 Z X N (, ) kak = k = aT X N aT , aT a
Joint density
1 x2 + y 2 2xy
10 Convergence
f (x, y) = exp
2(1 2 )
p
2 1 2 Let {X1 , X2 , . . .} be a sequence of rvs and let X be another rv. Let Fn denote
Conditionals the cdf of Xn and let F denote the cdf of X.
Types of Convergence
(Y | X = x) N x, 1 2 (X | Y = y) N y, 1 2

and D
1. In distribution (weakly, in law): Xn X
Independence
X
Y = 0 lim Fn (t) = F (t) t where F continuous
n 11
P
2. In probability: Xn X
Xn n(Xn ) D
Zn := q = Z where Z N (0, 1)
( > 0) lim P [|Xn X| > ] = 0 n
n V X
as
3. Almost surely (strongly): Xn X lim P [Zn z] = (z) zR
n
h i h i
P lim Xn = X = P : lim Xn () = X() = 1 CLT notations
n n
qm
Zn N (0, 1)
4. In quadratic mean (L2 ): Xn X 2

X n N ,
lim E (Xn X)2 = 0 n

n 2

X n N 0,
Relationships n

2

qm P D n(Xn ) N 0,
Xn X = Xn X = Xn X
as
Xn X = Xn X
P n(Xn )
N (0, 1)
D P
Xn X (c R) P [X = c] = 1 = Xn X
P P P
Xn X Yn Y = Xn + Yn X + Y
qm qm qm
Xn X Yn Y = Xn + Yn X + Y Continuity correction
P P P
Xn X Yn Y = Xn Yn XY
x + 12
P P

Xn X = (Xn ) (X)
P Xn x

D
Xn X = (Xn ) (X)
D / n
qm
Xn b limn E [Xn ] = b limn V [Xn ] = 0
x 12

n
qm
n x 1

X1 , . . . , Xn iid E [X] = V [X] < X P X
/ n
Slutzkys Theorem Delta method
D P D
Xn X and Yn c = Xn + Yn X + c
2

2 2

D P D
Xn X and Yn c = Xn Yn cX Yn N , = (Yn ) N (), (0 ())
n n
D D D
In general: Xn X and Yn Y =
6 Xn + Yn X + Y
11 Statistical Inference
10.1 Law of Large Numbers (LLN)
iid
Let X1 , , Xn F if not otherwise noted.
Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = .
Weak (WLLN)
n
X
P
n 11.1 Point Estimation
Strong (SLLN) Point estimator bn of is a rv: bn = g(X1 , . . . , Xn )
h i
n
X
as
n bias(bn ) = E bn
P
Consistency: bn
10.2 Central Limit Theorem (CLT)
Sampling distribution: F (bn )
Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] = 2 .
r h i
Standard error: se(n ) = V bn
b
12
h i h i
Mean squared error: mse = E (bn )2 = bias(bn )2 + V bn 11.4 Statistical Functionals
limn bias(bn ) = 0 limn se(bn ) = 0 = bn is consistent Statistical functional: T (F )
bn D Plug-in estimator of = (F ): bn = T (Fbn )
Asymptotic normality: N (0, 1) R
se Linear functional: T (F ) = (x) dFX (x)
Slutzkys Theorem often lets us replace se(bn ) by some (weakly) consis- Plug-in estimator for linear functional:
tent estimator
bn . Z n
1X
T (Fbn ) = (x) dFbn (x) = (Xi )
11.2 Normal-Based Confidence Interval n i=1

b 2 . Let z/2 = 1 (1 (/2)), i.e., P Z > z/2 = /2

Suppose bn N , se

b 2 = T (Fbn ) z/2 se
Often: T (Fbn ) N T (F ), se b
and P z/2 < Z < z/2 = 1 where Z N (0, 1). Then
pth quantile: F 1 (p) = inf{x : F (x) p}
Cn = bn z/2 se
b b=X n
n
1 X n )2
b2 =
(Xi X
11.3 Empirical distribution n 1 i=1
1
Pn
Empirical Distribution Function (ECDF) n i=1 (Xi b)3

b=
Pn
I(Xi x) Pb3
Fn (x) = i=1
b n
i=1 (Xi Xn )(Yi Yn )
n b = qP qP
n 2 n 2
i=1 (Xi Xn ) i=1 (Yi Yn )
(
1 Xi x
I(Xi x) =
0 Xi > x
Properties (for any fixed x) 12 Parametric Inference
h i
E Fbn = F (x)

Let F = f (x; ) : be a parametric model with parameter space Rk
h i F (x)(1 F (x)) and parameter = (1 , . . . , k ).
V Fbn =
n
F (x)(1 F (x)) D 12.1 Method of Moments
mse = 0
n
P j th moment
Fbn F (x) Z
j () = E X j = xj dFX (x)

Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn F )

P sup F (x) Fn (x) > = 2e2n
b 2
j th sample moment
x n
1X j
Nonparametric 1 confidence band for F
bj = X
n i=1 i
L(x) = max{Fbn n , 0}
Method of Moments estimator (MoM)
U (x) = min{Fbn + n , 1}
s 1 () =
b1
1 2
= log 2 () =
b2
2n
.. ..
.=.
P [L(x) F (x) U (x) x] 1 k () =
bk
13
Properties of the MoM estimator Equivariance: bn is the mle = (bn ) is the mle of ()
bn exists with probability tending to 1 Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-
P
Consistency: bn ples. If en is any other estimator, the asymptotic relative efficiency is:
p
Asymptotic normality: 1. se 1/In ()
(bn ) D
D
n(b ) N (0, ) N (0, 1)
se
q
where = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T , b 1/In (bn )
2. se
1
g = (g1 , . . . , gk ) and gj = j ()
(bn ) D
N (0, 1)
se
b
12.2 Maximum Likelihood Asymptotic optimality
Likelihood: Ln : [0, ) h i
V bn
n
Y are(en , bn ) = h i 1
Ln () = f (Xi ; ) V en
i=1
Approximately the Bayes estimator
Log-likelihood
n
X 12.2.1 Delta Method
`n () = log Ln () = log f (Xi ; )
i=1 b where is differentiable and 0 () 6= 0:
If = ()
Maximum likelihood estimator (mle)
n ) D
(b
N (0, 1)
Ln (bn ) = sup Ln () se(b
b )

where b = ()
b is the mle of and
Score function

s(X; ) = log f (X; ) b = 0 ()
se se(
b n )
b b

Fisher information
I() = V [s(X; )] 12.3 Multiparameter Models
In () = nI() Let = (1 , . . . , k ) and b = (b1 , . . . , bk ) be the mle.
Fisher information (exponential family)
2 `n 2 `n
Hjj = Hjk =
2 j k
I() = E s(X; )
Fisher information matrix
Observed Fisher information
E [H11 ] E [H1k ]

n
In () = .. .. ..
2 X

. . .
Inobs () =

log f (Xi ; )
2 i=1 E [Hk1 ] E [Hkk ]
Properties of the mle Under appropriate regularity conditions

P
Consistency: bn (b ) N (0, Jn )
14
with Jn () = In1 . Further, if bj is the j th component of , then Critical value c
Test statistic T
(bj j ) D Rejection region R = {x : T (x) > c}
N (0, 1)
se
bj Power function () = P [X R]
h i Power of a test: 1 P [Type II error] = 1 = inf ()
b 2j = Jn (j, j) and Cov bj , bk = Jn (j, k)
where se 1
Test size: = P [Type I error] = sup ()
0
12.3.1 Multiparameter delta method
Let = (1 , . . . , k ) and let the gradient of be Retain H0 Reject H0

H0 true Type
I Error ()
1 H1 true Type II Error () (power)
.
p-value
..
=

k

p-value = sup0 P [T (X) T (x)] = inf : T (x) R
P [T (X ? ) T (X)]

p-value = sup0 = inf : T (X) R
Suppose =b 6= 0 and b = ().
b Then, | {z }
1F (T (X)) since T (X ? )F
) D
(b
N (0, 1)
se(b
b )
p-value evidence
where r < 0.01 very strong evidence against H0
T
0.01 0.05 strong evidence against H0

se(b
b ) =
b Jbn
b
0.05 0.1 weak evidence against H0
b and

b = b. > 0.1 little or no evidence against H0
and Jbn = Jn () =
Wald test
12.4 Parametric Bootstrap
Two-sided test
Sample from f (x; bn ) instead of from Fbn , where bn could be the mle or method
of moments estimator. b 0
Reject H0 when |W | > z/2 where W =
se
b
P |W | > z/2
13 Hypothesis Testing p-value = P0 [|W | > |w|] P [|Z| > |w|] = 2(|w|)
H0 : 0 versus H1 : 1
Likelihood ratio test
Definitions
Null hypothesis H0 sup Ln () Ln (bn )

Alternative hypothesis H1 T (X) = =
sup0 Ln () Ln (bn,0 )
Simple hypothesis = 0 k
Composite hypothesis > 0 or < 0 iid
D
X
(X) = 2 log T (X) 2rq where Zi2 2k and Z1 , . . . , Zk N (0, 1)
Two-sided test: H0 : = 0 versus H1 : 6= 0
i=1
One-sided test: H0 : 0 versus H1 : > 0 p-value = P0 [(X) > (x)] P 2rq > (x)
15
Multinomial LRT Natural form

X1 Xk
mle: pbn = ,..., fX (x | ) = h(x) exp { T(x) A()}
n n
k
Y pbj Xj = h(x)g() exp { T(x)}
Ln (b
pn )
T (X) = = = h(x)g() exp T T(x)

Ln (p0 ) j=1
p0j
k
X pbj D
(X) = 2 Xj log 2k1 15 Bayesian Inference
j=1
p 0j
The approximate size LRT rejects H0 when (X) 2k1, Bayes Theorem
Pearson Chi-square Test f (x | )f () f (x | )f ()
f ( | x) = =R Ln ()f ()
k f (xn ) f (x | )f () d
X (Xj E [Xj ])2
T = where E [Xj ] = np0j under H0
j=1
E [Xj ] Definitions
D
T 2k1 X n = (X1 , . . . , Xn )

p-value = P 2k1 > T (x) xn = (x1 , . . . , xn )
D
2
Faster Xk1 than LRT, hence preferable for small n Prior density f ()
Likelihood f (xn | ): joint density of the data
Independence testing Yn
In particular, X n iid = f (xn | ) = f (xi | ) = Ln ()
I rows, J columns, X multinomial sample of size n = I J i=1
X
mles unconstrained: pbij = nij Posterior density f ( | xn )
X
Normalizing constant cn = f (xn ) = f (x | )f () d
R
mles under H0 : pb0ij = pbi pbj = Xni nj
Kernel: part of a density that dependsRon

PI PJ nX
LRT: = 2 i=1 j=1 Xij log Xi Xijj L ()f ()d
Posterior mean n = f ( | xn ) d = R n
R
PI PJ (X E[X ])2 Ln ()f () d
PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij
D
LRT and Pearson 2k , where = (I 1)(J 1) 15.1 Credible Intervals
Posterior interval
14 Exponential Family Z b
P [ (a, b) | xn ] = f ( | xn ) d = 1
Scalar parameter a
fX (x | ) = h(x) exp {()T (x) A()} Equal-tail credible interval

= h(x)g() exp {()T (x)} Z a Z
f ( | xn ) d = f ( | xn ) d = /2
Vector parameter b
Highest posterior density (HPD) region Rn

( s
)
X
fX (x | ) = h(x) exp i ()Ti (x) A()
i=1 1. P [ Rn ] = 1
= h(x) exp {() T (x) A()} 2. Rn = { : f ( | xn ) > k} for some k
= h(x)g() exp {() T (x)} Rn is unimodal = Rn is an interval
16
15.2 Function of parameters 15.3.1 Conjugate Priors
Continuous likelihood (subscript c denotes constant)
Let = () and A = { : () }.
Likelihood Conjugate prior Posterior hyperparameters
Posterior CDF for
Unif (0, ) Pareto(xm , k) max x(n) , xm , k + n
Z Xn
n n n
H(r | x ) = P [() | x ] = f ( | x ) d Exp () Gamma (, ) + n, + xi
A
i=1
Pn
0 i=1 xi 1 n
2
2

Posterior density N , c N 0 , 0 + / + 2 ,
2 2 02 c
0 c1
1 n
h( | xn ) = H 0 ( | xn ) + 2
02 c
Pn
02 + i=1 (xi )2
Bayesian delta method N c , 2 Scaled Inverse Chi- + n,
+n
square(, 02 )

+ nx n

| X n N (),
b seb 0 ()

N , 2
b
Normal- , + n, + ,
+n 2
scaled Inverse n 2
1X (
x )
Gamma(, , , ) + )2 +
(xi x
2 i=1 2(n + )
15.3 Priors 1
1 1
1 1

MVN(, c ) MVN(0 , 0 ) 0 + nc 0 0 + n x
,
1 1
1

Choice 0 + nc
Xn
MVN(c , ) Inverse- n + , + (xi c )(xi c )T
Subjective Bayesianism: prior should incorporate as much detail as possible Wishart(, ) i=1
the researchs a priori knowledgevia prior elicitation n
X xi
Objective Bayesianism: prior should incorporate as little detail as possible Pareto(xmc , k) Gamma (, ) + n, + log
x mc
(non-informative prior) i=1
Pareto(xm , kc ) Pareto(x0 , k0 ) x0 , k0 kn where k0 > kn
Robust Bayesianism: consider various priors and determine sensitivity of Xn
our inferences to changes in the prior Gamma (c , ) Gamma (0 , 0 ) 0 + nc , 0 + xi
i=1
Types
Flat: f () constant
R
Proper: f () d = 1
R
Improper: f () d =
Jeffreys Prior (transformation-invariant):
p p
f () I() f () det(I())
Conjugate: f () and f ( | xn ) belong to the same parametric family

17
Discrete likelihood Bayes factor
Likelihood Conjugate prior Posterior hyperparameters log10 BF10 BF10 evidence
n n 0 0.5 1 1.5 Weak
0.5 1 1.5 10 Moderate
X X
Bern (p) Beta (, ) + xi , + n xi
i=1 i=1
12 10 100 Strong
Xn n
X n
X >2 > 100 Decisive
Bin (p) Beta (, ) + xi , + Ni xi
p
i=1 i=1 i=1 1p BF10
n
X p = p where p = P [H1 ] and p = P [H1 | xn ]
NBin (p) Beta (, ) + rn, + xi 1 + 1p BF10
i=1
n
16 Sampling Methods
X
Po () Gamma (, ) + xi , + n
i=1
n
X 16.1 Inverse Transform Sampling
Multinomial(p) Dir () + x(i)
i=1 Setup
n
X
Geo (p) Beta (, ) + n, + xi U Unif (0, 1)
i=1 XF
F 1 (u) = inf{x | F (x) u}
15.4 Bayesian Testing Algorithm
1. Generate u Unif (0, 1)
If H0 : 0 :
2. Compute x = F 1 (u)
Z
Prior probability P [H0 ] = f () d
0 16.2 The Bootstrap
Z
Posterior probability P [H0 | xn ] = f ( | xn ) d Let Tn = g(X1 , . . . , Xn ) be a statistic.
0
1. Estimate VF [Tn ] with VFbn [Tn ].
2. Approximate VFbn [Tn ] using simulation:

Let H0 . . .Hk1 be k hypotheses. Suppose f ( | Hk ), (a) Repeat the following B times to get Tn,1 , . . . , Tn,B , an iid sample from
the sampling distribution implied by Fn b
f (xn | Hk )P [Hk ] i. Sample uniformly X1 , . . . , Xn Fbn .
P [Hk | xn ] = PK ,
n
k=1 f (x | Hk )P [Hk ] ii. Compute Tn = g(X1 , . . . , Xn ).
(b) Then
Marginal likelihood B B
!2
1 X 1 X
vboot = VFbn =
b Tn,b T
B B r=1 n,r
Z
n
f (x | Hi ) = f (xn | , Hi )f ( | Hi ) d b=1

16.2.1 Bootstrap Confidence Intervals
Posterior odds (of Hi relative to Hj )
Normal-based interval
n
P [Hi | x ] n
f (x | Hi ) P [Hi ] Tn z/2 se
b boot
=
P [Hj | xn ] f (xn | Hj ) P [Hj ] Pivotal interval
| {z } | {z }
Bayes Factor BFij prior odds 1. Location parameter = T (F )
18
2. Pivot Rn = bn 2. Generate u Unif (0, 1)
3. Let H(r) = P [Rn r] be the cdf of Rn Ln (cand )

3. Accept cand if u
4. Let Rn,b = bn,b bn . Approximate H using bootstrap: Ln (bn )
B
1 X 16.4 Importance Sampling
H(r)
b = I(Rn,b r)
B Sample from an importance function g rather than target density h.
b=1
Algorithm to obtain an approximation to E [q() | xn ]:
5. = sample quantile of (bn,1

, . . . , bn,B ) iid
1. Sample from the prior 1 , . . . , n f ()
6. r = beta sample quantile of (Rn,1

, . . . , Rn,B ), i.e., r = bn
Ln (i )
2. wi = PB i = 1, . . . , B

7. Approximate 1 confidence interval Cn = a , b where
i=1 Ln (i )
PB
3. E [q() | xn ] i=1 q(i )wi
b 1 1 =

a
= bn H bn r1/2 = 2bn 1/2
2

b = bn Hb 1
=
bn r/2 =
2bn /2 17 Decision Theory
2
Percentile interval Definitions

Cn = /2 , 1/2 Unknown quantity affecting our decision:
Decision rule: synonymous for an estimator b
16.3 Rejection Sampling Action a A: possible value of the decision rule. In the estimation
context, the action is just an estimate of , (x).
b
Setup
Loss function L: consequences of taking action a when true state is or
We can easily sample from g() discrepancy between and , b L : A [k, ).
We want to sample from h(), but it is difficult Loss functions
k()
We know h() up to a proportional constant: h() = R Squared error loss: L(, a) = ( a)2
k() d (
Envelope condition: we can find M > 0 such that k() M g() K1 ( a) a < 0
Linear loss: L(, a) =
K2 (a ) a 0
Algorithm
Absolute error loss: L(, a) = | a| (linear loss with K1 = K2 )
1. Draw cand g() Lp loss: L(, a) = | a|p
2. Generate u Unif (0, 1)
(
0 a=
k(cand ) Zero-one loss: L(, a) =
3. Accept cand if u 1 a 6=
M g(cand )
4. Repeat until B values of cand have been accepted
17.1 Risk
Example
Posterior risk
We can easily sample from the prior g() = f ()
Z h i
r(b | x) = L(, (x))f
b ( | x) d = E|X L(, (x))
b
Target is the posterior h() k() = f (xn | )f ()
Envelope condition: f (xn | ) f (xn | bn ) = Ln (bn ) M (Frequentist) risk
Algorithm Z h i
1. Draw cand
f () R(, )
b = L(, (x))f
b (x | ) dx = EX| L(, (X))
b
19
Bayes risk 18 Linear Regression
ZZ
Definitions
h i
r(f, )
b = L(, (x))f
b (x, ) dx d = E,X L(, (X))
b
Response variable Y
Covariate X (aka predictor variable or feature)
h h ii h i
r(f, )
b = E EX| L(, (X)
b = E R(, )
b
18.1 Simple Linear Regression

h h ii h i
r(f, )
b = EX E|X L(, (X)
b = EX r(b | X)
Model
17.2 Admissibility Yi = 0 + 1 Xi + i E [i | Xi ] = 0, V [i | Xi ] = 2
Fitted line
b0 dominates b if
b0 rb(x) = b0 + b1 x
: R(, ) R(, )
b
Predicted (fitted) values
: R(, b0 ) < R(, )
b Ybi = rb(Xi )
b is inadmissible if there is at least one other estimator b0 that dominates Residuals
it. Otherwise it is called admissible. i = Yi Ybi = Yi b0 + b1 Xi
Residual sums of squares (rss)

17.3 Bayes Rule
n
X
Bayes rule (or Bayes estimator) rss(b0 , b1 ) = 2i
i=1
r(f, )
b = inf e r(f, )

e
R Least square estimates
(x)
b = inf r(b | x) x = r(f, )
b = r(b | x)f (x) dx
bT = (b0 , b1 )T : min rss

b0 ,
b1
Theorems
Squared error loss: posterior mean b0 = Yn b1 Xn

Pn Pn
Absolute error loss: posterior median i=1 (Xi Xn )(Yi Yn ) i=1 Xi Yi nXY
1 =
b Pn = P n
Zero-one loss: posterior mode i=1 (Xi Xn )
2 2 2
i=1 Xi nX

0
h i
E b | X n =
17.4 Minimax Rules 1
2 n1 ni=1 Xi2 X n
h i P
Maximum risk V b | X n = 2
)
R( b = sup R(, )
b
R(a) = sup R(, a) nsX X n 1
r Pn
2
i=1 Xi

b
Minimax rule se(
b b0 ) =
) sX n n
sup R(, )
b = inf R( e = inf sup R(, )
e
e e

b
se(
b b1 ) =
sX n
b = Bayes rule c : R(, )
b =c Pn Pn 2
where s2X = n1 i=1 (Xi X n )2 and 1
b2 = n2 i=1
i (unbiased estimate).
Least favorable prior Further properties:
P P
bf = Bayes rule R(, bf ) r(f, bf ) Consistency: b0 0 and b1 1
20
Asymptotic normality: 18.3 Multiple Regression
b0 0 D b1 1 D Y = X +
N (0, 1) and N (0, 1)
se(
b b0 ) se(
b b1 )
where
Approximate 1 confidence intervals for 0 and 1 :
X11 X1k 1 1
.. .. = ...
.. ..
b0 z/2 se( and b1 z/2 se( X= . =.

b b0 ) b b1 ) . .
Xn1 Xnk k n
Wald test for H0 : 1 = 0 vs. H1 : 1 6= 0: reject H0 if |W | > z/2 where
W = b1 /se(
b b1 ). Likelihood

1
R2 L(, ) = (2 2 )n/2 exp 2 rss
Pn b 2
Pn 2 2
i=1 (Yi Y ) rss
2
R = Pn 2
= 1 Pn i=1 i 2 = 1
i=1 (Yi Y ) i=1 (Yi Y )
tss
N
X
Likelihood rss = (y X)T (y X) = kY Xk2 = (Yi xTi )2
n n n i=1
Y Y Y
L= f (Xi , Yi ) = fX (Xi ) fY |X (Yi | Xi ) = L1 L2
i=1 i=1 i=1 If the (k k) matrix X T X is invertible,
Yn
L1 = fX (Xi ) b = (X T X)1 X T Y
i=1 h i
V b | X n = 2 (X T X)1
n
( )
Y 1 X 2
n
L2 = fY |X (Yi | Xi ) exp 2 Yi (0 1 Xi )
2 i b N , 2 (X T X)1

i=1
Under the assumption of Normality, the least squares estimator is also the mle
Estimate regression function
but the least squares variance estimator is not the mle.
n k
1X 2 X
b2 =
rb(x) = bj xj
n i=1 i j=1
18.2 Prediction Unbiased estimate for 2

Observe X = x of the covariate and want to predict their outcome Y . n
1 X 2
b2 =
= X b Y
Yb = b0 + b1 x n k i=1 i
h i h i h i h i
V Yb = V b0 + x2 V b1 + 2x Cov b0 , b1 mle
nk 2
Prediction interval
b=X b2 =

Pn 2
n
2 2 i=1 (Xi X )
n =
b b P 2j + 1
n i (Xi X) 1 Confidence interval
Yb z/2 bn bj z/2 se(
b bj )
21
18.4 Model Selection Akaike Information Criterion (AIC)
Consider predicting a new observation Y for covariates X and let S J
denote a subset of the covariates in the model, where |S| = k and |J| = n. bS2 ) k
AIC(S) = `n (bS ,
Issues
Bayesian Information Criterion (BIC)
Underfitting: too few covariates yields high bias
Overfitting: too many covariates yields high variance k
bS2 )
BIC(S) = `n (bS , log n
Procedure 2
1. Assign a score to each model Validation and training

2. Search through all models to find the one with the highest score
m
X n n
Hypothesis testing R
bV (S) = (Ybi (S) Yi )2 m = |{validation data}|, often or
i=1
4 2
H0 : j = 0 vs. H1 : j 6= 0 j J
Leave-one-out cross-validation
Mean squared prediction error (mspe)
n n
!2
h i X X Yi Ybi (S)
mspe = E (Yb (S) Y )2 R
bCV (S) = (Yi Yb(i) )2 =
i=1 i=1
1 Uii (S)
Prediction risk
n n h i
U (S) = XS (XST XS )1 XS (hat matrix)
X X
R(S) = mspei = E (Ybi (S) Yi )2
i=1 i=1
Training error
n
R
btr (S) =
X
(Ybi (S) Yi )2
19 Non-parametric Function Estimation
i=1
2 19.1 Density Estimation
R Pn b 2
R i=1 (Yi (S) Y )
rss(S) btr (S) R
R2 (S) = 1 =1 =1 Estimate f (x), where f (x) = P [X A] = A
f (x) dx.
P n 2
i=1 (Yi Y )
tss tss Integrated square error (ise)
The training error is a downward-biased estimate of the prediction risk. Z 2 Z
h i L(f, fbn ) = f (x) fn (x) dx = J(h) + f 2 (x) dx
b
E R btr (S) < R(S)
h i n
X h i Frequentist risk
bias(Rtr (S)) = E Rtr (S) R(S) = 2
b b Cov Ybi , Yi
i=1
h i Z Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
Adjusted R2
n 1 rss
R2 (S) = 1
n k tss h i
Mallows Cp statistic b(x) = E fbn (x) f (x)
h i
R(S)
b =R 2 = lack of fit + complexity penalty
btr (S) + 2kb v(x) = V fbn (x)
22
19.1.1 Histograms KDE
n
Definitions 1X1 x Xi
fbn (x) = K
n i=1 h h
Number of bins m
Z Z
1 4 00 2 1
1 R(f, fn ) (hK )
b (f (x)) dx + K 2 (x) dx
Binwidth h = m 4 nh
Bin Bj has j observations c
2/5 1/5 1/5
c2 c3
Z Z
h = 1 c = 2
, c = K 2
(x) dx, c = (f 00 (x))2 dx
R
Define pbj = j /n and pj = Bj f (u) du n1/5
1 K 2 3
Z 4/5 Z 1/5
c4 5 2 2/5 2 00 2
Histogram estimator R (f, fn ) = 4/5
b c4 = (K ) K (x) dx (f ) dx
n 4
| {z }
m C(K)
X pbj
fbn (x) = I(x Bj )
j=1
h Epanechnikov Kernel
h i pj
E fbn (x) = (
3

h
4 5(1x2 /5)
|x| < 5
h i p (1 p ) K(x) =
j j
V fbn (x) = 0 otherwise
nh2
h2
Z
2 1
R(fbn , f ) (f 0 (u)) du + Cross-validation estimate of E [J(h)]
12 nh
!1/3
1 6 n n n
1 X X Xi Xj
Z

h = 1/3 R 2Xb 2
2 du JbCV (h) = fbn2 (x) dx f(i) (Xi ) K + K(0)
n (f 0 (u)) n i=1 hn2 i=1 j=1 h nh
2/3 Z 1/3
b C 3 0 2
R (fn , f ) 2/3 C= (f (u)) du
n 4 Z
K (x) = K (2) (x) 2K(x) K (2) (x) = K(x y)K(y) dy
Cross-validation estimate of E [J(h)]
Z
2Xb
n
2 n+1 X 2
m 19.2 Non-parametric Regression
JbCV (h) = fbn2 (x) dx f(i) (Xi ) = pb
n i=1 (n 1)h (n 1)h j=1 j Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points
(x1 , Y1 ), . . . , (xn , Yn ) related by
Yi = r(xi ) + i
19.1.2 Kernel Density Estimator (KDE)
E [i ] = 0
Kernel K V [i ] = 2
K(x) 0 k-nearest Neighbor Estimator

R
K(x) dx = 1

R
xK(x) dx = 0 1 X
rb(x) = Yi where Nk (x) = {k values of x1 , . . . , xn closest to x}

R 2 2
x K(x) dx K >0 k
i:xi Nk (x)
23
Nadaraya-Watson Kernel Estimator 20 Stochastic Processes
n
X
rb(x) = wi (x)Yi Stochastic Process
i=1 (
xxi

K {0, 1, . . . } = Z discrete
wi (x) = h [0, 1] {Xt : t T } T =
[0, )

Pn
K
xxj continuous
j=1 h
4 Z 2
h4 f 0 (x)
Z
2 2 00 0 Notations Xt , X(t)
R(brn , r) x K (x) dx r (x) + 2r (x) dx
4 f (x) State space X
Z 2R 2
K (x) dx Index set T
+ dx
nhf (x)
c1
h 1/5 20.1 Markov Chains
n
c2
R (b
rn , r) 4/5 Markov chain
n
P [Xn = x | X0 , . . . , Xn1 ] = P [Xn = x | Xn1 ] n T, x X

n
X n
X (Yi rb(xi ))2 Transition probabilities
JbCV (h) = (Yi rb(i) (xi ))2 = !2
i=1 i=1 K(0) pij P [Xn+1 = j | Xn = i]
1 Pn xx
j
K
j=1 h pij (n) P [Xm+n = j | Xm = i] n-step
19.3 Smoothing Using Orthogonal Functions Transition matrix P (n-step: Pn )

Approximation
J (i, j) element is pij
X X
r(x) = j j (x) j j (x) pij > 0
P
j=1 j=1 i pij = 1
Multivariate regression
Y = + Chapman-Kolmogorov

0 (x1 ) J (x1 ) X
.. .. .. pij (m + n) = pij (m)pkj (n)
where i = i and = . . . k
0 (xn ) J (xn )
Least squares estimator Pm+n = Pm Pn
b = (T )1 T Y
Pn = P P = Pn
1
T Y (for equally spaced observations only)
n Marginal probability
2 n = (n (1), . . . , n (N )) where i (i) = P [Xn = i]
n J
R
bCV (J) =
X
Yi
X
j (xi )bj,(i) 0 , initial distribution
i=1 j=1 n = 0 Pn
24
20.2 Poisson Processes Autocorrelation function (ACF)
Poisson process
Cov [xs , xt ] (s, t)
(s, t) = p =p
{Xt : t [0, )} = number of events up to and including time t V [xs ] V [xt ] (s, s)(t, t)
X0 = 0
Independent increments: Cross-covariance function (CCV)
t0 < < tn : Xt1 Xt0
Xtn Xtn1
xy (s, t) = E [(xs xs )(yt yt )]
Intensity function (t)
P [Xt+h Xt = 1] = (t)h + o(h) Cross-correlation function (CCF)
P [Xt+h Xt = 2] = o(h)
xy (s, t)
Xs+t Xs Po (m(s + t) m(s)) where m(t) =
Rt
(s) ds xy (s, t) = p
0 x (s, s)y (t, t)
Homogeneous Poisson process
Backshift operator
(t) = Xt Po (t) >0
B k (xt ) = xtk
Waiting times
Wt := time at which Xt occurs Difference operator

1 d = (1 B)d
Wt Gamma t,

Interarrival times White noise
St = Wt+1 Wt
2

1 wt wn(0, w )
St Exp iid 2

Gaussian: wt N 0, w
E [wt ] = 0 t T
St V [wt ] = 2 t T
w (s, t) = 0 s 6= t s, t T
Wt1 Wt t
Random walk
21 Time Series
Drift
Pt
Mean function Z
xt = t + j=1 wj
xt = E [xt ] = xft (x) dx E [xt ] = t

Autocovariance function Symmetric moving average
x (s, t) = E [(xs s )(xt t )] = E [xs xt ] s t k

X k
X
mt = aj xtj where aj = aj 0 and aj = 1
x (t, t) = E (xt t )2 = V [xt ]

j=k j=k
25
21.1 Stationary Time Series Sample variance
n
Strictly stationary 1 X |h|
V [
x] = 1 x (h)
n n
P [xt1 c1 , . . . , xtk ck ] = P [xt1 +h c1 , . . . , xtk +h ck ] h=n
k N, tk , ck , h Z Sample autocovariance function
Weakly stationary nh
1 X

b(h) = (xt+h x
)(xt x
)
E x2t < t Z n t=1
2
E xt = m t Z
x (s, t) = x (s + r, t + r) r, s, t Z Sample autocorrelation function
Autocovariance function

b(h)
b(h) =
(h) = E [(xt+h )(xt )] h Z
b(0)

(0) = E (xt )2
(0) 0 Sample cross-variance function
(0) |(h)|
nh
(h) = (h) 1 X

bxy (h) = (xt+h x
)(yt y)
n t=1
Autocorrelation function (ACF)
Cov [xt+h , xt ] (t + h, t) (h) Sample cross-correlation function

x (h) = p =p =
V [xt+h ] V [xt ] (t + h, t + h)(t, t) (0)

bxy (h)
Jointly stationary time series bxy (h) = p
bx (0)b
y (0)
xy (h) = E [(xt+h x )(yt y )]
Properties
xy (h)
xy (h) = p 1
x (0)y (h) bx (h) = if xt is white noise
n
Linear process 1
bxy (h) = if xt or yt is white noise

X
X n
xt = + j wtj where |j | <
j= j=

21.3 Non-Stationary Time Series
X
2
(h) = w j+h j Classical decomposition model
j=
xt = t + st + wt
21.2 Estimation of Correlation
Sample mean t = trend
n
1X st = seasonal component
x
= xt
n t=1 wt = random noise term
26
21.3.1 Detrending Moving average polynomial
Least squares (z) = 1 + 1 z + + q zq z C q 6= 0
2
1. Choose trend model, e.g., t = 0 + 1 t + 2 t
Moving average operator
2. Minimize rss to obtain trend estimate bt = b0 + b1 t + b2 t2
3. Residuals , noise wt (B) = 1 + 1 B + + p B p
Moving average MA (q) (moving average model order q)
1
The low-pass filter vt is a symmetric moving average mt with aj = 2k+1 : xt = wt + 1 wt1 + + q wtq xt = (B)wt
k q
1 X X
vt = xt1 E [xt ] = j E [wtj ] = 0
2k + 1
i=k j=0
Pk ( Pqh
1 2
If 2k+1 i=k wtj 0, a linear trend function t = 0 + 1 t passes
w j=0 j j+h 0hq
(h) = Cov [xt+h , xt ] =
without distortion 0 h>q
Differencing MA (1)
xt = wt + wt1
t = 0 + 1 t = xt = 1
2 2
(1 + )w h = 0

2
21.4 ARIMA models (h) = w h=1

0 h>1

Autoregressive polynomial
(

(z) = 1 1 z p zp z C p 6= 0 2 h=1
(h) = (1+ )
0 h>1
Autoregressive operator
ARMA (p, q)
(B) = 1 1 B p B p
xt = 1 xt1 + + p xtp + wt + 1 wt1 + + q wtq
Autoregressive model order p, AR (p)
(B)xt = (B)wt
xt = 1 xt1 + + p xtp + wt (B)xt = wt
Partial autocorrelation function (PACF)
AR (1) xih1 , regression of xi on {xh1 , xh2 , . . . , x1 }
k1 hh = corr(xh xh1
h , x0 xh1
0 ) h2
X k,||<1 X
xt = k (xtk ) + j (wtj ) = j (wtj ) E.g., 11 = corr(x1 , x0 ) = (1)
j=0 j=0
| {z } ARIMA (p, d, q)
linear process
P j
d xt = (1 B)d xt is ARMA (p, q)
E [xt ] = j=0 (E [wtj ]) = 0
2 h
w (B)(1 B)d xt = (B)wt
(h) = Cov [xt+h , xt ] = 12
(h) Exponentially Weighted Moving Average (EWMA)
(h) = (0) = h
(h) = (h 1) h = 1, 2, . . . xt = xt1 + wt wt1
27

X Frequency index (cycles per unit time), period 1/
xt = (1 )j1 xtj + wt when || < 1
j=1
Amplitude A
Phase
n+1 = (1 )xn +
x xn
U1 = A cos and U2 = A sin often normally distributed rvs
Seasonal ARIMA
Periodic mixture
Denoted by ARIMA (p, d, q) (P, D, Q)s
q
P (B s )(B)D d s
s xt = + Q (B )(B)wt X
xt = (Uk1 cos(2k t) + Uk2 sin(2k t))
k=1
21.4.1 Causality and Invertibility
P Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rvs with variances k2
ARMA (p, q) is causal (future-independent) {j } : j=0 j < such that Pq
(h) = k=1 k2 cos(2k h)
Pq

X (0) = E x2t = k=1 k2
xt = wtj = (B)wt
j=0 Spectral representation of a periodic process
P
ARMA (p, q) is invertible {j } : j=0 j < such that (h) = 2 cos(20 h)
2 2i0 h 2 2i0 h
X = e + e
(B)xt = Xtj = wt 2 2
Z 1/2
j=0
= e2ih dF ()
Properties 1/2
ARMA (p, q) causal roots of (z) lie outside the unit circle Spectral distribution function

X (z)
j 0
< 0
(z) = j z = |z| 1
(z) F () = 2 /2 < 0
j=0
2
0
ARMA (p, q) invertible roots of (z) lie outside the unit circle
F () = F (1/2) = 0

X (z) F () = F (1/2) = (0)
(z) = j z j = |z| 1
j=0
(z)
Spectral density
Behavior of the ACF and PACF for causal and invertible ARMA models
X 1 1
AR (p) MA (q) ARMA (p, q) f () = (h)e2ih
2 2
h=
ACF tails off cuts off after lag q tails off
PACF cuts off after lag p tails off q tails off P R 1/2
Needs h= |(h)| < = (h) = 1/2
e2ih f () d h = 0, 1, . . .
21.5 Spectral Analysis f () 0
f () = f ()
Periodic process f () = f (1 )
R 1/2
xt = A cos(2t + ) (0) = V [xt ] = 1/2 f () d
2
= U1 cos(2t) + U2 sin(2t) White noise: fw () = w
28
ARMA (p, q) , (B)xt = (B)wt : 22.2 Beta Function
Z 1
(x)(y)
|(e2i )|2
2 Ordinary: B(x, y) = B(y, x) = tx1 (1 t)y1 dt =
fx () = w 0 (x + y)
|(e2i )|2 Z x
a1 b1
Pp Pq Incomplete: B(x; a, b) = t (1 t) dt
where (z) = 1 k=1 k z k and (z) = 1 + k=1 k z k 0
Regularized incomplete:
Discrete Fourier Transform (DFT) a+b1
B(x; a, b) a,bN X (a + b 1)!
Ix (a, b) = = xj (1 x)a+b1j
n
X B(a, b) j=a
j!(a + b 1 j)!
d(j ) = n1/2 xt e2ij t
I0 (a, b) = 0 I1 (a, b) = 1
i=1
Ix (a, b) = 1 I1x (b, a)
Fourier/Fundamental frequencies
22.3 Series
j = j/n
Finite Binomial
Inverse DFT n n
n1 X n(n + 1) X n
= 2n
X
xt = n 1/2
d(j )e 2ij t k=
2 k
j=0 k=1 k=0
n n
X X r+k r+n+1
Periodogram (2k 1) = n2 =
I(j/n) = |d(j/n)|2 k n
k=1 k=0
n n
Scaled Periodogram
X n(n + 1)(2n + 1) X k n+1
k2 = =
6 m m+1
k=1 k=0
4 n
P (j/n) = I(j/n) X
n(n + 1)
2 Vandermondes Identity:
n k3 = r
m n

m+n

2
!2 !2 X
n n k=1 =
2X 2X n k rk r
= xt cos(2tj/n + xt sin(2tj/n cn+1 1 k=0
n t=1 n t=1
X
ck = c 6= 1 Binomial Theorem:
c1 n
n nk k
k=0
X
a b = (a + b)n
22 Math k
k=0
22.1 Gamma Function Infinite

Z

Ordinary: (s) = ts1 et dt X 1 X p
0 pk = , pk = |p| < 1
Z 1p 1p
k=0 k=1
Upper incomplete: (s, x) = ts1 et dt
!
X d X d 1 1
Z xx kpk1 = pk
= = |p| < 1
dp dp 1 p (1 p)2
Lower incomplete: (s, x) = ts1 et dt k=0 k=0
0
X r+k1 k
( + 1) = () >1 x = (1 x)r r N+
k
(n) = (n 1)! nN k=0

(0) = (1) = X k
p = (1 + p) |p| < 1 , C
(1/2) = k
k=0
(1/2) = 2(1/2)
29
22.4 Combinatorics [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications With R
Examples. Springer, 2006.
Sampling [4] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie, Algebra.
Springer, 2001.
k out of n w/o replacement w/ replacement [5] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie und Statistik.
k1 Springer, 2002.
Y n!
ordered nk = (n i) = nk [6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2003.
i=0
(n k)!
nk

n n! n1+r n1+r
unordered = = =
k k! k!(n k)! r n1
Stirling numbers, 2nd kind

(
n n1 n1 n 1 n=0
=k + 1kn =
k k k1 0 0 else
Partitions
n
X
Pn+k,k = Pn,i k > n : Pn,k = 0 n 1 : Pn,0 = 0, P0,0 = 1
i=1
Balls and Urns f :BU D = distinguishable, D = indistinguishable.
|B| = n, |U | = m f arbitrary f injective f surjective f bijective

( (
mn m n

n n! m = n
B : D, U : D mn m!
0 else m 0 else
(
m+n1 m n1 1 m=n
B : D, U : D
n n m1 0 else
m
( (
X n 1 mn n 1 m=n
B : D, U : D
k 0 else m 0 else
k=1
m
( (
X 1 mn 1 m=n
B : D, U : D Pn,k Pn,m
k=1
0 else 0 else
References
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory. Brooks Cole,
1972.
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American
Statistician, 62(1):4553, 2008.
30
Univariate distribution relationships, courtesy Leemis and McQueston [2].
31

Stat Cookbook

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Stat Cookbook

Hochgeladen von

Copyright:

Verfügbare Formate

Probability and Statistics

1 Distribution Overview 3 15 Bayesian Inference 16 22 Math 29

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

0.5 0.25 0.1

0.0 0.00 0.0

0.00 0 0.0 0.0

0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

0.25 0.25 0.25 0.25

Outcome (point or element) Bayes Theorem

Properties Probability Mass Function (PMF)

P [] = 0 fX (x) = P [X = x] = P [{ : X() = x}]

Continuous Sample mean

Properties of the mle Under appropriate regularity conditions

Null hypothesis H0 sup Ln () Ln (bn )

fX (x | ) = h(x) exp {()T (x) A()} Equal-tail credible interval

Highest posterior density (HPD) region Rn

Conjugate: f () and f ( | xn ) belong to the same parametric family

18.1 Simple Linear Regression

Residual sums of squares (rss)

Squared error loss: posterior mean b0 = Yn b1 Xn

18.2 Prediction Unbiased estimate for 2

1. Assign a score to each model Validation and training

K(x) 0 k-nearest Neighbor Estimator

P [Xn = x | X0 , . . . , Xn1 ] = P [Xn = x | Xn1 ] n T, x X

19.3 Smoothing Using Orthogonal Functions Transition matrix P (n-step: Pn )

Autocovariance function Symmetric moving average

x (s, t) = E [(xs s )(xt t )] = E [xs xt ] s t k

k N, tk , ck , h Z Sample autocovariance function

Cov [xt+h , xt ] (t + h, t) (h) Sample cross-correlation function

22.1 Gamma Function Infinite

Stirling numbers, 2nd kind

Balls and Urns f :BU D = distinguishable, D = indistinguishable.

|B| = n, |U | = m f arbitrary f injective f surjective f bijective

Das könnte Ihnen auch gefallen