Beruflich Dokumente
Kultur Dokumente
Multivariate Methods/Analysis
Lecture Notes
Lecturer
Samuel Iddi (PhD)
Department of Statistics
University of Ghana
isamuel@ug.edu.gh
Course Objective:
To understand problems associated with multi-dimensional data.
To study basic multivariate distribution theory and methods.
To discuss some fundamental and important multivariate statistical
techniques.
To understand inference and how to apply them to real-life problem
arising from various scientific fields.
To learn methods for data reduction.
To implement and apply the techniques in R.
Textbook:
1 Johnson, R. A. and Wichern, D. W. (2007). Applied Multivariate
Statistical Analysis. 3rd Ed. Prentice-Hall.
Reference:
1 Muirhead R. J. (2005). Aspects of Multivariate Statistical Theory. John
Wiley and Sons.
2 Hardle, W. K. and Simar, L. (2012). Applied Multivariate Statistical
Analysis. 3rd Ed. Springer.
3 Everitt, B. and Hothorn T. (2011). An Introduction to Applied
Multivariate Analysis with R Use R. Springer.
Office hour:
Thursday: 10:30am to 2:00pm or by appointment.
Phone Number: 0506783155, Office: 208
E-mail: isamuel@ug.eud.gh; isamuel.gh@gmail.com (Sending an email is the
easiest way to contact me)
Personal Website: www.samueliddi.com
Teaching Assistant
TA: Enouch Sakyi-Yeboah
Tutorial: Thursday, 9:30am to 10:30am
Phone number: 0274873770
E-mail: enochsakyi10@gmail.com
Grading
Homework Assignments (20%), Interim Assessment (30%) and Final Exams
(50%).
Guidelines
Homeworks should be submitted one week from the day assigned.
R program codes will also be submitted via email.
Late submission will not be accepted.
Duplicate solutions will not be graded.
TA will discussion solutions during tutorial hours.
Interim assessment could take the form of project, presentation and
defense.
Computing
The main software package for this course is regular R version 2.3.1 (or
higher). Install R by visiting the website: www.r-project.org. You may also
use RStudio but this only works after you have installed R first.
Introduction
This course concerns the use of statistical methods use for describing and
analyzing multivariate data.
Levels of Complexity
Level of Complexity
Level of Complexity
Multivariate Analysis
This refers to a set of techniques which allow the presence of more than
one outcome variable.
The challenges in learning from large data has led to the development
and evolution in this field of statistical sciences.
Data Setup
Example 2.1
A selection of four receipts from the university bookstore was obtained in
order to investigate the nature of book sales. Each receipt provided, among
other things, the number of books sold and the total amount of each sale. Let
the first variable be the total Cedi sales and the second variable be the
number of books sold. Then we can regard the corresponding numbers on the
receipts as four measurements on two variables.
Solution 2.1
Using the notation just introduced, we have
Descriptive statistics
Sample mean
n
1X
x̄j = xij , j = 1, . . . , p.
n
i=1
x̄1
x̄2
Mean vector x̄ = (x̄1 , x̄2 , . . . , x̄p )0 = .
..
x̄p
Sample variance
n
1X
s2j = (xij − x̄j )2 .
n
i=1
A very common definition is
n
1 X
s2j = (xij − x̄j )2 .
n−1
i=1
Remarks on correlation:
unitless
covariance of standardized variables
range −1 ≤ rij ≤ 1
rjk = 0 implies no linear association
rjk > 0 implies both pair of variables tend to deviate from their
respective means in the same direction.
The correlation is invariant under the following transformation
yij = axij + b, i = 1, 2, . . . , n
yik = cxik + d, i = 1, 2, . . . , n
provided a and c have the same sign (ac > 0). What happens if ac < 0?
Example 2.2
Find the arrays x̄, Sn , R from the example data above.
Solution 2.2
4
1X 1
x̄1 = xi1 = (42 + 52 + 48 + 58) = 50
4 4
i=1
4
1 X 1
x̄2 = xi2 = (4 + 5 + 4 + 3) = 4
4 4
i=1
50
x̄ =
4
4
1X
s11 = (xi1 − x̄1 )2
4
i=1
1
(42 − 50)2 + (52 − 50)2 + (48 − 50)2 + (58 − 50)2 = 34
=
4
s22 = 0.5
4
1X
s12 = s21 = (xi1 − x̄1 )(xi2 − x̄2 )
4
i=1
and
34 −1.5
Sn =
−1.5 0.5
s12 −1.5
r12 = r21 = √ =√ = −0.36
s11 s22 34 × 0.5
so
1 −0.36
R=
−0.36 1
Euclidean Distance
Consider the point P = (x1 , x2 ) in the plane, the straight line distance
d(O, P), from P to the origin O = (0, 0) is according to the Pythagorean
theorem q
d(O, P) = x12 + x22
Euclidean Distance
Statistical Distance
For P(x1 , x2 ) and O = (0, 0), statistical distance is computed from its
standardized coordinates x1∗ = √xs111 , x2∗ = √xs222 as
s 2 2 s
x12 x2
x1 x2
d(P, Q) = √ + √ = + 2
s11 s22 s11 s22
If s11 = s22 , it is convenient to ignore the common divisor and use the
Euclidean distance.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 26 / 261
Aspects of Multivariate Analysis Distance
Statistical Distance
x12 x2
d2 (O, P) = + 2 = c2
s11 s22
All points which have coordinates (x1 , x2 ) and are constant squared
x12 x22
distance c2 from the origin must satisfy s11 + s22 = c2
Alternative formulation
Alternatively, let P(x11 , x12 , . . . , x1p ) and Q(x11 , x22 , . . . , x2p ), the
Euclidean distance is given by
q
d(P, Q) = (x11 − x21 )2 + (x12 − x22 )2 + · · · + (x1p − x2p )2
v
u p
uX
p
= t (x1j − x2j )2 = (x1 − x2 )0 (x1 − x2 )
j=1
= ||x1 − x2 ||2
Drawback
◦ All components contribute equally, although there may be random
fluctuations of a different magnitude in the components.
◦ Different pairs of components may be correlated differently.
Alternative formulation
Alternative formulation
n
1X
Sy = C(xi − x̄)(xi − x̄)0 C0
n
i=1
= CSx C0
which is satisfied by S = Sx = (C0 C)−1
Measuring distance between P and Q can be performed in Euclidean
distance using coordinate system yi as they represent uncorrelated
variables with unit variance.
d2 (P, Q) = (y1 − y2 )0 (y1 − y2 )
= [C(x1 − x2 )]0 [C(x1 − x2 )]
= (x1 − x2 )0 C0 C(x1 − x2 )
= (x1 − x2 )0 S−1
x (x1 − x2 )
Definition 3.1
Let A be a square matrix of dimension k × k and let λ be the eigenvalue of A.
If x is a k × 1 nonzero vector (x 6= 0) such that
Ax = λx
Example 3.1
1 0
Let A = , find the eigenvalues and associated eigenvectors of A.
1 3
Solution 3.1
1 0 1 0 1−λ 0
|A − λI| = −λ =
1 3 0 1 1 3−λ
= (1 − λ)(3 − λ) = 0.
x1 = 3x1
x1 + 3x2 = 3x2
0
implies that x1 = 0 and x2 = 1 (arbitrarily), and hence, x = is the
1
eigenvector corresponding to the eigenvalue 3.
Definition 3.2
A quadratic form Q(x) in the k variables x1 , x2 , . . . , xk is Q(x) = x0 Ax, where
x0 = (x1 , x2 , . . . , xk ) and A is a k × k symmetric matrix.
Note that a quadratic form can be written as Q(x) = ki=1 kj=1 aij xi xj . For
P P
example,
1 1 x1
Q(x) = (x1 , x2 ) = x12 + 2x1 x2 + x22 .
1 1 x2
A = UΛV 0
E(X01 )
E(X11 ) E(X12 ) . . . E(X1p ) E(X1 )
E(X21 ) E(X22 ) . . . E(X2p ) E(X2 ) E(X0 )
2
E(X) = . .. = .. = ..
. .. . .
. . . . . .
E(Xn1 ) E(X11 ) . . . E(Xnp ) E(Xp ) E(X0n )
where
( R∞
−∞ xij fij (xij )dxij if Xij is a continuous random variable on <
E(Xij ) = P
xij xij pij (xij ) if Xij is a discrete random variable
with f the probability density function (pdf) and p the probability mass
function.
Example 3.2
Consider the random vector X0 = (X1 , X2 ). Let the discrete random variable
X1 have the following probability function:
x1 -1 0 1
p1 (x1 ) 0.3 0.3 0.4
x2 0 1
p2 (x2 ) 0.8 0.2
Find E(X)
Solution 3.2
P
E(X1 ) x x1 p 1 (x 1 ) −1 × 0.3 + 0 × 0.3 + 1 × 0.4
E(X) = = P1 =
E(X2 ) x2
x2 p2 (x2 ) 0 × 0.8 + 1 × 0.2
0.1
Thus, E(X) =
0.2
Immediate results:
Covariance Matrices
where fjk (xj , xk ) and pjk (xj , xk ) are the joint density function and joint
probability mass functions of (xj , xk ) respectively.
The population variance-covariance matrix is defined as
σ11 σ12 . . . σ1p
σ21 σ22 . . . σ2p
0
Σ = Cov(X) = . .. = E(X − µ)(X − µ) .
.. ..
.. . . .
σp1 σp2 . . . σpp
Example 3.3
Consider the previous example. Let the joint probability of X1 and X2 be
p12 (x1 , x2 ) represented by the entries in the following table.
x2
x1 0 1 p1 (x1 )
-1 0.24 0.06 0.3
0 0.16 0.14 0.3
1 0.40 0.00 0.4
p2 (x2 ) 0.8 0.2 1
Find the covariance matrix for the two random variables X1 and X2 .
Solution 3.3
We have already shown that µ1 = E(X1 ) = 0.1 and µ2 = E(X2 ) = 0.2.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 45 / 261
Random Vectors and Matrices Random Vectors and Matrices
Example 3.4
4 1 2 σ11 σ12 σ13
Suppose Σ = 1 9 −3 = σ21 σ22 σ23 . Obtain V 1/2 and ρ.
2 −3 25 σ31 σ32 σ33
Solution 3.4
√
σ11 0 0 2 0 0
√
V 1/2 = 0 σ22 0 = 0 3 0
√
0 0 σ33 0 0 5
1
−1 2 0 0
1/2
and V = 0 13 0 . Thus, the correlation matrix ρ is given by
0 0 51
1 1
−1 −1 0 0 2 4 1 2 2 0 0
V 1/2 Σ V 1/2 = 0 31 0 1 9 −3 0 13 0
0 0 15 2 −3 25 0 0 15
1 16 1
5
= 16 1 − 51
1 1
5 −5 1
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 48 / 261
Random Vectors and Matrices Partitioning Random Variables
Thus, we have
These results also hold if the population quantities are replaced by their
appropriate sample counterparts.
or
Z1 c11 c12 . . . c1p X1
Z2 c21 c22 . . . c2p X2
Z=.= . .. .. = CX
.. ..
.. .. . . . .
Zq cq1 cq2 . . . cqp Xp
The linear combinations Z = CX have
Example 3.5
Let X0 = (X1 , X2 ) be a random vector 0
vector µx = (µ1 , µ2 ) and
with mean
σ11 σ12
variance-covariance matrix ΣX = . Find the mean vector and
σ12 σ22
covariance matrix for the linear combinations
Z1 = X1 − X2
Z2 = X1 + X2
Solution 3.5
Z1 1 −1 X1
Z= = CX = = CX
Z2 1 1 X2
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 55 / 261
Random Vectors and Matrices Linear Combinations of Random Variables
Solution 3.5
1 −1 σ11 σ12
0 1 1
µZ = Cov(Z) = CΣX C =
1 1 σ12 σ22 −1 1
σ11 − 2σ12 + σ22 σ11 − σ22
=
σ11 − σ22 σ11 + 2σ12 + σ22
Matrix Inequalities
Maximization
(x0 d)2
max 0 ≤ d0 B−1 d
x6=0 x Bx
with the maximum attained when x = cB−1 d for some nonzero constant
c.
Maximization of quadratic forms: For a p × p positive definite matrix
B with eigenvalues λ1 , λ2 , . . . , λp ≥ 0 and corresponding eigenvectors
ei , i = 1, 2, . . . , p, then
◦ maxx6=0 xxBx
0
0 x = λ1 (attained when x = e1 )
x 0 Bx
◦ maxx6=0 x0 x = λp (attained when x = ep )
◦ maxx⊥e1 ,...,ek xxBx
0
0 x = λk+1 (attained when
x = ek+1 , k = 1, 2, . . . , p − 1)
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 58 / 261
Sample Geometry and Random Samples Sample Geometry
Sample Geometry
y0j 1n
1n = x̄j 1n
10n 1n
Denote the deviation vector from the mean vector by
dj = yj − x̄j 1n = (xi1 − x̄j , xi2 − x̄j , . . . , xin − x̄j )0
The sum of squared deviation (sum of square residuals):
Lj2 = d0j dj = ni=1 (xij − x̄j )2 = (n − 1)sjj
P
Sample Geometry
Also, the Pcross-products (dot product) of the jth and kth variable:
d0j dk = ni=1 (xij − x̄j )(xik − x̄k )0 = (n − 1)sjk
Angle between deviation vectors dj and dk , is obtained from
d0 d (n−1)sjk sjk
cos(θjk ) = h 0 j 0k i1/2 = 1/2 = (s s )1/2 = rjk
(dj dj )(dk dk ) [(n−1)s jj (n−1)s kk ] jj kk
Thus, the cosine of the angle between two deviation vectors is the
correlation between the two random variables Xj , Xk
The following results can be observed:
◦ if θjk = 0 then rjk = 1, i.e. if the deviation vectors are in the same
direction, the the correlation between the two random variables has
a perfect linear correlation.
◦ if θjk = π2 , then rjk = 0, i.e. if they two r.v are orthogonal, the
correlation between them is zero.
◦ if θjk = π, i.e. if they move in opposite directions, then rjk = −1 for
every pair j, k of variables from X.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 60 / 261
Sample Geometry and Random Samples Mean and Variance Estimators
Result 4.1
Assume that we take a random sample X1 , X2 , . . . , Xn from a multivariate
population with unknown mean vector, µ and covariance matrix, Σ. Then,
1 X̄ is an unbiased estimator for µ and has covariance matrix 1n Σ.
n
2 Sn−1 = n−1 Sn is an unbiased estimator for Σ.
3 Sn is a biased estimator for Σ, the bias is equal to − 1n Σ
Proof 4.1
1 Pn 1 Pn
= 1n (nµ) = µ
(1) E(X̄) = E n i=1 X i = n i=1 E(X i )
n
! n
!
0 1X 1X 0
Cov(X̄) = E(X̄ − µ)(X̄ − µ) = E (Xi − µ) (Xi − µ)
n n
i=1 i=1
1
= E [(X1 − µ) + (X2 − µ) + · · · + (Xn − µ)] ×
n
1
[(X1 − µ)0 + (X2 − µ)0 + · · · + (Xn − µ)0 ]
n
1
E (X1 − µ)(X1 − µ)0 + · · · + (Xn − µ)(Xn − µ)0
= 2
n
1 1 Σ
= 2
(Σ + · · · + Σ) = 2 nΣ =
n n n
n
! n
0 0
X X
⇒ (n − 1)E(Sn−1 ) = E XX0i − nX̄X̄ = E(XX0i ) − nX̄X̄
i=1 i=1
but Cov(Xi ) = Σ = E(Xi − µ)(Xi − µ) = E(Xi X0i ) − µµ0 0
n
X
0 Σ 0
hence (n − 1)E(Sn−1 ) = (Σ + µµ ) − n + X̄X̄
n
i=1
= nΣ + nµµ0 − Σ − nµµ0 = (n − 1)Σ
⇒ E(Sn−1 ) = Σ.
n−1
now E(Sn ) = Σ,
n
n−1 1
bias = E(Sn ) − Σ = Σ−Σ=Σ− Σ−Σ
n n
1
= − Σ.
n
The bias estimator: Sn = n1 ni=1 (Xi − X̄)(Xi − X̄)0 . Note that, as n → ∞, Sn
P
will be unbias (in the limit). We introduce a special notation for the unbiased
estimator:
n
n 1 X
S= Sn = (Xi − X̄)(Xi − X̄)0 .
n−1 n−1
i=1
Generalized variance
With a single variable, the sample variance is often used to describe the
amount of variation in the measurements on that variable.
When p variables are observed on each unit, the variation is described by
the sample-variance covariance matrix,
s11 s12 . . . s1p ( )
s21 s22 . . . s2p n
1 X
S= . .. = sjk = (xij − x̄j )(xik − x̄k )
.. ..
.. . . . n−1
i=1
sp1 sp2 . . . spp
Generalized variance
Generalized variance
Example 4.1
The sample covariance
matrix obtained
from a data with two variables is
252.04 −68.43
given by S = . Evaluate the generalized and total
−68.43 123.67
sample variance.
Solution 4.1
Generalized variance
Example 4.2
Find the generalized
variance
and total
variance
of population covariance
2 1 2 −1
matrices Σ1 = and Σ2 = . Comment.
1 3 −1 3
Solution 4.2
tr(Σ1 ) = tr(Σ2 ) = 5, |Σ1 | = |Σ2 | = 5. The two covariance matrices have
the same measures as far as total variance and generalized variance are
concerned but clearly, the two matrices are different.
Exercise 4.1
(a) Verify the relationship between |S| and |R| using
4 3 1
S = 3 9 2
1 2 1
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 69 / 261
Sample Geometry and Random Samples Generalized and Total Variances
Exercise 4.2
(b) Using S = D1/2 RD1/2 , show that |S| = (s11 + s22 +
· · · + spp )|R|.
1 2 5
(c) Show that the generalized variance |S| = 0 for X = 4 1 6
4 0 4
Sample Mean
1 Pn
n Pi=1 xi1 x11 + x21 + . . . +xn1
1 n xi2 1 x12 + x22 + . . . +xn2
n i=1
x̄ = = ..
.. .. .. ..
. n . . . .
1 Pn x1p + x1p + . . . +xnp
n i=1 xip
x11 x21 . . . xn1 1
x
1 12 22
x . . . x n2 1
1
0
= . .. .. .. .. = X 1n
n .. . . . . n
x1p x1p . . . xnp 1
1 0
⇒ x̄0 = 1 X
n n
1 1
⇒ 1n x̄0 = 1n 10n X = J0n X
n n
Sample Covariance
1
S = (X − 1n x̄0 )0 (X − 1n x̄0 )
n−1
0
1 1 0 1 0
= X − (Jn X) X − (Jn X)
n−1 n n
1 0 1 0 1 0
= X − (X Jn ) X − (Jn X)
n−1 n n
1 0 1 1 0
= X In − J n In − J n X
n−1 n n
1 1
= X 0 In − J n X
n−1 n
Let D = diag(sjj ) then
R = D−1/2 SD−1/2
S = D1/2 RD1/2
Observe that
Dr. S. these equations are analogous
Iddi (UG) STAT446/614 to their population counterparts.
April 16, 2015 72 / 261
Sample Geometry and Random Samples Sample Mean, Covariance, and Correlation as Matrix Operations
Result 4.2
The linear combinations
b0 X = b1 X1 + b2 X2 + · · · + bp Xp
c0 X = c1 X1 + c2 X2 + · · · + cp Xp
have sample means, variances, and covariances that are related to x̄ and S by
Sample mean of b0 X = b0 x̄
Sample mean of c0 X = c0 x̄
Sample variance of b0 X = b0 Sb
Sample variance of c0 X = c0 Sc
Sample variance of b0 X and c0 X = b0 Sc
Example 4.3
Consider the data array
42 4
52 5
X=
48
4
58 3
Find x̄, S, R.
Solution 4.3
1 0
x̄ = X 1n
n
1
1 42 52 48 58
= 50
1
=
4 4 5 4 3 1 4
1
1 1
S = X0 (1n − Jn )X
n−1 n
0
42 4 1 0 0 0 1 1 1 1 42 4
1 52 5 0 1 0 0 1
−
1 1 1 1 52 5
=
3 48 4 0 0 1 0 4 1 1 1 1 48 4
58 3 0 0 0 1 1 1 1 1 58 3
34 −1.5
=
−1.5 0.5
−1/2 5.83
−1/2 0 1/2 −1/2 0.17 0
R = D SD ,D = ,D =
0 0.71 0 1.41
0.17 0 34 −1.5 0.17 0
R =
0 1.41 −1.5 0.5 0 1.41
1 −0.36
=
−0.36 1
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 75 / 261
Multivariate Normal Distribution
Introduction
Introduction
(x − µ)0 Σ−1 (x − µ)
Introduction
X ∼ Np (µ, Σ).
(x − µ)0 Σ−1 (x − µ) =
1
(x1 − µ1 , x2 − µ2 ) 2
σ11 σ22 − σ12
√ √
σ22 −ρ12 σ11 σ22 x1 − µ 1
× √ √
−ρ12 σ11 σ22 σ11 x2 − µ 2
This simplifies to
1 √ √
= 2 √ (x1 − µ1 )σ22 − ρ12 σ11 σ 22 (x2 − µ2 ),
σ11 σ22 − ρ12 σ11 σ22
√ √ x1 − µ1
−ρ12 σ11 σ22 (x2 − µ2 ) + σ11 (x2 − µ2 )]
x2 − µ2
1 √ √
(x1 − µ1 )2 σ22 − ρ12 σ11 σ22 (x2 − µ2 )(x1 − µ1 )
= 2
σ11 σ22 (1 − ρ12 )
√ √
ρ12 σ11 σ22 (x1 − µ1 )(x2 − µ2 ) + (x2 − µ2 )2 σ11
−
"
x 1 − µ1 2 x2 − µ2 2
1 x1 − µ1
= √ + √ − 2ρ12 √
(1 − ρ212 ) σ11 σ22 σ11
x2 − µ2
× √
σ22
1
= q
(2π)2/2 σ11 σ22 (1 − ρ212 )
( "
x 1 − µ1 2 x2 − µ2 2
1
× exp − √ + √
2(1 − ρ212 ) σ11 σ22
x1 − µ1 x 2 − µ2
− 2ρ12 √ √
σ11 σ22
Example 5.1
For X distribution as N3 (µ, Σ), find the distribution of
X
1 −1 0 1
X1 − X2
= X2 = AX
X2 − X3 0 1 −1
X3
X1
Y1 1 −1 0
Let Y = = X2 = AX
Y2 0 1 −1
X3
Y ∼ N2 (Aµ, AΣA0 )
µ
0 1
1 −1 µ1 − µ 2
E(Y) = AE(X) = Aµ = µ2 =
0 1 −1 µ2 − µ 3
µ3
Properties
Note that with this assignment X, µ and Σ can respectively be rearranged and
partitioned as
σ22 σ24 | σ12 σ23 σ25
X2 µ2
X4 µ4 σ24 σ44 | σ14 σ34 σ45
− − − − − − − −
X=
,µ = µ1 ,Σ = σ12 σ14 | σ11 σ13 σ15
X1
X3 µ3 σ23 σ34 | σ13 σ33 σ35
X5 µ5 σ25 σ45 | σ15 σ35 σ55
X2
Thus X(1) = has the distribution
X4
(1) µ2 σ22 σ24
N2 (µ , Σ11 ) = N2 ,
µ4 σ24 σ44
It is clear from this example that the normal distribution for any subset can be
expressed by simply selecting the appropriate means and covariance from the
original µ and Σ.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 86 / 261
Multivariate Normal Distribution Additional Properties of the Normal Distribution
Properties
Properties
X1 0
See that X(1) = and X3 have covariance matrix Σ = . Therefore
X2 0
(X1 , X2 ) and X3 are independent. This implies that X3 is independent of X1
and also of X2 .
Example 5.4 (Result)
X1 µ1 Σ11 | Σ12
Let −− ∼ Np −− , −− | −−
X2 µ2 Σ21 | Σ22
Then (X1 |X2 = x2 ) ∼ Nq (µ1|2 , Σ1|2 ) with
Conditional distribution
Indirect
proof
| −Σ12 Σ−1
I 22
Let A = −− | −− . We know that X − µ ∼ N(0, Σ). This
0 | I
implies
| −Σ12 Σ−1
I 22
X1 − µ2
A(X − µ) = −− |
−− ∼ N(0, AΣA0 )
X2 − µ2
0 | I
i.e.
X1 − µ2 − Σ12 Σ−1
22 (X 2 − µ2 ) ∼ N(0, AΣA0 )
(X2 − µ2 )
I −Σ12 Σ−1
0 22 Σ11 Σ12 I 0
but AΣA =
0 I Σ21 Σ22 −Σ12 Σ−1
22 I
Conditional distribution
Thus,
This implies
Conditional distribution
µ1 + Σ12 Σ−1
X1 |X2 = x2 22 (x2 − µ)
∼ N ,
X2 |X1 = x1 µ2
Σ11 − Σ12 Σ−1
22 Σ21 0
0 Σ22
Hence,
or
(X1 |X2 = x2 ) ∼ N(µ1|2 , Σ1|2 )
2
(x1 −µ1 − σσ12 (x −µ2 ))
22 2
f (x1 , x2 ) 1 −
2σ11 (1−ρ2 )
f (x1 |x2 ) = =√ q e 12
f (x2 ) 2π σ11 (1 − ρ212 )
Distributional properties
Observe that
X n n
X
(xi − µ)0 Σ−1 (xi − µ) = tr[(xi − µ)0 Σ−1 (xi − µ)]
i=1 i=1
Xn
= tr[Σ−1 (xi − µ)(xi − µ)0 ]
i=1
" n
#
X
= tr Σ−1 (xi − µ)(xi − µ)0
i=1
" ( n )#
X
= tr Σ−1 (xi − µ)(xi − µ)0
i=1
Now,
(xi − µ)(xi − µ)0 = (xi − x̄ + x̄ − µ)(xi − x̄ + x̄ − µ)0
= [(xi − x̄) + (x̄ − µ)][(xi − x̄) + (x̄ − µ)]0
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 95 / 261
Sampling from Multivariate Normal Distribution and Maximum
Multivariate Normal Distribution Likelihood Estimation
thus,
n
X n
X
0
(xi − x̄)(xi − x̄)0 + 2(xi − x̄)(x̄ − µ)0
(xi − µ) (xi − µ) =
i=1 i=1
(x̄ − µ)(x̄ − µ)0
+
Xn
= (xi − x̄)(xi − x̄)0 + n(x̄ − µ)(x̄ − µ)
i=1
Xn
+ 2 (xi − x̄)(x̄ − µ)0
i=1
n
X
= (xi − x̄)(xi − x̄)0 + n(x̄ − µ)(x̄ − µ)
i=1
The most likely parameter value is defined as the one yielding maximum
L, given the data.
The corresponding parameter value is called the maximum likelihood
estimator or mle (which is a random variable, and function of the data)
It is common practice to maximize log-likelihood.
In our case, the log-likelihood is
np n
l(µ, Σ) = log L(µ, Σ) = − log(2π) − log |Σ|
2" 2 !#
n
1 −1
X
0
− tr Σ (xi − x̄)(xi − x̄)
2
i=1
n 0 −1
− (x̄ − µ) Σ (x̄ − µ)
2
Covariance Estimation
Theorem 5.2
Let X1 , X2 , . . . , Xn be independent and identically distributed Np (µ, Σ). Then
the maximum likelihood estimators for µ and Σ are
n
1X (n − 1)
µ̂ = X̄ and Σ̂ = (Xi − X̄)(Xi − X̄)0 = S = Sn
n n
i=1
Proof
1. We have derived the results for the mean.
2. The likelihood for µ = µ̂ becomes
1 1 −1
L(µ̂, Σ) = np n exp − tr[Σ nSn ]
(2π) 2 |Σ| 2 2
This
Dr. S. quantity
Iddi (UG) µ̂. Furthermore, it is a simple
is independent ofSTAT446/614 function
April 16, 2015 100of
/ 261
Sampling from Multivariate Normal Distribution and Maximum
Multivariate Normal Distribution Likelihood Estimation
Covariance Estimation
n
If we apply the previous lemma with b = 2 and B = nSn then the
maximum value is is achieved for
1
Σ̂ = (nSn ) = Sn
2 2n
Observed that MLE for mean is unbiased but MLE for the covariance
matrix is biased.
Substituting µ̂ and Σ̂ in L(µ, Σ) yields
1 1 −1 1
L(µ̂, Σ̂) = np exp − tr[Σ̂ nΣ̂] . n
(2π) 2 2 |Σ̂| 2
1 − n tr[Ip ] 1 1 − np 1
= np e 2 . n = np e 2 . n
(2π) 2 |Σ̂| 2 (2π) 2 |Σ̂| 2
Sufficient Statistics
Sampling distributions
Wishart distribution
for large n.
Note:
n should be large compared to p
Assessing normality
QQ-Plot
Note that Φ(z) has not close form solution and so numerical results in
standard normal tables are used.
2.0
1.5 QQ−Plot
Sample Quantiles, x(i)
1.0
0.5
0.0
−0.5
−1.0
The pair of points lie very nearly along a straight line and we will not
reject the notion that these data are normally distributed - particularly
with sample size as small as n = 10
As a rule of thumb: do not trust this procedure in samples of size n ≤ 20
There can be quite a bit of variability in the straightness of the QQ-plot
for small samples even when the observations are known to come from a
normal population.
Is the QQ-plot straight?
If it is straight, the correlation between the pairs would be 1.
We can test for correlation coefficient.
Correlation coefficient
Pn
− x̄)(q(j) − q̄)
j=1 (x(j)
rQ = qP
n 2
Pn 2
j=1 (x(j) − x̄) j=1 (q(j) − q̄)
8.584
rQ = √ √ = 0.994
8.472 8.795
The test for normality at the 10% level of significant is provided by
referring rQ = 0.994 to entries in the Table corresponding to n = 10 and
α = 0.10. This entry is 0.9351.
Since rQ > 0.9351, we do not reject the hypothesis of normality.
As an alternative, the Shapiro-Wilk test statistics may be computed.
For large sample sizes, the two statistics are nearly the same so either can
be used to judge lack of normality.
(x − µ)0 Σ−1 (x − µ)
Example
x1 = sales x2 = profits
Company (mil dollars) (mil dollars)
1 126874 4224
2 96933 3835
3 86656 3510
4 63438 3758
5 55264 3939
6 50976 1809
7 39069 2946
8 36156 359
9 35209 2480
10 32416 2413
Warnings:
◦ the χ2p is only an approximation, an F distribution can be used
instead.
◦ the quantities are not independent.
Order
2 2
d(1) ≤ · · · ≤ d(n)
Graph ! !
1
i−
χ2p 2 2
, d(i)
n
i− 21 2 i− 21
i n d(i) q2 n
5
4
Ordered distances
3
2
1
0 1 2 3 4 5 6
χ22−quantile
Outliers
Transformations
Introduction
Hotelling’s T 2
Example 5.7
Let the data matrix for a random
of size n = 3 from a bivariate
sample
6 9
normal population be X = 10 6 . Evaluate the T 2 for µ00 = [9, 5]. What
8 3
is the sampling distribution in this case.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 130 / 261
Inference About a mean vector Plausible Values for a Normal Population Mean
Hotelling’s T 2
Solution
6+10+8
x¯1 3 8
X̄ = = 9+6+3 = and
x¯2 3 6
so
4 −3
S=
−3 9
.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 131 / 261
Inference About a mean vector Plausible Values for a Normal Population Mean
Hotelling’s T 2
Thus 1 1
−1 1 9 −3
S = = 31 49
4(9) − (−3)(−3) −3 4 9 27
1 1 −2
8−9 7
T 2 = 3[8 − 9, 6 − 5] 13 94 = 3[−1, 1] 9
1 =
9 27 6 − 5 27 9
(3−1)2
Before the sample is selected, T 2 has the distribution of (3−2) F2,3−2 = 4F2,1
random variable.
Example 5.8
Test the hypothesis H0 : µ0 = [4, 50, 10] against H1 : µ0 6= [4, 50, 10] at level
of significance α = 0.10. Using perspiration from 20 healthy female where
three components, X1 =sweat rate, X2 =sodium content and X3 =potassium
content were measured. The data yields the following.
Hotelling’s T 2
Example 5.8
2.879 10.010 −1.810
S = 10.010 199.788 −5.640 and S−1 =
−1.810 −5.640 3.628
0.586 −0.022 0.258
−0.022 0.006 −0.002
0.258 −0.002 0.402
solution
T 2 = 20
4.640 − 4 45.4 − 50 9.965 − 10
0.586 −0.022 0.258 4.640 − 4
× −0.022 0.006 −0.002 45.4 − 50 = 9.74
0.258 −0.002 0.402 9.965 − 10
Hotelling’s T 2
(n − 1)p 19(3)
Fp,n−p (0.10) = F3,17 (0.10) = 3.353(2.44) = 8.18
(n − p) 17
Hotelling’s T 2
Property
One feature of the T 2 - statistics is that it is invariant under changes in the units
of measurement for X of the form
Y = CX + d
where Y is p × 1, C is p × p, X is p × 1 and d is p × 1.
Proof
Given x1 , x2 , x3 , . . . , xn , then ȳ = Cx̄ + d and
n
1X
Sy = (yi − ȳ)(yi − ȳ)0
n
i=1
n
1X
= (Cxi + d − Cx̄ − d)(Cxi + d − Cx̄ − d)0
n
i=1
n
1X
= C(xi − x̄)(x − x̄)0 C0 = CSC0
n
i=1
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 135 / 261
Inference About a mean vector Plausible Values for a Normal Population Mean
Hotelling’s T 2
Result 5.1
Let X1 , X2 , ........Xn be a random sample from an Np (µ, Σ) population.Then
the test based on T 2 is equivalent to the likelihood ratio test of H0 : µ = µ0
versus H1 : µ 6= µ0 because
−1
T2
2 |Σ̂|
Λ = n 1+ =
(n − 1) |Σ̂0 |
2
H0 is rejected for small values of Λ n or equivalently, large values of T 2 .
Optimality Properties
The likelihood ratio test are known to have certain optimality properties.
For testing H0 : θ = θ0 versus H1 : θ = θ1 (narrow alternative), the
likelihood ratio test has the highest power(among all test) for a given
significance level α.
Recall
P(test reject|H0 ) = α
P(Test reject|H1 ) = 1 − β
Confidence region
Confidence region
(n − 1)p
Fp,n−p (α)
(n − p)
Example
0.564
Consider the following: n = 42, x̄ = ,S =
0.603
0.0144 0.0117 203.018 −163.391
, S−1 =
0.0117 0.0146 −163.891 200.228
The 95% percent confidence region for µ consist of all values(µ1 , µ2 )
satisfying
203.018 −163.391 0.564 − µ1
42[0.564 − µ1 , 0.603 − µ2 ] ×
−163.891 200.228 0.603 − µ2
2(41)
≤ F2,40 (0.05)
40
since F2,40 (0.05) = 3.23,
42(203.018)(0.564 − µ1 )2 + 42(200.228)(0.603 − µ2 )2
− 84(163.391)(0.564 − µ1 )(0.603 − µ2 ) ≤ 6.62
Example
Proof of Result
n(`0 x̄−`0 µ)2
T 2 = n(x̄ − µ)0 S−1 (x̄ − µ) ≤ c2 ⇔ `0 S`
≤ c2
r r
0 `0 S` `0 S`
` x̄ − c ≤ `0 µ ≤ `0 x̄ + c
n n
Example[College Data]
Consider a data
with
526.59 5691.34 600.51 217.25
n = 87,x̄ = 54.69 , S = 600.51 126.05 23.37
25.13 217.25 23.37 23.11
a. compute the 95 percent simultaneous confident intervals for µ1 , µ2 and
µ3
b. construct a 95 percent confident interval for µ2 − µ3
c. compute the 95 percent confident region or confidence ellipse for the pair
(µ1 , µ2 )
Solution
p(n−1) 3(87−1) 3(86)
(a) (n−p) Fp,n−p (α) = (87−3) F3,84 (0.05) = 84 (2.7) = 8.29
√ √
r r
5691.34 5691.34
526.59 − 8.29 ≤ µ1 ≤ 526.59 − 8.29
87 87
503.30 ≤ µ1 ≤ 509.88
√ √
r r
126.05 126.05
54.69 − 8.29 ≤ µ2 ≤ 54.69 − 8.29
87 87
51.22 ≤ µ2 ≤ 58.16
√ √
r r
23.11 23.11
25.13 − 8.29 ≤ µ3 ≤ 25.13 − 8.29
87 87
23.65 ≤ µ3 ≤ 26.61
r
p(n − 1) s22 + s33 − 2s23
(x̄2 − x̄3 ) ± Fp,n−p (0.05)
(n − p) n
r
√ 126.05 + 23.11 − 2(23.37)
(54.69 − 25.13) ± 8.29
87
29.56 ± 3.12
= 0.849(54.69 − µ2 )2 + 4.633(25.13 − µ3 )
− 2 × 0.859(25.13 − µ3 )(54.69 − µ2 )
≤ 8.29
Bonferroni Intervals
α = α1 + α2 + · · · + αp
Bonferroni Intervals
α
The ratio of their length for αj = m is
α
tn−1 2m
q
p(n−1)
(n−p) Fp,n−p (α)
Bonferroni Intervals
r
5691.34
526.59 ± 2.44 or 507.77 ≤ µ1 ≤ 545.41
87
r
126.05
54.69 ± 2.44 or 51.89 ≤ µ2 ≤ 57.49
87
r
23.11
25.13 ± 2.44 or 23.93 ≤ µ3 ≤ 26.33
87
Observe that these intervals are smaller than those constructed for µ1 , µ2 ,µ3
using T 2 (T 2 - intervals)
Exercise
0.564
For the example with n = 42, x̄ = and
0.603
0.0144 0.0117 −1 203.018 −163.391
S= ,S =
0.0144 0.0146 −163.391 200.228
i. Conduct a test of the null hypothesis H0 : µ0 = 0.55, 0.60 at
if the observed
n(x̄ − µ0 )S−1 (x̄ − µ0 ) > χ2p (α)
where χ2p (α) is the upper (100α)th percentile of a chi-square distribution
with p df .
Note: The test statistics yields essentially the same results as T 2 in
situation where the χ2 - test is appropriate. This follows from the fact
(n−1)p 2
(n−p) Fp,n−p (α) and χp (α) are approximately equal for n large relative
to p.
Confidence Statement: Let X1 , X2 , . . . , Xn be a random sample from a
population with µ and positive definite covariance Σ. If (n − p) is large
r
0
q `0 S`
` X̄ ± χ2p (α)
n
will contain `0 µ for every `, with probability approximately (1 − α).
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 156 / 261
Inference About a mean vector Large-Sample Inference About a Population Mean Vector
contain (µj , µk ).
Paired comparisons
Let Xijk be the outcome for unit i on the jth variable, for treatment
(experimental condition) t = 1, 2; j = 1, . . . , p; i = 1, 2, . . . , n.
Then define Dij = Xij1 − th
Xij2 . Let Di be the difference vector for the i
Di1
Di2
unit/pair: Di = . .
..
Dip
Denote E(Di ) = δ and covariance, cov(Di ) = Σd . Again, T 2 statistics
play a critical role:
T 2 = n(D̄ − δ)0 S−1
d (D̄ − δ).
Assume Di ∼ Np (δ, Σd ), then (for a test of hypothesis H0 : δ = 0
against H1 : δ 6= 0)
(n − 1)p
T2 ∼ Fp,n−p
(n − p)
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 158 / 261
Comparison of Several Multivariate Means Paired Comparisons
Paired comparisons
Bonferroni intervals:
s s
α
s2dj
α
s2dj
d̄j − tn−1 ≤ δj ≤ d̄j + tn−1
2g n 2g n
Example
di1 = Xi11 − Xi12 -19 -22 -18 -27 -4 -10 -14 17 9 4 -19
di2 = Xi21 − Xi22 12 10 42 15 -1 11 -4 60 -2 10 -7
Here
−9.36
d̄1 199.26 88.38
d̄ = = , Sd =
13.27
d̄2 88.38 418.61
2 199.26 88.38 −9.36
T = 11[−9.36, 13.27] = 13.6
88.38 418.61 13.27
Example
Example
√
r
199.26
δ1 : d¯1 ± −9.36 ± 9.47 or (−22.46, 3.47)
11
√
r
418.61
δ2 : d¯2 ± 13.27 ± 9.47 or (−5.71, 32.25)
11
The 95% simultaneous confidence intervals include zero, yet the
hypothesis H1 : δ = 0 was rejected at α. What are we to conclude?
The evidence points towards real differences. The points δ = 0 falls
outside the 95% confidence region of δ
Note that our analysis assumed a normal distribution for Di .
Exercise: Construct the 95% confidence region for δ and show that
δ 0 = (0, 0) falls outside the 95% confidence region. Find the Bonferroni
simultaneous intervals. Do they cover zero?
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 163 / 261
Comparison of Several Multivariate Means Repeated Measures Design
Repeated Measures
Xiq
Most often, not the mean vector µ itself but linear functions of its
components are of interest Cµ. We call C a contrast matrix.
Repeated measures
Examples are:
Baseline comparison
µ 1 − µ2 1 −1 0 . . . 0 µ1
µ 1 − µ3 1 0 −1 . . . 0 µ2
=
.. .. .. .. .. .. ..
. . . . . . .
µ 1 − µq 1 0 0 ... −1 µq
Successive differences
µ2 − µ2 −1 1 0 . . . 0 0 µ1
µ3 − µ2 0 −1 1 . . . 0 0 µ2
=
.. .. .. .. . . .. .. ..
. . . . . . . .
µq − µq−1 0 0 0 ... −1 1 µq
Question
(a) All three indices are evaluated for each patient. Test for the equality of
mean indices with α = 0.05
(b) Judge the differences in pairs
of mean indices using 95% simultaneous
q 0
q
0
confidence intervals. Note: C x̄ ± (n−1)(q−1)
F (α) C SC
n−q+1 q−1,n−q+1 n
Common Variance
Common Variance
Common Variance
Example
µ01 − µ02 = [µ11 − µ21 , µ12 − µ22 ]. The 95% simultaneous confidence
intervals for the population differences are:
s
√
1 1
µ11 − µ21 : (204.4 − 130.0) ± 6.26 + 10963.7
45 55
21.7 ≤ µ11 − µ21 ≤ 127.1
s
√
1 1
µ12 − µ22 : (556.6 − 355) ± 6.26 + 63661.3
45 55
74.7 ≤ µ12 − µ22 ≤ 328.5
Exercise
Construct 95% confidence ellipse for µ1 − µ2 . Does the difference in means
cover 00 = [0, 0]? [Or, will T 2 − statistics reject H0 : µ1 − µ2 = 0 at the 5%
level?].
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 174 / 261
Comparison of Several Multivariate Means Comparing Mean Vectors from Two Populations
Example
Example
For α = 0.05, the critical value is χ22 (0.05) = 5.99 and since
T 2 > χ22 (0.05) = 5.99, we reject H0 .
Comparison of g populations
X`i = µ + τ ` + e`i
with i = 1, 2, . . . , n` and ` = 1, 2, . . . , g.
We take e`i ∼ N(0, Σ) and E(X`i ) = µ + τ ` , with µ = overall mean,
τ ` = group effect.
The model isPoverspecified and so we impose the identifiability
constraints g`=1 n` τ ` = 0. That is, all group effects add up to zero and
there are n` individuals in each group.
The following terminology is sometimes useful
◦ systematic part: µ + τ `
◦ random part: e`i
MANOVA
i.e.
This implies,
(x`i − x̄)(x`i − x̄)0 = [(x̄` − x̄) + (x`i − x̄` )][(x̄` − x̄) + (x`i − x̄` )]0
= (x̄` − x̄)(x̄` − x̄)0 + (x̄` − x̄)(x`i − x̄` )0
+ (x`i − x̄` )(x̄` − x̄)0 + (x`i − x̄` )(x`i − x̄` )0
MANOVA
summing over i
n
X̀ n
X̀ n
X̀
0 0
(x`i − x̄)(x`i − x̄) = (x̄` − x̄)(x̄` − x̄) + (x̄` − x̄) (x`i − x̄` )0
i=1 i=1 i=1
n
X̀ n
X̀
0
+ (x`i − x̄` )(x̄` − x̄) + (x`i − x̄` )(x`i − x̄` )0
i=1 i=1
n
X̀
= n` (x̄` − x̄)(x̄` − x̄)0 + (x`i − x̄` )(x`i − x̄` )0
i=1
Pn`
since i=1 (x`i − x̄` ) = 0
MANOVA
g X̀
X n g
X
(x`i − x̄)(x`i − x̄)0 = n` (x̄` − x̄)(x̄` − x̄)0
`=1 i=1 `=1
g X̀
X n
+ (x`i − x̄` )(x`i − x̄` )0
`=1 i=1
Total(corrected) Sum Treatment (Between)
of Squares and Cross = Sum of Squares and
Products Cross Products
Residual (Within) Sums
+ of Squares and
Cross Products
T = B+W
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 181 / 261
Comparison of Several Multivariate Means One-Way MANOVA
MANOVA
Recall,
g X̀
X n
W = (x`i − x̄` )(x`i − x̄` )0
`=1 i=1
g
X
B = n` (x̄` − x̄)(x̄` − x̄)0
`=1
g X̀
X n
B+W = (x`i − x̄)(x`i − x̄)0
`=1 i=1
Remark:
◦ The "between" matrix B is often denoted as H, the "hypothesis"
matrix.
◦ The "within" matrix W is often denoted as E, the "error" matrix
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 183 / 261
Comparison of Several Multivariate Means One-Way MANOVA
|W − λ(B + W)| = 0
|B − θ(B + W)| = 0
|B − φW| = 0
Now rephrasing the first and the second root equation, we find
|(1 − λ)W − λB| = 0
1−λ
|B − W| = 0
λ
|B − φW| = 0
and
|(1 − θ)B − θW| = 0
θ
|B − W| = 0
(1 − θ)
Thus
1−λ θ φ
φ = = ,θ = =1−λ
λ 1−θ 1+φ
1
λ = = 1 − θ.
1+φ
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 185 / 261
Comparison of Several Multivariate Means One-Way MANOVA
Test Statistics
Pg
− x̄)(x̄` − x̄)0 , W = g`=1 ni=1 (x`i − x̄` )(x`i − x̄` )0
P P`
B= `=1 n` (x̄`
H0 : τ 1 = τ 2 = · · · = τ g = 0
Exact distribution of Λ∗
P
If H0 is true and n` = n is large
p+g
−(n − 1 − ) ln Λ∗
2
has approximately chi-square distribution with p(g − 1) df .
Consequently, for large n, we reject H0 at α level if
p+g
−(n − 1 − ) ln Λ∗ ≥ χ2p(g−1) (α)
2
where χ2p(g−1) (α) is the upper (100α)th percentile of a chi-square
distribution with p(g − 1) df .
Exercise: Observations
on two
responses are collected for two treatments.
x1
The observed vectors are
x2
3 1 2
Treatment 1: , ,
3 6 3
2 5 3 2
Treatment 2: , , ,
3 1 1 3
(a) (i) Calculate Spooled
(ii) Test H0 : µ1 − µ2 = 0 employing a two-sample approach with
α = 0.01
(iii) Construct 99% simultaneous confidence intervals for the differences
µ1i − µ2i , i = 1, 2
(b) (i) Break up the observations into mean, treatment and residual
components.[X`i = X̄ + (X̄` − X̄) + (X`i − X̄` )]
(ii) Construct the one-way MANOVA table
(iii) Evaluate Wilk’s Lambda Λ∗ and test for treatment effects. Set
α = 0.01. Repeat the test using the chi-square approximation.
Compare the conclusions.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 190 / 261
Comparison of Several Multivariate Means Profile Analysis
Profile analysis
µ1p µ2p
Profile analysis
Equivalently,
(2) Given that the profiles are parallel, are they also coincident?
(3) Given that the profiles are coincident, are they horizontal?
Profile analysis
The null hypothesis in (1) can be written H01 : Cµ1 = Cµ2 where C is
the contrast matrix with
−1 1 0 0 . . . 0 0
0 −1 1 0 . . . 0 0
C= .
.. .. .. . . .. ..
..
. . . . . .
0 0 0 0 . . . −1 1
We don’t need new methodology to carry out this test. We merely need
to think of the following transformed observations:
Cx1i i = 1, . . . , n1
Cx2i i = 1, . . . , n2
The resulting observations have (p − 1) components instead of p. Using
these observations, the ordinary T 2 test immediately applies. In our case,
this is a two a sample problem.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 193 / 261
Comparison of Several Multivariate Means Profile Analysis
Profile analysis
Indeed,
Profile analysis
When we fail to reject the hypothesis, we can conclude that the profiles
are plausible. Then it is a reasonable to test for equality.
We now look for the a test for testing whether the distance of the two
profiles is zero.
Given that the profile are parallel, the difference at any point (outcome) j
is d = µ1j − µ2j , whence, if they are equal:
X
pd = (µ1j − µ2j ) = 0
j
that is X X
µ1j = µ2j
j j
Profile analysis
If we use the notation J = Jp,1 for the column vector of p ones, then the
above equality becomes
J0 µ1 = J0 µ2
The vector J plays a role, similar to the role of C in the previous test.
Now, however, we have a ‘univariate’ test. The denominator degree of
freedom for the F statistics is 1 and the test statistics is
−1
1 1
T 2 = (J0 x̄1 − J0 x̄2 )0 + J0 Spooled J (J0 x̄1 − J0 x̄2 )
n1 n2
−1
0 0 1 1 0
= (J x̄1 − J x̄2 ) + J Spooled J (J0 x̄1 − J0 x̄2 )
n1 n2
2
0
J (x̄ 1 − x̄ 2 )
= q
1 1 0
( n1 + n2 )J Spooled J
Profile analysis
Profile analysis
Remark
The hypothesis H01 can always be tested. It is the first hypothesis that
usually come to mind.
Example
(3) All things considered, how would you describe your contributions to the
marriage?
(4) All things considered, how would you describe your outcomes from the
marriage?
Example
Subjects were also asked to respond to the following using the 5-point scale
below.
None Very Some A great Tremendous
at all little deal amount
1 2 3 4 5
(1) What is the level of passionate love that you feel for your partner?
(2) What is the level of compassionate love that you feel for your partner?
Let
x1 = an 8-point scale response to Question 1
x2 = an 8-point scale response to Question 2
x3 = a 5-point scale response to Question 3
x4 = a 5-point scale response to Question 4
and the two population be defined as
Population 1 = married men
Population 2 = married women
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 200 / 261
Comparison of Several Multivariate Means Profile Analysis
Example
4.700 4.533
0.606 0.262 0.066 0.161
0.262 0.637 0.173 0.143
Spooled =
0.066
0.173 0.810 0.029
0.161 0.143 0.029 0.306
Thus
−1 0.606 0.262 0.066 0.161
0
−0.167
1 1 0.262 0.637 0.173 0.143
T2 = −0.066 +
0.066
30 30 0.173 0.810 0.029
0.200
0.161 0.143 0.029 0.306
−0.167
× −0.066 = 15(0.67) = 1.005
0.200
with α = 0.05
(30 + 30 − 2)(4 − 1)
c2 = F3,56 (0.05) = 3.11(2.8) = 8.7
(30 + 30 − 4)
Since T 2 = 1.005 < 8.7, we conclude that the hypothesis of parallel profiles
for men and women is tenable.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 203 / 261
Comparison of Several Multivariate Means Profile Analysis
0.167
0.606 0.262 0.066 0.161 1
0.262 0.637 0.173 0.143 1
J0 Spooled J =
1 1 1 1 0.066 0.173 0.810
0.029 1
0.161 0.143 0.029 0.306 1
= 4.207
2
0.367
T2 = q = 0.501
1 1
30 + 30 4.027
with α = 0.05, F1,58 (0.05) = 4.0 and T 2 = 0.501 < F1,58 (0.05) = 4.0, we
cannot reject the hypothesis that the profile are coincident. That is the
responses of the men and women to the four questions posed appear to be the
same.
We could now test for level profiles, however, it does not make sense to
carry out this test since Questions 1 and 2 were measured on a scale of
1 − 8, while Question 3 and 4 were measured on a scale of 1 − 5.
The incompatibility of these scales makes the test for level profiles
meaningless.
Exercise
Exercise (cont.)
PCA
PCA
Do not
◦ over estimate importance of the principal components analysis.
◦ over- interpret principal components analysis
They can be useful:
◦ in exploratory phase of a data analysis (to learn about the structure
of the data)
◦ as input to other statistical procedures.
In matrix notation,
Y1 X1
Y2 `11 . . . `1p
.. ..
.. X2
=
.. . . . ..
. .
`p1 . . . `pp
Yp Xp
Population PCA
Population PCA
Population PCA
This problem can be overcome by an additional requirement.
◦ the Yj are uncorrelated.
◦ the Yj have maximal variance.
◦ the coefficient vector `j have a unit length
The new requirement is formalized as follows:
p
X
2
||`j || = `0j `j = `2jk = 1
k=1
These requirement can be translated in an iterative procedure;
Y1 = `01 X with maximal variance, with condition `01 `1 = 1
Y2 = `02 X with maximal variance, with conditions
`02 `2 = 1, cov(Y1 , Y2 ) = 0
...
Yj = `0j X with maximal variance and conditions
`0j `j = 1, cov(Y1 , Yj ) = · · · = cov(Yj−1 , Yj ) = 0
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 215 / 261
Principal Component Analysis Heuristic Argument
Heuristic argument
Mathematical argument
It turns out that the heuristic reasoning of the previous section can be
generalized and proven mathematically.
Theorem: Let Σ be the covariance matrix of X with ordered
eigenvalue-eigenvector pairs (λj , ej )(j = 1, 2, . . . , p)
◦ the principal components are Yj = e0j X = ej1 X1 + · · · + ejp Xp
◦ var(Yj ) = λj
◦ cov(Yj , Yk ) = e0j Σek = 0, j 6= k
For distinct eigenvalues, the choice of the principal component is unique.
Procedure
Properties
The following properties ensure that the new variables are very desirable:
(1) The variance of the new variables is determined by the eigenvalues λj :
var(Yj ) = λj
(2) The covariance between pairs of distinct new variables is zero, and hence
also the correlation:
Corr(Yj , Yk ) = 0, j 6= k
(3) The total variability of the original variables is recorded by the new
variables. More precisely, the sum of the diagonal elements of the
original covariance matrix equals the sum of the eigenvalues.
(4) The total population variance is λ1 + · · · + λp . The jth principal
component "explains"
λj
λ1 + · · · + λp
of the variance.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 219 / 261
Principal Component Analysis Mathematical Argument
The proportion of the total variance accounted for by the first principal
component is
λ1 5.83
= = 0.73
λ1 + λ2 + λ3 8
Further, the first two components Y1 and Y2 could replace the original
three variables with little loss of information.
Sample PCs
Sample PCs
Sample PCs
Exercise
Covariance or Correlation?
◦ The original variables could have different units. Then we are
comparing apples and oranges.
◦ Original variables could have widely varying standard deviation.
In both cases, the analysis is driven by the variable with large standard
deviation.
kilometer or millimeter (×106 ) ⇒ principal components will be pulled
towards the variable in millimeters.
A solution is provided by using the correlation matrix instead. Thus
means that all variables are replaced by their standardized versions.
The variance changes to 1 and total variability is equal to p.
Exercise 2: Find the principal components and the proportion of the total
variance explained by each when the covariance matrix is
2
ρσ 2 ρσ 2
σ
Σ = ρσ 2 σ 2 ρσ 2 .
ρσ 2 ρσ 2 σ 2
⇒ Yij = αj + βj Xi + ij
Two Populations
Two Populations
Classification Error
That is, based on the classification rule and the observations made for a
particular specimen, we are lead to believe that the specimen belongs to
one subgroup, whereas in reality it belongs to the other.
These probabilities are the first step to answer the following questions:
(1) What is the misclassification error?
(2) What is the misclassification cost?
The cost matrix is very simple in the case of the two groups:
Classify as
True Population R1 R2
π1 0 c(2|1)
π2 c(1|2) 0
Similarly,
f1 (x) c(1|2)p2
R1 = x| >
f2 (x) c(2|1)p1
and hence the regions are defined.
What if ff21 ((xx)) = c(1|2)p2
c(2|1)p1 ?
This boundary case is fairly arbitrary and the performance of the rule
will not change when we either assign this curve to R1 or to R2
The classification rule is: Assign an observation with outcome vector x
to the first population π1 if
f1 (x) c(1|2)p2
>
f2 (x) c(2|1)p1
But it is plausible that one has a rough idea about the relative
severity of the misclassification eg. The second type of
misclassification is to times as bad as one of the first type.
Remarks:
◦ The shape of the discriminant function depends on the form of f1 (x)
and f2 (x). It will change with changing parametric forms assumed
for these densities (eg normal densities with equal or with unequal
variances)
◦ If either the cost ratio or the prior probability ratio is unity, the
definition of the regions simplifies accordingly.
◦ If the product of cost and prior probability ratio is unity, then we
actually allocate to the population with the highest probability. We
then classify to R1 if ff12 ((xx)) > 1 or equivalently f1 (x) > f2 (x)
The ECM is not the only useful criterion to determine the classification
boundary. A few alternatives are :
◦ Total probability of misclassification (TPM): The ECM for equal
costs.
◦ Largest posterior probability: Reduce to the TPM
The classification rule is based on the ratio of the two densities evaluated
at x:
f1 (x) 1 0 −1 1 0 −1
= exp − (x − µ1 ) Σ (x − µ1 ) + − (x − µ2 ) Σ (x − µ2 )
f2 (x) 2 2
After some manipulations, the classification region R1 is found to be:
0 −1 1 0 −1 c(1|2)p2
(µ1 − µ2 ) Σ x − (µ1 − µ2 ) Σ (µ1 + µ2 ) ≥ ln
2 c(2|1)p1
Sample Version
In the above reasoning, µ1 , µ2 and Σ are assumed to be known
population values. However, in practice, they are unknown. This implies
they have to be estimated from data.
The following algorithm can be used.
◦ Collect n1 observations out of π1 and n2 observations of π2 .
◦ Construct the sample statistics x̄1 , x̄2 , S1 and S2 , as estimators for
µ1 , µ2 , Σ1 and Σ2 respectively.
◦ Since we assume common Σ, it is necessary to construct a common
S. In other words, S1 and S2 are assumed to estimate the same
quantity, and therefore , they should be combined in a so called
pooled covariance matrix:
(n1 − 1)S1 + (n2 − 1)S2
Spooled =
(n1 + n2 − 2)
Observe that, when the sample sizes n1 and n2 are equal. The Spooled is
simply the average of S1 and S2 , otherwise they are weighted by the
sample size they are based upon.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 248 / 261
Discrimination and classification Two Multivariate Normal Populations
and the right hand side of the allocation rule vanishes, whence it can be
rewritten as
1
(x̄1 − x̄2 )0 S−1 0 −1
pooled x0 ≥ (x̄1 − x̄2 ) Spooled (x̄1 + x̄2 )
2
This linear combination occurs both on the left hand side, as well as on
the right hand side of the classification rule.
The rule can be rewritten as:
1
`0 x0 ≥ (`0 x̄1 + `0 x̄2 ) = m
2
where ` is called the vector of discriminant coefficients
Also,
−0.0065
0
x̄2 )0 S−1
` x̄1 = (x̄1 − = 37.61 −28.92
pooled x̄1 = 0.88
−0.00390
0 0 −1
−0.2483
` x̄2 = (x̄1 − x̄2 ) Spooled x̄2 = 37.61 −28.92 = −10.10
−0.0262
1 0 1
m = (` x̄1 + `0 x̄2 ) = (0.88 − 10.10) = −4.61
2 2
Classify a woman with x1 = −0.210, and x2 = −0.044 (ie x0 = (x1 , x2 )), or
should this woman with x0 be classified normal or obligatory carrier?
Assuming equal costs and equal priors so that ln(1) = 0, we obtain:
Allocate x0 to π1 if `0 x0 ≥ m
Allocate x0 to π2 if `0 x0 < m
Since
0
−0.210
` x0 = 37.61 −28.92 = −6.62 < −4.61
−0.044
1
w = (x1 −x2 )0 Σ−1 x0 − (x1 −x2 )0 Σ−1 (x1 +x2 ) = −6.62−(−4.61) = −2.01
2
is compared to ln pp12 = ln 0.75
0.25
= −1.10
We see that w = −2.01 < ln pp21 = ln 0.750.25
= −1.10 , we classify the
women as π2 , an obligatory carrier.
where k = 12 ln ||Σ
1| 0 −1 0 −1
Σ | + (µ1 Σ1 µ1 − µ2 Σ2 µ2 )
2
Allocate x0 to π1 if
1 c(1|2)p2
− x00 (S−1 −1 0 −1 −1
1 − S2 )x0 + (x̄1 S1 − x̄2 S2 )x0 − k ≥ ln
2 c(2|1)p1
Guidelines
◦ If the population are approximately normal and the variances are
unequal;use the quadratic classification rule.
◦ BUT: The quadratics rule is sensitive to departures from normality,
while the linear rule is much more generally valid, also outside the
normal framework, as we will learn later from Fisher’s discriminant
analysis.
◦ Carry out checks before performing a classification procedure:
† Transform to normality first
† Then check for homogeneity of the covariance matrix
The order is important since these homogeneity checks are sensitive
to nonnormality.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 255 / 261
Discrimination and classification Fisher’s Discriminant Function - Separation of Populations
` ∝ S−1
pooled (x̄1 − x̄2 )
Allocate x0 to π if
1
Y0 = (x̄1 − x̄2 )0 S−1 0 −1
pooled ≥ (x̄1 − x̄2 ) Spooled (x̄1 + x̄2 )
2
Observe that
1 1 0 1
(x̄1 − x̄2 )0 S−1
pooled (x̄1 + x̄2 ) = ` (x̄1 + x̄2 ) = (ȳ1 + ȳ2 ) = m̂
2 2 2
the estimated midpoint of two univariate means.
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 258 / 261
Discrimination and classification Fisher’s Discriminant Function - Separation of Populations
For high values (H0 rejected), we are able to discriminate between the
two populations.
Exercise