Multivariate Final PDF

STAT446/614
Multivariate Methods/Analysis
Lecture Notes
Lecturer
Samuel Iddi (PhD)
Department of Statistics
University of Ghana
isamuel@ug.edu.gh
April 16, 2015
Dr. S. Iddi (UG) STAT446/614 April 16, 2015 1 / 261

Course Information
Course Objective:
To understand problems associated with multi-dimensional data.
To study basic multivariate distribution theory and methods.
To discuss some fundamental and important multivariate statistical
techniques.
To understand inference and how to apply them to real-life problem
arising from various scientific fields.
To learn methods for data reduction.
To implement and apply the techniques in R.

Course Information
Textbook:
1 Johnson, R. A. and Wichern, D. W. (2007). Applied Multivariate
Statistical Analysis. 3rd Ed. Prentice-Hall.
Reference:
1 Muirhead R. J. (2005). Aspects of Multivariate Statistical Theory. John
Wiley and Sons.
2 Hardle, W. K. and Simar, L. (2012). Applied Multivariate Statistical
Analysis. 3rd Ed. Springer.
3 Everitt, B. and Hothorn T. (2011). An Introduction to Applied
Multivariate Analysis with R Use R. Springer.

Course Information
Office hour:
Thursday: 10:30am to 2:00pm or by appointment.
Phone Number: 0506783155, Office: 208
E-mail: isamuel@ug.eud.gh; isamuel.gh@gmail.com (Sending an email is the
easiest way to contact me)
Personal Website: www.samueliddi.com
Teaching Assistant
TA: Enouch Sakyi-Yeboah
Tutorial: Thursday, 9:30am to 10:30am
Phone number: 0274873770
E-mail: enochsakyi10@gmail.com

Course Information
Grading
Homework Assignments (20%), Interim Assessment (30%) and Final Exams
(50%).
Guidelines
Homeworks should be submitted one week from the day assigned.
R program codes will also be submitted via email.
Late submission will not be accepted.
Duplicate solutions will not be graded.
TA will discussion solutions during tutorial hours.
Interim assessment could take the form of project, presentation and
defense.
Computing
The main software package for this course is regular R version 2.3.1 (or
higher). Install R by visiting the website: www.r-project.org. You may also
use RStudio but this only works after you have installed R first.

Introduction
Introduction
This course concerns the use of statistical methods use for describing and
analyzing multivariate data.
Research in biological physical and social sciences frequently involve

the collection of measurements on several variables.
We will review some concepts of matrix and of matrix manipulation

(matrix algebra) relevant for multivariate analysis.
Also, we will acquire knowledge to properly interpret models, select

appropriate techniques and understand their strength and weakness.
The R software for complex statistical analysis will be used.

Introduction Levels of Complexity
Levels of Complexity
Simplex statistical analysis - single outcome variable on a sample from

homogeneous population.
◦ Standard procedures: location parameters (mean and median),
dispersion parameters (standard errors or interquartile ranges)
◦ Example: the height of students in the class.
First level of complexity - sample out of two larger populations
◦ Example - the height of male and female students in the class.
◦ Scientific question - we may be interested to know whether mean
response from the two population differ
◦ The outcome variable is called dependent variable
◦ Predictor variable is often called covariate or independent variable
◦ Statistical tools - ANOVA, t-test, Wilcoxon text.

Level of Complexity
Next level - more than two levels of predictor.

This lead to family of modes called regression models
The choice of the statistical analysis is driven by the outcome or
dependent variables rather than by the predictor variables
◦ When the outcome is continuous (eg. weight), we use the linear
regression.
◦ Binary dependent variable - logistic regression
◦ For count outcomes - Poisson regression
◦ Predictor variables can be continuous, binary, categorical or
discrete.
More than one predictor variables recorded
◦ Example: when effect of gender and religion on the height of
students in the class is of interest.
◦ Two-way or multi-way ANOVA
◦ Multiple linear regression - extension to one-way and simple
regression.
Level of Complexity
Finally, when several dependent variables are recorded and studied

simultaneously.
Example - the weight and height simultaneously recorded for a group of
male and female students.
Arguably, sex will influence height as well as weight. The two are likely
to be correlated or associated.
Multivariate analysis is required for simultaneous measurements on
many dependent variables.
Note that
◦ Multiple refers to several independent variables.
◦ Multivariate refers to several dependent variables.

Introduction Multivariate Analysis
Multivariate Analysis
This refers to a set of techniques which allow the presence of more than
one outcome variable.
The challenges in learning from large data has led to the development
and evolution in this field of statistical sciences.
The job of statistician is to extract useful information through extraction

of
◦ important patterns
◦ rules and trends in multivariate data
◦ relationship among various features
◦ classification of multidimensional patterns.

Introduction Multivariate Analysis
Objectives of Multivariate Methods
The objectives of scientific investigations to which multivariate methods

lend themselves include the following:
◦ Hypothesis construction and testing (Multivariate Analysis of
Variance (MANOVA), Profile Analysis (PA))
◦ Prediction (Multivariate Linear Regression)
◦ Data reduction or structural simplification (Principal Component
Analysis (PCA), Factor Analysis (FA))
◦ Investigation of the dependence among variables (Canonical
Correlation, CC)
◦ Sorting or grouping (Classification and Discriminate Analysis
(DA))

Aspects of Multivariate Analysis Organization of Data
Data Setup
Let i = 1, . . . , n denote subjects, individuals, items, experimental units etc.,

j = 1, 2, . . . , p denotes variables, characteristics measured for each individual.
The jth measurement for the ith subject will be denoted by xij . For n
measurements on p variables, the data is setup as follows:
Variable 1 Variable2 ... Variable j ... Variable p

Item 1 x11 x12 ... x1j ... x1p
Item 2 x21 x22 ... x2j ... x2p
.. .. .. .. ..
. . . . .
Item j xi1 xi2 ... xij ... xip
.. .. .. .. ..
. . . . .
Item n xn1 xn2 ... xnj ... xnp
Table: Data Setup

The data can be grouped into a matrix form as:

 
x11 x12 . . . x1p
x21 x22 . . . x2p 
X = (xij )i,j =  .
 
.. .. 
 .. . . 
xn1 xn2 . . . xnp
Example 2.1
A selection of four receipts from the university bookstore was obtained in
order to investigate the nature of book sales. Each receipt provided, among
other things, the number of books sold and the total amount of each sale. Let
the first variable be the total Cedi sales and the second variable be the
number of books sold. Then we can regard the corresponding numbers on the
receipts as four measurements on two variables.

Example 2.1 (cont.)

Suppose the data, in tabular form, are
Variable 1 (Cedi sales): 42 52 48 58

Variable 2 (number of books): 4 5 4 3
Solution 2.1
Using the notation just introduced, we have
x11 = 42, x21 = 52, x31 = 48, x41 = 58

x12 = 4, x22 = 5, x32 = 4, x42 = 3
and the data array X is  
42 4
52 5
X=
46

4
58 3

Descriptive statistics
Sample mean
n
1X
x̄j = xij , j = 1, . . . , p.
n
i=1
 
x̄1
x̄2 
Mean vector x̄ = (x̄1 , x̄2 , . . . , x̄p )0 =  . 
 
 .. 
x̄p
Sample variance
n
1X
s2j = (xij − x̄j )2 .
n
i=1
A very common definition is
n
1 X
s2j = (xij − x̄j )2 .
n−1
i=1

When sample variance are put along the main diagonal

q of a matrix, they
2 2
will be denoted by sjj = sj . Standard deviation: sj
Linear association
◦ Sample covariance:
n
1X
sjk = (xij − x̄j )(xik − x̄k )
n
i=1
We denote the p × p symmetric covariance matrix Sn = (sjk )j,k and

the matrix of squares and cross products as W = nSn = nSn−1
◦ Sample correlation: (Pearson’s product moment correlation)
Pn
sjk (xij − x̄j )(xik − x̄k )
rjk = √ = pPn i=1 Pn
sjj skk i=1 (xij − x̄j )
2
i=1 (xik − x̄k )
2

Remarks on correlation:
unitless
covariance of standardized variables
range −1 ≤ rij ≤ 1
rjk = 0 implies no linear association
rjk > 0 implies both pair of variables tend to deviate from their
respective means in the same direction.
The correlation is invariant under the following transformation
yij = axij + b, i = 1, 2, . . . , n
yik = cxik + d, i = 1, 2, . . . , n
provided a and c have the same sign (ac > 0). What happens if ac < 0?

Do not overestimate the importance of linear association measures.

◦ They do not capture other kinds of association (eg. quadratic
relationship between variables)
◦ Correlation coefficient is very sensitive to outliers.
Matrix representation:
Sample mean: x̄ = 1n ni=1 xj = (x̄1 , x̄2 , . . . , x̄p )0
P
Sample covariance matrix: Sn = (sjk ) = 1n ni=1 (xj − x̄)(xk − x̄)0 . i.e.

P
 
s11 s12 . . . s1p
s21 s11 . . . s2p 
Sn =  .
 
.. .. 
 .. . . 
sp1 sp2 . . . spp
Sample correlation matrix: R = (rjk ) with rjj = 1. i.e.
 
1 r12 . . . r1p
r21 1 . . . r2p 
R= .
 
.. .. 
 .. . . 
rp1 rp2 . . . 1
Example 2.2
Find the arrays x̄, Sn , R from the example data above.
Solution 2.2
4
1X 1
x̄1 = xi1 = (42 + 52 + 48 + 58) = 50
4 4
i=1
4
1 X 1
x̄2 = xi2 = (4 + 5 + 4 + 3) = 4
4 4
i=1

50
x̄ =
4

The sample variance and covariances

Solution 2.2 (cont.)
4
1X
s11 = (xi1 − x̄1 )2
4
i=1
1
(42 − 50)2 + (52 − 50)2 + (48 − 50)2 + (58 − 50)2 = 34

=
4
s22 = 0.5
4
1X
s12 = s21 = (xi1 − x̄1 )(xi2 − x̄2 )
4
i=1
and
34 −1.5
Sn =
−1.5 0.5

The sample variance and covariances

s12 −1.5
r12 = r21 = √ =√ = −0.36
s11 s22 34 × 0.5
so
1 −0.36
R=
−0.36 1

Aspects of Multivariate Analysis Graphical Techniques
Visualization of Multidimensional data
Graphical techniques help to visualize data and aid in data analysis.
They increase understanding, suggest explanations or at the very least

suggest where to start looking for explanations.
It is relatively easy to visualize 1, 2 and 3 dimensional data using

sophisticated computer programs and display equipment.
1 dimension - dot diagrams; 2 dimensions - scatter plots; 3 dimensions -

3d plots.
Often information from pairs of variables is valuable (eg. scatter plot

matrix).

Aspects of Multivariate Analysis Graphical Techniques
Two types of multivariate plots
For complicated multivariate data, visualization is impossible without

some prior processing.
Dimensionality reduction up to which visualization is possible (eg.

PCA).
Variable space: n points in a p−dimensional space - each individual is

represented through p coordinates, its outcome on each of the p variables.
Data space: p points in an n−dimensional space - each individual

represents a dimension in the space.
see JW for examples.

Aspects of Multivariate Analysis Distance
Euclidean Distance
Most multivariate techniques are based upon the simple concept of

distance.
We should already be familiar with Straight-line or Euclidean distance
Consider the point P = (x1 , x2 ) in the plane, the straight line distance
d(O, P), from P to the origin O = (0, 0) is according to the Pythagorean
theorem q
d(O, P) = x12 + x22
In general, for P = (x1 , x2 , . . . , xp ), the straight-line distance from P to

the origin O = (0, 0, . . . , 0) is
q
d(O, P) = x12 + x22 + · · · + xp2

Euclidean Distance
All points (x1 , x2 , . . . , xp ) that lie a constant square distance such as c2

from the origin satisfy the equation
d2 (O, P) = x12 + x22 + · · · + xp2 = c2 (hypersphere)
If p = 2, the equation is a circle.
Points equidistance from the origin lies on a hypersphere.
The straight-line distance from P = (x1 , x2 , . . . , xp ) to

Q = (y1 , y2 , . . . , yp ) is given by
q
d(P, Q) = (x1 − y1 )2 + (x2 − y2 )2 + · · · + (xp − yp )2 .

Statistical Distance
Statistical distance is fundamental to multivariate analysis.
Account for differences in variation.
When coordinates represent measurements that are subject to random

fluctuations of differing magnitude.
Depends on sample variances and covariances.
For P(x1 , x2 ) and O = (0, 0), statistical distance is computed from its
standardized coordinates x1∗ = √xs111 , x2∗ = √xs222 as
s 2 2 s
x12 x2

x1 x2
d(P, Q) = √ + √ = + 2
s11 s22 s11 s22
If s11 = s22 , it is convenient to ignore the common divisor and use the
Euclidean distance.
Statistical Distance
Ellipse of constant statistical distance
x12 x2
d2 (O, P) = + 2 = c2
s11 s22
All points which have coordinates (x1 , x2 ) and are constant squared
x12 x22
distance c2 from the origin must satisfy s11 + s22 = c2

Generalized statistical distance
Generalized statistical distance: From an arbitrary point P = (x1 , x2 ) to

any fixed point Q = (y1 , y2 ), the distance is given by
s
(x1 − y1 )2 (x2 − y2 )2
d(P, Q) = +
s11 s22
where s11 and s22 is sample variance constructed from x1 and x2
respectively.
Extension to more than two dimensions: From P(x1 , x2 , . . . , xp ) to

Q(y1 , y2 , . . . , yp ) we have
s
(x1 − y1 )2 (x2 − y2 )2 (xp − yp )2
d(P, Q) = + + ··· +
s11 s22 spp
where s11 , s22 , . . . , spp are sample variances constructed from n
measurements on x1 , x2 , . . . , xp respectively.
Alternative formulation
Alternatively, let P(x11 , x12 , . . . , x1p ) and Q(x11 , x22 , . . . , x2p ), the
Euclidean distance is given by
q
d(P, Q) = (x11 − x21 )2 + (x12 − x22 )2 + · · · + (x1p − x2p )2
v
u p
uX
p
= t (x1j − x2j )2 = (x1 − x2 )0 (x1 − x2 )
j=1
= ||x1 − x2 ||2
Drawback
◦ All components contribute equally, although there may be random
fluctuations of a different magnitude in the components.
◦ Different pairs of components may be correlated differently.

Statistical distance → take variance into account

v
u p (x1j − x2j )2
uX
d(P, Q) = t
sjj
j=1
Next we take correlation into account. The covariance matrix Sn will

play a key role.
Transform the variable xi to uncorrelated ones, say yi

yi = Cxi ⇒ xi = C−1 yi
Requirement is satisfied if
 
1 0 ... 0
 0 1 ... n
0  1X
Sy = I =  . . . = (yi − ȳ)(yi − ȳ)0
 
..
 .. .. .. 
.  n
i=1
0 0 ... 1
n
1X
Sy = C(xi − x̄)(xi − x̄)0 C0
n
i=1
= CSx C0
which is satisfied by S = Sx = (C0 C)−1
Measuring distance between P and Q can be performed in Euclidean
distance using coordinate system yi as they represent uncorrelated
variables with unit variance.
d2 (P, Q) = (y1 − y2 )0 (y1 − y2 )
= [C(x1 − x2 )]0 [C(x1 − x2 )]
= (x1 − x2 )0 C0 C(x1 − x2 )
= (x1 − x2 )0 S−1
x (x1 − x2 )
The last expression is the Mahalanobis distance.

Matrix Algebra Some useful results of matrix theory
Some useful results of matrix theory
Definition 3.1
Let A be a square matrix of dimension k × k and let λ be the eigenvalue of A.
If x is a k × 1 nonzero vector (x 6= 0) such that
Ax = λx
then x is said to be an eigenvector (characteristic vector) of the matrix A

associated with the eigenvalue λ.
Example 3.1

1 0
Let A = , find the eigenvalues and associated eigenvectors of A.
1 3

Solution 3.1

1 0 1 0 1−λ 0
|A − λI| = −λ =
1 3 0 1 1 3−λ
= (1 − λ)(3 − λ) = 0.
implies the two roots are λ1 = 1 and λ2 = 3. The eigenvectors associated

with these eigenvalues are obtained by solving the following equations:

1 0 x1 x
= 1 1
1 3 x2 x2
Ax = λ1 x

1 0 x1 x
= 3 1
1 3 x2 x2
Ax = λ2 x

From the first expression,
x1 = −2x2 .
There are many
solutions for x1 and x2 . Setting x2 = 1 gives x1 = −2, and
−2
hence x = is an eigenvector corresponding to eigenvalue 1. From the
1
second expression,
x1 = 3x1
x1 + 3x2 = 3x2

0
implies that x1 = 0 and x2 = 1 (arbitrarily), and hence, x = is the
1
eigenvector corresponding to the eigenvalue 3.


It is normal√
practice to determine an eigenvector so that it has unit length. We
take e = x/ x0 x as the eigenvector √
corresponding
√ to λ. For example, the
0
eigenvector for λ1 = 1 is e = (−2/ 5, 1/ 5).
Definition 3.2
A quadratic form Q(x) in the k variables x1 , x2 , . . . , xk is Q(x) = x0 Ax, where
x0 = (x1 , x2 , . . . , xk ) and A is a k × k symmetric matrix.
Note that a quadratic form can be written as Q(x) = ki=1 kj=1 aij xi xj . For
P P
example,

1 1 x1
Q(x) = (x1 , x2 ) = x12 + 2x1 x2 + x22 .
1 1 x2

Result 3.1 (Spectral Decomposition.)

Let A be a k × k symmetric matrix. Then A can be expressed in terms of its k
eigenvalue-eigenvector pairs (λi , ei ) as
k
X
A= λi ei e0i = PΛP0
i=1
where P = (e1 , e2 , . . . , ek ) is another matrix with normalized eigenvectors as

its columns.
A−1 = PΛ−1 P0 = ki=1 λ1i ee0
P

Result 3.2 (Square-root matrix)

The square-root matrix, of a positive matrix A,
k p
X
1/2
A = λi ee0 = PΛ1/2 P0
i=1
has the following properties:

a. (A1/2 )0 = A1/2 (that is, A1/2 is symmetric)
b. A1/2 A1/2 = A
c. (A1/2 )−1 = ki=1 √1λ ee0 = PΛ−1/2 P0 , where Λ1/2 is a diagonal matrix
P
√ i
with 1/ λi as the ith diagonal element.
d. A1/2 A−1/2 = A−1/2 A1/2 = I, and A−1/2 A−1/2 = A−1 , where
A−1/2 = (A1/2 )−1

Result 3.3 (Singular-Value Decomposition.)

Let A be a m × k symmetric matrix of real numbers. Then there exist an
m × m orthogonal matrix U and a k × k orthogonal matrix V such that
A = UΛV 0
where the m × k matrix Λ has (i, i) entry λi ≥ 0 for i = 1, 2, . . . , min(m, k)

and the other entries are zero. The positive constants λi are called the
singular values of A.
U has m orthogonal eigenvectors of AA0 as its columns and V has k

orthogonal eigenvectors of A0 A as its columns.
For every square matrix A with eigenvalues λ1 , λ2 , . . . , λk , we have that
k
X k
Y
tr(A) = λi , and |A| = λi
i=1 i=1

Random Vectors and Matrices Random Vectors and Matrices
Random vectors and matrices
When several random variables are considered simultaneously, it is

convenient to group them into vectors.
If we observe the same random vectors for n individuals, we can group
the result in an n × p matrix.
A random vector is a vector whose elements are random variables.
Similarly, random matrix is a matrix whose elements are random
variables.
For an n × p random matrix X, the expected value of X is an n × p matrix
of numbers (if they exist) given by
E(X01 )
     
E(X11 ) E(X12 ) . . . E(X1p ) E(X1 )
E(X21 ) E(X22 ) . . . E(X2p ) E(X2 ) E(X0 )
2 
E(X) =  . ..  =  ..  =  .. 
    
. .. . .
 . . . .   .   . 
E(Xn1 ) E(X11 ) . . . E(Xnp ) E(Xp ) E(X0n )

where
( R∞
−∞ xij fij (xij )dxij if Xij is a continuous random variable on <
E(Xij ) = P
xij xij pij (xij ) if Xij is a discrete random variable
with f the probability density function (pdf) and p the probability mass
function.
Example 3.2
Consider the random vector X0 = (X1 , X2 ). Let the discrete random variable
X1 have the following probability function:
x1 -1 0 1
p1 (x1 ) 0.3 0.3 0.4

Example 3.2 (cont.)

and the discrete random variable X2 have the probability function
x2 0 1
p2 (x2 ) 0.8 0.2
Find E(X)
Solution 3.2
P
E(X1 ) x x1 p 1 (x 1 ) −1 × 0.3 + 0 × 0.3 + 1 × 0.4
E(X) = = P1 =
E(X2 ) x2
x2 p2 (x2 ) 0 × 0.8 + 1 × 0.2

0.1
Thus, E(X) =
0.2

Notation for Random Vectors
Immediate results:
E(X + Y) = E(X) + E(Y)

E(AXB) = AE(X)B
Let X be a random Vector (X = (Xj )j ).

Definition 3.3
The (marginal) mean of X are defined by µj = E(Xj )
The population mean vector is given by
 
µ1
 µ2 
µ = E(X) =  . 
 
 .. 
µp

Covariance Matrices
Definition 3.3 (cont.)

The (marginal) (co)variances of X are defined by
( R∞ R∞
−∞P−∞ (xj − µj )(xk − µk )fjk (xj , xk )dxj dxk
σjk = E(Xj −µj )(Xk −µk ) = P
xj xk (xj − µj )(xk − µk )pjk (xj , xk )
where fjk (xj , xk ) and pjk (xj , xk ) are the joint density function and joint
probability mass functions of (xj , xk ) respectively.
The population variance-covariance matrix is defined as
 
σ11 σ12 . . . σ1p
 σ21 σ22 . . . σ2p 
0
Σ = Cov(X) =  . ..  = E(X − µ)(X − µ) .
 
.. ..
 .. . . . 
σp1 σp2 . . . σpp
Clearly, Σ is symmetric matrix.

Covariance and Correlation Matrices
Definition 3.3 (cont.)

The population correlations are constructed as follows:
σjk −1/2 −1/2
ρjk = √ √ = σjj σjk σkk ,
σjj σkk
σ1p
√ σ11 √ σ12
 
√
σ11 σ11
√
σ22 σ11 ... √ √
σ11 σpp

1 ρ12 . . . ρ1p

 √ σ21√ σ
 σ11 σ22 √ σ22
√ ... √ 2p√
 ρ12 1 . . . ρ2p 
σ22 σ22 σ22 σpp 
ρ= = 
.. .. .. .. . .. .. .. 
  ..
  .
. . . 

 . . .
σp1 σpp
√ √
σ11 σpp ... ... √ √
σpp σpp
ρ1p ρ2p . . . 1
which yields Σ = V 1/2 ρV 1/2 , V = diag(σ11 , σ22 , . . . , σpp ) where ρ is

the population correlation matrix, V is diagonal matrix of variances,
V 1/2 is diagonal matrix of standard deviations. Σ can be obtained from
V 1/2 and ρ and ρ can be obtained from Σ.
Example 3.3
Consider the previous example. Let the joint probability of X1 and X2 be
p12 (x1 , x2 ) represented by the entries in the following table.
x2
x1 0 1 p1 (x1 )
-1 0.24 0.06 0.3
0 0.16 0.14 0.3
1 0.40 0.00 0.4
p2 (x2 ) 0.8 0.2 1
Find the covariance matrix for the two random variables X1 and X2 .
Solution 3.3
We have already shown that µ1 = E(X1 ) = 0.1 and µ2 = E(X2 ) = 0.2.

We have already shown that µ1 = E(X1 ) = 0.1 and µ2 = E(X − 2) = 0.2.
X
σ11 = E(X1 − µ1 ) = (x1 − 0.1)2 p1 (x1 )
x1
= (−1 − 0.1) (0.3) + (0 − 0.1)2 (0.3) + (1 − 0.1)2 (0.4) = 0.69

2
X
σ22 = E(X2 − µ2 ) = (x2 − 0.2)2 p2 (x2 )
x2
= (0 − 0.2) (0.8) + (1 − 0.2)2 (0.2) = 0.16

2
X
σ12 = E(X1 − µ1 )(X2 − µ2 ) = (x1 − 0.1)(x2 − 0.2)p12 (x1 , x2 )
(x1 ,x2 )
= (−1 − 0.1)(0 − 0.2)(0.24) + · · · + (1 − 0.1)(1 − 0.2)(0.0)
= −0.08 = σ21


E(X1 ) µ1 0.1
⇒ µ = E(X) = = = and
E(X2 ) µ2 0.2
(X1 − µ1 )2

0 (X1 − µ1 )(X2 − µ2 )
Σ = E(X − µ)(X − µ) = E .
(X2 − µ2 )(X1 − µ1 ) (X2 − µ2 )2
Thus,
σ11 σ12 0.69 −0.08
Σ= =
σ21 σ22 −0.08 0.16
Example 3.4
   
4 1 2 σ11 σ12 σ13
Suppose Σ = 1 9 −3 = σ21 σ22 σ23 . Obtain V 1/2 and ρ.
2 −3 25 σ31 σ32 σ33

Solution 3.4
√   
σ11 0 0 2 0 0
√
V 1/2 =  0 σ22 0  = 0 3 0 
√
0 0 σ33 0 0 5
1 
−1 2 0 0
1/2
and V = 0 13 0  . Thus, the correlation matrix ρ is given by

0 0 51
1   1 
−1 −1 0 0 2 4 1 2 2 0 0
V 1/2 Σ V 1/2 =  0 31 0  1 9 −3  0 13 0 
0 0 15 2 −3 25 0 0 15
1 16 1 

5
=  16 1 − 51 
1 1
5 −5 1
Random Vectors and Matrices Partitioning Random Variables
Partitioning Random Variables
Suppose in a given study, 4 medical measurements are recorded,

X1 , X2 , X3 , X4 together with 6 socio-economic variables
X5 , X6 , X6 , X8 , X9 , X10 .
It is then natural to partition X into two parts, say
 
X1
 .. 
 . 
 
 X4 
 
X= −−

 X5 
 
 .. 
 . 
X10

In general, we will assume p dimensional random vector X is partitioned into

two subvectors of dimension q and p − q respectively,
 
X1
 .. 
 . 
 
(1)  Xq 
X  
X= (2) =  −−  .
 
X Xq+1 
 
 .. 
 . 
Xp−q

The mean vector and the covariance matrix can be partitioned

accordingly as
 
µ1
 .. 
 . 
..
 
 
(1)  µq 
µ   Σ11 . Σ12 
X= =  −−  and Σ =  . . . . . . . . .  .
µ(2) 
µq+1 
 
..


 .. 
 Σ21 . Σ22
 . 
µp−q
The meaning of the component matrices Σ is
Σ11 = Cov(X(1) ), Σ22 = Cov(X(2) )
Σ21 = Σ12 = Cov(X(1) , X(2) )

Thus, we have
Σ = E(X − µ)(X − µ)0

E(X(1) − µ(1) )(X(1) − µ(1) )0 E(X(1) − µ(1) )(X(2) − µ(2) )0

=
E(X(2) − µ(2) )(X(1) − µ(1) )0 E(X(2) − µ(2) )(X(2) − µ(2) )0
These results also hold if the population quantities are replaced by their
appropriate sample counterparts.

Random Vectors and Matrices Linear Combinations of Random Variables
Linear combination of random vectors
Let X be a p-dimensional random vector and c a p-dimensional vector of

constants.
The linear combination c0 X = c1 X1 + c2 X2 + · · · + cp Xp has
mean = E(c0 X) = c0 µ
variance = Var(c0 X) = c0 Σc
where µ = E(X) and Σ = Cov(X).
In general, consider the q linear combination of the p random variables
X1 , . . . , Xp :
Z1 = c11 X1 + c12 X2 + · · · + c1p Xp
Z2 = c21 X1 + c22 X2 + · · · + c2p Xp
.. .
. = ..
Zq = cq1 X1 + cq2 X2 + · · · + cqp Xp
or     
Z1 c11 c12 . . . c1p X1
Z2  c21 c22 . . . c2p  X2 
 
Z=.= . ..   ..  = CX
   
.. ..
 ..   .. . . .   .
Zq cq1 cq2 . . . cqp Xp
The linear combinations Z = CX have
µZ = E(Z) = E(CX) = CµX

ΣZ = Cov(Z) = Cov(CX) = CΣX C0
where µX and ΣX are the mean vector and variance-covariance matrix

of X, respectively.

Example 3.5
Let X0 = (X1 , X2 ) be a random vector 0
vector µx = (µ1 , µ2 ) and
with mean
σ11 σ12
variance-covariance matrix ΣX = . Find the mean vector and
σ12 σ22
covariance matrix for the linear combinations
Z1 = X1 − X2
Z2 = X1 + X2
Solution 3.5

Z1 1 −1 X1
Z= = CX = = CX
Z2 1 1 X2
Solution 3.5

1 −1 σ11 σ12
0 1 1
µZ = Cov(Z) = CΣX C =
1 1 σ12 σ22 −1 1

σ11 − 2σ12 + σ22 σ11 − σ22
=
σ11 − σ22 σ11 + 2σ12 + σ22
If σ11 = σ22 , the off diagonals vanishes

The sum and differences of two random variables with identical
variances are uncorrelated.

Random Vectors and Matrices Matrix Inequalities and Maximization
Matrix Inequalities
Several multivariate techniques utilize the principle of maximization.

eg in PCA - maximum variability; DA - maximize separation
Matrix inequalities can help derive certain maximization results.
Cauchy-Schwarz inequality: For any two p−dimentional vectors b and
d, we have that
(b0 d)2 ≤ (b0 b)(d0 d)
with equality if and only if b = cd for some c.
Extension: (b0 d)2 ≤ (b0 Bb)(d0 B−1 d) with equality if and only if
b = cB−1 d for a p × p positive definite matrix B and some constant c.

Random Vectors and Matrices Matrix Inequalities and Maximization
Maximization
Maximization Lemma: Let B be a p × p positive definite matrix and d a

p × 1 vector. Then, for an arbitrary nonzero vector x,
(x0 d)2
max 0 ≤ d0 B−1 d
x6=0 x Bx
with the maximum attained when x = cB−1 d for some nonzero constant
c.
Maximization of quadratic forms: For a p × p positive definite matrix
B with eigenvalues λ1 , λ2 , . . . , λp ≥ 0 and corresponding eigenvectors
ei , i = 1, 2, . . . , p, then
◦ maxx6=0 xxBx
0
0 x = λ1 (attained when x = e1 )
x 0 Bx
◦ maxx6=0 x0 x = λp (attained when x = ep )
◦ maxx⊥e1 ,...,ek xxBx
0
0 x = λk+1 (attained when
x = ek+1 , k = 1, 2, . . . , p − 1)
Sample Geometry and Random Samples Sample Geometry
Sample Geometry
The data matrix X consist of

◦ n observations (representing
  individuals) xi
x1j
x2j 
◦ p variables yj , i.e. yj =  . 
 
 .. 
xnj
Projection of yj on the unit vector 1n is written as
y0j 1n

1n = x̄j 1n
10n 1n
Denote the deviation vector from the mean vector by
dj = yj − x̄j 1n = (xi1 − x̄j , xi2 − x̄j , . . . , xin − x̄j )0
The sum of squared deviation (sum of square residuals):
Lj2 = d0j dj = ni=1 (xij − x̄j )2 = (n − 1)sjj
P

Sample Geometry and Random Samples Sample Geometry
Sample Geometry
Also, the Pcross-products (dot product) of the jth and kth variable:
d0j dk = ni=1 (xij − x̄j )(xik − x̄k )0 = (n − 1)sjk
Angle between deviation vectors dj and dk , is obtained from
d0 d (n−1)sjk sjk
cos(θjk ) = h 0 j 0k i1/2 = 1/2 = (s s )1/2 = rjk
(dj dj )(dk dk ) [(n−1)s jj (n−1)s kk ] jj kk
Thus, the cosine of the angle between two deviation vectors is the
correlation between the two random variables Xj , Xk
The following results can be observed:
◦ if θjk = 0 then rjk = 1, i.e. if the deviation vectors are in the same
direction, the the correlation between the two random variables has
a perfect linear correlation.
◦ if θjk = π2 , then rjk = 0, i.e. if they two r.v are orthogonal, the
correlation between them is zero.
◦ if θjk = π, i.e. if they move in opposite directions, then rjk = −1 for
every pair j, k of variables from X.
Sample Geometry and Random Samples Mean and Variance Estimators
Mean and variance estimators
Suppose we have a multivariate population with unknown mean vector µ

and unknown covariance matrix Σ.
The number of unknown parameters are p + p(p + 1)/2
To make inference about these two quantities and to have estimators
corresponding to µ and Σ, we take samples from the population.
Result 4.1
Assume that we take a random sample X1 , X2 , . . . , Xn from a multivariate
population with unknown mean vector, µ and covariance matrix, Σ. Then,
1 X̄ is an unbiased estimator for µ and has covariance matrix 1n Σ.
n
2 Sn−1 = n−1 Sn is an unbiased estimator for Σ.
3 Sn is a biased estimator for Σ, the bias is equal to − 1n Σ

Proof 4.1
1 Pn 1 Pn
= 1n (nµ) = µ

(1) E(X̄) = E n i=1 X i = n i=1 E(X i )
n
! n
!
0 1X 1X 0
Cov(X̄) = E(X̄ − µ)(X̄ − µ) = E (Xi − µ) (Xi − µ)
n n
i=1 i=1

1
= E [(X1 − µ) + (X2 − µ) + · · · + (Xn − µ)] ×
n

1
[(X1 − µ)0 + (X2 − µ)0 + · · · + (Xn − µ)0 ]
n
1
E (X1 − µ)(X1 − µ)0 + · · · + (Xn − µ)(Xn − µ)0

= 2
n
1 1 Σ
= 2
(Σ + · · · + Σ) = 2 nΣ =
n n n

Proof 4.1 (cont.)

Note that for i 6= `, E(Xi − µ)(X` − µ)0 = 0 because the entry is the
covariance between a component of Xi and a component E(X` ), and these are
independent.

1 Pn 0
(2) E(Sn−1 ) = E n−1 i=1 (X i − µ)(X i − µ)
n
! n
0 0
X X
⇒ (n − 1)E(Sn−1 ) = E XX0i − nX̄X̄ = E(XX0i ) − nX̄X̄
i=1 i=1
but Cov(Xi ) = Σ = E(Xi − µ)(Xi − µ) = E(Xi X0i ) − µµ0 0
⇒ E(Xi X0i ) = Σ + µµ0

Σ 0
also Cov(X̄) = = E(X̄ − µ)(X̄ − µ)0 = E(X̄X̄ ) − µµ0
n
0 Σ
⇒ E(X̄X̄ ) = + µµ0
n
Proof 4.1 (cont.)
n
X
0 Σ 0
hence (n − 1)E(Sn−1 ) = (Σ + µµ ) − n + X̄X̄
n
i=1
= nΣ + nµµ0 − Σ − nµµ0 = (n − 1)Σ
⇒ E(Sn−1 ) = Σ.
Thus, Sn−1 is an unbiased estimator of Σ.

(3) Now, we show that Sn is not an unbias estimator of Σ.
n−1
(n − 1)Sn−1 = nSn ⇒ Sn = Sn−1
n
n−1 n−1
E(Sn ) = E(Sn−1 ) = Σ 6= Σ
n n
Proof 4.1 (cont.)
n−1
now E(Sn ) = Σ,
n
n−1 1
bias = E(Sn ) − Σ = Σ−Σ=Σ− Σ−Σ
n n
1
= − Σ.
n
The bias estimator: Sn = n1 ni=1 (Xi − X̄)(Xi − X̄)0 . Note that, as n → ∞, Sn
P
will be unbias (in the limit). We introduce a special notation for the unbiased
estimator:
n
n 1 X
S= Sn = (Xi − X̄)(Xi − X̄)0 .
n−1 n−1
i=1

Sample Geometry and Random Samples Generalized and Total Variances
Generalized variance
With a single variable, the sample variance is often used to describe the
amount of variation in the measurements on that variable.
When p variables are observed on each unit, the variation is described by
the sample-variance covariance matrix,
 
s11 s12 . . . s1p ( )
s21 s22 . . . s2p  n
1 X
S= . ..  = sjk = (xij − x̄j )(xik − x̄k )
 
.. ..
 .. . . .  n−1
i=1
sp1 sp2 . . . spp
The sample covariance matrix contains p variances and p(p + 1)/2

potentially different covariances.
It is sometimes desirable to assign a single numerical value for the
variation expressed by S.

One choice of a value is the determinant of S which reduces to the usual

sample variance of a single characteristics when p = 1.
This determinant is called the generalized sample variance. Thus,
generalized sample variance = |S|.
Generalized sample variance can also be based on standardized variables
and is given by the determinant of the correlation matrix, |R|.
The quantities |S| and |R| are connected by the relationship
|S| = (s11 + s22 + · · · + spp )|R|.
Another choice to summarize information in S is the total sample
variance (variation). It is define as
total sample variance = tr(S) = s11 + s22 + · · · + spp .
Note that these are not sufficient to replace the p × p covariance matrix
but these quantities are of interest at times.
Example 4.1
The sample covariance
matrix obtained
from a data with two variables is
252.04 −68.43
given by S = . Evaluate the generalized and total
−68.43 123.67
sample variance.
Solution 4.1
Generalized variance = |S| = (252.04)(123.67)−(−68.43)(−68.43) = 26487
Total variance = tr(S) = 252.04 + 123.67.

Example 4.2
Find the generalized
variance
and total
variance
of population covariance
2 1 2 −1
matrices Σ1 = and Σ2 = . Comment.
1 3 −1 3
Solution 4.2
tr(Σ1 ) = tr(Σ2 ) = 5, |Σ1 | = |Σ2 | = 5. The two covariance matrices have
the same measures as far as total variance and generalized variance are
concerned but clearly, the two matrices are different.
Exercise 4.1
(a) Verify the relationship between |S| and |R| using
 
4 3 1
S = 3 9 2
1 2 1
Exercise 4.2
(b) Using S = D1/2 RD1/2 , show that |S| = (s11 + s22 +
· · · + spp )|R|.

1 2 5
(c) Show that the generalized variance |S| = 0 for X = 4 1 6
4 0 4

Sample Geometry and Random Samples Sample Mean, Covariance, and Correlation as Matrix Operations
Sample Mean
 1 Pn   
n Pi=1 xi1 x11 + x21 + . . . +xn1
 1 n xi2  1 x12 + x22 + . . . +xn2 
 n i=1 
x̄ =  =  ..
 
 .. .. .. .. 
 .  n  . . . . 
1 Pn x1p + x1p + . . . +xnp
n i=1 xip
  
x11 x21 . . . xn1 1
x
1  12 22
 x . . . x n2  1
   1
0
=  . .. .. ..   ..  = X 1n
n  .. . . .   .  n
x1p x1p . . . xnp 1
1 0
⇒ x̄0 = 1 X
n n
1 1
⇒ 1n x̄0 = 1n 10n X = J0n X
n n

Sample Covariance
1
S = (X − 1n x̄0 )0 (X − 1n x̄0 )
n−1
0
1 1 0 1 0
= X − (Jn X) X − (Jn X)
n−1 n n

1 0 1 0 1 0
= X − (X Jn ) X − (Jn X)
n−1 n n

1 0 1 1 0
= X In − J n In − J n X
n−1 n n

1 1
= X 0 In − J n X
n−1 n
Let D = diag(sjj ) then
R = D−1/2 SD−1/2
S = D1/2 RD1/2
Observe that
Dr. S. these equations are analogous
Iddi (UG) STAT446/614 to their population counterparts.
April 16, 2015 72 / 261
Sample values of linear combinations of variables
Result 4.2
The linear combinations
b0 X = b1 X1 + b2 X2 + · · · + bp Xp
c0 X = c1 X1 + c2 X2 + · · · + cp Xp
have sample means, variances, and covariances that are related to x̄ and S by
Sample mean of b0 X = b0 x̄
Sample mean of c0 X = c0 x̄
Sample variance of b0 X = b0 Sb
Sample variance of c0 X = c0 Sc
Sample variance of b0 X and c0 X = b0 Sc

Example 4.3
Consider the data array  
42 4
52 5
X=
48

4
58 3
Find x̄, S, R.
Solution 4.3
1 0
x̄ = X 1n
n
 
1
1 42 52 48 58 
  = 50
1 
=
4 4 5 4 3 1 4
1

1 1
S = X0 (1n − Jn )X
n−1 n
 0      
42 4 1 0 0 0 1 1 1 1 42 4
1 52 5  0 1 0 0 1
− 
1 1 1 1 52 5
=      
3 48 4 0 0 1 0 4 1 1 1 1 48 4
58 3 0 0 0 1 1 1 1 1 58 3

34 −1.5
=
−1.5 0.5

−1/2 5.83
−1/2 0 1/2 −1/2 0.17 0
R = D SD ,D = ,D =
0 0.71 0 1.41

0.17 0 34 −1.5 0.17 0
R =
0 1.41 −1.5 0.5 0 1.41

1 −0.36
=
−0.36 1
Multivariate Normal Distribution
Introduction
A generalization of the familiar bell-shaped density to several dimension

plays a fundamental role in multivariate analysis.
While real data are never exactly multivariate normal, the normal density
is often a useful approximation to the ‘true’ population distribution.
One advantage of the multivariate distribution is that it has got
mathematically elegant properties.
The normal distributions are useful in practice for two reasons:
◦ it is a bona fide model for practice
◦ the sampling distribution of a lot of statistics is approximately
normal (central limit effect)
The univariate normal distribution
1 1 x−µ 2
f (x) = √ e− 2 ( σ )
2πσ 2
Introduction
Write the exponent as (x − µ)(σ 2 )−1 (x − µ) which can be interpreted as

the squared distance between x and µ in standard deviation units.
A multivariate version is immediate for a p × 1 vector X of observation
on several variable as
(x − µ)0 Σ−1 (x − µ)
The p × 1 vector µ represents the expected value of the random vector X

and the p × p matrix Σ is the variance-covariance matrix of X.
We require the symmetric matrix Σ to be positive definite, so that
(x − µ)0 Σ−1 (x − µ) is the square of the generalized distance from X to
µ.

Introduction
The corresponding density function is

1 1
f (x) = f (x; µ, Σ) = √ p exp (x − µ)0 Σ−1 (x − µ)
( 2π)p |Σ| 2
where −∞ < x < ∞.

Often the notation φp is used to denote the multivariate normal density,
while Φp denotes the multivariate cumulative distribution function.
A multivariate normal random variable will be indicated by
X ∼ Np (µ, Σ).

Multivariate Normal Distribution Bivariate Normal Density
Bivariate normal density
Let p = 2-variate normal density in terms of the individual parameters:

µ1 = E(X1 ), µ2 = E(X2 ), σ11 = Var(X1 ), σ22 = Var(X2 ),
√ √
ρ12 = √σ11σ12√σ22 = Corr(X1 , X2 ) ⇒ σ12 = Cov(X1 , X2 ) = ρ12 σ11 σ22 .

σ11 σ12
Write Σ = ⇒ |Σ| = σ11 σ22 − (σ12 )2
σ12 σ22
Then,

−1 1 σ22 −σ12
Σ =
σ11 σ22 − (σ12 )2 −σ12 σ11
√ √
1 σ22 −ρ12 σ11 σ22
= √ √
2
σ11 σ22 − σ12 −ρ12 σ11 σ22 σ11

The squared distance becomes
(x − µ)0 Σ−1 (x − µ) =
1
(x1 − µ1 , x2 − µ2 ) 2
σ11 σ22 − σ12
√ √
σ22 −ρ12 σ11 σ22 x1 − µ 1
× √ √
−ρ12 σ11 σ22 σ11 x2 − µ 2
This simplifies to
1 √ √
= 2 √ (x1 − µ1 )σ22 − ρ12 σ11 σ 22 (x2 − µ2 ),
σ11 σ22 − ρ12 σ11 σ22

√ √ x1 − µ1
−ρ12 σ11 σ22 (x2 − µ2 ) + σ11 (x2 − µ2 )]
x2 − µ2

1 √ √
(x1 − µ1 )2 σ22 − ρ12 σ11 σ22 (x2 − µ2 )(x1 − µ1 )

= 2
σ11 σ22 (1 − ρ12 )
√ √
ρ12 σ11 σ22 (x1 − µ1 )(x2 − µ2 ) + (x2 − µ2 )2 σ11

−
"
x 1 − µ1 2 x2 − µ2 2

1 x1 − µ1
= √ + √ − 2ρ12 √
(1 − ρ212 ) σ11 σ22 σ11

x2 − µ2
× √
σ22

This implies f (x1 , x2 )
1
= q
(2π)2/2 σ11 σ22 (1 − ρ212 )
( "
x 1 − µ1 2 x2 − µ2 2

1
× exp − √ + √
2(1 − ρ212 ) σ11 σ22

x1 − µ1 x 2 − µ2
− 2ρ12 √ √
σ11 σ22
X1 and X2 are uncorrelated, so that ρ12 = 0, then f (x1 , x2 ) = f (x1 )f (x2 ).

Multivariate Normal Distribution Additional Properties of the Normal Distribution
Properties of normal distribution
If X has a multivariate normal distribution, then

◦ All linear combination of the components of X are normally
distributed.
◦ All subsets of X are multivariate normally distributed.
◦ cov(Xj , Xk ) = 0 implies Xj and Xk are independent
◦ Conditional distributions of the components are multivariate
normal.
Example 5.1
For X distribution as N3 (µ, Σ), find the distribution of
 
X
1 −1 0  1 

X1 − X2
= X2 = AX
X2 − X3 0 1 −1
X3

 
X1
Y1 1 −1 0
Let Y = =  X2  = AX
Y2 0 1 −1
X3
Y ∼ N2 (Aµ, AΣA0 )
 
µ
0  1 

1 −1 µ1 − µ 2
E(Y) = AE(X) = Aµ = µ2 =
0 1 −1 µ2 − µ 3
µ3

and covariance matrix, cov(Y) is given by

  
σ11 σ12 σ13 1 0
1 −1 0 
cov(Y) = AΣA0 = σ21 σ22 σ23   −1 1 
0 1 −1
σ31 σ32 σ33 0 −1

σ11 − 2σ12 + σ22 σ12 + σ23 − σ22 − σ13
=
σ12 + σ23 − σ22 − σ13 σ22 − 2σ23 + σ33
Example 5.2 (The distribution of a subset of a normal random vector)

X2
If X is distributed as N5 (µ, Σ), find the distribution of .
X4

(1) X2 (1) µ2 σ22 σ24
Set X = ,µ = , Σ11 =
X4 µ4 σ24 σ44

Properties
Note that with this assignment X, µ and Σ can respectively be rearranged and
partitioned as
σ22 σ24 | σ12 σ23 σ25
     
X2 µ2
 X4   µ4   σ24 σ44 | σ14 σ34 σ45 
     
 −   −   − − − − − − 
X= 
,µ =  µ1 ,Σ =  σ12 σ14 | σ11 σ13 σ15 
    
 X1     
 X3   µ3   σ23 σ34 | σ13 σ33 σ35 
X5 µ5 σ25 σ45 | σ15 σ35 σ55

X2
Thus X(1) = has the distribution
X4

(1) µ2 σ22 σ24
N2 (µ , Σ11 ) = N2 ,
µ4 σ24 σ44
It is clear from this example that the normal distribution for any subset can be
expressed by simply selecting the appropriate means and covariance from the
original µ and Σ.
Properties
Example 5.3 (The equivalence of zero covariance and covariance

independence for normal variables)
Let X be N2 (µ, Σ) with

4 1 0
Σ= 1 3 0 
0 0 2
Are X1 and X2 independent? What about (X1 , X2 ) and X3 ?
Since X1 and X2 have covariance σ12 = 1, they are not independent.

Partition X and Σ as
   
X1 4 1 | 0  
Σ 11 | Σ 12
 X2   1 3 | 0 
X=  −−  , Σ =  −− −− | −−  = −− | −−
    
Σ21 | Σ22
X3 0 0 | 2
Properties

X1 0
See that X(1) = and X3 have covariance matrix Σ = . Therefore
X2 0
(X1 , X2 ) and X3 are independent. This implies that X3 is independent of X1
and also of X2 .
Example 5.4 (Result)
     
X1 µ1 Σ11 | Σ12
Let  −−  ∼ Np  −−  ,  −− | −− 
X2 µ2 Σ21 | Σ22
Then (X1 |X2 = x2 ) ∼ Nq (µ1|2 , Σ1|2 ) with
µ1|2 = µ1 + Σ12 Σ−1

22 (x2 − µ2 )
Σ1|2 = Σ11 − Σ12 Σ−1

22 Σ21 .
Note two important features: the conditional mean is a linear function in x2

and the covariance is independent of x2 .
Conditional distribution
Indirect 
proof
| −Σ12 Σ−1

I 22
Let A =  −− | −− . We know that X − µ ∼ N(0, Σ). This
0 | I
implies
| −Σ12 Σ−1
 
I 22

X1 − µ2
A(X − µ) = −− |
 −−  ∼ N(0, AΣA0 )
X2 − µ2
0 | I
i.e.
X1 − µ2 − Σ12 Σ−1

22 (X 2 − µ2 ) ∼ N(0, AΣA0 )
(X2 − µ2 )
I −Σ12 Σ−1

0 22 Σ11 Σ12 I 0
but AΣA =
0 I Σ21 Σ22 −Σ12 Σ−1
22 I

Σ11 − Σ12 Σ−1 Σ21 Σ12 − Σ12 Σ−1

0 22 22 Σ22 I 0
AΣA =
Σ12 Σ22 −Σ12 Σ−1
22 I
−1 −1

Σ11 − Σ12 Σ22 Σ21 0 Σ11 − Σ12 Σ22 Σ21 0
= =
Σ12 − Σ22 Σ−1
22 Σ12 Σ22 0 Σ22
Thus,
X1 − µ2 − Σ12 Σ−1 Σ11 − Σ12 Σ−1

22 (X 2 − µ2 ) ∼ N 0, 22 Σ21 0
X2 − µ2 0 Σ22
This implies
µ1 + Σ12 Σ−1 Σ11 − Σ12 Σ−1

X1
∼N 22 (X 2 − µ) , 22 Σ21 0
X2 µ2 0 Σ22

This result implies that both components are independent. As a result, we

may condition the first on the value of the second, without affecting the
distribution.
µ1 + Σ12 Σ−1

X1 |X2 = x2 22 (x2 − µ)
∼ N ,
X2 |X1 = x1 µ2
Σ11 − Σ12 Σ−1

22 Σ21 0
0 Σ22
Hence,
(X1 |X2 = x2 ) ∼ N(µ1 + Σ12 Σ−1 −1

22 (x2 − µ), Σ11 − Σ12 Σ22 Σ21 )
or
(X1 |X2 = x2 ) ∼ N(µ1|2 , Σ1|2 )
where µ = µ1 + Σ12 Σ−1 −1

22 (x2 − µ2 ) and Σ1|2 = Σ11 − Σ12 Σ22 Σ21
Conditional distribution of bivariate normal
Example 5.5 (The conditional distribution of a bivariate normal distribution

using densities directly)
By using the bivariate normal density, f (x1 , x2 ) derived above and taking
(x2 −µ2 )2
1 −
f (x2 ) = √ √
2π σ22
e 2σ22
, it can easily be shown that
2
(x1 −µ1 − σσ12 (x −µ2 ))
22 2
f (x1 , x2 ) 1 −
2σ11 (1−ρ2 )
f (x1 |x2 ) = =√ q e 12
f (x2 ) 2π σ11 (1 − ρ212 )
(See textbook for proof)h i

σ12
Thus, X1 |X2 = x2 ∼ N µ1 + σ22 (x2 − µ2 ), σ11 (1 − ρ212 ) .

Distributional properties
Example 5.6 (Property)

Let X ∼ Np (µ, Σ) with |Σ| > 0. Then
(X − µ)0 Σ−1 (X − µ) ∼ χ2p .
Np (µ, Σ) assign probability (1 − α) to the solid ellipsoid.
{x|(x − µ)0 Σ−1 (x − µ) ≤ χ2p } where χ2p (α) denotes the (100α)th
percentile of the χ2p distribution.

Sampling from Multivariate Normal Distribution and Maximum
Multivariate Normal Distribution Likelihood Estimation
Multivariate normal likelihood function
Let a random sample X1 , X2 , . . . , Xn be distributed as Np (µ, Σ). The

multivariate normal likelihood function is defined as:
n
Y
L(µ, Σ, x) = φp (xi ; µ, Σ)
i=1
n
Y 1 1 0 −1
= p 1 exp − (xi − µ) Σ (xi − µ)
2
i=1 (2π) 2 |Σ| 2
!n ( n
)
1 1X 0 −1
= p 1 exp − (xi − µ) Σ (xi − µ)
(2π) 2 |Σ| 2 2
i=1
Note: If A be a k × k symmetric matrix and x is a k × 1 vector then

x0 Ax = tr(x0 Ax) = tr(Axx0 ).
Also, tr(AB) = tr(BA).

Observe that
X n n
X
(xi − µ)0 Σ−1 (xi − µ) = tr[(xi − µ)0 Σ−1 (xi − µ)]
i=1 i=1
Xn
= tr[Σ−1 (xi − µ)(xi − µ)0 ]
i=1
" n
#
X
= tr Σ−1 (xi − µ)(xi − µ)0
i=1
" ( n )#
X
= tr Σ−1 (xi − µ)(xi − µ)0
i=1
Now,
(xi − µ)(xi − µ)0 = (xi − x̄ + x̄ − µ)(xi − x̄ + x̄ − µ)0
= [(xi − x̄) + (x̄ − µ)][(xi − x̄) + (x̄ − µ)]0
thus,
n
X n
X
0
(xi − x̄)(xi − x̄)0 + 2(xi − x̄)(x̄ − µ)0

(xi − µ) (xi − µ) =
i=1 i=1
(x̄ − µ)(x̄ − µ)0

+
Xn
= (xi − x̄)(xi − x̄)0 + n(x̄ − µ)(x̄ − µ)
i=1
Xn
+ 2 (xi − x̄)(x̄ − µ)0
i=1
n
X
= (xi − x̄)(xi − x̄)0 + n(x̄ − µ)(x̄ − µ)
i=1

This fact enables as to rewrite the likelihood as,

( " n
1 1 X
L(µ, Σ; x) = np n exp − trΣ−1 (xi − x̄)(xi − x̄)0
(2π) |Σ|
2 2 2
i=1
#)
+ n(x̄ − µ)(x̄ − µ)0
"( n
#
1 1 X
= np n exp − tr Σ−1 (xi − x̄)(xi − x̄)0
(2π) 2 |Σ| 2 2
i=1
)
n
+ (x̄ − µ)0 Σ−1 (x̄ − µ)
2
The first part reminds of the variance-covariance matrix S, and is free of

µ, whiles the second part contains µ.
Maximum Likelihood Estimation
The most likely parameter value is defined as the one yielding maximum
L, given the data.
The corresponding parameter value is called the maximum likelihood
estimator or mle (which is a random variable, and function of the data)
It is common practice to maximize log-likelihood.
In our case, the log-likelihood is
np n
l(µ, Σ) = log L(µ, Σ) = − log(2π) − log |Σ|
2" 2 !#
n
1 −1
X
0
− tr Σ (xi − x̄)(xi − x̄)
2
i=1
n 0 −1
− (x̄ − µ) Σ (x̄ − µ)
2

Maximum Likelihood Estimation
The derivative of the log-likelihood w.r.t the mean vector µ is

∂l
= −nΣ−1 (x̄ − µ) = 0
∂µ
Thus, the MLE for µ is x̄, the sample mean.
The derivative w.r.t the covariance component is more complicated. The
maximization can be achieved in an alternative way.
Lemma 5.1
Let B and Σ be p × p symmetric p.d matrices and let b > 0, then

1 1 −1 1
exp − tr(Σ B) ≤ (2b)pb e−pb
|Σ|b 2 |B|b
Equality holds iff

1
Σ= B
2b
Covariance Estimation
Theorem 5.2
Let X1 , X2 , . . . , Xn be independent and identically distributed Np (µ, Σ). Then
the maximum likelihood estimators for µ and Σ are
n
1X (n − 1)
µ̂ = X̄ and Σ̂ = (Xi − X̄)(Xi − X̄)0 = S = Sn
n n
i=1
The maximum likelihood estimates are found by substituting the observed

values xi for Xi
Proof
1. We have derived the results for the mean.
2. The likelihood for µ = µ̂ becomes

1 1 −1
L(µ̂, Σ) = np n exp − tr[Σ nSn ]
(2π) 2 |Σ| 2 2
This
Dr. S. quantity
Iddi (UG) µ̂. Furthermore, it is a simple
is independent ofSTAT446/614 function
April 16, 2015 100of
/ 261
Covariance Estimation
n
If we apply the previous lemma with b = 2 and B = nSn then the
maximum value is is achieved for
1
Σ̂ = (nSn ) = Sn
2 2n
Observed that MLE for mean is unbiased but MLE for the covariance
matrix is biased.
Substituting µ̂ and Σ̂ in L(µ, Σ) yields

1 1 −1 1
L(µ̂, Σ̂) = np exp − tr[Σ̂ nΣ̂] . n
(2π) 2 2 |Σ̂| 2
1 − n tr[Ip ] 1 1 − np 1
= np e 2 . n = np e 2 . n
(2π) 2 |Σ̂| 2 (2π) 2 |Σ̂| 2

Multivariate Normal Distribution Sufficient Statistics
Sufficient Statistics
The likelihood depends on x1 , x2 , . . . , xn only through

i. x̄ = 1nP
Σni=1 xi (mean)
ii. C = ni=1 (xi − x̄)(xi − x̄)0 = (n − 1)S (squares and cross products
matrix)
Thus, x̄ and S are sufficient statistics.
For other multivariate distributions, more information may be needed.

Multivariate Normal Distribution Sampling Distributions
Sampling distributions
Univariate: Let X1 , X2 , . . . , Xn ∼ N1 (µ, σ 2 )

2
1. X̄ ∼ N1 (µ, σn )
2. (n − 1)s2 = ni=1 (Xi − X̄)2 ∼ σ 2 χ2n−1
P
3. X̄ and (n − 1)s2 are statistically independent.
Multivariate: Let X1 , X2 , . . . , Xn ∼ Np (µ, Σ). Then

1. X̄ ∼ Np (µ, 1n Σ)
2. (n − 1)S = ni=1 (Xi − X̄)(Xi − X̄)0 ∼ Wishartp,n−1 (Σ)
P
3. X̄ and (n − 1)S are statistically independent.

Multivariate Normal Distribution Wishart Distribution
Wishart distribution
Univariate case: Y ∼ σ 2 χ2m ∼ σ 2 (Z12 + Z22 + · · · + Zm2 ) with

σ 2 Zi ∼ N(0, σ 2 )
Similarly, for multivariate case: Y = Wishartp,m (Σ) ∼ m 0
P
j=1 Zj Zj with
Zj ∼ Np (0, Σ)
Properties
1. Let A1 ∼ Wp,m1 (Σ), A2 ∼ Wp,m2 (Σ), A1 and A2 are independent, then
A1 + A2 ∼ Wp,m1 +m2 (Σ)
2. Let B be a q × q matrix of constants and A ∼ Wp,m (Σ). Then
BAB0 ∼ Wq,m (BΣB0 ).
3. Let b be a vector of constants, A ∼ Wp,m (Σ), then
b0 Ab ∼ σ 2 χ2m
with σ 2 = b0 Σb.
Multivariate Normal Distribution Sample Distribution Properties
Large sample properties
Theorem 5.3 (Law of large numbers:)

Let X1 , X2 , . . . , Xn be an independent sample from a population with
E(Xi ) = µ. Then
X̄ → µ (n → ∞)
[i.e. P(|X̄ − µ| < ε) → 1 as n → ∞]
Theorem 5.4 (Central limit theorem:)

Let X1 , X2 , . . . , Xn be an independent with mean, µ and covariance, Σ. Then
√
n(X̄ − µ) ∼ Np (0, Σ)
for large n.

Multivariate Normal Distribution Sample Distribution Properties
Large sample properties
Note:
n should be large compared to p
above result is exact for normal variables
as a consequence of the central limit theorem
n(X̄ − µ)0 Σ−1 (X̄ − µ) ∼ χ2p
if n is sufficiently large, we may replace Σ by S, whence
n(X̄ − µ)0 S−1 (X̄ − µ) ∼ χ2p

Multivariate Normal Distribution Assessing the Normality Assumption
Assessing normality
Many of the multivariate techniques depend on the assumption of

(multivariate) normality (except some large sample theory results)
Even in large samples, the approximation will be better if the distribution

is closer to normality.
The detection of departures from normality is crucial.
Three properties of the normal distribution will be important eg.

◦ marginal distributions are normal
◦ linear combinations are normal
◦ density contours are ellipsoids
The detection of OUTLIERS (wild observations) will play an important

role as well.
Checks in univariate (bivariate) margins are usually not difficult.

Assessing normality (Univariate)
Strictly speaking, the multivariate distribution as a whole should be

investigated, but it is often sensible to restrict the investigation to
univariate and bivariate margins.
Univariate normality: Assess symmetry in dot diagrams, histograms
etc.
Use the property that the probability of
√ √
◦ [µj − σjj , µj + σjj ] is 68.3%
√ √
◦ [µj − 2 σjj , µj + 2 σjj ] is 95.4%
√ √
◦ [µj − 1.96 σjj , µj + 1.96 σjj ] is 95%
Check this by comparing the proportion of observations within
√ √
◦ [x̄j − sjj , x̄j + sjj ] is 68.3%
√ √
◦ [x̄j − 2 sjj , x̄j + 2 sjj ] is 95.4%
√ √
◦ [x̄j − 1.96 sjj , x̄j + 1.96 sjj ] is 95%

Quantile-Quantile Plot (QQ-Plot)
Special plots called QQ Plots can be used to assess the assumption of

normality.
They are plots of the sample quantile versus the quantile one would
expect to observe if the observations actually were normally distributed.
When the points lie very nearly along a straight line, the normality
assumption remain tenable. Normality is suspect if the points deviate
from a straight line.
Moreover, the pattern of the deviations can provide clues about the
nature of the nonnormality.
Once the reasons for nonnormality are identified, corrective action is
often possible.

QQ-Plot
Let x1 , x2 , . . . , xn be n observation and denote their ordered values by

x(1) , x(2) , . . . , x(n) values by
x(1) ≤ x(2) ≤ · · · ≤ x(n)
where the x(j) ’s are the sample quantiles.
When the x(j) ’s are distinct, exactly j observations are less than or equal
to x(j) . The proportion j/n of the sample at or to the left of x(j) is often
approximated by j−1/2
n for analytical convenience (continuity
correction).
On the other hand, we can compute quantile qj of the normal distribution
Z q(j)
1 2 j − 1/2
P(Z ≤ q(j) ) = φ(z)dz = √ e−z /2 dz =
−∞ 2π n
Solve this equation for q(j) and plot the pairs (q(j) , x(j) ). If normality is
true, the expected sample quantile should be x(j) ≈ σq(j) + µ
Example (Constructing a QQ-Plot)
A sample of n = 10 observations gives the values in the following table.
Ordered Probability levels Standard normal

observation x(j) (j − 12 )/n quantiles q(j)
-1 (1 − 12 )/10 = 0.05 -1.645
-0.1 0.15 -1.036
0.16 0.25 -0.674
0.41 0.35 -0.384
0.62 0.45 -0.125
0.80 0.55 0.125
1.26 0.65 0.385
1.54 0.75 0.674
1.71 0.85 1.036
2.30 0.95 1.645


R 0.385 2
Φ(0.385) = P(Z ≤ 0.385) = √1 e−z dz = 0.65
−∞ 2π
Note that Φ(z) has not close form solution and so numerical results in
standard normal tables are used.
2.0
1.5 QQ−Plot
Sample Quantiles, x(i)
1.0
0.5
0.0
−0.5
−1.0
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
Standard Normal Quantiles, q(i)

The pair of points lie very nearly along a straight line and we will not
reject the notion that these data are normally distributed - particularly
with sample size as small as n = 10
As a rule of thumb: do not trust this procedure in samples of size n ≤ 20
There can be quite a bit of variability in the straightness of the QQ-plot
for small samples even when the observations are known to come from a
normal population.
Is the QQ-plot straight?
If it is straight, the correlation between the pairs would be 1.
We can test for correlation coefficient.

Test for Correlation Coefficient
Correlation coefficient
Pn
− x̄)(q(j) − q̄)
j=1 (x(j)
rQ = qP
n 2
Pn 2
j=1 (x(j) − x̄) j=1 (q(j) − q̄)
We compare rQ to a table of critical points for the QQ-plot correlation

test for normality.
If rQ > the critical value, we do not reject the hypothesis of normality.
From our example
n
X n
X
(x(j) − x̄)q(j) = 8.584, (x(j) − x̄)2 = 8.472
j=1 j=1
n
X
(q(j) )2 = 8.795 since always q̄ = 0
j=1

Table of Critical Points
Critical Points for the QQ-Plot

Correlation Coefficient Test for Normality
Sample Size Significance level α
n 0.01 0.05 0.10
5 0.8299 0.8788 0.9032
10 0.8801 0.9198 0.9351
15 0.9126 0.9389 0.9503
20 0.9269 0.9508 0.9604
25 0.9410 0.9591 0.9665
30 0.9479 0.9652 0.9795

Test for Correlation Coefficient
8.584
rQ = √ √ = 0.994
8.472 8.795
The test for normality at the 10% level of significant is provided by
referring rQ = 0.994 to entries in the Table corresponding to n = 10 and
α = 0.10. This entry is 0.9351.
Since rQ > 0.9351, we do not reject the hypothesis of normality.
As an alternative, the Shapiro-Wilk test statistics may be computed.
For large sample sizes, the two statistics are nearly the same so either can
be used to judge lack of normality.

Evaluating Bivariate Normality
Consider pairs of variables. If their plots do not look ellipsoidal, the

distributional assumption is suspect.
However, if the shape is ellipsoidal, there is no guarantee of bivariate
normality.
Several distributions are within the classes of elliptically contoured
distribution (eg. Multivariate t−distribution, multivariate cauchy, etc).
Remember the relationship between
(x − µ)0 Σ−1 (x − µ)
and the χ22 distribution.

If we compute the 50% quantile of the χ22 distribution, then we get the
ellipse in which roughly 12 of the observations should lie. If this is not the
case, we should use caution again.
Example
Consider data of the pair of observations (x1 = sales, x2 = profits) for 10

largest US industrial corporations.
x1 = sales x2 = profits
Company (mil dollars) (mil dollars)
1 126874 4224
2 96933 3835
3 86656 3510
4 63438 3758
5 55264 3939
6 50976 1809
7 39069 2946
8 36156 359
9 35209 2480
10 32416 2413

This data gives

62309 10005.20 225.70
x̄ = ,S = × 105 .
2927 225.79 14.30
So,

−1 1 14.30 −225.70
S = × 10−5
77661.18 −225.70 10005.20

0.000184 −0.003293
= × 10−5
−0.003293 0.128131
Reading from χ2 − Table, χ22 (0.5) = 1.39. Thus, any observation

x0 = (x1 , x2 ) satisfying
0
x1 − 62309 0.000184 −0.00329 −5 x1 − 62309
×10 ≤ 1.39
x2 − 2927 −0.003293 0.12813 x2 − 2927

is on or inside the estimated 50% contour. Otherwise the observation is

outside.
The first pair of observations is (x1 , x2 )0 = (126974, 4224). In this case,
0
126974-62309 0.000184 −0.00329 −5 126974-62309
× 10
4224-2927 −0.003293 0.12813 4224-2927
= 4.34 > 1.39

and this falls outside the 50% contour.
The remaining points have generalized distance from x̄ of 1.2, 0.59, 0.83,
1.88, 1.01, 1.02, 5.33, 0.81 and 0.97 respectively.
Since 7 out of these distances are less than 1.39, a proportion of 0.70
falls within the 50% contour.

We would expect about 50% of observations within the contour. This

large difference will ordinarily provide evidence for rejecting the notion
of bivariate normality.
However, our sample size of 10 is too small to reach this conclusion.
Computing the fraction of points within a contour and subjectively

comparing it with the theoretical probability is a useful but rather rough
procedure.

Multivariate: Gamma plot or Chi-square plot
A somewhat more formal method for judging the joint normality of a

dataset is based on the squared generalized distance.
Compute di2 = (xi − x̄)0 S−1 (xi − x̄) which are

◦ approximately χ2p if say n − p ≥ 25
◦ if the population is multivariate normal.
Warnings:
◦ the χ2p is only an approximation, an F distribution can be used
instead.
◦ the quantities are not independent.

Construction of the Gamma Plot
Order
2 2
d(1) ≤ · · · ≤ d(n)
Graph ! !
1
i−
χ2p 2 2
, d(i)
n
The plot should be a straight line.
If a systematic pattern is detected, then a deviation from normality is

likely.
Example:
The ordered distances and the corresponding chi-square percentiles of the
previous example (n = 10, p = 2) are listed in the table below.


i− 21 2 i− 21
i n d(i) q2 n
1 0.05 0.59 0.10

2 0.15 0.81 0.33
3 0.25 0.83 0.58
4 0.35 0.97 0.86
5 0.45 1.01 1.20
6 0.55 1.02 1.60
7 0.65 1.20 2.10
8 0.75 1.88 2.77
9 0.85 4.34 3.79
10 0.95 5.33 5.99

The chi-square plot for the data is given below.
Chi−square plot of the ordered distances
5
4
Ordered distances
3
2
1
0 1 2 3 4 5 6
χ22−quantile
The points do not lie along a straight line with slope 1.

This data do not appear to be bivariate normal.
Outliers
An important reason for deviation from normality: outliers.
An outlier is an observation generated by another mechanism.
How can we recognize them?

◦ histogram
◦ scatter-plots for pairs of variables
◦ χ2 −plots or di2
[see textbook on steps for detecting outliers.]

Multivariate Normal Distribution Transformation to Near Normality
Transformations
What if the assumption of normality is not tenable?

Apply techniques for non-normal data
Apply transformations: re-express data in other units which are
sometimes more natural than the original units.
Some well-known examples
Original scale Transformed scale

√
count y y
p
proportion p logit(p) = log 1−p

correlations r Fisher’s z(r) = 21 log 1+r
1−r
Another important transformation is the log-transform, log(y) which is

often used in eg. growth data.
Transformations that reduce the tail of the distribution are the
log-transform and the power-transform y = x1/n
Multivariate Normal Distribution Transformation to Near Normality
Box and Cox Transforms
Allow data to suggest transformation.
A family of transformation proposed by Box and Cox is

xλ−1
λ if λ 6= 0
y=
log(x) if λ = 0
Note that the form λ = 0 follows as the limit for λ → 0
The parameter λ is to be estimated from the data.
λ is estimated by maximum likelihood or by minimizing the sample

variance of the variables.

Inference About a mean vector Plausible Values for a Normal Population Mean
Introduction
To determine whether a p × 1 vector µ0 is a plausible value for the mean

of a multivariate normal distribution, the following test statistics
(analogous to the univariate case) is used.
1
T 2 = (X̄ − µ0 )0 ( S)−1 (X̄ − µ0 ) = n(X̄ − µ0 )0 S−1 (X̄ − µ0 ).
n
where X̄ = n1 ni=1 Xi , S = n−1
1 Pn 0
P
  i=1 (X i − X̄)(X i − X̄) and
µ10
 µ10 
µ0 =  . 
 
 . .
µp0
The test statistics T 2 is called Hotelling’s T 2 . Here ( 1n S) is the estimated
covariance matrix of X̄.

Hotelling’s T 2
If the statistical distance T 2 is too large ie if the X̄ is "too far" from µ0 ,

the hypothesis H0 : µ = µ0 is rejected.
T 2 is distributed as (n−1)p
(n−p) Fp,n−p where Fp,n−p denotes a random variable
with an F− distribution with p and n − p degree of freedom.
So we test the hypothesis H0 : µ = µ0 versus H1 : µ 6= µ0 . At the α
level of significance, we reject H0 in favor of H1 if the observed
(n − 1)p
T2 > Fp,n−p .
(n − p)
Example 5.7
Let the data matrix for a random
  of size n = 3 from a bivariate
sample
6 9
normal population be X =  10 6 . Evaluate the T 2 for µ00 = [9, 5]. What
8 3
is the sampling distribution in this case.
Hotelling’s T 2
Solution
6+10+8

x¯1 3 8
X̄ = = 9+6+3 = and
x¯2 3 6
(6 − 8)2 + (10 − 8)2 + (8 − 8)

S11 = =4
2
(6 − 8)(9 − 6) + (10 − 8)(6 − 6) + (8 − 8)(3 − 6)
S12 = = −3
2
(9 − 6)2 + (6 − 6)2 + (3 − 6)
S22 = =9
2
so
4 −3
S=
−3 9
.
Hotelling’s T 2
Thus 1 1
−1 1 9 −3
S = = 31 49
4(9) − (−3)(−3) −3 4 9 27
1 1 −2
8−9 7
T 2 = 3[8 − 9, 6 − 5] 13 94 = 3[−1, 1] 9
1 =
9 27 6 − 5 27 9
(3−1)2
Before the sample is selected, T 2 has the distribution of (3−2) F2,3−2 = 4F2,1
random variable.
Example 5.8
Test the hypothesis H0 : µ0 = [4, 50, 10] against H1 : µ0 6= [4, 50, 10] at level
of significance α = 0.10. Using perspiration from 20 healthy female where
three components, X1 =sweat rate, X2 =sodium content and X3 =potassium
content were measured. The data yields the following.
x̄0 = 4.640 45.4 9.965


Hotelling’s T 2
Example 5.8
 
2.879 10.010 −1.810
S =  10.010 199.788 −5.640  and S−1 =
−1.810 −5.640 3.628
 
0.586 −0.022 0.258
 −0.022 0.006 −0.002 
0.258 −0.002 0.402
solution
T 2 = 20

4.640 − 4 45.4 − 50 9.965 − 10
  
0.586 −0.022 0.258 4.640 − 4
×  −0.022 0.006 −0.002   45.4 − 50  = 9.74
0.258 −0.002 0.402 9.965 − 10

Hotelling’s T 2
comparing the observed T 2 = 9.74 with the critical value
(n − 1)p 19(3)
Fp,n−p (0.10) = F3,17 (0.10) = 3.353(2.44) = 8.18
(n − p) 17
We reject H0 if one or more of the mean components or some combination of

means differs too much from the hypothesis. We see that T 2 = 9.74 > 8.18
and consequently, we reject H0 at 0.10 level of significance.
Note:
To apply this test, we must assume the sweat data are multivariate normal by
studying Q − Q plots of X1 , X2 , X3 and ensure that scatter plots for pairs of
observation have approximate.

Hotelling’s T 2
Property
One feature of the T 2 - statistics is that it is invariant under changes in the units
of measurement for X of the form
Y = CX + d
where Y is p × 1, C is p × p, X is p × 1 and d is p × 1.
Proof
Given x1 , x2 , x3 , . . . , xn , then ȳ = Cx̄ + d and
n
1X
Sy = (yi − ȳ)(yi − ȳ)0
n
i=1
n
1X
= (Cxi + d − Cx̄ − d)(Cxi + d − Cx̄ − d)0
n
i=1
n
1X
= C(xi − x̄)(x − x̄)0 C0 = CSC0
n
i=1
Hotelling’s T 2
Also, µy = E(Y) = E(CX + d) = Cµ + d. Therefore, T 2 computed with the

y’s and a hypothesized value µy,0 = Cµ0 + d is
T 2 = n(ȳ − µy,0 )0 S−1

y (ȳ − µy,0 )
= n(C(x̄ − µy,0 ))0 (CSC0 )−1 (C(x̄ − µy,0 ))
= n(x̄ − µy,0 )0 C0 (CSC0 )−1 C(x̄ − µy,0 )
= n(x̄ − µy,0 )0 C0 (C0 )−1 (S−1 )(C0 )−1 C(x̄ − µy,0 )
= n(x̄ − µy,0 )0 S−1 (x̄ − µy,0 )
which is T 2 computed with the x’s.

Inference About a mean vector Hotelling’s T 2 and Likelihood Ratio Tests
Likelihood Ratio Tests
An alternative and particularly convenient parameter way (in terms of

multivariate normal) to test the hypothesis hypothesis H0 : µ − µ0 = 0 is
the likelihood ratio test.
Likelihood Ratio Test have several optimal properties for large sample.
To determine whether µ0 is a plausible of µ, the maximum of L(µ0 , Σ)
is compared with the unrestricted maximum of L(µ, Σ).
The resulting ratio is called the Likelihood ratio statistics.
maxΣ L(µ0 , Σ) |Σ̂|
Likelihood ratio = Λ = =
maxµ,Σ L(µ, Σ) |Σ̂0 |
Pn n
| i=1 (xi − x̄)(xi − x̄)0 | 2
= Pn 0
| i=1 (xi − µ0 )(xi − µ0 ) |
ˆ
The equivalence statistics Λ n = |Σ
2 |
ˆ is called Wilk’s lambda.
|Σ0 |
Likelihood Ratio Tests
If the observed value of this likelihood is too small, the hypothesis

H0 : µ = µ0 is unlikely to be true and is, therefore, rejected.
Specifically, the likelihood ratio test of H0 : µ = µ0 against H1 : µ 6= µ0
rejects H0 if Λ < Cα where Cα is the lower (100α)th percentile of the
distribution of Λ.
Test procedure
◦ For small samples: determine the sampling distribution of Λ and
choose c accordingly.
◦ Asymptotic Theory: The sampling distribution of −2 ln Λ is
asymptotic χ2 with degree of freedom equal to dim(Θ) − dim(Θ0 )
where Θ =parameter space.

Hotelling’s T 2 and Likelihood Ratio Tests
Result 5.1
Let X1 , X2 , ........Xn be a random sample from an Np (µ, Σ) population.Then
the test based on T 2 is equivalent to the likelihood ratio test of H0 : µ = µ0
versus H1 : µ 6= µ0 because
−1
T2

2 |Σ̂|
Λ = n 1+ =
(n − 1) |Σ̂0 |
2
H0 is rejected for small values of Λ n or equivalently, large values of T 2 .

Optimality Properties
The likelihood ratio test are known to have certain optimality properties.
For testing H0 : θ = θ0 versus H1 : θ = θ1 (narrow alternative), the
likelihood ratio test has the highest power(among all test) for a given
significance level α.
Recall
P(test reject|H0 ) = α
P(Test reject|H1 ) = 1 − β
In multivariate setting, likelihood ratio test have asymptotically (for large

samples) highest power.

Inference About a mean vector Confidence Region and Simultaneous Comparisons of Component Means
Confidence region
To make inference from a sample, we can extend the concept of

univariate confidence interval to a multivariate confidence region. Given
data X = (X1 , X2 , . . . , Xn )0 , indicate a confidence region R(X) ⊂ <p .
The region R(X) is called a 100(1 − α)% percent confidence region if
the average probability θ is 1 − α. We call α the level of significant.
Formally
P(Θ ∈ R(X)) = 1 − α
The confidence region for the mean µ of a p-dimensional normal
population is given by

(n − 1)p
P n(X̄ − µ)0 S−1 (X̄ − µ) ≤ Fp,n−p (α) = 1 − α
(n − p)

Confidence region
For a given sample, we can compute x̄ and S, to obtain the confidence

region.
(n − 1)p
n(x̄ − µ)0 S−1 (x̄ − µ) ≤ Fp,n−p (α)
(n − p)
To determine whether a given µ0 lies within the confidence region,
compute the Mahalanobis distance between x̄ and µ0 .
The vector lies within R(x̄) if the distance is less than
(n − 1)p
Fp,n−p (α)
(n − p)
n(x̄ − µ0 )0 S−1 (x̄ − µ0 ) = mahalanobis distance between x̄ and µ0

Example

0.564
Consider the following: n = 42, x̄ = ,S =
0.603

0.0144 0.0117 203.018 −163.391
, S−1 =
0.0117 0.0146 −163.891 200.228
The 95% percent confidence region for µ consist of all values(µ1 , µ2 )
satisfying

203.018 −163.391 0.564 − µ1
42[0.564 − µ1 , 0.603 − µ2 ] ×
−163.891 200.228 0.603 − µ2
2(41)
≤ F2,40 (0.05)
40
since F2,40 (0.05) = 3.23,
42(203.018)(0.564 − µ1 )2 + 42(200.228)(0.603 − µ2 )2
− 84(163.391)(0.564 − µ1 )(0.603 − µ2 ) ≤ 6.62

Example
To see whether µ0 = [0.562, 0.589] is in the confidence region we compute
42(203.018)(.564 − .562)2 + 42(200.228)(.603 − .589)2

− 84(163.391)(.564 − .562)(.603 − .589)
= 1.30 ≤ 6.62
We conclude that µ0 = [0.562, 0.589] is in the region.

0.562
Equivalently, a test of H0 : µ = would not be reject in favor of
0.589

0.562
H1 : µ 6= at α = 0.05 level of significant.
0.589

Simultaneous Confidence Statement
Let X has a Np (µ, Σ), then a linear combination Z = `0 X has a normal

distribution with mean µz = `0 µ and covariance σz2 = `0 Σ`. The sample
mean z̄ and s2z are computed accordingly.
Result 5.2 (T 2 -interval, Scheffe’s interval)

Let X1 , X2 , . . . , Xn be random sample from a Np (µ, Σ) population with Σ
positive definite. Then the probability that the interval,
" s s #
(n − 1)p (n − 1)p
`0 x̄ − Fp,n−p (α)`0 S`, `0 x̄ + Fp,n−p (α)`0 S`
(n − p) (n − p)
contain `0 µ simultaneous for all `, is equal to (1 − α)

Proof of Result
n(`0 x̄−`0 µ)2
T 2 = n(x̄ − µ)0 S−1 (x̄ − µ) ≤ c2 ⇔ `0 S`
≤ c2
r r
0 `0 S` `0 S`
` x̄ − c ≤ `0 µ ≤ `0 x̄ + c
n n
Thus, the only probability we need to control is P(T 2 ≤ c2 ). We know from

the distribution form of T 2 that P(T 2 ≤ c2 ) = 1 − α is guaranteed by
choosing c2 = p(n−1)
(n−p) Fp,n−p (α)
In particular, a confidence interval for the jth mean µj is given by
choosing ` = (0, 0.....0, 1, 0, ....0)0 :
s r s r
p(n − 1) sjj p(n − 1) sjj
x̄j − Fp,n−p (α) ≤ µj ≤ x̄j + Fp,n−p (α)
(n − p) n (n − p) n

We may add as many intervals as we want without changing the

confidence (1 − α). Example; intervals for contrast of two means:
s r
p(n − 1) sjj − 2sjk + skk
(x̄j − x̄k ) − Fp,n−p (α) ≤ (µj − µk )
(n − p) n
s r
p(n − 1) sjj − 2sjk + skk
≤ (x̄j − x̄k ) + Fp,n−p (α)
(n − p) n

Example[College Data]
Consider a data
 with   
526.59 5691.34 600.51 217.25
n = 87,x̄ =  54.69  , S =  600.51 126.05 23.37 
25.13 217.25 23.37 23.11
a. compute the 95 percent simultaneous confident intervals for µ1 , µ2 and
µ3
b. construct a 95 percent confident interval for µ2 − µ3
c. compute the 95 percent confident region or confidence ellipse for the pair
(µ1 , µ2 )
Solution
p(n−1) 3(87−1) 3(86)
(a) (n−p) Fp,n−p (α) = (87−3) F3,84 (0.05) = 84 (2.7) = 8.29

√ √
r r
5691.34 5691.34
526.59 − 8.29 ≤ µ1 ≤ 526.59 − 8.29
87 87
503.30 ≤ µ1 ≤ 509.88
√ √
r r
126.05 126.05
54.69 − 8.29 ≤ µ2 ≤ 54.69 − 8.29
87 87
51.22 ≤ µ2 ≤ 58.16
√ √
r r
23.11 23.11
25.13 − 8.29 ≤ µ3 ≤ 25.13 − 8.29
87 87
23.65 ≤ µ3 ≤ 26.61
(b) `0 = [0, 1, −1], the interval for µ2 − µ3 has end points

r
p(n − 1) s22 + s33 − 2s23
(x̄2 − x̄3 ) ± Fp,n−p (0.05)
(n − p) n
r
√ 126.05 + 23.11 − 2(23.37)
(54.69 − 25.13) ± 8.29
87
29.56 ± 3.12
[26.44, 32.68] is a 95% percent confidence interval for µ2 − µ3 .

−1
126.05 23.37 54.69 − µ2
(c) 87[54.69 − µ2 , 25.13 − µ3 ]
23.37 23.11 25.13 − µ3
= 0.849(54.69 − µ2 )2 + 4.633(25.13 − µ3 )
− 2 × 0.859(25.13 − µ3 )(54.69 − µ2 )
≤ 8.29

Bonferroni Intervals
Consider the set of p confidence intervals for the mean components.

Consider partitioning of the overall significant level
α = α1 + α2 + · · · + αp
An obvious choice is αj = α/p.

The confidence intervals become
r r
α sjj α sjj
x̄j − tn−1 ≤ µj ≤ x̄j + tn−1
2p n 2p n
Bonferroni and T 2 both have the structure

r r
`0 S` `0 S`
`0 x̄ − (critical value) ≤ `0 µ ≤ `0 x̄ + (critical value)
n n

α
The ratio of their length for αj = m is
α

tn−1 2m
q
p(n−1)
(n−p) Fp,n−p (α)
which does not depend on the random quantities x̄ and S.

For small number of comparisons, the Bonferroni intervals are precise
(smaller), i.e when m = p. Because of this, often intervals are based on
Bonferroni method.
Example [College Data]
Bonferonni confidence interval for µ1 , µ2 and µ3 .
αi = 0.05
3 , i = 1, 2, 3, n = 87
r
0.05 sjj
x̄j ± t86 , j = 1, 2, 3
2×3 n
r
5691.34
526.59 ± 2.44 or 507.77 ≤ µ1 ≤ 545.41
87
r
126.05
54.69 ± 2.44 or 51.89 ≤ µ2 ≤ 57.49
87
r
23.11
25.13 ± 2.44 or 23.93 ≤ µ3 ≤ 26.33
87
Observe that these intervals are smaller than those constructed for µ1 , µ2 ,µ3
using T 2 (T 2 - intervals)

Exercise

0.564
For the example with n = 42, x̄ = and
0.603

0.0144 0.0117 −1 203.018 −163.391
S= ,S =
0.0144 0.0146 −163.391 200.228
i. Conduct a test of the null hypothesis H0 : µ0 = 0.55, 0.60 at

α = 0.05 level of significance.

ii. Find the simultaneous 95 percent T 2 confidence interval for µ1 and µ2 .
iii. Construct the 95 percent Bonferroni intervals for µ1 and µ2 . Compare
the two.
iv. Compare the two sets of intervals in (ii) and (iii)
v. Construct the 95 percent T 2 confidence interval for µ1 − µ2

Inference About a mean vector Large-Sample Inference About a Population Mean Vector
Large sample inference
All large-sample inferences about µ are based on a χ2 - distribution.

We know that
(X̄ − µ)0 (n−1 S)(X̄ − µ) = n(X̄ − µ)S−1 (X̄ − µ)
is approximately χ2 with p df and thus
P[n(X̄ − µ)S−1 (X̄ − µ) ≤ χ2p (α)] = 1 − α
where χ2p (α) is the upper (100α)th of the χ2 distribution.

Hypothesis Testing: X1 , X2 , . . . , Xn be a random sample from a
population with mean µ and positive definite covariance matrix Σ.
When (n − p) is large, the hypothesis H0 : µ = µ0 is rejected in favor
H1 : µ 6= µ0 , at a level of significance approximately α,
if the observed
n(x̄ − µ0 )S−1 (x̄ − µ0 ) > χ2p (α)
where χ2p (α) is the upper (100α)th percentile of a chi-square distribution
with p df .
Note: The test statistics yields essentially the same results as T 2 in
situation where the χ2 - test is appropriate. This follows from the fact
(n−1)p 2
(n−p) Fp,n−p (α) and χp (α) are approximately equal for n large relative
to p.
Confidence Statement: Let X1 , X2 , . . . , Xn be a random sample from a
population with µ and positive definite covariance Σ. If (n − p) is large
r
0
q `0 S`
` X̄ ± χ2p (α)
n
will contain `0 µ for every `, with probability approximately (1 − α).
Consequence, we can make 100(1 − α)% simultaneous confidence

statements r
q
2
sjj
x̄j ± χp (α)
n
contain µj
for all pairs (µj , µk ), j, k = 1, 2, . . . , p

sjj sjk x̄j − µj
n[x̄j − µj , x̄k − µk ] ≤ χ2p (α)
sjk skk x̄k − µk
contain (µj , µk ).

Comparison of Several Multivariate Means Paired Comparisons
Paired comparisons
Let Xijk be the outcome for unit i on the jth variable, for treatment
(experimental condition) t = 1, 2; j = 1, . . . , p; i = 1, 2, . . . , n.
Then define Dij = Xij1 − th
 Xij2 . Let Di be the difference vector for the i
Di1
 Di2 
unit/pair: Di =  . .
 
 .. 
Dip
Denote E(Di ) = δ and covariance, cov(Di ) = Σd . Again, T 2 statistics
play a critical role:
T 2 = n(D̄ − δ)0 S−1
d (D̄ − δ).
Assume Di ∼ Np (δ, Σd ), then (for a test of hypothesis H0 : δ = 0
against H1 : δ 6= 0)
(n − 1)p
T2 ∼ Fp,n−p
(n − p)
Paired comparisons
Furthermore, if n − p is large, then T 2 ∼ χ2p regardless of the distribution

of Di
0 (n−1)p
Confidence regions: T 2 = nd̄ S−1
d d̄ > (n−p) Fp,n−p (α)
T 2 interval (scheffe interval):

s s s s
(n − 1)p s2dj (n − 1)p s2dj
d̄j − Fp,n−p ≤ δj ≤ d̄j + Fp,n−p
(n − p) n (n − p) n
Bonferroni intervals:
s s

α
s2dj
α
s2dj
d̄j − tn−1 ≤ δj ≤ d̄j + tn−1
2g n 2g n

Example: Measurement of biochemical oxygen demand (BoD) and a

suspended solids(SS) were obtained for n = 11 sample splits from two
laboratories. The data are displayed in the table below.
Commercial Lab State Lab of Hygiene

Sample i Xi11 (B0D) Xi21 (SS) Xi12 (BOD) Xi22 (SS)
1 6 27 25 15
2 6 23 28 13
3 18 64 36 22
4 8 44 35 29
5 11 30 15 31
6 34 75 44 64
7 28 26 42 30
8 71 124 54 64
9 43 54 34 56
10 33 30 29 20
11 20 14 39 21

Example
Do the two laboratories chemical analysis agree? if differences exist ,

what is their nature.
The T 2 statistics for testing H0 : δ 0 = [δ1 , δ2 ] = [0, 0] is constructed from
the differences of paired observations:
di1 = Xi11 − Xi12 -19 -22 -18 -27 -4 -10 -14 17 9 4 -19
di2 = Xi21 − Xi22 12 10 42 15 -1 11 -4 60 -2 10 -7
Here
−9.36
d̄1 199.26 88.38
d̄ = = , Sd =
13.27
d̄2 88.38 418.61

2 199.26 88.38 −9.36
T = 11[−9.36, 13.27] = 13.6
88.38 418.61 13.27

Example
Taking α = 0.05, we find that

(n − 1)p 2(10)
Fp,n−p (0.05) = F2,9 (0.05) = 9.47
(n − p) 9
Since T 2 = 13.6 > 9.47, we reject H0 and conclude that there is a
nonzero difference between the measurements from the two laboratories.
By inspection, it appears that the commercial laboratories tends to
produce lower BOD measurements and higher SS measurements than the
state laboratories of hygiene.
The 95% simultaneous confidence intervals for the mean differences δ1
and δ2 are:
s s
(n − 1)p S2dj
δj : d̄j ± Fp,n−p (α)
n−p n

Example
√
r
199.26
δ1 : d¯1 ± −9.36 ± 9.47 or (−22.46, 3.47)
11
√
r
418.61
δ2 : d¯2 ± 13.27 ± 9.47 or (−5.71, 32.25)
11
The 95% simultaneous confidence intervals include zero, yet the
hypothesis H1 : δ = 0 was rejected at α. What are we to conclude?
The evidence points towards real differences. The points δ = 0 falls
outside the 95% confidence region of δ
Note that our analysis assumed a normal distribution for Di .
Exercise: Construct the 95% confidence region for δ and show that
δ 0 = (0, 0) falls outside the 95% confidence region. Find the Bonferroni
simultaneous intervals. Do they cover zero?
Comparison of Several Multivariate Means Repeated Measures Design
Repeated Measures
We assume that each subject is measured q times in row. Each

measurement corresponds to a specific experimental condition.
In doing so, there is no "between" variability, reducing the sample size
and simplifying statistical analysis. The data for the ith subject is a q
dimensional vector.  
Xi1
Xi =  ... 
 
Xiq
Most often, not the mean vector µ itself but linear functions of its
components are of interest Cµ. We call C a contrast matrix.

Repeated measures
Examples are:
Baseline comparison
    
µ 1 − µ2 1 −1 0 . . . 0 µ1
 µ 1 − µ3   1 0 −1 . . . 0  µ2 
=
    
 .. .. .. .. .. ..  .. 
 .   . . . . .  . 
µ 1 − µq 1 0 0 ... −1 µq
Successive differences
    
µ2 − µ2 −1 1 0 . . . 0 0 µ1
 µ3 − µ2   0 −1 1 . . . 0 0  µ2 
=
    
 .. .. .. .. . . .. ..  .. 
 .   . . . . . .  . 
µq − µq−1 0 0 0 ... −1 1 µq

Test for Equality of Treatments
All contrast matrices C with q − 1 linearly independent rows, all of

which are orthogonal.
Further, it follows that T 2 statistics
T 2 = n(Cx̄ − Cµ)0 (CSC0 )−1 (Cx̄ − Cµ)
is independent of the choice of C.
Let Xi ∼ Nq (µ, Σ) and let C be a contrast matrix H0 : Cµ = 0 is
rejected in favor of H1 : Cµ 6= 0 on the α-level if
(n − 1)(q − 1)
T2 > Fq−1,n−q+1 (α)
n−q+1
Confidence Region
(n − 1)(q − 1)
n(Cx̄ − Cµ)0 (CSC0 )−1 (Cx̄ − Cµ) ≤ Fq−1,n−q+1 (α)
n−q+1
Simultaneous T 2 and Bonferroni intervals follow immediately.
Question
A researcher considered 3 indices measuring the severity of heart attacks. The

values of these indices for n = 40 heart-attack patients arriving at the hospital
emergency room produces the summary statistics,
 
101.3 63.0 71.0
x̄0 = [46.1, 57.3, 50.4] and S =  63.0 80.2 55.6 
71.0 55.6 97.4
(a) All three indices are evaluated for each patient. Test for the equality of
mean indices with α = 0.05
(b) Judge the differences in pairs
of mean indices using 95% simultaneous
q 0
q
0
confidence intervals. Note: C x̄ ± (n−1)(q−1)
F (α) C SC
n−q+1 q−1,n−q+1 n

Comparison of Several Multivariate Means Comparing Mean Vectors from Two Populations
Compare mean vectors from two populations
Now, we will have 2 (independent) sets of responses n1 subjects under

experimental condition 1, n2 subjects under experimental condition 2.
The data setting is as follows. Let
◦ X`1 , X`2 , . . . , X`n be a random sample from population `(` = 1, 2)
which has mean µ` and covariance Σ`
◦ Both samples are independent.
◦ The observed values be x`1 , . . . , x`n` with mean µ` and covariance
S` .
◦ (small samples only): both populations are multivariate normal.
◦ (small samples only): Σ1 = Σ2 .
Question of interest
◦ µ1 = µ2 ?
◦ statements about the components of the mean vectors.

Common Variance
When covariances are equal (Σ1 = Σ2 = Σ), it is common practice to

construct a "pooled" estimate of variance.
(n1 − 1)S1 + (n2 − 1)S2

Spooled =
n1 + n2 − 2
Σi=1 (x1i − x̄1 )(x1i − x̄1 )0 + Σni=1
n1 2
(x2i − x̄2 )(x2i − x̄2 )0
=
n1 + n2 − 2
Why do we have the factor n1 + n2 − 2 in the denominator?
◦ we know that adding Wishart variables, the degree of freedom add
also
◦ there is likelihood support for this procedure
An important hypothesis is H0 : µ1 − µ2 = δ 0 . Obviously,
E(X̄1 − X̄2 ) = µ1 − µ2

Common Variance
The samples are independent

⇒ X̄1 and X̄2 are independent.
⇒ cov(X̄1 , X̄2 ) = 0
⇒
cov(X̄1 − X̄2 ) = cov(X̄1 ) + cov(X̄2 )
1 1
= Σ1 + Σ2
n n2
1
1 1
= + Σ
n1 n2
which is estimated by ( n11 + 1

n2 )Spooled .
Now the confidence region follows from

−1
1 1
T 2 = (x̄1 − x̄2 − δ0 )0 + Spooled (x̄1 − x̄2 − δ0 ) > c2
n1 n2

Common Variance
where c2 follows from the distribution of T 2 which is

(n1 + n2 − 2)p
c2 = Fp,n1 +n2 −p−1
(n1 + n2 − p − 1)
Simultaneous confidence interval for all `0 (µ1 − µ2 ) are
s
0 0 1 1
` (X̄1 − X̄2 ) ± c ` + Spooled `
n1 n2
In particular, a simultaneous confidence interval for µ1i − µ2j is
s
1 1
(X̄1j − X̄2j ) ± c + s
n1 n2 jj,pooled
Bonferroni confidence interval for µ1j − µ2j is when we consider the
population means
r
α 1 1
(X̄1j − X̄2j ) ± tn1 +n2 −2 + s
2p n1 n2 jj,pooled
Two-sample situation when Σ1 6= Σ2
Problem if samples are small.

Large sample results available.
Let n1 − p and n2 − p be large. An approximate 100(1 − α) confidence
ellipsoid for µ1 − µ2 is
−1
0 S1 S2
[(X̄1 − X̄2 ) − (µ1 − µ2 )] + [(X̄1 − X̄2 ) − (µ1 − µ2 )] ≤ χ2p (α)
n1 n2
Simultaneous confidence interval for `0 (µ1 − µ2 ) is

s
0 S1 S2
q
0 2
` (X̄1 − X̄2 ) ± χp (α) ` + `
n1 n2

Example:(Calculate confidence interval for differences in mean component)

Consider the following: n1 = 45, n2 = 55,

204.4 13825.3 23823.4
x̄1 = , S1 =
556.6 23823.4 73107.4

130.0 8632.0 19616.7
x̄2 = , S2 =
355.0 19616.7 55964.5
Calculate 95% simultaneous confidence intervals for the population
difference.
Solution

n1 − 1 n2 − 1 10963.7 21505.5
Spooled = S1 + S2 =
n1 + n2 − 2 n1 + n2 − 2 21505.5 63661.3
(n1 +n2 −2)p 98(2)

and c2 = n1 +n2 −p−1 Fp,n1 +n2 −p−1 = 97 F2,97 (0.05) = (2.02)(3.1) = 6.26

Example
µ01 − µ02 = [µ11 − µ21 , µ12 − µ22 ]. The 95% simultaneous confidence
intervals for the population differences are:
s
√

1 1
µ11 − µ21 : (204.4 − 130.0) ± 6.26 + 10963.7
45 55
21.7 ≤ µ11 − µ21 ≤ 127.1
s
√

1 1
µ12 − µ22 : (556.6 − 355) ± 6.26 + 63661.3
45 55
74.7 ≤ µ12 − µ22 ≤ 328.5
Exercise
Construct 95% confidence ellipse for µ1 − µ2 . Does the difference in means
cover 00 = [0, 0]? [Or, will T 2 − statistics reject H0 : µ1 − µ2 = 0 at the 5%
level?].
Example
Analyze the above example using large-sample approach.

1 1 1 13825.3 23823.4 1 8632.0 19616.7
S1 + S2 = +
n1 n2 45 23823.4 73107.4 55 19616.7 55964.5

464.17 886.08
=
886.08 2642.15
The 95% simultaneous CI for the linear combinations

µ11 − µ21
µ11 − µ21 = (1, 0) and
µ11 − µ21

µ11 − µ21
µ12 − µ22 = (0, 1) are
µ11 − µ21
√ √
µ11 − µ21 : 74.4 ± 5.99 464.17 or (21.7, 127.1)
√ √
µ12 − µ22 : 201.6 ± 5.99 2642.15 or (75.8, 327.4)

Example
The T 2 − statistics for testing H0 : µ1 − µ2 = 0 is

−1
2 1 0 1
T = (x̄1 − x̄2 ) S1 + S2 (x̄1 − x̄2 )
n1 n2
0
204.4 − 130.0 464.17 886.08 204.4 − 130.0
=
556.6 − 355.0 886.08 2642.15 556.6 − 355.0
= 15.66
For α = 0.05, the critical value is χ22 (0.05) = 5.99 and since
T 2 > χ22 (0.05) = 5.99, we reject H0 .

Comparison of Several Multivariate Means One-Way MANOVA
Comparison of g populations
Let sample `(` = 1, 2, . . . , g) be X`1 , X`2 , . . . , X`n`

Our interest are
◦ µ1 = µ2 = · · · = µg ?
◦ statements about components
The data setting is as follows: Let
◦ X`1 , X`2 , . . . , X`n` be a random sample from population `, with
mean µ` and common covariance Σ
◦ all samples are independent
◦ observed values x`1 , x`2 , . . . , x`n` with mean x¯` and covariance S`
◦ (small samples) each population is multivariate normal

Multivariate Analysis of Variance
MANOVA model for comparing g population mean vectors
Xì = µ + τ ` + eì
with i = 1, 2, . . . , n` and ` = 1, 2, . . . , g.
We take eì ∼ N(0, Σ) and E(Xì ) = µ + τ ` , with µ = overall mean,
τ ` = group effect.
The model isPoverspecified and so we impose the identifiability
constraints g`=1 n` τ ` = 0. That is, all group effects add up to zero and
there are n` individuals in each group.
The following terminology is sometimes useful
◦ systematic part: µ + τ `
◦ random part: eì

MANOVA
The observations are decomposed as
xì = x̄ + (x̄` − x̄) + (xì − x̄` )
i.e.
observation = overall mean+treatment effect (or group effect)+resisuals
This implies,
(xì − x̄)(xì − x̄)0 = [(x̄` − x̄) + (xì − x̄` )][(x̄` − x̄) + (xì − x̄` )]0
= (x̄` − x̄)(x̄` − x̄)0 + (x̄` − x̄)(xì − x̄` )0
+ (xì − x̄` )(x̄` − x̄)0 + (xì − x̄` )(xì − x̄` )0

MANOVA
summing over i
n
X̀ n
X̀ n
X̀
0 0
(xì − x̄)(xì − x̄) = (x̄` − x̄)(x̄` − x̄) + (x̄` − x̄) (xì − x̄` )0
i=1 i=1 i=1
n
X̀ n
X̀
0
+ (xì − x̄` )(x̄` − x̄) + (xì − x̄` )(xì − x̄` )0
i=1 i=1
n
X̀
= n` (x̄` − x̄)(x̄` − x̄)0 + (xì − x̄` )(xì − x̄` )0
i=1
Pn`
since i=1 (xì − x̄` ) = 0

MANOVA
Next, sum over `
g X̀
X n g
X
(xì − x̄)(xì − x̄)0 = n` (x̄` − x̄)(x̄` − x̄)0
`=1 i=1 `=1
g X̀
X n
+ (xì − x̄` )(xì − x̄` )0
`=1 i=1
   
Total(corrected) Sum Treatment (Between)
 of Squares and Cross  =  Sum of Squares and 
Products Cross Products
 
Residual (Within) Sums
+  of Squares and 
Cross Products
T = B+W
MANOVA
The within matrix is only meaningful if Σ is constant over samples.

Indeed
g X̀
X n
W = (xì − x̄` )(xì − x̄` )0
`=1 i=1
= (n1 − 1)S1 + (n2 − 1)S2 + · · · + (ng − 1)Sg
and reminds of Spooled .

The MANOVA table for τ 1 = τ 2 = · · · = τ g = 0 has the same structure
as ANOVA table.

How to construct Test Statistics
Recall,
g X̀
X n
W = (xì − x̄` )(xì − x̄` )0
`=1 i=1
g
X
B = n` (x̄` − x̄)(x̄` − x̄)0
`=1
g X̀
X n
B+W = (xì − x̄)(xì − x̄)0
`=1 i=1
Remark:
◦ The "between" matrix B is often denoted as H, the "hypothesis"
matrix.
◦ The "within" matrix W is often denoted as E, the "error" matrix
Construct Test Statistics
The simplest way to characterize information contain in p.s.d matrices is

via their eigenvalues.
Consider
|W − λ(B + W)| = 0
|B − θ(B + W)| = 0
|B − φW| = 0
These root equations find eigenvalues of the matrices
W(B + W)−1 , B(B + W)−1 , W −1 B
Indeed, |B − φW| = 0 ⇔ |W −1 B − φI| = 0

Construct Test Statistics
Now rephrasing the first and the second root equation, we find
|(1 − λ)W − λB| = 0
1−λ
|B − W| = 0
λ
|B − φW| = 0
and
|(1 − θ)B − θW| = 0
θ
|B − W| = 0
(1 − θ)
Thus
1−λ θ φ
φ = = ,θ = =1−λ
λ 1−θ 1+φ
1
λ = = 1 − θ.
1+φ
Test Statistics
There are four statistics in common use.

Wilk’s Lambda:
|W|
Λ = = det(W(B + W)−1 )
|B + W|
p p p
Y Y Y 1
= λj = θj =
1 + φj
j=1 j=1 j=1
Pillar’s Trace: tr (B(B + W)−1 ) =

Pp
j=1 θj
θj
Hotelling-Lawley Trace: tr BW −1 = pj=1 φj = pj=1
P P
1−θj
Note that T 2 is defined slightly different here.
Roy’s Greatest Root: The largest root of the equation |B − φW| = 0:
θmax
φmax =
1 − θmax
MANOVA Table for Comparing Population Mean Vectors
Table: MANOVA Table
Source Matrix of SS Degrees of Test

of Variation and CP Freedom Statistics
Treatment B g−1
Pg ∗ |W |
Residual (Error) W `=1 n` − g Λ = |B+W |
Pg
Total (corrected for the mean) B+W `=1 n` − 1
Pg
− x̄)(x̄` − x̄)0 , W = g`=1 ni=1 (xì − x̄` )(xì − x̄` )0
P P`
B= `=1 n` (x̄`
The quantity Λ∗ = |B|W |

+W |
is convenient for testing
H0 : τ 1 = τ 2 = · · · = τ g = 0
and is called the Wilk’s Lambda.

Exact distribution of Λ∗
The exact distribution of Λ∗ for special cases are listed below.
Table: Distribution of Wilk’s Lambda, Λ∗ = |B|W |

+W |
No. of No. of Sampling Distribution for

Variables Groups PMultivariate
Normal Data
n` −g 1−Λ∗

p=1 g≥2 g−1 Λ ∗ ∼ Fg−1,P n` −g
P √
n` −g−1 ∗
p=2 g≥2 √ Λ
1−
∼ F2(g−1),2(P n` −g−1 )
g−1
P Λ∗
n` −p−1 1−Λ∗

p≥1 g=2 p Λ ∗ ∼ Fp,P n` −p−1
P √
n` −p−2 ∗
p≥2 g=3 √ Λ
1−
∼ F2p,2(P n` −p−2 )
p Λ∗

Large sampling distribution of Λ∗
P
If H0 is true and n` = n is large
p+g
−(n − 1 − ) ln Λ∗
2
has approximately chi-square distribution with p(g − 1) df .
Consequently, for large n, we reject H0 at α level if
p+g
−(n − 1 − ) ln Λ∗ ≥ χ2p(g−1) (α)
2
where χ2p(g−1) (α) is the upper (100α)th percentile of a chi-square
distribution with p(g − 1) df .

Exercise: Observations
on two
responses are collected for two treatments.
x1
The observed vectors are
x2
3 1 2
Treatment 1: , ,
3 6 3

2 5 3 2
Treatment 2: , , ,
3 1 1 3
(a) (i) Calculate Spooled
(ii) Test H0 : µ1 − µ2 = 0 employing a two-sample approach with
α = 0.01
(iii) Construct 99% simultaneous confidence intervals for the differences
µ1i − µ2i , i = 1, 2
(b) (i) Break up the observations into mean, treatment and residual
components.[Xì = X̄ + (X̄` − X̄) + (Xì − X̄` )]
(ii) Construct the one-way MANOVA table
(iii) Evaluate Wilk’s Lambda Λ∗ and test for treatment effects. Set
α = 0.01. Repeat the test using the chi-square approximation.
Compare the conclusions.
Comparison of Several Multivariate Means Profile Analysis
Profile analysis
Let p measurements be made on all individuals in g groups. The

measurements can refer to
◦ p different treatment, administered in row.
◦ p tests
◦ p questions in a survey
◦ p measurements of characteristics, carried out at distinct points
(longitudinal data)
Studying the profiles of the outcomes can yield valuable information
about the data structure.
For two groups, we have mean vectors
   
µ11 µ21
µ1 =  ...  , µ2 =  ... 
   
µ1p µ2p

Profile analysis
Questions that come intuitively to mind are:

(1) Are the profiles parallel?
H01 : µ1j − µ2j = µ1,j−1 − µ2,j−1 , 2 ≤ j ≤ p
Equivalently,
H01 : µ1j − µ1,j−1 = µ2,j − µ2,j−1
(2) Given that the profiles are parallel, are they also coincident?
H02 : µ1j − µ2j , 1 ≤ j ≤ p
(3) Given that the profiles are coincident, are they horizontal?
H03 : µ11 = ....... = µ1p = µ21 = ........ = µ2p

Profile analysis
The null hypothesis in (1) can be written H01 : Cµ1 = Cµ2 where C is
the contrast matrix with
 
−1 1 0 0 . . . 0 0
 0 −1 1 0 . . . 0 0 
C= .
 
.. .. .. . . .. ..
 ..

. . . . . . 
0 0 0 0 . . . −1 1
We don’t need new methodology to carry out this test. We merely need
to think of the following transformed observations:
Cx1i i = 1, . . . , n1
Cx2i i = 1, . . . , n2
The resulting observations have (p − 1) components instead of p. Using
these observations, the ordinary T 2 test immediately applies. In our case,
this is a two a sample problem.
Profile analysis
Indeed,
CX1i ∼ Np−1 (Cµ1 , CΣC0 )

CX2i ∼ Np−1 (Cµ2 , CΣC0 )
The test for parallel profile becomes

−1
1 1 0
2
T = (Cx̄1 − Cx̄2 ) + CSpooled C (Cx̄1 − cx̄2 )0
n1 n2
Reject H01 of parallel profiles at α-level if T 2 > c2 with

2 −2)(p−1)
c2 = (n1 +n
n1 +n2 −p Fp−1,n1 +n2 −p

Profile analysis
When we fail to reject the hypothesis, we can conclude that the profiles
are plausible. Then it is a reasonable to test for equality.
We now look for the a test for testing whether the distance of the two
profiles is zero.
Given that the profile are parallel, the difference at any point (outcome) j
is d = µ1j − µ2j , whence, if they are equal:
X
pd = (µ1j − µ2j ) = 0
j
that is X X
µ1j = µ2j
j j

Profile analysis
If we use the notation J = Jp,1 for the column vector of p ones, then the
above equality becomes
J0 µ1 = J0 µ2
The vector J plays a role, similar to the role of C in the previous test.
Now, however, we have a ‘univariate’ test. The denominator degree of
freedom for the F statistics is 1 and the test statistics is
−1
1 1
T 2 = (J0 x̄1 − J0 x̄2 )0 + J0 Spooled J (J0 x̄1 − J0 x̄2 )
n1 n2
−1
0 0 1 1 0
= (J x̄1 − J x̄2 ) + J Spooled J (J0 x̄1 − J0 x̄2 )
n1 n2
 2
0
J (x̄ 1 − x̄ 2 )
= q 
1 1 0
( n1 + n2 )J Spooled J

Profile analysis
Note that now, F1,n1 +n2 −2 (α) = tn21 +n2 −2 ( α2 )

The next step, if this hypothesis is not rejected, is to look at level
(horizontal) profile. Given these hypothesis are plausible, we have a
single normal population with common vector µ and samples values;
Pn1 Pn2
i=1 x1i + i=1 x2i n1 x̄1 n2 x̄2
x̄ = = +
n1 + n2 n1 + n2 n1 + n2
The hypothesis now becomes µ1 = · · · = µp which can be written as
Cµ = 0 where C is exactly the same as in the first hypothesis.
We derive the rejection criterion. The univariate Hotelling T 2 is
n(x̄ − µ0 )0 S−1 (x̄ − µ0 ).
In our case, we look at Cx and Cµ0 = 0, thus
(n1 + n2 − 2)(p − 1)
(n1 + n2 )x̄0 C0 (CSpooled C0 )−1 Cx̄ > Fp−1,n1 +n2 −p
n1 + n2 − p
Profile analysis
Remark
The hypothesis H01 can always be tested. It is the first hypothesis that
usually come to mind.
The test for H02 as described is useless in the following cases:

◦ You are not interested in H01
◦ H01 is rejected.
Reason: H02 is not the same as H02 , given that H01 is true
Similarly for H03 in the relation H01 ?

Example
As part of a large study of Love and marriage, a sociologist surveyed adults

with respect to their marriage "contributions" and "outcomes" and their level
of "passionate" and "companionate" love. Recently married males and
females were asked to respond to the following questions, using the 8-point
scale in the table below.
Extremely Very Moderately Slightly Slightly Moderately Very Extremely

negative negative negative negative positive positive positive positive
1 2 3 4 5 6 7 8
(3) All things considered, how would you describe your contributions to the
marriage?
(4) All things considered, how would you describe your outcomes from the
marriage?

Example
Subjects were also asked to respond to the following using the 5-point scale
below.
None Very Some A great Tremendous
at all little deal amount
1 2 3 4 5
(1) What is the level of passionate love that you feel for your partner?
(2) What is the level of compassionate love that you feel for your partner?
Let
x1 = an 8-point scale response to Question 1
x2 = an 8-point scale response to Question 2
x3 = a 5-point scale response to Question 3
x4 = a 5-point scale response to Question 4
and the two population be defined as
Population 1 = married men
Population 2 = married women
Example
Assuming a common covariance matrix Σ, it is of interest to see whether the

profiles of males and females are the same. A sample of n1 = 30 males and
n2 = 30 females
 gavethe sample mean
 vectors.

6.833 6.633
 7.033   7.000 
x̄1 = x̄m = 
 3.967  , x̄2 = x̄f =  4.000  and pooled covariance matrix
  
4.700 4.533
 
0.606 0.262 0.066 0.161
 0.262 0.637 0.173 0.143 
Spooled =
 0.066

0.173 0.810 0.029 
0.161 0.143 0.029 0.306

Test for parallelism
To test for parallelism, H01 : Cµ1 = Cµ2 , we compute

 
  0.606 0.262 0.066 0.161
−1 1 0 0 
0 0.262 0.637 0.173 0.143 
CSpooled C =  0 −1 1 0   0.066 0.173

0.810 0.029 
0 0 −1 1
0.161 0.143 0.029 0.306
 
−1 0 0  
 1 −1 0.719 −0.268 −0.125
0 
×  = 1.101 −0.751 
 0 1 −1 
1.058
0 0 1
and
 
 0.200   
−1 1 0 0  −0.167
0.033  

C(x̄1 − x̄2 ) =  0 −1  −0.033  = −0.066
1 0  
0 0 −1 1 0.200
0.167
Test for parallelism
Thus
 
−1 0.606 0.262 0.066 0.161
 0
−0.167
1 1  0.262 0.637 0.173 0.143 
T2 =  −0.066  + 
 0.066

30 30 0.173 0.810 0.029 
0.200
0.161 0.143 0.029 0.306
 
−0.167
×  −0.066  = 15(0.67) = 1.005
0.200
with α = 0.05
(30 + 30 − 2)(4 − 1)
c2 = F3,56 (0.05) = 3.11(2.8) = 8.7
(30 + 30 − 4)
Since T 2 = 1.005 < 8.7, we conclude that the hypothesis of parallel profiles
for men and women is tenable.
Test for Coincident Profiles
[Plot mean profiles for male and female]

Test for coincident profiles, H02 : J0 µ1 = J0 µ2
 
0.200
 0.033 
J0 (x̄1 − x̄2 ) = 1 1 1 1   −0.033  = 0.367

0.167
  
0.606 0.262 0.066 0.161 1
0.262 0.637 0.173 0.143   1
J0 Spooled J =
 
1 1 1 1   0.066 0.173 0.810
 
0.029   1 
0.161 0.143 0.029 0.306 1
= 4.207
 2
0.367
T2 =  q  = 0.501
1 1

30 + 30 4.027

Test for Coincident Profiles
with α = 0.05, F1,58 (0.05) = 4.0 and T 2 = 0.501 < F1,58 (0.05) = 4.0, we
cannot reject the hypothesis that the profile are coincident. That is the
responses of the men and women to the four questions posed appear to be the
same.
We could now test for level profiles, however, it does not make sense to
carry out this test since Questions 1 and 2 were measured on a scale of
1 − 8, while Question 3 and 4 were measured on a scale of 1 − 5.
The incompatibility of these scales makes the test for level profiles
meaningless.

Exercise
A project was designed to investigate how consumers in Accra would react to

an electrical time -of-use pricing scheme. The cost of the electricity during
peak periods for some customers was set at eight times the cost of electricity
during off -peak hours. Hourly consumption (in kilowatt- hours) was
measured on a similar day before the experiment rates began.The responses,
log(current consumption)-log(baseline consumption) for the hours ending
9am, 11am(peak hour), 1pm and 3pm (a peak hour) produced the following
summary statistics.
n1 = 28, x̄01 = 0.153 −0.231 −0.322 −0.339

Test group:
n2 = 58, x̄02 = 0.151 0.180 0.256 0.257

Control group:
 
0.804 0.355 0.228 0.232
 0.355 0.722 0.233 0.199 
and Spooled = 
 0.228 0.233 0.592 0.239 

0.232 0.199 0.239 0.479

Exercise (cont.)
Perform a profile analysis. Does time-of-use pricing seem to make a

difference in electrical consumption? What is the nature of this difference, if
any? Comment. [use α = 0.05 for any statistically tests]

Principal Component Analysis
PCA
Consider a data set with p response variables. The summary information

of this data will consist of the following pieces:
◦ p sample means
◦ p sample standard deviation;equivalently p sample variance (ie, the
diagonal of the covariance matrix)
◦ p(p−1)
2 sample covariance (ie the off-diagonal elements of the
covariance matrix);equivalently the correlation could be computed
from this covariance matrix.
Principal Component are used for several tasks:
◦ Data Reduction: When the data can be reduced from p to a fewer
variables, the manipulation of the data becomes easier. This is true
when the data have to be used as input to other procedures such as
regression model.

◦ Display of data: Of course, p > 3 variables cannot be plotted at

once. Usually pairs of variables are plotted against each other. At
best, fancy graphical systems allow rotating graphs of three
variables.
◦ Interpretation: It is often hoped for that the original p variables can
be replaced by, for example, smaller k new variables, which have an
intuitive interpretation when the interpretation is of greater scientific
interest than the original variables, then insight might gained.
◦ Inference: Above tasks refer to manipulating a sample of n
objects.Often,one wants to make statements about the population.
Then, inferential tools have to be invoked. Of course, data reduction
will always lead to information reduction or information loss.
In summary, one wants to reduce p original variables to k new variables,
where k is a compromise between:
◦ minimum number of variables
◦ maximum amount of information retained
PCA
Do not
◦ over estimate importance of the principal components analysis.
◦ over- interpret principal components analysis
They can be useful:
◦ in exploratory phase of a data analysis (to learn about the structure
of the data)
◦ as input to other statistical procedures.

Principal Component Analysis Population Principal Component
Population principal component

Concepts can be developed for both the population level (ie based on the
theoretical distribution) or for the sample level (ie based on the simple
random sample to be studied).
In virtually all cases, the two theories are dual to each other, although the
assumptions involved and the generality of the conclusion can be
different.
Consider a random vector X = (X1 , . . . , Xp ) with covariance matrix Σ
and a correlation ρ.
We want to apply a linear transformation of X in order to obtain
Y = (Y1 , ......Yp );
Y1 = `11 X1 + `12 X2 + · · · + `1p Xp
Y2 = `21 X1 + `22 X2 + · · · + `2p Xp
..
.
Yp = `p1 X1 + `p2 X2 + · · · + `pp Xp
In matrix notation,
   
Y1   X1
 Y2 `11 . . . `1p

.. ..

..   X2 
=
   
 .. . . .  .. 
 .   . 
`p1 . . . `pp
Yp Xp
Two short hand notations:
Y = LX or Yj = `0j X(j = 1, 2, ......p)

Population PCA
We want a few variables to explain the data.

Two remarks:
◦ The correlation structure (p(p − 1)/2 covariance) are somewhat of a
nuisance. It will be nice if the new variables would be correlated.
Example, if p = 4, we will have a total of 10 numbers (4 variances
and 6 covariance). If the new variables are uncorrelated, then this
would reduce 10 numbers to 4 numbers.
◦ The information in the data is provided by a variance-covariance
structure. Suppose the covariances have been removed, then it is
provided by the variance structure only, or simply: the variability.
Should we be able to concentrate as much of the total variance as
possible on one new variable (Y1 ), then probably the remaining
variables (Y2 , . . . , Yp ) could be ignored. This would reduced the 4
remaining numbers to 1.
Population PCA
Formalizing this discussion, we would require;

◦ the Yj are uncorrelated
◦ the Yj have maximal variance.
The second requirement is a little bit strange. Indeed, suppose a variable
Y1 has been found with maximal variance. For example
Y1 = 3X1 + 0.5X2 − 2.6X3 − 3.2X4
Let us then consider,
Z1 = 6X1 + 1.0X2 − 5.2X3 − 6.4X4
The variable Z1 will have 4 times the variance of Y1 .

Thus, it seems that the second condition is impossible; one can always
multiply the variable with a well chosen constant to increase its variance.
Population PCA
This problem can be overcome by an additional requirement.
◦ the Yj are uncorrelated.
◦ the Yj have maximal variance.
◦ the coefficient vector `j have a unit length
The new requirement is formalized as follows:
p
X
2
||`j || = `0j `j = `2jk = 1
k=1
These requirement can be translated in an iterative procedure;
Y1 = `01 X with maximal variance, with condition `01 `1 = 1
Y2 = `02 X with maximal variance, with conditions
`02 `2 = 1, cov(Y1 , Y2 ) = 0
...
Yj = `0j X with maximal variance and conditions
`0j `j = 1, cov(Y1 , Yj ) = · · · = cov(Yj−1 , Yj ) = 0
Principal Component Analysis Heuristic Argument
Heuristic argument
Plotting two variables that follows a bivariate normal distribution, shows

a cloud of ellipsoid form. The shape of the ellipse is found from the
covariance matrix;
◦ the principal axes are in the direction of the eigenvectors of Σ
◦ the lengths of the principal axes is proportional to the eigenvalues of
Σ.
Now, for every bivariate sample, we can calculate the sample version of
Σ and thus these axes can be considered.
The variability of the original variables is reflected in the "width" of the
cloud of points, found by projecting the cloud on the x-axis and on the
y-axis respectively.

Principal Component Analysis Mathematical Argument
Mathematical argument
It turns out that the heuristic reasoning of the previous section can be
generalized and proven mathematically.
Theorem: Let Σ be the covariance matrix of X with ordered
eigenvalue-eigenvector pairs (λj , ej )(j = 1, 2, . . . , p)
◦ the principal components are Yj = e0j X = ej1 X1 + · · · + ejp Xp
◦ var(Yj ) = λj
◦ cov(Yj , Yk ) = e0j Σek = 0, j 6= k
For distinct eigenvalues, the choice of the principal component is unique.

Procedure
The procedure, deduced from this theorem is as follows:

(1) Determine the covariance matrix of your population.
(2) Calculate the eigenvectors ej , ordered such that they correspond to
decreasing eigenvalues (e1 corresponding to λ1 , the largest eigenvalue)
(3) Calculate the transformed variables:
Y1 = e11 X1 + e12 X2 + · · · + e1p Xp

Y2 = e21 X1 + e22 X2 + · · · + e2p Xp
..
.
Yp = ep1 X1 + ep2 X2 + · · · + epp Xp

Properties
The following properties ensure that the new variables are very desirable:
(1) The variance of the new variables is determined by the eigenvalues λj :
var(Yj ) = λj
(2) The covariance between pairs of distinct new variables is zero, and hence
also the correlation:
Corr(Yj , Yk ) = 0, j 6= k
(3) The total variability of the original variables is recorded by the new
variables. More precisely, the sum of the diagonal elements of the
original covariance matrix equals the sum of the eigenvalues.
(4) The total population variance is λ1 + · · · + λp . The jth principal
component "explains"
λj
λ1 + · · · + λp
of the variance.
It is common practice to choose the first few components in order to get

approximately 80 − 90% of variation, such that an acceptable amount of
information is lost.
Example(calculating the population principal components): Suppose the
random variables X1 , X2 , and X3 have the covariance matrix
 
1 −2 0
Σ =  −2 5 0 
0 0 2
It may be verified that the eigenvalue-eigenvector pairs are
λ1 = 5.83 e01 = [0.383, −0.924, 0]

λ2 = 2.00 e02 = [0, 0, 1]
λ3 = 0.17 e03 = [0.924, 0.383, 0]

Therefore the principal components become
Y1 = e01 X = 0.383X1 − 0.924X2

Y2 = e02 X = X3
Y3 = e03 X = 0.924X1 + 0.383X2
The variable X3 is one of the principal components, because it is uncorrelated

with the other variables.
var(Y1 ) = var(0.383X1 − 0.924X2 )

= 0.3832 var(X1 ) + (−0.9424)2 var(X2 )
+ 2(0.383)(−0.9424)cov(X1 , X2 )
= 0.147(1) + 0.854(5) − 0.708(−2) = 5.83 = λ1
cov(Y1 , Y2 ) = cov(0.383X1 − 0.924X2 , X3 )
= 0.383cov(X1 , X3 ) − 0.9242cov(X2 , X3 )
= 0.383(0) − 0.9242(0) = 0

It is also readily apparent that
σ11 + σ22 + σ33 = 1 + 5 + 2 = λ1 + λ2 + λ3 = 5.83 + 2.0 + 0.17 = 8
The proportion of the total variance accounted for by the first principal
component is
λ1 5.83
= = 0.73
λ1 + λ2 + λ3 8
Further, the first two components Y1 and Y2 could replace the original
three variables with little loss of information.

Principal Component Analysis Sample Principal Components
Sample PCs
By replacing the population covariance matrix by the sample covariance

matrix, the results carry over.
From the sample version, inferences can be drawn about the population
level. We may be lead to believe that the principal component is made up
of variables except those with considerably smaller coefficient.
However, this is merely jumping to conclusions and one should be
guided by the correlations between principal components and each of its
constituents.
A formula is given in the following property;
Property: Let X have covariance Σ, with Yj = e0j X the principal
components, then p
ejk λj
ρYj ,Xk = √
σkk
Sample PCs
In other words, the correlation is a function of the coefficient, but also of

the original variance. while the eigenvalue is also involved, this is not
important, since it is constant for a given eigenvector.
Applied to principal component 1 of the previous example
√ √
e11 λ1 0.383 5.83
ρY1 ,X1 = √ = √ = 0.925
σ11 1
√ √
e21 λ1 −0.924 5.83
ρY1 ,X2 = √ = √ = −0.998
σ22 5
Notice here that the variable X2 , with coefficient −0.924 receives the
greater weight in the component Y1 . It also has the largest correlation (in
absolute value) with Y1 . The correlation of X1 with Y1 , 0.925 is almost as
large as that for X2 , indicating that the variables are about equally
important to the first principal component.
Sample PCs
The relative sizes of the coefficients of X1 and X2 suggest, however that

X2 contributes more to the determination of Y1 than does X1 .
√ √
λ2 √2
Finally, ρY2 ,X1 = ρY2 ,X2 = 0 and ρY2 ,X3 = σ33 = 2
= 1 (as it should).
The remaining correlations can be neglected, since the third component
is unimportant.
Proof of Property: Define `k = (0, . . . , 0, 1, 0, . . . , 0)0 , then Xk = `0k X, and
Yj = e0j X
cov(Xk , Yj ) = `0k cov(X, X)ej = `0k Σej

= λj `0k ej = λj ejk since Σej = λj ej
√
λe λj
In conclusion, corr(Xk , Yj ) = √ j √jk = ejk √σkk where ejk is also a measure
λj σkk
of correlation between the jth principal component and the kth original
variable.
Exercise
(1) Determine the population principal components for the covariance

matrixes
5 2
(i) Σ =
2 2
 
2 0 0
(ii) Σ =  0 4 0 
0 0 4
(2) Also, calculate the proportion of the total population variance explained
by the principal components.
(3) Compute the correlation between principal components and the original
variables.

PCA based on correlation
Covariance or Correlation?
◦ The original variables could have different units. Then we are
comparing apples and oranges.
◦ Original variables could have widely varying standard deviation.
In both cases, the analysis is driven by the variable with large standard
deviation.
kilometer or millimeter (×106 ) ⇒ principal components will be pulled
towards the variable in millimeters.
A solution is provided by using the correlation matrix instead. Thus
means that all variables are replaced by their standardized versions.
The variance changes to 1 and total variability is equal to p.

PCA based on correlation
The population explained by the first k principal components is

λ1 + λ2 + · · · + λk
p
The correlation between new and original variable (standardized) is
p
ρYj ,Xk = ekj λj
Exercise 1: Consider the covariance matrix

1 4
Σ=
4 100
(i) Derive the correlation matrix ρ.
(ii) Perform PCA on ρ, what proportion of total variability is captured by
each principal components?
(iii) Find the correlation between the principal component and the original
variables.
Exercise 2: Find the principal components and the proportion of the total
variance explained by each when the covariance matrix is
 2
ρσ 2 ρσ 2

σ
Σ =  ρσ 2 σ 2 ρσ 2  .
ρσ 2 ρσ 2 σ 2
Repeat using the correlation matrix, ρ derived from Σ.

Other Analysis
Marginal PCA and PCA by groups
"Partialled out", Fit regression model for each of the outcome variables
⇒ Yij = αj + βj Xi + ij
The residuals is computed: ij = Yij − αj + βj Xi

The outcome vector is replaced with corresponding residual vector. The
correlation matrix of the residual vector is called the partial correlation
matrix. In other words the effect of the group variable is "taken out"
Discrimination and classification Introduction
Discrimination and classification
In many cases in multivariate data, subgroup (stratified) structure of the

data will be the focus of scientific research interest.
Two distinct situation can arise:
◦ Known Groups: Groups have been defined explicitly. This is often
based on subject matter (eg. biological) knowledge. Through
groups may be widely accepted by the scientific community, it does
not imply that it is easy to discriminate between groups.
◦ Unknown Groups: The researcher has good grounds to believe in
the existence of situation of strata, even though they have not
clearly been defined. There might even be uncertainty about the
actual number of groups. This second situation is the topic of
cluster analysis.
The attention here is confined to discriminant analysis.

Discrimination and classification Goal of Discriminate Analysis
Goal of Discriminate Analysis
Discriminant analysis can serve two slightly different purposes.

◦ Discrimination: Suppose in the data have been clearly defined.The
question is then whether it is possible to distinguish among groups
in an "optimal" way. Indeed, when several characteristics are
recorded, it might turn out that summary measures, such as mean
and standard deviation, are distinct among the groups. This can be
viewed as an exploratory, descriptive technique.
◦ Classification: Given a sample for which group membership is
known, and given a new observation, the question is to allocate the
new specimen to a particular population. Clearly, this is not a
descriptive technique any more. A (mathematical) rule is required,
an automated decision process, which indicated unambiguously to
which groups a new specimen should be allocated, given the values
on the characteristics.
In both cases, one often relies on an algebraic rule, a discriminant
function.
Discrimination and classification Scope of Discriminant Analysis
Fully Parametric or Not
Discriminant analysis can be approached by two philosophically

different roads:
◦ The parametric way: This method is based on assuming a
parametric distributional form for the outcomes in each of the
subgroups. Differences between these distributions (in terms of
their parameter), are used to discriminate between them.The best
known examples include normal and logistics discriminant analysis.
We will focus on normal discriminant analysis.
◦ Fisher’s way: This method is concerned with finding linear
combination of the original variables, that displays the group
differences best. [Fisher’s linear Discriminant Analysis]

Discrimination and classification Parametric Version: Two Populations
Two Populations
For the parametric theory, we will restrict attention to two groups.

Indicate the two classes of objects by π1 and π2 .
Recall that we want to:
◦ Distinguish between them
◦ Allocate objects to them
For each object or individual i, a set p measures
Xi = (Xi1 , Xi2 , ........Xip )0 is obtained.
We hope that the measurements are "different" between the two groups
(eg. the mean of some of them is higher or lower in one group than in the
other).
A difference between populations is translated into statistical language
by the claim that they are generated by a different stochastic mechanism,
which in turn is characterized by a different distribution.

Two Populations
An observation in population j = 1, 2 follows distribution Fj (x) with

density fj (x)
Divide the variable space Ω into 2 parts R1 and R2 (regions) and adopt
the rule: A new observations is assumed to belong to πj if it falls in Rj
Of course, the regions should be a partition:
◦ The union of R1 and R2 fill the whole parameter space Ω
◦ The intersection of R1 and R2 is empty (has probability zero).

Classification Error
Classification (and many other things....) is bound to ERROR.
In statistical terms, we have to study the misclassification error:

◦ Xi belongs to π1 and is classified into π2
◦ Xi belongs to π2 and is classified into π1
That is, based on the classification rule and the observations made for a
particular specimen, we are lead to believe that the specimen belongs to
one subgroup, whereas in reality it belongs to the other.

Properties of a Good Classification Rule
The following properties, discussed before, should be sought for a good

classification rule;
(1) The misclassification probabilities are minimal.
(2) The prior probabilities are taken into account:
◦ If one population is much larger than another, classification in
the largest should be more frequent.
◦ eg. there are more sound than bankrupt firms, whence
classification of a firm as bankruptcy candidate should occur
only if the evidence is overwhelming.
(3) Cost of misclassification error (ethnical cost, economic cost)
◦ eg. classifying a healthy person as diseased implies further
investigation and eventually the healthy condition of the person
will be established, while the opposite misclassification is
dramatic.
Formalizing the Classification Error
Recall the setting:

◦ Two population π1 and π2 with associated densities f1 (x) and f2 (x)
◦ Let Ω be the sample space and suppose it is partitioned as
Ω = R1 ∪ R2 with Rj the set of x which we could classify as
belonging to πj (which could be wrong).
Note that in general, there is no optimal classification possible but no
perfect classification.
The classification errors are:
◦ The probability of lying in the second region and belong to the first
population: R
P(2|1) = P(X ∈ R2 |π1 ) = R2 f1 (x)dx
◦ The opposite classification error
R is:
P(1|2) = P(X ∈ R1 |π2 ) = R1 f2 (x)dx

Now, consider the prior probabilities

◦ p1 = prior probability of belonging to π1
◦ p2 = prior probability of belonging to π2
Evidently, p1 + p2 = 1.
There are 4 "classification probabilities". A specimen can belong to π1 or
π2 and can be classified as population π1 and π2 leading to a 2 × 2
factorial:
Classify as
True Population π1 π2
π1 P(1|1) P(2|1)
π2 P(1|2) P(2|2)

The correct classification probability for population π1 :

P(correctly classified as π1 ) = P(X ∈ π1 , X ∈ R1 )
= P(X ∈ π1 )P(X ∈ R1 |X ∈ π1 )
= p1 P(1|1)
and misclassification error is:
P(misclassified as π1 ) = P(X ∈ π2 , X ∈ R1 )
= P(X ∈ π2 )P(X ∈ R1 |X ∈ π2 )
= p2 P(1|2)
In summary
P(correctly classified as π1 ) = p1 P(1|1)
P(misclassified as π1 ) = p2 P(1|2)
P(correctly classified as π2 ) = p2 P(2|2)
P(misclassified as π2 ) = p1 P(2|1)
Expected Cost of Misclassification
These probabilities are the first step to answer the following questions:
(1) What is the misclassification error?
(2) What is the misclassification cost?
The cost matrix is very simple in the case of the two groups:
Classify as
True Population R1 R2
π1 0 c(2|1)
π2 c(1|2) 0

Expected Cost of Misclassification
We are now able to compute the Expected Cost of Misclassification.

ECM = c(2|1)P(2|1)p1 + c(1|2)P(1|2)p2
Z Z
= c(2|1)p1 f1 (x)dx + c(1|2)p2 f2 (x)dx
R2 R1
Z Z
= c(2|1)p1 f1 (x)dx + c(1|2)p2 1 − f2 (x)dx
R2 R2
Z
= c(1|2)p2 + {f1 (x)c(2|1)p1 − f2 (x)c(1|2)p2 d}x
R2
Minimizing Expected Cost of Misclassification is done by choosing
those points that yield a negative contribution to the integral
R2 = {x|f1 (x)c(2|1)p1 − f2 (x)c(1|2)p2 < 0}
= {x|f1 (x)c(2|1)p1 < f2 (x)c(1|2)p2 }

f1 (x) c(1|2)p2
= x| <
f2 (x) c(2|1)p1
Structure of the ECM
Similarly,
f1 (x) c(1|2)p2
R1 = x| >
f2 (x) c(2|1)p1
and hence the regions are defined.
What if ff21 ((xx)) = c(1|2)p2
c(2|1)p1 ?
This boundary case is fairly arbitrary and the performance of the rule
will not change when we either assign this curve to R1 or to R2
The classification rule is: Assign an observation with outcome vector x
to the first population π1 if
f1 (x) c(1|2)p2
>
f2 (x) c(2|1)p1

In other words, the ratio of the densities should exceed a threshold

function.
f1 (x) c(1|2)p2
The boundary f2 (x) = c(2|1)p1 is called the discriminant function.
Inspecting this ratio, it is clear that we only need:

◦ Prior Probability ratio p1 /p2 . Of course knowing the ratio is
equivalent to knowing p1 or to knowing p2 only. Indeed, the
quantities sum to one, whence they represent only independent
quantity.
◦ Cost Ratio c(1|2)/c(2|1). This is an important reduction of the
information that needs to be found. Even if the components are hard
to specify, the ratio can be much easier to establish. Indeed, one
might have might have difficulty in calculating even a rough
approximation of the actual cost involved in these misclassification.

But it is plausible that one has a rough idea about the relative
severity of the misclassification eg. The second type of
misclassification is to times as bad as one of the first type.
Remarks:
◦ The shape of the discriminant function depends on the form of f1 (x)
and f2 (x). It will change with changing parametric forms assumed
for these densities (eg normal densities with equal or with unequal
variances)
◦ If either the cost ratio or the prior probability ratio is unity, the
definition of the regions simplifies accordingly.
◦ If the product of cost and prior probability ratio is unity, then we
actually allocate to the population with the highest probability. We
then classify to R1 if ff12 ((xx)) > 1 or equivalently f1 (x) > f2 (x)

The ECM is not the only useful criterion to determine the classification
boundary. A few alternatives are :
◦ Total probability of misclassification (TPM): The ECM for equal
costs.
◦ Largest posterior probability: Reduce to the TPM

Discrimination and classification Two Multivariate Normal Populations
Two Normal Populations
Now that we have defined a classification criterion: Assign an

observation with outcome x to the first population π1 if
f1 (x) c(1|2)p2
>
f2 (x) c(2|1)p1
We can focus on a few standard cases. Assume a multivariate normal
form for the two population:
π1 : Np (µ1 , Σ1 )
π2 : Np (µ2 , Σ2 )
where µ1 and µ2 are the mean vectors and Σ1 and Σ2 are the covariance
matrix.
Here µ1 , µ2 , Σ1 , Σ2 are assumed to be known.
We need to distinguish between the situation where the covariance
matrices are equal or unequal.
Equal Covariance Matrix
In this case, we assume Σ1 = Σ2 = Σ

Explicitly, the densities are (i = 1, 2):

1 1 0 −1
fi (x) = p 1 exp − (x − µi ) Σ (x − µ i )
(2π) 2 |Σ| 2 2
The classification rule is based on the ratio of the two densities evaluated
at x:

f1 (x) 1 0 −1 1 0 −1
= exp − (x − µ1 ) Σ (x − µ1 ) + − (x − µ2 ) Σ (x − µ2 )
f2 (x) 2 2
After some manipulations, the classification region R1 is found to be:

0 −1 1 0 −1 c(1|2)p2
(µ1 − µ2 ) Σ x − (µ1 − µ2 ) Σ (µ1 + µ2 ) ≥ ln
2 c(2|1)p1

Sample Version
In the above reasoning, µ1 , µ2 and Σ are assumed to be known
population values. However, in practice, they are unknown. This implies
they have to be estimated from data.
The following algorithm can be used.
◦ Collect n1 observations out of π1 and n2 observations of π2 .
◦ Construct the sample statistics x̄1 , x̄2 , S1 and S2 , as estimators for
µ1 , µ2 , Σ1 and Σ2 respectively.
◦ Since we assume common Σ, it is necessary to construct a common
S. In other words, S1 and S2 are assumed to estimate the same
quantity, and therefore , they should be combined in a so called
pooled covariance matrix:
(n1 − 1)S1 + (n2 − 1)S2
Spooled =
(n1 + n2 − 2)
Observe that, when the sample sizes n1 and n2 are equal. The Spooled is
simply the average of S1 and S2 , otherwise they are weighted by the
sample size they are based upon.
Estimated Minimum ECM Rule for Two Normal Populations
Allocate an observation with measurements x0 to π1 if

1 c(1|2)p2
(x̄1 − x̄2 )0 S−1
pooled x0
0 −1
− (x̄1 − x̄2 ) Spooled (x̄1 + x̄2 ) ≥ ln
2 c(2|1)p1
If the product of the two ratios is unity, then

c(1|2)p2
ln =0
c(2|1)p1
and the right hand side of the allocation rule vanishes, whence it can be
rewritten as
1
(x̄1 − x̄2 )0 S−1 0 −1
pooled x0 ≥ (x̄1 − x̄2 ) Spooled (x̄1 + x̄2 )
2

Estimated Minimum ECM Rule for Two Normal Populations
Define the linear combination vector
`0 = (x̄1 − x̄2 )0 S−1

pooled
This linear combination occurs both on the left hand side, as well as on
the right hand side of the classification rule.
The rule can be rewritten as:
1
`0 x0 ≥ (`0 x̄1 + `0 x̄2 ) = m
2
where ` is called the vector of discriminant coefficients

Example: (Classification with 2 normal populations-common Σ and equal

costs)
Consider the following information from a study concerned with the defection
of hemophilia A carriers. X1 = log10 (AHF activity), X2 = log10 (AHF-like
antigen) The first group of n1 = 30 (women without hemophilia gene), A
Second group of n2 = 22 (women from known hemophilia A carriers)

−0.0065 −0.2483 −1 131.158 −90.423
x̄1 = , x̄2 = and Spooled =
−0.0390 0.0262 −90.423 108.147
The equal cost and equal prior discriminant function is:
`0 x = (x̄1 − x̄2 )0 S−1

pooled x

131.158 −90.423 x1
= 0.2418 −0.0652
−90.423 108.147 x2
= 37.61x1 − 28.92x2

Also,

−0.0065
0
x̄2 )0 S−1

` x̄1 = (x̄1 − = 37.61 −28.92
pooled x̄1 = 0.88
−0.00390

0 0 −1
−0.2483
` x̄2 = (x̄1 − x̄2 ) Spooled x̄2 = 37.61 −28.92 = −10.10
−0.0262
1 0 1
m = (` x̄1 + `0 x̄2 ) = (0.88 − 10.10) = −4.61
2 2
Classify a woman with x1 = −0.210, and x2 = −0.044 (ie x0 = (x1 , x2 )), or
should this woman with x0 be classified normal or obligatory carrier?
Assuming equal costs and equal priors so that ln(1) = 0, we obtain:
Allocate x0 to π1 if `0 x0 ≥ m
Allocate x0 to π2 if `0 x0 < m
where x0 = (−0.210, −0.044).

Since

0
−0.210
` x0 = 37.61 −28.92 = −6.62 < −4.61
−0.044
we classify the woman as π2 , an obligatory carrier.

Suppose now that the prior probabilities of group membership are known.
That is p1 = 0.75 and p2 = 0.25. Assuming that the costs of misclassification
are equal, so that c(1|2) = c(2|1). The classification statistics
1
w = (x1 −x2 )0 Σ−1 x0 − (x1 −x2 )0 Σ−1 (x1 +x2 ) = −6.62−(−4.61) = −2.01
2

is compared to ln pp12 = ln 0.75
0.25

= −1.10

We see that w = −2.01 < ln pp21 = ln 0.750.25

= −1.10 , we classify the
women as π2 , an obligatory carrier.

Unequal Covariance Matrices
We now allow that Σ1 6= Σ2 . Manipulating the ratio of the densities, R1

is defined as the set of vectors satisfying

−1 0 −1 −1 0 −1 0 c(1|2)p2
R: x (Σ1 − Σ2 )x + (µ1 Σ1 − µ2 Σ2 −1)x − k ≥ ln
2 c(2|1)p1
where k = 12 ln ||Σ

1| 0 −1 0 −1
Σ | + (µ1 Σ1 µ1 − µ2 Σ2 µ2 )
2
If Σ1 = Σ2 then quadratic term vanishes and we obtain the linear

discriminant function as before.
Plugging in the sample in the sample versions, we obtain the quadratic
classification rule.

Quadratic Classification Rule for Two Normal Populations
Allocate x0 to π1 if

1 c(1|2)p2
− x00 (S−1 −1 0 −1 −1
1 − S2 )x0 + (x̄1 S1 − x̄2 S2 )x0 − k ≥ ln
2 c(2|1)p1
Guidelines
◦ If the population are approximately normal and the variances are
unequal;use the quadratic classification rule.
◦ BUT: The quadratics rule is sensitive to departures from normality,
while the linear rule is much more generally valid, also outside the
normal framework, as we will learn later from Fisher’s discriminant
analysis.
◦ Carry out checks before performing a classification procedure:
† Transform to normality first
† Then check for homogeneity of the covariance matrix
The order is important since these homogeneity checks are sensitive
to nonnormality.
Discrimination and classification Fisher’s Discriminant Function - Separation of Populations
Fisher’s Discriminant Function
As an alternative to the parametric approach developed so far, the

philosophy of Fisher can be adopted too. The key difference are:
◦ The emphasis is on displaying group differences, rather than on
classification;
◦ The concept of normality is replaced by the concept of linearity.
◦ The covariance matrices of the groups must be the same because a
pooled estimate is used.
As usual, denote the original outcome vector by x, leading to
observations
x11 , . . . , x1n1 ,
x21 , . . . , x2n2
Construct a (scaler) linear combination y = `0 x. We get the observation
y11 , . . . , y1n1 ,
y21 , . . . , y2n2
The separation of the groups could be measured by |ȳ1 − ȳ2 |. However, it

makes more sense to perform the measurement in standard deviation
units.
In this way, between group variability is measured relative to within
group variability:
|ȳ1 − ȳ2 |
sy
with Pn1
− ȳ1 )2 + ni=1 (y2i − ȳ2 )2
P2
i=1 (y1i
s2y =
n1 + n2 − 2
The squared distance can be expressed as
|ȳ1 − ȳ2 |2 (|`0 (x̄1 − x̄2 )|)2

d2 = =
s2y `0 Spooled `

Allocation Rule based on Fisher’s Discriminant Function
It can be shown that d2 is maximized for
` ∝ S−1
pooled (x̄1 − x̄2 )
and that the maximized value for d2 is
(x¯1 − x¯2 )S−1

pooled (x̄1 − x̄2 )
Allocate x0 to π if
1
Y0 = (x̄1 − x̄2 )0 S−1 0 −1
pooled ≥ (x̄1 − x̄2 ) Spooled (x̄1 + x̄2 )
2
Observe that
1 1 0 1
(x̄1 − x̄2 )0 S−1
pooled (x̄1 + x̄2 ) = ` (x̄1 + x̄2 ) = (ȳ1 + ȳ2 ) = m̂
2 2 2
the estimated midpoint of two univariate means.
In other words, the separation is based on d2 , the distance between the

two population means µ1 and µ2 .
The "possibility to separate" both populations can be assessed using the

test statistic
n1 + n2 − p − 1 n1 n2 2
d ∼ Fp,n1 +n2 −p+1
(n1 + n2 − 2) n1 + n2
For high values (H0 rejected), we are able to discriminate between the
two populations.

Exercise
Consider the two data sets

   
3 7 6 9
X1 =  2 4  and X2 =  5 7 
4 7 4 8
(i) Construct Fisher’s (sample) linear discriminant function.

(ii) Assign the observation x0 = (0, 1) to either population π1 and π2 .
Assume equal costs and equal prior probabilities.

Good Luck in Your Exams!!

Multivariate Final PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Multivariate Final PDF

Hochgeladen von

Copyright:

Verfügbare Formate

STAT446/614

April 16, 2015

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 1 / 261

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 2 / 261

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 3 / 261

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 4 / 261

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 5 / 261

 Research in biological physical and social sciences frequently involve

 We will review some concepts of matrix and of matrix manipulation

 Also, we will acquire knowledge to properly interpret models, select

 The R software for complex statistical analysis will be used.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 6 / 261

 Simplex statistical analysis - single outcome variable on a sample from

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 7 / 261

 Next level - more than two levels of predictor.

 Finally, when several dependent variables are recorded and studied

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 9 / 261

 The job of statistician is to extract useful information through extraction

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 10 / 261

Objectives of Multivariate Methods

 The objectives of scientific investigations to which multivariate methods

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 11 / 261

Let i = 1, . . . , n denote subjects, individuals, items, experimental units etc.,

Variable 1 Variable2 ... Variable j ... Variable p

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 12 / 261

The data can be grouped into a matrix form as:

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 13 / 261

Example 2.1 (cont.)

Variable 1 (Cedi sales): 42 52 48 58

x11 = 42, x21 = 52, x31 = 48, x41 = 58

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 14 / 261

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 15 / 261

 When sample variance are put along the main diagonal

We denote the p × p symmetric covariance matrix Sn = (sjk )j,k and

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 16 / 261

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 17 / 261

 Do not overestimate the importance of linear association measures.

 Sample covariance matrix: Sn = (sjk ) = 1n ni=1 (xj − x̄)(xk − x̄)0 . i.e.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 19 / 261

The sample variance and covariances

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 20 / 261

The sample variance and covariances

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 21 / 261

Visualization of Multidimensional data

 Graphical techniques help to visualize data and aid in data analysis.

 They increase understanding, suggest explanations or at the very least

 It is relatively easy to visualize 1, 2 and 3 dimensional data using

 1 dimension - dot diagrams; 2 dimensions - scatter plots; 3 dimensions -

 Often information from pairs of variables is valuable (eg. scatter plot

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 22 / 261

Two types of multivariate plots

 For complicated multivariate data, visualization is impossible without

 Dimensionality reduction up to which visualization is possible (eg.

 Variable space: n points in a p−dimensional space - each individual is

 Data space: p points in an n−dimensional space - each individual

 see JW for examples.

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 23 / 261

 Most multivariate techniques are based upon the simple concept of

 We should already be familiar with Straight-line or Euclidean distance

 In general, for P = (x1 , x2 , . . . , xp ), the straight-line distance from P to

Dr. S. Iddi (UG) STAT446/614 April 16, 2015 24 / 261

 All points (x1 , x2 , . . . , xp ) that lie a constant square distance such as c2

d2 (O, P) = x12 + x22 + · · · + xp2 = c2 (hypersphere)

Research in biological physical and social sciences frequently involve

We will review some concepts of matrix and of matrix manipulation

Also, we will acquire knowledge to properly interpret models, select

The R software for complex statistical analysis will be used.

Simplex statistical analysis - single outcome variable on a sample from

Next level - more than two levels of predictor.

Finally, when several dependent variables are recorded and studied

The job of statistician is to extract useful information through extraction

The objectives of scientific investigations to which multivariate methods

When sample variance are put along the main diagonal

Do not overestimate the importance of linear association measures.

Sample covariance matrix: Sn = (sjk ) = 1n ni=1 (xj − x̄)(xk − x̄)0 . i.e.

Graphical techniques help to visualize data and aid in data analysis.

They increase understanding, suggest explanations or at the very least

It is relatively easy to visualize 1, 2 and 3 dimensional data using

1 dimension - dot diagrams; 2 dimensions - scatter plots; 3 dimensions -

Often information from pairs of variables is valuable (eg. scatter plot

For complicated multivariate data, visualization is impossible without

Dimensionality reduction up to which visualization is possible (eg.

Variable space: n points in a p−dimensional space - each individual is

Data space: p points in an n−dimensional space - each individual

see JW for examples.

Most multivariate techniques are based upon the simple concept of

We should already be familiar with Straight-line or Euclidean distance

In general, for P = (x1 , x2 , . . . , xp ), the straight-line distance from P to

All points (x1 , x2 , . . . , xp ) that lie a constant square distance such as c2

If p = 2, the equation is a circle.

Points equidistance from the origin lies on a hypersphere.

The straight-line distance from P = (x1 , x2 , . . . , xp ) to

Statistical distance is fundamental to multivariate analysis.

Account for differences in variation.

When coordinates represent measurements that are subject to random

Depends on sample variances and covariances.

Ellipse of constant statistical distance

Generalized statistical distance: From an arbitrary point P = (x1 , x2 ) to

Extension to more than two dimensions: From P(x1 , x2 , . . . , xp ) to

Statistical distance → take variance into account

Next we take correlation into account. The covariance matrix Sn will

Transform the variable xi to uncorrelated ones, say yi

The last expression is the Mahalanobis distance.

When several random variables are considered simultaneously, it is