Sie sind auf Seite 1von 47

DS-GA 1002 Lecture notes 8 November 9, 2015

Linear models: Algebra


Linear models are a pillar of modern data analysis. Many phenomena can be modeled as
linear, at least approximately. In addition, linear models tend to be easy to interpret (quan-
tity A is proportional to quantity B is simpler to understand than quantity A is proportional
to eB if B < 1 and to 1/B otherwise). Finally, linear models are often convenient from
a pragmatic point of view: fitting them to the data tends to be easy and computationally
tractable. We will begin with an overview of linear algebra; being comfortable with these
concepts is essential to understand linear-modeling methods.

1 Vector spaces
You are no doubt familiar with vectors in R2 or R3 , i.e.
 
  −1.1
2
x= , y =  0 . (1)
3
5

From the point of view of algebra, vectors are much more general objects. They are elements
of sets called vector spaces that satisfy the following definition.

Definition 1.1 (Vector space). A vector space consists of a set V and two operations + and
· satisfying the following conditions.

1. For any pair of elements x, y ∈ V the vector sum x + y belongs to V.

2. For any x ∈ V and any scalar α ∈ R the scalar multiple α · x ∈ V.

3. There exists a zero vector or origin 0 such that x + 0 = x for any x ∈ V.

4. For any x ∈ V there exists an additive inverse y such that x + y = 0, usually denoted
as −x.

5. The vector sum is commutative and associative, i.e. for all x, y ∈ V

x + y = y + x, (x + y) + z = x + (y + z). (2)

6. Scalar multiplication is associative, for any α, β ∈ R and x ∈ V

α (β · x) = (α β) · x. (3)
7. Scalar and vector sums are both distributive, i.e. for all α, β ∈ R and x, y ∈ V
(α + β) · x = α · x + β · x, α · (x + y) = α · x + α · y. (4)

A subspace of a vector space V is a subset of V that is also itself a vector space.

From now on, for ease of notation we will ignore the symbol for the scalar product ·, writing
α · x as α x.
Remark 1.2 (More general definition). We can define vector spaces over an arbitrary field,
instead of R, such as the complex numbers C. We refer to any linear algebra text for more
details.

We can easily check that Rn is a valid vector space together with the usual vector addition
 T
and vector-scalar product. In this case the zero vector is the all-zero vector 0 0 0 . . . .
When thinking about vector spaces it is a good idea to have R2 or R3 in mind to gain
intuition, but it is also important to bear in mind that we can define vector sets over many
other objects, such as infinite sequences, polynomials, functions and even random variables
as in the following example.

Example 1.3 (The vector space of zero-mean random variables). Zero-mean random vari-
ables belonging to the same probability space form a vector space together with the usual
operations for adding random variables together and for multiplying random variables and
scalars. This follows almost automatically from the fact that linear combinations of random
variables are also random variables and from linearity of expectation. You can check for in-
stance that if X and Y are zero-mean random variables, for any scalars α and β the random
variable αX + βY is also a zero-mean random variable. The zero vector of this vector space
is the random variable equal to 0 with probability one.

The definition of vector space guarantees that any linear combination of vectors in a vector
space V, obtained by adding the vectors after multiplying by scalar coefficients, belongs to
V. Given a set of vectors, a natural question to ask is whether they can be expressed as
linear combinations of each other, i.e. if they are linearly dependent or independent.
Definition 1.4 (Linear dependence/independence). A set of m vectors x1 , x2 , . . . , xm is
linearly dependent if there exist m scalar coefficients α1 , α2 , . . . , αm which are not all equal
to zero and such that
Xm
αi xi = 0. (5)
i=1

2
Otherwise, the vectors are linearly independent.

Equivalently, at least one vector in a linearly dependent set can be expressed as the linear
combination of the rest, whereas this is not the case for linearly independent sets.

Let us check the equivalence. Equation (5) holds with αj 6= 0 for some j if and only if
1 X
xj = α i xi . (6)
αj
i∈{1,...,m}/{j}

We define the span of a set of vectors {x1 , . . . , xm } as the set of all possible linear combina-
tions of the vectors:
( m
)
X
span (x1 , . . . , xm ) := y | y = αi xi for some α1 , α2 , . . . , αm ∈ R . (7)
i=1

This turns out to be a vector space.


Lemma 1.5. The span of any set of vectors x1 , . . . , xm belonging to a vector space V is a
subspace of V.

Proof. The span is a subset of V due to Conditions 1 and 2 in Definition 1.1. We now show
that it is a vector space. Conditions 5, 6 and 7 in Definition 1.1 hold because V is a vector
space. We check Conditions 1, 2, 3 and 4 by proving that for two arbitrary elements of the
span
m
X m
X
y1 = αi xi , y2 = β i xi , α1 , . . . , αm , β1 , . . . , βm ∈ R, (8)
i=1 i=1

γ1 y1 + γ2 y2 also belongs to the span. This holds because


m
X
γ1 y1 + γ2 y2 = (γ1 αi + γ2 βi ) xi , (9)
i=1

so γ1 y1 + γ2 y2 is in span (x1 , . . . , xm ). Now to prove Condition 1 we set γ1 = γ2 = 1, for


Condition 2 γ2 = 0, for Condition 3 γ1 = γ2 = 0 and for Condition 4 γ1 = −1, γ2 = 0.

When working with a vector space, it is useful to consider the set of vectors with the smallest
cardinality that spans the space. This is called a basis of the vector space.
Definition 1.6 (Basis). A basis of a vector space V is a set of independent vectors {x1 , . . . , xm }
such that
V = span (x1 , . . . , xm ) . (10)

3
An important property of all bases in a vector space is that they have the same cardinality.
Theorem 1.7. If a vector space V has a basis with finite cardinality then every basis of V
contains the same number of vectors.

This theorem, which is proven in Section A of the appendix, allows us to define the dimen-
sion of a vector space.
Definition 1.8 (Dimension). The dimension dim (V) of a vector space V is the cardinality
of any of its bases, or equivalently the smallest number of linearly independent vectors that
span V.

This definition coincides with the usual geometric notion of dimension in R2 and R3 : a
line has dimension 1, whereas a plane has dimension 2 (as long as they contain the origin).
Note that there exist infinite-dimensional vector spaces, such as the continuous real-valued
functions defined on [0, 1] or an iid sequence X1 , X2 , . . ..
The vector space that we use to model a certain problem is usually called the ambient
space and its dimension the ambient dimension. In the case of Rn the ambient dimension
is n.
Lemma 1.9 (Dimension of Rn ). The dimension of Rn is n.

Proof. Consider the set of vectors e1 , . . . , en ⊆ Rn defined by


     
1 0 0
0 1 0
e1 = · · · , e2 = · · · , . . . , en = · · · .
     (11)
0 0 1
One can easily check that this set is a basis. It is in fact the standard basis of Rn .

2 Inner product product and norm


Up to now, the only operations that we know how to apply to the vectors in a vector space
are addition and multiplication by a scalar. We now introduce a third operation, the inner
product between two vectors.
Definition 2.1 (Inner product). An inner product on a vector space V is an operation h·, ·i
that maps pairs of vectors to R and satisfies the following conditions.

• It is symmetric, for any x, y ∈ V


hx, yi = hy, xi . (12)

4
• It is linear, i.e. for any α ∈ R and any x, y, z ∈ V
hα x, yi = α hy, xi , hx + y, zi = hx, zi + hy, zi . (13)

• It is positive semidefinite: for any x ∈ V hx, xi ≥ 0 and hx, xi = 0 implies x = 0.

A vector space endowed with an inner product is called an inner-product space.

Example 2.2 (Dot product). We define the dot product between two vectors x, y ∈ Rn
as
X
x · y := x [i] y [i] , (14)
i

where x [i] is the ith entry of x. It is easy to check that the dot product is a valid inner
product. Rn endowed with the dot product is usually called an Euclidean space of dimension
n.

Example 2.3 (Covariance as an inner product). The covariance of two zero-mean random
variables X and Y is equal to E (XY ). It is a valid inner product in the vector space of
zero-mean random variables. It is obviously symmetric and linearity follows from linearity of
expectation. Finally, E (X 2 ) ≥ 0 because it is the sum or integral of a nonnegative quantity
and by Chebyshev’s inequality E (X 2 ) = 0 implies that X = 0 with probability one.

The norm of a vector is a generalization of the concept of length.


Definition 2.4 (Norm). Let V be a vector space, a norm is a function ||·|| from V to R that
satisfies the following conditions.

• It is homogeneous. For all α ∈ R and x ∈ V


||α x|| = |α| ||x|| . (15)

• It satisfies the triangle inequality


||x + y|| ≤ ||x|| + ||y|| . (16)
In particular, it is nonnegative (set y = −x).

5
• ||x|| = 0 implies that x is the zero vector 0.

A vector space equipped with a norm is called a normed space. Distances in a normed space
can be measured using the norm of the difference between vectors.

Definition 2.5 (Distance). The distance between two vectors in a normed space with norm
||·|| is

d (x, y) := ||x − y|| . (17)

Inner-product spaces are normed spaces because we can define a valid norm using the inner
product. The norm induced by an inner product is obtained by taking the square root of
the inner product of the vector with itself,
p
||x||h·,·i := hx, xi. (18)

The norm induced by an inner product is clearly homogeneous by linearity and symmetry
of the inner product. ||x||h·,·i = 0 implies x = 0 because the inner product is positive
semidefinite. We only need to establish that the triangle inequality holds to ensure that the
inner-product is a valid norm.

Theorem 2.6 (Cauchy-Schwarz inequality). For any two vectors x and y in an inner-product
space

|hx, yi| ≤ ||x||h·,·i ||y||h·,·i . (19)

Assume ||x||h·,·i 6= 0,

||y||h·,·i
hx, yi = − ||x||h·,·i ||y||h·,·i ⇐⇒ y = − x, (20)
||x||h·,·i
||y||h·,·i
hx, yi = ||x||h·,·i ||y||h·,·i ⇐⇒ y = x. (21)
||x||h·,·i

Proof. If ||x||h·,·i = 0 then x = 0 because the inner product is positive semidefinite, which
implies hx, yi = 0 and consequently that (19) holds with equality. The same is true if
||y||h·,·i = 0.
Now assume that ||x||h·,·i 6= 0 and ||y||h·,·i 6= 0. By semidefiniteness of the inner product,
2
0 ≤ ||y||h·,·i x + ||x||h·,·i y = 2 ||x||2h·,·i ||y||2h·,·i + 2 ||x||h·,·i ||y||h·,·i hx, yi , (22)

2
0 ≤ ||y||h·,·i x − ||x||h·,·i y = 2 ||x||2h·,·i ||y||2h·,·i − 2 ||x||h·,·i ||y||h·,·i hx, yi . (23)

6
These inequalities establish (19).
Let us prove (20) by proving both implications.
(⇒) Assume hx, yi = − ||x||h·,·i ||y||h·,·i . Then (22) equals zero, so ||y||h·,·i x = − ||x||h·,·i y
because the inner product is positive semidefinite.
(⇐) Assume ||y||h·,·i x = − ||x||h·,·i y. Then one can easily check that (22) equals zero, which
implies hx, yi = − ||x||h·,·i ||y||h·,·i .
The proof of (21) is identical (using (23) instead of (22)).
Corollary 2.7. The norm induced by an inner product satisfies the triangle inequality.

Proof.
||x + y||2h·,·i = ||x||2h·,·i + ||y||2h·,·i + 2 hx, yi (24)
≤ ||x||2h·,·i + ||y||2h·,·i
+ 2 ||x||h·,·i ||y||h·,·i by the Cauchy-Schwarz inequality (25)
 2
= ||x||h·,·i + ||y||h·,·i . (26)

Example 2.8 (Euclidean norm). The Euclidean or `2 norm is the norm induced by the dot
product in Rn ,
v
u n
√ uX
||x||2 := x · x = t x2i . (27)
i=1

In the case of R2 or R3 it is what we usually think of as the length of the vector.

Example 2.9 (The standard deviation as a norm). The standard deviation or root mean
square
p
σX = E (X 2 ) (28)
is the norm induced by the covariance inner product in the vector space of zero-mean random
variables.

7
3 Orthogonality
An important concept in linear algebra is orthogonality.
Definition 3.1 (Orthogonality). Two vectors x and y are orthogonal if

hx, yi = 0. (29)

A vector x is orthogonal to a set S, if

hx, si = 0, for all s ∈ S. (30)

Two sets of S1 , S2 are orthogonal if for any x ∈ S1 , y ∈ S2

hx, yi = 0. (31)

The orthogonal complement of a subspace S is

S ⊥ := {x | hx, yi = 0 for all y ∈ S} . (32)

Distances between orthogonal vectors measured in terms of the norm induced by the inner
product are easy to compute.
Theorem 3.2 (Pythagorean theorem). If x and y are orthogonal vectors

||x + y||2h·,·i = ||x||2h·,·i + ||y||2h·,·i . (33)

Proof. By linearity of the inner product

||x + y||2h·,·i = ||x||2h·,·i + ||y||2h·,·i + 2 hx, yi (34)


= ||x||2h·,·i + ||y||2h·,·i . (35)

If we want to show that a vector is orthogonal to a certain subspace, it is enough to show


that it is orthogonal to every vector in a basis of the subspace.
Lemma 3.3. Let x be a vector and S a subspace of dimension n. If for any basis b1 , b2 , . . . , bn
of S,

hx, bi i = 0, 1 ≤ i ≤ n, (36)

then x is orthogonal to S.

8
P n
Proof. Any vector v ∈ S can be represented as v = i αi=1 bi for α1 , . . . , αn ∈ R, from (36)
* +
X X
n n
hx, vi = x, αi=1 bi = αi=1 hx, bi i = 0. (37)
i i

We now introduce orthonormal bases.


Definition 3.4 (Orthonormal basis). A basis of mutually orthogonal vectors with norm equal
to one is called an orthonormal basis.

It is very easy to find the coefficients of a vector in an orthonormal basis: we just need to
compute the dot products with the basis vectors.
Lemma 3.5 (Coefficients in an orthonormal basis). If {u1 , . . . , un } is an orthonormal basis
of a vector space V, for any vector x ∈ V
n
X
x= hui , xi ui . (38)
i=1

Proof. Since {u1 , . . . , un } is a basis,


m
X
x= αi ui for some α1 , α2 , . . . , αm ∈ R. (39)
i=1

Immediately,
* m
+ m
X X
hui , xi = ui , αi ui = αi hui , ui i = αi (40)
i=1 i=1

because hui , ui i = 1 and hui , uj i = 0 for i 6= j.

For any subspace of Rn we can obtain an orthonormal basis by applying the Gram-Schmidt
method to a set of linearly independent vectors spanning the subspace.
Algorithm 3.6 (Gram-Schmidt).

Input: A set of linearly independent vectors {x1 , . . . , xm } ⊆ Rn .


Output: An orthonormal basis {u1 , . . . , um } for span (x1 , . . . , xm ).

Initialization: Set u1 := x1 / ||x1 ||2 .


For i = 1, . . . , m

9
1. Compute
i−1
X
vi := xi − huj , xi i uj . (41)
j=1

2. Set ui := vi / ||vi ||2 .

This implies in particular that we can always assume that a subspace has an orthonormal
basis.

Theorem 3.7. Every finite-dimensional vector space has an orthonormal basis.

Proof. To see that the Gram-Schmidt method produces an orthonormal basis for the span of
the input vectors we can check that span (x1 , . . . , xi ) = span (u1 , . . . , ui ) and that u1 , . . . , ui
is set of orthonormal vectors.

10
A Proof of Theorem 1.7
We prove the claim by contradiction. Assume that we have two bases {x1 , . . . , xm } and
{y1 , . . . , yn } such that m < n (or the second set has infinite cardinality). The proof follows
from applying the following lemma m times (setting r = 0, 1, . . . , m − 1) to show that
{y1 , . . . , ym } spans V and hence {y1 , . . . , yn } must be linearly dependent.

Lemma A.1. Under the assumptions of the theorem, if {y1 , y2 , . . . , yr , xr+1 , . . . , xm } spans
V then {y1 , . . . , yr+1 , xr+2 , . . . , xm } also spans V (possibly after rearranging the indices r +
1, . . . , m) for r = 0, 1, . . . , m − 1.

Proof. Since {y1 , y2 , . . . , yr , xr+1 , . . . , xm } spans V


r
X m
X
yr+1 = βi yi + γi xi , β1 , . . . , βr , γr+1 , . . . , γm ∈ R, (42)
i=1 i=r+1

where at least one of the γj is non zero, as {y1 , . . . , yn } is linearly independent by assumption.
Without loss of generality (here is where we might need to rearrange the indices) we assume
that γr+1 6= 0, so that
r m
!
1 X X
xr+1 = βi yi − γi xi . (43)
γr+1 i=1 i=r+2

This implies that any vector in the span of {y1 , y2 , . . . , yr , xr+1 , . . . , xm }, i.e. in V, can be rep-
resented as a linear combination of vectors in {y1 , . . . , yr+1 , xr+2 , . . . , xm }, which completes
the proof.

11
DS-GA 1002 Lecture notes 9 November 16, 2015

Linear models
1 Projections
The projection of a vector x onto a subspace S is the vector in S that is closest to x. In
order to define this rigorously, we start by introducing the concept of direct sum. If two
subspaces are disjoint, i.e. their only common point is the origin, then a vector that can be
written as a sum of a vector from each subspace is said to belong to their direct sum.

Definition 1.1 (Direct sum). Let V be a vector space. For any subspaces S1 , S2 ⊆ V such
that

S1 ∩ S2 = {0} (1)

the direct sum is defined as

S1 ⊕ S2 := {x | x = s1 + s2 s1 ∈ S1 , s2 ∈ S2 } . (2)

The representation of a vector in the direct sum of two subspaces is unique.

Lemma 1.2. Any vector x ∈ S1 ⊕ S2 has a unique representation

x = s1 + s2 s1 ∈ S1 , s2 ∈ S2 . (3)

Proof. If x ∈ S1 ⊕ S2 then by definition there exist s1 ∈ S1 , s2 ∈ S2 such that x = s1 + s2 .


Assume x = s01 + s02 , s01 ∈ S1 , s02 ∈ S2 , then s1 − s01 = s2 − s02 . This implies that s1 − s01
and s2 − s02 are in S1 and also in S2 . However, S1 ∩ S2 = {0}, so we conclude s1 = s01 and
s2 = s02 .

We can now define the projection of a vector x onto a subspace S by separating the vector
into a component that belongs to S and another that belongs to its orthogonal complement.

Definition 1.3 (Orthogonal projection). Let V be a vector space. The orthogonal projection
of a vector x ∈ V onto a subspace S ⊆ V is a vector denoted by PS x such that x−PS x ∈ S ⊥ .

Theorem 1.4 (Properties of orthogonal projections). Let V be a vector space. Every vector
x ∈ V has a unique orthogonal projection PS x onto any subspace S ⊆ V of finite dimension.
In particular x can be expressed as

x = PS x + PS ⊥ x. (4)
For any vector s ∈ S
hx, si = hPS x, si . (5)
For any orthonormal basis b1 , . . . , bm of S,
m
X
PS x = hx, bi i bi . (6)
i=1

Proof. Since V has finite dimension, so does S, which consequently has an orthonormal basis
with finite cardinality b01 , . . . , b0m by Theorem 3.7 in Lecture Notes 8. Consider the vector
m
X
p := hx, b0i i b0i . (7)
i=1

It turns out that x − p is orthogonal to every vector in the basis. For 1 ≤ j ≤ m,


* m
+
X
0 0 0 0


x − p, bj = x − hx, bi i bi , bj (8)
i=1
m
X
= x, b0j − hx, b0i i b0i , b0j



(9)

i=1 0
x, b0j


= − x, bj = 0, (10)
so by Lemma 3.3 in Lecture Notes 8 x − p ∈ S ⊥ and p is an orthogonal projection. Since
S ∩ S ⊥ = {0} 1 there cannot be two other vectors x1 ∈ S, x1 ∈ S ⊥ such that x = x1 + x2 so
the orthogonal projection is unique.
⊥
Notice that o := x − p is a vector in S ⊥ such that x − o = p is in S and therefore in S ⊥ .
This implies that o is the orthogonal projection of x onto S ⊥ and establishes (4).
Equation (5) follows immediately from the orthogonality of any vector s ∈ S and PS x.
Equation (6) follows from (5) and Lemma 3.5 in Lecture Notes 8.

Computing the norm of the projection of a vector onto a subspace is easy if we have access
to an orthonormal basis (as long as the norm is induced by the inner product).
Lemma 1.5 (Norm of the projection). The norm of the projection of an arbitrary vector
x ∈ V onto a subspace S ⊆ V of dimension d can be written as
v
u d
uX
||PS x||h·,·i = t hbi , xi2 (11)
i

for any orthonormal basis b1 , . . . , bd of S.


2
1
For any vector v that belongs to both S and S ⊥ hv, vi = ||v||2 = 0, which implies v = 0.

2
Proof. By (6)
||PS x||2h·,·i = hPS x, PS xi (12)
* d d
+
X X
= hbi , xi bi , hbj , xi bj (13)
i j
d X
X d
= hbi , xi hbj , xi hbi , bj i (14)
i j
d
X
= hbi , xi2 . (15)
i

Finally, we prove indeed that of a vector x onto a subspace S is the vector in S that is closest
to x in the distance induced by the inner-product norm.

Example 1.6 (Projection onto a one-dimensional subspace). To compute the projection


of
n a vector ox onto a one-dimensional subspace spanned by a vector v, we use the fact that
v/ ||v||h·,·i is a basis for span (v) (it is a set containing a unit vector that spans the subspace)
and apply (6) to obtain
hv, xi
Pspan(v) x = v. (16)
||v||2h·,·i

Theorem 1.7 (The orthogonal projection is closest). The orthogonal projection of a vector
x onto a subspace S belonging to the same inner-product space is the closest vector to x that
belongs to S in terms of the norm induced by the inner product. More formally, PS x is the
solution to the optimization problem
minimize ||x − u||h·,·i (17)
u
subject to u ∈ S. (18)

Proof. Take any point s ∈ S such that s 6= PS x


||x − s||2h·,·i = ||PS ⊥ x + PS x − s||2h·,·i (19)
= ||PS ⊥ x||2h·,·i + ||PS x − s||2h·,·i (20)
> 0 because s 6= PS x, (21)
where (20) follows from the Pythagorean theorem since because PS ⊥ x belongs to S ⊥ and
PS x − s to S.

3
2 Linear minimum-MSE estimation
We are interested in estimating the sample of a continuous random variable X from the
sample y of a random variable Y . If we know the joint distribution of X and Y then the
optimal estimator in terms of MSE is the conditional mean E (X|Y = y). However often it is
very challenging to completely characterize a joint distribution between two quantities, but
it is more tractable to obtain an estimate of their first and second order moments. In this
case it turns out that we can obtain the optimal linear estimate of X given Y by using our
knowledge of linear algebra.

Theorem 2.1 (Best linear estimator). Assume that we know the means µX , µY and variances
2
σX , σY2 of two random variables X and Y and their correlation coefficient ρXY . The best
linear estimate of the form aY + b of X given Y in terms of mean-square error is

ρXY σX (y − µY )
gLMMSE (y) = + µX . (22)
σY

Proof. First we determine the value of b. The cost function as a function of b is

h (b) = E (X − aY − b)2 = E (X − aY )2 + b2 − 2bE (X − aY )


 
(23)
= E (X − aY )2 + b2 − 2b (µX − aµY ) .

(24)

The first and second derivative with respect to b are

h0 (b) = 2b − 2 (µX − aµY ) , (25)


h00 (b) = 2. (26)

Since h00 is positive the function is convex so the minimum is obtained by setting h0 (b) to
zero, which yields

b = µX − aµY . (27)

Consider the centered random variables X̃ := X − µX and Ỹ := Y − µY . It turns out that


to find a we just need to find the best estimate of X̃ of the form aỸ

E (X − aY − b)2 = E ((X − µX − a (Y − µY ) + µX − aµY − b))2



(28)
 2 
= E X̃ − aỸ , (29)

clearly any a that minimizes the left-hand side also minimizes the right-hand side and vice
versa.

4
Consider the vector space of zero-mean random variables. X̃ and Ỹ belong to this vector
space. In fact,
D E  
X̃, Ỹ = E X̃ Ỹ (30)
= Cov (X, Y ) (31)
= σX σY ρXY , (32)
2  
X̃ = E X̃ 2 (33)

h·,·i
2
= σX , (34)
2  
Ỹ = E Ỹ 2 (35)

h·,·i

= σY2 . (36)

Any random variable of the form aỸ belongs to the subspace spanned by Ỹ . Since the
distance in this vector space is induced by the mean-square norm, by Theorem 1.7 the
vector of the form aỸ that approximates X̃ better is just the projection of X̃ onto the
subspace
n o spanned by Ỹ , which we will denote by PỸ X̃. This subspace has dimension 1, so
Ỹ /σY is a basis for the subspace. The projection is consequently equal to
* +
Ỹ Ỹ
PỸ X̃ = X̃, (37)
σY σY
D E Ỹ
= X̃, Ỹ (38)
σY2
σX ρXY Ỹ
= . (39)
σY
So a = σX ρXY /σY , which concludes the proof.

In words, the linear estimator of X given Y is obtained by

1. centering Y by removing its mean,

2. normalizing Ỹ by dividing by its standard deviation,

3. scaling the result using the correlation between Y and X,

4. scaling again using the standard deviation of X,

5. recentering by adding the mean of X.

5
3 Matrices
A matrix is a rectangular array of numbers. We denote the vector space of m × n matrices
by Rm×n . We denote the ith row of a matrix A by Ai: , the jth column by A:j and the (i, j)
entry by Aij . The transpose of a matrix is obtained by switching its rows and columns.

Definition 3.1 (Transpose). The transpose AT of a matrix A ∈ Rm×n is a matrix in


A ∈ Rm×n

AT ij = Aji .

(40)

A symmetric matrix is a matrix that is equal to its transpose.


Matrices map vectors to other vectors through a linear operation called matrix-vector prod-
uct.

Definition 3.2 (Matrix-vector product). The product of a matrix A ∈ Rm×n and a vector
x ∈ Rn is a vector in Ax ∈ Rn , such that
n
X
(Ax)i = Aij x [j] (41)
j=1

= hAi: , xi , (42)

i.e. the ith entry of Ax is the dot product between the ith row of A and x.
Equivalently,
n
X
Ax = A:j x [j] , (43)
j=1

i.e. Ax is a linear combination of the columns of A weighted by the entries in x.

One can easily check that the transpose of the product of two matrices A and B is equal to
the the transposes multiplied in the inverse order,

(AB)T = B T AT . (44)

We can now we can express the dot product between two vectors x and y as

hx, yi = xT y = y T x. (45)

The identity matrix is a matrix that maps any vector to itself.

6
Definition 3.3 (Identity matrix). The identity matrix in Rn×n is
 
1 0 ··· 0
0 1 · · · 0
I= . (46)
 ··· 
0 0 ··· 1

Clearly, for any x ∈ Rn Ix = x.

Definition 3.4 (Matrix multiplication). The product of two matrices A ∈ Rm×n and B ∈
Rn×p is a matrix AB ∈ Rm×p , such that
n
X
(AB)ij = Aik Bkj = hAi: , B:,j i , (47)
k=1

i.e. the (i, j) entry of AB is the dot product between the ith row of A and the jth column of
B.
Equivalently, the jth column of AB is the result of multiplying A and the jth column of B
n
X
AB = Aik Bkj = hAi: , B:,j i , (48)
k=1

and ith row of AB is the result of multiplying the ith row of A and B.

Square matrices may have an inverse. If they do, the inverse is a matrix that reverses the
effect of the matrix of any vector.

Definition 3.5 (Matrix inverse). The inverse of a square matrix A ∈ Rn×n is a matrix
A−1 ∈ Rn×n such that

AA−1 = A−1 A = I. (49)

Lemma 3.6. The inverse of a matrix is unique.

Proof. Let us assume there is another matrix M such that AM = I, then

M = A−1 AM by (49) (50)


= A−1 . (51)

An important class of matrices are orthogonal matrices.

7
Definition 3.7 (Orthogonal matrix). An orthogonal matrix is a square matrix such that its
inverse is equal to its transpose,

UT U = UUT = I (52)

By definition, the columns U:1 , U:2 , . . . , U:n of any orthogonal matrix have unit norm and
orthogonal to each other, so they form an orthonormal basis (it’s somewhat confusing that
orthogonal matrices are not called orthonormal matrices instead). We can interpret applying
U T to a vector x as computing the coefficients of its representation in the basis formed by
the columns of U . Applying U to U T x recovers x by scaling each basis vector with the
corresponding coefficient:
n
X
T
x = UU x = hU:i , xi U:i . (53)
i=1

Applying an orthogonal matrix to a vector does not affect its norm, it just rotates the vector.

Lemma 3.8 (Orthogonal matrices preserve the norm). For any orthogonal matrix U ∈ Rn×n
and any vector x ∈ Rn ,

||U x||2 = ||x||2 . (54)

Proof. By the definition of an orthogonal matrix

||U x||22 = xT U T U x (55)


T
=x x (56)
= ||x||22 . (57)

4 Eigendecomposition
An eigenvector v of A satisfies

Av = λv (58)

for a real number λ which is the corresponding eigenvalue. Even if A is real, its eigenvectors
and eigenvalues can be complex.

8
Lemma 4.1 (Eigendecomposition). If a square matrix A ∈ Rn×n has n linearly independent
eigenvectors v1 , . . . , vn with eigenvalues λ1 , . . . , λn it can be expressed in terms of a matrix
Q, whose columns are the eigenvectors, and a diagonal matrix containing the eigenvalues,
 
λ1 0 · · · 0
  0 λ2 · · · 0  
 v1 v2 · · · vn −1
 
A = v1 v2 · · · vn  (59)
 ··· 
0 0 · · · λn
= QΛQ−1 (60)

Proof.
 
AQ = Av1 Av2 · · · Avn (61)
 
= λ1 v1 λ2 v2 · · · λ2 vn (62)
= QΛ. (63)

As we will establish later on, if the columns of a square matrix are all linearly independent,
then the matrix has an inverse, so multiplying the expression by Q−1 on both sides completes
the proof.

Lemma 4.2. Not all matrices have an eigendecomposition

Proof. Consider for example the matrix


 
0 1
. (64)
0 0

Assume λ has a nonzero eigenvalue corresponding to an eigenvector with entries v1 and v2 ,


then
      
v2 0 1 v1 λv1
= = , (65)
0 0 0 v2 λv2

which implies that v2 = 0 and hence v1 = 0, since we have assumed that λ 6= 0. This implies
that the matrix does not have nonzero eigenvalues associated to nonzero eigenvectors.

An interesting use of the eigendecomposition is computing successive matrix products very


fast. Assume that we want to compute

AA · · · Ax = Ak x, (66)

9
i.e. we want to apply A to x k times. Ak cannot be computed by taking the power of its
entries (try out a simple example to convince yourself). However, if A has an eigendecom-
position,

Ak = QΛQ−1 QΛQ−1 · · · QΛQ−1 (67)


= QΛk Q−1 (68)
 k 
λ1 0 ··· 0
 0 λk2 ··· 0 
= Q  Q−1 , (69)
 ··· 
k
0 0 · · · λn

using the fact that for diagonal matrices applying the matrix repeatedly is equivalent to
taking the power of the diagonal entries. This allows to compute the k matrix products
using just 3 matrix products and taking the power of n numbers.
From high-school or undergraduate algebra you probably remember how to compute eigen-
vectors using determinants. In practice, this is usually not a viable option due to stability
issues. A popular technique to compute eigenvectors is based on the following insight. Let
A ∈ Rn×n be a matrix with eigendecomposition QΛQ−1 and let x be an arbitrary vector in
Rn . Since the columns of Q are linearly independent, they form a basis for Rn , so we can
represent x as
n
X
x= αi Q:i , αi ∈ R, 1 ≤ i ≤ n. (70)
i=1

Now let us apply A to x k times,


n
X
k
A x= αi Ak Q:i (71)
i=1
n
X
= αi λki Q:i . (72)
i=1

If we assume that the eigenvectors are ordered according to their magnitudes and that the
magnitude of one of them is larger than the rest, |λ1 | > |λ2 | ≥ . . ., and that α1 6= 0 (which
happens with high probability if we draw a random x) then as k grows larger the term
α1 λk1 Q:1 dominates. The term will blow up or tend to zero unless we normalize every time
before applying A. Adding the normalization step to this procedure results in the power
method or power iteration, an algorithm of great importance in numerical linear algebra.

Algorithm 4.3 (Power method).


Input: A matrix A.
Output: An estimate of the eigenvector of A corresponding to the largest eigenvalue.

10
v2 x1 v2 v2

x2 x3
v1 v1 v1
Figure 1: Illustration of the first three iterations of the power method for a matrix with eigen-
vectors v1 and v2 , whose corresponding eigenvalues are λ1 = 1.05 and λ2 = 0.1661.

Initialization: Set x1 := x/ ||x||2 , where x contains random entries.


For i = 1, . . . , k, compute
Axi−1
xi := . (73)
||Axi−1 ||2

Figure 1 illustrates the power method on a simple example, where the matrix– which was
just drawn at random– is equal to
 
0.930 0.388
A= . (74)
0.237 0.286

The convergence to the eigenvector corresponding to the eigenvalue with the largest magni-
tude is very fast.

5 Time-homogeneous Markov chains


A Markov chain is a sequence of discrete random variables X0 , X1 , . . . such that

pXk+1 |X0 ,X1 ,...,Xk (xk+1 |x0 , x1 , . . . , xk ) = pXk+1 |Xk (xk+1 |xk ) . (75)

In words, Xk+1 is conditionally independent of Xj for j ≤ k − 1 conditioned on Xk . If the


value of the random variables is restricted to a finite set {α1 , . . . , αn } with probability one,
the Markov chain is said to be time homogeneous. More formally,

Pij := pXk+1 |Xk (αi |αj ) (76)

only depends on i and j, not on k for all 1 ≤ i, j ≤ n, k ≥ 0.


If a Markov chain is time homogeneous we can group the transition probabilities Pij in a
transition matrix P . We express the pmf of Xk restricted to the set where it can be

11
nonzero as a vector,
 
pXk (α1 )
 p (α ) 
 
πk =  Xk 2  . (77)
 ··· 
pXk (αn )
By the Chain Rule, the pmf of Xk can be computed from the pmf of X0 using the transition
matrix,
πk = P P · · · P π0 = P k π0 . (78)
In some cases, no matter how we initialize the Markov chain, the Markov Chain forgets
its initial state and converges to a stationary distribution. This is exploited in Markov-
Chain Monte Carlo methods that allow to sample from arbitrary distributions by building
the corresponding Markov chain. These methods are very useful in Bayesian statistics.
A Markov Chain that converges to a stationary distribution π∞ is said to be ergodic. Note
that necessarily P π∞ = π∞ , so that π∞ is an eigenvector of the transition matrix with a
corresponding eigenvalue equal to one.
Conversely, let a transition matrix P of a Markov chain have a valid eigendecomposition with
n linearly independent eigenvectors v1 , v2 , . . . and corresponding eigenvalues λ1 > λ2 ≥ λ3 . . ..
If the eigenvector corresponding to the largest eigenvalue has non-negative entries then
v1
π∞ := Pn (79)
i=1 v1 (i)

is a valid pmf and


P π ∞ = λ 1 π∞ (80)
is also a valid pmf, which is only possible if the largest eigenvalue λ1 equals one. Now, if we
represent any possible initial pmf π0 in the basis formed by the eigenvectors of P we have
n
X
π0 = αi vi , αi ∈ R, 1 ≤ i ≤ n, (81)
i=1

and
πk = P k π0 (82)
Xn
= αi P k vi (83)
i=1
n
X
= α1 λ1 v1 + αi λki vi . (84)
i=2

12
Since the rest of eigenvalues are strictly smaller than one,
lim πk = α1 λ1 v1 = π∞ (85)
k→∞

where thePlast equality follows from the fact that the sequence of πk all belong to the closed
n
set {π | i π(i) = 1} so the limit also belongs to the set and hence is a valid pmf. We refer
the interested reader to more advanced texts treating Markov chains for further details.

6 Eigendecomposition of symmetric matrices


The following lemma shows that eigenvectors of a symmetric matrix corresponding to dif-
ferent nonzero eigenvalues are necessarily orthogonal.
Lemma 6.1. If A ∈ Rn×n is symmetric, then if ui and uj are eigenvectors of A corresponding
to different nonzero eigenvalues λi 6= λj 6= 0
uTi uj = 0. (86)

Proof. Since A = AT
1
uTi uj = (Aui )T uj (87)
λi
1
= uTi AT uj (88)
λi
1
= uTi Auj (89)
λi
λj
= uTi uj . (90)
λi
This is only possible if uTi uj = 0.

It turns out that every n×n symmetric matrix has n linearly independent vectors. The proof
of this is beyond the scope of these notes. An important consequence is that all symmetric
matrices have an eigendecomposition of the form
A = U DU T (91)
 
where U = u1 u2 · · · un is an orthogonal matrix.
The eigenvalues of a symmetric matrix λ1 , λ2 , . . . , λn can be positive, negative or zero. They
determine the value of the quadratic form:
n
T
X 2
q (x) := x Ax = λi xT ui (92)
i=1

13
If we order the eigenvalues λ1 ≥ λ2 ≥ . . . ≥ λn then the first eigenvalue is the maximum
value attained by the quadratic if its input has unit `2 norm, the second eigenvalue is the
maximum value attained by the quadratic form if we restrict its argument to be normalized
and orthogonal to the first eigenvector, and so on.

Theorem 6.2. For any symmetric matrix A ∈ Rn with normalized eigenvectors u1 , u2 , . . . , un


with corresponding eigenvalues λ1 , λ2 , . . . , λn

λ1 = max uT Au, (93)


||u||2 =1

u1 = arg max uT Au, (94)


||u||2 =1

λk = max uT Au, (95)


||u||2 =1,u⊥u1 ,...,uk−1

uk = arg max uT Au. (96)


||u||2 =1,u⊥u1 ,...,uk−1

The theorem is proved in Section A.1 of the appendix.

7 The singular-value decomposition


If we consider the columns and rows of a matrix as sets of vectors then we can study their
respective spans.

Definition 7.1 (Row and column space). The row space row (A) of a matrix A is the span
of its rows. The column space col (A) is the span of its columns.

It turns out that the row space and the column space of any matrix have the same dimension.
We name this quantity the rank of the matrix.

Theorem 7.2. The rank is well defined;

dim (col (A)) = dim (row (A)) . (97)

Section A.2 of the appendix contains the proof.


The following theorem states that we can decompose any real matrix into the product of
orthogonal matrices containing bases for its row and column space and a diagonal matrix
with a positive diagonal. It is a fundamental result in linear algebra, but its proof is beyond
the scope of these notes.

14
Theorem 7.3. Without loss of generality let m ≤ n. Every rank r real matrix A ∈ Rm×n
has a unique singular-value decomposition of the form (SVD)
 T
v1

σ1 0 ··· 0
   0 σ2 · · · 0   v2T 
 
A = u1 u2 · · · um   (98)
 ···  · · · 
0 0 · · · σm T
vm
= U SV T , (99)

where the singular values σ1 ≥ σ2 ≥ · · · ≥ σm ≥ 0 are nonnegative real numbers, the


matrix U ∈ Rm×m containing the left singular vectors is orthogonal, and the matrix
V ∈ Rm×n containing the right singular vectors is a submatrix of an orthogonal matrix
(i.e. its columns form an orthonormal set).

Note that we can write the matrix as a sum of rank-1 matrices


m
X
A= σi ui viT (100)
i=1
r
X
= σi ui viT , (101)
i=1

where r is the number of nonzero singular values. The first r left singular vectors u1 , u2 , . . . , ur ∈
Rm form an orthonormal basis of the column space of A and the first r right singular vectors
v1 , v2 , . . . , vr ∈ Rn form an orthonormal basis of the row space of A. Therefore the rank of
the matrix is equal to r.

8 Principal component analysis


The goal of dimensionality-reduction methods is to project high-dimensional data onto
a lower-dimensional space while preserving as much information as possible. These meth-
ods are a basic tool in data analysis; some applications include visualization (especially if
we project onto R2 or R3 ), denoising and increasing computational efficiency. Principal
component analysis (PCA) is a linear dimensionality-reduction technique based on the
SVD.
If we interpret a set of data vectors as belonging to an ambient vector space, applying PCA
allows to find directions in this space along which the data have a high variation. This is
achieved by centering the data and then extracting the singular vectors corresponding to
the largest singular values. The next two sections provide a geometric and a probabilistic
justification.

15
Algorithm 8.1 (Principal component analysis).
Input: n data vectors x̃1 , x̃2 , . . . , x̃n ∈ Rm , a number k ≤ min {m, n}.
Output: The first k principal components, a set of orthonormal vectors of dimension m.

1. Center the data. Compute


n
1X
xi = x̃i − x̃i , (102)
n i=1

for 1 ≤ i ≤ n.

2. Group the centered data in a data matrix X


 
X = x1 x2 · · · xn . (103)

Compute the SVD of X and extract the left singular vectors corresponding to the k
largest singular values. These are the first k principal components.

8.1 PCA: Geometric interpretation

Once the data are centered, the energy of the projection of the data points onto different
directions in the ambient space reflects the variation of the dataset along those directions.
PCA selects the directions that maximize the `2 norm of the projection and are mutually
orthogonal. By Lemma 1.5, the sum of the squared `2 norms of the projection of the centered
data x1 , x2 , . . . , xn onto a 1D subspace spanned by a unit-norm vector u can be expressed as
n n
Pspan(u) xi 2 =
X X
2
uT xi xTi u (104)
i=1 i=1
= u XX T u
T
(105)
2
= X T u .
2
(106)

If we want to maximize the energy of the projection onto a subspace of dimension k, an


option is to choose orthogonal 1D projections sequentially. First we choose a unit vector
T 2

u1 that maximizes X u 2 and is consequently the 1D subspace that is better adapted to

the
Tdata.
Then, we choose a second unit vector u orthogonal to the first which maximizes
X u 2 and hence is the 1D subspace that is better adapted to the data while being in
2
the orthogonal complement of u1 . We repeat this procedure until we have k orthogonal
directions. This is exactly equivalent to performing PCA, as proved in Theorem 8.2 below.
The k directions correspond to the first k principal components. Figure 2 provides an
example in 2D. Note how the singular values are proportional to the energy that lies in the
direction of the corresponding principal component.

16
√ √ √
σ1 / √n = 0.705, σ1 / √n = 0.9832, σ1 / √n = 1.3490,
σ2 / n = 0.690 σ2 / n = 0.3559 σ2 / n = 0.1438

u1 u1
u1

u2 u2 u2

Figure 2: PCA of a dataset with n = 100 2D vectors with different configurations. The two
first singular values reflect how much energy is preserved by projecting onto the two first principal
components.

Theorem 8.2. For any matrix X ∈ Rm×n , where n > m, with left singular vectors u1 , u2 , . . . , um
corresponding to the nonzero singular values σ1 ≥ σ2 ≥ . . . ≥ σm ,

σ1 = max X T u 2 , (107)
||u||2 =1

u1 = arg max X T u 2 , (108)
||u||2 =1

σk = max X T u 2 , 2 ≤ k ≤ r, (109)
||u||2 =1
u⊥u1 ,...,uk−1
T
uk = arg max X u ,
2
2 ≤ k ≤ r. (110)
||u||2 =1
u⊥u1 ,...,uk−1

Proof. If the SVD of X is U SV T then the eigendecomposition of XX T is equal to

XX T = U SV T V SU T = U S 2 U T , (111)

where V T V = I because n > m and the matrix has m nonzero singular values. S 2 is a
diagonal matrix containing σ12 ≥ σ22 ≥ . . . ≥ σm
2
in its diagonal.
The result now follows from applying Theorem 6.2 to the quadratic form

uXX T u = X T u 2 . (112)

This result shows that PCA is equivalent to choosing the best (in terms of `2 norm) k 1D
subspaces following a greedy procedure, since at each step we choose the best 1D subspace

17
√ √
σ1 /√n = 5.077 σ1 /√n = 1.261
σ2 / n = 0.889 σ2 / n = 0.139

u2

u1

u1
u2

Uncentered data Centered data


Figure 3: PCA applied to n = 100 2D data points. On the left the data are not centered. As a
result the dominant principal component u1 lies in the direction of the mean of the data and PCA
does not reflect the actual structure. Once we center, u1 becomes aligned with the direction of
maximal variation.

orthogonal to the previous ones. A natural question to ask is whether this method produces
the best k-dimensional subspace. A priori this is not necessarily the case; many greedy
algorithms produce suboptimal results. However, in this case the greedy procedure is indeed
optimal: the subspace spanned by the first k principal components is the best subspace we
can choose in terms of the `2 -norm of the projections. The theorem is proved in Section A.3
of the appendix.
Theorem 8.3. For any matrix X ∈ Rm×n with left singular vectors u1 , u2 , . . . , um corre-
sponding to the nonzero singular values σ1 ≥ σ2 ≥ . . . ≥ σm ,
n n
Pspan(u1 ,u2 ,...,u ) xi 2 ≥
X X
||PS xi ||22 ,

k 2
(113)
i=1 i=1

for any subspace S of dimension k ≤ min {m, n}.

Figure 3 illustrates the importance of centering before applying PCA. Theorems 8.2 and 8.3
still hold if the data are not centered. However, the norm of the projection onto a certain
direction no longer reflects the variation of the data. In fact, if the data are concentrated
around a point that is far from the origin, the first principal component will tend be aligned
in that direction. This makes sense as projecting onto that direction captures more energy.
As a result, the principal components do not capture the directions of maximum variation
within the cloud of data.

18
8.2 PCA: Probabilistic interpretation

Let us interpret our data, x1 , x2 , . . . , xn in Rm , as samples of a random vector X of dimension


m. Recall that we are interested in determining the directions of maximum variation of the
data in ambient space. In probabilistic terms, we want to find the directions in which the
data have higher variance. The covariance matrix of the data provides this information. In
fact, we can use it to determine the variance of the data in any direction.
Lemma 8.4. Let u be a unit vector,
Var XT u = uT ΣX u.

(114)

Proof.
 2 
T T
− E 2 XT u
 
Var X u = E X u (115)
= E uXXT u − E uT X E XT u
  
(116)
 
= uT E XXT − E (X) E (X)T u

(117)
= uT ΣX u. (118)

Of course, if we only have access to samples of the random vector, we do not know the covari-
ance matrix of the vector. However we can approximate it using the empirical covariance
matrix.
Definition 8.5 (Empirical covariance matrix). The empirical covariance of the vectors
x1 , x2 , . . . , xn in Rm is equal to
n
1X
Σn := (xi − xn ) (xi − xn )T (119)
n i=1
1
= XX T , (120)
n
where xn is the sample mean, as defined in Definition 1.3 of Lecture Notes 4, and X is the
matrix containing the centered data as defined in (103).

If we assume that the mean of the data is zero (i.e. that the data have been centered using
the true mean), then the empirical covariance is an unbiased estimator of the true covariance
matrix:
n
! n
1X 1X
Xi XTi = E Xi XTi

E (121)
n i=1 n i=1
= ΣX . (122)

19
n=5 n = 20 n = 100

True covariance
Empirical covariance

Figure 4: Principal components of n data vectors samples from a 2D Gaussian distribution. The
eigenvectors of the covariance matrix of the distribution are also shown.


If the higher moments of the data E Xi2 Xj2 and E (Xi4 ) are finite, by Chebyshev’s inequality
the entries of the empirical covariance matrix converge to the entries of the true covariance
matrix. This means that in the limit

Var XT u = uT ΣX u

(123)
1
≈ uT XX T u (124)
n
1 2
= X T u 2 (125)
n
for any unit-norm vector u. In the limit the principal components correspond to the directions
of maximum variance of the underlying random vector. These directions also correspond to
the eigenvectors of the true covariance matrix by Theorem 6.2. Figure 4 illustrates how the
principal components converge to the eigenvectors of Σ.

20
A Proofs

A.1 Proof of Theorem 6.2

The eigenvectors are an orthonormal basis (they are mutually orthogonal and we assume that
they have been normalized), so we can represent any unit-norm vector hk that is orthogonal
to u1 , . . . , uk−1 as
m
X
hk = αi ui (126)
i=k

where
m
X
||hk ||22 = αi2 = 1, (127)
i=k

by Lemma 1.5. Note that h1 is just an arbitrary unit-norm vector.


Now we will show that the value of the quadratic form when the normalized input is restricted
to be orthogonal to u1 , . . . , uk−1 cannot be larger than λk ,
n m
!2
X X
hTk Ahk = λi αj uTi uj by (92) and (126) (128)
i=1 j=k
Xn
= λi αi2 because u1 , . . . , um is an orthonormal basis (129)
i=1
m
X
≤ λk αi2 because λk ≥ λk+1 ≥ . . . ≥ λm (130)
i=k
= λk , by (127). (131)
This establishes (93) and (95). To prove (108) and (110) we just need to show that uk
achieves the maximum
n
T
X 2
uk Auk = λi uTi uk (132)
i=1
= λk . (133)

A.2 Proof of Theorem 7.2

It is sufficient to prove
dim (row (A)) ≤ dim (row (A)) (134)

21
for an arbitrary matrix A. We can apply the result to AT to establish dim (row (A)) ≥
dim (row (A)), since row (A) = row (A)T and row (A) = row (A)T .
To prove (134) let r := dim (row (A)) and let x1 , . . . , xr ∈ Rn be a basis for row (A). Consider
the vectors Ax1 , . . . , Axr ∈ Rn . They belong to row (A) by (43), so if they are linearly
independent then dim (row (A)) must be at least r. We will prove that this is the case by
contradiction.
Assume that Ax1 , . . . , Axr are linearly dependent. Then there exist coefficients α1 , . . . , αr ∈
R such that
r r
!
X X
0= αi Axi = A α i xi (by linearity of the matrix product), (135)
i=1 i=1
Pr
This implies that i=1 αi xi is orthogonal to every row of A and hence to every vector in
row (A). However
Pr it is in the span of a basis of row (A) by construction! This is only
possible if i=1 αi xi = 0, which is a contradiction because x1 , . . . , xr are assumed to be
linearly independent.

A.3 Proof of Theorem 8.3

We prove the result by induction. The base case, k = 1, follows immediately from (108).
To complete the proof we need to show that if the results is true for k − 1 ≥ 1 (this is the
induction hypothesis) then it also holds for k.
Let S be an arbitrary subspace of dimension k. We choose an orthonormal basis for the
subspace b1 , b2 , . . . , bk such that bk is orthogonal to u1 , u2 , . . . , uk1 . We can do this by using
any vector that is linearly independent of u1 , u2 , . . . , uk1 and subtracting its projection onto
the span of u1 , u2 , . . . , uk1 (if the result is always zero then S is in the span and consequently
cannot have dimension k).
By the induction hypothesis,
n 2 Xk−1
X T 2
Pspan(u1 ,u2 ,...,uk ) xi = X ui by (6) (136)

1 2 2
i=1 i=1
Xn 2
≤ Pspan(b1 ,b2 ,...,bk ) xi (137)

1 2
i=1
k−1
X T 2
= X bi
2
by (6). (138)
i=1

22
By (110)
n
Pspan(u ) xi 2 = X T uk 2
X
k 2 2
(139)
i=1
n
Pspan(b ) xi 2
X
≤ k 2
(140)
i=1
2
= X T bk 2 . (141)

Combining (138) and (138) we conclude


n k
T 2
Pspan(u1 ,u2 ,...,u ) xi 2 =
X X
X ui (142)
k 2 2
i=1 i=1
k
X T 2
≤ X bi
2
(143)
i=1
Xn
≤ ||PS xi ||22 . (144)
i=1

23
DS-GA 1002 Lecture notes 10 November 23, 2015

Linear models
1 Linear functions
A linear model encodes the assumption that two quantities are linearly related. Mathemati-
cally, this is characterized using linear functions. A linear function is a function such that a
linear combination of inputs is mapped to the same linear combination of the corresponding
outputs.

Definition 1.1 (Linear function). A linear function T : Rn → Rm is a function that maps


vectors in Rn to vectors in Rm such that for any scalar α ∈ R and any vectors x1 , x2 ∈ Rn

T (x1 + x2 ) = T (x1 ) + T (x2 ) , (1)


T (α x1 ) = α T (x1 ) (2)

Multiplication with a matrix of dimensions m × n maps vectors in Rn to vectors in Rm . For


a fixed matrix, this is a linear function. Perhaps surprisingly, the converse is also true: any
linear function between Rn and Rm corresponds to multiplication with a certain matrix. The
proof is in Section A.1 of the appendix.

Theorem 1.2 (Equivalence between matrices and linear functions). For finite m, n every
linear function T : Rn → Rm can be represented by a matrix T ∈ Rm×n .

This implies that in order to analyze linear models in finite-dimensional spaces we can restrict
our attention to matrices.

1.1 Range and null space

The range of a matrix A ∈ Rm×n is the set of all possible vectors in Rm that we can reach
by applying the matrix to a vector in Rn .

Definition 1.3 (Range). Let A ∈ Rm×n ,

range (A) := {y | y = Ax for some x ∈ Rn } . (3)

This set is a subspace of Rm .


As we saw in the previous lecture notes (equation (43)), the product of a matrix and a vector
is a linear combination of the columns of the matrix, which implies that for any matrix A

range (A) = col (A) . (4)

In words, the range is spanned by the columns of the matrix.


The null space of a function is the set of vectors that are mapped to zero by the function.
If we interpret Ax as data related linearly to x, then the null space corresponds to the set
of vectors that are invisible under the measurement operator.

Definition 1.4 (Null space). The null space of A ∈ Rm×n contains the vectors in Rn that
A maps to the zero vector.

null (A) := {x | Ax = 0} . (5)

This set is a subspace of Rn .

The following lemma shows that the null space is the orthogonal complement of the row
space of the matrix.

Lemma 1.5. For any matrix A ∈ Rm×n

null (A) = row (A)⊥ . (6)

The lemma, proved in Section A.2 of the appendix, implies that the matrix is invertible if
we restrict the inputs to be in the row space of the matrix.

Corollary 1.6. Any matrix A ∈ Rm×n is invertible when acting on its row space. For any
two nonzero vectors x1 6= x2 in the row space of A

Ax1 6= Ax2 . (7)

Proof. Assume that for two different nonzero vectors x1 and x2 in the row space of A Ax1 =
Ax2 . Then x1 − x2 is a nonzero vector in the null space of A. By Lemma 1.5 this implies that
x1 − x2 is orthogonal to the row space of A and consequently to itself, so that x1 = x2 .

This means that for every matrix A ∈ Rm×n we can decompose any vector in Rn into two
components: one is in the row space and is mapped to a nonzero vector in Rm that is unique
in the sense that no other vector in row (A) is mapped to it, the other is in the null space
and is mapped to the zero vector.

2
1.2 Interpretation using the SVD

Recall that the left singular vectors of a matrix A that correspond to nonzero singular values
are a basis of the column space of A. It follows that they are also a basis for the range.
The right singular vectors corresponding to nonzero singular values are a basis of the row
space. As a result any orthonormal set of vectors that forms a basis of Rn together with
these singular vectors is a basis of the null space of the matrix. We can therefore write any
matrix A such that m ≥ n as
 
σ1 0 · · · 0 0 · · · 0
 0 σ2 · · · 0 0 · · · 0
 

 · · · 

A = [ u1 u2 · · · ur ur+1 · · · un ]  0 0 · · · σr 0 · · · 0

 [v|1 v2 {z· · · v}r v|r+1 {z· · · vn ]T .
| {z }
Basis of range(A)
 0 0 · · · 0 0 · · · 0 Basis of row(A) }
  Basis of null(A)
 ··· 
0 0 ··· 0 0 ··· 0

Note that the vectors ur+1 , . . . , un are a subset of an orthonormal basis of the orthogonal
complement of the range, which has dimension m − r.
The SVD provides a very intuitive characterization of the mapping between x ∈ Rn and
Ax ∈ Rm for any matrix A ∈ Rm×n with rank r,
r
X
σi viT x ui .

Ax = (8)
i=1

The linear function can be decomposed into four simple steps:

1. Compute the projection of x onto the right singular vectors of A: v1T x, v2T x, . . . , vrT x.

2. Scale the projections using the corresponding singular value: σ1 v1T x, σ2 v2T x, . . . , σr vrT x.

3. Multiply each scaled projection with the corresponding left singular vector ui .

4. Sum all the scaled left singular vectors.

1.3 Systems of equations

In a linear model we assume that the data y ∈ Rm can be represented as the result of
applying a linear function or matrix A ∈ Rm×n to an unknown vector x ∈ Rn ,

Ax = y. (9)

3
The aim is to determine x from the measurements. Depending on the structure of A and y
this may or may not be possible.
If we expand the matrix-vector product, the linear model is equivalent to a system of linear
equations

A11 x[1] + A12 x[2]+ . . . + A1n x[n] = y[1] (10)


A21 x[1] + A22 x[2]+ . . . + A2n x[n] = y[2] (11)
··· (12)
Am1 x[1] + Am2 x[2]+ . . . + Amn x[n] = y[m]. (13)

If the number of equations m is greater than the number of unknowns n the system is said
to be overdetermined. If there are more unknowns than equation n > m then the system
is underdetermined.
Recall that range (A) is the set of vectors that can be reached by applying A. If y does not
belong to this set, then the system cannot have a solution.

Lemma 1.7. The system y = Ax has one or multiple solutions if and only if y ∈ range (A).

Proof. If y = Ax has a solution then y ∈ range (A) by (4). If y ∈ range (A) then there is a
linear combination of the columns of A that yield y by (4) so the system has at least one
solution.

If the null space of the matrix has dimension greater than 0, then the system cannot have a
unique solution.

Lemma 1.8. If dim (null (A)) > 0, then if Ax = y has a solution, the system has an infinite
number of solutions.

Proof. The null space has at least dimension one, so it contains an infinite number of vectors
h such that for any solution x for which y = Ax x + h is also a solution.

In the critical case m = n, linear systems may have a unique solution if the matrix is full
rank, i.e. if all its rows (and its columns) are linearly independent. This means that the
data in the linear model completely specify the unknown vector of interest, which can be
recovered by inverting the matrix.

Lemma 1.9. For any square matrix A ∈ Rn×n , the following statements are equivalent.

1. null (A) = {0}.

2. A is full rank.

4
3. A is invertible.
4. The system Ax = y has a unique solution for every vector y ∈ Rn .

Proof. We prove that the statements imply each other in the order (1) ⇒ (2) ⇒ (3) ⇒
(4) ⇒ (1).
(1) ⇒ (2): If dim (null (A)) = 0, by Lemma 1.5 the row space of A is the orthogonal
complement of {0}, so it is equal to Rn and therefore the rows are all linearly independent.
(2) ⇒ (3): If A is full rank, its rows span all of Rn so range (A) = Rn and A is invertible by
Corollary 1.6.
(3) ⇒ (4) If A is invertible there is a unique solution to Ax = y which is A−1 y. If the
solution is not unique then Ax1 = Ax2 = y for some x1 6= x2 so that 0 and x1 − x2 have the
same image and A is not invertible.
(4) ⇒ (1) If Ax = y has a unique solution for 0, then Ax = 0 implies x = 0.

Recall that the inverse of a product of invertible matrices is equal to the product of the
inverses,

(AB)−1 = B −1 A−1 , (14)

and that the inverse of the transpose of a matrix is the transpose of the inverse,
−1 T
AT = A−1 . (15)

Using these facts, the inverse of a matrix A can be written in terms of its singular vectors
and singular values,
−1
A−1 = U SV T (16)
−1
= VT S −1 U −1

(17)
1 
σ
0 ··· 0
 01 1 · · · 0 
σ2  T
=V  U (18)

 ··· 
0 0 · · · σ1n
n
X 1
= vi uTi . (19)
σ
i=1 i

Note that if one of the singular values of a matrix is very small, the corresponding term
in (19) becomes very large. As a result, the solution to the system of equations becomes
very susceptible to noise in the data. In order to quantify the stability of the solution of a
system of equations we use the condition number of the corresponding matrix.

5
Definition 1.10 (Conditioning number). The condition number of a matrix is the ratio
between its largest and its smallest singular values
σmax
cond (A) = . (20)
σmin

If the condition number is very large, then perturbations in the data may be dramatically
amplified in the corresponding solution. This is illustrated by the following example.

Example 1.11 (Ill-conditioned system). The matrix


 
1.001 1
(21)
1 1

has a condition number equal to 401. Compare the solutions to the corresponding system of
equations for two very similar vectors
 −1    
1 1 1 1
= , (22)
1 1.001 1 0

 −1    
1 1 1.1 101
= . (23)
1 1.001 1 −100

2 Least squares
Just like a system for which m = n, an overdetermined system will have a solution as long
as y ∈ range (A). However, if m > n then range (A) is a low-dimensional subspace of Rm .
This means that even a small perturbation in a random direction is bound to kick y out
of the subspace. As a result, in most cases overdetermined systems do not have a solution.
However, we may still compute the point in range (A) that is closest to the data y. If we
measure distance using the `2 norm, then this is denoted by the method of least squares.

Definition 2.1 (Least squares). The method of least squares consists of estimating x by
solving the optimization problem

min ||y − Ax||2 (24)


x∈Rn

6
Tall matrices (with more rows than columns) are said to be full rank if all their columns are
linearly independent. If A is full rank, then the solution to the least-squares problem has a
closed-form solution given by the following theorem.

Theorem 2.2 (Least-squares solution). If A ∈ Rm×n is full rank and m ≥ n the solution to
the least-squares problem (24) is equal to

xLS = V S −1 U T y (25)
−1 T
= AT A A y. (26)

Proof. The problem (24) is equivalent to

min ||y − z||2 (27)


z∈range(A)

since every x ∈ Rn corresponds to a unique z ∈ range (A) (we are assuming that the matrix
is full rank, so the null space only contains the zero vector). By Theorem 1.7 in Lecture
Notes 9, the solution to Problem (27) is
n
X
uTi y ui

Prange(A) y = (28)
i=1
= U U T y. (29)

Where A = U SV T is the SVD of A, so the columns of U ∈ Rm×n are an orthonormal basis


for the range of A. Now, to find the solution we need to find the unique xLS such that

AxLS = U SV T xLS (30)


= U U T y. (31)

This directly implies

U T U SV T xLS = U T U U T y. (32)

We have

U T U = I, (33)

because the columns of U are orthonormal (note that U U T 6= I if m > n!). As a result
−1 T
xLS = SV T U y (34)
−1
= VT S −1 U T y

(35)
= V S −1 U T y, (36)

7
where we have used the fact that
−1
V −1 = V T and VT =V (37)

because V T V = V V T = I (V is an n × n orthogonal matrix).


Finally,
−1 T −1
AT A A = V S T U T U SV T V ST U T (38)
  2  −1
σ1 0 · · · 0
  0 σ22 · · · 0  T 
=  V  V S T U T by (33)
V 

···  
2
0 0 · · · σn
1 
σ12
0 ··· 0
 0 1 ··· 0 
σ22  T
=V   V V S T U T by (37) (39)

 ··· 
0 0 · · · σ12
n
1 
σ12
0 · · · 0
 0 1 ··· 0 
σ22
=V   SU T by (37) (40)
 
 ··· 
0 0 · · · σ12
n
−1 T
=VS U by (37), (41)

where we have used that S is diagonal so S T = S and A is full rank, so that all the singular
values are nonzero and S is indeed invertible.
−1 T
The matrix AT A A is called the pseudoinverse of A. In the square case it reduces to
the inverse of the matrix.

2.1 Linear regression

A very important application of least squares is fitting linear regression models. In linear
regression, we assume that a quantity of interest can be expressed as a linear combination
of other observed quantities.
n
X
a≈ θj cj , (42)
j=1

where a ∈ R is called the response or dependent variable, c1 , c2 , . . . , cn ∈ R are the


covariates or independent variables and θ1 , θ2 , . . . , θn ∈ R are the parameters of the

8
model. Given m observations of the response and the covariates, we can place the response
in a vector y and the covariates in a matrix X such that each column corresponds to a different
covariate. We can then fit the parameters so that the model approximates the response as
closely as possible in `2 norm. This is achieved by solving a least-squares problem

min ||y − Xθ||2 (43)


θ∈Rn

to fit the parameters.


Geometrically, the estimated parameters are those that project the response on the subspace
spanned by the covariates. Alternatively, linear regression also has a probabilistic interpre-
tation. It corresponds to computing the maximum likelihood estimator for a particular
model.

Lemma 2.3. Let Y and Z are random vectors of dimension n such that

Y = Xθ + Z, (44)

where X is a deterministic matrix (not a random variable). If Z is an iid Gaussian random


vector with mean zero and unit variance then the maximum likelihood estimator of Y given
Z is the solution to the least-squares problem (41).

Proof. Setting Σ = I in Definition 2.20 of Lecture Notes 3, we have that the likelihood
function
 
1 1 2
L (θ) = p exp − ||y − Xθ||2 . (45)
(2π)n 2

Maximizing the likelihood yields

θML = arg max L (θ) (46)


θ
= arg max log L (θ) (47)
θ
= arg min ||y − Xθ||2 . (48)
θ

9
Maximum temperature Minimum temperature

30 20

25 15

20
Temperature (Celsius)

Temperature (Celsius)
10
15
5
10
0
5
5
0 Data Data
Model Model
10
1860 1880 1900 1920 1940 1960 1980 2000 1860 1880 1900 1920 1940 1960 1980 2000

25 14
12
20
10
Temperature (Celsius)

Temperature (Celsius)

15 8
6
10 4
2
5
Data 0 Data
Model Model
0 2
1900 1901 1902 1903 1904 1905 1900 1901 1902 1903 1904 1905

25 15

20
10
Temperature (Celsius)

Temperature (Celsius)

15
5
10
0
5

5
0
Data Data
Model Model
5 10
1960 1961 1962 1963 1964 1965 1960 1961 1962 1963 1964 1965

Figure 1: Data and fitted model described by Example 2.4 for maximum and minimum temper-
atures.

10
Example 2.4 (Global warming). In this example we build a model for temperature data
taken in a weather station in Oxford over 150 years.1 The model is of the form
 
2πt
y ≈ a + b̃ cos + c̃ + dt (49)
12
   
2πt 2πt
= a + b cos + c sin + dt, (50)
12 12

where t denotes the time in months. The parameter a represents the mean temperature,
b and c account for periodic yearly fluctuations and d is the overall trend. Is d is positive
then the model indicates that temperatures are increasing, whereas if it is negative then it
indicates that temperatures are decreasing. To fit these parameters using the data, we build
a matrix A with four columns,

1 cos 2πt 2πt1


 
12
1
sin 12
dt 1
 1 cos 2πt2 sin 2πt2 dt2 
A=  12 12 , (51)
··· ··· ··· · · ·
1 cos 2πt12
n
sin 2πt
12
n
dtn

compile the temperatures in a vector y and solve a least-squares problem. The results are
shown in Figures 1 and 2. The fitted model indicates that both the maximum and minimum
temperatures are increasing by about 0.8 degrees Celsius (around 1.4 ◦ F).

1
The data is available at http://www.metoffice.gov.uk/pub/data/weather/uk/climate/
stationdata/oxforddata.txt.

11
Maximum temperature Minimum temperature

30 20

25 15

20
Temperature (Celsius)

Temperature (Celsius)
10
15
5
10
0
5
5
0 Data Data
Trend Trend
10
1860 1880 1900 1920 1940 1960 1980 2000 1860 1880 1900 1920 1940 1960 1980 2000

+ 0.75 ◦ C / 100 years + 0.88 ◦ C / 100 years


Figure 2: Temperature trend obtained by fitting the model described by Example 2.4 for maximum
and minimum temperatures.

A Proofs

A.1 Proof of Theorem 1.2

The matrix is
 
T := T (e1 ) T (e2 ) · · · T (en ) , (52)

i.e. the columns of the matrix are the result of applying T to the standard basis of Rn .
Indeed, for any vector x ∈ Rn
n
!
X
T (x) = T x[i]ei (53)
i=1
n
X
= x[i]T (ei ) by (1) and (2) (54)
i=1
= T x. (55)

A.2 Proof of Lemma 1.5

We prove (6) by showing that both sets are subsets of each other.

12
Any vector x in the row space of A can be written as

x = AT z, (56)

for some vector z ∈ Rm . If y ∈ null (A) then

y T x = y T AT z (57)
= (Ay)T z (58)
= 0. (59)

So null (A) ⊆ row (A)⊥ .


If x ∈ row (A)⊥ then in particular it is orthogonal to every row of A, so Ax = 0 and
row (A)⊥ ⊆ null (A).

13

Das könnte Ihnen auch gefallen