Beruflich Dokumente
Kultur Dokumente
1 Vector spaces
You are no doubt familiar with vectors in R2 or R3 , i.e.
−1.1
2
x= , y = 0 . (1)
3
5
From the point of view of algebra, vectors are much more general objects. They are elements
of sets called vector spaces that satisfy the following definition.
Definition 1.1 (Vector space). A vector space consists of a set V and two operations + and
· satisfying the following conditions.
4. For any x ∈ V there exists an additive inverse y such that x + y = 0, usually denoted
as −x.
x + y = y + x, (x + y) + z = x + (y + z). (2)
α (β · x) = (α β) · x. (3)
7. Scalar and vector sums are both distributive, i.e. for all α, β ∈ R and x, y ∈ V
(α + β) · x = α · x + β · x, α · (x + y) = α · x + α · y. (4)
From now on, for ease of notation we will ignore the symbol for the scalar product ·, writing
α · x as α x.
Remark 1.2 (More general definition). We can define vector spaces over an arbitrary field,
instead of R, such as the complex numbers C. We refer to any linear algebra text for more
details.
We can easily check that Rn is a valid vector space together with the usual vector addition
T
and vector-scalar product. In this case the zero vector is the all-zero vector 0 0 0 . . . .
When thinking about vector spaces it is a good idea to have R2 or R3 in mind to gain
intuition, but it is also important to bear in mind that we can define vector sets over many
other objects, such as infinite sequences, polynomials, functions and even random variables
as in the following example.
Example 1.3 (The vector space of zero-mean random variables). Zero-mean random vari-
ables belonging to the same probability space form a vector space together with the usual
operations for adding random variables together and for multiplying random variables and
scalars. This follows almost automatically from the fact that linear combinations of random
variables are also random variables and from linearity of expectation. You can check for in-
stance that if X and Y are zero-mean random variables, for any scalars α and β the random
variable αX + βY is also a zero-mean random variable. The zero vector of this vector space
is the random variable equal to 0 with probability one.
The definition of vector space guarantees that any linear combination of vectors in a vector
space V, obtained by adding the vectors after multiplying by scalar coefficients, belongs to
V. Given a set of vectors, a natural question to ask is whether they can be expressed as
linear combinations of each other, i.e. if they are linearly dependent or independent.
Definition 1.4 (Linear dependence/independence). A set of m vectors x1 , x2 , . . . , xm is
linearly dependent if there exist m scalar coefficients α1 , α2 , . . . , αm which are not all equal
to zero and such that
Xm
αi xi = 0. (5)
i=1
2
Otherwise, the vectors are linearly independent.
Equivalently, at least one vector in a linearly dependent set can be expressed as the linear
combination of the rest, whereas this is not the case for linearly independent sets.
Let us check the equivalence. Equation (5) holds with αj 6= 0 for some j if and only if
1 X
xj = α i xi . (6)
αj
i∈{1,...,m}/{j}
We define the span of a set of vectors {x1 , . . . , xm } as the set of all possible linear combina-
tions of the vectors:
( m
)
X
span (x1 , . . . , xm ) := y | y = αi xi for some α1 , α2 , . . . , αm ∈ R . (7)
i=1
Proof. The span is a subset of V due to Conditions 1 and 2 in Definition 1.1. We now show
that it is a vector space. Conditions 5, 6 and 7 in Definition 1.1 hold because V is a vector
space. We check Conditions 1, 2, 3 and 4 by proving that for two arbitrary elements of the
span
m
X m
X
y1 = αi xi , y2 = β i xi , α1 , . . . , αm , β1 , . . . , βm ∈ R, (8)
i=1 i=1
When working with a vector space, it is useful to consider the set of vectors with the smallest
cardinality that spans the space. This is called a basis of the vector space.
Definition 1.6 (Basis). A basis of a vector space V is a set of independent vectors {x1 , . . . , xm }
such that
V = span (x1 , . . . , xm ) . (10)
3
An important property of all bases in a vector space is that they have the same cardinality.
Theorem 1.7. If a vector space V has a basis with finite cardinality then every basis of V
contains the same number of vectors.
This theorem, which is proven in Section A of the appendix, allows us to define the dimen-
sion of a vector space.
Definition 1.8 (Dimension). The dimension dim (V) of a vector space V is the cardinality
of any of its bases, or equivalently the smallest number of linearly independent vectors that
span V.
This definition coincides with the usual geometric notion of dimension in R2 and R3 : a
line has dimension 1, whereas a plane has dimension 2 (as long as they contain the origin).
Note that there exist infinite-dimensional vector spaces, such as the continuous real-valued
functions defined on [0, 1] or an iid sequence X1 , X2 , . . ..
The vector space that we use to model a certain problem is usually called the ambient
space and its dimension the ambient dimension. In the case of Rn the ambient dimension
is n.
Lemma 1.9 (Dimension of Rn ). The dimension of Rn is n.
4
• It is linear, i.e. for any α ∈ R and any x, y, z ∈ V
hα x, yi = α hy, xi , hx + y, zi = hx, zi + hy, zi . (13)
Example 2.2 (Dot product). We define the dot product between two vectors x, y ∈ Rn
as
X
x · y := x [i] y [i] , (14)
i
where x [i] is the ith entry of x. It is easy to check that the dot product is a valid inner
product. Rn endowed with the dot product is usually called an Euclidean space of dimension
n.
Example 2.3 (Covariance as an inner product). The covariance of two zero-mean random
variables X and Y is equal to E (XY ). It is a valid inner product in the vector space of
zero-mean random variables. It is obviously symmetric and linearity follows from linearity of
expectation. Finally, E (X 2 ) ≥ 0 because it is the sum or integral of a nonnegative quantity
and by Chebyshev’s inequality E (X 2 ) = 0 implies that X = 0 with probability one.
5
• ||x|| = 0 implies that x is the zero vector 0.
A vector space equipped with a norm is called a normed space. Distances in a normed space
can be measured using the norm of the difference between vectors.
Definition 2.5 (Distance). The distance between two vectors in a normed space with norm
||·|| is
Inner-product spaces are normed spaces because we can define a valid norm using the inner
product. The norm induced by an inner product is obtained by taking the square root of
the inner product of the vector with itself,
p
||x||h·,·i := hx, xi. (18)
The norm induced by an inner product is clearly homogeneous by linearity and symmetry
of the inner product. ||x||h·,·i = 0 implies x = 0 because the inner product is positive
semidefinite. We only need to establish that the triangle inequality holds to ensure that the
inner-product is a valid norm.
Theorem 2.6 (Cauchy-Schwarz inequality). For any two vectors x and y in an inner-product
space
Assume ||x||h·,·i 6= 0,
||y||h·,·i
hx, yi = − ||x||h·,·i ||y||h·,·i ⇐⇒ y = − x, (20)
||x||h·,·i
||y||h·,·i
hx, yi = ||x||h·,·i ||y||h·,·i ⇐⇒ y = x. (21)
||x||h·,·i
Proof. If ||x||h·,·i = 0 then x = 0 because the inner product is positive semidefinite, which
implies hx, yi = 0 and consequently that (19) holds with equality. The same is true if
||y||h·,·i = 0.
Now assume that ||x||h·,·i 6= 0 and ||y||h·,·i 6= 0. By semidefiniteness of the inner product,
2
0 ≤ ||y||h·,·i x + ||x||h·,·i y = 2 ||x||2h·,·i ||y||2h·,·i + 2 ||x||h·,·i ||y||h·,·i hx, yi , (22)
2
0 ≤ ||y||h·,·i x − ||x||h·,·i y = 2 ||x||2h·,·i ||y||2h·,·i − 2 ||x||h·,·i ||y||h·,·i hx, yi . (23)
6
These inequalities establish (19).
Let us prove (20) by proving both implications.
(⇒) Assume hx, yi = − ||x||h·,·i ||y||h·,·i . Then (22) equals zero, so ||y||h·,·i x = − ||x||h·,·i y
because the inner product is positive semidefinite.
(⇐) Assume ||y||h·,·i x = − ||x||h·,·i y. Then one can easily check that (22) equals zero, which
implies hx, yi = − ||x||h·,·i ||y||h·,·i .
The proof of (21) is identical (using (23) instead of (22)).
Corollary 2.7. The norm induced by an inner product satisfies the triangle inequality.
Proof.
||x + y||2h·,·i = ||x||2h·,·i + ||y||2h·,·i + 2 hx, yi (24)
≤ ||x||2h·,·i + ||y||2h·,·i
+ 2 ||x||h·,·i ||y||h·,·i by the Cauchy-Schwarz inequality (25)
2
= ||x||h·,·i + ||y||h·,·i . (26)
Example 2.8 (Euclidean norm). The Euclidean or `2 norm is the norm induced by the dot
product in Rn ,
v
u n
√ uX
||x||2 := x · x = t x2i . (27)
i=1
Example 2.9 (The standard deviation as a norm). The standard deviation or root mean
square
p
σX = E (X 2 ) (28)
is the norm induced by the covariance inner product in the vector space of zero-mean random
variables.
7
3 Orthogonality
An important concept in linear algebra is orthogonality.
Definition 3.1 (Orthogonality). Two vectors x and y are orthogonal if
hx, yi = 0. (29)
hx, yi = 0. (31)
Distances between orthogonal vectors measured in terms of the norm induced by the inner
product are easy to compute.
Theorem 3.2 (Pythagorean theorem). If x and y are orthogonal vectors
hx, bi i = 0, 1 ≤ i ≤ n, (36)
then x is orthogonal to S.
8
P n
Proof. Any vector v ∈ S can be represented as v = i αi=1 bi for α1 , . . . , αn ∈ R, from (36)
* +
X X
n n
hx, vi = x, αi=1 bi = αi=1 hx, bi i = 0. (37)
i i
It is very easy to find the coefficients of a vector in an orthonormal basis: we just need to
compute the dot products with the basis vectors.
Lemma 3.5 (Coefficients in an orthonormal basis). If {u1 , . . . , un } is an orthonormal basis
of a vector space V, for any vector x ∈ V
n
X
x= hui , xi ui . (38)
i=1
Immediately,
* m
+ m
X X
hui , xi = ui , αi ui = αi hui , ui i = αi (40)
i=1 i=1
For any subspace of Rn we can obtain an orthonormal basis by applying the Gram-Schmidt
method to a set of linearly independent vectors spanning the subspace.
Algorithm 3.6 (Gram-Schmidt).
9
1. Compute
i−1
X
vi := xi − huj , xi i uj . (41)
j=1
This implies in particular that we can always assume that a subspace has an orthonormal
basis.
Proof. To see that the Gram-Schmidt method produces an orthonormal basis for the span of
the input vectors we can check that span (x1 , . . . , xi ) = span (u1 , . . . , ui ) and that u1 , . . . , ui
is set of orthonormal vectors.
10
A Proof of Theorem 1.7
We prove the claim by contradiction. Assume that we have two bases {x1 , . . . , xm } and
{y1 , . . . , yn } such that m < n (or the second set has infinite cardinality). The proof follows
from applying the following lemma m times (setting r = 0, 1, . . . , m − 1) to show that
{y1 , . . . , ym } spans V and hence {y1 , . . . , yn } must be linearly dependent.
Lemma A.1. Under the assumptions of the theorem, if {y1 , y2 , . . . , yr , xr+1 , . . . , xm } spans
V then {y1 , . . . , yr+1 , xr+2 , . . . , xm } also spans V (possibly after rearranging the indices r +
1, . . . , m) for r = 0, 1, . . . , m − 1.
where at least one of the γj is non zero, as {y1 , . . . , yn } is linearly independent by assumption.
Without loss of generality (here is where we might need to rearrange the indices) we assume
that γr+1 6= 0, so that
r m
!
1 X X
xr+1 = βi yi − γi xi . (43)
γr+1 i=1 i=r+2
This implies that any vector in the span of {y1 , y2 , . . . , yr , xr+1 , . . . , xm }, i.e. in V, can be rep-
resented as a linear combination of vectors in {y1 , . . . , yr+1 , xr+2 , . . . , xm }, which completes
the proof.
11
DS-GA 1002 Lecture notes 9 November 16, 2015
Linear models
1 Projections
The projection of a vector x onto a subspace S is the vector in S that is closest to x. In
order to define this rigorously, we start by introducing the concept of direct sum. If two
subspaces are disjoint, i.e. their only common point is the origin, then a vector that can be
written as a sum of a vector from each subspace is said to belong to their direct sum.
Definition 1.1 (Direct sum). Let V be a vector space. For any subspaces S1 , S2 ⊆ V such
that
S1 ∩ S2 = {0} (1)
S1 ⊕ S2 := {x | x = s1 + s2 s1 ∈ S1 , s2 ∈ S2 } . (2)
x = s1 + s2 s1 ∈ S1 , s2 ∈ S2 . (3)
We can now define the projection of a vector x onto a subspace S by separating the vector
into a component that belongs to S and another that belongs to its orthogonal complement.
Definition 1.3 (Orthogonal projection). Let V be a vector space. The orthogonal projection
of a vector x ∈ V onto a subspace S ⊆ V is a vector denoted by PS x such that x−PS x ∈ S ⊥ .
Theorem 1.4 (Properties of orthogonal projections). Let V be a vector space. Every vector
x ∈ V has a unique orthogonal projection PS x onto any subspace S ⊆ V of finite dimension.
In particular x can be expressed as
x = PS x + PS ⊥ x. (4)
For any vector s ∈ S
hx, si = hPS x, si . (5)
For any orthonormal basis b1 , . . . , bm of S,
m
X
PS x = hx, bi i bi . (6)
i=1
Proof. Since V has finite dimension, so does S, which consequently has an orthonormal basis
with finite cardinality b01 , . . . , b0m by Theorem 3.7 in Lecture Notes 8. Consider the vector
m
X
p := hx, b0i i b0i . (7)
i=1
Computing the norm of the projection of a vector onto a subspace is easy if we have access
to an orthonormal basis (as long as the norm is induced by the inner product).
Lemma 1.5 (Norm of the projection). The norm of the projection of an arbitrary vector
x ∈ V onto a subspace S ⊆ V of dimension d can be written as
v
u d
uX
||PS x||h·,·i = t hbi , xi2 (11)
i
2
Proof. By (6)
||PS x||2h·,·i = hPS x, PS xi (12)
* d d
+
X X
= hbi , xi bi , hbj , xi bj (13)
i j
d X
X d
= hbi , xi hbj , xi hbi , bj i (14)
i j
d
X
= hbi , xi2 . (15)
i
Finally, we prove indeed that of a vector x onto a subspace S is the vector in S that is closest
to x in the distance induced by the inner-product norm.
Theorem 1.7 (The orthogonal projection is closest). The orthogonal projection of a vector
x onto a subspace S belonging to the same inner-product space is the closest vector to x that
belongs to S in terms of the norm induced by the inner product. More formally, PS x is the
solution to the optimization problem
minimize ||x − u||h·,·i (17)
u
subject to u ∈ S. (18)
3
2 Linear minimum-MSE estimation
We are interested in estimating the sample of a continuous random variable X from the
sample y of a random variable Y . If we know the joint distribution of X and Y then the
optimal estimator in terms of MSE is the conditional mean E (X|Y = y). However often it is
very challenging to completely characterize a joint distribution between two quantities, but
it is more tractable to obtain an estimate of their first and second order moments. In this
case it turns out that we can obtain the optimal linear estimate of X given Y by using our
knowledge of linear algebra.
Theorem 2.1 (Best linear estimator). Assume that we know the means µX , µY and variances
2
σX , σY2 of two random variables X and Y and their correlation coefficient ρXY . The best
linear estimate of the form aY + b of X given Y in terms of mean-square error is
ρXY σX (y − µY )
gLMMSE (y) = + µX . (22)
σY
Since h00 is positive the function is convex so the minimum is obtained by setting h0 (b) to
zero, which yields
b = µX − aµY . (27)
clearly any a that minimizes the left-hand side also minimizes the right-hand side and vice
versa.
4
Consider the vector space of zero-mean random variables. X̃ and Ỹ belong to this vector
space. In fact,
D E
X̃, Ỹ = E X̃ Ỹ (30)
= Cov (X, Y ) (31)
= σX σY ρXY , (32)
2
X̃ = E X̃ 2 (33)
h·,·i
2
= σX , (34)
2
Ỹ = E Ỹ 2 (35)
h·,·i
= σY2 . (36)
Any random variable of the form aỸ belongs to the subspace spanned by Ỹ . Since the
distance in this vector space is induced by the mean-square norm, by Theorem 1.7 the
vector of the form aỸ that approximates X̃ better is just the projection of X̃ onto the
subspace
n o spanned by Ỹ , which we will denote by PỸ X̃. This subspace has dimension 1, so
Ỹ /σY is a basis for the subspace. The projection is consequently equal to
* +
Ỹ Ỹ
PỸ X̃ = X̃, (37)
σY σY
D E Ỹ
= X̃, Ỹ (38)
σY2
σX ρXY Ỹ
= . (39)
σY
So a = σX ρXY /σY , which concludes the proof.
5
3 Matrices
A matrix is a rectangular array of numbers. We denote the vector space of m × n matrices
by Rm×n . We denote the ith row of a matrix A by Ai: , the jth column by A:j and the (i, j)
entry by Aij . The transpose of a matrix is obtained by switching its rows and columns.
AT ij = Aji .
(40)
Definition 3.2 (Matrix-vector product). The product of a matrix A ∈ Rm×n and a vector
x ∈ Rn is a vector in Ax ∈ Rn , such that
n
X
(Ax)i = Aij x [j] (41)
j=1
= hAi: , xi , (42)
i.e. the ith entry of Ax is the dot product between the ith row of A and x.
Equivalently,
n
X
Ax = A:j x [j] , (43)
j=1
One can easily check that the transpose of the product of two matrices A and B is equal to
the the transposes multiplied in the inverse order,
(AB)T = B T AT . (44)
We can now we can express the dot product between two vectors x and y as
hx, yi = xT y = y T x. (45)
6
Definition 3.3 (Identity matrix). The identity matrix in Rn×n is
1 0 ··· 0
0 1 · · · 0
I= . (46)
···
0 0 ··· 1
Definition 3.4 (Matrix multiplication). The product of two matrices A ∈ Rm×n and B ∈
Rn×p is a matrix AB ∈ Rm×p , such that
n
X
(AB)ij = Aik Bkj = hAi: , B:,j i , (47)
k=1
i.e. the (i, j) entry of AB is the dot product between the ith row of A and the jth column of
B.
Equivalently, the jth column of AB is the result of multiplying A and the jth column of B
n
X
AB = Aik Bkj = hAi: , B:,j i , (48)
k=1
and ith row of AB is the result of multiplying the ith row of A and B.
Square matrices may have an inverse. If they do, the inverse is a matrix that reverses the
effect of the matrix of any vector.
Definition 3.5 (Matrix inverse). The inverse of a square matrix A ∈ Rn×n is a matrix
A−1 ∈ Rn×n such that
7
Definition 3.7 (Orthogonal matrix). An orthogonal matrix is a square matrix such that its
inverse is equal to its transpose,
UT U = UUT = I (52)
By definition, the columns U:1 , U:2 , . . . , U:n of any orthogonal matrix have unit norm and
orthogonal to each other, so they form an orthonormal basis (it’s somewhat confusing that
orthogonal matrices are not called orthonormal matrices instead). We can interpret applying
U T to a vector x as computing the coefficients of its representation in the basis formed by
the columns of U . Applying U to U T x recovers x by scaling each basis vector with the
corresponding coefficient:
n
X
T
x = UU x = hU:i , xi U:i . (53)
i=1
Applying an orthogonal matrix to a vector does not affect its norm, it just rotates the vector.
Lemma 3.8 (Orthogonal matrices preserve the norm). For any orthogonal matrix U ∈ Rn×n
and any vector x ∈ Rn ,
4 Eigendecomposition
An eigenvector v of A satisfies
Av = λv (58)
for a real number λ which is the corresponding eigenvalue. Even if A is real, its eigenvectors
and eigenvalues can be complex.
8
Lemma 4.1 (Eigendecomposition). If a square matrix A ∈ Rn×n has n linearly independent
eigenvectors v1 , . . . , vn with eigenvalues λ1 , . . . , λn it can be expressed in terms of a matrix
Q, whose columns are the eigenvectors, and a diagonal matrix containing the eigenvalues,
λ1 0 · · · 0
0 λ2 · · · 0
v1 v2 · · · vn −1
A = v1 v2 · · · vn (59)
···
0 0 · · · λn
= QΛQ−1 (60)
Proof.
AQ = Av1 Av2 · · · Avn (61)
= λ1 v1 λ2 v2 · · · λ2 vn (62)
= QΛ. (63)
As we will establish later on, if the columns of a square matrix are all linearly independent,
then the matrix has an inverse, so multiplying the expression by Q−1 on both sides completes
the proof.
which implies that v2 = 0 and hence v1 = 0, since we have assumed that λ 6= 0. This implies
that the matrix does not have nonzero eigenvalues associated to nonzero eigenvectors.
AA · · · Ax = Ak x, (66)
9
i.e. we want to apply A to x k times. Ak cannot be computed by taking the power of its
entries (try out a simple example to convince yourself). However, if A has an eigendecom-
position,
using the fact that for diagonal matrices applying the matrix repeatedly is equivalent to
taking the power of the diagonal entries. This allows to compute the k matrix products
using just 3 matrix products and taking the power of n numbers.
From high-school or undergraduate algebra you probably remember how to compute eigen-
vectors using determinants. In practice, this is usually not a viable option due to stability
issues. A popular technique to compute eigenvectors is based on the following insight. Let
A ∈ Rn×n be a matrix with eigendecomposition QΛQ−1 and let x be an arbitrary vector in
Rn . Since the columns of Q are linearly independent, they form a basis for Rn , so we can
represent x as
n
X
x= αi Q:i , αi ∈ R, 1 ≤ i ≤ n. (70)
i=1
If we assume that the eigenvectors are ordered according to their magnitudes and that the
magnitude of one of them is larger than the rest, |λ1 | > |λ2 | ≥ . . ., and that α1 6= 0 (which
happens with high probability if we draw a random x) then as k grows larger the term
α1 λk1 Q:1 dominates. The term will blow up or tend to zero unless we normalize every time
before applying A. Adding the normalization step to this procedure results in the power
method or power iteration, an algorithm of great importance in numerical linear algebra.
10
v2 x1 v2 v2
x2 x3
v1 v1 v1
Figure 1: Illustration of the first three iterations of the power method for a matrix with eigen-
vectors v1 and v2 , whose corresponding eigenvalues are λ1 = 1.05 and λ2 = 0.1661.
Figure 1 illustrates the power method on a simple example, where the matrix– which was
just drawn at random– is equal to
0.930 0.388
A= . (74)
0.237 0.286
The convergence to the eigenvector corresponding to the eigenvalue with the largest magni-
tude is very fast.
pXk+1 |X0 ,X1 ,...,Xk (xk+1 |x0 , x1 , . . . , xk ) = pXk+1 |Xk (xk+1 |xk ) . (75)
11
nonzero as a vector,
pXk (α1 )
p (α )
πk = Xk 2 . (77)
···
pXk (αn )
By the Chain Rule, the pmf of Xk can be computed from the pmf of X0 using the transition
matrix,
πk = P P · · · P π0 = P k π0 . (78)
In some cases, no matter how we initialize the Markov chain, the Markov Chain forgets
its initial state and converges to a stationary distribution. This is exploited in Markov-
Chain Monte Carlo methods that allow to sample from arbitrary distributions by building
the corresponding Markov chain. These methods are very useful in Bayesian statistics.
A Markov Chain that converges to a stationary distribution π∞ is said to be ergodic. Note
that necessarily P π∞ = π∞ , so that π∞ is an eigenvector of the transition matrix with a
corresponding eigenvalue equal to one.
Conversely, let a transition matrix P of a Markov chain have a valid eigendecomposition with
n linearly independent eigenvectors v1 , v2 , . . . and corresponding eigenvalues λ1 > λ2 ≥ λ3 . . ..
If the eigenvector corresponding to the largest eigenvalue has non-negative entries then
v1
π∞ := Pn (79)
i=1 v1 (i)
and
πk = P k π0 (82)
Xn
= αi P k vi (83)
i=1
n
X
= α1 λ1 v1 + αi λki vi . (84)
i=2
12
Since the rest of eigenvalues are strictly smaller than one,
lim πk = α1 λ1 v1 = π∞ (85)
k→∞
where thePlast equality follows from the fact that the sequence of πk all belong to the closed
n
set {π | i π(i) = 1} so the limit also belongs to the set and hence is a valid pmf. We refer
the interested reader to more advanced texts treating Markov chains for further details.
Proof. Since A = AT
1
uTi uj = (Aui )T uj (87)
λi
1
= uTi AT uj (88)
λi
1
= uTi Auj (89)
λi
λj
= uTi uj . (90)
λi
This is only possible if uTi uj = 0.
It turns out that every n×n symmetric matrix has n linearly independent vectors. The proof
of this is beyond the scope of these notes. An important consequence is that all symmetric
matrices have an eigendecomposition of the form
A = U DU T (91)
where U = u1 u2 · · · un is an orthogonal matrix.
The eigenvalues of a symmetric matrix λ1 , λ2 , . . . , λn can be positive, negative or zero. They
determine the value of the quadratic form:
n
T
X 2
q (x) := x Ax = λi xT ui (92)
i=1
13
If we order the eigenvalues λ1 ≥ λ2 ≥ . . . ≥ λn then the first eigenvalue is the maximum
value attained by the quadratic if its input has unit `2 norm, the second eigenvalue is the
maximum value attained by the quadratic form if we restrict its argument to be normalized
and orthogonal to the first eigenvector, and so on.
Definition 7.1 (Row and column space). The row space row (A) of a matrix A is the span
of its rows. The column space col (A) is the span of its columns.
It turns out that the row space and the column space of any matrix have the same dimension.
We name this quantity the rank of the matrix.
14
Theorem 7.3. Without loss of generality let m ≤ n. Every rank r real matrix A ∈ Rm×n
has a unique singular-value decomposition of the form (SVD)
T
v1
σ1 0 ··· 0
0 σ2 · · · 0 v2T
A = u1 u2 · · · um (98)
··· · · ·
0 0 · · · σm T
vm
= U SV T , (99)
where r is the number of nonzero singular values. The first r left singular vectors u1 , u2 , . . . , ur ∈
Rm form an orthonormal basis of the column space of A and the first r right singular vectors
v1 , v2 , . . . , vr ∈ Rn form an orthonormal basis of the row space of A. Therefore the rank of
the matrix is equal to r.
15
Algorithm 8.1 (Principal component analysis).
Input: n data vectors x̃1 , x̃2 , . . . , x̃n ∈ Rm , a number k ≤ min {m, n}.
Output: The first k principal components, a set of orthonormal vectors of dimension m.
for 1 ≤ i ≤ n.
Compute the SVD of X and extract the left singular vectors corresponding to the k
largest singular values. These are the first k principal components.
Once the data are centered, the energy of the projection of the data points onto different
directions in the ambient space reflects the variation of the dataset along those directions.
PCA selects the directions that maximize the `2 norm of the projection and are mutually
orthogonal. By Lemma 1.5, the sum of the squared `2 norms of the projection of the centered
data x1 , x2 , . . . , xn onto a 1D subspace spanned by a unit-norm vector u can be expressed as
n n
Pspan(u) xi 2 =
X X
2
uT xi xTi u (104)
i=1 i=1
= u XX T u
T
(105)
2
= X T u .
2
(106)
16
√ √ √
σ1 / √n = 0.705, σ1 / √n = 0.9832, σ1 / √n = 1.3490,
σ2 / n = 0.690 σ2 / n = 0.3559 σ2 / n = 0.1438
u1 u1
u1
u2 u2 u2
Figure 2: PCA of a dataset with n = 100 2D vectors with different configurations. The two
first singular values reflect how much energy is preserved by projecting onto the two first principal
components.
Theorem 8.2. For any matrix X ∈ Rm×n , where n > m, with left singular vectors u1 , u2 , . . . , um
corresponding to the nonzero singular values σ1 ≥ σ2 ≥ . . . ≥ σm ,
σ1 = max X T u2 , (107)
||u||2 =1
u1 = arg max X T u2 , (108)
||u||2 =1
σk = max X T u2 , 2 ≤ k ≤ r, (109)
||u||2 =1
u⊥u1 ,...,uk−1
T
uk = arg max X u ,
2
2 ≤ k ≤ r. (110)
||u||2 =1
u⊥u1 ,...,uk−1
XX T = U SV T V SU T = U S 2 U T , (111)
where V T V = I because n > m and the matrix has m nonzero singular values. S 2 is a
diagonal matrix containing σ12 ≥ σ22 ≥ . . . ≥ σm
2
in its diagonal.
The result now follows from applying Theorem 6.2 to the quadratic form
uXX T u = X T u2 . (112)
This result shows that PCA is equivalent to choosing the best (in terms of `2 norm) k 1D
subspaces following a greedy procedure, since at each step we choose the best 1D subspace
17
√ √
σ1 /√n = 5.077 σ1 /√n = 1.261
σ2 / n = 0.889 σ2 / n = 0.139
u2
u1
u1
u2
orthogonal to the previous ones. A natural question to ask is whether this method produces
the best k-dimensional subspace. A priori this is not necessarily the case; many greedy
algorithms produce suboptimal results. However, in this case the greedy procedure is indeed
optimal: the subspace spanned by the first k principal components is the best subspace we
can choose in terms of the `2 -norm of the projections. The theorem is proved in Section A.3
of the appendix.
Theorem 8.3. For any matrix X ∈ Rm×n with left singular vectors u1 , u2 , . . . , um corre-
sponding to the nonzero singular values σ1 ≥ σ2 ≥ . . . ≥ σm ,
n n
Pspan(u1 ,u2 ,...,u ) xi 2 ≥
X X
||PS xi ||22 ,
k 2
(113)
i=1 i=1
Figure 3 illustrates the importance of centering before applying PCA. Theorems 8.2 and 8.3
still hold if the data are not centered. However, the norm of the projection onto a certain
direction no longer reflects the variation of the data. In fact, if the data are concentrated
around a point that is far from the origin, the first principal component will tend be aligned
in that direction. This makes sense as projecting onto that direction captures more energy.
As a result, the principal components do not capture the directions of maximum variation
within the cloud of data.
18
8.2 PCA: Probabilistic interpretation
Proof.
2
T T
− E 2 XT u
Var X u = E X u (115)
= E uXXT u − E uT X E XT u
(116)
= uT E XXT − E (X) E (X)T u
(117)
= uT ΣX u. (118)
Of course, if we only have access to samples of the random vector, we do not know the covari-
ance matrix of the vector. However we can approximate it using the empirical covariance
matrix.
Definition 8.5 (Empirical covariance matrix). The empirical covariance of the vectors
x1 , x2 , . . . , xn in Rm is equal to
n
1X
Σn := (xi − xn ) (xi − xn )T (119)
n i=1
1
= XX T , (120)
n
where xn is the sample mean, as defined in Definition 1.3 of Lecture Notes 4, and X is the
matrix containing the centered data as defined in (103).
If we assume that the mean of the data is zero (i.e. that the data have been centered using
the true mean), then the empirical covariance is an unbiased estimator of the true covariance
matrix:
n
! n
1X 1X
Xi XTi = E Xi XTi
E (121)
n i=1 n i=1
= ΣX . (122)
19
n=5 n = 20 n = 100
True covariance
Empirical covariance
Figure 4: Principal components of n data vectors samples from a 2D Gaussian distribution. The
eigenvectors of the covariance matrix of the distribution are also shown.
If the higher moments of the data E Xi2 Xj2 and E (Xi4 ) are finite, by Chebyshev’s inequality
the entries of the empirical covariance matrix converge to the entries of the true covariance
matrix. This means that in the limit
Var XT u = uT ΣX u
(123)
1
≈ uT XX T u (124)
n
1 2
= X T u2 (125)
n
for any unit-norm vector u. In the limit the principal components correspond to the directions
of maximum variance of the underlying random vector. These directions also correspond to
the eigenvectors of the true covariance matrix by Theorem 6.2. Figure 4 illustrates how the
principal components converge to the eigenvectors of Σ.
20
A Proofs
The eigenvectors are an orthonormal basis (they are mutually orthogonal and we assume that
they have been normalized), so we can represent any unit-norm vector hk that is orthogonal
to u1 , . . . , uk−1 as
m
X
hk = αi ui (126)
i=k
where
m
X
||hk ||22 = αi2 = 1, (127)
i=k
It is sufficient to prove
dim (row (A)) ≤ dim (row (A)) (134)
21
for an arbitrary matrix A. We can apply the result to AT to establish dim (row (A)) ≥
dim (row (A)), since row (A) = row (A)T and row (A) = row (A)T .
To prove (134) let r := dim (row (A)) and let x1 , . . . , xr ∈ Rn be a basis for row (A). Consider
the vectors Ax1 , . . . , Axr ∈ Rn . They belong to row (A) by (43), so if they are linearly
independent then dim (row (A)) must be at least r. We will prove that this is the case by
contradiction.
Assume that Ax1 , . . . , Axr are linearly dependent. Then there exist coefficients α1 , . . . , αr ∈
R such that
r r
!
X X
0= αi Axi = A α i xi (by linearity of the matrix product), (135)
i=1 i=1
Pr
This implies that i=1 αi xi is orthogonal to every row of A and hence to every vector in
row (A). However
Pr it is in the span of a basis of row (A) by construction! This is only
possible if i=1 αi xi = 0, which is a contradiction because x1 , . . . , xr are assumed to be
linearly independent.
We prove the result by induction. The base case, k = 1, follows immediately from (108).
To complete the proof we need to show that if the results is true for k − 1 ≥ 1 (this is the
induction hypothesis) then it also holds for k.
Let S be an arbitrary subspace of dimension k. We choose an orthonormal basis for the
subspace b1 , b2 , . . . , bk such that bk is orthogonal to u1 , u2 , . . . , uk1 . We can do this by using
any vector that is linearly independent of u1 , u2 , . . . , uk1 and subtracting its projection onto
the span of u1 , u2 , . . . , uk1 (if the result is always zero then S is in the span and consequently
cannot have dimension k).
By the induction hypothesis,
n 2 Xk−1
X T 2
Pspan(u1 ,u2 ,...,uk ) xi = X ui by (6) (136)
1 2 2
i=1 i=1
Xn 2
≤ Pspan(b1 ,b2 ,...,bk ) xi (137)
1 2
i=1
k−1
X T 2
= X bi
2
by (6). (138)
i=1
22
By (110)
n
Pspan(u ) xi 2 = X T uk 2
X
k 2 2
(139)
i=1
n
Pspan(b ) xi 2
X
≤ k 2
(140)
i=1
2
= X T bk 2 . (141)
23
DS-GA 1002 Lecture notes 10 November 23, 2015
Linear models
1 Linear functions
A linear model encodes the assumption that two quantities are linearly related. Mathemati-
cally, this is characterized using linear functions. A linear function is a function such that a
linear combination of inputs is mapped to the same linear combination of the corresponding
outputs.
Theorem 1.2 (Equivalence between matrices and linear functions). For finite m, n every
linear function T : Rn → Rm can be represented by a matrix T ∈ Rm×n .
This implies that in order to analyze linear models in finite-dimensional spaces we can restrict
our attention to matrices.
The range of a matrix A ∈ Rm×n is the set of all possible vectors in Rm that we can reach
by applying the matrix to a vector in Rn .
Definition 1.4 (Null space). The null space of A ∈ Rm×n contains the vectors in Rn that
A maps to the zero vector.
The following lemma shows that the null space is the orthogonal complement of the row
space of the matrix.
The lemma, proved in Section A.2 of the appendix, implies that the matrix is invertible if
we restrict the inputs to be in the row space of the matrix.
Corollary 1.6. Any matrix A ∈ Rm×n is invertible when acting on its row space. For any
two nonzero vectors x1 6= x2 in the row space of A
Proof. Assume that for two different nonzero vectors x1 and x2 in the row space of A Ax1 =
Ax2 . Then x1 − x2 is a nonzero vector in the null space of A. By Lemma 1.5 this implies that
x1 − x2 is orthogonal to the row space of A and consequently to itself, so that x1 = x2 .
This means that for every matrix A ∈ Rm×n we can decompose any vector in Rn into two
components: one is in the row space and is mapped to a nonzero vector in Rm that is unique
in the sense that no other vector in row (A) is mapped to it, the other is in the null space
and is mapped to the zero vector.
2
1.2 Interpretation using the SVD
Recall that the left singular vectors of a matrix A that correspond to nonzero singular values
are a basis of the column space of A. It follows that they are also a basis for the range.
The right singular vectors corresponding to nonzero singular values are a basis of the row
space. As a result any orthonormal set of vectors that forms a basis of Rn together with
these singular vectors is a basis of the null space of the matrix. We can therefore write any
matrix A such that m ≥ n as
σ1 0 · · · 0 0 · · · 0
0 σ2 · · · 0 0 · · · 0
· · ·
A = [ u1 u2 · · · ur ur+1 · · · un ] 0 0 · · · σr 0 · · · 0
[v|1 v2 {z· · · v}r v|r+1 {z· · · vn ]T .
| {z }
Basis of range(A)
0 0 · · · 0 0 · · · 0 Basis of row(A) }
Basis of null(A)
···
0 0 ··· 0 0 ··· 0
Note that the vectors ur+1 , . . . , un are a subset of an orthonormal basis of the orthogonal
complement of the range, which has dimension m − r.
The SVD provides a very intuitive characterization of the mapping between x ∈ Rn and
Ax ∈ Rm for any matrix A ∈ Rm×n with rank r,
r
X
σi viT x ui .
Ax = (8)
i=1
1. Compute the projection of x onto the right singular vectors of A: v1T x, v2T x, . . . , vrT x.
2. Scale the projections using the corresponding singular value: σ1 v1T x, σ2 v2T x, . . . , σr vrT x.
3. Multiply each scaled projection with the corresponding left singular vector ui .
In a linear model we assume that the data y ∈ Rm can be represented as the result of
applying a linear function or matrix A ∈ Rm×n to an unknown vector x ∈ Rn ,
Ax = y. (9)
3
The aim is to determine x from the measurements. Depending on the structure of A and y
this may or may not be possible.
If we expand the matrix-vector product, the linear model is equivalent to a system of linear
equations
If the number of equations m is greater than the number of unknowns n the system is said
to be overdetermined. If there are more unknowns than equation n > m then the system
is underdetermined.
Recall that range (A) is the set of vectors that can be reached by applying A. If y does not
belong to this set, then the system cannot have a solution.
Lemma 1.7. The system y = Ax has one or multiple solutions if and only if y ∈ range (A).
Proof. If y = Ax has a solution then y ∈ range (A) by (4). If y ∈ range (A) then there is a
linear combination of the columns of A that yield y by (4) so the system has at least one
solution.
If the null space of the matrix has dimension greater than 0, then the system cannot have a
unique solution.
Lemma 1.8. If dim (null (A)) > 0, then if Ax = y has a solution, the system has an infinite
number of solutions.
Proof. The null space has at least dimension one, so it contains an infinite number of vectors
h such that for any solution x for which y = Ax x + h is also a solution.
In the critical case m = n, linear systems may have a unique solution if the matrix is full
rank, i.e. if all its rows (and its columns) are linearly independent. This means that the
data in the linear model completely specify the unknown vector of interest, which can be
recovered by inverting the matrix.
Lemma 1.9. For any square matrix A ∈ Rn×n , the following statements are equivalent.
2. A is full rank.
4
3. A is invertible.
4. The system Ax = y has a unique solution for every vector y ∈ Rn .
Proof. We prove that the statements imply each other in the order (1) ⇒ (2) ⇒ (3) ⇒
(4) ⇒ (1).
(1) ⇒ (2): If dim (null (A)) = 0, by Lemma 1.5 the row space of A is the orthogonal
complement of {0}, so it is equal to Rn and therefore the rows are all linearly independent.
(2) ⇒ (3): If A is full rank, its rows span all of Rn so range (A) = Rn and A is invertible by
Corollary 1.6.
(3) ⇒ (4) If A is invertible there is a unique solution to Ax = y which is A−1 y. If the
solution is not unique then Ax1 = Ax2 = y for some x1 6= x2 so that 0 and x1 − x2 have the
same image and A is not invertible.
(4) ⇒ (1) If Ax = y has a unique solution for 0, then Ax = 0 implies x = 0.
Recall that the inverse of a product of invertible matrices is equal to the product of the
inverses,
and that the inverse of the transpose of a matrix is the transpose of the inverse,
−1 T
AT = A−1 . (15)
Using these facts, the inverse of a matrix A can be written in terms of its singular vectors
and singular values,
−1
A−1 = U SV T (16)
−1
= VT S −1 U −1
(17)
1
σ
0 ··· 0
01 1 · · · 0
σ2 T
=V U (18)
···
0 0 · · · σ1n
n
X 1
= vi uTi . (19)
σ
i=1 i
Note that if one of the singular values of a matrix is very small, the corresponding term
in (19) becomes very large. As a result, the solution to the system of equations becomes
very susceptible to noise in the data. In order to quantify the stability of the solution of a
system of equations we use the condition number of the corresponding matrix.
5
Definition 1.10 (Conditioning number). The condition number of a matrix is the ratio
between its largest and its smallest singular values
σmax
cond (A) = . (20)
σmin
If the condition number is very large, then perturbations in the data may be dramatically
amplified in the corresponding solution. This is illustrated by the following example.
has a condition number equal to 401. Compare the solutions to the corresponding system of
equations for two very similar vectors
−1
1 1 1 1
= , (22)
1 1.001 1 0
−1
1 1 1.1 101
= . (23)
1 1.001 1 −100
2 Least squares
Just like a system for which m = n, an overdetermined system will have a solution as long
as y ∈ range (A). However, if m > n then range (A) is a low-dimensional subspace of Rm .
This means that even a small perturbation in a random direction is bound to kick y out
of the subspace. As a result, in most cases overdetermined systems do not have a solution.
However, we may still compute the point in range (A) that is closest to the data y. If we
measure distance using the `2 norm, then this is denoted by the method of least squares.
Definition 2.1 (Least squares). The method of least squares consists of estimating x by
solving the optimization problem
6
Tall matrices (with more rows than columns) are said to be full rank if all their columns are
linearly independent. If A is full rank, then the solution to the least-squares problem has a
closed-form solution given by the following theorem.
Theorem 2.2 (Least-squares solution). If A ∈ Rm×n is full rank and m ≥ n the solution to
the least-squares problem (24) is equal to
xLS = V S −1 U T y (25)
−1 T
= AT A A y. (26)
since every x ∈ Rn corresponds to a unique z ∈ range (A) (we are assuming that the matrix
is full rank, so the null space only contains the zero vector). By Theorem 1.7 in Lecture
Notes 9, the solution to Problem (27) is
n
X
uTi y ui
Prange(A) y = (28)
i=1
= U U T y. (29)
U T U SV T xLS = U T U U T y. (32)
We have
U T U = I, (33)
because the columns of U are orthonormal (note that U U T 6= I if m > n!). As a result
−1 T
xLS = SV T U y (34)
−1
= VT S −1 U T y
(35)
= V S −1 U T y, (36)
7
where we have used the fact that
−1
V −1 = V T and VT =V (37)
where we have used that S is diagonal so S T = S and A is full rank, so that all the singular
values are nonzero and S is indeed invertible.
−1 T
The matrix AT A A is called the pseudoinverse of A. In the square case it reduces to
the inverse of the matrix.
A very important application of least squares is fitting linear regression models. In linear
regression, we assume that a quantity of interest can be expressed as a linear combination
of other observed quantities.
n
X
a≈ θj cj , (42)
j=1
8
model. Given m observations of the response and the covariates, we can place the response
in a vector y and the covariates in a matrix X such that each column corresponds to a different
covariate. We can then fit the parameters so that the model approximates the response as
closely as possible in `2 norm. This is achieved by solving a least-squares problem
Lemma 2.3. Let Y and Z are random vectors of dimension n such that
Y = Xθ + Z, (44)
Proof. Setting Σ = I in Definition 2.20 of Lecture Notes 3, we have that the likelihood
function
1 1 2
L (θ) = p exp − ||y − Xθ||2 . (45)
(2π)n 2
9
Maximum temperature Minimum temperature
30 20
25 15
20
Temperature (Celsius)
Temperature (Celsius)
10
15
5
10
0
5
5
0 Data Data
Model Model
10
1860 1880 1900 1920 1940 1960 1980 2000 1860 1880 1900 1920 1940 1960 1980 2000
25 14
12
20
10
Temperature (Celsius)
Temperature (Celsius)
15 8
6
10 4
2
5
Data 0 Data
Model Model
0 2
1900 1901 1902 1903 1904 1905 1900 1901 1902 1903 1904 1905
25 15
20
10
Temperature (Celsius)
Temperature (Celsius)
15
5
10
0
5
5
0
Data Data
Model Model
5 10
1960 1961 1962 1963 1964 1965 1960 1961 1962 1963 1964 1965
Figure 1: Data and fitted model described by Example 2.4 for maximum and minimum temper-
atures.
10
Example 2.4 (Global warming). In this example we build a model for temperature data
taken in a weather station in Oxford over 150 years.1 The model is of the form
2πt
y ≈ a + b̃ cos + c̃ + dt (49)
12
2πt 2πt
= a + b cos + c sin + dt, (50)
12 12
where t denotes the time in months. The parameter a represents the mean temperature,
b and c account for periodic yearly fluctuations and d is the overall trend. Is d is positive
then the model indicates that temperatures are increasing, whereas if it is negative then it
indicates that temperatures are decreasing. To fit these parameters using the data, we build
a matrix A with four columns,
compile the temperatures in a vector y and solve a least-squares problem. The results are
shown in Figures 1 and 2. The fitted model indicates that both the maximum and minimum
temperatures are increasing by about 0.8 degrees Celsius (around 1.4 ◦ F).
1
The data is available at http://www.metoffice.gov.uk/pub/data/weather/uk/climate/
stationdata/oxforddata.txt.
11
Maximum temperature Minimum temperature
30 20
25 15
20
Temperature (Celsius)
Temperature (Celsius)
10
15
5
10
0
5
5
0 Data Data
Trend Trend
10
1860 1880 1900 1920 1940 1960 1980 2000 1860 1880 1900 1920 1940 1960 1980 2000
A Proofs
The matrix is
T := T (e1 ) T (e2 ) · · · T (en ) , (52)
i.e. the columns of the matrix are the result of applying T to the standard basis of Rn .
Indeed, for any vector x ∈ Rn
n
!
X
T (x) = T x[i]ei (53)
i=1
n
X
= x[i]T (ei ) by (1) and (2) (54)
i=1
= T x. (55)
We prove (6) by showing that both sets are subsets of each other.
12
Any vector x in the row space of A can be written as
x = AT z, (56)
y T x = y T AT z (57)
= (Ay)T z (58)
= 0. (59)
13