Sie sind auf Seite 1von 4

1

EECS 127/227AT Optimization Models in Engineering


Spring 2019 Homework 2
Due date: 2/14/19, 23:00 (11pm). Please LATEX or handwrite your homework solution and submit
an electronic version. Self grades are due at 2/21/19 at 23:00 (11pm).

1. Gram-Schmidt Any set of n linearly independent vectors in Rn could be used as a basis for Rn .
However, certain bases could be more suitable for certain operations than others. For example, an
orthonormal basis could facilitate solving linear equations.

(a) Given a matrix A ∈ Rn×n , it could be represented as a multiplication of two matrices

A = QR,

where Q is a unitary matrix (its columns form an orthonormal basis for Rn ) and R is an
upper-triangular matrix. For the matrix A, describe how Gram-Schmidt process could be
used to find the Q and R matrices, and apply this to
 
3 −3 1
A = 4 −4 −7
0 3 3

to find a unitary matrix Q and an upper-triangular matrix R.


(b) Given an invertible matrix A ∈ Rn×n and an observation vector b ∈ Rn , the solution to the
equality
Ax = b
is given as x = A−1 b. For the matrix A = QR from part (b), assume that we want to solve
 
8
Ax = −6 .
3

By using the fact that Q is a unitary matrix, find b such that

Rx = b.

Then, given the upper-triangular matrix R and b in part (c), find the elements of x sequentially.
(c) Describe how your solution in the previous problem is akin to Gaussian elimination in solving
a system of linear equations.
(d) Given an invertible matrix B ∈ Rn×n and an observation vector c ∈ Rn , find the computational
cost of finding the solution z to the equation Bz = c by using the QR decomposition of B.
Assume that Q and R matrices are available, and adding, multiplying, and dividing scalars
take one unit of “computation”.
As an example, computing the inner product a> b is said to be O(n), since we have n scalar
multiplication for each ai bi . Similarly, matrix vector multiplication is O(n2 ), since matrix
2

vector multiplication can be viewed as computing n inner products. The computational cost
for inverting a matrix in Rn is O(n3 ), and consequently, the cost grows rapidly as the set of
equations grows in size. This is why the expression A−1 b is usually not computed by directly
inverting the matrix A. Instead, the QR decomposition of A is exploited to decrease the
computational cost.

2. Gradient Descent Algorithm


Given a continuous and differentiable function f : Rn → R, the gradient of f at any point x, ∇f (x),
is orthogonal to the level curve of f at point x, and it points in the increasing direction of f . In
other words, moving from point x in the direction ∇f (x) leads to an increase in the value of f ,
while moving in the direction of −∇f (x) decreases the value of f . This idea gives an iterative
algorithm to minimize the function f : the gradient descent algorithm.
This problem is a light introduction to the gradient descent algorithm, which we will cover in more
detail later in the class.

(a) Consider f (x) = 21 (x − 2)2 , and assume that we use the gradient descent algorithm:

x[k + 1] = x[k] − η∇f (x[k]) ∀k ≥ 0,

with some random initialization x[0], where η > 0 is the step size (or the learning rate) of the
algorithm. Write (x[k] − 2) in terms of (x[0] − 2), and show that x[k] converges to 2, which is
the unique minimizer of f , when η = 0.2.
(b) What is the largest value of η that we can use so that the gradient descent algorithm converges
to 2 from all possible initializations in R? What happens if we choose a larger step size?
(c) Now assume that we use the gradient descent algorithm to minimize f (x) = 21 kAx − bk22 for
some A ∈ Rm×n and b ∈ Rm , where A has full column rank. First show that ∇f (x) =
A> Ax − A> b. Then, write (x[k] − (A> A)−1 A> b) in terms of (x[0] − (A> A)−1 A> b) and find
the largest step size that we can use (in terms of A and b) so that the gradient descent
algorithm converges for all possible initializations. Your largest step size should be a function
of λmax (A> A), the largest eigenvalue of A> A.

3. Frobenius norm Frobenius norm of a matrix A ∈ Rm×n is defined as


p q
kAkF = hA, Ai = trace(A> A).
m×n , the canonical inner product defined over this space is
PB ∈ R
Recall for two matrices A,
>
hA, Bi := trace(A B) = ij Aij Bij . Note this is equivalent to flattening the matrices into “vec-
tors” and then taking the inner product in Rmn .
Show that Frobenius norm satisfies all three properties of a norm.

4. Eigenvectors of a symmetric matrix Let p, q ∈ Rn be two linearly independent vectors, with


.
unit norm (kpk2 = kqk2 = 1). Define the symmetric matrix A = pq > + qp> . In your derivations, it
. >
may be useful to use the notation c = p q.

(a) Show that p + q and p − q are eigenvectors of A, and determine the corresponding eigenvalues.

(b) Determine the nullspace and rank of A.


3

(c) Find an eigenvalue decomposition of A, in terms of p, q. Hint: use the previous two parts.

(d) What is the answer to the previous part if p, q are not normalized? Write A as a function p, q
and their norms and the new eigenvalues as a function of p, q and their norms.

5. Interpretation of the covariance matrix We are given m data points x(1) , . . . , x(m) in Rn . Let
x̂ ∈ Rn denote the sample average of the points:
m
. 1 X (i)
x̂ = x ,
m
i=1

and let Σ denote the sample covariance matrix:


m
. 1 X (i)
Σ= (x − x̂)(x(i) − x̂)> .
m
i=1

Given a normalized direction w ∈ Rn with kwk2 = 1, we consider the line with direction w passing
through the origin: L(w) = {tw : t ∈ R}. We then consider the projection of the points x(i) ,
i = 1, . . . , m onto the line L(w). For example, the projection of point x(i) onto the line L(w) is
given by ti (w)w, where
ti (w) = arg min ktw − x(i) k2 .
t

For any w, let t̂(w) denote the sample average of ti (w):


m
1 X
t̂(w) = ti (w),
m
i=1

and let σ 2 (w) denote the empirical variance of ti (w):


m
2 1 X
σ (w) = (ti (w) − t̂(w))2 .
m
i=1

(a) Show that ti (w) = w> x(i) , i = 1, . . . , m.

(b) Assume that t̂(w) is constant, i.e., it is independent of the direction w. Show that the sample
average of the data points, x̂, is zero.

(c) In this subpart, we show that the covariance matrix alone provides a powerful visualization of
where the data points reside. Assume that the points x(1) , . . . , x(m) are in R2 , and that the
sample average of the points, x̂, is zero.
As defined at the beginning of the problem, for any unit vector w, σ 2 (w) denotes the empirical
variance of {ti (w)}m
i=1 , and correspondingly, σ(w) denotes the standard deviation. For this
problem, given a vector w, let us ignore all points that are more than three standard deviations
away. In other words, for each unit vector w, let us assume that the points x(1) , . . . , x(m) belong
to the set n o
S(w) := z ∈ R2 : t̂(w) − 3σ(w) ≤ w> z ≤ t̂(w) + 3σ(w) .
 >
For w = 1 0 , assume σ(w) = 1, and describe the shape of S(w) and shade the set S(w)
in R2 . Remember that the sample average of the points, x̂, is zero.
4

(d) Note that the points x(1) , . . . , x(m) reside in S(w) for each unit vector w. We can visualize
the region occupied by these points  by finding
 the intersection of S(w) for various w. Let the
3 −1
sample covariance matrix Σ be . For each of the following w:
−1 3
(    " √ # " √ #)
2
1 0 − 2
, , √22 , √22 ,
0 1
2 2

use the sample covariance matrix to find σ(w), and shade the region S(w) in R2 .

6. Nullspace and Spectral Theorem Let A, B ∈ Sn+ (the set of symmetric positive semidefinite
matrices in Rn×n ) and let A = V V > where V ∈ Rn×r and r = rank(A). Show that null(A) ⊆
null(B) if and only if there exists Q ∈ Sr+ such that B = V QV > .
Hint: For the forward direction, consider A = U ΛU > and B = W M W > and set Q = (V † W )M (V † W )>
where V † = (V > V )−1 V > , U ∈ Rn×r , and W ∈ Rn×s where s = rank(B).
This result is useful for finding an upper bound on the minimum rank solution of a semi-definite
program (SDP). A SDP is a convex optimization problem where the variable we optimize over is
a PSD matrix. In these problems, we are usually interested in having solutions that are low-rank
(i.e., lot of eigenvalues equal 0). Before even solving an SDP (which is in general computationally
extremely expensive), you can derive an upper bound on the minimum rank solution. Doing
so requires utilizing the stated proposition. You will learn about SDPs later in the course. No
knowledge of SDPs is required to solve the following problems.

(a) Prove the backward direction first; that is, prove if B = V QV > then null(A) ⊆ null(B).
(b) Prove that given B = V QV > , then V V † = V (V > V )−1 V > is an orthogonal projection matrix
onto the column span of V (recall the definition of an orthogonal projection matrix).
(c) Prove that for symmetric matrices A, B that null(A) ⊆ null(B) implies range(B) ⊆ range(A).
(d) Prove that if range(W ) ⊆ range(V ) then V V † W = W .
(e) Finally use the above results to prove the reverse direction. Let A = U ΛU > and B = W M W >
where U ∈ Rn×r , and W ∈ Rn×s where s = rank(B). Find Q, in terms of V, W, M such that
B = V QV > = W M W > .