Conjugate Gradient Method

EE236C (Spring 2011-12)
3. Conjugate gradient method
• conjugate gradient method for linear equations
• convergence analysis
• conjugate gradient method as iterative method
• nonlinear conjugate gradient method
3-1
Unconstrained quadratic minimization
1 T
minimize f (x) = x Ax − bT x
2
with A ∈ Sn++
• equivalent to solving Ax = b
• residual r = b − Ax is negative gradient at x: r = −∇f (x)
conjugate gradient method
• invented by Hestenes and Stiefel around 1951

• the most widely used iterative method for solving Ax = b, with A ≻ 0
• can be extended to non-quadratic unconstrained minimization
Conjugate gradient method 3-2

Krylov subspaces
definition: a sequence of nested subspaces (K0 ⊆ K1 ⊆ K2 ⊆ · · · )
K0 = {0}, Kk = span{b, Ab, . . . , Ak−1b} for k ≥ 1
if Kk+1 = Kk , then Ki = Kk for all i ≥ k
key property: A−1b ∈ Kn (even when Kn 6= Rn)
Cayley-Hamilton theorem: p(A) = An + a1An−1 + · · · + anI = 0 where
p(λ) = det(λI − A) = λn + a1λn−1 + · · · + an−1λ + an
−1 1 n−1 n−2

therefore A b=− A b + a1 A b + · · · + an−1b
an

Krylov sequence
CG algorithm is a recursive method for computing the Krylov sequence
x(k) = argmin f (x), k≥0

x∈Kk
• from previous page, x(n) = A−1b

• we will see there is a simple two-term recurrence
x(k+1) = x(k) − ak ∇f (x(k)) + bk (x(k) − x(k−1))
example 2
0
x2

1 0 10 2
A= , b=
0 10 10 4
20 10 0
x1

Residuals of Krylov sequence
• optimality conditions in definition of Krylov sequence
x(k) ∈ Kk , ∇f (x(k)) = Ax(k) − b ∈ Kk⊥
• hence, residuals rk = b − Ax(k) satisfy
rk ∈ Kk+1, rk ∈ Kk⊥
(first property follows from b ∈ K1 and x(k) ∈ Kk )
(nonzero) residuals form an orthogonal basis for the Krylov subspaces
Kk = span{r0, r1, . . . , rk−1}, riT rj = 0 (i 6= j)

Conjugate directions
the vectors vi = x(i) − x(i−1) satisfy
viT Avj = 0 for i 6= j, viT Avi = viT ri−1
• directions are ‘conjugate’: orthogonal for inner product hv, wi = v T Aw

• in particular, if vi 6= 0 it is independent of v1, . . . , vi−1
• Kk = span{v1, v2, . . . , vk }
(proofs on next page)
T
conjugate vectors: defined as pi = vi/αi, scaled so that ri−1 pi = kri−1k22
viT ri−1 kri−1k22

αi = 2 = T
kri−1k2 pi Api

proof of properties on page 3-6 (assume j < i)
• vjT Avi = 0 because vj = x(j) − x(j−1) ∈ Kj ⊆ Ki−1 and
Avi = A(x(i) − x(i−1)) = −ri + ri−1 ∈ Ki−1

⊥
• expression for viT Avi follows from the fact that t = 1 minimizes
1
f (x(i−1) + tvi) = f (x(i−1)) + t2viT Avi − tviT ri−1
2
• second expression for αi follows from
viT ri−1 viT Avi T

2 i Api
p
αi = = = αi
kri−1k22 kri−1k22 kri−1k22

Recursion for pk
Kk = span{p1, p2, . . . , pk−1, rk−1}, so we can express pk as
k−2
X
p1 = δr0, pk = δrk−1 + βpk−1 + γ i pi (k > 1)
i=1
• γi = 0: take inner products with Api for i ≤ k − 2:
pTi Apk = pTi Apk−1 = 0, pTi Ark−1 = 0 (because Api ∈ Ki+1 ⊆ Kk−1)
T
• δ = 1: take inner product with rk−1 and use rk−1 pk = krk−1k22
• expression for β: take inner product with Apk−1
pTk−1Ark−1
β=− T
pk−1Apk−1

Basic conjugate gradient algorithm
x(0) = 0, r0 = b
for k = 1, 2, . . .
1. return x(k−1) if krk−1k2 ≤ ǫkbk2
2. if k = 1, pk = r0; otherwise
pTk−1Ark−1
pk = rk−1 + βpk−1 where β=− T
pk−1Apk−1
3. compute
(k) (k−1) krk−1k22

x =x + αpk where α= T
pk Apk
and rk = b − Ax(k)

Improvements
step 3: compute residual recursively:
rk = rk−1 − αApk
step 2: simplify the expression for β by using
krk−2k22
rk−1 = rk−2 − T Apk−1
pk−1Apk−1
taking inner product with rk−1 gives
krk−1k22
β=
krk−2k22
this reduces number of matrix multiplications to one per iteration

Conjugate gradient algorithm
x(0) = 0, r0 = b
for k = 1, 2, . . .
1. return x(k−1) if krk−1k2 ≤ ǫkbk2
2. if k = 1, p1 = r0; else
krk−1k22
pk = rk−1 + pk−1
krk−2k22
3. compute
krk−1k22
α= T , x(k) = x(k−1) + αpk , rk = rk−1 − αApk
pk Apk

Outline

Analysis of Krylov sequence
1 T
minimize f (x) = x Ax − bT x
2
optimal value
1 1
f (x⋆) = − bT A−1b = − kx⋆k2A
2 2
suboptimality at x
1
f (x) − f ⋆ = kx − x⋆k2A
2
relative error measure
f (x) − f ⋆ kx − x⋆k2A
τ= =
f (0) − f ⋆ kx⋆k2A
here, kukA = (uT Au)1/2 is A-weighted norm

error after k steps in the Krylov sequence
• x(k) ∈ Kk = span{b, Ab, A2b, . . . , Ak−1b}, so it can be expressed as
k
X
x(k) = γiAi−1b = p(A)b
i=1
k
γisi−1, a polynomial of degree k − 1 or less
P
where p(s) =
i=1
• x(k) minimizes f (x) over Kk ; hence
2(f (x(k)) − f ⋆) = inf kx − x⋆k2A

x∈Kk
(p(A) − A−1)b 2

= inf A
deg p<k
we now use the eigenvalue decomposition of A to bound this quantity

simplification using eigenvalue decomposition of A
n
X
A = QΛQT = λiqiqiT (QT Q = I, Λ = diag(λ1, . . . , λn))
i=1
with b̄ = QT b, error expression simplifies to

−1
2 −1
2
(p(A) − A )b = (p(Λ) − Λ )b̄ Λ
A
n
X (λip(λi) − 1)2b̄2 i
=
i=1
λi
n
X (λip(λi) − 1)2b̄2 i
2(f (x(k)) − f ⋆) = inf
deg p<k
i=1
λi
n
X q(λi)2b̄2 i
= inf
deg q≤k, q(0)=1
i=1
λi

bounds on error
• absolute error
n
!
b̄2i
X
f (x(k)) − f ⋆ ≤ inf max q(λi)2
i=1
2λi deg q≤k, q(0)=1 i=1,...,n

1 ⋆ 2
= kx kA inf max q(λi)2
2 deg q≤k, q(0)=1 i=1,...,n
b̄2i /λi = bT A−1b = kx⋆k2A)

P
(equality follows from
i
• relative error
kx − x⋆k2A

τk = ≤ min max q(λi)2
kx⋆k2A deg q≤k, q(0)=1 i=1,...,n

Convergence rate and spectrum of A
• if A has k distinct eigenvalues γ1, . . . , γk , CG terminates in k steps
(−1)k
q(λ) = (λ − γ1) · · · (λ − γk )
γ1 · · · γk
has degree k, q(0) = 1, q(λi) = 0 for all i; therefore τk = 0
• if eigenvalues are clustered in k groups, then τk is small

can find q(λ) of degree k, with q(0) = 1, that is small on spectrum
• if x⋆ is a linear combination of k eigenvectors, termination in k steps

take q of degree k with q(λi) = 0 where b̄i 6= 0; then
n
X q(λi)2b̄2 i
=0
i=1
λi

other bounds (without proof)
• in terms of condition number κ = λmax/λmin
√ k
κ−1
τk ≤ 2 √
κ+1
derived by taking for q a Chebyshev polynomial on [λmin, λmax]
• in terms of sorted eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn

2
λk − λn
τk ≤
λk + λn
derived by taking q with roots at λ1, . . . , λk−1 and (λ1 + λn)/2

Outline

Conjugate gradient method as iterative method
in exact arithmetic
• CG was originally proposed as a direct (non-iterative) method

• in theory, convergence in at most n steps
in practice
• due to rounding errors, CG method can take ≫ n steps (or fail)

• CG is now used as an iterative method
• with luck (good spectrum of A), good approximation in ≪ n steps
• attractive if matrix-vector products are inexpensive

Preconditioned conjugate gradient algorithm
preconditioner
• apply CG after linear change of coordinates x = T y, det T 6= 0

• use CG to solve T T AT y = T T b; then set x⋆ = T −1y ⋆
• T or M = T T T is called preconditioner
implementation
• in naive implementation, each iteration requires multiplies by T and T T

(and A); also need to compute x⋆ = T −1y ⋆ at end
• can re-arrange computation so each iteration requires one multiply by
M (and A), and no final solve x⋆ = T −1y ⋆
called preconditioned conjugate gradient (PCG) algorithm

Choice of preconditioner
• if spectrum of T T AT is clustered, PCG converges fast

• extreme case: M = A−1
• trade-off between enhanced convergence, cost of multiplying with M
examples
• diagonal M = diag(1/A11, . . . , 1/Ann)

• incomplete or approximate Cholesky factorization
best preconditioners are application-dependent

Outline

Applications in optimization
nonlinear conjugate gradient methods
• extend linear CG method to nonquadratic functions

• local convergence similar to linear CG
• limited global convergence theory
inexact and truncated Newton methods
• use conjugate gradient method to compute (approximate) Newton step

• less reliable than exact Newton methods, but handle very large problems

Nonlinear conjugate gradient
minimize f (x)
(f convex and differentiable)
modifications needed to extend linear CG algorithm of page 3-11
• replace rk = b − Ax(k) with −∇f (x(k))

• determine α by line search

Fletcher-Reeves CG algorithm
CG algorithm of page 3-11 modified to minimize non-quadratic convex f
given x(0)
for k = 1, 2, . . .
1. return x(k−1) if k∇f (x(k−1))k2 ≤ ǫ
2. if k = 1, p1 = −∇f (x(0)); else
(k−1) k∇f (x(k−1))k22

pk = −∇f (x ) + βpk−1 where β=
k∇f (x(k−2))k22
3. update x(k) = x(k−1) + αpk where α = argmint f (x(k−1) + tpk )

some observations
• first iteration is a gradient step; practical implementations restart the

algorithm by taking a gradient step, for example, every n iterations
• update is gradient step with momentum term
x(k) = x(k−1) − αk ∇f (x(k−1)) + βk (x(k−1) − x(k−2))
• with exact line search, reduces to linear CG for quadratic f
line search
• exact line search in step 3 implies ∇f (x(k))T pk = 0

• therefore in step 2, pk is a descent direction at x(k−1):
∇f (x(k−1))T pk = −k∇f (x(k−1))k22 < 0

Variations
Polak-Ribière: in step 2, compute β from
∇f (x(k−1))T (∇f (x(k−1)) − ∇f (x(k−2)))

β=
k∇f (x(k−2))k22
Hestenes-Stiefel
∇f (x(k−1))T (∇f (x(k−1)) − ∇f (x(k−2)))

β=
pTk−1(∇f (x(k−1)) − ∇f (x(k−2)))
formulas are equivalent for quadratic f and exact line search

Interpretation as restarted BFGS method
BFGS update (page 2-5) with Hk−1 = I:
y T y ssT ysT + sy T
Hk−1 = I + (1 + T ) T −
s y y s yT s
where y = ∇f (x(k)) − ∇f (x(k−1)), s = x(k) − x(k−1)
• ∇f (x(k))T s = 0 if x(k) is determined by exact line search

• quasi-Newton step in iteration k is
y T ∇f (x(k))
−Hk−1∇f (x(k)) = −∇f (x (k)
)+ T
s
y s
this is the Hestenes-Stiefel update
nonlinear CG can be interpreted as L-BFGS with m = 1

References
• G. H. Golub and C. F. Van Loan, Matrix Computations (1996), chap. 10
• J. Nocedal and S. J. Wright, Numerical Optimization (2006), chap. 5
• S. Boyd, lecture notes for EE364b, Convex Optimization II

Conjugate Gradient Method

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Conjugate Gradient Method

Hochgeladen von

Copyright:

Verfügbare Formate

EE236C (Spring 2011-12)

3. Conjugate gradient method

• conjugate gradient method for linear equations

• conjugate gradient method as iterative method

• nonlinear conjugate gradient method

conjugate gradient method

• invented by Hestenes and Stiefel around 1951

Conjugate gradient method 3-2

definition: a sequence of nested subspaces (K0 ⊆ K1 ⊆ K2 ⊆ · · · )

K0 = {0}, Kk = span{b, Ab, . . . , Ak−1b} for k ≥ 1

if Kk+1 = Kk , then Ki = Kk for all i ≥ k

key property: A−1b ∈ Kn (even when Kn 6= Rn)

Cayley-Hamilton theorem: p(A) = An + a1An−1 + · · · + anI = 0 where

p(λ) = det(λI − A) = λn + a1λn−1 + · · · + an−1λ + an

Conjugate gradient method 3-3

CG algorithm is a recursive method for computing the Krylov sequence

x(k) = argmin f (x), k≥0

• from previous page, x(n) = A−1b

x(k+1) = x(k) − ak ∇f (x(k)) + bk (x(k) − x(k−1))

Conjugate gradient method 3-4

• optimality conditions in definition of Krylov sequence

x(k) ∈ Kk , ∇f (x(k)) = Ax(k) − b ∈ Kk⊥

• hence, residuals rk = b − Ax(k) satisfy

(first property follows from b ∈ K1 and x(k) ∈ Kk )

(nonzero) residuals form an orthogonal basis for the Krylov subspaces

Kk = span{r0, r1, . . . , rk−1}, riT rj = 0 (i 6= j)

Conjugate gradient method 3-5

the vectors vi = x(i) − x(i−1) satisfy

viT Avj = 0 for i 6= j, viT Avi = viT ri−1

• directions are ‘conjugate’: orthogonal for inner product hv, wi = v T Aw

viT ri−1 kri−1k22

Conjugate gradient method 3-6

• vjT Avi = 0 because vj = x(j) − x(j−1) ∈ Kj ⊆ Ki−1 and

Avi = A(x(i) − x(i−1)) = −ri + ri−1 ∈ Ki−1

• second expression for αi follows from

viT ri−1 viT Avi T

Conjugate gradient method 3-7

Kk = span{p1, p2, . . . , pk−1, rk−1}, so we can express pk as

• γi = 0: take inner products with Api for i ≤ k − 2:

• expression for β: take inner product with Apk−1

Conjugate gradient method 3-8

(k) (k−1) krk−1k22

Conjugate gradient method 3-9

step 3: compute residual recursively:

step 2: simplify the expression for β by using

taking inner product with rk−1 gives

this reduces number of matrix multiplications to one per iteration

Conjugate gradient method 3-10

Conjugate gradient method 3-11

• conjugate gradient method for linear equations

• conjugate gradient method as iterative method

• nonlinear conjugate gradient method

relative error measure

here, kukA = (uT Au)1/2 is A-weighted norm

Conjugate gradient method 3-12

• x(k) ∈ Kk = span{b, Ab, A2b, . . . , Ak−1b}, so it can be expressed as

• x(k) minimizes f (x) over Kk ; hence

2(f (x(k)) − f ⋆) = inf kx − x⋆k2A

we now use the eigenvalue decomposition of A to bound this quantity

Conjugate gradient method 3-13

with b̄ = QT b, error expression simplifies to

Conjugate gradient method 3-14