Sie sind auf Seite 1von 30

EE236C (Spring 2011-12)

3. Conjugate gradient method

• conjugate gradient method for linear equations

• convergence analysis

• conjugate gradient method as iterative method

• nonlinear conjugate gradient method

3-1
Unconstrained quadratic minimization

1 T
minimize f (x) = x Ax − bT x
2

with A ∈ Sn++

• equivalent to solving Ax = b
• residual r = b − Ax is negative gradient at x: r = −∇f (x)

conjugate gradient method

• invented by Hestenes and Stiefel around 1951


• the most widely used iterative method for solving Ax = b, with A ≻ 0
• can be extended to non-quadratic unconstrained minimization

Conjugate gradient method 3-2


Krylov subspaces

definition: a sequence of nested subspaces (K0 ⊆ K1 ⊆ K2 ⊆ · · · )

K0 = {0}, Kk = span{b, Ab, . . . , Ak−1b} for k ≥ 1

if Kk+1 = Kk , then Ki = Kk for all i ≥ k

key property: A−1b ∈ Kn (even when Kn 6= Rn)

Cayley-Hamilton theorem: p(A) = An + a1An−1 + · · · + anI = 0 where

p(λ) = det(λI − A) = λn + a1λn−1 + · · · + an−1λ + an

−1 1 n−1 n−2

therefore A b=− A b + a1 A b + · · · + an−1b
an

Conjugate gradient method 3-3


Krylov sequence

CG algorithm is a recursive method for computing the Krylov sequence

x(k) = argmin f (x), k≥0


x∈Kk

• from previous page, x(n) = A−1b


• we will see there is a simple two-term recurrence

x(k+1) = x(k) − ak ∇f (x(k)) + bk (x(k) − x(k−1))

example 2
0
x2

   
1 0 10 2
A= , b=
0 10 10 4
20 10 0
x1

Conjugate gradient method 3-4


Residuals of Krylov sequence

• optimality conditions in definition of Krylov sequence

x(k) ∈ Kk , ∇f (x(k)) = Ax(k) − b ∈ Kk⊥

• hence, residuals rk = b − Ax(k) satisfy

rk ∈ Kk+1, rk ∈ Kk⊥

(first property follows from b ∈ K1 and x(k) ∈ Kk )

(nonzero) residuals form an orthogonal basis for the Krylov subspaces

Kk = span{r0, r1, . . . , rk−1}, riT rj = 0 (i 6= j)

Conjugate gradient method 3-5


Conjugate directions

the vectors vi = x(i) − x(i−1) satisfy

viT Avj = 0 for i 6= j, viT Avi = viT ri−1

• directions are ‘conjugate’: orthogonal for inner product hv, wi = v T Aw


• in particular, if vi 6= 0 it is independent of v1, . . . , vi−1
• Kk = span{v1, v2, . . . , vk }
(proofs on next page)

T
conjugate vectors: defined as pi = vi/αi, scaled so that ri−1 pi = kri−1k22

viT ri−1 kri−1k22


αi = 2 = T
kri−1k2 pi Api

Conjugate gradient method 3-6


proof of properties on page 3-6 (assume j < i)

• vjT Avi = 0 because vj = x(j) − x(j−1) ∈ Kj ⊆ Ki−1 and

Avi = A(x(i) − x(i−1)) = −ri + ri−1 ∈ Ki−1


• expression for viT Avi follows from the fact that t = 1 minimizes

1
f (x(i−1) + tvi) = f (x(i−1)) + t2viT Avi − tviT ri−1
2

• second expression for αi follows from

viT ri−1 viT Avi T


2 i Api
p
αi = = = αi
kri−1k22 kri−1k22 kri−1k22

Conjugate gradient method 3-7


Recursion for pk

Kk = span{p1, p2, . . . , pk−1, rk−1}, so we can express pk as

k−2
X
p1 = δr0, pk = δrk−1 + βpk−1 + γ i pi (k > 1)
i=1

• γi = 0: take inner products with Api for i ≤ k − 2:

pTi Apk = pTi Apk−1 = 0, pTi Ark−1 = 0 (because Api ∈ Ki+1 ⊆ Kk−1)

T
• δ = 1: take inner product with rk−1 and use rk−1 pk = krk−1k22

• expression for β: take inner product with Apk−1

pTk−1Ark−1
β=− T
pk−1Apk−1

Conjugate gradient method 3-8


Basic conjugate gradient algorithm

x(0) = 0, r0 = b

for k = 1, 2, . . .
1. return x(k−1) if krk−1k2 ≤ ǫkbk2
2. if k = 1, pk = r0; otherwise

pTk−1Ark−1
pk = rk−1 + βpk−1 where β=− T
pk−1Apk−1

3. compute

(k) (k−1) krk−1k22


x =x + αpk where α= T
pk Apk

and rk = b − Ax(k)

Conjugate gradient method 3-9


Improvements

step 3: compute residual recursively:

rk = rk−1 − αApk

step 2: simplify the expression for β by using

krk−2k22
rk−1 = rk−2 − T Apk−1
pk−1Apk−1

taking inner product with rk−1 gives

krk−1k22
β=
krk−2k22

this reduces number of matrix multiplications to one per iteration

Conjugate gradient method 3-10


Conjugate gradient algorithm

x(0) = 0, r0 = b

for k = 1, 2, . . .
1. return x(k−1) if krk−1k2 ≤ ǫkbk2

2. if k = 1, p1 = r0; else

krk−1k22
pk = rk−1 + pk−1
krk−2k22

3. compute

krk−1k22
α= T , x(k) = x(k−1) + αpk , rk = rk−1 − αApk
pk Apk

Conjugate gradient method 3-11


Outline

• conjugate gradient method for linear equations

• convergence analysis

• conjugate gradient method as iterative method

• nonlinear conjugate gradient method


Analysis of Krylov sequence

1 T
minimize f (x) = x Ax − bT x
2

optimal value
1 1
f (x⋆) = − bT A−1b = − kx⋆k2A
2 2

suboptimality at x
1
f (x) − f ⋆ = kx − x⋆k2A
2

relative error measure

f (x) − f ⋆ kx − x⋆k2A
τ= =
f (0) − f ⋆ kx⋆k2A

here, kukA = (uT Au)1/2 is A-weighted norm

Conjugate gradient method 3-12


error after k steps in the Krylov sequence

• x(k) ∈ Kk = span{b, Ab, A2b, . . . , Ak−1b}, so it can be expressed as

k
X
x(k) = γiAi−1b = p(A)b
i=1

k
γisi−1, a polynomial of degree k − 1 or less
P
where p(s) =
i=1

• x(k) minimizes f (x) over Kk ; hence

2(f (x(k)) − f ⋆) = inf kx − x⋆k2A


x∈Kk

(p(A) − A−1)b 2

= inf A
deg p<k

we now use the eigenvalue decomposition of A to bound this quantity

Conjugate gradient method 3-13


simplification using eigenvalue decomposition of A
n
X
A = QΛQT = λiqiqiT (QT Q = I, Λ = diag(λ1, . . . , λn))
i=1

with b̄ = QT b, error expression simplifies to


−1
2 −1
2
(p(A) − A )b = (p(Λ) − Λ )b̄ Λ
A
n
X (λip(λi) − 1)2b̄2 i
=
i=1
λi

n
X (λip(λi) − 1)2b̄2 i
2(f (x(k)) − f ⋆) = inf
deg p<k
i=1
λi
n
X q(λi)2b̄2 i
= inf
deg q≤k, q(0)=1
i=1
λi

Conjugate gradient method 3-14


bounds on error

• absolute error
n
!
b̄2i
X  
f (x(k)) − f ⋆ ≤ inf max q(λi)2
i=1
2λi deg q≤k, q(0)=1 i=1,...,n
 
1 ⋆ 2
= kx kA inf max q(λi)2
2 deg q≤k, q(0)=1 i=1,...,n

b̄2i /λi = bT A−1b = kx⋆k2A)


P
(equality follows from
i

• relative error

kx − x⋆k2A
 
τk = ≤ min max q(λi)2
kx⋆k2A deg q≤k, q(0)=1 i=1,...,n

Conjugate gradient method 3-15


Convergence rate and spectrum of A

• if A has k distinct eigenvalues γ1, . . . , γk , CG terminates in k steps

(−1)k
q(λ) = (λ − γ1) · · · (λ − γk )
γ1 · · · γk

has degree k, q(0) = 1, q(λi) = 0 for all i; therefore τk = 0

• if eigenvalues are clustered in k groups, then τk is small


can find q(λ) of degree k, with q(0) = 1, that is small on spectrum

• if x⋆ is a linear combination of k eigenvectors, termination in k steps


take q of degree k with q(λi) = 0 where b̄i 6= 0; then

n
X q(λi)2b̄2 i
=0
i=1
λi

Conjugate gradient method 3-16


other bounds (without proof)

• in terms of condition number κ = λmax/λmin

√ k
κ−1
τk ≤ 2 √
κ+1

derived by taking for q a Chebyshev polynomial on [λmin, λmax]

• in terms of sorted eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn


 2
λk − λn
τk ≤
λk + λn

derived by taking q with roots at λ1, . . . , λk−1 and (λ1 + λn)/2

Conjugate gradient method 3-17


Outline

• conjugate gradient method for linear equations

• convergence analysis

• conjugate gradient method as iterative method

• nonlinear conjugate gradient method


Conjugate gradient method as iterative method

in exact arithmetic

• CG was originally proposed as a direct (non-iterative) method


• in theory, convergence in at most n steps

in practice

• due to rounding errors, CG method can take ≫ n steps (or fail)


• CG is now used as an iterative method
• with luck (good spectrum of A), good approximation in ≪ n steps
• attractive if matrix-vector products are inexpensive

Conjugate gradient method 3-18


Preconditioned conjugate gradient algorithm

preconditioner

• apply CG after linear change of coordinates x = T y, det T 6= 0


• use CG to solve T T AT y = T T b; then set x⋆ = T −1y ⋆
• T or M = T T T is called preconditioner

implementation

• in naive implementation, each iteration requires multiplies by T and T T


(and A); also need to compute x⋆ = T −1y ⋆ at end
• can re-arrange computation so each iteration requires one multiply by
M (and A), and no final solve x⋆ = T −1y ⋆

called preconditioned conjugate gradient (PCG) algorithm

Conjugate gradient method 3-19


Choice of preconditioner

• if spectrum of T T AT is clustered, PCG converges fast


• extreme case: M = A−1
• trade-off between enhanced convergence, cost of multiplying with M

examples

• diagonal M = diag(1/A11, . . . , 1/Ann)


• incomplete or approximate Cholesky factorization

best preconditioners are application-dependent

Conjugate gradient method 3-20


Outline

• conjugate gradient method for linear equations

• convergence analysis

• conjugate gradient method as iterative method

• nonlinear conjugate gradient method


Applications in optimization

nonlinear conjugate gradient methods

• extend linear CG method to nonquadratic functions


• local convergence similar to linear CG
• limited global convergence theory

inexact and truncated Newton methods

• use conjugate gradient method to compute (approximate) Newton step


• less reliable than exact Newton methods, but handle very large problems

Conjugate gradient method 3-21


Nonlinear conjugate gradient

minimize f (x)

(f convex and differentiable)

modifications needed to extend linear CG algorithm of page 3-11

• replace rk = b − Ax(k) with −∇f (x(k))


• determine α by line search

Conjugate gradient method 3-22


Fletcher-Reeves CG algorithm

CG algorithm of page 3-11 modified to minimize non-quadratic convex f

given x(0)

for k = 1, 2, . . .
1. return x(k−1) if k∇f (x(k−1))k2 ≤ ǫ
2. if k = 1, p1 = −∇f (x(0)); else

(k−1) k∇f (x(k−1))k22


pk = −∇f (x ) + βpk−1 where β=
k∇f (x(k−2))k22

3. update x(k) = x(k−1) + αpk where α = argmint f (x(k−1) + tpk )

Conjugate gradient method 3-23


some observations

• first iteration is a gradient step; practical implementations restart the


algorithm by taking a gradient step, for example, every n iterations
• update is gradient step with momentum term

x(k) = x(k−1) − αk ∇f (x(k−1)) + βk (x(k−1) − x(k−2))

• with exact line search, reduces to linear CG for quadratic f

line search

• exact line search in step 3 implies ∇f (x(k))T pk = 0


• therefore in step 2, pk is a descent direction at x(k−1):

∇f (x(k−1))T pk = −k∇f (x(k−1))k22 < 0

Conjugate gradient method 3-24


Variations

Polak-Ribière: in step 2, compute β from

∇f (x(k−1))T (∇f (x(k−1)) − ∇f (x(k−2)))


β=
k∇f (x(k−2))k22

Hestenes-Stiefel

∇f (x(k−1))T (∇f (x(k−1)) − ∇f (x(k−2)))


β=
pTk−1(∇f (x(k−1)) − ∇f (x(k−2)))

formulas are equivalent for quadratic f and exact line search

Conjugate gradient method 3-25


Interpretation as restarted BFGS method

BFGS update (page 2-5) with Hk−1 = I:

y T y ssT ysT + sy T
Hk−1 = I + (1 + T ) T −
s y y s yT s

where y = ∇f (x(k)) − ∇f (x(k−1)), s = x(k) − x(k−1)

• ∇f (x(k))T s = 0 if x(k) is determined by exact line search


• quasi-Newton step in iteration k is

y T ∇f (x(k))
−Hk−1∇f (x(k)) = −∇f (x (k)
)+ T
s
y s

this is the Hestenes-Stiefel update

nonlinear CG can be interpreted as L-BFGS with m = 1

Conjugate gradient method 3-26


References

• G. H. Golub and C. F. Van Loan, Matrix Computations (1996), chap. 10

• J. Nocedal and S. J. Wright, Numerical Optimization (2006), chap. 5

• S. Boyd, lecture notes for EE364b, Convex Optimization II

Conjugate gradient method 3-27

Das könnte Ihnen auch gefallen