Sie sind auf Seite 1von 67

Lecture Notes

on
Continuous Optimization

Kok Lay Teo


Department of Mathematics
Curtin University of Technology
k.l.teo@curtin.edu.au
and

Song Wang
School of Mathematics & Statistics
The University of Western Australia
swang@maths.uwa.edu.au

Chapter 1

Motivation and Optimization Models


1

Construction of optimization Models

approximating f by a linear function, i.e.,


y = as + bx

For any industrial optimization study, the first task is to assess all the factors and their relationships, requirements, and objective about the problem.
Then, mathematical representations are constructed. This is called model construction, which is the first stage towards formulating an optimization problem.
Model construction is one of the most important and interesting exercises in optimization studies. If an industrial problem is too complicated, we would need
to make as many simplifications as possible in the construction of the model, as
long as the answers remain realistic and can be used for the purposes for which
they are intended. There are three types of simplifications: (i) assumptions;
(ii) approximations; and (iii) estimations. To some extend, these three types of
simplifications can be overlapping. Some further details are as follows:

where a and b are constants.


Rainfall is, of course, not known exactly. We can only estimate it by the
recorded amount of rainwater from previous years together with meteorological
predictions.
Example 1.1.2.
A bus company wishes to determine the optimal routing
policy for its buses. The following conditions are assumed. (i) The daily
passenger carrying capacity of a bus is constant irrespective of traffic conditions;
and (ii) the actual kilometers between two points as proportional to the straight
line distance. Let m(i, j) denote the actual kilometers and d(i, j) the straight
line distance. Then, we have

1. Assumptions define the limitations of the model, which lead to structural


simplifications. Its purpose is to achieve model simplicity but yet retaining
adequate realism of the model which is being constructed.

m(i, j) = kd(i, j)

where k is a constant. This problem is known as the Vehicle Routing Problem.


Consider a student study problem, where the efficiency of
2. Approximations are mathematical tools, which are used to approximate a Example 1.1.3
complex function by a simplified one so that mathematical simplification her daily study depends on her study method (denoted by m), hours of study
can be achieved without much loss in accuracy.
(denoted by s), hours of rest (denoted by r), and others such as emotion (de3. Estimations are statistical techniques, which are used to assign values to noted by e), and health (denoted by h), etc. Then, the study efficiency can be
parameters of the model for the cases when these parameters depend on written as:
E = f (m, s, r, e, h)
some stochastic or random elements. Through estimations, we can assign
deterministic values to these variables without invalidating the answers
Clearly, e and h are random variables. Thus, they will need to be estimated
through ignoring the random nature of these parameters.
by using available previous data. Once they are estimated, we can write
Examples 1.1.1.
Suppose that we have an agricultural land and assume that
E = fb(m, s, r)
its annual yield depends on the amount of rainfall and the quantity of fertilizer
applied.
where E is to be maximized with respect to m, s, r. We are also required to
Let s denote the amount of rainfall and x the quantity of fertilizer. The impose
constraints on the variables. For example, we may impose
yield y is:
y = f (s, x)
s 0, r 8 (hours)
s
+
r
= 24
This model is only a simplified one, as y also depends on many other factors.
The function f is, in general, nonlinear. Here, we simplify the problem by
m {m1 , m2 , . . . , mN }
2

where {m1 , m2 , . . . , mN } denotes the set of N different study methods. This 3


Classification of Optimization Problems
is an optimization problem with constraints. Since s and r are continuous
variable, while m takes values from a discrete set, this problem is called a mixed For the cost function f (x), it can be: (i) a function of a single variable; (ii) a
an integer programming problem. This problem is solvable if the function linear function; (iii) sum of squares of linear functions; (iv) a quadratic function;
f is known. To determine f , we need to use our experimental data/past (v) a smooth function; or (vi) a non-smooth function.
experiences, etc.
For the constraint functions gi (x), they can be: (i) no constraints; (ii) in the
form of simple bounds; (iii) linear functions; (iv) smooth non-linear functions;
or (v) non-smooth nonlinear functions.
The following is the list of some well-known classes of optimization problems.
2 General Form of Optimization Problems
minimize f (x)
subject to gi (x) = 0, i = 1, 2, . . . m,
gi (x) 0, i = m + 1, . . . , M.
where x = [x1 , x2 , ..., xn ]T Rn , the superscript

denotes the transpose, and

f (x) - objective (also known as cost) function


gi (x) = 0, i = 1, ..., m, equality constraints
gi (x) 0, i = m, ..., M , inequality constraints
T

The feasible region is the set which consists of all those x = [x1 , x2 , ..., xn ]
Rn such that the above constraints are satisfied.
Some examples are given below.
Example 1.2.1.
minimize f (x1 , x2 )
subject to x1 + 2x2 = 0
x1 1
x2 1
The feasible region is:
n
o
x1
D = [x1 , x2 ]T R2 : x2 = , 1 x1 2
2
Example 1.2.2.
min f (x1 , x2 , x3 )
subject to x1 0
x2 0
x3 0
(x1 + x2 + x3 ) 0
The feasible region is: the tetrahedron with vertices (0, 0, 0), (1, 0, 0), (0, 1, 0),
(0, 0, 1).
3

One Dimensional optimization Problem:


of a single variable, no constraints, i.e., no gi (x).

Linear Programming (LP):


all linear.

Quadratic Programming (QP):


linear.

Unconstrained Optimization Problems:


linear or quadratic, no constraints, i.e., no gi (x).

Constrained Optimization Problems:


other than linear or quadratic.

f (x) is a function

f (x) and gi (x), i = 1, ..., M , are


f (x) is quadratic, gi (x) are
f (x) is other than
f (x) and gi (x) are

Chapter 2

One-Dimensional Search Techniques


A standard one-dimensional optimization problem can be written as follows: and

p = 160 0.01q,
where
q
is
the
number
of
units
each week and the price and cost are
min f (x)
(2.0.1) measured in dollars. Then, we produced
have
subject to a x b
(2.0.2)
Total revenue = pq = 160q 0.01q
where the objective function f is to be minimized subject to the constraint
A farmer has hired a farm worker to put up a fence of 100
(2.1b). If a = and b = , then we have an unconstrained one-dimensional Example 2.1.3.
meters so as to enclose a rectangular region along a river. What is the largest
optimization problem, which can be re-stated as :
area that can be enclosed? Clearly,
min f (x)
subject to x R
where R denotes the real line.

A = xy
(2.0.3)
(2.0.4) Thus, the optimization problem is:

max xy = A
subject to 2x + y = 100
max A = x(100 2x) = 100x 2x2

Some Examples

Note that
Let us, first of all, consider the following 6 examples.
A = 2(x2 50x + 252 252 )
Example 2.1.1.
Consider a situation, where a company manufactures some
commodity (Q). Let q be the units of the commodity manufactured, let p =
= 2(x 25)2 + 2 252
p(q) be the sale price (which depends on q), and let c = c(q) be the total cost of
1250
production (which also depends on q). Furthermore, let T = T (q) be the total
profit made by the company through the sale of the q units of the commodity Suppose x = 25. we have A = 1250, i.e.,
(Q). Clearly, the total profit = total revenue (pq) total cost. Thus, we have
Amax = 1250
T = pq c.
Example 2.1.4.
A tracking device located at (1, 0) is used to track a missile
which is descending along the path y 2 = x, (y 0). What angles of elevation
must the device be capable of?

For this problem, we are interested in maximizing.


Example 2.1.2.
As another example, suppose that

tan =

c = 40q + 20, 000


4

y
1+x

We know that Dist = Rate Time, or Time =


Clearly, min = 0. Thus, we need to find max .
(

max

= tan1

subject to x y 2 = 0
y0

y
1+x

=
=

Equivalently
(

max

= tan1

y
1 + y2

AC
CB
+
Rw
Rr

2
1
3x
1
1
1+x
+
= (1 + x2 ) 2 + (3 x)
3
5
3
5
1
1
1
2 2
minimize T = (1 + x ) + (3 x)
3
5

subject to 0 <

subj. to y 0

xopt

It is easy to show that max 27 .


A farmer has statistical records showing that if 25 orange
Example 2.1.5.
trees are planted, each tree would yield 500 oranges (on the average), while the
yield would decrease about 10 apples per tree for each additional tree planted.
How many trees should be planted for maximum total yield?
Let x be the number of trees planted in excess of 25. The yield per tree is
then 500 10x and we have

D
.
R

x3
3
, Topt 52 minutes
4

Note that x (0, 3]. If x = 0, then


T =

max y = (500 10x) (25 + x)


subject to x 0
25

xopt =

Optimal No of trees + 12 + 25

4
= 48 min .
5

Properties of the Objective Function

A function f of a single variable x is said to be convex on [a, b] if for all


x, y [a, b] and for a [0, 1],
f (x + (1 )y) f (x) + (1 )f (y).

N.B. This problem should be formulated as a discrete optimization problem as (i.e., straight line connecting any two points in [a, b] lies above the graph)
x only takes 1, 2, 3, . . .
Example 2.1.6.
Tom is out of petrol at the location A in the figure. TomNotes: (i) If the inequality is strict, for all (0, 1) is strictly convex.
(ii) If the equality holds everywhere, then the graph must be a straight
thinks about how to reach the location B in the shortest possible time. He is
capable of hiking 3mph through the woods and jogging 5mph on the road.
line a straight line is convex.
5

A function f of a single variable x is said to be concave on [a, b] if for all


x, y [a, b] and for all [0, 1],
f (x + (1 )y) f (x) + (1 )f (y)
Notes:

(i) A straight line is concave,


(ii) A straight line is, in fact both convex and concave.

Stationary point
Any point at which
df (x)
=0
dx
A function f defined on [a, b] is said to attain a strict local minimum value
at x0 [a, b] if there exists an > 0 such that
f (x0 ) < f (x)
x N (x0 ) \ {x0 }, where N (x0 ) is called the -neighbourhood of x0 , and is
defined
N (x0 ) = {x [a, b] : |x x0 | < }.
A function f defined on [a, b] is said to attain a strict global minimum
value at x0 [a, b] if
f (x0 ) < f (x)
x [a, b] \ {x0 }.
A function f defined on [a, b] is said to attain a local minimum value at
x0 [a, b] if there exists an > 0 such that

Analytic Methods

f (x0 ) f (x)
Theorem 2.1 Let f be a function depends on (a, b) such that f 0 also continuous
on (a, b). A necessary condition for x (a, b) to be a local optimum is that
x N (x0 ).
df (x )
A function f defined on [a, b] is said to attain a global minimum value at
= (i.e. x is a stationary points).
dx
x0 [a, b] if
f (x0 ) f (x)
Example2.3.1
x [a, b].
f (x) = 3x3 36x
Critical Point
f 0 (x) = 9x2 36 = 0 x = 2
local maximum and minimum

Any point which is a candidate for a global optima.


Obviously, just finding stationary points not give global optima.
6

Example 2.3.2

Theorem 2.3 (Global minimization lemma) If f (x) is a convex function


over a closed interval [a, b], then any local minimum of f (x) in this interval is
also the global minimum of f (x) over the interval.

f (x) = |x| local as well as as global minimum at x = 0


But,

PROOF. Let p [a, b] be a local minimum. Then,

f 0 (x)andf 00 (x) not defined at x = 0

f (x) f (p), x [a, b] N (p),


Theorem 2.2 Assume that f (x) and its first n derivatives are continuous.
Then, f (x) has a local optimum at x if and only if n is even, where n is where N (p) denotes the -neighbourhood of p. Suppose p is not a global minthe order of the first non-vanishing derivative at x . The function f (x) has a imum, i.e., z [a, b] such that f (z) < f (p). Since f is convex, for (0, 1),
we have
maximum x if
(n)
f (p + (z p)) f (p) = f (z + (1 )p) f (p)
f (x ) < 0
f (z) + (1 )f (p) f (p)

and a minimum at x if
(n)
= (f (z) f (p)) < 0
f (x ) > 0
Note is arbitrary. We may choose sufficiently small such that p + (z p)
N (p). This violates the assumption that p is a local minimum.


Example 2.3.3
f (x)

= x4 4x3 + 6x2 4x + 1
=

df
dx
d3 f
dx3

d2 f
df
Example 2.3.5.
f (x) = x2 , 1 x 1. f convex;
= 2x,
=2>
dx
dx2
and n = 2 which is even x = 0 is a local minimum. But f is a x = 0 is
also a global minimum.

(x 1)4

4(x 1)3 ,

24(x 1),

d2 f
= 12(x 1)2 ,
dx2
d4 f
= 24
dx4

Theorem 2.4 The global maximum of a convex function f (x) over a closed
interval a x b is either at x = a or x = b or both.
Note that similar results apply to concave functions.
f (x) = x2 , 1 x 2 maximum at x = 2.
Example 2.3.6.
Example 2.3.7.
Example 2.3.8.
x = 1.

d4 f
dx4
d4 f
minimum since n = 4 is even and
= 24 > 0.
dx4

first non-vanishing derivative is

Example 2.3.4
f (x) = (x 1)3
First non-vanishing derivative at x = 1 is
d3 f
= 6,
dx3

f (x) = x2 ,
f (x) = x2 ,

2 x 1 maximum at x = 2.
1 x 1 maximum at x = 1 and

Search Methods for General One Dimensional


Functions

Calculus methods are normally not practical, since


(i) extrema often occur at boundaries

n odd

(ii) extrema may occur where derivation do not exist.

necessary condition satisfied for stationary point



df
=0
dx x=1

(iii) solution of f 0 (x) = 0 often require numerical methods.


Usually, it is better to search extrema directly (i.e. evaluating function at
certain selected points and company function values).
Question
How much is known about the function? Then, we can determine
which method is to be used.

But
x = 1 is not a local min
7

4.1

Function values given only at set of n points (nothing 4.4 (I) Equal Interval Search (dichotomous search)
else know)
This is shown in the following figure.

Method:
est.

4.2

Compare function values and the one which gives rise to the small-

Function is piecewise continuous defined on a finite


interval

Definition 2.1 (Piecewise continuity) A function is piecewise continuous


in [a, b] if the interval can be divided into a finite number of non-overlapping
subintervals over each of which, the function is continuous.

Exhaustive Search: divide interval into n subintervals of length h =


Evaluate function value a points:
xm = a + mh,

ba
.
n

m = 0, 1, , n.

Then, find point at which the function value is the smallest.


The larger the value of n, the closer one is likely to get to the true minimum.
Interval of Uncertainty is defined as
xK+1 xK1 = 2h

Random Search: (Used if n for exhaustive search becomes too large to be


practical.) Select values randomly from the interval [a, b] and compare function
values (Monte Carlo Method).
Note
Neither of the two methods ensures that we are anywhere near true
minimizer.
We compute





1
1
+
and f

.
2 2
2 2






4.3 Function is unimodal and piecewise continuous on a
1
1
1
If f
+
<f

, reject 0,
the interval of uncertainty is
finite interval
2 2
2 2
2
(We make use of the methods to be discussed below even we do not know if the reduced to 1 + = 1 + 2 1 .
2
2
2
function is unimodal provided we are happy with a local minimum.)
Repeat this procedure by placing two points apart at the midpoint of remaining interval. the interval , uncertainty is reduced to


Definition 2.2 A function f of a single variable x on an interval [a, b] is said
1 1+

1 + 3
1 41
+ =
= +

to be unimodal if it has a single local minimum.


2
2
2
4
4
2
f

4.5

Repeat this process again the i of uncertainty is reduced to




1 1 + 3

1 + 7
1 81
+ =
= +

2
4
2
8
8
8

Fibonacci Search Method


This method has property that, f a given number n of function values the n
points are placed so that final interval of uncertainty is a minimum.

After n repetitions, the optimum is located within the interval of uncertainty


of

Unequal Interval Searches

2n + (1 2n ).

min f (x) = x2 6x + 2, x [0,


Example
By dichotomous search
The midpoint is x = 5. Choose = 0


0.5
f 5+
= f (5.25) = (5.25)2 6(5.25) + 2 = 1.93
2


0.5
f 5
= f (4.75) = 3.9375
2

Use Fibonacci numbers


F0
i.e., F2

= F1 = 1, Fn = Fn1 + Fn2 for n 2


= 2, F3 = 3, F4 = 5, F5 = 8,

Stage 1

f (4.75) < f (5.25) interval of uncertainty reduces to [0, 5.25]. The


midpoint of [0, 5.25] is 2.625


0.5
f 2.625 +
= f (2.875) = 6.9844
2


0.5
= f (2.375) = 6.6094
f 2.625
2
f (2.875) < f (2.375) interval of uncertainty reduces to [2.375, 5.25]. The
midpoint of [2.375, 5.25] is 3.8125


0.5
f 3.8125 +
= f (4.0625) = 5.8711
2


0.5
f 3.8125
= f (3.5625) = 6.6836
2

x1 = b 2 L1
x2 = a + 2 L1
where

f (3.5625) < f (4.0625) interval of uncertainty is [2.375, 4.0625]. The


midpoint of which is 3.2188


0.5
f 3.2188 +
= f (3.4688) = 6.7800
2


0.5
f 3.2188
= f (2.9688) = 6.9990
2

2 =

Fn2
Fn2
1
=
<
Fn
Fn1 + Fn2
2
f (x2 ) <
f (x2 ) >

Stage 2
suppose
f (x3 ) > f (x2 ) x3 xm

f (2.9688) < f (3.4688) interval of uncertainty is [2.375, 3.4688]. The


exact answer is x = 3 and fmin = 7.0

f (x3 ) < f (x2 ) a xmin


9

f (x1 ) a xmin x1
f (x1 ) x2 xmin b

Fn2
Fn
We define L2 = L1 L1 2 = L1 (1 2 )
where L1 = b a and 2 =

Fn Fn2
Fn3
= L1
Fn
Fn


2Fn2
= L1 22 L1 = L1 1
Fn
Fn 2Fn2
= L1
Fn
Fn1 + Fn2 2Fn2
Fn1 Fn2
= L1
= L1
Fn
Fn
Fn1 Fn3
= L1

Fn
Fn1
= 3 L2

= L1
General step

x1 x2

L1

Fn3
Fn

So, we need only to define a point x3 = a +


Lk
Lk+1

= Lk Lk k+1
= Lk (1 k+1 )


Fn(k+1)
= Lk 1
Fn(k+1)


Fnk+1 Fnk
= Lk
Fnk+1
Fnk
= Lk
Fnk+1

=
=

Lk

=
=

Consider the 2nd step as shown by the figure.

Fn(k1)
Fn(k1)+1
Fnk+1
Fnk+1
= Lk1
Lk1
Fnk+1+1
Fnk+2

Lk1

Fnk+1 Fnk+2
Fn1

L1
Fnk+2 Fnk+3
Fn
Fnk+1
L1
Fn
Lk
Fnk+1
=
L1
Fn

Take K = n
Ln
L1
Ln
10

F1
Fn
F1
1
L1 =
L1
Fn
Fn

Ratio of reduction
Limit of

Ln+1
Fn
=
Ln
Fn+1

Fn
Fn+1
Fn+1 = Fn + Fn1

Dividing by Fn1 we get


Fn+1
Fn
=
+1
Fn1
Fn1
or
Fn+1
Fn
Fn

=
+1
Fn
Fn1
Fn1
Let n ,

f (3.75) = 6.44
f (6.25) = 3.56

Fn+1
Fn
t,
t
Fn
Fn1

Stage 2:

f (3.75) < f (6.25)

0 xmin < 6.25

place point symmetrically

t2 = t + 1
or t2 t 1 = 0

1 1+4
1.618 (positive root)
=
2
Fn
1

= 0.618
Fn+1
t

3 =

F53
F2
2
=
=
F51
F4
5

Example
minimize f (x) = x2 6x + 2
subject to 0 x 10
to 15% of the original interval (10 units) i.e. the final interval of uncertainty is
wit.

Ln

f (2.5)

= 6.75 < f (3.75) = 6.44


0 xmin < 3.75

f (1.25)

= 3.9375 > f (2.5) = 6.75


1.25 < xmin < 3.75

Stage 3:

15% of 10.
15 10
L1
10
=
=
=
100
Fn
Fn
15
1
require
<
F5 = 8
Fn
100

Stage 1:
2 =

Fn2
F3
3
=
= ,
Fn
F5
8

L1 = 10
11

Stage 4:

(b) need 4 points 2 end points plus


internal points to reduce the size of the interval of uncertainty.
3 points, i.e., one end point and interval points, must be at the next
stage.
Stage K

New point superimposed at 2.5 (half a final interval)


Final step dichotomous
Stage K + 1

f (2.5)

6.75 > f (2.5001) =


2.5 < xmin < 3.75

Note: Final interval of uncertainty of 1.25 is 12% of 10


Let

Have 3 kinds of problems:

(1) Determine position of minimum to interval .


(2) Determine minimum value of function to within relative error .

x2 x1
x3 x2

(3) Terminate search if either or (2) is satisfied.


For (1):

Can choose, if Fibonacci used, n such that


Ln =

For (2) and (3):

Lk+2
Lk+1
=
= ratio of interval reduction / step
Lk
Lk+1
Lk+1 = x3 x1 = x4 x2 = Lk
= x4 x3 = Lk Lk+1 = Lk Lk = (1 )L
= (x3 x1 ) (x2 x1 ) = Lk (1 )Lk = (2
=

Also
x3 x2

L1

Fn

Cannot choose n, it is better to use Golden section search.

2 1 = (1 )

Golden Section Search


Demand:
1. (1) only one new functional evaluation per step
(2) the interval reduction same at each
Remember:
(a) need 3 points to locate minimum an interval

= Lk+1 Lk+2 = Lk+1 Lk+1 = (1 )Lk+1


= (1 )Lk

2 + 1 = 0

1 + 5
=
= 0.6180
2
= ratio of Golden section
Note: The new interval of uncertainty is equal to 0.618 of the previous one
COMPARISON
12

(1) Two points equal interval search


Interval reduction per step < 50%
Evaluations per step = 2
Golden section search is much 6

= (2.64)2 = 6.94; f (x3 ) = (7.36)2 = 54

f (x2 ) < f (x3 ) xmin [5, 7.36]

f (x2 )

(2) n-step Fibonacci search


reduction of

Fnk+1
at kth step
Fn

x5

Fm1
0.618 as m
Fm
L1
total reduction to
after n function evaluations (n steps, i.e., k = n)
Fn

= x3 12.36 0.618
= 7.36 12.36 0.618
= 0.278

(3) Golden Section Search


interval reduction to 0.618 of the previous one per extra function evaluation (excluding 2 at the beginning).
Golden section
ratio of
= (0.618)n Fn 1.17 for n > 5
Fibonacci
Fibonacci is approximately 17% better in interval reduction.
Disadvantage of Fibonacci is that we must choose n, the number of function
evaluates before starting the search.
Example
min f (x) = x2 ,
= 1.5
x3

x2

x [5, 15]

f (x5 ) = (0.278)2 = 0.07729


f (x5 ) < f (x2 ) xmin [5, 2.64]

= x1 + 0.618 20
= 5 + 0.618 20
= 7.36

x6

= x4 0.618 20
= 15 0.618 20
= 2.64

13

= x2 0.618 7.638
= 2.64 4.72
= 2.08

f (x6 ) = (2.08)2 = 4.326


f (x5 ) = 0.078 < f (x0 ) = 4.326

xmin [2.08, 2.64]


x7

f (x8 ) = (0.96)2 = 0.92


f (x5 ) < f (x8 ) xmin [0.96, 0.84]

x9

= x6 + 4.72 0.618
= 2.08 + 2.917
= 0.84

f (x7 ) = (0.84)2 = 0.71


f (x5 ) = 0.078 < f (x7 ) = f (0.84) = 0.71

xmin [0.28, 0.84]


x8

= x7 0.618 2.917
= 0.84 1.803
= 0.96

f (x9 )

= x7 + 0.618 1.803
= 0.96 + 1.114
= 0.16

= (0.16)2 = 0.023
f (x9 ) < f (x5 ) xmin [0.28,

The interval of uncertainty is 1.114 <

5
5.1

Search methods for Smooth Functions


Newtons Method

Suppose f (x) is smooth such that f 0 and f 00 exist. At any xk [a, b], f (x) can
be approximated locally by the following truncated Taylors expansion of at xk :
1
q(x) = f (xk ) + f 0 (xk )(x xk ) + f 00 (xk )(x xk )2 .
2
14

PROOF. Let k1 and k2 be such that


|g 00 ()| < k1 , |g 0 ()| > k2 , N (x , )
Since g(x ) = 0, we have
xk+1 x

=
=
=

So, if

Now we can find the minimum of q(x) from q 0 (x) = 0


0

1
q 0 (x) = f 0 (xk ) + f 00 (xk ) 2(x xk )
2
f 0 (xk )
xk+1 : = x = xk 00
f (xk )

g(xk ) g(x )
g 0 (xk )

g(x ) [g(xk ) + g 0 (xk )(xk x )]


g 0 (xk )
1 g 00 ()
(xk x )2
2 g 0 (xk )
k1
2
|xk+1 x |
|xk x |
2k2

xk x

k1
|xk x | < < 1, we have
2k2
|xk+1 x | < |xk x | (0 < < 1)
The mapping from xk xk+1 is a contraction
xk x 0 or xk x .

From (2.5.5) we see that the convergence rate if of 2nd order.


This process can be continued until |xk+1 xk | < (a prescribed tolerance).
This method can also be viewed as the application of the Newtons method
(for solving nonlinear algebra equations) to the equation
g(x)
xk+1

5.2

:= f 0 (x) = 0,
g(xk )
.
= xk 0
g (xk )

Method of False Position (Secant)

f 00 (xk )

x [0, 1.5]

Theorem 2.5 (Convergence) Let the function g(= f 0 ) have a continuous


second derivative, and let x satisfy g(x ) = 0, g 0 (x ) 6= 0. Then, provided x0

is sufficiently close to x , the sequence {xk }0 with


xk+1 = xk

g(xk )
, k = 0, 1,
g 0 (xk )

Newtons method uses information at only xk and thus needs f (xk ), f 0 (xk ) and
f 00 (xk ).
If use more points, then less information is required at each point.
We now replace f 00 (xk ) in the Newtons formula by

Problem
Write a small C or Matlab program to implement the Newtons
method for an arbitrary f (x), and test your code with
f (x) = x4 4x + 1

(2.5.5)

f 0 (xk1 ) f 0 (xk )
.
xk1 xk

Then we have the method of false position


q(x) = f (xk ) + f 0 (xk )(x xk ) +

f 0 (xk1 ) f 0 (xk ) (x xk )2
xk1 xk
2

Solving q 0 (xk+1 ) = 0 gives


xk+1 = xk f 0 (xk )

converges to x with an order of convergence at least two.


15

xk1
0
f (xk1 )

xk
f 0 (xk )

Using the
g[xk1 , xk ] = g 0 (x ) + O(k2 ),

k = xk x

Similarly
g[xk1 , xk ] g[xk , x ]

1 00
2
g (x )(xk1 x ) + 0(k1
)
2

g 00 (x )
2
(xk x )(xk1 x ) + 0(k2 ) + 0(k1
)
2g 0 (x )
M (xk x )(xk1 x )

|xk+1 x | =

where M = g 00 (x )/2g 0 (x )
Letk = M (xk x ). We have
Let

g(x) = f 0 (x). The Secant Method solving g(x) = 0 which gives


xk+1

k+1

= M (xk+1 x )
= M 2 (xk x )(xk1 x )
= k k1

xk xk1
= xk g(xk )
g(xk ) g(xk1 )

Taking the logarithm and letting yk = log k , we have


yk+1 = yk + yk1

Convergence of MFP or Secant Method


Theorem 2.6 Let g(= f 0 ) have a continuous second derivative and suppose
x is such that g(x ) = 0, g 0 (x ) 6= 0. Then for x0 sufficient close to x , the
sequence
xk xk1
k = 0, 1
xk+1 = xk g(xk )
g(xk ) g(xk1 )

which is the Fibonacci difference equation. From the previous discuss we know
that
yk+1
yk

converges to x with order 1 1.618

PROOF. Similar to that for the Newtons method. Let g[x, y] denote the divided
difference defined by
g[x, y] =

g(y) g(x)
,
yx

1
= 1.6
0.618
log k+1
1

log k
0.618
k+1
1, k+1 1.618
k
1.618
k

(g[x, x] = g 0 (x)) .

1.618

|xk+1 x | |xk x |

or order of convergence 1.618

Then,
xk+1 x

xk xk1
g(xk ) g(xk1 )
g(xk ) g(x )
1
= (xk x ) (xk x )

xk x
g[xk1 , xk ]

g[x
,
x
]

g[x
,
x
]
k1
k
k
= (xk x )
g[xk1 , xk ]
= xk x g(xk )

6
6.1

Search Methods for Continuous Function


Quadratic Fit (Powells Quadratic Fit)

Start with three points x0 , x1 , x2 such that


f (x0 ) > f (x1 ) < f (x2 )
16

xmin (x0 , x2 )

Fit quadratic through these three points. Choose a, b, c such that


ax20
ax21
ax22

From (2.6.9) and the above,

+ bx0 + c = f (x0 )

= f0 ,

(2.6.6)

+ bx1 + c = f (x1 )

= f1 ,

(2.6.7)

+ bx2 + c = f (x2 )

= f2 .

(2.6.8)

Estimate the minimum of f (x) as the minimum of the quadratic. That is,
y = ax2 + bx + c,
b
dy
= 0 = 2ax + b x =
dx
2a
Solve Eqs. (2.6.6), (2.6.7) and (2.6.8):
a(x21 x20 ) + b(x1 x0 )

= f1 f0

a(x22 x20 ) + b(x2 x0 )

= f2 f0



f1 f0
b
(x1 x0 ) =
(x21 x20 )
a
a


b
f2 f0
(x2 x0 ) =
(x22 x20 )
a
a

(2.6.9)

xm =

Need to test whether the new point is minimum or not. That is,

d2 y
= 2a > 0
dx2 x=xm
Recall
a(x21 x20 ) + b(x1 x0 )

= f1 f0

(2.6.13)

a(x22

= f2 f0

(2.6.14)

x20 )

+ b(x2 x0 )

(2.6.13)(x2 x0 ) - (2.6.14)(x1 x0 )
a =

(2.6.10)
(2.6.11)

f0 (x1 x2 ) + f1 (x2 x0 ) + f2 (x0 x1 )


<0
(x0 x1 )(x1 x2 )(x2 x0 )

Therefore, the condition for xm being minimum.


Algorithm
0. Set > 0

f1 f0

(2.6.11)
f2 f0
b (x2 x0 )(f1 f0 )
(x2 x20 )(f1 f0 )
f1 f0
2
=
a
f2 f0
a
f2 f0

b
1 f0 (x21 x22 ) + f1 (x22 x20 ) + f2 (x20 x21 )
=
2a
2 f0 (x1 x2 ) + f1 (x2 x0 ) + f2 (x0 x1 )

1. Find 3 points, x0 , x1 , x2 , that f (x0 ) > f (x1 ) < f (x2 )


2. Find turning point x = xm of quadratic fit through the 3 points.
(2.6.12)

3. Compute f (xm ). Let x {x0 , x1 , x2 , } such that f (x ) is the minimum


among these four points, x0 , x1 , x2 .

(2.6.10)-(2.6.12)
4. If xm is within the small prescribed distance from x1 , take


(x2 x0 )(f1 f0 )
(x22 x20 )(f1 f0 )
b
2
2
x1 x0
= (x1 x0 ) +
min{f (xm ), f (x1 )}
a
f2 f0
f2 f0

as the required minimum. Otherwise go to step 5, step 6, step 7 and step




8 as appropriate.
b (x1 x0 )(f2 f0 ) (x2 x0 )(f1 f0 )
(x21 x20 )(f2 f0 ) + (x22 x20 )
=
5. x = x1 and x1 [x0 , xm ]. Discard x2 and replace it by xm . Return to
a
f2 f0
f2 f0
Step 2

b
a

(x21 x20 )(f2 f0 ) + (x22 x20 )(f1 f0 )


(x1 x0 )(f2 f0 ) (x2 x0 )(f1 f0 )
f2 (x21 x20 ) + f0 (x21 x20 ) + f1 (x22 x20 ) f0 (x22 x20 )
=
f2 (x1 x0 ) f0 (x1 x0 ) f1 (x2 x0 ) + f0 (x2 x0 )
f1 (x22 x20 ) + f2 (x20 x21 ) + f0 [x21 x20 x22 + x20 ]
=
[f1 (x2 x0 ) + f2 (x0 x1 )] f0 [x1 x0 x2 x0 )]
f0 (x21 x22 ) + f1 (x22 x20 ) + f2 (x20 x21 )
=
f0 (x1 x2 ) + f1 (x2 x0 ) + f2 (x0 x1 )
=

17

6. x = x1 and x1 [xm , x2 ]. Discard x0 and replace it by xm . Return to 6.2 Davidsons Cubic Interpolation Method
Step 2
It is generally better than Powells method if the derivatives of f (x) are easy
7. x = xm and xm [x0 , x1 ]. Discard x2 and re-name the remaining 3 points to evaluate.
as:
Consider the problem:
x0 = x0 , x1 = xm , x2 = x1
min f (x) along x = x0 + ,
Return to Step 2.
where x0 is the current point
8. x = xm and xm [x1 , x2 ]. Discard x0 and re-name the remaining 3 points
as:
Let f0 = f (x0 ) and f = f (x0 + ), is a given value of .
x0 = x1 , x1 = xm , x2 = x2
Suppose we know:
Return to step 2.

df
4
G
=
= f00 and G0 < 0
Example
f (x) = x 4x + 1.
0
d =0
Choose x0 = 0, h = 0.5, = 0.02

df

= f0
G =
x
f (x)
d =

x0 = 0.5
0
1

bracket min
0.5 0.9375
x1 = 1
[Note: To cover case where G0 > 0, i.e. minimum to the left, use f = f (x0 ).]
is -2

Minimization occurs in 3 stages.


x2 = 1.5
1
2

1.5
0.0625
(a) Order of magnitude of m the minimizing value of established.



1 0.9375 (1)2 (1.5)2 + (2) (1.5)2 (0.5)2 + (0.0625) (0.5)2 (b) Upper and lower bounds found m .
xm =
2
0.9375(1 1.5) + (2)(1.5 0.5) + 0.0625(0.5)
(c) Cubic interpolation used for more precise bounds.
1 0.9375 1.25 2 2 0.0625 0.75
=
(a) Initial Approximation to m .VIZ
2 0.9375 0.5 2 1 0.0625 0.5
0.9375 1.25 4 0.0625 0.75


=
2(f0 fe )
0.9375 4 0.0625
= min K,
G0
= 0.92
K = some representative magnitude for the problem usually K = 2
f (xm ) = 1.9636071
fe = preliminary estimates (low rather than high)of f (x0 + m )
Check conditions for minimum: Reject x = 0.5
(even though f (0.5) is a
bracket minimum)
Note
If a function to be minimized a quadratic, then
x0 = 0.92,
x1 = 1,
f0 = 1.9636071, f1 = 2,

x2 = 1.5
f2 = 0.0625

Using the same method we have


xm

=
=
=

2(f0 fe )
= m
G0

when fe is an exact estimate minimum of f (x0 + ).


Proof



1 1.9636071 1 1.52 + (2) 1.52 0.922 + 0.0625(0.92)2 1)
2
1.9636071(1 1.5) + (2)(1.5 0.92) + 0.0625(0.5)
1 1.9636071 1.25 2 1.4036 0.0625 0.1536
2 1.9636071 0.5 2 0.58 0.0625 0.08
0.98880519 f (xm ) = 1.99925

f (x0 + ) = a(x0 + )2 + b(x0 + ) + c


df (x0 + )
= 2a(x0 + ) + b = 0
d
b
b2
x0 + m =
f (x0 + m ) = c
2a
4a

xm 1 < xmin = 1 and f (1) = 2.


18

(*)

Now, look at

2(f0 fe )
:
G0

Clearly, (2.6.19) and (2.6.20) satisfy (2.6.15) and (2.6.17) respectively. Setting
= and using (2.6.16) and (2.6.18) we have



b2
2 ax20 + bx0 + c c +
4a
=
2ax0 + b
(2ax0 + b)
4a2 x20 + 4abx0 + b2
=
=
2a (2ax0 + b)
2a (2ax0 +
2ax0 + b
b
=
= x0
2a
2a
is valid

2(f0 fe )

G0

2 y2 + 3 y3
2y2 + 32 y3

= f f0 G0
= G G0

solving these gives


y2
2 y3
where
Z=

(b)

= (G0 + Z)
1
(G0 + G + 2Z)
=
3
3
(f0 f ) + G0 + G

(2.6.20) is then
y 0 () = G0 2(G0 + Z)

2
+ (G0 + G + 2Z) 2

To find m , set y 0 () = 0
m

=
=

where

W = (Z 2 G0 G ) 2

Assume cubic y() approximates f (x + )


y(0) = f0 ,
y() = f ,
y 0 (0) = G000 ,
y 0 () = G .

Which sign?
Consider

(2.6.15)
(2.6.16)
(2.6.17)
(2.6.18)

Interpolation Formula:
Assume
y() = f0 + G0 + y2 2 + y3 3
y () = G0 + 2y2 + 3y3

(2.6.19)
(2.6.20)

2
y 00 () = (G0 + Z) + 2(G0 + G + 2Z) 2

so
y 00 (m )

Since G0 < 0, xm minimum is between = 0 and = if G > 0(fm )


or if f > f0 (from (B)). If neither, they replace x + by x + 2
repeating, if necessary, till minimum of y() is bracketed then start
interpolation.


1
G0 + Z (G0 + Z)2 G0 (G0 + G + 2Z) 2
G0 + G + 2Z
G0 + Z W
G0 + G + 2Z

2
m
= (G0 + Z) + 2(G0 + G + 2Z) 2

2W
=

> 0 for minimum.

must have + sign, as is positive by definition.


m
G0 + Z + W

G0 + G + 2Z
m
G + W Z

=1

G G0 + 2W
19

(2.6.21)

for greater numerical accuracy!


Proof:
G0 + Z + W
G0 + G + 2Z

=
=
=

Example

Use Davidsons cubic interpolation method to find minimum of


f (x) = x4 4x + 1

G + W Z
G G0 + 2W
G G0 + 2W G W + Z
G G0 + 2W
W G0 + Z
G G0 + 2W

for = 0.001.
Solution

(G0 + Z + W )(G G0 + 2W ) = (G0 + G + 2Z)(W G0 +

G0 G G20 + 2G0 W + ZG ZG0 + 2ZW + W G

1. Choose x0 = 1 f0 = 1, G0 = f00 = [4x 4]|x=0 = 4 G0 < 0.


Choose K = 2, fe = 4


2(1 (4))
= min 2,
4


1
= min 2, 5 = 2 positive
2

W G0 + 2W 2 = G0 W G20 + G0 Z + G W G
2.

+G Z + 2ZW 2ZG0 + 2Z 2

= f (2) = (2)4 4 2 + 1 = 9,


= f 0 (2) = 4 (2)3 4 = 28 > 0

f2
2W 2 = 2(Z 2 G0 G )
validity by (2.6.21)

G2
5.

G2 + W Z
m
=1
2
G2 G0 + 2W

Algorithm
1.

2.
3.

4.
5.

6.

7.

Evaluate f0 = f (x0 ) and G0 = f 0 (x0 )

check that G0 < 0


choose K and fe and determine

(Normally use K = 2)

Evaluate f = f (x0 + ) and

G = f0 (x0 + )

If G > 0 or if f > f0 ,

to step 5; otherwise go to

step 4.

Replace by 2, evaluate

new f and g , return to step 3.


Interpolate in interval [0, ]
for m using (9), where
W and Z are given by (7) and (8).
Return to step 5 to repeat
interpolation in smaller interval
[m , ]
[0, m ]
or
f 0 (x0 + m ) 0 according to f 0 (x0 + m ) < 0
f (x0 + m ) > f (x0 )
Stop if m within of
endpoints or interval is less than .

where
(a)

(b)

and
W

3
(f0 f ) + G0 + G

3
=
(1 9) + (4) + 28
2
= 12 4 + 28 = 12
=

(Z 2 G0 G ) 2 = (144 (4) 28) 2

=
=

(144 + 112) 2 = (256) 2


16

28 + 16 12
32
1
m
=1
=1
=
2
28 + 4 + 32
64
2

(c)

m
G1
Return to
20

= 1 f (0 + 1) = 2 and
= f 0 (0 + 1) = 0

5. on interval [0, 1]
m
1

G1 + W Z
G1 G0 + 2W

3
(f0 f ) + G0 + G , = 1 =

3
(1 (2)) + (4) + 0 = 5
1
1
1
(Z 2 G0 G1 ) 2 = (25 (4)(0)) 2 = 5

m
0+55
0
=1
=1
1
0 (4) + 10
14

m = 1, as before stop
Minimum at x = 1, f (1) = .
Check Analytically:
df
= 4x3 4 = 0
dx
x = 1

d2 f
= 12x2 x=1

2
dx x=1
= 12 > 0
x = 1 is minimum.

21

Chapter 3

Unconstrained Optimization Techniques


1

Introduction

for any x, y X and all [0, 1].

An optimization problem is defined as


x**2 + y**2

min f (x)

xRn

(i) gi (x) 0,

i = 1, 2, . . . , m

200
150
100
50
0

(ii) gi (x) = 0,

i = m + 1, m + 2, . . . , r,

-10

subject to

10
5

where x = (x1 , x2 , . . . , xn )T . The constraints (i) and (ii) form a feasible region
for the optimization problem. We denote it by .
Global minimum: x is said to be a global minimum if
f (x ) f (x)

-5
10

-10

Figure 1: An example of a convex function


Properties of Convex Functions

x .

1. If both f1 and f2 are convex functions on X, then f1 + f2 is also convex


on X.
PROOF. Suppose f1 and f2 are convex on X, i.e. for any x, y X

Strict global minimum: x is said to be the strict global minimum if


f (x ) < f (x)

-5

x \{x }.

fi (x + (1 )y) fi (x) + (1 )fi (y),

Convex Set. A set C is said to be convex if for any two points in C the line
segment joining the two points is also in C. Mathematically, for any x, y C,

for [0, 1] and i = 1, 2. From this we have


f1 (x + (1 )y) + f2 (x + (1 )y)
f1 (x) + (1 )f1 (y)
+f2 (x) + (1 )f2 (y)
= (f1 (x) + f2 (x)) + (1 )(f1 (y) + f2 (y)).

x + (1 )y C, [0, 1].
Convexity. A function f (x) is said to be convex in a convex set X Rn if
f (x + (1 )y) f (x) + (1 )f (y)

From the definition we see that f1 + f2 is convex.


22

2. If f is a convex function on X, then af is also convex on X for any a > 0. Generalization


If f (x) is continuous and has continuous first and second partial derivatives
PROOF. Exercise.
over an open convex set X in Rn , then for any two points x and y = x + h in
X there exists a , 0 1, such that
3. Combining items 1 and 2 we have that if
fi (i = 1, 2, . . . , m) are convex on X and ai R+ for i = 1, 2, . . . , m, then
m
X

ai fi (x)

1
f (y) = f (x) + (g(x))T h + hT G(x + (1 )y)h
2

for a (0, 1), where G(z) denotes the Hessian of the function f defined by g,

2f
2f
x2
xy

G=
2f
2f

i=1

is also convex on X.
4. If f (x) is convex on Rn , then the set

xy

:= {x Rn : f (x) b, b R}

y 2

Necessary Condition for Local Optima

is a convex set.
PROOF. Let x, y , i.e.,

g(x0 ) = 5f (x0 ) = 0.

f (x) b
Consider

(3.1.1)

and f (y) b.

z = x + (1 )y,

(3.1.2)

This is equivalent to
f (x0 )
=0
xj

[0, 1].

j = 1, 2, . . . , n.

We need to prove that z or equivalently f (z) b. Now,


f (z)

f (x + (1 )y)
f (x) + (1 )f (y)
b + (1 )b
b.

So, z and thus is a convex set.

Sufficient Condition for Local Optima


Definition 3.1 A matrix A is said to be positive definite if xT Ax > 0 for all
x 6= 0.
We have the following theorem
Theorem 3.1 Let x0 be a solution to (3.1.2). Then x0 is a local minimum if
the Hessian G(x0 ) of the function f is positive definite.

Taylors Theorem
If f (x) is continuous and has continuous first partial derivatives over an open PROOF. At x0 we have from (3.1.1)
convex set X in Rn , where
1
f (x0 + h) f (x0 ) = hT G(x0 + (1 )(x0 + h))h
x = (x1 , . . . , xn )
2

(3.1.3)

then for any two points x and y = x + h in X there exists a , 0 1, such because g(x0 ) = 0. Assume that the second partial derivatives of f are continthat
uous. Then, there exists an > 0 such that both
f (y) = f (x) + g(x + (1 )y) h
2 f (x0 )
2 f (x0 + (1 )(x0 + h))
where
and
g(z) = 5f (z) z X
xi xj
xi xj
and
h = (h1 , h2 , . . . , hn ).

have the same sign pattern, provided that


x0 + (1 )(x0 + h) N (x0 , )

All g, x and h will sometimes be regarded as n 1 matrices.


23

where N (x0 , ) denotes the neighborhood of x0 , i.e., 0 < |h| < .


Now, if
hT G(x0 )h 0, 0 < |h| < ,
then

Note: Hessian matrix is symmetric if f C 2 .


Sylvesters criterion.
The symmetric matrix

f (x0 + h) f (x0 ) > 0 0 < |h| < .

This implies that x0 is a local minimum of f.

P =

Note: The argument for local maxima is similar.


Consider a real quadratic function
V (x) = xT P x,
where

P =

P11

Pn1

P12

Pn2

P1n

Pnn

V (x) = xT P x is real (xT P x)T = xT P T x

P11

Example

(xT P x)T = xT P T x

P + PT
2


x

1
Clearly, (P + P T ) is symmetric. Thus, we may just as well assume that P is
2
symmetric.
Now, let us go back to the quadratic function

P1n
P2n

Pnn

P1n
P2n

Pnn


P
> 0, 11
P12

P12
P22


P11


P12

> 0, . . . , .

..

P1n

P12
P22
..
.

P1n
P2n
..
.

P2n

Pnn






>0


Consider the matrix

10
1 2
4 1 .
P = 1
2 1 1

P is positive definite.
Example Find minimum of
f (x) = 2x21 + 3x22 + 4x23 8x1 12x2 24x3 + 10
f
= 4x1 8 = 0 x1 = 2
x1
f
= 6x2 12 = 0 x2 = 2
x2
f
= 8x3 24 = 0 x3 = 3
x3

V (x) = xT P x
and assume that P is symmetric,

P11 P12

P22

P =

P1n P2n

P2n

By applying Sylvesters criterion, we obtain






10
1 2
10 1

> 0, 1
4 1 > 0
10 > 0,

1 4
2 1 1

xT P x = xT P T x
V (x) = xT

The proof to this is lengthy and we show it by an example.

But,

P12
P22

is positive definite if and only if

Without loss of generality, we may assume that P is symmetric. Thus,

P11
P12

P1n

Thus,

2
x0 = 2
3

Recall: P is positive definite if V (x) = xT P x > 0 for all x 6= 0.


24

2f
2f
= 4,
= 0 for i 6= j,
2
x1
xi xj

4 0 0
G= 0 6 0
0 0 8

2f
2f
= 6,
= 8,
2
x2
x23

We have the following theorem


Theorem 3.3 Let X be a closed bounded convex set in Rn and let f (x) be a
convex function over X. If f (x) has global maxima, then one or more of these
global maxima are boundary points of X.

Note: f (x) is concave on X iff (if and only if) f is convex on X. All the
results here can be extended to concave functions.
Let us consider the problem: Find minimum of f (x), x where is the
feasible region.

x1

x = ...

which is clearly positive definite.


minimum at (2, 2, 3)
f (x0 ) = 8 + 12 + 36 16 24 72 + 10 = 46.

xn

Optima of Convex and Concave Functions

Theorem 3.2 Let f (x) be a convex function over a closed set X in Rn . Then, Method depends again on what are known about the function f (x).
any local minimum of f (x) in X is also the global minimum of f (x) over X.
Two Types of Methods
PROOF. Let x0 be a local minimum. Assume x0 is not a global minimum.
(1) Methods requiring only function values.
Then, we can find y such that
(2) Methods using gradient information much more efficient.

f (y) < f (x0 ).

We shall discuss methods for these two types separately.

Now, consider
f (x0 + (1 )y) f (x0 ) + (1 )f (y)
<

f (x0 ) + (1 )f (x0 )

Search methods using function values only

= f (x )

The type of methods is also called direct search. f piecewise continuous.

Rearrange the arguments of left hand side

2.1

f (x0 + (1 )(y x0 )) < f (x0 )

Exhaustive Search

Evaluate function at x = (k1 h1 , k2 h2 ) for 0 k1 m1 and 0 k2 m2


for all 0 1. Now, let
N (x0 , y) := {x X : x = x0 + (1 )(y x0 ),

[0, 1]

Then N (x0 , y) is a neighborhood of x0 such that


f (x) < f (x0 ), x N (x0 , y).
x0 is not a local minimum
x0 is not a local minimum if it is not also a global minimum.

Definition 3.2 (Boundary point) A point x is a boundary point of X if


every neighborhood of x contains a point not in X).
25


min f (x)

x=

x1
x2

can take long time essential to use most efficient line search method such as
Fibonacci. But even then, very inefficient.

subject to
0 x1 b1
0 x2 b2
Set h1 =

b1
,
m1

Definition 3.3 (Unimodality) A function is unimodal if there is some path


from every point x to the optimal along which the function continuously decreases.

Example
max f (x1 , x2 ) = 10 2(1 x1 )2 (1 x2 )
 
0
Use Univariate method, starting point x(0) =
0
Solution
First, search along x1 direction the objective function becomes
(0)

(0)

f (x1 + 1 , x2 )

= f (1 , 0) = 10 2(1 1 )2 1
=

9 2(1 1 )2

Clearly,
1

Below are some methods for Unimodal Functions.

1 maximizes
 
1
=
.
0

x(1)

2.2

f (1 , 0)

Univariate method

min f (x),

x1
x = x2
x3

First carry out line search in x1 direct

to find minimum of f (x1 , x2 , x3 ) for given x2 , x3 .


1st iteration
Then, with (x1 , x3 ) fixed, search along x2 .

Then, with (x1 , x2 ) fixed, search along x3 .

Then, with (x2 , x3 ) fixed, search along x1 .

Then, with (x1 , x3 ) fixed, search along x2 .


2nd iteration

Then, with (x1 , x2 ) fixed, search along x3 etc.

Now, search along the x2 direction the objective function becomes


(1)

(1)

f (x1 , x2 + 2 )

= f (1, 2 ) = 10 2(1 1)2 (1 2 )


=

2
x(2)

10 (1 2 )2

1 maximizes
 
1
=
.
1

f (1, 2 )
(1 iteration)

now, come back to search along x1 direction (i.e. we start the second iteration)
26




(2)
(2)
f x1 + 1 , x2

= f (1 + 1 , 1)
=

10 2(1 1 1 )2 (1 1)2

10 221

1
x(3)

0 minimizes f (1 + 1 , 1)
 
1
=
.
1
=

Now, search along x2 direction




(3)
(2)
f x1 , x2 + 2

= f (1, 1 + 2 )
=

10 2(1 1)2 (1 1 2 )2

10 22

2. Select a distance h = length


of side of

1
(0)
1
(0)
x1 + h
x
+
h

1
(2)
2

2 and x =
3
(0)
(0)
x2
x2 +
h
2

2
x(4)

0 maximizes
 
1
=
.
1

f (1, 1 + 2 )

triangle.

Place points x(1) =

3. Evaluate function at x(0) , x(1) and x(2) .

(2nd iteration)
4. Find point with largest function value (say, x(1) ).


Since

1
1

is the best solution in both the x1 and x2 directions, we have


 
1
and f (1, 1) = 10.)
located the optimum, (i.e. x =
1
Note: In general, we will stop the algorithm when |k | are less than some
tolerance value in consecutive iterations.

5. Reflect point of largest function value in centroid of remaining vertices to


get new point x(3) .
(Question : Is x(3) a previous point?)
6. Evaluate f (x(3) ). Is f (x(3) ) < f (x(1) ).

2.3

Simplex Method

(This is not the one in Linear Programming)


A simplex is the smallest regular polyhedron in n-dimensional space - contains
n + 1 vertices.
2dimensional space simplex is triangle.
3dimensional space simplex is tetrahedron.
For 2dimensional case, we have the following steps.

7. Go back to step 4.
Difficulties
(a) Oscillationoccur if

h
i
(0) (0)
1. Start with any point x(0) = x1 x2

f (x(k+2) ) > f (x(k) ) and f (x(k+1) )


27

Also

f (x(k+3) )

>

f (x(k) ) and f (x(k+1) )


(k+4)

(k+2)

=x

2.4

Pattern Search

Consider the following steps.


(1)


Action to overcome: If reflection step gives back an earlier point, reflect


point with next largest function value instead of the largest. i.e. if

min f (x)
subject to x Rn

The search procedure is best described in terms of base points and


temporary positions. The first base point is denoted by
h
iT
(0) (0)
.
x(0) = B (0) = b1 , b2 , , b(0)
n

f (x(k+3) ) > f (x(k) ) > f (x(k+1) ),

A step size xi is chosen for each variable xi . To use vector notation, we


let
D(i) = [0, . . . , 0, xi , 0, . . . , 0]T

then reflect x(k) to get x(k+4)

(0)

We first perturb the variable x1 . Then, the temporary position, T1 , is


determined by the formulas.
Example

(2)
(0)

T1

min f (x) = 2(x1 1)2 + 2x1 x2 + x22


28

(0)
B + D(1) , if f (B (0) + D(1) ) < min{f (B (0) ), f (B (0) D(1) )}
=
B (0) D(1) , if f (B (0) D(1) ) < min{f (B (0) ), f (B (0) + D(1) )}
(0)
B ,
if f (B (0) ) < min{f (B (0) + D(1) ), f (B (0) D(1) )}

(1)

(0)

(1)

Tn is an improvement on the objective function value at B (1) , Tn


designated a new based point B (2) , i.e.,

Now, the next variable x2 is perturbed also the temporary position T1


(0)
instead of the original base point B (0) , and T2 is calculated as the new
(0)
temporary position. In general, the jth temporary position Tj is obtained
(0)

is

(7)

from Tj1 by the formula


1.

B (2) = Tn(1)
(0)

Tj

(0)
(0)
(0)
(0)
(j)
(j)
(j)

Tj1 + D , if f (Tj1 + D ) < min{f (Tj1 ), f (Tj1 D )}


(0)
(0)
(0)
(0)
=
Tj1 D(j) , if f (Tj1 D(j) ) < min{f (Tj1 ), f (Tj1 + D(j) )}

(0)
(0)
(0)
(0)
Tj1 ,
if f (Tj1 ) < min{f (Tj1 D(j) ), f (Tj1 + D(j) )}

if




f Tn(1) < f B (1) .

This expression covers all j, 0 j n, if we adopt the convention that


Assuming this condition holds we now make a further double-step from
(2)
B (1) be? B (2) to the temporary position T0 , where

(4)
(a)

T0

= B (0) .

The last temporary position is designated to second base point B (1) , i.e.,

(8)

(5)

(2)

B (1) = Tn(0) .

T0

All these exploratory moves which determine the movement from B (a) to
B (1) establish a pattern movement. Now instead of exploring around
B (1) , a similar fashion we assume that the pattern may persist and start
the next temporary search to position not at B (1) but at a point 2(B (1) 1)
from B (0) . Thus,

(2)

and perform new exploratory moves around T0 .


However, if the double jump was a false move and it turns out that the
objective function has increased, i.e.,

(6)
(1)

T0

= B (1) + 2(B (2) B (1) ) = 2B (2) B (1) .

(9)

= B (0) + 2(B (1) B (0) ) = 2B (1) B (0) .




f T02 f B (1)

This is illustrated in the Figure

we retreat to the previous base point by setting.


(10)
B (2) = B (1)
The pattern of movement is thus destroyed and the whole procedure is
started again treating B as an initial base point with smaller step size
x i = 1, . . . , n.
(1)

A local exploration is now carried out around T0 , and the equations for
(1)
(0)
determining Tj for j = 1, . . . , n are the same as the equations for Tj with
the superscript 1 replacing 0. Then if the final temporary position
29

To enable the step size to be automatically adj? the step sizes xi are
(k)
halved when no improvement can be made around some T0 and the whole
procedure repeated until the required accuracy is obtained.

Explore around base point


(0)

B (0) = T0

= [0, 0]T ,

=
=

min{f (1, 0), f (1, 0)}


min{41, 50}
41 < f (0, 0) = 44
f (1, 0) = 41 accept

=
=

min{f (1, 1), f (1, 1)}


min{21, 69}
21 < f (1, 0) = 41
(1, 1) = 21 accept

New base point B (1) = [1, 1]T .


(1)
Temporary position T0 = 2B (1) B (0) = [2, 2]T [0, 0]T = [2, 2]T
The above figure displays how the pattern search method would work on an f (2, 2) = 8.
objective function in to variables where the position of the minimum is indicated
(1)
Explore around T0 = [2, 2]T
the contours. Starting at the base point B (0) we decrease x1 by x1 and
(0)
increase x2 by x2 to reach T2 which is the new base point B (1) . We now
min{f (3, 2), f (1, 2)}
(1)
(1)
jump to T0 and start exploring around T0 and to determine B (2) as it is
= min{9, 9}
(2)
an improvement on B (1) . Next we explore around T0 which is double the
= 9 > f (2, 2) = 8 reject
(2)
(1)
(2)
(1)
distance B to B from B . We now explore around T0 to find that only
a decrease in x1 is worth while so that we establish B (3) . Exploration is now
(3)
min{f (2, 3), f (2, 1)}
conducted
around T0 . Without any improvement due to the exploration, but

(3)
=
min{4, 12}
as f T2
< f B (3) the new base point is B (4) and the temporary position
= 4 < f (2, 2) = 8
(4)
(4)
f exploration
is T0 . Since the exploration around T0 will lead to the result


f (2, 3) = 4 accept

(4)
> f B (4) , we are returned to the old base as B (5) = B (4) . Would
f T2
now search around B (5) possibly with a smaller step size.
Example
=
=

min f (x1 , x2 ) = x21 + 4x22 4x1 24x2 + 4

with

min{f (2, 4), f (2, 2)}


min{8, 8}
8 < f (2, 3) reject

Choose
x1 = x2 = 1,
and let [0, 0]T be the initial base point.
Evaluate base point
(0)

B (0) = T0 , f (0, 0) = 44.

New base point is B (4) = [2, 3]T = B (T ) = B (2) . We may halve the step size
and repeat the procedure until a desired accuracy is achieved.
New base point
B (2) = [2, 3]T .
(2)
Temporary position T0 = 2B (2) B (1) = [4, 6]T [1, 1]T = [3, 5] with
f (3, 5) = 21.
30

(2)

Explore around T0

If a direction s(k) at x(k) is such that



T 

s(k) g x(k) < 0

= [3, 5]T
=
=

min{f (4, 5), f (2, 5)}


min{24, 20}
20 < f (3, 5) = 21
f (2, 5) = 20 accept

=
=

min{f (2, 6), f (2, 4)}


min{40, 8}
8 < f (2, 5) = 21
f (2, 4) = 8 accept

then s(k) is called a descent direction as the function value can always be reduced
in a line search for some > 0.

New base point B (2) = [2, 3]T = B (2) as f (2, 4) > f (2, 3).
(3)
Explore around the base point B (3) = T0 = [2,
min{f (3, 3), f (1, 3)}
= min{5, 5}
= 5 > f (2, 3) reject


slope of f x(k) + s(k) is negative at = 0
Descent Algorithms
Given an initial estimate x(0) , the kth iteration is:

Gradient-type Methods

(i) determine a direction of search


(ii) check for convergence

Consider the unconstrained optimization problem:


(iii) find (k) to minimize f x(k) + s(k) with respect to line search

minn f

xR

(iv) set x(k+1) = x(k) + (k) s(k)

An algorithm which generates a sequence of points {x(k) } such that


f (x(k+1) ) < f (x(k) )
for all k is referred to as a descent method (i.e. function value is reduced at
each iteration)
NOTATION:
g (k)

= f (x(k) )

G(k)

= 2 f (x(k) )

Consider the function f (x) along the line x() = x + s(k) . Then, f (x ())
may be regarded as a function of alone.
slope:
df (x())
d

move through Rn along a sequence of straight lines segments.

Remarks




d
T
T
f x(k) + s(k) = sk f x(2) + s(k) = sk f (x())(1) this is a general structure within which most of the good methods lie.
d




  (k) 
(2) different methods arise from different ways of generating the search direcdf (x())
k T
(k)
k T
=
s
f
x
=
s
g x
tion s(k) .
d =0


31

(3) line search is idealized in that an exact line search is impossible in practice.

A necessary condition for (k) to minimize f x(k) + s(k) is that

df x(k) + s(k)
= 0 at = (k)
d

but

3.1

 

T
df x(k) + s(k)
= f x(k) + s(k)
s(k)
d
a necessary condition for an exact line search is:


T

T
f x(k) + (k) s(k)
s(k) = g (k+1) s(k) = 0

g (k)

s =
g (k) .


Steepest Descent Method

Consider the optimization problem:


min f (x)

xRn

choose

s(k) = g (k) =

direction in which the function


decreases most rapidly the
neighbourhood of x(k) .

Remarks:
(1) simple need only values of function and its gradient.
(2) global convergence to a stationary point.
(3) the convergence can be very slow.
Example
min f (x) = 10 + 2(1 x1 )2 + (1 x2 )2
starting point
(0)

x
Result:

0
0


.

Solution

Theorem 3.4 Given g (k) . Then


g (k)

s =
g (k) , where


T
g (0) = [4(1 x1 ), 2(1 x2 )]



1/2
(k)
g = (g (k) )T g (k)

x(0) =

[4, 2]

solves

T (k)

min s g
PROOF.





sT g (k) = ksk g (k) cos
0

x(0) g (0)
 

 

0
4
4
=

=
0
2
2

subject to ksk = 1.



f x(0) + g (0)

angle between s and g (k) .






sT g (k) = g (k) cos
=

= 10 + 2(1 4)2 + (1 2)2


= 10 + 2(1 8 + 162 ) + (1 4 + 42 )
= 10 + 2 16 + 322 + 1 4 + 42
= 362 20 7

best value when = .


32


df x(0) + g 0
= 0 72 20 = 0
d



minimizes
f x(1) + g (1)

d2 f x(1) + g (1)
96 2
=
>0


d2
81
(1)

x(1) + (1) g (1)

minimizes f x(0) + g (0) as



5 2

(0)
(0)
(0) =

d f x + g
= 72 > 0

18
(0)
d

10

x(1) = x(0) + (0) g (0) = 95


9


(1)
f x
= 9.7778

x(2)



f x(2)
=
=
=

10
4

9
9
5
8
+
9
9

= 10 + 2(1 x1 )2 + (1 x2 )2

2 

10 4
5 8
= 10 + 2 1
+
+ 1
9
9
9
9

2 
2
1 4
4 8
= 10 + 2 +
+

9
9
9
9




2
1
8 16
16 64 642
= 10 + 2

+
+

+
81
81
81
81
81
81
2
2
96
80 79
=

81
81
81

df x(1) + g (1)
d


=

(1)

2 
2
25
25
10 + 2 1
+ 1
27
27
 2  2
2
2
+
10 + 2
27
27
8
4
10 +
+
= 9.98355
2
(27)
(27)2




f x(1) + g (1)

25
27
25
27


T
g (1) = [4(1 x1 ), 2(1 x2 )]
x=x(1)
T

4 8
,
=
9 9
x(1) g (1)


10
4

= 9
5 8 =

9
9

= x(1) + (1) g (1)


10 4
5

= 95 89 12
5 =
+
9 9 12

g (2)

x(2) g (2)

= 27
4

27


8
25

27
= 27
25
4 =

27
27

8
25
+
27 27
25
4
+
27 27



f x(2) g (2)
2 
2

25
8
25
4
= 10 + 2 1
t + 1
t
27 27
27 27

2 
2
2
8
2
4
= 10 + 2
t +
t
27 27
27 27


4
48
64 2
= 10 + 2

(27)2
(27)2
(27)2


4
16
16 2
+

2
2
(27)
(27)
(27)2

96 2
80

=0
81
81
80
5
=
=
2 96
12
33


df x(2) g (2)
d
288
80

(27)2
(27)2

Introducing the quadratic function

80
288


d2 f x(2) g (2)
288
(2)
(2)
>0
=
which minimizes f x g
as

2
(2)
d
(27)2

20
25
25
8
80


+
+

1.0082

27 9 27
27 288
x(3) = 27
=
=
25
4
80 25
10
0.9671
+

+
27 27 280
27 9 27
=

0 (2) =

E(x)

1
(x x )T Q(x x )
2

1 T
x Qx 2xT Qx + (x )T Qx
=
2
1 T
1
=
x Qx xT b + (x )T Qx
2
2
1
= f (x) + (x )T Qx
2
=

E(x) and f (x) have the same minimum point x , since (x )T Qx is a constant.
We now apply the steepest descent method to E(x) or f (x)



f x(3) = 9.9990 etc.

g(x) = Qx b

Global Convergence of the Steepest Descent Method


Assume that f (x ) = 0 and f (x) 6= 0 if x 6= x . The SDM constructs a
sequence {f (xi )} with
f (xi+1 ) = min f (xi g(x)) < f (xi )
0<

i = 0, 1, . . .

f (xk+1)

xk+1 = xk k gk (gk = g(xk )




1
T
T
(xk gk ) Q(xk gk ) (xk gk ) b
=
min
(0,) 2

Differentiating w.r. to and setting the result to zero we get


where g(x) = f (x). Thus, {f (xi )} is a bounded monotone sequence, since
f (x ) f (xi ) f (x0 )

gkT Qxk (xk gk ) + gkT b = 0

i = 0, 1, . . .

{f (xi )} is convergent for any initial point x0 , or {f (xi )} is globally convergent.

gk

Rate of Convergence
We take the following quadratic form for example,
f (x) =

gkT Qgk gkT (Qxk b) = 0


| {z }

1 T
x Qx bx
2


where Q is a positive-definite symmetric n n matrix and b is an n 1 matrix.


Since Q is positive definite f (x) is strictly convex and f (x) has only
positive eigenvalues:

xk+1 = xk

gkT gk
gkT Qgk
gkT gk
gkT Qgk


gk

k = 0, 1, 2,

Lemma 3.1 The above iterative procedure satisfies

0 < a = 1 2 n = A.
The unique minimum point x can be found from

(
E(xk+1 ) =

f (x ) = 0 Qx b = 0 or x = Q1 b
34

gkT gk


1 T
gk Qgk gkT Q1 gk

)
E(xk )

(3.3.4)

PROOF. By direct computation,


E(xk ) E(xk+1 )
E(xk )
(yk

Definition 3.4 (condition number) For any matrix Q, the condition number of Q is defined as

2k gkT Qyk k2 gkT Qgk


yk Qyk
= xk x , gk = Qyk = Qxk b)
(gT gk )2
2(gkT gk )2
Tk
T
gk Qgk
gk Qgk
=
T
1
gk Q gk

Using the condition number, we have



2
r1
E(xk+1 )
E(xk ).
r+1

gkT gk
T
(gk Qgk )(gkT Q1 gk )

N.B. Convergence of SDM is very slow since normally r >> 1 in practice.


Linear convergence with the ratio

2

r1

r+1

Eq 3.3.4) follows from this.

Theorem 3.5 (Kantorovich Inequality) For any vector x, there holds


T

Theorem 3.7 (Steepest descent, non-quadratic case) Suppose f (x) is


defined on Rn , has continuous second partial derivative, and has a relative minimum at x . Suppose further that the Hessian of f, G(x ), has the condition
number r. If {xk } is a sequence generated by the SD method that converges to
x , then the sequence {f (xk )} converges to f (x ). Linearly with a convergence

2
r1
ratio no greater than
.
r+1

4aA
(x x)

(xT Qx)(xT Q1 x)
(a + A)2

(3.3.5)

where Q is a positive definite n n matrix and a and A are the smallest and
largest eigenvalues of Q respectively.
PROOF. Omitted.

A largest eigenvalue
a smallest eigenvalue

r = A/a where

kxk+1 x k

Using this we have the following theorem

r1
r+1

2

kxk x k

Theorem 3.6 (Steepest descent, quadratic case) For any x0 Rn , the


3.2 Newtons Method
SDM (1) satisfies

2
Newtons method is based on the quadratic model of the function obtained by
Aa
E(xk+1 )
E(xk )
truncating the Taylor series expansion of f (x) about x(k) , i.e.,
A+a
with E(x) =

1
f (x(k) + )
= q (k) () = f (k) + T g (k) + T G(k) ,
2

1
(x x )T Q(x x ).
2

where g (k) () is the quadratic approximation at the point x(k) and = x x(k)
is the step correction.
Choose (k) as the minimizer of q (k) (), i.e., as the solution to

PROOF. Combining (3.3.4) and (3.3.5) we have




4aA
E(xK+1 )
1
E(xk )
(A + a)2

2
Aa
=
E(xk )
A+a

5q (k) () = 0,
giving
1

(k) = G(k) g (k)




if G(k) is positive definite.


Remarks

35

(1) requires f (k) , g (k) and G(k) , i.e. function values; first and second derivatives.
(2) the step (k) is only appropriate and well-defined if the quadratic model
has a minimum, i.e. G(k) is positive definite.

where s a point in between x and xk . Note that



0
A0 (x ) = x G1 (x )g(x )
0
= I G1 (x )G(x ) G1 (x ) g(x )
= 0

(3) basic Newtons method does not involve a line search as a step of (k) (i.e.
(k) = 1) goes to minimum of quadratic.

1
2
kA00 ()k kxk x k
2
2
C kxk x k

kxk+1 x k =

The Newtons Algorithm


(a) Choose an initial guess x( 0) and 0 <  << 1. Let k = 0.

(3.3.7)

When x0 is close to x , {xk } converges to x at a rate of at least 2nd order.

(b) Solve G(k) ( k) = g (k) for (k) .

Global Convergence
From (3.3.7) we see that if, at one stage, kxk x k > 1, then the method may
x(k) || < , then STOP. Otherwise, let k = k + 1 and GOTO not converge. Lets consider a damped scheme corresponding to (3.3.6):

(c) Set x(k+1) = x(k) + (k) .


(d) If ||x(k+1)
Step (b).

xk+1 = xk k G1
k gk ,

0 < k 1.

Remarks:

We may use k to control the step length. Now, consider a general form
(1) Newtons method is not a general purpose method, as G(k) may not be
xk+1 = xk Mk gk ,
(3.3.8)
always positive definite when x(k) is remote from x , where x is a local
minimum.
where Mk is an n n matrix. A two special cases are

(2) If G = 5 f (x ) is positive definite, good local convergence for starting


1. The Steepest descent method: Mk = I
point sufficiently close to x .
2. Newtons method : Mk = Gk .

Local Convergence of Newtons Method


Consider the iterates:
xk+1 = xk G1
k gk
where

Gk = Hessian and gk = f (xk ).

Define

A(x) = x (G(x))
Suppose x is a point such that
g(x ) = 0

and

= f (xk ) gkT Mk gk + O()


(when is small). So,
f (xk+1 ) f (xk ) gkT Mk gk .

G(x )

non-singular

= A(xk ) x + G1 (x )g(x )
= A(xk ) A(x ).

gkT Mk gk > 0
or Mk is positive definite.
Steepest Descent: Mk = I
positive definite
Newton:
Mk = G k
positive definite
When x is close to x .
May not be p.d. if x is away from x .

Therefore, taking the norm,


kxk+1 x k = kA(xk ) A(x )k

= f (xk ) + gkT (xk )(xk+1 xk ) + O(kxk+1 xk k

In order that f (xk+1 ) < f (xk ), we need

Then
xk+1 x

f (xk+1 )

(3.3.6)

g(x).

From Taylors expansion and (3) we have

kA0 (x )(xk x )k +

1
2
kA00 ()k kxk x k
2

Modifications to Newtons method


36

(a) Assume that Gk = G(xk ) has eigenvalues k1 < k2 < . . . < kn . Choose k
such that

(a) Newtons Method :



g(x)

k + k1 = > 0
a small positive number.

2x1 4
10x2 + 30

G(x)

Obviously, when k1 > 0, we choose k = 0 (and = k1 ). Now

2
0

0
10

k I + Gk
x(1) =

is positive definite. So, we construct a modified Newton method


k = 0, 1, . . .
(1)

Gk = Lk Dk LTk
Lk lower triangular matrix.
Dk diagonal matrix.

g (2)

Dk = diag (dk dk2 , . . . dkn ).


Ek = Dk + k I

is positive definite, and thus we construct

Example

= Lk Ek LTk

1

k = 0, 1,

Consider the quadratic function


f (x) =

4x1 +

5x22

g (1)

1
0  2   1 
1

= G(1) g (1) = 2 1
=
10
1
0
10

 
 

1
1
2
x(2) = x(1) + (1) =
+
=
2
1
3
 
0
=
x(2) = x
0

(k) must satisfy



d  (k)
f x + s(k)
=0
d
=(k)

is easy to evaluate.

x21

= g (k) , x(k+1) = x(k) + (k) S (k)








1
2
2
(1)
(1)
(1)
x
=
, g =
, S =
2
10
10

Exact line search : choose (k) to minimize f x(k) + s(k)

k = Lk Ek LT
G
k

N.B. G

0
1
10


S (k)

for a chosen . Then

1
2

(b) Steepest Descent Method :

k + min{dk1 , dk2 dkn } = > 0

xk+1 = xk k Gk gk ,

G(x)1

1
2
=
0

2
=
10

Note : Newtons method converged in 1 iteration. This is true for any positive definite quadratic function as Newtons method is based on a quadratic
model of function.

Choose
with

(b) LDU decomposition of Gk (Choleski decomposition)

Let

xk+1 = xk k (k I + Gk )1 gk

i.e.

+ 30x2 + 50

and the starting pointx(1) = [1, 2].


(a) Solve using Newtons method
(b) Carry out 1 iteration of the steepest descent method with exact line search
Solution
37

f (x(k) + (k) s(k) )T s(k) = g (k+1) s(k) = 0




2x1 4
g(x) =
10x2 + 30



 

1
2
1 + 2(1)
(2)
(1)
x
=
+
=
2
10
2 10(1)




2(1 + 2(1) ) 4
2(2(1) 1
(2)
g
=
=
10(2 10(1) ) + 30
10(1 10(1) )

g (2) s(1) = 4(2(1) 1) 100(1 10(1) ) = 0

Look at h() = f (x + s)

i.e.
1008(1) 104 = 0

(1) = 104/1008 = 13/126

76
26
1
+

63
126
x(2) =
130 = 191
2

126
63

very slow!

Inaccurate Line Search


Steepest descent and most methods require line search. So far we have
assumed this can be done precisely so that we choose that minimizes exactly If choose = , then a.
f (x + s).
|h0 ()| h0 (0) [b, c].
O.K. analytically but what about practically when being done by the computer.

An acceptable approximate minimizer is any point satisfying both of these


conditions, i.e. any point in [b, c] in diagram.
Notes

must resort to approximate line searches.

(1) to ensure the existence of a point satisfying both these conditions need
.

If s is a descent direction at x, i.e. sT 5 f (x) < 0, then


f (x + s) < f (x)

(2) as 0, line search become more accurate.


Typical values of are
= 0.9 weak, i.e. not very accurate line search
= 0.1 strong, i.e. fairly accurate line search

for > 0 small enough. Thus, we could replace finding minimizer of f (x) by
any small enough and still have a descent method. But, if the (k) s are
chosen to be too small, we may not get to the minimum of f . We need at least
linear decrease in function value to guarantee convergence. If chosen too big,
s may be no longer a descent direction.

(3) is typically taken quite small eg. 0.01

(0 < < < 1)

Approximate Line Search Conditions


Approximate minimizer of f (x + S) must satisfy
(1) sufficient function decreasef (x + S) f (x) + S T 5 f (x).


(2) sufficient slope improvement f (x + s)T S sT f (x).

Steepest Descent Method with Approximate Line Search




Theorem 3.8 Let f C 2 and x(k) be a sequence of points generated by the
steepest descent method using an approximate line search. Then, either
(i) g (k) = 0 for some k (g (k) = 5f ),

Here:

(ii) g (k) 0 as

(a) s is a descent direction sT f (x) < 0


(b) and are constants satisfying 0 < < < 1. If = 0, then = .

k , or

(iii) f (k) (no finite minimum)


38

Note: as with any method using only first derivative information, this method 3.3 Quasi-Newtons Methods
can only guarantee convergence to a stationary point (i.e. x such that g =
Consider the optimization problem
g(x ) = 0) of a general function f.
min f (x).
xRn
Method of Steepest Descent
Let

Recall that G(k) = 2 f x(k) may not be always positive definite when x(k)
(k)
(k)
(k)
2
(k)
g
= 5f (x ), G = 5 f (x )
is far away from the local minimum. Thus, Newtons method is not a general
(k+1)
purpose method. However, it has good local properties.
x
= x(k) (k) 5 f (x(k) )
1
Quasi-Newtons methods are based on the idea of approximating G(k)
Global convergence
at each iteration by a symmetric positive definite matrix H (k) which we will
Linear convergence rate local behaviour not good
update at each iteration.
Only need knowledge of f and 5f plus some line search.
Newtons Method
(k+1)

(k)

=x

(k) 1

5 f (x

(k)

5 f (x

Algorithm
Given x(1) , H (1) . Set k = 1.
1. Evaluate f (k) = f (x(k) ), g (k) = f (x(k) ).

Need 52 f positive definite near optimal point x . Cannot be sure if






f x(k+1) < f x(k) ,

2. Set s(k) = H (k) g (k)

(the search direction)




3. Check for convergence. If s(k) < , stop

so do not have global convergence.


Good local properties converges quadratical
To improve things, try a combination of both.
Newtons Method with Line Search
Use direction 52 f (x(k) )1 5 f (x(k) ) as in Newtons method but with a line
search

1
x(k+1) = x(k) (k) 52 f (x(k) )
5 f (x(k) )

4. Set x(k+1) = x(k) + (k) s(k) , where (k) is chose by a line search.
5. Update H (k) to H (k+1)
6. Set k = k + 1, go to step 1.
Usually H (1) = I. This implies that

where (k) minimizes

S (1) = g (1)




1
(k)
2
(k)
(k)
x 5 f (x )
5 f (x )

so we have the steepest descent direction.


Possible Advantages

over R+
If 52 f (x(k) ) and hence 52 f (x(k) )1 is positive definite, then the search direction

1
s(k) = 52 f (x(k) )
5 f (x(k) )
is a descent direction, because
T

g (k) s(k) = g (k)


(k)

G(k)

1

g (k) < 0

= 1 and we obtain the local convergence rate of Newtons

Near the solution,


method.
If G(k) is not positive definite but is nonsingular, search along s(k) where
the sign is chosen to obtain a descent direction.
But then the stationary point of the approximating quadratic is not a minimum so why search in that direction!

Newtons Method
second derivatives
G(k) may be indefinite

Quasi-Newton Methods
only first derivatives
H (k) always positive definite
S (k) is descent direction.

Actually, some quasi-Newton methods do not have H (k) positive definite.


Those that do are sometimes called variable metric methods.
Approximation of the Inverse G1
The key idea in quasi-Newtons methods is to approximate the inverse of the
Hessian by H (k) at step k. Let
(k)
(k)
39

= x(k+1) x(k)
= g (k+1) g (k)

where

So, since uuT (k) = uT (k) u




auT (k) u = (k) H (k) (k)

g k = 5f (xk ).

Taylors expansion gives


g (k+1) = g (k) + G(k) (x(k+1) x(k) ) +

Set

or

and choose a such that

(k) = G(k) (k) + higher order to

This expansion is exact if f (x) is quadratic. The above equality shows that G(k) Then,
depends on (k) and in the case that G(i) is constant for all i = 0, 1, 2, . . . k, (i.e.
f (x) is quadratic), we have
(i) = G (i) ,

u = (k) H (k) (k)

i = 0, 1, , n.

(k+1)

auT (k) = 1

=H

(k)

(k) H (k) (k)

(k) H (k) (k)


T
(k) H (k) (k) (k)

Problems: Does not keep H (k) positive definite.


zero.

or
= G

(3.3.9)

T

Denominator may become

= ( (0) (1) (n1) )

Theorem 3.9 Suppose G is well-defined and positive definite, and (1) , . . . , (n)
are linearly independent, then the rank one method terminates on a quadratic
function in at most n + 1 searches, with H (n+1) = G1 .

= ( (0) (1) (n1) )

PROOF. (k) = G (k)

where
and

for a quadratic. Want to show

H (i) (j) = (j) , j = 1, . . . , i 1.


Both and are n n matrices and we assume that (0) , , (n1) are linear
independent.
It is true for i = 2. Suppose it is true for i > 2. Then, for j < i,
From (3.3.9) we have
1
G =

T
 T

T
(i)
(i) (i)
(j)
(i)
(j)
(i)

1

H (i) (j)
So, it is natural to construct successive approximations H (k+1) to G(k)
 T

T
based on data obtained from the first k steps of a descent process in such a way
(i)
(j)
(i)
=

(j)
that if G is constant, then the approximation would be consistent with (1) for
these steps. More specifically, H (k+1) would satisfy
= 0,
H (k+1) (i) = (i) 0 i k

(3.3.10)

Then, after n linearly independent steps we would have H (n) = G


How do we achieve (3.3.10)?

as (i) = G (i) and (j) = G (j) .


Then
H (i+1) (j) = H (i) (j) + 0 = (j) ,
Also

Rank One Correction


H

(k+1)

j < i.

H (i+1) (i) = (i)

Therefore,

=H

(k)

(k)

=H

(k)

+ auu

(j) = H (n+1) (j) = H (n+1) G (j)

and since (j) s are linearly independent, it follows that

Symmetric
rank one matrix

H (n+1) G = I

If (3.3.10) is to be satisfied, we must have

H (n+1) = G1

termination.


H (k) (k) + auuT (k) = (k)


40

Rank Two Correction


H

Also
(k+1)

=H

(k)

+ auu + bvv

(k)

T

(k)

If the quasi-Newton condition (3.3.10) is satisfied, then


(k) = H (k) (k) + auuT (k) + bvv T (k)

T
T

(k) g (k+1) (k) g (k)

T
= (k) g (k) ,


(3.3.11)

Unlike before, u and v are not determined uniquely. Letu = (k) andv = since

T
H (k) (k) . Then,
(k) g (k) = 0,
1
auT (k) = 1 a =
T
(k)
because x(k+1) is the minimum point of f (x) along (k) .

(k)
Thus, by definition of (k) and (3.3.11)
and
1

T

T

T
bv T (k) = 1 b =
T
(k)
(k) (k) = k H (k) g (k) g (k) = k g (k) H (k)

H (k) (k)
Thus,
H

(k+1)

=H

(k)

and hence

T
T
H (k) (k) (k) H (k)
(k) (k)

+
T
T
(k) (k)
(k) H (k) (k)

xT H (k+1) x =

This is the Davidson-Fletcher-Powell formula (DEP)


Theorem 3.10 H (k) positive definite H (k+1) positive definite. (This ensures search direction is always downhill, so it can be reduced in line search).
(Note

PROOF. For any x Rn , we have


T

x H

Let

(k+1)

T
(k) (k)
x = x H x+x
x
T
(k) (k)
T
(k) (k)

(k) H (k)
TH
x
x
( k )T H (k) (k)
2
2
xT (k)
xT H (k) (k)
T
(k)
k T (k) (k)
= x H x+
T
( ) H
(k) (k)
T

(k)


 21
a = H (k) x,

(aT b)2 (aT a)(bT b)

(aT a)(bT b) (aT b)2


(bT b)
2
xT (k)
+
>0
k (g k )T H (k) g (k)

by Candy-Schwarz inequality.)

Properties:
T

(I) Is (k) (k) > 0?


(k) = G (k)

(i) Quadratic functions:


T

(k) (k) = (k) G (k) > 0


 12
b = H (k) (k) .

for a quadratic function with


positive definite Hessian.

(ii) General functions:


(a) exact line search

(Note H 1/2 is not defined as that in a the real number set.) We have

xT (k)
aT b
T
(k+1)
T
x H
x = a a T +
T
b b
(k) (k)
2
xT (k)
(aT a)(bT b) aT b
=
+
T
bT b
(k) (k)

(k) (k) = (k) g (k+1) (k) g (k) = |{z}


g (k) H (k) g (k) > 0
|
{z
}
q
0

>0

>0

from exact line (k) = H (k) g (k) , H (k) positive


search condition
definite
41

Exact line search: g (k+1) s(k) = 0

(b) inexact line search


T
conditions on slope can easily be used to ensure (k) (k) > 0.


T
(k)T (k+1)
g
< s(k) g (k) , 0 < < 1
s

Had:

g (k+1) s(k) g (k) s(k) , 0 < < 1.




( (k) )T (k) = ( (k) )T g (k+1) g (k)

T 

=
s(k)
g (k+1) g (k)

20

s(k) g (k) ( 1)

T
H (k) g (k) g (k) (6 1) > 0

1
2
10
2(1 )

 T

2
2


=

0 1 20 + 1 2 = 0

(1) =

1
.
11

since H (k) is positive definite.


(iii) Quadratic Termination: The minimum of a quadratic function with
positive definite Hessian is found in at most n iterations.
Quadratic
functions
(G positive
definite)

steepest
Quasi-Newton
descent
s(k) = g (k)
s(k) = H (k) g (k)
can converge
Quadratic termination
arbitrary slowly
i.e. converges in
1st derivative only at most n iterations
uses only 1st derivatives

Newton
1

s(k) = G(k) g (k)


converges in
1 iteration
but needs
2nd derivatives

Key to quadratic termination is the concept of conjugate directions.


We shall discuss the concept of conjugate directions later.

Example

x(2)

f (x) = 10x21 + x22


1
=
10
1

1
20
=
0
"

x(1)

G1

s(1)

= g (1)

#
g (x) =

0
1
2


2
=
2

20x1
2x2


, G(x) =

20 0
0 2

(1)

(implying H (1) = I)

(1)

(1) (1)
42

9
18

11
(2)
= 110
, g = 18

9
11
11

2
11
(2)
(1)
(1) (1)
= x x = s =
2

11

40

4
11
(2)
(1)
= g g =
4 = 11

11
8
16 101
(1)T
(1) (1)
=
,
H =
11
121

=
11
10
1

1
1

H (2)


=

1
0

1
2222

0
1



+

4
11

8
121

123 119
119 2301

1
1

1
1

16
121

16 101 121

100
10

10
1

Another Updated Matrix Formula is BFGS formula (Broyden, Fletcher (1970),


Goldfarb, Shanna)


T



(k+1)
HBF GS

(k) H (k) (k)


1+
(k)T (k)

(k) (k)
(k)T (k)
!
T
T
(k) (k) H (k) + H (k) (k) (k)

(k)T (k)
!
!
T
T
T
(k) (k)
(k) (k)
(k) (k)
(k)
=
1 (k)T (k) H
I (k)T (k) + (k)T (k)

= H

(k)

18
123 119
1
= H (2) g (2) =
119
2301
1
11 2222




18
9
242
1
=
=
2420
10
11 1111
101

18
9




+
9
18
1
1

101
x(3) = x(2) + s(2) = 110
=
+

9
180
110 10
101 10
T
T
T
T

(k) (k)
(k) (k)
(k) (k) H (k) (k) (k)
11
101
= H (k) (k)T (k) H (k) H (k) (k)T (k) +

(k)T (k) (k)T (k)


9
101
101
(2) =

T
220  
(k) (k)
110  18
+
.
0
0
(k)T (k)
x(3) =
= x , g (3) =
0
0




101 18
9
The way we get this is by approximating G instead of G1 by some matrix
1
1
(2) = (2) s(2) =
=
(k)
B
= (H (k) )1 . Then, the quasi-Newton formula changes to
220 101 10
110 10


18
1
(2) = g (3) g (2) =
(k) = B (k+1) (k)
11 1
T
9 18
9 18
(2) (2) =
11 =
Update B (k) by rank two correction as in DF P
110 11
110


18
1
T
T
H (2) (2) = H (2) g (2) = s(2) =
(k) (k)
B (k) (k) (k) B (k)
(k+1)
101 10
BBF GS = B (k) + (k)T (k)
 

(k)T B (k) (k)


0
(3)
(as g
=
)
0
This is the DFP formula with B replacing H and swapped with.
The BFGS formula comes about by choosing

T 

T
18
18
182
1
1
(k+1)
(k+1)
=
(2) H (2) (2) =

BBF GS HBF GS = I
10
11 101 1
101




1
81
110
123 119
1
10
The properties that the DFP method had are also evident in BFGS. What is
H (3) =
+

2222 119 2301


110 110 9 18 10 100 more is that for low accuracy line searches, BFGS tends to do better the DFP


and, in addition, if the condition
101
182
1
1
2
2
10 10
18
101

T


(k)
(k+1)
(k)
f

9
g
(k)
1
1230 + 101 220
1190 1010 + 2200
=

T

T
2222 1190 1010 + 2200 23010 + 10100 22000

g (k+1) s(k) 6 g (k) s(k) , 0 < 9 6


1
20 0
1
=
1 =G .
hold in an inexact line search, then BFGS is globally convergent, whereas no
0
such result exists for DFP.
2
s(2)

43

Combination of Steepest Descent and Newtons Methods

zk = xk + k dk , where k minimizes f (xk + dk ).


xk+1 = zk k g(zk ), where k minimizes f (zk k g(zk )).

For any xk , zk R , Taylors expansion gives

Notes:

1
f (zk ) f (xk ) + gk (zk xk ) + (zk xk )T Gk (zk xk ),
2

1. Steps 1 and 2 perform a Newton iteration with the approximated inverse


of the Hessian.

(3.4.12)

2. Step 3 performs a steepest descent iteration.

where gk = f (xk ) and Gk is the Hessian of f at xk . Let B be an known nm,


matrix consisting m linearly independent columns. Then,

4.1

N := {Bu : u Rm }
is a linear subspace of Rn . Let zk xk = Buk . The expression (3.4.12) becomes
1
f (zk ) = f (xk ) + gk Buk + uTk B T Gk Buk .
2

Convergence rate of method for quadratic functions

Theorem 3.11 (combined method) Let Q be an n n symmetric positive


definite matrix, and let x Rn . Define

(3.4.13)

E(x) =

1
(x x )T Q(x x ).
2

So, our problem becomes: find uk such that the RHS of (3.4.13) is minimized. For any given n m matrix of rank m, the sequence {xk }
k=0 produced by
Differentiating gives
applying
the
above
algorithm
to
E(x)
satisfies
B T Gk Buk + B T gk = 0.
E(xk+1 (1 )E(xk ), k = 0, 1, ...,
From this we have
uk = (B T Gk B)1 B T gkT ,
(pT p)2
over all vectors p in the
where [0, 1] is the minimum of T
and so
(p Qp)(pT Q1 p)
zk = xk B(B T Gk B)1 B T gkT .
T
null-space of B .
Obviously B(B T Gk B)1 B T is the inverse of Gk restricted on N . We expect it
The proof is omitted here.
is easy to evaluate.
Example. Suppose B = (I 0)T , where I is the m m identity matrix. Let


G11 G12
G=
,
G21 G22
where G11 and G22 are respectively m m and (n m) (n m) matrix. Then,


  1
G11 G12
I
T
1
(B Gk B) = (I 0)
= G1
11 .
G21 G22
0
and
B(B T Gk B)1 B T =

G1
11
0

5
5.1

Conjugate Gradient Methods


The standard conjugate gradient method

Motivation: Consider the minimization of


F (x) =

1 T
x Gx bT x,
2

x Rn ,

(3.5.14)

where G is a positive definite and symmetric matrix. Let p0 , p1 , ..., pk , (k < n)


be linear independent vectors in Rn and we put Vk = span{pi }k0 Rn . The
minimization of E(x) over xk + Vk is defined as

0
.
0

min F (xk + Pk w),

So, B(B T Gk B)1 B T is the inverse of G on the subspace N .


The Algorithm:

wRk+1

(3.5.15)

where Pk = (p0 , p1 , ..., pk ) is an n (k + 1) matrix. Using (3.5.14), we have


T

B T gkT ,

Set dK = B(B Gk B)
where gk and G are respectively the
gradient and Hessian of f , and B is a given n m (m < n) matrix.

F (xk + Pk w) =
44

1
(xk + Pk w)T G(xk + Pk w) bT (xk + Pk w).
2

Theorem 3.13 Let {s(i) }n1 be a set of nonzero Q orthogonal vectors. For any
x(1) Rn , the sequence {x(i) } generated by

So,
F (xk + Pk w) =
or

PkT G(xk

+ Pk w)

PkT b

= 0,

x(k+1)

PkT (Gxk b) + PkT GPk w = 0.

= x(k) + (k) s(k)

From this we have

(k)

w = (PkT GPk )1 PkT gk ,


where gk = Gxk b. We choose
xk+1 = xk + Pk w = xk

Pk (PkT GPk )1 PkT gk .

xk+1 = xk + Pk (PkT GPk )1 ek ,

(3.5.16)

for some set {(i) }. Multiplying by Q and taking the inner product with respect
to s(k) ,
s(k)T Q(x x ) = (k) s(k)T Qs(k) ,
so that
s(k)T Q(x x(1) )
.
(3.5.20)
(k) =
s(k)T Qs(k)
From (3.5.18), we get

i 6= j.

with k =

gkT pk
.
pTk Gpk

(PkT GPk becomes a diagonal matrix.)

x(k) x(1) =

Definition 3.5 Given a symmetric matrix Q, two vectors d(1) and d(2) are said
to be Q-orthogonal, or conjugate with respect to Q, if d(1)T Qd(2) = 0.

k1
X

(i) s(i)

(k) =

s(k)T Q(x x(k) )


.
s(k)T Qs(k)

But Qx = b and Qx(k) = g (k) + b, we have from the above

PROOF. Suppose they satisfy


1 d

+ + k d

s(k)T Q(x(k) x(1) ) = 0.

Substituting this into (3.5.20), we get

Theorem 3.12 If Q is positive definite and the set of nonzero vectors d(i) , i =
1, 2, ..., k are Q-conjugate, then they are linearly independent.

(k)

i=1

In our applications, we will also assume Q is positive-definite.

(1)

(k) =

=0

s(k)T g (k)
.
s(k)T Qs(k)

for a set of constants i , i = 1, 2..., k. Multiplying by Q and take scalar product


with respect to d(i) we have


Let B (k) be the subspace of Rn spanned by {s(i) }k1 . Then we have the following theorem.

1 d(1)T Qd(i) + + k d(k)T Qd(i) = 0.


From the conjugacy we have
1 d(i)T Qd(i) = 0
since d(i)T Qd(i) > 0.
1 T
T
2 x Qx b x,

(3.5.19)

x x(1) = (1) s(1) + + (n) s(n)

(3.5.17)

Therefore, (3.5.17) becomes


xk+1 = xk + k pk ,

g s
s(k)T Qs(k)

PROOF. Since {s(i) } are linearly independent,

where = gk T Pk and ek is the kth column of the identity matrix. Furthermore, since p, p1 , ..., pk are arbitrary, we can choose the such that
pTi Gpj = 0,

and g (k) = Qx(k) b converges to the unique solution x after n iterations, that
is x(n+1) = x .

It can be shown (later) that pTi gj = 0, j > i. So, (3.5.16) becomes

Consider min
satisfies Qx = b.

(3.5.18)

(k) (k)

Theorem 3.14 Assume G is positive definite and let {s(i) }n1 be a sequence of
nonzero G-orthogonal vectors in Rn . Then, for any x(1) Rn , the sequence
{x(k) } generated by

i = 0


x(k+1)

= x(k) + (k) s(k) ,

where Q is positive definite, and assume that x

(3.5.21)

(k) (k)

(k)
45

g s
s(k)T Gs(k)

(3.5.22)

has the property that x(n+1) minimizes


f (x) =

5.2

Let x(1) be given.

1 T
x Gx bT x
2

1. Set s(1) = g (1) .


2. x(k+1 ) = x(k) + (k) s(k) , where (k) minimizes f along s(k) .

on the line x = x(k) +s(k) , for all (, ) as well as on the linear variety
x(1) + B (k) .
PROOF. Since f is convex (actually strictly convex), a local minimum is a global
minimum. This implies that we need only to show g (k+1) B(k) . (B (k) contains
the line x(k) + s(k) .) We use mathematical induction to prove this.
Trivially true for B (0) = .
Assume it holds for k 1. That is g (k1) B (k) . We have
x(k+1)
g

(k)

= x(k) + (k) s(k)


(k)

= Gx

=g

(k)

Gs

(k) =

When k = n, set xn+1 = x(1) (or continue as it is) and go back to Step 1.
On a quadratic function with G positive definite, the choice of (k) ensures
that s(k) are conjugate. Since

we have

(k)

g (k+1)T g (k+1)
.
g (k)T g (k)

s(k+1) = g (k+1) + (k) s(k)

(k)

3. s(k+1) = g (k+1) + (k+1) , where

(3.5.23)

Multiplying (3.5.23) by G and subtracting b from both sides, we get


(k+1)

Fletcher-Reeves Method (1964)

s(k+1)T (k) = g (k+1)T (k) + (k) s(k)T (k) .

But
s(k+1)T (k)

= s(k+1)T (g (k+1) g (k)


= s(k+1)T G(x(k+1 ) x(k) )

From (3.5.22) we have

= (k) s(k+1)T s(k)


= 0.

s(k)T g (k+1 ) = s(k)T g (k) + (k) s(k)T Gs(k) = 0.


Therefore, from (3.5.24),

For i < k,

0 = g (k+1)T (k) + (k) s(k)T (k) ,

s(i)T g (k+1) = s(i)T g (k) + (k) s(i)T Gs(k) .


or

But the first term on the RHS vanishes by induction, and the second term
vanishes by conjugacy. Therefore,

(k)
s(i)T g (k+1) = 0
That is g (k+1) B(k) .

for i k.
=


Corollary 3.1 g (k)T s(i) = 0 for all i < k.


How do we generate {s(i) }?

g (k+1)T (k)
s(k)T (k)
g (k+1)T (g (k+1) g (k) )
(k)
(g + (k1) s(k1) )T (g (k+1) g (k) )

g (k+1)T (g (k+1) g (k) )


g (k)T (g (k+1) g (k) ) + (k1) s(k1)T (g (k+1) g (k) )

g (k+1)T (g (k+1) g (k) )


g (k)T (g (k+1) g (k) )

because g (k+1) g (k) = Gs(k) . But the previous theorem showed that

If first derivative are available, we have a conjugate gradient method. It is


applicable to a general minimization problem, not just for quadratic f .

g (j)T s(i) = 0,
46

for i < j

(3.5.24)

and since
g

(i)

(i)

= s

(i1) (i1)

(j)T (i)

It is easy to show that (2) = 11/40 and thus


= 0,

i < j,
x(3) = x(2) + (2) s(2) =

we have
(k) =

g (k+1)T g (k+1)
.
g (k)T g (k)

 
0
= x .
0

Recall Davidson-Fletcher-Powell method that


 9 
1
So, on a quadratic function with G positive definite, the Fletcher-Reeves method
110
(2)
(1)
10 ,
x
=
,
x
=
terminates in n iterations and generates conjugate direction.
9
1
11
Advantages over quasi-Newton methods

x(3) =

 
0
0

These are the same as those of Fletcher-Reeves method.

If n is large, we may have problems storing the approximation to the inverse


Hessian in quasi-Newton methods.

Various Conjugate Gradient Algorithms

If n is not large, quasi-Newton methods are preferable.


Consider

Other conjugate gradient methods arise from different choices of (k) . For
example, Polak-Ribiere (1971) method
g (k+1)T (g (k+1) g (k) )
.
G(k)T g (k)

(k) =

1 T
x Gx bT x,
2
where G is symmetric and positive definite.
min (x) =

6.1

Let 1 = 0, p1 = 0. Given x0 , let r0 = Gx0 b.

In the quadratic case, the above two expressions for (k) are identical, as
g (k+1)T g (k) = o. For general functions, the two methods behave differently,
but both on numerical and theoretical grounds, Polak-Ribiere method is preferable.

For k = 0, 1, ..., until convergence


pk
k

Example. Find the optimal point of f (x) = 10x21 + x22 using Flecher-Reeves
method.
Solution. We start with (0) = 0. The gradient of f is g(x) = f =
(20x1 2x2 )T . We choose x(1) = (1/10 1)T and so
g (1) = (2 2)T ,

xk+1
rk+1
k

s(1) = g (1) = (2 2)T .

From the algorithm we have




(1) = arg min f x(1) + s(1) =

1
.
11

So,
x

(1)

=x

= rk + k1 pk1 ;
||rk ||22
;
=
pTk Gpk
= xk + k pk ;
= rk + k Gpk ;
||rk+1 ||22
=
,
||rk ||22

where || ||2 denotes the Euclidean norm.




(2)

The linear CG method

(1) (1)

9
110
9
11


,

(2)


=

18
11
18
11


.

g (2)T g (2)
92
=
.
112
g (1)T g (1)
 18 
 


92 2
36
1
11
= 18 + 2
=
.
2
11
121 10
11

(1) =
s(2) = g (2) + (1) s(1)

Rate of Convergence
Define ||x||G := xT Gx. The convergence of the linear CG algorithm is given
by

k
1
||xk x ||G C
||x0 x ||G
+1
where = max /min is the condition number of G and x is the minimum
point of (x). From the above estimate we have

k
||xk x ||G
1
lim
= lim
= 0.
k ||x0 x ||G
k
+1
47

Therefore, the convergence rate of the linear CG method is super-linear.


end do i
= b with M = LLT .
Now, we know that min (x) is equivalent to solving Gx = b. The CG method Finally, we can apply the linear CG algorithm to Gx
can also be used for solving Gx = b. In particular, it can be used for solving
the gradient equation
Another Preconditioned CG method
Gk gk = b
The equation Gx = b can also be written as
in Newtons method.

Normally >> 1 1
1, and thus the convergence of the linear
W 1/2 GW 1/2 (W 1/2 x) = W 1/2 b
+1
CG method is very slow when used for solving large-scale linear systems.
or in short
x = b.
G

6.2

Preconditioned CG methods

The revised algorithm is

Consider the solution of


Gx = b.
Multiplying by a matrix M
M

Algorithm(Preconditioned CG method):

we have

Gx = M

r0 = Gx0 , 1 = 0, p1 = 0;
b

= b,
or Gx

For k = 0,1,...,until convergence

= M 1 G. We expect that M is such that


where G

pk

is (much) smaller than that of G.


The condition number of G

M 1 is positive-definite and symmetric.

xk+1
rk+1

M is easily invertible.
Obviously, if we choose M = G, then M 1 G = I and we solve the problem in
one iteration. This choice is not practical!

Incomplete Choleski Factorization

= W 1 rk + k1 pk1 ;
rkT W 1 rk
=
;
pTk Gpk
= xk + k pk ;
= rk + k Gpk ;
T
rk+1
W 1 rk+1
.
=
T
rk W 1 rk

The preconditioner W can be chosen in various ways. For example, we can use

Since G is symmetric and positive-definite, it can be decomposed into

W = diag(G) diagonal scaling.


G = LLT

(Choleski),

W = LLT , where L is determined by Algorithm (ICF).

where L is a lower triangular matrix. In the case that G = (Gij )nn is a largescale, sparse matrix, we can use the following algorithm to find the Incomplete
Choleski Factorization or Decomposition (ICF or ICD) M = LLT with L =
(lij )nn :
Algorithm (ICF):
1/2
l11 = G11 ,
for i = 2 to n do i
for j = 1 to i 1 do j
where (G
 ij 6= 0)Pdo

j1
lij = Gij k=1 lik ljk /ljj
end do
end do j

Pi1 2 1/2
lii = Gii k=1 lik
48

Chapter 4

Constrained Optimization Techniques


1

Introduction

subject to
4r2 + 6x2 = A = constant and r 0, x 0.

Consider the nonlinearly constrained optimization problem :


min f (x)
subject to
(Equality constraints)
gi (x) = 0, i = 1, ..., m
(Inequality constraints)
hi (x) 0, i = 1, ..., r
(bounds on variables)
li xi ui , i = 1, ..., n,
T
where x = x1 xn . Here, none of these constraints are compulsory.

2.1

Direct method

Solve the constraint equation


4r2 + 6x2 = A

4
V = r3 +
3

Solution Methods: Direct Method and Lagrangian Multipliers.


Let us demonstrate these methods using the following example.
dV
=
Example. Sum of areas of a cube and a sphere is constant. What is the ratio
dr
of an edge of the cube to the radius of the sphere when
(a) sum of volumes is minimum
=
(b) sum of volumes is maximum.

From this we have
V = 43 r3 + x3
A = 4r2 + 6x2

Problem (2).
min V =

A 2 2
r .
6
3

Substitute into the objective function to give

Equality and bound constraints

Here, r = radius of sphere and x = edge of cube.


Problem (1).
4
max V = r3 + x3
3
subject to
4r2 + 6x2 = A = constant and r 0, x 0.

x2 =

A 2 2
r
6
3

4
3 A
3r2 + {
3
2 6
A
2
4r 2r{
6
2r{2r (

 23

1
2
2
r2 } 2 { 2r}
3
3
2 2 1
2
r } = 0.
3

A 2 2 1
r ) 2 } = 0
6
3

implying either
r=0
or
2r (

(4.2.1)

A 2 2 1
r ) 2 = 0
6
3

From (4.2.2) we have

4 3
r + x3
3

4r2 =
49

A 2 2
r
6
3

2
A
(4 + )r2 = .
3
6

(4.2.2)

A
At r = 12 ( 6+
)2 ,

So,
r2 =

A
3
A
=
6 12 + 2
4(6 + )

x2

=
r=
A
Check at r = 0 and r = 12 ( 6+
)
(i) At r = 0


d2 V
dr2 r=0

=
=

1 A 1
(
)2
2 6+

(4.2.3)
=

1
2


A 2 2 1
4 2 r2 A 2r2 1
= 8r 2{ r } 2 +
(
)2
6
3
3
6
3
r=0
A 1
2
= 2( ) < 0
6

=
=

A 2 2
r |r= 1 ( A )1/2
2 6+
6
3
A 2
1 A

6
3
46+
A
A

6
6(6 + )
A(6 + ) A
6(6 + )
6A + A A
6(6 + )
A
6+

A 1/2
A
So, r = 12 ( 6+
) 2 and x = ( 6+
)
is the solution of Problem 2.

= maximum.
1
A
(ii) At r = 12 ( 6+
)2

2.2


d V
dr2 r= 1 (
2

Lagrangian multipliers for equality constraints

Let us discuss the following problem.


min f (x),
x = [xi , ..., xn ]
subject to g(x) = 0,
i = 1, ..., m.

A
1/2
6+ )

1 A 1/2
1 A 1
A 2
) 2 } 2{
}
8{ (
2 6+
6
3
46+
4 2 1 A
A 2 1 A 1/2
+
{
}
3 46+ 6
3 46+
1
A
A A 1
= 4{
} 2 2{
}1/2
6+
6
6 6+
2 A
A A 1/2
+
{
}
3 6+ 6
6 6+
A 1/2
2 A
A(6 + ) A 1/2
= 4{
} +{
2}{
}
6+
3 6+
6(6 + )
3 A
A 1/2
A 1/2
} +{
2}{
}
> 0.
= 4{
6+
3 6+
6+
=

= minimum
Therefore, at r = 0, we have
x2 =

A 2 2
A
r |r=0 =
6
3
6

This is the solution of Problem 1.

A
x = ( )1/2
6

Implicit Function Theorem.


If the rank of the Jacobian Jm (x) defined by
g1
g1
x
x1
n

gm
gm

x1
xn

evaluated at x = x(0) is equal to m, then there is a necessary sufficient condition


for the existence of a set of m functions, i , i = 1, ..., m, which are unique,
continuous and differentiable in some neighborhood of x(0) . Furthermore,
xi = i (xm+1 , ..., xn ),

i = 1, ..., m

in some neighborhood of x(0) .


Return to our problem :
min f (x) ,
x = [xi , ..., xn ]T
subject to gi (x) = 0,
i = 1, ..., m < n
Suppose that gi satisfy the condition of Implicit Function Theorem. Then,
the first m variables, xi , ..., xm , can be, in principle, eliminated by using the
constraints :
gi (xi , ..., xn ) = 0,
i = 1, ..., m.
50

Therefore, a function h (xm+1 , ..., xn ) such that

Since xj, j = m + 1, ..., n, are linearly independent, we have


m
f (x )
gi (x )
i
= 0, i = m + 1, ..., n
i=1
xj
xj

f (x1 , ..., xn ) = h(xm+1 , ..., xn ),

(4.2.12)

where xm+1 , ..., xn are independent variables.


Now, instead of minimizing f , we minimize h with respect to xm+1 , ..., xn . Thus, we conclude that the necessary conditions for a local minimum are:
Clearly this is an unconstrained maximization problem. Thus, we know that
gi (x ) = 0, i = 1, ..., m
(4.2.13)
the first partial derivatives of h at

m
f (x )
gi (x )
i
= 0, j = 1, ..., n
(4.2.14)
b = (xm+1 , ..., xn )
x
i=1
xj
xj
must vanish if x
b is a local minimum. Let us write this in the form of a total Remark. These necessary conditions can also be obtained easily as follows:
differential:
Define
m

n
F (x, ) = f (x) i gi (x)
h(b
x )

i=1
dh(b
x )=
dxj = 0
(4.2.4)
j=m+1 xj
Then,
F
gi (x)
However, df = dh. We thus have the from the above two equalities
=
= 0, j = 1, ..., n
xj
xj

n f (x )
df =
dxj = 0
(4.2.5) and
F
j=1 xj
= gi (x) = 0, i = 1, ..., m
i

b
where x = [x1 , ..., xm , x ).
These coincide with (4.2.13) and (4.2.14). But, we must note that the approach
Let us now consider the total differential of the constraints:
given in this Remark is not a proof.
Examples. The same as the one given before.
gi (x) = 0, i = 1, ..., m
(4.2.6)
1. min V = 43 r3 + x3 subject to 4r2 + 6x2 = A = constant, r 0, x 0.
at x . It is
n g (x )
2. max V = 43 r3 + x3 subject to 4r2 + 6x2 = A = constant, r 0, x 0.
i
dgi (x ) =
dxj = 0, i = 1, ..., m
(4.2.7)
j=1 xj
Solution. The region is defined by
Multiplying each of the functions in (4.2.7) by an associated Lagrange multiplier, i , and subtracting this from (4.2.5), we obtain
m
n g (x )
f (x )
i
dxj i
dxj (4.2.8)
df (x ) i dgi (x ) =
i=1
j=1 xj
i=1 j=1 xj
n
X
m
gi (x )
f (x )
dxj i
}dxj = (4.2.9)
0.
=
{
i=1
xj
xj
j=1
m

Choose i , i = 1, ..., m, such that


m
f (x )
gi (x )
i
= 0, j = 1, ..., m
i=1
xj
xj

4r2 + 6x2 = A,

which is closed and V is a smooth function of x and r in this region. Maximum


and minimum points can occur at boundaries or where
V
V
=
=0
r
x
Introduce Lagrangian L = V g, where g = 4r2 + 6x2 A.W ehave

4
L = r3 + x3 4r2 + 6x2 A
3
The necessary conditions are:

(4.2.10)

L
= 4r2 8r = 0
r
L
= 3x2 12x = 0
x
L
= 4r2 + 6x2 A = 0

Substituting (4.2.10) into (4.2.9), we have


n

j=m+1

m
f (x)
gi (x )
i
i=1
xj
xj


dxj = 0

r 0, x 0.

(4.2.11)

51

From the first two we have


r = 0 or = 21 r ;
x = 0 or = 14 x;
Let us consider these separately.
(i) If r = 0 = x =
(ii) If x = 0 = r =

A
6
A
4

 12


1
2

A
6

= V =
=

1
2

P = 40xi + 55x2 + 8x1 x2 10x21 6x22 ,


x1 + x2 = 10,
x1 , x2 0.

max
Subjectto

1
2

 32
= V =

(A)3/2
6()1/2

(iiii) If r 6= 0 and x 6= 0 =
1
1
r and = x,
2
4
2
2
4 (2) + 6 (4) A = 0,

Note that we have assumed we wish to spend all of the $10 per unit that
is available. It might, in certain cases, turn out to be the case that it is not
optional to spend it all. Hence, it might be preferable to specify the constraint
as:
x1 + x2 10
We shall hope the solution satisfies these constraints for this problem. We form
the Lagrangian Function

=
=
=
=
=

162 + 6 162 A = 0,

 12
1
A
,
=
4 +6

1/2

1/2
A
1
A
x=
and r =
,
+6
2 +6

F = 40xi + 55x2 + 8x1 x2 10x21 6x22 + [10 x1 x2 ] .


The necessary conditions are
F
x1

= 40 + 8x2 20x1 = 0

F
=
55
+
8x

12x

=
0
1
2
x1

F
= 10 x1 x2 = 0

3/2

(A)

V =

1/2

6 ( + 6)
Combining the above cases we have

(4.2.15)

Solving (1), we obtain x1 = 3.854


x2 = 6.146.
as the optimal allocation of the expenditure of $10. The value of the maximum
profit is
P = $306.51

Global maximum is (iii) as < 6 < + 6, and


Global minimum is (i).

2.3

Example 2. Consider the problem of finding the most profitable level of production of an entrepreneur who produces two goods or outputs Q1 and Q2 with
a single input X, and let
x = h(q1 , q2 )

Various examples

be the production function, i.e., the cost of production in terms of X is a


Example 1. It has been determined that the profit per unit of output in a function h of the quantities of the two outputs.
A problem that is frequently encountered in economics is one in which the
manufacturing operation depends on both the quality of the workers and on
the quality of the maintenance of the machines and other components. Let us entrepreneur wishes to maximize his revenue
define
R = p1 q 1 + p2 q 2
x1 = dollars spent on work force per unit of output
x2 = dollars spent on maintenance per unit of output
with a given level of input x = x0 , p1 and p2 are the unit prices of Q1 and Q2
and the profit P is related to these variables by
respectively. Each is assumed to be given under purely competitive markets.
2
2
P = 40xi + 55x2 + 8x1 x2 10x1 6x2
Example 3.
Suppose that the output q of a firm is related to the inputs of labor l and
In addition, a total of $10 per unit of output is allowed for both work force and capital
k by the function
maintenance. Determine the optimum levels of manufacturing work force and
1
1
q = 120l 2 k 2
of maintenance. We therefore wish to solve the following problem :
52

We wish to find the cost-minimizing input levels for a given output level q if Then
the price (rental) of capital is r and the usage rate is w. The total cost c is

z = 40q1 2q12 + 20q2 q22 10B

Example 6 Let us consider a more complicated joint cost problems. Suppose


one raw material R is used in the production of three products A, B, and C.
To produce B and C requires a special processing of R.
which is to be minimized, subject to
Let
1
1
x1 , x2 , and x3 be the quantities of A, B, and C produced
2
2
g (l, k) = 120l k q = 0
f1 (x1 ) , f2 (x2 ) , and f3 (x3 ) be the respective price functions
R = amount of raw material bought
y = amount of R processed for the production of B and C.
Example 4 (Peak loading pricing). With rising fuel costs, many utilities, such
Suppose
further that the unit cost of R is $10 and the (joint) cost of processing
as the Western Power, may find it worthwhile to employ a pricing system in
which customers are charged a higher rate during peak periods of use and a lower each unit of y is $4. Our problem is to maximize
rate during off-peak periods. Disregarding the energy conservation arguments,
3
the company is interested in knowing whether such a system maximizes profit.
X
xi fi (xi ) 10R 4y
z=
Let q1 , q2 , ..., q24 be the quantity (of electricity, say ) sold or demanded during
each of the 24 hours of the day
i=1
p1 , ..., p24 the corresponding price per unit
y be the hourly output capacity
subject to
c(q1 , ..., q24 ) be the daily total operating cost and
g(y) be the daily cost of capital (capacity).
0 < x1 R, 0 < x2 y,
The problemis then to maximize daily profit
0 < x3 y,
0 < y R.
c = wl + rk

z = f (q1 , ..., q24 )


where
z=

24
X

pi qi c(q1 , ..., q24 ) g(y)

i=1

Subject to 0 < qi y, i = 1, ..., 24.


Notice that we assume that each qi > 0, that is, some quantity is sold or
demanded each hour of the day, and that y > 0. We also assume that each pi
charged is independent of the output sold.
Example 5 (Allocating joint costs). It often happens that a raw material is
used simultaneously in the manufacture of several products. For instance, say
beans are cooked and the juice coagulated and made into bean curd. The pulp
is used either as hog feed or fertilizer. How does one allocate the joint cost of
these products? To be more precise, suppose soy beans cost $10 per 100-lb bag.
Let
B = number of bags of soy beans produced
q1 = number of units of bean curd obtained from the B bag of beans
p1 = unit price of bean curd
p2 = unit price of fertilizer.
We wish to maximize the profit
z = p1 q1 + p2 q2 10B
subject to

0 < qi B, i = 1, 2
Notice again that we assume that we buy some beans and get bean curd and
fertilizer as products. Suppose the demand functions are

Example 6 (A pay-off period theory of investment) It is common practice in


the investment decision process to limit investment outlays to those projects
that will return the investment within a specified pay-off period. Suppose a
firm is considering n alternative investment projects, and that the ith project
yields a discounted expected profit flow Pi (Ki ), where Ki is the investment
expenditures on the ith project. Suppose further that the firm has a fixed
amount of capital C to distribute among the n projects. The problem is to
maximize
n
X
P =
Pi (Ki )
i=1

subject to
n
X

Ki 0, i = 1, ..., n

i=1

Example 7(Cost minimization under capital rationing). Suppose that two


inputs, one current (x1 ) and one capital (x2 ), are involved in a productive
process and that the firm has a ration of money capital available which cannot
exceed a fixed amount, say K units.
Let
w1 = price of each unit of current input
w2 = price of each unit of capital input
y = output.
Then we wish to minimize the total cost

p1 = 40 2q1 and p2 = 20 q2

C = w1 x1 + w2 x2
subject to

53

Example 10 (Minimum weight design of a rotating disc) In a paper in (),


Desilva describes the application of nonlinear programing to the optimal design
y = f (x1 , x2 ) ,
w2 x2 K
of a steam turbine disc. This paper should be consulted for the details of the
analysis. However, what is produced by the analysis can be described and
x1 > 0,
x2 > 0
illustrates a typical structural design application of nonlinear programming.
Example 8 To maximize and minimize distances from the origin to the ellipse
[() Fletcher, R.(ed), Optimization, Academic, New York, 1969]
In what follows the values of aj , bj , j = 1, 2, ..., m are various dimensions of a
g(x, y) = 2x2 + 3xy + 2y 2 4 = 0
discretized (parallel thickness) approximation to the actual curved turbine disc.
we determine the extremal values of the function
For example, bm is the fixed rim thickness; bj1 , j = 1, 2, ..., m give disc profile
thicknesses,
etc. In the problem under consideration, the following assumptions
2
2
f (x, y) = x + y
were made
aj < aj+1
j = 1, 2, ..., m 1
subject to the constraint
g(x, y) = 0
b1 = b2 and fixed
bm1 = bm and fixed
Example 9 (The design of an Electrical Transformer)
The design of an air-cooled, two-winding transformer rated at 540 VA,
a2 variable
all other ak fixed
110/220 volts, 50 cps will be stated. the objective to be minimized is the
bj variable
j = 3, ..., m 2
sum of initial cost and operating cost. Details are given in []. We define the
bj 1
for j = 3, ..., m 2
following variables:
a1 + 3 a3 2
= cost of material
= cost of winding material
The design variables for the problem are then x = [b3 , ..., bm2 , a2 ]T and we
= core loss
have constraints such that
= conductor loss
Ixu
= core loss cost
= conductor loss cost
where
V A = combined rating of windings
I = [1 , ..., 1 , a1 + 3 ] and u = [, ..., , a3 2 ]
x1 = width of core leg
x2 = width of winding window
The objective function is derived in [] to be
x3 = height of winding window
x4 = thickness of core
m2
x5 = magnetic flux density
1 X
x6 = current density
W =

(aj+1 aj1 ) (aj+1 + aj + aj1 ) bj


The design problem can then be stated as follows:
3 j=3
min Z


1
+ b1 3a21 + a22 + a23 + a2 a3
3

1
+ bm 3a2m a2m1 a2m2 am1 am2 .
3

= x1 x4 (x1 + x2 + x3 ) + x2 x3 (x1 + 1.57x2 + x4 )


+x1 x4 (x1 + x2 + x3 ) x25
+x2 x3 (x1 + 1.57x2 + x4 ) x26

subject to

There are also certain behavioral constraints relating to stresses which must be
taken into account. Let us call these behavioral variables y (x). The compoex1 x2 x3 x4 x5 x6 1080 = 0,
nents of y (x) are stresses. The constraints on them are given by
2
2
x1 x4 (x1 + x2 + x3 ) x5 + x2 x3 (x1 + 1.57x2 + x4 ) x6 28 0,
L y (x) U
x1 x4 (x1 + x2 + x3 ) x25 + x2 x3 (x1 + 1.57x2 + x4 ) x26
0.16 0 where
2x1 (2x1 + 4x2 + 2x3 + 3x4 ) + 4x2 (1.57x2 + 1.57x3 + x4 ) + 2x3 x4
L=0
x1, x2, x3 , x4 , x5 , x6 .
U = [0 , 0 , ..., 0 ]T
0 = critical stress
The derivation of the functional relationship used in this example are given by
Schinzinger in Lavi, A., and T. P. Vogl (eds.) : Recent Advances in OptimizaThe nonlinear programming problem which is to be solved to determine the
tion Techniques, Wiley, New York, 1966.
minimum weight is to find an x such that
54

min W

A = 2r2 + 2rl

m2
X

(aj+1 aj1 ) (aj+1 + aj + aj1 ) bj


3 j=3

where r = the radius and l = the length of the cylinder, respectively. We


wish to determine these quantities subject to the restriction that the volume
Vc = r2 l. Hence our problem is to
min A = 2r2 + 2rl
subject to
r2 l = Vc
Example 13
In a problem arising in electrical engineering in the study of encoding sets of
analog messages, an optimization problem of the following sort arises
n
P
min E =
aj 2xj lwj


1
+ b1 3a21 + a22 + a23 + a2 a3
3

1
+ bm 3a2m a2m1 a2m2 am1 am2
3
subject to
1 x u,
L y (x) U.

j=1

Example 11.(A problem in Optimal Investment)


Suppose an investor has $100,000 to invest on the stock market. Suppose
further that the investor decides to view the future yields of the various stocks
in probabilistic terms. He might assume, for example, the rate of return on
investment in the ith security to be normally distributed with expected rate of
return i and variance ii . A security with a high variance ii is considered
risky. One might invest in two securities i and j and the covariance ij of
the returns measures the correlation between the rates of return. An obvious
approach to investment in order to decrease risk is to hedge, i.e.,to try to
offset unfavorable return on j.
Let us assume then that an investor has established, as well as he can, values
of i , ii, ij on his stocks. We will now assume that among alternatives with
equal rates of return an investor will prefer the one with the smallest variance.
We will also assume that among alternatives with equal variances, an investor
will prefer the one with the greatest rate of return.
Let us now consider our investor with $100,000. he wishes to determine values
of xi , the fraction of his total investment to be invested in the ith security, for
all i. If we assume that our investor has a certain coefficient of risk aversion
0, then the portfolio maximization problem becomes
N
N P
N
P
P
min R =
i xi
ij xi xj

subject to
n
P
xj = c
j=1

The aj , wj ,and c are known constraints. Find values of xi which cause E to


be a minimum subject to the given constraint.
Example 14
A problem in electrical network asks us to
n
P
min z =
Sk2 x2k
k=1

subject to
n
P
xk = b
k=1

where the Sk and b are known constants.


Example 15
The demand for oxygen in a plant is cyclic approximately every hour. In
the design of an oxygen production system, the problem the designer faces is
essentially to find the optimal capacities of three machines : an oxygen producing machine, a compressor, and an inventory storage unit. The three combine
to make the oxygen production system. Therefore, what we wish to find is
the output level O, the compressor motor capacity H, and the inventory storage
capacity V . There are economic and physical relationships which express
i=1
i=1j=1
O, H, and V as functions of the oxygen production rate per unit of time P and
subject to
the maximum pressure of the compressor p. Typical cost relationships can be
N
P
found
in Jen et al. (). In order to minimize total cost we find we need to solve
xi = 1
the following problem :
i=1
h


 ib3
x1 0
i = 1, ..., N
p
t2 t1
RT
ln
min
C
=
a
+
a
P
+
b
b
(D

P
)
1
2
1
2
1
t
k
k
p0
If = 0, an investor has no risk aversion, i.e., we simply maximize expected
1
2 1

h
ic3

rate of return. If the problem becomes one of maximizing the variance
Im GRT
P
+
c
c
+b4 (D1 P ) (t2 t1 ) kRT
ln
1
2
of the rate of return. Intermediate values of provide a balance between these
k
p
p
2 1
0
extreme attitudes towards risk.
subject to
This general approach to portfolio selection was first suggested by Markowitz
P Dav
in ()
p p0
All parameters except p and P in the objective function are either known as
() Markowitz, H. M., Portfolio Selection, Wiley, New York, 1959.
constants or operating conditions which are chosen.
Example 12
() Jen, F. C., C. C. Pegels, and T. M. Dupuis, Optimal Capacities of ProIt is desired to find the dimensions of a closed cylindrical tank with a fixed
(given) volume Vc which has the minimum surface area A. The area A of this duction Facilities, Management Science, 4: B-573-580, 1968.
tank is given by
Example 16 (An animal breeders problem)
55

A breeder of laboratory animals in a highly competitive market has found


that his best animals grow on a diet which is particularly sensitive to three
nutritional elements which we will call vitamins, protein, and minerals.
He has found that the minimum daily diet for his animals is at least 10mg of
vitamins, at least 80g of protein, and at least 2g of minerals. There are six basic
feed materials that can be used to satisfy these requirements. The contents of
each is given in Table 1.
The animal breeder is in a difficult competitive situation for these feed materials. If he uses up his reserves of them at any time, he must purchase more
on the open market which causes increasing prices because of decreasing supply. Hence, the cost of these feed materials is generally a quadratic function
of the amount purchased. Therefore, we assume a cost function of the form
0
0
z = c x + x Dx where c = [2, 3, 1, 10, 5, 7] and cj is the cost in dollars of feed
material j. The matrix D is:

5 10 0 1 0
3
10 3 0 2 0
0

0
0 1 2 0
1

D=
2 2 6 3
0

1
0
0 0 3 10 1
3
0 1 0 1 20
Where the dij are the dollars per pound of nutrient i in feed material j.
In order to find the minimum cost diet for the animals we need to solve the
problem
0
0
min z = c x + x Dx
subject to
Ax = b
x0
Table 1. Content of nutrient per 100lb of feed material
Feed Material
Vitamins
Protein
Minerals
1
1
3
10
2
21
500
5
3
40
60
3
4
65
20
0
5
10
40
15
6
0
700
5
where c, D are given above. From table 1 and the previous problem statement,
we see that
b = [10,
80, 2] and

1
21 40 65 10
0
A = 3 500 60 20 40 700
10
5
3
0 15
5
Example 17 (An Inventory Problem for a small business)
Consider a retail store which stocks and sells three different models of lawnmowers. The store owner cannot afford to have an inventory on hand worth
more than $15,000 at any time. We shall assume that the past several years
have given him enough experience so that he can predict his demand sufficiently
accurately to treat this problem deterministically, i.e., without attempting to
introduce probabilistic complications.

The lawn mowers are ordered in lots. If the models are numbered 1,2,3, then
Qj is the order quantity for model j. The store owner calculates his carrying
charge on each item, Ij is 0.25. This means that the rate at which inventory
costs accumulate are proportional to the investment in inventory at that time.
I is the constant of proportionality. I has units of dollars per year per dollar of
investment in inventory. We define further
Aj = fixed cost of ordering a lot of model j
cj = cost of one unit of model j
j = demand rate (units per year)for model j
The data the store owner has is shown in Table 1. We note above that there
is a limitation of $15,000 on the total value of the inventory. The store owner
has an additional constraint that he has an additional constraint that he has
a maximum effective storage space of 6000ft2 in which to store his inventory.
Each occupies 25ft2 .
We can now state the problem of determining what values of Q1 , Q2 , and
Q3 will minimize the average annual cost of ordering and storage subject to
constraints of limitation of capital and storage space. This problem is
1 A1
2 A2
3 A3
min Z = Q
+ I1 c21 Q1 + Q
+ I2 c22 Q2 + Q
+ I3 c23 Q3
1
2
3
subject to
c1 Q1 + c2 Q2 + c3 Q3 M
s1 Q1 + s2 Q2 + s3 Q3 S
In terms of our given data, the optimization problem becomes
Table 1 Lawnmower stores costs
Model
1
2
3
Ordering cost Aj 60
80
100
Unit cost cj
30 110
60
Demand rate j 800 500 1500
40,000
150,000
+ 7.50Q3
min Z = 48,000
Q1 + 3.75Q1 + Q2 +
Q3
subject to
30Q1 + 110Q2 + +60Q3 15, 000
25Q1 + 25Q2 + 25Q3 6000
Q1 , Q2 , Q3 0
Example 18
A chemical manufacturing company sells three products and has found that
its revenue function is
f = 10x + 4.4y 2 + 2z,
where x, y, and z are the monthly production rates of each chemical. It is found
from break even charts that it is necessary to impose the following limits on the
production rates:
x>2
1 2
2
2z + y > 3
In addition, only a limited amount of raw material is available ; hence the
following restrictions must be imposed upon the production schedule:
x + 4y + 5z < 32
x + 3y + 2z < 29
Determine the best production schedule for this company and find the best
value of the revenue function.

56

Inequality constraints

The last equation is equivalent to

Consider the nonlinearly constrained optimization problem:


min f (x)
subject to
hi (x) = 0,
i = 1, ..., m
gi (x) 0,
i = 1, ..., r
where the functions f, hi , i = 1, ..., m, and gi , i = 1, ..., r, are assumed to
have continuous first derivatives and satisfy certain regularity conditions. These
regularity conditions are called the constraint qualification by Kuhn and Tucker
in their original paper. In essence, the constraint qualification rules out rare
situations on the boundaries of the feasible region which might invalidate the
conditions stated in their theorem.
Definition Let x be a point satisfying the constraints
hi (X ) = 0,

gj (x ) 0

1 (x21 + x22 5) = 0,

2 (3x1 + x2 6) = 0.

Now we have four equations with four unknowns.


We can choose various combinations of active constraints and check the signs
of the resulting Lagrange Multipliers. For example, we assume that the 3rd
equation is active (i.e. 1 6= 0) and the 4th is inactive (i.e., 2 = 0). Then we
have
4x1 + 2x2 10 + 2x1
2x1 + 2x2 10 + 2x2
x21 + x22

= 0,
= 0,
= 0.

The solution to this system is

for i = 1, 2, ..., m and j = 1, 2, ..., r. Then x is said to be a regular point


x1 = 1, x2 = 2, 1 = 1.
of the constraints in the above problem if hi (x ), gj (x ), i = 1, 2, ...m and
j = 1, 2, ..., r are liearly independent.
Substituting this into the second constraint we have
Kuhn-Tucker Conditions (First order necessary conditions)
3x1 + x2 = 5 < 6
Let x be a local optimal solution, and suppose x is a regular point for the
constraint set. Then, there is a vector Rm and a vector Rr with 0
satisfied. Therefore, the above is a solution. (There may be other solutions.)
such that
f (x ) + T h(x ) + T g(x )
T

g(x )

0,

0,

where
T h(x )

m
X

i hi (x ),

i=1

T g(x )

r
X

gi (x ).

i=1

NOTE :Kuhn-Tucker conditions can only be used to generate optimal solution in low dimensional problems.
Example. Use Kuhn-Tucker conditions to solve
min
subbject to
3x1 + x2 6.

2x21 + 2x1 x2 + x22 10x1 10x2 ,


x21 + x22 5,

3.1

Let us first consider a simple optimization problem subject to an inequality


constraint
min f (x)
subject to
h 0
=
P (x) , where

0 if h (x) 0
P (x) =
2
(h (x))
if h (x) > 0
- P (x) has continuous first derivative if h (x) has.
2

(x)
- x
(h (x)) = 2h (x) hx
j
j
=
(
0
if h (x) < 0
(x)
- Px
=
h (x)
j
if h (x) > 0
2h (x) xj
(x)
But, if hx
is continuous at
j
h (x) = 0
then
(x)
2h (x) hx
= 0 at h (x) = 0.
j

Solution. From the Kuhn-Tucker conditions we have






   
4x1 + 2x2 10
2x1
3
0
+ 1
+ 2
=
2x1 + 2x2 10
2x2
1
0
1 (x21

Penalty Function Methods

(x)
This, in turn, implies that Px
= 0 at h (x) = 0.
j
We then consider objective function
F (x) = f (x) + M P (x),
M >0

x22

+ 5) + 2 (3x1 + x2 6) = 0
1 , 2 0.
57

with no constraint.
In the limit as M
=
M P (x) = 0
for h (x) 0
M P (x)
if h (x) > 0
In the limit as M =
Minimum of F (x) is inside feasible region if minimum of f (x) is inside.
Minimum of f (x) is on the boundary if minimum of f (x) is outside the
feasible region.
We now consider a simple optimization problem subject to an equalyy constraint
min f (x)
subject to
g (x) = 0
Introduce penalty function
2
P (x) = (g (x))
Then we consider objective function
F (x) = f (x) + M P (x)
Clearly, in the limit as M , the minimum of F (x) is on g (x) = 0.
We now return to the general problem
min f (x)
subject to
hi (x) 0, i = 1, ..., r
gi (x) = 0, i = 1, ..., m
We then replace the problem by a sequence of unconstrained optimization
problems, depending on M , = 1, ..., r, and M , = 1, ..., m.
r
m
P
P
min F (x) = f (x) +
M P (x) +
M P (x)
=1

NOTE : For large M s and M s, we still get steep valleys but have a good
estimate of minimum by then so search methods will work well.
WARNING : Methods find local minimum. So check for others, use different starting points and resolve.
Example min f (x) = x2 + 3x 4
subject to

x 0
0 x 4 =
x40

2
2
F (x) = x2 + 3x 4 + M1 (max{0, x}) + M2 (max{0, x 4})
Step 1. Unconstrained problem M1 = M2 = 0
=
F (x) = x2 + 3x 4
2

= x 32 + 49 16
4
2
= x 32 47
always negative = minimum at x = = constraints active
Step 2.
F
X = 2x + 3 2M1 (max{0, x}) + 2M2 (max{0, x 4}) = 0
Case 1. x < 0, x 4 < 0
=

2x + 3 = 0 = x = 32 and f 32 = 74
Case 2 x 0, x 4 < 0
=
2x + 3 + 2M1 x = 0
=

=1

For each set of constants M , = 1, ..., r, and M , = 1, ..., m, the


corresponding unconstrained optimization problem can be solved by methods
presented in the previous chapter. However, for large M , = 1, ..., r, and
M , = 1, ..., m, F (x) has deep narrow valleys on original constraint lines.
All search methods work badly. Thus, in practice, we should proceed iteratively
as follows :
Step 1. Let M s and M s = 0
Find the unconstrained minimum of
F (x) = f (x)
check which constraints are satisfied, i.e., which are inactive.
Step 2. Take moderate values of M s and M s for active consraints.
Step 3. re-solve to get new minimum of the correspomding F (x).
Step 4. Check for inactive constraints. Increase values of those M s and
M s which correspond to active constraints.
Step 5. Re-solve. The process is continued until all constraints are satisfied
within acceptable error.
58

(2M1 2) x = 3
=
x = 2M3
0 as M1
1 2
=
f (0) = 4
Case 3 x < 0, x 4 0
2x + 3 + 2M2 (x 4) = 0
(2M2 2)x = 8M2 3
=
2 3
x = 8M
2M2 2 4 as M2
- (all constraints satisfied)
= f (4) = 8
Case 4
x 0, x 4 0
= impossible
CONCLUSION. minimum at x = 4 and f (4) = 8
Example.
min f (x) = x21 2x22 + x1 x2
subject to
2x1 + x2 = 0
x1 2 = x1 2 0
x2 1 = x2 1 0
=
2
F (x) = x21 2x22 + x1 x2 + M 1 (2x1 + x2 )
2
2
+M1 (max{0, x1 2}) + M2 (max{0, x2 1})

Step 1. Unconstrained problem. M1 = 0, M1 = M2 = 0


2
= x21 2x22 + x1 x2 = x1 12 x2 74 x22
always negative = minimumat x =
Violates constraints
Step 2.

=
4M 1 +1
4M 1 +1
x2 = 2M
x 2M
2
4 1
4
1

= 2

F
= 2x1 + x2 + 4M 1 (2x1 + x2 )
x1
+2M1 (max{0, x1 2}) = 0

(A)

F
= 4x2 + x1 + 2M 1 (2x1 + x2 )
x2
+2M2 (max{0, x2 1}) = 0
Case 1.
x1 2 < 0, x2 1 < 0
(A) = 2x1 + x2 + 4M 1 (2x1 + x2 ) = 0
(B) = 4x2 + x1 + 2M 1 (2x1 + x2 ) = 0
In the limit as M 1 = we have
2x1 + x2 = 0
=

2x1 + x2 = 0
= x1 = x2 = 0
4x2 + x1 = 0
= f (0, 0) = 0
Case 2
x1 2 0, x2 1 < 0
(A) = 2x1 + x2 + 4M 1 (2x1 + x2 ) + 2M1 (x1 2) = 0
(B) = x1 4x2 + 2M 1 (2x1 + x2 ) = 0
=


8M 1 + M1 2 x1 + 4M 1 + 1 x2 = 4M1


4M 1 + 1 x2 + 2M 1 4 x2 = 0


1 + 1 =
h(1) 2M1 4 (2) 4M

2 i

2M 1 4 8M 1 + M1 2 4M 1 + 1
x1 = 4M1 2M 1 4
=
8M1 M 1 16M1
x1 =
2
2

(B)

(1)
(2)

1
M

(3)
(4)

16M 1 12M 1 +116M 1 +36M 1 816M 1 M2 +4M2


16M 1 M2 +4M2
24M 1 16M 1 M2 +4M2 7

M1

( 12 , 1)

satisfies all constraints


= f ( 21 , 1) = 14 2 21 = 2 43
Case 4.
x1 2 0, x2 1 0
(A)= 2x1 + x2 + 4M 1 (2x1 + x2 ) + 2M1 (x1 2) = 0
(B)= x1 4x2 + 2M 1 (2x1 + x2 ) + 2M2 (x2 1) = 0
=


8M 1 + 2M1 2 x1 + 4M 1 + 1 x2 = 4M1


4M 1 + 1 x1 + 2M 1 + 2M2 4 x2 = 2M2


(5) 4M 1 + 1 (6) 8M 1 + 2M1 2 =
h
2

i
4M 1 + 1 8M 1 + 2M1 2 2M 1 + 2M2 4 x2


= 4M1 4M 1 + 1 2M2 8M 1 + 2M1 2
=
4M1 (4M 1 +1)2M2 (8M 1 +2M1 2)
x2 =
2
4M
( 1 +1) (8M 1 +2M1 2)(2M 1 +2M2 4)

+4M1 8 M 1 8 M1
8M1
44+4M1
8
2
44
M
+4

as M 1



4M 1 + 1 x1 + 2M 1 + 2M2 4 x2 = 2M2


(3) 4M 1 + 1 (4) 8M 1 2 =





4M 1 + 1 4M 1 + 1 2M 1 + 2M2 4 8M 1 2 x2

= 2M2 8M 1 2
=
16M 1 M2 +4M2
x2 =
2
2

8M1 16 M 1

8M1
36+4M1 8

2 42 2 = 4

Fix M2 and let M 1 =


16M2
= 2416
x2 2416M
1 as M2
2
M2 16
=
4+ 1
(4M 1 +1)
x1 = 8M 2 = 8 M21 21 as M 1

16M 1 36M 1 +8+4M 1 M1 8M1 16M 1 8M 1 1

36+ M8
1

1
4
M1

as x1 2

(all constraints satisfied)


= f (2, 4) = 4 2 16 2 4 = 44
Case 3. x1 2 < 0, x2 1 0
(A)= 2x1 + x2 + 4M 1 (2x1 + x2 ) = 0
(B)= x1 4x2 + 2M 1 (2x1 + x2 ) + 2M2 (x2 1) = 0
=


8M 1 2 x1 + 4M 1 + 1 x2 = 0

Try to vary M 1 and M1 independently


=
holding M1 fixed and letting M 1
=
M
x1 =

4+ M1

as M1

59

(5)
(6)

4(4M 1 +1)2M2 8 M 1 +2 M2
(4M 1 +1)2

 8M

1
M1
M1
M2
1
4 4+ M 4 M
1
1
M
2 2+2 M 2 M4
1
1
4(4M 1 +1)
4
M2
2M
2 M 1 +2 M4
2
2





+2 M2

x = Sb +
where
= Zy,
y <nm
Columns of Z act as basis vectors for the null space of AT , where the null
space of AT is

(2M 1 +2M2 4)

 (as M 1 )

 (as M2 )

16
4
4
4

= 4

{ : AT = 0}

=1

inconsistent
CONCLUSION.
Minimum at (2, 4) with f (2, 4) = 44. Case 4 means that we cannot
simultaneously have active inequality constraints, i.e. x1 = 2 and x2 = 1, and
also satisfy 2x1 + x2 = 0.
Quadratic Programming
(QP)
min q(x) = 12 xT Gx + dT x
subject to
aTi x = bi , i E
aTi x bi , i I
Equality Constraints. (I = )
(QPE)
min q (x) = 12 xT Gx + dT x
subject to
AT x = b
A has full rank = m < n
Elimination Method
Since A has rank m, one way to solve (QPE) would be to solve the equality
constraints to obtain x1 <m in terms of x2 <nm (Gaussian elimination,
say).
AT1 x1 + AT2 x2 = b

x1 = AT
b AT2 x2
(1)
1

So
x = Sb + Zy

(2)

is the general solution of AT x = b.


(2) is a generalization of (1).
Substitution of (2) into g (x) gives

(QPE) becomes
min (x2 )
No constraints, quadratic
If 2 is positive definite, then minimum is given by the unique point x2
satisfying
(x2 ) = 0
x1 is found from (1).
Generalized Elimination
Let S and Z be n m and n (n m) matrices such that
AT S = I,
where S is the generized inverse,
AT Z = 0
such
that


..
S . Z nonsingular
A solution of AT x = b is given by
60

(y) =


1 T
y Z T GZ y + Sb
2

If Z T GZ is positive definite, then a unique minimizer y exists which solves


the linear system

Z T GZ y = Z (d + GSb)
The solution is obtained by computing LLT or LDLT factors of Z T GZ. Then,
x determined by substitution into (2).
Z T GZ is referred to as the reduced Hessian matrix.

Z T (d + GSb)
is reduced gradient.
LAGRANGIAN METHODS FOR EQUALITY CONSTRAINED
PROBLEMS
Quadratic Programming Problem
min Q (x) = 21 xT Gx + dT x + Q0
subject to
AT x = b
Lagrangian function

L (x, ) = Q (x) T AT x b
Equality constraints : necessary conditions are that

x L (x , ) = 0 = Gx + d A
 =0

T
L (x , ) = 0 = A x b = 0
In matrix form 

 

G
A
x
d
=

b
AT
0

Thus,
find
x
,

by
solving
of linear equations


  system

G
A
x
d
=

b
AT
0
If
 G is positive definite, and A has full rank, then
G
A
AT
0

1 T
is nonsingular, and as it is square, its inverse exists.
H = Z Z T GZ
Z
1 T
Let
T
T
 :
1 

T = S S GZ Z T GZ
Z
G
A
H T T
1 T
=
T
T
T
V = S GZ Z GZ
Z GS S T GS
A
0
T
V
Quadratic Programming (QP)
where
1 T 1
min q (x) = 12 xT Gx + dT x +
H = G1 G1 A AT G1 A
A G

1
subject to
T = AT G1 A
AT G1
ci (x) = aTi x + i = 0, i E

1
V = AT G1 A
ci (x) = aTi x + i 0, i I
Thus the solution to the problem can be written as
Objective function is quadratic, i.e. Hessian 2 f = G is constant.
x = Hd + T T b
Constraints
linear, i.e gradient ci constant.
= T d V b
If G is positive semi-definite, then this is a convex programming problem and
the first order necessary conditions
In fact, if x(k) satisfies the constraints
such that
P
AT x(k) = b
Gx + d =
i ci = A
then, by setting

iA
g(k) = Q(x(k) )
A = [ci , i A ]
it can also be shown that the optimal solution to the problem can also be written
ci = 0, i E
as
(feasibility)

(k)
(k)
ci 0,
iI
x = x Hg

(k)
i ci = 0 for all i E I
= Tg
Indeed,
are both necessary and sufficient.
Dantzig-Wolfe (1959, 1963)
x = x(k) Hg(k)
(k)
(k)
min Q (x) = 12 xT Gx + dT x
= x H{Gx + d}
subject to
= x(k) HGx(k) Hd
1 T 1
(k)
1
1
T 1
(k)
AT x = b
= x {G G A A G A
A G }Gx Hd
x0

b
}|
{
z
G positive definite = objective function convex
I
I
Linear constraints = feasible region convex
z }| {
z }| {

1
T 1
1
(k)
(k)
(k)
1
T 1
=
convex programming problem
= x G Gx + G A A G A
A G Gx Hd

=
first order necessary conditions both necessary and sufficient.
1
= G1 A AT G1 A
b Hd
Lagrangian

L (x, , ) = 21 xT Gx + dT x T AT x b T x
= Hd + T T b
= T g(k) = T {Gx(k) + d} = T Gx(k) + T d
Inequality constraints= 0, x 0
b
Necessary conditions x L = 0 =
}|
{
Gx + d A = 0
1 z T 1
(k)
A G
G
x
+
T
d
= AT G1 A
Thus, need a solution

of the system of equations


| {z }

 x


I
A
0
0
b
1

T 1
=
= A G A
b + Td
G A I
d

= V b + T d
x 0, 0
The proof is complete.
xi i = 0 for all i
Let
S
and
Z
be
n

m
and
n

(n

m)
matrices,
respectively,
such
that


Very similar to the linear programming problem
..
- a method for quadratic programming based on solving the above problem
S.Z
by LP like techniques exists.
- turns out to be equivalent to the following active set method.
is non-singular, and, in addition, let
Differences with linear programming
AT S = I
T
A Z=0
Then it can be shown that
61

minimum along the line can occure either


(1) when a new constraint becomes active
(2) due to the curvature of the object function.
Active Set Method for Quadratic Programming :
Let
(k) = approximation to set of active constraints at x(k)
(k)
(k) A(k) = {i : ci = 0}, usually (k) = A(k)
(k)
(k)
E contains all equality constraints .
Let


A(k) = ci , i (k)
Assume ci , i (k) , are linearly independent (i.e. A(k) has full rank).
Define the Equality Constrained Quadratic Programming Problem (EQP) as

- The multiplies are only tested when x(k) is a solution of the corresponding
Equality
constrained problem. (i.e. when know they exist, assuming linear
independence of ci , i (k) . Automatically holds if x(k) is a vertex.
- The search direction s(k) can be obtained directly.

:

min q (x)
q x(k) + s

subject to
T
= q x(k) + g(k) s + 12 sT Gs
ci (x) = 0, i (k)


ci x(k) = ci x(k) + sT ci , i (k)
Let x(1) is a solution of the corresponding EQP (if not start at step 2)
Obtain s(k) as solution to
kth iteration
min 12 sT Gs + sT g(k)
1. Evaluation the Lagrange multipiers
s
(k)
i , i (k)
subject to
(k)
sT ci = 0,
i (k)
(i) If i 0, i (k) I, stop.
T
(k)
(k)
(k)
(i.e. s A = 0)
(ii) Otherwise, let j be the index such that j = most negative i , i
- Multipliers : satisfy
(k) I.
g(k) = A(k) (k)
Set
(k+1)
(k)
(k+1)
(k)
(i.e. Gx(k) + d = A(k) (k) )

= j,
x
= x and k = k + 1.
Very similar in structure to linear programming
2. Let x be the solution of the EQP corresponding to (k) . Let s(k) = xx(k) .
Starting point : x(1) feasible point of QP
(k)
3. Choose steplength which maintains feasibility (ci 0, i I) .
- can be obtained by methods analogous to artificial variables in linear proLet
gramming.
(k)
c
- good starting point is a vertex of feasible region.
(k) =
min
{1, i = cTi s(k) }
i
iI(k)
Example.
(k) <0
cT
min x21 + x22 4x1 5x2 + 2
i s
xR2
minimum
minimum due
subject to
due to
to new constraint
c1 (x) = 2x1 x2 + 2 0
curvature
becoming active.
of quadratic
c2 (x) = x1 0
(l = 0)
c3 (x) = x2 0
If minimum is due to new constraint becoming active, let l be the index of
Start at 
this constraint.
0
x(1) =
4. Set
(k+1)
(k)
(k) (k)
0





x
=x + s
(k+1)
(k)
(k)
2x1 4
2 0
2
=  + l,
if < 1

g (x) =
, G=
, c1 =
,
2x 5
0 2
1
(k) = l
 2
 
1
0
(a) If either x(k+1) is a vertex ((k+1) has n elements)
c2 =
, c3 =
0
1
(k)
(k+1)
or = 1 (so x
is a solution to the equality constrained problem)
(1)
x is feasible, vertex.
Set k = k + 1 and go to 1


1 0
(b) Otherwise set k = k + 1 and go to 2.
(1)
(1)
(1)
= A = {2, 3}, A =
,
Notes:
0 1
62

g(1) =
g

(1)

4
5

(1) (1)

=A


=

(1)

4
5

1
0

0
1

"

(1)

2
(1)
3

(1)

= 2 = 4, 3 = 5
(1)
Most negative = 3 = j = 3 (Drop c3 )
(1) j = {2} =
Solve EQP
min 21 sT Gs + sT g(1)
subject to
sT c2 = 0
 
0
sT c2 = 0 = s =
= sT = [0, ]

Thus,
min 2 5 = = 52
Step length :
i
} = {1, 25 } = 45
min {1, sTcc
i
i
/ (1)
s.t. sT ci <0

H =I h

(2)

0
2
= {2, 3} {3} + {1} = {2, 1},
vertex.

 
 " (2) #
4
1 2
2
:
=
(2)
1
0 1
1

z = a22 + ka3 k

(i.e. c2 )

EQP :

1
5

= s(2) =

(2) = min{1, 22 } = 1
 
5 1   1 
0
(3)
5
x =
+1
= 58
2
2
 12 5   5 
5
2
(3)
(3) : g(3) =
=
1
1
95
(3)

= 1 =

9
5

1
5
25

Optimal solution : x =

1
5
8
5

a scalar,

0 is a (n r 1) 1 vector of zeros,
0
1

2 2
h = , and = a2 + a22 + ka3 k
a3

2
8
 
 

. Take a1 = 2 , a2 = 6, a3 = 3 . =
6
Example.
a=

8
5
3
5
 12


1
2
= (36 + 34) 2 = 70
z = a22 + ka3 k

= h =
6
+
70

3
5

2
= khk = 2 70 + 6 70

=
(2)
(2)
1 = 1, 2 = 2 = Drop j = 2

min{10 2 2} = =

(*)

a1
a = a2 n 1 vector
a3
where a1 is r 1, a2 is a scalar, and a3 is (n r 1) 1. Then, it can be shown
that

a1
Ha = z
0
where 
1

sT c1 = 0
1
= s =
2
=

T
2h

Let

i=1

(2)

khk

H2 = I

= l = 1 (add)
x(2) = x(1) + (1) s(1) =

where h is an n 1 vector.
Nice numerical properties : one is if H is a householder matrix, then

(3) = A(3) = {1}


1 = 59 .
Indefinite Quadratic Programing
If the matrix G is not positive definite, there are likely to be stationary
points which are local minima, local maxima or saddle points. Figure 1 depicts
the contours of a two dimensional function F (x) for which G has one positive
eigenvalue and one negative eigenvalue, together with three constraints.
We shall need Householder Transformation when we solve indefinite
Quadratic Programming Problem

63

2
T
Ha = {I h khk
2 h }a

0
2
8

= {I 70+6170
6 + 70 0, 0, 6 + 70, 3, 5 } 6
3

3
5
5

1
0
0
0
0

0 1 0 0 0
0 0 1 0 0
=

0 0 0 1 0

0 0 0 0 1


0 0
0
0
0

2

0 0

0
0
0
2

8
1

70+6 70 0 0 (6 + 70) 3(6 + 70) 5(6 + 70)


6

0 0 3(6 + 70)

3
9
15

5
0 0 5(6 + 70)
15
25

70

0
0
Example. Indefinite QP.
min f (x) = x21 4x22
subject to
x1 4x2 0
x1 1
x1 8x2 4
=
c1 (x) = x1 4x2 0
c2 (x) = 1 x1 0
c3 (x) =4 + x1 + 8x2 0 

2x1
2 0
g(x) =
, G(x) =
8x2
0 8




 
1
1
1
c1 =
, c2 =
, c3 =
4
0
8


2 0
Hessian G =
not positive definite; in fact, indefinite-one positive
0 8
eigenvalue of 2 and one negative eigenvalue of -8.
Starting point



0
1
x(1) = 1 , = c(1) = 0
4
4
=
(1)
(1)
(1)
= A = {1, 2}. f = 34 .

Feasible starting point


 which is a vertex = multipliers exist
1 1
(1)
A =
4 0


 " (1) # 
1
2
1 1
(1) (1)
(1)
=
A = g
(1)
2
4 0
2
(1)

(1)

3
= 1 = 21 , 2 =
 2


1
1
(2)
(1)
(2)
= {1}, x = 1 , A =
4
4
(2)
Factor A
 (2) 
R
(2) (2)
Q A =
0
2
T
1
(2)
Q = I hh , where = 12 khk



1 + 17
h=
,
4
h
i


2
= 12 1 + 17 + 16



= 12 2 (17) + 2 17 = 17 + 17
=
"
2
 #
17
4
1
+
17
1
+
1

Q(2) = I 17+17
4 1 + 17
16

 


2 17 17 4 1 +
17
17 + 17 1
1

= 17+ 17
4 1 + 16
1 + 17





1 4
1
4
1+ 17
1
= 17+ 17
= 17
4 1
4 1


17
Q(2) A(2) =
0

 

T
17 0
1 0
1
(2)
(2)
Q = 17
Q
=
0 17
0 1
=
 (2) 



T
1 4
R
17
A(2) = Q(2)
= 117
4 1
0
0
h
i  (2) 
R
(2)
(2)
= Q1 , Q2
0
 
4
(2)
Z = Q2 = 117
1
Reduced gradient


2
(2)
= 617
gR = Z T g(2) = 117 [4, 1]
2
Reduced Hessian



2 0
4
(2)
1
T
GR = Z GZ = 17 [4, 1]
= 24
17
0 8
1
which is positive definite.

64

(2) <0
aT
i s

= min{1,
=

7
3 }

=1

x(3) = x(2) + s(2) =

1
1
4


+

1
14


=

0
0

 #
2
65  8 1 + 65
1+
=I
8 1 + 65
64





1
8
1
8
1+ 65
1
= 65+ 65
= 65
8 1
8 1


T
65 0
1
Q(5) Q(5) = 65
=I
0 65

 h

iT 
R
R
(5)
(5)
(5)
(5)T
A =Q
= Q1 , Q2
0
0


8
(5)
Z = Q2 = 165
1
Reduce gradient
 
0
(5)
= gR = Z T g(5) = 165 [8, 1]
4
= 465
"

(3)

0
Although (3) = 0 0, the indefiniteness of G may mean x(3) is not
0
optimal. Try to remove constraint 1. (4) = .
G indefinite. Try to find descent direction of negative curvature moving into
interiorof feasible
region from x(3) . if no such vector exists, then x(3) is optimal.

0
x(4) =
,
0

 
0
0
g(4) =
, c(4) = 1
0
4
T (4)
We want s such that s g 0, sT Gs < 0 and aT1 s > 0
o.k.
as g(4) =0




2 0
s1
[s1 , s2 ]
= 2s21 8s22 < 0
0 8
s2
aT1 s = s1 4s2 > 0 
0
Try s(4) =
1
Line search :
i
(4) = min{ sTcc
: sT ci < 0}
i
(4)
bi aT
i x
(4)
aT
i s

1
2

1
65+ 65

= 1 = 0

= min{

Q(5) = I 1 hhT

(3) = (2) = {1}


(3)
g(3) = A(3)
 1

0
1
(3)
=
=
1
0
4

4
8

(aT2 s(4) = 0, aT3 s(4) = 8)




0
x(5) = x(4) + (4) s(4) =
, (5) = {3}
21
 
 
1
0
A(5) =
, g(5) =
8
4


 
R
65
,
Q(5) A(5) =
=
0
0



1 + 65
h=
8
h
i
2

2
1
= 2 khk = 21 1 + 65 + 64 = 65 + 65

The reduced search direction is


(2)
(2)
17
6
sR = GR gR = 17
24 17 = 4
The search direction
 is then

  1 
4  17
(2)
1
(2)
s = ZsR = 17
=
1 4
1
14

(2)
(2)
line search :
min f x + s
:

b aT x(2)
(2) = min 1, min { i aT si (2) }
i

iI(2)

(5)

Reduced Hessian = GR = ZGZ




2 0
8
= [8, 1]
0 8
1


16
1
[8, 1]
= 65
8
24
= 13 > 0
The reduced direction is
(5)
(5)1 (5)
13
4

sR = GR gR = 13
24 65 = 6 65
The search direction is






8  13
8
8
(5)
1
13
1
(5)

s = ZsR = 65
= 390
= 30
6 65
1
1
1
1
(5)
= min{1, 8 } = 1
1
65

30

: aTi s(4) < 0}

(6) = {3}
65

g(6) = A(6) (6) =


=
=
(6)

(6)
3

=


1
15

1
15

8
64


=

1
8

(6)

hi = li xi 0, i = r + n + 1, ..., r + 2n
(6)

Let x(k) be a current iterate, (k) , (k) an approximation of the optimal Lagrange multipliers, B (k) a positive definite approximation of the Hessian matrix
of the lagrangian function

8
15 .

4
8


is optimal.

min f (x)
subject to
AT x b
Necessary Conditions
AT x b
Z T g (x ) = 0
i 0, i A
Z T G (x ) Z ..is positivesemidefinite

L (x, , ) = f (x)

m
X
i=1

i gi (x)

r+2n
X

i hi (x)

(7)

i=1

Sufficient Conditions
AT x b
Z T g (x ) = 0
i > 0, i A
Z T G (x ) Z..positivedefinite.

Linearizing the nonlinear constraints (2) and (3), and minimizing a quadratic
approximation of the lagrangian function (7), we obtain a sub-problem of the
form

T
1
(8a)
min dT B (k) d + f x(k) d
2
subject to
The Sequential Quadratic Programming Algorithm
Sequential quadratic programming methods for nonlinear constrained opti
T


mization were developed mainly by Han [S.P.Han, Superlinearly convergent
gi x(k) d + gi x(k) = 0, i = 1, ..., m
(8b)
variable metric algorithms for general nonlinear programming problems, Mathimatical Programming 11 (1976) 263; S.P.Han, A globally convergent method

T


for nonlinear programming, J. of Optimization Theory and Applications, 22
hi x(k) d + hi x(k) , i = 1, ..., r
(8c)
(1977) 297 ] and Powell [M.J.D.Powell, A fast algorithm for nonlinearly constrained optimization calculations, in : Numerical Analysis, ed, G.A.Watson,
(k)
(k)
li xi di ui xi , i = 1, ..., n
(8d)
Lecture Notes in Mathematics, Val. 630 (Springer-Verlag, Berlin-HeidelbergNew York, 1978; M.J.D.Powell, The convergence of variable metric methods for
Let d(k) be the solution of (8). Introduce the corresponding Lagrangian
nonlinearly constrained optimization calculations, in Nonlinear Programming
(k)
3, ed. O.L.Mangasarian, R.R.Meyer and S.M.Robinson (Academic Press, New function L :
York, 1978)], based on the initial work of Wilson [R.B.Wilson, A simplicial al
T
1
gorithm for concave programming, Ph.D. Thesis, Graduate School of Business
L(k) = dT B (k) d + f x(k) d
2
Administration, Harvard University, Boston (1963)].
Consider the constrained nonlinear optimization problem
m

T


X
i {gi x(k) d + gi x(k) }
min f (x)
(1)
i=1

subject to
gi (x) = 0, i = 1, ..., m

(2)

hi (x) 0, i = 1, ..., r

(3)

li xi ui , i = 1, ..., n

(4)

i=1

For convenience, define the bound constraints li xi ui , i = 1, ..., n by


some functions
hi , i = r + 1, ..., r + 2n,
to simplify the notation. More precisely, we let
hi = xi ui 0, i = r + 1, ..., r + n

r

T


X
i {hi x(k) d + hi x(k) }

n
X
(k)
r+1 {li xi di }
i=1
n
X
(k)
r+n+i {di ui xi }

(9)

i=1
(k)

From Kuhn-Tucker conditions, there exist values {i }, i = 1, ..., m, and


(k)
(5) values {i }, i = 1, ...., r + 2n, such that
66

(k) (k)

(k)

+ f x

m
X

(k)
i gi

(k)

i=1
r
n
n

 X
X
X
(k)
(k)
i h x(k) +
r+i ei
r+n+i ei = 0
i=1

i=1

(10)

i=1

where ei = [0, ..., 1, 0, ..., 0] .


ith position
Let

u(k)

(k)

..

(k)
m
=
(k)
1

..

.
(k)
r+2n

(11)

Then, a new iterate is determined by


x(k) = x(k) + k d(k)

(12)

where k is a line search parameter. k is designed to produce a suffice k is


a line search parameter. k is designed to produce a suffice decrease of merit
function
 (k) 


x
d(k)
k () = r(k)
+

(13)
v(k)
u(k) v(k)
Since the line search may depend on the approximation v(k) of the optimal
lagrange multipliers of (7), we update v(k) simultaneneously by


v(k+1) = v(k) + k u(k) v(k)
(14)

subprobem (8) as it stands. It is possible that the feasible region of (8) will be
empty although the original problem (1)-(4) is solvable. The second draw is the
recalculation of gradients of all constraints at each iteration, although some of
them might be inactive at an optimal solution, i.e., locally redundant.
To avoid both disadvantages, an additional variable and an active set strategy are introduced, leading to the modified subproblem
T
min 21 dT B (k) d + f x(k) d + 12 k 2
(16a)
subject to

T

gi x(k) d + (1 ) gi x(k) = 0

i Jk
(16b)




(k) T
(k)
hi x
d + (1 ) hi x
0


k(j) T
(k)
hi x
d + (1 ) hi x
0, i Kk
(16c)
l x(k) d u x(k)
(16d)
d <n
(16e)
<
(16f)
where

(k)
Jk = {1, ..., m} {i : hi x(k) or vi > 0}
and

(k)
Kk = {i = 1, ..., r}\{i : hi x(k) or vi > 0}.
Here, we
 have

(k)
(k)
(k)
v = v1 , ..., vr+2n
and is a user-provided tolerance. The index k(j) indicates gradients
which have been calculated in previous iterations. The term k is an additional
penalty parameter designed to reduce the influence of on the solution of (16).
It is easy to see that the point
d(0) = 0,
0 = 1
satisfies the constraints of (16) and can also be used as a feasible starting point
for a quadratic programming algorithm.

In (13), r(k) is a vector of penalty parameters and controls the degree of


penalizing the objective or lagrangian function when leaving the feasible region.
A possible merit function is the augmented Lagrangian function

m 
P
2
vi gi (x) 12 ri gi (x)
r (x, v) = f (x)
i=1


2
1

v
h
(x)

r
h
(x)
, if hi (x) vrii
i
i
i
i

2
m+2n
P

(15)
2
i=m+1

vi
1 vi
if
hi (x) > ri
2 ri ,
proposed by Schittkowski. The penalty parameter r(k) is updated by a suitable
rule to guarantee a descent direction d(k) with respect to the chosen merit
function. However, we can not always implement the quadratic programming
67

Das könnte Ihnen auch gefallen