Unconstrained and Constrained Optimization Algorithms by Soman K.P

1
Unconstrained and Constrained Optimization Algorithms

Soman K.P
1 Introduction
Optimization is a means of finding the most efficient way of solving complex
mathematical problems, and is a vital part of most branches of computational science and
engineering.
We encounter optimization in our day to day life. Without thinking about it, most people
are constantly trying to do things in an optimal way. It can be anything from looking for
discounts to minimize the cost at the weekly shopping tour, to finding the shortest path between
two cities. When to choose between a long queue and a short queue, most people choose the
short one in order to minimize the time spent in the queue. Most of these everyday problems are
solved by intuition and it is often not crucial to find the absolutely best solution. These are all
examples of simple optimization problems. Unfortunately, there are many important
optimization problems not that easy to solve. Optimization is used in many areas and is in many
cases a very powerful tool. Common and more advanced, examples are to minimize the weight
of a construction while maintaining the desired strength or to find the optimal route for an
airplane to minimize the fuel consumption. In these cases, it can be impossible to solve the
problems by intuition. Instead, a mathematical algorithm executed in a computer, an
optimization routine, is often applied to the problem [1]. We are interested in applying
optimization algorithms for signal and image processing applications. Recent development in
compressed sensing spurred lots of interest in optimization theory, especially the theory of L1-
norm optimization. Utilization of sparsity in signal representation and sparsity in gradient of the
images requires strong footing in optimization theory. This chapter is a first level introduction to
unconstrained optimization theory.

2 Unconstrained Optimization
In this section, at first numerical schemes to solve unconstrained optimization will be introduced.
Solving unconstrained optimization problem is closely related to root finding process. Thus, it is
worth considering root finding algorithm first.

2

2.1 Root Finding Algorithm
In a root finding problem, we find
*
x R that satisfies
*
( ) 0 f x =
where : f R R is a smooth function in one variable. The Newton's method is the most
representative approach for this type of problem. Newton's method takes an iterative procedure,
which successively generates a sequence
k
x , which approaches a root
*
x as k increases. In figure
1, Newton's method is illustrated by a graph of an arbitrary function f . At iteration k, Newton's
method draws a tangential line y ax b = + at the current point( ) , ( )
k k
x f x . See the straight line
drawn tangential to the curve at ( ) , ( )
k k
x f x in figure 1.1. We need to determine slope a and y-
intercept b. The slope is simply the gradient
'
, ( )
k
f x at
k
x .
'
( )
k
a f x =
Note that the tangential line should pass( ) , ( )
k k
x f x , which can be found by evaluating f at
k
x .
Plugging this into y ax b = + gives
'
( ) ( )
k k k
f x f x x b = + which in turn gives
'
( ) ( )
k k k
b f x f x x =
. Thus we have obtained the tangential line
' ' '
( ) ( ) ( ) ( ) ( )( )
k k k k k k k
y f x x f x f x x f x f x x x = + = +
The Newton's method updates
k
x so that
1 k
x
+
is the root of the tangential line. Thus, we get the
following Newton's update formula
1 '
( )
( )
k
k k
k
f x
x x
f x
+
=

3

Figure1.1: Illustration of Newton's Method of root finding

The Newton's method repeats this procedure until it converges to the root. The main concept of
Newton's method is to linearize f (i.e. finding tangential line at the current point). Then it finds
`the' root for the linearized function. The root is used as a next point. The procedure is repeated
until it converges to the root of f . This idea of approximating original function locally and
'solving' the approximated function instead of original function prevails in numerical
optimization algorithm. We will definitely see this paradigm again. Letting
1 k k k
p x x
+
= , we
have
'
( )/ ( )
k k k
p f x f x = (1.1)

We can consider
k
p as a step to the next point
1 k
x
+
. Remember the formula given in (1.1). We
will see the very similar expression in numerical optimization, too.

4

Exercise 1: Find square root of 3 using Newton method
Solution: We take
2
( ) 3 f x x = . It is an equation of a parabola which cut the x axis at 3 x =
Or in other words solution of ( ) 0 f x = is 3 x =
We take
0
1 x = and proceed
'
( ) 2 f x x = .

( )
0
1 0 '
0
2 ( )
1 2
( ) 2
f x
x x
f x
= = =
( )
1
2 1 '
1
1 ( )
2 7/ 4
( ) 4
f x
x x
f x
= = = and so on
Depending on the starting point it will converge on one of the roots.

The Newton's method can fail to find a root. Consider
2
( ) 1 f x x = and
0
0 x = . Then we have
0
( ) 1 f x = and
'
0
( ) 0 f x = , so
1
x is undefined. This example illustrates the importance of
starting point. A bad starting point can cause the algorithm to fail. You may think that the
Newton's method is very robust if a good starting point is chosen. Unfortunately, it is not the
case. In some cases, the Newton's method is very hard to converge or even fail to converge with
any ordinary starting point. Usually, the Newton's method shows a great performance when a
starting point is close to a root or a function f is very nice (i.e. f is convex). However, it is not
guaranteed that the Newton's method will converge for a general function f .

5

Figure 1. 2: Illustration of global and local minimizer. The circular point: global minimizer, the
rectangular points: local minimizers.

2.2 Local Minimizer
In unconstrained optimization problem, we minimize an objective function that depends on real
variables, with no restrictions at all on the values of these variables. The mathematical
formulation is
min ( )
x
imize f x
where
n
x R is a real vector with 1 n and :
n
f R R : is a smooth function. Usually we
lack a global perspective on the function f . All we know are how to evaluate f at a specific
points and maybe gradient
1 2
, , . .
T
n
f f f
f
x x x
| |
=
|

\ .
.

We need to come up with an algorithm, which can find some" minimum of f , only with given
minimal information. Good news is that there are several such algorithms!!. Bad news, however,
is that most algorithms find local minimizer. This, in turn, tells us that global minimization is
very difficult task. Then what is different between global and local minimizer? Let's examine the
definitions of each minimizer.
6

Definition 1.
*
x is called global minimizer of f if
*
( ) ( ) f x f x for all
n
x R . On the other
hands,
*
x is called local minimizer of f if
*
( ) ( ) f x f x for all
*
( )
e
x N x , where ( ) N x
denotes -neighborhood of x.

Note from the definition that the global minimizer is a local minimizer, but a local minimizer
may not be a global minimizer. The difference between global and local minimizer is best
depicted by the figure-1.2. Both square and circular points are local minimizers. However,
there is only one global minimizer, which is the circular point. Even from the figure-2, you can
imagine why it is hard to find a global minimizer under the condition that only some function
evaluations and gradient information are given. Since we aim for the local minimizer, we need to
know more about local minimizer. What are the characteristics of local minimizers? Listing a
few:

Tangential slope is zero at local minimizer. In other words,
*
( ) f x = 0

*
( ) ( ) f x f x for
*
( ) x N x

( ) f x is small if
*
( ) x N x

2 *
( ) f x (Hessian matrix) is positive semi definite

In designing gradient based algorithm, the Taylor's theorem plays a crucial rule. Let's take a look
at the Taylor's theorem.
At first we look one variable case:
Any given f(x), can be expressed as a power series with respect to a chosen point x
o
, as follows:

2 3
0 1 0 2 0 3 0
( ) ( ) ( ) ( ) ... f x a a x x a x x a x x = + + + + (1.2)
Now how do we find the values of
0 1 2
, , ,... a a a of this infinite series so that the equation holds.

7

2.3 Method:
The general idea will be to process both sides of this equation and choose values of x so that
only one unknown appears each time.
To obtain
0
a : Choose
0
x x = in (1.2). This result in
0 0
( ) a f x =
To obtain
1
a : First take the derivative of (1.2)

2 3
1 2 0 3 0 4 0
( ) 2 ( ) 3 ( ) 4 ( ) ....
d
f x a a x x a x x a x x
dx
= + + + + (1.3)
Now choose
0
x x = . Then,
0
1
x x
df
a
dx
=
=
To obtain
2
( ) ( ) ( )
2
2 3
2 3 0 4 0 5 0 2
( ) 2 3 (2) ( ) 4 (3) ( ) 5 (4) ( ) ....
d
f x a a x x a x x a x x
dx
= + + + + (1.4)
Now choose
0
x x = .
0
2
2 2
1
2
x x
d f
a
dx
=
=
To obtain
3
( ) ( ) ( )
3
2
3 4 0 5 0 3
( ) 3 (2) 4 (3)(2) ( ) 5 (4)(3) ( ) ....
d
f x a a x x a x x
dx
= + + + (1.5)
Now choose
0
x x = .
( )
0
3
3 3
1
3 (2)
x x
d f
a
dx
=
=
8

To obtain a
k
: First take the k
th
derivative of equation (1.2) and then choose
0
x x = .
0
1
!
k
k k
x x
d f
a
k dx
=
=
Summarizing, The Taylor series expansion of f(x) with respect to x
o
is given by:

0
0 0
2
2
0 0 0 0 2
1 1
( ) ( ) ( ) ( ) ... ( ) ...
2! !
k
k
k
x x
x x x x
df d f d f
f x f x x x x x x x
dx dx k dx
=
= =
| | | |
| |
= + + + + +
| | |
\ .
\ . \ .

Generalization to multivariable function:
Let
1 2 3
, , x x x be the three independent variables,
1 2 3 0 1 1 1 2 2 2 3 3 3
2 2 2
11 1 1 22 2 2 33 3 3
12 1 1 2 2 13 1 1 3 3 23 2 2
( , , ) ( ) ( ) ( )
( ) ( ) ( )
( )( ) ( )( ) ( )(
f x x x x x x x x x
x x x x x x
x x x x x x x x x x x

= + + + +
+ + +
+ +
3 3
) ... x +

(1.6)
Using similar method as described above, using partial derivatives this time,

0 1 2 3
( , , ), f x x x =
1 2 3
1
1
, , x x x
f
x
,
1 2 3
2
2
, , x x x
f
x
,
1 2 3
3
3
, , x x x
f
x

1 2 3
2
11 2
1
, ,
1
2!
x x x
f
x
,
1 2 3
2
22 2
2
, ,
1
2!
x x x
f
x
,
1 2 3
2
33 2
3
, ,
1
2!
x x x
f
x

9

1 2 3
2
12
1 2
, , x x x
f
x x

=

;
1 2 3
2
13
1 3
, , x x x
f
x x

=

;
1 2 3
2
23
2 3
, , x x x
f
x x

=

To get a simple concise expression, we assume the variables are
1 2
, ,...,
n
x x x so that we denote
the generic point in the domain of the function as ( )
1 2
, ,...,
T
n
x x x x = . Then

2
1
( ) ( ) ( ) ( ) ( ) ( )( ) ....
2
T T
f f f f + + + x x x x x x x x x x (1.7)
Where
1
/
( )
/
n
f x
f
f x
| |
|
=
|
|

\ .
. x
And
2 2 2 2
1 1 2 1
2 2 2 2
2 2 1 2 2
2 2 2 2
1 2
/ / /
/ / /
( )
/ / /
n
n
n n n
f x f x x f x x
f x x f x f x x
f
f x x f x x f x
| |
|

|
=
|
|
|

\ .
. . . .
x

3 Line Search Methods
As in Newton method for root -finding algorithm, in unconstrained optimization, we are thinking
of designing iterative algorithm. Starting from an initial guess, we search for a direction p to
take, and then decide how far we will go in directionp . This methodology is called line search
algorithm. Thus we update current point
k
x ,
1 k k
x x p
+
= +
where is called step length and p search direction. In root-finding algorithm, p is
determined by equation (1.1) and =1. Then what would be a good search direction for
unconstrained optimization?

10

3.1 The Steepest Descent Method
From calculus, we have learned that ( ) f x gives the steepest ascent direction at x . Thus,
( ) f x is the steepest descent direction. Since the only thing we need is a direction to take, we
normalize the steepest descent direction to get p ,

( )
( )
f x
p
f x
(1.8)
The line search method that uses this direction is called the steepest descent method.

3.2 The Newton Method
In contrast to the steepest descent method, the Newton's method takes a bit more sophisticated
search directionp . As in the Newton's method for root-finding algorithm, we approximate the
original objective function f locally, and then take a minimizer of approximated function
f as
a search direction. Linearizing f , however, is not an option because linearized f does not have
a minimum. Next option would be to move one step further: taking quadratic approximation of
f . Let's say we are at the current point
k
x . According to Taylor's theorem we have
2
1
( ) ( ) ( ) ( ) ...
2
T T
k k k k
f x p f x p f x p f x p + = + + +
The higher order terms in the Taylor's expansion is very small if p is small. Thus if we define
( ) f x to be:
2
1
( ) ( ) ( ) ( ) ( ) ( )( )
2
T T
k k k k k k
f x f x x x f x x x f x x x = + + , (1.9)
then
( ) ( ) f x f x = for x close to
k
x .
Thus,
f is a very good approximation to f around

k
x . In order to find out search directionp ,
we need to find a minimizer x of
f . We know for x

to be a local minima, ( ) f x =0. Thus,
we must have
2
( ) ( )( ) 0
k k k
f x f x x x + =

This is obtained by taking derivative on both side of (1.9) and equating to a zero vector . This is
proved as follows:
11

2
1
( ) ( ) ( ) ( ) ( ) ( )( )
2
T T
k k k k k k
f x f x x x f x x x f x x x = + +
2 2 2
1 1
( ) ( ) ( ) ( ) ( ) ( ) ( )
2 2
T T T T T
k k k k k k k k k k
f x f x x f x x f x x f x x x f x x x f x x = + + +

2 2
1
( ) ( ) ( ) ( )
2
T T T
k k k k
f x x f x x f x x x f x x c = + +
2 2
( ) ( ) ( ) ( )
k k k k
f x f x f x x f x x = +

2
( ) ( ) ( )( )
k k k
f x f x f x x x = +

Letting
( ) f x = 0 we obtain

2
( ) ( )( ) 0
k k k
f x f x x x + =

On solving for search direction
k
p x x = we get
( )
1
2
( ) ( )
k k k
x x f x f x
=
Or
( )
1
2
1
( ) ( )
k k k k
x x x f x f x
+
= =

The Newton search direction is
( )
1
2
( ) ( )
k k k
x x f x f x
=

4 Constrained Optimization
The other major type of optimization is constrained optimization. In constrained optimization,
the search space is dictated by constraints. The constraints can be either equality or inequality
constraints.
12

For example, let us consider a case with one equality constraint. in ( ) . . ( ) 0
x
M f x s t g x =
implies that on the set of points x that satisfies ( ) 0 g x = , we search for
*
x at which ( ) f x is
minimum. Therefore, the condition to be satisfied at the optimal point
*
x is not
*
( ) f x = 0 .
*
( ) f x = 0 is the condition for unconstrained optimization problem.
To find the condition for optimality in the case of constrained optimization, we make use of
levelsets of the objective function. We know levelset is a set of points on which function value is
a given constant. Therefore, we draw several level set in sequence starting from minimum
possible level set. We then gradually increase the levelset value and draw corresponding
levelsets. At a particular value of levelset, the corresponding levelset curve of objective function
( ) f x just touches the zero levelset curve of ( ) g x (or in other words , touch ( ) 0 g x = curve). At
the point of contact
*
x , we can draw a common tangent to the two levelsets . This tangent is
orthogonal to both gradients
*
( ) f x and
*
( ) g x drawn at
*
x . This means gradients
*
( ) f x
and
*
( ) g x are collinear. There are two possibilities- the gradients may point in the same
direction or opposite direction, These are shown in Fig.1 and Fig.2. All depends on which
direction ( ) f x and ( ) g x is increasing at the point
*
x .
13

Figure 1.3. Optimality conditions using level set curves, case 1

Figure 1.4. Optimality conditions using level set curves, case 2
14

Well, we obtained conditions for optimality for one equality constraint. What about if there are
several equality constraints. Note that each equality constraints put severe restriction on the
search space. If there are more than one equality constraints, the search space is limited to set of
common points that satisfy all the equality constraints. Now at the optimal point what are the
conditions to be satisfied?. For the purpose of illustration, let us consider the case of two equality
constraints
1 2
( ) 0, ( ) 0 g x g x = = as in figure 3. The objective function (the function for which
location of optima is to be found out) is ( ) f x . We assume as earlier , all functions are defined
over
2
R . At the point of intersection
*
x of
1 2
( ) 0, ( ) 0 g x g x = = , the level set of objective
function ( ) f x may not touch tangentially to any of the curve. However,
*
( ) f x has to be in the
plane spanned by
* *
1 2
( ) and ( ) g x g x . In other words
* * *
1 1 2 2
( ) ( ) + ( ) f x g x g x = . That
is
*
( ) f x is a linear combination of
*
1
( ) g x and
*
2
( ) g x . This is illustrated in Figure 3. The
sign of
1
and
2
depends on which direction
1 2
( ) and ( ) g x g x are increasing at
*
x .

15

Figure 1.5. Optimality condition for problem with two equality constraints.
5 Lagrangian Function
Lagrange [3] introduced a new function called Lagrangian function for which when we apply
first order optimality condition for unconstrained problem, we obtain the required condition for
constrained problem. For example, let us consider
1 2
min ( )
. . ( ) 0; ( ) 0;
x
f x
s t g x g x = =
(1.10)

The Lagrangian function is given by
( )
1 2 1 1 2 2
, , ( ) ( ) ( ) L x f x g x g x =
(1.11)

On differentiating with respect to
1 2
, , x we obtain respectively
16

1 1 2 2
( ) ( ) ( ) f x g x g x = 0
(1.12)

1
( ) 0 g x =
2
( ) 0 g x =
The point
*
x satisfying above three conditions are given by
* * *
1 1 2 2
( ) ( ) + ( ) f x g x g x = ;
*
1
( ) =0 g x ;
*
2
( ) 0 g x =
(1.13)

These conditions are the same condition we obtained using geometrical arguments.
The importance of Lagrangian function is that it converts a constrained optimization problem
into unconstrained optimization problem. Lagrangian function is obtained by adding constraints
multiplied by constants called Lagrangian multipliers to the objective function. Note that sign of
multiplier depends on which direction the constraint functions are increasing at
*
x .
1.5.1 Optimization problem with inequality constraints
Next our consideration is deriving optimality condition when constraints are of inequality type.
Inequality constraints are less stringent compared to equality constraints because we have more
search space. That is, set of points in ( ) 0 g x is much more than in ( ) 0 g x = . Let us consider
problem of type
min ( )
. . ( ) 0
x
f x
s t g x
(1.14)

17

Figure 1.6. Optimality conditions with inequality constraints
Note that ( ) 0 g x = divides the search domain into two regions. One region is ( ) 0 g x and the
other is ( ) 0 g x < See figure 4. If the global minimum of function ( ) f x is in ( ) 0 g x , then the
constraint ( ) 0 g x is no more a constraint at all since now ( ) f x can assume the global
minimum value at
*
x x = .
So, for ( ) 0 g x to be a real constraint, the global minimum of function ( ) f x must be in
( ) 0 g x < . Figure 4 shows such a situation. We draw level sets for ( ) f x until a levelset just
grazes ( ) 0 g x . Let the point where it just touches ( ) 0 g x be
*
x . At
*
x , we note that the two
gradients are collinear and are in the same direction. Therefore, the required first order
optimality condition is

18

* *
( ) ( ), 0 f x g x = >
(1.15)

*
( ) 0 g x =

There is one problem with above two conditions. It does not take care of the situation where
( ) 0 g x is not constraining the objective function. It happens when global minimum of ( ) f x
is in ( ) 0 g x . At that global minimum point
*
x ,
*
( ) f x = 0 but
*
( ) g x and
*
( ) g x is not
necessarily zero. This will violate our optimality conditions given above. To take care of that, we
write a new condition
*
( ) 0 g x = which says that if the constraint is not active(
*
( ) 0 g x ) at
the optimal point, then 0 = .
The condition
*
( ) 0 g x = is called complementarity condition.
Putting all together , we write the conditions as at the optimal point
*
x ,
* *
( ) ( ), 0 f x g x =
(1.16)

*
( ) 0 g x
*
( ) 0 g x =
These three conditions take care of both situations. Let us see both situations separately.
Case 1: ( ) 0 g x is a constraint. (global optimum point is in ( ) 0 g x < )
* *
( ) ( ), 0 f x g x = > ,
*
( ) f x 0
(1.17)

*
( ) 0 g x =
*
( ) 0 g x = , 0 >

19

Case 2 ( ) 0 g x is not a constraint . (global optimum point is in ( ) 0 g x )
* *
( ) ( ), 0 f x g x = = ,
*
( ) f x = 0
(1.18)

*
( ) 0 g x
*
( ) 0 g x = since 0 =
Very important point to be noted here is that Lagrangian multiplier is not unrestricted in sign.
If ( ) 0 g x = is an active constraint, then, at the optimal point both ( ) f x and ( ) g x must be in
the same direction.
5.2 Lagrangian Formulation for inequality constraint
Let us now try to construct Lagrangian function that can produce same optimality conditions.
The problem min ( ) . . ( ) 0
x
f x s t g x , we rewrite as
2
max ( )
. . ( ) 0
x
f x
s t g x s =
(1.19)

Where, s is any real number so that
2
s is positive for any s .
We write the Lagrangian as
( ) ( )
2
, , ( ) ( ) , 0 L x s f x g x s =
(1.20)

So that
* *
( ) ( ) ( ) ( )
L
f x g x f x g x
x

= = =
0
(1.21)

( )
2 *
( ) 0 ( ) 0
L
g x s g x
= =
; because
2
0 s
(1.22)

*
2 0 2 ( ) 0
L
s g x
s

= = =
; because
*
( ) 0 g x = when 0 s = (1.23)
20

While writing Lagrangian function we should make sure that 0
L
x
produce
* *
( ) ( ) f x g x =
with 0 .
Let us now try to see the directions of ( ) f x and ( ) g x at the optimal point for various
combinations of type of optimization (max or min) and inequality constraints ( less than type
or greater than type )
a. Combination type 1 :

max ( )
. . ( ) 0
x
f x
s t g x
(1.24)

Here, for the constraint to be active, Global maximum must be outside ( ) 0 g x region.
Under that situation at the optimal point
*
x ,
* *
( ) ( ), 0 f x g x = . See Figure 5

Figure 1.7. Optimality conditions combination type 1
21

The lagrangian function for the problem is: ( ) ( )
2
, , ( ) + ( ) , 0 L x s f x g x s =
b. Combination type 2 :
max ( )
. . ( ) 0
x
f x
s t g x
(1.25)

The Lagrangian for the problem is ( ) ( )
2
, , ( ) - ( ) , 0 L x s f x g x s = . See figure 6

c. Combination type 3 :
min ( )
. . ( ) 0
x
f x
s t g x
(1.26)

22

The Lagrangian for the problem is ( ) ( )
2
, , ( ) ( ) , 0 L x s f x g x s = . See figure 7
Remembering Lagrangian form for all the case is difficult. To make it easy to remember, we may
convert the formulation into a standard form and remember only the lagrangian for that standard
form.
A maximization problem is converted into minimization problem by changing the objective
function to ( ) f x . That is ( ) max ( ) =min ( )
x x
f x f x
Similarly 0 inequality constraint is converted into 0 constraint by multiplying with (-1).
That is ( ) 0 g x is changed into ( ) 0 g x .


23

6 Formulation with several equality and inequality constraints
Consider a general case where we have m equality constraints and n inequality constraints.
Given:
min ( )
. . ( ) 0; 1,2,...,
g ( ) 0; 1,2,...,
x
i
j
f x
s t h x i m
x j n
= =
=
(1.27)

Lagrangian for the above problem is
( )
2
j
1 1
( , , , ,) ( ) ( ) ( ) ; 0 j
m n
i i j j j
i j
L x f x h x g x s
= =
=

s
(1.28)

First order Optimality condition is obtained as
* * *
1 1
L
( ) ( ) ( ); 0
x
m n
i i j j j
i j
f x h x g x j
= =
= = +

0
(1.29)

*
L
0 ( ) 0
i
i
h x i
= =
(1.30)

* 2 *
L
0 ( ) 0 ( ) 0
j j j
j
g x s g x j
= =
(1.31)

*
L
0 0 ( ) 0
j j j j
j
s g x j
s

= = =
(1.32)

Putting all together , the optimal point
*
x has to satisfy the following
* * *
1 1
( ) ( ) ( ); 0
m n
i i j j j
i j
f x h x g x j
= =
= +

(1.33)
(first order gradient condition)
*
( ) 0
i
h x i = (feasibility condition)
24

*
( ) 0
j
g x j (feasibility condition)
*
( ) 0
j j
g x j = (Complementarity condition)
These conditions are called KKT Conditions or Karush-Kuhn-Tucker conditions.
7 Convex Optimization problems
A vast number of problems in engineering including signal and image processing problems can
be posed as constrained optimization problems, of the type
0
min ( )
( ) 0, 1,2,..
( ) 0 1,2,..
i
i
imize f x
subject to g x i m
h x i n
=
= =
(1.34)

However, such problems can be very hard to solve in general, especially when the number of
decision variables in x is large. There are several reasons for this difficulty:
1) The problem terrain may be riddled with local optima.
2) It might be very hard to find a feasible point (i.e., an x which satisfies all the equalities
and inequalities), in fact the feasible set, which need not even be fully connected, could be
empty.
3) Stopping criteria used in general optimization algorithms are often arbitrary.
4) Optimization algorithms might have very poor convergence rates.
5) Numerical problems could cause the minimization algorithm to stop all together or
wander
It has been known for a long time that if f is convex function (which we will define soon), and
all the constraints together define a convex space, then the first three problems disappear: any
local optimum is, in fact, a global optimum; feasibility of convex optimization problems can be
determined unambiguously, at least in principle; and very precise stopping criteria are available
using duality(which will be defined soon). However, convergence rate and numerical sensitivity
issues still remained a potential problem.

It was not until the late 80s and 90s that researchers in the former Soviet Union and United
States discovered that if, in addition to convexity, the objective function f satisfied a property
25

known as self-concordance (discussed later), then issues of convergence and numerical
sensitivity could be avoided using interior point methods [2-6]. The self-concordance property is
satisfied by a very large set of important functions used in engineering. Hence, it is now possible
to solve a large class of convex optimization problems in engineering with great efficiency.

7.1 Convex Sets
In this section we list some important convex sets and operations.
We will be concerned only with optimization problems whose decision variables are vectors in
n
R or matrices in
m n
R
A function :
n m
f R R is affine if it has the form ( ) f x Ax b = + . Affine functions are
sometimes loosely referred to as linear.
n
S R is a subspace if it contains the plane through any two of its points and the origin, i.e.,
, , , x y S x y S + R .
Two common representations of a subspace are as the range of a matrix
{ }
( )
n
range A Aw w = R
1 1
....
n n i
wa w a w = + + R where | |
1
...
n
A a a = ;
alternatively, as the null space of a matrix.
{ }
( ) nullspace B x Bx = = 0
A set
n
S R is affine if it contains line through any two points in it, i.e.,
, , , , 1 x y S x y S + = + R

Figure 1.10. Example of Affine Set
26

Geometrically, an affine set is simply a set that is parallel to a subspace, which is centered at the
origin (For a set of points to be a subspace, the null vector-origin- must be a member of the set).
Two common representations for an affine set are the range of affine function
{ }
n
S Az b z = + R
alternatively, as the solution of a set of linear equalities:
{ }
S x Bx d = =
A set
n
S R is a convex set if it contains the line segment joining any two of its points, i.e.,
, , , 0, 1 x y S x y S + = +

Figure 1.11. Convex Sets
Geometrically, we can think of convex sets as always bulging outward, with no dents or kinks in
them. Clearly subspaces and affine sets are convex, since their definitions subsume convexity.
A set
n
S R is a convex cone if it contains all rays passing through its points which emanate
from the origin, as well as all line segments joining any points on those rays, i.e.,
, , , 0 x y S x y S +
Geometrically , , x y S means that S contains the entire pie slice between x, and y.
27

Figure 1.12. Convex Cone
The nonnegative orthant,
n
+
R is a convex cone. The set
{ }
0
n n
S X S X
+
= = ~ of symmetric
positive semi definite (PSD) matrices is also a convex cone, since any positive combination of
semi definite matrices is semi definite. Hence we call
n
S
+
the positive semi definite cone.
A convex cone
n
K R is said to be proper if it is closed, has nonempty interior, and is pointed,
i.e., there is no line in K. A proper cone K defines a generalized inequality
K
= -
n
R :
K
x y y x K = -
(strict version int
K
x y y x eriorK = -

Figure 1.13. Example of Convex Cone
This formalizes our use of the = ~ symbol:

:
n
K
K x y
+
= = R - means
i i
x y (components wise inequality)
28

:
n
K
K S X Y
+
= = - means X-Y is PSD

Given points
n
i
x R and
i
R, then
1 1 2 2
...
k k
y x x x = + + + is said to be a
1. linear combination for any real
i

2. affine combination if 1
i
i
=

3. convex combination if 1, 0
i i
i
=

4. conic combination if 0
i

The linear (resp. affine, convex, conic) hull of a set S is the set of all linear (resp. affine, convex,
conic) Combinations of points from S, and is denoted by span(S) (resp. Aff(S), Co(S), Cone(S)).
It can be shown that this is the smallest such set containing S.
As an example, consider the set S =f(1; 0; 0); (0; 1; 0); (0; 0; 1)g. Then span(S) is
3
R ; Aff(S) is
the hyper plane passing through the three points; Co(S) is the unit simplex which is the triangle
joining the vectors along with all the points inside it; Cone(S) is the nonnegative orthant
3
+
R
Recall that a hyperactive plane, represented as
{ }( ) 0
T
x a x b a = , is in general an affine set,
and is a subspace if b =0. Another useful representation of a hyperactive plane is given by
{ }
0
( ) 0
T
x a x x = , where a is normal vector;
0
x lies on hyperactive plane. Hyperactive planes
are convex, since they contain all lines (and hence segments) joining any of their points.
A halfspace, described as
{ }( ) 0
T
x a x b a is generally convex and is a convex cone if 0 b = .
Another useful representation is
{ }
0
( ) 0
T
x a x x , where a is (outward) normal vector and
0
x
lies on boundary.

29

Figure 1.14. Example of Half Space
We now come to a very important fact about properties, which are preserved under intersection:
Let A be an arbitrary index set (possibly unaccountably infinite) and
{ }
S A
collection of
sets, then we have the following:

subspace subspace
affine affine
is is
convex convex
convex cone convex cone
A
S S

| | | |
| |
| |
| |
| |
\ . \ .

In fact, every closed convex set S is the (usually infinite)intersection of halfspaces which contain
it, i.e.,
{ }
halfspace, S H H S H = . For example, another way to see that
n
S
+
is a convex
cone is to recall
that a matrix
n
X S is positive semi definite if 0,
T n
z Xz z R . Thus we can write
1
0
n
n
n n T
i j ij
z
i
S X S z Xz z z X
+
=

= =
`
)
(1.35)

Now observe that the summation above is actually linear in the components of X, so
n
S
+
is the
infinite intersection of halfspaces containing the origin (which are convex cones) in
n
S .
We continue with our listing of useful convex sets.
A polyhedron is intersection of a finite number of halfspaces
30

{ } { }
, 1,2,..,
T
i i
x a x b i k x Ax b = = = -
where = - above means component wise inequality.

Figure 1.15. Example of Polyhedron

A bounded polyhedron is called a polytope, which also has the alternative representation

{ }
1 2
, ,..,
N
P Co v v v = where { }
1 2
, ,..,
N
v v v are its vertices. For example the nonnegative orthant
{ }
0
n n
x x
+
= = R R ~ is a polyhedron , while the probability simplex 0, 1
n
i
i
x x x

= =
`
)
R ~
is a polytope
If f is a norm and the norm Ball B=
{ }
( ) 1
c
x f x x is convex and the norm Cone
( ) { }
, ( ) C x t f x t = is a convex cone. Perhaps the most familiar norms are
p
l norms on
n
R :
1/
; 1
max ;
p
p
i
i
p
i i
x p
x
x p
| |

|
=
\ .
(1.36)

The corresponding norm balls (in
2
R ) look like this:
31

Figure 1.16. Norm Balls
Two further properties are helpful in visualizing the geometry of convex sets. The first is the
separating hyperplane theorem, which states that if ,
n
S T R are convex and disjoint
S T = , then there exists a hyperplane
{ }
0
T
x a x b = which separates them.

Figure 1.17. Separating Hyperplane theorem
32

The second property is the supporting hyperplane theorem which states that there exists a
supporting hyperplane at every point on the boundary of a convex set, where a supporting
hyperplane
{ }
0
T T
x a x a x = supports S at
0
x S if
0
T T
x S a x a x

Figure 1.18. Supporting Hyperplane Theorem
7.2 Convex Functions
In this section, we introduce the reader to some important convex functions and techniques for
verifying convexity. The objective is to sharpen the readers ability to recognize convexity.
A. Convex functions
A function :
n
f R R is convex if its domain dom (f )is convex and for all
| | , , 0,1 x y dom f
( ) ( ) ( ) 1 ( ) 1 ( ); f x y f x f y + +
f is concave if -f is convex.
33

Figure 1.19. Types of Functions
The convexity of a differentiable function :
n
f R Rcan also be characterized by conditions on
its gradient f and Hessian
2
f . Recall that, in general, the gradient yields a first order Taylor
approximation at
0
x
0 0 0
( ) ( ) ( ) ( )
T
f x f x f x x x +
(1.37)

We have the following first-order condition: f is convex if and only if for all
0
, , x x dom f ,
0 0 0
( ) ( ) ( ) ( )
T
f x f x f x x x + . (1.38)
i.e., the first order approximation of f is a global underestimator.

Figure 1.20.Illustration of Taylor Series
Recall that the Hessian of f ,
2
f , yields a second order Taylor series expansion around
0
x :
34

2
0 0 0 0 0 0
1
( ) ( ) ( ) ( ) ( ) ( )( )
2
T T
f x f x f x x x x x f x x x + + =
(1.39)

We have the following necessary and sufficient second order condition: a twice differentiable
function f is convex if and only if for all
2
, ( ) 0 x dom f f x = ~ , i.e., its Hessian is positive
semidefinite on its domain.

7.3 Concept of Self Concordance :

Nesterov and Nemirovski [5]introduced a notion of self-concordance and a class of self-
concordant functions. This provides a new tool for analyzing Newtons method that exploits the
affine invariance of the method.

7.3.1 Definition (for one variable):
A function : f R R is self concordant when f is convex and ( )
3/2
' ' ' ' '
( ) 2 f x f x for all
x dom f
Significance: If Newtons method is applied to a quadratic function (whose Hessian is constant
matrix), then it converges in one iteration. By extension , if the Hessian matrix does not change
rapidly , the Newton method ought to converge rapidly. Changes in the second derivative can be
measured using the third derivative. Intuitively, the third derivative should be small relative to
the second derivative. The self concordance property reflects this requirement.

7.4 Concept of duality.
Earlier we have seen the conditions to be satisfied at the optimal point for general constrained
optimization problem. These conditions does not help us to find the solution except for very
simple cases. It help us to check whether optimal point is reached or not. We need some iterative
procedure to solve it. To this end , corresponding to the problem
min ( )
. . g(x) 0
x
f x
s t
(1.40)

we show that , it is equivalent to solving minimization of lagrangian function without slack
variables in two steps. That is
35

min ( )
. . g(x) 0
x
f x
s t
(1.41)

=
x 0
min max ( , ) ( ) g(x) L x f x

>
=
(1.42)

=
x 0
min max ( ) g(x) f x

>

(1.43)

To understand logic behind this , let us consider a case where the domain is
2
R .
The inner minimization max ( ) g(x) f x
can be visualized as follows.

Every point x in the domain, we evaluate ( ) g(x) f x for different values of 0 . In the
region where g(x)>0, the maximum value possible at any x is ( ) f x itself. This is achieved for
=0. In the region where g(x) 0 < , the maximum value possible is infinity and it is obtained for
= .
Because in this region at any point x, g(x) is negative and therefore g(x) becomes infinity.
Note that is not allowed to take negative values. In addition, on g(x)=0, the highest possible
value for lagrangian ant any x is ( ) f x .
Through this inner computation cum optimization process, we associate every point in the
domain a value. The values at infeasible regions is infinity and value at any feasible point x is
( ) f x itself. Now we apply outer optimization (basically a search) . This search will find the
location
*
x where the Lagrangian is assigned minimum value in the previous inner loop
computations. Though this is not the actual computational procedure we finally follow for
finding the optimal point, readers can see that the logic of the formulation is right and if one does
that way, it will definitely end up in finding the location of
*
x . Now we are ready for the one of
the most important concept in convex optimization theory - the lagrangian duality.

7.5 Lagrangian Duality
According to Lagrangian duality concept, for a wide class of functions,
x x
min max ( , ) =max min ( , ) L x L x

(1.44)

That is , the order of maximization and minimization can be swapped.
36

The original problem
x
min max ( , ) L x
is called primal and the swapped version

x
max min ( , ) L x
is called dual of the primal.

Why does swapping make sense?
Consider a plot of the image of the domain of x under ( ( ), )) g f x x x ( . The optimal primal
solution lies on the ordinate, on the lower boundary of the image of the mapping.

Figure 1.21. Example domain of x under ( ( ), )) g f x x x ( .
In the dual problem, the Lagrangian
T
( )- ( ) f g x x is being minimized. On the graph this is the
y intercept of the line with the slope passing through the point( ( ), )) g f x x ( . The minimization
finds the smallest such intercept, ranging over all x. This corresponds to the dual function. The
subsequent maximization of the dual function takes the maximum of such y-intercepts. This
yields the same point as the primal solution.

37

Appendix-1
Understanding Lagrangian Duality
Understanding the concept of duality is very important in the theory of support vector machines
because one rarely solve the optimization arising in SVM in the primal because of the
computational complexity involved. The main stumbling block that a new comer in this field
faces is the concept of Lagrangian multiplier and the Lagrangian duality. To facilitate an easy
entry, we consider duality concept in Linear Programming which many are familiar with.
Duality in Linear Programming
Linear programming was developed as a discipline in the 1940's, motivated initially by the need
to solve complex planning problems in wartime operations. Its development accelerated rapidly
in the postwar period as many industries found valuable uses for linear programming. The
founders of the subject are generally regarded as George B. Dantzig, who devised the simplex
method in 1947, and John von Neumann, who established the theory of duality that same year.
The Nobel prize in economics was awarded in 1975 to the mathematician Leonid Kantorovich
(USSR) and the economist Tjalling Koopmans (USA) for their contributions to the theory of
optimal allocation of resources, in which linear programming played a key role. Many industries
use linear programming as a standard tool, e.g. to allocate a finite set of resources in an optimal
way. Examples of important application areas include airline crew scheduling, shipping or
telecommunication networks, oil refining and blending, and stock and bond portfolio selection.
The most remarkable mathematical property of linear programs is the theory of duality. Duality
in linear programming is essentially a unifying theory that develops the relationships between a
given linear program and another related linear program stated in terms of dual variables. The
intriguing feature of this is that both primal and dual has the same optimal value for their
objective function.
To understand the logic behind duality, let us consider two examples.

38

Example 1. Given the linear program
2
1 2
1 2
min
. 8
3 2 6
x
s t x x
x x
+
+

how do we lower bound the value of the optimum solution?. That is, instead of solving the
problem, using a linear combination of constraints, can we tell something about upper/lower
bound of objective function.
Multiplying the first constraint by 3 and adding to the second gives
2
5 30 x which implies
2
6 x or
2
6 x
For any feasible solution ,
2
x is at least 6 .
Example 2. Given the linear program
1 2 3 4
1 2 3 4
1 2 3 4
max 5 6 9 8
. 2 3 5
2 3 3
0, 1,2,3,4
i
x x x x
s t x x x x
x x x x
x i
+ + +
+ + +
+ + +
=

How do we upper bound the value of the optimum solution?
We choose
1 2
, 0 y y . We multiply first constraint by
1
y and the second by
2
y and add. The
choice of
1
y and
2
y should be such that , in the resulting sum we have coefficient of
1
x greater
than 5, the coefficient of
2
x greater than 6 , the coefficient of
3
x greater than 9 and the
coefficient of
4
x greater than 8. Since
1 2
5 +3 y y is greater than this sum, the upper bound will
become
1 2
5 +3 y y . That is
( ) ( )
1 1 2 3 4 2 1 2 3 4 1 2
2 3 2 3 5 3 y x x x x y x x x x y y + + + + + + + +

39

So if we choose y
1
and y
2
such that following conditions are met

1 2
1 2
1 2
1 2
1 2
+ 5
2 + 6
3 +2 9
+3 8
, 0
y y
y y
y y
y y
y y

then solution will give an upper bound (by substituting y
1
and y
2
in
1 2
5 3 y y + ) to our original
problem.
Then, to get tight upper bound we should choose
1 2
and y y such that it minimizes
1 2
5 3 y y +
at same time satisfy the above constraints. Thus we obtain another optimization problem as
follows:
1 2
1 2
1 2
1 2
1 2
1 2
min 5 3
.
+ 5
2 + 6
3 +2 9
+3 8
and 0
y y
s t
y y
y y
y y
y y
y y
+

We call above optimization problem as the dual of the original primal problem.
Using Excel we get following result:
* * * *
1 2 3 4
1, 2, 0, 0 x x x x = = = = for the primal and
* *
1 2
1, 4 y y = = for the dual.
So that
* * * *
1 2 3 4
5 6 9 8 x x x x + + + =17 =
* *
1 2
5 3 y y +
In essence, we get another linear programming with same optimal objective function value.
Another incidental advantage in the above dual problem is that we have now only two variables
though we have more constraints.
40

Getting the dual in Lagrangian way
Lagrange (centuries before practical solution of LP using Simplex and interior point methods
were invented) has put the above procedure in the framework of calculus.
The method goes as follows.
Suppose our objective is of type maximization. Lagrange says us to form a new objective
function by adding the actual objective function with constraints multiplied by a positive
quantity (assuming that our constraints are inequalities) such that the optimization of new
objective function gives an upper bound for the original optimization.
For example, for the following LP
max
T
s.t.
c x
Ax b
x 0

the Lagrangian function is
( )+
T T T
L= c x + y b Ax x , y 0, 0
We can easily prove that maximization of L for a given y 0, 0 give an optimum value that
is higher than the original problems optimum value.
We now take the derivative of the new objective function ( ) L , x y with respect to the primal
variable x and equate to zero. Then we substitute back the resulting expression into the
Lagrangian so that the new objective function become devoid of primal variablex .
( )
+ 0
T
L ,
= =
x y,
c A y
x
-
T
= c A y
Now
( )
( ) ( )+
T
T T T
D
L = y, A y x + y b Ax x =
T
y b=
T
b y
This function we minimize with respect to y and such that -
T
= c A y . That is

41

( )
T
D
min L = y, b y
Such that -
T
= c A y , y 0, 0
This may also be written as (since 0 )
( )
T
D
min L = y b y
T
A y c , y 0
This is the dual of the primal.
To make the concept more transparent, let us apply the method to an LP with out packing our
variables into vectors and coefficients into matrix.
Consider again the LP
1 2 3 4
1 2 3 4
1 2 3 4
max 5 6 9 8
. 2 3 5
2 3 3
0, 1,2,3,4
i
x x x x
s t x x x x
x x x x
x i
+ + +
+ + +
+ + +
=

Whose solution is given by
* * * *
1 2 3 4
1, 2, 0, 0 x x x x = = = = and
* * * *
1 2 3 4
5 6 9 8 x x x x + + + =17
Let us take Lagrangian
( ) L , x y, =
1 2 3 4
5 6 9 8 x x x x + + + + ( ) ( )
1 1 2 3 4
5 2 3 y x x x x + + + ( ) ( )
2 1 2 3 4
3 2 3 y x x x x + + + +
+
1 1 2 2 3 3 4 4
x x x x + + + , with the condition
1 2 1 2 3 4
, , , , , 0 y y
Taking derivative with respect to primal variables

42

1 2 1
1
1 2 2
2
1 2 3
3
1 2 4
4
( )
5 ( ) 0
( )
6 (2 ) 0
( )
9 (3 2 ) 0
( )
8 ( 3 ) 0
L ,
y y
x
L ,
y y
x
L ,
y y
x
L ,
y y
x
= + + =
= + + =
= + + =
= + + =
x y,
x y,
x y,
x y,

In matrix form
1
2 1
3 2
4
5 1 1
6 2 1
( )
9 3 2
8 1 3
y
L ,
y
( ( (
( ( (
(
( ( (
= +
(
( ( (

( ( (

x y,
0
x
= + 0
T
= c A y ]
Substituting this back into Lagrangian we obtain
1 2
( ) 5 3
D
L y y = + y . We minimize this subject to
the constraint (omitting
1 2
, 0 )
1 2
1 2
1 2
1 2
1 2
+ 5
2 + 6
3 +2 9
+3 8
and 0
y y
y y
y y
y y
y y

So the dual problem is
1 2
1 2
1 2
1 2
1 2
1 2
min 5 3
.
+ 5
2 + 6
3 +2 9
+3 8
, 0
y y
s t
y y
y y
y y
y y
y y
+

43

In matrix form, the dual problem is
( )
T
D
min L = y b y
T
A y c , y 0
Complementary Conditions at the Optimal Point
( )
, ,
* *
x y

For primal and dual, the optimal values of variables are
* * * *
1 2 3 4
* *
1 2
* * * *
1 2 3 4
1, 2, 0, 0
1, 2
0, 0, 2, 5
x x x x
y y

= = = =
= =
= = = =

[Note: Values of s are obtained by substituting optimal values of ys in the equations
1 2 1
1
1 2 2
2
1 2 3
3
1 2 4
4
( )
5 ( ) 0
( )
6 (2 ) 0
( )
9 (3 2 ) 0
( )
8 ( 3 ) 0
L , ,
y y
x
L , ,
y y
x
L , ,
y y
x
L , ,
y y
x
= + + =
= + + =
= + + =
= + + =
x y
x y
x y
x y

]
Let us substitute these optimal values in the Lagrangian
( ) L , x y, =
1 2 3 4
17
5 6 9 8 x x x x
=
+ + +
_
+
( ) ( )
1 1 2 3 4
1
0
5 2 3 y x x x x
=
=
+ + +
_
( ) ( )
2 1 2 3 4
4
0
3 2 3 y x x x x
=
=
+ + + +
_
+

1 1 2 2 3 3 4 4
0 1 0 2 5 0 2 0
x x x x
= = = = = = = =
+ + +
Notice that if the constraint expression is nonzero, Lagrangian multiplier is 0 and vice versa.
In matrix notation ( ) = 0
T T
=
* * * *
y b Ax 0, x
44

These conditions are called complementarity conditions
Finally we capture the entire Lagrangian duality theorem in the following one mathematical
statement.
( )+
T T T
min max
,

y>0 x 0
c x + y b Ax x

More on Lagrangian duality
For equality constraints we have
Case (1)
0
(
s.t
Max f )
g( )
=
x
x
x b

Lagrangian dual problem is constructed as
( )
0 0
( Min Max f ) g( ) b

+
x
x x , here is unrestricted in sign
Case (2)
0
(
s.t
Min f )
g( ) b
=
x
x
x

Lagrangian dual problem is constructed as
( )
0
( Max Min f ) g( ) b
+
x
x x , here is unrestricted in sign
Case (3)
0
(
s.t
Max f )
g( ) b
x
x
x

The Lagrangian function is
( ) ( ) ( L f ) g( ) b = + x, x x . 0
45

We want g( ) x b to be greater than or equal to zero. Multiplying this with a positive quantity
and adding to the objective function give us a new objective function whose maximization
leads to an upper bound of the original problem. The important point to note here is the sign of
. Only a positive multiplier leads us to getting an upper bound. The solution obtained will
be a function of . We then minimize this function w.r.t to get a tighter bound. The
difference is called duality gap. If the problem is a convex optimization problem, the duality gap
will be zero.
So the Lagrangian dual is
( )
0 0
( Min Max f ) g( ) b

+
x
x x
Case (4)
0
(
s.t
Max f )
g( ) b
x
x
x

At first, we convert the inequality into by rewriting the g( ) b x as 0 b g( ) x Hence
we proceed as in the previous case. Therefore, the Lagrangian function and dual are given by
( ) ( ) ( L f ) b g( ) = + x, x x . 0
( )
0 0
( Min Max f ) b g( )

+
x
x x
Case (5)
0
(
s.t
Min f )
g( ) b
x
x
x

The Lagrangian function is
( ) ( ) ( L f ) g( ) b = + x, x x . 0

46

We want g( ) x b to be less than or equal to zero. Multiplying this with a positive quantity
and adding to the objective function give us a new objective function whose minimization leads
to lower bound of the original problem. The important point to note here is the sign of . Only a
positive multiplier leads us to getting a lower bound, if we multiply it withg( ) b x . The
solution obtained will be a function of . We then maximize this function w.r.t to get a
tighter bound. The difference is called duality gap. If the problem is a convex optimization
problem, the duality gap will be zero.
So the Lagrangian dual is
( )
0 0
( Max Min f ) g( ) b

+
x
x x
Case (6)
0
(
s.t
Min f )
g( ) b
x
x
x

At first, we convert the inequality into by rewriting the g( ) b x as 0 b - g( ) x .
Then we proceed as in the previous case.
The Lagrangian function and dual are given by
( ) ( ) ( L f ) b g( ) = + x, x x . 0
( )
0 0
( Max Min f ) b g( )

+
x
x x
General Case
0
(
s.t , i=1,2,..,m
, =1,2,..,n
i i
j j
Min f )
g ( ) b
h ( ) a j
=
x
x
x
x

( ) ( )
1 1
( , ) (
m n
i i i j j j
i j
L f ) b g ( ) a h ( )
= =
= + +

x, x x x , 0
47

Lagrangian dual is
( ) ( )
1 1
max min (
m n
i i i j j j
i j
f ) b g ( ) a h ( )

= =
(
+ +
(

x 0 0,
x x x

Note that is unrestricted in sign and 0
General Case with unconstrained inner optimization
(
s.t , i=1,2,..,m
, =1,2,..,n

i i
j j
Min f )
g ( ) b
h ( ) a j
x
x
x
x 0

Here we form Lagrangian with multipliers given to x 0 also.
( ) ( )
1 1
( , ) (
m n
T
i i i j j j
i j
L f ) b g ( ) a h ( )
= =
= + +

x, x x x x , , 0 0
Lagrangian dual is
( ) ( )
1 1
max min (
m n
T
i i i j j j
i j
f ) b g ( ) a h ( )

= =
(
+ +
(

x 0, 0,
x x x x

When do Lagrangian multiplier takes zero (or non-zero value) in the case of inequality
constraints?
To answer this question, let us consider the problem
0
(
s.t , i=1,2,..,m
i i
Min f )
g ( ) b
x
x
x

We add a positive quantity (a variable quantity)
2
i
s to each constraint and make each constraint
an equality constraint. Thus the optimization problem becomes

48

0
(
s.t , i=1,2,..,m
2
i i i
Min f )
g ( ) s b
=
x
x
x +

The Lagrangian is given by
( )
m
( ) (
2
i i i i
=1
L f ) g ( )+s b = +
x, s, x x
On differentiating with respect to primal variables, and equating to zero, we obtain the first order
necessary conditions for optimality as
( ( (1)
2 0 (2)
i i
i
L
f )+ )=0
L
s
s
= =
x g x
x

The second relation implies that when slack variable is non zero, Lagrangian multiplier must
necessarily be zero. Note that slack variable become zero when the constraint become active,
that is, when the constraint become an equality and is on the verge of violation. So at the
optimal point, if any constraint is active, then corresponding Lagrangian multiplier is nonzero.
This also follows from the fact that, Lagrangian multiplier is rate of change of objective function
when an active constraint is further relaxed for accommodating more search space.
This condition also implies that , at the optimal point
*
x
( )
0, 1,2,..
*
i i i
g ( ) b i m = = x
This condition is called complementarity condition.
What are KKT conditions for the following Optimization Problem
0
(
s.t , i=1,2,..,m
2
i i i
Min f )
g ( ) s b
=
x
x
x +

The Lagrangian is given by
49

( )
m
( ) (
2
i i i i
=1
L f ) g ( )+s b = +
x, s, x x
( (
2 0
0
i i
i
2
i i i
i
L
f )+ )=0
L
s
s
L
g ( )+s b
= =
= =
x g x
x
x

We can rewrite these conditions without using the slack variables as
( )
( (
0, i=1,2,..,m
, i=1,2,..,m
i i i
i i
f )=- )
g ( ) b
g ( ) b
x g x
x
x

These conditions are called KKT conditions at the optimal point.
KKT Conditions for a General Case
0
(
s.t , i=1,2,..,m
, =1,2,..,n
i i
j j
Min f )
g ( ) b
h ( ) a j
=
x
x
x
x

Subtracting a positive slack variable from each of the inequality constraints we obtain
0
(
s.t , i=1,2,..,m
, =1,2,..,n
2
i i i
j j
Min f )
g ( ) s b
h ( ) a j
=
=
x
x
x
x

On taking Lagrangian, we obtain
( ) ( )
1 1
( , ) (
m n
2
i i i i j j j
i j
L f ) b s g ( ) a h ( )
= =
= + + +

x, x x x
The first order necessary conditions are
50

( ( (
2 0
0
0
i i
i
2
i i i
i
j j
i
L
f ) ) - )=0
L
s
s
L
g ( ) s b
L
a h ( )

= =
= =
= =
x g x h x
x
x
x

Without using slack variables these conditions reduces to
( ( ( f ) )+ )

= x g x h x
( ) 0, 1,2,..,
i i i
g ( ) b i m = = x
1,2,..,
i i
g ( ) b i m = x
, =1,2,..,n
j j
h ( ) a j = x
Very Important Note: Usually in the Lagrangian formulation, we do not add slack variables. So
care must be taken in writing the KKT conditions. Simply differentiating Lagrangian with
respect to primal and dual variables and putting equal to zero leads to wrong KKT conditions.
Example 3: Find dual of the following LP using Lagrangian dual

T
max c
. A
0
x
s t x b
x
=

T
( , ) c ( A )
T T
L b = + x y x + y x s x , s 0
( )
T T
A 0 A
L
= + = =
c y s y s c
x

Substituting
T
A = y s c in Lagrangian, we obtain
51

( )
T T
( A ) A ( A )
T
T T T T T T
+ + c x + y b x s x = y s x + y b x s x = y b = b y
Therefore the Lagrangian dual is
T
Min
y
b y
s.to
T
A = y s c , s 0

or

T
Min
y
b y

T
A y c
Example 2: Find dual of the following LP using Lagrangian dual
T
max c
. A
0
x
s t x b
x

Lagrangian is
T
( , ) c ( A )
T T
L = + x, y s x + y b x s x , y 0, s 0
T
c A
L
= +
y s = 0
x

( )
T
c A = y s
On substituting
( )
T
c A = y s in the lagrangian we obtain

( )
T T
c ( A ) A ( A )
T
T T T T T T
+ + x + y b x s x = y s x + y b x s x = y b = b y
Therefore the lagrangian dual is
0
T
Min
y
b y s.t
T
A y s = c , s 0

Or
0
T
Min
y
b y s.t
T
A y c

52

Newtons Method

The Newtons method or the Newton-Raphson method is known to perform better than the
algorithms discussed previously for quadratic functions. The previous methods utilize only the
first derivatives for selecting a search direction. If higher derivatives are used, the algorithm
would be more effective and that is what the Newton method does by involving the second
derivative in finding the search direction. Newtons method however is locally convergent and
hence the initial point has to be somewhat close to the minimizer for better convergence. If the
objective function is quadratic the algorithm will converge in one step to the true minimizer, but
if the function is non-quadratic then it will provide only an estimate of the position of the exact
minimizer.

We can obtain a quadratic approximation to the given twice continuously differentiable objective
function f using the Taylor series expansion of f about a point x
i
neglecting the terms of order
three and above as
( ) ( ) ( ) ( ) ( ) ( )( )
T T
' ''
i i i i i i
f x f x x x f x x x f x x x = + +
1
2

Applying the first order necessary condition for optimality, we get
( ) ( ) ( )( )
'
' ''
i i i
f x f x f x x x = + = 0 0
Let g
i
=f (x
i
) and h
i
= f(x
i
), then if h
i
>0, then a minimum value for x
i+1
can be obtained as
i i i i i
x x h g

+
=
1
1

is the step length. The convergence criteria can be zero value for the gradient or any other that
we have discussed previously.
Let us see how this algorithm works through an example
Consider the problem of minimizing Powells function
f (x
1
, x
2
, x
3
, x
4
) =(x
1
+10x
2
)
2
+5(x
3
- x
4
)
2
+(x
2
- 2x
3
)
4
+10(x
1
- x
4
)
4

The gradient and the Hessian matrix of the function are calculated as

( )
( - )
( - )
, , ,
- - ( - )
- - ( - )
x x x x
x x x x
f x x x x
x x x x
x x x x
(
+ +
(
+ +
(
=
(
(
(
+

3
1 2 1 4
3
1 2 2 3
1 2 3 4
3
3 4 2 3
3
3 4 1 4
2 20 40
20 200 4 2
10 10 8 2
10 10 40

53

( )
( - ) - ( - )
( - ) - ( - )
, , ,
- ( - ) ( - ) -
- ( - ) - ( - )
x x x x
x x x x
f x x x x
x x x x
x x x x
(
+
(
(
+
(
=
(
+
(
(
+
(

2
1 4
2
2 3
2
2 3
2
1 4
2
2 120 20 0 120
1 4
2
20 200 12 2 24 2 0
2 3 2
1 2 3 4
2
0 24 2 10 48 2 10
2 3
2
120 0 10 10 120
1 4

Taking x
0
=[1, 1, 0, -1]
T
as our starting point,
f(x
0
) =287
We get | | 330 2 224 342
0
= g
(
(
(
(
=
490 10 - 0 480 -
10 - 58 24 - 0
0 24 - 212 20
480 - 0 20 482
0
h
(
(
(
(
0.1107 0.0155 0.0087 - 0.1106

0.0155 0.0203 0.0008 0.0154
0.0087 - 0.0008 0.0057 0.0089 -
0.1106 0.0154 0.0089 - 0.1126
1
0
h
| |
T
0
1
0
6190 . 0 3810 . 0 0952 . 1 0476 . 0 =
g h
x
1
=x
0
- h
0
-1
g
0
=[0.9524, -0.0952, -0.3810, -0.3810]
T

Proceeding in this manner we obtain the following results until the algorithm converges
Iteration x values Function Value
1 [0.9524, -0.0952, -0.3810, -0.3810]
T
31.8089
2 [0.6349, -0.0635, -0.2540, -0.2540]
T
6.2823
3 [0.4233, -0.0423, -0.1693, -0.1693]
T
1.2409
4 [0.2822, -0.0282, -0.1129, -0.1129]
T
0.2452
54

5 [0.1881, -0.0188, -0.0753, -0.0753]
T
0.0484
6 [0.1254, -0.0125, -0.0502, -0.0502]
T
0.0096

The iterations continue until the minimum is reached at the point [0, 0, 0, 0]
T

Analysis of Newtons Method

There is no guarantee that the Newtons algorithm will point in the direction of decreasing values
of the objective function if h
i
is not positive definite. Moreover even if h
i
>0, still the direction
may not be the descent direction. Some remedial measures have been discussed afterwards.
Despite this drawback the Newtons method has superior convergence when the starting point is
near to the minimizer.
The convergence analysis of the Newtons method when the objective function is quadratic is
straightforward.
Let
1
( )
2
T T
f x x Ax b x c = +
( ) g x Ax b =
( ) h x A =
Given any initial point
0
x
( )
*
x x h g
x A Ax b
A b
x
=
=
=
=
1
1 0 0 0
1
0 0
1

x* being the true minimizer. Thus the algorithm will converge in a single step for a quadratic
function irrespective of the starting point. Thus the order of convergence for this case is infinity.
For general cases the order of convergence is at least 2.
The Newtons method has superior convergence properties if the starting point is near the
solution and is not guaranteed to converge to the solution if we start far away and may not be
even well defined because of the singularity of the Hessian matrix. The other drawbacks of the
method are that evaluation of the Hessian matrix for large dimensions can be computationally
expensive. Further more a set of n linear equations has to be solved to get the search direction in
55

each iteration. The main problem arises from the Hessian matrix not being positive definite. We
will see a method to overcome this difficulty in the next section.
Levenberg-Marquardt Modification
If the Hessian matrix h
i
is not positive definite, then the search direction d
i
=h
i
-1
g
i
may not point
in the descent direction. A simple technique to overcome this is to add the Levenberg-Marquardt
modification to Newtons algorithm.
( )
i i i i i
x x h I g

+
= +
1
1

where
i
0.
Underlying idea of this modification is as follows. Consider a symmetric matrix h, which may
not be positive, definite. Let
1
,
2
, .,
n
be the Eigen values of h with the respective Eigen
vectors v
1
,v
2
, ,v
n
. The Eigen values are real and all may not be positive. Now consider the
matrix G =h +I with 0. The Eigen values of G are
1
+ ,
2
+ , .,
n
+ . In fact,
( )
( )
i i
Gv h I v
hvi Ivi
ivi vi
i vi

= +
= +
= +
= +

This means that vi is also the Eigen vectors of G with the Eigen values
i
+ . Therefore if is
sufficiently large, then the Eigen values of the matrix G will b positive thus making it positive
definite. The search direction will then always be pointing in the direction of descent.
( )
i i i i i i
x x h I g

+
= +
1
1

The Levenberg-Marquardt modification of the Newtons algorithm can be made to approach the
behaviour of the pure Newtons method by letting tend to zero and if tends to infinity then it
will approach a pure gradient method with small step size. In practice, we may start with a small
value of and then slowly increase it until we the iteration is in the descent.
Newtons Method for Nonlinear Least Squares

Suppose we are given m measurements of a process at m points in time. Let t
1
, t
2
, .,t
m
be the
measurement times and y
1
,y
2
,y
m
the measurement values as shown in the figure.
56

We want to fit a sinusoidal curve to the measurement so as to predict the process for a particular
time. The equation of the sinusoid is
( ) sin y A t = +
we have to find the values of the parameters A, , and that will minimize the error between the
actual and the predicted values of the measurements. We can construct the objective function as
( ) ( )
minimize sin
m
i i
i
y A t
=
+
2
1

this type of problems are known as non-linear least squares problem
In general such problems are defined as
( ) ( )
minimize
m
i
i
f x
=
2
1

where f
i
(x) are the given functions.
We will see how we can apply the Newtons method to the example problem. Let | |
T
, , x A =
be the vector of decision variables and the function written as
57

( ) ( ) sin
i i i
r x y A t = +
defining
T
[ , ,..., ]
m
r r r r =
1 2
, the objective function can be expressed as
f(x) =r(x)
T
r(x)
To apply the Newtons method we need to compute the gradient and the Hessian of f.
The j
th
component of the gradient is
( ) ( ) ( )
( ) ( )
j
j
m
i
i
i j
f
f x x
x
r
r x x
x
=
1
2

Denoting the Jacobian matrix of r as
( )
( ) ( )
( ) ( )
m
m m
m
r r
x x
x x
J x
r r
x x
x x
(
(

(
= (
(

(
(

1 1
1
1

Then, the gradient can be represented as
( ) ( ) ( )
T
f x J x r x = 2
To compute the Hessian matrix
( ) ( )
( ) ( )
( ) ( ) ( ) ( )
k j k j
m
i
i
i k j
m
i i i
i
i k j k j
f f
x x
x x x x
r
r x x
x x
r r r
x x r x x
x x x x
=
=
| |

= |
|

\ .
| |

= |
|

\ .
| |

= + |
|

\ .
2
1
2
1
2
2

Let S(x) be the matrix whose (k, j)
th
component is
( ) ( )
i
i
k j
r
r x x
x x
2

58

The Hessian can now be written as
( ) ( ) ( ) ( )
( )
T
H x J x J x S x = + 2
Therefore, Newtons method applied to the nonlinear least squares problem is given by
( ) ( ) ( )
( )
( ) ( )
T T
i i
x x J x J x S x J x r x
+
= +
1
1

If the second derivatives are considerably small, the matrix S(x) can be neglected, in which case
the Newtons method reduces to Gauss-Newton method.
( ) ( )
( )
( ) ( )
T T
i i
x x J x J x J x r x
+
=
1
1

In the Gauss-Newton method also, sometimes the ( ) ( )
( )
T
J x J x matrix may not be positive
definite and as before the LevenbergMarquardt modification can be implemented by adding a
positive I value to it to overcome the problem. An alternative interpretation of the Levenberg-
Marquardt algorithm is to view the term I as an approximation to S(x) in Newtons method.
Quasi-Newton Methods
In Newtons method, we have to calculate the Hessian of the objective function at each iteration.
When the dimension of the problem is high and the Hessian of the function is difficult to
calculate, the calculation of the inverse Hessian may take considerable computational time. In
the Quasi-Newtons method, the computation of the Hessian is avoided while retaining the fast
local convergence of Newtons method. The method aims at finding an approximation to the
inverse Hessian which is computational economic. This approximate will then be updated at each
iteration so that it retains some of the properties of the original inverse of the Hessian the
prominent one being its positive definiteness for providing a descent direction.
Quasi-Newton Condition
Suppose that we have already calculated ( ) ( )
1
2
and ,
+

i i i
x x f x f , then from Taylor series
expansion we can write,
( ) ( ) ( )
i i i i i
x x h x f x f =
+ + + 1 1 1
+
i i
x x o
+1

Neglecting the higher order terms and denoting the inverse Hessian by H, we can rewrite the
previous equation as
59

( ) ( ) ( )
i i i i i
x x x f x f H =
+ + + ! 1 1

This is called the Quasi-Newton condition.
In the general form we write the Quasi Newton condition as
i i i
H =
+1

( ) ( ) ( )
i i i i i
x x x f x f = =
+ + ! i 1
and where
Specifically, Quasi-Newton methods have the form,
i i i
g H d =
( )
i i i
d x f
+ =
min arg
0

i i i i
d x x + =
+1

The Quasi-Newton methods are in a sense Conjugate directions method, since the search
directions generated would be A conjugate.
There are some specific updating formulae for the inverse Hessian
Broydens method
DFP method
BFGS method

Broydens method
Broydens method is the rank one correction formula for the inverse Hessian H
i+1
from H
i
. We
write it as
T
1 i i i i i
z z H H + =
+

This is called the rank one correction since,
( ) | | 1 rank rank
1
1
T
=
|
|
|
.
|
\
|
(
(
(
=
ni i
ni
i
i i
z z
z
z
z z .

60

This can be verified by substituting any value for z
i
. This is sometimes called single-rank
symmetric (S R S) algorithm. It can also be observed that if H
i
is symmetric then H
i+1
will also be
symmetric. Now the task is to find
i
, z
i
such that the Quasi Newton condition is satisfied.
( )
i i i i i i i i
z z H H = + =
+
T
1

Since z
i
T
i
is a scalar, z
i
can be expressed as
( )
i i i
i i i
i
z
H
z

T
=
Hence,
( )( )
( )
2
T
T
1
i i i
i i i i i i
i i
z
H H
H H

+ =
+

Now to find z
i
we premultiply ( )
i i i i i i i
z z H
T
= by
i
T
to obtain

( )
i i i i
T
i i i
T
i i
T
i
z z H
T
=
( )
2
T
i i i i i
T
i i
T
i
z H =
Substituting the above relation gives
( )( )
( )
i i i
T
i
i i i i i i
i i
H
H H
H H

+ =
+
T
1

The Broydens algorithm
Step 1: Set i =0 ans select x
0
and a real symmetric positive definite H
0
.
Step 2: If g
i
=0, stop; else
i i i
g H d =
Step 3: Compute ( )
i i i
d x f
+ =
min arg
0
and
i i i i
d x x + =
+1

Step 4: Compute ( ) ( ) ( )
i i i i i
x x x f x f = =
+ + 1 i 1
and

( )( )
( )
i i i
T
i
i i i i i i
i i
H
H H
H H

+ =
+
T
1

61

Step 5: Set i =i +1 and go to step 2
This algorithm also has he disadvantage that the Hessian might not be always positive definite.
So we proceed to rank two correction formula namely, the DFP algorithm
Example
Find the minimizer of the function f(x
1
, x
2
) =1.5x
2
+y
2
+5
The function can be written as f(x, y) =0.5*X
T
AX +5
With
(
=
(
=
y
x
X and A
2 0
0 3

Take X
0
=[1 , 2]
T

Then g
0
=[3x, 2y]
T
; g
0
=[3 4]
T
Let H
0
be an identity matrix of order 2.
The search direction d
0
=-H
0
g
0
=[-3, -4]
T
The step length 4237 . 0
0 0
0 0
0
= =
Ad d
d g
T
T

X
1
=X
0
+
0
d
0
= [-0.2711, 0.3052]
T
0
=X
1
X
0
=
0
*d
0
=[-1.2712, -1.6949]
T
g
1
=[-0.8133, 0.6104]
T
0
=g
1
- g
0
= [-3.8133, -3.3896]
T

( )( )
( )
(

+ =
8140 . 0 2791 . 0
2791 . 0 5814 . 0
0 0 0 0
T
0 0 0 0 0 0
0 1

H
H H
H H
T

d
1
=[0.6432, -0.7238]
T

1
=0.4216
X
2
=X
1
+
1
d
1
=[0, 0]
T
Since this is a quadratic problem with two variables, the solution is arrived at the second iteration
itself. The solution can be verified from the contour plot of the function.
62

Plot of the function f(x, y)

Contour plot of the function

DFP Algorithm
As mentioned previously, the DFP algorithm uses the rank two update via
T T
1 i i i i i i i
p p z z H H + + =
+

Now this must satisfy the condition,
( )
i i i i i i i i
p p z z H = + +
T T

After computations as done previously, we obtain
i
T
i
i
i
T
i
i i i i i i
p z
H p z

1
and
1
, , = = = =
On substitution,
( )( )
i i i
i i i i
i
T
i
i i
i i
H
H H
H H

T
T T
1
+ =
+

The DFP algorithm is the same as the Broydens algorithm except for the rank two update for the
inverse Hessian.
This formula was considered by Davidon and then modified by Fletcher and Powell and hence
the name DFP. This method is also called variable metric algorithm.

63

Example
Find the minimizer of the function f(x, y) =x
2
+x*y +y
2
+3x - 2y
The function can be rewritten as f(x, y) =0.5X
T
AX - b
T
X
with
(
=
(
=
(
=
y
x
X and b A
2
3
1 1
1 2

Take X
0
=[1, 0]
T
g
0
=[2x +y +3, x +2y - 2] =[5;-1]
Again let H
0
be the identity matrix of second order
0
=-H0 g0 =[-5, 1]
T

The step length 6190 . 0
0 0
0 0
0
= =
Ad d
d g
T
T

X
1
=X
0
+
0
d
0
= [-2.0952, 0.6190]
T
0
=X
1
X
0
=
0
*d
0
=[-3.0952, 0.6190]
T
g
1
=[-0.5714, -2.8571]
T
0
=g
1
- g
0
= [-5.5714, -1.8571]
T

( )( )
(

= + =
9238 . 0 4190 . 0
4190 . 0 6952 . 0
0 0
T
0
T
0 0 0 0
0 0
T
0 0
0 1

H
H H
H H
T

d
1
=[-0.8, 2.4]
T

1
=0.7143
X
2
=X
1
+
1
d
1
=[-2.6667, 2.3333]
T
The result can be verified against the plot given. The quadratic function of two variables, as
expected converges in the second iteration to the true solution.
64

Plot of the function

The major advantage of this method is that the positive definiteness and the symmetry of the H
matrix are preserved. When used for minimizing quadratic functions, the search directions
generated are all A- conjugate and at the n
th
iteration, H
n
becomes the true inverse of the Hessian
h. When used along with exact line search method for minimizing quadratic functions the
method will converge within n steps. If the function is strictly convex, then along with exact line
search techniques the method shows global convergence. However in the case of large non-
quadratic problems, the algorithm has a tendency to get stuck owing to the singularity of H
matrix. To overcome this, the BFGS algorithm was formulated.
BFGS Algorithm
The BFGS algorithm was suggested independently by Broyden, Fletcher, Goldfarb and Shanno.
In all the previous algorithms some update formulas were derived for approximating the inverse
of the Hessian matrix. An alternative to this is to approximate the Hessian matrix itself. To do
this let us assume B
i
be the approximate of the Hessian at the i
th
step. B
i+1
must satisfy the
following relation.
i j B
j j j

+
= 0
1

We can observe that this condition is same as the previous set of Quasi-Newton condition for
H
i+1
except that the terms and are interchanged. Thus given any update formula for H matrix,
the corresponding formula for B can be obtained by interchanging H and B as well as and .
Specifically the DFP update for H corresponds to the BFGS update for B and the formulas
related in this fashion are called dual or complementary.
The DFP algorithm for H is given by

65

( )( )
i i i
i i i i
i
T
i
i i
i i
H
H H
H H

T
T T
1
+ =
+

And by making use of the duality concept, the update formula for B can be obtained as
( )( )
i i i
i i i i
i
T
i
i i
i i
B
B B
B B

T
T T
1
+ =
+

The above equation gives the BFGS update for the approximate Hessian and now to find the
inverse of the approximate Hessian take
( )( )
1
T
T T
1
1 1
+ +
|
|
.
|
\
|
+ = =
i i i
i i i i
i
T
i
i i
i i i
B
B B
B B H

To obtain the inverse of the B matrix, we shall make use of the Sherman - Morrison formula for
matrix inverse stated as follows.
If M is a nonsingular matrix and u and v be the column vectors such that 1 +v
T
M
-1
u 0,
then M +uv
T
is nonsingular and
( )
( )( )
u M v
M v u M
M uv M
1 T
1 T 1
1
1
T
1

+
= +
Applying this relation twice to B
i+1
yields
( ) ( )
i i
i i i i i i
i i
i i
i
T
i
i i i
i i
H H H
H H

T
T
T T
T
T T
1
1
+
|
|
.
|
\
|
|
|
.
|
\
|
+ + =
+

This is the BFGS formula for updating H
i
.
Example
Find the minimizer of the function f(x, y) =1.5x
2
- 2xy +y
2
- 2x y +5
The function can be rewritten as f(x, y) =0.5X
T
AX - b
T
X +5
with
(
=
(
=
(

=
y
x
X and b A
1
2
2 2
2 3

Take X
0
=[2, 3]
T
g
0
=[ 3x - 2y - 2, -2x +2y - 1] =[-2, 1]
T

66

Again let H
0
be the identity matrix of second order
0
=-H0 g0 =[2, -1]
T

The step length 0.2273
0 0
0 0
0
= =
Ad d
d g
T
T

X
1
=X
0
+
0
d
0
=[2.4545, 2.7727]
T

0
=X
1
X
0
=
0
*d
0
=[0.4545, -0.2273]
T
g
1
=[-0.1818, -0.3636]
T
0
=g
1
- g
0
=[1.8182, -1.3636]
T

( ) ( )
(
=
+
|
|
.
|
\
|
|
|
.
|
\
|
+ + =
7066 . 0 4050 . 0
4050 . 0 5537 . 0
1
0
T
0
T
T
0 0 0
T
0 0 0
0
T
0
T
0 0
0 0
T
0 0 0
0 1

H H H
H H
T

d
1
=[0.2479, 0.3306]
T

1
=2.2
X
2
=X
1
+
1
d
1
=[3.0000, 3.5000
T

Plot of the function

Like DFP method, this method also ensures A conjugacy of the search directions and the
positive definiteness of the Hessian matrix. The BFGS update is reasonably robust when the line
search techniques are not exact and is far more efficient than the DFP algorithm.
67

Convex optimization:
1. Consider the unconstrained problemmin ( )
x
f x ,where :
n n
f R R is smooth.
a. One formof the Barzilali-Borwein method takes steps of the form
1
( )
k k k k
x x f x
+
=
where
1 1
: , : , : ( ) ( )
T
k k
k k k k k k k T
k k
s s
s x x y f x f x
s y

= = =
Write down an explicit formula for
k
in terms of
k
s and A,for the special case in which
f is strictly convex quadratic, that is ,
1
( )
2
T
f x x Ax = ,where Ais symmetric positive
definite.
b. Considering the steepest descent method
1
( )
k k k k
x x f x
+
= ,applied to the strictly
convex quadratic ,write down an explicit formula for the exact minimizing
k

c. Show that the step lengths obtained in parts (a)and(b)are related as follows:
1 k k

+
=
Solution:
a. Consider
1
( )
2
T
f x x Ax =
Therefore, ( ) f x Ax =
Since,
( )
1
1
1
1
: ( ) ( )
( )
k k k
k k
k k
k k k k
y f x f x
Ax Ax
A x x
As s x x
=
=
=
= =

Therefore,
:
T
k k
k T
k k
T
k k
T
k k
s s
s y
s s
s As
=
=

b. Consider the steepest descent method
1
( )
k k k k
x x f x
+
= ,
Since
1
( )
2
T
f x x Ax = ,
( )
1
( ( )) ( ( )) ( ( ))
2
T
k k k k k k k k k
f x f x x f x A x f x =

Inorder to find the expression corresponds to differentiate the above equation with respect
to and equate to zero.
68

( ) ( )
( )
( )
( )
2
1
( ) ( ) 0
2
1
( ) ( ) ( ) ( ) 0
2
1
2 ( ) ( ) ( ) 0
2
T
k k k k k k
T T T T T
k k k k k k k k k k k k
T T T
k k k k k k k k
x f x A x f x
x Ax x A f x f x Ax A f x f x
x Ax x A f x A f x f x

(
=
(

(
+ =
(

(
+ =
(

On differentiating,
( )
1
2 ( ) 2 ( ) ( ) 0
2
( ) ( ) ( ) 0
( ) ( ) ( )
T T
k k k k k
T T
k k k k k
T T
k k k k k
x A f x f x A f x
x A f x f x A f x
f x A f x x A f x
+ =
+ =
=

Therefore,

( )
( ) ( )
( ) ( )
T
k k
k T
k k
T
k k
k T
k k
x A f x
f x A f x
x A Ax
Ax A Ax
=

2
3
T
k k
k T
k k
x A x
x A x
=
c. We have,
1
:
k k k
s x x

=
Ie,
1 1
:
k k k k k
s x x Ax
+ +
= =
Therefore,
1
1
k k
k
Ax s
+
=
Since
2
3
T
k k
k T
k k
x A x
x A x
=
69

1 1
1 1
1 1 2
1 1 2
1 1
1
1 1
1 1
1 1
1
1
T
k k
k k
k T
k k
k k
T
k k
k
T
k k
k
T
k k
k T
k k
s s
s A s
s s
s As
s s
s As

+ +
+ +
+ +
+ +
+ +
+
+ +
| | | |
| |
\ . \ .
=
| | | |
| |
\ . \ .
=
= =

2. Suppose that :
n n
f R R be a twice continuously differentiable function and suppose
that{ }
k
x is a sequence of iterates in
n
R
a. Suppose thatliminf ( ) 0
k
f x = . Is it true that all accumulation points of { }
k
x are stationary
(that is satisfying first order necessary conditions)?
b. Suppose thatlim ( ) 0
k
f x = . Is it true that all accumulation points of { }
k
x are stationary?
c. Suppose that the sequence { }
k
x converges to a point
*
x ,that the gradients ( )
k
f x converge to
zero, and that the hessians
2
( )
k
f x at all these points are positive definite. Show that the second
order necessary conditions are satisfied at the limit
*
x
d. For the situation described in part(c), can we say that second order sufficient conditions will be
satisfied at
*
x ? Explain

Solution:
a. No, liminf ( ) 0
k
f x = guarantees only that there is a subsequence k such that ,
lim ( ) 0
k
k K
f x
=
The accumulation point may be the limit of another subsequence K
for which ( )| 0
k k K
f x

b. Yes, since lim ( ) 0
k
f x = ,hence lim ( ) 0
k
k K
f x
= for all subsequence { } 1,2... K =

If X
is any accumulation point, there is subsequence K
such that lim

k
k K
X X
=
We havelim ( ) 0 ( )
k
k K
f X f X
= = , so X
is stationary
c. We have ( ) lim ( ) 0
k
k
f X f x
= =
70

Since all
2
( )
k
f X are positive definite, the limit is at least positive semi definite.(The minimum
eigen values of ( )
k
f X is positive for all k , it may approaches to zero as h but cannot
become negative)
d. No, we have
( )
2
min
( ) 0
k
f X , so
2 *
( ) f X may be only positive semi definite not positive
definite.
3. a. The BFGS quasi-Newton updating formula for the approximate inverse hessian
k
H can
be written as follows:

( ) ( )
1
,
T T T
k k k k k k k k k k k
H I s y H I y s s s
+
= + Where
1
k T
k k
y s
=
Show that if
k
H is positive definite and the curvature condition 0
T
k k
y s > holds, then
1 k
H
+
is
also positive definite?
b. If 0
T
k k
y s ,is it still possible for
1 k
H
+
to be positive definite?
Solution:
a. Suppose there exist v such that
1
0
T
k
v H v
+
= ,then
1
0
T T T T T T
k k k k k k k k k k k
v H v v I s y H I y s v v s s v
+
( ( = = +

Since
1
0
k T
k k
s y
= > and since
k
H is positive definite we must have,
( ) 0
T
k k k
v s v y = and 0
T
k
s v =
This implies that 0 v =
b. No, by second condition we have
1 k k k
H y S
+
=
Thus by taking the inner product of both sides with
k
y , we obtain
1
0
T T
k k k k k
y H y y S
+
= , therefore
1 k
H
+
cannot be positive definite.
4. Consider the following form of the conjugate gradient method for solving Ax b = (or
equivalently minimizing
1
( )
2
T T
f x x Ax b x = , whereAis symmetric positive definite)
Given
0
; x
Set
0 0 0 0
, , 0; r Ax b p r k
while 0
k
r
71

1
1 1
1
1
1 1 1
;
;
;
;
;
1;
T
k k
k T
k k
k k k k
k k
T
k k
k T
k k
k k k k
r p
p Ap
x x p
r Ax b
r Ap
p Ap
p r p
k k
+
+ +
+
+
+ + +

+

+
+

end (while)
Show that, 0,
T
k j
r p = for all 0,1,..., 1 (0.1) j k =
(You may assume that the vectors
j
p are conjugate, ie; 0,
T
j i
p Ap = when i j )
(Hint: Prove by induction. Show first that
1
0
T
k k
r p
+
= for all k which establishes (0.1) for 1 k = .
Then show that if (0.1) holds for somek , it continues to hold for 1 k + ie;
1
0,
T
k j
r p
+
= for all
0,1,..., j k = )

Solution:
1 1 1
[ ] ( )
k k k k k k k k
r Ax b Ax b A x x r A p
+ + +
= = + = +
Taking product of both sides with
k
p , we obtain
1
0
T T T
k k k k k k k
p r p r p Ap
+
= + = , by definition of
k

Thus
0
0
T
k
r p = , and this claimis satisfied for 1 k =
Suppose that 0, 0,1,.... 1
T
k j
r p j k = = for some k
To complete the inductive proof we used, 0, 0,1,....
T
k j
r r j k = =
We have proved this already for j k =
For 0,1,2,...... 1 j k = , take inner product of the expression above with
j
p to obtain,
1
0
T T T
j k j k k j k
p r p r p Ap
+
= + =
Here 0,
T
j k
p r = by inductive hypothesis and 0,
T
j k
p Ap = by conjugation.
Hence this implies
1
0
T T T
j k j k k j k
p r p r p Ap
+
= + = as required.

72

Minimizing Non-Smooth Convex Functions
Introduction
In many applications of optimization, an exact solution is less useful than a simple, well structured
approximate solution. An example is found in compressed sensing, where we prefer a sparse signal (e.g.
containing few frequencies) that matches the observations well to a more complex signal that matches the
observations even more closely. The need for simple, approximate solutions has a profound effect on the
way that optimization problems are formulated and solved. Regularization terms can be introduced into
the formulation to induce the desired structure, but such terms are often non-smooth and thus may
complicate the algorithms. On the other hand, an algorithmthat is too slow for finding exact solutions
may become competitive and even superior when we need only an approximate solution. In this section
we introduce a non-smooth function that frequently appears in modern optimization problems and
describe how it is tackled elegantly in the framework of tradition optimization theory that has simple first
order and second order conditions for checking optimality.
Non Smooth Function - l1 norm
One of the widely used non smooth function is ( ) ( ) f x x abs x = =
Its graph is given by

Figure: Plot of ( ) ( ) f x x abs x = =
when more variables are involved
1
( ) ( )
n
i i
i
i
f x x x abs x
=
= = =

73

Figure: Plot of
1 2 1 2
( ) ( , ) f x f x x x x = = +

Figure: Plot of
2 2
1 2 1 2
( ) ( , ) f x f x x x x = = +
Minimizing a function means finding the point at which the function value is minimum.
Let us consider the function,
2
( ) ( ) , 0 f x x x c = + > is a control parameter (need to be given to the algorithm). It actually
consists of two functions namely
1
( ) f x x = and
2
2
( ) ( ) f x x c = . Both are convex functions. First
one is not differentiable at x=0 where as second one is differentiable at every x. This types of
optimizations problems are now very typical in Sparsity constrained optimization problems.
74

For every x there is a function value ( ) f x . Here is considered as a weightage or control parameter.
We have to find the point at which the function ( ) f x is minimum. As per the calculus,
'
( ) 0 f x =

gives the solution for the above unconstrained optimization problem if functions are differentiable
everywhere. However, one of our function is not differentiable everywhere. But we know that it is not
differentiable only at one point. This section is about how to overcome such a situation. Initially we
assume the solution is not at x =0. In that case,
* 2
argmin ( )
x
x x x c = +

'
( ) 2( ) ( ) 0 f x x c sign x = + = since ( ), 0
d
x sign x x
dx
=
Implies
* *
( )
2
x c sign x
=
Note that here we should have an idea about the location (on positive or negative side of x axis) of the
optimal point to decide the sign.
Consider three examples to deduce an algorithmfor the above minimization problem
Example1:
In the above one variable problem let us choose the as 1 and c as 10. Then the equation becomes,
2
( ) 1 ( 10) f x x x = +

For this we note that the following: Since

( ) f x is basically sumof two convex functions, the minimum
point is expected to lie between the optimal point corresponding to the two individual functions . In our
case
1
( ) 1 f x x x = = has minimumat 0. So it is quite easy to decide that the sign of the solution point
depends on the sign of the solution (minimumpoint) of the second function.
Here,
2 2
2
( ) ( ) ( 10) f x x c x = =

and the minimumvalue of this function is at
*
10 x = .
So for the given function
1 2
( ) ( ) ( ) f x f x f x = + , the minimumvalue should lie between 0 and 10. So sign
of the solution for
2
( ) 1 ( 10) f x x x = + is positive.
and
* 2
argmin1 ( 10)
x
x x x = +
75

Figure: Minimum of
2
( ) 1 ( 10) f x x x = +
For finding the solution
'
( ) 0 f x = , we assume 1
d
x
dx
= . So

'
( ) 1 2( 10) 0 f x x = + =
*
1
10 9.5
2
x = =
The plot of the function ( ) f x for different values of x is shown below.

Figure: Plot of
2
( ) 1 ( 10) f x x x = +
76

The minimumoccurs at
*
9.5 x =
Example2:
Consider the function,
2
( ) ( ) f x x x c = + +
Here,
* 2
'
argmin ( )
( ) 2( ) ( ) 0
x
x x x c
f x x c sign x
= + +
= + =

Implies
* *
( )
2
x c sign x
= +
For example,
2
( ) 1 ( 10) f x x x = + + , here the optimal value
*
x should be between -10 and 0, hence sign
of
*
x is negative.
For finding the solution
'
( ) 0 f x = , we put 1
d
x
dx
=
'
( ) 1 2( 10) 0 f x x = + + =

*
1
10 9.5
2
x = + =

Figure: Plot of
2
( ) 1 ( 10) f x x x = + +

77

The minimumoccurs at
*
9.5 x =
Example3:
Now consider a function where c is less than
2
. For any c , which is less than

2
, the optimal value

can be shown to be at zero. For example consider the function
2
( ) ( ) f x x x c = + , let us have 1 =
and
1
4
c = ie,
2
1
( ) 1 ( )
4
f x x x = + . Here we assume 1
d
x
dx
= +
* *
*
( )
2
1 1 1
4 2 4
x c sign x
x
=
= =

Here we expect the solution to have positive sign since the two individual optimal are 0and
1
4
, but on
substituting we find that sign of
*
x flips if we do so. That is
*
1 1 1
4 2 4
x = = .This should not happen
so we let
*
0 x = . This in fact can be verified by plotting the function

Figure: Plot of
2
1
( ) 1 ( )
4
f x x x = +
Fromthe graph it is clear that the minimumvalue occurs at
*
0 x = . This is true whenever
2
c

< .
78

This leads to the important assumption that for any c, which should be less than
2
, ie,
2
c

< the
optimal value should be at zero.
Then a question arises. How
*
0
( )
2 2
x
x c sign x c sign x

=
= = is satisfied.
Note that at 0,
d
x x
dx
= doesnt exist. To counter this problem the concept of sub-differential was
introduced. According to which x at 0 x = is having sub-differential which is a number between -1 and
+1. (its let derivative and right derivative) . Now if we assume sub-differential is
1
2

*
1 1 1
0
4 2 2
x = = is satisfied.

Figure: Diagramof Subdifferential

More details be available in later sections

79

Optimization in 2-D Unconstrained Problems:
The 2-D minimization problems (of the type in equation) can be solved in the same way as 1-D
optimization problems.
Consider the function,
2 2
1 2 1 1 1 2 2 2
1 1
( , ) ( ) ( ) f x x x x c x x c = + + +
Here the function is separable and finding the solution for each variable separately is possible.
In vector form, the function can be written as
( ) ( ) ( )
2
1 2
( )
T
f x x x c x x x c x c = + = + + ,
where
1
2
x
x
x
(
=
(

and
1
2
c
c
c
(
=
(

So
( )
( )
( )
2
1 1
1 2 1 1 2 2
2 2
( ) [ , ]
x c
f x x x c x x x c x c
x c

(
= + = + +
(

The optimal solution is obtained as,
2
*
1
argmin
x
x x x c = + .
As per calculus,
( ) 2( ) ( ) 0 f x x c sign x

= + =
Implies,
* *
( )
2
x c sign x
= or

* *
1 1 1
* *
2 2 2
( )
2 ( )
c x sign x
c x sign x
( ( (
=
( ( (

Or
* *
1 1 1
( )
2
x c sign x
= ,
* *
2 2 2
( )
2
x c sign x
=
An example with full solution is given below:
2
min ( )
x
f x x x c = + with 1 = and
5
3
7
1
c
(
(
(
=
(
(

80

The solution is,

* *
1 1 1
* *
2 2 2
( )
2 ( )
c x sign x
c x sign x
( ( (
=
( ( (

* *
1 1 1
* *
2 2 2
* *
3 3 3
* *
4 4 4
*
1
*
2
*
3
*
4
( )
( )
2 ( )
( )
5 1
3 1
1
7 1 2
1 1
c x sign x
c x sign x
c x sign x
c x sign x
x
x
x
x
( ( (
( ( (
( ( (
=
( ( (
( ( (
( (

( ( (
( ( (
( ( (
=
( ( (
( ( (
(

*
1
*
2
*
3
*
4
1 9
5
2 2
1 5
3
2 2
1 15
7
2 2
1 3
1
2 2
x
x
x
x
( (
( (
( (
(
( (
(
( (
(
= =
( (
(

( (

(
( (
(

( (
( (

Transition to proximal methods:
What happens if we repeat the optimization problem with replacement point vector c with solution
vector
*
x . Of course it is easy to imagine. It will converge on 0 vector which is the optimum point of
the function first function
1
( ) f x x = . The idea behind proximal methods is to assume an initial
solution c to non differentiable function
1
( ) f x and go for the solution of new optimization problemof
the form
2
1
( ) ( ) f x f x x c = +

81

Proximal Algorithms
Proximal algorithms are state of the art optimization method. These classes of algorithms are considered
as a tool for solving convex optimization problems, generally non smooth unconstrained problems of
large size. For every convex function it is possible to define a proximal operator.
(Proximal algorithms generalize the concept of the orthogonal projection of a point onto a set C. They
can be especially useful when dealing with a complicated-looking optimization scheme (which often arise
frommachine learning and signal processing problems), where they help decompose the complicated
scheme into a sequence of much simpler optimization steps which, when updated in an iterative fashion,
provide a solution to the original problem. [Ref: http://www.meetup.com/NU-Machine-Learning-
Meetup/events/146037982/])
Definition
Proximal operator of a function :
n n
f R R is defined by,
2
2
1
( ) argmin ( )
2 f x
prox v f x x v
| |
= +
|
\ .

Here v is the approximate minimal point of the function ( ) f x and hence assumed to be a known
vector. The proximal operator of a scaled function f is possible to express as
2
2
1
( ) argmin ( )
2
f
x
prox v f x x v
| |
= +
|
\ .
, where 0 >
This is called as the proximal operator of f with respect to , which is a trade-off parameter between
these terms. In order to find the minimum of the function ( ) f x an additional function
2
2
x v , a strictly
convex function is appended with the original ( ) f x . By adding a highly convex function we are making
the given sum of functions more convex than that of f alone. So it is easy to converge to the solution.
Now start froman initial point
0
x , find the minimum point
*
x for the combined function substitute it
again and repeat the same until it converges to the minima of the function f

82

Figure: Illustration of proximal algorithm
Why proximal algorithms for solving optimization problems?
Comparing with other optimization techniques proximal algorithms has several advantages. They can be
fast and it is easier to compute the proximal operator of most of the functions. Unlike Newtons method it
doesnt include complex computations like the second derivative or hessian matrix. There are functions
which are difficult to find the second derivative and also there are functions for which it is almost
impossible to take the second derivative. Proximal algorithms doesnt include any such complex
computations and they are simple to derive and easier to understand.
Fixed Points:
The point
*
x minimizes the function f if and only if
* *
( )
f
x prox x =
Which implies that
*
x is a fixed point of
f
prox
For example
1) 0 is the fixed point of proximal of ( ) f x x =
2) 1 if the fixed point of proximal of ( ) f x x =

Now we learn a new duality for a function. This duality tries to express a function in terms of
an optimization problem in terms of its dual function.

83

Legendre- Fenchel Transformation- Concept of Dual Function
Transformations are functions that map from one space to another space. LF transform maps the
( , ( )) x f x space to
*
( , ( )) p f p space. It uses the supremumof transformations.
LF transformof a continuous function : f R R can be defined as,
{ }
*
( ) sup ( )
x R
f p px f x
=

The direct interpretation for a single variable differentiable function ( ) f x is as follows.
To find
*
( ) f p for a givenp , draw a line with slope p through origin on the same plot for ( ) f x .

Further, find location at which { } ( ) px f x is maximum. In the above figure, we can easily locate a
range of x where the above difference is positive. Then locate precisely where this difference is
maximum. This maximumvalue is
*
( ) f p .

84

Figure: Illustration of Legendre-Fenchel Transform
Another method:
For a single variable differentiable function ( ) f x , finding p and
*
( ) f p reduces to the following:
Given p , first draw a line with slope p passing through origin. Move this line (parallely) along x-axis
until it just leaves the plot of the function ( ) f x . Note y-intercept (on ( ) f x -axis) . The negative of y-
intercept is the required
*
( ) f p .

Figure: Legendre-Fenchel Transform

This visualization allows us to easily find dual function especially for norm functions, that is for
1 2
( ) ( ) f x x or f x x = =

85

Important Observation
For differentiable function, corresponding to ( ) , ( ) x f x , the dual variable p is slope at x . That is
'
( ) ( )
d
p f x f x
dx
= = and
*
( ) f p is negative of the intercept of the tangential line at ( ) , ( ) x f x .
Now suppose, ( ) f x is not differentiable atx . Then we note that concept of sub-differential helps us out.
We find, that sub-differential valuep , for which negative of y-intercept is maximum. So for every
( ) , ( ) x f x , there is a
( )
*
, ( ) p f p . So p is either gradient or sub-gradient atx .
For every p , there is anx such that,
*
( , ( )) ( , ( )) x f x p f p where
{ } *( ) sup ( )
x
f p xp f x =
For higher dimension case it is possible to write as, { }
*
( ) sup , ( )
n
x R
f p x p f x
=
Here ( ) p f x = , if the function is differentiable . Otherwise p is sub-differential.
Another important fact is that Irrespective of the function ( ) f x is convex or not, the LF conjugate
*
( ) f p is always convex.
Proof: Legendre- Fenchel conjugate is always convex
{ }
*
( ) sup ( )
n
T
x R
f z x z f x
=
For proving the function is convex, we need to prove the J ensens inequality for a given
* * *
1 2 1 2
( (1 ) ) ( ) (1 ) ( ) f z z f z f z + +
LHS of the above equation is possible to expand by using the concept of duality
{ }
*
1 2 1 2
( (1 ) ) sup ( (1 ) ) ( )
n
T
x R
f z z x z z f x
+ = +
We can rewrite the function ( ) f x as,
( ) ( ) (1 ) ( ) f x f x f x = + and replace in the above equation,
{ }
*
1 2 1 2
( (1 ) ) sup ( (1 ) ) ( ) (1 ) ( )
n
T
x R
f z z x z z f x f x
+ = + +
{ }
1 2
sup (1 ) ) ( ) (1 ) ( )
n
t t
x R
x z x z f x f x
= +
0 1
86

{ }
1 2
sup ( ( )) (1 )( ( ))
n
t t
x R
x z f x x z f x
= +
Let us assume that
1 1
( ( ))
t
p x z f x = and
2 2
( ( ))
t
p x z f x =
{ }
1 2
sup (1 )
n
x R
p p
= +
By using the property of supremumwe can rewrite the above equation as,
{ }
1 2 1 2
sup (1 ) sup{ } sup{(1 ) }
n n n
x R x R x R
p p p p

= + +
Now,
*
1 1 1
sup{ } sup{ ( ( ))} ( )
n n
t
x R x R
p x z f x f z

= =
*
2 1 2
sup{(1 ) } sup{(1 )( ( ))} (1 ) ( )
n n
t
x R x R
p x z f x f z

= =
Therefore
* * *
1 2 1 2
( (1 ) ) ( ) (1 ) ( ) f z z f z f z + +
Hence we can say that irrespective of the function ( ) f x , its LF conjugate
*
( ) f z is always convex.
Dual of dual is primal:
For a convex bounded function dual of dual is primal. It is possible to prove this by using Legendre-
Fenchel transform.
Proof:
For example consider the function
2
( ) , 1 1 f x x x =
We can define the dual function
*
( ) f p as,
( )
* 2
( ) sup
x R
f p xp x
=
Now for finding the maximumvalue, take the derivative of the above function and equate to zero.
2 0
2
p x
p
x
=
=

So this is the location at which the function should have maximum value and for finding the
function value at this location back substitute in
*
( ) f p
That is,
87

* 2
2 2
2
*
( ) ( )
2 4
( )
4
f p xp x
p p
p
f p
=
| |
=
|
\ .
=

So here we find the dual of the function ( ) f x as
*
( ) f p
*
( , ( )) ( , ( )) x f x p f p

Here,
( )
2
2
, ,
4
p
p x x
| |
|
\ .

Now find the dual of
*
( ) f p , that is
**
( ) f p
We can define
**
( ) f p as
( )
** *
( ) sup ( )
x R
f p xp f p
=
For finding the maximumvalues take the derivative and equate to zero. Here we know that
2
*
( )
4
p
f p =
For our convenience rewrite the
**
( ) f p as,
2
*
( ) sup
4 x R
x
f p px
| |
=
|
\ .

1
.2 0
4
2
p x
x
p
=
=

Implies 2 x p =
Examples of Dual Functions:
Norm Function
Consider the normfunction
1
( ) f y y = , It is a function which is non-differentiable at 0 y =
88

Figure: Diagramof Subdifferential

At 0 y = there are infinite lines whose slope lies between -1 and +1 and the function value always lie
above all the tangents. The slope of those set of lines that lie below the function ( ) f y are called sub-
differential. So for the above function, sub-differential at any point 0 y < is 1 and at any point 0 y > is
1. But for 0 y = , sub-differential become a closed set in the interval [ 1,1] ie, if we draw a set of lines at
0 y = whose slope is between -1 and +1 and the lines lies always below the function.
Now consider the normfunctions.
L2 norm (two variable case)
Let
2
( ) f x x = . Let p be the dual variable (vector of same dimension as x ) . Let us find
*
( ) f p by visualizing the function. For any
1
2 2
1 2
1 1
2
2 2 2
2 2
1 2
, ( ) ( ) 1 1
x
x x
x p
x f x f x p
x p x
x x
(
(
+
(
(
= = = = =
(
(

(
+ (

89

This means dual variable p for any x is such that its L2-normis 1.
Also we note that all tangential plane to this function at any point ( ) , ( ) x f x passes through origin
implying that
*
( ) f p =0
Conversely for any p , with L2-norm as 1, we have ( ) , ( ) x f x (in this case infinitely many). This set
of ( ) , ( ) x f x can be found out by taking a plane with gradient as p (with L2-norm 1) and then moving
(parallely) towards surface till it just leaves the cone. All such planes pass through origin.
We note that at (0,0) x = , gradient does not exist . However, here we have sub-differential f such
that its magnitude lies in the range [-1 1]. Thus f at x=(0,0) can any gradient vector such that its norm
is less than 1.
Thus the dual variable corresponding to (0,0) x = is
1
2
p
p
p
(
=
(

such that
2
1 p =
Thus we note that dual variable
1
2
p
p
p
(
=
(

is that its L2-normlies between 0 and 1.
What if take a p with
2
1 p > . What is the corresponding
*
( ) f p . This can obtained by the following
mental visualization: Take a plane passing through origin with gradient vector as p with
2
1 p > . Move
this plane towards (parallely) till it just leaves the function surface. We note that we have to move
infinitely and accordingly the intercept also moves to infinity. Thus we arrive at the following dual
formulation for the L2-normfunction.
90

* 2
0 1
( )
if p
f p
otherwise

=
`

)

Now we go for an algebraic proof
Let
2
( ) f x x = . Let p be the dual variable (vector of same dimension asx ).

{ }
*
2
( ) max
n
T
x R
f p x p x
= = ( ) { }
2 2 2
max cos
n
x R
p x x
=
( ) ( ) { }
2 2
max cos 1
n
x R
x p

There are two cases to be considered
1)
2
1 p
The highest value possible under this condition for this function
{ }
2
T
x p x is zero. Highest
value occurs whenp is aligned with x and its L2-normis 1 .
2)
2
1 p >
Under this condition the highest value goes to infinity by allowing to take
2
p to be infinity and
0 .

* 2
0 1
( )
if p
f p
otherwise

=
`

)

L1 norm
Let
1
( ) f x x = . Let p be the dual variable (vector of same dimension as x ).
{ }
*
1
( ) max
n
T
x R
f p x p x
=
We consider two complementary cases.
1) 1 p

The highest value possible under this condition for this function
{ }
1
T
x p x is zero. Highest value
occurs when ( ) p sign x = . When ( ) p sign x = ,
1
T
x p x = and hence function value is zero. All
other cases function value is negative.
2) 1 p
>
Under this condition the highest value goes to infinity by allowing to take p
to be infinity and.
91

*
0 1
( )
if p
f p
otherwise

=
`

)

Example 1:
In practical application we will be trying
1
( ) f x x =
Then,

{ }
{ }
*
1
1
1
max ( )
max
T
p
T
p
x x p f p
x p
=
=

If
1
1
3
4 3 4 5 12
5
x = = + + =

{ }
{ }
1
1
1 2 3
1
max
max 3 4 5
T
p
p
x x p
p p p
=
= + +

Here for getting the solution we have to chose p such that ( ) p sign x = , all other p for the condition
1 p
doesnt give this solution.

1
( ) 1
1
p sign x
(
(
= =
(
(

Therefore,
{ }
{ }
1
1
1
max
max 1 3 1 4 1 5
12
T
p
p
x x p
=
= + +
=

92

Example 2:
In practical application we will be trying
2
( ) f x x =
Then,

{ }
{ }
2
2
*
2
1
1
max ( )
max
T
p
T
p
x x p f p
x p
=
=

If
2 2 2
2
2
3
4 3 4 5 50
5
x = = + + =

{ }
{ }
2
2
2
1
1 2 3
1
max
max 3 4 5
T
p
p
x x p
p p p
=
= + +

The solution for this is obtained by
3 1 2
1 2 3
2 2 2 2 2 2 2 2 2
1 2 3 1 2 3 1 2 3
, ,
x x x
p p p
x x x x x x x x x
= = =
+ + + + + +

{ }
2
2
2
1
1
max
3 3 4 4 5 5
max
50 50 50
50
50
50
T
p
p
x x p
=

= + +
`
)
= =

Moreau Decomposition
Moreau decomposition is defined as,
*
( ) ( ), 1
f
f
v prox v prox v = + =
*
( ) ( )
f
f
prox v v prox v =
Where
*
f is the dual of the function f or convex conjugate of the function f and it is defined as,
{ }
*
( ) sup ( )
T
x
f p x p f x =

It is also defined as,
*
1
( ), 1
f
f
v
v prox prox v
| |
= + >
|
\ .

93

*
1
( )
f
f
v
v prox prox v
| |
=
|
\ .

Example 1: (for 1 = )
Consider the function
2
( ) f x x =
We can define the dual function
*
( ) f p as,
( )
* 2
( ) sup
x R
f p xp x
=
Now for finding the maximumvalue, take the derivative of the above function and equate to zero.
2 0
2
p x
p
x
=
=

So this is the location at which the function should have maximum value and for finding the
function value at this location back substitute in
*
( ) f p
That is,
* 2
( ) ( ) f p xp x =
2 2
2 4
p p | |
=
|
\ .

2
*
( )
4
p
f p =
Ie,
2
( ) f x x = and
2
*
( )
4
x
f x =
Now consider the moreau identity,
*
1
( )
f
f
v
v prox prox v
| |
= +
|
\ .

Assume 1 = ,
2
( ) f x x = and
2
*
( )
4
x
f x =
Therefore,
*
( ) ( )
f
f
v prox v prox v = +

Now,
94

2
2
2
2
2
2
2
2
1
( ) argmin ( )
2
1
argmin
2
1
argmin
2
f
x
x
x
prox v f x x v
x x v
x x v
= +
| |
= +
|
\ .
| |
= +
|
\ .

Differentiate w.r.t x and equate to zero,
2 ( ) 0
3
/ 3
x x v
x v
x v
+ =
=
=

And,
*
2
*
2
2
2
2
2
2
2
1
( ) argmin ( )
2
1
argmin
4 2
1
argmin
4 2
f
x
x
x
prox v f x x v
x
x v
x
x v
= +
| |
= +
|
\ .
| |
= +
|
\ .

( ) 0
2
3 / 2
2 / 3
x
x v
x v
x v
+ =
=
=

Now according to moreau identity,
*
( ) ( )
2 3
3 3 3
f
f
v prox v prox v
v v v
v
= +
= + = =

Hence proved.
Example 2: (for 1 > )
Consider 5 = , 10 v = ,
2
( ) f x x = and
2
*
( )
4
x
f x =
According to moreau identity,
95

*
1
( )
f
f
v
v prox prox v
| |
= +
|
\ .

Now,
2
2
2
2
2
2
2
2
1
( ) argmin ( )
2
1
argmin 10
2 5
1
argmin 10
10
f
x
x
x
prox v f x x v
x x
x x
= +
| |
= +
|
\ .
| |
= +
|
\ .

1
2 ( 10) 0
5
x x + =
11
2
5
x
=
10
11
x =
And,
*
2
*
1
2
2
2
2
2
2
2
1
argmin ( )
1
2
1 10
argmin
1
4 5
2
5
5
argmin 2
4 2
x f
x
x
v v
prox f x x
x
x
x
x
| | | |
= +
| |
\ . \ .
| |
|
| |
= +
|
|
\ .
|
\ .
| |
= +
|
\ .

5( 2) 0
2
5 10 0
2
11 20
20
11
x
x
x
x
x
x
+ =
+ =
=
=

96

Now,
*
1
( )
20 10
5
11 11
110
10
11
f
f
v
v prox prox v
| |
= +
|
\ .
= +
= =

Basic operations of proximal operators
Regularization:
If
2
2
( ) ( )
2
f x x x a
| |
= +
|
\ .
, then
( )
( )
( )
( )
f
prox v prox v a
= +

, where
( ) 1

=
+

Proof:
We know that,
2
2
2 2
2 2
1
( ) argmin ( )
2
1
argmin ( )
2 2
f
x
x
prox v f x x v
x x a x v
= +
| |
= + +
`
|
\ . )

Expanding the terms,
( ) ( ) ( ) ( )
( ) ( )
1
argmin ( )
2 2
1
argmin ( ) 2 2
2 2
T T
x
T T T T T T
x
x x a x a x v x v
x x x x a a a x x x v v v
| |
= + +
`
|
\ . )
| |
= + + + +
`
|
\ . )

( ) ( )
1
argmin ( ) 2 2
2 2
T T T T
x
x x x x a x x x v c
| |
= + + +
`
|
\ . )

Rearranging the above expression,
97

1
argmin ( )
2 2
1 1
argmin ( )
2
T T T
x
T T T
x
v
x x x x a x
v
x x x x a x

| |
= + + +
`
|
\ . )
+ | |
= + +
`
|
\ . )

Multiplying throughout with
1

=
+

1
argmin ( )
1 2 1 1
T
T T
x
x v
x x x x a

= + +
`
+ + +
)

2
2
1
argmin ( )
2
1
argmin ( )
2
T T T
x
x
x x x x a x v
x x a v

= + +
`
)

| |

= + +
` |
\ .

)

( ) ( )
f
prox v prox v a
| | | |
= +
| |
|
\ . \ .

Hence proved.
Postcomposition:
If ( ) ( ) , 0 f x x b with = + > , then
( ) ( )
f
prox v prox v

=
Proof:
2
2
2
2
2
2
2
2
1
( ) argmin ( )
2
1
argmin ( ( ) )
2
1
argmin ( )
2
1
argmin ( )
2
( )
f
x
x
x
x
prox v f x x v
x b x v
x x v c
x x v
prox v
= +

= + +
`
)

= + +
`
)

= +
`
)
=

98

( ) ( )
f
prox v prox v

= , hence proved
Affine addition:
If ( ) ( ) ,
T
f x x a x b = + + then
( ) ( )
f
prox v prox v a

=

Proof:
2
2
2
2
2
2
1
( ) argmin ( )
2
1
argmin ( )
2
1
argmin ( ) ( )
2
( )
f
x
T
x
x
prox v f x x v
x a x b x v
x x v a
prox v a
= +

= + + +
`
)

= +
`
)
=

( ) ( )
f
prox v prox v a

= , hence proved

Proximal Gradient Method
Consider the problem
min ( ) ( ) f x g x +
where :
n n
f R R , :
n n
g R R and the function ( ) f x is differentiable and ( ) g x is non differentiable.
For solving these types of problems initially finding a temporary point for the function ( ) f x and is used
for finding the proximal with respect to the second function ( ) g x .
The proximal gradient method is given as,
( )
1
( )
k
k k k k
g
x prox x f x

+
=
where 0
k
> and is defined as a control parameter. So it is important to choose the properly for
reaching the optimumpoint. If it is not chosen appropriately the optimumpoint will swing between some
values and will never reach the minimum of the function. Here comes the concept of Lipschitz constant.
For every function it is possible to define a Lipschitz constant( ) L such that
1
0
L
< < . But there are
also cases where we not know the Lipschitz constant. In such a case we are going for a line search
method. [18]
Line search algorithm
99

Algorithm:
Given
1
,
k k
x

, and parameter (0,1)
Let
1
:
k

=
repeat
1.Let
( ) ( )
:
k k
g
z prox x f x
=
2. break if
( )
( ) ,
k
f z f z x

3. Update : =
return
1
: , :
k k
x z
+
= =

Figure: Proximal Gradient Algorithm

Here first take an arbitory point
k
x for the function ( ) f x and use that
k
x for finding the gradient operator
( ( ))
k k
x f x . Now by using this ( ( ))
k k
x f x point find the proximal point z with respect to the
second function ( ) g x . Ie, ( ( ))
k k
x f x will act as v for proximal for ( ) g x
ie,
( )
( )
k k
g
z prox x f x
=

100

But here the selection of should be crucial. So in order to verify that we need to define an
additional function
( , )
k
f z x

atz , this will act as an upper bound for the function ( ) f z . Now
check for
( )
( ) ,
k
f z f z x
for correcting the or inturn making a correct parabola fit. If it is

satisfied then will take the summation of both functions at z and which should be the minimum.
Then z is taken as the optimum point of the combined function ( ) ( ) f x g x + , else will update
for making the correct parabola fit. Updating as : = , a typical value for the parameter is
1
2

Note: Here
( )
,
k
f z x
is defined as a majorization function over ( ) f z . For an upper bound of the

function f ,
is defined as
( )
2
2
1
, ( ) ( ) ( ) , 0
2
T
f x y f y f y x y x y
= + + >
For a fixedy , this function is convex and which satisfies ( )
, ( ) f x x f x
= and is an upper bound on f

when
1
0,
2

Primal- dual proximal method
A general class of the functions representing the energy minimizations problems can be written as,
{ } min ( ) ( )
x X
F Kx G x
+
Where F and G are proper convex functions, K is a matrix and x is a vector
n
R
By the definition of Legendre- Fenchel transformation,
{ }
{ }
*
*
( ) max , ( )
max , ( )
y
T
y
F Kx Kx y F y
x K y F y
=
=

Now rewrite the problemby replacing ( ) F Kx with its convex conjugate,
{ } { }
*
min ( ) ( ) minmax , ( ) ( )
x X x y
F Kx G x Kx y F y G x
+ = +
Consider,
{ } { }
* *
max , ( ) min ( ) ,
y y
Kx y F y F y Kx y =
101

Apply proximal method with
k
y y = and
k
x x =
We obtain,
( ) *
2
1 *
2
2
*
2
1
argmin ( ) ,
2
1
argmin ( )
2
K K k
y
k k
y
k k
F
y F y Kx y y y
F y y y Kx
prox y Kx
+

= +
`
)

= +
`
)
= +

Considering,
{ } { }
min , ( ) min , ( )
T
x x
Kx y G x x K y G x + = +
Apply proximal method with
k
x and
1 k
y
+
given,
( )
2
1 1
2
2
1
2
1
1
argmin , ( )
2
1
argmin ( )
2
K T K K
x
K T K
x
K T K
G
x x K y G x x x
G x x x K y
prox x K y
+ +
+
+

= + +
`
)

= + +
`
)
=

We also apply an acceleration to x update by,
1 1 1
( )
K K K K
x x x x
+ + +
= +
Putting all the parts together,
( )
( )
( )
*
1
1 1
1 1 1
K K K
F
K K T K
G
K K K K
y prox y Kx
x prox x K y
x x x x
+
+ +
+ + +
= +
=
= +

ADMM-Alternating Direction Method of Multipliers
ADMM is a simple and powerful iterative algorithm for convex optimization problems. It is almost 80
times faster for multivariable problems than conventional methods. ADMM replaces linear and quadratic
programming in a single frame work.
Form I:
102

Consider the unconstrained problem,
min ( ) ( )
x
f x g x +

Where :
n n
f R R , :
n n
g R R are convex functions.
Now the alternating direction method of multipliers (ADMM) is defined as,
( )
( )
1
1 1
1 1 1
:
:
:
k k k
f
k k k
g
k k k k
x prox z u
z prox x u
u u x z
+
+ +
+ + +
=
= +
= +

Where k is an iteration counter.
Explanations:
Here we are changing the problem by introducing a new variable z and also importing a constraint. The
variable has been split into two variables x andz . Therefore the number of variables has been doubled
but the optimization problem remains same. Basically converting an unconstrained problem into a
constrained one and trying to solve that by using the augmented Lagrangian.
So now the optimization problem is changed as,
min ( ) ( )
0
f x g z
subject to x z
+
=

and this will be equivalent to minimizing the problem ( ) ( ) f x g x + .
Now we can write the augmented Lagrangian formassociated with the above problemas,
( )
2
2
( , , ) ( ) ( ) ( )
2
T
L x y z f x g z y x z x z

= + + +
Where 0 > is a parameter and
n
y R is the Lagrangian dual variable. It is called as augmented
Lagrangian in the sense that in addition to the Lagrangian dual variable another term
( )
2
2
2
x z
called
augmented Lagrangian is also adding. This will be a highly convex function and so addition of this extra
termincreases the convexity of our original problem and it will fastly converges to the solution. So now
there are three variables , x y and z
Now the ADMM can be expressed as,
103

( )
( )
( )
1
1 1
1 1 1
: argmin , ,
: argmin , ,
:
k k k
x
k k k
z
k k k k
x L x z y
z L x z y
y y x z
+
+ +
+ + +
=
=
= +

In each step updating x andz
using the previous step updated value of primal and dual variable.
1 k
y
+
is the
acceleration term.
Now we are trying to reduce the ADMM into proximal version. So for that first consider the augmented
Lagrangian form,
( )
2
2
( , , ) ( ) ( ) ( )
2
T
L x y z f x g z y x z x z

= + + +
Here z and y are randomly chosen initially and trying to minimize the Lagrangian for finding the x
update.
Therefore
2
1
2
argmin ( ) ,
2
k k k
x
x f x y x x z
+
| |
= + +
|
\ .

Similarly,
2
1 1
2
argmin ( ) ,
2
k k k
z
z g z y z x z
+ +
| |
= +
|
\ .

( )
1 1 1
:
k k k k
y y x z
+ + +
= +

Pull the linear terms into the quadratic ones to get,
( )
2
1
2
2
1 1
2
1 1 1
1
argmin ( )
2
1
argmin ( )
2
k k k
x
k k k
z
k k k k
x f x x z y
z g z x z y
y y x z
+
+ +
+ + +
| |
= + + |
|
\ .
| |
= + + |
|
\ .
= +

Assume,
1 1
,
k k
u y

= =
Now,
104

( )
2
1
2
2
1 1
2
1 1 1
1
argmin ( )
2
1
argmin ( )
2
k k k
x
k k k
z
k k k k
x f x x z u
z g z x z u
y y x z
+
+ +
+ + +
| |
= + +
|
\ .
| |
= + +
|
\ .
= +

ie,
( )
2
1
2
2
1 1
2
1 1 1
1
argmin ( ) ( )
2
1
argmin ( ) ( )
2
k k k
x
k k k
z
k k k k
x f x x z u
z g z z x u
y y x z
+
+ +
+ + +
| |
= +
|
\ .
| |
= +
|
\ .
= +

Fromthe definition of proximal,
2
2
1
( ) argmin ( )
2
f
x
prox v f x x v
| |
= +
|
\ .

Therefore rewrite the above ADMM formusing the proximal as,
( )
( )
1
1 1
1 1 1
:
:
:
k k k
f
k k k
g
k k k k
x prox z u
z prox x u
u u x z
+
+ +
+ + +
=
= +
= +

Form II:
Consider the constrained optimization problem,
min ( )

f x
subject to x C

Where f and C are convex
This problemis rewritten in ADMM formas,
min ( ) ( )
0
f x g z
subject to x z
+
=

Where g is an indicator function of C and is defined as,
105

0
( )
if z C
g z
otherwise

=
`
)

Here the constraints are absorbed into an indicator function in a different variable and the variables made
equal with a new constraint.
Now consider the augmented Lagrangian associated with the minimization problem as,
( )
2
2
( , , ) ( ) ( )
2
L x z u f x g z x z u

= + + +
According to the definition of the indicator function, if ( ) 0 g z = means we are actually minimizing the
original function ( ) f x . So when the algorithm proceeds at any stage we are forced to choose an z , which
is an element of C .If it is not, then z = and further optimization is not possible.
Now find the update for x and z . When updating x choose z and u initially and minimize the
Lagrangian,
2
1
2
argmin ( )
2
k k k
x
x f x x z u
+
| |
= + +
|
\ .

Now find the update for z .If z x u = + , means
2
2
2
x z u
+ becomes zero, but ( ) g z =

.Therefore should be careful while updatingz , choose z such that ( ) 0 g z = .Here going for the
concept of projection. In order to choose z

which minimizes the Lagrangian the concept of
projection is used.
The update of z is obtained as,
( )
1 1 k k k
C
z x u
+ +
= + , project
( )
1 k k
x u
+
+ onto space of C
and
1 1 1 k k k k
u u x z
+ + +
= +

If 0 x z = then
1 k k
u u
+
=

Linear and Quadratic Programming:
The standard formof quadratic programming is defined as,
1
min
2
, 0
T T
x Px q x
subject to Ax b x
| |
+
|
\ .
=

Where
n
x R
There are two constraints. Here assumed that
n
P S
+
. When 0 P = the above problem will reduced to the
standard formof Linear programming.
Now express the problemin ADMM formas,
106

min ( ) ( )
0
f x g z
subject to x z
+
=

One constrained we are considering along with ( ) f x , second constrained is absorbed by the function () g
and a new third constrained is introduced.
Where
1
( ) , { | }
2
T T
f x x Px q x domf x Ax b
| |
= + =
|
\ .

( ) f x is the original objective function with restricted domain. We introducing a new indicator function
( ) g z , it is a function of the non negative orthant
n
R
+
. This indicator function makes sure that z is always
greater than zero.
Now consider the augmented Lagrangian associated with the minimization problem as,
( )
2
2
( , , ) ( ) ( )
2
L x z u f x g z x z u

= + + +

Now find the update for x andz . For updating x choose z and u initially and minimize the Lagrangian,
2
1
2
argmin ( )
2
k k k
x
x f x x z u
+
| |
= + +
|
\ .

But here x update is an equally constrained least square problemwith optimality condition.
2
1
2
:
argmin ( )
2
k k k
x Ax b
x f x x z u
+
=
| |
= + +
|
\ .

Where
1
( ) , { | }
2
T T
f x x Px q x domf x Ax b
| |
= + =
|
\ .

Now take the Lagrangian corresponds to the above problem.

2
2
1
( , ) ( )
2 2
T T k k T
L x v x Px q x x z u v Ax b
| |
= + + + +
|
\ .

Differentiate with respect to x andv , then equate to zero;
( )
k k T
L
Px q x z u A v
x

= + + + +

L
Ax b
v

( )
1
0 0
k k k T
L
Px q x z u A v
x

+
= + + + + =

107

( )
1 1
0
k k k k T
Px q x z u A v
+ +
+ + + + =

1 1
0
k k k k T
Px q x z u A v
+ +
+ + + + =
( ) ( )
1
0
k T k k
P I x q A v z u
+
+ + + =
1
0 0
k
L
Ax b
v
+
= =

Rewrite it as,

1
1
1
( )
0
0
( )
0
T k k k
k T k k
P I A x q z u
A v b
x P I A q z u
v A b

+
+
( ( ( +
+ =
( ( (

( ( ( +
=
( ( (

Now the update of z is obtained as,
( )
1 1 k k k
z x u
+ +
+
= +
and
1 1 1 k k k k
u u x z
+ + +
= +

Therefore the scaled version of ADMM consists of the iterations,
2
1
2
argmin ( )
2
k k k
x
x f x x z u
+
| |
= + +
|
\ .

( )
1 1 k k k
z x u
+ +
+
= +

1 1 1 k k k k
u u x z
+ + +
= +

%% Code: Linear Programming

function [z, history] =linprog(c, A, b, rho, alpha)

% [x, history] =linprog(c, A, b, rho, alpha);

% Global constants and defaults

QUIET =0;
MAX_ITER =1000;
ABSTOL =1e-4;
RELTOL =1e-2;

% Data preprocessing

[mn] =size(A);
108

% ADMM solver

x =zeros(n,1);
z =zeros(n,1);
u =zeros(n,1);

for k =1:MAX_ITER

% x-update
tmp =[ rho*eye(n), A'; A, zeros(m) ] \ [ rho*(z - u) - c; b ];
x =tmp(1:n);

% z-update with relaxation
zold =z;
x_hat =alpha*x +(1 - alpha)*zold;
z =pos(x_hat +u);

u =u +(x_hat - z);
end

%% Code: Quadratic Programming

function [z, history] =quadprog(P, q, r, lb, ub, rho, alpha)

% [x, history] =quadprog(P, q, r, lb, ub, rho, alpha)

QUIET =0;
MAX_ITER =1000;
ABSTOL =1e-4;
RELTOL =1e-2;


n =size(P,1);

% ADMM solver

x =zeros(n,1);
z =zeros(n,1);
u =zeros(n,1);

for k =1:MAX_ITER

if k >1
x =R \ (R' \ (rho*(z - u) - q));
else
R =chol(P +rho*eye(n));
x =R \ (R' \ (rho*(z - u) - q));
end
109

zold =z;
x_hat =alpha*x +(1-alpha)*zold;
z =min(ub, max(lb, x_hat +u));

% u-update
u =u +(x_hat - z);

end

L1-norm problems:
Describing simple but important problems involving l1-normand its solution via the ADMM algorithm
Basis Pursuit:
Basis pursuit is the equally constrained l1 minimization problem

1
minimize x
subject to Ax b =

With variable
n
x R , data
m n
A R

and
m
b R with m n <
Basis pursuit is often used as a heuristic for finding a sparse solution to an underdetermined system of
linear equations. It plays a central role in modern statistical signal processing, particularly the theory of
compressed sensing.[ ]

We can rewrite the above basis pursuit problemin ADMM format as,

1
min ( )
0
f x z
subject to x z
+
=

Where ( ) f x is an indicator function of
{ }
|
n
x R Ax b =
Now using the ADMM algorithm finding the update corresponding to
1 1
,
k k
x z
+ +
and
1 k
u
+

Consider the augmented Lagrangian corresponds to the minimization problemas,

2
1 2
( , , ) ( )
2
L x z u f x z x z u

= + + +
First we have to find the update for x . Since ( ) f x is an indicator function of
{ }
|
n
x R Ax b = ,
we are going for projection. Here the x update involves solving a linearly constrained minimum
Euclidean norm problem and hence for finding the update of x , projecting on to
{ }
|
n
x R Ax b = .
ie,

( )
1 k k k
x z u
+
=
Where is the projection onto
{ }
|
n
x R Ax b = ie, projecting
( )
k k k
x z u = ontoAx b = . And
will get the update as,
110

( )
( )
( ) ( )
1 1
1 k T T k k T T
x I A AA A z u A AA b

+
= +
Proof:
We have to find an
1 k
x
+
which should satisfiesAx b = . That is
1 k
x
+
should be the closest vector in
the affine space defined by Ax b =

Figure: Projection onto Ax b =
ie,
1 k k
x x e
+
=
Let,
k k k
x z u =
Therefore
1 k
Ax b
+
=
( )
k
A x e b =

k
Ax Ae b =

k
Ae Ax b =
Let,
T
e A v =
Now,
T k
AA v Ax b =
( ) ( )
1
T k
v AA Ax b
=
( ) ( )
1 1
( )
T k T
v AA Ax AA b

=
Since
( ) ( )
1 1
( )
T T T k T
e A v A AA Ax AA b

(
= =
(

( ) ( )
1 1
( )
T T k T T
e A AA Ax A AA b

=
Now we can find
1 k
x
+
,
111

1 k k
x x e
+
=
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
1 1
1
1 1
k k k T T k k T T
k k T T k k T T
x z u A AA A z u A AA b
z u A AA A z u A AA b

+

(
=
(

= +
\
Therefore,
( )
( )
( ) ( )
1 1
1 k T T k k T T
x I A AA A z u A AA b

+
= +
Now find the update for z and u as,
( )
1 1
1
k k k
z S x u
+ +
= +

and
1 1 1 k k k k
u u x z
+ + +
= +

%% Code: Basis Pursuit
function [z, history] =basis_pursuit(A, b, rho, alpha)

% [x, history] =basis_pursuit(A, b, rho, alpha)


QUIET =0;
MAX_ITER =1000;
ABSTOL =1e-4;
RELTOL =1e-2;


[mn] =size(A);

% ADMM solver

x =zeros(n,1);
z =zeros(n,1);
u =zeros(n,1);

% precompute static variables for x-update (projection on to Ax=b)
AAt =A*A';
P =eye(n) - A' * (AAt \ A);
q =A' * (AAt \ b);

for k =1:MAX_ITER
112

% x-update
x =P*(z - u) +q;

zold =z;
z =shrinkage(x_hat +u, 1/rho);

u =u +(x_hat - z);
end

Lasso (Least Absolute Shrinkage and Selection Operator)
Consider the l1- regularized linear regression also called as lasso problemas,

2
2 1
1
min
2
Ax b x +
Rewrite the Lasso problem in ADMM format,
2
2 1
1
min
2
0
Ax b x
subject to x z
+
=

Write the corresponding augmented Lagrangian formas;

2 2
2 1 2
1
( , , ) ( )
2 2
T
L x z y Ax b z y x z x z

= + + +
In order to find the update for x minimizes the Lagrangian with respect tox

( )
1
2 2
2 2
2
2
argmin ( , , )
1
argmin
2 2
1
argmin 2
2 2
k
x
T
x
T T T T T T
x
x L x z y
Ax b y x x z
x A Ax x A b b b y x x z
+
=
= + +
= + + +

Differentiate with respect tox and equate to zero,

( )
1
2 2 2( ) 0
2 2
T T kT k
A Ax A b y x z
+ + =

( )
( )
( ) ( )
1 1
1
1
1
0
T k T kT k k
T k T k k
k T T k k
A Ax A b y x z
A A I x A b z y
x A A I A b z y

+ +
+
+
+ + =
+ =
= + +

Now find the update corresponds to z
113

1
2
1 2
2
1
2
argmin ( , , )
argmin
2
1
argmin
2
k
z
T
z
k k
z
z L x z y
z y z x z
z x z y
+
=
= +
= + +

This implies,

1 1
k
k k
y
z S x

+ +
| |
= +
|
\ .

And,

( )
1 1 1 k k k k
y y x z
+ + +
= +
Therefore the ADMM algorithm corresponds to Lasso problem is,

( ) ( )
1
1
1 1
1 1 1
1
( )
k T T k k
k k k
k k k k
x A A I A b z y
z S x y
y y x z
+
+ +
+ + +
= + +
| |
= +
|
\ .
= +

%% Code: LASSO (Least Absolute Shrinkage and Selection Operator)
function [z, history] =lasso(A, b, lambda, rho, alpha)

% lasso Solve lasso problemvia ADMM


QUIET =0;
MAX_ITER =1000;
ABSTOL =1e-4;
RELTOL =1e-2;


[m, n] =size(A);

% save a matrix-vector multiply
Atb =A'*b;

% ADMM solver

x =zeros(n,1);
z =zeros(n,1);
u =zeros(n,1);

114

for k =1:MAX_ITER

% x-update
q =Atb +rho*(z - u); % temporary value
if( m>=n ) % if skinny
x =U \ (L \ q);
else % if fat
x =q/rho - (A'*(U \ ( L \ (A*q) )))/rho^2;
end

zold =z;
z =shrinkage(x_hat +u, lambda/rho);

% u-update
u =u +(x_hat - z);
end

Legendre- Fenchel Transformations - Concept of Dual Function
Applications to computer vision
ROF model:
Consider a standard ROF model as,

2
1 2
min
2
u X
u u g
+
The convex conjugate of . is an indicator function

0 1
( )
p
if p
P
otherwise

=
`

)

That is, . ( )
p
P
Therefore we can write
1
u as

( )
1
: 1
max ( )
T
p
p p
u p u P
=
= ie,

( )
1
: 1
max , ( )
p
p p
u p u P
=
=

Therefore the ROF function can be written in minmax formas
115

2
2
: 1
min max , ( )
2
p
u X p p
p u P u g
=
| |
+
|
\ .

The above objective function contains the primal (represented in min with u as variable) and dual form
((represented in max with p as variable)).
The update equations need to be found for u and p.
Computing the update terms by adding the proximal term:
1. Update equation for p:

( )
: 1
max , ( )
p
p p
p u p
=

To add the proximal term, the max operations need to be converted to minimization formas,

0 2
: 1
1
min ( , || || )
2
p p
p u p p

=
+
To solve this or to derive the update equation, differentiate w.r.t p and equate to 0.

0
1
( ) 0 u p p
+ =
In order to generalize,

1
0
n
n
p p
p p
+

Ie;
1
1
( ) 0
n n
u p p
+
+ =
1

n n
or p p u
+
= +
Projecting onto space 1 p
implies,

( )
1
max 1,
n n
n
n n
p u
p
p u
+
+
=
+

2. Update equation for u:
116

2
2
min ,
2
u X
p u u g
| |
+
|
\ .

, , p u u p = -
The minimization termbecomes as

2
2
min ,
2
u X
u p u g
| |
+
|
\ .
-
Adding the proximal term,

2
0 2
2
1
min , || ||
2 2
u X
u p u g u u
| |
+ +
|
\ .
-

To solve this or to derive the update equation, differentiate w.r.t u and equate to 0.

0
1
( ) ( ) 0 p u g u u
+ + =
1
0
n
n
u u
u u
+

ie;
1 1
1
( ) ( ) 0
n n n
p u g u u
+ +
+ + =
1
(1 )
n n
u u p g
+
+ = + +
1
(1 )
n
n
u p g
u

+
+ +
=
+

Therefore,
1
1
(1 )
n n
n
u divp g
u

+
+
+ +
=
+

Therefore the update equations are,
( )
1
max 1,
n n
n
n n
p u
p
p u
+
+
=
+

117

1
1
1
n n
n
u divp g
u

+
+
+ +
=
+

%% Code: Image denoising based ROF model

function primal_dual_denoising_ROF (img_src, lambda, max_iter)

noise_image =img_src; %original noisy input image
if (size (noise_image,3)==3)
noise_image =rgb2gray (noise_image); %converting the noisy image into grayscale image
end
[n_row n_col] =size (noise_image);
N =n_row * n_col;

% generate gradient matrix for nabla operator
% vectorize the original image (n_row by n_col) to a vector with n_row*n_col elements via row-wise
scanning way. That is to associate the (i,j) element of image matrix with the (i-1)*n_col+j element of the
vector

noise_image =reshape (noise_image', N, 1);
nabla =make_nabla (n_col, n_row);

%initialization
u =noise_image; % primal variable u initially taken as the input noisy image
p =zeros (2*N, 1); % dual variable p initially chosen randomly
head_u =u;

% algorithmparameters
L =sqrt (8); % Lipschitz constant (L)
tau =1/L;
sigma =1/L;
lambda=30;
gamma =0.7*lambda;
max_iter=100;

nabla_t =nabla';

for n_processing =1:max_iter
old_u =u;
%update dual
temp_p =nabla * head_u * sigma +p;
sqrt_p =sqrt (temp_p(1:N).^2 +temp_p(N+1:2*N).^2);
sqrt_p_ =[sqrt_p; sqrt_p];
p =temp_p./(max(1,sqrt_p_));
% update primal
var =tau * lambda/ (1 +tau * lambda);
u =(1 - var) * old_u +var * (noise_image +nabla_t * p/lambda);
theta =1/sqrt (1 +2*gamma*tau);
118

tau =tau*theta;
sigma =sigma/theta;

% calculate head_u
head_u =u +theta * (u - old_u);
drawnow;
end

Figure: Result of denoising using ROF model ( lambda=30)

Huber-ROF model:
The objective function is
2
2
min
2
h
u X
u u g
+
Where
h
- is the Huber normand is defined as

2
2
2
x
if x
x
x if x
>

To find the dual formof the Huber norm,
Consider a single variable function,
2
( )
2
x
f x
=

2
*
( )
2
x
f p xp
=
Differentiate w.r.t x, we get
119

0
x
p x p
= = implies for , 1
2
x
x p p =
Substituting this, we get
2
*
( ) , for 1
2
p
f p p
=
Consider,
( )
2
f x x

=
*
( )
2
f p xp x
| |
= +
|
\ .

Differentiate w.r.t x, we get, 1 p =

*
( ) , for 1
2
f p p

= =
Now for multivariable problemthe objective formin min max formcan be written as,
2 2
2
: 1
min max ,
2 2 2
u X p p
p u p u g

=
| |
+ +
|
\ .

Note: the constant term
2
can be avoided fromthe objective function, as it doesnt affect optimization

problem
2 2
2
: 1
min max ,
2 2
u X p p
p u p u g

=
| |
+
|
\ .

Computing the update terms by adding the proximal term:

2
: 1
max ( , )
2
p p
p u p
=

Changing to min form,

2
: 1
min ( , )
2
p p
p u p
=
+
120

2
0 2
: 1
1
min ( , || || )
2 2
p p
p u p p p

=
+ +
To find the solution, differentiate w.r.t p and equate to 0, we get,
0
1
( ) 0 u p p p
+ + =

1
0
n
n
p p
p p
+

1 1
1
( ) 0
n n n
u p p p
+ +
+ + =
1 1
0
n n n
u p p p
+ +
+ + =

1
(1 )
n n n
p p u
+
+ = +

1
1
n n
n
p u
p

+
+
=
+

Projecting onto 1 p
space,
1 1
max 1,
1
n n
n
n n
p u
p
p u
+
+
+
=
| |
+
|
+
\ .


2
2
min ,
2
u X
p u u g
| |
+
|
\ .

, , p u u p =
The minimization termbecomes as

2
2
min ,
2
u X
u p u g
| |
+
|
\ .


2
0 2
2
1
min , || ||
2 2
u X
u p u g u u
| |
+ +
|
\ .

121

To solve this or to derive the update equation, differentiate w.r.t u and equate to 0.

0
1
( ) ( ) 0 p u g u u
+ + =
1
0
n
n
u u
u u
+

ie;
1 1
1
( ) ( ) 0
n n n
p u g u u
+ +
+ + =

1
(1 )
n n
u u p g
+
+ = + +
1
(1 )
n
n
u p g
u

+
+ +
=
+

Therefore the update equations are,
1 1
max 1,|
1
n n
n
n n
p u
p
p u
+
+
+
=
| |
+
|
+
\ .

1
(1 )
n
n
u divp g
u

+
+ +
=
+

%% Code: Image denoising based Huber-ROF model

function primal_dual_denoising_Huber_ROF (img_src, lambda, max_iter)

noise_image =img_src; %original noisy input image
if(size(noise_image,3)==3)
noise_image =rgb2gray (noise_image); %converting the noisy image into grayscale image
end
[n_row n_col] =size (noise_image);
N =n_row * n_col;

122

vector

nabla =make_nabla(n_col,n_row);

%initialization
u =noise_image; % primal variable u initially taken as the input noisy image
p =zeros (2*N,1); % dual variable p initially chosen randomly
head_u =u;

L =sqrt (8); % Lipschitz constant (L)
alfa =0.05;
lambda =30;
gamma =lambda;
delta =alfa;
mu =2*sqrt(gamma*delta)/L;
theta =1/(1 +mu);
tau =mu/(2*gamma);
sigma =mu/(2*delta);
max_iter=100;
nabla_t =nabla';

old_u =u;
%update dual
temp_p =(nabla * head_u * sigma +p)/(1 +sigma*alfa);
sqrt_p =sqrt (temp_p (1:N).^2 +temp_p(N+1:2*N).^2);
sqrt_p_ =[sqrt_p; sqrt_p];
p =temp_p. /(max(1,sqrt_p_));
% update primal
var =tau * lambda/(1 +tau * lambda);
u =(1 - var) * old_u +var * (noise_image +nabla_t * p/lambda);
% calculate head_u
drawnow;
end

123

Figure: Result of denoising using Huber ROF model (lambda=10)

TV L
1
denoising:
The objective function can be written as,
1 1
min
u X
u u g
+
1 1
min ( )
u X
or u u g
+

Dual function for
1
u is,

1
1
max , ( )
p
p
u p u p
=
( )
p
p is an indicator function.

0 1
( )
p
if p
p
otherwise

=
`

)

Dual function for
1
( ) u g is,

1
1
( ) =max , ( ) ( )
q
q
u g q u g Q

( )
q
Q is an indicator function,
124

0 1
( )
q
if q
Q
otherwise

=
`

)

Now the TVL
1
objective function in min max formcan be written as,

( )
: 1 : 1
min max max , , ( ) ( ) ( )
p q
u X p p q q
p u q u g p Q

= =

Computing the update equations:

1
max , ( )
p
p
p u p

Converting this into min formand adding the proximal term, we get
0 2
: 1
1
min ( , || || )
2
p p
p u p p

=
+

0
1
( ) 0 u p p
+ =

1
0
n
n
p p
p p
+

Ie;
1
1
( ) 0
n n
u p p
+
+ =
1

n n
or p p u
+
= +
Projecting onto 1 p
space,
( )
1
max 1,
n n
n
n n
p u
p
p u
+
+
=
+

125

2. Update equation for q:

1
max , ( ) ( )
q
q
q u g Q

0 2
1
1
min( , ( ) || || )
2
q
q u g q q

+
To solve this or to derive the update equation, differentiate w.r.t q and equate to 0.
0
1
( ) ( ) 0 u g q q
+ =
Generalizing,
1
0
n
n
q q
q q
+

1
1
( ) ( ) 0
n n n
u g q q
+
+ =
1
1
( ) ( )
n n n
q q u g
+
=
1
( )
n n n
q q u g
+
= +
Projecting onto 1 q
space,

( )
1
( )
max 1,| ( )|
n n
n
n n
q u g
q
q u g
+
+
=
+


( )
min , , ( )
u X
p u q u g
+
Changing , , p u u p =

( )
min , , ( )
u X
u p q u g
+
126

0 2
1
min , , ( ) || ||
2
u X
u p q u g u u
| |
+ +
|
\ .

Solve this or to derive the update equation, differentiate w.r.t u and equate to 0.
0
1
( ) 0 p q u u
+ + = or
0
0 p q u u + + =
Generalizing,
1
0
n
n
u u
u u
+

1 1 1
0
n n n n
p q u u
+ + +
+ + =
1 1 1 n n n n
u u p q
+ + +
= +

1 1 1 n n n n
u u divp q
+ + +
= +

%% Code: Image denoising based TV-L1 model

function primal_dual_denoising_TV_L1(img_src,lambda,max_iter)

noise_image =img_src;
if(size(noise_image,3)==3)
noise_image =rgb2gray(noise_image);
end
[n_row n_col] =size(noise_image);
N =n_row * n_col;

vector

nabla =make_nabla (n_col,n_row);

%initialization
u =noise_image;
p =zeros (2*N, 1);
head_u =u;

127

L =sqrt (8); %Lipschitz constant (L)
tau =0.02;
sigma =(1/L^2)/tau;
theta =0.5;
lambda=30;
max_iter=100;

nabla_t =nabla';

old_u =u;
%update dual
sqrt_p =sqrt(temp_p(1:N).^2 +temp_p(N+1:2*N).^2);
sqrt_p_ =[sqrt_p;sqrt_p];
% update primal
temp_u =old_u - tau * nabla_t * p;
index1 =(temp_u - noise_image) >tau*lambda;
index2 =(temp_u - noise_image) <-tau*lambda;
index3 =~index1 & ~index2;
u(index1) =temp_u(index1) - tau*lambda;
u(index2) =temp_u(index2) +tau*lambda;
u(index3) =noise_image(index3);
% calculate head_u
drawnow;
end

Figure: Result of denoising bsed TV L1 model (lambda=2)

Image Deconvolution:
The objective function can be written as,
128

2
1 2
min
2
u X
u Au g
+
Dual function for
1
u is,

1
1
max , ( )
p
p
u p u p
=
( )
p
p is an indicator function.

0 1
( )
p
if p
p
otherwise

=
`

)

Now the objective formin the min max formcan be written as,
2
2
: 1
min max , ( )
2
p
u X p p
p u Au g p
=
| |
+
|
\ .


1
max , ( )
p
p
p u p

0 2
: 1
1
min ( , || || )
2
p p
p u p p

=
+
0
1
( ) 0 u p p
+ =

1
0
n
n
p p
p p
+

Ie;
129

1
1
( ) 0
n n
u p p
+
+ =
1

n n
or p p u
+
= +
Projecting onto 1 p
space,

( )
1
max 1,
n n
n
n n
p u
p
p u
+
+
=
+

2
2
min ,
2
u X
p u Au g
+

Changing , , p u u p =

Now by adding the proximal term,
2
2
0
2
1
min ,
2 2
u X
u p Au g u u
| |
+ +
|
\ .

Solve this or to derive the update equation, differentiate w.r.t u and equate to 0.
( )
0
1
( ) 0
T
p A Au g u u
+ + =
1 1 1
1 1 1
1 1 1
1 1 1
1 1
( )
( )
( )
( )
n T n n n
n n n T n
n T n n n
n T n T n n
n T n n T
divp A Au g u u
u u divp A Au g
u A Au g u divp
u A Au A g u divp
u I A A u divp A g

+ + +
+ + +
+ + +
+ + +
+ +
+ + =
= +
+ = +
+ = +
+ = + +

( ) ( )
1
1 1 n T n n T
u I A A u divp A g
+ +
= + + +
%% Code: Image deconvolution

function tv_deconv_primal_dual(degraded_img, noise_var, filter, lambda, max_iter, check, handles)
[n_row,n_col] =size(degraded_img);
N =n_row * n_col;
v_degraded_img =reshape(degraded_img',N,1);
nabla =make_nabla(n_col,n_row);
%initialization
u =v_degraded_img;
p =zeros(2*N,1);
head_u =u;
130

% algorithmparameter
L =sqrt(8);
sigma =10;
tau =1/L^2/sigma;

theta =1;

A =filter;
B =zeros(n_row,n_col);
B(1:size(A,1),1:size(A,2)) =A;
B =circshift(B,-floor(size(A)/2));

fft_degraded_img =fft2(degraded_img);
fft_filter =fft2(B);
fft_filter_conj =conj(fft_filter);
fft_filter_sqr =fft_filter.*fft_filter_conj;
fft_degraded_times_fft_filter_conj =fft_degraded_img.*fft_filter_conj;

estimated_nsr =noise_var / var(degraded_img(:));
wnr_img =deconvwnr(degraded_img, filter, estimated_nsr);

energy =zeros(max_iter,1);
nabla_t =nabla';
axes(handles.axes_tv);
old_u =u;
%update dual
sqrt_p =sqrt(temp_p(1:N).^2 +temp_p(N+1:2*N).^2);
sqrt_p_ =[sqrt_p;sqrt_p];
% primal update
temp_u_vector =old_u - tau * nabla_t * p;
fft_temp_u =fft2((reshape(temp_u_vector,n_col,n_row))');
temp_u =ifft2((tau*lambda*fft_degraded_times_fft_filter_conj+fft_temp_u)./...
(1 +tau*lambda*fft_filter_sqr));
u =reshape(temp_u',N,1);
% calculate head_u

temp_u =(reshape(u,n_col,n_row))';
temp_u =imfilter(temp_u,filter,'circular');
drawnow;
end

Optic flow:
The main objective of optic flow concept is to compute a flow field estimating the motion of
pixels in two consecutive image frames.
Consider the standard L1 norm based optic flow equation,
131

{ }
1 2
1 1
,
min ( ) ( )
u X v Y
u v I x f I x

+ + +
Where,
1 2
, I I Successive image pairs, control parameter
f is the flow vector ( , ) u v at any pixel ( , ) x y in the image
Dual function for
1
u is,

1
1
max , ( )
p
p
u p u p
=
( )
p
p is an indicator function
Similarly possible to write the dual function for
1
v and
1 2
( ) ( ) I x f I x +
Now the objective function in the min max formcan be written as,

{ }
1 2
, , , 1
min max , , , ( ( ) ( )) ( ) ( ) ( )
p q r
u X v Y p q r
p u q v r I x f I x P Q R

+ + +
1
max , ( )
p
p
p u p

0 2
: 1
1
min ( , || || )
2
p p
p u p p

=
+
0
1
( ) 0
p
u p p
+ =

1
0
n
n
p p
p p
+

Ie;
1
1
( ) 0
n n n
p
u p p
+
+ =
1

n n n
p
or p p u
+
= +
132

Projecting onto 1 p
space,

( )
1
max 1,
n n
p n
n n
p
p u
p
p u
+
+
=
+

2. Update equation for q:
1
max , ( )
q
q
q v Q

0 2
: 1
1
min ( , || || )
2
q q
q v q q

=
+
To solve this or to derive the update equation, differentiate w.r.t q and equate to 0.
0
1
( ) 0
q
v q q
+ =

1
0
n
n
q q
q q
+

Ie;
1
1
( ) 0
n n n
q
v q q
+
+ =
1
q q
n n n
q
or v
+
= +

Projecting onto 1 q
space,

( )
1
max 1,
n n
q n
n n
q
q v
q
q v
+
+
=
+

3. Update equation for r:
( ) { }
1 2
: 1
max , ( ) ( )
r r
r I x f I x
+
133

( )
2
0
1 2
2 : 1
1
min , ( ) ( )
2
r r
r I x f I x r r

+ +
`
)

Now differentiate with respect to r and equate to zero,
( ) ( )
0
1 2
1
( ) ( ) 0
2
I x f I x r r
+ + =

1
0
n
n
r r
r r
+

( )
1
1 2
( ) ( ) 0
n n
r
I x f I x r r
+
+ + =
Normalizing,
( )
( )
( )
1 2
1
1 2
( ) ( )
max 1, ( ) ( )
n n
r
n
n n
r
r I x f I x
r
r I x f I x

+
+ +
=
+ +

Here f is the flow vector ( , ) u v at any pixel ( , ) x y in the image
Therefore by linearize around
0
f we can rewrite the expression ( )
1 2
, ( ) ( ) r I x f I x + as,
( ) ( ) ( )
1 2 1 2 0
, ( ) ( ) , ( ) ( ) ( )
T
T
x x
r I x f I x r I x f I x f f I I + = + +
It is possible to expand the terms involving f and
0
f as,
( ) ( ) ( )
1 0 2 0 1 0 2 0 0
( ) ( ) ( ) ( ) ( )
T T T
T
x x x y
I x f I x f f I I I x f I x u u I v v I + + = + + +
Now it is easy to expand the dot product expression involving r as,
( )
( )
( ) ( )
1 0 2 0 1 0 2 0 0
, ( ) ( ) ( ) , ( ) ( ) , ,
T T T
T
x x x y
r I x f I x f f I I r I x f I x r u u I r v v I + + = + + +
Here,
( ) ( )
( )
( )
0 0
0
0
, ,
,
T
x x
T
x
T
x
r u u I r I u u
r I u u
I r u u
=
=
=

Now consider the objective function w.r.t u
134

( )
{ }
0
min , ,
T
x
u X
p u I r u u
+
Rewrite as,
( )
{ }
0
min , ,
T
x
u X
u divp I r u u
+
Taking the proximal,
( )
2
0
0
2
1
min , ,
2
T
x
u X
u divp I r u u u u

+ +
`
)

Now differentiate the above expression w.r.t u and equate to zero,
( )
0
1
0
T
x
u
divp I r u u
+ + =
On generalizing,
( )
( )
1 1
1 1 1
1 1 1
1
0
1
n T n n
x
u
n n n T n
x
u
n n n T n
u u u x
divp I r u u
u u divp I r
u u divp I r

+ +
+ + +
+ + +
+ + =
=
= +

5. Update equation for v:
( )
{ }
0
min , ,
T
y
v Y
q v I r v v
+
Rewrite as,
( )
{ }
0
min , ,
T
y
v Y
v divq I r v v
+
Taking the proximal,
( )
2
0
0
2
1
min , ,
2
T
y
v Y
v divq I r v v v v

+ +
`
)

Now differentiate the above expression w.r.t u and equate to zero,
( )
0
1
0
T
y
v
divq I r v v
+ + =
On generalizing,
( )
( )
1 1
1 1 1
1 1 1
1
0
1
n T n n
y
v
n n n T n
y
v
n n n T n
v v v y
divq I r v v
v v divq I r
v v divq I r

+ +
+ + +
+ + +
+ + =
=
= +

135

Row space and column space of a matrix
Consider an m by n matrix,

11 12 1
21 22 2
1 2
. . .
. . .
. . . . . .
. . . . . .
. . . . . .
. . .
n
n
m m mn
r r r
r r r
A
r r r
(
(
(
(
=
(
(
(
(
(

The vector
| | | |
1 11 12 1 1 2
. . ,....., . .
n m m m mn
R r r r R r r r = = in
n
R are called as the
row vectors of A
The vector
11 1
21 2
1
1
. ,....., .
. .
n
n
n
m mn
r r
r r
C C
r r
( (
( (
( (
( ( = =
( (
( (
( (

in
m
R are called as the column vectors of A
The subspace of
n
R spanned by the row vectors of Ais called the row space of Aand the subspace of
m
R spanned by the column vectors of Ais called as the column space of A
Find the basis for row space and column space:
If a matrix R is in row-reduced form(or) echelon formthen,
i. The row vectors with leading 1s forms a basis for the row space of R
ii. The column vectors with leading 1s of the row vectors forms the basis for the column space
of R
For example consider the matrix in echelon formas,
1 2 5 0 3
0 1 3 0 0
0 0 0 1 0
0 0 0 0 0
A
(
(
(
=
(
(

136

Row space basis is the set of row vectors with a leading 1.That is;
| | | | | | { }
1 2 5 0 3 , 0 1 3 0 0 , 0 0 0 1 0
Hence the row space dimension is 3
Column space basis set is,
1 2 0
0 1 0
, ,
0 0 1
0 0 0
( ( (
( ( (

( ( (
`
( ( (

( ( (

)
, hence the column space has a dimension of 3
Null space:
The null space of an m by n matrix Ais the set of all solutions to the equation 0 Ax =
Null space of
{ }
: 0
n
A x x R and Ax = =
If Ais a 5 3 matrix, then the elements of right null space of Aare vectors of
3
R and the elements of left
null space are vectors of
5
R
Linear independence:
Consider a set of n number of vectors{ }
1 2
, ,.......
n
u u u are said to be linearly independent if,
1 1 2 2
.......... 0
n n
a u a u a u + + + = only when 0
i
a = ; the trivial solution
If any other case exists i.e., if any
i
a s are non zero then the vectors { }
1 2
, ,.......
n
u u u becomes linearly
dependent. Then it is possible to express one vector as a linear combination of the other.
Rank:
Rank of a matrix is the maximum number of linearly independent rows or columns in a matrix. The
maximum number of linearly independent rows in a matrix A is called the row rank of A and the
maximumnumber of linearly independent columns in Ais called the column rank of A
If A is an m by n matrix, then row rank of Ais m and column rank of Ais n . If Ais a null matrix then
the rank of Ais zero.
The rank of Aan m byn matrix is possible to compute using the gauss elimination method. In Gauss
elimination use elementary row operations to reduceAto echelon formand the number of pivots or the
leading coefficients will give the rank of the matrix A

It is possible to find the rank of a matrix in matlab using the command rank , which provides an
estimate of the number of linearly independent rows or columns of a given matrix. The command rank
137

(A) returns the number of singular values of A that are larger than the default tolerance. Tolerance equals
to max (size(A))*eps(norm(A))
Examples:
Consider
8 1 6
3 5 7
4 9 2
A
(
(
=
(
(

( ) 3 rank A = , its a full rank matrix
If
1 2 3
4 5 6
7 8 9
A
(
(
=
(
(

and here ( ) 2 rank A =
The rank of the above matrix is finding using the Gauss elimination. By doing elementary row operations
the matrixAis deduced to the echelon formas
1 2 3
0 3 6
0 0 0
(
(

(
(

. In the echelon formthe third row is zero
and only two rows are independent. Hence the rank of the matrix is 2.
If
1 2 1
2 3 1
1 1 2
A
(
(
=
(
(

Using Gaussian elimination the matrixAis reduced to the echelon form as
1 0 0
0 1 0
0 0 1
(
(
(
(

. In the echelon formthe there are three non zero rows and hence the rank of the matrix is 3
Problems:
1. Using the concept of rank, check whether a vector ( , , ) x y z is in the row space of a matrix A
Solution:
i. Find the row rank of the matrix A
ii. Find the row rank of the matrix Awith vector ( , , ) x y z appended as a row vector
If both are same then we can say that the vector ( , , ) x y z is from the row space of A
2. Using the concept of rank, check whether a vector
x
y
z
(
(
(
(

is in the column space of a matrix A

138

Solution:

i. Find the column rank of the matrix A
ii. Find the column rank of the matrix Awith vector
x
y
z
(
(
(
(

appended as a new column
If both are same then we can say that the vector
x
y
z
(
(
(
(

is in the column space of A
Example1:
Let
1 2 3
4 5 6
7 8 9
A
(
(
=
(
(

check whether (4,3,6) is in row space of A
Step1:
First we have to find the rank of Ausing Gaussian elimination method
1 2 3
4 5 6
7 8 9
A
(
(
=
(
(

2 2 1 3 3 1
4 , 7 R R R R R R
1 2 3
0 3 6
0 6 12
(
(
(
(

1 2 3
0 3 6
0 6 12
(
(
(
(

3 3 2
2 R R R +
1 2 3
0 3 6
0 0 0
(
(
(
(

The row reduced formof Ais
1 2 3
0 3 6
0 0 0
(
(
(
(

and the number of non zero rows is 2 hence the row rank of
matrix Ais 2
Step2:
Find the rank of the matrix Awith vector (4,3,6) append as a row vector.
NowAbecomes,
139

1 2 3
4 5 6
7 8 9
4 3 6
A
(
(
(
=
(
(

, now find the rank of this matrix using Gaussian elimination method
1 2 3
4 5 6
7 8 9
4 3 6
(
(
(
(
(

2 2 1 3 3 1 4 4 1
4 , 7 , 4 R R R R R R R R R
1 2 3
0 3 6
0 6 12
0 5 6
(
(

(
(
(

1 2 3
0 3 6
0 6 12
0 5 6
(
(

(
(
(

3 3 2 4 4 2
2 , 3 5 R R R R R R
1 2 3
0 3 6
0 0 0
0 0 4
(
(

(
(
(

1 2 3
0 3 6
0 0 0
0 0 4
(
(

(
(
(

Interchange
3
R and
4
R
1 2 3
0 3 6
0 0 4
0 0 0
(
(

(
(
(

Hence the row reduced formof Ais
1 2 3
0 3 6
0 0 4
0 0 0
(
(

(
(
(

and the number of non zero rows is 3 hence the
row rank of matrix Ais 3
So it clear that the matrix Aand the appended Ahas different row rank. Hence we can say that the vector
(4,3,6) is not in the row space of matrix A.i.e., (4,3,6) is not possible to express as a linear
combination of row vectors.
Example 2:
Let
1 2 3 4
6 7 8 9
1 2 3 5
A
(
(
=
(
(

check whether (1,2,8,9) is in row space of A
Step1:
First we have to find the rank of Ausing Gaussian elimination method
140

The row reduced formof Ais
1 2 3 4
0 5 10 15
0 0 0 1
(
(

(
(

and the number of non zero rows is 3 hence the
row rank of matrix Ais 3
Step2:
Find the rank of the matrix Awith vector (1,2,8,9) append as a row vector.
NowAbecomes,
1 2 3 4
6 7 8 9
1 2 3 5
1 2 8 9
A
(
(
(
=
(
(

, now find the rank of this matrix using Gaussian elimination method
The row reduced formof this Ais
1 2 3 4
0 5 10 15
0 0 5 5
0 0 0 1
(
(

(
(
(

and the number of non zero rows is 4
hence the row rank of matrix Ais 4
So it clear that the matrix Aand the appended Ahas different row rank. Hence we can say that the vector
(1,2,8,9) is not in the row space of matrix A
3. How to check whether a given vector is in the right null space of the matrixA?
Solution:
If a given vector X is in the right null space, then 0 AX
=
That is, take dot product of X with all the rows of A and if all are zero then we can say that the vector is
in the right null space of A
Example:
Let
1 2 3
4 5 6
7 8 9
A
(
(
=
(
(

check whether
1
2
1
(
(
(
(

is in right null space of A
141

If a given vector X is in the right null space, then 0 AX
=
i.e;
1 2 3 1 4 4 0
4 5 6 2 10 10 0 0
7 8 9 1 16 16 0

( ( ( (
( ( ( (
= = =
( ( ( (
( ( ( (

Hence we can say that the given vector is in the right null space of A
4. How to check whether a given vector is in the left null space of the matrix A?
Solution:
If a given vector X is in the left null space, then 0
T
A X
=
That is, take dot product of X with all the rows of
T
A and if all are zero then we can say that the vector
X is in the left null space of A
Example:
Let
1 2 3
4 5 6
7 8 9
A
(
(
=
(
(

check whether
1
2
1
(
(
(
(

is in left null space of A
If a given vector X is in the left null space, then 0
T
A X
=
i.e;
1 4 7 1 8 8 0
2 5 8 2 10 10 0 0
3 6 9 1 12 12 0

( ( ( (
( ( ( (
= = =
( ( ( (
( ( ( (

Hence we can say that the given vector is in the right null space of A
5. Give three vectors in
3
R and check whether they are independent or not?

Solution:
Write the vectors as rows/columns of a matrix and find their rank and if the rank of the matrix =no. of
vectors, then they are independent
Example:
i. Check the given vectors [1,2,3],[5,9,8],[5,5, 20] are independent or not?

142

Solution:
Write the vectors as the rows of matrix A and find their rank
Let,
1 2 3
5 9 8
5 5 20
A
(
(
=
(
(

Rank (A) =2
Rank no. of vectors, therefore the given vectors are dependent
ii. Check the given vectors [1,2,1],[2,3,4],[5,1,2],[6,3,1] are independent or not?
Solution:
Let,
1 2 1
2 3 4
5 1 2
6 3 1
A
(
(
(
=
(
(

Rank (A) =3
iii. Check the given vectors are independent?
Solution:
Let,
1 1 2
4 2 1
2 2 4
A
(
(
=
(
(

Rank (A) =2
iv. Check the given vectors are independent?
Solution:
143

Let,
1 2 1
1 0 1
2 1 1
A
(
(
=
(
(

Rank (A) =3
Rank = no. of vectors, therefore the given vectors are independent
v. Check the given vectors are independent?
Solution:
Let,
1 2 5
1 0 1
1 1 1
A
(
(
=
(
(

Rank (A) =3
Rank = no. of vectors, therefore the given vectors are independent

vi. Find the rank of the following matrices using matlab?
Let,
1 2 3 4
5 6 7 8
6 8 10 12
11 14 17 20
A
(
(
(
=
(
(

rank (A) =2

1 2 3 4
1 3 2 0
0 2 1 1
1 1 3 2
B
(
(
(
=
(
(

rank (B) =4, since it is a full rank matrix all the rows are linearly independent

1 0
0 2
0 3
C
(
(
=
(
(

rank (C) =2, not a full rank matrix

144

1 2 5
1 0 1
2 1 1
D
(
(
=
(
(

rank (D) =2

1 0 1
2 1 3
E
(
=
(

rank (E) =2

1 0 2 1
2 1 1 0
0 2 3 1
F
(
(
=
(
(

rank (F)=3
Problem:
1. Find the projection of vector (1,6,2,3)
T
x = on to all subspaces of the following matrix with
rank=2
1 2 3 4
5 6 7 8
6 8 10 12
11 14 17 20
A
(
(
(
=
(
(

Solution:
x is possible to express as a linear combination of row basis and right null basis.
ie,
( ) nullspace R rowspace
x x x = +
Similarlyx is also possible to express as a linear combination of column basis and left null basis.
( ) nullspace L columnspace
x x x = +
Consider,
( ) nullspace R rowspace
x x x = +
Where
4 4 4
( )
, ,
nullspace R rowspace
x R x R x R
Now it is possible to express a part of x as a linear combination of rows of A
ie,
T
N R N
x x x x A = + = +
Pre multiply with A
T
N
Ax Ax AA = +
145

0
T
Ax AA
= +
T
AA Ax = , Since Ais a matrix with rank=2,
T
AA is not invertible. Therefore rewrite it by using SVD
T
U V Ax =
1 T
V U Ax

=
Therefore,

1 T T T
R
x A A V U Ax

= =
( )
1 T T
N R N R
x x x x x x I A V U A x
= + = =
Repeat the same for
T
B A =
Consider,
( )
T
nullspace L columnspace N
x x x x B = + = +
Now pre multiply with B
T
N
Bx Bx BB = +
0
T
Bx BB
= +

T
BB Bx =
,
T
BB is not invertible. Therefore rewrite it by using SVD
T
U V Bx =
1 T
V U Bx

=
Therefore,

1 T T T
C
x B B V U Bx

= =
( )
1 T T
N C N C
x x x x x x I B V U B x
= + = =

Vector/ matrix operations in excel using matrix.xla
Matrix.xla is a matrix add-in for Excel.
Installation:
i. Open Excel
ii. Frommenu tool bar select tools and then select add-in
146

iii. Once in the add-in manager, browse for matrix.xla and select it
iv. Click ok
Open the excel sheet and click on the add-ins toolbar. A small blue icon with letter M appears on the
Add-ins command.

The options available there are;
Selector tool: Select matrix pieces
Generator tool: Generate different kind of matrices like random, Hilbert etc
Macros: Starter icon for macro stuffs
Help: Online help manual
Selector tool: For selecting several different matrix formats like diagonal, triangular, tridiagonal etc.
Simply select any cell in the matrix and choose the option you want formthe menu.
Generator tool: Generate different kind of matrices

Macro stuff: Under macros different matrix operations are available
147

There are different functions are available in matrix.xla and which helps to performs the matrix/vector
operations in Excel.
Rank of a matrix
The rank of a square matrix is the maximumnumber of independent rows or columns in a matrix. In
matrix.xla the function M_RANK determines the rank of a given matrix.
Example:
Step 1: Click the Generator option and then select Randomgives a convenient way to generate a matrix of
randomnumbers.
Select the appropriate number of rows and columns. The starting from space indicates where the first
matrix element should be placed (cell A1 in the given example). It is also possible to choose the
maximum and minimum values for the random numbers as well as the matrix type like integer or decimal.
Now pressing the Generate button produces the desired randommatrix.

148

Step 2: Find the rank of this random matrix by using the function M_RANK. For example write the
following command in cell D2 as =M_RANK (A1:C3)and press enter.

Examples:
Find the rank of the following matrices using matrix.xla
Let,
1 2 3 4
5 6 7 8
6 8 10 12
11 14 17 20
A
(
(
(
=
(
(

rank (A) =2

1 2 3 4
1 3 2 0
0 2 1 1
1 1 3 2
B
(
(
(
=
(
(

rank (B) =4, since it is a full rank matrix all the rows are linearly independent

1 0
0 2
0 3
C
(
(
=
(
(

rank (C) =2, not a full rank matrix

1 2 5
1 0 1
2 1 1
D
(
(
=
(
(

rank (D) =2

149

1 0 1
2 1 3
E
(
=
(

rank (E) =2

1 0 2 1
2 1 1 0
0 2 3 1
F
(
(
=
(
(

rank (F)=3

Singular Value Decomposition (SVD) of a matrix
The SVD of a matrix can be visualized as factorizing of A into three separate matrices.
That is;
1
A U V
=
It is possible to compute the SVD of a matrix in matrix.xla.
Step 1: Open a new excel sheet and load the input matrix A
Step 2: Select the option SVD available in matrix operation menu. Then select the matrix A in
matrix/vector A space and also indicate the starting cell in the output starting from cell space. In the given
example cell A2:D5 indicates the input matrix and cell F2 indicates the output staring cell. Now press
the run button.
150

Solving a linear system Ax=b in matrix.xla
Step 1: Select the button for Ax=b frommatrix operation window.
Step 2: The cell range for A appears in matrix/vector A selection and the cell range for right hand side
appears on the matrix/vector B selection. The address for the first cell given in the output starting form
cell: selection. After selection click the run button, the solution will compute and appears in the output
cell.

151

Projection
1. Find projection matrix
c
P to the column space of
m n
A

Here projecting a vector
m
X R on to the column space of
m n
A

Given,
CS LNS
X X X = +
ie,
LNS
X A X = + , where A is the linear combination of columns and is the coefficients.
Premultiply with
T
A ,
T T T
LNS
A X A A A X = +
LNS
X is orthogonal to all the columns of A and rows of
T
A
When 0
T
LNS
A X
= implies,

T T
A A A X =
If
T
A Ais invertible,
1
( )
T T
A A A X

=
Therefore,
1
( )
CS
T T
CS
X A
X A A A A X
=
=

So the projection matrix
c
P onto the column space of
m n
A

1
( )
T T
c
P A A A A
=
1 1 1 1
1 2 1 2
2 2 2 2
| | | |
* * +e =
| | | |
x x
A A
x x
c c c c
x x

( (
( ( ( (
( (

( ( ( (
( (

( (

,
152

In general,
( )
T T
A A A X = (If
T
A Ais not invertible) and hence
( )
T T
CS
X A A A A A X = =
c
P to the column space of
m n
A

is,
( )
T T
c
P A A A A =
2. Find the projection matrix
r
P to the row space of
m n
A

n
X R on to the row space of
m n
A

Given,
RS RNS
X X X = +
ie,
T
RNS
X A X = + , where
T
A is the linear combination of rows of A and is the coefficients
Premultiply withA,
T
RNS
AX AA AX = +
RNS
X is perpendicular to the rows of A
When 0
RNS
AX = implies,

T
AA AX =
153

Therefore,
( )
T
AA AX = (If
T
AA is not invertible) and hence
( )
T T T
RS
X A A AA AX = =
r
P onto the row space of
m n
A

is,
( )
T T
r
P A AA A =
3. Find the projection matrix P ,to project a vector on to right null space of
m n
A

Solution:
n
X R on to the row space of
m n
A

Given,
RS RNS
X X X = +
ie,
T
RNS
X A X = + , where
T
A is the linear combination of rows of Aand is the coefficients
Premultiply withA,
T
RNS
AX AA AX = +
When 0
RNS
AX = implies,

T
AA AX =
In general if
T
AA is not invertible i.e., not full rank matrix then take SVD of
T
AA
T
AA AX =
1
T
T
U V AX
V U AX

=
=

Now,

1
T
RS
T T
RS
X A
X A V U AX
=
=

And hence,

1
1
( )
RNS RS
T T
RNS
T T
RNS
X X X
X X A V U AX
X I A V U A X
=
=
=

So the projection matrix P to the right null space of
m n
A

is,
1 T T
P I A V U A
=
154

Properties of projection matrices
1. Prove that projection matrix
c
P is symmetric
Proof:
( )
( )
( )
T T
c
T
T T T
c
T T T
c
T
c c
P A A A A
P A A A A
P A A A A
P P
=
( =

=
=

Hence the proof
2. Prove that
2
c c
P P =
Proof:

2
2
2
2
2
( ) ( )
( ) ( ( ) )
( ) ( ( ) )
c c c
T T T T
c
T T T
c
T T T T
c
c c
P P P
P A A A A A A A A
P A A A A A A A A
P A A A A A A A A I
P P
=
=
=
= =
=

Hence the proof
Geometrical interpretation of
2
P P =

Given,
CS LNS
X X X = +

155

HereP is such that,
CS
PX X = , where
CS
X is a vector in column space.
Now
2
CS CS
P X P PX PX X = = =
This will implies that
2
P P =
3. Prove that the eigen values of a projection matrix are 0 or 1
Proof:
Let, PX X = and 0 X
Premultiply with P
Now,
2
2 2
P X PX
P X X X

=
= =

Since,
2
P X PX = , the above relation becomes,
2
2
PX X
X X

=
=

2
( ) 0 X

=
( 1) 0 X

=
This implies 0 = or 1 = , since 0 X
4. Projection matrix P is invertible or not?
In general projection matrix P is not invertible. Since one or more eigen values of P are zero.
Geometrical interpretation:
Suppose
1 2
, X X has same projections.
Ie,
1 2 c
PX PX X = =
If P is invertible means it has a unique inverse. But here multiple vectors have same projected vector
therefore unique inverse is not possible.

156

Singular value Decomposition and PCA- A Geometric Viewpoint

The singular value decomposition is over a hundred years old. For the case of square matrices, it was
discovered independently by Beltrami in 1873 and J ordan in 1874. The technique was extended to
rectangular matrices by Eckart and Young in the 1930s and its use as a computational tool dates back to
the 1960s. Golub and van Loan [1 demonstrated its usefulness and feasibility in a wide variety of
applications.
The Singular Value Decomposition (SVD) is a topic rarely reached in undergraduate linear algebra
courses and often skipped over in graduate courses. Consequently relatively few mathematicians are
familiar with what M.I.T. Professor Gilbert Strang calls "absolutely a high point of linear algebra."
These pages are a brief introduction to SVD suitable for inclusion in a standard undergraduate level linear
algebra course.
The SVD has a variety of applications in scientific computing, signal processing, automatic control, and
many other areas
Understanding SVD geometrically through PCA
In linear algebra, the most important concept is SVD. This says that any mxn matrix A can be
represented as a product of three matrices:
T
A U V = .
What is the physical interpretation of this statement.
To understand it geometrically, we consider each row of matrix A as a point in n-dimensional space. To
visualize, let the dimension of matrix A be mx2, so that each data point (row vector) is a 2-tuple which
can be plotted on a plane as shown below.

In matrix formthe data is visualized as shown below

157

Data
1 2
1
2
3
4
5
6
7
8
1 1
2 2.5
3 4
4 4.2
1 1.5
2 2.8
3 3.5
4 3.9
T
T
T
T
T
T
T
T
X X

x
x
x
x
x
x
x
x

For example,
let
1 1
2 2.5
3 4
4 4.2
1 1.5
2 2.8
3 3.5
4 3.9
A
(
(
(
(
(
(
=
(

(

(
(

(
(

, Here each row represent a point in the
1 2
X X plane
This matrix can be decomposed into three matrices using SVD (or PCA) . What are these matrices.
Let us start fromPythagoras theorem. What is the sum of the squares of the lengths of all the row vectors
(here each data point can be considered as a vector). It is the sumof squares of all the elements of this
matrix A. Looking at another way, it is sumof squares of X1 coordinates and X2 coordinates.
Based on matrix A, how will we represent this sum? Consider
1 2
1 1
, and
0 1
e e
( (
= =
( (

that represent
our unit normaxes.
Then
1 1 1
2 2.5 2
3 4 3
4 4.2 1 4
1 1.5 0 1
2 2.8 2
3 3.5 3
4 3.9 4
( (
( (
( (
( (
( (
(
( (
=
(
( (

( (

( (
( (

( (
( (

, and
1 1 1
2 2.5 2.5
3 4 4
4 4.2 0 4.2
1 1.5 1 1.5
2 2.8 2.8
3 3.5 3.5
4 3.9 3.9
( (
( (
( (
( (
( (
(
( (
=
(
( (

( (

( (
( (

( (
( (

158

Symbolically X1 and X2 coordinates of all data point are respectively the vectors
1 2
and Ae Ae . Now we
need to square and sum each of the elements of the vectors to get our required sum. This is easily
obtained by dot product. Let T represent the total. This T is in fact total variation in the data.
( ) ( ) ( ) ( )
T T
1 1 2 2
+ T Ae Ae Ae Ae = or

1 1 2 2
T= A + A
T T T T
e A e e A e =variation along X1 axis +variation along X2 axis.
But a 2-D vector can be decomposed into two components in any two orthogonal directions. We are
again using Pythagoras theoremand since our data vector length do not change, we get the same sumT.
This is exemplified in the following figure.

For example we may use two orthogonal unit vectors
1 2
v and v and project data on to those axes. Since
the total sumof square of vector lengths remain same, T can be written as
1 1 2 2
T= A + A
T T T T
v A v v A v .
Also note that, we can choose two unit norm orthogonal vectors
1 2
v and v such that in one direction the
variation is maximum. Let
1
v be the direction in which the variation is maximum. [In the figure it can
be seen that
1
1/ 2

1/ 2
v
(
= (
(

(a 45 degree direction)]. The components of data points along this direction is
1
Av . So the variation along this direction is ( ) ( )
T
1 1
Av Av =
1 1
A
T T
v A v .
Mathematically how will we derive the direction in which the variation is maximum? It is solution of
following optimization problem.
159

T
max A
to 1
T T
v
v A v
subject v v =

The constraint
T
1 v v = ensures that v is having unit norm.
Taking Lagrangian and applying first order optimization condition we obtain the following.
( )
T
( , ) A 1
T T
L v v A v v v = , ( is Lagrangian multiplier)
2 A 2 0 A
T T
L
A v v A v v
v

= = =

The direction is given by Eigen vector of
T
A A. But
T
A A has two Eigen vectors since
T
A A is 2x2
matrix. We choose the Eigen vector corresponding to largest Eigen value, because represent variation
along the direction 1v . This follows fromthe fact that
T T T T
v A Av v v v v = = = . Let this Eigen
value be
1
and corresponding eigen vector be
1
v . This value
1
is maximum variation among all possible
directions. In the figure, this direction corresponds to the principal axis of the ellipse along which data is
distributed. The remaining variation is along an axis perpendicular to the principal axis. Incidentally all
Eigen vectors of a real symmetric matrix are orthogonal. Since A
T
A is symmetric, total variation T can
be split by projecting data on to normalized Eigen vectors of A
T
A . These directions are respectively
1 2
and v v and variation is respectively
1 2
and . Also T=
1 2
+
Let
1 1 1
Av a u = . Av represent the component of each data point (row vectors of A) along the
direction of vector v . Here
1
a is a scalar. Since there are m data points (mrow vectors) , there are m
components and hence
1
u is of dimension mx1.
Similarly we have
2 2 2
Av a u =
Therefore
| | | |
1
1 2 1 2
2
0
0
a
A v v u u
a
(
=
(

. Or AV U = .
Since the columns of V are orthonormal V is 2x2 orthonormal matrix and hence
T
VV I =
Thus
| |
1 1
1 2
2 2
0
0
T
T
T
a v
A U V u u
a v
( (
= =
( (

Observe the dimensions of each matrix.
2 2 2 2 2 2
T
m m
A U V

=
160

In our example considered, the two columns of A are independent and hence we say it has full column
rank. Only r out of n columns of A may be independent. SVD can be generalized to deal with such
matrix. Also we can show that
1 1 2 2
and a a = = . All this can proved by linear algebra concept
described in the next section.
To summarize
Any rectangular matrix A can be factorized into 3 matrices.
1. Columns of V are eigen vectors of matrix
T
A A.
2. Columns of U are projections of A (or data points which are rows of A) onto columns of V
vectors
3. is a diagonal matrix and diagonal elements are square root of variation of data points along
columns of V.
Using linear algebra we can also prove the following
1. Columns of U are orthonormal.
2. Columns of U are eigenvectors of matrix
T
AA . Note that matrix
T
AA is a mxmmatrix.
3. Eigen values of
T
AA and
T
A A are same (for normalized eigen vectors)
Prove that row rank of A =rank of
T
A A (note that
T
A A symmetric)
Proof
Let dimension of A be mxn
We use the relation row rank of (A) +nullity(A) =number of columns of A
That is row rank of ( A) +nullity of (A) =n (1)
[this basically means number of independent rows of A +nullity of (A) =n. Because, when we compute
Ax, we are dot-producting rows of A with x. For some 0 x ,
n
x , 0 Ax = , and some other
n
x
0 Ax . This means rank of Rowspace (A) +rank of null space of A must be equal to n.
Further visualization:
Let r <n be the number of independent rows. Let
1 2
, ,...,
n
b b br
x x x be the set of bases that spans the
row space of A. For any
n
x and which can be expressed as a linear combination of
1 2
, ,...,
b b br
x x x , the product Ax cannot be a zero vector. Becausex cannot be orthogonal to all rows of
A. But remember x can be orthogonal to some rows of A.
161

There remains n-r orthogonal vectors in
n
which are orthogonal to each
1 2
, ,...,
b b br
x x x . Let these
orthogonal basis vectors be vectors
( 1) ( 2)
, ,...,
n
b r b r bn
x x x
+ +
. Now for any
n
x which are linear
combination of vectors
( 1) ( 2)
, ,...,
b r b r bn
x x x
+ +
, the product Ax is a zero vector. Because this time x is
orthogonal to all the rows of A.
We know that the space spanned by
1 2
, ,...,
b b br
x x x is row space of A and space spanned by
1 2
, ,...,
b b br
x x x is null space of A. Hence the result]
Let
T
B A A = . dimension of B is nxn .
r ank( B ) +nullity of ( B ) =n (2)
We will now show that both Aand
T
A A have the same null space.
, 0 0 0
n T
x x Ax A Ax = = . This means that Aand
T
A A have same null space . Therefore
nullity ( A) =nullity of (
T
A A)= nullity of ( B ) (3)
From(1) (2) and (3), we obtain
Row rank of ( A) =rank (
T
A A)
On the same line we can prove that row rank of (
T
A ) =rank (
T
AA )
The next theoremsays that rank (
T
A A) =rank (
T
AA )
This means Row rank of ( A) =row rank of (
T
A ) . Or Row rank of (A) =Column rank of (A )
So rank (A) =rank(A
T
) = rank(
T
A A) =rank (
T
AA )
Prove that Eigen values of
T
X X and
T
XX are same for normalized eigen vectors . Also
corresponding to every vector of
T
X X , there is a corresponding eigen vector for
T
XX
Proof :
Let dimension of X be mxn
Let mrows of X represent m data points in n-dimensional space
T
1
T
2
T
m
X
(
(
(
(
(
(

x
x
=
x
.

162

From linear Algebra, Also
T
X X and
T
XX are square matrices with dimension nxn and mxm
respectively.
Let
T
X Xv v = , where 1
T
v v =
v is an eigen vector of
T
X X , and is the corresponding eigen value
We can now show that is also the eigen value of
T
XX (Corresponding to a normalized eigen vector)
Multiplying both sides with X, we obtain
T
XX Xv Xv = .
This implies that Xv is an eigen vector of
T
XX
Let Xv au = so that 1
T
u u = . Also a is a scalar
On substitution of Xv au = into
T
XX Xv Xv = , we obtain
T
XX au au =
This implies
T
XX u u =
This proves that eigen values are same (corresponding to normalized eigen vector pair u,v) .
Thus, number of non-zero eigen values of
T
XX and
T
X X are same
This also prove that rank of
T
XX is same as rank of
T
X X
Can we establish some relation between a and . Yes we will prove that a =
Proof
Xv au = . Here a is a scalar and u is a unit vector
2 T T
v X Xv a =
But, from
T
X Xv v = ,
T T
v X Xv =
This means
2
a a = =
Therefore Xv u = . This is the relation between Eigen vectors of
T
X X and
T
XX
Dimensions of v s are (nx1) and u s are (mx1)
T
X X and
T
XX has r eigen vectors . We assume they are distinct; they are orthogonal and hence
independent
163

Since Xv u = ,
1 2
, ,...,
r
u u u span the column space of X . Similarly
1 2
, ,...,
r
v v v span the row
space of X
We have now

| |
1
2
1 2 1 2
0 . 0
0 . 0
[ , ,...., ] , ,...,
0 0 . 0
0 0 .
r r
r
X v v v u u u
(
(
(
=
(
(
(

Since
n
i
v , we can find out n-r orthogonal vectors
1 2
, ,...,
n
r r n
v v v
+ +
such that Xv = 0
Fromwhich we derive
| |
1
2
1 2 1 2
0 . 0 0 . 0
0 . 0 0 . 0
. . . . . . 0
[ , ,...., ,..., ] , ,..., ,...,
0 0 . 0 . 0
0 0 . 0 0 . 0
. . . . . . .
0 0 . 0 0 . 0
r n r n
r
X v v v v u u u u
(
(
(
(
(
= (
(
(
(
(
(

, where
1 2
, ,...,
m
r r n
u u u
+ +
are arbitrary vectors.
Representing the above matrix relation as XV U = , we see that V is a (nxn) orthogonal matrix .
Therefore
T
m n m n n n n n
X U V

=
On simplification
T
m n m r r r r n
X U V

=
Applications
1. Image Compression
2. Finding Pseudo Inverse of a matrix as in least square estimation
3. Noise filtering
4. Face Recognition
5. Water marking
6. Term-document matrices and singular value decompositions
7. Application to Cryptanalysis
164

8. SVD methods for visualization of gene expression data
http://public.lanl.gov/mewall/kluwer2002.html
9. Recent work has shown how the singular value decomposition (SVD) may be used in a
multiresolution formanalogous to the wavelet decomposition
10. Phylogenic Tree Construction using SVD (file Errickson Trees SVD )
11. Deblurring of Images ( refer book signal processing directory) Svd and Signal Processing :
Algorithms, Applications and Architectures
12. Handwritten digit recognition (book Numerical linear algebra and data mining)
13. Text Summarization
14. Image Fusion
15. Speech Enhancement using SVD ( SVD speech thesis)
16. Speech modeling ( same as above)

165

166

Unconstrained and Constrained Optimization Algorithms by Soman K.P

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Unconstrained and Constrained Optimization Algorithms by Soman K.P

Hochgeladen von

Copyright:

Verfügbare Formate

1

Unconstrained and Constrained Optimization Algorithms

f is a very good approximation to f around

can be visualized as follows.

is called primal and the swapped version

is called dual of the primal.

0.1107 0.0155 0.0087 - 0.1106

= for all subsequence { } 1,2... K =

is any accumulation point, there is subsequence K

such that lim

. For any c , which is less than

, the optimal value

doesnt give this solution.

for correcting the or inturn making a correct parabola fit. If it is

is defined as a majorization function over ( ) f z . For an upper bound of the

= and is an upper bound on f

+ becomes zero, but ( ) g z =

can be avoided fromthe objective function, as it doesnt affect optimization

Das könnte Ihnen auch gefallen