You are on page 1of 6

COS 511: Theoretical Machine Learning

Lecturer: Rob Schapire Lecture #17


Scribe: Eric Goldlust April 9, 2008
Bound on the Loss of the Widrow-Ho algorithm
Last time, we were analyzing the performance of the Widrow-Ho algorithm, specied as
follows:
1: w
1
= 0
2: for t = 1 to T do
3: get example x
t
R
n
4: predict response y
t
= w
t
x
t
5: observe response y
t
R
6: set w
t+1
= w
t
(w
t
x
t
y
t
)x
t
7: end for
We dened the loss under Widrow-Ho to be: L
WH
=

T
t=1
( y
t
y
t
)
2
And the loss under
a xed vector u to be L
u
=

T
t=1
(u x
t
y
t
)
2
and partially proved the following theorem
about L
WH
, which does not rely on any distributional assumptions about the data.
Theorem. Assume that for every t, we know that ||x
t
||
2
1. Then:
L
WH
min
u
_
L
u
1
+
||u||
2
2

_
.
In the previous lecture, we reduced the proof of this theorem to the proof of the following
lemma, which we now prove:
Lemma. Let ||x
t
||
2
1. Dene the potential function
t
= ||w
t
u||
2
2
. Dene the time-t
signed error of Widrow-Ho by
t
= w
t
x
t
y
t
and the signed error from a xed u by
g
t
= u x
t
y
t
. Then, for every t and u:

t+1

t

2
t
+

1
g
2
t
.
Proof. Let
t
=
t
x
t

t+1

t
= ||w
t+1
u||
2
2
||w
t
u||
2
2
= ||(w
t
u)
t
||
2
2
||w
t
u||
2
2
= ||
t
||
2
2
2(w
t
u)
t
=
2

2
t
||x
t
||
2
2
. .
1
2
t
x
t
(w
t
u)
. .
=
t
g
t

2

2
t
2
2
t
+ 2
t
g
t
=
2

2
t
2
2
t
+ 2
_
_

t
_
1
_
_
g
t

1
__
.
We next use the real algebraic inequality
1
ab (a
2
+b
2
)/2 for the case where a =
t

1
and b =
g
t

1
. This gives

t+1

t

2

2
t
2
2
t
+ (1 )
2
t
+

1
g
2
t
= [
2
2 + (1 )]
2
t
+

1
g
2
t
=
2
t
+

1
g
2
t
which completes the proof of the lemma, and in turn the theorem.
Generalization: Varying the Loss function and the Norm
When we originally derived the Widrow-Ho update rule, we tried to nd a value of w
t+1
that minimized a linear combination of the loss of w
t+1
on (x
t
, y
t
) and the norm ||w
t+1

w
t
||
2
2
. Specically, we wanted to minimize (w
t+1
x
t
y
t
)
2
+||w
t+1
w
t
||
2
2
. We can try to
generalize this objective function by replacing either the loss term or the norm term with a
more general function.
General loss function, L
2
distance
If our objective function is given by:
L(w
t
, x
t
, y
t
) +||w
t+1
w
t
||
2
2
Then we get the Gradient Descent (GD) update rule:
w
t+1
= w
t

w
L(w
t+1
, x
t
, y
t
)
w
t

w
L(w
t
, x
t
, y
t
).
Square loss, Relative Entropy distance
If our objective function is given by:
(w
t
x
t
y
t
)
2
+ RE(w
t
||w
t+1
)
then we nd the following update rule, which is specied component-wise:

t+1,i
=

t,i
exp {(w
t+1
x
t
y
t
)x
t,i
}
Z
t


t,i
exp {(w
t
x
t
y
t
)x
t,i
}
Z
t
where Z
t
are normalization factors. Note the parallel with the original Widrow-Ho rule:
In that case, we added
t
to w
t
. In this case, we multiply componentwise with the
exponentiation of the components of
t
and then normalize.
1
Follows from a
2
2ab + b
2
= (a b)
2
0.
2
General loss function, Relative Entropy distance
If our objective function is given by:
L(w
t
, x
t
, y
t
) + RE(w
t
||w
t+1
)
then we get the Exponentiated Gradient (EG) update rule:

t+1,i
=

t,i
exp
_

i
(w
t+1
, x
t
, y
t
)
_
Z
t

t,i
exp
_

i
(w
t
, x
t
, y
t
)
_
Z
t
.
There is a performance bound for the EG update rule that looks similar to the one we
proved about WH. In order to make an apples-to-apples comparison, we rst rewrite the
original WH bound as follows:
||x
t
||
2
1 L
WH
min
u:||u||
2
=1
[aL
u
+ b] for some a, b.
For EG with square loss, there is the following similar-looking bound:
||x
t
||

1 L
EG
min
u:||u||
1
=1
[aL
u
+ b ln N] for some a, b.
We can now add an element to our recurring dichotomy between additive and multiplicative
algorithms:
additive updates multiplicative updates
L
2
/L
2
L

/L
1
Support Vector Machines AdaBoost
Perceptrons Winnow
Gradient Descent / Widrow-Ho Exponentiated Gradient
Using Online Algorithms in a Batch Setting
We have analyzed both the batch setting and the online setting. In some sense, the results
in the online setting are stronger because they do not rely on statistical assumptions about
the data (for example that the examples are i.i.d). We now analyze the use of these online
algorithms on batch data where these statistical assumptions are assumed to hold. The
result will be simple and fast algorithms with generalization bounds that come for free from
the analysis we did in the online setting.
The Batch Setting
Given S = (x
1
, y
1
), . . . , (x
m
, y
m
), assume that for any i, (x
i
, y
i
) D. Let there be a new
test point (x, y) that is also distributed according to D. We want to nd a linear predictor
v with a low expected loss (also called risk or true risk), dened by:
R
v
= E
(x,y)D
_
(v x y)
2

.
3
When we say that we want the expected loss to be low, we mean this in relation to that of
the best possible value for any u, i.e. it should be low relative to min
u
R
u
.
One reasonable way to accomplish this is to use Widrow-Ho on the dataset, treating it as
though the examples were arriving online. This would yield a sequence of weight vectors
w
1
= 0, w
2
, . . . , w
m
. We could then return the nal hypothesis v = w
m
. Unfortunately,
if we return w
m
, the analysis turns out to be too dicult. We can, however, analyze the
performance if we return v =
1
m

m
t=1
w
t
. In this case, we can prove the following theorem:
Theorem.
E
S
[R
v
] min
u
_

_
R
u
1
+
||u||
2
2
m
. .
0 as m
_

_
Here, the expectation is over the random training set S. We proceed in steps, starting with
three lemmas.
Lemma (1).
(v x y)
2

1
m
m

t=1
(w
t
x y)
2
.
Proof.
(v x y)
2
=
__
1
m
m

t=1
w
t
_
x y
_
2
=
_
1
m
m

t=1
(w
t
x y)
_
2

1
m
m

t=1
(w
t
x y)
2
(by convexity of f(z) = z
2
).
Lemma (2).
E
_
(u x
t
y
t
)
2

= E
_
(u x y)
2

.
Proof. This statement is true because (x
t
, y
t
) and (x, y) are identically distributed.
Lemma (3).
E
_
(w
t
x
t
y
t
)
2

= E
_
(w
t
x y)
2

.
Proof. w
t
is chosen before (x
t
, y
t
), so (x
t
, y
t
) and (x, y) are identically distributed given w
t
.
Note that this is not true, for example, given w
t+1
.
We are now ready to prove the theorem.
4
Proof. For any xed u, we have:
E
S
[R
v
] = E
_
(v x y)
2

E
_
1
m
m

t=1
(w
t
x y)
2
_
(by lemma 1)
=
1
m
m

t=1
E
_
(w
t
x y)
2

=
1
m
m

t=1
E
_
(w
t
x
t
y
t
)
2

(by lemma 3)
=
1
m
E
_
m

t=1
(w
t
x
t
y
t
)
2
_
=
1
m
E[L
WH
]

1
m
E
_
m
t=1
(u x
t
y
t
)
2
1
+
||u||
2
2

_
(previously shown in WH analysis)
=
1
m
_

m
t=1
E
_
(u x
t
y
t
)
2

1
+
||u||
2
2

_
=
1
m
_

m
t=1
E
_
(u x y)
2

1
+
||u||
2
2

_
(by lemma 2)
=
_
R
u
1
+
||u||
2
2
m
_
.
Since this holds for any u, the theorem follows.
Probability Modeling
So far, we have analyzed situations where we deal with (x, y) pairs, where y could be real or
categorical. We now consider the situation where we receive only x and our goal is to model
its distribution. Let x P. The goal is to estimate P. This task is called Probability
Modeling or Density Estimation.
One example of where this can be useful is in speech recognition. In order to decide if
a speaker has just said I sat on a chair or I fat on a chair, a system could use prior
estimates of the relative likelihood of these two phrases in English in order to decide that
I sat on a chair was more likely.
We might also want to perform density estimation for classication problems. If we
wanted to estimate human gender based on height, we could build separate density estimates
(possibly just by estimating gaussian parameters) for men and women and then use Bayes
rule to decide which gender was more likely under that probability model given the height
of the test person. This is called the generative approach.
By contrast, the discriminative approach makes no attempt to model the distribution
of the data, but rather just tries to nd an eective classication rule. In this case, we would
just estimate a threshold height above which to classify a test person as male.
An advantage of the discriminative approach is that it is direct and simple and does not
require assumptions about the distribution of the data. The generative approach, however,
5
has the advantage that expert knowledge of the distribution can sometimes lead to higher
performance with less training data.
One example where the generative approach is eective is in the detection of fraudulent
calling-card phone calls. One could build, for every customer, a probability distribution over
phone calls (frequency, duration, location, etc), and then build similar models for fraudsters.
Given a test phone call, one could use the probabilities under these two distributions to
estimate the probability that the call is fraudulent. In this case, the generative approach
can work well.
6