Beruflich Dokumente
Kultur Dokumente
Rob Schapire
Example: “How May I Help You?”
[Gorin et al.]
• goal: automatically categorize type of call requested by phone
customer (Collect, CallingCard, PersonToPerson, etc.)
• yes I’d like to place a collect call long distance
please (Collect)
• operator I need to make a call but I need to bill
it to my office (ThirdNumber)
• yes I’d like to place a call on my master card
please (CallingCard)
• I just called a number in sioux city and I musta
rang the wrong number because I got the wrong
party and I would like to have that taken off of
my bill (BillingCredit)
Example: “How May I Help You?”
[Gorin et al.]
• goal: automatically categorize type of call requested by phone
customer (Collect, CallingCard, PersonToPerson, etc.)
• yes I’d like to place a collect call long distance
please (Collect)
• operator I need to make a call but I need to bill
it to my office (ThirdNumber)
• yes I’d like to place a call on my master card
please (CallingCard)
• I just called a number in sioux city and I musta
rang the wrong number because I got the wrong
party and I would like to have that taken off of
my bill (BillingCredit)
• observation:
• easy to find “rules of thumb” that are “often” correct
• e.g.: “IF ‘card’ occurs in utterance
THEN predict ‘CallingCard’ ”
• hard to find single highly accurate prediction rule
The Boosting Approach
• brief background
• basic algorithm and core theory
• other ways of understanding boosting
• experiments, applications and extensions
Brief Background
Strong and Weak Learnability
• [Schapire ’89]:
• first provable boosting algorithm
• [Freund ’90]:
• “optimal” algorithm that “boosts by majority”
• [Drucker, Schapire & Simard ’92]:
• first experiments using boosting
• limited by practical drawbacks
AdaBoost
• [Freund & Schapire ’95]:
introduced “AdaBoost” algorithm
•
strong practical advantages over previous boosting
•
algorithms
• experiments and applications using AdaBoost:
[Drucker & Cortes ’96] [Abney, Schapire & Singer ’99] [Tieu & Viola ’00]
[Jackson & Craven ’96] [Haruno, Shirai & Ooyama ’99] [Walker, Rambow & Rogati ’01]
[Freund & Schapire ’96] [Cohen & Singer’ 99] [Rochery, Schapire, Rahim & Gupta ’01]
[Quinlan ’96] [Dietterich ’00] [Merler, Furlanello, Larcher & Sboner ’01]
[Breiman ’96] [Schapire & Singer ’00] [Di Fabbrizio, Dutton, Gupta et al. ’02]
[Maclin & Opitz ’97] [Collins ’00] [Qu, Adam, Yasui et al. ’02]
[Bauer & Kohavi ’97] [Escudero, Màrquez & Rigau ’00] [Tur, Schapire & Hakkani-Tür ’03]
[Schwenk & Bengio ’98] [Iyer, Lewis, Schapire et al. ’00] [Viola & Jones ’04]
[Schapire, Singer & Singhal ’98] [Onoda, Rätsch & Müller ’00] [Middendorf, Kundaje, Wiggins et al. ’04]
.
.
.
• continuing development of theory and algorithms:
[Breiman ’98, ’99] [Ridgeway, Madigan & Richardson ’99] [Wyner ’02]
[Schapire, Freund, Bartlett & Lee ’98] [Kivinen & Warmuth ’99] [Rudin, Daubechies & Schapire ’03]
[Grove & Schuurmans ’98] [Friedman, Hastie & Tibshirani ’00] [Jiang ’04]
[Mason, Bartlett & Baxter ’98] [Rätsch, Onoda & Müller ’00] [Lugosi & Vayatis ’04]
[Schapire & Singer ’99] [Rätsch, Warmuth, Mika et al. ’00] [Zhang ’04]
[Cohen & Singer ’99] [Allwein, Schapire & Singer ’00] [Rosset, Zhu & Hastie ’04]
[Freund & Mason ’99] [Friedman ’01] [Zhang & Yu ’05]
[Domingo & Watanabe ’99] [Koltchinskii, Panchenko & Lozano ’01][Bartlett & Traskin ’07]
[Mason, Baxter, Bartlett & Frean ’99] [Collins, Schapire & Singer ’02] [Mease & Wyner ’07]
[Duffy & Helmbold ’99, ’02] [Demiriz, Bennett & Shawe-Taylor ’02] [Shalev-Shwartz & Singer ’08]
[Freund & Mason ’99] [Lebanon & Lafferty ’02] .
.
Basic Algorithm and Core Theory
• introduction to AdaBoost
• analysis of training error
• analysis of test error based on
margins theory
A Formal Description of Boosting
• constructing Dt :
• D1 (i ) = 1/m
AdaBoost
[with Freund]
• constructing Dt :
• D1 (i ) = 1/m
• given Dt and ht :
−α
Dt (i ) e t if yi = ht (xi )
Dt+1 (i ) = ×
Zt e αt if yi 6= ht (xi )
Dt (i )
= exp(−αt yi ht (xi ))
Zt
where Zt = normalization
factor
1 1 − t
αt = 2 ln >0
t
AdaBoost
[with Freund]
• constructing Dt :
• D1 (i ) = 1/m
• given Dt and ht :
−α
Dt (i ) e t if yi = ht (xi )
Dt+1 (i ) = ×
Zt e αt if yi 6= ht (xi )
Dt (i )
= exp(−αt yi ht (xi ))
Zt
where Zt = normalization
factor
1 1 − t
αt = 2 ln >0
t
• final classifier: !
X
• Hfinal (x) = sign αt ht (x)
t
Toy Example
D1
h1 D2
ε1 =0.30
α1=0.42
Round 2
h2 D3
ε2 =0.21
α2=0.65
Round 3
h3
ε3 =0.14
α3=0.92
Final Classifier
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
final
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
=
"
"
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
Analyzing the Training Error
[with Freund]
• Theorem:
• write t as 1/2 − γt [ γt = “edge” ]
Analyzing the Training Error
[with Freund]
• Theorem:
• write t as 1/2 − γt [ γt = “edge” ]
• then
Yh p i
training error(Hfinal ) ≤ 2 t (1 − t )
t
Yq
= 1 − 4γt2
t
!
X
≤ exp −2 γt2
t
Analyzing the Training Error
[with Freund]
• Theorem:
• write t as 1/2 − γt [ γt = “edge” ]
• then
Yh p i
training error(Hfinal ) ≤ 2 t (1 − t )
t
Yq
= 1 − 4γt2
t
!
X
≤ exp −2 γt2
t
• so: if ∀t : γt ≥ γ > 0
2
then training error(Hfinal ) ≤ e −2γ T
• AdaBoost is adaptive:
• does not need to know γ or T a priori
• can exploit γt γ
Proof
X
• let F (x) = αt ht (x) ⇒ Hfinal (x) = sign(F (x))
t
• Step 1: unwrapping recurrence:
!
X
exp −yi αt ht (xi )
1 t
Dfinal (i ) = Y
m Zt
t
1 X 1 if yi F (xi ) ≤ 0
=
m 0 else
i
Proof (cont.)
Y
• Step 2: training error(Hfinal ) ≤ Zt
t
• Proof:
1 X 1 if yi 6= Hfinal (xi )
training error(Hfinal ) =
m 0 else
i
1 X 1 if yi F (xi ) ≤ 0
=
m 0 else
i
1 X
≤ exp(−yi F (xi ))
m
i
Proof (cont.)
Y
• Step 2: training error(Hfinal ) ≤ Zt
t
• Proof:
1 X 1 if yi 6= Hfinal (xi )
training error(Hfinal ) =
m 0 else
i
1 X 1 if yi F (xi ) ≤ 0
=
m 0 else
i
1 X
≤ exp(−yi F (xi ))
m
i
X Y
= Dfinal (i ) Zt
i t
Proof (cont.)
Y
• Step 2: training error(Hfinal ) ≤ Zt
t
• Proof:
1 X 1 if yi 6= Hfinal (xi )
training error(Hfinal ) =
m 0 else
i
1 X 1 if yi F (xi ) ≤ 0
=
m 0 else
i
1 X
≤ exp(−yi F (xi ))
m
i
X Y
= Dfinal (i ) Zt
i t
Y
= Zt
t
Proof (cont.)
p
• Step 3: Zt = 2 t (1 − t )
Proof (cont.)
p
• Step 3: Zt = 2 t (1 − t )
• Proof:
X
Zt = Dt (i ) exp(−αt yi ht (xi ))
i
X X
= Dt (i )e αt + Dt (i )e −αt
i :yi 6=ht (xi ) i :yi =ht (xi )
−αt
= − t ) e
t e αt + (1
p
= 2 t (1 − t )
How Will Test Error Behave? (A First Guess)
0.8
0.6
error
0.4 test
0.2
train
20 40 60 80 100
# of rounds (T)
expect:
• training error to continue to drop (or reach zero)
• test error to increase when Hfinal becomes “too complex”
• “Occam’s razor”
• overfitting
• hard to know when to stop training
Actual Typical Run
20
error
10 (boosting C4.5 on
test “letter” dataset)
5
0
train
10 100 1000
# of rounds (T)
• key idea:
• training error only measures whether classifications are
right or wrong
• should also consider confidence of classifications
A Better Story: The Margins Explanation
[with Freund, Bartlett & Lee]
• key idea:
• training error only measures whether classifications are
right or wrong
• should also consider confidence of classifications
• recall: Hfinal is weighted majority vote of weak classifiers
A Better Story: The Margins Explanation
[with Freund, Bartlett & Lee]
• key idea:
• training error only measures whether classifications are
right or wrong
• should also consider confidence of classifications
• recall: Hfinal is weighted majority vote of weak classifiers
• measure confidence by margin = strength of the vote
= (weighted fraction voting correctly)
−(weighted fraction voting incorrectly)
cumulative distribution
1.0
20
1000
15 100
error
10 0.5
5
test
train 5
0
10 100 1000 -1 -0.5 0.5 1
# of rounds (T) margin
# rounds
5 100 1000
train error 0.0 0.0 0.0
test error 8.4 3.3 3.1
% margins ≤ 0.5 7.7 0.0 0.0
minimum margin 0.14 0.52 0.55
Theoretical Evidence: Analyzing Boosting Using Margins
p !
d/m
generalization error ≤ P̂r[margin ≤ θ] + Õ
θ
• game theory
• loss minimization
• an information-geometric view
Just a Game
[with Freund]
= PT MQ ≡ M(P, Q)
The Minmax Theorem
• von Neumann’s minmax theorem:
• in words:
• v = min max means:
• row player has “minmax” strategy P∗
such that ∀ column strategy Q
loss M(P∗ , Q) ≤ v
The Minmax Theorem
• von Neumann’s minmax theorem:
• in words:
• v = min max means:
• row player has “minmax” strategy P∗
such that ∀ column strategy Q
loss M(P∗ , Q) ≤ v
• v = max min means:
• this is optimal in sense that
column player has “maxmin” strategy Q∗
such that ∀ row strategy P
loss M(P, Q∗ ) ≥ v
The Boosting Game
• row player ↔ booster
• column player ↔ weak learner
The Boosting Game
• row player ↔ booster
• column player ↔ weak learner
• let {g1 , . . . , gN } = space of all weak classifiers
• matrix M:
• row ↔ example (xi , yi )
• column ↔ weak classifier gj
weak learner
g1 gj gN
x1 y1
booster
xi yi M(i,j)
xmym
The Boosting Game
• row player ↔ booster
• column player ↔ weak learner
• let {g1 , . . . , gN } = space of all weak classifiers
• matrix M:
• row ↔ example (xi , yi )
• column ↔ weak classifier gj
1 if yi = gj (xi )
• M(i , j) =
0 else
• encodes which weak classifiers correct on which examples
weak learner
g1 gj gN
x1 y1
booster
xi yi M(i,j)
xmym
Boosting and the Minmax Theorem
• γ-weak learning assumption says:
∀ distribution P over examples
∃gj such that
1
Pr [gj (xi ) = yi ] ≥ 2 +γ
i ∼P
Boosting and the Minmax Theorem
• γ-weak learning assumption says:
∀ distribution P over examples
∃gj such that
1
Pr [gj (xi ) = yi ] ≥ 2 +γ
i ∼P
1
• i.e., min max M(P, j) ≥ 2 +γ
P j
Boosting and the Minmax Theorem
• γ-weak learning assumption says:
∀ distribution P over examples
∃gj such that
1
Pr [gj (xi ) = yi ] ≥ 2 +γ
i ∼P
• so:
γ-weak learning assumption holds
⇔ (value of game M) ≥ 21 + γ
⇔ every example classified correctly with margin ≥ 2γ by
some weighted majoriy classifier
Idea for Boosting
• so:
γ-weak learning assumption holds
⇔ (value of game M) ≥ 21 + γ
⇔ every example classified correctly with margin ≥ 2γ by
some weighted majoriy classifier
• further, weights are maxmin strategy for M
Idea for Boosting
• so:
γ-weak learning assumption holds
⇔ (value of game M) ≥ 21 + γ
⇔ every example classified correctly with margin ≥ 2γ by
some weighted majoriy classifier
• further, weights are maxmin strategy for M
• idea: use general algorithm for finding maxmin strategy
through repeated play
Boosting and Repeated Play
• on each round t:
• booster chooses:
• distribution Dt over examples (xi , yi )
• summarizing:
• weak learning assumption implies maxmin strategy for M
defines large-margin classifier
• AdaBoost finds maxmin strategy by applying general
algorithm for solving games through repeated play
AdaBoost and Game Theory
• summarizing:
• weak learning assumption implies maxmin strategy for M
defines large-margin classifier
• AdaBoost finds maxmin strategy by applying general
algorithm for solving games through repeated play
• consequences:
• weights on weak classifiers converge to
(approximately) maxmin strategy for game M
• (average) of distributions Dt converges to
(approximately) minmax strategy
• margins and edges connected via minmax theorem
• explains why AdaBoost maximizes margins
AdaBoost and Game Theory
• summarizing:
• weak learning assumption implies maxmin strategy for M
defines large-margin classifier
• AdaBoost finds maxmin strategy by applying general
algorithm for solving games through repeated play
• consequences:
• weights on weak classifiers converge to
(approximately) maxmin strategy for game M
• (average) of distributions Dt converges to
(approximately) minmax strategy
• margins and edges connected via minmax theorem
• explains why AdaBoost maximizes margins
• different instantiation of game-playing algorithm gives on-line
learning algorithms (such as weighted majority algorithm)
AdaBoost and Loss Minimization
Y
Zt
t
AdaBoost and Exponential Loss
where X
F (x) = αt ht (x)
t
AdaBoost and Exponential Loss
where X
F (x) = αt ht (x)
t
where X
F (x) = αt ht (x)
t
F ← F − α∇F L(F )
Functional Gradient Descent
[Mason et al.][Friedman]
• want to minimize
X
L(F ) = L(F (x1 ), . . . , F (xm )) = exp(−yi F (xi ))
i
F ← F − α∇F L(F )
F ← F + αht
Functional Gradient Descent
[Mason et al.][Friedman]
• want to minimize
X
L(F ) = L(F (x1 ), . . . , F (xm )) = exp(−yi F (xi ))
i
F ← F − α∇F L(F )
F ← F + αht
80
observed probability 60
40
20
0
0 20 40 60 80 100
predicted probability
• tempting to conclude:
• AdaBoost is just an algorithm for minimizing exponential
loss
• AdaBoost works only because of its loss function
∴ more powerful optimization techniques for same loss
should work even better
A Note of Caution
0.5
0
-1 -0.5 0 0.5 1
A Dual Information-geometric Perspective
P
x0
An Iterative-projection Algorithm
• say want to find point closest to x0 in set
P = { intersection of N hyperplanes }
• algorithm: [Bregman; Censor & Zenios]
• start at x0
• repeat: pick a hyperplane and project onto it
P
x0
An Iterative-projection Algorithm
• say want to find point closest to x0 in set
P = { intersection of N hyperplanes }
• algorithm: [Bregman; Censor & Zenios]
• start at x0
• repeat: pick a hyperplane and project onto it
P
x0
An Iterative-projection Algorithm
• say want to find point closest to x0 in set
P = { intersection of N hyperplanes }
• algorithm: [Bregman; Censor & Zenios]
• start at x0
• repeat: pick a hyperplane and project onto it
P
x0
An Iterative-projection Algorithm
• say want to find point closest to x0 in set
P = { intersection of N hyperplanes }
• algorithm: [Bregman; Censor & Zenios]
• start at x0
• repeat: pick a hyperplane and project onto it
P
x0
An Iterative-projection Algorithm
• say want to find point closest to x0 in set
P = { intersection of N hyperplanes }
• algorithm: [Bregman; Censor & Zenios]
• start at x0
• repeat: pick a hyperplane and project onto it
P
x0
• if P =
6 ∅, under general conditions, will converge correctly
AdaBoost is an Iterative Projection Algorithm
[Kivinen & Warmuth]
• algorithm:
• start at D1 = uniform
• for t = 1, 2, . . .:
• pick hyperplane/weak classifier ht ↔ gj
• Dt+1 = (entropy) projection of Dt onto hyperplane
= arg P min RE (D k Dt )
D: i D(i )yi gj (xi )=0
AdaBoost as Iterative Projection (cont.)
• algorithm:
• start at D1 = uniform
• for t = 1, 2, . . .:
• pick hyperplane/weak classifier ht ↔ gj
• Dt+1 = (entropy) projection of Dt onto hyperplane
= arg P min RE (D k Dt )
D: i D(i )yi gj (xi )=0
• claim: equivalent to AdaBoost
• further: choosing ht with minimum error ≡ choosing farthest
hyperplane
Boosting as Maximum Entropy
• corresponding optimization problem:
min RE (D k uniform)
D∈P
• where
P = feasible
( set )
X
= D: D(i )yi gj (xi ) = 0 ∀j
i
Boosting as Maximum Entropy
• corresponding optimization problem:
• where
P = feasible
( set )
X
= D: D(i )yi gj (xi ) = 0 ∀j
i
Boosting as Maximum Entropy
• corresponding optimization problem:
• where
P = feasible
( set )
X
= D: D(i )yi gj (xi ) = 0 ∀j
i
• P=
6 ∅ ⇔ weak learning assumption does not hold
• in this case, Dt → (unique) solution
Boosting as Maximum Entropy
• corresponding optimization problem:
• where
P = feasible
( set )
X
= D: D(i )yi gj (xi ) = 0 ∀j
i
• P=
6 ∅ ⇔ weak learning assumption does not hold
• in this case, Dt → (unique) solution
• if weak learning assumption does hold then
• P=∅
• Dt can never converge
• dynamics are fascinating but unclear in this case
Unifying the Two Cases
[with Collins & Singer]
• two distinct cases:
• weak learning assumption holds
• P =∅
• dynamics unclear
• weak learning assumption does not hold
• P =6 ∅
• can prove convergence of Dt ’s
Unifying the Two Cases
[with Collins & Singer]
• two distinct cases:
• weak learning assumption holds
• P =∅
• dynamics unclear
• weak learning assumption does not hold
• P =
6 ∅
• can prove convergence of Dt ’s
• to unify: work instead with unnormalized versions of Dt ’s
Unifying the Two Cases
[with Collins & Singer]
• two distinct cases:
• weak learning assumption holds
• P =∅
• dynamics unclear
• weak learning assumption does not hold
• P =
6 ∅
• can prove convergence of Dt ’s
• to unify: work instead with unnormalized versions of Dt ’s
Dt (i ) exp(−αt yi ht (xi ))
• standard AdaBoost: Dt+1 (i ) =
normalization
Unifying the Two Cases
[with Collins & Singer]
• two distinct cases:
• weak learning assumption holds
• P =∅
• dynamics unclear
• weak learning assumption does not hold
• P =
6 ∅
• can prove convergence of Dt ’s
• to unify: work instead with unnormalized versions of Dt ’s
Dt (i ) exp(−αt yi ht (xi ))
• standard AdaBoost: Dt+1 (i ) =
normalization
• instead:
• algorithm is unchanged
Reformulating AdaBoost as Iterative Projection
• optimization problem:
min RE (d k 1)
d∈P
• where ( )
X
P= d: d(i )yi gj (xi ) = 0 ∀j
i
exponential loss:
X X
inf exp −yi λj gj (xi )
λ
i j
Exponential Loss as Entropy Optimization
• all vectors dt created by AdaBoost have form:
X
d(i ) = exp −yi λj gj (xi )
j
• Q = closure of Q
Exponential Loss as Entropy Optimization
• all vectors dt created by AdaBoost have form:
X
d(i ) = exp −yi λj gj (xi )
j
• Q = closure of Q
Duality
[Della Pietra, Della Pietra & Lafferty]
• basic experiments
• multiclass classification
• confidence-rated predictions
• text categorization /
spoken-dialogue systems
• incorporating prior knowledge
• active learning
• face detection
Practical Advantages of AdaBoost
• fast
• simple and easy to program
• no parameters to tune (except T )
• flexible — can combine with any learning algorithm
• no prior knowledge needed about weak learner
• provably effective, provided can consistently find rough rules
of thumb
→ shift in mind set — goal now is merely to find classifiers
barely better than random guessing
• versatile
• can use with data that is textual, numeric, discrete, etc.
• has been extended to learning problems well beyond
binary classification
Caveats
30 30
25 25
20 20
C4.5
C4.5
15 15
10 10
5 5
0 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
ht : X → Y
e −αt
Dt (i ) if yi = ht (xi )
Dt+1 (i ) = ·
Zt e αt if yi 6= ht (xi )
X
Hfinal (x) = arg max αt
y ∈Y
t:ht (x)=y
Multiclass Problems
[with Freund]
• say y ∈ Y = {1, . . . , k}
• direct approach (AdaBoost.M1):
ht : X → Y
e −αt
Dt (i ) if yi = ht (xi )
Dt+1 (i ) = ·
Zt e αt if yi 6= ht (xi )
X
Hfinal (x) = arg max αt
y ∈Y
t:ht (x)=y
• can prove:
k Y
training error(Hfinal ) ≤ · Zt
2
Dt (i )
Dt+1 (i ) = · exp(−αt yi ht (xi ))
Zt
and identical rule for combining weak classifiers
• question: how to choose αt and ht on each round
Confidence-rated Predictions (cont.)
• saw earlier:
!
Y 1 X X
training error(Hfinal ) ≤ Zt = exp −yi αt ht (xi )
t
m t
i
50
% Error
40
30
20
10
2 card
3 my home
4 person ? person
5 code
6 I
More Weak Classifiers
rnd term AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT
7 time
8 wrong number
9 how
10 call
11 seven
12 trying to
13 and
More Weak Classifiers
rnd term AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT
14 third
15 to
16 for
17 charges
18 dial
19 just
Finding Outliers
examples with most weight are often outliers (mislabeled and/or
ambiguous)
• I’m trying to make a credit card call (Collect)
• hello (Rate)
• yes I’d like to make a long distance collect call
please (CallingCard)
• calling card please (Collect)
• yeah I’d like to use my calling card number (Collect)
• can I get a collect call (CallingCard)
• yes I would like to make a long distant telephone call
and have the charges billed to another number
(CallingCard DialForMe)
• yeah I can not stand it this morning I did oversea
call is so bad (BillingCredit)
• yeah special offers going on for long distance
(AttService Rate)
• mister allen please william allen (PersonToPerson)
• yes ma’am I I’m trying to make a long distance call to
a non dialable point in san miguel philippines
(AttService Other)
Application: Human-computer Spoken Dialogue
[with Rahim, Di Fabbrizio, Dutton, Gupta, Hollister & Riccardi]
text−to−speech automatic
speech
recognizer
text response
text
dialogue
manager natural language
predicted understanding
category
80
data+knowledge
knowledge only
70 data only
60
% error rate
50
40
30
20
10
100 1000 10000
# training examples
Results: Helpdesk
90
85
data + knowledge
Classification Accuracy
80
75
data
70
65
60
55
knowledge
50
45
0 500 1000 1500 2000 2500
# Training Examples
Problem: Labels are Expensive
• idea:
• use selective sampling to choose which examples to label
• focus on least confident examples [Lewis & Gale]
• for boosting, use (absolute) margin as natural confidence
measure
[Abe & Mamitsuka]
Labeling Scheme
32
% error rate
30
28
26
24
0 5000 10000 15000 20000 25000 30000 35000 40000
# labeled examples
20
% error rate
15
10
0
0 2000 4000 6000 8000 10000 12000 14000 16000
# labeled examples