Learning Theory2

Introduction to Statistical Learning Theory
abor Lugosi3
Olivier Bousquet1 , Stephane Boucheron2 , and
1 !a"#$lanc% Institute &or Biological 'ybernetics
Spe(annstr) 3*, +#,2-,. T/ ubingen, er(any
olivier)bousquet0(1")org
222 ho(e page3 http344555)%yb)(pg)de46bousquet
2 7niversite de $aris#Sud, Laboratoire d8In&or(atique
B9ati(ent 1:-, ;#:11-< Orsay 'ede", ;rance
stephane)boucheron0lri)&r
222 ho(e page3 http344555)lri)&r46bouchero
3 +epart(ent o& =cono(ics, $o(peu ;abra 7niversity
>a(on Trias ;argas 2<#2,, -*--< Barcelona, Spain
lugosi0up&)es
222 ho(e page3 http344555)econ)up&)es46lugosi
?bstract) The goal o& statistical learning theory is to study, in a sta#
tistical &ra(e5or%, the properties o& learning algorith(s) In particular,
(ost results ta%e the &or( o& so#called error bounds) This tutorial intro#
duces the techniques that are used to obtain such results)
1 Introduction
The (ain goal o& statistical learning theory is to provide a &ra(e5or% &or study#
ing the proble( o& in&erence, that is o& gaining %no5ledge, (a%ing predictions,
(a%ing decisions or constructing (o dels &ro( a set o& data) This is studied in a
statistical &ra(e5or%, that is there are assu(ptions o& statistical nature about
the underlying pheno(ena @in the 5ay the data is generatedA)
?s a (otivation &or the need o& such a theory, let us Bust quote C) Capni%3
@Capni%, D1EA Fothing is (ore practical than a good theory)
Indeed, a theory o& in&erence should be able to give a &or(al de&inition o& 5ords
li%e learning, generaliGation, over&itting, and also to characteriGe the per&or(ance
o& learning algorith(s so that, ulti(ately, it (ay help design better learning
algorith(s)
There are thus t5o goals3 (a%e things (ore precise and derive ne5 or i(proved
algorith(s)
1)1 Learning and In&erence
2hat is under study here is the process o& inductive in&erence 5hich can roughly
be su((ariGed as the &ollo5ing steps3
1,. Bousquet, Boucheron H Lugosi
1) Observe a pheno(enon
2) 'onstruct a (odel o& that pheno(enon
3) !a%e predictions using this (odel
O& course, this de&inition is very general and could be ta%en (ore or less as the
goal o& Fatural Sciences) The goal o& !achine Learning is to actually auto(ate
this process and the goal o& Learning Theory is to &or(aliGe it)
In this tutorial 5e consider a special case o& the above process 5hich is the
supervised learning &ra(e5or% &or pattern recognition) In this &ra(e5or%, the
data consists o& instance#label pairs, 5here the label is either I1 or 1) iven a
set o& such pairs, a learning algorith( constructs a &unction (apping instances to
labels) This &unction should be such that it (a%es &e5 (ista%es 5hen predicting
the label o& unseen instances)
O& course, given so(e training data, it is al5ays possible to build a &unction
that &its e"actly the data) But, in the presence o& noise, this (ay not be the
best thing to do as it 5ould lead to a poor per&or(ance on unseen instances
@this is usually re&erred to as over&ittingA) The general idea behind the design o&
1 )<
1
- )<
-
- -) < 1 1 ) <
;ig) 1) Trade#o bet5een &it and co(ple"ity)
learning algorith(s is thus to lo o% &or regularities @in a sense to be de&ined laterA
in the observed pheno(enon @i)e) training dataA) These can then be generaliGed
&ro( the observed past to the &uture) Typically, one 5ould lo o%, in a collection
o& possible (o dels, &or one 5hich &its 5ell the data, but at the sa(e ti(e is as
si(ple as possible @see ;igure 1A) This i((ediately raises the question o& ho5
to (easure and quanti&y si(plicity o& a (o del @i)e) a J 1, I1K#valued &unctionA)
Statistical Learning Theory 1,,
It turns out that there are (any 5ays to do so, but no best one) ;or e"a(ple
in $hysics, people tend to pre&er (odels 5hich have a s(all nu(ber o& constants
and that correspond to si(ple (athe(atical &or(ulas) O&ten, the length o& de#
scription o& a (odel in a co ding language can be an indication o& its co(ple"ity)
In classical statistics, the nu(ber o& &ree para(eters o& a (odel is usually a
(easure o& its co(ple"ity) Surprisingly as it (ay see(, there is no universal 5ay
o& (easuring si(plicity @or its counterpart co(ple"ityA and the choice o& a spe#
ci&ic (easure inherently depends on the proble( at hand) It is actually in this
choice that the designer o& the learning algorith( introduces %no5ledge about
the speci&ic pheno(enon under study)
This lac% o& universally best choice can actually be &or(aliGed in 5hat is
called the Fo ;ree Lunch theore(, 5hich in essence says that, i& there is no
assu(ption on ho5 the past @i)e) training dataA is related to the &uture @i)e) test
dataA, prediction is i(possible) =ven (ore, i& there is no a priori restriction on
the possible pheno(ena that are e"pected, it is i(possible to generaliGe and
there is thus no better algorith( @any algorith( 5ould be beaten by another
one on so(e pheno(enonA)
Lence the need to (a%e assu(ptions, li%e the &act that the pheno(enon 5e
observe can be e"plained by a si(ple (odel) Lo5ever, as 5e said, si(plicity is
not an absolute notion, and this leads to the state(ent that data cannot replace
%no5ledge, or in pseudo#(athe(atical ter(s3
eneraliGation M +ata I Nno5ledge
1)2 ?ssu(ptions
2e no5 (a%e (ore precise the assu(ptions that are (ade by the Statistical
Learning Theory &ra(e5or%) Indeed, as 5e said be&ore 5e need to assu(e that
the &uture @i)e) testA observations are related to the past @i)e) trainingA ones, so
that the pheno(enon is so(e5hat stationary)
?t the core o& the theory is a probabilistic (odel o& the pheno(enon @or data
generation processA) 2ithin this (odel, the relationship bet5een past and &uture
observations is that they both are sa(pled independently &ro( the sa(e distri#
bution @i)i)d)A) The independence assu(ption (eans that each ne5 observation
yields (a"i(u( in&or(ation) The identical distribution (eans that the obser#
vations give in&or(ation about the underlying pheno(enon @here a probability
distributionA)
?n i((ediate consequence o& this very general setting is that one can con#
struct algorith(s @e)g) %#nearest neighbors 5ith appropriate %A that are consis#
tent, 5hich (eans that, as one gets (ore and (ore data, the predictions o& the
algorith( are closer and closer to the opti(al ones) So this see(s to indicate that
5e can have so(e sort o& universal algorith() 7n&ortunately, any @consistentA
algorith( can have an arbitrarily bad behavior 5hen given a &inite training set)
These notions are &or(aliGed in ?ppendi" B)
?gain, this discussion indicates that generaliGation can only co(e 5hen one
adds speci&ic %no5ledge to the data) =ach learning algorith( encodes speci&ic
1,* Bousquet, Boucheron H Lugosi
%no5ledge @or a speci&ic assu(ption about ho5 the opti(al classi&ier lo o%s li%eA,
and 5or%s best 5hen this assu(ption is satis&ied by the proble( to 5hich it is
applied)
Bibliographical re(ar%s) Several te"tbo o%s, surveys, and research (ono#
graphs have been 5ritten on pattern classi&ication and statistical learning theory)
? partial list includes ?nthony and Bartlett D2E, Brei(an, ;ried(an, Olshen,
and Stone D3E, +evroye, y/ or&i, and Lugosi D1E, +uda and Lart D<E, ;u%unaga D.E,
Nearns and CaGirani D,E, Nul%arni, Lugosi, and Cen%atesh D*E, Lugosi D:E, !cLach#
lan D1-E, !endelson D11E, Fatara Ban D12E, Capni% D13, 11, 1E, and Capni% and
'hervonen%is D1<E)
2 ;or(aliGation
2e consider an input space O and output space P ) Since 5e restrict ourselves
to binary classi&ication, 5e choose P M J 1, 1K) ;or(ally, 5e assu(e that the
pairs @O, P A O Q P are rando( variables distributed according to an un%no5n
distribution $ ) 2e observe a sequence o& n i)i)d) pairs @Oi , Pi A sa(pled according
to $ and the goal is to construct a &unction g 3 O P 5hich predicts P &ro(
O)
2e need a criterion to choose this &unction g) This criterion is a lo5 proba#
bility o& error $@g@OA M P A) 2e thus de&ine the ris% o& g as
g @O AM P )
>@gA M $ @g@OA M P A M
Fotice that $ can be deco(posed as $O Q $ @P ROA) 2e introduce the regression
S
DP M 1RO M "E 1 and the target &unction
&unction @"A M DP RO M "E M 2
@or Bayes classi&ierA t@"A M sgn @"A) This &unction achieves the (ini(u( ris%
over all possible (easurable &unctions3
>@tA M in& >@gA )
g
2e 5ill denote the value >@tA by > , called the Bayes ris%) In the deter(inistic
S
case, one has P M t@OA al(ost surely @
DP M 1ROE J-, 1KA and > M -) In the
S
general case 5e can de&ine the noise level as s@"A M (in@ DP M 1RO M "E , 1
S
DP M 1RO M "EA M @1 @"AA42 @s@OA M - al(ost surely in the deter(inistic
caseA and this gives > M s@OA)
Our goal is thus to identi&y this &unction t, but since $ is un%no5n 5e cannot
directly (easure the ris% and 5e also cannot %no5 directly the value o& t at the
data points) 2e can only (easure the agree(ent o& a candidate &unction 5ith
the data) This is called the e(pirical ris% 3
n
g @O i AMP i )
>n @g A M 1
n
iM1
It is co((on to use this quantity as a criterion to select an esti(ate o& t)
Statistical Learning Theory 1,:
2)1 ?lgorith(s
Fo5 that the goal is clearly speci&ied, 5e revie5 the co((on strategies to @ap#
pro"i(atelyA achieve it) 2e denote by gn the &unction returned by the algorith()
Because one cannot co(pute >@gA but only appro"i(ate it by >n @gA, it 5ould
be unreasonable to loo% &or the &unction (ini(iGing >n @g A a(ong all possible
&unctions) Indeed, 5hen the input space is in&inite, one can al5ays construct a
&unction gn 5hich per&ectly predicts the labels o& the training data @i)e) gn @Oi A M
Pi , and >n @gn A M -A, but behaves on the other points as the opposite o& the target
&unction t, i)e) gn @OA M P so that >@gn A M 11 ) So one 5ould have (ini(u(
e(pirical ris% but (a"i(u( ris%)
It is thus necessary to prevent this over&itting situation) There are essentially
t5o 5ays to do this @5hich can be co(binedA) The &irst one is to restrict the
class o& &unctions in 5hich the (ini(iGation is per&or(ed, and the second is to
(odi&y the criterion to be (ini(iGed @e)g) adding a penalty &or Tco(plicated8
&unctionsA)
=(pirical >is% !ini(iGation) This algorith( is one o& the (ost straight#
&or5ard, yet it is usually e cient) The idea is to choose a (odel o& possible
&unctions and to (ini(iGe the e(pirical ris% in that (odel3
gn M arg (in >n @gA )
g
O& course, this 5ill 5or% best 5hen the target &unction belongs to ) Lo5ever,
it is rare to be able to (a%e such an assu(ption, so one (ay 5ant to enlarge
the (odel as (uch as possible, 5hile preventing over&itting)
Structural >is% !ini(iGation) The idea here is to choose an in&inite se#
quence Jd 3 d M 1, 2, ) ) )K o& (odels o& increasing siGe and to (ini(iGe the
e(pirical ris% in each (o del 5ith an added penalty &or the siGe o& the (odel3
gn M arg (in

>n @gA I pen@d, nA )
g d ,d
The penalty pen@d, nA gives pre&erence to (odels 5here esti(ation error is s(all
and (easures the siGe or capacity o& the (odel)
>egulariGation) ?nother, usually easier to i(ple(ent approach consists in
cho osing a large (odel @possibly dense in the continuous &unctions &or e"a(pleA
and to de&ine on a regulariGer, typically a nor( g ) Then one has to (ini(iGe
the regulariGed e(pirical ris%3
gn M arg (in >n @gA I g 2 )
g
1 Strictly spea%ing this is only possible i& the probability distribution satis&ies so(e
(ild conditions @e)g) has no ato(sA) Other5ise, it (ay not be possible to achieve
>@gn A M 1 but even in this case, provided the support o& $ contains in&initely (any
points, a si(ilar pheno(enon occurs)
1*- Bousquet, Boucheron H Lugosi
'o(pared to S>!, there is here a &ree para(eter , called the regulariGation
para(eter 5hich allo5s to choose the right trade#o bet5een &it and co(ple"ity)
Tuning is usually a hard proble( and (ost o&ten, one uses e"tra validation
data &or this tas%)
!ost e"isting @and success&ulA (ethods can be thought o& as regulariGation
(ethods)
For(aliGed >egulariGation) There are other possible approaches 5hen the
regulariGer can, in so(e sense, be Tnor(aliGed8, i)e) 5hen it corresponds to so(e
probability distribution over )
iven a probability distribution de&ined on @usually called a priorA, one can
use as a regulariGer log @g A< ) >eciprocally, &ro( a regulariGer o& the &or( g 2 ,
i& there e"ists a (easure U on such that e g 2 dU@gA V &or so(e W -,
then one can construct a prior corresponding to this regulariGer) ;or e"a(ple, i&
is the set o& hyperplanes in d going through the origin, can be identi&ied
5ith d and, ta%ing U as the Lebesgue (easure, it is possible to go &ro( the
=uclidean nor( regulariGer to a spherical aussian (easure on d as a prior. )
This type o& nor(aliGed regulariGer, or prior, can be used to construct another
probability distribution on @usually called posteriorA, as
X @ A @gA , @gA M e > n@ g A
5here - is a &ree para(eter and X @ A is a nor(aliGation &actor)
There are several 5ays in 5hich this can be used) I& 5e ta%e the &unction
(a"i(iGing it, 5e recover regulariGation as
arg (a" @gA M arg (in
>n @gA log @gA ,
g g
5here the regulariGer is 1 log @gA, )
?lso, can be used to rando(iGe the predictions) In that case, be&ore co(#
puting the predicted label &or an input ", one sa(ples a &unction g according to
and outputs g@"A) This procedure is usually called ibbs classi&ication)
?nother 5ay in 5hich the distribution constructed above can be used is by
ta%ing the e"pected prediction o& the &unctions in 3
gn @"A M sgn@ @g@"AAA )
< This is &ine 5hen is countable) In the continuous case, one has to consider the
density associated to ) 2e o(it these details)
. eneraliGation to in&inite di(ensional Lilbert spaces can also be done but it requires
(ore care) One can &or e"a(ple establish a correspondence bet5een the nor( o& a
reproducing %ernel Lilbert space and a aussian process prior 5hose covariance
&unction is the %ernel o& this space)
, Fote that (ini(iGing >n @gA log @gA is equivalent to (ini(iGing >n @gA
1 log @gA)
Statistical Learning Theory 1*1
This is typically called Bayesian averaging)
?t this point 5e have to insist again on the &act that the choice o& the class
and o& the asso ciated regulariGer or prior, has to co(e &ro( a priori %no5ledge
about the tas% at hand, and there is no universally best choice)
2)2 Bounds
2e have presented the &ra(e5or% o& the theory and the type o& algorith(s that
it studies, 5e no5 introduce the %ind o& results that it ai(s at) The overall goal is
to characteriGe the ris% that so(e algorith( (ay have in a given situation) !ore
precisely, a learning algorith( ta%es as input the data @O1 , P1 A, ) ) ) , @On , Pn A and
produces a &unction gn 5hich depends on this data) 2e 5ant to esti(ate the
ris% o& gn ) Lo5ever, >@gn A is a rando( variable @since it depends on the dataA
and it cannot be co(puted &ro( the data @since it also depends on the un%no5n
$ A) =sti(ates o& >@gn A thus usually ta%e the &or( o& probabilistic bounds)
Fotice that 5hen the algorith( chooses its output &ro( a (odel , it is
possible, by intro ducing the best &unction g in , 5ith >@g A M in&g >@gA, to
5rite
>@gn A > M D>@g A > E I D>@gn A >@g AE )
The &irst ter( on the right hand side is usually called the appro"i(ation error,
and (easures ho5 5ell can &unctions in approach the target @it 5ould be Gero
i& t A) The second ter(, called esti(ation error is a rando( quantity @it
depends on the dataA and (easures ho5 close is gn to the best possible choice
in )
=sti(ating the appro"i(ation error is usually hard since it requires %no5ledge
about the target) 'lassically, in Statistical Learning Theory it is pre&erable to
avoid (a%ing speci&ic assu(ptions about the target @such as its belonging to
so(e (odelA, but the assu(ptions are rather on the value o& > , or on the noise
&unction s)
It is also %no5n that &or any @consistentA algorith(, the rate o& convergence to
Gero o& the appro"i(ation error* can be arbitrarily slo5 i& one does not (a%e
assu(ptions about the regularity o& the target, 5hile the rate o& convergence
o& the esti(ation error can be co(puted 5ithout any such assu(ption) 2e 5ill
thus &ocus on the esti(ation error)
?nother possible deco(position o& the ris% is the &ollo5ing3
>@gn A M >n @gn A I D>@gn A >n @gn AE )
In this case, one esti(ates the ris% by its e(pirical counterpart, and so(e quan#
tity 5hich appro"i(ates @or upper boundsA >@gn A >n @gn A)
To su((ariGe, 5e 5rite the three type o& results 5e (ay be interested in)
* ;or this converge to (ean anything, one has to consider algorith(s 5hich choose
&unctions &ro( a class 5hich gro5s 5ith the sa(ple siGe) This is the case &or e"a(ple
o& Structural >is% !ini(iGation or >egulariGation based algorith(s)
1*2 Bousquet, Boucheron H Lugosi
Y =rror bound3 >@gn A >n @gn A I B@n, A) This corresponds to the esti(ation
o& the ris% &ro( an e(pirical quantity)
Y =rror bound relative to the best in the class3 >@gn A >@g A I B@n, A) This
tells ho5 Zopti(alZ is the algorith( given the (odel it uses)
Y =rror bound relative to the Bayes ris% 3 >@gn A > I B@n, A) This gives
theoretical guarantees on the convergence to the Bayes ris%)
3 Basic Bounds
In this section 5e sho5 ho5 to obtain si(ple error bounds @also called general#
iGation boundsA) The ele(entary (aterial &ro( probability theory that is needed
here and in the later sections is su((ariGed in ?ppendi" ?)
3)1 >elationship to =(pirical $rocesses
gn @O A MP o& the &unction
>ecall that 5e 5ant to esti(ate the ris% >@gn A M
gn returned by the algorith( a&ter seeing the data @O1 , P1 A, ) ) ) , @On , Pn A) This
quantity cannot be observed @$ is un%no5nA and is a rando( variable @since it
depends on the dataA) Lence one 5ay to (a%e a state(ent about this quantity
is to say ho5 it relates to an esti(ate such as the e(pirical ris% >n @gn A) This
relationship can ta%e the &or( o& upper and lo5er bounds &or
S
D>@gn A >n @gn A W E )
;or convenience, let Xi M @Oi , Pi A and X M @O, P A) iven de&ine the loss class
; M J& 3 @", yA g @"A My 3 g K ) @1A
Fotice that contains &unctions 5ith range in J 1, 1K 5hile ; contains non#
negative &unctions 5ith range in J-, 1K) In the re(ainder o& the tutorial, 5e 5ill
go bac% and &orth bet5een ; and @as there is a biBection bet5een the(A, so(e#
ti(es stating the results in ter(s o& &unctions in ; and so(eti(es in ter(s o&
&unctions in ) It 5ill be clear &ro( the conte"t 5hich classes and ; 5e re&er
to, and ; 5ill al5ays be derived &ro( the last (entioned class in the 5ay o& @1A)
n
2e use the shorthand notation $& M D& @O, P AE and $n & M 1
i M1 & @Oi , Pi A) n
$n is usually called the e(pirical (easure associated to the training sa(ple)
2ith this notation, the quantity o& interest @di erence bet5een true and e(pir#
ical ris%sA can be 5ritten as
@2A
$ &n $n &n )
?n e(pirical process is a collection o& rando( variables inde"ed by a class o&
&unctions, and such that each rando( variable is distributed as a su( o& i)i)d)
rando( variables @values ta%en by the &unction at the dataA3
J$ & $n & K& ; )
Statistical Learning Theory 1*3
One o& the (ost studied quantity associated to e(pirical processes is their supre#
(u(3
sup
$& $n & )
& ;
It is clear that i& 5e %no5 an upper bound on this quantity, it 5ill be an upper
bound on @2A) This sho5s that the theory o& e(pirical processes is a great source
o& tools and techniques &or Statistical Learning Theory)
3)2 Loe ding8s Inequality
Let us re5rite again the quantity 5e are interested in as &ollo5s
n
>@gA >n @gA M D& @X AE 1 & @Xi A )
n
i M1
It is easy to recogniGe here the di erence bet5een the e"pectation and the e(#
pirical average o& the rando( variable & @X A) By the la5 o& large nu(bers, 5e
i((ediately obtain that
1
n
S
li(
& @Xi A D& @X AE M - M 1 )
n
n
i M1
This indicates that 5ith enough sa(ples, the e(pirical ris% o& a &unction is a
good appro"i(ation to its true ris%)
It turns out that there e"ists a quantitative version o& the la5 o& large nu(bers
5hen the variables are bounded)
Theore( 1 @Lo e dingA) Let X1 , ) ) ) , Xn be n i)i)d) rando( variables 5ith
& @XA Da, bE) Then &or all W -, 5e have
1
n
S
& @Xi A D& @XAE W 2 e"p 2n 2
n
@b aA2 )
i M1
Let us re5rite the above &or(ula to better understand its consequences) +enote
the right hand side by ) Then

S
,
R$n & $ & R W @b aA log 2
2n
or @by inversion, see ?ppendi" ?A 5ith probability at least 1 ,
R$n & $& R @b aA log 2

2n )
1*1 Bousquet, Boucheron H Lugosi
?pplying this to & @X A M g @O AMP 5e get that &or any g, and any W -, 5ith
probability at least 1
@3A >@g A >n @gA I log 2

2n )
Fotice that one has to consider a &i"ed &unction g and the probability is 5ith
respect to the sa(pling o& the data) I& the &unction depends on the data this
do es not apply[
3)3 Li(itations
?lthough the above result see(s very nice @since it applies to any class o&
bounded &unctionsA, it is actually severely li(ited) Indeed, 5hat it essentially
says is that &or each @&i"edA &unction & ;, there is a set S o& sa(ples &or 5hich
S
DSE 1 A) Lo5#
2 n @and this set o& sa(ples has (easure $ & $n & lo g 2
ever, these sets S (ay be di erent &or di erent &unctions) In other 5ords, &or the
observed sa(ple, only so(e o& the &unctions in ; 5ill satis&y this inequality)
?nother 5ay to e"plain the li(itation o& Loe ding8s inequality is the &ollo5#
ing) I& 5e ta%e &or the class o& all J 1, 1K#valued @(easurableA &unctions, then
&or any &i"ed sa(ple, there e"ists a &unction & ; such that
$ & $n & M 1 )
To see this, ta%e the &unction 5hich is & @Oi A M Pi on the data and & @OA M P
every5here else) This does not contradict Loe ding8s inequality but sho5s that
it do es not yield 5hat 5e need)
;igure 2 illustrates the above argu(entation) The horiGontal a"is corresponds
>
>is%
Rn
R (g)
n
R(g)
;unction class g g g \
n
;ig) 2) 'onvergence o& the e(pirical ris% to the true ris% over the class o& &unctions)
Statistical Learning Theory 1*<
to the &unctions in the class) The t5o curves represent the true ris% and the e(#
pirical ris% @&or so(e training sa(pleA o& these &unctions) The true ris% is &i"ed,
5hile &or each di erent sa(ple, the e(pirical ris% 5ill be a di erent curve) I&
5e observe a &i"ed &unction g and ta%e several di erent sa(ples, the point on
the e(pirical curve 5ill uctuate around the true ris% 5ith uctuations con#
trolled by Loe ding8s inequality) Lo5ever, &or a &i"ed sa(ple, i& the class is
big enough, one can &ind so(e5here along the a"is, a &unction &or 5hich the
di erence bet5een the t5o curves 5ill be very large)
3)1 7ni&or( +eviations
Be&ore seeing the data, 5e do not %no5 5hich &unction the algorith( 5ill choose)
The idea is to consider uni&or( deviations
>@&n A >n @&n A sup @>@& A >n @& AA @1A
& ;
In other 5ords, i& 5e can upper bound the supre(u( on the right, 5e are done)
;or this, 5e need a bound 5hich holds si(ultaneously &or all &unctions in a class)
Let us e"plain ho5 one can construct such uni&or( bounds) 'onsider t5o
&unctions &1 , &2 and de&ine
'i M J@"1 , y1 A, ) ) ) , @"n , yn A 3 $ &i $n &i W K )
This set contains all the Tbad8 sa(ples, i)e) those &or 5hich the bound &ails) ;ro(
Loe ding8s inequality, &or each i
S
D'i E )
2e 5ant to (easure ho5 (any sa(ples are Tbad8 &or i M 1 or i M 2) ;or this 5e
use @see ?ppendi" ?A
S S S
D'1 '2 E D'1 E I D'2 E 2 )
!ore generally, i& 5e have F &unctions in our class, 5e can 5rite
F
S
D'1 ) ) ) 'F E D'i E
i M1 S
?s a result 5e obtain
S
D & J&1 , ) ) ) , &F K 3 $& $n & W E
F

D$ &i $n &i W E
iM 1 S
F e"p 2n 2
1*. Bousquet, Boucheron H Lugosi
Lence, &or M Jg1 , ) ) ) , gF K, &or all W - 5ith probability at least 1 ,

g , >@gA >n @g A I log F I log 1
2n
This is an error bound) Indeed, i& 5e %no5 that our algorith( pic%s &unctions
&ro( , 5e can apply this result to gn itsel&)
Fotice that the (ain di erence 5ith Loe ding8s inequality is the e"tra log F
ter( on the right hand side) This is the ter( 5hich accounts &or the &act that 5e
5ant F bounds to hold si(ultaneously) ?nother interpretation o& this ter( is as
the nu(ber o& bits one 5ould require to speci&y one &unction in ) It turns out
that this %ind o& coding interpretation o& generaliGation bounds is o&ten possible
and can be used to obtain error esti(ates D1.E)
3)< =sti(ation =rror
7sing the sa(e idea as be&ore, and 5ith no additional e ort, 5e can also get a
bound on the esti(ation error) 2e start &ro( the inequality
>@g A >n @g A I sup @>@gA >n @gAA ,
g
5hich 5e co(bine 5ith @1A and 5ith the &act that since gn (ini(iGes the e(#
pirical ris% in ,
>n @g A >n @gn A -
Thus 5e obtain
>@gn A M >@gn A >@g A I >@g A
>n @g A >n @gn A I >@gn A >@g A I >@g A
2 sup
R>@gA >n @gAR I >@g A
g
2e obtain that 5ith probability at least 1

>@gn A >@g A I 2 log F I log 2
2n )
2e notice that in the right hand side, both ter(s depend on the siGe o& the
class ) I& this siGe increases, the &irst ter( 5ill decrease, 5hile the second 5ill
increase)
3). Su((ary and $erspective
?t this point, 5e can su((ariGe 5hat 5e have e"posed so &ar)
Y In&erence requires to put assu(ptions on the process generating the data
@data sa(pled i)i)d) &ro( an un%no5n $A, generaliGation requires %no5ledge
@e)g) restriction, structure, or priorA)
Statistical Learning Theory 1*,
Y The error bounds are valid 5ith respect to the repeated sa(pling o& training
sets)
Y ;or a &i"ed &unction g, &or (ost o& the sa(ples
>@gA >n @gA 14 n
Y ;or (ost o& the sa(ples i& R R M F
sup
>@gA >n @gA log F4n
g
The e"tra variability co(es &ro( the &act that the chosen gn changes 5ith
the data)
So the result 5e have obtained so &ar is that 5ith high probability, &or a &inite
class o& siGe F ,
sup

@>@gA >n @gAA log F I log 1
2n )
g
There are several things that can be i(proved3
Y Loe ding8s inequality only uses the boundedness o& the &unctions, not their
variance)
Y The union bound is as bad as i& all the &unctions in the class 5ere independent
@i)e) i& &1 @XA and &2 @X A 5ere independentA)
Y The supre(u( over o& >@gA >n @gA is not necessarily 5hat the algorith(
5ould cho ose, so that upper bounding >@gn A >n @gn A by the supre(u(
(ight be lo ose)
1 In&inite 'ase3 Capni%#'hervonen%is Theory
In this section 5e sho5 ho5 to e"tend the previous results to the case 5here the
class is in&inite) This requires, in the non#countable case, the introduction o&
tools &ro( Capni%#'hervonen%is Theory)
1)1 >e&ined 7nion Bound and 'ountable 'ase
2e &irst start 5ith a si(ple re&ine(ent o& the union bound that allo5s to e"tend
the previous results to the @countablyA in&inite case)
>ecall that by Loe ding8s inequality, &or each & ;, &or each W - @possibly
depending on & , 5hich 5e 5rite @& AA,

log 1 S
@& A ) @& A
$ & $n & W
2n
1** Bousquet, Boucheron H Lugosi
Lence, i& 5e have a countable set ;, the union bound i((ediately yields

log 1 S

@& A
@&A )
& ; 3 $ & $n & W
2n
& ;
'ho osing @& A M p@& A 5ith & ; p@& A M 1, this (a%es the right#hand side
equal to and 5e get the &ollo5ing result) 2ith probability at least 1 ,
log 1
p@& A I log 1

& ;, $ & $n & I
2n )
2e notice that i& ; is &inite @5ith siGe F A, ta%ing a uni&or( p gives the log F as
be&ore)
7sing this approach, it is possible to put %no5ledge about the algorith(
into p@& A, but p should be chosen be&ore seeing the data, so it is not possible to
Tcheat8 by setting all the 5eight to the &unction returned by the algorith( a&ter
seeing the data @5hich 5ould give the s(allest possible boundA) But, in general,
i& p is 5ell#chosen, the bound 5ill have a s(all value) Lence, the bound can be
i(proved i& one %no5s ahead o& ti(e the &unctions that the algorith( is li%ely
to pic% @i)e) %no5ledge i(proves the boundA)
1)2 eneral 'ase
2hen the set is uncountable, the previous approach do es not directly 5or%)
The general idea is to loo% at the &unction class TproBected8 on the sa(ple) !ore
precisely, given a sa(ple G1 , ) ) ) , Gn , 5e consider
;G 1, )) ),G n M J@& @G1 A, ) ) ) , &@Gn AA 3 & ;K
The siGe o& this set is the nu(ber o& possible 5ays in 5hich the data @G1 , ) ) ) , Gn A
can be classi&ied) Since the &unctions & can only ta%e t5o values, this set 5ill
al5ays be &inite, no (atter ho5 big ; is)
+e&inition 1 @ro5th &unctionA) The gro5th &unction is the (a"i(u( nu(#
ber o& 5ays into 5hich n points can be classi&ied by the &unction class3
R;G 1, )) ),G n R )
S; @nA M sup
@G 1 ,)) ),G n A
2e have de&ined the gro5th &unction in ter(s o& the loss class ; but 5e can do
the sa(e 5ith the initial class and notice that S; @nA M S @nA)
It turns out that this gro5th &unction can be used as a (easure o& the TsiGe8
o& a class o& &unction as de(onstrated by the &ollo5ing result)
Theore( 2 @Capni%#'hervonen%isA) ;or any W -, 5ith probability at least
1 ,
g , >@gA >n @gA I 2 2 log S @2nA I log 2

n )
Statistical Learning Theory 1*:
Fotice that, in the &inite case 5here RR M F , 5e have S @nA F so that this
bound is al5ays better than the one 5e had be&ore @e"cept &or the constantsA)
But the proble( beco(es no5 one o& co(puting S @nA)
1)3 C' +i(ension
Since g J 1, 1K, it is clear that S @nA 2n ) I& S @nA M 2n , there is a set o&
siGe n such that the class o& &unctions can generate any classi&ication on these
points @5e say that shatters the setA)
+e&inition 2 @C' di(ensionA) The C' di(ension o& a class is the largest
n such that
S @nA M 2n )
In other 5ords, the C' di(ension o& a class is the siGe o& the largest set that
it can shatter)
In order to illustrate this de&inition, 5e give so(e e"a(ples) The &irst one is the
set o& hal&#planes in d @see ;igure 3A) In this case, as depicted &or the case
d M 2, one can shatter a set o& d I 1 points but no set o& d I 2 points, 5hich
(eans that the C' di(ension is d I 1)
;ig) 3) 'o(puting the C' di(ension o& hyperplanes in di(ension 23 a set o& 3 points
can be shattered, but no set o& &our points)
It is interesting to notice that the nu(ber o& para(eters needed to de&ine
hal&#spaces in d is d, so that a natural question is 5hether the C' di(ension
is related to the nu(ber o& para(eters o& the &unction class) The ne"t e"a(ple,
depicted in ;igure 1, is a &a(ily o& &unctions 5ith one para(eter only3
Jsgn@sin@t"AA 3 t K
5hich actually has in&inite C' di(ension @this is an e"ercise le&t to the readerA)
1:- Bousquet, Boucheron H Lugosi
;ig) 1) C' di(ension o& sinusoids)
It re(ains to sho5 ho5 the notion o& C' di(ension can bring a solution
to the proble( o& co(puting the gro5th &unction) Indeed, at &irst glance, i& 5e
%no5 that a class has C' di(ension h, it entails that &or all n h, S @nA M 2n
and S @nA V 2n other5ise) This see(s o& little use, but actually, an intriguing
pheno(enon occurs &or n h as depicted in ;igure <) The gro5th &unction
log@S@nAA
h
n
;ig) <) Typical behavior o& the log gro5th &unction)
5hich is e"ponential @its logarith( is linearA up until the C' di(ension, beco(es
polyno(ial a&ter5ards)
This behavior is captured in the &ollo5ing le((a)
Le((a 1 @Capni% and 'hervonen%is, Sauer, ShelahA) Let be a class

o& &unctions 5ith &inite C'#di(ension h) Then &or all n ,
h
n
S @nA
i ,
i M-
Statistical Learning Theory 1:1
and &or all n h,
h )
S @nA en
h
7sing this le((a along 5ith Theore( 2 5e i((ediately obtain that i& has
C' di(ension h, 5ith probability at least 1 ,
h I log 2

g , >@gA >n @gA I 2 2 h log 2 e n
n )
2hat is i(portant to recall &ro( this result, is that the di erence bet5een the
true and e(pirical ris% is at (ost o& order
h log n
n )
?n interpretation o& C' di(ension and gro5th &unctions is that they (easure the
e ective siGe o& the class, that is the siGe o& the pro Bection o& the class onto &inite
sa(ples) In addition, this (easure does not Bust Tcount8 the nu(ber o& &unctions
in the class but depends on the geo(etry o& the class @rather its pro BectionsA)
;inally, the &initeness o& the C' di(ension ensures that the e(pirical ris% 5ill
converge uni&or(ly over the class to the true ris%)
1)1 Sy((etriGation
2e no5 indicate ho5 to prove Theore( 2) The %ey ingredient to the proo& is the
so#called sy((etriGation le((a) The idea is to replace the true ris% by an esti#
(ate co(puted on an independent set o& data) This is o& course a (athe(atical
technique and does not (ean one needs to have (ore data to be able to apply
the result) The e"tra data set is usually called Tvirtual8 or Tghost sa(ple8)
2e 5ill denote by X1 , ) ) ) , Xn an independent @ghostA sa(ple and by $n the
corresponding e(pirical (easure)
Le((a 2 @Sy((etriGationA) ;or any t W -, such that nt2 2,
S S
sup sup
@$ $n A& t 2 @$n $n A& t42 )
& ; & ;
$roo&) Let &n be the &unction achieving the supre(u( @note that it depends
on X1 , ) ) ) , Xn A) One has @5ith denoting the conBunction o& t5o eventsA,
@$ $n A& n Wt @$ $ nA &n V t42 M @ $ $n A& n Wt @$n $ A& n t42
@ $n $n A& n Wt42 )
Ta%ing e"pectations 5ith respect to the second sa(ple gives
S
D@$ $n A&n V t42E D@$n $n A&n W t42E )
@$ $ nA &n W tS
1:2 Bousquet, Boucheron H Lugosi
By 'hebyshev8s inequality @see ?ppendi" ?A,
S
D@$ $n A&n t42E 1Car&n
nt2 1 nt2 )
Indeed, a rando( variable 5ith range in D-, 1E has variance less than 141) Lence
@$ $n A& n Wt @1 1
S
D@$n $n A&n W t42E )
nt2 A
Ta%ing e"pectation 5ith respect to &irst sa(ple gives the result)
This le((a allo5s to replace the e"pectation $& by an e(pirical average
over the ghost sa(ple) ?s a result, the right hand side only depends on the
proBection o& the class ; on the double sa(ple3
,
;X 1, )) ),X n ,X1 , ))) ,X n
5hich contains &initely (any di erent vectors) One can thus use the si(ple union
bound that 5as presented be&ore in the &inite case) The other ingredient that is
needed to obtain Theore( 2 is again Loe ding8s inequality in the &ollo5ing &or(3
S
D$n & $n & W tE e n t2 42 )
2e no5 Bust have to put the pieces together3
S
sup& ; @$ $n A& t
S
2
sup& ; @$n $n A& t42
S
M 2
@$n $n A& t42
sup& ; X1 , ))), Xn , X1 ,) )), Xn
S
2S; @2nA D@$n $n A& t42E
1S; @2nAe n t2 4* )
7sing inversion &inishes the proo& o& Theore( 2)
1)< C' =ntropy
One i(portant aspect o& the C' di(ension is that it is distribution independent)
Lence, it allo5s to get bounds that do not depend on the proble( at hand3
the sa(e bound holds &or any distribution) ?lthough this (ay be seen as an
advantage, it can also be a dra5bac% since, as a result, the bound (ay be loose
&or (ost distributions)
2e no5 sho5 ho5 to (odi&y the proo& above to get a distribution#dependent
result) 2e use the &ollo5ing notation F @;, Gn
1 A 3M R;G 1 ,) )), G n R)
+e&inition 3 @C' entropyA) The @annealedA C' entropy is de&ined as
L; @nA M log DF @;, Xn
1 AE )
Statistical Learning Theory 1:3
Theore( 3) ;or any W -, 5ith probability at least 1 ,

g , >@gA >n @gA I 2 2 L @2nA I log 2
n )
$roo&) 2e again begin 5ith the sy((etriGation le((a so that 5e have to
upper bound the quantity
S
I M
@$n $n A& t42 )
sup& ; X n
1 ,X n1
Let 1 , ) ) ) , n be n independent rando( variables such that $ @ i M 1A M $ @ i M
1A M 142 @they are called >ade(acher variablesA) 2e notice that the quanti#
n
ties @$n $n A& and 1
i M1 i @& @Xi A & @Xi AA have the sa(e distribution since n
changing one i corresponds to e"changing Xi and Xi ) Lence 5e have
1
n
i @& @Xi A & @Xi AA t42 ,
n I S sup& ; X n
i M1 1 ,X n1
and the union bound leads to
1
n
I F ;, X n
i @& @Xi A & @Xi AA t42 ) 1 , Xn
1 (a"
n
& S i M1
Since i @& @Xi A & @Xi AA D 1, 1E, Loe ding8s inequality &inally gives
I DF @;, X, X AE e n t24 * )
The rest o& the proo& is as be&ore)
< 'apacity !easures
2e have seen so &ar three (easures o& capacity or siGe o& classes o& &unction3 the
C' di(ension and gro5th &unction both distribution independent, and the C'
entropy 5hich depends on the distribution) ?part &ro( the C' di(ension, they
are usually hard or i(possible to co(pute) There are ho5ever other (easures
5hich not only (ay give sharper esti(ates, but also have properties that (a%e
their co(putation possible &ro( the data only)
<)1 'overing Fu(bers
2e start by endo5ing the &unction class ; 5ith the &ollo5ing @rando(A (etric
dn @&, & A M 1
n RJ& @Xi A M & @Xi A 3 i M 1, ) ) ) , nKR )
1:1 Bousquet, Boucheron H Lugosi
This is the nor(aliGed La((ing distance o& the TproBections8 on the sa(ple)
iven such a (etric, 5e say that a set &1 , ) ) ) , &F covers ; at radius i&
; F
i M1 B@&i , A )
2e then de&ine the covering nu(bers o& ; as &ollo5s)
+e&inition 1 @'overing nu(berA) The covering nu(ber o& ; at radius ,
5ith respect to dn , denoted by F @; , , nA is the (ini(u( siGe o& a cover o&
radius )
Fotice that it does not (atter i& 5e apply this de&inition to the original class
or the loss class ;, since F @;, , nA M F @, , nA)
The covering nu(bers characteriGe the siGe o& a &unction class as (easured
by the (etric dn ) The rate o& gro5th o& the logarith( o& F @, , nA usually called
the (etric entropy, is related to the classical concept o& vector di(ension) Indeed,
i& is a co(pact set in a d#di(ensional =uclidean space, F @, , nA d )
2hen the covering nu(bers are &inite, it is possible to appro"i(ate the class
by a &inite set o& &unctions @5hich cover A) 2hich again allo5s to use the
&inite union bound, provided 5e can relate the behavior o& all &unctions in to
that o& &unctions in the cover) ? typical result, 5hich 5e provide 5ithout proo&,
is the &ollo5ing)
Theore( 1) ;or any t W -,
S
D g 3 >@gA W >n @gA I tE * DF @, t, nAE e n t2 41 2 * )
'overing nu(bers can also be de&ined &or classes o& real#valued &unctions)
2e no5 relate the covering nu(bers to the C' di(ension) Fotice that, be#
cause the &unctions in can only ta%e t5o values, &or all W -, F @, , nA
R M F @, Xn
1 A) Lence the C' entropy corresponds to log covering nu(bers RX n
1
at (ini(al scale, 5hich i(plies F @, , nA h log en
h , but one can have a con#
siderably better result)
Le((a 3 @LausslerA) Let be a class o& C' di(ension h) Then, &or all W -,
all n, and any sa(ple,
F @ , , nA 'h@1eAh h )
The interest o& this result is that the upper bound do es not depend on the sa(ple
siGe n)
The covering nu(ber bound is a generaliGation o& the C' entropy bound
5here the scale is adapted to the error) It turns out that this result can be
i(proved by considering all scales @see Section <)2A)
<)2 >ade(acher ?verages
>ecall that 5e used in the proo& o& Theore( 3 >ade(acher rando( variables,
i)e) independent J 1, 1K#valued rando( variables 5ith probability 142 o& ta%ing
either value)
Statistical Learning Theory 1:<
;or convenience 5e introduce the &ollo5ing notation @signed e(pirical (ea#
n
sureA >n & M 1
i M1 i & @Xi A) 2e 5ill denote by the e"pectation ta%en 5ith n
respect to the >ade(acher variables @i)e) conditionally to the dataA 5hile 5ill
denote the e"pectation 5ith respect to all the rando( variables @i)e) the data,
the ghost sa(ple and the >ade(acher variablesA)
+e&inition < @>ade(acher averagesA) ;or a class ; o& &unctions, the >ade#
(acher average is de&ined as
>@;A M sup >n & ,
& ;
and the conditional >ade(acher average is de&ined as
>n @;A M sup >n & )
& ;
2e no5 state the &unda(ental result involving >ade(acher averages)
Theore( <) ;or all W -, 5ith probability at least 1 ,

& ;, $& $n & I 2>@;A I log 1
2n ,
and also, 5ith probability at least 1 ,
& ;, $& $n & I 2>n @; A I 2 log 2

n )
It is re(ar%able that one can obtain a bound @second part o& the theore(A 5hich
depends solely on the data)
The pro o& o& the above result requires a po5er&ul tool called a concentration
inequality &or e(pirical processes)
?ctually, Loe ding8s inequality is a @si(pleA concentration inequality, in the
sense that 5hen n increases, the e(pirical average is concentrated around the
e"pectation) It is possible to generaliGe this result to &unctions that depend on
i)i)d) rando( variables as sho5n in the theore( belo5)
Theore( . @!c+iar(id D1,EA) ?ssu(e &or all i M 1, ) ) ) , n,
sup
R; @G1 , ) ) ) , Gi , ) ) ) , Gn A ; @G1 , ) ) ) , Gi , ) ) ) , Gn AR c ,
G 1 ,) )), G n ,G i
then &or all W -,
S
DR; D; E R W E 2 e"p 2 2
nc2 )
The (eaning o& this result is thus that, as soon as one has a &unction o& n
independent rando( variables, 5hich is such that its variation is bounded 5hen
one variable is (odi&ied, the &unction 5ill satis&y a Loe ding#li%e inequality)
1:. Bousquet, Boucheron H Lugosi
$roo& o& Theore( <) To prove Theore( <, 5e 5ill have to &ollo5 the &ollo5ing
three steps3
1) 7se concentration to relate sup& ; $& $n & to its e"pectation,
2) use sy((etriGation to relate the e"pectation to the >ade(acher average,
3) use concentration again to relate the >ade(acher average to the conditional
one)
2e &irst sho5 that !c+iar(id8s inequality can be applied to sup& ; $ & $n & )
2e denote te(porarily by $ i
n the e(pirical (easure obtained by (odi&ying one
ele(ent @e)g) Xi is replaced by Xi A o& the sa(ple) It is easy to chec% that the
&ollo5ing holds
R sup
@$ & $ i R$ i
@$ & $n & A sup
n & AR sup n & $n & R )
& ; & ; & ;
Since & J-, 1K 5e obtain
R$ i
n & $n & R M 1
n ,
n R& @Xi A & @Xi AR 1
and thus !c+iar(id8s inequality can be applied 5ith c M 14n) This concludes
the &irst step o& the pro o&)
2e ne"t prove the @&irst part o& theA &ollo5ing sy((etriGation le((a)
Le((a 1) ;or any class ;,
sup
$ & $n & 2 sup >n & ,
& ; & ;
and
sup
R$ & $n & R 1 >n & 1
2 n )
2 sup & ; & ;
$roo&) 2e only prove the &irst part) 2e introduce a ghost sa(ple and its
corresponding (easure $n ) 2e successively use the &act that $n & M $ & and
the supre(u( is a conve" &unction @hence 5e can apply ]ensen8s inequality, see
?ppendi" ?A3
sup
$ & $n &
& ;
D$n & E $n &
M sup
& ;
sup $n & $n &
& ;
1
n
M sup i @& @Xi A & @Xi AA
n
& ;
iM1
1
n
1
n
sup i & @Xi A I sup i & @Xi AA
n n
& ; & ;
iM1 i M1
M 2 sup >n & )
& ;
Statistical Learning Theory 1:,
5here the third step uses the &act that &@Xi A & @Xi A and i @& @Xi A & @Xi AA
have the sa(e distribution and the last step uses the &act that the i & @Xi A and
i &@Xi A have the sa(e distribution)
The above already establishes the &irst part o& Theore( <) ;or the second
part, 5e need to use concentration again) ;or this 5e apply !c+iar(id8s in#
equality to the &ollo5ing &unctional
; @X1 , ) ) ) , Xn A M >n @;A )
It is easy to chec% that ; satis&ies !c+iar(id8s assu(ptions 5ith c M 1
n ) ?s a
result, ; M >@; A can be sharply esti(ated by ; M >n @;A)
Loss 'lass and Initial 'lass) In order to (a%e use o& Theore( < 5e have to
relate the >ade(acher average o& the loss class to those o& the initial class) This
can be done 5ith the &ollo5ing derivation 5here one uses the &act that i and
i Pi have the sa(e distribution)
1
n
>@; A M sup
n i g @O i AMP i
g
i M1
1
n
1
i
M sup
n
2 @1 Pi g@Oi AA
g
i M1
1
n
M 1
i Pi g@Oi A M 1
n 2 >@ A )
2 sup
g
i M1
Fotice that the sa(e is valid &or conditional >ade(acher averages, so that 5e
obtain that 5ith probability at least 1 ,

g , >@gA >n @gA I >n @A I 2 log 2
n )
'o(puting the >ade(acher ?verages) 2e no5 assess the di culty o&
actually co(puting the >ade(acher averages) 2e 5rite the &ollo5ing)
1 1
n
i g@Oi A
n
2 supg
i M1
1
n
M 1
1 i g@Oi A
n 2
2 I sup
g
i M1
1
n
1 i g@Oi A
M 1
n 2
2 in& g
i M1
M 1
>n @g, A )
2 in& g
1:* Bousquet, Boucheron H Lugosi
This indicates that, given a sa(ple and a choice o& the rando( variables 1 , ) ) ) , n ,
co(puting >n @A is not harder than co(puting the e(pirical ris% (ini(iGer in
) Indeed, the procedure 5ould be to generate the i rando(ly and (ini(iGe
the e(pirical error in 5ith respect to the labels i )
?n advantage o& re5riting >n @ A as above is that it gives an intuition o& 5hat
it actually (easures3 it (easures ho5 (uch the class can &it rando( noise) I&
the class is very large, there 5ill al5ays be a &unction 5hich can per&ectly &it
the i and then >n @ A M 142, so that there is no hope o& uni&or( convergence
to Gero o& the di erence bet5een true and e(pirical ris%s)
;or a &inite set 5ith R R M F , one can sho5 that
>n @ A 2 log F 4n ,
5here 5e again see the logarith(ic &actor log F ) ? consequence o& this is that,
by considering the proBection on the sa(ple o& a class 5ith C' di(ension h,
and using Le((a 1, 5e have
h
>@A 2 h log en
n )
This result along 5ith Theore( < allo5s to recover the Capni% 'hervonen%is
bound 5ith a concentration#based proo&)
?lthough the bene&it o& using concentration (ay not be entirely clear at that
point, let us Bust (ention that one can actually i(prove the dependence on n
o& the above bound) This is based on the so#called chaining technique) The idea
is to use covering nu(bers at all scales in order to capture the geo(etry o& the
class in a better 5ay than the C' entropy does)
One has the &ollo5ing result, called +udley8s entropy bound

n log F @; , t, nA dt )
>n @;A '
-
?s a consequence, along 5ith Laussler8s upper bound, 5e can get the &ollo5ing
result
>n @;A ' h
n )
2e can thus, 5ith this approach, re(ove the unnecessary log n &actor o& the C'
bound)
. ?dvanced Topics
In this section, 5e point out several 5ays in 5hich the results presented so &ar
can be i(proved) The (ain source o& i(prove(ent actually co(es, as (entioned
earlier, &ro( the &act that Lo e ding and !c+iar(id inequalities do not (a%e
use o& the variance o& the &unctions)
Statistical Learning Theory 1::
.)1 Bino(ial Tails
2e recall that the &unctions 5e consider are binary valued) So, i& 5e consider a
&i"ed &unction & , the distribution o& $n & is actually a bino(ial la5 o& para(eters
$ & and n @since 5e are su((ing n i)i)d) rando( variables & @Xi A 5hich can either
be - or 1 and are equal to 1 5ith probability &@Xi A M $ & A) +enoting p M $ & ,
5e can have an e"act e"pression &or the deviations o& $n & &ro( $& 3
n@ p tA
n S
D$ & $n & tE M
% p% @1 pAn % )
%M-
Since this e"pression is not easy to (anipulate, 5e have used an upper bound
provided by Loe ding8s inequality) Lo5ever, there e"ist other @sharperA upper
S
bounds) The &ollo5ing quantities are an upper bound on
D$ & $n & tE,
n @p ItA @e"ponentialA n @1 p tA p
1 p
1 p t pIt
1 p @@1 t4pA lo g @1 t4pAIt4pA @BennettA
e np
2 p@ 1 p AI2 t 43 @BernsteinA
e n t 2
e 2 n t2 @Lo e dingA
="a(ining the above bounds @and using inversionA, 5e can say that roughly
spea%ing, the s(all deviations o& $& $n & have a aussian behavior o& the
&or( e"p@ nt2 42p@1 pAA @i)e) aussian 5ith variance p@1 pAA 5hile the large
deviations have a $oisson behavior o& the &or( e"p@ 3nt42A)
So the tails are heavier than aussian, and Loe ding8s inequality consists in
upper bounding the tails 5ith a aussian 5ith (a"i(u( variance, hence the
ter( e"p@ 2nt2 A)
=ach &unction & ; has a di erent variance $ & @1 $ & A $ & ) !oreover,
&or each & ;, by Bernstein8s inequality, 5ith probability at least 1 ,

$ & $n & I 2$ & log 1
3n ) n I 2 log 1
The aussian part @second ter( in the right hand sideA do(inates @&or $ & not
too s(all, or n large enoughA, and it depends on $ & ) 2e thus 5ant to co(bine
Bernstein8s inequality 5ith the union bound and the sy((etriGation)
.)2 For(aliGation
The idea is to consider the ratio
$& $n &
$ & )
Lere @& J-, 1KA, Car& $ & 2 M $ &
2-- Bousquet, Boucheron H Lugosi
The reason &or considering this ration is that a&ter nor(aliGation, uctuations
are (ore Tuni&or(8 in the class ;) Lence the supre(u( in
$ & $n &
sup $ &
& ;
not necessarily attained at &unctions 5ith large variance as it 5as the case pre#
viously)
!oreover, 5e %no5 that our goal is to &ind &unctions 5ith s(all error $&
@hence s(all varianceA) The nor(aliGed supre(u( ta%es this into account)
2e no5 state a result si(ilar to Theore( 2 &or the nor(aliGed supre(u()
Theore( , @Capni%#'hervonen%is, D1*EA) ;or W - 5ith probability at least
1 ,
$ & 2 log S; @2nA I log 1

& ;, $& $n &
n ,
and also 5ith probability at least 1 ,

$n & 2 log S; @2nA I log 1
& ;, $n & $&
n )
$roo&) 2e only give a s%etch o& the proo&) The &irst step is a variation o& the
sy((etriGation le((a
S
$ & $n &
S
$n & $n &
sup $ & t 2 sup
@$n & I $n & A42 t ) & ; & ;
The second step consists in rando(iGation @5ith >ade(acher variablesA
n 1
i M1 i @& @Xi A & @Xi AA n
S sup
^ ^ ^ M 2
@$n & I $n & A42 t ) & ;
;inally, one uses a tail bound o& Bernstein type)
Let us e"plore the consequences o& this result)
;ro( the &act that &or non#negative nu(bers ?, B, ',

? B I ' B ' ,
? ? B I '2 I
5e easily get &or e"a(ple

& ;, $ & $n & I 2 $n & log S; @2nA I log 1
n

I1 log S; @2nA I log 1
n )
Statistical Learning Theory 2-1
In the ideal situation 5here there is no noise @i)e) P M t@OA al(ost surelyA, and
t , denoting by gn the e(pirical ris% (ini(iGer, 5e have > M - and also
>n @gn A M -) In particular, 5hen is a class o& C' di(ension h, 5e obtain
>@gn A M O h log n
n )
So, in a 5ay, Theore( , allo5s to interpolate bet5een the best case 5here
the rate o& convergence is O@h log n4nA and the 5orst case 5here the rate is
O@ h log n4nA @it does not allo5 to re(ove the log n &actor in this caseA)
It is also possible to derive &ro( Theore( , relative error bounds &or the
(ini(iGer o& the e(pirical error) 2ith probability at least 1 ,

>@gn A >@g A I 2 >@g A log S @2nA I log 1
n

I1 log S @2nA I log 1
n )
2e notice here that 5hen >@g A M - @i)e) t and > M -A, the rate is again
o& order 14n 5hile, as so on as >@g A W -, the rate is o& order 14 n) There&ore,
it is not possible to obtain a rate 5ith a po5er o& n in bet5een 142 and 1)
The (ain reason is that the &actor o& the square ro ot ter( >@g A is not the
right quantity to use here since it does not vary 5ith n) 2e 5ill see later that
one can have instead >@gn A >@g A as a &actor, 5hich is usually converging to
Gero 5ith n increasing) 7n&ortunately, Theore( , cannot be applied to &unctions
o& the type & & @5hich 5ould be needed to have the (entioned &actorA, so 5e
5ill need a re&ined approach)
.)3 Foise 'onditions
The re&ine(ent 5e see% to obtain requires certain speci&ic assu(ptions about the
noise &unction s@"A) The ideal case being 5hen s@"A M - every5here @5hich cor#
responds to > M - and P M t@OAA) 2e no5 intro duce quantities that (easure
ho5 5ell#behaved the noise &unction is)
The situation is &avorable 5hen the regression &unction @"A is not too close
to -, or at least not too o&ten close to 142) Indeed, @"A M - (eans that the noise
is (a"i(u( at " @s@"A M 142A and that the label is co(pletely undeter(ined
@any prediction 5ould yield an error 5ith probability 142A)
+e&initions) There are t5o types o& conditions)
+e&inition . @!assart8s Foise 'onditionA) ;or so(e c W -, assu(e
R @OAR W 1
c al(ost surely )
2-2 Bousquet, Boucheron H Lugosi
This condition i(plies that there is no region 5here the decision is co(pletely
rando(, or the noise is bounded a5ay &ro( 142)
+e&inition , @Tsyba%ov8s Foise 'onditionA) Let D-, 1E, assu(e that
one the &ollo5ing equivalent conditions is satis&ied
@iA c W -, g J 1, 1KO ,
S
Dg@OA @OA -E c@>@gA > A
@iiA c W -, ? O, d$ @"A c@
R @"ARd$@"AA
? ?
S
@iiiA B W -, t -,
DR @OAR tE Bt
1
'ondition @iiiA is probably the easiest to interpret3 it (eans that @"A is close
to the critical value - 5ith lo5 probability)
2e indicate ho5 to prove that conditions @iA, @iiA and @iiiA are indeed equiv#
alent3
@iA @iiA It is easy to chec% that >@gA > M DR @OAR g - E) ;or each
&unction g, there e"ists a set ? such that ? M g -
@iiA @iiiA Let ? M J" 3 R @"AR tK
S
DR R tE M d$ @"A c@
R @"ARd$ @"AA
? ?
d$ @"AA
ct @
?
S

DR R tE c 1
1 t
1
@iiiA @iA 2e 5rite
>@gA > M DR @OAR g -E
g - R Rt
t
S
M t
g W- R R t
DR R tE t
S S
Dg W -E M t@ 1 A )
1 A t
t@1 Bt Dg -E Bt

@1 A 4 &inally gives
Dg - E
Ta%ing t M @1 A
B
S
Dg -E B1
@1 A@ 1 A @>@gA > A )
2e notice that the para(eter has to be in D-, 1E) Indeed, one has the opposite
inequality
S
Dg@OA @OA -E ,
>@g A > M DR @OAR g - E D g - E M
5hich is inco(patible 5ith condition @iA i& W 1)
2e also notice that 5hen M -, Tsyba%ov8s condition is void, and 5hen
M 1, it is equivalent to !assart8s condition)
Statistical Learning Theory 2-3
'onsequences) The conditions 5e i(pose on the noise yield a crucial rela#
tionship bet5een the variance and the e"pectation o& &unctions in the so#called
relative loss class de&ined as

; M J@", yA & @", yA t@ "AMy 3 & ;K )
This relationship 5ill allo5 to e"ploit Bernstein type inequalities applied to this
latter class)
7nder !assart8s condition, one has @5ritten in ter(s o& the initial classA &or
g ,
@ g @O AMP t @O AMP A2 c@>@gA > A ,
or, equivalently, &or &
;, Car& $ & 2 c$ & ) 7nder Tsyba%ov8s condition
this beco(es &or g ,
@ g @O AM P t@O AMP A2 c@>@g A > A ,
and &or &
;, Car& $ & 2 c@$ & A )
In the &inite case, 5ith R R M F , one can easily apply Bernstein8s inequality
to ; and the &inite union bound to get that 5ith probability at least 1 , &or
all g ,

>@gA > >n @gA >n @tA I *c@>@gA > A log F
3n )
n I 1 log F
?s a consequence, 5hen t , and gn is the (ini(iGer o& the e(pirical error
@hence >n @gA >n @tAA, one has
2 , 1

>@gn A > ' log F
n
5hich al5ays better than n 1 42 &or W - and is valid even i& > W -)
.)1 Local >ade(acher ?verages
In this section 5e generaliGe the above result by intro ducing a localiGed version
o& the >ade(acher averages) oing &ro( the &inite to the general case is (ore in#
volved than 5hat has been seen be&ore) 2e &irst give the appropriate de&initions,
then state the result and give a proo& s%etch)
+e&initions) Local >ade(acher averages re&er to >ade(acher averages o& sub#
sets o& the &unction class deter(ined by a condition on the variance o& the &unc#
tion)
+e&inition * @Local >ade(acher ?verageA) The local >ade(acher average
at radius r - &or the class ; is de&ined as
>@;, rA M sup >n & )
& ; 3$ & 2 r
2-1 Bousquet, Boucheron H Lugosi
The reason &or this de&inition is that, as 5e have seen be&ore, the crucial ingredi#
ent to obtain better rates o& convergence is to use the variance o& the &unctions)
LocaliGing the >ade(acher average allo5s to &ocus on the part o& the &unction
class 5here the &ast rate pheno(enon o ccurs, that are &unctions 5ith s(all vari#
ance)
Fe"t 5e introduce the concept o& a sub#root &unction, a real#valued &unction
5ith certain (onotony properties)
+e&inition : @Sub#>oot ;unctionA) ? &unction 3 is sub#root i&
@iA is non#decreasing,
@iiA is non negative,
@iiiA @rA4 r is non#increasing )
?n i((ediate consequence o& this de&inition is the &ollo5ing result)
Le((a <) ? sub#root &unction
@iA is continuous,
@iiA has a unique @non#GeroA &i"ed point r satis&ying @r A M r )
;igure . sho5s a typical sub#root &unction and its &i"ed point)
3
"
phi @"A
2)<
2
1)<
1
-)<
-
- - )< 1 1)< 2 2 )< 3
;ig) .) ?n e"a(ple o& a sub#root &unction and its &i"ed point)
Be&ore seeing the rationale &or introducing the sub#root concept, 5e need yet
another de&inition, that o& a Tstar#hull8 @so(e5hat si(ilar to a conve" hullA)
+e&inition 1- @Star#LullA) Let ; be a set o& &unctions) Its star#hull is de&ined
as
; M J & 3 & ;, D-, 1EK )
Statistical Learning Theory 2-<
Fo5, 5e state a le((a that indicates that by ta%ing the star#hull o& a class o&
&unctions, 5e are guaranteed that the local >ade(acher average behaves li%e a
sub#root &unction, and thus has a unique &i"ed point) This &i"ed point 5ill turn
out to be the %ey quantity in the relative error bounds)
Le((a .) ;or any class o& &unctions ;,
>n @ ;, rA is sub#root )
One legiti(ate question is 5hether ta%ing the star#hull does not enlarge the class
too (uch) One 5ay to see 5hat the e ect is on the siGe o& the class is to co(pare
the (etric entropy @log covering nu(bersA o& ; and o& ;) It is possible to
see that the entropy increases only by a logarith(ic &actor, 5hich is essentially
negligible)
>esult) 2e no5 state the (ain result involving local >ade(acher averages and
their &i"ed point)
Theore( *) Let ; be a class o& bounded &unctions @e)g) & D 1, 1EA and r
be the &i"ed point o& >@ ;, rA) There e"ists a constant ' W - such that 5ith
probability at least 1 ,
I log log n
& ;, $ & $n & ' r Car& I log 1
n )
I& in addition the &unctions in ; satis&y Car& c@$ & A , then one obtains that
5ith probability at least 1 ,
I log log n 2 I log 1
& ;, $ & ' $n & I @r A 1
n )
$roo&) 2e only give the (ain steps o& the proo&)
1) The starting point is Talagrand8s inequality &or e(pirical processes, a gen#
eraliGation o& !c+iar(id8s inequality o& Bernstein type @i)e) 5hich includes
the varianceA) This inequality tells that 5ith high probability,
sup Car& 4n I c 4n ,
$ & $n & sup $ & $n & I c sup
& ; & ; & ;
&or so(e constants c, c )
2) The second step consists in Tpeeling8 the class, that is splitting the class into
subclasses according to the variance o& the &unctions
;% M J& 3 Car& D"% , "% I1 AK ,
2-. Bousquet, Boucheron H Lugosi
3) 2e can then apply Talagrand8s inequality to each o& the sub#classes sepa#
rately to get 5ith high probability
sup
$ & $n & sup $ & $n & I c "Car& 4n I c 4n ,
& ; % & ; %
1) Then the sy((etriGation le((a allo5s to introduce local >ade(acher av#
erages) 2e get that 5ith high probability
& ; , $& $n & 2>@;, "Car&A I c "Car& 4n I c 4n )
<) 2e then have to Tsolve8 this inequality) Things are si(ple i& > behaves li%e a
square root &unction since 5e can upper bound the local >ade(acher average
by the value o& its &i"ed point) 2ith high probability,
$ & $n & 2 r Car& I c "Car& 4n I c 4n )
.) ;inally, 5e use the relationship bet5een variance and e"pectation
Car& c@$ & A ,
and solve the inequality in $ & to get the result)
2e 5ill not got into the details o& ho5 to apply the above result, but 5e give
so(e re(ar%s about its use)
?n i(portant e"a(ple is the case 5here the class ; is o& &inite C' di(ension
h) In that case, one has
>@;, rA ' rh log n
n ,
so that r ' h lo g n
n ) ?s a consequence, 5e obtain, under Tsyba%ov condition, a
rate o& convergence o& $ &n to $ & is O@14n1 4@2 A A) It is i(portant to note that
in this case, the rate o& convergence o& $n & to $ & in O@14 nA) So 5e obtain
a &ast rate by loo%ing at the relative error) These &ast rates can be obtained
provided t @but it is not needed that > M -A) This require(ent can be
re(oved i& one uses structural ris% (ini(iGation or regulariGation)
?nother related result is that, as in the global case, one can obtain a bound
5ith data#dependent @i)e) conditionalA local >ade(acher averages
>n @;, rA M sup >n & )
& ; 3 $ & 2 r
The result is the sa(e as be&ore @5ith di erent constantsA under the sa(e con#
ditions as in Theore( *) 2ith probability at least 1 ,
I log log n 2 I log 1
$ & ' $n & I @r
n
n A 1
Statistical Learning Theory 2-,
5here r
n is the &i"ed point o& a sub#root upper bound o& >n @;, rA)
Lence, 5e can get i(proved rates 5hen the noise is 5ell#behaved and these
rates interpolate bet5een n 1 42 and n 1 ) Lo5ever, it is not in general possible
to esti(ate the para(eters @c and A entering in the noise conditions, but 5e 5ill
not discuss this issue &urther here) ?nother point is that although the capacity
(easure that 5e use see(s Tlo cal8, it does depend on all the &unctions in the
class, but each o& the( is i(plicitly appropriately rescaled) Indeed, in >@ ;, rA,
each &unction & ; 5ith $ & 2 r is considered at scale r4$& 2 )
Bibliographical re(ar%s) Loe ding8s inequality appears in D1:E) ;or a proo&
o& the contraction principle 5e re&er to Ledou" and Talagrand D2-E)
Capni%#'hervonen%is#Sauer#Shelah8s le((a 5as proved independently by
Sauer D21E, Shelah D22E, and Capni% and 'hervonen%is D1*E) ;or related co(#
binatorial results 5e re&er to ?les%er D23E, ?lon, Ben#+avid, 'esa#Bianchi, and
Laussler D21E, 'esa#Bianchi and Laussler D2<E, ;ran%l D2.E, Laussler D2,E, SGare%
and Talagrand D2*E)
7ni&or( deviations o& averages &ro( their e"pectations is one o& the central
proble(s o& e(pirical pro cess theory) Lere 5e (erely re&er to so(e o& the co(#
prehensive coverages, such as +udley D2:E, ine D3-E, Capni% D1E, van der Caart
and 2ellner D31E) The use o& e(pirical pro cesses in classi&ication 5as pioneered
by Capni% and 'hervonen%is D1*, 1<E and re#discovered 2- years later by Blu(er,
=hren&eucht, Laussler, and 2ar(uth D32E, =hren&eucht, Laussler, Nearns, and
Caliant D33E) ;or surveys see ?nthony and Bartlett D2E, +evroye, y/ or&i, and
Lugosi D1E, Nearns and CaGirani D,E, Fatara Ban D12E, Capni% D11, 1E)
The question o& ho5 sup& ; @$ @&A $n @& AA behaves has been %no5n as the
liven%o#'antelli proble( and (uch has been said about it) ? &e5 %ey re&erences
include ?lon, Ben#+avid, 'esa#Bianchi, and Laussler D21E, +udley D31, 3<, 3.E,
Talagrand D3,, 3*E, Capni% and 'hervonen%is D1*, 3:E)
The vc di(ension has been 5idely studied and (any o& its properties are
%no5n) 2e re&er to ?nthony and Bartlett D2E, ?ssouad D1-E, 'over D11E, +udley
D12, 2:E, oldberg and ]erru( D13E, Narpins%i and ?) !acintyre D11E, Nhovans%ii
D1<E, Noiran and Sontag D1.E, !acintyre and Sontag D1,E, Steele D1*E, and 2enocur
and +udley D1:E)
The bounded di erences inequality 5as &or(ulated e"plicitly &irst by !c#
+iar(id D1,E 5ho proved it by (artingale (etho ds @see the surveys D1,E, D<-EA,
but closely related concentration results have been obtained in various 5ays in#
cluding in&or(ation#theoretic (ethods @see ?lhs5ede, acs, and N/ orner D<1E,
!arton D<2E, D<3E,D<1E, +e(bo D<<E, !assart D<.E and >io D<,EA, Talagrand8s in#
duction (ethod D<*E,D<:E,D.-E @see also LucGa% and !c+iar(id D.1E, !c+iar(id
D.2E, $anchen%o D.3, .1, .<EA and the so#called _entropy (ethodZ, based on loga#
rith(ic Sobolev inequalities, developed by Ledou" D..E,D.,E, see also Bob%ov and
Ledou" D.*E, !assart D.:E, >io D<,E, Boucheron, Lugosi, and !assart D,-E, D,1E,
Boucheron, Bousquet, Lugosi, and !assart D,2E, and Bousquet D,3E)
Sy((etriGation le((as can be &ound in ine and Xinn D,1E and Capni% and
'hervonen%is D1*, 1<E)
2-* Bousquet, Boucheron H Lugosi
The use o& >ade(acher averages in classi&ication 5as &irst pro(oted by
Noltchins%ii D,<E and Bartlett, Boucheron, and Lugosi D,.E, see also Noltchin#
s%ii and $anchen%o D,,, ,*E, Bartlett and !endelson D,:E, Bartlett, Bousquet,
and !endelson D*-E, Bousquet, Noltchins%ii, and $anchen%o D*1E, Negl, Linder,
and Lugosi D*2E)
? $robability Tools
This section recalls so(e basic &acts &ro( probability theory that are used
throughout this tutorial @so(eti(es 5ithout e"plicitly (entioning itA)
2e denote by ? and B so(e events @i)e) ele(ents o& a #algebraA, and by O
so(e real#valued rando( variable)
?)1 Basic ;acts
Y 7nion3
S S S
D? or BE D?E I DBE )
S S
Y Inclusion3 I& ? B, then D?E DB E)
S
Y Inversion3 I& DO W tE ; @tA then 5ith probability at least 1 ,
O ; 1 @ A )
Y ="pectation3 I& O -,
DOE M
DO tE dt )
- S
?)2 Basic Inequalities
?ll the inequalities belo5 are valid as soon as the right#hand side e"ists)
Y ]ensen3 &or & conve",
& @ DOEA D& @OAE )
Y !ar%ov3 I& O - then &or all t W -,
S
DO tE DOE
t )
Y 'hebyshev3 &or t W -,
S
DRO DOE R tE CarO
t2 )
Y 'herno 3 &or all t ,
e @O tA )
S
DO tE in&
-
Statistical Learning Theory 2-:
B Fo ;ree Lunch
2e can no5 give a &or(al de&inition o& consistency and state the core results
about the i(possibility o& universally goo d algorith(s)
+e&inition 11 @'onsistencyA) ?n algorith( is consistent i& &or any probability
(easure $ ,
li(
n >@gn A M > al(ost surely)
It is i(portant to understand the reasons that (a%e possible the e"istence o&
consistent algorith(s) In the case 5here the input space O is countable, things
are so(eho5 easy since even i& there is no relationship at all bet5een inputs and
outputs, by repeatedly sa(pling data independently &ro( $ , one 5ill get to see
an increasing nu(ber o& di erent inputs 5hich 5ill eventually converge to all
the inputs) So, in the countable case, an algorith( 5hich 5ould si(ply learn Tby
heart8 @i)e) (a%es a (aBority vote 5hen the instance has been seen be&ore, and
produces an arbitrary prediction other5iseA 5ould be consistent)
In the case 5here O is not countable @e)g) O M A, things are (ore subtle)
Indeed, in that case, there is a see(ingly innocent assu(ption that beco(es
crucial3 to be able to de&ine a probability (easure $ on O , one needs a #algebra
on that space, 5hich is typically the Borel #algebra) So the hidden assu(ption
is that $ is a Borel (easure) This (eans that the topology o& plays a role
here, and thus, the target &unction t 5ill be Borel (easurable) In a sense this
guarantees that it is possible to appro"i(ate t &ro( its value @or appro"i(ate
valueA at a &inite nu(ber o& points) The algorith(s that 5ill achieve consistency
are thus those 5ho use the topology in the sense o& TgeneraliGing8 the observed
values to neighborho ods @e)g) lo cal classi&iersA) In a 5ay, the (easurability o& t
is one o& the crudest notions o& s(oothness o& &unctions)
2e no5 cite t5o i(portant results) The &irst one tells that &or a &i"ed sa(ple
siGe, one can construct arbitrarily bad proble(s &or a given algorith()
Theore( : @Fo ;ree Lunch, see e)g) D1EA) ;or any algorith(, any n and
any W -, there e"ists a distribution $ such that > M - and
S
>@gn A 1
2 M 1 )
The second result is (ore subtle and indicates that given an algorith(, one
can construct a proble( &or 5hich this algorith( 5ill converge as slo5ly as one
5ishes)
Theore( 1- @Fo ;ree Lunch at ?ll, see e)g) D1EA) ;or any algorith(, and
any sequence @an A that converges to -, there e"ists a probability distribution $
such that > M - and
>@gn A an )
In the above theore(, the Tbad8 probability (easure is constructed on a countable
set @5here the outputs are not related at all to the inputs so that no generaliGa#
tion is possibleA, and is such that the rate at 5hich one gets to see ne5 inputs
is as slo5 as the convergence o& an )
21- Bousquet, Boucheron H Lugosi
;inally 5e (ention other notions o& consistency)
+e&inition 12 @C' consistency o& =>!A) The =>! algorith( is consistent
i& &or any probability (easure $ ,
>@gn A >@g A in probability,
and
>n @gn A >@g A in probability)
+e&inition 13 @C' non#trivial consistency o& =>!A) The =>! algorith(
is non#trivially consistent &or the set and the probability distribution $ i& &or
any c ,
in& $ @&A in probability)
$n @& A in&
& ; 3 $ & W c & ; 3 $ & W c
>e&erences
1) Capni%, C)3 Statistical Learning Theory) ]ohn 2iley, Fe5 Por% @1::*A
2) ?nthony, !), Bartlett, $)L)3 Feural Fet5or% Learning3 Theoretical ;oundations)
'a(bridge 7niversity $ress, 'a(bridge @1:::A
3) Brei(an, L), ;ried(an, ]), Olshen, >), Stone, ')3 'lassi&ication and >egression
Trees) 2ads5orth International, Bel(ont, '? @1:*1A
1) +evroye, L), y/ or&i, L), Lugosi, )3 ? $robabilistic Theory o& $attern >ecognition)
Springer#Cerlag, Fe5 Por% @1::.A
<) +uda, >), Lart, $)3 $attern 'lassi&ication and Scene ?nalysis) ]ohn 2iley, Fe5
Por% @1:,3A
.) ;u%unaga, N)3 Introduction to Statistical $attern >ecognition) ?cade(ic $ress,
Fe5 Por% @1:,2A
,) Nearns, !), CaGirani, 7)3 ?n Introduction to 'o(putational Learning Theory)
!IT $ress, 'a(bridge, !assachusetts @1::1A
*) Nul%arni, S), Lugosi, ), Cen%atesh, S)3 Learning pattern classi&ication`a sur#
vey) I=== Transactions on In&or(ation Theory 11 @1::*A 21,*Y22-. In&or(ation
Theory3 1:1*Y1::*) 'o((e(orative special issue)
:) Lugosi, )3 $attern classi&ication and learning theory) In y/ or&i, L), ed)3 $rinciples
o& Fonpara(etric Learning, Springer, Ciena @2--2A <Y.2
1-) !cLachlan, )3 +iscri(inant ?nalysis and Statistical $attern >ecognition) ]ohn
2iley, Fe5 Por% @1::2A
11) !endelson, S)3 ? &e5 notes on statistical learning theory) In !endelson, S), S(ola,
?), eds)3 ?dvanced Lectures in !achine Learning) LF'S 2.--, Springer @2--3A 1Y
1-
12) FataraBan, B)3 !achine Learning3 ? Theoretical ?pproach) !organ Nau&(ann,
San !ateo, '? @1::1A
13) Capni%, C)3 =sti(ation o& +ependencies Based on =(pirical +ata) Springer#Cerlag,
Fe5 Por% @1:*2A
11) Capni%, C)3 The Fature o& Statistical Learning Theory) Springer#Cerlag, Fe5 Por%
@1::<A
1<) Capni%, C), 'hervonen%is, ?)3 Theory o& $attern >ecognition) Fau%a, !osco5
@1:,1A @in >ussianAa er(an translation3 Theorie der Xeichener%ennung, ?%ade(ie
Cerlag, Berlin, 1:,:)
Statistical Learning Theory 211
1.) von Lu"burg, 7), Bousquet, O), Sch/ ol%op&, B)3 ? co(pression approach to support
vector (odel selection) The ]ournal o& !achine Learning >esearch < @2--1A 2:3Y
323
1,) !c+iar(id, ')3 On the (ethod o& bounded di erences) In3 Surveys in 'o(bina#
torics 1:*:, 'a(bridge 7niversity $ress, 'a(bridge @1:*:A 11*Y1**
1*) Capni%, C), 'hervonen%is, ?)3 On the uni&or( convergence o& relative &requencies
o& events to their probabilities) Theory o& $robability and its ?pplications 1.
@1:,1A 2.1Y2*-
1:) Loe ding, 2)3 $robability inequalities &or su(s o& bounded rando( variables)
]ournal o& the ?(erican Statistical ?ssociation <* @1:.3A 13Y3-
2-) Ledou", !), Talagrand, !)3 $robability in Banach Space) Springer#Cerlag, Fe5
Por% @1::1A
21) Sauer, F)3 On the density o& &a(ilies o& sets) ]ournal o& 'o(binatorial Theory
Series ? 13 @1:,2A 11<Y11,
22) Shelah, S)3 ? co(binatorial proble(3 Stability and order &or (odels and theories
in in&inity languages) $aci&ic ]ournal o& !athe(atics 11 @1:,2A 21,Y2.1
23) ?les%er, S)3 ? re(ar% on the SGare%#Talagrand theore() 'o(binatorics, $roba#
bility, and 'o(puting . @1::,A 13:Y111
21) ?lon, F), Ben#+avid, S), 'esa#Bianchi, F), Laussler, +)3 Scale#sensitive di(ensions,
uni&or( convergence, and learnability) ]ournal o& the ?'! 11 @1::,A .1<Y.31
2<) 'esa#Bianchi, F), Laussler, +)3 ? graph#theoretic generaliGation o& the Sauer#
Shelah le((a) +iscrete ?pplied !athe(atics *. @1::*A 2,Y3<
2.) ;ran%l, $)3 On the trace o& &inite sets) ]ournal o& 'o(binatorial Theory, Series ?
31 @1:*3A 11Y1<
2,) Laussler, +)3 Sphere pac%ing nu(bers &or subsets o& the boolean n#cube 5ith
bounded Capni%#'hervonen%is di(ension) ]ournal o& 'o(binatorial Theory, Series
? .: @1::<A 21,Y232
2*) SGare%, S), Talagrand, !)3 On the conve"i&ied Sauer#Shelah theore() ]ournal o&
'o(binatorial Theory, Series B .: @1::,A 1*3Y1:2
2:) +udley, >)3 7ni&or( 'entral Li(it Theore(s) 'a(bridge 7niversity $ress, 'a(#
bridge @1:::A
3-) in e, =)3 =(pirical processes and applications3 an overvie5) Bernoulli 2 @1::.A
1Y2*
31) van der 2aart, ?), 2ellner, ])3 2ea% convergence and e(pirical processes)
Springer#Cerlag, Fe5 Por% @1::.A
32) Blu(er, ?), =hren&eucht, ?), Laussler, +), 2ar(uth, !)3 Learnability and the
Capni%#'hervonen%is di(ension) ]ournal o& the ?'! 3. @1:*:A :2:Y:.<
33) =hren&eucht, ?), Laussler, +), Nearns, !), Caliant, L)3 ? general lo5er bound on
the nu(ber o& e"a(ples needed &or learning) In&or(ation and 'o(putation *2
@1:*:A 21,Y2.1
31) +udley, >)3 'entral li(it theore(s &or e(pirical (easures) ?nnals o& $robability
. @1:,*A *::Y:2:
3<) +udley, >)3 =(pirical processes) In3 =cole de $robabilit e de St) ;lour 1:*2, Lecture
Fotes in !athe(atics b1-:,, Springer#Cerlag, Fe5 Por% @1:*1A
3.) +udley, >)3 7niversal +ons%er classes and (etric entropy) ?nnals o& $robability
1< @1:*,A 13-.Y132.
3,) Talagrand, !)3 The liven%o#'antelli proble() ?nnals o& $robability 1< @1:*,A
*3,Y*,-
3*) Talagrand, !)3 Sharper bounds &or aussian and e(pirical processes) ?nnals o&
$robability 22 @1::1A 2*Y,.
212 Bousquet, Boucheron H Lugosi
3:) Capni%, C), 'hervonen%is, ?)3 Fecessary and su cient conditions &or the uni#
&or( convergence o& (eans to their e"pectations) Theory o& $robability and its
?pplications 2. @1:*1A *21Y*32
1-) ?ssouad, $)3 +ensite et di(ension) ?nnales de l8Institut ;ourier 33 @1:*3A 233Y2*2
11) 'over, T)3 eo(etrical and statistical properties o& syste(s o& linear inequali#
ties 5ith applications in pattern recognition) I=== Transactions on =lectronic
'o(puters 11 @1:.<A 32.Y331
12) +udley, >)3 Balls in >% do not cut all subsets o& % I 2 points) ?dvances in
!athe(atics 31 @3A @1:,:A 3-.Y3-*
13) oldberg, $), ]erru(, !)3 Bounding the Capni%#'hervonen%is di(ension o& con#
cept classes para(etriGed by real nu(bers) !achine Learning 1* @1::<A 131Y11*
11) Narpins%i, !), !acintyre, ?)3 $olyno(ial bounds &or vc di(ension o& sig(oidal
and general $&a an neural net5or%s) ]ournal o& 'o(puter and Syste( Science
<1 @1::,A
1<) Nhovans%ii, ?))3 ;e5no(ials) Translations o& !athe(atical !onographs, vol)
**, ?(erican !athe(atical Society @1::1A
1.) Noiran, $), Sontag, =)3 Feural net5or%s 5ith quadratic vc di(ension) ]ournal o&
'o(puter and Syste( Science <1 @1::,A
1,) !acintyre, ?), Sontag, =)3 ;initeness results &or sig(oidal _neuralZ net5or%s) In3
$roceedings o& the 2<th ?nnual ?'! Sy(posiu( on the Theory o& 'o(puting,
?ssociation o& 'o(puting !achinery, Fe5 Por% @1::3A 32<Y331
1*) Steele, ])3 ="istence o& sub(atrices 5ith all possible colu(ns) ]ournal o& 'o(bi#
natorial Theory, Series ? 2* @1:,*A *1Y**
1:) 2enocur, >), +udley, >)3 So(e special Capni%#'hervonen%is classes) +iscrete
!athe(atics 33 @1:*1A 313Y31*
<-) !c+iar(id, ')3 'oncentration) In Labib, !), !c+iar(id, '), >a(ireG#?l&onsin,
]), >eed, B), eds)3 $robabilistic !ethods &or ?lgorith(ic +iscrete !athe(atics,
Springer, Fe5 Por% @1::*A 1:<Y21*
<1) ?hls5ede, >), acs, $), N/ orner, ])3 Bounds on conditional probabilities 5ith ap#
plications in (ulti#user co((unication) Xeitschri&t &/ ur 2ahrscheinlich%eitstheorie
und ver5andte ebiete 31 @1:,.A 1<,Y1,, @correction in 3:33<3Y3<1,1:,,A)
<2) !arton, N)3 ? si(ple proo& o& the blo5ing#up le((a) I=== Transactions on
In&or(ation Theory 32 @1:*.A 11<Y11.
<3) !arton, N)3 Bounding c d#distance by in&or(ational divergence3 a 5ay to prove
(easure concentration) ?nnals o& $robability 21 @1::.A *<,Y*..
<1) !arton, N)3 ? (easure concentration inequality &or contracting !ar%ov chains)
eo(etric and ;unctional ?nalysis . @1::.A <<.Y<,1 =rratu(3 ,3.-:Y.13, 1::,)
<<) +e(bo, ?)3 In&or(ation inequalities and concentration o& (easure) ?nnals o&
$robability 2< @1::,A :2,Y:3:
<.) !assart, $)3 Opti(al constants &or Loe ding type inequalities) Technical report,
!athe(atiques, 7niversite de $aris#Sud, >eport :*)*. @1::*A
<,) >io, =)3 In egalites de concentration pour les processus e(piriques de classes de
parties) $robability Theory and >elated ;ields 11: @2--1A 1.3Y1,<
<*) Talagrand, !)3 ? ne5 loo% at independence) ?nnals o& $robability 21 @1::.A 1Y31
@Special Invited $aperA)
<:) Talagrand, !)3 'oncentration o& (easure and isoperi(etric inequalities in product
spaces) $ublications !ath e(atiques de l8I)L)=)S) *1 @1::<A ,3Y2-<
.-) Talagrand, !)3 Fe5 concentration inequalities in product spaces) Inventiones
!athe(aticae 12. @1::.A <-<Y<.3
.1) LucGa%, !)]), !c+iar(id, ')3 'oncentration &or locally acting per(utations) +is#
crete !athe(atics @2--3A to appear
Statistical Learning Theory 213
.2) !c+iar(id, ')3 'oncentration &or independent per(utations) 'o(binatorics,
$robability, and 'o(puting 2 @2--2A 1.3Y1,*
.3) $anchen%o, +)3 ? note on Talagrand8s concentration inequality) =lectronic 'o(#
(unications in $robability . @2--1A
.1) $anchen%o, +)3 So(e e"tensions o& an inequality o& Capni% and 'hervonen%is)
=lectronic 'o((unications in $robability , @2--2A
.<) $anchen%o, +)3 Sy((etriGation approach to concentration inequalities &or e(pir#
ical processes) ?nnals o& $robability to appear @2--3A
..) Ledou", !)3 On Talagrand8s deviation inequalities &or product (easures) =S?I!3
$robability and Statistics 1 @1::,A .3Y*, http344555)e(ath)&r4ps4)
.,) Ledou", !)3 Isoperi(etry and aussian analysis) In Bernard, $), ed)3 Lectures on
$robability Theory and Statistics, =cole d8=t e de $robabilites de St#;lour OOIC#
1::1 @1::.A 1.<Y2:1
.*) Bob%ov, S), Ledou", !)3 $oincare8s inequalities and Talagrands8s concentration
pheno(enon &or the e"ponential distribution) $robability Theory and >elated
;ields 1-, @1::,A 3*3Y1--
.:) !assart, $)3 ?bout the constants in Talagrand8s concentration inequalities &or
e(pirical processes) ?nnals o& $robability 2* @2---A *.3Y**1
,-) Boucheron, S), Lugosi, ), !assart, $)3 ? sharp concentration inequality 5ith
applications) >ando( Structures and ?lgorith(s 1. @2---A 2,,Y2:2
,1) Boucheron, S), Lugosi, ), !assart, $)3 'oncentration inequalities using the en#
tropy (ethod) The ?nnals o& $robability 31 @2--3A 1<*3Y1.11
,2) Boucheron, S), Bousquet, O), Lugosi, ), !assart, $)3 !o(ent inequalities &or
&unctions o& independent rando( variables) The ?nnals o& $robability @2--1A to
appear)
,3) Bousquet, O)3 ? Bennett concentration inequality and its application to supre(a
o& e(pirical processes) ') >) ?cad) Sci) $aris 331 @2--2A 1:<Y<--
,1) in e, =), Xinn, ])3 So(e li(it theore(s &or e(pirical processes) ?nnals o& $roba#
bility 12 @1:*1A :2:Y:*:
,<) Noltchins%ii, C)3 >ade(acher penalties and structural ris% (ini(iGation) I===
Transactions on In&or(ation Theory 1, @2--1A 1:-2Y1:11
,.) Bartlett, $), Boucheron, S), Lugosi, )3 !odel selection and error esti(ation)
!achine Learning 1* @2--1A *<Y113
,,) Noltchins%ii, C), $anchen%o, +)3 =(pirical (argin distributions and bounding the
generaliGation error o& co(bined classi&iers) ?nnals o& Statistics 3- @2--2A
,*) Noltchins%ii, C), $anchen%o, +)3 >ade(acher processes and bounding the ris% o&
&unction learning) In in e, =), !ason, +), 2ellner, ]), eds)3 Ligh +i(ensional
$robability II) @2---A 113Y1<:
,:) Bartlett, $), !endelson, S)3 >ade(acher and aussian co(ple"ities3 ris% bounds
and structural results) ]ournal o& !achine Learning >esearch 3 @2--2A 1.3Y1*2
*-) Bartlett, $), Bousquet, O), !endelson, S)3 LocaliGed >ade(acher co(ple"ities)
In3 $roceedings o& the 1<th annual con&erence on 'o(putational Learning Theory)
@2--2A 11Y1*
*1) Bousquet, O), Noltchins%ii, C), $anchen%o, +)3 So(e local (easures o& co(ple"ity
o& conve" hulls and generaliGation bounds) In3 $roceedings o& the 1<th ?nnual
'on&erence on 'o(putational Learning Theory, Springer @2--2A <:Y,3
*2) ?ntos, ?), N egl, B), Linder, T), Lugosi, )3 +ata#dependent (argin#based gener#
aliGation bounds &or classi&ication) ]ournal o& !achine Learning >esearch 3 @2--2A
,3Y:*

Learning Theory2

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Learning Theory2

Hochgeladen von

Copyright:

Verfügbare Formate

Introduction to Statistical Learning Theory

Das könnte Ihnen auch gefallen