Beruflich Dokumente
Kultur Dokumente
algorithm A, so we make no assumptions about it, except that it returns a classifer with domain X.
j
We want to show that maxi∈[T ] E S i
[L D (h )] ≥ 1/4
i i
.
This second requirement seems to be another way of saying that we are working with a labelled set.
j
So the real work is in showing that maxi∈[T ] E S i
[L D (h )] ≥ 1/4
i i
.
While the intuitive idea of the proof is (claimed to be) simple, there are many tricks in the details. We also have
to recall many definitions. So let's warm up before getting into the details.
The algorithm A will receive a sample of m examples from C labelled by some fi . Intuitively, we let it learn
from these m examples and show how it can go wrong on the examples it didn't see.
How many labellings of C are possible? For each example x ∈ C we can choose fi (x) = 1 or .
fi (x) = 0
Each sample possibly seen by A is a sequence of m labelled examples. If we think of the examples first being
labelled by fi and then randomly drawn, one after another, with repetition allowed, there are k = (2m)m
possible sequences or samples. This follows because we draw from C , a set with 2m elements, exactly m
times. With repetition allowed, there are 2m choices for each draw, each assumed equally likely.
j
So we can represent all possible samples with two indices, as Si , where i ∈ [T ] represents a sample labelled
by fi and j ∈ [k] indicates which of the k = (2m)m possible samples from this labelled-by- fi version of C
actually occurred.
j j j
For convenience we define h
i
= A(S )
i
. So h
i
is the output, classifier returned by the function or algorithm
j
A , upon receiving S
i
as its training set.
In the book's version of the proof, the distrubtion of Di is just used to say that each labelled example in C is
equally likely to be drawn or sampled. It also implies that some labelling fi is the true labelling. So each
x ∈ C is equally likely and it comes with the correct or actual label fi (x). Di also allows us to use previous
First we note that, for any classifier h and distribution Di ,we have:
This follows because the actual risk is the expected value of the 0 − 1 loss function over all labelled
examples, and because each labelled example is equally likely.
Later we will use this for the outputs of our algorithm in the form:
j 1
L D (h ) = E[1
i i
j
h (x)≠fi (x)
] = ∑
x∈C
1 j
h (x)≠fi (x)
.
i
2m i
We want to prove that the worst actual risk over all possible labellings fi is at least 1/4 .
j
In other words, that maxi∈[T ] E S i
[L D (h )] ≥ 1/4
i i
.
If we can prove this, then Di is the distribution we seek, while fi is the labelling such that LD (fi ) = 0 . Since i
all the distribution adds is the equal probability of an example, intuitively we're looking for the i ∈ [T ] that
indexes the 'true' labelling that our algorithm can't perform well on.
j
Note that E S i
[L D (h )]
i i
is taken over all samples labelled by fi , which is to say over k possible samples.
j 1 j
E S [L D (h )] = ∑ L D (h )
i i i k j∈[k] i i
1 j
max i∈[T ]
k
∑
j∈[k]
L D (h ) ≥ 1/4
i i
.
We know that a max over i ∈ [T ] is ≥ than an average over i ∈ [T ] , so we can safely derive:
1 j 1 1 j
max i∈[T ] ∑ L D (h ) ≥ ∑ ∑ L D (h )
k j∈[k] i i T i∈[T ] k j∈[k] i i
Now for a little trick. We know when can reverse the order of summation, since we are dealing with a finite
number of terms. We'll see why we want to in a moment. But this gives us:
1 j 1 1 j
max i∈[T ] ∑j∈[k] L D (hi ) ≥ ∑j∈[k] ∑i∈[T ] L D (hi )
k i k T i
1
If we look at the leftmost sum and its preceding k
, we see that we have another, different average.
1 j 1 1 j
Note that E j [ T ∑
i∈[T ]
L D (h )] =
i i k
∑
j∈[k] T
∑
i∈[T ]
L D (h )
i i
This follows because the expectation is calculated over all j ∈ [k] , and all sequenced samples are equally
likely.
Just as we knew before that a max was ≥ than an average, so do we also know that an average is ≥ than a
minimum.
1 j 1 1 j 1 j
Ej [
T
∑
i∈[T ]
L D (h )] =
i i k
∑
j∈[k] T
∑
i∈[T ]
L D (h ) ≥ minj∈[k]
i i T
∑
i∈[T ]
L D (h )
i i
.
1 j 1 j
max i∈[T ] ∑ L D (h ) ≥ minj∈[k] ∑ L D (h )
k j∈[k] i i T i∈[T ] i i
1 j
minj∈[k] ∑ L D (h ) ≥ 1/4
T i∈[T ] i i
j
L D (h ) = E[1
i i
j
h (x)≠fi (x)
] =
1
∑
x∈C
1 j
h (x)≠fi (x)
.
i
2m i
Before we plug this in to the work since then, we can do a little work on this sum. What we want is a nice lower
j
bound, independent of hi .
Recall that A only gets to see m examples from a set C of 2m elements. Moreover it's possible that A will
get the same example more than once in its sample. So A will see at most m unique examples.
Let p be the number of examples that A never sees. Since only as many as half can be seen, at least half
won't be seen.
1 1
Therefore p ≥ m and ≥ .
2m 2p
j
Let's first note (letting h = h
i
since it doesn't matter here) that
1 1 p
∑
x∈C
1 h(x)≠f (x) ≥ ∑
r=1
1 h(v
r )≠fi (v r )
.
2m i 2m
This follows because the vi are simply a subset of C and the indicator function is non-negative.
We can frame this another way. Let V be the set of the vi , the V ⊂ C , so
1 1 1 p
∑ 1 h(x)≠f (x) ≥ ∑ 1 h(x)≠f (x) = ∑ 1 h(v
2m x∈C i 2m x∈V i 2m r=1 r )≠fi (v r )
1 1
Since ≥ , we can say that
2m 2p
1 1 p 1 p
L D (h) = ∑ 1 h(x)≠f (x) ≥ ∑ 1 h(v ≥ ∑ 1 h(v
i 2m x∈C i 2m r=1 r )≠fi (v r ) 2p r=1 r )≠fi (v r )
Note that we didn't use the properties of h to make this argument. So for all hji we have
j 1 p
L D (h ) ≥ ∑ 1 j
i i 2p r=1 h (vr )≠fi (vr )
i
1 j 1 1 p
∑
i∈[T ]
L D (h ) ≥
i i
∑
i∈[T ]
∑
r=1
1 j
h (vr )≠fi (vr )
.
T T 2p i
We will again find it useful to change the order of summation, which is also justified because we are summing
over a finite number of terms.
1 j 1 p 1
∑
i∈[T ]
L D (h ) ≥
i i
∑
r=1
∑
i∈[T ]
1 j
h (vr )≠fi (vr )
.
T 2p T i
We almost have an average on the RHS. If we factor out the 1/2 , then we'll have
1
E v∈V [ ∑
i∈[T ]
1 j
h (vr )≠fi (vr )
] ,
T i
1 j 1 p 1 1 1 1 1
∑ L D (h ) ≥ ∑ ∑ 1 j = E v∈V [ ∑ 1 j ] ≥ minr∈[p]
T i∈[T ] i i 2p r=1 T i∈[T ] h (vr )≠fi (vr ) 2 T i∈[T ] h (vr )≠fi (vr ) 2 T
i i
Let's think about this rightmost sum, which is just a function g of r , which we can write
g(r) = ∑ 1 j
i∈[T ] h (vr )≠fi (vr )
i
j
This sum counts the errors of all possible h
i
on just the one unseen element vr .
j j
Recall also that each hi = A(Si ) is determined by the labels fi applied to the examples and the order
j ∈ [k] in which the examples were given to A .
Let's take any r ∈ [p] and consider it fixed. Out of all of the T labellings fi , we can always find pairs (fi , fI )
that agree on every x ∈ C other than vr .
To make this more intuitive, imagine ordering the elements of C as x1 , x2 , x3 , . . . , x2m . Then each fi can
be represented by a sequence of 2m bits, and every such sequence of 2m bits represents an fi . Half of the
fi will give fi (vr ) = 1 and the other half will give fi (vr ) = 0 . Each fi that returns 1 at vr will have its
Now the fact that vr ∈ V becomes important. The algorithm A never sees the difference between the
j j
labelling fi and its almost twin fI . In other words, Si = S
I
.
j j
Since the same input to algorithm guarentees the same ouput, this gives hi = h
I
.
To make the next step a little clearer, let E be the set of the fi such that fi (vr ) = 1 .
So g(r) = ∑
i∈[T ]
1 j
h (vr )≠fi (vr )
= ∑
i∈[E]
(1 j
h (vr )≠fi (vr )
+ 1 j
h (vr )≠fI (vr )
) = |E| = T /2 .
i i I
Note that we when we indexed over E that we summed up the indicator functions in paired rather than
separately.
So 1
minr∈[p]
1
∑
i∈[T ]
1 j =
1
minr∈[p]
1
g(r) =
1 1 T
= 1/4
2 T h (vr )≠fi (vr ) 2 T 2 T 2
i
j
Hence 1
∑i∈[T ] L D (hi ) ≥ 1/4
T i
Note that we didn't use the ordering j ∈ [k] and or the distribution Di , since our pairs (fi , fI ) used all the
distributions.
So the above inequality holds for all labellings and all orders.
1 j 1 j
Hence max i∈[T ] ∑
j∈[k]
L D (h ) ≥ minj∈[k]
i i
∑
i∈[T ]
L D (h ) ≥ 1/4
i i
,
k T
1 j
since ∑
i∈[T ]
L D (h ) ≥ 1/4
i i
no matter the sample.
T
We choose a maximal index i and then the distribution Di that labels examples with fi .