No Free Lunch Theorem

Let X be the domain of a binary classifier chosen by an algorithm A.
Let C be a subset of 2m examples from

X. It is C that will be important to us. Note that our results are supposed to show the limitations of any
algorithm A, so we make no assumptions about it, except that it returns a classifer with domain X.
j
We want to show that maxi∈[T ] E S i
[L D (h )] ≥ 1/4
i i
.
At the same time, we need to find an f such that LD i

(f ) = 0 .
This second requirement seems to be another way of saying that we are working with a labelled set.
j
So the real work is in showing that maxi∈[T ] E S i
[L D (h )] ≥ 1/4
i i
.
While the intuitive idea of the proof is (claimed to be) simple, there are many tricks in the details. We also have
to recall many definitions. So let's warm up before getting into the details.
The algorithm A will receive a sample of m examples from C labelled by some fi . Intuitively, we let it learn
from these m examples and show how it can go wrong on the examples it didn't see.
How many labellings of C are possible? For each example x ∈ C we can choose fi (x) = 1 or .
fi (x) = 0
Since there are 2m choices, this gives us T = 2 2m possible labellings fi .
Each sample possibly seen by A is a sequence of m labelled examples. If we think of the examples first being
labelled by fi and then randomly drawn, one after another, with repetition allowed, there are k = (2m)m
possible sequences or samples. This follows because we draw from C , a set with 2m elements, exactly m
times. With repetition allowed, there are 2m choices for each draw, each assumed equally likely.
j
So we can represent all possible samples with two indices, as Si , where i ∈ [T ] represents a sample labelled
by fi and j ∈ [k] indicates which of the k = (2m)m possible samples from this labelled-by- fi version of C
actually occurred.
j j j
For convenience we define h
i
= A(S )
i
. So h
i
is the output, classifier returned by the function or algorithm
j
A , upon receiving S
i
as its training set.
In the book's version of the proof, the distrubtion of Di is just used to say that each labelled example in C is
equally likely to be drawn or sampled. It also implies that some labelling fi is the true labelling. So each
x ∈ C is equally likely and it comes with the correct or actual label fi (x). Di also allows us to use previous
definitions. But it's mostly used to indicate how C is labelled.

Before we proceed, it will help to unpack some of the notation.
First we note that, for any classifier h and distribution Di ,we have:
L D (h) = E[1 h(x)≠f

i (x) ] =
1
∑
x∈C
1 h(x)≠f (x)
.
i 2m i
This follows because the actual risk is the expected value of the 0 − 1 loss function over all labelled
examples, and because each labelled example is equally likely.
Later we will use this for the outputs of our algorithm in the form:
j 1
L D (h ) = E[1
i i
j
h (x)≠fi (x)
] = ∑
x∈C
1 j
h (x)≠fi (x)
.
i
2m i
We want to prove that the worst actual risk over all possible labellings fi is at least 1/4 .
j
In other words, that maxi∈[T ] E S i
[L D (h )] ≥ 1/4
i i
.
If we can prove this, then Di is the distribution we seek, while fi is the labelling such that LD (fi ) = 0 . Since i
all the distribution adds is the equal probability of an example, intuitively we're looking for the i ∈ [T ] that
indexes the 'true' labelling that our algorithm can't perform well on.
j
Note that E S i
[L D (h )]
i i
is taken over all samples labelled by fi , which is to say over k possible samples.
Since each of these samples is equally likely, we can say:
j 1 j
E S [L D (h )] = ∑ L D (h )
i i i k j∈[k] i i
So more concretely the problem is showing that:
1 j
max i∈[T ]
k
∑
j∈[k]
L D (h ) ≥ 1/4
i i
.
We know that a max over i ∈ [T ] is ≥ than an average over i ∈ [T ] , so we can safely derive:
1 j 1 1 j
max i∈[T ] ∑ L D (h ) ≥ ∑ ∑ L D (h )
k j∈[k] i i T i∈[T ] k j∈[k] i i
Now for a little trick. We know when can reverse the order of summation, since we are dealing with a finite
number of terms. We'll see why we want to in a moment. But this gives us:
1 j 1 1 j
max i∈[T ] ∑j∈[k] L D (hi ) ≥ ∑j∈[k] ∑i∈[T ] L D (hi )
k i k T i
1
If we look at the leftmost sum and its preceding k
, we see that we have another, different average.
1 j 1 1 j
Note that E j [ T ∑
i∈[T ]
L D (h )] =
i i k
∑
j∈[k] T
∑
i∈[T ]
L D (h )
i i
This follows because the expectation is calculated over all j ∈ [k] , and all sequenced samples are equally
likely.
Just as we knew before that a max was ≥ than an average, so do we also know that an average is ≥ than a
minimum.
So this gives us:
1 j 1 1 j 1 j
Ej [
T
∑
i∈[T ]
L D (h )] =
i i k
∑
j∈[k] T
∑
i∈[T ]
L D (h ) ≥ minj∈[k]
i i T
∑
i∈[T ]
L D (h )
i i
.
Putting all of our work so far together, we have established:
1 j 1 j
max i∈[T ] ∑ L D (h ) ≥ minj∈[k] ∑ L D (h )
k j∈[k] i i T i∈[T ] i i
So we need to show that
1 j
minj∈[k] ∑ L D (h ) ≥ 1/4
T i∈[T ] i i
We know want to consider what we established early on, namely that:
j
L D (h ) = E[1
i i
j
h (x)≠fi (x)
] =
1
∑
x∈C
1 j
h (x)≠fi (x)
.
i
2m i
Before we plug this in to the work since then, we can do a little work on this sum. What we want is a nice lower
j
bound, independent of hi .
Recall that A only gets to see m examples from a set C of 2m elements. Moreover it's possible that A will
get the same example more than once in its sample. So A will see at most m unique examples.
Let p be the number of examples that A never sees. Since only as many as half can be seen, at least half
won't be seen.
1 1
Therefore p ≥ m and ≥ .
2m 2p
Let's name these unseen examples v1 , v2 , . . . , vp .
j
Let's first note (letting h = h
i
since it doesn't matter here) that
1 1 p
∑
x∈C
1 h(x)≠f (x) ≥ ∑
r=1
1 h(v
r )≠fi (v r )
.
2m i 2m
This follows because the vi are simply a subset of C and the indicator function is non-negative.
We can frame this another way. Let V be the set of the vi , the V ⊂ C , so
1 1 1 p
∑ 1 h(x)≠f (x) ≥ ∑ 1 h(x)≠f (x) = ∑ 1 h(v
2m x∈C i 2m x∈V i 2m r=1 r )≠fi (v r )
1 1
Since ≥ , we can say that
2m 2p
1 1 p 1 p
L D (h) = ∑ 1 h(x)≠f (x) ≥ ∑ 1 h(v ≥ ∑ 1 h(v
i 2m x∈C i 2m r=1 r )≠fi (v r ) 2p r=1 r )≠fi (v r )
Note that we didn't use the properties of h to make this argument. So for all hji we have
j 1 p
L D (h ) ≥ ∑ 1 j
i i 2p r=1 h (vr )≠fi (vr )
i
We can use this to say
1 j 1 1 p
∑
i∈[T ]
L D (h ) ≥
i i
∑
i∈[T ]
∑
r=1
1 j
h (vr )≠fi (vr )
.
T T 2p i
We will again find it useful to change the order of summation, which is also justified because we are summing
over a finite number of terms.
1 j 1 p 1
∑
i∈[T ]
L D (h ) ≥
i i
∑
r=1
∑
i∈[T ]
1 j
h (vr )≠fi (vr )
.
T 2p T i
We almost have an average on the RHS. If we factor out the 1/2 , then we'll have
1
E v∈V [ ∑
i∈[T ]
1 j
h (vr )≠fi (vr )
] ,
T i
where the expectation is taken over v ∈ C .

We know that an average over V is ≥ than a minimum over V . So we get
1 j 1 p 1 1 1 1 1
∑ L D (h ) ≥ ∑ ∑ 1 j = E v∈V [ ∑ 1 j ] ≥ minr∈[p]
T i∈[T ] i i 2p r=1 T i∈[T ] h (vr )≠fi (vr ) 2 T i∈[T ] h (vr )≠fi (vr ) 2 T
i i
Let's think about this rightmost sum, which is just a function g of r , which we can write
g(r) = ∑ 1 j
i∈[T ] h (vr )≠fi (vr )
i
j
This sum counts the errors of all possible h
i
on just the one unseen element vr .
j j
Recall also that each hi = A(Si ) is determined by the labels fi applied to the examples and the order
j ∈ [k] in which the examples were given to A .
We want to show that g(r) = T /2 for all r .
Let's take any r ∈ [p] and consider it fixed. Out of all of the T labellings fi , we can always find pairs (fi , fI )
that agree on every x ∈ C other than vr .
To make this more intuitive, imagine ordering the elements of C as x1 , x2 , x3 , . . . , x2m . Then each fi can
be represented by a sequence of 2m bits, and every such sequence of 2m bits represents an fi . Half of the
fi will give fi (vr ) = 1 and the other half will give fi (vr ) = 0 . Each fi that returns 1 at vr will have its
'almost twin' fI that returns 0 and agrees everywhere else.
Now the fact that vr ∈ V becomes important. The algorithm A never sees the difference between the
j j
labelling fi and its almost twin fI . In other words, Si = S
I
.
j j
Since the same input to algorithm guarentees the same ouput, this gives hi = h
I
.
This means that 1 h (v j + 1 j = 1 , for all pairs (fi , fI ) .

i r )≠fi (v r ) h (vr )≠fI (vr )
I
To make the next step a little clearer, let E be the set of the fi such that fi (vr ) = 1 .
Then |E| = T/2. Moreover i ∈ E ⟹ I ∉ E .
So g(r) = ∑
i∈[T ]
1 j
h (vr )≠fi (vr )
= ∑
i∈[E]
(1 j
h (vr )≠fi (vr )
+ 1 j
h (vr )≠fI (vr )
) = |E| = T /2 .
i i I
Note that we when we indexed over E that we summed up the indicator functions in paired rather than
separately.
Moreover r was arbitrary, so we've actually shown that (∀r ∈ [p])g(r) = T /2 .
So 1
minr∈[p]
1
∑
i∈[T ]
1 j =
1
minr∈[p]
1
g(r) =
1 1 T
= 1/4
2 T h (vr )≠fi (vr ) 2 T 2 T 2
i
j
Hence 1
∑i∈[T ] L D (hi ) ≥ 1/4
T i
Note that we didn't use the ordering j ∈ [k] and or the distribution Di , since our pairs (fi , fI ) used all the
distributions.
So the above inequality holds for all labellings and all orders.
1 j 1 j
Hence max i∈[T ] ∑
j∈[k]
L D (h ) ≥ minj∈[k]
i i
∑
i∈[T ]
L D (h ) ≥ 1/4
i i
,
k T
1 j
since ∑
i∈[T ]
L D (h ) ≥ 1/4
i i
no matter the sample.
T
This finally gives us what we need:

j 1 j
max i∈[T ] E S [L D (h )] =
i i i
∑
j∈[k]
L D (h ) ≥ 1/4
i i
.
k
We choose a maximal index i and then the distribution Di that labels examples with fi .
This will give L D (fi ) = 0

i
, completing the proof.

No Free Lunch Theorem

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

No Free Lunch Theorem

Hochgeladen von

Copyright:

Verfügbare Formate

Let X be the domain of a binary classifier chosen by an algorithm A.

Let C be a subset of 2m examples from

At the same time, we need to find an f such that LD i

Since there are 2m choices, this gives us T = 2 2m possible labellings fi .

definitions. But it's mostly used to indicate how C is labelled.

L D (h) = E[1 h(x)≠f

Since each of these samples is equally likely, we can say:

So more concretely the problem is showing that:

So this gives us:

Putting all of our work so far together, we have established:

So we need to show that

We know want to consider what we established early on, namely that:

Let's name these unseen examples v1 , v2 , . . . , vp .

We can use this to say

where the expectation is taken over v ∈ C .

We want to show that g(r) = T /2 for all r .

'almost twin' fI that returns 0 and agrees everywhere else.

This means that 1 h (v j + 1 j = 1 , for all pairs (fi , fI ) .

Then |E| = T/2. Moreover i ∈ E ⟹ I ∉ E .

Moreover r was arbitrary, so we've actually shown that (∀r ∈ [p])g(r) = T /2 .

This finally gives us what we need:

This will give L D (fi ) = 0

Das könnte Ihnen auch gefallen