Sie sind auf Seite 1von 11

Chapter 4

Sample-Based Classification

Optimal classification requires full knowledge of the feature-label distribution. In practice, that is
a relatively rare scenario, and a combination of distributional knowledge and sample data must be
employed to obtain a classifier. In fact, in the past few decades, there has been an emphasis towards
distribution-free, purely data-driven classification algorithms. In this chapter, we will introduce
the basic concepts related to sample-based classification, including sampling, designed classifiers
and error rates, and consistency. The chapter ends with a section showing that distribution-free
classification rules have severe limitations. The material in this chapter provides the foundation for
the next several chapters on sample-based classification.

4.1 Classification Rules

The training data set Sn = {(X1 , Y1 ), . . . , (Xn , Yn )} of n consists of sample feature vectors and
their associated labels, which are typically produced by performing a vector measurement Xi on
each of n specimens in an experiment, and then having an “expert” produce a label Yi for each
specimen. We assume that Sn is an independent and identically distributed (i.i.d.) sample from the
feature-label distribution p(x, y); i.e., the set of sample points is independent and each sample point
has distribution p(x, y) (but see Section 4.5.2 for a di↵erent scenario). This implies that, for each
specimen corresponding to feature vector Xi , the “expert” produces a label Yi = 0 with probability
P (Y = 0 | Xi ) and a label Yi = 1 with probability P (Y = 1 | Xi ). Hence, the labels in the training
data are not the “true” labels, as in general there is no such a thing; the labels are simply assumed
P
to be assigned with the correct probabilities. In addition, notice that the numbers N0 = ni=1 IYi =0
P
and N1 = ni=1 IYi =1 of sample points from class 0 and class 1 are binomial random variables with
parameters (n, 1 p) and (n, p), respectively, where p = P (Y = 1). Obviously, N0 and N1 are not

3
4 CHAPTER 4. SAMPLE-BASED CLASSIFICATION

independent, since N0 + N1 = n.1

Given the training data Sn as input, a classification rule is an operator that outputs a trained
classifier n . The subscript “n” reminds us that n is a function of the data Sn (it plays a similar
role to the “ˆ” hat notation used for estimators in classical Statistics). It is important to understand
the di↵erence between a classifier and a classification rule; the latter does not output class labels,
but rather classifiers.

Formally, let C denote the set of classifiers of interest; e.g., C may be the set of all Borel-measurable
functions from Rd into {0, 1} (this includes all functions of interest in practice). Then a classification
rule is defined as a mapping n : [Rd ⇥ {0, 1}]n ! C. In other words, n maps sample data
Sn 2 [Rd ⇥ {0, 1}]n into a classifier n = n (Sn ) 2 C.

Example 4.1. (Nearest-Centroid Classification Rule.) Consider the following simple classifier:
8
<1 , ||x µ̂ || < ||x µ̂ || ,
1 0
n (x) = (4.1)
:0 , otherwise,

where
n n
1 X 1 X
µ̂0 = Xi IYi =0 and µ̂1 = Xi IYi =1 (4.2)
N0 N1
i=1 i=1
are the sample means for each class. In other words, the classifier assigns to the test point x the label
of the nearest (sample) class mean. It is easy to see that this classification rule produces hyperplane
decision boundaries. By replacing the sample mean with other types of centroids (e.g., the sample
median), a family of like-minded classification rules can be obtained. Notice the similarity between
(4.1) and (??) — more on this in Chapter 5.

Example 4.2. (Nearest-Neighbor Classification Rule.) Another simple classifier is given by

n (x) = Y(1) (x) , (4.3)

where (X(1) (x), Y(1) (x)) is a training point such that

X(1) (x) = arg min ||X x|| . (4.4)


X1 ,...,Xn

In other words, the classifier assigns to the test point x the label of the nearest neighbor in the
training data. The decision boundaries produced by this classification rule are very complex (see
Chapter 6). A family of similar classification rules is obtained by replacing the Euclidean norm with
other metrics (e.g., the correlation). In addition, a straightforward generalization of this classification
rule is obtained by assigning to the test point the majority label in the set of k nearest training
points (k = 1 yielding the previous case), with odd k to avoid ties. This is called the k-nearest
neighbor classification rule, which will be studied in detail in Chapter 6.
1
All these concepts can be immediately extended to any number of classes c > 2.
4.2. CLASSIFICATION ERROR RATES 5

Figure 4.1: Discrete histogram rule: top shows distribution of the sample data in the bins; bottom
shows the designed classifier.

Example 4.3. (Discrete Histogram Rule.) Assume that p(x) is concentrated over a finite number
of points {x1 , . . . , xb } in Rd . This corresponds to the case where the measurement X can yield
only a finite number of di↵erent values. Let Uj and Vj be the number of training points with
(Xi = xj , Yi = 0) and (Xi = xj , Yi = 1), respectively, for j = 1, . . . , b. The discrete histogram rule
is given by 8
<1 , U < V
j j j
n (x ) = (4.5)
:0 , oterwise,

for j = 1, . . . , b. In other words, the discrete histogram rule assigns to xj the majority label among
the training points that coincide with xj . In the case of a tie, the label is set to zero. See Figure ??
for an illustration.

If random factors are allowed in the definition of a classification rule (e.g., allowing a random tie-
breaking procedure in the discrete histogram rule), then one ends up with random classifiers. Unless
otherwise stated, all classification rules considered in the sequel are nonrandom. (See Section 4.5.1
for examples of random classification rules.)

4.2 Classification Error Rates

Given a classification rule n, the error of a designed classifier n = n (Sn ) trained on data Sn is
given by
"n = P ( n (Sn )(X) 6= Y | Sn ) = P ( n (X) 6= Y | Sn ) . (4.6)
6 CHAPTER 4. SAMPLE-BASED CLASSIFICATION

Here, (X, Y ) can be considered to be a test point, which is independent of Sn . Notice that "n is
similar to the classifier error defined in (??). However, there is a fundamental di↵erence between the
two error rates, as "n is a function of the random sample data Sn and therefore a random variable.
On the other hand, assuming that n is nonrandom, the classification error "n is an ordinary real
number once the data Sn is specified and fixed. The error "n is sometimes called the conditional
error, since it is conditioned on the data.

Another important error rate in sample-based classification is the expected error:

µn = E["n ] = P ( n (X) 6= Y ) . (4.7)

This error rate is nonrandom. It is sometimes called the unconditional error.

Comparing "n and µn , we observe that the conditional error "n is usually the one of most practical
interest, since it is the error of the classifier designed on the actual sample data at hand. Nevertheless,
µn can be of interest because it is data-independent: it is a function only of the classification rule.
Therefore, µn can be used to define global properties of classification rules, such as consistency
(see the next section). In addition, since it is nonrandom, µn can be bounded, tabulated and
plotted. This can be convenient in both analytical and empirical (simulation) studies. Finally, the
most common criterion for comparing the performance of classification rules is to pick the one with
smallest expected error µn , for a fixed given sample size n.

Similarly to what was done in (??), we can define class-specific error rates:

"0n = P ( (X) 6= Y | Y = 0, Sn )
(4.8)
"1n = P ( (X) 6= Y | Y = 1, Sn )

Clearly, "n = (1 p)"0n + p"1n . We also define the expected error rates µ0n = E["0n ] and µ1n = E["1n ].

4.3 Consistency

Consistency has to do with the natural requirement that, as the sample size n increases to infinity, the
classification error should in some sense converge to the Bayes error. Accordingly, the classification
rule 2 is said to be consistent if, as n ! 1,

"n ! "⇤ , in probability, (4.9)

that is, given any ⌧ > 0, P (|"n "⇤ | > ⌧ ) ! 0. (See Section ?? for a review of modes of convergence
for random variables.) In other words, for a large sample size n, "n will be near "⇤ with a large
2
Throughout this section, what we call a classification rule n is a actually sequence { n; n = 1, 2, . . .}.
4.3. CONSISTENCY 7

probability. The classification rule n is said to be strongly consistent if, as n ! 1,

"n ! "⇤ , with probability 1, (4.10)

that is, P ("n ! "⇤ ) = 1. Since convergence with probability 1 implies convergence in probability,
strong consistency imples ordinary (“weak”) consistency. Strong consistency is a much more de-
manding criterion than ordinary consistency. It roughly requires "n to converge to "⇤ for almost
all possible sequences of training data {Sn ; n = 1, 2, . . .}. In a very real sense, however, ordinary
consistency is enough for practical purposes. (Recall Example ?? and the remark that follows it.)
Furthermore, all commonly used consistent classification rules turn out, interestingly, to be strongly
consistent as well.

The previous definitions hold for a given fixed feature-label distribution p(x, y). So a classifica-
tion rule can be consistent under a feature-label distribution but not under another. A universally
(strong) consistent classification rule is consistent under any distribution; hence, universal consis-
tency is a property of the classification rule alone. Many were skeptical that such classification rules
existed, until Stone’s 1977 paper (see Chapter 6).

Example 4.4. (Consistency of the Nearest-Centroid Classification Rule.) The classifier in (4.1) can
be written as: 8
< 1 , aT x + b > 0 ,
n n
n (x) = (4.11)
:0 , otherwise,

where an = µ̂1 µ̂0 and bn = (µ̂1 µ̂0 )(µ̂1 + µ̂0 )/2 (use the fact that ||x µ̂||2 = (x µ̂)T (x µ̂)).
Now, assume that the feature-label distribution of the problem is specified by multivariate spherical
Gaussian densities p(x | Y = 0) ⇠ Nd (µ0 , Id ) and p(x | Y = 1) ⇠ Nd (µ1 , Id ), with µ0 6= µ1 and
P (Y = 0) = P (Y = 1). The classification error is given by

"n = P ( n (X) = 1 | Y = 0)P (Y = 0) + P ( n (X) = 0 | Y = 1)P (Y = 1)


1
= P (aT X + bn > 0 | Y = 0) + P (aTn X + bn  0 | Y = 1) (4.12)
2 ✓ ✓n ◆ ✓ T ◆◆
1 aTn µ0 + bn an µ1 + bn
= + ,
2 ||an || ||an ||

where (·) is the CDF of a standard Gaussian and we used the fact that aTn X + bn | Y = i ⇠
N (aTn µi + bn , ||an ||2 ), for i = 0, 1 (see Section ?? for the properties of the multivariate Gaussian
distribution). We also know from (??) that the Bayes error for this problem is
✓ ◆
⇤ ||µ1 µ0 ||
" = . (4.13)
2

Now, by the vector version of the Law of Large Numbers (c.f. Thm. ??), we know that, with
probability 1, µ̂0 ! µ0 and µ̂1 ! µ1 , so that an ! a = µ1 µ0 and bn ! b = (µ1 µ0 )(µ1 +µ0 )/2,
8 CHAPTER 4. SAMPLE-BASED CLASSIFICATION

as n ! 1. Furthermore, "n in (4.12) is a continuous function of an and bn , so we can write


✓ ◆
||µ1 µ0 ||
"n (an , bn ) ! "n (a, b) = = "⇤ with probability 1, (4.14)
2
as can be easily verified. Hence, the nearest-centroid classification rule is strongly consistent under
spherical Gaussian densities with the same variance and equally-likely classes.

The nearest-centroid classification rule is not universally consistent; examples of non-Gaussian


feature-label distributions can be given for which the classification error does not converge to the
Bayes error as sample size increases.
Example 4.5. (Consistency of the Discrete Histogram Rule.) With c0 = P (Y = 0), c1 = P (Y = 1),
pj = P (X = xj | Y = 0), and qj = P (X = xj | Y = 1), for j = 1, . . . , b, we have that
c 1 qj
⌘(xj ) = P (Y = 1 | X = xi ) = , (4.15)
c 0 pj + c 1 q j
for j = 1, . . . , b. Therefore, the Bayes classifier is

(xj ) = I⌘(xj )>1/2 = Ic1 qj >c0 pj , (4.16)
for j = 1, . . . , b, with Bayes error
b
X
"⇤ = E[min{⌘(X), 1 ⌘(X)}] = min{c0 pj , c1 qj } . (4.17)
j=1

Now, the error of the classifier in (4.5) can be written as:


b
X
"n = P ( n (X) 6= Y ) = P (X = xj , n (x
j
) 6= Y )
j=1
b
X ⇥ ⇤
= P (X = xj , Y = 0)I n (x
j )=1 + P (X = xj , Y = 1)I n (x
j )=0 (4.18)
j=1
b
X ⇥ ⇤
= c0 pj IVj >Uj + c1 qj IUj Vj .
j=1

Clearly Uj is a binomial random variable with parameters (n, c0 pj ). To see this, note that Uj is the
number of times that one of the n training points independently goes into the “bin” (X = xj , Y = 0)
with probability c0 pi . It follows from the Law of Large Numbers (c.f. Thm. ??) that Uj /n ! c0 pj a.s.
as n ! 1. Similarly, Vj is a binomial random variable with parameters (n, c1 qj ) and Vj /n ! c1 qj
a.s. as n ! 1. It follows that IVj /n>Uj /n ! Ic1 qj >c0 pj a.s., provided that c1 qj 6= c0 pj , as the function
Iu>v is continuous everywhere except at u = v. But we can rewrite (4.18) as
b
X b h
X i
"n = c 0 pj + c0 pj IVj /n>Uj /n + c1 qj IUj /n Vj /n . (4.19)
j=1 j=1
c0 pj =c1 qj c0 pj 6=c1 qj

Therefore, "n ! "⇤ a.s. and the discrete histogram rule is strongly consistent.
4.3. CONSISTENCY 9

error

µn

e*

sample size, n

Figure 4.2: Representation of the expected classification error vs. sample size for a consistent clas-
sification rule.

The following result, which is a simple application of Theorem ??, shows that consistency can be
fully characterized by the behavior of the expected classification error as sample size increases.

Theorem 4.1. The classification rule n is consistent if and only if, as n ! 1,

E["n ] ! "⇤ . (4.20)

Proof. Note that {"n ; n = 1, 2, . . .} is a uniformly bounded random sequence, since 0  "n  1 for
all n. It follows from Theorem ?? that "n ! "⇤ in probability if and only if "n ! "⇤ in L1 , that is,
if and only if, as n ! 1,

E[|"n "⇤ |] = E["n "⇤ ] = E["n ] "⇤ ! 0 , (4.21)

proving the assertion. ⇧

Theorem 4.1 states that consistency is characterized entirely by the first moment of the random
variable "n as n increases. This is of course not sufficient for strong consistency, which depends
on the behavior of the entire distribution of "n . Notice that {µn ; n = 1, 2, . . .} is a sequence of
real numbers (not random variables) and the convergence in (4.20) is ordinary convergence, and
thus can be plotted to obtain a graphical representation of consistency. See Figure 4.2 for an
illustration, where the expected classification error is represented as a continuous function of n for
ease of interpretation.

Example 4.6. (Consistency of the Nearest-Neighbor Classification Rule.) In Chapter 6 it will be


shown that the expected error of the nearest-neighbor classification rule of Example 4.2 satisfies
limn!1 E["n ]  2"⇤ . Assume that the feature-label distribution is such that "⇤ = 0. Then, by
Theorem ??, the nearest-neighbor classification rule is consistent.
10 CHAPTER 4. SAMPLE-BASED CLASSIFICATION

The nearest-neighbor classification rule requires "⇤ = 0 to be consistent. This is a very special
case; according to the zero-one law for perfect discrimination (c.f. Exercise XX) this is equivalent to
requiring class-conditional densities that do not overlap. In fact, the k-nearest-neighbor classification
rule is not universally consistent, for any fixed k = 1, 2, . . . However, we will see in Chapter 6 that the
k-nearest neighbor classification rule is universally consistent, provided that k is allowed to increase
with n at a specified rate.

Consistency is a large sample property, and could be irrelevant in small-sample cases, as non-
consistent classification rules are typically better than consistent classification rules when the train-
ing data size is small. The reason is that consistent classification rules, especially universal ones,
tend to be complex, while non-consistent ones are often simpler. We saw this counterintuitive phe-
nomenon represented in the “scissors plot” of Figure Fig-basic(b). In that plot, the blue cruve
represents the expected error of a consistent classification rule, while the green one does not. How-
ever, the non-consistent classification rule is still better at small-sample sizes (n < N0 in the plot), in
which case the performance of the complex consistent rule degrades due to overfitting. The precise
value of N0 is very difficult to pinpoint, as it depends on the complexity of the classification rules,
the dimensionality of the feature vector, dimensionality, and the feature-label distribution. We will
have more to say about this topic in later chapters.

4.4 No-Free-Lunch Theorems

Universal consistency is a remarkable property in that it appears to imply that no knowledge at all
about the feature-label distribution is needed to obtain optimal performance, i.e., a purely data-
driven approach obtains performance arbitrarily close to the optimal performance if one has a large
enough sample.

The next two theorems by Devroye show that this is deceptive. They are sometimes called “No-
Free-Lunch” theorems, as they imply that some knowledge about the feature-label distribution must
be obtained to garantee acceptable performance (or at least, to avoid terrible performance), after
all. In the interest of conciseness, we present the theorems without proof3 The proofs are based
on finding simple feature-label distributions (in fact, discrete ones with zero Bayes error) that are
“bad” enough.

The first theorem states that all classification rules can be arbitrarily bad at finite sample sizes. In
the case of universally consistent classification rules, this means that one can never know if their
finite-sample performance will be satisfactory, no matter how large n is (unless one knows something
about the feature-label distribution).
3
Both theorems are stated and proved in [?].
4.5. ADDITIONAL TOPICS 11

Theorem 4.2. For every ⌧ > 0, integer n, and classification rule n, there exists a feature-label
distribution p(x, y) (with "⇤ = 0) such that

1
E["n ] ⌧. (4.22)
2

The feature-label distribution in the previous theorem may have to be di↵erent for di↵erent n. The
next remarkable theorem applies to a fixed feature-label distribution and implies that, though one
may get E["n ] ! "⇤ in a distribution-free manner, one must know something about the feature-label
distribution in order to guarantee a rate of convergence.

Theorem 4.3. For every classification rule n , there exists a monotonically decreasing sequence
a1 a2 a3 . . . converging to zero such that there is a feature-label distribution p(x, y) (with

" = 0) for which
E["n ] an , (4.23)

for all n = 1, 2, . . .

As an example, for the discrete histogram rule with "⇤ = 0, one can show that there exists a constant
r > 0 such that E["n ] < e rn , for n = 1, 2, . . . (see Exercise 2). This however does not contradict
Theorem 4.3, because the constant r > 0 is distribution-dependent. In fact, it can be made as close
to zero as wanted by choosing the distribution.

4.5 Additional Topics

4.5.1 Ensemble Classification

Ensemble classification rules combine the decision of multiple classification rules by majority voting.
This is an application of the “wisdom of the crowds” principle, which can reduce overfitting and
increase accuracy overq the component classification rules (which are called in some contexts “weak”
learners).

Formally, given a set of classification rules { 1, .. . , m }, an ensemble classification rule E pro-


n n n,m
duces a classifier 8
<1, 1 Pm j 1
E m j=1 n (Sn )(x) > 2
n,m (x) = (4.24)
:0, otherwise.

In other words, the ensemble classifier assigns label 1 to the test point x if a majority of the
component classification rules produce label 1 on x; otherwise, it assigns label 0.
12 CHAPTER 4. SAMPLE-BASED CLASSIFICATION

In practice, ensemble classifiers are almost always produced by resampling. (This is a general
procedure in PR, which will be important again when we discuss error estimation in Chapter XX.)
Consider an operator ⌧ : (Sn , ⇠) 7! Sn⇤ , which applies a “perturbation” to the training data Sn and
produces a modified data set Sn⇤ . The variable ⇠ represents random factors, rendering Sn⇤ random,
given the data Sn . A base classification rule n is selected, and the components classification rules
are defined by jn (Sn ) = n (⌧ (Sn )) = n (Sn⇤ ). Notice that this produces random classification
rules due to the randomness of ⌧ . It follows that the ensemble classification rule n,mE in (4.24) is
likewise random. This means that di↵erent classifiers result from repeated application of n,m E to
the same training data Sn . We will not consider random classification rules in detail in this book.

The perturbation ⌧ may consist of taking random subsets of the data, adding small random noise
to the training points, flipping the labels randomly, and more. Here we consider in the detail an
example of perturbation called bootstrap sampling. Given a fixed data set Sn = {(X1 = x1 , Y1 =
y1), . . . , (Xn = xn , Yn = yn )}, the empirical feature-label distribution is a discrete distribution, with
probability mass function given by P̂ (X = xi , Y = yi ) = n1 , for i = 1, . . . , n. A bootstrap sample is
a sample Sn⇤ from the empirical distribution; it consists of n equally-likely draws with replacement
from the original sample Sn . Some sample points will appear multiple times, whereas others will not
appear at all. The probability that any given sample point will not appear in Sn⇤ is (1 1/n)n ⇡ e 1 .
It follows that a bootstrap sample of size n contains on average (1 e 1 )n ⇡ 0.632n of the original
sample points. The ensemble classification rule in (4.24) is called in this case a bootstrap aggregate
and the procedure is called “bagging.”

4.5.2 Mixture Sampling vs. Separate Sampling

The assumption being made thus far is that the training data Sn is an i.i.d. sample from the feature-
label distribution p(x, y); i.e., the set of sample points is independent and each sample point has
distribution p(x, y). In this case, each Xi is distributed as p(x | Y = 0) with probability P (Y = 0)
or p(x | Y = 1) with probability P (Y = 1). It is common to say then that each Xi is sampled from
a mixture of the populations p(x | Y = 0) and p(x | Y = 1), with mixing proportions P (Y = 0) and
P (Y = 1), respectively.

This sampling design is pervasive in the PR literature — most papers and textbooks assume it,
often tacitly. However, suppose sampling is not from the mixture of populations, but rather from
each population separately, such that a nonrandom number n0 of sample points are drawn from
p(x | Y = 0), while a nonrandom number n1 of points are drawn from p(x | Y = 1), where
n0 + n1 = n. This separate sampling case is quite distinct from unconstrained random sampling,
where the numbers N0 and N1 of sample points from each class are binomial random variables
(see Section ??). In addition, in separate sampling, the labels Y1 , . . . , Yn are no longer independent:
4.5. ADDITIONAL TOPICS 13

knowing that, say, Y1 = 0, is informative about the status of Y2 , since the number of points from class
0 is fixed. A key fact in the separate sampling case is that the class prior probabilities p0 = P (Y = 0)
and p1 = P (Y = 1) are not estimable from the data. In the random sampling case, p̂0 = Nn0 and
p̂1 = Nn1 are unbiased estimators of p0 and p1 , respectively; they are also consistent estimators,
i.e., p̂0 ! p0 and p̂1 ! p1 with probability 1 as n ! 1, by virtue of the Law of Large Numbers
(Theorem ??). Therefore, provided that the sample size is large enough, ĉ0 and ĉ1 provide decent
estimates of the prior probabilities. However, ĉ0 and ĉ1 are clearly nonsensical estimators in the
separate sampling case. As a matter of fact, there are no sensible estimate of p0 and p1 under
separate sampling. There is simply no information about p0 and p1 in the training data.

Separate sampling is a very common scenario in observational case-control studies in biomedicine,


which typically proceed by collecting data from the populations separately, where the separate
sample sizes n0 and n1 , with n0 + n1 = n, are pre-determined experimental design parameters. The
reason is that one of the populations is often small (e.g., a rare disease). Sampling from the mixture
of healthy and diseased populations would produce a very small number of diseased subjects (or
none). Therefore, retrospective studies employ a fixed number of members of each population, by
assuming that the outcomes (labels) are known in advance.

Separate sampling can be seen as an example of restricted random sampling, where in this case, the
restriction corresponds to conditioning on N0 = n0 , or equivalently, N1 = n1 . Failure to account
for this restriction in sampling will manifest itself in two ways. First, it will a↵ect the design of
classifiers that require, directly or indirectly, estimation of p0 and p1 , in which case alternative
procedures, such as minimax classification (see Section ??) need to be employed; an example of this
will be given in Chapter 5. Secondly, it will a↵ect population-wide average performance metrics,
such as the expected classification error rates. Under a separate sampling restriction, the expected
error rate is given by

µn0 ,n1 = E["n | N = n0 ] = P ( n (X) 6= Y | N = n0 ) . (4.25)

This is in general di↵erent, and sometimes greatly so, from the unconstrained expected error rate
µn (see Section ?? for examples). Failure to account for the sampling mechanism used to acquire
the data can have practical negative consequences in the analysis of performance of classification
algorithms, and even in the accuracy of the classifiers themselves.

Das könnte Ihnen auch gefallen