Beruflich Dokumente
Kultur Dokumente
Similarity-Based Methods
1.1
Rather than fix the value of k, the radial basis function (RBF) technique allows
all data points to contribute to g(x), but not equally. The natural choice is
that points further from x should contribute less; a radial basis function (or
kernel) (z) quantifies this contribution, where z will be the distance kx xi k.
The properties of a typical kernel are that it is positive, non-increasing in |z|
and (kxk) integrates to 1. The most commonly used is the Gaussian kernel,
(z) =
1 2
1
e 2 z .
d/2
(2)
(1.1)
where i (x) = (kx xi k/r). For classification, take the sign of the resulting
function. The radial in RBF is because the weights only depend on how far a
point is to x. The scale parameter r determines the kernel width. The smaller
the scale parameter, the more emphasis is placed on the nearer points. When
r 0, the kernel behaves like a delta function, picking out the nearest point;
in this limit, the final hypothesis is similar to the nearest neighbor rule.
r=0.1
r=0.3
As r gets large, the kernel width gets larger and more of the points contribute
to the value of g. Thus, the choice of r is similar to the choice of k; in fact,
the RBF technique with the window kernel is sometimes called the r-nearest
neighbor rule, because it uniformly weights all neighbors within distance r of
x. A small r leads to a more complex hypothesis. One way to choose r is
using cross validation. One can also get a universal consistency result. The
25
1. Similarity-Based Methods
intuition is that r sets the width of the kernel in that data points within r
of x contribute to g(x). The volume of this region is of order rd , and so the
number of points in this region is of order N rd . We want r 0 so that the
influential points are close to x. We also want the number of influential points
to be large, so we want N rd . Under these two conditions, one obtains
asymptotic (as N ) universal consistency, an analog of Theorem 1.2.
1.1.1
r=0.1
r=0.3
Exercise 1.10
(a) For the Gaussian kernel, what is g(x) as kxk for the nonparametric RBF technique versus for the parametric RBF network
with fixed wi .
(b) Let be the square feature matrix defined by ij = j (xi ) and
assume that is invertible. Show that with w = 1 y, g(x) exactly
interpolates the data points, i.e. g(xi ) = yi .
26
1. Similarity-Based Methods
The RBF network is exactly a linear model with a non-linear transform into
an N dimensional space determined by the kernel . This is a linear model
with N parameters whose VC-dimension will generally be N . By tuning w,
we are able to fit the data exactly and we will have poor generalization.
The obvious solution is to constrain the model in some way. The root of the
problem is that we have too many parameters, one for each bump. Solution:
restrict the number of bumps to k N . If we restrict the number of bumps
to k, then only k weights w1 , . . . , wk need to be learned. Naturally, we will
choose the weights that best fit the data. Where should the bumps be placed?
It is natural to let the data tell us where to place the bumps; we will denote
the centers of the bumps by 1 , . . . , 1 . The final hypothesis then takes on
the parameterized form below, which is also illustrated in the feed forward
network to the right.
g(x)
+
w1
g(x) =
k
X
j=1
wj
kx j k
r
wj
kxj k
r
kx1k
r
wk
kxk k
r
where the unknown parameters {wj }kj=1 and {j }kj=1 need to be learned by
fitting the data. 8 For classification, we take the sign to get the final hypothesis.
This is the radial basis function network learning model. A useful graphical
representation of the model is illustrated. There are several important features
that are worth discussing. First, the hypothesis set is very similar to a linear
model except that the transformation functions j (x) can depend on the data
set (through the choice of j which is chosen to fit the data). In the linear
model, these transformation functions were fixed ahead of time. Because the
j appear inside a non-linear function, this is our first the model that is not
linear in its parameters. It is linear in the wj , but non-linear in the j . It
8 We assume the y-values have zero mean. When you have N bumps, this is not an issue,
because you have a bump on each data point. When you only have a small number of
bumps, a bias in the y-values can distort the learning (as is also the case if you do linear
regression without the constant term and the data y-values have a bias). To combat this,
one typically works with yi = yi y because the bias can always be added back at the end:
g(x) = y +
k
X
wj
j=1
kx j k
r
27
1. Similarity-Based Methods
turns out that allowing the the basis functions to depend on the data adds
significant power to this model over the linear model. We will see a more
detailed discussion of this later in Chapter ??. Other than the parameters
wj , j which are chosen to fit the data, there are two high-level parameters k
and r which specify the nature of the hypothesis set. These parameters are
a quantitative realization of a discussion in Chapter ?? about the size of a
hypothesis set (in this case the number of bumps you are allowed to have)
versus the complexity of a single hypothesis in the set (in this case how wiggly
an individual hypothesis is); k controls the size of H and r the complexity of
the individual hypotheses. One heuristic for setting r is so that k bumps of
width r can more or less cover the whole data set. Let R be the radius of the
data, R = maxi,j kxi xj k; then the data sits in a hypersphere of radius R,
having volume proportional to (R)d . Assuming the data are well spread out
in this bounding hypersphere, for the k hyperspheres of radius r to cover the
data, they should have comparable volume, i.e. krd ( 12 R)d . Thus, we set
r=
R
k 1/d
1.1.2
For a given k and r, we still need to determine the centers j and the weights
wj . For a given set of j , the hypothesis is linear in the wj . For least squares
regression, we can solve exactly for the wj (see Chapter ??). Specifically,
define the feature matrix by
kxi j k
.
ij =
r
Then w = + y = (t )1 t y. Recall also the discussion about overfitting
from Chapter ??. It is prudent to use regularization when the data are noisy.
With regularization parameter , the solution becomes w = (t +I)1 t y.
For classification, instead of regression, you can use your favorite algorithm
for linear models from Chapter ??; the linear programming algorithm in Problem ?? also works quite well. So, we know what to do if the j are fixed.
The hard part is to determine the bump centers, because this is where the
model becomes essentially non-linear. Luckily, the physical interpretation of
this hypothesis set as a set of bumps centered on the j helps us. When there
were N bumps (the non-parametric case), the natural choice was to place one
on top of each data point. Now that there are only k, we should still choose
the bump centers to well represent the data. So, given the xi (we do not
need to know the yi ) we would like to place the centers so that the xi are
well represented. This is an unsupervised learning problem; in fact, a very
28
1. Similarity-Based Methods
important one. One way to formulate the task is to require that no xi be too
far away from a bump center. The xi should cluster around the centers, and
each i should represent one cluster. This is known as the k-center problem,
known to be NP-hard. We wish to partition the data into k clusters and pick
representative centers {j }kj=1 , one for each cluster. Clustering is one of the
classic unsupervised learning tasks.
1.1.3
k-means Clustering
Ekm (S1 , . . . , Sk ; 1 , . . . , k ) =
kxi j k .
(1.4)
j=1 xi Sj
The k-means objective considers each cluster Sj separately, and for every
Pk
cluster computes an error Ej for cluster Sj , so Ekm = j=1 Ej . The error
for cluster Sj measures how well its representative center approximates the
data points in Sj (using the sum of squared deviations). Finding the optimal
k-means partition to minimize Ekm is also NP-hard.
Exercise 1.11
(a) If the partitions are fixed to S1 , . . . , Sk , then show that the centers
which minimize Ekm are the centroids of the clusters:
j =
1 X
xi .
|Sj | x S
i
(b) If the centers are fixed to 1 , . . . , k , then show that the partitions
{Sj }kj=1 which minimize Ekm are obtained from the Voronoi regions
of the centers by adding to Sj every point in the Voronoi region of
j :
Sj = {xi : kxi j k kxi k for = 1, . . . , k}.
The previous exercise says that if we fix a partition, then the optimal centers are easy to obtain, and similarly, if we fix the centers, then the optimal
partition is easy to obtain. This suggests a very simple iterative algorithm
for obtaining a good clustering, which is known as Lloyds algorithm. The
29
1. Similarity-Based Methods
algorithm starts with candidate centers and iteratively improves them until
convergence.
1:
2:
3:
4:
Exercise 1.12
Show that steps 2 and 3 in Lloyds algorithm can never increase Ekm , and
hence that the algorithm must converge. [Hint: There are only a finite
number of different partitions]
1. Similarity-Based Methods
k
X
j=1
k
X
1 (x j )t 1
1/2
j (x j ) ,
wj e 2
wj kj
(x j )k =
j=1
(1.5)
where the last expression is for the Gaussian kernel. The scalars {wj }kj=1 ,
the d 1 vectors {j }kj=1 and the d d positive definite symmetric matrices
{j }kj=1 need to be learned from the data. Intuitively, it is still the case that
each bump j represents a cluster of data points centered at j , but now the
scale r is replaced by a scale matrix j which captures the covariance of
the points in the cluster (scale and correlations); for the Gaussian kernel, the
j are the centroids, and the j are the covariance matrices. It is still the
case that the wi can be fit using techniques for fitting linear models, once the
j and j (the location and shape parameters) are given. The location and
shape parameters can be learned in an unsupervised way; an E-M algorithm
similar to Lloyds algorithm can be applied (Section 1.2.1 elaborates on the
details); for Gaussian kernels, this unsupervised learning task is called learning
a Gaussian Mixture Model (GMM), another important unsupervised learning
problem.
31