Sie sind auf Seite 1von 7

1.

Similarity-Based Methods

1.1

1.1. Radial Basis Functions

Radial Basis Functions

Rather than fix the value of k, the radial basis function (RBF) technique allows
all data points to contribute to g(x), but not equally. The natural choice is
that points further from x should contribute less; a radial basis function (or
kernel) (z) quantifies this contribution, where z will be the distance kx xi k.
The properties of a typical kernel are that it is positive, non-increasing in |z|
and (kxk) integrates to 1. The most commonly used is the Gaussian kernel,
(z) =

1 2
1
e 2 z .
d/2
(2)

Another common kernel is the window kernel,


1
( 2 d + 1)
z 1,
d/2
(z) =

0
z > 1,

where () is the Gamma function (the constants of proportionality, which


depend on d are chosen so that (kxk) integrates to 1).
The direct analog of the k-nearest neighbor rule is to take a weighted
average of the target values using weights specified by the kernel,
PN
i (x)yi
,
g(x) = Pi=1
N
i=1 i (x)

(1.1)

where i (x) = (kx xi k/r). For classification, take the sign of the resulting
function. The radial in RBF is because the weights only depend on how far a
point is to x. The scale parameter r determines the kernel width. The smaller
the scale parameter, the more emphasis is placed on the nearer points. When
r 0, the kernel behaves like a delta function, picking out the nearest point;
in this limit, the final hypothesis is similar to the nearest neighbor rule.

r=0.1

r=0.3

As r gets large, the kernel width gets larger and more of the points contribute
to the value of g. Thus, the choice of r is similar to the choice of k; in fact,
the RBF technique with the window kernel is sometimes called the r-nearest
neighbor rule, because it uniformly weights all neighbors within distance r of
x. A small r leads to a more complex hypothesis. One way to choose r is
using cross validation. One can also get a universal consistency result. The
25

1. Similarity-Based Methods

1.1. Radial Basis Functions

intuition is that r sets the width of the kernel in that data points within r
of x contribute to g(x). The volume of this region is of order rd , and so the
number of points in this region is of order N rd . We want r 0 so that the
influential points are close to x. We also want the number of influential points
to be large, so we want N rd . Under these two conditions, one obtains
asymptotic (as N ) universal consistency, an analog of Theorem 1.2.

1.1.1

Radial Basis Function Networks

The non-parametric radial basis function technique leads us directly to the


parametric version which is called a radial basis function network. Another
way to view the function in Equation (1.1) is as a weighted sum of bump
functions,


N
X
kx xi k
wi (x)
g(x) =
,
(1.2)
r
i=1
PN
where wi (x) = yi / i=1 (kx xi k/r).
There is a bump centered on each data point, and the final hypothesis
rescales each bump to have height wi (x) and sums. The heights of the bumps
depend on wi (x), and the widths depend on r. If we were to set the heights
wi to constants, independent of x, and denote (kx xi k/r) by i (x), then
we have
N
X
wi i (x) = wt z,
(1.3)
g(x) =
i=1

where z = [1 (x), . . . , N (x)] is a transformed N dimensional vector obtained


from x. Since the non-linear transform is obtained from the bumps placed on
each data point, these are often called local basis functions.
t

r=0.1

r=0.3

Exercise 1.10
(a) For the Gaussian kernel, what is g(x) as kxk for the nonparametric RBF technique versus for the parametric RBF network
with fixed wi .
(b) Let be the square feature matrix defined by ij = j (xi ) and
assume that is invertible. Show that with w = 1 y, g(x) exactly
interpolates the data points, i.e. g(xi ) = yi .

26

1. Similarity-Based Methods

1.1. Radial Basis Functions

The RBF network is exactly a linear model with a non-linear transform into
an N dimensional space determined by the kernel . This is a linear model
with N parameters whose VC-dimension will generally be N . By tuning w,
we are able to fit the data exactly and we will have poor generalization.
The obvious solution is to constrain the model in some way. The root of the
problem is that we have too many parameters, one for each bump. Solution:
restrict the number of bumps to k N . If we restrict the number of bumps
to k, then only k weights w1 , . . . , wk need to be learned. Naturally, we will
choose the weights that best fit the data. Where should the bumps be placed?
It is natural to let the data tell us where to place the bumps; we will denote
the centers of the bumps by 1 , . . . , 1 . The final hypothesis then takes on
the parameterized form below, which is also illustrated in the feed forward
network to the right.
g(x)
+
w1

g(x) =

k
X
j=1

wj

kx j k
r

wj

kxj k
r

kx1k
r

wk

kxk k
r

where the unknown parameters {wj }kj=1 and {j }kj=1 need to be learned by
fitting the data. 8 For classification, we take the sign to get the final hypothesis.
This is the radial basis function network learning model. A useful graphical
representation of the model is illustrated. There are several important features
that are worth discussing. First, the hypothesis set is very similar to a linear
model except that the transformation functions j (x) can depend on the data
set (through the choice of j which is chosen to fit the data). In the linear
model, these transformation functions were fixed ahead of time. Because the
j appear inside a non-linear function, this is our first the model that is not
linear in its parameters. It is linear in the wj , but non-linear in the j . It
8 We assume the y-values have zero mean. When you have N bumps, this is not an issue,
because you have a bump on each data point. When you only have a small number of
bumps, a bias in the y-values can distort the learning (as is also the case if you do linear
regression without the constant term and the data y-values have a bias). To combat this,
one typically works with yi = yi y because the bias can always be added back at the end:

g(x) = y +

k
X

wj

j=1

kx j k
r

The unknown parameters (the wj , j ) are learned using the unbiased yi .

27

1. Similarity-Based Methods

1.1. Radial Basis Functions

turns out that allowing the the basis functions to depend on the data adds
significant power to this model over the linear model. We will see a more
detailed discussion of this later in Chapter ??. Other than the parameters
wj , j which are chosen to fit the data, there are two high-level parameters k
and r which specify the nature of the hypothesis set. These parameters are
a quantitative realization of a discussion in Chapter ?? about the size of a
hypothesis set (in this case the number of bumps you are allowed to have)
versus the complexity of a single hypothesis in the set (in this case how wiggly
an individual hypothesis is); k controls the size of H and r the complexity of
the individual hypotheses. One heuristic for setting r is so that k bumps of
width r can more or less cover the whole data set. Let R be the radius of the
data, R = maxi,j kxi xj k; then the data sits in a hypersphere of radius R,
having volume proportional to (R)d . Assuming the data are well spread out
in this bounding hypersphere, for the k hyperspheres of radius r to cover the
data, they should have comparable volume, i.e. krd ( 12 R)d . Thus, we set
r=

R
k 1/d

It is important to choose a good value of k to avoid over or under-fitting. A


good strategy for choosing k is via cross validation.

1.1.2

Fitting the Data

For a given k and r, we still need to determine the centers j and the weights
wj . For a given set of j , the hypothesis is linear in the wj . For least squares
regression, we can solve exactly for the wj (see Chapter ??). Specifically,
define the feature matrix by


kxi j k
.
ij =
r
Then w = + y = (t )1 t y. Recall also the discussion about overfitting
from Chapter ??. It is prudent to use regularization when the data are noisy.
With regularization parameter , the solution becomes w = (t +I)1 t y.
For classification, instead of regression, you can use your favorite algorithm
for linear models from Chapter ??; the linear programming algorithm in Problem ?? also works quite well. So, we know what to do if the j are fixed.
The hard part is to determine the bump centers, because this is where the
model becomes essentially non-linear. Luckily, the physical interpretation of
this hypothesis set as a set of bumps centered on the j helps us. When there
were N bumps (the non-parametric case), the natural choice was to place one
on top of each data point. Now that there are only k, we should still choose
the bump centers to well represent the data. So, given the xi (we do not
need to know the yi ) we would like to place the centers so that the xi are
well represented. This is an unsupervised learning problem; in fact, a very
28

1. Similarity-Based Methods

1.1. Radial Basis Functions

important one. One way to formulate the task is to require that no xi be too
far away from a bump center. The xi should cluster around the centers, and
each i should represent one cluster. This is known as the k-center problem,
known to be NP-hard. We wish to partition the data into k clusters and pick
representative centers {j }kj=1 , one for each cluster. Clustering is one of the
classic unsupervised learning tasks.

1.1.3

k-means Clustering

One way to mathematically formulate the center selection problem is known


as k-means clustering. We already saw the basic algorithm when we discussed
partitioning for the branch and bound approach to finding the nearest neighbor. The goal is to partition the data points x1 , . . . , xN into k sets S1 , . . . , Sk
and select a center 1 , . . . , k for each partition. The centers are representative of the data if every data point in partition Sj is close to its corresponding
center j . We can therefore measure how good the centers are using the sum
of squares error, which yields the k-means objective function:
k
X
X

Ekm (S1 , . . . , Sk ; 1 , . . . , k ) =

kxi j k .

(1.4)

j=1 xi Sj

The k-means objective considers each cluster Sj separately, and for every
Pk
cluster computes an error Ej for cluster Sj , so Ekm = j=1 Ej . The error
for cluster Sj measures how well its representative center approximates the
data points in Sj (using the sum of squared deviations). Finding the optimal
k-means partition to minimize Ekm is also NP-hard.
Exercise 1.11
(a) If the partitions are fixed to S1 , . . . , Sk , then show that the centers
which minimize Ekm are the centroids of the clusters:
j =

1 X
xi .
|Sj | x S
i

(b) If the centers are fixed to 1 , . . . , k , then show that the partitions
{Sj }kj=1 which minimize Ekm are obtained from the Voronoi regions
of the centers by adding to Sj every point in the Voronoi region of
j :
Sj = {xi : kxi j k kxi k for = 1, . . . , k}.

The previous exercise says that if we fix a partition, then the optimal centers are easy to obtain, and similarly, if we fix the centers, then the optimal
partition is easy to obtain. This suggests a very simple iterative algorithm
for obtaining a good clustering, which is known as Lloyds algorithm. The
29

1. Similarity-Based Methods

1.1. Radial Basis Functions

algorithm starts with candidate centers and iteratively improves them until
convergence.

1:
2:
3:
4:

Lloyds Algorithm for k-Means Clustering


Use the greedy approach (page ??) to initialize j .
Construct each Sj by associating to j all points closest j .
Compute new centers j as the centroids of the current Sj .
Repeat steps 2 and 3 until Ekm stops decreasing.

Exercise 1.12
Show that steps 2 and 3 in Lloyds algorithm can never increase Ekm , and
hence that the algorithm must converge. [Hint: There are only a finite
number of different partitions]

Lloyds algorithm produces a partition which is locally optimal in an unusual


sense (analogous to a Nash equilibrium in game theory): given the centers,
there is no better way to assign the points to centers; and, given the partition,
there is no better choice for the centers. However, the algorithm need not
find the optimal partition as one might be able to improve the k-means objective by simultaneously changing the centers and the cluster memberships.
Lloyds algorithm falls into a class of algorithms known as E-M (expectationmaximization) algorithms. It optimizes a complex objective function by separating the variables to be optimized into two sets. If one set is known, then
it is easy to optimize the other set, and so the natural iterative algorithm follows, as with Lloyds algorithm. It is an active area of theoretical research to
find algorithms which can guarantee near optimal clustering (to within 1 + ).
There are some algorithms which achieve this kind of accuracy and run in
O(n2d ) time (curse of dimensionality). For our practical purposes, we do not
need the best clustering per say; we just need a good one that is representative
of the data, because we still have some flexibility to fit the data by choosing
the weights wj . Lloyds algorithm works just fine in practice, and it is quite
efficient.
Potpourri
There have been several approaches to justifying the RBF technique. Our approach was to treat the RBF as a natural way to extend the k-nearest neighbor
algorithm. RBF networks also arise naturally through Tikhonov regularization for fitting non-linear functions to the data Tikhonov regularization uses
a regularization term which penalizes curvature in the final hypothesis (see
Problem 1.22). RBF networks also arise naturally from noisy interpolation
theory where one tries to achieve minimum expected in-sample fit error under
the assumption that the x-values are noisy; this is similar to regularization
where one asks g to be nearly a constant equal to yi in the neighborhood of
xi , not just at xi .
30

1. Similarity-Based Methods

1.1. Radial Basis Functions

In our treatment, the RBF network consists of k identical bumps. A natural


way to extend this model is to allow the bumps to have different shapes. This
can be accomplished by choosing different scale parameters for each bump (so
rj instead of r). Further, we can drop the requirement that the bumps be
rotationally symmetric. The final hypothesis resulting from this more general
RBF network has the form:
g(x) =

k
X
j=1

k

 X
1 (x j )t 1
1/2
j (x j ) ,
wj e 2
wj kj
(x j )k =
j=1

(1.5)
where the last expression is for the Gaussian kernel. The scalars {wj }kj=1 ,
the d 1 vectors {j }kj=1 and the d d positive definite symmetric matrices
{j }kj=1 need to be learned from the data. Intuitively, it is still the case that
each bump j represents a cluster of data points centered at j , but now the
scale r is replaced by a scale matrix j which captures the covariance of
the points in the cluster (scale and correlations); for the Gaussian kernel, the
j are the centroids, and the j are the covariance matrices. It is still the
case that the wi can be fit using techniques for fitting linear models, once the
j and j (the location and shape parameters) are given. The location and
shape parameters can be learned in an unsupervised way; an E-M algorithm
similar to Lloyds algorithm can be applied (Section 1.2.1 elaborates on the
details); for Gaussian kernels, this unsupervised learning task is called learning
a Gaussian Mixture Model (GMM), another important unsupervised learning
problem.

31

Das könnte Ihnen auch gefallen