Sie sind auf Seite 1von 10

Computer Vision, Graphi

s, and Pattern Re ognition Group


Department of Mathemati s and Computer S ien e
University of Mannheim
D-68131 Mannheim, Germany

Reihe Informatik
10/2001

EÆ ient Feature Subset Sele tion for


Support Ve tor Ma hines
Matthias Heiler, Daniel Cremers, Christoph S hnorr

Te hni al Report 21/2001


Computer S ien e Series
O tober 2001

The te hni al reports of the CVGPR Group are listed under


http://www.ti.uni-mannheim.de/bmg/Publi ations-e.html
EÆ ient FeatureVe tor
Support Subset Sele tion for
Ma hines
Matthias Heiler, Daniel Cremers, Christoph S hnorr

Computer Vision, Graphi s, and Pattern Re ognition Group

Department of Mathemati s and Computer S ien e

University of Mannheim, 68131 Mannheim, Germany

heileruni-mannheim.de
http://www.ti.uni-mannheim.de/bmg

Abstra t

Support ve tor ma hines an be regarded as algorithms for ompressing infor-

mation about lass membership into a few support ve tors with lear geometri

interpretation. It is tempting to use this ompressed information to sele t the

most relevant input features. In this paper we present a method for doing so and

provide eviden e that it sele ts high-quality feature sets at a fra tion of the osts

of lassi al methods.

Keywords: support ve tor ma hine, feature subset sele tion, wrapper method

1 Introdu tion
The feature subset sele tion problem is an old problem studied in ma hine learning,
statisti s and pattern re ognition [1℄. For lassi ation purposes, the problem an be
stated as follows: Given a data set with features X1 ; X2 ; : : : ; X and labels Y , sele t
n

a feature subset su h that a ma hine learning algorithm trained on it a hieves good


performan e.
John et al. helped stru turing the eld by distinguishing lter methods, whi h sele t
feature subsets based on general riteria independent of any spe i learning algorithm,
from wrapper methods, whi h tailor feature subsets to suite the indu tive bias of a
given learning algorithm [2℄. The wrapper method treats feature sele tion as a sear h
problem in the spa e of all possible feature subsets. It is well{known that exhaustive
sear h through all possible feature subsets is the only way for sele ting the optimal
features [3, 4℄. However, when there are n features this spa e has obviously 2 elements
n

whi h is generally too large to be sear hed exhaustively. Thus, numerous heuristi
sear h algorithms have been proposed for determining a suboptimal feature subset in a
omputationally eÆ ient way (e.g., [1, 5, 6℄).
In this paper, we fo us on a spe i learning algorithm for lassi ation, the support
ve tor ma hine. In this ontext, appli ation of the wrapper method has one severe
disadvantage: It an be omputationally expensive. This is due to the fa t that to
assess the quality of ea h feature subset the ma hine learning algorithm must be trained

1
and evaluated on it. Unfortunately, training SVMs an be slow rendering the wrapper
method a ostly pro edure for feature sele tion, espe ially on large multi lass data sets.
To over ome this diÆ ulty, we present a novel strategy for feature subset sele tion
whi h is dire tly based on the support ve tor ar hite ture and the representation of de i-
sion fun tions in terms of support ve tors. The general idea is to train a support ve tor
ma hine on e on a data set ontaining all features, extra t some relevan e measure from
the trained ma hine, and use this information to lead a hill- limbing sear h dire tly
toward a good feature subset. Sin e the number of reiterations of the training pro e-
dure in reases only linearly with the number of sele ted features, this algorithm an be
orders of magnitudes faster than the wrapper method. Furthermore, we show that this
omputational eÆ ien y an be obtained without sa ri ing lassi ation a ura y.
After ompletion of this work [7℄, the authors be ame aware of similar ideas reported
in [8℄. Whereas the latter work is applied in the ontext of visual obje t re ognition [9℄,
we fo us dire tly on the feature sele tion problem and present here for the rst time
extensive numeri al results whi h reveal the performan e of our approa h for established
ben hmark data sets [10℄.

2 SVM-based Feature Sele tion


Let us motivate the feature sele tion algorithm with a simple example: Assume we are
given a two dimensional binary lassi ation problem where only one input feature is
relevant for lassi ation. The other input feature ontains noise. We train a SVM with
linear kernel on this problem and nd a separating hyperplane with maximal margin
(Figure 1).
Now the key observation is that the normal w ~ of the separating hyperplane will point
in the relevant dire tion, i.e., it will be approximately olinear with the basis ve tor ~e i

that is used for the relevant feature and approximately orthogonal to the other one.
This holds for any n-dimensional SVM with linear kernel: If we take away all the basis
ve tors whi h are orthogonal to w ~ we will loose no information about lass membership
as the orresponding features have no in uen e on the SVM de ision. A ordingly, we
an de ne the importan e of ea h feature X by its amount of olinearity with w~ :
k

d = (hw~ ; ~e i)2 :
k k (1)
In the nonlinear ase the SVM de ision fun tion [11℄ reads

f (~x) =
X y h(~x); (~x )i + b; (2)
i i i

with a set of given training ve tors ~x , orresponding lass labels y , Lagrange multipliers
i i

asso iated with the SVM optimization problem, and some o set from the origin b.
As the nonlinear mapping  appears only inside the s alar produ t h(~x); (~x )i it is
i

usually expressed in terms of a kernel fun tion.


Compared to the linear ase the in uen e of a feature X on the de ision boundary
k

is no longer independent from the other features. It varies depending on where in the

2
11
00
00
11
00
11
00 10
11
11 11
00 00 11
00
00 11
11 00
00
11 00
11
11 11 11
00 00 00
00 01
11
00
11
01
w 01
Figure 1: Separating hyperplane for feature sele tion. Cir les indi ate the support
ve tors.

input spa e it is determined. However, for a given point in input spa e, ~x, we an de ne
the in uen e of X as the squared partial derivative with respe t to X :
k k

d k;~
x = (hrf (~x); ~e i)2 ;
k (3)
Note that for SVMs with linear kernel this de nition redu es to the measure of olinearity
de ned in (1).
In order to evaluate (3), we have to sele t meaningful points ~x. To this end, we
an take advantage of the information- ompressing apabilities of the support ve tor
ma hine: The SVM de ision fun tion (2) essentially is linear in the mapped input ve tors
(~x ). More pre isely, it is linear in the mapped input ve tors for whi h the SVM
i

optimization pro ess yields non-zero Lagrange parameters > 0. These input ve tors
are the support ve tors SV = f~x : > 0g, and in pra ti e their number is often small
i

i i

ompared to the number of all input ve tors [11, p. 135℄. It is lear that the features
whi h have little in uen e on the support ve tors also have small e e t on the SVM
de ision. Thus, a good measure of the importan e of a feature is its average in uen e

3
evaluated at the support ve tors:

d =
1 X Pd k;~
xi
: (4)
k
jSV j d 2SV k;~
xi

P
k
~
xi

Note that the denominator k


d ensures that ea h support ve tor's ontribution
k;~
x

sums to one. This is to avoid that outliers and support ve tors lo ated at very narrow
margins dominate the overall result too mu h. Or, equivalently, it in reases the in uen e
of support ve tors at lear, well-separated margins.
On e we have al ulated the importan e measure fd g =1 we use a simple hill-
k k :::n

limbing sear h to determine the optimal number of relevant features. Spe i ally, we
rank the features a ording to their d values and, starting from an empty feature set,
k

subsequently add features with highest rank until the a ura y of a SVM trained on the
sele ted features stops in reasing.

3 Experiments
To evaluate the performan e of the SVM feature sele tion method we ran it on a number
of data sets from the UCI ma hine learning repository [10℄. For omparison, we also ran
the wrapper method with hill- limbing sear h. Note that over tting is a general problem
with feature sele tion on small data sets [12, 13℄. We tried to avoid it by using 10-fold
rossvalidation during the feature sele tion pro ess, as well as on a ompletely separate
test set for assessing the quality of the sele ted features. For the hill- limbing sear h
we used the stopping riterion proposed by Kohavi et al. [12℄. For all experiments we
employed the LIBSVM pa kage with default parameters set [14℄.
Table 1 summarizes the results of our experiments. For ea h of the data sets exam-
ined we ould redu e the number of features used for lassi ation without sa ri ing
a ura y. More pre isely speaking, a one-tailed t-test revealed at the 5% level no sta-
tisti ally signi ant de rease in lassi ation performan e after feature sele tion. On
the ontrary, for the hess, the led and the optidigit data sets we found our feature se-
le tion to signi antly in rease lassi ation a ura y. Note that the t-test revealed no
di eren e in performan e between the wrapper and the SVM-based sele tion method.
The led data bene tted very mu h from feature sele tion. This is not surprising as
led is a syntheti data set onsisting of 7 relevant features and 17 features ontaining
random noise. Both feature sele tion algorithms reliably extra t the 7 relevant features,
however some irrelevant features, whi h appear to be predi tive on the given data, are
also in luded. As this data set ontains binary features only, we ould easily estimate
the Shannon entropy h, ompute the information gain h(y ) h(y jX ) for ea h individual k

feature X [15℄, and ompare it to our relevan e measure d . Figure 2 visualizes both rel-
k k

evan e measures: The strong orrelation (r = 0:99) between them is apparent as well as
the lear distin tion between the 7 relevant features on the left and the random features
on the right. Note that for experiments omprising ontinuous features omputing the
information gain is often not straightforward while evaluating the SVM-based relevan e
measure is.

4
Table 1: Performan e of di erent feature sele tion methods on a test set. Feature
sele tion signi antly redu es the number of features without sa ri ing lassi ation
a ura y.
No Sele tion Wrapper SVM Sele tion
Data set Features A ura y Features A ura y Features A ura y
breast an er 9 95.71 5 95.42 6 95.71
hess 35 33.33 5 86.67 4 86.67
rx 43 83.73 5 85.15 5 86.03
diabetes 8 74.13 4 75.19 5 74.91
led 24 66.27 11 73.10 10 74.70
mfeat 649 97.60 n.a. n.a. 31 97.50
glass 9 60.63 4 61.36 5 60.54
mushroom 125 99.95 8 99.95 14 100.00
optidigit 64 97.68 36 98.32 36 98.39

0.6
SVM feature criterion
information gain

0.5

0.4

0.3

0.2

0.1

0
0 5 10 15 20 25

Figure 2: Feature sele tion on the led data set.

5
Table 2: CPU time used for the feature sele tion pro ess. SVM sele tion is mu h faster
than the wrapper method as less lassi ers must be trained.
Data set Wrapper SVM Sele tion
breast an er 19.29 4.073
hess 45.03 1.48
rx 355.74 17.92
diabetes 45.15 19.35
glass 13.87 3.46
led 253.25 23.04
mfeat n.a. 59162.13
mushroom 27518.39 2977.34
optidigit 77742.152 2843.34

Our results show that the wrapper and the SVM method sele ted features of equal
quality in all ases examined. Consequently, it is interesting to ompare the methods in
terms of speed. Table 2 shows the CPU time in se onds used by ea h method for the
di erent data sets. We an see that espe ially for the larger data sets the SVM-based
feature sele tion has a lear advantage over the wrapper. This is not surprising as the
wrapper needs to train a larger number of SVMs. Spe i ally, to sele t a subset of d
features from a set of n given features the wrapper methods examines (d2 + d(2r + 1))=2
SVMs, where r = (n d) denotes the number of removed features, while the method
we propose examines (d + 1) SVMs only { one for omputing the relevan e measure d k

and r during hill- limbing. Thus, in orporating the information olle ted from the SVM
redu es the run-time omplexity from quadrati to linear.

4 Con lusion
We propose a method that utilizes the information- ompressing apabilities of the sup-
port ve tor ma hine for feature sele tion. It is easy to understand, simple to implement,
fast to exe ute, and it performs as a urately as the wrapper method on a number of
real-world data sets.

Software
For our experiments we used the LIBSVM pa kage by [14℄. The program used to al-
ulate the relevan e measure d from a trained SVM is available for download from
k

http://www. vgpr.uni-mannheim.de/heiler/.

6
Referen es
[1℄ P.A. Devijver and J. Kittler. Pattern Re ognition: A Statisti al Approa h. Prenti e-
Hall International, 1982.
[2℄ George H. John, Ron Kohavi, and Karl P eger. Irrelevant features and the subset
sele tion problem. In Pro eedings of ICML-94, 11th International Conferen e on
Ma hine Learning, pages 121{129, New Brunswi k, NJ, 1994.
[3℄ T. Cover and J. van Campenhout. On the possible orderings in the measurement
sele tion problem. IEEE Trans. on Systems, Man, and Cyberneti s, 7:657{661,
1977.
[4℄ L. Devroye, L. Gyor , and G. Lugosi. A Probabilisti Theory of Pattern Re ognition.
Springer-Verlag, 1996.
[5℄ Ron Kohavi and George H. John. Wrappers for feature subset sele tion. In Pro-
eedings of the Tenth International Conferen e on Computational Learning Theory,
pages 245{271, 1997.
[6℄ P. Somol, P. Pudil, J. Novovi ova, and P. Pa lik. Adaptive oating sear h methods
in feature sele tion. Patt. Re og. Letters, 20(11):1157{1163, 1999.
[7℄ Matthias Heiler. Optimization riteria and learning algorithms for large mar-
gin lassi ers. Master's thesis, University of Mannheim, Germany, Department
of Mathemati s and Computer S ien e, Computer Vision, Graphi s, and Pattern
Re ognition Group, D-68131 Mannheim, Germany, 2001.
[8℄ Theodoros Evgeniou. Learning with kernel ma hine ar hite tures. PhD thesis,
Massa husetts Institute of Te hnology, 6 2000.
[9℄ T. Evgeniou, M. Pontil, C. Papageorgiou, and T. Poggio. Image representations
for obje t dete tion using kernel lassi ers. Asian Conferen e on Computer Vision,
pages 687{692, 2000.
[10℄ C.L. Blake and C.J. Merz. UCI repository of ma hine learning databases, 1998.
[11℄ Vladimir N. Vapnik. The Nature of Statisti al Learning Theory. Springer, N.Y.,
1995.
[12℄ Ron Kohavi and Sommer eld Dan. Feature subset sele tion using the wrapper
method: Over tting and dynami sear h spa e topology. In Usama M Fayyad
and Ramasamy Uthurusamy, editors, First International Conferen e on Knowledge
Dis overy and Data Mining (KDD-95), 1995.
[13℄ T. S he er and R. Herbri h. Unbiased assessment of learning algorithms. In IJCAI-
97, pages 798{803, 1997.

7
[14℄ Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library
for support ve tor ma hines, 2001. Software available at
http://www. sie.ntu.edu.tw/~ jlin/libsvm.

[15℄ T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley &
Sons, In ., New York, 1991.

Das könnte Ihnen auch gefallen