Beruflich Dokumente
Kultur Dokumente
Reihe Informatik
10/2001
heileruni-mannheim.de
http://www.ti.uni-mannheim.de/bmg
Abstra t
mation about lass membership into a few support ve tors with lear geometri
most relevant input features. In this paper we present a method for doing so and
provide eviden e that it sele ts high-quality feature sets at a fra tion of the osts
of lassi al methods.
Keywords: support ve tor ma hine, feature subset sele tion, wrapper method
1 Introdu
tion
The feature subset sele
tion problem is an old problem studied in ma
hine learning,
statisti
s and pattern re
ognition [1℄. For
lassi
ation purposes, the problem
an be
stated as follows: Given a data set with features X1 ; X2 ; : : : ; X and labels Y , sele
t
n
whi
h is generally too large to be sear
hed exhaustively. Thus, numerous heuristi
sear
h algorithms have been proposed for determining a suboptimal feature subset in a
omputationally eÆ
ient way (e.g., [1, 5, 6℄).
In this paper, we fo
us on a spe
i
learning algorithm for
lassi
ation, the support
ve
tor ma
hine. In this
ontext, appli
ation of the wrapper method has one severe
disadvantage: It
an be
omputationally expensive. This is due to the fa
t that to
assess the quality of ea
h feature subset the ma
hine learning algorithm must be trained
1
and evaluated on it. Unfortunately, training SVMs
an be slow rendering the wrapper
method a
ostly pro
edure for feature sele
tion, espe
ially on large multi
lass data sets.
To over
ome this diÆ
ulty, we present a novel strategy for feature subset sele
tion
whi
h is dire
tly based on the support ve
tor ar
hite
ture and the representation of de
i-
sion fun
tions in terms of support ve
tors. The general idea is to train a support ve
tor
ma
hine on
e on a data set
ontaining all features, extra
t some relevan
e measure from
the trained ma
hine, and use this information to lead a hill-
limbing sear
h dire
tly
toward a good feature subset. Sin
e the number of reiterations of the training pro
e-
dure in
reases only linearly with the number of sele
ted features, this algorithm
an be
orders of magnitudes faster than the wrapper method. Furthermore, we show that this
omputational eÆ
ien
y
an be obtained without sa
ri
ing
lassi
ation a
ura
y.
After
ompletion of this work [7℄, the authors be
ame aware of similar ideas reported
in [8℄. Whereas the latter work is applied in the
ontext of visual obje
t re
ognition [9℄,
we fo
us dire
tly on the feature sele
tion problem and present here for the rst time
extensive numeri
al results whi
h reveal the performan
e of our approa
h for established
ben
hmark data sets [10℄.
that is used for the relevant feature and approximately orthogonal to the other one.
This holds for any n-dimensional SVM with linear kernel: If we take away all the basis
ve
tors whi
h are orthogonal to w ~ we will loose no information about
lass membership
as the
orresponding features have no in
uen
e on the SVM de
ision. A
ordingly, we
an dene the importan
e of ea
h feature X by its amount of
olinearity with w~ :
k
d = (hw~ ; ~e i)2 :
k k (1)
In the nonlinear
ase the SVM de
ision fun
tion [11℄ reads
f (~x) =
X y h(~x); (~x )i + b; (2)
i i i
with a set of given training ve
tors ~x ,
orresponding
lass labels y , Lagrange multipliers
i i
asso
iated with the SVM optimization problem, and some oset from the origin b.
As the nonlinear mapping appears only inside the s
alar produ
t h(~x); (~x )i it is
i
is no longer independent from the other features. It varies depending on where in the
2
11
00
00
11
00
11
00 10
11
11 11
00 00 11
00
00 11
11 00
00
11 00
11
11 11 11
00 00 00
00 01
11
00
11
01
w 01
Figure 1: Separating hyperplane for feature sele
tion. Cir
les indi
ate the support
ve
tors.
input spa
e it is determined. However, for a given point in input spa
e, ~x, we
an dene
the in
uen
e of X as the squared partial derivative with respe
t to X :
k k
d k;~
x = (hrf (~x); ~e i)2 ;
k (3)
Note that for SVMs with linear kernel this denition redu
es to the measure of
olinearity
dened in (1).
In order to evaluate (3), we have to sele
t meaningful points ~x. To this end, we
an take advantage of the information-
ompressing
apabilities of the support ve
tor
ma
hine: The SVM de
ision fun
tion (2) essentially is linear in the mapped input ve
tors
(~x ). More pre
isely, it is linear in the mapped input ve
tors for whi
h the SVM
i
optimization pro
ess yields non-zero Lagrange parameters > 0. These input ve
tors
are the support ve
tors SV = f~x : > 0g, and in pra
ti
e their number is often small
i
i i
ompared to the number of all input ve
tors [11, p. 135℄. It is
lear that the features
whi
h have little in
uen
e on the support ve
tors also have small ee
t on the SVM
de
ision. Thus, a good measure of the importan
e of a feature is its average in
uen
e
3
evaluated at the support ve
tors:
d =
1 X Pd k;~
xi
: (4)
k
jSV j d 2SV k;~
xi
P
k
~
xi
sums to one. This is to avoid that outliers and support ve
tors lo
ated at very narrow
margins dominate the overall result too mu
h. Or, equivalently, it in
reases the in
uen
e
of support ve
tors at
lear, well-separated margins.
On
e we have
al
ulated the importan
e measure fd g =1 we use a simple hill-
k k :::n
limbing sear
h to determine the optimal number of relevant features. Spe
i
ally, we
rank the features a
ording to their d values and, starting from an empty feature set,
k
subsequently add features with highest rank until the a
ura
y of a SVM trained on the
sele
ted features stops in
reasing.
3 Experiments
To evaluate the performan
e of the SVM feature sele
tion method we ran it on a number
of data sets from the UCI ma
hine learning repository [10℄. For
omparison, we also ran
the wrapper method with hill-
limbing sear
h. Note that overtting is a general problem
with feature sele
tion on small data sets [12, 13℄. We tried to avoid it by using 10-fold
rossvalidation during the feature sele
tion pro
ess, as well as on a
ompletely separate
test set for assessing the quality of the sele
ted features. For the hill-
limbing sear
h
we used the stopping
riterion proposed by Kohavi et al. [12℄. For all experiments we
employed the LIBSVM pa
kage with default parameters set [14℄.
Table 1 summarizes the results of our experiments. For ea
h of the data sets exam-
ined we
ould redu
e the number of features used for
lassi
ation without sa
ri
ing
a
ura
y. More pre
isely speaking, a one-tailed t-test revealed at the 5% level no sta-
tisti
ally signi
ant de
rease in
lassi
ation performan
e after feature sele
tion. On
the
ontrary, for the
hess, the led and the optidigit data sets we found our feature se-
le
tion to signi
antly in
rease
lassi
ation a
ura
y. Note that the t-test revealed no
dieren
e in performan
e between the wrapper and the SVM-based sele
tion method.
The led data benetted very mu
h from feature sele
tion. This is not surprising as
led is a syntheti
data set
onsisting of 7 relevant features and 17 features
ontaining
random noise. Both feature sele
tion algorithms reliably extra
t the 7 relevant features,
however some irrelevant features, whi
h appear to be predi
tive on the given data, are
also in
luded. As this data set
ontains binary features only, we
ould easily estimate
the Shannon entropy h,
ompute the information gain h(y ) h(y jX ) for ea
h individual k
feature X [15℄, and
ompare it to our relevan
e measure d . Figure 2 visualizes both rel-
k k
evan
e measures: The strong
orrelation (r = 0:99) between them is apparent as well as
the
lear distin
tion between the 7 relevant features on the left and the random features
on the right. Note that for experiments
omprising
ontinuous features
omputing the
information gain is often not straightforward while evaluating the SVM-based relevan
e
measure is.
4
Table 1: Performan
e of dierent feature sele
tion methods on a test set. Feature
sele
tion signi
antly redu
es the number of features without sa
ri
ing
lassi
ation
a
ura
y.
No Sele
tion Wrapper SVM Sele
tion
Data set Features A
ura
y Features A
ura
y Features A
ura
y
breast
an
er 9 95.71 5 95.42 6 95.71
hess 35 33.33 5 86.67 4 86.67
rx 43 83.73 5 85.15 5 86.03
diabetes 8 74.13 4 75.19 5 74.91
led 24 66.27 11 73.10 10 74.70
mfeat 649 97.60 n.a. n.a. 31 97.50
glass 9 60.63 4 61.36 5 60.54
mushroom 125 99.95 8 99.95 14 100.00
optidigit 64 97.68 36 98.32 36 98.39
0.6
SVM feature criterion
information gain
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25
5
Table 2: CPU time used for the feature sele
tion pro
ess. SVM sele
tion is mu
h faster
than the wrapper method as less
lassiers must be trained.
Data set Wrapper SVM Sele
tion
breast
an
er 19.29 4.073
hess 45.03 1.48
rx 355.74 17.92
diabetes 45.15 19.35
glass 13.87 3.46
led 253.25 23.04
mfeat n.a. 59162.13
mushroom 27518.39 2977.34
optidigit 77742.152 2843.34
Our results show that the wrapper and the SVM method sele
ted features of equal
quality in all
ases examined. Consequently, it is interesting to
ompare the methods in
terms of speed. Table 2 shows the CPU time in se
onds used by ea
h method for the
dierent data sets. We
an see that espe
ially for the larger data sets the SVM-based
feature sele
tion has a
lear advantage over the wrapper. This is not surprising as the
wrapper needs to train a larger number of SVMs. Spe
i
ally, to sele
t a subset of d
features from a set of n given features the wrapper methods examines (d2 + d(2r + 1))=2
SVMs, where r = (n d) denotes the number of removed features, while the method
we propose examines (d + 1) SVMs only { one for
omputing the relevan
e measure d k
and r during hill-
limbing. Thus, in
orporating the information
olle
ted from the SVM
redu
es the run-time
omplexity from quadrati
to linear.
4 Con
lusion
We propose a method that utilizes the information-
ompressing
apabilities of the sup-
port ve
tor ma
hine for feature sele
tion. It is easy to understand, simple to implement,
fast to exe
ute, and it performs as a
urately as the wrapper method on a number of
real-world data sets.
Software
For our experiments we used the LIBSVM pa
kage by [14℄. The program used to
al-
ulate the relevan
e measure d from a trained SVM is available for download from
k
http://www. vgpr.uni-mannheim.de/heiler/.
6
Referen
es
[1℄ P.A. Devijver and J. Kittler. Pattern Re
ognition: A Statisti
al Approa
h. Prenti
e-
Hall International, 1982.
[2℄ George H. John, Ron Kohavi, and Karl P
eger. Irrelevant features and the subset
sele
tion problem. In Pro
eedings of ICML-94, 11th International Conferen
e on
Ma
hine Learning, pages 121{129, New Brunswi
k, NJ, 1994.
[3℄ T. Cover and J. van Campenhout. On the possible orderings in the measurement
sele
tion problem. IEEE Trans. on Systems, Man, and Cyberneti
s, 7:657{661,
1977.
[4℄ L. Devroye, L. Gyor, and G. Lugosi. A Probabilisti
Theory of Pattern Re
ognition.
Springer-Verlag, 1996.
[5℄ Ron Kohavi and George H. John. Wrappers for feature subset sele
tion. In Pro-
eedings of the Tenth International Conferen
e on Computational Learning Theory,
pages 245{271, 1997.
[6℄ P. Somol, P. Pudil, J. Novovi
ova, and P. Pa
lik. Adaptive
oating sear
h methods
in feature sele
tion. Patt. Re
og. Letters, 20(11):1157{1163, 1999.
[7℄ Matthias Heiler. Optimization
riteria and learning algorithms for large mar-
gin
lassiers. Master's thesis, University of Mannheim, Germany, Department
of Mathemati
s and Computer S
ien
e, Computer Vision, Graphi
s, and Pattern
Re
ognition Group, D-68131 Mannheim, Germany, 2001.
[8℄ Theodoros Evgeniou. Learning with kernel ma
hine ar
hite
tures. PhD thesis,
Massa
husetts Institute of Te
hnology, 6 2000.
[9℄ T. Evgeniou, M. Pontil, C. Papageorgiou, and T. Poggio. Image representations
for obje
t dete
tion using kernel
lassiers. Asian Conferen
e on Computer Vision,
pages 687{692, 2000.
[10℄ C.L. Blake and C.J. Merz. UCI repository of ma
hine learning databases, 1998.
[11℄ Vladimir N. Vapnik. The Nature of Statisti
al Learning Theory. Springer, N.Y.,
1995.
[12℄ Ron Kohavi and Sommereld Dan. Feature subset sele
tion using the wrapper
method: Overtting and dynami
sear
h spa
e topology. In Usama M Fayyad
and Ramasamy Uthurusamy, editors, First International Conferen
e on Knowledge
Dis
overy and Data Mining (KDD-95), 1995.
[13℄ T. S
heer and R. Herbri
h. Unbiased assessment of learning algorithms. In IJCAI-
97, pages 798{803, 1997.
7
[14℄ Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library
for support ve
tor ma
hines, 2001. Software available at
http://www.
sie.ntu.edu.tw/~
jlin/libsvm.
[15℄ T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley &
Sons, In
., New York, 1991.