Beruflich Dokumente
Kultur Dokumente
www.elsevier.com/locate/eswa
Abstract
Support Vector Machines, one of the new techniques for pattern classification, have been widely used in many application areas. The kernel
parameters setting for SVM in a training process impacts on the classification accuracy. Feature selection is another factor that impacts
classification accuracy. The objective of this research is to simultaneously optimize the parameters and feature subset without degrading the SVM
classification accuracy. We present a genetic algorithm approach for feature selection and parameters optimization to solve this kind of problem.
We tried several real-world datasets using the proposed GA-based approach and the Grid algorithm, a traditional method of performing
parameters searching. Compared with the Grid algorithm, our proposed GA-based approach significantly improves the classification accuracy and
has fewer input features for support vector machines.
q 2005 Elsevier Ltd. All rights reserved.
Keywords: Support vector machines; Classification; Feature selection; Genetic algorithm; Data mining
C (penalty parameter). The Grid algorithm is an alternative Eqs. (1) and (2) can be combined into one set of inequalities.
to finding the best C and gamma when using the RBF
yi ðhw$xi i C bÞK1R 0 c i Z 1; .; m (3)
kernel function. However, this method is time consuming
and does not perform well (Hsu and Lin, 2002; LaValle and The SVM finds an optimal separating hyperplane with the
Branicky, 2002). Moreover, the Grid algorithm cannot maximum margin by solving the following optimization
perform the feature selection task. problem:
Genetic algorithms have the potential to generate both the
Min 12 wT w subject to : yi ðhw$xi i C bÞK1R 0 (4)
optimal feature subset and SVM parameters at the same time. w;b
Our research objective is to optimize the parameters and
feature subset simultaneously, without degrading the SVM It is known that to solve this quadratic optimization problem one
classification accuracy. The proposed method performs feature must find the saddle point of the Lagrange function:
selection and parameters setting in an evolutionary way. Based 1 Xm
on whether or not feature selection is performed independently Lp ðw; b; aÞ Z wT $wK ðai yi ðhw$xi i C bÞK1Þ (5)
2 iZ1
of the learning algorithm that constructs the classifier, feature
subset selection algorithms can be classified into two where the ai denotes Lagrange multipliers, hence aiR0. The
categories: the filter approach and the wrapper approach search for an optimal saddle point is necessary because the Lp
(John, Kohavi, & Peger, 1994; Kohavi and John, 1997). The must be minimized with respect to the primal variables w and b
wrapper approach to feature subset selection is used in this and maximized with respect to the non-negative dual variable
paper because of the accuracy. ai. By differentiating with respect to w and b, the following
In the literature, only a few algorithms have been proposed equations are obtained:
for SVM feature selection (Bradley, Mangasarian, & Street, Xm
v
1998; Bradley and Mangasarian, 1998; Weston et al., 2001; Lp Z 0; w Z ai yi xi (6)
Guyon, Weston, Barnhill, & Bapnik, 2002; Mao, 2004). Some vw iZ1
other GA-based feature selection methods were proposed
(Raymer, Punch, Goodman, Kuhn, & Jain, 2000; Yang and v Xm
Lp Z 0; ai yi Z 0 (7)
Honavar, 1998; Salcedo-Sanz, Prado-Cumplido, Pérez-Cruz, vb iZ1
& Bousoño-Calzón, 2002). However, these papers focused on
feature selection and did not deal with parameters optimization The Karush Kuhn–Tucker (KTT) conditions for the optimum
for the SVM classifier. Fröhlich and Chapelle (2003) proposed constrained function are necessary and sufficient for a maximum
a GA-based feature selection approach that used the theoretical of Eq. (5). The corresponding KKT complementarity conditions
bounds on the generalization error for SVMs. The SVM are
regularization parameter can also be optimized using GAs in ai ½yi ðhw$xi i C bÞK1 Z 0 c i (8)
Fröhlich’s paper.
This paper is organized as follows: a brief introduction to Substitute Eqs. (6) and (7) into Eq. (5), then LP is transformed to
the SVM is given in Section 2. Section 3 describes basic GA the dual Lagrangian LD(a):
concepts. Section 4 describes the GA-based feature selection X
m
1 X m
and parameter optimization. Section 5 presents the experimen- Max LD ðaÞ Z ai K a a y y hx $x i
a 2 i;jZ1 i j i j i j
tal results from using the proposed method to classify several iZ1
real world datasets. Section 6 summarizes the results and draws (9)
X
m
a general conclusion. subject to : ai R 0 i Z 1; .; m and ai yi Z 0
iZ1
are called support vectors, as the optimal decision hyperplane LD ðaÞ Z ai K a a y y kðx $x Þ
iZ1
2 i;jZ1 i j i j i j
f(x,a*,b*) depends on them exclusively. (16)
X
m
subject to : 0% ai % C; i Z 1; .; m and ai yi Z 0
2.2. The optimal hyperplane for nonseparable data (linear iZ1
generalized SVM)
This optimization model can be solved using the method for
The above concepts can also be extended to the non- solving the optimization in the separable case. Therefore, the
separable case, i.e. when Eq. (3) there is no solution. The goal optimal hyperplane has the form Eq. (17). Depending upon the
is to construct a hyperplane that makes the smallest number of applied kernel, the bias b can be implicitly part of the kernel
errors. To get a formal setting of this problem we introduce the function. Therefore, if a bias term can be accommodated within
non-negative slack variables xiR0, iZ1,., m. Such that the kernel function, the nonlinear SV classifier can be shown as
Eq. (18).
hw$xi i C bRC1Kxi for yi Z C1 (11) X
m
f ðx; a ; b Þ Z yi ai hFðxi Þ; FðxÞi C b
hw$xi i C b%K1 C xi for yi ZK1 (12) iZ1
X
m
In terms of these slack variables, the problem of finding the Z yi ai kðxi ; xÞ C b (17)
hyperplane that provides the minimum number of training iZ1
errors, i.e. to keep the constraint violation as small as possible, X X
has the formal expression: f ðx; a ; b Þ Z yi ai hFðxi Þ; FðxÞi Z yi ai kðxi ; xÞ (18)
i2sv i2sv
1 Xm
Min wT w C C xi Some kernel functions include polynomial, radial basis
w;b;x 2
iZ1 (13) function (RBF) and sigmoid kernel (Burges, 1998), which
subject to : yi ðhw$xi i C bÞ C xi K1R 0; xi R 0 are shown as functions (19), (20), and (21). In order to improve
classification accuracy, these kernel parameters in the kernel
This optimization model can be solved using the Lagrangian functions should be properly set.
method, which is almost equivalent to the method for solving Polynomial kernel:
the optimization problem in the separable case. One must
maximize the same dual variables Lagrangian LD(a) (Eq. (14)) kðxi ; xj Þ Z ð1 C xi $xj Þd (19)
as in the separable case. Radial basis function kernel:
X
m
1 X
m
kðxi ; xj Þ Z expðKgjjxi Kxj jj2 Þ (20)
Max LD ðaÞ Z ai K ai aj yi yj hxi $xj i
a
iZ1
2 i;jZ1 Sigmoid kernel:
(14)
X
m kðxi ; xj Þ Z tanhðkxi $xj KdÞ (21)
subject to : 0% ai % C; i Z 1; .; m and ai yi Z 0
iZ1
Parents
Before
Crossover point Fig. 3. The chromosome comprises three parts, C, g, and the features mask.
nf
!K1 increase SVM accuracy according to our experimental
X
fitnessZWA !SVM_accuracyCWF ! Ci !Fi (23) results. Generally, each feature can be linearly scaled to the
iZ1 range [K1, C1] or [0, 1] by formula (24), where v is
original value, vt is scaled value, maxa is upper bound of
WA SVM classification accuracy weight the feature value, and mina is low bound of the feature
SVM_accuracy SVM classification accuracy value.
WF weight for the number of features
Ci cost of feature i vKmina
Fi ‘1’ represents that feature i is selected; ‘0’ v0 Z (24)
maxa Kmina
represents that feature i is not selected
Dataset
Population
Scaling
Trained SVM
classifier
Classfication accuracy for
testing set
Fitness evaluation
Genetic No Termination
operatation are satisfied?
Yes
Optimized (C, γ) and
feature subset
Fig. 4. System architectures of the proposed GA-based feature selection and parameters optimization for support vector machines.
236 C.-L. Huang, C.-J. Wang / Expert Systems with Applications 31 (2006) 231–240
Table 1
Datasets from the UCI repository
No. Names No. of classes No. of instances Nominal features Numeric features Total features
1 German (credit card) 2 1000 0 24 24
2 Australian (credit card) 2 690 6 8 14
3 Pima-Indian diabetes 2 760 0 8 8
4 Heart disease (Statlog Project) 2 270 7 6 13
5 Breast cancer(Wisconsin) 2 699 0 10 10
6 Contraceptive Method Choice 3 1473 7 2 9
7 Ionosphere 2 351 0 34 34
8 Iris 3 150 0 4 4
9 Sonar 2 208 0 60 60
10 Statlog project: vehicle 4 940 0 18 18
11 Vowel 11 990 3 10 13
(5) Termination criteria. When the termination criteria are holdout test set for the model trained with the remaining kK1
satisfied, the process ends; otherwise, we proceed with subsets. The advantages of cross validation are that all of the
the next generation. test sets were independent and the reliability of the results
(6) Genetic operation. In this step, the system searches for could be improved. The data set is divided into k subsets for
better solutions by genetic operations, including selec- cross validation. A typical experiment uses kZ10. Other
tion, crossover, mutation, and replacement. values may be used according to the data set size. For a small
data set, it may be better to set larger k, because this leaves
more examples in the training set (Salzberg, 1997). This study
5. Numerical illustrations used kZ10, meaning that all of the data will be divided into 10
parts, each of which will take turns at being the testing data set.
5.1. Experiment descriptions The other nine data parts serve as the training data set for
adjusting the model prediction parameters.
To evaluate the classification accuracy of the proposed Our implementation was carried out on the Matlab 6.5
system in different classification tasks, we tried several real- development environment by extending the Libsvm which is
world datasets from the UCI database (Hettich, Blake, & Merz, originally designed by Chang and Lin (2001). The empirical
1998). These data sets have been frequently used as bench- evaluation was performed on Intel Pentium IV CPU running at
marks to compare the performance of different classification 1.6 GHz and 256 MB RAM.
methods in the literature. These datasets consist of numeric and The Grid search algorithm is a common method for searching
nominal attributes. Table 1 summarizes the number of numeric for the best C and g. Fig. 5 shows the process of Grid algorithm
attributes, number of nominal attributes, number of classes, and combined with SVM classifier. In the Grid algorithm, pairs of (C,
number of instances for these datasets. g) are tried and the one with the best cross-validation accuracy is
To guarantee valid results for making predictions regarding chosen. After identifying a ‘better’ region on the grid, a finer grid
new data, the data set was further randomly partitioned into search on that region can be conducted (Hsu et al., 2003).
training sets and independent test sets via a k-fold cross This research conducted the experiments using the proposed
validation. Each of the k subsets acted as an independent GA-based approach and the Grid algorithm. The results from
No
Termination criteria
Yes
Optimized (C, γ )
Target (or disease) The detail parameter setting for genetic algorithm is as the
C K following: population size 500, crossover rate 0.7, mutation
Predicted (or Test) C True Positive (TP) False Positive (FP) rate 0.02, two-point crossover, roulette wheel selection, and
K False Negative (FN) True Negative (TN) elitism replacement. We set ncZ20 and nrZ20; the value of nf
depends on the experimental datasets stated in Section 5.1.
the proposed method were compared with that from the Grid According to the fitness function of Eq. (23), WA and WF can
algorithm. In all of the experiments 10-fold cross validation influence the experiment result. The higher WA is; the higher
was used to estimate the accuracy of each learned classifier. classification accuracy is. The higher WF is; the smaller the
Some empirical results are reported in the following sections. number of features is. We can compromise between weight WA
and WF. Taking the German and Australia datasets, for
example, as shown in Figs. 6 and 7, the accuracy (measured
5.2. Accuracy calculation
by overall hit rate) is high with large numbers of features when
high WA and low WF are defined. We defined WAZ0.8 and
Accuracy using the binary target datasets can be demon-
WFZ0.2 for all experiments. The user can choose different
strated by the positive hit rate (sensitivity), the negative hit rate
weight values; however, the results could be different.
(specificity), and the overall hit rate. For the multiple class
datasets, the accuracy is demonstrated only by the average hit The termination criteria are that the generation number
rate. A two by two table with the classification results on the reaches generation 600 or that the fitness value does not
left side and the target status on top is as shown in Table 2. improve during the last 100 generations. The best chromosome
Some cases with the ‘positive’ class (with disease) correctly is obtained when the termination criteria satisfy. Taking the
classified as positive (TPZTrue Positive fraction), however, German dataset, for example, the positive hit rate, negative hit
some cases with the ‘positive’ class will be classified negative rate, overall hit rate, number of selected features, and the best
(FNZFalse Negative fraction). Conversely, some cases with pairs of (C, g) for each fold using GA-based approach and Grid
the ‘negative’ class (without the disease) will be correctly algorithm are shown in Table 3. For GA-based approach, its
classified as negative (TNZTrue Negative fraction), while average positive hit rate is 89.6%, average negative hit rate is
some cases with the ‘negative’ class will be classified as 76.6%, average overall hit rate is 85.6%, and average number
positive (FPZFalse Positive fraction). TP and FP are the two of features is 13. For Grid algorithm, its average positive hit
important evaluation performances for classifiers (Woods and rate is 88.8%, average negative hit rate is 46.2%, average
Bowyer, 1997). Sensitivity and specificity describe how well overall hit rate is 76.0%, and all 24 features are used. Note that
the classifier discriminates between case with positive and with the weight WA and WF are 0.8 and 0.2 in all of our experiments.
negative class (with and without disease). Table 4 shows the summary results for the positive hit rate,
Sensitivity is the proportion of cases with positive class that negative hit rate, overall hit rate, number of selected features,
are classified as positive (true positive rate, expressed as a and running time for the 11 UCI datasets using the two
percentage). In probability notation for sensitivity: approaches. In Table 4, the accuracy and average number of
PðTCjDCÞZ TP=ðTPC FNÞ. Specificity is the proportion of features are illustrated with the form of ‘averageGstandard
cases with the negative class, classified as negative (true deviation.’ The GA-based approach generated small feature
negative rate, expressed as a percentage). In probability subsets while Grid algorithm uses all of the features.
notation: PðTKjDKÞZ TN=ðTNC FPÞ. The overall hit rate is To compare the overall hit rate of the proposed GA-based
the overall accuracy which is calculated by (TPCTN)/(TNC approach with the Grid algorithm, we used the nonparametric
FPCFNCFP). Wilconxon signed rank test for all of the datasets. As shown in
The SVM_accuracy of the fitness in function (23) is Table 4, the p-values for diabetes, breast cancer, and vehicle
measured by sensitivity!specificity for the datasets with two are larger than the prescribed statistical significance level of
classes (positive or negative), and by the overall hit rate for the 0.05, but other p-values are smaller than the significance level
datasets with multiple classes. of 0.05. Generally, compared with the Grid algorithm, the
0 0
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
Accurcay Weight Accuracy Weight
Fig. 6. Illustration of the classification accuracy versus the accuracy weight, WA, for fold #4 of German dataset.
238 C.-L. Huang, C.-J. Wang / Expert Systems with Applications 31 (2006) 231–240
Table 3
Experimental results for German dataset using GA-based approach and Grid algorithm
Table 4
Experimental results summary of GA-based approach and Grid algorithm on the test sets
proposed GA-based approach has good accuracy performance sensitivity and specificity, and demonstrates that the closer
with fewer features. the curve follows the left-hand border and then the top border
The ability of a classifier to discriminate between ‘positive’ of the ROC space, the more accurate the classifier. The area
cases (C) and ‘negative’ cases (K) is evaluated using Receiver under the curve (AUC) is the evaluation criteria for the
Operating Characteristic (ROC) curve analysis. ROC curves classifier.
can also be used to compare the diagnostic performance of two Taking fold #4 of German dataset, for example, ROC
or more diagnostic classifiers (DeLeo and Rosenfeld, 2001). curve of GA-based approach and Grid algorithm are shown
For every possible cut-off point or criterion value we select to in Fig. 8, where AUC is 0.85, 0.82, respectively. For the
discriminate between the two populations (with positive or ROC curve for the other nine folds can also be plotted in
negative class value) there will generate a pair of sensitivity the same manner. In short, the average AUC for the 10
and specificity. An ROC curve shows the trade-off between folds of testing dataset of German dataset are 0.8424 for
1 20
0.8 15
0.6
10
0.4
0.2 5
0 0
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
Accuracy Weight Accuracy Weight
Fig. 7. Illustration of the classification accuracy versus the accuracy weight, WA, for fold #3 of Australia dataset.
C.-L. Huang, C.-J. Wang / Expert Systems with Applications 31 (2006) 231–240 239
0.6
proposed GA-based approach has good accuracy performance
0.5
with fewer features.
0.4 This study showed experimental results with the RBF
kernel. However, other kernel parameters can also be optimized
0.3
using the same approach. The proposed approach can also be
0.2 GA-based approach applied to support vector regression (SVR). Because the kernel
Grid search parameters and input features heavily influence the predictive
0.1
accuracy of the SVR with different kernel functions; we can use
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
the same GA-based feature selection and parameters optimiz-
1-Specificity ation procedures to improve the SVR accuracy.
Hsu, C. W., & Lin, C. J. (2002). A simple decomposition method for support Raymer, M. L., Punch, W. F., Goodman, E. D., Kuhn, L. A., & Jain, A. K.
vector machine. Machine Learning, 46(1–3), 219–314. (2000). Dimensionality reduction using genetic algorithms. IEEE Trans-
Joachims, T. (1998). Text categorization with support vector machines. In actions on Evolutionary Computation, 4(2), 164–171.
Proceedings of European conference on machine learning (ECML) (pp. Salcedo-Sanz, S., Prado-Cumplido, M., Pérez-Cruz, F., & Bousoño-Calzón, C.
137–142). Chemintz, DE. (2002). Feature selection via genetic optimization Proceedings of the
John, G., Kohavi, R., & Peger, K. (1994). Irrelevant features and the subset ICANN international conference on artificial neural networks, Madrid,
selection problem. Proceedings of the 11th international conference on Span pp. 547–552.
machine learning, San Mateo, CA pp. 121–129. Salzberg, S. L. (1997). On comparing classifiers: Pitfalls to avoid and a
Kecman, V. (2001). Learning and soft computing. Cambridge, MA: The MIT recommended approach. Data Mining and Knowledge Discovery, 1, 317–
Press. 327.
Kohavi, R., & John, G. (1997). Wrappers for feature subset selection. Artificial Schőlkopf, B., & Smola, A. J. (2000). Statistical learning and kernel methods.
Intelligence, 97(1–2), 273–324. Cambridge, MA: MIT Press.
LaValle, S. M., & Branicky, M. S. (2002). On the relationship between classical Vapnik, V. N. (1995). The nature of statistical learning theory. New York:
grid search and probabilistic roadmaps. International Journal of Robotics Springer.
Research, 23(7–8), 673–692. Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., & Vapnik, V.
Lin, H. T., & Lin, C. J. (2003). A study on sigmoid kernels for SVM and the (2001). Feature selection for SVM. In S. A. Solla, T. K. Leen, & K.-R.
training of non-PSD kernels by SMO-type methods. Technical report, Muller, Advances in neural information processing systems (Vol. 13) (pp.
Department of Computer Science and Information Engineering, National 668–674). Cambridge, MA: MIT Press.
Taiwan University. Available at: http://www.csie.ntu.edu.tw/~cjlin/papers/ Woods, K., & Bowyer, K. W. (1997). Generating ROC curves for artificial
tanh.pdf. neural networks. IEEE Transactions on Medical Imaging, 16(3), 329–337.
Mao, K. Z. (2004). Feature subset selection for support vector machines Yang, J., & Honavar, V. (1998). Feature subset selection using a genetic
through discriminative function pruning analysis. IEEE Transactions on algorithm. IEEE Intelligent Systems, 13(2), 44–49.
Systems, Man, and Cybernetics, 34(1), 60–67. Yu, G. X., Ostrouchov, G., Geist, A., & Samatova, N. F. (2003). An SVM-
Pontil, M., & Verri, A. (1998). Support vector machines for 3D object based algorithm for identification of photosynthesis-specific genome
recognition. IEEE Transactions on Pattern Analysis and Machine features. Second IEEE computer society bioinformatics conference, CA,
Intelligence, 20(6), 637–646. USA pp. 235–243.