A Multiple Kernel Support Vector Machine Scheme

A multiple kernel support vector machine scheme
for feature selection and rule extraction from

gene expression data of cancer tissue
Zhenyu Chen
a,b
, Jianping Li
a,
*
, Liwei Wei
a,b
a
Institute of Policy & Management, Chinese Academy of Sciences, Beijing 100080, China
b
Graduate University of Chinese Academy of Sciences, Beijing 100039, China
Received 30 November 2006; received in revised form 31 July 2007; accepted 31 July 2007
Articial Intelligence in Medicine (2007) 41, 161175
http://www.intl.elsevierhealth.com/journals/aiim
KEYWORDS
Multiple kernel
learning;
Support vector
machine;
Feature selection;
Rule extraction;
Gene expression data
Summary
Objective: Recently, gene expression proling using microarray techniques has been
shown as a promising tool to improve the diagnosis and treatment of cancer. Gene
expression data contain high level of noise and the overwhelming number of genes
relative to the number of available samples. It brings out a great challenge for
machine learning and statistic techniques. Support vector machine (SVM) has been
successfully used to classify gene expression data of cancer tissue. In the medical
eld, it is crucial to deliver the user a transparent decision process. Howto explain the
computed solutions and present the extracted knowledge becomes a main obstacle
for SVM.
Material and methods: A multiple kernel support vector machine (MK-SVM)
scheme, consisting of feature selection, rule extraction and prediction modeling
is proposed to improve the explanation capacity of SVM. In this scheme, we show
that the feature selection problem can be translated into an ordinary multiple
parameters learning problem. And a shrinkage approach: 1-norm based linear
programming is proposed to obtain the sparse parameters and the corresponding
selected features. We propose a novel rule extraction approach using the informa-
tion provided by the separating hyperplane and support vectors to improve the
generalization capacity and comprehensibility of rules and reduce the computa-
tional complexity.
Results and conclusion: Two public gene expression datasets: leukemia dataset and
colon tumor dataset are used to demonstrate the performance of this approach.
This research has been partially supported by a grant from National Natural Science Foundation of China (#70531040), and 973
Project (#2004CB720103), Ministry of Science and Technology, China.
* Corresponding author. Tel.: +86 10 6263 4957; fax: +86 10 6254 2629.
E-mail addresses: zychen@casipm.ac.cn (Z. Chen), ljp@casipm.ac.cn (J. Li), lwwei@casipm.ac.cn (L. Wei).
0933-3657/$ see front matter # 2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.artmed.2007.07.008
1. Introduction
DNA microarray technology makes it possible to
measure the expression levels of thousands of genes
in a single experiment [1,2]. There are many poten-
tial applications for the DNA microarray technology
such as the functional assignment of genes and the
recognition of gene regulation network [3]. The
diagnosis and treatment of human diseases
becomes a major application eld of this technol-
ogy. Especially, DNA microarray technology is con-
sidered as a promising tool for the cancer diagnosis
[14]. Tumors with similar histopathological
appearance can show very different response to
the same therapies. It greatly limits the decision
accuracy of the traditional clinical methods which
rely on a limited set of historical and pathological
features. The cancer is fundamentally the malfunc-
tion of genes, so utilizing the gene expression data
might be the most direct diagnosis way [15]. There
are some reported works in the literature that focus
on constructing a decision support system to assist
doctors and clinicians in their decision-making pro-
cess [3,6]. That kind of information systems usually
consists of the following three phases: feature
selection, modeling the problem and knowledge
discovery.
There are many machine learning technologies
have been used to model the gene expression data.
Unsupervised learning methods, including the hier-
archical cluster [4,7], self-organizing maps [8],
fuzzy adaptive resonance theory (ART) [9,10] and
K-means clustering [11], are widely used in func-
tional assignment of novel genes, marker genes
identication and classes discovery. Recently, more
researchers pay their attention to the supervised
learning techniques. The articial neural network
(ANN) [1214] is the most popular supervised learn-
ing methods utilized in medical research. Especially,
fuzzy neural network (FNN) [6,13,14] that is one of
the advanced ANN models can extract the causality
between input and output variables as explicit IF
THEN rules. It increases the explanation capacity of
the neural network. Other supervised learning
methods include Bayesian approaches [15], decision
tree [16] and support vector machine (SVM) [17
19]. Besides, the ensemble learning methods [20
22] are also used in this eld to improve the per-
formance of some single approach.
The most outstanding characteristic of gene
expression data is that it contains a large number
of gene expression values (several thousands to tens
of thousands) and a relatively small sample size (a
few dozen). Furthermore, many genes are highly
correlated, which leads to redundancy in the data.
Besides, gene expression data contains high level of
technical and biological noise. Above factors make
the clustering or classication results susceptible to
over-tting and sometimes under-tting problem
[3,21,23,24]. Therefore, feature selection has to
be performed prior to the implementation of a
classication or clustering algorithm [3,6]. Besides,
feature selection can improve the transparency of
computational model. Especially for gene expres-
sion data analysis, a small set of genes that is
indicative of important differences in cell states
can serve as either convenient diagnosis panels or
as the candidates of very expensive and time-con-
suming analysis required to determine if they could
serve as useful targets for therapy.
Feature selection aims at nding out a powerfully
predictive subset of features within a database and
try best to reduce the number of features presented
to the modeling process [2528]. All the methods
proposed to tackle with the feature selection pro-
blem yield two basic categories: the lter and the
wrapper methods [29]. In lter methods, the data
are preprocessed and some top ranked features are
selected using a quality metric, independent of the
classier. Due to the fact that they are more ef-
cient than the wrapper methods, the lter methods,
such as T-statistic, information entropy, information
gain and a series of statistic impurity measures
[1,30], are widely used on large-scale data, such
as gene expression data. In wrapper methods, the
search for the good feature subset is conducted by
using the classier itself as a part of evaluation
function [29,31]. The wrapper methods usually
obtain better predictive accuracy estimates than
the lter methods. However, they usually require
much computational time.
A recent breakthrough in feature selection
research is the development of SVM-based techni-
ques, that are scalable of thousands of variables and
typically exhibit excellent performance in reducing
the number of variables while maintaining or
improving classication accuracy [32,19]. Most of
these SVM-based techniques make use of forward
162 Z. Chen et al.
Using the small number of selected genes, MK-SVM achieves encouraging classica-
tion accuracy: more than 90% for both two datasets. Moreover, very simple rules
with linguist labels are extracted. The rule sets have high diagnostic power because
of their good classication performance.
# 2007 Elsevier B.V. All rights reserved.
selection or backward elimination strategy
[19,33,34]. It is very hard to nd the global solutions
by them [19]. Some approaches, including SVM-
based recursive feature elimination (SVM-RFE)
[19] and its improvements [35] are proposed to deal
with above issue. It is clearly that they are compu-
tational expensive.
Hardin et al. argues that linear SVM-based fea-
ture selection algorithms may remove the strongly
relevant variables and keep the weakly relevant
ones [32]. In fact, SVM-RFE and multiple kernel
support vector machine (MK-SVM) proposed in this
paper carry out multivariable feature selection pro-
cess. That is to say, they aimat not to rank the single
marker genes but to nding out the gene group that
work together as pathway components and reect
the state of cell [36]. And then new issues arise:
what is the contribution of each gene in the gene
group and what is the interaction of the genes?
Besides, Hardin et al. also points out that in gene
expression data even randomgene subsets give good
classication performance [32]. So a more transpar-
ent model is required to give further explanation of
the selected gene subset. The extraction of human-
comprehensible rules from the selected gene subset
is necessary to answer above doubts and debates.
Many researches focus on the rule extraction
from ANN in the last decade [3739]. The
approaches of rule extraction from ANN can be
categorized as: link rule extraction techniques,
black-box rule extraction techniques, extracting
fuzzy rules and extracting rules from recurrent net-
works [40]. But there are few papers published in
the cases of rule extraction from SVM. Nunez et al.
propose a SVM + prototypes method in which
support vectors and the prototype points dened
by the K-means clustering algorithm are used to
determine the rules [41]. But the introduction of
K-means clustering makes the rule extraction pro-
cess uncertain and sensitive to the initialization.
Fung et al. denes rules as the hypercubes and uses
the linear programming to optimize the vertexes
of rules [42]. It is the drawback of this method that
it is only suitable for the linear kernel. It is well
known that the nonlinear mapping and the kernel
trick is one of the important characteristics of SVM.
Fu et al. proposes a rule-extraction approach
(RuleExSVM) to obtain hyper-rectangular rules
based on the information provided by support vec-
tors and its decision function [43]. Some other
approaches treat SVM as a black-box. Following
the training of SVM, an interpretable model such
as decision tree [44,45] is used to generate rules.
However, they cannot guarantee the same general-
ization performance of the extracted rules as that of
SVM.
The multiple kernel learning considers the multi-
ple kernels or parameterizations of kernels to
replace the single xed kernel [46]. It provides more
exibility and the chance to choose a suitable ker-
nel. Some efcient methods [4749] are proposed
to perform the optimization over some kind of con-
vex combinations of basic kernels. In the present
paper, this idea is extended to make feature selec-
tion and rule extraction in the framework of SVM.
The multiple kernels are described as the convex
combination of two kinds of single feature basic
kernels. A sparse optimization method: 1-norm
based linear programming is proposed to carry out
the optimization of the parameter of each basic
kernel (feature parameter). In this way, the feature
selection is equivalent to the multiple parameters
learning [51] that is easy to be done. And the simple
and comprehensible rules can be extracted using
the algorithm with low computational cost.
This paper is organized as follows: Section 2 rstly
describes SVM briey. Then the proposed MK-SVM
scheme for the feature selection and rule extraction
is developed in detail. Section 3 presents the experi-
mental results and analysis on two public gene
expression dataset: ALL-AML leukemia dataset and
colon tumor dataset. Section 4 summarizes the
results and draws a general conclusion.
2. MK-SVM for feature selection and
rule extraction
2.1. Brief introduction of SVM
In this section we will briey describe the basic SVM
concepts for typical two-class classication pro-
blems. These concepts can also be found in
[23,25,19]. Given a training set of data points
G f~x
i;
y
i
g
n
i1
, ~x
i
2R
m
and y
i
2f1; 1g. The non-
linear SVM maps the training samples from the input
space into a higher-dimensional feature space via a
mapping function f and construct an optimal hyper-
plane dened by ~ wf~x b 0 to separate exam-
ples from the two classes. For SVM with L1 soft
margin formulation, this is done by solving the
primal problem:
min J~ w;
~
j
1
2
~ w k k
2
C
n
i1
j
i
(1)
s:t:
y
i
~ w
T
f~x
i
b !1 j
i
; i 1; . . . ; n
j
i
!0
_
(2)
where j
i
! 0 are the non-negative slack variables,
the regularization parameter C determines the tra-
deoff between the maximummargin 1=jj~ wjj
2
and the
minimum experience risk.
A multiple kernel support vector machine scheme 163
Above quadratic optimization problem can be
solved by nding the saddle point of the Lagrange
function:
L
p
~ w; b;
~
j; ~a J~ w;
~
j
n
i1
a
i
fy
i
~ w
T
f~x
i
b
j
i
1g
n
i1
d
i
j
i
(3)
where a
i
, d
i
denote Lagrange multipliers, hence
a
1
! 0, and d
i
! 0.
By differentiating with respect to ~ w, b and j
i
, the
following equations are obtained:
@L
@~ w
0 )~ w
n
i1
a
i
y
i
f~x
i
(4)
@L
@b
0 )
n
i1
y
i
a
i
0 (5)
@L
@j
i
0 )a
i
C d
i
; i 1; . . . ; n (6)
Substitute Eqs. (4) and (5) into Eq. (3), then L
p
is
transformed to the dual Lagrangian:
max
n
i1
a
i
1
2
n
i; j1
a
i
a
j
y
i
y
j
k~x
i
;~x
j
_ _
(7)
s:t:
n
i1
y
i
a
i
0
0 a
i
C; i 1; . . . ; n
_
_
_
(8)
where k~x
i
;~x
j
f~x
i
f~x
j
is called the kernel
function. Gaussian kernel is the most commonly
used kernel function:
K~x; ~y exps
2
~x ~y k k
2
(9)
where s
2
is the kernel parameter.
The Karush KuhnTucker (KTT) conditions for
the optimum constrained function are necessary
and sufcient for a maximum of Eq. (3). The
corresponding KKT complementarity conditions
require that the optimal solutions ~a
; ~ w
; b
must
satisfy:
a
i
y
i
n
j1
a
j
y
j
kx
i
; x
j
b
_ _
1
_ _
0;
i 1; . . . ; n (10)
In a typical classication task, only a small subset
of the Lagrange multipliers a
i
usually tends to be
greater than zero. The respective training vectors
having nonzero a
i
are called support vectors. Geo-
metrically, these vectors construct the optimal
separating hyperplane:
y
i
n
j1
a
j
y
j
kx
i
; x
j
b
_ _
1; i 1; . . . ; n (11)
2.2. MK-SVM I: MK-SVM for feature
selection
The multiple kernel learning is viewed as an effec-
tive way to design an optimal kernel. Micchelli et al.
[46] draws the conclusion that the optimal kernel
can always be obtained as a convex combination of
nitely many basic kernels. In the present paper,
this idea is extended to improve the transparency of
the kernel methods.
In practice, a simple and efcient method is that
the kernel function is described as the convex com-
bination of some basic kernels:
k~x
i
;~x
j

m
d1
b
d
k
d
~x
i
; ~x
j
; b
d
!0 (12)
In the present paper, we introduce a new kernel:
single feature kernel to improve the explanation
capacity of SVM. The single feature kernel can be
shown as
k~x
i
;~x
j

m
d1
b
d
kx
i;d
; x
j;d
(13)
where x
i,d
denotes the dth attribute of the input
vector ~x
i
.
In Eq. (13), the parameter b
d
represents the
weight of the single feature kernel. It plays crucial
role in the feature selection. So it is called feature
coefcient. If the parameter b
d
equals zero, it means
that the corresponding feature doesnt take impact
on theoutput of SVMandcan be discarded. So it is the
introduction of single feature kernel that transforms
the feature selection problem, which usually has a
huge search space and high computational complex-
ity, into nding sparse feature coefcients.
When the single feature kernel described in
Eq. (13) is used, the optimization problem of SVM
changes into
max
n
i1
a
i
1
2
n
i; j1
a
i
a
j
y
i
y
j
m
d1
b
d
kx
i;d
; x
j;d
_ _
(14)
s:t:
n
i1
y
i
a
i
0
0 a
i
C; i 1; . . . ; n
b
d
!0; d 1; . . . ; m
_
_
(15)
It is different from normal SVM that both the
Lagrange coefcients a
j
and the feature coefcients
b
d
need to be optimized. A two stage iterative
procedure is used to decompose the problem into
two simple convex optimization problems [50]. Thus
they can be solved by standard optimizers. The
feature coefcient b
d
is xed and the Lagrange
coefcients a
j
can be gotten by solving the optimi-
zation problem described in (14) and (15).
164 Z. Chen et al.
The optimization of the feature coefcients b
d
can be seen as model selection problem in SVM
which has been well studied [5154]. The choice
of multiple hyperparameters in SVM can be done by
minimizing some estimates of the generalization
errors [51,53]. A 1-norm soft margin error function
is minimized in this method to obtain the sparse
solution [55]:
min J
~
b;
~
j
m
d1
b
d
l
n
i1
j
i
(16)
s:t:
y
i
m
d1
b
d
n
j1
a
j
y
j
kx
i;d
; x
j;d
b
_ _
!1 j
i
j
i
!0; i 1; . . . ; n
b
d
!0; d 1; . . . ; m
_
_
(17)
In Eq. (16), the regularized parameter l controls
the sparsity of the feature coefcients.
The dual of this linear programming is (see
Appendix A for the derivation):
max
n
i1
u
i
(18)
s:t:
n
i1
u
i
y
i
n
j1
a
j
y
j
kx
i;d
; x
j;d
_ _
1; d 1; . . . ; m
n
i1
u
i
y
i
0
0 u
i
l; i 1; . . . ; n
_
_
(19)
Algorithm 1 (Training of MK-SVM I). Input: Input
data vector ~x
i
2R
m
and output y
i
2f1; 1g.
Step 1: Initialization: Set the regularization para-
meter C = 1, l = 1 and the kernel parameter
s
2
= x
i,d
(x
i,d
is the dth attribute of a random
input vector ~x
i
). The feature coefcients b
d
are set to fb
0
d
1jd 1; ; mg.
Step 2: While 1
m
d1
b
t
d
b
t1
d
m
!e
1. Solving the Lagrange coefcients a
t
j
:
The Lagrange coefcients a
t
j
are
obtained by solving the quadratic pro-
gramming described by Eqs. (20) and
(21) in which the feature coefcient
b
t1
d
is used:
max
_
n
i1
a
t
i
1
2
n
i; j1
a
t
i
a
t
j
y
i
y
j
m
d1
b
t1
d
kx
i;d
; x
j;d
_
s:t:
n
i1
y
i
a
t
i
0
0 a
t
i
C; i 1; . . . ; n
_
_
(21)
2. Solving the feature coefcients b
t
d
: The
feature coefcients b
t
d
are obtained by
solving the dual linear programming
described by Eqs. (22) and (23) and
the Lagrange coefcient a
t
j
solved in
the last step is used:
max
n
i1
u
t
i
(22)
3. Calculating cross-validation errors: The
cross-validation errors are calculated
according to the coefcients a
t
j
and
b
t
d
obtained in above two steps.
f~x
test
sign
m
d1
b
t
d
n
j1
a
t
j
y
j
kx
j;d
; x
test
d
b
_
(24)
4. End.
Step 3: Tune the parameters and go back to step 2
until the output are optimal.
Output: The feature coefcients
~
b, support vec-
tors and the classication results. The selected
feature subset is composed of those having nonzero
feature coefcients.
Note that the problem is convergent when the
selected features dont vary with iteration. So we
n
i1
u
t
i
y
i
n
j1
a
t
j
y
j
kx
i;d
; x
j;d
_ _
1; d 1; . . . ; m
n
i1
u
t
i
y
i
0
0 u
t
i
l; i 1; . . . ; n
_
_
(23)
(20)
dene the following convergence criterion:
1
m
d1
b
t
d
b
t1
d
m
e (25)
where b
t
d
represents the feature coefcients
obtained in step t. We set e to 0.01 in the experi-
ments. In practice, a fewiterations (usually no more
than 3) can achieve a desirable performance.
2.3. MK-SVMII: SVM with mixture of kernel
In this section we use a new single feature kernel
with strict conditions:
k~x
i
; ~x
j

m
d1
b
d
kx
i;d
; x
j;d
;
b
d
!0;
m
d1
b
d
1 (26)
Substituting (26) into the quadratic programming
described in (7) and (8), we can deduce the follow-
ing formulation:
max
n
i1
a
i
1
2
n
i; j1
a
i
a
j
y
i
y
j
m
d1
b
d
kx
i;d
; x
j;d
_ _
(27)
s:t:
n
i1
y
i
a
i
0
0 a
i
C; i 1; . . . ; n
b
d
!0; d 1; . . . ; m
m
d1
b
d
1
_
_
(28)
Let g
i,d
= a
i
b
d
. According to
m
d1
b
d
1, we get:
i;d
g
i;d
i;d
a
i
b
d
i
a
i
(29)
i;d
y
i
g
i;d
i;d
y
i
a
i
b
d
i
y
i
a
i
0 (30)
Then the optimization problem described in (27)
and (28) changes into:
max
i;d
g
i;d
1
2
i; j;d
g
i;d
g
j;d
y
i
y
j
kx
i;d
; x
j;d
_ _
(31)
s:t:
i;d
y
i
g
i;d
0
0
m
d1
g
i;d
C; i 1; . . . ; n
g
i;d
!0; d 1; . . . ; m
_
_
(32)
Here the Lagrange coefcient a
j
in SVM is replaced
by the new coefcient g
i,d
. The number of coef-
cients that need to be optimized is increased from n
to n m. It greatly increases the computational
cost for high-dimensional dataset.
The linear programming SVM [56,57] is a promis-
ing approach to reduce the computational cost of
SVM. Recently, Wu et al. points out that the con-
vergence behavior of the linear programming SVM is
almost the same as that of the quadratic program-
ming SVM [56]. Ikeda et al. draws the conclusion
that the SVMsolutions have rather little dependency
on the used norm by means of the geometrical
description [57]. Based on above idea, we propose
the 1-norm based MK-SVM:
min J~g;
~
j
i;d
g
i;d
l
n
i1
j
i
(33)
s:t:
y
i
j;d
g
j;d
y
j
kx
i;d
; x
j;d
b
_ _
!1 j
i
j
i
!0; i 1; . . . ; n
g
i;d
!0; d 1; . . . ; m
_
_
(34)
In Eq. (33), the regularized parameter l controls
the sparsity of the coefcients g
i,d
.
The dual of this linear programming is (see
Appendix A for the derivation):
max
n
i1
u
i
(35)
It can be found that above idea is equivalent to the
approachcalledmixtureof kernel [58]. So thenew
coefcient g
i,d
is called the mixture coefcient.
When the vector ~g
i
contains at least one nonzero
mixture coefcients g
i,d
, the corresponding input
data vector x
i
is also named as support vector. It
is obvious that the number of attributes in each
support vector is not identical. It improves the
exibility in rule extraction. The details for the
166 Z. Chen et al.
s:t:
n
i1
u
i
y
i
g
j;d
y
j
kx
i;d
; x
j;d
1; j 1; . . . ; n; d 1; . . . ; m
n
i1
u
i
y
i
0
0 u
i
l; i 1; . . . ; n
_
_
(36)
rule extraction will be described in the next sec-
tion.
Algorithm 2 (Training of MK-SVMII). Input: Input
data vector ~x
i
2R
m
and output y
i
2 1; 1 f g.
Step 1: Initialization: Set the regularization para-
meter l = 1 and the kernel parameter
s
2
= x
i,d
(x
i,d
is the dth attribute of a random
input vector ~x
i
).
Step 2: Solving the coefcients g
i,d
: The coef-
cients g
i,d
are obtained by solving the dual
linear programming described by Eq. (35)
and (36).
Step 3: Calculating cross-validation errors: The
cross-validation errors are calculated
according to the output of MK-SVMII:
f~x
test
sign
j;d
g
j;d
y
j
kx
test
d
; x
j;d
b
_ _
(37)
Step 4: Tune the parameters and go to step 2 until
the output are optimal.
Output: The mixture coefcients ~g, support vec-
tors and the classication results.
2.4. Rule extraction from MK-SVM II
According to the KKT complementarity condi-
tions, we get the separating hyperplane of MK-
SVM II:
j;d
g
j;d
y
j
kx
i;d
; x
j;d
b 1; i 1; . . . ; n (38)
The following simplied formulation is investi-
gated rstly when the linear kernel is used:
j;d
g
j;d
y
j
x
i;d
x
j;d
b 1; i 1; . . . ; N (39)
This problem is similar to that addressed by [42].
In [42], the mathematical programming is use to
optimize the vertexes of rules. For MK-SVM II, sup-
port vectors can be used as the vertexes of a hyper-
cube and a series of hypercube can approximate the
subspace of each class. For MK-SVM II with linear
kernel, it is obvious that one of the corners of the
space can be used as the other vertex of this rule to
increase coverage rate. The extracted rules in the
two-dimensional space have the following formula-
tions:
IF l
1
x
1
1 and 1 x
2
u
2
;
THENy 11 (40)
It is easy to be extended to m-dimensional
space. We dene the rule as that covers the
hypercube with axis-parallel faces and has one
vertex on the hyperplane. It has the following
formulation:
IF ^
p
c1
l
c
x
c
u
c
; THENy 11 (41)
where p is the number of premises in each rule, l
c
and u
c
are the lower bound and upper bound of a
rule.
MK-SVM II based approach can extract rules from
the nonlinear separating hyperplane.
Now we introduce the concepts of up-regu-
lated and down-regulated.
Denition 1. Considering a gene j and a sample i,
we can dene that the gene j is up-regulated in the
sample i, if
a
i j
!a (42)
where a
ij
is the expression value of the gene j in the
sample i and a is a constant.
Denition 2. Considering a gene j and a sample i,
we can dene that the gene j is down-regulated in
the sample i, if
a
i j
a (43)
where a
ij
is the expression value of the gene j in the
sample i and a is a constant.
Wang et al. dened up-regulated, down-
regulated and gene regulation probabilities
[59]. Then various statistical methods were used
to compute the gene regulation probabilities. In
the present paper, the rule extraction algorithm is
used to learn the constant a. For the rule described
in Eq. (40), we can use above two denitions to
translate it to the following formulation:
IF x
1
up regulated and x
2
down regulated;
THENy 11 (44)
The rules with linguistic labels have the following
advantages:
(1) It can be used to analysis data from multiple
heterogeneous sources. Different microarray
techniques use different mechanism to measure
gene expression levels. It makes the gene
expression levels reported by different techni-
ques not comparable with each other. For rules
with linguistic labels, the parameter a is
obtained by machine learning process and varies
with experimental settings.
(2) Moreover, it makes it possible to compare and
fusion the results on multiple heterogeneous
datasets to improve the classiers performance.
(3) As a further work, we can dene a membership
function and use MK-SVM-based algorithm to
optimize its parameters. It will lead to a new
way to extract fuzzy rules from SVM.
We dene the following ve measures to evaluate
the generalization capacity and comprehensibility
of the extracted rules:
(1) Soundness: It measures how many times each
rule is correctly red. A rule is correctly red
means that all the conditions of this rule are
satised by the sample and its consequent
matches the target decision.
(2) Completeness: It measures how many times the
samples are correctly classied by this rule and
not by any other rules before this rule. So it is
useful in the rule-ordering algorithm for the
nal decision.
(3) False-alarm: It measures how many times each
rule is misred. A rule is misred means that all
the conditions of this rule are satised by the
sample and its consequent does not match the
target decision.
(4) Number of rules: It measures the comprehensi-
bility of the extracted rules.
(5) Number of conditions per rule: It also measures
the comprehensibility of the extracted rules.
Algorithm 3 (Extract the ifthen rules from MK-
SVM II). Input: Support vectors D
~
X
1;
y
1
; ;
~
X
N
; y
N
.
Initialization:
R
o
fall support vectorsg; R
n
f g.
Step 1: Set a small perturbation d > 0 and the
threshold of soundness measure.
Step 2:
Output: a set of rules R
n
.
In step 2 of Algorithm 3, m(i) represents the
number of conditions of a support vector.
Because the extracted rules usually have over-
laps, a rule-ordering algorithm is necessary to
obtain the rule set that minimize the number of
rules and maximize the classication accuracy.
Algorithm 4 (Rule reduction and ordering). Input:
R
n
= {all extracted rules}, R
m
= { }and the calculated
soundness and false-alarm measures for each rule.
Initialization: Set the completeness measure of
each rule equivalent to its false-alarm measure.
Output: Rule set R
m
for the nal decision.
For Algorithms 3 and 4, it is required to compute
the soundness, false-alarm and completeness mea-
sure of each rule in the selected feature subspace.
So the computational complexity is O(ln) where
l
N
i1
mi. Because MK-SVM II has good feature
reduction ability, so l is usually very small. For SVM,
N can be reduced to less than 10% of the number of
training samples (n). So n is the most important
factor that takes effect on the cost of computing.
Now we compare the computational complexity
of rule extraction from MK-SVM II with some other
SVM-based rule-extraction techniques. RuleExSVM
consists of three phases. The second phase has the
highest computational complexity O(mnNt) where
mrepresents the dimension of the feature space and
t represents the number of iteration for searching. It
is evident that it is higher than the computational
complexity of MK-SVM. In the SVM + prototypes
method, the K-means clustering has the computa-
tional complexity O(mnkt) where k represents the
number of classes and t represents the number of
iteration for tuning the centroid. Besides, an incre-
mental scheme is used in the SVM + prototypes
method to tune the extracted rules. So the
SVM + prototypes method also has a high compu-
tational complexity.
2.5. Methodology
The MK-SVM-based information system consists of
the following three phases: feature selection, pre-
dictive modeling and rule extraction.
(1) Gene selection: Many of wrapper approaches
are computational expensive, so a lter algo-
rithm is used as a preprocessing tool to reduce
the redundant genes [3,6]. However, it is not
true of MK-SVM I. An iteration of two simple
168 Z. Chen et al.
convex optimization problems needs to be
implemented in the training of MK-SVM I. They
can be easily solved by standard optimizers. The
selected gene group is composed of the genes
corresponding to nonzero feature coefcients
b
d
. The number of genes in the gene group is
controlled by the regularized parameter l.
(2) Predictive modeling: MK-SVM II can also be used
to selected relevant genes. However, it
increases the computational burden that too
many variables need to be optimized. So it is
a good choice that MK-SVM II makes use of the
genes selected by MK-SVM I in the last step to
make prediction and extract rules.
(3) Rule extraction: After the training of MK-SVM II,
Algorithm 3 is carried out to extract rules. In
each support vector, the number of attributes
with nonzero mixture coefcients is different. It
results in different number of premises in each
rule. Then we use Algorithm 4 to obtain a set of
rules which contains the minimum number of
rules and have maximumclassication accuracy.
3. Experiments
Two datasets: ALL-AML leukemia dataset and colon
tumor dataset are used here to evaluate the per-
formance of MK-SVM scheme.
(1) ALL-AML leukemia dataset: This dataset is a
collection of expression measurements
reported by Golub et al. [1]. This dataset is
already divided into the training dataset and
the testing dataset. The training dataset con-
sists of 38 bone marrow samples (27 ALL and 11
AML), over 7129 probes from 6817 human genes.
Also 34 samples in the testing dataset are pro-
vided, with 20 ALL and 14 AML.
(2) Colon tumor dataset: This dataset is a collection
of expression measurements from colon biopsy
samples reported by Alon et al. [60]. It contains
62 samples in which 40 samples are from tumors
(labeled as negative) and 22 are from healthy
parts of the colons of the same patients (labeled
as positive). Two thousand out of around 6500
genes were selected based on the condence in
the measured expression levels. We split ran-
domly the data into 31 samples for training and
31 samples for testing.
It is important to conduct a cross-validation for
feature selection, because using a xed set of fea-
tures selected with the whole training data set
induces a bias in the results [61]. Instead, one should
withhold a pattern, select features, and assess the
performance of the classier with the selected
features using the left out examples. The results
reported in this section were obtained on the inde-
pendent testing set and with a tenfold cross-valida-
tion on the training set.
The Grid search algorithm is a common method
for searching for the best hyperparameters in SVM.
For details of the Grid search algorithm, see [25,53].
Our implementation was carried out on the Matlab
6.5 development environment and its optimization
toolbox was used to solve the mathematical pro-
gramming in this paper.
A simple normalized method is used to make it
easy to tune the free parameters:
x
i;d
x
i;d
min~x
i
max~x
i
min~x
i
(45)
3.1. ALL-AML leukemia dataset
The regularized parameter l in MK-SVM I controls
the sparsity of the feature coefcients b
d
. So it plays
an important role in the feature selection. In theory,
a small-regularized parameter l puts much weight
on the former part of Eqs. (16) and leads to more
number of nonzero feature coefcients b
d
. Here a
grid search algorithm is used and l is set to {0, 10,
20, . . ., 1000}. Table 1 gives the number of selected
genes when varying the regularized parameter l and
the kernel parameter s
2
. It is seen from Table 1 that
the higher l is, the higher the number of selected
genes is. Table 2 gives the selected genes with
varied l and xed s
2
. When l is smaller than some
threshold, no genes are selected. With the increase
of l, more and more genes are put into the selected
gene subset. Considering the high connectivity of
the gene network, we are more interested in the
minimum number of genes that have strong expla-
nation ability to characterize the cancer types. Rule
extraction will give more information about the
discovered relevant genes.
Then MK-SVM II making use of the selected
genes is used to predict the types of samples. We
select the gene subsets that have the highest clas-
sication accuracy on the training set with a tenfold
Table 1 Variation of the number of selected genes
with choice of l and s
2
l s
2
= 0.05 s
2
= 0.1 s
2
= 0.5
40 4 5 7
50 4 7 8
60 4 9 10
70 4 9 12
80 7 9 14
90 7 11 15
100 8 11 16
cross-validation. Table 3 shows the selected ones
and their performances on the testing set.
The classication performance of MK-SVM com-
pared with that of some other well known feature
selection and classication methods [6163] are
shown in Table 4. Note that all the experimental
results are obtained on the independent testing set
(34 samples). The emerging patterns is a
method to nd good gene subsets and discriminate
the class of samples [62]. The ARAM is a predictive
neural network known as adaptive resonance asso-
ciative map [63]. The Fishier and entropy criteria
are used as the lter tools to make feature selec-
tion for ARAM. The IB1 (IB) is a case-based nearest
neighbor (K-NN) classier. The na ve-Bayes (NB)
rule uses the Bayes theory to predict the class of
each sample. The C4.5 represents a classication
model by a decision tree. The SF represents a
sequential forward searching strategy for feature
selection. The FCBF is a fast correlation-based
lter algorithm. The BIRS (best incremental ranked
subset) is a gene ranking algorithm [61]. It can be
seen that the classication accuracy of MK-SVM is a
bit lower than that of FCBF-based NB and IB clas-
siers and higher than that of other approaches.
However, the number of selected genes of MK-SVM
is far less than that of FCBF-based NB and IB
classiers.
170 Z. Chen et al.
Table 2 Gene selection using MK-SVM I with different l (s
2
= 0.1)
l Selected genes
30 None
40 D49950, M55150, U50136, X17042, X95735
50 D49950, M55150, U46499, U50136, X04085, X17042, X95735
60 D49950, M23197, U46499, U50136, U82759,X04085, X17042, X95735, U22376
70 D49950, M55150, U46499, U50136, U82759,X04085, X17042, X95735, U22376
80 D49950, HG2855, M55150, U50136, U82759,X04085, X17042, X95735, U22376
90 D49950, HG2855, M55150, M81933, U50136,U63289,U70867,U82759,X04085, X95735, U22376
100 D49950, HG2855, M55150, M81933, U50136,U63289,U70867,U82759,X04085, X95735, U22376
Table 3 Gene subsets having the highest classication accuracy on the training set and their performances on the
testing set
s
2
l Accuracy Selected genes
0.01 80 32/34 J04615, M27891, M92287, U32944, U46499, M31523, X95735, U22376, M31523
0.05 80 31/34 M23197, M55150, U12471, U50136, X95735, U22376
0.1 60 30/34 D49950, M23197, U46499, U50136, U82759, X04085, X17042, X95735, U22376
0.4 40 31/34 D49950, U50136, U82759, X17042, X59417, Y12670
0.5 50 30/34 D49950, M23197, U50136, U63289, U82759, X17042, X59417, Y12670
Table 4 Classication performance of MK-SVM on the ALL/AML leukemia dataset, compared with some other
methods
Feature selection Classier Number of features Accuracy
MK-SVM I MK-SVM II 9 94.12
SVM 251000 88.2494.12
Clustering 50 85.29
Emerging pattern Emerging pattern 1 91.18
Fisher ARAM 10 90
entropy ARAM 10 94.12
BIRS NB 2.5 93.04
BIRS IB 3.3 93.04
BIRS C4.5 1.2 88.57
SF NB 3.2 87.32
SF IB 2.3 88.93
SF C4.5 1.6 87.32
FCBF NB 45.80 95.89
FCBF IB 45.80 94.46
FCBF C4.5 45.80 83.21
In the last step, samples with the selected gene
subsets (shown in Table 3) are used as the inputs to
train MK-SVM II. Then rules are extracted using
Algorithms 3 and 4. When the rst gene subset in
Table 3 (l = 80) and the third gene subset (l = 60)
are used, the extracted rules and their performance
on the independent testing set are shown in Table 5.
From the original nine genes, only one marker gene:
U46499 is used in the rule set. Rule 1 tells us that if
the expression level of U46499 is down-regulated
(a 146), the corresponding samples are ALL leu-
kemia tissues. Similarly, if the expression level of
U46499 is up-regulated (a ! 288), the corresponding
samples are AML Leukemia tissues. The performance
of these two rules is promising. There is no mis-
classication and 31 samples are covered by the two
rules on the independent testing set.
Table 6 shows the experimental results using the
second gene subset in Table 3 (l = 80). A new
marker gene: X95735 is included in the rule set.
According to rule 1, 18 samples are correctly classi-
ed as ALL leukemia tissues and one sample is
mistakenly classied on the independent testing
set. According to rule 2, 13 samples are correctly
classied as ALL Leukemia tissues and two samples
are mistakenly classied on the independent testing
set.
Similarly, Table 7 shows the experimental results
using the fourth gene subset in Table 3 (l = 40).
Table 7 contains nine rules that include ve marker
genes. Two samples are incorrectly predicted and
one sample is not covered by the rules on the testing
set. Table 8 shows the experimental results using the
fth gene subset in Table 3 (l = 50). Two rules with
one marker gene: U63289 are extracted. Only one
sample is incorrectly predicted and 31 of 34 testing
samples are covered by the rules.
The experimental results of rule extraction indi-
cate that
(1) The extracted rules have good comprehensibil-
ity. Few rules with few premises are extracted
from MK-SVM II.
(2) The extracted rules have good generalization
performance. Most rule sets achieves better
classication accuracy than MK-SVMII classier.
(3) The extracted rules have good explanation
capacity. By rule extraction, more compact
Table 5 Rules extracted from MK-SVM II using the rst and third ones of selected optimal gene subsets
No. Rule body a Class False-alarm
(testing set)
Soundness
(testing set)
1 U46499 down-regulated 146 ALL 0 18
2 U46499 up-regulated 288 AML 0 13
Overall 0/34 31/34
Table 6 Rules extracted from MK-SVM II using the second one of selected optimal gene subsets
(testing set)
Soundness
(testing set)
1 X95735 down-regulated 938 ALL 1 18
2 X95735 up-regulated 1050 AML 2 13
Overall 3/34 31/34
Table 7 Rules extracted from MK-SVM II using the fourth one of selected optimal gene subsets
(testing set)
Soundness
(testing set)
1 U22376 up-regulated 3309 ALL 0 7
2 D49950 down-regulated 54 ALL 0 7
5 D49950 up-regulated 259 AML 0 3
7 X17042 up-regulated 8749 AML 0 2
8 X17042 down-regulated 2208 ALL 1 15
9 U22376 down-regulated 880 AML 1 7
Overall 2/34 33/34
gene subsets are found. More importantly, mul-
tiple good diagnostic gene groups (gene subsets)
are indicative of the possible pathways of
genetic network.
3.2. Colon tumor dataset
Firstly MK-SVMI is used to select relevant genes. We
use a tenfold cross-validation on the training set to
tune the free parameters of MK-SVM I (s
2
, l) and
select the gene subsets which have the highest
classication accuracy. Then the selected gene
subsets are used as the attributes to train MK-
SVM II and make classication of samples. Table 9
exhibits the identied gene subsets using MK-SVM I
with optimal kernel parameters s
2
and regularized
parameters l and their performance on the testing
set.
The performance of MK-SVMon colon tumor data-
set is compared with that of some other approaches
[6163] which are briey described in the last sec-
tion and the results are shown in Table 10. It can be
seen that the classication accuracy of MK-SVM is a
bit lower than that of entropy-based ARAM and
higher than that of other approaches. However,
the number of selected genes of MK-SVM is far less
than that of entropy-based ARAM.
The extracted rules, their soundness and false-
alarm measures are shown in Table 11. Table 11
contains six rules that include three marker genes.
One sample is incorrectly predicted and three sam-
ples are not covered by the rules on the testing set.
172 Z. Chen et al.
Table 8 Rules extracted from MK-SVM II using the fth one of selected optimal gene subsets
No. Rule body a Class False-Alarm
(testing set)
Soundness
(testing set)
Overall 1/34 31/34
Table 9 Gene subsets having the highest classication accuracy on the training set and their performances on the
testing set
s
2
l Accuracy Selected gene subset
0.2 10 28/31 T71025, T94579, H64489,Z50753, M76378, R88740, R65697, M80815, T62947,
X75208, J02854, D00860, M82919, M81651, H08393, H55916
0.5 5 27/31 T79152, H64489, Z50753, M76378, R65697, J02854, K02268, M82919, M81651,
H08393, H55916
0.8 5 29/31 T79152, H64489, Z50753, M76378, J03210, R65697, J02854, K02268, M81651,
H08393, H55916
Table 10 Classication performance of MK-SVM and some other methods
Feature selection Classier Number of features Accuracy
MK-SVM I MK-SVM II 11 93.55
SVM 2000 90.32
Clustering 2000 87.10
Emerging pattern Emerging pattern 35 91.94
Fisher ARAM 25 86.94
Entropy ARAM 135 96.13
BIRS NB 3.5 85.48
BIRS IB 6.3 79.76
BIRS C4.5 2.9 83.81
SF NB 5.9 84.05
SF IB 4.8 66.67
SF C4.5 3.3 80.71
FCBF NB 14.60 77.62
FCBF IB 14.60 80.71
FCBF C4.5 14.60 88.73
4. Conclusions
In this paper, a MK-SVM-based feature selection and
rule extraction scheme is proposed to discover
knowledge from the gene expression data. This
scheme consists of three phases: feature selection,
predictive modeling and rule extraction.
Unlike the traditional combinatorial searching
method, feature subset selection is translated into
the model selection of SVM which has been well
studied. An iteration of two simple convex optimi-
zation problems needs to be implemented. They can
be easily solved by standard optimizers. We conduct
experiments to evaluate the classication perfor-
mance of this method on two open datasets: ALL-
AML leukemia dataset and colon tumor dataset.
Compared with some popular methods, MK-SVM
achieves good classication accuracy using the iden-
tied gene subsets.
We propose a novel rule extraction approach
from SVM making advantage of the information
provided by the separating hyperplane and support
vectors. Compared with other approaches, rules
obtained from the present approach have good
generalization capacity and comprehensibility.
The experimental results illustrate that the rule-
based classication achieve promising generaliza-
tion performance on the testing set using a fewrules
extracted from MK-SVM II. Besides, the computa-
tional cost of this approach is low.
Due to the fact that the gene expression levels
reported by different techniques not comparable
with each other, we introduce the concepts of up-
regulated and down-regulated and obtain the
rules with linguistic descriptions. As a further work,
we will fusion the results on multiple heterogeneous
datasets to improve the classiers performance. It
is another interesting work to extract fuzzy rules
from SVM.
Rule extraction is helpful to identify new
and more complex phenotypes of cancer [3].
Moreover, we can provide more transparent
explanations of the contribution of each selected
gene and the interaction of selected genes by
making further investigations on the rules. It is
clear that the rules can also illustrate weather or
not random gene subsets give good classication
performance. All above ideas will be included into
another paper to make detail descriptions and
analyses.
Appendix A. Primal/dual problem of
linear programming
Let us start with the following problem:
min J
~
b;
~
j
m
d1
b
d
l
n
i1
j
i
s:t:
y
i
m
d1
b
d
n
j1
a
j
y
j
kx
i;d
; x
j;d
b
_ _
!1 j
i
j
i
!0; i 1; . . . ; n
b
d
!0; d 1; . . . ; m
_
_
The Lagrangian for this problem is
L
p
~
b; b;
~
j
J
~
b;
~
j
n
i1
m
i
y
i
_
m
d1
b
d
n
j1
a
j
y
j
kx
i;d
; x
j;d
b
_ _
j
i
1
_
n
i1
g
i
j
i
n
i1
n
k
b
k
where m
i
! 0, g
i
! 0, n
i
! 0 denote Lagrange multi-
pliers.
Set the derivatives with respect to the primal
variables to zero:
@L
p
@b
k
0 )n
k
1
n
i1
m
i
y
i
n
j1
a
j
y
j
kx
i;d
; x
j;d
_ _
@L
p
@b
0 )
n
i1
y
i
m
i
0
@L
p
@j
i
0 )m
i
l g
i
; i 1; . . . ; n
Table 11 Rules extracted from the colon tumor dataset
(testing)
Soundness
(testing)
1 H08393 up-regulated 129.9575 Neg. 0 8
2 M76378 up-regulated 935.2 Pos. 0 2
3 J02854 up-regulated 1235.5 Pos. 1 4
4 H08393 down-regulated 44.3212 Pos. 0 2
5 M76378 down-regulated 306.0 Neg. 2 16
6 M76378 down-regulated AND 749.4 Neg. 2 17
J02854 down-regulated 413.4
Overall 1/31 28/31
Plug in above three equations and we get the dual
problem:
max
n
i1
u
i
s:t:
n
i1
u
i
y
i
n
j1
a
j
y
j
kx
i;d
; x
j;d
_ _
1; d 1; . . . ; m
n
i1
u
i
y
i
0
0 u
i
l; i 1; . . . ; n
_
_
References
[1] Golub TR, Slonim DK, Tamayo P, Huard C, Gaasengeek M,
Mesirov JP, et al. Molecular classication of cancer: class
discovery and class prediction by gene expression monitor-
ing. Science 1999;286:5317.
[2] vant Veer LJ, Dai H, van de Vijver MJ, He YD, Hart A, Mao M,
et al. Gene expression proling predicts clinical outcome of
breast cancer. Nature 2002;415:5306.
[3] Matthias EF, Anthony R, Nikola K. Evolving connectionist
systems for knowledge discovery from gene expression data
of cancer tissue. Artif Intell Med 2003;28:16589.
[4] Yeoh E, Ross ME, Shurtleff SA. Classication, subtype dis-
covery, and prediction of outcome in pediatric acute lym-
phoblastic leukemia by gene expression proling. Cancer
cell 2002;1:13343.
[5] Slonim K. From patterns to pathways: gene expression data
analysis comes of age. Nat Genet 2002;32:5028.
[6] Tung WL, Quek C. GenSo-FDSS: a neural-fuzzy decision
support system for pediatric ALL cancer subtype identica-
tion using gene expression data. Artif Intell Med 2005;33:
6188.
[7] Sacchi L, Bellazy R, Larizza C, Magni P, Curk T, Petrovic U,
et al. TA-clustering: cluster analysis of gene expression
proles through temporal abstractions. Int J Med Inform
2005;74:50517.
[8] Hautaniemi S, Yli-Harja O, Astola J, Kauraniemi P, Kallio-
niemi A, Wolf M, et al. Analysis and visualization of gene
expression microarray data in human cancer using self-
organizing maps. Mach Learn 2003;52:4566.
[9] Tomida S, Hanai T, Honda H, Kobayashi T. Analysis of expres-
sion prole using fuzzy adaptive resonance theory. Bioinfor-
matics 2002;18:107383.
[10] Takahashi H, Kobayashi T, Honda H. Construction of robust
prognostic predictors by using projective adaptive reso-
nance theory as a gene ltering method. Bioinformatics
2005;2:17986.
[11] Li L, Weinberg CR, Darden TA, Pedersen LG. Gene selection
for sample classication based on gene expression data:
study of sensitivity to choice of parameters of the GA/
KNN method. Bioinformatics 2001;17:113142.
[12] Xu Y, Selaru FM, Yin J, Zou TT, Shustova V, Mori Y, et al.
Articial neural networks and gene ltering distinguish
between global gene expression proles of Barretts
esophagus and esophageal cancer. Cancer Res 2002;62:
34937.
[13] Ando T, Suguro M, Kobayashi T, Seto M, Honda H. Selection of
causal gene sets for lymphoma prognostication from expres-
sion proling and construction of prognostic fuzzy neural
network models. J Biosci Bioeng 2003;96:1617.
[14] Takahashi H, Masuda K, Ando T, Kobayashi T, Honda H.
Prognostic predictor with multiple fuzzy neural models using
expression proles from DNA microarray for metastases of
breast cancer. J Biosci Bioeng 2004;98:1939.
[15] Roth V, Lange T. Bayesian class discovery in microarray
dataset. IEEE Trans Biomed Eng 2004;51:70718.
[16] Sevon P, Toivonen H, Ollikainen V. TreeDT: tree pattern
mining for gene mapping. IEEE/ACM Trans Comput Biol
Bioinform 2006;3:17485.
[17] Mao Y, Zhou X, Pi D, Sun Y, Wong S. Multiclass cancer
classication by using fuzzy support vector machine and
binary decision tree with gene selection. J Biomed Biotech-
nol 2005;2:16071.
[18] Wu W, Liu X, Xu M, Peng J, Rudy S. A hybrid SOM-SVMmethod
for analyzing zebra sh gene expression. In: Kittler J, Petrou
M, Nixon M, editors. Proceedings of the 17th international
conference on pattern recognition. Los Alamitos, California,
United States: IEEE Computer Society; 2004. p. 3236.
[19] Isabella G, Jason W, Stephen B, Vapnik V. Gene selection for
cancer classication using support vector machines. Mach
Learn 2002;46:389422.
[20] Dettling M, Buhlmann P. Boosting for tumor classication
with gene expression data. Bioinformatics 2003;19:10619.
[21] Monti S, Tamayo P, Mesirov J, Golub T. Consensus clustering:
a resampling-based method for class discovery and visuali-
zation of gene expression microarray data. Mach Learn
2003;52:91118.
[22] Hong JH, Cho SB. The classication of cancer based on DNA
microarray data that uses diverse ensemble genetic pro-
gramming. Artif Intell Med 2006;36:4358.
[23] Vapnik V. The nature of statistic learning theory. New York:
Springer; 1995.
[24] Paul TK, Iba H. Selection of the most useful subset of genes
for gene expression-based classication. In: Paul TK, editor.
Proceedings of the 2004 congress on evolutionary computa-
tion. 2004. p. 207683.
[25] Huang CL, Wei CJ. GA-based feature selection and feature
optimization for support vector machine. Expert Syst Appl
2006;31:23140.
[26] Blum AL, Langley P. Selection of relevant features and
examples in machine learning. Artif Intell 1997;97:24571.
[27] Zhu ZX, Ong YS, Dash M. Wrapper-lter feature selection
algorithm using a memetic framework. IEEE Trans Syst Man
Cybern B Cybern 2007;37:706.
[28] Yang J, Honavar V. Feature subset selection using a genetic
algorithm, feature extraction, construction and subset
selection: a data mining perspective. New York: Kluwer;
1998.
[29] Inza I, Larranaga P, Blanco R, Cerrolaza AJ. Filter versus
wrapper gene selection approaches in DNA microarray
domains. Artif Intell Med 2004;31:91103.
[30] Su Y, Murali TM, Pavlovic V, Schaffer M, Kasif S. RankGene:
identication of diagnostic genes based on expression data.
Bioinformatics 2003;19:15789.
[31] Kohavi R, John GH. Wrappers for feature subset selection.
Artif Intell 1997;97:273324.
[32] Hardin D, Tsamardinos I, Aliferis: CF. A theoretical charac-
terization of linear SVM-based feature selection. In: Russell
G, Dale S, editors. Proceedings of the 21st international
conference on machine learning. New York: ACM; 2004. p.
37784.
[33] Mao KZ. Feature subset selection for support vector
machines though discriminate function pruning analysis.
IEEE Trans Syst Man Cybern B Cybern 2004;34:607.
[34] Cao L, Seng CK, Gu Q, Lee: HP. Saliency analysis of support
vector machines for gene selection in tissue classication.
Neural Comput Appl 2003;11:2449.
174 Z. Chen et al.
[35] Duan K, Rajapakse JC, Wang H, Azuaje F. Multiple SVM-RFE
for gene selection in cancer classication with expression
data. IEEE Trans Nanobiosci 2005;4:22834.
[36] Gregory P, Pablo T. Microarray data mining: facing the
challenges. ACM SIGKDD Explor Newslett 2003;5:15.
[37] Roy A. On connectionism, rule extraction, and brain-like
learning. IEEE Trans Fuzzy Syst 2000;8:2227.
[38] Gupta A, Park S, Lam: SM. Generalized analytic rule extrac-
tion for feedforward neural networks. IEEE Trans Knowl Data
Eng 1999;11:98591.
[39] Kasabov NK. On-line learning, reasoning, rule extraction and
aggregation in locally optimized evolving fuzzy neural net-
works. Neurocomputing 2001;41:2545.
[40] Taha IA, Ghosh J. Symbolic interpretation of articial neural
networks. IEEE Trans Knowl Data Eng 1999;11:44863.
[41] H. Nunez, C. Angulo, A. Catala. Rule extraction from support
vector machines. (2002) [online] http://www.dice.ucl.ac.
be/esann/proceedings/papers.php?ann=2002#ES2002-51.
(Accessed: July 19, 2007).
[42] Fung G, Sandilya S, Rao R. Rule extraction from linear
support vector machines. In: Grossman RL, Bayardo R,
Bennett K, Vaidya J, editors. Proceedings of the 11th ACM
SIGKDD international conference on knowledge discovery
and data mining. New York: ACM; 2005. p. 3240.
[43] Fu XJ, Ong CJ, Keerthi S, Hung GG, Goh LP. Extracting the
knowledge embedded in support vector machines. In: Gross-
man RL, Bayardo R, Bennett K, Vaiya J, editors. Proceedings
of 2004 IEEE international joint conference on neural net-
works. 2004. p. 2916.
[44] He J, Hu HJ, Harrison R, Tai PC, Pan Y. Rule generation for
protein secondary structure prediction with support vector
machines and decision tree. IEEE Trans Nanobiosci
2006;5:4653.
[45] Barakat N, Diederich J. Eclectic rule extraction fromsupport
vector machines. Int J Comput Intell 2005;2:5962.
[46] Micchelli CA, Pontil M. Learning the kernel function via
regularization. J Mach Learn Res 2005;6:1099125.
[47] Lanckrient GRG, Cristianini N, Bartlett P, El Ghaoui L, Jordan
MI. Learning the kernel matrix with semidenite program-
ming. J Mach Learn Res 2004;5:2772.
[48] Bach FR, Lanckrient GRG, Jordan MI. Multiple kernel learn-
ing, conic duality and the SMO algorithm. In: Russell G, Dale
S, editors. Proceedings of the 21st international conference
on machine learning. New York: ACM; 2004. p. 418.
[49] Sonnenburg S, Ratsch G, Schafer C. Learning interpretable
SVMs for biological sequence classication. In: Miyano S,
Mesirov J, Kasif S, Pevzner P, Waterman M, editors. Proceed-
ings of the 9th annual international conference on research
in computational molecular biology. Berlin: Springer; 2005.
p. 389407.
[50] Gunn SR, Kandola JS. Structural modeling with sparse ker-
nels. Mach Learn 2002;48:13763.
[51] Chapelle O, Vapnik V, Bousquet O, Mukherfee S. Choosing
multiple parameters for support vector machines. Mach
Learn 2002;46:13159.
[52] Friedrichs F, Iqel C. Evolutionary tuning of multiple SVM
parameters. Neurocomputing 2005;64:10717.
[53] Duan K, Keerthi SS, Poo AN. Evaluation of simple perfor-
mance measures for tuning SVM hyperparameters. Neuro-
computing 2003;51:4159.
[54] Keerthi SS. Efcient tuning of SVM hyperparameters using
radius/margin bound and iterative algorithms. IEEE Trans
Neural Netw 2002;13:12259.
[55] Demiriz A, Bennett KP, Shawe-Taylor J. Linear programming
boosting via column generation. Mach Learn 2002;46:225
54.
[56] Wu Q, Zhou DX. SVM soft margin classiers: linear program-
ming versus quadratic programming. Neural Comput
2005;17:116087.
[57] Ikeda K, Murata N. Geometrical properties of Nu support
vector machines with different norms. Neural Comput
2005;17:250829.
[58] Bi JB, Fung G, Dundar M, Rao B. Semi-supervised mixture of
kernels via LPBoost methods. In: Han J, Wah BW, Raghavan
V, Wu X, Rastogi R, editors. Proceedings of the 5th IEEE
international conference on data mining. Los Alamitos,
California, United States: IEEE Computer Society; 2005. p.
56972.
[59] Wang H, Huang D. Regulation probability method for gene
selection. Pattern Recogn Lett 2006;27:11622.
[60] Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D,
et al. Broad patterns of gene expression revealed by clus-
tering of tumor and normal colon tissues probed by oligo-
nucleotide arrays. Proc Natl Acad Sci USA 1999;96:674550.
[61] Ruiz R, Riquelme JC, Aguilar-Ruiz JS. Incremental wrapper-
based gene selection from microarray data for cancer clas-
sication. Pattern Recogn 2006;39:238392.
[62] Li J, Wong L. Identifying good diagnostic gene groups from
gene expression proles using the concept of emerging
patterns. Bioinformatics 2002;18:72534.
[63] Tan AH, Pan H. Predictive neural networks for gene expres-
sion data analysis. Neural Netw 2005;18:297306.

A Multiple Kernel Support Vector Machine Scheme

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

A Multiple Kernel Support Vector Machine Scheme

Hochgeladen von

Copyright:

Verfügbare Formate

A multiple kernel support vector machine scheme

for feature selection and rule extraction from

Das könnte Ihnen auch gefallen