Sie sind auf Seite 1von 12
Statistics and Data Analysis 55 (2011) 1897–1908 Contents lists available at ScienceDirect Computational

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis

journal homepage: www.elsevier.com/locate/csda

Analysis journal homepage: www.elsevier.com/locate/csda Gene selection and prediction for cancer classification

Gene selection and prediction for cancer classification using support vector machines with a reject option

Hosik Choi a , Donghwa Yeo b , Sunghoon Kwon c , Yongdai Kim c,

a Department of Informational Statistics, Hoseo University, Asan, Chungnam 336-795, Republic of Korea

b Institute of Mathematical Sciences, Ewha University, Seoul 120-750, Republic of Korea

c Department of Statistics, Seoul National University, Seoul 151-742, Republic of Korea

a r t i c l e

i n f o

Article history:

Received 23 August 2009 Received in revised form 16 August 2010 Accepted 1 December 2010 Available online 9 December 2010

Keywords:

Classification Reject option Support vector machines Lasso

1. Introduction

a b s t r a c t

In cancer classification based on gene expression data, it would be desirable to defer a decision for observations that are difficult to classify. For instance, an observation for which the conditional probability of being cancer is around 1/2 would preferably require more advanced tests rather than an immediate decision. This motivates the use of a classifier with a reject option that reports a warning in cases of observations that are difficult to classify. In this paper, we consider a problem of gene selection with a reject option. Typically, gene expression data comprise of expression levels of several thousands of candidate genes. In such cases, an effective gene selection procedure is necessary to provide a better understanding of the underlying biological system that generates data and to improve prediction performance. We propose a machine learning approach in which we apply the l 1 penalty to the SVM with a reject option. This method is referred to as the l 1 SVM with a reject option. We develop a novel optimization algorithm for this SVM, which is sufficiently fast and stable to analyze gene expression data. The proposed algorithm realizes an entire solution path with respect to the regularization parameter. Results of numerical studies show that, in comparison with the standard l 1 SVM, the proposed method efficiently reduces prediction errors without hampering gene selectivity. © 2010 Elsevier B.V. All rights reserved.

Statistical and machine learning approaches are popularly used to construct a predictive model for classifying cancer patients from normal ones based on gene expression data. A few of such approaches include the support vector machine (SVM, Terrence et al., 2000; Guyon et al., 2002; Ambroise and McLachlan, 2002), logistic regression (Shevade and Keerthi, 2003; Liao and Chin, 2007), and boosting (Hong et al., 2005). A standard approach is to classify all future observations. However, particularly in some cases, it would be desirable to defer a decision, for which the observations that are difficult to classify. For instance, an observation for which the conditional probability of being cancer is around 1 / 2 preferably require more advanced tests rather than an immediate decision. This motivates the use of a classifier with a reject option that reports a warning in case of observations that are difficult to classify. Many empirical studies in the engineering community support the conjecture that the use of a reject option effectively reduces misclassification error rates. For further examples, refer to Chow (1970), Tortorella (2000), and Lendgrebe et al. (2006). In addition, see the references in McLachlan (1992). Recently, Bartlett and Wegkamp (2008) proposed a learning algorithm with a reject option based on the SVM — this approach is referred to as the l 2 SVM with a reject option — and studied its theoretical properties.

Corresponding author. Tel.: +82 2 880 9091; fax: +82 2 883 6144. E-mail addresses: choi.hosik@gmail.com (H. Choi), ydkim0903@gmail.com (Y. Kim).

0167-9473/$ – see front matter © 2010 Elsevier B.V. All rights reserved.

1898

H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 1897–1908

Typically, gene expression data consist of expression levels of several thousands of candidate genes. In such cases, an effective gene selection procedure is necessary to provide better understanding of the underlying biological system that generates data and to improve prediction performance. For literature on gene selection, refer to Guyon et al. (2002). In this paper, we propose a machine learning approach for gene selection with a reject option; in this approach, the l 1 penalty is applied to the SVM with a reject option. The proposed method is referred to as the l 1 SVM with a reject option. The l 1 penalty is widely used for variable selection in several contexts (Tibshirani, 1996; Shevade and Keerthi, 2003; Zhu et al., 2004) since it provides a sparse solution (i.e., some estimated coefficients are exactly zero). For the SVM with a reject option, Wegkamp (2007) proved that the l 1 penalty provides desirable large-sample properties. However, a crucial difficulty in using the l 1 penalty in gene expression data is computation since the l 1 penalty is not differentiable. Various efficient optimization algorithms for computing the l 1 penalty, such as the fixed-point algorithm and block coordinate descent algorithm, have been proposed by Shevade and Keerthi (2003), Park and Hastie (2007), Wu and Lange (2008), Genkin et al. (2007), Koh et al. (2007), Balakrishnan and Madigan (2008), Hale et al. (2008), Meier et al. (2008), Rosset and Zhu (2007) and Kim et al. (2008). These algorithms, however, are not applicable to the l 1 SVM with a reject option since the loss function as well as the penalty are not differentiable. The optimization problem for the the l 1 SVM with a reject option is similar to that of the standard l 1 SVM (Zhu et al., 2004) although the former is much more difficult. This greater degree of difficulty is because there are multiple nondifferentiable points in the loss function of the l 1 SVM with a reject option. In this paper, we develop a novel optimization algorithm for the l 1 SVM with a reject option, which is sufficiently fast and stable to analyze gene expression data. In particular, the proposed algorithm provides an entire solution path with respect to the regularization parameter. The paper is organized as follows. In Section 2, we review the SVM and l 1 SVM with a reject option. A novel optimization algorithm for the l 1 SVM with a reject option is developed in Section 3. Numerical results on simulated data, four publicly available gene expression data sets, and a mass spectrometry protein data set are presented in Section 4. Concluding remarks follow in Section 5.

2. Methodology

2.1. SVM with a reject option: review

, ( x n , y n ) be the input–output pairs of given data where x i R p is a vector of gene expression levels

and y i ∈ {−1, 1} denotes a class label ( 1 for normal and 1 for cancer). We assume that the data are n independent copies of a random vector ( X, Y ) . Traditional learning algorithms try to find the optimal hyperplane that minimizes the misclassification error rate E (I (Yf (X ) < 0 )) among all linear hyperplanes f ( x) = β 0 + x β, β 0 R , β R p . The optimal

hyperplane can be estimated by minimizing a regularized empirical risk given as

is a convex surrogate loss function of the 0-1 loss I (z < 0) and J λ is a penalty function controlling the misclassification error and classifier’s complexity. Various surrogate loss functions φ(z ) correspond to various learning algorithms such as logistic regression, boosting, and SVM (Hastie et al., 2001). A learning algorithm with a reject option is used to construct a classifier t : R p → {−1, 1, ⃝}r , where the interpretation of the output r is that of being in doubt and hence of deferring the decision. Herbei and Wegkamp (2006) introduced the misclassification error rate of a classifier with a reject option as L d (t ) = dPr(t ( X ) = ⃝r ) + Pr (t (X ) ̸= Y , t (X ) ̸= ⃝r ) , where d ∈ [0, 1/2) is a cost of a reject option. For a given real valued function f (x ) and δ > 0, Bartlett and Wegkamp (2008)

proposed a method for constructing a classifier with a reject option t

Then, the misclassification error rate of t

Let ( x 1 , y 1 ),

n

= 1 φ(y i 0 + x β)) + J λ (β ), where φ(z )

i

i

δ ( x) by t

f

δ ( x) = signf (x )I ( |f ( x)| > δ) + ⃝r I (| f (x) | ≤ δ).

f

δ becomes E(l d,δ (Yf (X ))), where

f

1

l d,δ (z ) = d

0

if z < δ, if |z | ≤ δ,

otherwise.

Fig. 1(a) and (b) shows a comparison between the 0-1 loss and 0-1 loss with a reject option (l d ,δ ). Bartlett and Wegkamp (2008) proposed the SVM with a reject option by replacing l d ,δ with a convex surrogate loss and applying the l 2 norm of the coefficients as a penalty function. In this approach, the SVM with a reject option estimates a classifier by minimizing the following regularized empirical risk.

n

i

=1

φ d (y i 0 + x β)) + λ β 2

i

2

2

where β 2

2 =

p j = 1 β j

2

and the surrogated loss function φ d , the hinge loss with a reject option, is expressed as

φ d ( z ) =

1 d

d

1

1 z

  0

z

if z <

if 0 z < 1,

otherwise .

0,

(1)

(2)

Bartlett and Wegkamp (2008) showed that φ d is a reasonable surrogate loss for l d ,δ by proving the Fisher consistency. Furthermore, they noted that the optimization problem (1) can be solved by quadratic programming. Fig. 2 shows a

H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 1897–1908

1899

Statistics and Data Analysis 55 (2011) 1897–1908 1899 (a) 0-1 loss. Fig. 1. (b) 0-1 loss

(a) 0-1 loss.

Fig. 1.

(b) 0-1 loss with a reject option.

The 0-1 loss and 0-1 loss with a reject option.

option. The 0-1 loss and 0-1 loss with a reject option. Fig. 2. Comparison between the

Fig. 2. Comparison between the hinge loss φ d with a reject option when d = 0 .2 and the other loss functions.

comparison between l d,δ , and φ d , and the hinge loss φ H ( z ) = (1 z ) + , which is a surrogate loss for the SVM; ( z ) + = max {z , 0} . It should be noted that φ d is piecewise linear as the hinge loss. However, there are two nondifferentiable points — one at z = 1 and the other at z = 0 in φ d — while there is only one nondifferentiable point at z = 1 in the hinge loss.

2.2. l 1 SVM with a reject option

In the case of l 1 SVM with a reject option, the l 2 norm is replaced with the l 1 norm in the regularized empirical risk (1) of the SVM with a reject option. In other words, we construct a classifier by minimizing the following regularized empirical risk

n

i = 1

φ d (y i 0 + x β )) + λ β1 ,

i

(3)

where β1 =

p = 1 |β j | and λ > 0 is a regularization parameter controlling the complexity and sparsity of the classifier. An

j

important advantage of the l 1 penalty is that some of the estimated coefficients are exactly zero, hence allowing automatic gene selection. In addition, it is known that when the number of genes are much larger than that of observations, learning procedures with the l 1 penalty produce highly accurate classifiers (Greenstein, 2006). Computation in the case of l 1 SVM with a reject option is difficult. Since the penalty as well as the surrogate loss are piecewise linear, we can use a linear program to solve the optimization problem in this case. However, a standard linear program is very complicated in terms of computation when the dimension of the input (i.e., the number of genes) is very large. The situation becomes worse when an optimal regularization parameter λ (and possibly d) optimally is to be selected because a linear program is required for various values of λ. In this paper, we propose a novel efficient optimization algorithm for the l 1 SVM with a reject option to yield an entire solution path with respect to λ . A solution path is a function

1900

H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 1897–1908

 

ˆ

ˆ

ˆ

ˆ

(

β 0 (λ),

β(λ) ) of λ , where (

β

0 (λ),

β (λ) ) is the minimizer of (3) for a given λ . There are several solution-path-finding

algorithms with the l 1 penalty, such as LARs proposed by Efron et al. (2004), l 1 SVM by Zhu et al. (2004), general differentiable

piecewise linear loss by Rosset and Zhu (2007), and a generalized linear model by Park and Hastie (2007). However, such algorithms are not directly applicable to the l 1 SVM with a reject option as there are multiple nondifferentiable points in the surrogate loss function φ d . When p is much larger than n, linear classifiers frequently perform better than nonlinear ones in many applications. Hall et al. (2005) have provided some explanations for this phenomenon. Since gene expression data are high-dimensional and have low sample size, we only consider linear classifiers in this paper. In addition, nonlinear models can be approximated by employing a linear model using a basis expansion technique such as a spline (Zhang et al., 2004).

3. Algorithm

In this section, we develop an optimization algorithm for (3), which provides an entire solution path with respect to λ. The entire solution path helps in saving computing time significantly when an optimal regularization parameter is being searched.

3.1. Computation

Since φ d ( z ) of (2) is decomposed as a sum of two functions, that is, φ d ( z ) = (1 z ) + + a (z ) + , where a = (1 2d)/ d, the problem of minimizing (3) is equivalent to the following constraint optimization problem.

min

β 0 , β

Q 0 , β) =

n

i

= 1

( 1 y i f (x i )) + + a

n

i

= 1

(y i f (x i )) + ,

(4)

subject to β1 s for some s > 0, where f (x i ) = β 0 + x β. Here, the complexity as well as sparsity are controlled by s. Since there is a one-to-one relation between s and λ , we find an entire solution path with respect to s. That is, we obtain

i

ˆ

ˆ

ˆ

{( β 0 ( s ), β( s ) ) : 0 s < ∞} , where ( β 0 ,

ˆ

β (s ) ) is the minimizer of Q 0 , β) subject to β1 s.

The main portion of this algorithm is dedicated to identifying the ‘‘kinks’’ in the path. The algorithm finds the kinks

0 = s 0 < s 1 < · · · < s m < and solutions (

ˆ

ˆ

ˆ

ˆ

β 0 ( s 0 ), β (s 0 ) ) , ( β 0 (s 1 ), β(s 1 ) ) ,

ˆ

, ( β 0 (s m ),

ˆ

β(s m ) ) such that for

 

ˆ

ˆ

ˆ

ˆ

ˆ

ˆ

s

(s k 1 , s k ] , the (

β

0 (s ),

β (s ) ) term is obtained by linear interpolation of (

β 0 ( s k 1 ), β (s k 1 ) ) , and ( β 0 ( s k ), β( s k ) ) . In

practice, we set s m as the smallest s value such that all coefficients of the corresponding solution are nonzero. The equivalent optimization problem to (4) is

min

β 0 ,β , ξ , ζ

n n

i= 1

ξ i + a

i= 1

ζ i ,

subject to ξ i 1 y i f ( x i ), ζ i ≥ − y i f (x i ), ξ i 0, and ζ i 0, for i = 1,

ζ = (ζ 1 ,

L =

, ζ n ) . The primal Lagrange function (Bertsekas, 2003) for (5) is

n

i

= 1

i + a ζ i ) +

n

i= 1

α i ( 1 y i f (x i ) ξ i ) +

n

i= 1

γ i (y i f (x i ) ζ i )

, n and β 1 s, where ξ = 1 ,

(5)

, ξ n ) and

n

i= 1

κ i ξ i

n

i

= 1

ρ i ζ i + η

p

j

= 1

|β j | − s ,

where α i 0, γ i 0, κ i 0 , ρ i 0, and η 0 are Lagrange multipliers. Taking derivatives of L with respect to β 0 , β , ξ , and ζ , we have

L

∂β 0

L

∂β j

L

∂ξ i

L

ζ i

= 0

n

i= 1

i + γ i )y i = 0,

= 0 ⇔ −

n

i= 1

i + γ i ) y i x ij + η signj ) = 0 ,

= 0 1 α i κ i = 0,

i

= 1,

 

, n ,

= 0 a γ i ρ i = 0,

i =

1,

.

.

. , n ,

and

j V ( s),

considering the Karush–Kuhn–Tucker (KKT) conditions

α i (1 y i f (x i ) ξ i ) = 0,

κ i ξ i = 0,

ρ i ζ i = 0,

γ i (0 y i f ( x i ) ζ i ) = 0 ,

and

η

p

j

= 1

|β j | − s = 0,

H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 1897–1908

1901

Statistics and Data Analysis 55 (2011) 1897–1908 1901 Fig. 3. The locations of margins of samples

Fig. 3. The locations of margins of samples in 5 subsets LL ( s ), LE ( s), LR (s ), E ( s ), and R (s ) .

ˆ

ˆ

where V ( s ) = {j : β j (s ) ̸= 0} . We divide the samples into 5 subsets: left–left set LL (s ) = {i : y i f ( x i ) < 0} , left elbow set

ˆ

ˆ

ˆ

LE (s ) = {i : y i

f (x i ) = 0} , left–right set LR ( s ) = {i : 0 < y i f (x i ) < 1} , elbow set E (s ) = { i : y i f (x i ) = 1 }, and right set

ˆ

R (s ) = {i : y i

f (x i ) > 1 }. The margins of these subset samples are shown in Fig. 3. The KKT conditions imply that

 

LL (s ) = {i : y i f (x i ) < 0 } −→ α i = 1 and γ i = a ,

LE ( s) = {i : y i f (x i ) = 0 } −→ α i = 1 and γ i + ρ i = a ,

LR ( s) = {i : 0 < y i f ( x i ) < 1 } −→ α i = 1 and γ i = 0,

E (s ) = {i : y i f ( x i ) = 1 } −→ α i + κ i = 1 and γ i = 0,

R (s ) = {i : y i f ( x i ) > 1 } −→ α i = 0 and γ i = 0.

and

(6)

Hence, the minimizer of (5) can be found by solving the following system of linear equations

n

i = 1

i + γ i )y i x ij η signj ) = 0 for j V (s ),

n

i + γ i )y i = 0 ,

i = 1

y i β 0 +

j

V

(s)

y i β 0 +

j

V

(s)

β j x ij = 0

β j x ij = 1

β 1 =

p

j = 1

signj j = s ,

for i LE ( s ),

for i E (s ),

and

(7)

with the KKT conditions (6). Now, considering the right derivatives of (7), we have

 −

 −

iE ( s)

iE ( s)

β 0

 −

s

j V (s )

α i

s

α i

s

y i x ij +

iLE ( s)

γ i

s

y i x ij

y i +

i LE (s )

γ i

s

y i = 0,

η

s

signj ) = 0

+

jV (s)

β j

s

x ij = 0

for i LE (s ) or E (s ),

signj ) β j

s

= 1 .

for j V ( s ),

and

Solving the system of linear Eq. (8), we have β j / s for j V (s ) and grow (

ˆ

ˆ

ˆ

ˆ

(8)

β 0 ( s), β(s ) ) by β 0 ( s + h ) = β 0 (s ) + h β 0 / s

 

ˆ

ˆ

ˆ

ˆ

and

β j ( s + h ) =

β j ( s) + h β j / s for j V (s ) . It is not difficult to show that (

β 0 (s + h ),

β( s + h ) ) is the solution of (7) with

 

ˆ

ˆ

1902

H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 1897–1908

Table 1 The algorithm for l 1 SVM with a reject option (pseudo code).

1. Find LL (0 +), LE ( 0 +), LR ( 0+), E ( 0 + ), R ( 0 + ), and V ( 0+) .

2. Repeat until all coefficients corresponding to solution are nonzero.

(a)

Compute the increase in s.

(b)

Growtheactivecoefficientslinearlyuntilthel 1 normbecomess.

(c)

Update the sets LL( s ), LE ( s ), LR ( s), E (s ), R ( s) , and V ( s) .

(d)

Compute the new right derivatives using linear system (8).

Table 2 Computational cost with varying p (standard errors).

p

L1SVM

L1SVM-R

CPU time

CPU time

25

0.246 (0.005)

0.381 (0.016)

50

0.334 (0.009)

0.529 (0.035)

100

0.430 (0.014)

0.742 (0.040)

200

0.648 (0.025)

1.224 (0.098)

1000

2.458 (0.077)

5.647 (0.702)

2000

4.869 (0.170)

15.802 (1.763)

or LE (s + h ) is changed. Next, we update V (s + h ) by either adding a new coefficient that is not in V (s ) but satisfies (7)

β ) in a similar

or deleting a coefficient that is in V (s ) and becomes zero; then, we resolve the (8) and further grow (

manner. Starting this procedure with V (0+ ) , which is obtained by minimizing (4) with only one coefficient via a line search,

we obtain the entire solution path. We summarize the proposed algorithm for the l 1 SVM with a reject option in Table 1.

ˆ

β

0 ,

ˆ

3.2. Computational complexity

The computational complexity of the proposed algorithm is equivalent to the computational complexity of solving (8) multiplied by the number of kinks. Note that the number of equations in (8) is |E (s )| + |LE ( s) | + |V (s )| + 2. The standard l 1 SVM algorithm developed by Zhu et al. (2004) is required to solve the system of | E ( s )| + | V ( s )| + 2 equations whenever E ( s ) is changed. The proposed algorithm deals with two nondifferentiable points; that is, the number of points is exactly twice that in the case of the standard l 1 SVM algorithm developed by Zhu et al. (2004). Let us suppose that | E ( s )| = |LE ( s) |. Since the complexity of solving the system of linear equations is proportional to the square of the number of equations, the computing time for the proposed algorithm is required to be approximately 4 times that for the standard l 1 SVM. Since the number of kinks is equal to the sample size for the worst case, we can conclude that the overall computational complexity of the l 1 SVM with a reject option is approximately 4 times larger than that of the standard l 1 SVM. Now, we compare the computational cost of the l 1 SVM with a reject option (L1SVM-R) with the standard l 1 SVM (L1SVM) algorithm using a small simulation. The experiment was conducted in aWindows R environment using a 2.33 GHz Intel Core2 duo processor and 2 GB RAM. Training observations of size n = 50 have been generated from the model with the same parameters as in Section 4.1. The elapsed computing times (in seconds) were recorded using the system.time() function in R. We replicated this process 20 times and computed the averaged elapsed times and corresponding standard errors. Table 2 shows the results for various p. The numbers reported in Table 2 denote the average CPU times in seconds with their standard errors in parentheses. In summary, our path algorithm, L1SVM-R, is about 2 to 4 times slower than the L1SVM, which confirms our theoretical calculations.

4. Numerical results

In this section, we compare L1SVM-R with the L1SVM that does not have a reject option by analyzing the simulated as well as real data sets. Particularly, we study how the reject option affects prediction accuracy and gene selectivity. Further, the prediction accuracy of the l 1 SVM with a reject option is compared with that of the l 2 SVM with a reject option by analyzing real data sets.

4.1. Simulation I

For simulation data, we generate a sample of size n as follows. Let µ + be a p-dimensional vector whose first q entries are D and the other pq entries are zero, and let µ = −µ + . Then, we generate x from N p + , Σ ) and assign y = 1 for the first n /2 observations and generate x from N p , Σ ) and assign y = −1 for the last n /2 observations. We let the (k, l) entry of Σ be r | k l | , where r ∈ [0, 1) . It should be noted that all input variables except the first q are noisy. It is a standard procedure to fix the reject cost d in advance. However, the choice of d affects the estimated decision boundary. Fig. 4 compares the prediction scores f (x i ) of the L1SVM-R according to various d values based on a simulated data set for fixed λ. This figure shows that as the value of d decreases, the variation in the prediction scores increases. This

ˆ

H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 1897–1908

1903

Statistics and Data Analysis 55 (2011) 1897–1908 1903 ˆ Fig. 4. The box-plots of prediction f

ˆ

Fig. 4. The box-plots of prediction f ( x)s according to d values.

of prediction f ( x ) s according to d values. (a) The scatter plot of

(a) The scatter plot of margins.

s according to d values. (a) The scatter plot of margins. (b) The histogram of p

(b) The histogram of p-values of the permutation test.

Fig. 5.

Comparison of empirical margins of the L1SVM and L1SVM-R.

is

because the classifier with a smaller reject cost tends to be far away from the boundary since φ d assigns more weights

to

misclassified observations. When the purpose of using a reject option is to improve the prediction accuracy by not using

observations close to the decision boundary, it would be natural to choose optimal d instead of fixing it in advance. In this section, we use the above-mentioned approach to choose optimal d using either validation samples or a model selection criterion.

Fig. 5(a) compares the empirical margins y i f (x i ) of the L1SVM and L1SVM-R calculated on a simulated data set with p = 100 , q = 5 , D = 1 .0, r = 0.0, and n = 100. We can observe that the margins of L1SVM-R are larger than those

of L1SVM, particularly for large margin values. To understand the statistical significance of this difference, we applied the

permutation test. Fig. 5(b) draws the histogram of 100 p-values of the permutation test obtained on 100 simulated data sets

of size 100, which indicates that the difference of margins is statistically significant.

Table 3 compares the prediction accuracy as well as variable selectivity. The training sample size is 100 and the regularization parameters (d , λ) are selected on the basis of an independent validation sample of size 100. Next, we choose the optimal δ , which minimizes the empirical risk of the 0-1 loss with a reject option. Misclassification errors are calculated on the basis of another independent test sample of size 2000. In the table, ‘‘Total MIS’’ denotes the misclassification error rates obtained based on all observations in a test sample; ‘‘Accept MIS,’’ based only on accepted observations by a L1SVM-

R classifier; and ‘‘Reject MIS,’’ based on rejected observations. ‘‘Reject’’ denotes the portion of rejected observations. We

repeat the simulation 20 times and report the average misclassification errors with their standard errors in parentheses. The p-values are obtained by the paired t-test with 20 paired error rates of L1SVM-R and L1SVM. In the table, ‘‘Czeros’’ and ‘‘Cnzeros’’ are, respectively, the average numbers of correct zero coefficients (true is zero and estimated as zero) and correct nonzero coefficients (true is nonzero and estimated as nonzero) in the estimated models, ‘‘Count’’ represents the frequencies of the first q coefficients being estimated as nonzero among 20 simulations, and ‘‘Others’’ represents the frequencies of

ˆ

Others

0.979

0.863

0.853

0.052

1.232

0.052

0.108

0.118

0.537

1.211

0.031

0.031

Table 3 Comparison between prediction accuracy and variable selectivity values of the L1SVM and those of L1SVM-R: average misclassification errors and average numbers of coefficients (standard errors).

20 20 20 20 20

20 20 20 20 20

19 10 11 12 19 18 14 13 12 18

20 20 18 20 15 20 20 17 20 16

9 10 13

8 12

18 18 15 13 18 17 18 14 14 18

20 17 17 17 15 20 18 18 17 17

9

Count

12 12

12 12

4.100 (0.287)

4.300 (0.147)

3.750 (0.176)

2.800 (0.573)

4.650 (0.263)

4.050 (0.340)

4.500 (0.154)

2.650 (0.567)

3.550 (0.198)

4.650 (0.263)

Cnzeros

5 (0)

5 (0)

1989.80 (2.598)

1992.30 (1.920)

1981.65 (7.156)

1984.25 (6.159)

1988.65 (3.966)

1991.90 (1.360)

90.90 (0.976)

89.15 (1.186)

90.95 (0.752)

90.35 (1.164)

92.45 (0.634)

89.25 (1.015)

Czeros

p-value

0.004

0.003

0.003

0.002

0.068

0.001

0.087 (0.013)

0.119 (0.016)

0.084 (0.012)

0.078 (0.010)

0.077 (0.008)

0.048 (0.007)

Reject

0.493 (0.013)

0.470 (0.024)

0.460 (0.023)

0.459 (0.014)

0.465 (0.018)

0.483 (0.013)

0.515 (0.024)

0.465 (0.013)

0.505 (0.017)

0.478 (0.021)

0.492 (0.016)

0.480 (0.013)

Reject MIS

0.250 (0.006)

0.230 (0.004)

0.200 (0.005)

0.191 (0.004)

0.131 (0.003)

0.153 (0.006)

0.134 (0.003)

0.160 (0.008)

0.254 (0.006)

0.236 (0.004)

0.193 (0.003)

0.208 (0.006)

Accept MIS

0.277 (0.005)

0.251 (0.004)

0.211 (0.003)

0.224 (0.005)

0.178 (0.006)

0.147 (0.003)

0.279 (0.005)

0.258 (0.004)

0.231 (0.006)

0.215 (0.003)

0.151 (0.003)

0.185 (0.007)

Total MIS

L1SVM-R

L1SVM-R

L1SVM-R

L1SVM-R

L1SVM-R

L1SVM-R

Method

L1SVM

L1SVM

L1SVM

L1SVM

L1SVM

L1SVM

0.3
0.6

0.3
0.6

0

0

r

100

2000

p

1904

H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 1897–1908

H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 1897–1908

1905

Table 4 Prediction results of the L1SVM, L1SVM-R, and L1SVM-R (remeasure): average misclassification errors (standard errors).

σ e

r

L1SVM

L1SVM-R

L1SVM-R(remeasure)

pv1

pv2

pv3

0.5 0

0.217 (0.007)

0.212 (0.007)

0.208 (0.007)

0.001

0.000

0.000

 

0.3

0.259 (0.007)

0.253 (0.007)

0.249 (0.007)

0.000

0.000

0.000

0.6

0.312 (0.010)

0.311 (0.010)

0.307 (0.010)

0.447

0.016

0.000

1 0

0.309 (0.009)

0.302 (0.009)

0.292 (0.009)

0.002

0.000

0.000

 

0.3

0.325 (0.010)

0.319 (0.010)

0.312 (0.010)

0.017

0.000

0.000

0.6

0.365 (0.011)

0.360 (0.011)

0.355 (0.012)

0.005

0.000

0.000

Table 5 Real data sets.

Reference

Data

( n , p )

(+1, 1 )

Bhattacharjee et al. (2001) Iwao et al. (2002) Yukinawa et al. (2006) Petricoin et al. (2002) Singh et al. (2002)

Breast

(110, 2379)

(98, 12)

Lung

(203, 12602)

(186, 17)

Thyroid

(168, 2000)

(128, 40)

Ovarian

(252, 15154)

(162, 90)

Prostate

(102, 12600)

(52, 50)

selected noisy variables. In all simulations, we set D = 0 .5 and q = 5. In the table, the smallest misclassification error rates are highlighted in bold face. L1SVM-R always has significantly lower misclassification errors (except in two cases: Reject MIS with p = 100, r = 0.6 and p = 2000, r = 0.6) than the L1SVM, and the improvements are statistically significant in most cases. Further, the misclassification error rates on rejected observations are around 0.5, which indicates that the L1SVM-R successfully selects observations near a decision boundary. One of the reasons for the superior prediction performance of L1SVM-R is the use of only high-quality samples (samples far away from the decision boundary), which makes a corresponding classifier robust to less informative samples that are usually located near a decision boundary, hence, yielding better prediction accuracy. The use of only high-quality data cannot be easily realized since we do not know a decision boundary before constructing it. L1SVM-R implemented this approach suitably. For variable selectivity, L1SVM-R and L1SVM are clearly competitive. The results of the simulations show that L1SVM-R significantly improves prediction accuracy without hampering variable selectivity.

4.2. Simulation II

The reject option can be used to further improve accuracy when measurement errors exist in x. For given y ∈ {−1, 1} , let

z follow a multivariate normal distribution with mean µ y and variance Σ . Here, D in µ y and the structure of Σ are the same as those in simulation I. However, instead of observing z, suppose we observe x = z + ϵ , where ϵ follows a multivariate

normal distribution with mean 0 and variance σ

2 I. We consider the following classification strategy. Let R be the index set

e

of rejected observations in the test samples. Then, for i R, we measure x i one more time to denote it by x˜ i . Next, we replace the original x i by (x i + x˜ i )/ 2, and hence make a prediction based on ( x i + x˜ i )/2. We perform the simulation using three methods including the new strategy: (1) no rejection for training and test data set (L1SVM), (2) reject option for training and test data set (L1SVM-R), and (3) reject option + prediction with remeasurement of an observation in test data set when it is rejected (L1SVM-R (remeasure)). Table 4 shows the results for two measurement noise levels σ e = 0 .5 and 1 when p = 2000. In the table, ‘‘pv1,’’ ‘‘pv2,’’ and ‘‘pv3’’ are p-values obtained via paired t-test between L1SVM and L1SVM-R, L1SVM and L1SVM-R (remeasure), and L1SVM-R and L1SVM-R (remeasure) respectively. We can see that the remeasurement with a reject option improves the prediction accuracy significantly in almost all cases.

4.3. Analysis of gene expression data

In this section, we analyze four gene gene expression data sets and a mass spectrometry data set; all these data sets are publicly available. The basic information of the data sets is given in Table 5, and detailed descriptions are given below.

The Breast cancer data set contains 98 female breast cancer samples and 12 normal samples. This data was determined using adapter-tagged competitive PCR instead of conventional cDNA microarrays.

The Lung cancer data set consists of a total of 203 snap-frozen lung tumor samples and normal lung specimens. Of these, 125 adenocarcinoma samples were associated with clinical data and with histological slides from adjacent sections. The 203 specimens include histologically defined lung adenocarcinomas, squamous cell lung carcinomas, pulmonary carcinoids, SCLC cases, and normal lung specimens. Other adenocarcinomas were suspected to be extrapulmonary metastases based on clinical history.

The third data set is pertaining to Thyroid cancer, which is a relatively common cancer, accounting for roughly 1% of the total cancer incidence. There are two main types of thyroid cancer: papillary carcinoma (PC) and follicular carcinoma (FC). In addition to these malignant types, a benign tumor, follicular adenoma (FA), is also prevalent.

1906

H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 1897–1908

Table 6 Test error rates of L2SVM and L2SVM-R.

Method

Breast

Lung

Thyroid

Ovarian

Prostate

L2SVM

0.090 (0.003)

0.033 (0.008)

0.256 (0.014)

0.011 (0.002)

0.154 (0.015)

L2SVM-R

0.070 (0.002)

0.028 (0.003)

0.230 (0.008)

0.010 (0.003)

0.137 (0.011)

Table 7 Prediction accuracies, reject portions, p-values, and mean model sizes (Vsize, | V |), with the corresponding standard errors in parentheses, of the L1SVM and L1SVM-R on 5 real data sets.

Data

Method

Total MIS

Accept MIS

Reject MIS

Reject

p-value

Vsize

Breast

L1SVM

0.017 (0.007)

0.017 (0.007)

0 (0)

1.400 (0.197)

L1SVM-R

0.003 (0.001)

0.002 (0.001)

0.050 (0.050)

0.001 (0.001)

0.041

1.150 (0.082)

Lung

L1SVM

0.011 (0.001)

0.010 (0.001)

0.150 (0.180)

3.400 (0.440)

L1SVM-R

0.011 (0.001)

0.009 (0.001)

0.150 (0.180)

0.001 (0.000)

0.165

2.600 (0.180)

Thyroid

L1SVM

0.254 (0.018)

0.235 (0.018)

0.512 (0.071)

3.450 (0.872)

L1SVM-R

0.235 (0.009)

0.218 (0.011)

0.481 (0.068)

0.059 (0.011)

0.097

2.850 (0.386)

Ovarian

L1SVM

0.011 (0.004)

0.009 (0.003)

0.389 (0.200)

7.550 (0.682)

L1SVM-R

0.007 (0.001)

0.005 (0.001)

0.661 (0.200)

0.003 (0.001)

0.140

5.950 (0.540)

Prostate

L1SVM

0.153 (0.019)

0.150 (0.018)

0.615 (0.104)

6.500 (0.605)

L1SVM-R

0.139 (0.013)

0.126 (0.013)

0.339 (0.098)

0.026 (0.008)

0.158

6.900 (0.692)

The fourth data set corresponds to mass spectrometry data and consists of 90 normal samples and 162 Ovarian cancer samples. A normal sample refers to a blood (more precisely, serum) sample collected from a healthy patient, while a cancer sample refers to a blood sample collected from an ovarian cancer patient. Each sample is described by 15154 mass/charge identity of protein in the blood sample.

Finally, the Prostate cancer data set has 102 samples (52 prostate cancer samples and 50 normal samples) and 12 600 genes.

To calculate misclassification errors, we divide each data set into two parts, training and test data sets, by randomly selecting 2/3 observations and 1 / 3 observations, respectively. We construct a classifier on training data and select the regularization parameters ( d, λ) by minimizing the BIC-type criterion (Schwarz, 1978) calculated on the training data.

1

n

n φ d (y i f ( x i )) + log

i = 1

n

2n |V | ,

where |V | is the number of nonzero coefficients in β . Refer to Zou et al. (2007) for justification for using the BIC-type criterion. The optimal δ value is selected by minimizing the empirical average of the 0-1 loss with a reject option. Next, we measure a misclassification error on the test data. We repeat this procedure 20 times and summarize the results in Tables 6 and 7. To illustrate the importance of gene selection in real data analysis, we compare l 2 SVM (Hastie et al., 2004, denoted by L2SVM) and l 2 SVM with a reject option (Bartlett and Wegkamp, 2008, denoted by L2SVM-R) with the L1SVM and L1SVM-R. L2SVM-R is implemented using quadratic programming. The results are presented in Table 6. L2SVM-R is always superior to L2SVM; this fact, when considered with the results of Table 7 (Total MIS), indicates that the methods with the l 1 penalty show better performance than those with the l 2 penalty. Therefore, gene selection is indispensable for constructing accurate predictive models. We compare L1SVM-R with L1SVM in detail. The results in Table 7 show that L1SVM-R performs consistently better than L1SVM in terms of prediction. In terms of Total MIS and Accept MIS L1SVM-R performs better than L1SVM in 4 out of 5 data sets (the ‘‘Breast,’’ ‘‘Thyroid,’’ ‘‘Ovarian,’’ and ‘‘Prostate’’ data sets), while their performances are similar for the ‘‘Lung’’ data set. Further, in most cases L1SVM-R has lower error rates even for rejected samples. The number of selected genes (Vsize in the table) are similar. The improvements of L1SVM-R over L1SVM are not statistical significant in many cases. However, it should be noted that L1SVM is already a highly accurate classifier and it is very difficult to significantly improve highly accurate classifiers in real data sets. Many recent studies on the development of new learning algorithms have investigated the consistency, that is, whether a proposed algorithm consistently improves an old but highly accurate method on various data sets. Further examples, refer to Zhang et al. (2006) and Wang et al. (2008). In this context, we can conclude that the proposed l 1 SVM with a reject option is a useful new tool for gene expression data. To assess the stability of gene selection, we investigated the most frequently selected genes in the Ovarian cancer data set by L1SVM-R and L1SVM among 20 random partitions and found that the top 5 genes matched. Table 8 presents the frequencies of the selection of the 5 most important genes by L1SVM-R and L1SVM. Note that L1SVM-R never misses the first two of the most important genes while L1SVM misses some. This result suggests that L1SVM-R is more stable in gene selection than L1SVM.

H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 1897–1908

1907

Table 8 Frequencies of the selection of the 5 most important genes identified by L1SVM and L1SVM-R in the Ovarian cancer data.

 

X508022

X702461

X8832

X180015

X657213

L1SVM

18

16

16

12

11

L1SVM-R

20

20

16

10

8

Fig. 6. Various surrogate losses.
Fig. 6.
Various surrogate losses.

5. Conclusion

In this paper, we proposed a learning method that can simultaneously select signal genes and produce a highly accurate predictive model by incorporating a reject option. In addition, we developed an efficient computational algorithm that realizes an entire solution path and is able to work with high-dimensional gene expression data without much difficulty. Analysis of simulated and real data sets suggested the strong potential for the application of the proposed method for gene expression data analysis. A disadvantage of the proposed method is that there are multiple regularization parameters (λ, d); this issue was resolved by employing the BIC-criterion. There are various applications of the reject option. For example, we can remeasure gene expression levels of rejected samples. Refer to Section 4.2 for an example. Remeasurement further improves prediction accuracy, particularly in cases of measurement errors in gene expression levels. We examined this possibility only by means of simulation since we do not have real data for remeasurement. We aim to further pursue this issue in the near future. An opposite approach to using the reject option is to downgrade the misclassified observations located far away from the decision boundary. With this approach, we can construct a robust classifier to outliers. An example is the ψ -learning proposed by Shen et al. (2003). The surrogate loss of the ψ -learning is the same as the hinge loss on the positive side but is smaller than that on the negative side. This contrasts with the surrogate loss for the l 1 SVM with a reject option, which is larger than the hinge loss on the negative side. We can consider of a modification of the ψ - learning method, where the surrogate loss has a negative slope on the negative side. See Fig. 6. Interestingly, this surrogate loss corresponds to the surrogate loss of the l 1 SVM with a reject option with d 1/2. This suggests that two seemingly different approaches — the reject option and ψ -learning — may have some interesting relationship. We will attempt to tackle this issue in a future research.

Acknowledgements

We are grateful to the anonymous referees and the co-editor for their helpful comments. Choi’s work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2010-0003377). Yeo’s work was supported by the Priority Research Centers Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2009-0093827). Kim’s work was supported by the Korea Research Foundation Grant funded by the Korean Government

(KRF-2008-314-C00046).

1908

References

H. Choi et al. / Computational Statistics and Data Analysis 55 (2011) 1897–1908

Ambroise, C., McLachlan, G.J., 2002. Selection bias in gene extraction on the basis of microarray gene expression data. Proceedings of the National Academy of Sciences USA 99, 6562–6566. Balakrishnan, S., Madigan, D., 2008. Algorithms for sparse linear classifiers in the massive data. Journal of Machine Learning Research 9, 313–337. Bartlett, P., Wegkamp, M.H., 2008. Classification with a reject option using a hinge loss. Journal of Machine Learning Research 9, 1823–1840. Bertsekas, D.P., 2003. Nonlinear Programming, second ed. Athena Scientific. Bhattacharjee, A., Richards, W.G., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E.J., Lander, E.S., Wong, W., Johnson, B.E., Golub, T.R., Sugarbaker, D.J., Meyerson, M., 2001. Classification of human lung carcinomas by MRNA expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences USA 98, 13790–13795. Chow, C.K., 1970. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory 16, 41–46. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., 2004. Least angle regression. The Annals of Statistics 32, 407–499. Genkin, A., Lewis, D.D., Madigan, D., 2007. Large-scale Bayesian logisitic regression for text categorization. Technometrics 49, 291–304. Greenstein, E., 2006. Best subset selection, persistence in high-dimensional statistical learning and optimization under L1 constraint. The Annals of Statistics 34, 2367–2386. Guyon, I., Weston, J., Barnhill, S., Vapnik, V., 2002. Gene selection for cancer classification using support vector machines. Machine Learning 46, 389–422. Hale, E., Yin, W., Zhang, Y., 2008. Fixed-point continuation for L1-minimization: methodology and convergence. SIAM Journal on Optimization 19,

1107–1130.

Hall, P., Marron, J.S., Neeman, A., 2005. Geometric representation of high dimension low sample size data. Journal of the Royal Statistical Society. Series B 67, 427–444. Hastie, T., Rosset, S., Tibshirani, R., Zhu, J., 2004. The entire regularization path for the support vector machine. Journal of Machine Learning Research 5,

1391–1415.

Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning, first ed. Springer-Verlag, New York. Herbei, R., Wegkamp, M.H., 2006. Classification with reject option. The Canadian Journal of Statistics 34, 709–721. Hong, P., Liu, S., Zhou, Q., Lu, X., Liu, S., Wong, H., 2005. A boosting approach for motif modeling using CHIP–chip data. Bioinformatics 21, 2636–2643. Iwao, K., Matoba, R., Ueno, N., Ando, A., Miyoshi, Y., Matsubara, K., Noguchi, S., Kato, K., 2002. Molecular classification of primary breast tumors possessing distinct prognostic properties. Human Molecular Genetics 11, 199–206. Kim, J., Kim, Y., Kim, Y., 2008. A gradient-based optimization algorithm for lasso. Journal of Computational and Graphical Statistics 17, 994–1009. Koh, K., Kim, S.J., Boyd, S., 2007. An interior-point method for large-scale L1-regularized logistic regression. Journal of Machine Learning Research 8,

1519–1555.

Lendgrebe, C.W., Tax, M.J., Paclik, P., Duin, P.W., 2006. The interaction between classification and reject performance for distance-based reject-option classifiers. Pattern Recognition Letters 27, 908–917. Liao, J.G., Chin, K.V., 2007. Logistic regression for disease classification using microarray data: model selection in a large p and small n case. Bioinformatics 23, 1945–1951. McLachlan, G.J., 1992. Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York. Meier, L., van de Geer, S., Buhlmann, P., 2008. The group lasso for logistic regression. Journal of the Royal Statistical Society. Series B 70, 53–71. Park, M.Y., Hastie, T., 2007. L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society. Series B 69, 659–677. Petricoin, E.F., Ardekani, A.M., Hitt, B.A., Levine, P.J., Fusaro, V.A., Steinberg, S.A., Mills, G.B., Simone, C., Fishman, D.A., Kohn, E.C., Liotta, L.A., 2002. Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359, 572–577. Rosset, S., Zhu, J., 2007. Piecewise linear regularized solution paths. The Annals of Statistics 35, 1012–1030. Schwarz, G., 1978. Estimating the dimension of a model. The Annals of Statistics 6, 461–464. Shen, X., Tseng, G.C., Zhang, X., Wong, W.H., 2003. On ψ -learning. Journal of the American Statistical Association 98, 724–734. Shevade, S.K., Keerthi, S.S., 2003. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19, 2246–2253. Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A.A., D’Amico, A.V., Richie, J.P., Lander, E.S., Loda, M., Kantoff, P.W., Golub, T.R., Sellers, W.R., 2002. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1, 203–209. Terrence, S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., Haussler, D., 2000. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16, 906–914. Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B 58, 267–288. Tortorella, F., 2000. An optimal reject rule for binary classifiers. Lecture Notes in Computer Science 876, 611–620. Wang, L., Zhu, J., Zou, H., 2008. Hybrid huberized support vector machines for microarray classification and gene selection. Bioinformatics 24, 412–419. Wegkamp, M.H., 2007. Lasso type classifiers with a reject option. Electronic Journal of Statistics 1, 155–168. Wu, T.T., Lange, K., 2008. Coordinate descent algorithm for lasso penalized regression. The Annals of Applied Statistics 2, 224–244. Yukinawa, N., Oba, S., Kato, K., Taniguchi, K., Iwao-Koizumi, K., Tamaki, Y., Noguchi, S., Ishii, S., 2006. A multi-class predictor based on a probabilistic model:

application to gene expression profiling-based diagnosis of thyroid tumors. BMC Bioinformatics 7, 1471–2164. Zhang, H.H., Ahn, J., Lin, X., Park, C., 2006. Gene selection using support vector machines with nonconvex penalty. Bioinformatics 22, 88–95. Zhang, H.H., Wahba, G., Lin, Y., Voelker, M., Ferris, M., Klein, R., Klein, B., 2004. Variable selection and model building via likelihood basis pursuit. Journal of the American Statistical Association 99, 659–672. Zhu, J., Rosset, S., Hastie, T., Tibshirani, R., 2004. 1-norm support vector machines. In: Thrun, S., et al. (Eds.), Advances in Neural Information Processing Systems, vol. 16. MIT Press, Cambridge, MA. Zou, H., Hastie, T., Tibshirani, R., 2007. On the degrees of freedom of the lasso. The Annals of Statistics 35, 2173–2192.