Training ν-Support Vector Regression - Theory and application

Manuscript Number: 2403
Training -Support Vector Regression: Theory and Algorithms

Chih-Chung Chang and Chih-Jen Lin Department of Computer Science and Information Engineering National Taiwan University Taipei 106, Taiwan (cjlin@csie.ntu.edu.tw) Abstract We discuss the relation between -Support Vector Regression (-SVR) and Support Vector Regression ( -SVR). In particular we focus on properties which are dierent from those of C -Support Vector Classication (C -SVC) and -Support Vector Classication ( -SVC). We then discuss some issues which do not occur in the case of classication: the possible range of and the scaling of target values. A practical decomposition method for -SVR is implemented and computational experiments are conducted. We show some interesting numerical observations specic to regression.
Introduction
The -support vector machine (Sch olkopf et al. 2000; Sch olkopf et al. 1999) is a new class of support vector machines (SVM). It can handle both classication and regression. Properties on training -support vector classiers ( -SVC) have been discussed in (Chang and Lin 2001b). In this paper we focus on -support vector regression ( -SVR). Given a set of data points, {(x1 , y1 ), . . . , (xl , yl )}, such that xi Rn is an input and yi R1 is a target output, the primal problem of -SVR is as follows: (P ) 1 1 min wT w + C ( + 2 l
T l
(i + i ))
i=1
(1.1)
(w (xi ) + b) yi + i, yi (wT (xi ) + b) + i , i , i 0, i = 1, . . . , l, 0. 1
Here 0 1, C is the regularization parameter, and training vectors xi are mapped into a higher (maybe innite) dimensional space by the function . The -insensitive loss function means that if wT (x) is in the range of y , no loss is considered. This formulation is dierent from the original -SVR (Vapnik 1998): (P ) C 1 min wT w + 2 l
T l
(i + i )
i=1
(w (xi ) + b) yi + i , yi (wT (xi ) + b) + i , i , i 0, i = 1, . . . , l.
(1.2)
As it is dicult to select an appropriate , Sch olkopf et al. introduced a new parameter which lets one control the number of support vectors and training errors. To be more precise, they proved that is an upper bound on the fraction of margin errors and a lower bound of the fraction of support vectors. In addition, with probability 1, asymptotically, equals to both fractions. Then there are two dierent dual formulations for (P ) and (P ): 1 min ( )T Q( ) + yT ( ) 2 eT ( ) = 0, eT ( + ) C,
0 i , i C/l,
(1.3)
i = 1, . . . , l.
1 min ( )T Q( ) + yT ( ) + eT ( + ) 2 eT ( ) = 0,
0 i , i C/l,
(1.4)
i = 1, . . . , l,
where Qij (xi )T (xj ) is the kernel and e is the vector of all ones. Then the approximating function is
l
f (x) =
i=1
(i i )(xi )T (x) + b.
For regression, the parameter replaces while in the case of classication, replaces C . In (Chang and Lin 2001b) we have discussed the relation between SVC and C -SVC as well as how to solve -SVC in detail. Here we are interested in 2
dierent properties for regression. For example, the relation between -SVR and -SVR is not the same as that between -SVC and C -SVC. In addition, similar to the situation of C -SVC, we make sure that the inequality eT ( + ) C can be replaced by an equality so algorithms for -SVC can be applied to -SVR. They will be the main topics of Sections 2 and 3. In Section 4 we discuss the possible range of and show that it might be easier to use -SVM. We also demonstrate some situations where the scaling of the target values y is needed. Note that these issues do not occur for classication. Finally Section 5 presents computational experiments. We discuss some interesting numerical observations which are specic to support vector regression.
The Relation Between -SVR and -SVR
In this section we will derive a relationship between the solution set of -SVR and -SVR which allows us to conclude that the inequality constraint (1.3) can be replaced by an equality. In the dual formulations mentioned earlier, eT (+ ) is related to C . Similar to (Chang and Lin 2001b), we scale them to the following formulations so eT ( + ) is related to : ( D ) 1 min ( )T Q( ) + (y/C )T ( ) 2 eT ( ) = 0, eT ( + ) ,
0 i , i 1/l,
i = 1, . . . , l.
(2.1)
( D )
1 min ( )T Q( ) + (y/C )T ( ) + (/C )eT ( + ) 2 eT ( ) = 0,

0 i , i 1/l,
i = 1, . . . , l.
For convenience, following (Sch olkopf et al. 2000), we represent
as () . Remember that for -SVC, not all 0 1 lead to meaningful problems of
(D ). Here the situation is similar so in the following we dene a which will be the upper bound of the interesting interval of .
Denition 1 Dene min() eT ( + ), where () is any optimal solution of (D ), = 0.

Note that 0 i , i 1/l implies that the optimal solution set of (D ) or
(D ) is bounded. As their objective and constraint functions are all continuous, any limit point of a sequence in the optimal solution set is in it as well. Hence we have that the optimal solution set of (D ) or (D ) is close and bounded (i.e. compact). Using this property, if = 0, there is at least one optimal solution which satises eT ( + ) = . The following lemma shows that for (D ), > 0, at an optimal solution one of
i and i must be zero: Lemma 1 If > 0, all optimal solutions of (D ) satisfy i i = 0. Proof. If the result is wrong, then we can reduce values of two nonzero i and i
such that is still the same but the term eT ( + ) of the objective function is decreased. Hence () is not an optimal solution so there is a contradiction. 2 The following lemma is similar to (Chang and Lin 2001b, Lemma 4): Lemma 2 If 1 is any optimal solution of (D1 ), 2 is any optimal solution of (D2 ), and 0 1 < 2 , then
T eT (1 + 1 ) e ( 2 + 2 ). () ()
(2.2)
Therefore, for any optimal solution
()
of (D ), > 0, eT ( + ) .
Unlike the case of classication where eT C is a well-dened function of C if C is an optimal solution of the C -SVC problem, here for -SVR, for the same
T (D ) there may have dierent eT ( + ). The main reason is that e is the
only linear term of the objective function of C -SVC but for (D ), the linear term becomes (y/C )T ( ) + (/C )eT ( + ). We will elaborate more on this in Lemma 4 where we prove that eT ( + ) can be a function of if Q is a positive denite matrix. The following lemma shows the relation between (D ) and (D ). In particular, we show that for 0 < , any optimal solution of (D ) satises eT ( + ) = . 4
Lemma 3 For any (D ), 0 < , one of the following two situations must happen: 1. (D )s optimal solution set is part of the solution set of an (D ), > 0, 2. (D )s optimal solution set is the same as that of (D ), where > 0 is any one element in a unique open interval.
In addition, any optimal solution of (D ) satises eT ( + ) = and i i = 0.
Proof. The Karush-Kuhn-Tucker (KKT) condition of (D ) shows that there exist 0 and b such that Q Q Q Q e e y/C b = . + + e y/C C e (2.3)
If = 0, () is an optimal solution of (D ), = 0. Then eT ( + ) > causes a contradiction. Therefore > 0 so the KKT condition implies that eT ( + ) = . Then there are two possible situations: Case 1: is unique: By assigning = , all optimal solutions of (D ) are KKT points of (D ). Hence (D )s optimal solution set is part of that of a (D ). This (D ) is unique as otherwise we can nd another which satises (2.3). Case 2: is not unique. That is, there are two 1 < 2 . Suppose 1 and 2 are the smallest and largest one satisfying the KKT. Again the existence of 1 and 2 are based on the compactness of the optimal solution set. Then for any 1 < < 2 , we consider the problem (D ), = . Dene 1 1 and 2 2 . From Lemma 2, since 1 < < 2 ,
T T = eT (1 + 1 ) e ( + ) e (2 + 2 ) = ,
where () is any optimal solution of (D ), and 1 and 2 are optimal solutions of (D1 ) and (D2 ), respectively. Hence all optimal solutions of (D ) satisfy eT ( + ) = and all KKT conditions of (D ) so (D )s optimal solution set is in that of (D ).
()
()
Hence (D ) and (D ) share at least one optimal solution () . For any other () of (D ), it is feasible for (D ). Since optimal solution 1 )T Q( ) + (y/C )T ( ) ( 2 1 = ( )T Q( ) + (y/C )T ( ), 2 + ) , we have and eT ( 1 )T Q( ) + (y/C )T ( ) + (/C )eT ( + ) ( 2 1 ( )T Q( ) + (y/C )T ( ) + (/C )eT ( + ). 2 Therefore, all optimal solutions of (D ) are also optimal for (D ). Hence (D )s optimal solution set is the same as that of (D ), where > 0 is any one element in a unique open interval (1 , 2 ). Finally, as (D )s optimal solution set is the same or part of a (D ), from
Lemma 1, we have i i = 0. 2
Using the above results, we now summarize a main theorem: Theorem 1 We have 1. 1. 2. For any [ , 1], (D ) has the same optimal objective value as (D ), = 0. 3. For any [0, ), Lemma 3 holds. That is, one of the following two situations must happen: (a) (D )s optimal solution set is part of the solution set of an (D ), > 0, (b) (D )s optimal solution set is the same as that of (D ), where > 0 is any one element in a unique open interval. 4. For all (D ), 0 1, there are always optimal solutions which happen at the equality eT ( + ) = . Proof. From the explanation after Denition 1, there exists an optimal solution of
(D ), = 0 which satises eT ( + ) = . Then this () must satisfy i i =
0, i = 1, . . . , l, so 1. In addition, this () is also feasible to (D ), . Since (D ) has the same objective function as (D ), = 0 but has one more constraint, this solution of (D ), = 0, is also optimal for (D ). Hence (D ) and (D ) have the same optimal objective value. For 0 < , we already know from Theorem 3 that the optimal solution happens only at the equality. For 1 , rst we know that (D ), = 0 has an optimal solution () which satises eT ( + ) = . Then we can increase () satisfy eT ( + ) = some elements of and such that new vectors = . Hence () is an optimal solution of (D ) which satises but the equality constraint. 2 Therefore, the above results make sure that it is safe to solve the following problem instead of (D ): ) (D 1 min ( )T Q( ) + (y/C )T ( ) 2 eT ( ) = 0, eT ( + ) = ,
0 i , i 1/l,
i = 1, . . . , l.
This is important as for existing SVM algorithms, it is easier to handle equalities than inequalities. Note that for -SVC, there is also a where for ( , 1], (D ) is infeasible. At that time = 2 min(#positive data, #negative data)/l can be easily calculated (Crisp and Burges 2000). Now for -SVR, it is dicult to know in ), > is solved, priori. However, we do not have to worry about this. If a (D a solution with its objective value equal to that of (D ), = 0 is obtained. Then
may both be nonzero. some i and i
Since there are always optimal solutions of the dual problem which satisfy eT ( + ) = , this also implies that the 0 constraint of (P ) is not necessary. In the following theorem, we derive the same result directly from the primal side: Theorem 2 Consider a problem which is the same as (P ) but without the inequality constraint 0. We have that for any 0 < < 1, any optimal solution of (P ) must satisfy 0. Proof. 7
Assume (w, b, , , ) is an optimal solution with < 0. Then for each i, i wT (xi ) + b yi + i implies i + i + 2 0. With (2.4), 0 max(0, i + ) i wT (xi ) + b yi 0 + + i 0 + max(0, i + ). Hence (w, b, max(0, + e), max(0, + e), 0) is a feasible solution of (P ). From (2.5), max(0, i + ) + max(0, i + ) i + i + . Therefore, with < 0 and 0 < < 1, 1 1 T w w + C ( + 2 l > 1 T C w w + (l + 2 l C 1 T w w+ 2 l
l l l
(2.4)
(2.5)
(i + i ))
i=1
(i + i))
i=1
(max(0, i + ) + max(0, i + ))
i=1
implies that (w, b, , ) is not an optimal solution. Therefore, any optimal solution of (P ) must satisfy 0. 2 Next we demonstrate an example where one (D ) corresponds to many (D ). Given two training points x1 = 0, x2 = 0, and target values y1 = < 0 and y2 = > 0. When = , if the linear kernel is used and C = 1, (D ) becomes min
2(1 + 2 ) 0 1 , 1 , 2 , 2 1/l, 1 1 + 2 2 = 0. Thus 1 = 2 = 0 so any 0 1 = 2 1/l is an optimal solution. Therefore,
for this , the possible eT ( + ) ranges from 0 to 1. The relation between and is illustrated in Figure 1. 8
6 = 1
Figure 1: An example where one (D ) corresponds to dierent (D )
When the Kernel Matrix Q is Positive Denite

()
In the previous section we have shown that for -SVR, eT ( + ) may not be a well-dened function of , where is any optimal solution of (D ). Because of this diculty, we cannot exactly apply results on the relation between C -SVC and -SVC to -SVR and -SVR. In this section we show that if Q is positive denite, then eT ( + ) is a function of and all results discussed in (Chang and Lin 2001b) hold. Assumption 1 Q is positive denite. Lemma 4 If > 0, then (D ) has a unique optimal solution. Therefore, we can dene a function eT ( + ) on , where
() ()
is the optimal solution of (D ).

()
Proof. Since (D ) is a convex problem, if 1 and 2 are both optimal solutions, for all 0 1, 1 T ((1 1 ) + (1 )(2 2 )) Q((1 1 ) + (1 )(2 2 )) 2 +(y/C )T ((1 1 ) + (1 )(2 2 ))
+(/C )eT ((1 + 1 ) + (1 )(2 + 2 )) 1 T T T = ( 1 1 ) Q(1 1 ) + (y/C ) (1 1 ) + (/C )e (1 + 1 ) 2 1 T T T +(1 ) (2 2 ) Q(2 2 ) + (y/C ) (2 2 ) + (/C )e (2 + 2 ) . 2
This implies 1 1 T T T ( 1 1 ) Q(2 2 ) = (1 1 ) Q(1 1 )+ (2 2 ) Q(2 2 ). (3.1) 2 2 9
Since Q is positive semidenite, Q = LT L so (3.1) implies L(1 1 ) L ( 2

2 ) = 0. Hence L(1 1 ) = L(2 2 ). Since Q is positive denite, L is invertible so 1 1 = 2 2 . Since > 0, from Lemma 1, (1 )i (1 )i = 0 and (2 )i (2 )i = 0. Thus we have 1 = 2 and 1 = 2 . 2
For convex optimization problems, if the Hessian is positive denite, there is Q Q a unique optimal solution. Unfortunately here the Hessian is which Q Q is only positive semi-denite. Hence special eorts are needed for proving the property of the unique optimal solution.
Note that in the above proof, L(1 1 ) = L(2 2 ) implies (1 T T 1 ) Q(1 1 ) = (2 2 ) Q(2 2 ). Since 1 and 2 are both optimal T solutions, they have the same objective value so yT (1 1 ) + e (1 + 1 ) = T T yT (2 2 ) + e (2 + 2 ). This is not enough for proving that e (1 + 1 ) = 1 T T eT (2 + 2 ). On the contrary, for -SVC, the objective function is 2 Q e T so without the positive denite assumption, T 1 Q1 = 2 Q2 already implies () ()
eT 1 = eT 2 . Thus eT C is a function of C . We then state some parallel results in (Chang and Lin 2001b) without proofs: Theorem 3 If Q is positive denite, then the relation between (D ) and (D ) is summarized as follows: 1. (a) For any 1 , (D ) has the same optimal objective value as ( D ) , = 0 . (b) For any [0, ), (D ) has a unique solution which is the same as that of either one (D ), > 0, or some (D ), where is any number in an interval.
2. If is the optimal solution of (D ), > 0, the relation between and is
as follows: There are 0 < 1 < < s and Ai , Bi , i = 1, . . . , s such that 0 < 1 , T e ( + ) = Ai + Bi i i+1 , i = 1, . . . , s 1, 0 s , 10
where
()
is the optimal solution of (D ). We also have Ai + Bi i+1 = Ai+1 + Bi+1 i+1 , i = 1, . . . , s 2, (3.2)
and As1 + Bs1 s = 0. In addition, Bi 0, i = 1, . . . , s 1. The second result of Theorem 3 shows that is a piece-wise linear function of . In addition, it is always decreasing.
Some Issues Specic to Regression
The motivation of -SVR is that it may not be easy to decide the parameter . Hence here we are interested in the possible range of . As expected, results show that is related to the target values y. Theorem 4 The zero vector is an optimal solution of (D ) if and only if maxi=1,...,l yi mini=1,...,l yi . 2 (4.1)
Proof. If the zero vector is an optimal solution of (D ), the KKT condition implies that there is a b such that y e e + b 0. y e e Hence b yi and + b yi, for all i. Therefore, b min yi and + b max yi
i=1,...,l i=1,...,l
so
maxi=1,...,l yi mini=1,...,l yi . 2
On the other hand, if (4.1) is true, we can easily check that = = 0 satisfy the KKT condition so the zero vector is an optimal solution of (D ). 2 Therefore, when using -SVR, the largest value of to try is (maxi=1,...,l yi mini=1,...,l yi )/2. 11
On the other hand, should not be too small as if 0, most data are support vectors and overtting tends to happen. Unfortunately we have not been able to nd an eective lower bound on . However, intuitively we would think that it is also related to the target values y. As the eective range of is aected by the target values y , a way to solve this diculty for -SVM is by scaling the target valves before training the data. For example, if all target values are scaled to [1, +1], then the eective range of will be [0, 1], the same as that of . Then it may be easier to choose . There are other reasons to scale the target values. For example, we encountered some situations where if the target values y are not properly scaled, it is dicult to adjust the value of C . In particular, if yi , i = 1, . . . , l are large numbers and C is chosen to be a small number, the approximating function is nearly a constant.
Algorithms
The algorithm will be considered here for -SVR is similar to the decomposition method in (Chang and Lin 2001b) for -SVC. The implementation is part of the software LIBSVM (Chang and Lin 2001a). Another SVM software which has also implemented -SVR is mySVM (R uping 2000). The basic idea of the decomposition method is that in each iteration, the indices {1, . . . , l} of the training set are separated to two sets B and N , where B is the working set and N = {1, . . . , l}\B . The vector N is xed and then a sub-problem with the variable B is solved. The decomposition method was rst proposed for SVM classication (Osuna et al. 1997; Joachims 1998; Platt 1998). Extensions to -SVR are in, for example, (Keerthi et al. 2000; Laskov 2002). The main dierence on these methods is their working set selections which may signicantly aect the number of iterations. Due to the additional equality (1.3) in the -SVM, more considerations on the working set selection are needed. Discussions for classication are in (Keerthi and Gilbert 2002; Chang and Lin 2001b). For the consistency with other SVM formulations in LIBSVM, we consider (D )
12
as the following scaled form: min 1 T Q +p T 2 T = 1 , y T = 2 , e 0 t C, t = 1, . . . , 2l, where = Q y e e Q Q = = = = ,y ,e , 1 = 0, 2 = Cl. , ,p y e e Q Q
(5.1)
That is, we replace C/l by C . Note that because of the result in Theorem 1, we are safe to use an equality constraint here in (5.1). Then the sub-problem is as follows: min B 1 T T BN B QBB B + ( k B pB + Q N) 2 T B = 1 y T N, y B N T B = 2 e T N, e B N B )t C, t = 1, . . . , q, 0 ( where q is the size of the working set. Following the idea of Sequential Minimal Optimization (SMO) by Platt (1998), we use only two elements as the working set in each iteration. The main advantage is that an analytic solution of (5.2) can be obtained so there is no need to use an optimization software. Our working set selection follows from (Chang and Lin 2001b) which is a modication of the selection in the software SV M light (Joachims 1998). Since they dealt with the case of more general selections where the size is not restricted to two, here we have a simpler derivation directly using the KKT condition. It is similar to that in (Keerthi and Gilbert 2002, Section 5). T B = Now if only two elements i and j are selected but y i = y j , then y B T N and e T B = 2 e T N imply that there are two equations with two 1 y N B N k , the variables so in general (5.2) has only one feasible point. Therefore, from solution of the k th iteration, it cannot be moved any more. On the other hand, if 13
(5.2)
T B = 1 y T N and e T B = 2 e T N become the same equality y i = y j , y B N B N so there are multiple feasible solutions. Therefore, we have to keep y i = y j while selecting the working set. The KKT condition of (5.1) shows that there are and b such that )i + by f ( i = 0 if 0 < i < C, 0 if i = 0, 0 if i = C. Dene r1 b, r2 + b. If y i = 1, the KKT condition becomes )i r1 0 if f ( i < C, 0 if i > 0. On the other hand, if y i = 1, it is )i r2 0 if f ( i < C, 0 if i > 0. Hence, indices i and j are selected from either )t | y i = argmint {f ( t = 1, t < C }, )t | y j = argmaxt {f ( t = 1, t > 0}, or )t | y i = argmint {f ( t = 1, t < C }, )t | y j = argmaxt {f ( t = 1, t > 0}, (5.6) (5.5) (5.4) (5.3)
)j f ( )i (i.e. larger KKT viodepending on which one gives a larger f ( )j f ( )i is smaller than a given (103 in our lations). If the selected f ( experiments), the algorithm stops. Similar to the case of -SVC, here the zero vector cannot be the initial solution. T = 2 of (5.1). Here we This is due to the additional equality constraint e and with the same values. The rst l/2 elements are assign both initial [C, . . . , C, C (l/2 l/2)]T while others are zero. 14
Table 1: Solving -SVR and -SVR: C = 1 (time in seconds)

Problem pyrimidines l 74 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.135131 0.064666 0.028517 0.002164 0.152014 0.090124 0.048543 0.020783 0.012700 0.006332 0.002898 0.001088 0.161529 0.089703 0.046269 0.018860 0.380308 0.194967 0.096720 0.033753 0.366606 0.216329 0.124792 0.059115 0.168812 0.094959 0.055966 0.026165 0.087070 0.053287 0.032080 0.014410 0.086285 0.054095 0.031285 0.013842 0.294803 0.168370 0.097434 0.044636 Iter. 181 175 365 695 988 1753 2115 3046 1112 2318 3553 4966 799 1693 1759 2700 175 483 422 532 1928 3268 3400 4516 4189 8257 12483 18302 5020 8969 12261 16311 8028 16585 22376 28262 12153 24614 35161 42709 Iter. 145 156 331 460 862 1444 1847 2595 1047 2117 2857 3819 1231 1650 2002 2082 116 325 427 513 1542 3294 3300 4296 3713 7113 12984 18277 4403 7731 10704 13852 7422 15240 19126 24840 10961 20968 30477 40652 Time 0.03 0.03 0.04 0.05 0.19 0.32 0.40 0.56 0.14 0.25 0.37 0.48 0.30 0.53 0.63 0.85 0.13 0.18 0.20 0.23 1.58 2.75 3.36 4.24 15.68 30.38 42.74 65.98 10.47 18.70 26.27 32.71 82.66 203.20 283.71 355.59 575.11 1096.87 1530.01 1883.35 Time 0.02 0.04 0.03 0.05 0.16 0.27 0.34 0.51 0.13 0.23 0.31 0.42 0.34 0.45 0.60 0.65 0.10 0.15 0.18 0.23 1.18 2.35 2.76 3.65 11.69 22.88 37.41 54.04 7.56 14.44 20.72 27.19 59.14 120.48 163.96 213.25 294.53 574.77 851.91 1142.27 0.817868
mpg
392
0.961858
bodyfat
252
0.899957
housing
506
0.946593
triazines
186
0.900243
mg
1385
0.992017
abalone
4177
0.994775
space ga
3107
0.990468
cpusmall
8192
0.990877
cadata
20640
0.997099
15
Table 2: Solving -SVR and -SVR: C = 100 (time in seconds)

Problem pyrimidines l 74 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 Iter. Iter. 0.000554 29758 11978 0.000317 30772 11724 0.000240 27270 11802 0.000146 20251 12014 0.121366 85120 74878 0.069775 210710 167719 0.032716 347777 292426 0.007953 383164 332725 0.001848 238927 164218 0.000486 711157 323016 0.000291 644602 339569 0.000131 517370 356316 0.092998 154565 108220 0.051726 186136 182889 0.026340 285354 271278 0.002161 397115 284253 0.193718 16607 22651 0.074474 34034 47205 0.000381 106621 51175 0.000139 68553 50786 0.325659 190065 195519 0.189377 291315 299541 0.107324 397449 407159 0.043439 486656 543520 0.162593 465922 343594 0.091815 901275 829951 0.053244 1212669 1356556 0.024670 1680704 1632597 0.078294 510035 444455 0.048643 846873 738805 0.028933 1097732 1054464 0.013855 1374987 1393044 0.070568 977374 863579 0.041640 1783725 1652396 0.022280 2553150 2363251 0.009616 3085005 2912838 0.263428 1085719 1081038 0.151341 2135097 2167643 0.087921 2813070 2614179 0.039595 3599953 3379580 : experiments where Time 0.63 0.65 0.58 0.44 9.53 24.50 42.08 47.61 16.80 50.77 46.23 37.28 24.21 30.49 48.62 69.16 0.94 1.89 5.69 3.73 87.99 139.10 194.81 241.20 797.48 1577.83 2193.97 2970.98 595.42 1011.82 1362.67 1778.38 4304.42 8291.12 11673.62 14784.05 16003.55 31936.05 42983.89 54917.10 Time 0.27 0.27 0.27 0.28 8.26 19.32 34.82 40.86 11.58 23.24 24.33 25.55 16.87 29.51 45.64 49.12 1.20 2.52 2.84 2.81 89.20 141.73 196.14 265.27 588.92 1449.37 2506.52 2987.30 508.41 867.32 1268.40 1751.39 3606.35 7014.32 10691.95 12737.35 15475.36 31474.21 38580.61 49754.27 0.191361
mpg
392
0.876646
bodyfat
252
0.368736
housing
506
0.815085
triazines
186
0.582147
mg
1385
0.966793
abalone
4177
0.988298
space ga
3107
0.984568
cpusmall
8192
0.978351
cadata
20640
0.995602
16
It has been proved that if the decomposition method of LIBSVM is used for
solving (D ), > 0, during iterations i i = 0 always holds (Lin 2001, Theorem 4.1). Now for -SVR we do not have this property as i and i may both be
nonzero during iterations. Next we discuss how to nd . We claim that if Q is positive denite and (, ) is any optimal solution of (D ), = 0, then
l
=
i=1
|i i |.
Note that by dening , (D ), = 0 is equivalent to min 1 T Q + (y/C )T 2 eT = 0, 1/l i 1/l, i = 1, . . . , l.
When Q is positive denite, it becomes a strictly convex programming problem so there is a unique optimal solution . That is, we have a unique but may
have multiple optimal (, ). With conditions 0 i , i 1/l, the calculation of |i i | is like to equally reduce i and i until one becomes zero. Then |i i | is the smallest possible i + i with a xed i i . In the next section
we will use the RBF kernel so if no data points are the same, Q is positive denite.
Experiments
In this section we demonstrate some numerical comparisons between -SVR and 2 -SVR. We test the RBF kernel with Q = e xi xj /n , where n is the number
ij
of attributes of a training data. The computational experiments for this section were done on a Pentium III500 with 256MB RAM using the gcc compiler. Our implementation is part of the software LIBSVM which includes both -SVR and -SVR using the decomposition method. We used 100MB as the cache size of LIBSVM for storing recently used Qij . The shrinking heuristics in LIBSVM is turned o for an easier comparison. We test problems from various collections. Problems housing, abalone, mpg, pyrimidines, and triazines are from the Statlog collection (Michie et al. 1994). From 17
StatLib (http://lib.stat.cmu.edu/datasets) we select bodyfat, space ga, and cadata. Problem cpusmall is from the Delve archive which collects data for evaluating learning in valid experiments (http://www.cs.toronto.edu/~delve). Problem mg is a Mackey-Glass time series where we use the same settings as the experiments in (Flake and Lawrence 2002). Thus we predict 85 times steps in the future with six inputs. For these problems, some data entries have missing attributes so we remove them before conducting experiments. Both the target and attribute values of these problems are scaled to [1, +1]. Hence the eective range of is [0, 1]. For each problem, we solve its (D ) form using = 0.2, 0.4, 0.6, and 0.8 rst. Then we solve (D ) with = for comparison. Tables 1 and 2 present the number of training data (l), the number of iterations ( Iter. and Iter.), and the training time ( Time and Time) by using C = 1 and C = 100, respectively. In the last column we also list the number of each problem. From both tables we have the following observations: 1. Following theoretical results, we really see that as increases, its corresponding decreases. 2. If , as increases, the number of iterations of -SVR and its corresponding -SVR is increasing. Note that the case of covers all results in Table 1 and most of Table 2. Our explanation is as follows: When is larger, there are more support vectors so during iterations the number of non-zero variables is also larger. In (Hsu and Lin 2002) it has been pointed out that if during iterations there are more non-zero variables than those at the optimum, the decomposition method will take many iterations to reach the nal face. Here a face means the sub-space by considering only free variables. An example is in Figure 2 where we plot the number of free variables during iterations against the number of iterations. To be more precise, the
y -axis is the number of 0 < i < C and 0 < i < C where (, ) is the
solution at one iteration. We can see that no matter for solving -SVR or -SVR, it takes a lot of iteration to identify the optimal face. From the aspect of -SVR we can consider eT ( + ) as a penalty term 18
mpg, C=100
150 nu=0.4 nu=0.6
100
#FSV
50 0 0
3 x 10
4
5
Iteration
(a) -SVR
mpg, C=100
150
eps=0.4 eps=0.6
100
#FSV
50 0 0
3 x 10
5
Iteration
(b) -SVR
Figure 2: Iterations and Number of free variables ( )

triazines, C=100 nu=0.6 nu=0.8 eps=0.000139
200
150
#FSV
100
50
0 0
50000
100000 Iteration
150000
Figure 3: Iterations and Number of free variables ( ) 19
in the objective function of (D ). Hence when is larger, fewer i , i are
non-zero. That is, the number of support vectors is less. 3. There are few problems (e.g. pyrimidines, bodyfat, and triazines) where is encountered. When this happens, their should be zero but due to numerical inaccuracy, the output are only small positive numbers. Then for dierent , when solving their corresponding (D ), the number of iterations is about the same as essentially we solve the same problem: (D ) with = 0. On the other hand, surprisingly we see that at this time as increases, it is easier to solve (D ) with fewer iterations. Now its solution is optimal for
(D ), = 0 but the larger is, the more (i , i ) are both non-zeros. There-
fore, contrary to the general case where it is dicult to identify and move free variables during iterations back to bounds at the optimum, now there are no strong needs to do so. To be more precise, in the beginning of the decomposition method, many variables become nonzero as we try to modify them for minimizing the objective function. If nally most of these variables are still nonzero, we do not need the eorts to put them back to bounds. In Figure 3 again we plot the number of free variables against the number of iterations using problem triazines with = 0.6, 0.8, and = 0.000139 0. It can be clearly seen that for large , the decomposition method identies the optimal face more quickly so the total number of iterations is less. 4. When , we observe that there are minor dierences on the number of iterations for -SVR and -SVR. In Table 1, for nearly all problems -SVR takes a little more iterations than -SVR. However, in Table 2, for problems triazines and mg, -SVR is slightly faster. Note that there are several dissimilarities between algorithms for -SVR and -SVR. For example, SVR generally starts from the zero vector but -SVR has to use a nonzero initial solution. For the working set selection, the two indices selected for
but the two equality constraints lead to the se-SVR can be any i or i
lection (5.5) and (5.6) for -SVR where the set is either from {1 , . . . , l } 20
or {1 , . . . , l }. Furthermore, as the stopping tolerance 103 might be too
loose in some cases, the obtained after solving (D ) may be a little dierent from the theoretical value. Hence we actually solve two problems with slightly dierent optimal solution sets. All these factors may contribute to the distinction on iterations. 5. We also see that it is much harder to solve problems using C = 100 than using C = 1. The dierence is even more dramatic than the case of classication. We do not have a good explanation of this observation.
Conclusions and Discussions
In this paper we have shown that the inequality in the -SVR formulation can be treated as an equality. Hence algorithms similar to those for -SVC can be applied for -SVR. In addition, in Section 6 we have shown similarities and dissimilarities on numerical properties of -SVR and -SVR. We think that in the future, the relation between C and (or C and ) should be investigated in more detail. The model selection on these parameters is also an important issue.
References
Chang, C.-C. and C.-J. Lin (2001a). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~ cjlin/ libsvm. Chang, C.-C. and C.-J. Lin (2001b). Training -support vector classiers: Theory and algorithms. Neural Computation 13 (9), 21192147. Crisp, D. J. and C. J. C. Burges (2000). A geometric interpretation of -SVM classiers. In S. Solla, T. Leen, and K.-R. M uller (Eds.), Advances in Neural Information Processing Systems, Volume 12, Cambridge, MA. MIT Press. Flake, G. W. and S. Lawrence (2002). Ecient SVM regression training with SMO. Machine Learning 46, 271290. Hsu, C.-W. and C.-J. Lin (2002). A simple decomposition method for support vector machines. Machine Learning 46, 291314. 21
Joachims,
T. (1998). Making large-scale SVM learning practical. In
B. Sch olkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods - Support Vector Learning, Cambridge, MA. MIT Press. Keerthi, S. S. and E. G. Gilbert (2002). Convergence of a generalized SMO algorithm for SVM classier design. Machine Learning 46, 351360. Keerthi, S. S., S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy (2000). Improvements to SMO algorithm for SVM regression. IEEE Transactions on Neural Networks 11 (5), 11881193. Laskov, P. (2002). An improved decomposition algorithm for regression support vector machines. Machine Learning 46, 315350. Lin, C.-J. (2001). On the convergence of the decomposition method for support vector machines. IEEE Transactions on Neural Networks 12 (6), 12881298. Michie, D., D. J. Spiegelhalter, C. C. Taylor, and J. Campbell (Eds.) (1994). Machine learning, neural and statistical classication. Upper Saddle River, NJ, USA: Ellis Horwood. Data available at http://archive.ics.uci.edu/ ml/machine-learning-databases/statlog/. Osuna, E., R. Freund, and F. Girosi (1997). Training support vector machines: An application to face detection. In Proceedings of CVPR97, New York, NY, pp. 130136. IEEE. Platt, J. C. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Sch olkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods - Support Vector Learning, Cambridge, MA. MIT Press. R uping, support S. (2000). vector mySVM machines. another Software one of those at
available
http://www-ai.cs.uni-dortmund.de/SOFTWARE/MYSVM/. Sch olkopf, B., A. Smola, R. C. Williamson, and P. L. Bartlett (2000). New support vector algorithms. Neural Computation 12, 12071245. Sch olkopf, B., A. J. Smola, and R. Williamson (1999). Shrinking the tube: A new support vector regression algorithm. In M. S. Kearns, S. A. Solla, and 22
D. A. Cohn (Eds.), Advances in Neural Information Processing Systems, Volume 11, Cambridge, MA. MIT Press. Vapnik, V. (1998). Statistical Learning Theory. New York, NY: Wiley.
23
6 = 1
Figure 1: An example where one (D ) corresponds to dierent (D ) Author: Chang, Manuscript Number: 2403
mpg, C=100
150 nu=0.4 nu=0.6
100
#FSV
50 0 0
3 x 10
4
5
Iteration
(a) -SVR
mpg, C=100
150
eps=0.4 eps=0.6
100
#FSV
50 0 0
3 x 10
5
Iteration
(b) -SVR
Figure 2: Iterations and Number of free variables ( ) Author: Chang, Manuscript Number: 2403
200
150
triazines, C=100 nu=0.6 nu=0.8 eps=0.000139
#FSV
100
50
0 0
50000
100000 Iteration
150000
Figure 3: Iterations and Number of free variables ( ) Author: Chang, Manuscript Number: 2403

Training ν-Support Vector Regression - Theory and application

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Training ν-Support Vector Regression - Theory and application

Hochgeladen von

Copyright:

Verfügbare Formate

Manuscript Number: 2403

Training -Support Vector Regression: Theory and Algorithms

(w (xi ) + b) yi + i, yi (wT (xi ) + b) + i , i , i 0, i = 1, . . . , l, 0. 1

(w (xi ) + b) yi + i , yi (wT (xi ) + b) + i , i , i 0, i = 1, . . . , l.

The Relation Between -SVR and -SVR

1 min ( )T Q( ) + (y/C )T ( ) + (/C )eT ( + ) 2 eT ( ) = 0,

For convenience, following (Sch olkopf et al. 2000), we represent

as () . Remember that for -SVC, not all 0 1 lead to meaningful problems of

Denition 1 Dene min() eT ( + ), where () is any optimal solution of (D ), = 0.

Therefore, for any optimal solution

Figure 1: An example where one (D ) corresponds to dierent (D )

When the Kernel Matrix Q is Positive Denite

is the optimal solution of (D ).

This implies 1 1 T T T ( 1 1 ) Q(2 2 ) = (1 1 ) Q(1 1 )+ (2 2 ) Q(2 2 ). (3.1) 2 2 9

Since Q is positive semidenite, Q = LT L so (3.1) implies L(1 1 ) L ( 2

Some Issues Specic to Regression

as the following scaled form: min 1 T Q +p T 2 T = 1 , y T = 2 , e 0 t C, t = 1, . . . , 2l, where = Q y e e Q Q = = = = ,y ,e , 1 = 0, 2 = Cl. , ,p y e e Q Q

Table 1: Solving -SVR and -SVR: C = 1 (time in seconds)

Table 2: Solving -SVR and -SVR: C = 100 (time in seconds)

Note that by dening , (D ), = 0 is equivalent to min 1 T Q + (y/C )T 2 eT = 0, 1/l i 1/l, i = 1, . . . , l.

Figure 2: Iterations and Number of free variables ( )

Figure 3: Iterations and Number of free variables ( ) 19

in the objective function of (D ). Hence when is larger, fewer i , i are

or {1 , . . . , l }. Furthermore, as the stopping tolerance 103 might be too

Conclusions and Discussions

T. (1998). Making large-scale SVM learning practical. In

triazines, C=100 nu=0.6 nu=0.8 eps=0.000139

Das könnte Ihnen auch gefallen