Sie sind auf Seite 1von 54

Ecient Classication Algorithms using SVMs for Large Datasets

A Project Report Submitted in partial fullment of the requirements for the Degree of

Master of Technology
in

Computational Science

by

S.N.Jeyanthi

Supercomputer Education and Research Center INDIAN INSTITUTE OF SCIENCE BANGALORE 560 012, INDIA JUNE 2007

TO

Mom, Dad and My nephew Laddan

Acknowledgements
First and foremost, I would like to thank my advisor Dr.Shirish K.Shevade for his constant support and encouragement. I gratefully recollect all the hardships he underwent to read my legible handwritten results and the patience he had to make me understand the concepts. It was great learning experience under him. I would also like to thank Prof Matthew Jacob.T for his continued guidance throughout my stint at IISc. I will be mistaken if I miss to thank Prof.R.Govindarajan who upheld all my demands that I took to him however small or big they are. I would like to thank Naimisha, Joth, Sivagama Sundari, Garima whose company made my life here enjoyable one. A special mention about Shijesta who helped me in every regard right from the beginning of my life at IISc and about Sreepathi for his help with Linux and allied stu, amidst their busy schedule. I would also like to extend my sincere gratefulness to all my SSL labmates for making this work really computational with their maintained patience when my processes occupied their CPU, delaying their work. I am also grateful to oce people like Sekhar and Mallika without whom none of the formal procedures would have been completed. Last but not least, I owe a lot to my parents and my Sis, who wonderfully handled my fragility during the stress period I faced here and made me to keep on going. I am nowhere without them.

Abstract
Support Vector Machine(SVM) is a machine learning tool that is based on the idea of large margin data classication. The tool has strong theoretical foundation and the classication algorithms based on it give good generalization performance. Standard implementations, though provide good classication accuracy, are slow and do not scale well. Hence they cannot be applied to large-scale data mining applications. They typically need large number of support vectors. Hence the training as well as the classication times are high. First part of our work involves developing a new learning algorithm, where we solve the dual problem, add the support vectors incrementally. This algorithm selects new support vectors from a random sample based on generalization ability. In the second part of our work, we developed a classication algorithm by solving the primal problem instead of the dual problem. This algorithm performs better in terms of resulting classier complexity, comparable in terms of generalization error when compared to the rst phase algorithm. In both phases we reduced the resultant classier complexity when compared with existing works. Experimental results done on real-world large datasets show that these methods help to reduce the storage cost, produce comparable classication accuracy with existing works and result in reduction of support vectors thereby reducing the inference time.

ii

Contents
Acknowledgements Abstract Abbreviations 1 Introduction 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 SVMs - Non-Separable case - Soft Margin Classication 1.1.2 A brief note on nucleus of our phase-1 algorithms . . . 1.2 Survey of related work . . . . . . . . . . . . . . . . . . . . . . 1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Organization of the report . . . . . . . . . . . . . . . . . . . . 2 Classication Algorithms solving the Dual problem 2.1 Simple 59 sampling Algorithm . . . . . . . . . . . . . 2.1.1 Numerical Experiments . . . . . . . . . . . . . 2.1.2 Algorithm-2 . . . . . . . . . . . . . . . . . . . 2.2 Heuristics 59 sampling . . . . . . . . . . . . . . . . . 2.2.1 Numerical Experiments on large datasets . . . 2.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i ii vii 1 1 3 3 4 4 5 6 6 9 12 15 18 19

3 Classication Algorithms solving the Primal problem 22 3.1 Solving the Regularized Least Squares Primal problem . . . . . . . . . . . 22 3.1.1 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.2 Analysis of the Algorithms of phase-2 . . . . . . . . . . . . . . . . . 29 4 Conclusion A Tables B Figures C Machine Conguration and Dataset Details References iii 32 33 35 42 44

List of Tables
2.1 2.2 2.3 2.4 2.5 2.6 2.7 3.1 3.2 3.3 3.4 A.1 A.2 A.3 A.4 A.5 A.6 A.7 Algorithm-1 Algorithm-1 Algorithm-1 Algorithm-1 Algorithm-1 Algorithm-1 Algorithm-1 on Banana Dataset . . . . . . . . . on Splice Dataset . . . . . . . . . . & Algorithm-2 on Shuttle dataset . & Algorithm-2 on Adult-8 dataset & Algorithm-2 on IJCNN1 dataset & Algorithm-3 on Shuttle dataset . & Algorithm-3 on IJCNN1 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 11 12 12 12 19 19 26 26 30 31 33 33 33 34 34 34 34

Algorithm-4 on small datasets for Mean and Std on TestError , #BVs, Training set error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm-4 on small datasets for Mean and Std on Lambda, Sigmasqr . Algorithm-5 & Algorithm-6 over Adult8 and Shuttle . . . . . . . . . . . . Comparison of our work and one of the existing works [5] . . . . . . . . Algorithm-1 Algorithm-1 Algorithm-1 Algorithm-1 Algorithm-1 Algorithm-5 Algorithm-5 on Waveform dataset . . . . . . . . . . on Image Dataset . . . . . . . . . . . . & Algorithm-2 on MNIST 3vO dataset & Algorithm-2 on MNIST 3v8 dataset & Algorithm-2 on Vehicle dataset . . . & Algorithm-6 over M3V8 & M3VO . & Algorithm-6 over IJCNN1 & Vehicle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

C.1 Table shows the characteristics of Large Datasets . . . . . . . . . . . . . . 43

iv

List of Figures
1.1 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 3.1 3.2 3.3 3.4 3.5 3.6 3.7 Linear Support Vector Machine example . . . . . . . . . . . . . . . . . . . Behaviour of dual objective value: Algorithm-1 on small datasets . . . . Generalization error performance by Algorithm-1 . . . . . . . . . . . . . Shuttle dataset - Cumulative time performance by Algorithm-1 & Algorithm-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shuttle dataset - Test error performance by Algorithm-1 & Algorithm-2 . Adult-1 dataset - 59pt and Test error by Algorithm-1 with C=1.0 . . . . Adult-7 dataset - Test error by Algorithm-1 with C=1.0 . . . . . . . . . Shuttle dataset - Test error by Algorithm-3 . . . . . . . . . . . . . . . . . IJCNN1 dataset - Test error by Algorithm-3 . . . . . . . . . . . . . . . . Algorithm-4 Algorithm-4 Algorithm-4 Algorithm-5 Algorithm-5 Algorithm-6 Algorithm-6 over Banana dataset for Test Error on 100 instances . . . over Banana dataset for Training Error on 100 instances on Banana Dataset for Test & Training Error variation . on Adult8 for test and training error . . . . . . . . . . . on Adult8 for cumulative time in seconds . . . . . . . . . on Adult-8 dataset for #BVs vs Test error . . . . . . . . Adult-8 dataset for Time vs Test error . . . . . . . . . . . . . . . . . . . . . . . . 2

. 10 . 11 . . . . . . . . . . . . . 13 13 14 14 20 20 25 25 27 28 28 30 30

B.1 Algorithm-1 on Image dataset for Test error . . . . . . . . . . . . . . . . B.2 Algorithm-1 on Waveform dataset for Test error . . . . . . . . . . . . . . B.3 Algorithm-1 with C=1.0 on Adult-1 dataset for Dual objective function value variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4 Algorithm-1 with C=10.0 on Adult-1 dataset for Test error . . . . . . . . B.5 Algorithm-3 Common Sampling on Shuttle dataset for Cumulative Time B.6 Algorithm-3 Common Sampling on Shuttle dataset for Test error . . . . . B.7 Algorithm-3 Common Sampling on IJCNN1 dataset for Test error . . . . B.8 Algorithm-3 on Shuttle dataset for Test error with dierent parameter combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.9 Algorithm-4 on Banana dataset for # of BVs . . . . . . . . . . . . . . . B.10 Algorithm-4 on Breast Cancer dataset for Test Error . . . . . . . . . . . B.11 Algorithm-4 on Diabetis dataset for # of BVs . . . . . . . . . . . . . . . B.12 Algorithm-4 on Diabetis dataset for Test Error . . . . . . . . . . . . . . . B.13 Algorithm-4 on Heart dataset for Training Error . . . . . . . . . . . . . . v

. 35 . 35 . . . . . . . . . . . 36 36 36 36 37 37 37 37 38 38 38

LIST OF FIGURES B.14 B.15 B.16 B.17 B.18 B.19 B.20 B.21 B.22 B.23 B.24 B.25 B.26 Algorithm-4 Algorithm-5 Algorithm-6 Algorithm-5 Algorithm-5 Algorithm-5 Algorithm-5 Algorithm-5 Algorithm-5 Algorithm-6 Algorithm-6 Algorithm-6 Algorithm-6 on on on on on on on on on on on on on Thyroid dataset for Training Error . . . . . . . . Shuttle dataset for Training and Test Error . . . . Shuttle dataset for CPU Time . . . . . . . . . . . MNIST 3Vs8 dataset for CPU Time . . . . . . . . MNIST 3VsO dataset for CPU Time . . . . . . . MNIST 3Vs8 dataset for Training and Test Error MNIST 3VsO dataset for Training and Test Error IJCNN1 dataset for CPU Time . . . . . . . . . . IJCNN1 dataset for Training and Test Error . . . IJCNN1 dataset for #BVs vs Test error . . . . . . IJCNN1 dataset for Time vs Test error . . . . . . Shuttle dataset for #BV vs Test error . . . . . . . Shuttle dataset for Time vs Test error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi 38 39 39 39 39 40 40 40 40 41 41 41 41

Abbreviations
MSMO - Modied Sequential Minimal Optimization SVM - Support Vector Machine 59pt - 59 point SVs - Support Vectors 59s - 59 sampling 100s - 100 sampling H59S - Heuristics 59 Sampling BVs - Basis Vectors KKT - Karush-Kuhn-Tucker QP - Quadratic Programming

vii

Chapter 1 Introduction
1.1 Background

Binary pattern recognition involves constructing a decision rule to classify examples into one of two classes based on a training set of examples whose classication is known a priori. Support Vector Machines(SVMs) construct a decision surface in the feature space that bisects the two categories and maximizes the margin of separation between two classes of points. This decision surface can then be used as a basis for classifying points of unknown class. Suppose we have N training data points {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )} where xi Rd and yi {+1, 1}. The problem of nding a maximal margin separating hyperplane can be written as 1 min wT w subject to yi (wT xi b) 1 i = 1, . . . , N w,b 2 This is a convex quadratic programming problem. Introducing Lagrange multipliers and solving to get the Wolfe dual, we get:
N

maximize LD subject to:

i
i=1

1 2

i j yi yj x i x j
i,j

0,
i

i yi = 0

Chapter 1. Introduction The solution of the primal problem is given by


N

w=
i=1

i yi x i

To train the SVM, we search through the feasible region of the dual problem and maximize the objective function. The optimal solution can be checked using the KKT conditions. Refer gure 1.1. The equations of the three hyperplanes:

Linear SVM 1000


H1 H1 PLUS CLASS H OPTIMAL PLANE H2 MINUS CLASS H H2

500

0
w . x b = 1

500

w.xb=0 w . x b = +1

1000 0

0.5

1.5

2.5

Figure 1.1: Linear Support Vector Machine example

Optimal hyperplane H : y = w x b = 0

Chapter 1. Introduction Supporting hyperplanes - parallel and equidistant to optimal hyperplane H1 : y = w x b = +1 H 2 : y = w x b = 1

Examples with i > 0 are called support vectors. Only they participate in the denition of optimal separating hyperplane and other examples can be removed. We can classify a new object x with f (x) = sgn(w x b)
N

= sgn((
i=1 N

i y i x i ) x b) i y i (x i x ) b )

= sgn(
i=1

1.1.1

SVMs - Non-Separable case - Soft Margin Classication

Very large datasets invariably have noisy data points. There will be in general no linear separation in the input space. In which case, we will have training examples between the two supporting hyperplanes. We need a way to tolerate noise and outliers, and take into consideration the positions of more training points than just those closest to the boundary. We would like to relax the constraints of the primal problem, by introducing non-negative slack variables i 1 i N in the constraints. The corresponding Wolfe dual remains the same, with the only dierence from the maximal margin case is that the i now have an upper bound of C instead of .

1.1.2

A brief note on nucleus of our phase-1 algorithms

SMO algorithm [3]: SMO breaks the large QP optimization that needs to be solved for training SVM into series of smallest possible QP problems. These small QP problems are solved analytically, which avoids using a time-consuming numerical QP.

Chapter 1. Introduction MSMO algorithm[6]:

The authors of this algorithm enhance SMO further by alleviating the ineciency associated with maintaining and updating of a single threshold value by SMO. In particular, we took the Modication-2 of [6], modied it to suit our work. The modication we did was that: 1. Instead of the whole algorithm MSMO, we took only loop that is addressed as inner loop by [6]. This way we operated only on the worst KKT violators. 2. Instead of initializing dual variables to 0, we allowed for warm-start initialization.

1.2

Survey of related work

There has been constant eort in direction of improving the training and classication time in the algorithmic approaches to SVMs. Methods like chunking, decomposition [3] try scaling up the QP to be solved. Other scale-up methods include Kernel-Adatron [10], RSVM [7]. Recently, there has been increased interest in exploiting the approximation ideas discussed in [9]. There are works in the direction in solving the primal problem directly than taking up the dual which is a common procedure followed in SVM implementations. Our results were compared with [5] that uses this primal-solve idea and also with [1]. Their idea of picking basis vectors is based on the idea of [11] which is a non-SVM technique.

1.3

Motivation

All the scale up methods that solve the dual problem still have the problem of kernel matrix being too large to t in memory and also take more training time as they have atleast O(n2 ) time complexities. They have O(nsv ) inference time. We have reduced the classication time and the memory complexity, with retaining comparable generalization accuracy. The techniques like RVM [11] provide good empirical support for the fact that only a small fraction of training samples are enough to capture the essence of whole

Chapter 1. Introduction

dataset, still producing acceptable test error performance. This was one of the motivations for our second phase of the work.

1.4

Organization of the report


This includes Algorithm-1, Algorithm-2 & Algorithm-3 wherein

Chapter-2 explains about the algorithms developed and experimented during the rst phase of the work. we solved dual problem and the analysis of the those algorithms. Chapter-3 explains Algorithm-4, Algorithm-5 and Algorithm-6 where we solved the primal problem and the relevant analysis of them. Chapter-4 concludes the report with directions for future work. Appendices give the results of experiments as tables and graphs with the last appendix giving details about the machine conguration on which experiments were conducted along with details of the datasets.

Chapter 2 Classication Algorithms solving the Dual problem


2.1 Simple 59 sampling Algorithm

Non-linear SVM: When the points in the input space are not linearly separable, we can transform the data points to another high dimensional space, such that the data points will be linearly separable in the feature space. This is also the case where the decision function is not a linear function of the data. Wolfe dual for this case is : maximize LD subject to: 0 i C,
i

i
i

1 2

i j yi yj (xi ) (xj )
i,j

i yi = 0

where () is the transformation to the high dimensional space. As can be seen, the training algorithm depend only on dot products of the data in high dimensional space. If there were a kernel function k such that (xi ) (xj ) = k (xi , xj ), we would only need to use k in the training algorithm and never explicitly. The Algorithm-1 solves this dual problem by incremental addition of SVs to the current set of SVs.

Chapter 2. Classication Algorithms solving the Dual problem Description of the algorithm:

The Algorithm starts by randomly picking two points from opposite classes. As they belong to opposite classes, they become the current SVs. The dual variables corresponding to them can be set without a call to MSMO routine, with just one call to kernel function. Dierent 59 points are sampled as chunks till a chunk that gives minimum 59pt error with respect to current SVs. Initially the outer loop(loop forever ) of the algorithm was based on the saturation of dual objective function value. Following [8] on the termination condition, appropriate exit condition was set based on heuristics. The stopping criterion is based on the number of times we got the 59pt chunk that were all classied correctly with current set of SVs or the number of times we failed to get 59pt chunk that were not able to give minimum 59pt error less than the current error threshold. If either of these counts reach the specied threshold, the algorithm stops. The threshold value is set after some trial and error eorts with various values and depends on the size of the dataset. Otherwise, the error threshold is updated as the mean of all the 59pt errors of the so far chosen but discarded 59 samples. The error threshold controls the kind of points that get sampled. In this way we avoid choosing outliers or noisy samples w.r.t current SVs. For each one of the misclassied points that are among the sampled points, the point is added to the model and is trained with MSMO algorithm with warm start initialization. With temporary addition of one more point to the SV array, the test error on the remaining 58 points is observed. That point that gave the minimum most 58pt test error is nally added to the model. This process is repeated till the stopping criterion. Algorithm-1: Simple 59 Sampling /*** * This algorithm solves the dual problem by considering one point at * a time. Calls IMSMO as a routine for training with current

Chapter 2. Classication Algorithms solving the Dual problem * set of points in the SV array * * Input: Input data matrix, label information * Output: Set of Support vectors * Initialization: Error threshold = huge value ***/ begin Randomly sample 2 points belonging to dierent classes. Add them to the current set of Support Vectors. Set the corresponding dual variables - values Loop forever Loop to randomly sample 59 points Choose that set of 59 points with which current SVs give 59pt test error less than the current error threshold Break if insucient points end loop random sampling of 59 points Break if the inner loop is quit because of insucient points Update error threshold as average of 59pt test errors Loop over misclassied points Add the point to current SVs Train using IMSMO with warm start and test over the remaining points end loop over misclassied points Add that misclassied point that gave the min-most error to the SV array over remaining points Save dual variables for next iteration end loop forever end

Following [9] experiments were done by choosing 59 samples and also as the authors of [5] speak about no universal answer for choosing a given number of samples,

Chapter 2. Classication Algorithms solving the Dual problem experiments were done with 100 samples also. Complexity analysis:

While picking 59 samples, they are tested to give less error than error threshold. There are O(nsv ) kernel evaluations for one chunk of 59. This operation is repeated till we nd satisfactory chunk or we run out of points. Hence total kernel evaluations taken here is O((n nsv ) nsv ) where n is the number of training samples and nsv is the number of SVs. After the points are chosen, the kernel matrix corresponding to these points are set which amounts to O(sampsze2 ) kernel evaluations where sampsze is either 59 or 100 in our case. For every point in the 59 chunk, the cost of training with MSMO amounts to O(sampsze n2 sv ) time. This MSMO training would take more time because of warm start initialization and leads to additional kernel function evaluations in the beginning for the appropriate datastructures O(n nsv ). When MSMO is used with full dataset, this cost does not contribute as the dual variables are initialized to zero. With regard to memory usage, a kernel matrix is used for storing the kernel function evaluations of every new 59 point group. This is independent of the training set size and is a constant. The whole process is repeated till all the n nsv points are considered to be added to the model. This is essentially equivalent to dividing the left out points into 59 sample groups and considering them for adding to the model. The timing may be improved by having few more backup kernel evaluations as the points once thrown will be considered with high probability again.

2.1.1

Numerical Experiments

Over Small Datasets The gure 2.1 shows that the Algorithm-1 over the small datasets does not reach the optimal dual value, represented as horizontal lines in the gure 2.1 The tables table 2.1 and table 2.2 show the various details comparing three cases: MSMO represents the MSMO algorithm applied over the aforementioned datasets. The training involves all points of the dataset. 59s/A-1(59) represents Algorithm-1 applied with sampling chunk of 59 points

Chapter 2. Classication Algorithms solving the Dual problem

10

Dual Value change Vs Num Points 1400

1200

1000

Dual value

800

600

400 Banana 200 Wave Image splice 0 0 50 100 150 Number of Points 200 250 300

Figure 2.1: Behaviour of dual objective value: Algorithm-1 on small datasets Banana Dataset C = 10.0 sigmasqr = 0.1 Type Time nBsvs Bsvs Zsvs Threshold GenErr(%) MSMO 2 65 59 276 -0.0988 11.877 59s 4.399 42 54 15 -0.0254 13.122 100s 7.244 39 57 15 -0.0482 12.489 Table 2.1: Algorithm-1 on Banana Dataset 100s/A-1(100) represents Algorithm-1 applied with sampling chunk of 100 points As the tables show, this algorithm performs poor with respect to time, comparable w.r.t to generalization error and better when compared with number of SVs. Among the 59s and 100s cases, 100s performs better w.r.t to generalization error at the expense of taking more training time. It is also to be noted that 100s case is able to give better test error because the number of SVs is more when compared to 59s case. The column on dual values conform with the gure 2.1 proving that the dual problem is not completely solved and optimal dual value is yet to be reached. The reason for the algorithm taking more training time could be the repeated kernel function evaluation. We sample 59 or 100 points and take only one point from them and throw the remaining. All calculations pertaining to the thrown away points may have to be repeated when those points get sampled again. But in the case of MSMO algorithm, none of the kernel Dual 670.230 620.749 647.356 Gap 1.077e-8 8.473e-3 6.791e-3

Chapter 2. Classication Algorithms solving the Dual problem Splice Dataset C = 100.0 sigmasqr = 100.0 Type Time nBsvs Bsvs Zsvs Threshold GenErr Dual MSMO 130 489 0 511 1.686 11.0344 1206.224 59s 1382.215 268 0 12 1.199 12.00 986.380 100s 1901.682 275 0 9 1.001 11.724 1003.503 Table 2.2: Algorithm-1 on Splice Dataset

11

Gap 1.4737e-6 6.8275-1 6.4411e-1

evaluations are repeated, at the expense of memory for storing whole of the kernel matrix or the dot product evaluation. Refer appendix(table A.1 table A.2) for tables exhibiting the behaviour of this algorithm over Image and Waveform datasets.
Banana Generalization error (%) 70 59 100 60 45 50 40 35 40 30 30 25 20 20 15 10 10 55 50 59 100 Splice Gen error

20

40

60 No of points

80

100

120

50

100

150 No of points

200

250

300

Figure 2.2: Generalization error performance by Algorithm-1 Figure 2.2 shows the test error uctuations with every point getting added to the current set of SVs, comparing type2(59s) and type3(100s). As was already discussed the plots show that 100s type gives better generalization error rate when compared to 59s type. For the performance on remaining two datasets refer gure B.1 B.2 Over Large Datasets We used the datasets used in [5] as their work is also similar in idea with ours of incremental addition of points. Refer second row of the tables 2.3, 2.4, 2.5 for the performance of Algorithm-1 on Shuttle, Adult-8 and IJCNN1. Refer gures 2.3 2.4 for performance of

Chapter 2. Classication Algorithms solving the Dual problem Shuttle Dataset Type Time(s) GenErr(%) nBsvs Bsvs Zeros Dual MSMO 2964.31 0.0965 123 44 43333 22204.17 A-1 30.35 0.0552 97 in 135 13788.79 A-2 32.01 0.0552 97 in 135 13788.79

12

Gap Thresh 1.814 -0.908 4.7e+05 -0.694 4.81e+05 -0.694

Table 2.3: Algorithm-1 & Algorithm-2 on Shuttle dataset Type Time(s) GenErr(%) MSMO 3302.0 14.85 A-1 35.11 25.6 A-2 38.97 25.61 Adult 8 with C=1.0 nBsvs Bsvs Zeros Dual 680 7577 14439 7573.49 155 in 155 138.09 155 in 155 138.09 Gap 0.2268 1.7e+04 1.7e+04 Thresh 0.3018 -0.0834 -0.0833

Table 2.4: Algorithm-1 & Algorithm-2 on Adult-8 dataset Algorithm-1 on shuttle dataset. For the details on other datasets like MNIST and Vehicle, refer appendix for tables A.3, A.4, A.5.

2.1.2

Algorithm-2

It can be seen that the Algorithm-1 takes more time for small datasets when compared to MSMO performance but the time reduces for large datasets. Possible explanation could be that the loop that does sampling of 59/100 points become overkill when the dataset is small. Hence saving of time with 59/100 sampling is better option only when the dataset is big enough. In order to still improve the training time taken by 59s type, Algorithm-2 was implemented with the following modications: The number of times Take Step procedure being called is kept under control. We used 10 and 15 as the maximum number of times Take Step can be called. After a point gets selected for addition as next SV (as it gave the least 59pt test IJCNN1 Dataset GenErr nBsvs Bsvs zeros Dual 1.41 2580 794 46616 14014.95 1.651 1780 in 2081 10851.52 1.642 1783 in 2059 1.00

Type Time MSMO 4164.4 59s 4097.34 A-2 4970.815

Gap Thresh 8.805 0.415 3.7e+04 0.223 6.4e+05 0.247

Table 2.5: Algorithm-1 & Algorithm-2 on IJCNN1 dataset

Chapter 2. Classication Algorithms solving the Dual problem

13

error when compared to other misclassied points), it is to be trained completely with MSMO to get the correct setting of dual variables - s This way the process of selecting the suitable next support vector is expedited to save CPU time. Refer last row of tables 2.3, 2.4, 2.5, A.3, A.4, A.5 for the results that we obtained with this algorithm over the datasets considered. We expected improvement in the training time, but was not successful. Possible explanation could be that, as the points that are selected for next SV are selected based on partial training with MSMO, this type takes more time to reach the test error of its 59s counter part. We compared our results with [5] also. It is observed that we have comparable performance on test error and # of SVs for all but Adult-8 dataset. Refer gures 2.3 2.4 for performance of Algorithm-2 on shuttle dataset.
Shuttle Algorithm1 40 35 Cumulative Time(S) 30 Gen Error 25 20 15 10 5 0 0 50 # of points 100 150 v1.2 v1.1 80 70 60 50 40 30 20 10 0 0 50 # of points 100 150 59s v1.2 Shuttle Algorithm1

Figure 2.3: Shuttle dataset - Cumulative time performance by Algorithm-1 & Algorithm-2

Figure 2.4: Shuttle dataset - Test error performance by Algorithm-1 & Algorithm-2

Observations made on Adult dataset As the test error of the Adult datasets(1, 7, 8) did not improve at all, refer gures 2.5, 2.6, B.3, B.4, we did the following experiments to study the Algorithm-1 & Algorithm2 over this dataset:

Chapter 2. Classication Algorithms solving the Dual problem

14

1. Changing the random seed - So that path taken by the algorithms when choosing the points would vary. With this algorithm stability is studied for dierent random seed based paths. 2. Increasing the C value - We tried getting zero training set error so that the optimal hyperplane stabilizes. 3. Observing the 59pt test error - Instead of taking whole test set, this random small set of points which are yet to be trained can be considered as well for test set and behaviour of the algorithm studied. 4. Observing the training set error - Similar to the previous experiment, but the whole training set is involved. 5. Observing the variation of dual obj. function value - No saturation is seen and the curve almost linearly increases with number of points With all the observations, it was seen that the algorithm still behaved poorly without stabilizing. There were uctuations about a value of generalization error.

Adult1 60 50 40 30 20 10 0 0

C=1.0
50 Generalization Error (%)

Adult 8

59pt err Gen err

45 40 35 30 25 20

200 400 # of points

600

15

500

1000 # of points

1500

Figure 2.5: Adult-1 dataset - 59pt and Test error by Algorithm-1 with C=1.0

Figure 2.6: Adult-7 dataset - Test error by Algorithm-1 with C=1.0

Chapter 2. Classication Algorithms solving the Dual problem

15

2.2

Heuristics 59 sampling

Description of the algorithm: As was observed in Algorithm-1 & Algorithm-2 case, the optimal dual value was never reached for all the datasets considered and for datasets like Adult and MNIST 3Vs8, the test error rate was unacceptably high. Hence Algorithm-3 was developed improving upon the Simple 59 Sampling algorithm by having intelligent heuristics while sampling points to be included as SVs. The idea of shrinking was also employed whereby points that are correctly classied a given number of times is removed from the active set of points as they dont help to improve the solution. Important kernel computations was saved in cache speculating improvement in training time. Additional data structures are used to help in intelligent sampling of the points (refer the header of the algorithm for details of the datastructures) Algorithm begins by randomly sampling two points from opposite classes. They become the current SVs. There is an intermediate array of input points that stands between the whole search space and current SV set. The points in the intermediate array are classied with the points in the current SV array. The 59 points are sampled only from current set of misclassied points in the intermediate array. A count array of number of times a point in the intermediate array gets correctly classied is maintained that will help to do shrinking. When the number of points that are misclassied are less than 59 points, all the points in the count array whose count is greater than some threshold is ushed and new points are brought in. This way any correctly classied sample is removed and room is created for new point from the whole search space. Once a set of 59 points are obtained, the process of adding a point to current SV array is same as Simple 59 Sampling. After the addition of new SV, one sweep through the intermediate array is done to update the appropriate data structures. This process is repeated till any new sampling from whole search space fails to increase the number of misclassied points in the intermediate array. Various cache that are used are 1. cache to hold the kernel values between the points in the current SV set

Chapter 2. Classication Algorithms solving the Dual problem

16

2. cache to hold the kernel values between the current SVs and points in the intermediate array 3. cache to hold the kernel values between the current SVs and 59 random samples 4. cache to hold the kernel values between 59 randomly sampled points

Algorithm-3: Heuristics 59 Sampling /*** * This algorithm solves the dual problem by considering one point at a time selected based on intelligent heuristics. Maintains a intermediate cache of 2000 points from which all sampling is done. The cache is periodically inserted with fresh points from whole search space. * Calls Improved Modied SMO as a routine for training with current set of points in the SV array. The points to be added to the SV array are sampled from the misclassied set of points in the intermediate array * Input: Input data matrix, label information * Parameters: Intermediate array size, Threshold for number of times a point could be correctly classied * Important Data structures: kernel cache for SVs, kernel cache for SV vs 2000 points, linked list of misclassied points, # of times correctly classied array of size 2000 * Output: Set of Support vectors ****/ begin Randomly sample 2 points belonging to dierent classes. Add them to the current set of Support Vectors. Set the corresponding dual variables - values Do the rst sampling of 2000 points Classify them as either classied correctly or misclassied according to the current SVs Loop forever

Chapter 2. Classication Algorithms solving the Dual problem Loop to randomly sample 59 points Randomly choose 59 points from the misclassied portion of 2000 point array. If sampling successful, continue with next part of the algorithm Break if sampling fails because of insucient misclassied points a specied number of times Shrinking If sampling fails, remove all the correctly classied points in the 2000 array. In place of them, insert new points from the whole search space Insert the misclassied newly inserted points in the linked list, set the cache data structures end loop random sampling of 59 points Break if the inner loop is quit because of insucient points Set the appropriate cache data structures. Loop over 59 points Add the point to current SVs Train using IMSMO with warm start and test over the remaining points end loop over 59 points Add that point that gave the min-most error to the SV array over remaining points

17

Set the SV kernel cache, update the linked list and other data structures. Save dual variable s for next iteration end loop forever end

Complexity Analysis: Memory Complexity: Referring 2.2 for list of various cache used in this algorithm, for the case 1, it is O(n sv 2 ), for case 2, it is O(nsv intera rrays ize), for case 3, it is O(nsv sampsze) where sampsze

Chapter 2. Classication Algorithms solving the Dual problem is taken as 59 in this algorithm, for case 4, O(sampze2 ). Time Complexity:

18

When the rst sampling of intermediate array (mostly 2000 in our case) points are done, there are 2 2000 kernel evaluations, where 2 is current number of support vectors. This is followed by testing of these points with insertion O(n) operation into the linked list, if a point gets misclassied. The kernel evaluations for setting kernel cache for SVs and 59 points can be avoided by saving the indices and accessing corresponding locations of 2000 Vs SVs kernel cache as those 59 points are part of intermediate array. When there are insucient points, assuming the worst case of all the points currently in the 2000 point array to be correctly classied, and for every one of new point being sampled and tested, the total kernel evaluations are 2000 O(nsv ) testing kernel evaluations with possible insertion into linked list and updating the count array of classication information. Similar to Algorithms-1&2, there is setting of kernel evaluations between 59 points which amounts to O(sampsze2 ) kernel evaluations. This whole process is repeated till the stopping criterion which is based on heuristics. After a point gets added to the model, all these operations have to repeated in addition to insertion and deletion of points in the linked list that is maintained which are O(n) operations. During every deletion from the list, there are 2000 kernel evaluations. So as a upper bound when all the points get removed there will be 20002 kernel evaluations happening even though this extreme case rarely happens in practice. Most important distinction between the MSMO routines of this algorithm and Algorithm1 is that all of these routines are passed with kernel matrix so that they need not call the kernel function at all. Hence there is no additional cost incurred because of warm-start initialization that happened with Algorithm-1.

2.2.1

Numerical Experiments on large datasets

The experiments were done on Shuttle, IJCNN1, Adult8, MNIST datasets. Refer tables 2.6, 2.7 for comparison between MSMO, Algorithm-1 & Algorithm-3.

Chapter 2. Classication Algorithms solving the Dual problem Shuttle Dataset Type Time(s) GenErr(%) nBsvs Bsvs Zeros Dual MSMO 2964.31 0.0965 123 44 43333 22204.17 A-1 30.35 0.055 97 in 135 13788.79 A-3 72.76 0.1103 79 in 94 11964.55

19

Gap Thresh 1.814 -0.908 4.79e+05 -0.693 8.92e+05 -0.605

Table 2.6: Algorithm-1 & Algorithm-3 on Shuttle dataset Type Time(s) GenErr(%) MSMO 4164.4 1.41 A-1 4097.33 1.651 A-3 22356.44 1.857 IJCNN1 Dataset nBsvs Bsvs Zeros Dual 2580 794 46616 14014.95 1780 in 2081 10851.53 1603 in 1791 10483.55 Gap Thresh 8.805 0.415 3.7e+04 0.223 8.19e+04 0.237

Table 2.7: Algorithm-1 & Algorithm-3 on IJCNN1 dataset

2.2.2

Analysis

Algorithms-1&2 Vs Algorithm-3: It may be noted that with regard to the number of SVs, which was our original problem, Algorithm-3 performs well by giving less number of SVs. The possible reason may be because of the heuristics that aid in selecting the samples intelligently. Possible reasons for Algorithm-3 taking more time: This new algorithm takes more time because of book-keeping operations in maintaining the caches involved. When the linked list of misclassied points does not have enough points to sample 59 points for the algorithm to proceed, this results in resampling of fresh 2000 points, resulting in setting of kernel cache which involves dot product operations. As the algorithm proceeds adding new points, as opposed to what is expected of, the algorithm takes still longer time as for every single point that is to be added.The reason is that the repeated operation of clearing the intermediate array, kernel function evaluation of freshly sampled points and other relevant book keeping operations need to be done. A point that was removed from intermediate array during some initial iterations may come in again and hence all the calculations belong to that point need to be redone.

Chapter 2. Classication Algorithms solving the Dual problem

20

Even though intelligent heuristics is involved, because of aforementioned reasons, this algorithm cannot be applied to very large datasets. Refer gures 2.7, 2.8 for variation of time and test error given by this algorithm for every point that is getting considered, for datasets like Shuttle, IJCNN1.
Shuttle dataset 40 80 Test Error (%) 60 40 20 0 IJCNN dataset (R)

30 Test Error(%)

20

10

0 0

20

40 60 # of points added

80

100

500 1000 # of points added

Figure 2.7: Shuttle dataset - Test error by Algorithm-3

Figure 2.8: IJCNN1 dataset - Test error by Algorithm-3

Other experimental tricks that were deployed 1. Experiments were done with common sampling also, as sampling always from misclassied portion may lead to giving too much importance to outliers also. Common Sampling procedure consider sampling of 59 points from whole of the intermediate array rather than merely from linked list of misclassied points. The graphs gave similar behaviour like only misclassied portion based sampling. Common sampling was applied to Shuttle and IJCNN1 datasets. Refer gures B.5 B.6 B.7 2. Cross validation was done with Intermediate array size, numtime corr classication. Refer gure B.8 There was not much improvement seen after these experiments.

Chapter 2. Classication Algorithms solving the Dual problem

21

Motivation for Second Phase - developing Algorithm-4 Similar behaviour as was observed with Adult datasets by Algorithm-1 & Algorithm2 was also observed here also. Possible explanation could be that because for Adult datasets most of the SVs are Bounded SVs, having the dual variables variables taking the value of penalty parameter C, the algorithms take a prohibitively high time to stabilize as it consider only misclassied points. For Adult dataset, as most of the SVs are at the margin, when a new point gets added, this drastically aects the stability of the optimal hyperplane. This leads to the oscillation of test error about a particular value. As the side eect, this algorithm also fails to reach the optimal dual value of the dataset considered as can be observed from the tables presented. The second phase of the work involves solving the primal problem rather than solving the dual which is a common procedure in SVM implementations.

Chapter 3 Classication Algorithms solving the Primal problem


3.1 Solving the Regularized Least Squares Primal problem
The objective Function that is minimized is : min

1 + K y 2 2

where y is the class information vector where yi {+1, 1} where n is the given number of points and the kernel matrix K is of size n#BV, is the regularization parameter,
dm ax

, is a unconstrained coecient vector not to be confused with Lagrange multipliers

that are used in SVM implementations. Solving for , we get, = (K T K + I )1 K T y where the term (K T K + I )

22

Chapter 3. Classication Algorithms solving the Primal problem is the symmetric positive denite Hessian. We seek solution of the form :
n

23

f (x ) =
i=1

i k (x i , x )

Description of the algorithm: The algorithm begins with empty set of basis vectors. Proceeds to solve the aforementioned primal by unconstrained optimization. Performing ecient cholesky factorization of the Hessian: - This is done for including a point in the model and subsequent solving. Instead of decomposing from scratch, we perform ecient low rank updates for factorizing. Instead of inverting the Hessian, we maintain and update the cholesky factorization of it (LL T ) and compute the inverse, on demand with forward-substitutions. Stopping criterion: - This depends on the size of the dataset over which algorithm is applied. Specic stopping criterion are given under respective sections on small and large datasets. Parameter dmax : - This parameter helps to keep the complexity of the obtained classier under control which indirectly helps in controlling training time also. The algorithm can be stopped after dmax number of BVs are added to the model. Algorithm-4 /*** * This algorithm solves/directly minimizes the primal problem, computes least-squares loss function * Input: Input data matrix, class information * Hyperparameters: , 2 * Parameters: dmax , tolerance * Output: Set of Basis vectors ***/

Chapter 3. Classication Algorithms solving the Primal problem

24

begin Repeat For every candidate example - examples not in current set of BVs Include it in the model eciently Observe the generalization performance on the remaining points end for candidate examples Add that point to the BVs list that gave better test error Till the stopping criterion end

3.1.1

Numerical Experiments

Over small datasets The stopping criterion for small datasets is that when a better error over remaining points in the training set cannot be found, in other words when we dont have a candidate example that gives better error over remaining points in the training set, we stop. The base algorithm(Algorithm-4) presented above was used without any modications for small datasets following the experience from Algorithm-1&3 that sampling is overkill for small datasets. 100 instances of Banana, Breast-Cancer, Thyroid, Diabetis, Heart datasets were obtained from [2]. For each instance of the datasets, over a 6x6 grid whose x-axis is 2 and y-axis is , training error values over remaining points (will be addressed as minerror hereafter) was found using Algorithm-4 and the grid point giving the minimum most, the corresponding x- & y-axis values are taken as input hyperparameter combinations for that particular dataset. Refer gures 3.1 & 3.2 for the performance of the algorithm over banana dataset. In the gures, 1. the tag oldlogic refers to run made on 100 instances with the result of cross validation with the parameter dmax set to n, where n is the size of the dataset.

Chapter 3. Classication Algorithms solving the Primal problem

25

2. the tag dmaxlogic refers to run made with parameters obtained with cross validation done with dmax set to n/10 and individual runs with dmax set to n. 3. the tag dmaxlogic small is same as second case except that individual runs are also made with dmax = n/10
Banana1 dataset 0.13 0.125 0.12 TestErr 0.115 0.11 0.105 0.1 n,n n/10,n n/10,n/10 Minerr 0.34 0.32 0.3 0.28 0.26 0.24 0.22 0.2 Oldlogic dmaxlogic dmaxlogic_small Banana1 dataset

Figure 3.1: Algorithm-4 over Banana dataset for Test Error on 100 instances

Figure 3.2: Algorithm-4 over Banana dataset for Training Error on 100 instances

For the boxplot for # of BVs for the above three cases on Banana dataset, refer gure B.9. For the performance on the remaining datasets, refer gures B.9, B.10, B.11, B.12, B.13, B.14 The tables 3.1, 3.2 presents the mean and standard deviation for important parameters under observation for the four datasets B-Banana, H-Heart, BC-Breast Cancer, DB-Diabetis Refer gure 3.3 for the glimpse of how training and test error varies with incremental addition of BVs on Banana dataset. The gure shows one instance of banana dataset with relevant parameter and hyperparameter values. As the graph indicates, the algorithm stops when training set error cannot be reduced further.

Chapter 3. Classication Algorithms solving the Primal problem Mean +/- Std Dataset MinErr #BV TestErr B 0.257 +/- 0.0282 115.52 +/- 59.17 0.114 +/- 0.0073 H 0.434 +/- 0.0416 17.38 +/- 5.327 0.175 +/- 0.031 BC 0.538 +/- 0.0299 56.95 +/- 5.112 0.288 +/- 0.0434 DB 0.513 +/- 0.0223 98.17 +/- 10.132 0.256 +/- 0.0202

26

Table 3.1: Algorithm-4 on small datasets for Mean and Std on TestError , #BVs, Training set error Dataset B H BC DB Mean +/- Std lambda sigmasqr 0.0919 +/- 0.0258 0.051 +/- 0.045 0.244 +/- 0.36 10.9 +/- 9.0 0.1 +/- 0.00 1.0 +/- 0.0 0.098 +/- 0.0126 1.0 +/- 0.0

Table 3.2: Algorithm-4 on small datasets for Mean and Std on Lambda, Sigmasqr Over Large datasets The same set of hyperparameters and kernel functions as was used by [5] in their experiments is used for the both Algorithm-5 and Algorithm-6 Major Modications to the algorithm: 1. Algorithm-4 with 59 sampling (Algorithm-5) 2. Algorithm-4 with intermediate cache with 59 sampling (Algorithm-6) 3. Stopping criterion is based on the saturation of the training error within specied tolerance.

Chapter 3. Classication Algorithms solving the Primal problem

27

Banana1 1 0.8 0.6 0.4 0.2 0 0


Sigmasqr =0.1 lambda =0.1 Testerr =0.120612 MinErr =0.230188

MinErr TestError

10

20 #bv

30

40

50

Figure 3.3: Algorithm-4 on Banana Dataset for Test & Training Error variation Algorithm-5 /*** Repeat till suitable stopping criterion Repeat for all points of the 59 randomly chosen points * Select one point from 59 points. * Add the point to the existing model eciently following the idea of cholesky decomposition from Algorithm-4. * Observe the training set error given by the current point in the model. End repeat 59 points * Add the point that gave minimum training set error. * Sample new set of 59 points. * If a set of newly sampled 59 points cannot monotonously reduce the minerror quantity, those points are not considered and another set of 59 points are sampled. * No cache is used ***/

Chapter 3. Classication Algorithms solving the Primal problem

28

Numerical results Refer gures 3.4, 3.5 for Algorithm-5 over Adult-8 dataset. As can be observed, for this dataset for which Algorithm-1 cannot give promising performance, the primal-solve based Algorithm-4 is able to give comparable test error rate. Refer [5] for test error rates for this dataset. [5] follows the approach of Newton optimization for adding the basis elements greedily.
Adult8 59 sampling 25 Test err(%), minerr (ratio) 20 15 10 5 0 0
MinErr Tol 0.01 TestErr Tol 0.01 MinErr Tol 0.001 TestErr Tol 0.001

Adult8 59 sampling 1200 1000 Time (seconds) 800 600 400 200 Tol 0.01 Tol 0.001

50 #BV

100

150

0 0

50 #BV

100

150

Figure 3.4: Algorithm-5 on Adult8 for test and training error

Figure 3.5: Algorithm-5 on Adult8 for cumulative time in seconds

For the plots on other datasets by this Algorithm, refer appendix B for gures. As can be observed, the lesser the tolerance of the algorithm, the better the test error at the expense of more CPU time. As also can be observed, in all the cases, training error exponentially drop after certain number of addition of BVs and start to saturate. This indicates as the algorithm proceeds, the solution starts to stabilize and any new addition of BVs bring little improvement to the solution and again at the cost of time.

Chapter 3. Classication Algorithms solving the Primal problem

29

Algorithm-6 /*** * There is an intermediate array of size 1000/150/100 which are current candidate points for inclusion in BV set. * After one point is added, the minerror given by all these points are sorted. (Recall that minerror is the training set error) * The last 58 points are removed in addition to the point that gave the minimum most minerror, which gives rise to a total of 59 points being removed from the intermediate array. * Next new set of 59 points are sampled from whole space of training points excluding the ones in BV set and inserted in intermediate array. Corresponding kernel cache values are calculated and set. * The loop repeated till the aforementioned stopping criterion for large datasets. ***/ In both the Algorithms-5&6, time analysis revealed that most of the time is spent in dot product evaluation. With cached variant (Algorithm-6), time is still more as kernel function has to be evaluated for lling cache entries every time new 59 points are sampled.

3.1.2

Analysis of the Algorithms of phase-2

To aid in comparing with [5] plots, plots that compare # of BVs with Test error and Test error with cumulative time is plotted. The size of the intermediate cache was set to 150 rather than 1000 as the experiments continued as setting 1000 as intermediate cache size of Algorithm-6 took very high time. Refer gures 3.6, 3.7, B.23, B.24, B.25, B.26 for the zoomed versions of the plots with 150 intermediate cache size for the three datasets - Shuttle, IJCNN1 and Adult8. For all these plots the tolerance was set to 0.01. Refer tables 3.3, A.6, A.7 for comparing Algorithms- 5&6 with varying tolerance. In both the cases, decreasing the tolerance gives better test error rate, with increase in # of BVs and training time. The abbreviations ME and TE stand for training set error and test set error respectively.

Chapter 3. Classication Algorithms solving the Primal problem

30

Adult 8 150 Cache from 7th BV 17

Adult 8 150 Cache from 38.9500 th sec 17

Test Error %

Test Error %

16.5

16.5

16

16

15.5

15.5

15 0.8

1 1.2 1.4 # of BV log10 scale

1.6

15 1.5

2 Time (sec) log10 scale

2.5

Figure 3.6: Algorithm-6 on Adult-8 dataset for #BVs vs Test error

Figure 3.7: Algorithm-6 Adult-8 dataset for Time vs Test error

Adult8

Shuttle

S59 KC S59 KC 0.01 0.001 0.01 0.001 0.01 0.001 0.01 0.001 BV 28 129 25 68 111 245 44 60 ME 0.4502 0.4265 0.4496 0.43377 0.0194 0.0161 0.0371 0.0364 TE 0.1509 0.1464 0.1537 0.1508 0.0017 0.00152 0.0072 0.00689 Time(s) 163.61 1018.85 174.71 603.306 772.94 3273.35 341.22 620.4 Table 3.3: Algorithm-5 & Algorithm-6 over Adult8 and Shuttle

Analysis of eect of tolerance over the parameters under study As can be observed in the above tables, given an Algorithm(either 5 or 6), decreasing the tolerance helps to improve the test error performance. Tolerance was reduced till 0.0001 and Algorithm-5 was run to observe the changes. Test error rate reduced with increase in training time.

Chapter 3. Classication Algorithms solving the Primal problem

31

Algorithm5 Vs Algorithm-6 Test Error Amongst Algorithm-5 and Algorithm-6, Algorithm-5 gives better test error rate when compared to Algorithm-6. This is obvious with Adults-8, Shuttle and IJCNN1 datasets. The remaining three dataset deviate slightly from this behaviors wherein both algorithms give comparable performance in terms of test error. #of BVs Except for Vehicle dataset, for all other datasets we got comparable number of basis vectors and for Vehicle we got very good reduction in number of BVs. The reduction in BVs is attributed to the intelligent heuristics that is employed in Algorithm-6. Summary of comparison with existing work, phase-1 algorithm and phase-2 algorithms All the comparisons with [5] were done with their SpSVM-R method where R stands for Regularized. Refer table 3.4 for important results. Generalization Error(%) # of BVs or # of SVs Dataset [5] Algo-1 Algo-5 Algo-6 [5] Algo-1 Algo-5 Algo-6 Shuttle 0.05 0.05 0.17 0.72 > 102 135 111 44 3 Adult-8 14.4 to 14.6 25.6 15.09 15.37 100 to 10 155 28 25 M 3v8 1 to 2 22.17 1.1 0.81 > 103 12 153 201 IJCNN1 1 to 2 1.65 5.01 5.31 103 2081 173 183 Table 3.4: Comparison of our work and one of the existing works [5] It may be noted that as the phase-2 algorithms take any point from the training set without regard to its position relative to hyperplane, those algorithms can best capture the essence of the original dataset. Thus they give rise to lesser BVs than the phase-1 algorithms. The phase-1 algorithms, try to dierentiate two classes well and learn the hyperplanes, they give better classication accuracy than the phase-2 algorithms.

Chapter 4 Conclusion
With all the algorithms that were developed, we were able to achieve less number of SVs or BVs, whereby the resultant classier can classify the unseen examples eciently. It is also observed that not always solving the dual problem would give better solution. For few datasets, where phase-1 algorithms failed, phase-2 algorithms performed well. It is also to be noted that we tried solving primal with working towards approximate solution (all algorithms resort to random 59 sampling while choosing the next example to be added). All the algorithms are Probabilistic Speed-up methods [9]. Amongst the phase-1 and phase-2 algorithms, in terms of generalization error, phase-1 algorithms performed well and in terms of SVs or BVs, phase-2 algorithms took the lead. Also to be noted is that in all the cases, we didnt work with the whole dataset at one time at all. This makes all our algorithms scalable in terms of memory. We need not load the full dataset in main memory. As with future work, we are trying to improve upon the training time. Our speculation that using kernel cache would improve CPU time was not conrmed experimentally. Instead of giving better time, most of the cases having cache increased time and analysis showed that most of the time is spent in kernel evaluations. We plan to pursue our work in the direction of ecient usage of cache. Such an improvement is to be done for both the phases of the work.

32

Appendix A Tables
Waveform Dataset C = 1.0 sigmasqr = 10.0 Type Time nBsvs Bsvs Zsvs Threshold GenErr Dual MSMO 4 137 1 262 0.143 10.739 167.653 A-1 10.188 79 2 3 -0.0408 12.086 145.861 A-2 17.371 81 1 2 0.219 11.717 145.836 Table A.1: Algorithm-1 on Waveform dataset Image Dataset C = 100.0 sigmasqr = 1.0 Type Time nBsvs Bsvs Zsvs Threshold GenErr Dual MSMO 81 365 6 929 -0.0478 3.168 1110.285 A-1 78.546 152 6 30 -0.0149 3.960 1020.138 A-2 99.897 157 4 13 -0.011 3.663 969.84 Table A.2: Algorithm-1 on Image Dataset MNIST 3 Vs Others Dataset - poly kernel - norm tuples - #iter fn of #trsamp Type Time GenErr nBsvs Bsvs zeros Dual Gap Thresh MSMO 14033.16 0.29 4192 0 55808 2.68 9.79 0.693 A-1 8485.39 0.33 916 in 922 1.58 1.26e+05 0.196 A-2 11832.6 0.38 940 in 944 1.64 1.19e+05 0.187 Table A.3: Algorithm-1 & Algorithm-2 on MNIST 3vO dataset

Gap 2.9044e-8 1.687e-2 1.9078e-2

Gap 7.717e-7 3.190e-1 3.174e-1

33

Appendix A. Tables

34

MNIST 3 Vs 8 - polynomial kernel - norm tuples - #iter fn of #trsamps Type Time GenErr nBsvs Bsvs zeros Dual Gap Thresh MSMO 1114.15 0.403 1823 0 10159 1.03 3.88 -0.164 A-1 0.596 22.177 12 in 12 0.0142162 1.015e+05 0.156 A-2 0.672 22.127 12 in 12 0.014 1.015e+05 0.156 Table A.4: Algorithm-1 & Algorithm-2 on MNIST 3v8 dataset

Type Time MSMO 523941.14 A-1 1.04e+05 A-2 306.81

Vehicle - #iter fn of #trsamps GenErr nBsvs Bsvs zeros Dual 11.499 5281 18758 54784 642562.98 17.655 983 in 1008 23890.98 19.83 264 in 277 4888.676

Gap Thresh 41.53 -1.202 2.5e+06 -0.532 1.295e+06 -0.454

Table A.5: Algorithm-1 & Algorithm-2 on Vehicle dataset

M3V8 M3VO S59 KC S59 KC Variables 0.01 0.001 0.01 0.001 0.01 0.001 0.01 0.001 BV 153 999 201 970 148 475 174 381 ME 0.1123 0.0384 0.0945 0.0355 0.0838 0.0556 0.0752 0.0569 TE 0.0111 0.0035 0.0081 0.0040 0.0145 0.0083 0.0124 0.0083 Time(s) 1.2e+03 1.8e+04 2.4e+03 3.6e+04 5.6e+03 9.2e+03 Table A.6: Algorithm-5 & Algorithm-6 over M3V8 & M3VO

Variables

IJCNN1 Vehicle S59 KC S59 KC 0.01 0.001 0.01 0.001 0.01 0.001 0.01 0.001 BV 173 965 183 820 18 181 21 126 ME 0.1987 0.0946 0.1823 0.1064 0.436 0.3932 0.4312 0.3971 TE 0.0501 0.0195 0.0531 0.0239 0.139 0.125 0.1347 0.1254 Time(s) 2055.34 52438.11 5016.59 7.5e+04 3.98e+02 6.35e+03 4.9e+02 5.6e+03 Table A.7: Algorithm-5 & Algorithm-6 over IJCNN1 & Vehicle

Appendix B Figures
Image Gen error 60 59 100 50 30 35 59 100 Wave Gen error

40 25 30 20 20

15 10

20

40

60

80

100 120 No of points

140

160

180

200

10

10

20

30

40 50 No of points

60

70

80

90

Figure B.1: Algorithm-1 on Image dataset for Test error

Figure B.2: Algorithm-1 on Waveform dataset for Test error

35

Appendix B. Figures

36

Adult1 600 500 400 300 200 100

C=1.0
35 Generalization Error

Adult1

C=10.0

Dual objective value

30

25

20

0 0

200 # of points

400

600

50

100

150 200 # of points

250

300

350

Figure B.3: Algorithm-1 with C=1.0 on Adult-1 dataset for Dual objective function value variation

Figure B.4: Algorithm-1 with C=10.0 on Adult-1 dataset for Test error

Common sampling shuttle 2000 & 5 100 Cumulative Time seconds 80 60 40 20 0 0 Variation Minmost 15 Test error % 20

Shuttle common sampling 2000 & 5 Variation Minmost

10

50 100 # of points added

150

0 0

50 100 # of points added

150

Figure B.5: Algorithm-3 Common Sampling on Shuttle dataset for Cumulative Time

Figure B.6: Algorithm-3 Common Sampling on Shuttle dataset for Test error

Appendix B. Figures

37

IJCNN1 common sampling 25 80 Test error % 60 40 20 0 Variation Minmost Test error % 20 15 10 5 0 0

Original File cache 3500, Times:1 Variation Minmost

500 1000 1500 # of points added

20

40 60 # of points added

80

100

Figure B.7: Algorithm-3 Common Sampling on IJCNN1 dataset for Test error

Figure B.8: Algorithm-3 on Shuttle dataset for Test error with dierent parameter combinations

Banana1 dataset 180 160 140 #BVs


Testerror 0.35 0.3 0.4

Breast Cancer

120 100 80 60 40 Oldlogic dmaxlogic dmaxlogic_small

0.25 0.2

0.15 Oldlogic dmaxlogic dmaxsmall

Figure B.9: Algorithm-4 on Banana dataset for # of BVs

Figure B.10: Algorithm-4 on Breast Cancer dataset for Test Error

Appendix B. Figures

38

Diabetis 140 120 100 #BV 80 60 40 20 Oldlogic dmaxlogic dmaxsmall

Diabetis 0.32 0.3 0.28 Testerror 0.26 0.24 0.22 0.2 Oldlogic dmaxlogic dmaxsmall

Figure B.11: Algorithm-4 on Diabetis dataset for # of BVs

Figure B.12: Algorithm-4 on Diabetis dataset for Test Error

HEART DATASET
0.24 0.22

Thyroid

0.5

Minerror

0.45
Minerror

0.2 0.18 0.16 0.14 0.12

0.4

0.35 Oldlogic dmaxlogic dmaxsmall

0.1 0.08 Oldlogic dmaxlogic dmaxsmall

Figure B.13: Algorithm-4 on Heart dataset for Training Error

Figure B.14: Algorithm-4 on Thyroid dataset for Training Error

Appendix B. Figures

39

Shuttle 59 sampling 25 Test err(%), minerr (ratio) 20 15 10 5 0 0 MinErr Tol 0.01 TestErr Tol 0.01 MinErr Tol 0.001 TestErr Tol 0.001

Shuttle Dataset 59 sampling 4000 Tol 0.01 Tol 0.001 Time (seconds) 3000

2000

1000

50

100 #BV

150

200

250

0 0

50

100 #BV

150

200

250

Figure B.15: Algorithm-5 on Shuttle dataset for Training and Test Error

Figure B.16: Algorithm-6 on Shuttle dataset for CPU Time

x 10

MNIST 3 vs 8 59 sampling Tol 0.01 Tol 0.001 Time (seconds)

MNIST 3 vs Others 59 sampling 6000 Tol 0.01 5000 4000 3000 2000 1000 0 0 50 #BV 100 150

Time (seconds)

1.5

0.5

0 0

200

400 #BV

600

800

1000

Figure B.17: Algorithm-5 on MNIST 3Vs8 dataset for CPU Time

Figure B.18: Algorithm-5 on MNIST 3VsO dataset for CPU Time

Appendix B. Figures

40

MNIST 3 vs Others 59 sampling MNIST 3 vs 8 59 sampling Test err(%), minerr (ratio) MinErr Tol 0.01 TestErr Tol 0.01 MinErr Tol 0.001 TestErr Tol 0.001 Test err(%), minerr (ratio) 50 40 30 20 10 0 0 12 10 8 6 4 2 0 0 200 400 #BV 600 800 1000 200 400 #BV 600 800 MinErr Tol 0.01 TestErr Tol 0.01 MinErr Tol 0.001 TestErr Tol 0.001

Figure B.19: Algorithm-5 on MNIST 3Vs8 dataset for Training and Test Error

Figure B.20: Algorithm-5 on MNIST 3VsO dataset for Training and Test Error

6 5 Time (seconds) 4 3 2 1

x 10

IJCNN 59 sampling Test err(%), minerr (ratio) Tol 0.01 Tol 0.001

IJCNN 59 sampling 10 8 6 4 2 0 0 MinErr Tol 0.01 TestErr Tol 0.01 MinErr Tol 0.001 TestErr Tol 0.001

0 0

200

400 #BV

600

800

1000

200

400 #BV

600

800

1000

Figure B.21: Algorithm-5 on IJCNN1 dataset for CPU Time

Figure B.22: Algorithm-5 on IJCNN1 dataset for Training and Test Error

Appendix B. Figures

41

IJCNN1 150cache tol=0.01 from 20th BV 10 9 Test Error % Test Error % 8 7 6 5 1 10

IJCNN1 150cache tol=0.01 from 132.2929th sec 10 9 8 7 6 5 2 10

10 # of BV

10

10 Time (sec)

10

Figure B.23: Algorithm-6 on IJCNN1 dataset for #BVs vs Test error

Figure B.24: Algorithm-6 on IJCNN1 dataset for Time vs Test error

Shuttle 150 Cache from 7th BV 4

Shuttle 150 Cache from 23.7775 th sec 4

Test Error %

Test Error %

0 0.8

1.2 1.4 1.6 # of BV log10 scale

1.8

0 1

1.5 2 2.5 Time (sec) log10 scale

Figure B.25: Algorithm-6 on Shuttle dataset for #BV vs Test error

Figure B.26: Algorithm-6 on Shuttle dataset for Time vs Test error

Appendix C Machine Conguration and Dataset Details


The kernel functions used in this work are 1. Radial Basis Function (Gaussian kernel) K (x i , x j ) = e 2. Polynomial of degree p K (xi , xj ) = (xi xj + 1)p where the hyperparameters for RBF are and 2 and for polynomial case is p, the degree of the polynomial. We implemented all these algorithms in C language and ran them on Intel Pentium-4 3.00GHz Linux machine with 1GB RAM. The executables were generated with GCC version 4.0.2 Small datasets For each one of the 100 instance of Banana, Waveform, Image, Splice, Heart, Thyroid, Diabetis, Breast-Cancer datasets, cross validation was done and the hyperparameters were set accordingly. For all these datasets we used RBF kernel. These datasets were obtained from [2] Large Datasets 42
xi xj
2 /2 2

Appendix C. Machine Conguration and Dataset Details

43

The datasets used are Adult8, IJCNN1, Shuttle, MNIST 3-Vs-Others, MNIST 3-Vs-8, Vehicle. Except for Adult-8 all others were obtained from [1]. Adult-8 is obtained from [4] For these datasets the hyperparameters were set to the same values as [5]. We used RBF kernel for all other datasets except for MNIST for which we used polynomial kernel of degree 9. Refer table C.1 for size and dimensional details of the large datasets that was used in this work.

Dataset Shuttle Adult-8 IJCNN1 MNIST 3v8 MNIST 3vO Vehicle

Training Set Size Dim 43500 9 22696 123 49990 22 11982 769 60000 780 78823 100

Test Set Size Dim 14500 9 9865 123 91701 22 1984 769 10000 780 19705 100

Table C.1: Table shows the characteristics of Large Datasets

References
[1] C.C Chang and C.J.Lin. LIBSVM: a Library for Support Vector Machines, 2004. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm [2] G. Ratsch. Data Benchmark Analysis repository. Group, Technical Report, In2005,

telligent

Fraunhofer-FIRST,

http://ida.rst.fraunhofer.de/projects/bench/benchmarks.htm [3] John C Platt. Sequential Minimal Optimization - A Fast Algorithm for Training Support Vector Machines, Technical Report MSR-TR-98-14, Microsoft Research, April 1998. [4] J.C.Platt. Adult datasets. http://www.research.microsoft.com/ jplatt/smo.html [5] S.S.Keerthi, O.Chapelle, D.DeCoste. Building Support Vector Machines with Reduced Classier Complexity, Journal of Machine Learning Research, 2006. [6] S.S.Keerthi, S.K.Shevade, C.Bhattacharya, K.R.K.Murthy. Improvements to Platts SMO Algorithm for SVM classier design, Technical Report CD-99-14, Control Division Dept of Mech and Prod Engg. [7] Y.J.Lee and O.L.Mangasarian. RSVM: Reduced Support Vector Machines, In Proc 1st SIAM Int Conf Data Mining, 2001. [8] Nello Cristianini and J.S.Taylor. An introduction to Support Vector Machines and other kernel based learning methods - Cambridge University Press [9] A.Smola B.Scholkopf. Sparse greedy matrix approximation for machine learning. In Proceedings of the Seventeenth International Conference on Machine Learning 911-918, June 2000. 44

REFERENCES

45

[10] Thomas Friess, Nello Cristianini, Colin Campbell, The Kernel-Adatron Algorithm: A fast and simple learning procedure for Support Vector Machines, In Proc. 15th ICML, Morgan Kaufman Publishers, 1998. [11] M.E. Tipping. Sparse Bayesian learning and the Relevance Vector Machine. Journal of Machine Learning Research, 2001

Das könnte Ihnen auch gefallen