0 Bewertungen0% fanden dieses Dokument nützlich (0 Abstimmungen)
10 Ansichten5 Seiten
Support vector machines are originally formulated for two-class classification problems. By this architecture, the generalization ability depends on the tree structure. In this paper, we propose to optimize the structure so that the class pairs with higher generalization abilities are put in the upper nodes of the tree.
Support vector machines are originally formulated for two-class classification problems. By this architecture, the generalization ability depends on the tree structure. In this paper, we propose to optimize the structure so that the class pairs with higher generalization abilities are put in the upper nodes of the tree.
Copyright:
Attribution Non-Commercial (BY-NC)
Verfügbare Formate
Als PDF, TXT herunterladen oder online auf Scribd lesen
Support vector machines are originally formulated for two-class classification problems. By this architecture, the generalization ability depends on the tree structure. In this paper, we propose to optimize the structure so that the class pairs with higher generalization abilities are put in the upper nodes of the tree.
Copyright:
Attribution Non-Commercial (BY-NC)
Verfügbare Formate
Als PDF, TXT herunterladen oder online auf Scribd lesen
Graduate School of Science and Technology, Kobe University Kobe Japan abe@eedept.kobe-u.ac.jp
Abstract rules of a tennis tournament to resolve unclassified re-
gions. Not knowing their work, Kijsirikul and Ussi- To resolve unclassifiable regions for pairwise support vakul [8] proposed the same method and called it Adap- vector machines, decision acyclic graph support vector tive Directed Acyclic Graph (ADAG). The problem with machines have been proposed. But by this architecture, DDAGs and ADAGs is that the generalization regions de- the generalization ability depends on the tree structure. pend on the tree structure [4]. Abe and Inoue [9] extended In this paper, to improve the generalization ability, we one-against-all fuzzy support vector machines to pairwise propose to optimize the structure so that the class pairs classification. with higher generalization abilities are put in the upper In all-at-once formulation we need to determine all the nodes of the tree. We show the effectiveness of our method decision functions at once [10, 11], [2, pp. 437–440]. for some benchmark data sets. But this results in simultaneously solving a problem with a larger number of variables than the above mentioned methods. In this paper, we propose to determine the optimal 1. Introduction structure of a decision tree. By decision trees, the un- classifiable regions are assigned to the classes associated Support vector machines are originally formulated for with leaf nodes. Thus if class pairs with low general- two-class classification problems [1]. But since decision ization ability are assigned to the leaf nodes, associated functions of two-class support vector machines are di- decision functions are used for classification. Thus the rectly determined to maximize the generalization ability, classes that are difficult to be separated are classified us- an extension to multiclass problems is not unique. There ing the decision functions determined for these classes. are roughly three ways to solve this problem: one-against- As a measure to estimate the generalization ability, we all, pairwise, and all-at-once classifications. can use any measure that is developed for estimating the Original formulation by Vapnik [1] is one-against-all generalization ability for two-class problems, because in classification, in which one class is separated from the re- DDAGs and ADAGs, decision functions for all the class maining classes. By this formulation, however, unclassifi- pairs are needed to be determined in advance. able regions exist. Instead of discrete decision functions, In Section 2, we explain two-class support vector ma- Vapnik [2, p. 438] proposed to use continuous decision chines, and in Section 3 we discuss pairwise SVMs. In functions. Namely, we classify a datum into the class Section 4, we explain DDAGs and ADAGs and in Section with the maximum value of the decision functions. Inoue 5, we propose to optimize DDAGs and ADAGs. In Sec- and Abe [3] proposed fuzzy support vector machines, in tion 6, we demonstrate the effectiveness of our method by which membership functions are defined using the deci- computer experiments. sion functions. Abe [4] showed that support vector ma- chines with continuous decision functions and fuzzy sup- 2 Two-class Support Vector Machines port vector machines are equivalent. In pairwise classification, the n-class problem is con- Let m-dimensional inputs xi (i = 1, . . . , M ) belong verted into n(n − 1)/2 two-class problems. Kreßel to Class 1 or 2 and the associated labels be yi = 1 for [5] showed that by this formulation, unclassifiable re- Class 1 and −1 for Class 2. Let the decision function be gions reduce, but still they remain. To resolve unclas- sifiable regions for pairwise classification, Platt, Cristian- D(x) = wt x + b, (1) ini, and Shawe-Taylor [6] proposed decision-tree-based where w is an m-dimensional vector, b is a scalar, and pairwise classification called Decision Directed Acyclic Graph (DDAG). Pontil and Verri [7] proposed to use yi D(xi ) ≥ 1 − ξi for i = 1, . . . , M. (2) Here ξi are nonnegative slack variables. x2 The distance between the separating hyperplane D(x) = 0 and the training datum, with ξi = 0, near- D12(x) = 0 est to the hyperplane is called margin. The hyperplane D13(x) = 0 Class 1
D(x) = 0 with the maximum margin is called optimal
separating hyperplane. To determine the optimal separating hyperplane, we minimize Class 2 Class 3
1 M 2 w + C ξi (3) D23(x) = 0 2 i=1
subject to the constraints:
0 x1 yi (w xi + b) ≥ 1 − ξi t for i = 1, . . . , M, (4) Figure 1. Unclassifiable regions by the pair- where C is the margin parameter that determines the wise formulation tradeoff between the maximization of the margin and minimization of the classification error. The data that sat- isfy the equality in (4) are called support vectors. Let the decision function for class i against class j, To enhance separability, the input space is mapped with the maximum margin, be into the high-dimensional dot-product space called fea- ture space. Let the mapping function be g(x). If the dot Dij (x) = wij t x + bij , (8) product in the feature space is expressed by H(x, x ) = g(x)t g(x), H(x, x ) is called kernel function, and we do where wij is an m-dimensional vector, bij is a scalar, and not need to explicitly treat the feature space. The kernel Dij (x) = −Dji (x). functions used in this study are as follows: The regions 1. Dot product kernels Ri = {x | Dij (x) > 0, j = 1, . . . , n, j = i} (9)
H(x, x ) = x x . t (5) do not overlap and if x is in Ri , we classify x into class i. If x is not in Ri (i = 1, . . . , n), we classify x by voting. 2. Polynomial kernels Namely, for the input vector x we calculate H(x, x ) = (xt x + 1)d , (6) n Di (x) = sign(Dij (x)), (10) where d is an integer. j=1,j=i
3. RBF kernels where
1 for x ≥ 0, 2 sign(x) = (11) H(x, x ) = exp(−γ x − x ), (7) −1 for x < 0 and classify x into the class where γ is a positive parameter for slope control. arg max Di (x). (12) To simplify notations, in the following we discuss sup- i=1,...,n port vector machines with the dot product kernel. The extension to the feature space is straightforward. If x ∈ Ri , Di (x) = n − 1 and Dk (x) < n − 1. Thus x is classified into class i. But if any of Di (x) is not n, (12) may be satisfied for plural i’s. In this case, 3 Pairwise Support Vector Machines x is unclassifiable. If the decision functions for a three- class problem are as shown in Fig. 1, the shaded region is In pairwise support vector machines, we determine the unclassifiable since Di (x) = 1 (i = 1, 2, and 3). decision functions for all the combinations of class pairs. In determining a decision function for a class pair, we use the training data for the corresponding two classes. Thus, 4 Decision-tree Based Support Vector Ma- in each training the number of training data is reduced chines considerably compared to one-against-all support vector machines, which use all the training data. But the number 4.1 Decision Directed Acyclic Graphs of decision functions is n(n − 1)/2, compared to n for one-against-all support vector machines, where n is the Fig. 2 shows the decision tree for the three classes number of classes. shown in Fig. 1. In the figure, i shows that x does not 1 Training of a DDAG is the same with conventional 2 D12(x) pairwise support vector machines. Namely, we need to − determine n(n − 1)/2 decision functions for an n-class − 3 2 1 problem. The advantage of DDAGs is that classification is faster than conventional pairwise support vector ma- chines or pairwise fuzzy support vector machines. In a 1 2 D13(x) D32(x) DDAG, classification can be done by calculating (n − 1) 3 3 decision functions. − − − − 3 1 2 3 4.2 Adaptive Directed Acyclic Graphs
Class 1 Class 3 Class 2 An ADAG is based on the rules of tennis tournament.
For three-class problems, an ADAG has an equivalent Figure 2. Decision-tree-based pairwise clas- DDAG. Reconsider the example shown in Fig. 1. Let sification the first round matches be {Class 1, Class 2} and {Class 3}. Then for an input x, in the first match, x is classified into Class 1 or Class 2, and in the second match x is clas- x2 sified into Class 3. Then the second round match is either {Class 1, Class 3} or {Class 2, Class 3} according to the D12(x) = 0 outcome of the first round match. The resulting general- D13(x) = 0 class 1 ization region for each class is the same as those shown in Fig. 3. Thus for a three-class problem there are three dif- ferent ADAGs, each having an equivalent DDAG. When the number of classes is more than three, we can show that class 2 the set of ADAGs is included in the set of DDAGs [4]. class 3 According to the computer simulations [12, 8], classifica- D23(x) = 0 tion performance of the two methods are almost identical.
5 Optimizing Decision Trees
0 x1
Classification by DDAGs or ADAGs is faster than by
Figure 3. Generalization region by decision- pairwise fuzzy SVMs. But the problem is that the gener- tree-based pairwise classification alization ability depends on the structure of decision trees. In DDAGs, the unclassifiable regions are assigned to the classes associated with the leaf nodes. For example, by the DDAG for the three-class problem shown in Fig. 2, belong to class i. As the top-level classification, we can the unclassifiable region is assigned to Class 3, which is choose any pair of classes. And except for the leaf node if associated with the leaf node D32 (x) as shown in Fig. 3. Dij (x) > 0, we consider that x does not belong to class Since any ADAG is converted into a DDAG, the above j, and if Dij (x) < 0 not class i. Then if D12 (x) > 0, discussions hold for ADAGs. Namely, the unclassifiable x does not belong to Class 2. Thus it belongs to either regions are assigned to the classes associated with the leaf Class 1 or 3 and the next classification pair is Classes 1 nodes of the equivalent DDAG. and 3. The generalization regions become as shown in Thus, if we put the class pairs that are easily separated Fig. 3. Unclassifiable regions are resolved but clearly the in the upper nodes, unclassifiable regions are assigned to generalization regions depends on the tree structure. the classes that are difficult to be separated. This means Classification by a DDAG is executed by list process- that the class pairs that are difficult to be separated are ing. Namely, first we generate a list with class numbers classified by the decision boundaries that are determined as elements. Then, we calculate the decision function, by these pairs. for the input x, corresponding to the first and the last ele- In forming a DDAG or an ADAG, we need to train ments. Let these classes be i and j and Dij (x) > 0. We SVMs for all pairs of classes. Thus, in determine the op- delete the element j from the list. We repeat the above timal structure, we can use any of the measures that are procedure until one element is left. Then we classify x developed for estimating the generalization ability. In the into the class that corresponds to the element number. For computer experiment we use the following estimate of the Fig. 2, we generate the list {1, 3, 2}. If D12 (x) > 0, we generalization error Eij for classes i and j: delete element 2 from the list; we obtain {1, 3}. Then if D13 (x) > 0, we delete element 3 from the list. Since SVij only 1 is left in the list, we classify x into Class 1. Eij = , (13) Mij where SVij is the number of support vectors for classes i and j and Mij is the number of training data for classes i Table 1. Benchmark data specification and j [1]. Data Inputs Classes Train. Test Therefore, the algorithm to determine the DDAG Thyroid 21 3 3772 3428 structure for an n-class problem is as follows: Blood cell 13 12 3097 3100 Hiragana-50 50 39 4610 4610 1. Generate the initial list: {1, . . . , n}. Hiragana-105 105 38 8375 8356 Hiragana-13 13 38 8375 8356 2. If there are no generated lists, terminate the algo- MNIST 784 10 60000 10000 rithm. Otherwise, select a list and select the class pair (i, j) with the highest generalization ability from the list. of the proposed method using the thyroid data,1 blood cell data, hiragana data [13], and the MNIST data2 listed in 3. If the list selected at Step 2 has more than two ele- Table 1. We used the polynomial kernel with degree 3 ments, generate two lists deleting i or j from the list. and RBF kernel with γ = 1. The range of input was nor- Go to Step 2. malized into [0, 1]. We trained support vector machines Figure 4 shows an example of a four-class problem. At by the primal-dual interior point method combined with first we generate the list {1, 2, 3, 4}. Then at the top level the decomposition technique. For the thyroid and MNIST we select a pair of classes that has the highest generaliza- data we set C = 10000 and for other data sets C = 1000. tion ability from Classes 1 to 4. Let them be Classes 1 and We used an Athron MP 2000 personal computer. 2. Then we generate the two lists: {2, 3, 4} and {1, 3, 4}. Table 2 shows the recognition rates of the test data for We iterate the above procedure for the two lists. the conventional pairwise support vector machine (SVM), pairwise fuzzy support vector machine (FSVM) [9], and DDAGs. Column “OPT” lists the recognition rate for the (1, 2, 3, 4) optimum DDAG. To compare the recognition rate of the 1 vs 2 DDAGs, the maximum, minimum, and average recogni- 1 2 tion rates for the DDAGs are also listed. Poly3 denotes the polynomial kernel with degree 3 and RBF1 denotes the RBF kernels with γ = 1. (1, 3, 4) 3 vs 4 2 vs 3 (2, 3, 4) The number of DDAGs for a three-class problem is 3, but the number of classes explodes as n increases. Thus, except for the iris and thyroid data, we randomly gen- erate 10000 DDAGs and calculate the maximum, mini- Figure 4. Determination of DDAG for a four- mum, and average recognition rates of the test data. class problem From the table, for the thyroid data, the recognition rate of the optimum DDAG is the minimum recognition rate of the DDAG among 3 DDAGs. But in general, the The above procedure determines the structure off-line. recognition rate of the optimum DDAG is comparable to We can determine the structure while classifying x as fol- the average recognition rate of DDAGs. lows (An extension to ADAGs is straightforward.): The recognition rates of FSVMs are equal to or better than the average recognition rates of DDAGs and compa- 1. Generate the initial list: {1, . . . , n}. rable to the maximum recognition rates of DDAGs. For pairwise classification, training time is the same 2. Select the class pair (i, j) with the highest general- for fuzzy SVMs and DDAGs. But the advantage of ization ability from the list. Let x be on the Class i DDAGs and ADAGs is short classification time. Table side of the decision function. Delete j from the list. 3 lists classification time of the blood cell and hiragna-50 Otherwise, delete i. test data. Classification by DDAGs is much faster than by 3. If the list selected at Step 2 has more than one ele- SVMs and FSVMs. ment, go to Step 2. Otherwise, classify x into the class associated with the element and terminate the 7 Conclusions algorithm. In this paper we proposed to optimize the structure 6 Performance Evaluation of a decision acyclic graph support vector machine by putting the class pairs with higher generalization ability Since the generalization abilities of the ADAGs and at the upper nodes of the decision tree. According to the DDAGs did not differ very much, in the following we 1 ftp://ics.uci.edu: pub/machine-learning-databases show the results for DDAGs. We evaluated performance 2 http://yann.lecun.com/exdb/mnist/ Table 2. Recognition rates of pairwise SVMs (%) Data Kernel SVM FSVM DDAG Max. Min. Ave OPT Thyroid Poly3 97.81 97.87 97.89 97.81 97.86 97.81 RBF1 97.26 97.37 97.40 97.26 97.34 97.26 Blood Poly3 92.07 92.77 93.00 92.32 92.47 92.65 Cell RBF1 91.93 92.23 92.41 91.84 92.08 92.00 Hiragana-50 Poly3 98.57 98.89 99.37 98.52 98.75 98.61 RBF1 98.37 98.85 99.22 98.29 98.56 98.50 Hiragana-105 Poly3 100 100 100 100 100 100 RBF1 99.98 100 100 99.98 99.99 99.99 Hiragana-13 Poly3 99.63 99.66 99.72 99.58 99.64 99.66 RBF1 99.61 99.65 99.70 99.55 99.60 99.63 MNIST Poly3 97.85 98.01 97.99 97.91 97.96 97.91 RBF1 97.18 97.78 97.49 97.31 97.41 97.54
Advances in Neural Information Processing Systems
Table 3. Classification time comparison (s) 12, pages 547–553. The MIT Press, Cambridge, Data Parm DDAG SVM FSVM MA, 2000. Blood cell Poly3 0.9 1.2 2.4 RBF1 0.7 2.6 5.2 [7] M. Pontil and A. Verri. Support vector machines for Hiragana-50 Poly3 15 66 114 3-d object recognition. IEEE Transactions on Pat- RBF1 17 76 159 tern Analysis and Machine Intelligence, 20(6):637– 646, 1998. [8] B. Kijsirikul and N. Ussivakul. Multiclass sup- computer experiments for several benchmark data sets, port vector machines using adaptive directed acyclic the optimized DDAGs show the average recognition rates graph. In Proceedings of International Joint Con- of the DDAGs. ference on Neural Networks (IJCNN 2002), pages 980–985, 2002. References [9] S. Abe and T. Inoue. Fuzzy support vector machines for multiclass problems. In Proceedings of the Tenth [1] V. N. Vapnik. The Nature of Statistical Learning European Symposium on Artificial Neural Networks Theory. Springer-Verlag, London, UK, 1995. (ESANN”2002), pages 116–118, Bruges, Belgium, [2] V. N. Vapnik. Statistical Learning Theory. John April 2002. Wiley & Sons, New York, NY, 1998. [10] K. P. Bennett. Combining support vector and mathe- matical programming methods for classification. In [3] T. Inoue and S. Abe. Fuzzy support vector machines B. Schölkopf, C. J. C. Burges, and A. J. Smola, ed- for pattern classification. In Proceedings of Interna- itors, Advances in Kernel Methods: Support Vec- tional Joint Conference on Neural Networks (IJCNN tor Learning, pages 307–326. The MIT Press, Cam- ’01), volume 2, pages 1449–1454, July 2001. bridge, MA, 1999. [4] S. Abe. Analysis of multiclass support vector ma- [11] J. Weston and C. Watkins. Support vector machines chines. In Proceedings of International Conference for multi-class pattern recognition. In Proceed- on Computational Intelligence for Modelling Con- ings of the Seventh European Symposium on Artifi- trol and Automation (CIMCA’2003), pages 385– cial Neural Networks (ESANN’99), pages 219–224, 396, Vienna, Austria, February 2003. 1999. [5] U. H.-G. Kreßel. Pairwise classification and support [12] C. Nakajima, M. Pontil, and T. Poggio. People vector machines. In B. Schölkopf, C. J. C. Burges, recognition and pose estimation in image sequences. and A. J. Smola, editors, Advances in Kernel Meth- In Proceedings of International Joint Conference on ods: Support Vector Learning, pages 255–268. The Neural Networks (IJCNN 2000), volume IV, pages MIT Press, Cambridge, MA, 1999. 189–194, 2000. [6] J. C. Platt, N. Cristianini, and J. Shawe-Taylor. [13] S. Abe. Pattern Classification: Neuro-fuzzy Meth- Large margin DAGs for multiclass classification. In ods and Their Comparison. Springer-Verlag, Lon- S. A. Solla, T. K. Leen, and K.-R. Müller, editors, don, UK, 2001.