28

Optimizing Directed Acyclic Graph
Support Vector Machines
Fumitake Takahashi Shigeo Abe

Graduate School of Science and Technology, Kobe University
Kobe Japan
abe@eedept.kobe-u.ac.jp
Abstract rules of a tennis tournament to resolve unclassified re-

gions. Not knowing their work, Kijsirikul and Ussi-
To resolve unclassifiable regions for pairwise support vakul [8] proposed the same method and called it Adap-
vector machines, decision acyclic graph support vector tive Directed Acyclic Graph (ADAG). The problem with
machines have been proposed. But by this architecture, DDAGs and ADAGs is that the generalization regions de-
the generalization ability depends on the tree structure. pend on the tree structure [4]. Abe and Inoue [9] extended
In this paper, to improve the generalization ability, we one-against-all fuzzy support vector machines to pairwise
propose to optimize the structure so that the class pairs classification.
with higher generalization abilities are put in the upper In all-at-once formulation we need to determine all the
nodes of the tree. We show the effectiveness of our method decision functions at once [10, 11], [2, pp. 437–440].
for some benchmark data sets. But this results in simultaneously solving a problem with
a larger number of variables than the above mentioned
methods.
In this paper, we propose to determine the optimal
1. Introduction
structure of a decision tree. By decision trees, the un-
classifiable regions are assigned to the classes associated
Support vector machines are originally formulated for with leaf nodes. Thus if class pairs with low general-
two-class classification problems [1]. But since decision ization ability are assigned to the leaf nodes, associated
functions of two-class support vector machines are di- decision functions are used for classification. Thus the
rectly determined to maximize the generalization ability, classes that are difficult to be separated are classified us-
an extension to multiclass problems is not unique. There ing the decision functions determined for these classes.
are roughly three ways to solve this problem: one-against- As a measure to estimate the generalization ability, we
all, pairwise, and all-at-once classifications. can use any measure that is developed for estimating the
Original formulation by Vapnik [1] is one-against-all generalization ability for two-class problems, because in
classification, in which one class is separated from the re- DDAGs and ADAGs, decision functions for all the class
maining classes. By this formulation, however, unclassifi- pairs are needed to be determined in advance.
able regions exist. Instead of discrete decision functions, In Section 2, we explain two-class support vector ma-
Vapnik [2, p. 438] proposed to use continuous decision chines, and in Section 3 we discuss pairwise SVMs. In
functions. Namely, we classify a datum into the class Section 4, we explain DDAGs and ADAGs and in Section
with the maximum value of the decision functions. Inoue 5, we propose to optimize DDAGs and ADAGs. In Sec-
and Abe [3] proposed fuzzy support vector machines, in tion 6, we demonstrate the effectiveness of our method by
which membership functions are defined using the deci- computer experiments.
sion functions. Abe [4] showed that support vector ma-
chines with continuous decision functions and fuzzy sup-
2 Two-class Support Vector Machines
port vector machines are equivalent.
In pairwise classification, the n-class problem is con-
Let m-dimensional inputs xi (i = 1, . . . , M ) belong
verted into n(n − 1)/2 two-class problems. Kreßel
to Class 1 or 2 and the associated labels be yi = 1 for
[5] showed that by this formulation, unclassifiable re-
Class 1 and −1 for Class 2. Let the decision function be
gions reduce, but still they remain. To resolve unclas-
sifiable regions for pairwise classification, Platt, Cristian- D(x) = wt x + b, (1)
ini, and Shawe-Taylor [6] proposed decision-tree-based
where w is an m-dimensional vector, b is a scalar, and
pairwise classification called Decision Directed Acyclic
Graph (DDAG). Pontil and Verri [7] proposed to use yi D(xi ) ≥ 1 − ξi for i = 1, . . . , M. (2)
Here ξi are nonnegative slack variables.
x2
The distance between the separating hyperplane
D(x) = 0 and the training datum, with ξi = 0, near-
D12(x) = 0
est to the hyperplane is called margin. The hyperplane D13(x) = 0 Class 1
D(x) = 0 with the maximum margin is called optimal

separating hyperplane.
To determine the optimal separating hyperplane, we
minimize Class 2
Class 3
1 M
2
w + C ξi (3) D23(x) = 0
2 i=1
subject to the constraints:

0 x1
yi (w xi + b) ≥ 1 − ξi
t
for i = 1, . . . , M, (4)
Figure 1. Unclassifiable regions by the pair-
where C is the margin parameter that determines the wise formulation
tradeoff between the maximization of the margin and
minimization of the classification error. The data that sat-
isfy the equality in (4) are called support vectors.
Let the decision function for class i against class j,
To enhance separability, the input space is mapped
with the maximum margin, be
into the high-dimensional dot-product space called fea-
ture space. Let the mapping function be g(x). If the dot Dij (x) = wij
t
x + bij , (8)
product in the feature space is expressed by H(x, x ) =
g(x)t g(x), H(x, x ) is called kernel function, and we do where wij is an m-dimensional vector, bij is a scalar, and
not need to explicitly treat the feature space. The kernel Dij (x) = −Dji (x).
functions used in this study are as follows: The regions
1. Dot product kernels Ri = {x | Dij (x) > 0, j = 1, . . . , n, j = i} (9)

H(x, x ) = x x .
t
(5) do not overlap and if x is in Ri , we classify x into class i.
If x is not in Ri (i = 1, . . . , n), we classify x by voting.
2. Polynomial kernels Namely, for the input vector x we calculate
H(x, x ) = (xt x + 1)d , (6)
n
Di (x) = sign(Dij (x)), (10)
where d is an integer. j=1,j=i
3. RBF kernels where

1 for x ≥ 0,
2 sign(x) = (11)
H(x, x ) = exp(−γ x − x ), (7) −1 for x < 0
and classify x into the class
where γ is a positive parameter for slope control.
arg max Di (x). (12)
To simplify notations, in the following we discuss sup- i=1,...,n
port vector machines with the dot product kernel. The
extension to the feature space is straightforward. If x ∈ Ri , Di (x) = n − 1 and Dk (x) < n − 1.
Thus x is classified into class i. But if any of Di (x) is
not n, (12) may be satisfied for plural i’s. In this case,
3 Pairwise Support Vector Machines x is unclassifiable. If the decision functions for a three-
class problem are as shown in Fig. 1, the shaded region is
In pairwise support vector machines, we determine the unclassifiable since Di (x) = 1 (i = 1, 2, and 3).
decision functions for all the combinations of class pairs.
In determining a decision function for a class pair, we use
the training data for the corresponding two classes. Thus, 4 Decision-tree Based Support Vector Ma-
in each training the number of training data is reduced chines
considerably compared to one-against-all support vector
machines, which use all the training data. But the number 4.1 Decision Directed Acyclic Graphs
of decision functions is n(n − 1)/2, compared to n for
one-against-all support vector machines, where n is the Fig. 2 shows the decision tree for the three classes
number of classes. shown in Fig. 1. In the figure, i shows that x does not
1 Training of a DDAG is the same with conventional
2 D12(x) pairwise support vector machines. Namely, we need to
− determine n(n − 1)/2 decision functions for an n-class
− 3
2 1 problem. The advantage of DDAGs is that classification
is faster than conventional pairwise support vector ma-
chines or pairwise fuzzy support vector machines. In a
1 2
D13(x) D32(x) DDAG, classification can be done by calculating (n − 1)
3 3 decision functions.
− − − −
3 1 2 3
4.2 Adaptive Directed Acyclic Graphs
Class 1 Class 3 Class 2 An ADAG is based on the rules of tennis tournament.

For three-class problems, an ADAG has an equivalent
Figure 2. Decision-tree-based pairwise clas- DDAG. Reconsider the example shown in Fig. 1. Let
sification the first round matches be {Class 1, Class 2} and {Class
3}. Then for an input x, in the first match, x is classified
into Class 1 or Class 2, and in the second match x is clas-
x2 sified into Class 3. Then the second round match is either
{Class 1, Class 3} or {Class 2, Class 3} according to the
D12(x) = 0 outcome of the first round match. The resulting general-
D13(x) = 0 class 1
ization region for each class is the same as those shown in
Fig. 3. Thus for a three-class problem there are three dif-
ferent ADAGs, each having an equivalent DDAG. When
the number of classes is more than three, we can show that
class 2 the set of ADAGs is included in the set of DDAGs [4].
class 3
According to the computer simulations [12, 8], classifica-
D23(x) = 0 tion performance of the two methods are almost identical.
5 Optimizing Decision Trees

0 x1
Classification by DDAGs or ADAGs is faster than by

Figure 3. Generalization region by decision- pairwise fuzzy SVMs. But the problem is that the gener-
tree-based pairwise classification alization ability depends on the structure of decision trees.
In DDAGs, the unclassifiable regions are assigned to the
classes associated with the leaf nodes. For example, by
the DDAG for the three-class problem shown in Fig. 2,
belong to class i. As the top-level classification, we can the unclassifiable region is assigned to Class 3, which is
choose any pair of classes. And except for the leaf node if associated with the leaf node D32 (x) as shown in Fig. 3.
Dij (x) > 0, we consider that x does not belong to class Since any ADAG is converted into a DDAG, the above
j, and if Dij (x) < 0 not class i. Then if D12 (x) > 0, discussions hold for ADAGs. Namely, the unclassifiable
x does not belong to Class 2. Thus it belongs to either regions are assigned to the classes associated with the leaf
Class 1 or 3 and the next classification pair is Classes 1 nodes of the equivalent DDAG.
and 3. The generalization regions become as shown in Thus, if we put the class pairs that are easily separated
Fig. 3. Unclassifiable regions are resolved but clearly the in the upper nodes, unclassifiable regions are assigned to
generalization regions depends on the tree structure. the classes that are difficult to be separated. This means
Classification by a DDAG is executed by list process- that the class pairs that are difficult to be separated are
ing. Namely, first we generate a list with class numbers classified by the decision boundaries that are determined
as elements. Then, we calculate the decision function, by these pairs.
for the input x, corresponding to the first and the last ele- In forming a DDAG or an ADAG, we need to train
ments. Let these classes be i and j and Dij (x) > 0. We SVMs for all pairs of classes. Thus, in determine the op-
delete the element j from the list. We repeat the above timal structure, we can use any of the measures that are
procedure until one element is left. Then we classify x developed for estimating the generalization ability. In the
into the class that corresponds to the element number. For computer experiment we use the following estimate of the
Fig. 2, we generate the list {1, 3, 2}. If D12 (x) > 0, we generalization error Eij for classes i and j:
delete element 2 from the list; we obtain {1, 3}. Then if
D13 (x) > 0, we delete element 3 from the list. Since SVij
only 1 is left in the list, we classify x into Class 1. Eij = , (13)
Mij
where SVij is the number of support vectors for classes i
and j and Mij is the number of training data for classes i Table 1. Benchmark data specification
and j [1]. Data Inputs Classes Train. Test
Therefore, the algorithm to determine the DDAG Thyroid 21 3 3772 3428
structure for an n-class problem is as follows: Blood cell 13 12 3097 3100
Hiragana-50 50 39 4610 4610
1. Generate the initial list: {1, . . . , n}. Hiragana-105 105 38 8375 8356
Hiragana-13 13 38 8375 8356
2. If there are no generated lists, terminate the algo- MNIST 784 10 60000 10000
rithm. Otherwise, select a list and select the class
pair (i, j) with the highest generalization ability
from the list. of the proposed method using the thyroid data,1 blood cell
data, hiragana data [13], and the MNIST data2 listed in
3. If the list selected at Step 2 has more than two ele- Table 1. We used the polynomial kernel with degree 3
ments, generate two lists deleting i or j from the list. and RBF kernel with γ = 1. The range of input was nor-
Go to Step 2. malized into [0, 1]. We trained support vector machines
Figure 4 shows an example of a four-class problem. At by the primal-dual interior point method combined with
first we generate the list {1, 2, 3, 4}. Then at the top level the decomposition technique. For the thyroid and MNIST
we select a pair of classes that has the highest generaliza- data we set C = 10000 and for other data sets C = 1000.
tion ability from Classes 1 to 4. Let them be Classes 1 and We used an Athron MP 2000 personal computer.
2. Then we generate the two lists: {2, 3, 4} and {1, 3, 4}. Table 2 shows the recognition rates of the test data for
We iterate the above procedure for the two lists. the conventional pairwise support vector machine (SVM),
pairwise fuzzy support vector machine (FSVM) [9], and
DDAGs. Column “OPT” lists the recognition rate for the
(1, 2, 3, 4) optimum DDAG. To compare the recognition rate of the
1 vs 2
DDAGs, the maximum, minimum, and average recogni-
1 2 tion rates for the DDAGs are also listed. Poly3 denotes
the polynomial kernel with degree 3 and RBF1 denotes
the RBF kernels with γ = 1.
(1, 3, 4) 3 vs 4 2 vs 3 (2, 3, 4)
The number of DDAGs for a three-class problem is 3,
but the number of classes explodes as n increases. Thus,
except for the iris and thyroid data, we randomly gen-
erate 10000 DDAGs and calculate the maximum, mini-
Figure 4. Determination of DDAG for a four- mum, and average recognition rates of the test data.
class problem From the table, for the thyroid data, the recognition
rate of the optimum DDAG is the minimum recognition
rate of the DDAG among 3 DDAGs. But in general, the
The above procedure determines the structure off-line. recognition rate of the optimum DDAG is comparable to
We can determine the structure while classifying x as fol- the average recognition rate of DDAGs.
lows (An extension to ADAGs is straightforward.): The recognition rates of FSVMs are equal to or better
than the average recognition rates of DDAGs and compa-
1. Generate the initial list: {1, . . . , n}. rable to the maximum recognition rates of DDAGs.
For pairwise classification, training time is the same
2. Select the class pair (i, j) with the highest general-
for fuzzy SVMs and DDAGs. But the advantage of
ization ability from the list. Let x be on the Class i
DDAGs and ADAGs is short classification time. Table
side of the decision function. Delete j from the list.
3 lists classification time of the blood cell and hiragna-50
Otherwise, delete i.
test data. Classification by DDAGs is much faster than by
3. If the list selected at Step 2 has more than one ele- SVMs and FSVMs.
ment, go to Step 2. Otherwise, classify x into the
class associated with the element and terminate the 7 Conclusions
algorithm.
In this paper we proposed to optimize the structure
6 Performance Evaluation of a decision acyclic graph support vector machine by
putting the class pairs with higher generalization ability
Since the generalization abilities of the ADAGs and at the upper nodes of the decision tree. According to the
DDAGs did not differ very much, in the following we 1 ftp://ics.uci.edu: pub/machine-learning-databases
show the results for DDAGs. We evaluated performance 2 http://yann.lecun.com/exdb/mnist/
Table 2. Recognition rates of pairwise SVMs (%)
Data Kernel SVM FSVM DDAG
Max. Min. Ave OPT
Thyroid Poly3 97.81 97.87 97.89 97.81 97.86 97.81
RBF1 97.26 97.37 97.40 97.26 97.34 97.26
Blood Poly3 92.07 92.77 93.00 92.32 92.47 92.65
Cell RBF1 91.93 92.23 92.41 91.84 92.08 92.00
Hiragana-50 Poly3 98.57 98.89 99.37 98.52 98.75 98.61
RBF1 98.37 98.85 99.22 98.29 98.56 98.50
Hiragana-105 Poly3 100 100 100 100 100 100
RBF1 99.98 100 100 99.98 99.99 99.99
Hiragana-13 Poly3 99.63 99.66 99.72 99.58 99.64 99.66
RBF1 99.61 99.65 99.70 99.55 99.60 99.63
MNIST Poly3 97.85 98.01 97.99 97.91 97.96 97.91
RBF1 97.18 97.78 97.49 97.31 97.41 97.54
Advances in Neural Information Processing Systems

Table 3. Classification time comparison (s) 12, pages 547–553. The MIT Press, Cambridge,
Data Parm DDAG SVM FSVM MA, 2000.
Blood cell Poly3 0.9 1.2 2.4
RBF1 0.7 2.6 5.2 [7] M. Pontil and A. Verri. Support vector machines for
Hiragana-50 Poly3 15 66 114 3-d object recognition. IEEE Transactions on Pat-
RBF1 17 76 159 tern Analysis and Machine Intelligence, 20(6):637–
646, 1998.
[8] B. Kijsirikul and N. Ussivakul. Multiclass sup-
computer experiments for several benchmark data sets,
port vector machines using adaptive directed acyclic
the optimized DDAGs show the average recognition rates
graph. In Proceedings of International Joint Con-
of the DDAGs.
ference on Neural Networks (IJCNN 2002), pages
980–985, 2002.
References
[9] S. Abe and T. Inoue. Fuzzy support vector machines
for multiclass problems. In Proceedings of the Tenth
[1] V. N. Vapnik. The Nature of Statistical Learning
European Symposium on Artificial Neural Networks
Theory. Springer-Verlag, London, UK, 1995.
(ESANN”2002), pages 116–118, Bruges, Belgium,
[2] V. N. Vapnik. Statistical Learning Theory. John April 2002.
Wiley & Sons, New York, NY, 1998. [10] K. P. Bennett. Combining support vector and mathe-
matical programming methods for classification. In
[3] T. Inoue and S. Abe. Fuzzy support vector machines
B. Schölkopf, C. J. C. Burges, and A. J. Smola, ed-
for pattern classification. In Proceedings of Interna-
itors, Advances in Kernel Methods: Support Vec-
tional Joint Conference on Neural Networks (IJCNN
tor Learning, pages 307–326. The MIT Press, Cam-
’01), volume 2, pages 1449–1454, July 2001.
bridge, MA, 1999.
[4] S. Abe. Analysis of multiclass support vector ma- [11] J. Weston and C. Watkins. Support vector machines
chines. In Proceedings of International Conference for multi-class pattern recognition. In Proceed-
on Computational Intelligence for Modelling Con- ings of the Seventh European Symposium on Artifi-
trol and Automation (CIMCA’2003), pages 385– cial Neural Networks (ESANN’99), pages 219–224,
396, Vienna, Austria, February 2003. 1999.
[5] U. H.-G. Kreßel. Pairwise classification and support [12] C. Nakajima, M. Pontil, and T. Poggio. People
vector machines. In B. Schölkopf, C. J. C. Burges, recognition and pose estimation in image sequences.
and A. J. Smola, editors, Advances in Kernel Meth- In Proceedings of International Joint Conference on
ods: Support Vector Learning, pages 255–268. The Neural Networks (IJCNN 2000), volume IV, pages
MIT Press, Cambridge, MA, 1999. 189–194, 2000.
[6] J. C. Platt, N. Cristianini, and J. Shawe-Taylor. [13] S. Abe. Pattern Classification: Neuro-fuzzy Meth-
Large margin DAGs for multiclass classification. In ods and Their Comparison. Springer-Verlag, Lon-
S. A. Solla, T. K. Leen, and K.-R. Müller, editors, don, UK, 2001.

28

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

28

Hochgeladen von

Copyright:

Verfügbare Formate

Optimizing Directed Acyclic Graph

Support Vector Machines

Fumitake Takahashi Shigeo Abe

Abstract rules of a tennis tournament to resolve unclassified re-

D(x) = 0 with the maximum margin is called optimal

subject to the constraints:

3. RBF kernels where

Class 1 Class 3 Class 2 An ADAG is based on the rules of tennis tournament.

5 Optimizing Decision Trees

Classification by DDAGs or ADAGs is faster than by

Advances in Neural Information Processing Systems

Das könnte Ihnen auch gefallen