Sie sind auf Seite 1von 15

Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/pr

Multivariate alternating decision trees


Hong Kuan Sok a,n, Melanie Po-Leen Ooi a,b, Ye Chow Kuang a, Serge Demidenko a,c
a
Advanced Engineering Platform & Electrical and Computer Systems Engineering, School of Engineering, Monash University, 47500 Bandar Sunway, Malaysia
b
School of Engineering & Physical Sciences, Heriot-Watt University, 62200 Putrajaya, Malaysia
c
School of Engineering & Advanced Technology, Massey University, Private Bag 1102904, Auckland 0745, New Zealand

art ic l e i nf o a b s t r a c t

Article history: Decision trees are comprehensible, but at the cost of a relatively lower prediction accuracy compared to other
Received 5 May 2015 powerful black-box classifiers such as SVMs. Boosting has been a popular strategy to create an ensemble of
Received in revised form decision trees to improve their classification performance, but at the expense of comprehensibility advantage.
7 July 2015
To this end, alternating decision tree (ADTree) has been proposed to allow boosting within a single decision
Accepted 17 August 2015
tree to retain comprehension. However, existing ADTrees are univariate, which limits their applicability. This
research proposes a novel algorithm – multivariate ADTree. It presents and discusses its different variations
Keywords: (Fisher's ADTree, Sparse ADTree, and Regularized Logistic ADTree) along with their empirical validation on a
Alternating decision tree set of publicly available datasets. It is shown that multivariate ADTree has high prediction accuracy
Boosting
comparable to that of decision tree ensembles, while retaining good comprehension which is close to
Multivariate decision tree
comprehension of individual univariate decision trees.
Lasso
LARS & 2015 Elsevier Ltd. All rights reserved.

1. Introduction boosted decision trees led to the invention of the alternating decision
tree (ADTree), which was designed to retain interpretability in the
Decision trees are among the most powerful and popular classifiers boosting paradigm [10]. Rather than building a decision tree at every
available. They are acyclic-directed graphical models that solve classi- boosting cycle, a much simpler decision stump is created.
fication problems using symbolic representation, i.e., a graph of Fig. 1(b) shows a graphical illustration of the ADTree. Similar to the
decision nodes that are connected via edges (Fig. 1(a)). As a result, decision tree in Fig. 1(a), ADTree is also an acyclic-directed graphical
they follow the flowchart-like human logic and reasoning, making model. However, the symbolic meaning of each node and the manner
them highly comprehensible. Decision trees model the domain pro- in which they are connected are different. It does not use leaf nodes at
blem as a set of decision rules. Such a model is transparent and is the terminal nodes, or decision nodes as the internal nodes. Instead,
understandable to specialists in relevant application areas [1]. For many decision stumps (or one-level decision trees) are combined to
example in [2], medical experts used the quantitative information obtain a special representation where each of the stumps consists of a
obtained from the alternating decision tree model to gain a better decision node and two prediction nodes.
understanding between disease phenotypes and affection status. The ADTree can be viewed as a loose generalization to standard decision
comprehensibility trait therefore, makes decision trees highly accessi- trees, boosted decision trees, and boosted decision stumps [10] due to
ble to users outside just a machine learning community, and therefore the following reasons. First, ADTree can be used as an alternative to
they can be found in a wide range of applications such as business [3], represent any standard decision tree model with the same function-
manufacturing [4], computational biology [5], bioinformatics [6], etc. ality. In addition to that, ADTree allows multiple decision stumps under
It is often possible to further improve the classification accuracy of the same prediction node to get majority-voted decisions. Boosting can
an individual decision tree by combining a number of decision trees to be implemented directly within the same tree as opposed to the
make majority-voted decisions [7]. There are two popular strategies to conventional way of creating boosted decision trees or boosted
achieve this: bagging [8] and boosting [9]. Unfortunately, an ensemble decision stumps. There are a number of extensions of ADTree such as
of decision trees results in many variations in the symbolic represen- multi-label ADTree [11], multi-class ADTree [12] and complex feature
tation which causes the overall classifier to be large, complex, and ADTree [13]. ADTree has been successfully implemented in various
difficult to interpret. This negates the comprehensibility advantage of applications such as genetic disorders [14], corporate performance
being a decision tree [10]. The issue with large and incomprehensible prediction [3], and bioinformatics [15].
Unfortunately, there are two major drawbacks of using univariate
decision nodes in ADTree. First, as with any other univariate decision
n
Corresponding author. Tel.: þ 60 35 514 6238; fax: þ 60 35 514 6207. tree, the splitting based on a single feature is axis-parallel partitioning
E-mail address: sok.hong.kuan@monash.edu (H.K. Sok). of the input space. This leads to a high bias and generates large decision

http://dx.doi.org/10.1016/j.patcog.2015.08.014
0031-3203/& 2015 Elsevier Ltd. All rights reserved.

Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
2 H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 1. Decision trees:(a) a classical decision tree consisting of decision nodes as internal nodes and leaf nodes as terminal nodes; (b) an alternating decision tree which can
be used to represent the standard decision tree shown in part (a) to make the same prediction; and (c) the accommodation of boosting in the ADTree, whereby more decision
stumps can be added to any existing prediction nodes (highlighted in circle) to obtain majority-voted decisions.

trees in classification problems that have co-dependent features. The standard 10  10 fold stratified cross-validations were performed on
resultant large and complex decision tree complicates the interpreta- all datasets to generate performance estimations.
tion process. Second, ADTree induction is based on a probably The rest of this paper is as follows: Section 2 provides a brief
approximately correct (PAC) learning framework which requires a weak literature review on supervised learning, boosting, and ADTree.
learner to achieve an error rate ε that is slightly better than random The proposed multivariate ADTree algorithms are presented in
guesses for binary class problems; formally ε r 0:5  Ψ for a small Section 3. The experimental setup and obtained results are given
constant Ψ (known as edge). Unfortunately, simple univariate decision in Section 4 together with the detailed discussions. Section 5
stumps sometimes do not satisfy the weak learning condition. This presents the conclusion and outlines the future work.
causes the boosting procedure to fail in generating a functioning
ADTree model [16].
The aim of this paper is to present a novel multivariate alternating 2. Background on alternating decision tree
decision tree learning algorithm with boosting capability that offers
improved classification performance of decision trees while remain- 2.1. Supervised learning framework
ing comprehensible. The goals are to:
For better readability, the notations used in this paper are first
1. Outperform the existing univariate ADTree and multivariate described. Vectors are typed in bold (e.g., x) and they are all column
(unboosted) decision trees in terms of prediction accuracy vectors unless specified otherwise. Scalars are typed in regular (e.g., λ).
while offering good comprehensibility; Matrices are given in capital bold (e.g., X). Specific entries in vectors are
2. Match the performance of univariate decision trees for univariate indexed with a scalar. For example, the ith entry of a column vector x is
problems while outperforming them on multivariate datasets; denoted as xi . For matrices, the entry of ith row and jth column of a
3. Provide superior comprehensibility compared to the ensemble- matrix X is denoted as Xij . The entire ith row of a matrix X is denoted
based decision trees. as Xi and the entire jth column of a matrix X is denoted as Xj .
Under supervised learning, a training dataset ½X; y consists of a
set of n labeled samples, where each sample x A RP is a real-valued
There are several different subsections in the existing ADTree column vector of p features and its corresponding label
algorithm that can be restructured in order to induce a multivariate yA f þ 1;  1g assumes either positive or negative class for a binary
ADTree. In this paper, three possible variations are explored, namely classification problem. The dimension of the design matrix X is
Fisher's ADTree, Sparse ADTree [17], and regularized Logistic ADTree. The n  p, and the column vector y is of the length n. The ith row of the
Sparse ADTree presented in the earlier paper was the first attempt to design matrix X or Xi refers to the ith sample, a transposed vector,
induce a multivariate ADTree. The current paper presents significantly i.e., xT . The goal of a decision tree learning algorithm is to learn a
new and further developed results that cover two additional multi- single classification model. For ensemble learning, the weak
variate alternating decision tree (ADTree) designs. This increases the learner is repeatedly called to learn multiple models.
material coverage and comprehension as well as applicability by
practitioners and researchers in the field. In addition, the experiments 2.2. Boosting
are significantly more vigorous with extended discussions on the
validity, usage and applicability of the multivariate alternating deci- Boosting is an important development in the field of machine
sion trees. All the three variants of multivariate ADTree were tested learning. It allows for any choice of a prevalent learning algorithm as
on a set of real-world datasets against a number of established long as the weak learning condition ε r 0:5  Ψ is satisfied for binary
decision tree learning algorithms such as the original univariate class problems. Paper [24] shows that decision trees are the popular
ADTree [10]; univariate decision trees: C4.5 [18] and CART [19]; the choice as weak learners due to their inherent instability to small
multivariate decision tree – Fisher's decision tree [20]; and ensemble variations in training datasets. Boosting creates such variations through
of decision trees: Boosted C4.5 and oblique Random Forest [21]. Note weight distribution over the training samples by sequential reweight-
that there are other variants of decision trees presented in the ing. This paper implements two different boosting algorithms to induce
literature (e.g., [22,23]). However, the benchmarking algorithms are multivariate ADTree, namely, AdaBoost and LogitBoost (see Table 1).
selected based on availability of the source codes, and they are used as AdaBoost initializes the weight distribution w as a uniform one
representatives of the different decision tree families. This was done with an initial weight value of 1=n. The weight of the ith sample at
in order to compare the overall prediction accuracy, induction time, the tth boosting procedure is indicated as wiðtÞ . AdaBoost then repeats
tree size and complexity/comprehensibility against different families for T boosting procedures to obtain a weak model f t ðxÞ from the
of decision trees. For statistical verification and comparisons, the weak learner, determines the linear coefficient of the weak model αt

Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 3

Table 1
AdaBoost and LogitBoost algorithms.

AdaBoost LogitBoost

Input: training dataset ½X; y Input: training dataset ½X; y


 
1. Initialize: wi ¼ 1=n8 i 1. Initialize: yni ¼ yi þ 1 =2, wi ¼ 1=n
2. For t ¼ 1; :::; T GðXi Þ ¼ 0, and pðXi Þ ¼ 0:5 8 i
2.1. Obtain f t ðxÞ from a weak learner 2. For t ¼ 1; :::; T
2.2.   P
n   y n  pðX Þ
2.1. Compute a working response zi ¼ pðXii Þð1  piðXi ÞÞand weights
Determine αt ¼ 12 log 1 εt εt with εt ¼ wi;t I f t ðXi Þ a yi
i¼1
  wi ¼ pðXi Þð1  pðXi ÞÞ
2.3. Update the weight distribution wðti þ 1Þ ¼ wiðtÞ exp  yi αt f t ðXi Þ 2.2. Fit gt ðxÞ by the weighted least-squares regression of zi to Xi with wi
2.3. Update GðXi Þ’GðXi Þ þ 12gt ðXi Þ,
ðGðXi ÞÞ
pðXi Þ ¼ expðGðXexp
i ÞÞ þ expð  GðXi ÞÞ
P
T
Output: F ðxÞ ¼ α t f t ðx Þ
t¼1

P
T
Output: F ðxÞ ¼ g t ðx Þ
t¼1

based on the error εt , and updates the weight distribution before the
next boosting procedure starts. An indicator function I ð:Þ returns 1 if
the Boolean expression inside the function is evaluated as being True.
Output is obtained through a linear combination of the weak models.
For LogitBoost, the uniform weight distribution is also initialized in
a same manner as in AdaBoost. In addition, it also keeps track of the
probability estimation of the positive class pðxÞ and regression value
GðxÞ for all n samples. It then repeats for T boosting procedures. The
working response (or pseudo-label z) and weight distribution are
updated at the beginning of every boosting procedure. A regression
function g ðxÞ is fitted to the weighted least square regression problem.
Regression values of all training samples are updated to calculate the
new probability estimation of each sample at the end of every boosting
procedure. The output is a regression function, whereby classification is
achieved by taking the sign of the summation. Fig. 2. Illustration of ADTree induction whereby a new decision stump is added to
one of the existing prediction nodes after each boosting procedure. Weak learner
generates a set of base conditions C that becomes potential candidates for forming
2.3. Alternating decision tree the next decision node. The best combination of base condition and prediction
node is chosen to form a new decision stump.
ADTree uniquely bridges the gap between boosting and the
decision tree algorithm. Instead of the conventional approach of slightly different ADTree model to handle multi-label problems [11],
building a forest of decision trees, the boosting procedure is while others employed LogitBoost to address multiclass problems [12].
incorporated within a single decision tree to facilitate comprehen- With reference to Fig. 2, the root prediction node is first con-
sibility. ADTree consists of alternating layers of decision nodes and structed given the original dataset. The rest of the induction is
prediction nodes starting with a root prediction node. Mathema- repeated for a given number of boosting iterations. Each boosting
tically, ADTree can be described as a set of decision rules as shown cycle adds a new decision stump to one of the “best” prediction nodes
in (1). Each decision rule will return one of the followings: a to optimally expand ADTree. Precondition refers to the choice of a
positive prediction score α þ , a negative prediction score α  , or a prediction node that is selected for inclusion into the ADTree.
zero score depending on the nested if statement such that r t ðxÞ: if Condition refers to the decision node of the decision stump. Two
(precondition) then [if (condition) then α þ else α  ] else 0. prediction values refer to the prediction nodes of the decision stump.
Precondition is a conjunction of conditions, while the condition The weight distribution over the training dataset is then updated
itself is a Boolean predicate that is embedded in the decision node. based on the newly added decision rule. This helps to guide the next
weak learner when generating a new set of base conditions.
 T
AD Tree model : ¼ r t ðxÞ t ¼ 0 ; ð1Þ The weak learner shown in Fig. 2 is independent from the core
ADTree induction. The existing univariate ADTree uses an exhaustive
To perform classification, an input sample is sorted top-down approach to generate a set of univariate base conditions, each based on
from the root prediction node. Instead of following a single path a different feature given the weight distribution. The next section of the
from the root decision node to one of the leaf nodes in standard paper proposes different methods to replace this weak learner in order
decision trees, one or more paths could be traversed within to introduce multivariate decision nodes. This allows the induction of
ADTree due to possible multiple decision stumps under the same multivariate base conditions in order to build a multivariate ADTree.
prediction node. The prediction scores from all the traversed
prediction nodes are summed to make a prediction on the class
label. The sign of the summation is used to indicate either a 3. Proposed multivariate ADTree
positive or a negative class label. The magnitude of the summation
is a good indication of classification confidence. As discussed above in Section 2, ADTree requires a set of base
In terms of learning, ADTree model can be grown through any conditions in the beginning of every boosting procedure (Fig. 2).
boosting algorithm. AdaBoost was implemented in the seminal work These are potential decision node candidates that dictate the char-
on ADTree [10]. In the later years some research works on ADTree acteristics of the ADTree model. The weak learner is responsible for
used different boosting algorithms such as AdaBoost.MH to induce a generating this set of base conditions using the weighted training

Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
4 H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎

dataset. As discussed earlier, the existing ADTree implements a denoted as μk .


univariate weak learner (Fig. 3(a)) where a set of univariate base X X
K   T
conditions is used to evaluate the split. The selected jth feature or xj is
b
¼ μk  μ μk  μ ð3Þ
threshold at ϑ to form the univariate split. k¼1
Fig. 3(b)–(d) shows three proposed approaches to replace the
univariate weak learner in order to induce a multivariate ADTree. In X X
K X   T
¼ Xi  μk Xi  μk ð4Þ
the first approach (Fig. 3(b)), a reweighted training dataset is provided w
k ¼ 1 i A kth class
at the beginning of every boosting procedure. The aim is to use
In the original Fisher's discriminant estimation, weight distri-
Fisher's discriminant analysis to obtain a vector β to form an artificial
bution is not incorporated as part of the optimization formulation.
feature xT β with a higher discrimination power compared to that
In order to use Fisher's discriminant under the boosting frame-
using individual features. It results in a multivariate base condition.
work, β learning needs be to adaptive to the weight distribution.
This tree is referred to as the Fisher's ADTree (Section 3.1).
Otherwise, each boosting procedure would produce an identical β,
Fig. 3(c) shows the sparse ADTree, whereby feature selection is
which defies the purpose of the boosting.
incorporated to prune out irrelevant features from the multivariate P P
To achieve weight adaptation, weighted versions of b and w
base condition. The sparse linear discriminant analysis [25] is
are used as shown in (5) and (6) to find the discriminative vector β
employed to zero out redundant features in the discriminative vector
where the weighted k-th class mean vector and overall mean
β. Thus, a sparse multivariate base condition is obtained (Section 3.2).
vector are shown in (7) and (8) respectively. Class label y is
The Fisher's ADTree and sparse ADTree are based on AdaBoost,
required in these calculations to obtain the discriminative vector
whereby weight distribution must be explicitly included into a
weak learner. An interesting structure of LogitBoost is that the
β. The detail weighted Fisher's discriminant is shown in Algorithm
1 that forms the weak learner for Fisher's ADTree. Given β, an
weight is intrinsically a part of the linear regression problem.
artificial feature is generated through the linear projection xT β
Therefore, using LogitBoost allows us to take advantage of the
and is used to form the multivariate base condition (see Fig. 3(b)).
well-developed regularized regression techniques [26] to perform
feature selection. In this paper, the additive logistic regression X X
K
1   T
interpretation of LogitBoost is used to form the multivariate base ¼ P μk  μ μk  μ ð5Þ
b
k¼1
wi
condition g ðxÞ as a regression function instead of Boolean function i A kth class

f ðxÞ (see Fig. 3(d)). This is the regularized LADTree (Section 3.3).
X X
K X   T
w
¼ wi Xi  μk Xi  μk ð6Þ
3.1. Fisher's ADTree k ¼ 1 i A kth class

P
Fisher's discriminant [27] is a well-established supervised techni- class w i Xi
μk ¼ P
i A kth
ð7Þ
que that finds a subspace upon which the projected samples are well i A kth class w i
separated according to their class labels. The objective is to maximize
TP
the between-class covariance matrix β b β with respect to the
1XK

TP μ¼ μ ð8Þ
within-class covariance
  matrix β w β of the projected samples. This Kk¼1 k
forms Fisher's ratio J β (2), which can be maximized through solving
A similar approach is implemented in Fisher's decision tree [20],
the generalized eigen-value problem. The optimized β parameter is
which is a multivariate extension of C4.5. It should be emphasized the
then used in the proposed Fisher's ADTree to form an artificial feature
proposed Fisher's ADTree differs from the existing Fisher's decision
xT β, which is a linear combination of all original features. This is
tree in its ability to “boost” several decision stumps under the same
resulted in a multivariate decision node, since it uses all the features
prediction node to improve the final prediction.
rather than just the individual jth feature used in the univariate
variants. The number of dimensions of the subspace is determined by
the total number of classes K. It can only have the maximum of K  1 Algorithm 1. Weighted Fisher's discriminant
discriminative projections. For binary class problems, it results in a
single discriminative vector β. Input: training dataset ½X; y and weight distribution w
P Statistical procedure to extract information based on ½X; y
  βT β
J β ¼ T Pb ; ð2Þ includes:
β w β 1. Calculate weighted mean of positive and negative class
P P
where b and w are respectively the between-class and within-
respectively:
class covariances of the original dataset. They are estimated from the μ1 and μ2 using (7); P
training dataset using (3) and (4). The mean vector of the entire 2. Calculate weighted between-class covariance matrix b
training dataset is denoted as μ while the mean vector of class k is using (5);

Fig. 3. Weak learner for: (a) univariate ADTree where a set of univariate base condition is obtained through exhaustive approach, one for each feature indexed by j;
(b) Fisher's ADTree which results in a single multivariate base condition as all features are used to form an artificial feature xT β instead of jth feature, xj ; (c) sparse ADTree
where β vector can be sparse with many zero elements to facilitate feature selection; and (d) Regularized LADTree where the base condition gðxÞ is a regression function
instead of Boolean function f ðxÞ.

Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 5

P
3. Calculate weighted within-class covariance matrix w θn is a random vector. The optimal score vector θ is then
using (6); normalized such that θ Dπ θ ¼ 1;
T

4. Maximize the Fisher's ratio (2) by solving the generalized 2. Repeat until convergence:
P P
eigenvalue problem b β ¼ λ w β where λ and β are also 2.1. For fixed θ, solve (11) to obtain β using LARSEN [29]
termed as eigenvalue and eigenvector respectively. algorithm;
Output: β 2.2. For fixed β, calculate θ ¼ ðDπ Þ  1 Y T Xβ. The optimal
score vector θ is then ortho-normalized to make it orthogonal
to θ0 .
3.2. Sparse ADTree Output: β and θ

This section describes the concept of sparse ADTree where


detailed implementation can be found in the earlier work [17]. Sparse 3.2.2. Adaptation for sparse ADTree
ADTree design uses Sparse Linear Discriminant Analysis (SLDA) [25] in As shown in Fig. 3(c), β solution is required to form the
place of Fisher's discriminant analysis. This allows some redundant multivariate base condition. In order to adapt SLDA under boost-
features to be zeroed out and removed from the classifier, thus ing, a simple modification has been made to LARSEN algorithm
achieving a more optimal subset of features. However SLDA is not (step 2.1 of Algorithm 2) as proposed in [17] to guide β learning
originally designed for boosting. Thus the underlying solver for SLDA based on the weight distribution. For fixed θ; the optimization
is modified to account for the weight distribution. The background on problem in (10) can be solved easily using LARSEN algorithm by
SLDA (presented below in Sections 3.2.1 and 3.2.2) describes how it reformulating (10) into Lasso regression problem (11) using the
can be adapted to boost ADTree. augmented training dataset (12).

min n  1 ‖Y n θ  Xn β‖22 þ λ1 ‖β‖1 ð11Þ


3.2.1. Sparse Linear Discriminant Analysis β
Sparse Linear Discriminant Analysis (SLDA) allows some of the
!
elements of β to be exactly zero and hence to effectively remove the  1 X Y
Xðnn þ pÞp ¼ 1 þ λ2 2 pffiffiffiffiffi ; Y nðn þ pÞ ¼ ð12Þ
corresponding features out of the discriminant analysis. To achieve this, λ2 I 0
SLDA uses Optimal Scoring [28] to incorporate sparsity inducing penalty
for feature selection purposes. Optimal Scoring is just a different For Lasso regression problem (11), the solution β is no longer
formulation from the Fisher's discriminant whereby they both produce unique. Rather, β is now a linear piecewise function of λ1 . The
discriminative vector β that are equivalent up to a factor [28]. solution β is sparse for large values of λ1 as it allows more
Optimal Scoring converts the categorical label y from the penalization. The entire family of β solutions is termed as a
training dataset into a real-valued label Y θ whereby Y is the regularization path. The regularization path starts from a null β
indicator matrix of y with one-hot representation, i.e., Y ik ¼ 1 for solution and ends with a full β solution. LARSEN finds the linear
ðkÞ
ith training sample that is assigned to class k, and 0 elsewhere in breakpoints analytically which results in a series of β solutions
that ith row. Vector θ comprises real-valued entries to assign a real indexed by k. However, it does not accommodate additional weight
ðkÞ
value to each class. Optimal Scoring formulation is shown in (9) inputs in solving each possible solution β . In order to implement
with θ Dπ θ ¼ 1 to avoid a null solution.
T
boosting, it is necessary to adapt weight distribution in finding the
linear breakpoints. The detailed implementation can be found in [17].
minn  1 ‖Y θ  Xβ‖22 subject to θ Dπ θ ¼ 1; where Dπ ¼ n  1 Y T Y
T
ðkÞ
θ;β Given a series of β solutions (regularization path) from the
ð9Þ Algorithm 2, only one β solution is required to form the multivariate
base condition. Model selection techniques can be applied to select an
Optimal Scoring in SLDA is further constrained by Elastic Net, one
optimal β solution. In sparse ADTree, generalized cross-validation
of the well examined and documented Lasso-type penalization
(GCV) as shown in (13) was implemented for this purpose. It
techniques [29], resulting in (10). Elastic Net is a convex combination
generates a measure of how well the estimated model (βsolution)
of both Ridge ‖β‖22 [30] and Lasso ‖β‖1 [31] penalizations. By
fits the output, the size of the training dataset, i.e., n, and the
penalizing l1 -norm on β, Lasso forces some of β elements to be
complexity of β in terms of degrees of freedom or d [32]. The most
exactly zero while Ridge penalty stabilizes β to ensure a unique
optimal β solution is that with the lowest GCV measure. It should be
solution is obtainable and encourages grouping of correlated features
noted that other model selection techniques can be implemented,
(similar β magnitudes for correlated features). The regularization
such as Akaike Information Criteria [33] or Bayesian Information Criteria
parameters for Lasso and Ridge penalty areλ1 and λ2 respectively.
[34]. Depending on application, the adopted choice may have some
minn  1 ‖Y θ Xβ‖22 þ λ2 ‖β‖22 þ λ1 ‖β‖1 subject to θ Dπ θ ¼ 1
T effect on the decision node complexity and tree size.
ð10Þ
θ;β
‖Y n θ Xn β‖22
The authors of SLDA proposed a simple iterative algorithm to GCV ¼ 2
ð13Þ
ðn  dÞ
solve (10) for two parameters: the optimal score vector θ and the
discriminative vector β. First, θ is held fixed while optimizing β, One additional characteristic of this sparse ADTree is that it allows
and then β is held fixed while solving for θ. This process is a user to preselect a number of features for the decision nodes. For
repeated until the convergence takes place. The detailed imple- example, the sparse ADTree can be used to generate a univariate
mentation is shown in Algorithm 2. ADTree by selecting k ¼ 1 to obtain β solution with 1 active feature.

Algorithm 2. SLDA 3.3. Regularized logistic ADTree

The multivariate ADTree can be induced based on a different


Input: training dataset, i.e., X and Y boosting technique. In this research the use of LogitBoost [35] is
1. Initialize a trivial optimal score vector θ0 for which consists specifically investigated because of its unique structure. LogitBoost
of all 1 s. For initialization, θ is a set θ ¼ ðI  θ0 θ0 Dπ Þθn where
T
is an additive logistic regression interpretation of AdaBoost [35]. It

Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
6 H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎

links AdaBoost to the classical logistic regression, which is a LADTree's modularity and flexibility are the greatest advantage of
probabilistic discriminative model for classification tasks. The this approach over all other ADTree designs. LADTree users can
logistic regression models the log-odds ratio between positive apply any of the newer classical penalization techniques [26] and
and negative class posteriors Prðy ¼ þ 1j xÞ and Prðy ¼  1j xÞ with select any number of features that they wish to incorporate in
a regression model GðxÞ as follows: order to customize the tree for their specific applications.

Prðy ¼ þ 1j xÞ
log ¼ GðxÞ ð14Þ
Prðy ¼  1j xÞ
4. Comparative experimental analysis
LogitBoost is a nonparametric extension to (14). It uses a linear
combination of regression models (15) to estimate the log-odds 4.1. Experimental design and validation
ratio instead of the fixed parametric form for GðxÞ. The total
number of regression models is M. The proposed new multivariate ADTree designs discussed in
XM Section 3 above are Fisher's ADTree, Sparse ADTree, regularized
Gð x Þ ¼ m¼1 m
g ðxÞ ð15Þ LADTree using Lasso and regularized LADTree using Elastic Net. In
order to gauge their performance against other types of decision
In each boosting procedure of LogitBoost, the aim is to solve a
trees, several well-known and well-represented in the literature
weighted least squares regression problem (the details were
algorithms are chosen to include each general type of the decision
presented earlier in Section 2.2). Hence, LogitBoost can be viewed
tree (Table 2). The discriminant analysis classifier is also included
as Iteratively Reweighted Least Squares (IRLS) regression formula-
since this technique has been implemented in two of the multivariate
tion. Any regression model g m ðxÞ can be implemented to induce
ADTree designs. The chosen learning algorithms are listed below:
GðxÞ. To form multivariate ADTree, the regression model g m ðxÞ is
restricted to be of a linear type g m ðxÞ : xT β which results in solving
1. Univariate decision tree: C4.5 and CART [37];
(16). The matrix W is diagonal of n  n dimension. Each diagonal
2. Multivariate decision tree: Fisher's decision tree [20];
entry indicates a weight value for one training sample. Vector z is
3. Ensemble of univariate decision trees: Boosted C4.5 [37];
the updated pseudolabel of length n. The only parameter to
4. Ensemble of multivariate decision trees: Oblique Random
optimize is β. The weight is incorporated directly such that the
Forest [21];
output and design matrix are W1=2 z and W1=2 X respectively. Here
5. Univariate boosted decision tree: ADTree [10];
the optimization process is of a standard linear regression type.
6. Sparse discriminant analysis [38].
min n  1 ‖W1=2 z  W1=2 Xβ‖22 ð16Þ
β
The datasets used in this study are given in Table 3. The datasets
By expressing the problem in the form of (16), it becomes possible are shortlisted such that each of them consists of only real-value
to take advantage of the vast regularized linear regression literature to feature measurements. Datasets with categorical features are excluded
accommodate the boosting weight distribution. Note that for the since multivariate trees must convert categorical features to real-
sparse ADTree learning algorithm, LARSEN algorithm has been valued features, while such conversions could bias the performance
modified to accommodate the weight distribution. For the regularized comparisons.
LADTree, the weight distribution is assimilated as a part of the linear University of California, Irvine (UCI) datasets [39] are associated
regression problem in minimizing the residual value between W1=2 z with a wide range of real-world problems. This allows comparing
and W1=2 Xβ. This alleviates the need to convert the categorical performances of the trees across datasets of varying characteristics (i.e.,
responses to real-valued ones through the optimal scoring. feature measurements of different nature representing particular
Unfortunately just the use of (16) is still insufficient since a domain problems). Three additional spectral datasets from the Uni-
constraint or penalization function must be placed on β in order to versity of Eastern Finland (UEF) [40] are included because their
provide the capability to shape characteristics of the ADTree decision characteristics are known to have highly correlated features. This
 
node (e.g., feature selection). Therefore a penalization function J β is allows comparisons between the decision trees on multivariate corre-
used on (17) to obtain a constrained regression solution shown in lated features. All datasets are preprocessed to center each feature to
(17). From Bayesian perspective, this is effectively equivalent to zero and with the standard deviation of one. All experiments were
placing a priori on the β solution in maximizing the posterior conducted on PC with Intels Core™ 3.2 GHz i5 CPU and 4 GB RAM.
likelihood. A standard 10-times 10-fold stratified cross-validation was
  performed on each dataset for each learning algorithm to generate
min n  1 ‖W 1=2 z W1=2 Xβ‖22 þ J β ð17Þ performance estimation data. The employed performance metrics
β
were: prediction accuracy, induction time, decision tree size, and
There is a wide range of penalization techniques of the form (18). decision node complexity. Comprehensibility can be viewed as a
Classical ones include Ridge (‖β‖22 ), Lasso (‖β‖1 ), and Elastic Net as tradeoff between the decision tree size and decision node
presented previously in Section 3.2. Their solvers can be implemented
in a classical form in each boosting procedure to produce multivariate Table 2
base conditions for the regularized LADTree induction (see Fig. 3(d)). Abbreviated algorithm names.

In this paper, two different variants of regularized LADTree using Abbreviation Description
Lasso and Elastic Net respectively were presented.
The proposed regularized LADTree has a modular design that ADT Alternating Decision Tree
can seamlessly incorporate different types of linear regularization C4.5 C4.5
CART CART
techniques. This essentially gives ADTree the ability to change its
FADT Fisher's Alternating Decision Tree
inherent model selection (or feature selection) approach without FDT Fisher's Decision Tree
affecting the learning algorithm itself. The use of different reg- oRF Oblique Random Forest
ularization techniques also allows users to preselect the number of rLADTEN Elastic Net Regularized Logistic Alternating Decision Tree
features for their given application. For example, selecting k ¼ 1 in rLADTL Lasso Regularized Logistic Alternating Decision Tree
SADTEN Elastic Net Regularized Sparse Alternating Decision Tree
the original LARS (solver for Lasso regression) [36] or LARSEN SLDA Sparse Linear Discriminant Analysis
algorithm will generate a univariate tree. The proposed regularized

Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 7

Table 3 475–491 times respectively). Nevertheless, the inclusion of raw induc-


Summary of UCI and UEF datasets. tion time is to show that the fast induction property of the decision
tree is not lost in the proposed multivariate ADTree variants in
Dataset ID Dataset Number of samples Number of features
comparisons to other decision tree families.
1 Breast cancer 569 30 Statistical comparison results are available in Fig. 4. All statistical
2 Blood transfusion 748 4 significant differences were detected at a 0.01 significant level. The
3 Liver disorder 345 6 Friedman's and post-hoc Nemenyi's tests were performed based on
4 Vertebral 310 6
5 Pimaindian 768 8
average rank values of every learning algorithm. The average ranks
6 Heart 267 44 are shown in brackets next to their corresponding learning algo-
7 MAGIC gamma 19,020 10 rithms. A lower average rank value indicates better performance, and
8 Parkinson 195 22 vice versa. Groups of algorithms that are not statistically significantly
9 Haberman 306 3
different are indicated using a bold line.
10 ILPD 579 10
11 Ionosphere 351 33 When examining the classifiers disparately across all datasets,
12 Spambase 4601 57 almost all of them have similar prediction accuracy without statisti-
13 Wilt 4839 5 cally significant difference (SSD) (see Fig. 4). Only oRF is significantly
14 QSAR 1055 41 more accurate than ADT. Each classifier is superior in some cases and
15 Climate 540 18
16 Banknote 1372 4
inferior in others, as predicted by the “No-Free-Lunch” theorem [43].
17 Woodchip (UEF) 10,000 26 Therefore, average ranks of algorithms' performance were also
18 Forest (UEF) 707 93 derived because they give a measure of how well a particular classifier
19 Paper (UEF) 180 31 performs across a variety of datasets.
When only statistical testing is considered, it can be seen that the
classical decision trees (C4.5 and CART) are consistently within the
complexity. For each dataset, the best performing algorithm was top group of algorithms for all the performance indicators. These
given rank value of 1, the second was given rank value of 2, and so results well agree with the literature findings and can be considered
on. The ranks were averaged if the performances were tied. An as validating them. These algorithms are fast to build and they are
average rank of each learning algorithm was calculated over comprehensible. The above is perhaps the main reason why they
multiple datasets as shown in (18), where r il represents the rank remain relevant despite more powerful methods being introduced
of lth algorithm for ith dataset. The total number of datasets is M over the years to further improve classification performance.
while the total number of algorithms is L. Despite being within the first group of algorithms without SSD in
terms of prediction accuracy, C4.5 and CART are ranked at the bottom
1X
Rl ¼ r ð18Þ half out of the 11 classifiers. This shows that their performance indeed
M i il
could be improved with different tree induction strategies. The
Statistical comparison was conducted along the lines suggested ensemble-based decision trees (oRF and boosted C4.5) have been
in [41]. The null hypothesis was such that all algorithms were designed aiming for it. They are consistently ranked in the top tier of
performing similarly. Nonparametric Friedman's test (19) was used the classification performance (see Fig. 4). Nonetheless based on the
for hypothesis testing. It was based on the average ranks of performed experimental analysis, the accuracy improvement is not
classification models to detect if there were statistically significant statistically significant when comparing to C4.5 and CART across the
differences among the classifiers. In case of rejection of the null range of different datasets. Yet, the tradeoff in terms of the decision
hypothesis, Nemenyi's test [41] was applied to determine which tree size and split complexity is statistically significantly worse
pairs of algorithms were statistically different. That was done compared to some other classifiers. They also have the worst average
based on the critical difference (20) in terms of rank value where rank in terms of the induction time. For example, when comparing
the critical values for Nemenyi's test were qα . oRF to CART (see Fig. 4), it can be seen that the first is statistically
" larger with more complex nodes. At the same time it offers no
12M X 2 LðL þ1Þ2 statistically significant improvement in accuracy. This clearly negates
χ 2F ¼ R   ð19Þ
LðL þ 1Þ l l 4 some of good qualities of being a decision tree.
The proposed multivariate ADTree variants offer a flexible
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
LðL þ 1Þ nonparametric approach to adapt to different characteristics of
CD ¼ qα ð20Þ datasets. It has a built-in opportunity to decide whether to build a
6M
full decision tree or a simple decision stump (linear decision
boundary like SLDA) based on user-supplied stopping criteria.
4.2. Experimental results and discussions Multivariate ADTree modifications are also able to decide on
whether to boost multiple splits on the same input space sub-
The raw results covering prediction accuracy, induction time, region or to use a standard decision tree partitioning (like FDT).
decision tree size, and split complexity are summarized in the tables SADT and rLADT allow the use of univariate or multivariate
in Appendices A–D. Each value in these tables represents the aver- decision nodes (or even both types of them) within the same tree.
age7standard deviation of 10-time 10-fold stratified cross validation The benefits of these properties are further elaborated below in
for a particular pair of a learning algorithm and dataset. Induction time relation to other decision tree families and SLDA.
is excluded from the statistical comparisons and only average compu-
tation time across all datasets is reported due to the varying execution 4.2.1. Generalizing the alternating decision tree
speed of different platform implementations such as MATLAB, JAVA, ADTree is a tradeoff between the classical decision trees and
and R. In addition, there are some timing overheads as both JAVA and ensemble of decision trees. It retains the comprehensibility of a
R codes are called within the MATLAB platform when conducting the decision tree despite going through boosting cycles. ADT algorithm
experiments. There are some reported differences in execution times has significantly worse prediction accuracies compared to the oblique
of the same code under different programming languages such as the random forest. Besides its induction time is on average longer than
one in [42] where it is shown that JAVA, MATLAB, and R languages are the most of other analyzed decision trees. These limitations are due to
slower than Cþ þ (approximately between 2.2 and 2.69; 9–11; and the univariate base conditions, as validated in the experiments. All the

Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
8 H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 4. Average ranks of the algorithms (rank values are shown in bracket next to the algorithms) and the corresponding statistical comparisons for: (a) prediction accuracy,
(b) decision tree size, and (c) split complexity (total number of nonzero feature coefficients). Groups of algorithms that are not statistically significantly different are
connected in bold line. CD refers to critical difference in terms of rank value. The proposed multivariate ADTree variants are shown in bold for readability.

Fig. 5. Frequency of training samples versus feature range (first decision stump) for: (a) univariate ADTree, and (b) Fisher's ADTree for Forest dataset.

three proposed multivariate ADTree variants overcome these limita- in this research to compare ADT with SADTUNI and rLADTUNI. The raw
tions. The Forest dataset can be used as an example to illustrate it. performances of the three algorithms are tabulated in Appendix E.
ADT selected 90th feature of the Forest dataset for splitting in the Both SADTUNI and rLADTUNI do not suffer from the ADT's exhaustive
first decision stump. However, the histogram in Fig. 5(a) shows an approach that is used to generate a set of univariate base conditions.
obvious distribution overlap between positive and negative training Therefore, it can be observed that both the trees are consistently faster
samples over the selected feature range. This violates the weak learning
to induce compared to ADT on all datasets. Since they are all univariate
condition of achieving at least 50% accuracy, which is the requirement
ADTrees, the split complexity is completely dependent on the tree size.
for boosting to work. In contrast to that, the proposed FADT algorithm
Both rLADTUNI and SADTUNI are statistically smaller than ADT, thus
uses Fisher's discriminant analysis. It synthesizes a feature (through
linear projection) that is more discriminative, where the positive and leading to the conclusion that they are generally more comprehen-
negative training samples are well separated over the feature range sible. However, the ADT is statistically significantly better in its
(see Fig. 5(b)). accuracy of prediction in comparison to SADTUNI and rLADTUNI. Thus,
The univariate ADTree is a subclass of SADT and rLADT algorithms. even though SADT and rLADT are able to induce a purely univariate
It can be generated by choosing to use only one active feature when ADTree, in cases where prediction accuracy is prized over induction
computing the regularization path. A separate analysis was performed time and tree size, it is more beneficial to induce multivariate trees.

Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 9

Table 4
Performance comparison between FDT and FADT for some cases where FADT predicts better than FDT. The best performing value is highlighted in bold.

Dataset Prediction accuracy Induction time Tree size

FDT FADT FDT FADT FDT FADT

Liver disorder 50.677 4.61 63.47 71.54 0.007 0.00 0.197 0.10 7.2471.32 43.42 7 15.18
Heart 73.62 7 1.95 76.34 72.22 0.04 7 0.00 0.09 7 0.09 11.58 70.60 17.95 7 13.21
Pimaindians 64.967 4.86 75.16 71.04 0.017 0.00 0.047 0.02 14.50 71.55 26.38 7 11.20
MAGIC gamma 74.69 7 4.12 82.52 70.12 1.107 0.03 8.55 7 1.29 266.46 753.76 109.697 8.04
ILPD 59.03 7 3.52 71.39 70.26 0.017 0.00 0.03 7 0.05 10.84 72.83 8.657 5.57
QSAR 81.187 0.90 85.6770.57 0.137 0.01 0.06 7 0.04 44.32 71.54 14.597 6.62

Table 5 Table 6
Prediction accuracy of SADT and SLDA for cases whereby SADT generates a single Prediction accuracies of C4.5, CART and rLADTree for medical datasets with highly
SLDA model in its sole decision node (e.g., it behaves like an SLDA). The best discriminative features. The best performing value is highlighted in bold.
performing value is highlighted in bold.
Dataset Prediction accuracy
Dataset SADT SLDA
C4.5 CART rLADTL rLADTEN
Blood transfusion 76.217 0.00 66.007 0.42
Banknote 98.217 0.05 97.497 0.09 Breast cancer 93.52 7 0.78 93.047 0.73 96.40 7 0.26 96.52 70.23
Woodchip (UEF) 99.58 7 0.01 99.617 0.01 Liver disorders 65.767 2.14 66.38 7 2.45 66.997 2.22 66.85 71.54
Paper (UEF) 100.007 0.00 100.007 0.00 Vertebral 81.23 7 1.01 80.81 7 1.25 82.877 1.03 82.68 71.43

true for ensemble-based multivariate decision trees such as the


oblique random forest, it is not the case for FADT. In actuality, FADT
built a smaller tree in 7 out of 19 experimented datasets. Some of the
examples of that are MAGIC gamma, ILPD and QSAR datasets
(Table 4). Most interestingly is that FADT improved MAGIC gamma
prediction by 8% while building a smaller tree (some 2.5 times
smaller) than that of FDT. This can be explained by the boosting
providing significantly better discrimination on the already parti-
tioned regions rather than going down in depth for further splitting.
In two cases where FDT gave a better prediction compared to
FADT (i.e., Ionosphere and Woodchip datasets), it is likely that FADT
suffered from the over-fitting phenomenon. It can be observed
from using Woodchip dataset as an example that FADT generated a
significantly larger decision tree size of 141.04 7 5.00 compared to
18.22 70.36 of FDT. This is likely due to the over-fitting. In general,
FADT does improve the prediction accuracy of FDT through
boosting. Furthermore in many cases it achieves this without
necessarily sacrificing the tree size.
Fig. 6. SADT models based on: (a) woodchip and (b) heart datasets. SADT is capable of
generating either SLDA-like model in (a), or further extend it as a tree model in (b).
4.2.3. Sparse ADTree – a nonparametric extension of SLDA
The proposed SADT is a direct nonparametric extension of SLDA,
4.2.2. Fisher's ADTree as an extension to Fisher's decision tree which itself is a powerful discriminant analysis method. It is
The proposed FADT is a direct extension of the existing FDT important to note that the parametric form of Sparse Linear
algorithm. The difference comes due to accommodation of boost- Discriminant Analysis makes a linear assumption on the underlying
ing in FADT that allows a majority-voted decision from multiple data. However there are cases where a linear classifier is insufficient.
multivariate decision nodes on the same input space sub-region. Such cases can be accommodated by SADT by inducing a suitable
The performed experiments did not show any statistically number of decision boundaries to better discriminate the input
significant differences for performance metrics between FADT space. The ability of SADT to extend SLDA into a tree representation
and FDT. However, the average prediction accuracy rank of FADT increases the prediction accuracy across multiple datasets. In fact it
was better than the FDT while the average induction time, decision showed improvements on 12 out of 19 datasets employed in the
tree size, and split complexity ranks of FDT were better than those reported experimental research. Besides, SADT performed compar-
of FADT. The incorporation of boosting by FADT improved the ably with SLDA in the other 7 datasets.
prediction accuracies for 11 out of 19 datasets. Both trees had Table 5 compares the classification performance of datasets where
similar prediction accuracies for 6 datasets, while FDT was more SADT generates a single SLDA classifier in its sole decision stump. For
accurate on 2 datasets. Table 4 shows some examples where FADT example, Woodchip is a linearly separable dataset whereby SADT
predicted better than FDT. generated a single decision boundary. This is essentially similar to
Using Liver Disorder dataset as an example (Table 4), it can be SLDA (Fig. 5(a)). There is no surprise therefore that both SADT and SLDA
seen that FADT can improve the classification accuracy of Liver achieved close experimental results in terms of their accuracy perfor-
Disorder by 13% over FDT. At the same time FADT's tree size is mance: 99.5870.01% and 99.6170.01% respectively. Close perfor-
larger by around 6 times compared to that of FDT. This example may mances can also be noticed in their induction time (0.3970.05 s for
lead to a wrong conclusion that FADT improves the classification SADT and 0.5870.03 for SLDA) as well as split complexity (both were
performance at the cost of a larger decision tree. While this may be 26.0070.00).

Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
10 H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Table 7
Prediction accuracies and tree sizes of C4.5, CART and rLADT for spectral datasets with highly correlated features. The best performing value is highlighted in bold.

Dataset Prediction accuracy Decision tree size

C4.5 CART rLADTL rLADTEN C4.5 CART rLADTL rLADTEN

Wood-chip 91.94 7 0.18 91.647 0.26 99.61 70.02 99.45 7 0.01 516.54 73.97 313.08 725.40 4.007 0.00 4.00 70.00
Forest 88.617 0.70 88.30 7 0.52 95.94 70.35 93.197 0.32 44.64 71.09 27.52 73.96 16.45 7 13.78 4.00 70.00
Paper 95.32 7 1.39 96.99 7 1.33 98.36 71.07 96.187 1.67 10.48 70.30 10.02 70.75 37.93 7 9.81 58.12 79.01

Fig. 7. rLADTL model (a) and stem plot (b) of β feature coefficients of each and every spectral measurements for woodchip dataset from the decision node in (a). The stems
are colored according to the visible light colors depending on the wavelength.

was ranked just below the univariate trees. SADT achieved better
classification on 9 out the 19 datasets compared to C4.5 and CART
while inducing a smaller decision tree in 12 out of the 19 datasets.
In short, it can be concluded that SADT is a successful nonpara-
metric extension to SLDA. It creates a parsimonious version of a
multivariate decision tree that results in the smallest decision tree
on average even when compared against univariate decision trees,
while with only slightly higher split complexity.

4.2.4. Regularized LADTree


The most notable extension is the proposed rLADTree. Despite
being boosted and multivariate version, it shows no statistically
significant difference in terms of the tree size and node complexity
when compared to univariate unboosted decision trees such as
C4.5 and CART. First, performance of rLADTree is examined and
Fig. 8. rLADTL model on the Vertebral dataset whereby it is possible to boost compared with C4.5 and CART on datasets with highly discrimi-
multiple decision stumps on the same input space sub-region and there are both native features. In the performed experiments these were repre-
univariate (white) and multivariate (gray) decision nodes.
sented by medical datasets. Feature measurements using such
In most other cases in the reported experimental research, more databases are good indicators of the capability to discriminate
than a single decision boundary was required. It is illustrated here with between different types of medical conditions (see Table 6).
the Heart example as shown in Fig. 6(b). The classification performance It is generally known that univariate decision trees perform well
was increased from 68.0471.30% (SLDA) to 76.3671.77% (SADT) by on this kind of datasets [44]. Breast cancer, for example, is a
building a tree rather than a single decision boundary. classification problem on discriminating between malign and benign
In short, SADT behaves as SLDA for datasets that are linearly breast cancer diagnoses. Features here are extracted from digitized
separable, and behaves as a tree for cases that are not. This alleviates images of fine needle aspirate of breast mass. C4.5 and CART achieved
the need for practitioners to select a right parametric form to achieve average accuracies of 93.5270.78% and 93.0470.73% respectively.
better prediction. However the improved prediction comes at the cost For the same datasets, rLADTL and rLADTEN achieved similar accura-
of a longer induction time and higher split complexity measure. For cies of 96.4270.26% and 96.5270.23%.
example, Heart dataset required an induction time of 2.8870.68 s On the contrary, univariate decision nodes do not handle datasets
compared to SLDA's 0.1470.01 s along with a split complexity of with complex interaction well [44]. For datasets with highly corre-
263.097251.29 compared to SLDA's 8.2072.13. lated multivariate features, such as spectral datasets (see Table 7),
Although SADT was larger than SLDA, it in fact built the multivariate decision trees are preferable. For example, the Woodchip
smallest tree on average among all the decision trees in the dataset consists of spectral reflectance for two different types of
reported experimental analysis. Furthermore, its spilt complexity woodchips: birch and scots pine. The induced rLADTL model

Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 11

comprising only a single decision stump achieved accuracy of 5. Conclusion


99.6170.02%. Magnitudes of the feature coefficients (together with
their signs) of this decision stump are shown in Fig. 7. Colors in the In this paper, three different methods are presented to induce a
figure represent wavelengths in the visible light range. It is easy to multivariate ADTree. The aim is to equip the decision tree with the
comprehend the importance of each spectral measurement at differ- boosting capability while ensuring that it remains comprehensible.
ent wavelengths that best discriminates between birch and scots pine Although no single algorithm can outperform all others on all possible
species from the rLADTL model. In contrast, C4.5 and CART induced datasets as suggested by the No-Free-Lunch theorem, it is clear from the
large decision trees of over 500 and 300 nodes respectively (refer to performed experimental analysis that the most optimal tree can be
Appendix C). These large decision trees were quite incomprehensible built if dataset characteristics can be matched by right selection of a
when compared to the one induced using the rLADTL model. At the specific decision tree algorithm. For example, if the domain problem
same time, they achieved lower classification accuracies. has a few highly discriminative features, C4.5 and CART algorithms are
Regularized LADTree allows flexible hierarchical modeling by capable of generating an optimal decision tree as shown in the
building a suitable regression model for estimating class posterior literature and confirmed by the performed experiments. However,
distribution in performing classification. Incorporation of the regular- they induce large and incomprehensible decision trees with lower
ization terms enables the decision node complexity to be optimally prediction accuracies when complex interactions exist among the
selected. Thus, it is possible to have multivariate decision nodes with features (i.e., spectral datasets). Ensemble-based forests are good at
varying sparsity, univariate decision node, or even both within the giving high classification performance across most data types, but at
same regression model. the expense of other factors, such as induction time and comprehen-
Versatility of the regularized LADTree can be illustrated by using sibility. In many cases, the characteristics of the datasets are unknown a
the Vertebral dataset. The rLADTL model consists of decision nodes priori. Therefore, an optimal classifier with a right induction bias that
that are both univariate and multivariate within the same tree (Fig. 8). best captures the underlying characteristics is also unknown. More
It achieved accuracy of 82.8771.03% compared to the best perform- often than not, practitioners are required to perform experimentation
ing 84.9471.08% of oRF. Yet its tree size was significantly smaller across different types of classifiers to determine the suitable one for
(36.1976.21 compared to 4614.32775.26), and it offered a lower their given domain problem.
node complexity (44.45734.29 compared to 4514.32775.26). The proposed multivariate ADTree variants (in particular, the
In short, the regularized LADTree is able to optimally induce a sparse ADTree and regularized LADTree) are non-parametric decision
range of possible regression models, such as: (1) single linear trees that are equipped with additional boosting and regularization
regression model (similar to SLDA); (2) additive regression model; techniques to better match complexity of given datasets. They are
and (3) hierarchical regression model (decision tree). Its modularity therefore able to optimally represent datasets with few highly
and ability to better match different complexities of data enable it to discriminative features (e.g., C4.5 and CART), datasets with correlated
retain the good qualities of a decision tree such as short induction features such as spectral datasets (e.g., multivariate decision trees), and
time and good comprehensibility while enjoying the advantages of datasets that require multiple models (e.g., ensemble-based forests).
boosting and feature selection techniques of the well-established The proposed Fisher's ADTree is a boosted alternative to multivariate
regularization approach. decision trees such as Fisher's decision tree. The proposed sparse
ADTree incorporates a sparseness criterion into the multivariate ADTree
to allow for better comprehension through the feature selection. It is a
4.2.5. Comparisons between multivariate ADTree variants nonparametric extension to SLDA. It performs the same partitioning as
Each proposed multivariate ADTree algorithm has its own SLDA for datasets that satisfy the linear assumption while also over-
distinctive characteristics set that distinguishes it from the other coming limitations of SLDA by automatically fitting multiple decision
variants. From the experiments, it was observed that Fisher's boundaries to improve the classification accuracies for datasets that
ADTree is the fastest to induce. However, the sparse ADTree is cannot be classified with a single linear decision boundary.
the most comprehensible, while the regularized LADTree is the The most distinctive is the regularized LADTree, which is capable of
most accurate. Table 8 highlights the main characteristics of the performing without statistical significant difference to the state-of-art
algorithms and their differences. C4.5 and CART algorithms in terms of the tree size and node complex-

Table 8
Comparisons between characteristics of multivariate ADTrees.

Characteristics Fisher's ADTree Sparse ADTree Regularized LADTree

Boosting AdaBoost AdaBoost LogitBoost


Penalization None Restricted to Elastic Net May use any regularization technique
terms
Prediction Lower than Sparse and Regularized Better than the Fisher's ADTree Best among the multivariate ADTree variants
accuracy ADTree
Induction time Fastest among the multivariate ADTrees Slowest among the multivariate ADTrees due to a series of Faster than Sparse ADTree
due to the use of a single analytical optimizations and the use of additional parameters (i.e.
solution optimal score vector)
Decision tree Approximately the same as Regularized Smallest among the multivariate ADTrees Approximately the same as Fisher's ADTree
size LADTree
Multivariate Most complex of the three variations Approximately the same as Regularized LADTree Approximately the same as Sparse ADTree
split due to the implemented Fisher's
complexity discriminant analysis
Advantages Embedded feature extraction for better 1. Has feature selection mechanisms to select optimal 1. Flexible hierarchical additive modeling
discrimination and satisfying weak feature set for each decision node 2. Framework that allows any linear type
learning condition 2. Regularization path with “early stopping” that allows regularization (maximum a posteriori)
decision node complexity tuning without modification to the solver
3. Probabilistic decision due to the additive
logistic regression interpretation

Disadvantages No feature selection mechanism Restricted to Elastic Net Higher model complexity compared to Sparse
ADTree
12 H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎

ity for most datasets despite being a boosted multivariate tree. Most Appendix A
significantly, regularized LADTree had better classification performance
across all datasets. It is ranked directly in the second tier after the See Table A1.
decision tree ensemble algorithms while remaining comprehensible.
For example, on applications that contain features with complex
interactions, the regularized LADTree builds a more accurate and much
smaller tree with its multivariate node compared to C4.5 and CART. At Appendix B
the same time its node complexity remains small due to the use of
regularization techniques. It is important to note that the greatest See Table B1.
advantage lies in the regularized LADTree's modularity, which allows a
wide range of established linear regularization techniques to be
applied. This bridges between the decision tree and powerful regular-
Appendix C
ization research fields.
For the future research, it would be important to investigate
See Table C1.
how ADTree can be designed based on different boosting algo-
rithms to handle wide range of domain problems. This would lead
to an advantage over the classical decision trees, which often
require a new learning mechanism to achieve certain properties. Appendix D

See Table D1.


Acknowledgments

This work was supported by the Monash University Malaysia


through a Higher Degree Research Scholarship and Malaysia Appendix E
Ministry of Higher Education, Malaysia Fundamental Research
Grant Scheme FRGS/2/2014/TK03/MUSM/02/1. See Table E1.

Table A1
Prediction accuracy (average7standard deviation of 10-time 10-fold stratified cross validation) in terms of percentage.

ID C4.5 CART FDT ADT FADT SADT rLADTL rLADTEN boosted C4.5 oRF SLDA

1 93.52 7 0.78 93.047 0.73 94.97 70.46 94.687 0.80 96.127 0.26 96.89 7 0.37 96.40 70.26 96.52 7 0.23 97.167 0.26 97.217 0.27 96.36 7 0.25
2 77.92 7 0.62 78.28 7 0.37 78.52 70.57 77.38 7 0.41 76.38 7 0.41 76.217 0.00 76.21 70.00 76.217 0.00 77.54 7 0.42 78.50 7 1.01 66.007 0.42
3 65.767 2.14 66.387 2.45 50.6774.61 62.34 7 1.37 63.477 1.54 62.81 7 0.81 66.99 72.22 66.857 1.54 68.907 1.17 72.59 7 1.24 62.55 7 0.38
4 81.23 7 1.01 80.81 7 1.25 83.06 71.24 82.717 1.24 82.87 7 1.13 83.357 0.57 82.87 71.03 82.687 1.43 83.107 1.51 84.94 7 1.08 79.65 7 0.87
5 74.577 0.90 74.147 0.37 64.96 74.86 72.54 7 0.91 75.167 1.04 76.017 0.41 74.68 70.87 74.357 0.85 73.747 1.08 76.09 7 0.54 76.147 0.15
6 75.50 7 1.88 78.317 1.12 73.62 71.95 78.577 1.66 76.34 7 2.22 76.36 7 1.77 79.43 70.00 79.43 7 0.00 80.81 7 1.12 81.187 1.21 68.04 7 1.30
7 85.127 0.15 85.377 0.12 74.69 74.12 78.59 7 0.10 82.52 7 0.12 78.90 7 0.03 81.40 70.09 81.43 7 0.11 88.007 0.64 87.737 0.07 79.45 7 0.02
8 83.83 7 2.01 86.88 7 1.63 84.63 71.57 88.86 7 1.57 81.82 7 2.28 81.677 1.01 81.28 72.27 77.99 7 1.93 92.747 1.13 91.99 7 1.44 82.02 7 1.34
9 70.547 1.08 72.217 1.23 71.60 71.00 71.917 1.59 72.56 7 0.71 71.28 7 0.81 72.98 70.83 73.077 1.03 71.277 1.29 69.36 7 1.28 74.047 0.63
10 68.357 1.75 71.117 0.65 59.03 73.52 71.23 7 0.52 71.417 0.63 71.39 7 0.26 71.51 70.00 71.517 0.00 71.58 7 1.34 72.137 0.85 63.40 7 0.60
11 90.29 7 1.29 88.707 1.02 87.97 71.47 84.187 1.39 80.157 1.57 84.767 0.92 87.81 71.15 87.477 1.11 94.02 7 0.42 94.30 7 0.42 86.487 0.80
12 92.79 7 0.40 92.417 0.23 90.83 70.35 93.63 7 0.16 91.03 7 0.06 90.917 0.08 91.64 70.14 91.60 7 0.20 95.23 7 0.28 86.727 0.24 90.62 7 0.06
13 98.157 0.11 98.23 7 0.07 97.93 70.17 96.46 7 0.06 96.147 0.05 94.59 7 0.06 94.61 70.00 94.617 0.00 98.53 7 0.08 98.25 7 0.11 91.28 7 0.06
14 83.517 1.08 82.88 7 0.57 81.18 70.90 84.047 0.92 85.677 0.57 85.177 0.36 85.13 70.79 85.23 7 0.67 86.917 0.65 87.36 7 0.48 84.95 7 0.29
15 89.84 7 0.71 91.277 0.37 90.06 70.98 91.077 0.44 91.497 0.00 91.497 0.00 91.49 70.00 91.497 0.00 92.517 0.51 91.55 7 0.25 78.677 0.83
16 98.517 0.25 98.26 7 0.32 99.74 70.13 88.80 7 0.23 99.067 0.12 98.217 0.05 96.66 70.06 96.62 7 0.04 99.80 7 0.10 99.82 7 0.05 97.497 0.09
17 91.94 7 0.18 91.647 0.26 99.47 70.05 67.82 7 0.16 91.357 1.95 99.58 7 0.01 99.61 70.02 99.45 7 0.01 98.667 0.08 99.317 0.04 99.617 0.01
18 88.617 0.70 88.30 7 0.52 95.25 70.46 83.717 0.38 95.84 7 0.31 96.42 7 0.36 95.94 70.35 93.197 0.32 92.46 7 0.46 96.46 7 0.34 96.29 7 0.26
19 95.32 7 1.39 96.99 7 1.33 100.00 70.00 96.517 1.67 100.007 0.00 100.007 0.00 98.36 71.07 96.187 1.67 96.417 1.38 98.517 0.69 100.007 0.00

Table B1
Induction time (average7standard deviation of 10-time 10-fold stratified cross validation) in terms of seconds.

ID C4.5 CART FDT ADT FADT SADT rLADTL rLADTEN boosted C4.5 oRF SLDA

1 0.02 7 0.00 0.167 0.01 0.02 7 0.00 1.43 7 0.39 0.02 7 0.04 0.447 0.22 1.02 7 0.47 2.917 0.71 1.107 0.12 3.96 70.30 0.107 0.03
2 0.007 0.00 0.05 7 0.00 0.007 0.00 0.077 0.03 0.137 0.08 0.017 0.00 0.02 7 0.03 0.93 7 0.38 0.02 7 0.00 14.36 70.54 0.017 0.00
3 0.007 0.00 0.03 7 0.00 0.007 0.00 0.117 0.05 0.197 0.10 0.017 0.01 0.08 7 0.03 0.017 0.02 0.067 0.01 6.26 70.12 0.017 0.00
4 0.007 0.00 0.03 7 0.00 0.007 0.00 0.197 0.09 0.107 0.08 0.02 7 0.01 0.09 7 0.04 0.107 0.05 0.167 0.02 3.40 70.07 0.017 0.00
5 0.017 0.00 0.09 7 0.00 0.017 0.00 0.127 0.07 0.077 0.06 0.047 0.02 0.09 7 0.06 0.08 7 0.04 0.58 7 0.10 12.90 70.29 0.017 0.00
6 0.02 7 0.00 0.077 0.00 0.047 0.00 0.32 7 0.14 0.09 7 0.09 2.88 7 0.68 0.277 0.25 0.107 0.07 0.577 0.10 2.59 70.07 0.147 0.01
7 1.357 0.04 8.58 7 0.13 1.107 0.03 16.25 7 0.79 8.55 7 1.29 1.26 7 0.77 14.077 1.79 0.32 7 0.25 263.29 7 82.54 362.60 73.17 0.197 0.02
8 0.017 0.00 0.047 0.00 0.017 0.00 1.25 7 0.16 0.117 0.07 0.25 7 0.15 0.54 7 0.23 15.69 7 2.07 0.247 0.03 1.6770.03 0.047 0.01
9 0.007 0.00 0.02 7 0.01 0.007 0.00 0.05 7 0.03 0.137 0.05 0.03 7 0.01 0.067 0.03 0.78 7 0.24 0.017 0.00 5.94 70.13 0.007 0.00
10 0.017 0.00 0.08 7 0.00 0.017 0.00 0.03 7 0.03 0.03 7 0.05 0.08 7 0.05 0.017 0.02 0.067 0.02 0.54 7 0.10 8.75 70.14 0.017 0.00
11 0.02 7 0.00 0.107 0.01 0.02 7 0.00 1.96 7 0.82 0.067 0.03 0.447 0.25 1.05 7 0.46 0.017 0.02 0.85 7 0.08 3.54 70.05 0.107 0.11
12 0.92 7 0.05 2.83 7 0.07 0.777 0.06 41.23 7 2.72 0.747 0.36 6.82 7 2.75 14.90 7 3.09 0.83 7 0.26 11.38 7 2.01 45.17 70.43 1.75 7 0.11
13 0.05 7 0.00 0.477 0.01 0.067 0.00 0.577 0.31 0.167 0.07 0.03 7 0.01 0.017 0.00 0.03 7 0.00 4.167 0.58 36.42 70.57 0.02 7 0.00
14 0.107 0.00 0.377 0.02 0.137 0.01 6.687 1.42 0.067 0.04 2.85 7 1.13 1.89 7 0.93 0.017 0.00 4.62 7 0.79 13.23 70.27 0.23 7 0.02
15 0.017 0.00 0.117 0.00 0.017 0.00 0.777 0.45 0.02 7 0.01 0.05 7 0.04 0.017 0.00 0.017 0.00 0.78 7 0.07 4.6770.08 0.03 7 0.01
16 0.017 0.00 0.127 0.01 0.017 0.00 0.377 0.09 0.047 0.02 0.017 0.01 0.187 0.15 0.017 0.01 0.46 7 0.06 7.95 70.16 0.017 0.00
17 1.077 0.12 7.30 7 0.52 0.30 7 0.02 1.25 7 0.02 7.98 7 0.67 0.39 7 0.05 0.077 0.00 0.30 7 0.03 76.26 7 10.19 109.92 71.25 0.58 7 0.03
18 0.117 0.00 0.917 0.05 0.187 0.01 6.23 7 1.10 0.02 7 0.02 11.177 5.35 3.187 4.07 0.46 7 0.05 5.63 7 0.78 5.36 70.08 0.977 0.04
19 0.007 0.00 0.047 0.00 0.007 0.00 1.34 7 0.26 0.007 0.00 0.117 0.02 0.667 0.24 0.83 7 0.24 0.017 0.00 1.23 70.02 0.08 7 0.00
patcog.2015.08.014i
Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.

Table C1
Decision tree size (average7 standard deviation of 10-time 10-fold stratified cross validation) in terms of total number of nodes. SLDA is not a decision tree and hence not included.

ID C4.5 CART FDT ADT FADT SADT rLADTL rLADTEN boosted C4.5 oRF

1 21.44 70.94 13.08 7 1.62 11.00 70.78 71.65 712.25 6.40 7 4.79 11.59 74.84 45.617 13.24 42.167 12.06 1091.14 738.20 1253.86 7 52.21
2 10.82 71.06 13.62 7 2.26 6.12 70.66 31.93 77.48 33.88 7 12.39 4.00 70.00 6.82 7 4.76 5.177 3.70 62.80 724.90 5992.06 7 121.13
3 49.18 72.52 24.80 7 4.97 7.24 71.32 33.28 76.64 43.42 7 15.18 5.62 72.64 36.197 6.21 36.94 7 12.65 691.08 7495.77 4614.32 7 75.26
4 19.92 70.89 13.92 7 2.61 14.50 71.55 51.91 714.21 26.38 7 11.20 7.63 72.33 34.06 7 10.07 33.707 12.11 1522.35 7266.26 2649.22 7 79.57
5 39.02 74.72 18.60 7 6.76 9.34 72.64 25.75 77.54 19.09 7 10.22 13.24 74.32 23.20 7 10.41 21.677 7.74 4086.29 71509.32 7703.04 7 141.01
6 36.90 70.84 3.88 7 1.71 11.58 70.60 23.08 75.33 17.95 7 13.21 50.56 712.02 16.96 7 11.54 11.86 7 7.54 1333.74 731.50 1710.66 7 40.41
7 726.48 720.51 209.62 7 18.26 266.46 753.76 137.65 73.80 109.69 7 8.04 15.94 78.25 100.63 7 8.35 101.86 7 8.90 71846.90 718979.61 131636.68 7 653.16
8 19.04 71.11 11.40 7 1.96 9.46 70.60 90.25 78.48 27.707 11.51 16.30 77.90 57.85 7 13.21 77.29 7 15.88 726.31 7127.97 1191.58 7 38.67
9 4.88 71.04 4.42 7 2.26 1.90 70.51 29.98 710.19 36.25 7 11.53 18.22 75.43 42.82 7 10.52 42.197 7.75 28.50 722.10 3522.84 7 86.06
10 55.64 78.40 2.007 1.61 10.84 72.83 8.05 74.32 8.65 7 5.57 15.88 78.10 4.87 7 2.75 4.93 7 2.94 3750.71 7926.73 6550.84 7 82.18
11 26.54 71.14 9.84 7 2.38 11.36 70.77 64.06 715.41 29.417 13.17 14.02 74.67 53.32 7 13.64 41.80 7 8.75 1075.88 728.76 1729.80 7 54.26
12 207.42 74.27 116.68 7 15.31 91.94 72.41 139.87 73.49 25.03 7 8.23 11.80 74.52 70.187 7.62 70.87 7 9.99 2335.49 795.68 8218.66 7 230.63
13 51.40 71.50 40.80 7 3.13 25.50 73.78 36.88 711.14 40.78 7 8.62 5.02 71.27 4.007 0.00 4.007 0.00 4895.30 7180.73 7115.56 7 205.23
14 110.96 71.69 43.30 7 6.18 44.32 71.54 104.56 711.39 14.59 7 6.62 33.31 711.11 48.22 7 8.46 51.43 7 10.28 5463.94 795.72 6193.34 7 112.46
15 28.06 71.05 3.32 7 1.19 14.86 71.60 37.00 717.02 10.06 7 6.13 5.68 73.55 4.007 0.00 4.007 0.00 1524.22 745.23 2052.64 7 52.95
16 29.06 71.05 33.187 1.01 11.32 70.67 64.18 78.55 28.30 7 7.21 4.33 70.85 34.45 7 16.43 5.29 7 2.72 1205.94 795.27 1868.34 7 67.28

H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎


17 516.54 73.97 313.08 7 25.40 18.22 70.36 30.55 70.32 141.04 7 5.00 4.00 70.00 4.007 0.00 4.007 0.00 17562.82 7141.26 13836.32 7 222.18
18 44.64 71.09 27.52 7 3.96 8.62 70.55 86.62 78.22 6.137 3.49 10.30 74.03 16.45 7 13.78 4.007 0.00 2224.28 761.67 1741.60 7 61.64
19 10.48 70.30 10.02 7 0.75 3.00 70.00 76.12 77.46 4.007 0.00 4.00 70.00 37.93 7 9.81 58.127 9.01 18.50 730.03 745.02 7 26.68

Table D1
Complexity of multivariate split (average 7 standard deviation) in terms of total number of nonzero coefficients.

ID C4.5 CART FDT ADT FADT SADT rLADTL rLADTEN boosted C4.5 oRF SLDA

1 10.22 7 1.96 6.04 7 2.24 150.007 23.74 23.55 74.08 54.007 132.60 90.86 7 166.57 202.37 7200.66 138.197 107.47 520.577 19.10 2884.65 7130.52 22.197 1.87
2 4.917 2.00 6.317 3.58 10.247 5.52 10.31 72.49 43.40 7 44.17 4.007 0.00 3.82 710.48 3.03 7 4.39 28.95 7 12.23 5892.06 7121.13 4.007 0.00
3 24.09 7 4.98 11.90 7 10.25 18.727 19.00 10.76 72.21 84.82 7 87.82 9.20 7 13.04 44.45 734.29 41.80 7 31.76 336.277 242.46 4514.32 775.26 5.977 0.17
4 9.46 7 2.62 6.46 7 4.45 40.50 7 17.40 16.97 74.74 50.767 57.16 12.98 7 12.43 35.82 739.89 30.677 26.79 737.007 129.39 2549.22 779.57 4.92 7 0.31
5 19.017 6.48 8.80 7 8.18 33.36 7 31.31 8.25 72.51 48.177 73.28 29.83 7 32.14 37.85 743.21 35.56 7 41.20 2021.29 7 747.64 7603.04 7141.01 6.88 7 0.38
6 17.95 7 1.60 1.447 2.78 232.767 56.39 7.36 71.78 248.60 7 480.87 263.09 7 251.29 83.07 7127.79 51.54 7 69.33 641.87 7 15.75 4831.98 7121.24 8.20 7 2.13
7 362.747 27.11 104.317 28.88 1327.30 7 967.66 45.55 71.27 362.30 7 73.92 46.54 7 80.58 138.84 757.95 130.217 50.22 35900.45 7 9483.78 197305.02 7979.74 9.86 7 0.40
8 9.02 7 1.62 5.20 7 1.79 93.06 7 18.19 29.75 72.83 195.80 7 294.95 100.577 135.98 208.73 7138.44 244.28 7 127.61 338.96 7 59.91 2183.16 777.35 17.20 7 2.19
9 1.94 7 1.70 1.717 2.90 1.357 2.43 9.66 73.40 35.167 42.59 13.447 18.77 23.17 722.82 23.89 7 24.34 12.45 7 10.55 3422.84 786.06 1.95 7 0.33
10 27.32 7 12.83 0.50 7 2.71 49.20 7 49.05 2.35 71.44 25.137 63.21 40.83 7 68.14 8.80 721.31 9.137 24.40 1851.68 7 458.88 9676.26 7123.26 7.55 7 0.72
11 12.777 2.14 4.42 7 3.32 170.94 7 46.05 21.02 75.14 312.517 406.59 128.69 7 173.76 256.27 7206.14 175.067 160.33 512.94 7 14.38 4074.50 7135.66 26.20 7 3.41
12 103.217 9.23 57.84 7 19.64 2591.79 7 227.20 46.29 71.16 456.577 629.77 200.707 193.33 825.20 7452.10 686.867 470.06 1150.727 44.27 28415.31 7807.22 54.92 7 1.35
13 6.94 7 11.70 0.05 7 0.50 61.25 7 27.60 8.86 72.96 66.30 7 53.11 6.56 7 7.09 4.96 70.20 4.95 7 0.22 2422.65 7 90.36 7015.56 7205.23 4.95 7 0.22
14 54.98 7 7.28 21.157 9.93 888.06 7 98.76 34.52 73.80 185.737 354.97 422.98 7 452.77 295.28 7181.99 291.917 174.98 2706.977 47.86 18280.02 7337.37 39.98 7 0.92
15 13.53 7 2.52 1.167 2.42 124.747 44.41 12.00 75.67 54.36 7 110.28 17.59 7 53.92 6.82 75.12 6.82 7 5.12 737.117 22.61 3905.28 7105.89 8.04 7 3.13
16 14.03 7 1.77 16.09 7 2.14 20.64 7 4.21 21.06 72.85 36.40 7 30.41 3.777 3.69 21.68 723.53 3.717 4.28 578.157 45.95 1768.34 767.28 3.187 0.39
17 257.777 9.31 156.04 7 38.09 223.86 7 39.27 9.85 70.11 1213.68 7 137.87 26.007 0.00 25.95 70.22 25.95 7 0.22 8756.417 70.63 34340.80 7555.46 26.007 0.00
18 21.82 7 2.35 13.26 7 5.55 354.337 88.37 28.54 72.74 159.03 7 382.03 274.067 515.69 293.93 7572.49 86.52 7 7.59 1087.147 30.84 7387.20 7277.38 67.117 6.18
19 4.747 0.60 4.517 0.88 31.007 0.00 25.04 72.49 31.007 0.00 28.83 7 1.58 273.28 7238.99 163.977 74.44 8.28 7 13.27 1612.55 766.70 30.017 1.31

13
14 H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎

References

1.007 0.00

1.007 0.00
5.007 0.00

1.007 0.00

1.007 0.00

1.007 0.00
1.007 0.00
1.007 0.00

1.007 0.00
rLADTU

30.747 2.82

24.777 5.07
9.94 7 3.29
15.177 4.26

27.20 7 3.97
41.59 7 2.27

23.187 3.41

18.80 7 2.69
9.977 2.74
1.46 7 1.45
[1] P. Geurts, A. Irrthum, L. Wehenkel, Supervised learning with decision tree-

1.58
based methods in computational and systems biology, Mol. Biosyst. 5 (12)
(2009) 1593–1605.
[2] K.-Y.K. Liu, J. Lin, X. Zhou, S.T.C.S. Wong, Boosting alternating decision trees
modeling of disease trait information, BMC Genet. 6 (Suppl. 1) (2005) S132.

4.007 0.00

1.007 0.00

1.007 0.00

1.007 0.00
2.22 7 0.88
3.64 7 0.89

39.54 7 2.53

21.197 3.63
1.85 7 2.69
1.64 7 1.07

6.43 7 1.48

2.93 7 1.34
14.25 7 3.14

11.517 2.10

1.04 7 0.13
9.50 7 2.19

3.137 0.13
[3] G. Creamer, Y. Freund, Using boosting for financial analysis and performance

1.90 71.49
5.28 7 3.11
SADTU

prediction: application to S&P 500 companies, Latin American ADRs and


banks, Comput. Econ. 36 (2) (2010) 133–151.

1.63
[4] M.P.-L. Ooi, H.K. Sok, Y.C. Kuang, S. Demidenko, C. Chan, Defect cluster
recognition system for fabricated semiconductor wafers, Eng. Appl. Artif.
Intell. 26 (3) (2013) 1029–1043.
Split complex

[5] C. Kingsford, S.L. Salzberg, What are decision trees? Nat. Biotechnol. 26 (9)
23.55 74.08
10.31 72.49

29.75 72.83

8.86 72.96
34.52 73.80
12.00 75.67
21.06 72.85

25.04 72.49
10.76 72.21

8.25 72.51

2.35 71.44
16.97 74.74

7.36 71.78

28.54 7 2.74
45.55 71.27

21.02 75.14

9.85 70.11
46.29 71.16
9.66 73.40
(2008) 1011–1013.
ADT

[6] J. He, H. Hu, R. Harrison, P. Tai, Y. Pan, Transmembrane segments prediction


and understanding using support vector machine and decision tree, Expert

2.79
Syst. Appl. 30 (2006) 64–72.
[7] J. Quinlan, Bagging, boosting, and C4.5, in: Proceedings of the 13th National
Conference on Artificial Intelligence, 1996, pp. 725–730.
[8] L. Breiman, Bagging predictors, Mach. Learn. 24 (2) (1996) 123–140.
46.517 12.77

70.54 7 10.22
75.317 15.21

82.60 7 11.92
4.007 0.00

4.007 0.00
16.007 0.00

4.007 0.00

4.007 0.00

4.007 0.00
4.007 0.00
4.007 0.00

4.007 0.00
30.82 7 9.87

5.38 7 4.36
30.917 8.22

125.777 6.81
93.22 7 8.47

[9] Y. Freund, R. Schapire, A decision-theoretic generalization of on-line learning

57.40 7 8.06
rLADTU

and an application to boosting, J. Comput. Syst. Sci. 55 (1997) 119–139.


[10] Y. Freund, L. Mason, The alternating decision tree learning algorithm, in:
Proceedings of the 16th International Conference on Machine Learning, 1999,
1.58 pp. 124–133.
[11] F. De Comité, R. Gilleron, M. Tommasi, Learning multi-label alternating
decision trees from texts and data, in: Proceedings of the 3rd international
64.57 710.88

conference on Machine Learning and Data Mining in Pattern Recognition,


119.62 7 7.58
13.00 70.00

4.00 70.00

4.00 70.00

4.00 70.00
35.53 76.30
7.66 72.63
11.92 72.67
20.29 74.43

9.79 74.03
16.84 79.33
4.12 70.38
29.5 76.58

6.55 78.06
10.39 70.40
5.92 73.21
43.75 79.41

6.707 4.48

2003, pp. 35–49.


SADTU

[12] G. Holmes, B. Pfahringer, R. Kirkby, Multiclass alternating decision trees, in:


Proceedings of the 13th European Conference on Machine Learning, 2002,
1.63

pp. 161–172.
[13] Y.C. Kuang, M.P.L. Ooi, Complex feature alternating decision tree,, Int. J. Intell.
Decision tree size

Syst. Technol. Appl. 9 (3) (2010) 335–353.


[14] R. Guy, P. Santago, C. Langefeld, Bootstrap aggregating of alternating decision
71.65 712.25

51.91 714.21

64.06 715.41

37.00 717.02
104.56 711.39
29.98 710.19
33.28 76.64

90.25 78.48

139.87 73.49
23.08 75.33
137.65 73.80

8.05 74.32

27.58 78.89

64.18 78.55
30.55 70.32
31.93 77.48

25.75 77.54

76.12 77.46

trees to detect sets of SNPs that associate with disease, Genet. Epidemiol. 36
86.62 7 8.22

(2012) 99–106.
[15] G. Stiglic, M. Bajgot, P. Kokol, Gene set enrichment meta-learning analysis:
next-generation sequencing versus microarrays, BMC Bioinform. 11 (2010)
2.79
ADT
Comparison between univariate ADTree, univariate version of SADT (SADTU) and univariate version of rLADT (rLADTU).

(article 176).
[16] M. Drauschke, Multi-class ADTboost. Technical Report No. 6, Department of
Photogrammetry Institute of Geodesy and Geoinformation University of Bonn,
rLADTU

0.007 0.00

0.017 0.00

0.017 0.00

0.007 0.00

0.017 0.00

0.017 0.00

0.02 7 0.00

0.007 0.00

2008.
0.187 0.03

0.08 7 0.02

0.077 0.03

0.147 0.02

0.147 0.04
2.127 0.20

0.277 0.08
0.03 7 0.01

0.03 7 0.01

0.017 0.01

0.187 0.02

[17] H.K. Sok, M.P.-L. Ooi, Y.C. Kuang, Sparse alternating decision tree, Pattern
Recognit. Lett. 60–61 (2015) 57–64.
1.63

[18] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgran Kaufmann
Publisher, San Francisco, 1993.
[19] L. Breiman, J.H. Friedman, R.A. Olshen, Classification and Regression Trees,
Wadsworth International Group, Belmont, Canada, 1984.
0.01 70.00

0.01 70.00

0.02 70.00
0.05 70.00

0.01 70.00
0.12 70.05

0.06 70.02
0.21 70.03

0.02 70.02

0.08 70.03
6.64 70.62

0.02 70.03

0.04 70.03
0.01 70.01
0.05 70.01
0.01 70.01
0.02 70.01

0.02 70.01

0.43 70.11

[20] A. López-Chau, J. Cervantes, L. López-García, F.G. Lamont, Fisher's decision


SADTU

tree, Expert Syst. Appl. 40 (16) (2013) 6283–6291.


[21] B. Menze, B. Kelm, D. Splitthoff, On oblique random forests, in: Proceedings of
1.37

the European Conference on Machine Learning (ECML/PKDD), 2011, pp. 453–


469.
Induction time

[22] A. Franco-arcega, Splitting attribute subsets for large datasets, in: Proceedings
1.43 70.39
0.07 70.03
0.11 70.05
0.19 70.09
0.12 70.07

16.25 70.79

0.05 70.03
0.03 70.03
1.96 70.82

1.17 70.52

0.77 70.45
0.37 70.09
1.25 70.02

1.34 70.26
41.23 72.72

6.68 71.42

of the 23rd Canadian Conference on Artificial Intelligence, 2010, pp. 370–373.


0.32 70.14

1.25 70.16

6.23 7 1.10

[23] S. Schulter, P. Wohlhart, C. Leistner, A. Saffari, P. M. Roth, H. Bischof,


Alternating decision forests, in: Proceedings of the IEEE Computer Society
3.00
ADT

Conference on Computer Vision and Pattern Recognition, 2013, pp. 508–515.


[24] J. Kozak, U. Boryczka, Multiple boosting in the ant colony decision forest meta-
classifier, Knowl.-Based Syst. 75 (2015) 141–151.
[25] L. Clemmensen, T. Hastie, D. Witten, B. Ersbøll, Sparse discriminant analysis,
76.21 70.00

79.43 70.00
74.75 70.00
75.42 70.00

71.51 70.00

94.61 70.00

91.49 70.00
84.55 70.00
63.92 70.00

80.08 70.00
80.17 70.48
93.84 70.58

72.40 70.94

81.62 70.70

78.79 70.43
72.84 70.91
80.52 71.42
61.05 71.77

90.96 70.19

Technometrics 53 (4) (2011) 406–413.


[26] T. Hesterberg, N.H. Choi, L. Meier, C. Fraley, Least angle and ℓ1 penalized
rLADTU

regression: a review, Stat. Surv. 2 (2008) 61–93.


2.47

[27] R. Fisher, The use of multiple measurements in taxonomic problems, Ann.


Eugen. 7 (1936) 179–188.
[28] T. Hastie, A. Buja, R. Tibshirani, Penalized discriminant analysis,, Ann. Stat. 23
(1995) 73–102.
71.517 0.00

94.617 0.00

91.49 7 0.00

80.08 7 0.00
93.68 7 0.67
76.09 7 0.38

79.03 7 0.59
73.32 7 0.67

80.09 7 0.84

91.187 0.64

63.81 7 0.02
82.157 0.72
75.127 0.01

72.247 1.00

78.58 7 1.00
62.357 1.39

78.87 7 1.05

87.87 7 0.10

77.477 0.20

[29] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R.
Stat. Soc.: Ser. B (Stat. Methodol. 67 (2) (2005) 301–320, Apr..
SADTU

[30] A. Hoerl, R. Kennard, Ridge regression: biased estimation for nonorthogonal


2.16

problems, Technometrics 12 (1) (1970) 55–67.


Prediction accuracy

[31] R. Tibshirani, Regression shrinkage and selection via the lasso,, J. R. Stat. Soc.
Ser. B (Methodol.) 58 (1996) 267–288.
[32] Y. Chen, P. Du, Y. Wang, Variable selection in linear models,, Wiley Interdiscipl.
91.077 0.44
94.68 7 0.80

71.23 7 0.52

84.04 7 0.92

88.80 7 0.23
72.54 7 0.91

83.717 0.38
77.38 7 0.41

82.717 1.24

78.577 1.66

71.917 1.59

84.187 1.39

96.517 1.67
62.34 7 1.37

88.86 7 1.57
78.59 7 0.10

93.63 7 0.16

67.82 7 0.16
93.377 0.11

Rev.: Comput. Stat. 6 (1) (2014) 1–9.


[33] A. Hirotugu, Information theory and an extension of the maximum likelihood
principle, in: Proceedings of the 2nd International Symposium on Information
ADT

1.37

Theory, Tsahkadsor, Armenia, USSR, 1971.


[34] G. Schwarz, Estimating the dimension of a model, Ann. Stat. 6 (1978) 461–464.
Table E1

[35] J. Friedman, T. Hastie, T. Robert, Additive logistic regression: a statistical view


of boosting, Ann. Stat. 28 (2000) 337–407.
Av
12
10

13
14
15
16

19
18
ID

17
11

[36] B. Efron, T. Hastie, Least angle regression, Ann. Stat. 32 (2) (2004) 407–499.
1
2
3
4
5
6
7
8
9

Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 15

[37] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The [41] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach.
WEKA data mining software: an update, SIGKDD Explor. 11 (1) (2009). Learn. Res. 7 (2006) 1–30.
[38] K. Sjöstrand, L. Clemmensen, Spasm: a matlab toolbox for sparse statistical [42] S. Aruoba, J. Fernández-Villaverde, A Comparison of Programming Languages
modeling, 2012, [Online], Available: 〈http://www2.imm.dtu.dk/projects/ in Economics, Working Paper No. 20263, National Bureau of Economic
spasm〉 (accessed 21.08.14). Research, 2014.
[39] A. Frank, A. Asuncion, UCI machine learning repository, [Online], Available: [43] D.H. Wolpert, W.G. Macready, No free lunch theorems for optimization,, IEEE
〈http://archive.ics.uci.edu/ml〉. Trans. Evol. Comput. 1 (1) (1997) 67–82.
[40] University of Eastern Finland, Spectral Color Research Group, [Online], Avail- [44] L. Rokach, O. Maimon, Top-down induction of decision trees classifiers—a
able: 〈https://www.uef.fi/spectral/spectral-database〉. survey, IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev. 35 (4) (2005) 476–487,
Nov..

Hong Kuan Sok received the Bachelor of Engineering (Honours) in Electrical and Computer Systems Engineering degree from Monash University, Malaysia in 2010. He is
currently a Ph.D. student with particular interests in machine learning and pattern recognition.

Melanie Ooi Po-Leen received the Ph.D. degree from Monash University, Malaysia, in 2011. She is currently a Senior Lecturer with the Engineering Faculty, Monash
University. Her research interests include machine learning, computer vision, biomedical imaging and electronic design and test.

Ye Chow Kuang received the Bachelor of Engineering (Honours) degree in electromechanical engineering, and the Ph.D. degree from University of Southampton. He joined
Monash University, Malaysia, where he is involved in the field of machine intelligence and statistical modelling.

Serge Demidenko received the M.E. degree from the Belarusian State University of Informatics and Radio Electronics, and the Ph.D. degree from the Institute of Engineering
Cybernetics, Belarusian Academy of Sciences. He is currently a Professor and the Associate Head of School of Engineering and Advanced Technology, and a Cluster Leader
with Massey University, New Zealand. His research interests include electronic design and test, instrumentation and measurements, and signal processing.

Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i

Das könnte Ihnen auch gefallen