Beruflich Dokumente
Kultur Dokumente
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
art ic l e i nf o a b s t r a c t
Article history: Decision trees are comprehensible, but at the cost of a relatively lower prediction accuracy compared to other
Received 5 May 2015 powerful black-box classifiers such as SVMs. Boosting has been a popular strategy to create an ensemble of
Received in revised form decision trees to improve their classification performance, but at the expense of comprehensibility advantage.
7 July 2015
To this end, alternating decision tree (ADTree) has been proposed to allow boosting within a single decision
Accepted 17 August 2015
tree to retain comprehension. However, existing ADTrees are univariate, which limits their applicability. This
research proposes a novel algorithm – multivariate ADTree. It presents and discusses its different variations
Keywords: (Fisher's ADTree, Sparse ADTree, and Regularized Logistic ADTree) along with their empirical validation on a
Alternating decision tree set of publicly available datasets. It is shown that multivariate ADTree has high prediction accuracy
Boosting
comparable to that of decision tree ensembles, while retaining good comprehension which is close to
Multivariate decision tree
comprehension of individual univariate decision trees.
Lasso
LARS & 2015 Elsevier Ltd. All rights reserved.
1. Introduction boosted decision trees led to the invention of the alternating decision
tree (ADTree), which was designed to retain interpretability in the
Decision trees are among the most powerful and popular classifiers boosting paradigm [10]. Rather than building a decision tree at every
available. They are acyclic-directed graphical models that solve classi- boosting cycle, a much simpler decision stump is created.
fication problems using symbolic representation, i.e., a graph of Fig. 1(b) shows a graphical illustration of the ADTree. Similar to the
decision nodes that are connected via edges (Fig. 1(a)). As a result, decision tree in Fig. 1(a), ADTree is also an acyclic-directed graphical
they follow the flowchart-like human logic and reasoning, making model. However, the symbolic meaning of each node and the manner
them highly comprehensible. Decision trees model the domain pro- in which they are connected are different. It does not use leaf nodes at
blem as a set of decision rules. Such a model is transparent and is the terminal nodes, or decision nodes as the internal nodes. Instead,
understandable to specialists in relevant application areas [1]. For many decision stumps (or one-level decision trees) are combined to
example in [2], medical experts used the quantitative information obtain a special representation where each of the stumps consists of a
obtained from the alternating decision tree model to gain a better decision node and two prediction nodes.
understanding between disease phenotypes and affection status. The ADTree can be viewed as a loose generalization to standard decision
comprehensibility trait therefore, makes decision trees highly accessi- trees, boosted decision trees, and boosted decision stumps [10] due to
ble to users outside just a machine learning community, and therefore the following reasons. First, ADTree can be used as an alternative to
they can be found in a wide range of applications such as business [3], represent any standard decision tree model with the same function-
manufacturing [4], computational biology [5], bioinformatics [6], etc. ality. In addition to that, ADTree allows multiple decision stumps under
It is often possible to further improve the classification accuracy of the same prediction node to get majority-voted decisions. Boosting can
an individual decision tree by combining a number of decision trees to be implemented directly within the same tree as opposed to the
make majority-voted decisions [7]. There are two popular strategies to conventional way of creating boosted decision trees or boosted
achieve this: bagging [8] and boosting [9]. Unfortunately, an ensemble decision stumps. There are a number of extensions of ADTree such as
of decision trees results in many variations in the symbolic represen- multi-label ADTree [11], multi-class ADTree [12] and complex feature
tation which causes the overall classifier to be large, complex, and ADTree [13]. ADTree has been successfully implemented in various
difficult to interpret. This negates the comprehensibility advantage of applications such as genetic disorders [14], corporate performance
being a decision tree [10]. The issue with large and incomprehensible prediction [3], and bioinformatics [15].
Unfortunately, there are two major drawbacks of using univariate
decision nodes in ADTree. First, as with any other univariate decision
n
Corresponding author. Tel.: þ 60 35 514 6238; fax: þ 60 35 514 6207. tree, the splitting based on a single feature is axis-parallel partitioning
E-mail address: sok.hong.kuan@monash.edu (H.K. Sok). of the input space. This leads to a high bias and generates large decision
http://dx.doi.org/10.1016/j.patcog.2015.08.014
0031-3203/& 2015 Elsevier Ltd. All rights reserved.
Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
2 H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Fig. 1. Decision trees:(a) a classical decision tree consisting of decision nodes as internal nodes and leaf nodes as terminal nodes; (b) an alternating decision tree which can
be used to represent the standard decision tree shown in part (a) to make the same prediction; and (c) the accommodation of boosting in the ADTree, whereby more decision
stumps can be added to any existing prediction nodes (highlighted in circle) to obtain majority-voted decisions.
trees in classification problems that have co-dependent features. The standard 10 10 fold stratified cross-validations were performed on
resultant large and complex decision tree complicates the interpreta- all datasets to generate performance estimations.
tion process. Second, ADTree induction is based on a probably The rest of this paper is as follows: Section 2 provides a brief
approximately correct (PAC) learning framework which requires a weak literature review on supervised learning, boosting, and ADTree.
learner to achieve an error rate ε that is slightly better than random The proposed multivariate ADTree algorithms are presented in
guesses for binary class problems; formally ε r 0:5 Ψ for a small Section 3. The experimental setup and obtained results are given
constant Ψ (known as edge). Unfortunately, simple univariate decision in Section 4 together with the detailed discussions. Section 5
stumps sometimes do not satisfy the weak learning condition. This presents the conclusion and outlines the future work.
causes the boosting procedure to fail in generating a functioning
ADTree model [16].
The aim of this paper is to present a novel multivariate alternating 2. Background on alternating decision tree
decision tree learning algorithm with boosting capability that offers
improved classification performance of decision trees while remain- 2.1. Supervised learning framework
ing comprehensible. The goals are to:
For better readability, the notations used in this paper are first
1. Outperform the existing univariate ADTree and multivariate described. Vectors are typed in bold (e.g., x) and they are all column
(unboosted) decision trees in terms of prediction accuracy vectors unless specified otherwise. Scalars are typed in regular (e.g., λ).
while offering good comprehensibility; Matrices are given in capital bold (e.g., X). Specific entries in vectors are
2. Match the performance of univariate decision trees for univariate indexed with a scalar. For example, the ith entry of a column vector x is
problems while outperforming them on multivariate datasets; denoted as xi . For matrices, the entry of ith row and jth column of a
3. Provide superior comprehensibility compared to the ensemble- matrix X is denoted as Xij . The entire ith row of a matrix X is denoted
based decision trees. as Xi and the entire jth column of a matrix X is denoted as Xj .
Under supervised learning, a training dataset ½X; y consists of a
set of n labeled samples, where each sample x A RP is a real-valued
There are several different subsections in the existing ADTree column vector of p features and its corresponding label
algorithm that can be restructured in order to induce a multivariate yA f þ 1; 1g assumes either positive or negative class for a binary
ADTree. In this paper, three possible variations are explored, namely classification problem. The dimension of the design matrix X is
Fisher's ADTree, Sparse ADTree [17], and regularized Logistic ADTree. The n p, and the column vector y is of the length n. The ith row of the
Sparse ADTree presented in the earlier paper was the first attempt to design matrix X or Xi refers to the ith sample, a transposed vector,
induce a multivariate ADTree. The current paper presents significantly i.e., xT . The goal of a decision tree learning algorithm is to learn a
new and further developed results that cover two additional multi- single classification model. For ensemble learning, the weak
variate alternating decision tree (ADTree) designs. This increases the learner is repeatedly called to learn multiple models.
material coverage and comprehension as well as applicability by
practitioners and researchers in the field. In addition, the experiments 2.2. Boosting
are significantly more vigorous with extended discussions on the
validity, usage and applicability of the multivariate alternating deci- Boosting is an important development in the field of machine
sion trees. All the three variants of multivariate ADTree were tested learning. It allows for any choice of a prevalent learning algorithm as
on a set of real-world datasets against a number of established long as the weak learning condition ε r 0:5 Ψ is satisfied for binary
decision tree learning algorithms such as the original univariate class problems. Paper [24] shows that decision trees are the popular
ADTree [10]; univariate decision trees: C4.5 [18] and CART [19]; the choice as weak learners due to their inherent instability to small
multivariate decision tree – Fisher's decision tree [20]; and ensemble variations in training datasets. Boosting creates such variations through
of decision trees: Boosted C4.5 and oblique Random Forest [21]. Note weight distribution over the training samples by sequential reweight-
that there are other variants of decision trees presented in the ing. This paper implements two different boosting algorithms to induce
literature (e.g., [22,23]). However, the benchmarking algorithms are multivariate ADTree, namely, AdaBoost and LogitBoost (see Table 1).
selected based on availability of the source codes, and they are used as AdaBoost initializes the weight distribution w as a uniform one
representatives of the different decision tree families. This was done with an initial weight value of 1=n. The weight of the ith sample at
in order to compare the overall prediction accuracy, induction time, the tth boosting procedure is indicated as wiðtÞ . AdaBoost then repeats
tree size and complexity/comprehensibility against different families for T boosting procedures to obtain a weak model f t ðxÞ from the
of decision trees. For statistical verification and comparisons, the weak learner, determines the linear coefficient of the weak model αt
Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 3
Table 1
AdaBoost and LogitBoost algorithms.
AdaBoost LogitBoost
P
T
Output: F ðxÞ ¼ g t ðx Þ
t¼1
based on the error εt , and updates the weight distribution before the
next boosting procedure starts. An indicator function I ð:Þ returns 1 if
the Boolean expression inside the function is evaluated as being True.
Output is obtained through a linear combination of the weak models.
For LogitBoost, the uniform weight distribution is also initialized in
a same manner as in AdaBoost. In addition, it also keeps track of the
probability estimation of the positive class pðxÞ and regression value
GðxÞ for all n samples. It then repeats for T boosting procedures. The
working response (or pseudo-label z) and weight distribution are
updated at the beginning of every boosting procedure. A regression
function g ðxÞ is fitted to the weighted least square regression problem.
Regression values of all training samples are updated to calculate the
new probability estimation of each sample at the end of every boosting
procedure. The output is a regression function, whereby classification is
achieved by taking the sign of the summation. Fig. 2. Illustration of ADTree induction whereby a new decision stump is added to
one of the existing prediction nodes after each boosting procedure. Weak learner
generates a set of base conditions C that becomes potential candidates for forming
2.3. Alternating decision tree the next decision node. The best combination of base condition and prediction
node is chosen to form a new decision stump.
ADTree uniquely bridges the gap between boosting and the
decision tree algorithm. Instead of the conventional approach of slightly different ADTree model to handle multi-label problems [11],
building a forest of decision trees, the boosting procedure is while others employed LogitBoost to address multiclass problems [12].
incorporated within a single decision tree to facilitate comprehen- With reference to Fig. 2, the root prediction node is first con-
sibility. ADTree consists of alternating layers of decision nodes and structed given the original dataset. The rest of the induction is
prediction nodes starting with a root prediction node. Mathema- repeated for a given number of boosting iterations. Each boosting
tically, ADTree can be described as a set of decision rules as shown cycle adds a new decision stump to one of the “best” prediction nodes
in (1). Each decision rule will return one of the followings: a to optimally expand ADTree. Precondition refers to the choice of a
positive prediction score α þ , a negative prediction score α , or a prediction node that is selected for inclusion into the ADTree.
zero score depending on the nested if statement such that r t ðxÞ: if Condition refers to the decision node of the decision stump. Two
(precondition) then [if (condition) then α þ else α ] else 0. prediction values refer to the prediction nodes of the decision stump.
Precondition is a conjunction of conditions, while the condition The weight distribution over the training dataset is then updated
itself is a Boolean predicate that is embedded in the decision node. based on the newly added decision rule. This helps to guide the next
weak learner when generating a new set of base conditions.
T
AD Tree model : ¼ r t ðxÞ t ¼ 0 ; ð1Þ The weak learner shown in Fig. 2 is independent from the core
ADTree induction. The existing univariate ADTree uses an exhaustive
To perform classification, an input sample is sorted top-down approach to generate a set of univariate base conditions, each based on
from the root prediction node. Instead of following a single path a different feature given the weight distribution. The next section of the
from the root decision node to one of the leaf nodes in standard paper proposes different methods to replace this weak learner in order
decision trees, one or more paths could be traversed within to introduce multivariate decision nodes. This allows the induction of
ADTree due to possible multiple decision stumps under the same multivariate base conditions in order to build a multivariate ADTree.
prediction node. The prediction scores from all the traversed
prediction nodes are summed to make a prediction on the class
label. The sign of the summation is used to indicate either a 3. Proposed multivariate ADTree
positive or a negative class label. The magnitude of the summation
is a good indication of classification confidence. As discussed above in Section 2, ADTree requires a set of base
In terms of learning, ADTree model can be grown through any conditions in the beginning of every boosting procedure (Fig. 2).
boosting algorithm. AdaBoost was implemented in the seminal work These are potential decision node candidates that dictate the char-
on ADTree [10]. In the later years some research works on ADTree acteristics of the ADTree model. The weak learner is responsible for
used different boosting algorithms such as AdaBoost.MH to induce a generating this set of base conditions using the weighted training
Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
4 H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
f ðxÞ (see Fig. 3(d)). This is the regularized LADTree (Section 3.3).
X X
K X T
w
¼ wi Xi μk Xi μk ð6Þ
3.1. Fisher's ADTree k ¼ 1 i A kth class
P
Fisher's discriminant [27] is a well-established supervised techni- class w i Xi
μk ¼ P
i A kth
ð7Þ
que that finds a subspace upon which the projected samples are well i A kth class w i
separated according to their class labels. The objective is to maximize
TP
the between-class covariance matrix β b β with respect to the
1XK
TP μ¼ μ ð8Þ
within-class covariance
matrix β w β of the projected samples. This Kk¼1 k
forms Fisher's ratio J β (2), which can be maximized through solving
A similar approach is implemented in Fisher's decision tree [20],
the generalized eigen-value problem. The optimized β parameter is
which is a multivariate extension of C4.5. It should be emphasized the
then used in the proposed Fisher's ADTree to form an artificial feature
proposed Fisher's ADTree differs from the existing Fisher's decision
xT β, which is a linear combination of all original features. This is
tree in its ability to “boost” several decision stumps under the same
resulted in a multivariate decision node, since it uses all the features
prediction node to improve the final prediction.
rather than just the individual jth feature used in the univariate
variants. The number of dimensions of the subspace is determined by
the total number of classes K. It can only have the maximum of K 1 Algorithm 1. Weighted Fisher's discriminant
discriminative projections. For binary class problems, it results in a
single discriminative vector β. Input: training dataset ½X; y and weight distribution w
P Statistical procedure to extract information based on ½X; y
βT β
J β ¼ T Pb ; ð2Þ includes:
β w β 1. Calculate weighted mean of positive and negative class
P P
where b and w are respectively the between-class and within-
respectively:
class covariances of the original dataset. They are estimated from the μ1 and μ2 using (7); P
training dataset using (3) and (4). The mean vector of the entire 2. Calculate weighted between-class covariance matrix b
training dataset is denoted as μ while the mean vector of class k is using (5);
Fig. 3. Weak learner for: (a) univariate ADTree where a set of univariate base condition is obtained through exhaustive approach, one for each feature indexed by j;
(b) Fisher's ADTree which results in a single multivariate base condition as all features are used to form an artificial feature xT β instead of jth feature, xj ; (c) sparse ADTree
where β vector can be sparse with many zero elements to facilitate feature selection; and (d) Regularized LADTree where the base condition gðxÞ is a regression function
instead of Boolean function f ðxÞ.
Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 5
P
3. Calculate weighted within-class covariance matrix w θn is a random vector. The optimal score vector θ is then
using (6); normalized such that θ Dπ θ ¼ 1;
T
4. Maximize the Fisher's ratio (2) by solving the generalized 2. Repeat until convergence:
P P
eigenvalue problem b β ¼ λ w β where λ and β are also 2.1. For fixed θ, solve (11) to obtain β using LARSEN [29]
termed as eigenvalue and eigenvector respectively. algorithm;
Output: β 2.2. For fixed β, calculate θ ¼ ðDπ Þ 1 Y T Xβ. The optimal
score vector θ is then ortho-normalized to make it orthogonal
to θ0 .
3.2. Sparse ADTree Output: β and θ
Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
6 H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
links AdaBoost to the classical logistic regression, which is a LADTree's modularity and flexibility are the greatest advantage of
probabilistic discriminative model for classification tasks. The this approach over all other ADTree designs. LADTree users can
logistic regression models the log-odds ratio between positive apply any of the newer classical penalization techniques [26] and
and negative class posteriors Prðy ¼ þ 1j xÞ and Prðy ¼ 1j xÞ with select any number of features that they wish to incorporate in
a regression model GðxÞ as follows: order to customize the tree for their specific applications.
Prðy ¼ þ 1j xÞ
log ¼ GðxÞ ð14Þ
Prðy ¼ 1j xÞ
4. Comparative experimental analysis
LogitBoost is a nonparametric extension to (14). It uses a linear
combination of regression models (15) to estimate the log-odds 4.1. Experimental design and validation
ratio instead of the fixed parametric form for GðxÞ. The total
number of regression models is M. The proposed new multivariate ADTree designs discussed in
XM Section 3 above are Fisher's ADTree, Sparse ADTree, regularized
Gð x Þ ¼ m¼1 m
g ðxÞ ð15Þ LADTree using Lasso and regularized LADTree using Elastic Net. In
order to gauge their performance against other types of decision
In each boosting procedure of LogitBoost, the aim is to solve a
trees, several well-known and well-represented in the literature
weighted least squares regression problem (the details were
algorithms are chosen to include each general type of the decision
presented earlier in Section 2.2). Hence, LogitBoost can be viewed
tree (Table 2). The discriminant analysis classifier is also included
as Iteratively Reweighted Least Squares (IRLS) regression formula-
since this technique has been implemented in two of the multivariate
tion. Any regression model g m ðxÞ can be implemented to induce
ADTree designs. The chosen learning algorithms are listed below:
GðxÞ. To form multivariate ADTree, the regression model g m ðxÞ is
restricted to be of a linear type g m ðxÞ : xT β which results in solving
1. Univariate decision tree: C4.5 and CART [37];
(16). The matrix W is diagonal of n n dimension. Each diagonal
2. Multivariate decision tree: Fisher's decision tree [20];
entry indicates a weight value for one training sample. Vector z is
3. Ensemble of univariate decision trees: Boosted C4.5 [37];
the updated pseudolabel of length n. The only parameter to
4. Ensemble of multivariate decision trees: Oblique Random
optimize is β. The weight is incorporated directly such that the
Forest [21];
output and design matrix are W1=2 z and W1=2 X respectively. Here
5. Univariate boosted decision tree: ADTree [10];
the optimization process is of a standard linear regression type.
6. Sparse discriminant analysis [38].
min n 1 ‖W1=2 z W1=2 Xβ‖22 ð16Þ
β
The datasets used in this study are given in Table 3. The datasets
By expressing the problem in the form of (16), it becomes possible are shortlisted such that each of them consists of only real-value
to take advantage of the vast regularized linear regression literature to feature measurements. Datasets with categorical features are excluded
accommodate the boosting weight distribution. Note that for the since multivariate trees must convert categorical features to real-
sparse ADTree learning algorithm, LARSEN algorithm has been valued features, while such conversions could bias the performance
modified to accommodate the weight distribution. For the regularized comparisons.
LADTree, the weight distribution is assimilated as a part of the linear University of California, Irvine (UCI) datasets [39] are associated
regression problem in minimizing the residual value between W1=2 z with a wide range of real-world problems. This allows comparing
and W1=2 Xβ. This alleviates the need to convert the categorical performances of the trees across datasets of varying characteristics (i.e.,
responses to real-valued ones through the optimal scoring. feature measurements of different nature representing particular
Unfortunately just the use of (16) is still insufficient since a domain problems). Three additional spectral datasets from the Uni-
constraint or penalization function must be placed on β in order to versity of Eastern Finland (UEF) [40] are included because their
provide the capability to shape characteristics of the ADTree decision characteristics are known to have highly correlated features. This
node (e.g., feature selection). Therefore a penalization function J β is allows comparisons between the decision trees on multivariate corre-
used on (17) to obtain a constrained regression solution shown in lated features. All datasets are preprocessed to center each feature to
(17). From Bayesian perspective, this is effectively equivalent to zero and with the standard deviation of one. All experiments were
placing a priori on the β solution in maximizing the posterior conducted on PC with Intels Core™ 3.2 GHz i5 CPU and 4 GB RAM.
likelihood. A standard 10-times 10-fold stratified cross-validation was
performed on each dataset for each learning algorithm to generate
min n 1 ‖W 1=2 z W1=2 Xβ‖22 þ J β ð17Þ performance estimation data. The employed performance metrics
β
were: prediction accuracy, induction time, decision tree size, and
There is a wide range of penalization techniques of the form (18). decision node complexity. Comprehensibility can be viewed as a
Classical ones include Ridge (‖β‖22 ), Lasso (‖β‖1 ), and Elastic Net as tradeoff between the decision tree size and decision node
presented previously in Section 3.2. Their solvers can be implemented
in a classical form in each boosting procedure to produce multivariate Table 2
base conditions for the regularized LADTree induction (see Fig. 3(d)). Abbreviated algorithm names.
In this paper, two different variants of regularized LADTree using Abbreviation Description
Lasso and Elastic Net respectively were presented.
The proposed regularized LADTree has a modular design that ADT Alternating Decision Tree
can seamlessly incorporate different types of linear regularization C4.5 C4.5
CART CART
techniques. This essentially gives ADTree the ability to change its
FADT Fisher's Alternating Decision Tree
inherent model selection (or feature selection) approach without FDT Fisher's Decision Tree
affecting the learning algorithm itself. The use of different reg- oRF Oblique Random Forest
ularization techniques also allows users to preselect the number of rLADTEN Elastic Net Regularized Logistic Alternating Decision Tree
features for their given application. For example, selecting k ¼ 1 in rLADTL Lasso Regularized Logistic Alternating Decision Tree
SADTEN Elastic Net Regularized Sparse Alternating Decision Tree
the original LARS (solver for Lasso regression) [36] or LARSEN SLDA Sparse Linear Discriminant Analysis
algorithm will generate a univariate tree. The proposed regularized
Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 7
Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
8 H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Fig. 4. Average ranks of the algorithms (rank values are shown in bracket next to the algorithms) and the corresponding statistical comparisons for: (a) prediction accuracy,
(b) decision tree size, and (c) split complexity (total number of nonzero feature coefficients). Groups of algorithms that are not statistically significantly different are
connected in bold line. CD refers to critical difference in terms of rank value. The proposed multivariate ADTree variants are shown in bold for readability.
Fig. 5. Frequency of training samples versus feature range (first decision stump) for: (a) univariate ADTree, and (b) Fisher's ADTree for Forest dataset.
three proposed multivariate ADTree variants overcome these limita- in this research to compare ADT with SADTUNI and rLADTUNI. The raw
tions. The Forest dataset can be used as an example to illustrate it. performances of the three algorithms are tabulated in Appendix E.
ADT selected 90th feature of the Forest dataset for splitting in the Both SADTUNI and rLADTUNI do not suffer from the ADT's exhaustive
first decision stump. However, the histogram in Fig. 5(a) shows an approach that is used to generate a set of univariate base conditions.
obvious distribution overlap between positive and negative training Therefore, it can be observed that both the trees are consistently faster
samples over the selected feature range. This violates the weak learning
to induce compared to ADT on all datasets. Since they are all univariate
condition of achieving at least 50% accuracy, which is the requirement
ADTrees, the split complexity is completely dependent on the tree size.
for boosting to work. In contrast to that, the proposed FADT algorithm
Both rLADTUNI and SADTUNI are statistically smaller than ADT, thus
uses Fisher's discriminant analysis. It synthesizes a feature (through
linear projection) that is more discriminative, where the positive and leading to the conclusion that they are generally more comprehen-
negative training samples are well separated over the feature range sible. However, the ADT is statistically significantly better in its
(see Fig. 5(b)). accuracy of prediction in comparison to SADTUNI and rLADTUNI. Thus,
The univariate ADTree is a subclass of SADT and rLADT algorithms. even though SADT and rLADT are able to induce a purely univariate
It can be generated by choosing to use only one active feature when ADTree, in cases where prediction accuracy is prized over induction
computing the regularization path. A separate analysis was performed time and tree size, it is more beneficial to induce multivariate trees.
Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 9
Table 4
Performance comparison between FDT and FADT for some cases where FADT predicts better than FDT. The best performing value is highlighted in bold.
Liver disorder 50.677 4.61 63.47 71.54 0.007 0.00 0.197 0.10 7.2471.32 43.42 7 15.18
Heart 73.62 7 1.95 76.34 72.22 0.04 7 0.00 0.09 7 0.09 11.58 70.60 17.95 7 13.21
Pimaindians 64.967 4.86 75.16 71.04 0.017 0.00 0.047 0.02 14.50 71.55 26.38 7 11.20
MAGIC gamma 74.69 7 4.12 82.52 70.12 1.107 0.03 8.55 7 1.29 266.46 753.76 109.697 8.04
ILPD 59.03 7 3.52 71.39 70.26 0.017 0.00 0.03 7 0.05 10.84 72.83 8.657 5.57
QSAR 81.187 0.90 85.6770.57 0.137 0.01 0.06 7 0.04 44.32 71.54 14.597 6.62
Table 5 Table 6
Prediction accuracy of SADT and SLDA for cases whereby SADT generates a single Prediction accuracies of C4.5, CART and rLADTree for medical datasets with highly
SLDA model in its sole decision node (e.g., it behaves like an SLDA). The best discriminative features. The best performing value is highlighted in bold.
performing value is highlighted in bold.
Dataset Prediction accuracy
Dataset SADT SLDA
C4.5 CART rLADTL rLADTEN
Blood transfusion 76.217 0.00 66.007 0.42
Banknote 98.217 0.05 97.497 0.09 Breast cancer 93.52 7 0.78 93.047 0.73 96.40 7 0.26 96.52 70.23
Woodchip (UEF) 99.58 7 0.01 99.617 0.01 Liver disorders 65.767 2.14 66.38 7 2.45 66.997 2.22 66.85 71.54
Paper (UEF) 100.007 0.00 100.007 0.00 Vertebral 81.23 7 1.01 80.81 7 1.25 82.877 1.03 82.68 71.43
Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
10 H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Table 7
Prediction accuracies and tree sizes of C4.5, CART and rLADT for spectral datasets with highly correlated features. The best performing value is highlighted in bold.
Wood-chip 91.94 7 0.18 91.647 0.26 99.61 70.02 99.45 7 0.01 516.54 73.97 313.08 725.40 4.007 0.00 4.00 70.00
Forest 88.617 0.70 88.30 7 0.52 95.94 70.35 93.197 0.32 44.64 71.09 27.52 73.96 16.45 7 13.78 4.00 70.00
Paper 95.32 7 1.39 96.99 7 1.33 98.36 71.07 96.187 1.67 10.48 70.30 10.02 70.75 37.93 7 9.81 58.12 79.01
Fig. 7. rLADTL model (a) and stem plot (b) of β feature coefficients of each and every spectral measurements for woodchip dataset from the decision node in (a). The stems
are colored according to the visible light colors depending on the wavelength.
was ranked just below the univariate trees. SADT achieved better
classification on 9 out the 19 datasets compared to C4.5 and CART
while inducing a smaller decision tree in 12 out of the 19 datasets.
In short, it can be concluded that SADT is a successful nonpara-
metric extension to SLDA. It creates a parsimonious version of a
multivariate decision tree that results in the smallest decision tree
on average even when compared against univariate decision trees,
while with only slightly higher split complexity.
Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 11
Table 8
Comparisons between characteristics of multivariate ADTrees.
Disadvantages No feature selection mechanism Restricted to Elastic Net Higher model complexity compared to Sparse
ADTree
12 H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
ity for most datasets despite being a boosted multivariate tree. Most Appendix A
significantly, regularized LADTree had better classification performance
across all datasets. It is ranked directly in the second tier after the See Table A1.
decision tree ensemble algorithms while remaining comprehensible.
For example, on applications that contain features with complex
interactions, the regularized LADTree builds a more accurate and much
smaller tree with its multivariate node compared to C4.5 and CART. At Appendix B
the same time its node complexity remains small due to the use of
regularization techniques. It is important to note that the greatest See Table B1.
advantage lies in the regularized LADTree's modularity, which allows a
wide range of established linear regularization techniques to be
applied. This bridges between the decision tree and powerful regular-
Appendix C
ization research fields.
For the future research, it would be important to investigate
See Table C1.
how ADTree can be designed based on different boosting algo-
rithms to handle wide range of domain problems. This would lead
to an advantage over the classical decision trees, which often
require a new learning mechanism to achieve certain properties. Appendix D
Table A1
Prediction accuracy (average7standard deviation of 10-time 10-fold stratified cross validation) in terms of percentage.
ID C4.5 CART FDT ADT FADT SADT rLADTL rLADTEN boosted C4.5 oRF SLDA
1 93.52 7 0.78 93.047 0.73 94.97 70.46 94.687 0.80 96.127 0.26 96.89 7 0.37 96.40 70.26 96.52 7 0.23 97.167 0.26 97.217 0.27 96.36 7 0.25
2 77.92 7 0.62 78.28 7 0.37 78.52 70.57 77.38 7 0.41 76.38 7 0.41 76.217 0.00 76.21 70.00 76.217 0.00 77.54 7 0.42 78.50 7 1.01 66.007 0.42
3 65.767 2.14 66.387 2.45 50.6774.61 62.34 7 1.37 63.477 1.54 62.81 7 0.81 66.99 72.22 66.857 1.54 68.907 1.17 72.59 7 1.24 62.55 7 0.38
4 81.23 7 1.01 80.81 7 1.25 83.06 71.24 82.717 1.24 82.87 7 1.13 83.357 0.57 82.87 71.03 82.687 1.43 83.107 1.51 84.94 7 1.08 79.65 7 0.87
5 74.577 0.90 74.147 0.37 64.96 74.86 72.54 7 0.91 75.167 1.04 76.017 0.41 74.68 70.87 74.357 0.85 73.747 1.08 76.09 7 0.54 76.147 0.15
6 75.50 7 1.88 78.317 1.12 73.62 71.95 78.577 1.66 76.34 7 2.22 76.36 7 1.77 79.43 70.00 79.43 7 0.00 80.81 7 1.12 81.187 1.21 68.04 7 1.30
7 85.127 0.15 85.377 0.12 74.69 74.12 78.59 7 0.10 82.52 7 0.12 78.90 7 0.03 81.40 70.09 81.43 7 0.11 88.007 0.64 87.737 0.07 79.45 7 0.02
8 83.83 7 2.01 86.88 7 1.63 84.63 71.57 88.86 7 1.57 81.82 7 2.28 81.677 1.01 81.28 72.27 77.99 7 1.93 92.747 1.13 91.99 7 1.44 82.02 7 1.34
9 70.547 1.08 72.217 1.23 71.60 71.00 71.917 1.59 72.56 7 0.71 71.28 7 0.81 72.98 70.83 73.077 1.03 71.277 1.29 69.36 7 1.28 74.047 0.63
10 68.357 1.75 71.117 0.65 59.03 73.52 71.23 7 0.52 71.417 0.63 71.39 7 0.26 71.51 70.00 71.517 0.00 71.58 7 1.34 72.137 0.85 63.40 7 0.60
11 90.29 7 1.29 88.707 1.02 87.97 71.47 84.187 1.39 80.157 1.57 84.767 0.92 87.81 71.15 87.477 1.11 94.02 7 0.42 94.30 7 0.42 86.487 0.80
12 92.79 7 0.40 92.417 0.23 90.83 70.35 93.63 7 0.16 91.03 7 0.06 90.917 0.08 91.64 70.14 91.60 7 0.20 95.23 7 0.28 86.727 0.24 90.62 7 0.06
13 98.157 0.11 98.23 7 0.07 97.93 70.17 96.46 7 0.06 96.147 0.05 94.59 7 0.06 94.61 70.00 94.617 0.00 98.53 7 0.08 98.25 7 0.11 91.28 7 0.06
14 83.517 1.08 82.88 7 0.57 81.18 70.90 84.047 0.92 85.677 0.57 85.177 0.36 85.13 70.79 85.23 7 0.67 86.917 0.65 87.36 7 0.48 84.95 7 0.29
15 89.84 7 0.71 91.277 0.37 90.06 70.98 91.077 0.44 91.497 0.00 91.497 0.00 91.49 70.00 91.497 0.00 92.517 0.51 91.55 7 0.25 78.677 0.83
16 98.517 0.25 98.26 7 0.32 99.74 70.13 88.80 7 0.23 99.067 0.12 98.217 0.05 96.66 70.06 96.62 7 0.04 99.80 7 0.10 99.82 7 0.05 97.497 0.09
17 91.94 7 0.18 91.647 0.26 99.47 70.05 67.82 7 0.16 91.357 1.95 99.58 7 0.01 99.61 70.02 99.45 7 0.01 98.667 0.08 99.317 0.04 99.617 0.01
18 88.617 0.70 88.30 7 0.52 95.25 70.46 83.717 0.38 95.84 7 0.31 96.42 7 0.36 95.94 70.35 93.197 0.32 92.46 7 0.46 96.46 7 0.34 96.29 7 0.26
19 95.32 7 1.39 96.99 7 1.33 100.00 70.00 96.517 1.67 100.007 0.00 100.007 0.00 98.36 71.07 96.187 1.67 96.417 1.38 98.517 0.69 100.007 0.00
Table B1
Induction time (average7standard deviation of 10-time 10-fold stratified cross validation) in terms of seconds.
ID C4.5 CART FDT ADT FADT SADT rLADTL rLADTEN boosted C4.5 oRF SLDA
1 0.02 7 0.00 0.167 0.01 0.02 7 0.00 1.43 7 0.39 0.02 7 0.04 0.447 0.22 1.02 7 0.47 2.917 0.71 1.107 0.12 3.96 70.30 0.107 0.03
2 0.007 0.00 0.05 7 0.00 0.007 0.00 0.077 0.03 0.137 0.08 0.017 0.00 0.02 7 0.03 0.93 7 0.38 0.02 7 0.00 14.36 70.54 0.017 0.00
3 0.007 0.00 0.03 7 0.00 0.007 0.00 0.117 0.05 0.197 0.10 0.017 0.01 0.08 7 0.03 0.017 0.02 0.067 0.01 6.26 70.12 0.017 0.00
4 0.007 0.00 0.03 7 0.00 0.007 0.00 0.197 0.09 0.107 0.08 0.02 7 0.01 0.09 7 0.04 0.107 0.05 0.167 0.02 3.40 70.07 0.017 0.00
5 0.017 0.00 0.09 7 0.00 0.017 0.00 0.127 0.07 0.077 0.06 0.047 0.02 0.09 7 0.06 0.08 7 0.04 0.58 7 0.10 12.90 70.29 0.017 0.00
6 0.02 7 0.00 0.077 0.00 0.047 0.00 0.32 7 0.14 0.09 7 0.09 2.88 7 0.68 0.277 0.25 0.107 0.07 0.577 0.10 2.59 70.07 0.147 0.01
7 1.357 0.04 8.58 7 0.13 1.107 0.03 16.25 7 0.79 8.55 7 1.29 1.26 7 0.77 14.077 1.79 0.32 7 0.25 263.29 7 82.54 362.60 73.17 0.197 0.02
8 0.017 0.00 0.047 0.00 0.017 0.00 1.25 7 0.16 0.117 0.07 0.25 7 0.15 0.54 7 0.23 15.69 7 2.07 0.247 0.03 1.6770.03 0.047 0.01
9 0.007 0.00 0.02 7 0.01 0.007 0.00 0.05 7 0.03 0.137 0.05 0.03 7 0.01 0.067 0.03 0.78 7 0.24 0.017 0.00 5.94 70.13 0.007 0.00
10 0.017 0.00 0.08 7 0.00 0.017 0.00 0.03 7 0.03 0.03 7 0.05 0.08 7 0.05 0.017 0.02 0.067 0.02 0.54 7 0.10 8.75 70.14 0.017 0.00
11 0.02 7 0.00 0.107 0.01 0.02 7 0.00 1.96 7 0.82 0.067 0.03 0.447 0.25 1.05 7 0.46 0.017 0.02 0.85 7 0.08 3.54 70.05 0.107 0.11
12 0.92 7 0.05 2.83 7 0.07 0.777 0.06 41.23 7 2.72 0.747 0.36 6.82 7 2.75 14.90 7 3.09 0.83 7 0.26 11.38 7 2.01 45.17 70.43 1.75 7 0.11
13 0.05 7 0.00 0.477 0.01 0.067 0.00 0.577 0.31 0.167 0.07 0.03 7 0.01 0.017 0.00 0.03 7 0.00 4.167 0.58 36.42 70.57 0.02 7 0.00
14 0.107 0.00 0.377 0.02 0.137 0.01 6.687 1.42 0.067 0.04 2.85 7 1.13 1.89 7 0.93 0.017 0.00 4.62 7 0.79 13.23 70.27 0.23 7 0.02
15 0.017 0.00 0.117 0.00 0.017 0.00 0.777 0.45 0.02 7 0.01 0.05 7 0.04 0.017 0.00 0.017 0.00 0.78 7 0.07 4.6770.08 0.03 7 0.01
16 0.017 0.00 0.127 0.01 0.017 0.00 0.377 0.09 0.047 0.02 0.017 0.01 0.187 0.15 0.017 0.01 0.46 7 0.06 7.95 70.16 0.017 0.00
17 1.077 0.12 7.30 7 0.52 0.30 7 0.02 1.25 7 0.02 7.98 7 0.67 0.39 7 0.05 0.077 0.00 0.30 7 0.03 76.26 7 10.19 109.92 71.25 0.58 7 0.03
18 0.117 0.00 0.917 0.05 0.187 0.01 6.23 7 1.10 0.02 7 0.02 11.177 5.35 3.187 4.07 0.46 7 0.05 5.63 7 0.78 5.36 70.08 0.977 0.04
19 0.007 0.00 0.047 0.00 0.007 0.00 1.34 7 0.26 0.007 0.00 0.117 0.02 0.667 0.24 0.83 7 0.24 0.017 0.00 1.23 70.02 0.08 7 0.00
patcog.2015.08.014i
Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
Table C1
Decision tree size (average7 standard deviation of 10-time 10-fold stratified cross validation) in terms of total number of nodes. SLDA is not a decision tree and hence not included.
ID C4.5 CART FDT ADT FADT SADT rLADTL rLADTEN boosted C4.5 oRF
1 21.44 70.94 13.08 7 1.62 11.00 70.78 71.65 712.25 6.40 7 4.79 11.59 74.84 45.617 13.24 42.167 12.06 1091.14 738.20 1253.86 7 52.21
2 10.82 71.06 13.62 7 2.26 6.12 70.66 31.93 77.48 33.88 7 12.39 4.00 70.00 6.82 7 4.76 5.177 3.70 62.80 724.90 5992.06 7 121.13
3 49.18 72.52 24.80 7 4.97 7.24 71.32 33.28 76.64 43.42 7 15.18 5.62 72.64 36.197 6.21 36.94 7 12.65 691.08 7495.77 4614.32 7 75.26
4 19.92 70.89 13.92 7 2.61 14.50 71.55 51.91 714.21 26.38 7 11.20 7.63 72.33 34.06 7 10.07 33.707 12.11 1522.35 7266.26 2649.22 7 79.57
5 39.02 74.72 18.60 7 6.76 9.34 72.64 25.75 77.54 19.09 7 10.22 13.24 74.32 23.20 7 10.41 21.677 7.74 4086.29 71509.32 7703.04 7 141.01
6 36.90 70.84 3.88 7 1.71 11.58 70.60 23.08 75.33 17.95 7 13.21 50.56 712.02 16.96 7 11.54 11.86 7 7.54 1333.74 731.50 1710.66 7 40.41
7 726.48 720.51 209.62 7 18.26 266.46 753.76 137.65 73.80 109.69 7 8.04 15.94 78.25 100.63 7 8.35 101.86 7 8.90 71846.90 718979.61 131636.68 7 653.16
8 19.04 71.11 11.40 7 1.96 9.46 70.60 90.25 78.48 27.707 11.51 16.30 77.90 57.85 7 13.21 77.29 7 15.88 726.31 7127.97 1191.58 7 38.67
9 4.88 71.04 4.42 7 2.26 1.90 70.51 29.98 710.19 36.25 7 11.53 18.22 75.43 42.82 7 10.52 42.197 7.75 28.50 722.10 3522.84 7 86.06
10 55.64 78.40 2.007 1.61 10.84 72.83 8.05 74.32 8.65 7 5.57 15.88 78.10 4.87 7 2.75 4.93 7 2.94 3750.71 7926.73 6550.84 7 82.18
11 26.54 71.14 9.84 7 2.38 11.36 70.77 64.06 715.41 29.417 13.17 14.02 74.67 53.32 7 13.64 41.80 7 8.75 1075.88 728.76 1729.80 7 54.26
12 207.42 74.27 116.68 7 15.31 91.94 72.41 139.87 73.49 25.03 7 8.23 11.80 74.52 70.187 7.62 70.87 7 9.99 2335.49 795.68 8218.66 7 230.63
13 51.40 71.50 40.80 7 3.13 25.50 73.78 36.88 711.14 40.78 7 8.62 5.02 71.27 4.007 0.00 4.007 0.00 4895.30 7180.73 7115.56 7 205.23
14 110.96 71.69 43.30 7 6.18 44.32 71.54 104.56 711.39 14.59 7 6.62 33.31 711.11 48.22 7 8.46 51.43 7 10.28 5463.94 795.72 6193.34 7 112.46
15 28.06 71.05 3.32 7 1.19 14.86 71.60 37.00 717.02 10.06 7 6.13 5.68 73.55 4.007 0.00 4.007 0.00 1524.22 745.23 2052.64 7 52.95
16 29.06 71.05 33.187 1.01 11.32 70.67 64.18 78.55 28.30 7 7.21 4.33 70.85 34.45 7 16.43 5.29 7 2.72 1205.94 795.27 1868.34 7 67.28
Table D1
Complexity of multivariate split (average 7 standard deviation) in terms of total number of nonzero coefficients.
ID C4.5 CART FDT ADT FADT SADT rLADTL rLADTEN boosted C4.5 oRF SLDA
1 10.22 7 1.96 6.04 7 2.24 150.007 23.74 23.55 74.08 54.007 132.60 90.86 7 166.57 202.37 7200.66 138.197 107.47 520.577 19.10 2884.65 7130.52 22.197 1.87
2 4.917 2.00 6.317 3.58 10.247 5.52 10.31 72.49 43.40 7 44.17 4.007 0.00 3.82 710.48 3.03 7 4.39 28.95 7 12.23 5892.06 7121.13 4.007 0.00
3 24.09 7 4.98 11.90 7 10.25 18.727 19.00 10.76 72.21 84.82 7 87.82 9.20 7 13.04 44.45 734.29 41.80 7 31.76 336.277 242.46 4514.32 775.26 5.977 0.17
4 9.46 7 2.62 6.46 7 4.45 40.50 7 17.40 16.97 74.74 50.767 57.16 12.98 7 12.43 35.82 739.89 30.677 26.79 737.007 129.39 2549.22 779.57 4.92 7 0.31
5 19.017 6.48 8.80 7 8.18 33.36 7 31.31 8.25 72.51 48.177 73.28 29.83 7 32.14 37.85 743.21 35.56 7 41.20 2021.29 7 747.64 7603.04 7141.01 6.88 7 0.38
6 17.95 7 1.60 1.447 2.78 232.767 56.39 7.36 71.78 248.60 7 480.87 263.09 7 251.29 83.07 7127.79 51.54 7 69.33 641.87 7 15.75 4831.98 7121.24 8.20 7 2.13
7 362.747 27.11 104.317 28.88 1327.30 7 967.66 45.55 71.27 362.30 7 73.92 46.54 7 80.58 138.84 757.95 130.217 50.22 35900.45 7 9483.78 197305.02 7979.74 9.86 7 0.40
8 9.02 7 1.62 5.20 7 1.79 93.06 7 18.19 29.75 72.83 195.80 7 294.95 100.577 135.98 208.73 7138.44 244.28 7 127.61 338.96 7 59.91 2183.16 777.35 17.20 7 2.19
9 1.94 7 1.70 1.717 2.90 1.357 2.43 9.66 73.40 35.167 42.59 13.447 18.77 23.17 722.82 23.89 7 24.34 12.45 7 10.55 3422.84 786.06 1.95 7 0.33
10 27.32 7 12.83 0.50 7 2.71 49.20 7 49.05 2.35 71.44 25.137 63.21 40.83 7 68.14 8.80 721.31 9.137 24.40 1851.68 7 458.88 9676.26 7123.26 7.55 7 0.72
11 12.777 2.14 4.42 7 3.32 170.94 7 46.05 21.02 75.14 312.517 406.59 128.69 7 173.76 256.27 7206.14 175.067 160.33 512.94 7 14.38 4074.50 7135.66 26.20 7 3.41
12 103.217 9.23 57.84 7 19.64 2591.79 7 227.20 46.29 71.16 456.577 629.77 200.707 193.33 825.20 7452.10 686.867 470.06 1150.727 44.27 28415.31 7807.22 54.92 7 1.35
13 6.94 7 11.70 0.05 7 0.50 61.25 7 27.60 8.86 72.96 66.30 7 53.11 6.56 7 7.09 4.96 70.20 4.95 7 0.22 2422.65 7 90.36 7015.56 7205.23 4.95 7 0.22
14 54.98 7 7.28 21.157 9.93 888.06 7 98.76 34.52 73.80 185.737 354.97 422.98 7 452.77 295.28 7181.99 291.917 174.98 2706.977 47.86 18280.02 7337.37 39.98 7 0.92
15 13.53 7 2.52 1.167 2.42 124.747 44.41 12.00 75.67 54.36 7 110.28 17.59 7 53.92 6.82 75.12 6.82 7 5.12 737.117 22.61 3905.28 7105.89 8.04 7 3.13
16 14.03 7 1.77 16.09 7 2.14 20.64 7 4.21 21.06 72.85 36.40 7 30.41 3.777 3.69 21.68 723.53 3.717 4.28 578.157 45.95 1768.34 767.28 3.187 0.39
17 257.777 9.31 156.04 7 38.09 223.86 7 39.27 9.85 70.11 1213.68 7 137.87 26.007 0.00 25.95 70.22 25.95 7 0.22 8756.417 70.63 34340.80 7555.46 26.007 0.00
18 21.82 7 2.35 13.26 7 5.55 354.337 88.37 28.54 72.74 159.03 7 382.03 274.067 515.69 293.93 7572.49 86.52 7 7.59 1087.147 30.84 7387.20 7277.38 67.117 6.18
19 4.747 0.60 4.517 0.88 31.007 0.00 25.04 72.49 31.007 0.00 28.83 7 1.58 273.28 7238.99 163.977 74.44 8.28 7 13.27 1612.55 766.70 30.017 1.31
13
14 H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
References
1.007 0.00
1.007 0.00
5.007 0.00
1.007 0.00
1.007 0.00
1.007 0.00
1.007 0.00
1.007 0.00
1.007 0.00
rLADTU
30.747 2.82
24.777 5.07
9.94 7 3.29
15.177 4.26
27.20 7 3.97
41.59 7 2.27
23.187 3.41
18.80 7 2.69
9.977 2.74
1.46 7 1.45
[1] P. Geurts, A. Irrthum, L. Wehenkel, Supervised learning with decision tree-
1.58
based methods in computational and systems biology, Mol. Biosyst. 5 (12)
(2009) 1593–1605.
[2] K.-Y.K. Liu, J. Lin, X. Zhou, S.T.C.S. Wong, Boosting alternating decision trees
modeling of disease trait information, BMC Genet. 6 (Suppl. 1) (2005) S132.
4.007 0.00
1.007 0.00
1.007 0.00
1.007 0.00
2.22 7 0.88
3.64 7 0.89
39.54 7 2.53
21.197 3.63
1.85 7 2.69
1.64 7 1.07
6.43 7 1.48
2.93 7 1.34
14.25 7 3.14
11.517 2.10
1.04 7 0.13
9.50 7 2.19
3.137 0.13
[3] G. Creamer, Y. Freund, Using boosting for financial analysis and performance
1.90 71.49
5.28 7 3.11
SADTU
1.63
[4] M.P.-L. Ooi, H.K. Sok, Y.C. Kuang, S. Demidenko, C. Chan, Defect cluster
recognition system for fabricated semiconductor wafers, Eng. Appl. Artif.
Intell. 26 (3) (2013) 1029–1043.
Split complex
[5] C. Kingsford, S.L. Salzberg, What are decision trees? Nat. Biotechnol. 26 (9)
23.55 74.08
10.31 72.49
29.75 72.83
8.86 72.96
34.52 73.80
12.00 75.67
21.06 72.85
25.04 72.49
10.76 72.21
8.25 72.51
2.35 71.44
16.97 74.74
7.36 71.78
28.54 7 2.74
45.55 71.27
21.02 75.14
9.85 70.11
46.29 71.16
9.66 73.40
(2008) 1011–1013.
ADT
2.79
Syst. Appl. 30 (2006) 64–72.
[7] J. Quinlan, Bagging, boosting, and C4.5, in: Proceedings of the 13th National
Conference on Artificial Intelligence, 1996, pp. 725–730.
[8] L. Breiman, Bagging predictors, Mach. Learn. 24 (2) (1996) 123–140.
46.517 12.77
70.54 7 10.22
75.317 15.21
82.60 7 11.92
4.007 0.00
4.007 0.00
16.007 0.00
4.007 0.00
4.007 0.00
4.007 0.00
4.007 0.00
4.007 0.00
4.007 0.00
30.82 7 9.87
5.38 7 4.36
30.917 8.22
125.777 6.81
93.22 7 8.47
57.40 7 8.06
rLADTU
4.00 70.00
4.00 70.00
4.00 70.00
35.53 76.30
7.66 72.63
11.92 72.67
20.29 74.43
9.79 74.03
16.84 79.33
4.12 70.38
29.5 76.58
6.55 78.06
10.39 70.40
5.92 73.21
43.75 79.41
6.707 4.48
pp. 161–172.
[13] Y.C. Kuang, M.P.L. Ooi, Complex feature alternating decision tree,, Int. J. Intell.
Decision tree size
51.91 714.21
64.06 715.41
37.00 717.02
104.56 711.39
29.98 710.19
33.28 76.64
90.25 78.48
139.87 73.49
23.08 75.33
137.65 73.80
8.05 74.32
27.58 78.89
64.18 78.55
30.55 70.32
31.93 77.48
25.75 77.54
76.12 77.46
trees to detect sets of SNPs that associate with disease, Genet. Epidemiol. 36
86.62 7 8.22
(2012) 99–106.
[15] G. Stiglic, M. Bajgot, P. Kokol, Gene set enrichment meta-learning analysis:
next-generation sequencing versus microarrays, BMC Bioinform. 11 (2010)
2.79
ADT
Comparison between univariate ADTree, univariate version of SADT (SADTU) and univariate version of rLADT (rLADTU).
(article 176).
[16] M. Drauschke, Multi-class ADTboost. Technical Report No. 6, Department of
Photogrammetry Institute of Geodesy and Geoinformation University of Bonn,
rLADTU
0.007 0.00
0.017 0.00
0.017 0.00
0.007 0.00
0.017 0.00
0.017 0.00
0.02 7 0.00
0.007 0.00
2008.
0.187 0.03
0.08 7 0.02
0.077 0.03
0.147 0.02
0.147 0.04
2.127 0.20
0.277 0.08
0.03 7 0.01
0.03 7 0.01
0.017 0.01
0.187 0.02
[17] H.K. Sok, M.P.-L. Ooi, Y.C. Kuang, Sparse alternating decision tree, Pattern
Recognit. Lett. 60–61 (2015) 57–64.
1.63
[18] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgran Kaufmann
Publisher, San Francisco, 1993.
[19] L. Breiman, J.H. Friedman, R.A. Olshen, Classification and Regression Trees,
Wadsworth International Group, Belmont, Canada, 1984.
0.01 70.00
0.01 70.00
0.02 70.00
0.05 70.00
0.01 70.00
0.12 70.05
0.06 70.02
0.21 70.03
0.02 70.02
0.08 70.03
6.64 70.62
0.02 70.03
0.04 70.03
0.01 70.01
0.05 70.01
0.01 70.01
0.02 70.01
0.02 70.01
0.43 70.11
[22] A. Franco-arcega, Splitting attribute subsets for large datasets, in: Proceedings
1.43 70.39
0.07 70.03
0.11 70.05
0.19 70.09
0.12 70.07
16.25 70.79
0.05 70.03
0.03 70.03
1.96 70.82
1.17 70.52
0.77 70.45
0.37 70.09
1.25 70.02
1.34 70.26
41.23 72.72
6.68 71.42
1.25 70.16
6.23 7 1.10
79.43 70.00
74.75 70.00
75.42 70.00
71.51 70.00
94.61 70.00
91.49 70.00
84.55 70.00
63.92 70.00
80.08 70.00
80.17 70.48
93.84 70.58
72.40 70.94
81.62 70.70
78.79 70.43
72.84 70.91
80.52 71.42
61.05 71.77
90.96 70.19
94.617 0.00
91.49 7 0.00
80.08 7 0.00
93.68 7 0.67
76.09 7 0.38
79.03 7 0.59
73.32 7 0.67
80.09 7 0.84
91.187 0.64
63.81 7 0.02
82.157 0.72
75.127 0.01
72.247 1.00
78.58 7 1.00
62.357 1.39
78.87 7 1.05
87.87 7 0.10
77.477 0.20
[29] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R.
Stat. Soc.: Ser. B (Stat. Methodol. 67 (2) (2005) 301–320, Apr..
SADTU
[31] R. Tibshirani, Regression shrinkage and selection via the lasso,, J. R. Stat. Soc.
Ser. B (Methodol.) 58 (1996) 267–288.
[32] Y. Chen, P. Du, Y. Wang, Variable selection in linear models,, Wiley Interdiscipl.
91.077 0.44
94.68 7 0.80
71.23 7 0.52
84.04 7 0.92
88.80 7 0.23
72.54 7 0.91
83.717 0.38
77.38 7 0.41
82.717 1.24
78.577 1.66
71.917 1.59
84.187 1.39
96.517 1.67
62.34 7 1.37
88.86 7 1.57
78.59 7 0.10
93.63 7 0.16
67.82 7 0.16
93.377 0.11
1.37
13
14
15
16
19
18
ID
17
11
[36] B. Efron, T. Hastie, Least angle regression, Ann. Stat. 32 (2) (2004) 407–499.
1
2
3
4
5
6
7
8
9
Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i
H.K. Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 15
[37] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The [41] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach.
WEKA data mining software: an update, SIGKDD Explor. 11 (1) (2009). Learn. Res. 7 (2006) 1–30.
[38] K. Sjöstrand, L. Clemmensen, Spasm: a matlab toolbox for sparse statistical [42] S. Aruoba, J. Fernández-Villaverde, A Comparison of Programming Languages
modeling, 2012, [Online], Available: 〈http://www2.imm.dtu.dk/projects/ in Economics, Working Paper No. 20263, National Bureau of Economic
spasm〉 (accessed 21.08.14). Research, 2014.
[39] A. Frank, A. Asuncion, UCI machine learning repository, [Online], Available: [43] D.H. Wolpert, W.G. Macready, No free lunch theorems for optimization,, IEEE
〈http://archive.ics.uci.edu/ml〉. Trans. Evol. Comput. 1 (1) (1997) 67–82.
[40] University of Eastern Finland, Spectral Color Research Group, [Online], Avail- [44] L. Rokach, O. Maimon, Top-down induction of decision trees classifiers—a
able: 〈https://www.uef.fi/spectral/spectral-database〉. survey, IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev. 35 (4) (2005) 476–487,
Nov..
Hong Kuan Sok received the Bachelor of Engineering (Honours) in Electrical and Computer Systems Engineering degree from Monash University, Malaysia in 2010. He is
currently a Ph.D. student with particular interests in machine learning and pattern recognition.
Melanie Ooi Po-Leen received the Ph.D. degree from Monash University, Malaysia, in 2011. She is currently a Senior Lecturer with the Engineering Faculty, Monash
University. Her research interests include machine learning, computer vision, biomedical imaging and electronic design and test.
Ye Chow Kuang received the Bachelor of Engineering (Honours) degree in electromechanical engineering, and the Ph.D. degree from University of Southampton. He joined
Monash University, Malaysia, where he is involved in the field of machine intelligence and statistical modelling.
Serge Demidenko received the M.E. degree from the Belarusian State University of Informatics and Radio Electronics, and the Ph.D. degree from the Institute of Engineering
Cybernetics, Belarusian Academy of Sciences. He is currently a Professor and the Associate Head of School of Engineering and Advanced Technology, and a Cluster Leader
with Massey University, New Zealand. His research interests include electronic design and test, instrumentation and measurements, and signal processing.
Please cite this article as: H.K. Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.
patcog.2015.08.014i