Beruflich Dokumente
Kultur Dokumente
Information Fusion
journal homepage: www.elsevier.com/locate/inffus
a r t i c l e i n f o a b s t r a c t
Article history: In this paper, we present an experimental comparison among different strategies for combining decision
Received 8 February 2011 trees built by means of imprecise probabilities and uncertainty measures. It has been proven that the
Received in revised form 14 March 2012 combination or fusion of the information obtained from several classiers can improve the nal process
Accepted 15 March 2012
of the classication. We use previously developed schemes, known as Bagging and Boosting, along with a
Available online xxxx
new one based on the variation of the root node via the information rank of each feature of the class var-
iable. To this end, we applied two different approaches to deal with missing data and continuous vari-
Keywords:
ables. We use a set of tests on the performance of the methods analyzed here, to show that, with the
Imprecise probabilities
Credal sets
appropriate approach, the Boosting scheme constitutes an excellent way to combine this type of decision
Uncertainty measures tree. It should be noted that it provides good results, even compared with a standard Random Forest clas-
Supervised classication sier, a successful procedure very commonly used in the literature.
Decision trees 2012 Elsevier B.V. All rights reserved.
Ensemble methods
1566-2535/$ - see front matter 2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.inffus.2012.03.003
Please cite this article in press as: J. Abelln, Ensembles of decision trees based on imprecise probabilities and uncertainty measures, Informat. Fusion
(2012), http://dx.doi.org/10.1016/j.inffus.2012.03.003
2 J. Abelln / Information Fusion xxx (2012) xxxxxx
IRN, with a standard previous preprocessing approach, performs well extracted from the dataset for each case of the class variable using
compared to other classic ensemble methods that use classical deci- Walleys imprecise Dirichlet model (IDM) [14], which represents a
sion trees. The method does not need to set the number of trees to be specic kind of convex set of probability distributions (see [15]).
used; rather, it uses a number of decision trees which depends on The IDM depends on a given hyperparameter s which does not
the number of informative features (using IIG criterion) in relation depend on the sample space [14]. The IDM estimates that the prob-
to the class variable. These informative features are used as root abilities for each value of the class variable C are within the
nodes. interval:
We study the performance of the following methods: the origi-
nal IRN method, Bagging CTs (noted as BAGCT), and Boosting CTs ncj ncj s
pcj 2 ; ; j 1; . . . ; k;
(noted as BOOCT). To compare the results, we also use the following Ns Ns
methods: the IRN method replacing its nal procedure with a vote
procedure (noted as IRNV), Bagging decision trees built with Quin- with ncj as the frequency of the set of values (C = cj) in the dataset;
lans classic Information Gain (Info-Gain) split criterion [10] (noted and N the sample size. The value of parameter s determines the
as BAGIG), Boosting decision trees built with the Info-Gain split cri- speed at which the upper and lower probability values converge
terion (noted as BOOIG), and the Random Forest method. when the sample size increases. Higher values of s give a more cau-
To check the accuracy of the above-mentioned methods, we tious inference. Walley [14] does not give a denitive recommenda-
used two different approaches to deal with missing data and con- tion for the value of this parameter but he suggests two candidates:
tinuous variables. We show that for two of the classication meth- s = 1 or s = 2.
ods used here, the type of approach can be strongly related to their For this type of probability intervals, the maximum entropy is
accuracy. We compare the performance of the classication meth- estimated. This is a total uncertainty measure which is well known
ods analyzed by conducting a set of tests, including a Friedman for this type of sets (see [16,17]).
rank [11,12] on the methods used in each approach, and a set of The procedure for building credal trees is very close to the one
known post hoc tests. The most important outcome of this exper- used in Quinlans [10] well-known ID3 algorithm, replacing its
imental study is that the Boosting scheme, with the appropriate Info-Gain split criterion with the Imprecise Info-Gain (IIG) split cri-
approach, is the best way to ensemble credal decision trees. It ob- terion. This criterion can be dened as follows: in a classication
tains excellent results compared with other ensemble procedures for problem, let C be the class variable, {Z1, . . . , Zk} be the set of fea-
this type of decision trees. Indeed, it should be noted that Boosting tures, and Z be a feature; then
credal decision trees obtains a better Friedman rank score than a P
IIGD C; Z S K D C PZ zi S K D CjZ zi ;
standard Random Forest method in the comparative study conducted i
among the seven methods presented in this paper. The results
D D
obtained with the set of tests carried out reinforce this statement. where K C and K CjZ zi are the credal sets obtained using the
The rest of the paper is organized as follows: in Section 2, we IDM for the variables C and (CjZ = zi) respectively, for a partition D of
present previous knowledge on the method for building credal the dataset (see [4]); and S is the maximum entropy function
decision trees and the ensemble schemes utilized with credal deci- (see [15]).
sion trees in this paper. In Section 3, we check all the ensemble The IIG criterion is different from the classical ones. It is based
methods studied here, using two different approaches to deal with on the principle of maximum uncertainty (see [5]), broadly used
missing data and continuous variables, on a set of datasets widely in the classic information theory, in which it is known as the max-
used in classication. Section 4 is devoted to the conclusions. imum entropy principle [18]. The use of the maximum entropy
function in the procedure for building decision trees is justied
2. Background in Abelln and Moral [6]. It is important to note that for a feature
Z and a partition D; IIGD C; Z can be negative. This situation does
2.1. Method for building credal decision trees not appear with classical split criteria such as the Info-Gain crite-
rion used in ID3.3 This characteristic enables the IIG criterion to re-
A decision tree is a simple structure that can be used as a clas- veal features that worsen the information on the class variable.
sier. An important reference on its origin is the work Classication Each node No in a decision tree produces a partition of the data-
and Regression Trees by Breiman et al. [13]. Quinlans well-known set (for the root node, D is considered to be the entire dataset). Fur-
ID3 algorithm for building decision trees, which was presented la- thermore, each node No has an associated list L of feature labels
ter, should also be mentioned. (that are not in the path from the root node to No). The procedure
In a decision tree, each internal (non-leaf) node represents an for building credal trees is explained in the algorithm in Fig. 1.
attribute variable (or predictive attribute or feature) and each Considering this algorithm, when an Exit situation is attained,
branch represents one of the states of this variable. Each tree leaf i.e. when there are no more features to introduce in a node, or
species an expected value of the class variable (the variable to when the uncertainty is not reduced (steps 1 and 4 of the algo-
be predicted). Important aspects of the procedures for building rithm, respectively), a leaf node is produced. In a leaf node, the
decision trees are: the split criterion, i.e. the criterion used for most probable state or value of the class variable for the partition
branching; and the stop criterion, i.e. the criterion used to stop associated with that leaf node is inserted. To avoid obtaining
branching. unclassied instances, if we do not have one single most probable
Decision trees are built using a set of data referred to as the class value, we can select the one obtained in its parent node, and
training dataset. A different set, called the test dataset, is used to so on recursively (see [8]).
check the model. When we obtain a new sample or instance of Credal decision trees can be somewhat smaller than the ones
the test dataset, we can make a decision or prediction about the produced with similar procedures replacing the IIG criterion with
state of the class variable by following the path in the tree from the classic ones (see [7]). This normally produces a reduction of
the root node to a leaf node, using the sample values and the tree the overtting of the model (see [6]), which could advise against
structure. its use in bagging or boosting schemes (see [19]).
The split criterion employed to build credal decision trees [4] is
based on the application of uncertainty measures on convex sets of 3
The Info-Gain criterion is actually a particular case of the IIG criterion using the
probability distributions. Specically, probability intervals are parameter s = 0.
Please cite this article in press as: J. Abelln, Ensembles of decision trees based on imprecise probabilities and uncertainty measures, Informat. Fusion
(2012), http://dx.doi.org/10.1016/j.inffus.2012.03.003
J. Abelln / Information Fusion xxx (2012) xxxxxx 3
ers where each one depends upon its predecessors; and that
each classier considers the error of the previous classier in
order to decide what to focus on during the next iteration of
the data. Boosting assumes training dataset to be perfect, so
this procedure does not perform as well as other strategies
with noisy training datasets. A special type of Boosting is
the algorithm of Adaboost [1] which has demonstrated
excellent experimental performance on benchmark datasets
and real applications. AdaBoost is an adaptive algorithm in
the sense that classiers built subsequently are tweaked in
favor of instances misclassied by previous classiers. It is
a feature selector with a principled strategy: minimization
of upper bound on empirical error.
(iii) The IIG criterion is used in two important ways in the recent
Fig. 1. Procedure to build a credal decision tree. procedure IRN-Ensemble method of Abelln and Masegosa
[8]. As a split criterion, a single credal tree needs to be built
and used to obtain the set of features that will make up the
root node of each tree used in the ensemble scheme. As sta-
Credal trees and the IIG criterion have been successfully used in
ted in the preceding sections, this criterion can use certain
other procedures and tools in data mining as a part of a procedure
features to provide negative values of information for the
for selecting variables [20] and on datasets with classication noise
class variable. Therefore, all the features can be considered
[7]. An extended version of the IIG criterion has been used to dene
as potential root nodes or, to the contrary, none of them
a semi-nave Bayes classier [21].
can be considered as root nodes. The method is not based
upon a nal voting procedure, and can be expressed as
2.2. Ensemble methods using credal decision trees follows:
Use IIG rank to obtain the set of features {Z1, . . . , Zm},
The general idea normally used in the procedures for combining where IIG (CjZj) > 0, "j 2 {1, . . . , m}.
decision trees is based on the generation of a set of different deci- Build m credal trees {T1, . . . , Tm}. Each Tj is built using Zj
sion trees combined with a majority vote criterion. When a new as the root node and the rest of nodes, for this tree, are
unclassied instance arises, each single decision tree makes a pre- selected following the procedure described in the previ-
diction and the instance is assigned to the class value with the ous section, using the IIG criterion.
highest number of votes. In an ensemble scheme, if all decision In order to classify a new observation, apply the new case
trees are quite similar, ensemble performance will not be much to each of the m credal trees and consider the frequency
better than a single decision tree. However, if the ensemble com- (number of cases) of each state of the class variable in
prises a broad set of different decision trees exhibiting good indi- each of the leaf nodes. For a tree Tj, a new observation
vidual performance, the ensemble will become more robust, with is associated with a leaf node, and this also has an asso-
a better prediction capacity. There are many different approaches ciated partition of the dataset. Then njci is the frequency
to this problem, but Bagging [2], Random Forests [3], and AdaBoost (number of cases) with which ci appears in the partition
[1] stand out as the best known and most competitive ones. generated by the leaf node of Tj. Calculate the following
The schemes used here with credal trees can be briey de- relative frequencies (probabilities):
scribed as follows:
P njci
qi P j ;
(i) The Bootstrap Aggregating ensemble method, known as j ci nci
Breimans Bagging [2] is an intuitive and simple method
for each ci.
that performs very well. Diversity in Bagging is obtained
For a new observation, the value obtained by the
by using bootstrap replicates of the original training dataset:
IRN-Ensemble classication method is ch, where: ch =
different training datasets are randomly drawn with
arg(maxi{qi}).
replacement. Subsequently, a single decision tree is built
with each instance of the training dataset using the standard
As can be inferred from the above explanation, the IRN method
approach [13]. Thus, each tree can be dened by a different
has an important characteristic with respect to the rest: it does not
set of variables, nodes and leaves. Finally, their predictions
x the number of trees (or iterations) to be used. It depends on the
are combined by a majority vote.
number of informative features with respect to the class variable
(ii) Boosting is an ensemble method based on the work of
(see [8]). This implies that it is possible to use a number of trees
Kearns [22]. This procedure can be described in the follow-
equal to the number of features in the dataset, or even zero.4 As
ing steps: (i) applying a weak classier (such as a decision
a very low number of trees can be used, the authors of the IRN meth-
tree) to the learning data, where each observation is
od do not consider a nal voting procedure appropriate to classify a
assigned an initially equal weight; (ii) applying weights to
new observation.
the observations in the learning sample that are inversely
proportional to the accuracy of the classication of the com-
puted predicted classications; (iii) going to the step (i) M- 3. Experimentation
times (M previously xed); and (iv) combining predictions
from individual models (weighted by accuracy of the mod- Our aim is to study the performance of the methods: the origi-
els). In the Boosting procedure: for misclassied samples nal IRN method (noted as IRN); the IRN method replacing its nal
the weights are increased, while for correctly classied sam-
ples the weights are decreased. Thus, the principal idea of 4
In that situation the classication procedure always returns the case of the class
this method is that it uses a sequence of successive classi- variable with greater frequency in the dataset.
Please cite this article in press as: J. Abelln, Ensembles of decision trees based on imprecise probabilities and uncertainty measures, Informat. Fusion
(2012), http://dx.doi.org/10.1016/j.inffus.2012.03.003
4 J. Abelln / Information Fusion xxx (2012) xxxxxx
Please cite this article in press as: J. Abelln, Ensembles of decision trees based on imprecise probabilities and uncertainty measures, Informat. Fusion
(2012), http://dx.doi.org/10.1016/j.inffus.2012.03.003
J. Abelln / Information Fusion xxx (2012) xxxxxx 5
Table 2
Percentage of correct classication of the methods with the (A) approach.
a less sensitive test than others, as it is described by Garca and the same results of the tests on the methods when the approach (B)
Herrera. With Nemenyis test we can encounter situations where is used.11
the differences expressed by the Friedman test were not detected.
Hence, as the latter authors recommended, we considered con- 3.1. Comments on the results
ducting a post hoc Holms test.
The study of the p-values is very interesting because, citing a From Tables 2 and 3 we can obtain the differences between the
paragraph by Garca and Herrera [27]: A p-value provides infor- results of the methods using approaches (A) and (B). Some differ-
mation about whether a statistical hypothesis test is signicant ences can be seen in favor of the results obtained from approach
or not, and it also indicates something about how signicant the re- (B) compared to the results from approach (A). This is particularly
sult is: The smaller p-value, the stronger the evidence against the evident for some small datasets such as glass2, hepatitis and sonar,
null hypothesis. Most important, it does this without committing and for medium ones such as vowel. These differences are more
to a particular level of signicance. noticeable in methods BOOCT and RF.
When a p-value comes from a multiple comparison, it does not We conducted Wilcoxons tests on each separate model with
take into account the other comparisons in the multiple study. To each approach, i.e. each method using approach (A) against the
solve this problem, we can use a study of the Adjusted P-values same method using approach (B). The results are given in Table
(APVs) [31]. Using the APVs in a statistical analysis gives us more 4. In these tests, we obtained that the IRN method with the (A) ap-
information because it is not focused on a xed level of signi- proach exhibits no signicant differences from the same method
cance. The work by Garca and Herrera [27] provides an explana- and the (B) approach. The same situation also arises for the IRNV,
tion of how to obtain the values. In some situations, such as this BAGCT, BAGIG and BOOIG methods. However, it is important to re-
one, certain methods are more similar in performance. Therefore, mark that the tests performed for the BOOCT and RF methods ex-
to make a clear distinction between them, it is very interesting press that there are signicant positive differences in accuracy
the study of the APVs of the tests conducted in the experiments. when the (B) approach is used.
Table 4 gives the results of the Wilcoxons test carried out in The Friedman test carried out for each set of methods with each
each method separately in order to compare their performance approach rejects the null hypothesis. Therefore, Holms tests were
by using approaches (A) and (B), respectively. The table shows conducted. The Friedmans ranks of the methods using approach
the approach in which the method is better using Wilcoxons (A), expressed in Table 5, show that the best methods are IRN
test.10 and RF in this order; whereas the worst methods are those in
Tables 5 and 6 show Friedmans ranks of the methods with ap- which the Info-Gain criterion was used for the decision trees:
proaches (A) and (B), respectively (the null hypothesis is rejected in BOOIG and BAGIG. The situation is different for the best methods
both cases). when approach (B) is used (Table 6); the best methods here are
Table 7 shows the p-values of the tests carried out on the meth- BOOCT and RF in this order, but the situation is similar for the worst
ods when the approach (A) is used. The column p-valueH corre- methods, again the BOOIG and BAGIG methods have the worse
sponds to the p-values of the Holms tests on the methods in results.
each row. The column APVH corresponds to the adjusted p-values Focusing on the results of the Holms tests, Tables 7 and 8 ex-
of each comparison obtained from the Holms tests. Table 8 shows press the p-values of each comparison between two methods with
11
In both tables, for a weaker level of signicance a = 0.1, Holms procedure rejects
10
It is the same table of results obtained for a weaker level of signicance of 0.1 and those hypotheses that have a p-valueH 6 0.003846. With the Nemenyis test the
for a stronger one of 0.01. results are very similar to the ones obtained with the Holms test.
Please cite this article in press as: J. Abelln, Ensembles of decision trees based on imprecise probabilities and uncertainty measures, Informat. Fusion
(2012), http://dx.doi.org/10.1016/j.inffus.2012.03.003
6 J. Abelln / Information Fusion xxx (2012) xxxxxx
Table 3
Percentage of correct classication of the methods with the (B) approach.
Table 4
Results of the Wilcoxons test conducted with a level of signicance of 0.05,
performed in each method separately using approaches (A) and (B). The table shows Table 6
the approach in which the method is better using the results of the Wilcoxons test. Friedmans ranks of the algorithms
indicates non signicant statistical differences. with the approach (B).
Method Rank
IRN 3.02
RF 3.04 Table 7
BAGCT 3.42 P-values table for approach (A). For a = 0.05, Holms procedure rejects those
IRNV 3.5 hypotheses that have a p-valueH 6 0.003333. The APVH column corresponds to the
BOOCT 4.42 adjusted p-values obtained from the Holms tests.
BAGIG 5.16
BOOIG 5.44 i Methods p-valueH APVH
1 IRN vs. BOOIG 0.002381 0.00157
2 BOOIG vs. RF 0.0025 0.001714
3 IRN vs. BAGIG 0.002632 0.008761
each approach. Table 7 shows that with approach (A) the best 4 BAGIG vs. RF 0.002778 0.00938
methods are IRN and the RF. They obtain very similar results, i.e. 5 BAGCT vs. BOOIG 0.002941 0.016088
similar number of wins in the tests carried out.12 The IRNV and 6 IRNV vs. BOOIG 0.003125 0.023968
7 BAGCT vs. BAGIG 0.003333 0.066046
BAGCT methods are also winner ones when they are compared with
8 IRNV vs. BAGIG 0.003571 0.092279
the worst method: BOOIG. Table 8 shows that with the approach (B) 9 IRN vs. BOOCT 0.003846 0.285308
the best methods are BOOCT and the RF methods.13 The IRN and IRNV 10 BOOCT vs. RF 0.004167 0.286933
methods are also winners when they are compared with the worst 11 BOOCT vs. BOOIG 0.004545 1.045492
method: BOOIG. 12 BAGCT vs. BOOCT 0.005 1.045492
13 IRNV vs. BOOCT 0.005556 1.18929
It is not easy to compare the best methods for each approach 14 BAGIG vs. BOOCT 0.00625 1.806828
with the results obtained in this work, but as Garca and Herrera 15 IRN vs. IRNV 0.007143 3.024777
16 IRNV vs. RF 0.008333 3.024777
17 IRN vs. BAGCT 0.01 3.024777
12
18 BAGCT vs. RF 0.0125 3.024777
IRN has one more win in the Holms test than the RF if a weaker level of 19 BAGIG vs. BOOIG 0.016667 3.024777
signicance of 0.1 is used. 20 IRNV vs. BAGCT 0.025 3.024777
13
We must remark that BOOCT has one more win in the Holms test than the RF if a 21 IRN vs. RF 0.05 3.024777
weaker level of signicance of 0.1 is used.
Please cite this article in press as: J. Abelln, Ensembles of decision trees based on imprecise probabilities and uncertainty measures, Informat. Fusion
(2012), http://dx.doi.org/10.1016/j.inffus.2012.03.003
J. Abelln / Information Fusion xxx (2012) xxxxxx 7
Table 8 scheme with the appropriate approach is the best way to combine
P-values table for approach (B). For a = 0.05, Holms procedure rejects those credal decision trees. Using the best approach for each method ((B)
hypotheses that have a p-valueH 6 0.003333. The APVH column corresponds to the
adjusted p-values obtained from the Holms tests.
approach), we used a set of tests carried out on the result obtained
from the seven methods analyzed here to show that Boosting cre-
i Methods p-valueH APVH dal trees and the Random Forest methods are the best ones, with
1 BOOCT vs. BOOIG 0.002381 0.000002 Boosting credal trees being better than the Random Forest method
2 BOOIG vs. RF 0.0025 0.000005 in comparison with any other method.
3 IRN vs. BOOIG 0.002632 0.001238
4 BAGIG vs. BOOCT 0.002778 0.005715
5 BAGIG vs. RF 0.002941 0.010002 Acknowledgments
6 BAGCT vs. BOOIG 0.003125 0.015142
7 IRNV vs. BOOIG 0.003333 0.05366 This work has been supported by the Spanish Consejera de
8 IRNV vs. BOOCT 0.003571 0.215965
Econom a, Innovacin y Ciencia de la Junta de Andaluca under
9 IRNV vs. RF 0.003846 0.310844
10 IRN vs. BAGIG 0.004167 0.310844 Project TIC-06016.
11 BAGCT vs. BOOCT 0.004545 0.466564 Im very grateful to the anonymous reviewers of this paper for
12 BAGCT vs. RF 0.005 0.620745 their valuable suggestions and comments to improve its quality.
13 BAGIG vs. BOOIG 0.005556 0.744935
14 BAGCT vs. BAGIG 0.00625 0.929148
15 IRN vs. BOOCT 0.007143 1.257081
References
16 IRN vs. RF 0.008333 1.431879
17 IRNV vs. BAGIG 0.01 1.431879 [1] Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in:
Thirteenth International Conference on Machine Learning, San Francisco, 1996,
18 IRN vs. IRNV 0.0125 1.431879
pp. 148156.
19 IRN vs. BAGCT 0.016667 1.431879
[2] L. Breiman, Bagging predictors, Machine Learning 24 (2) (1996) 123140.
20 IRNV vs. BAGCT 0.025 1.431879
[3] L. Breiman, Random forests, Machine Learning 45 (1) (2001) 532.
21 BOOCT vs. RF 0.05 1.431879 [4] J. Abelln, S. Moral, Building classication trees using the total uncertainty
criterion, International Journal of Intelligent Systems 18 (12) (2003) 1215
1225.
[5] G.J. Klir, Uncertainty and Information: Foundations of Generalized Information
recommend for this type of situations, we can use the study of the Theory, John Wiley, Hoboken, NJ, 2006.
APVs to analyse the differences between the best methods in this [6] J. Abelln, S. Moral, Upper entropy of credal sets, applications to credal
classication, International Journal of Approximate Reasoning 39 (2-3) (2005)
study in greater depth. Table 7 shows that for approach (A) there 235255.
exist very similar results in the strength of the rejection of the null [7] J. Abelln, A. Masegosa, An experimental study about simple decision trees for
hypothesis (values of the APVs) of the comparisons IRN vs. M and bagging ensemble on data sets with classication noise, in: C. Sossai, G.
Chemello (Eds.), ECSQARU, LNCS, vol. 5590, Springer, 2009, pp. 446456.
RF vs. M, with M being any other method. However, in Table 8, [8] J. Abelln, A. Masegosa, An ensemble method using credal decision trees,
about the tests on the methods with approach (B), we can observe European Journal of Operational Research 205 (1) (2010) 218226.
that this situation is more clearly in favor of the BOOCT method [9] M. Zaffalon, The naive credal classier, Journal of Statistical Planning and
Inference 105 (2002) 521.
when it is compared with the RF method. For example, the APV
[10] J. R Quinlan, Induction of decision trees, Machine Learning 1 (1986) 81106.
of the comparison RF vs. BOOIG is 2.5 times greater than the [11] M. Friedman, The use of rank to avoid the assumption of normality implicit in
one of the comparison BOOCT vs. BOOIG; and the APV of the com- the analysis of variance, Journal of the American Statistical Association 32
(1937) 675701.
parison RF vs. BAGIG is 1.75 times the one of the comparison
[12] M. Friedman, A comparison of alternative tests of signicance for the problem
BOOCT vs. BAGIG. of m rankings, Annals of Mathematical Statistics 11 (1940) 8692.
We can use similar analysis to check the differences between [13] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classication and Regression
the worst methods for each approach. We have the same situation Trees, Wadsworth Statistics, Probability Series, Belmont, 1984.
[14] P. Walley, Inferences from multinomial data: learning about a bag of marbles,
with both approaches: comparisons BOOIG vs. M always give an Journal of the Royal Statistical Society B 58 (1996) 357.
APV that is notably lower14 than the one for BAGIG vs. M, with [15] J. Abelln, Uncertainty measures on probability intervals from Imprecise
M being any other method. Thus, we can say that clearly the worst Dirichlet model, International Journal of General Systems 35 (5) (2006) 509
528.
method of our study with the two approaches is BOOIG. [16] J. Abelln, G.J. Klir, S. Moral, Disaggregated total uncertainty measure for credal
Using the analysis reported in this paper, we can summarize the sets, International Journal of General Systems 35 (1) (2006) 2944.
following: the methods using approach (A) are worse than or equal [17] J. Abelln, A. Masegosa, Requirements for total uncertainty measures in
DempsterShafer theory of evidence, International Journal of General Systems
to the equivalent using approach (B); the RF method obtains the 37 (6) (2008) 733747.
best average value with both approaches but it can not be consid- [18] E.T. Jaynes, On the rationale of maximum-entropy methods, Proceeding of the
ered the best one for any approach; with approach (B), the RF and IEEE 70 (9) (1982) 939952.
[19] F. Provost, P. Domingos, Tree induction for probability-based ranking, Machine
BOOCT methods can be considered the best methods, with BOOCT
Learning 52 (3) (2003) 199215.
being better than the RF in the comparisons with the other [20] J. Abelln, A. Masegosa, A lter-wrapper method to select variables for the
methods. naive bayes classier based on credal decision trees, International Journal of
Uncertainty, Fuzziness and Knowledge-Based Systems 17 (6) (2009) 833854.
[21] J. Abelln, Application of uncertainty measures on credal sets on the naive
4. Conclusions bayes classier, International Journal of General Systems 35 (6) (2006) 675
686.
[22] M. Kearns, Thoughts on hypothesis boosting, Unpublished manuscript, 1988.
In this paper, we analyzed different strategies for combining [23] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and
credal decision trees, i.e. decision trees built using imprecise prob- Techniques, second ed., Morgan Kaufmann, San Francisco, 2005.
abilities (using the IDM); and uncertainty measures (using the [24] U.M. Fayyad, K.B. Irani, Multi-valued interval discretization of continuous-
valued attributes for classication learning, in: Proceedings of the 13th
maximum entropy method). We used previously created Bagging International Joint Conference on Articial Intelligence, Morgan Kaufman, San
and Boosting schemes and a new one specically developed for Mateo, 1993, pp. 10221027.
credal decision trees. We experimentally showed that different ap- [25] K. Pelckmans, J. De Brabanter, J.A.K. Suykens, B. De Moor, Handling missing
values in support vector machine classiers, Neural Networks 18 (5-6) (2005)
proaches of dealing with missing data and continuous variables
684692.
can signicantly vary the results of some methods for creating [26] J. Demsar, Statistical comparison of classiers over multiple data sets, Journal
ensemble credal decision trees. We can conclude that the Boosting of Machine Learning Research 7 (2006) 130.
[27] S. Garca, F. Herrera, An extension on statistical comparisons of classiers over
multiple data sets for all pairwise comparisons, Journal of Machine Learning
14
In this case, lower signies stronger evidence in favor of the M method. Research 9 (2008) 26772694.
Please cite this article in press as: J. Abelln, Ensembles of decision trees based on imprecise probabilities and uncertainty measures, Informat. Fusion
(2012), http://dx.doi.org/10.1016/j.inffus.2012.03.003
8 J. Abelln / Information Fusion xxx (2012) xxxxxx
[28] F. Wilcoxon, Individual comparison by ranking methods, Biometrics 1 (1945) [30] P.B. Nemenyi, Distribution-free multiple comparison, PhD thesis, Princenton
8083. University, 1963.
[29] S. Holm, A simple sequentially rejective bonferroni test procedure, [31] P.H. Westfall, S.S. Young, Resampling-Based Multiple Testing: Examples and
Scandinavian Journal of Statistics 6 (1979) 6570. Methods for p-value Adjustment, John Wiley and Sons, 2004.
Please cite this article in press as: J. Abelln, Ensembles of decision trees based on imprecise probabilities and uncertainty measures, Informat. Fusion
(2012), http://dx.doi.org/10.1016/j.inffus.2012.03.003