Sie sind auf Seite 1von 6

In Proceedings of The 5th Australian Joint Conference on Arti cial Intelligence, World Scienti c, 355-360, 1992.

CONSTRUCTING CONJUNCTIVE TESTS FOR DECISION TREES


ZIJIAN ZHENG Basser Department of Computer Science, University of Sydney, NSW 2006 Email: zijian@cs.su.oz.au

ABSTRACT: This paper discusses an approach of constructing new attributes based on

decision trees and production rules. It can improve the concepts learned in the form of decision trees by simplifying them and improving their predictive accuracy. In addition, this approach can distinguish relevant primitive attributes from irrelevant primitive attributes.

1. Introduction
If the training examples are presented in a suitable form, learning classi ers from them can be relatively easy. Selective induction algorithms such as ID3 5] can learn good concepts in this situation. When the attributes used in describing training examples are inappropriate for the concept to be learned, however, learning using only selective induction methods can be di cult. To overcome this problem, a learning system needs to be able to create new attributes that are more appropriate than the primitive ones for the concept to be learned. Our work on constructive induction is based on decision trees and production rules generated from the decision trees 6]. We use ^ (logical and) as the operator for attribute construction with the conditions of production rules as candidate operands. Generating production rules from decision trees can nd groups of primitive attributes whose conjunctions, as new attributes, have higher relevance to the class of examples and can reduce the search space for attribute construction. If the decision tree built using only new attributes has a higher or comparable classi cation accuracy than the tree built using primitive attributes, we can assume that any primitive attributes which do not appear in a new attribute are irrelevant. Removing irrelevant primitive attributes usually improves the decision tree. We rst brie y describe decision trees and production rules in section 2. The details can be found in 5] and 6]. Then, in section 3, we present our constructive induction learning algorithms, and discuss constructing new attributes and distinguishing relevant primitive attributes from irrelevant ones. Section 4 summarizes experimental results of attribute construction on several real domains. Section 5 discusses this approach and its relationship to other attribute construction approaches. Finally, section 6 concludes with areas for further research.

2. Decision Trees and Production Rules


We use the selective inductive learning algorithm C4.5 (an extension of ID3) 5] as the base of our constructive induction learning algorithms, although any algorithms that constructs decision trees would serve as well. C4.5 accepts a set of examples, a training set, described by a set of attributes and classes. It produces a learned concept in the form of a decision tree, each leaf of which denotes a class. A decision node denotes a test on an attribute with a subsidiary decision tree for each possible outcome of the test. The algorithm C4.5rules 6] is used to generate production rules from decision trees. From a decision tree, a set of production rules are derived based on the same training set used when building the tree. These rules are a subset of paths in the tree, each generalized by the removal of conditions that are not clearly relevant.

3. Constructing New Attributes from Production Rules


1

Table 1. The CI1 algorithm (TS is the training set for building decision tree.) 1 Assign the training set supplied by the task to TS1 . 2 Repeat until some stopping criterion is met. 2.1 Build a decision tree T based on TS . 2.2 Generate production rules based on the decision tree T .
i i i i i i

2.3 Construct new attributes based on the production rules. 2.4 Select the best constructed attributes using information theoretic measure. 2.5 Rewrite the training set from TS to TS +1 using both primitive attributes and the selected new attributes. Table 2. The CI2 algorithm (PA is the set, or a sub-set, of primitive attributes; TS1 , TS2 , and TS3 are the training sets for building decision trees.) 1 Assign the primitive attributes supplied by the task to PA1 and assign the training set supplied by the task to TS11 . 2 Repeat until some stopping criterion is met. 2.1 Build a decision tree T1 based on TS1 . 2.2 Generate production rules based on the decision tree T1 . 2.3 Construct new attributes based on the production rules. 2.4 Rewrite the training set from TS1 to TS2 using both PA and new attributes, then build a decision tree T2 , and select the best newly-constructed attributes. 2.5 Rewrite the training set from TS1 to TS3 using only new attributes, then build a decision tree T3 , and select the best newly-constructed attributes. 2.6 Select a sub-set from PA based on the new attributes selected in step 2.5 and assign it to PA +1 . 2.7 Rewrite the training set from TS1 to TS1 +1 using PA +1 .
i i i i i i i i i i i i i i i i

Note:

indicates the th execution cycle of the algorithm.

Given a training set described using a set of primitive attributes and classes, our constructive induction learning algorithm learns a concept by iterating three basic steps: 1). Learn a decision tree using a selective induction learning algorithm; 2). Generate production rules from the decision tree; 3). Construct and select new attributes based on the production rules and the trees, then rewrite the training set. We detail the constructive induction learning algorithm in Table 1 and Table 2 respectively, according to whether new attributes are used in the following attribute construction and whether irrelevant attributes are identi ed and deleted. After creating new attributes, the algorithm CI1 adds them to primitive attributes and any previously created attributes, then uses all of these attributes for the following execution cycle. Therefore, in CI1, all primitive attributes and all previously created attributes can be used in generating production rules and thus in constructing new attributes. However, after creating new attributes, the algorithm CI2 identi es and deletes irrelevant primitive attributes, then uses only those relevant primitive attributes for the following execution cycle. So, only those primitive attributes most recently identi ed as relevant are used in generating production rules and thus in constructing new attributes. The tree built using only new attributes is sometimes better than the tree built using new and primitive attributes, and vice versa. In some situations, the tree built using only primitive attributes identi ed as being relevant is better than both above trees. Therefore, in the ith execution cycle, algorithm CI2 builds three decision trees (T1 , T2 , T3 ) using respectively primitive attributes, primitive and new attributes, and only new attributes in step 2.1, 2.4, and 2.5. The stopping criterion currently used in the algorithms CI1 and CI2 is \no new attributes can be created" or \new attributes cause the decision tree to have lower predictive accuracy than the decision tree built using all primitive attributes only". Because the predictive accuracy of the decision tree is required, the training set is split into two parts. One part is used as training set for building the decision tree, generating production rules, and constructing new attributes; and the other is used as test set for estimating the accuracy of the tree. If the size of data set is small, a
i i i

Accuracy 1.00

# Nodes 130

Accuracy 0.80

# Nodes 50

0.90

70

0.75

25

0.80 T1 1T2 1 T3 1T1 2T2 2T3 2T1 3T2 3T3 3 (a)

10 T1 1T2 1 T3 1T1 2T2 2T3 2T1 3T2 3T3 3 (b)

0.70 T1 1T2 1 T3 1T1 2T2 2T3 2T1 3T2 3T3 3 (a)

0 T1 1T2 1 T3 1T1 2T2 2T3 2T1 3T2 3T3 3 (b)

Figure 1. Accuracy and Size improvement (on Tic-Tac-Toe problem)

Figure 2. Accuracy and Size improvement (on Heart-disease problem)

10-fold cross-validation 1] is performed. The output of the algorithm CI1 and CI2 is the decision tree that has the greatest predictive accuracy. We now turn to constructing new attributes from production rules and selecting the best constructed attributes from them using a decision tree. The following four methods are used to construct new attributes. 1. For every production rule that has more than one condition, use the conjunction of two conditions of the rule as a new attribute. These two conditions are nearest the leaf of the decision tree 3]. 2. For every production rule that has more than one condition, use the conjunction of two conditions of the rule as a new attribute. These two conditions are nearest the root of the decision tree. 3. For every production rule that has more than two conditions, use the conjunction of three conditions of the rule as a new attribute. These three conditions are nearest the root of the decision tree. 4. For every production rule that has more than two conditions, use the conjunction of three conditions of the rule as a new attribute. These three conditions are nearest the root of the decision tree. For every production rule that has only two conditions, use their conjunction as a new attribute. The number of new attributes constructed can be rather large, so it may be necessary to select the best of them and use the selected new attributes in the following process. We use the same criterion as that used in deciding the attribute to be selected during the tree formation, i.e., an information theoretic measure, for evaluation of new attributes. All new attributes not used in building the tree are eliminated.

4. Experiments
This section gives results of experiments on three domains: Tic-Tac-Toe, Cleveland Heart Disease, and LED from the UCI Repository of Machine Learning Databases. In the Figure 1 to 4, we use the indices of decision trees built in the execution of the algorithm CI2 as the X-axis to show how the accuracy and size of tree are improved with constructing new attributes and identifying irrelevant primitive attributes. The dashed line indicates the end of a cycle of the algorithm. The accuracy, the size (the number of nodes), and time are the average value obtained from 10-fold cross-validation. The rst experiment is a 10-fold cross-validation on Tic-Tac-Toe problem. The data set contains 958 instances described by 9 primitive attributes. The algorithm CI2 with the third attribute 3

Accuracy 0.75

# Nodes 100

Time(seconds) 10

0.70

50

0.65 T1 1T2 1 T3 1T1 2T2 2T3 2T1 3T2 3T3 3 (a)

0 T1 1T2 1 T3 1T1 2T2 2T3 2T1 3T2 3T3 3 (b)

0 T1 1T2 1 T3 1T1 2T2 2T3 2T1 3T2 3T3 3

Figure 3. Accuracy and Size improvement (on LED problem)

Figure 4. Time needed for constructing new attr. (on Heart-disease problem)

construction method executes just one cycle, and creates 8 new attributes (after selection). The new attribute is the conjunction of three conditions from a production rule, such as, (top-left-square = o) ^ (top-middle-square = o) ^ (top-right-square = o). This automatically constructed attribute means that all three squares in the top row of the tic-tactoe board contain \o", representing a case won for \o". Figure 1 shows the signi cant improvement in accuracy and size of decision trees. The second experiment is a 10-fold cross-validation on the Cleveland Heart-disease problem. The data set contains 303 instances with 6 missing values described by 13 primitive attributes. The algorithm CI2 with the fourth attribute construction method was used. Through 3 cycles of execution, an average of 5.9 new attributes is selected. The new attributes improve the decision trees in both aspects of accuracy (error rate is reduced from 26.4% to 19.5%) and complexity (size is reduced from 54.0 to 13.0). In addition, the algorithm identi es an average of 7.6 out of 13 primitive attributes as relevant attributes through constructing new attributes. All other primitive attributes are assumed to be irrelevant, or unimportant, and are eliminated. From Figure 2, we can see that the decision trees built using only new attributes are more accurate and simpler than the trees built using primitive and new attributes, and the latter are more accurate and simpler than the trees built using only primitive attributes. Furthermore, the trees built after deleting primitive attributes identi ed as irrelevant are better again. The third experiment is on LED problem(Figure 3). 24 primitive attributes are concerned, 17 of them irrelevant. We use 500 instances as the training set and 5000 instances as the test set, both containing 10% noise. The algorithm CI2 with the third attribute construction method executes three cycles. 10 new attributes are created (after selection). All 17 irrelevant primitive attributes are identi ed and eliminated and the size and accuracy of the decision trees are improved signi cantly. However unlike the results of two experiments above, the most accurate tree is built using primitive and new attributes. The smallest one is still the tree built using only new attributes.

5. Discussion
In section 3, we presented 4 methods for constructing new attributes from production rules. Through experiment, we nd that method 2 is better than method 1. This might be because selective induction learning algorithms tend to select rst those attributes containing more information for splitting training examples as the tree's decision nodes. This is especially the case when the training set is noisy. Therefore, the attributes in the nodes near the root of the tree are more relevant than the attributes in the nodes near the leaves of the tree. So, method 2 can construct better new attributes than method 1. When some attributes are irrelevant, they tend to appear 4

(if at all) in the nodes near the leaves of decision tree. Using pairwise conjunctions of conditions of production rules as new attributes can produce simple new attributes, but it also reduces the information contained in a new attribute. The experiments tell us that method 3 performs better than method 2 on some real-world problems. Because C4.5rules generates a lot of production rules containing only two conditions in some domains, method 3 cannot construct enough new attributes. In order to solve this problem, method 4 is adopted. For example, method 4 performs better than method 3 on the Cleveland Heart-disease problem. In addition to these four attribute construction methods, we have tried to use the conjunction of all conditions of a production rule as a new attribute. Algorithm CI1 with this method works well on some arti cial problems, such as some DNF and CNF expressions. It can very quickly construct disjuncts of a DNF target concept (or negation of conjuncts of a CNF target concept) as new attributes, and thus can produce a decision tree with higher accuracy and lower complexity. However, it does not perform well on some real-world problems, especially when real-valued primitive attributes are concerned. Pagallo and Haussler's FRINGE 3,4] constructs new attributes by conjoining the parent and grandparent nodes of all positive leaves in the decision tree. It is appropriate for learning DNF expressions. To deal with CNF problems as well, Pagallo implemented SymFringe 4] that constructs new attributes by using the conjunctions of the parent and grandparent nodes of all positive and negative leaves in the decision tree. Yang et al 7] present an algorithm DCFringe that also conjoins the parent and grandparent nodes of all positive leaves in the decision tree but considers the pattern of tree 7] and uses both ^ and _ as constructive operators. All of these algorithms directly use the decision nodes near the leaves of trees as constructive operands and use pairwise conjunction (DCFringe uses disjunction as well) as the constructive operator. The main di erences between our algorithms and FRINGE-like algorithms are that we use the conditions of production rules generated from decision tree as constructive operands and allow three or two terms in the conjunction. In a production rule, the conditions nearest the root of the decision tree are used to construct new attributes (except for method 1). The FRINGE has two approaches to new attribute selection. One is keeping attributes that have been used at least once. Another is keeping attributes that have been used in the last iteration 4]. CITRE 2] uses \utility" to select attributes. The \utility" of each new attribute is measured in terms of the information gained by using the attribute to split the entire training set into disjoint subsets. They are then ordered by utility, and those attributes with the lowest utilities are deleted until the total count of primitive attributes and new attributes is down to a predecided number. As pointed out by Matheus and Rendell 2], this greedy method can fail if a new attribute having a poor utility on the entire data set exhibits a relatively high utility sometime after the rst split. For this reason, we select new attributes through building a decision tree. All new attributes not used in building the decision tree are deleted. In our experiments, usually less than 20 new attributes are constructed, and the largest number of new attributes constructed was 70. The time needed for selecting new attributes by this method is only the time for building a decision tree. Figure 4 illustrates the time needed on a MIPS RS3330 for building all decision trees by algorithm CI2 on Cleveland Heart-disease problem in experiment 2 mentioned above. The time needed for constructing new attributes (mainly generating production rules) dominates the entire execution time. Because some irrelevant primitive attributes are identi ed and deleted after every execution cycle of the algorithm, the tree becomes smaller and smaller, with the result that the execution time in each cycle becomes shorter and shorter. This is one of the characteristics of algorithm CI2 that the irrelevant (or unimportant) primitive attributes can be identi ed and eliminated through constructing new attributes. Unlike attribute construction approaches discussed above that use the learning result of selective induction to guide their construction, another kind of approaches constructs new attributes from training data directly during creating concepts, although all of them learn decision trees as their target concepts. For example, CART 1] uses Boolean combinations of primitive attributes as new 5

attributes. During building decision trees, it chooses a split for a node in the form ( 1 ^ 2 ^ ^ ) or ( 1 _ 2 _ _ ) that maximizes the decrease in impurity of the node, where is a split on a primitive attribute. Like CART, ID2-of-3 8] constructs new attributes from training data directly during building decision trees, but it uses M-of-N concepts instead of Boolean combinations.
s s ::: sn s s ::: sn si

6. Conclusion and Future Work


We have described the approach of attribute construction on decision trees and production rules. As a bias for constructing new attributes, it improves the decision tree signi cantly in both accuracy and simplicity in some domains such as Heart-disease and Tic-Tac-Toe. In addition, it can distinguish relevant attributes from irrelevant attributes in some domains, such as LED24 and Heart-disease. The second part of the stopping criterion of our attribute construction algorithms is \new attributes cause the decision tree to have lower predictive accuracy than the decision tree built using all primitive attributes only". Some other stopping criteria may be better than the current one. Another idea is trying to merge the algorithms CI1 and CI2. When no more primitive attributes can be identi ed as irrelevant attributes, add newly created attributes to relevant primitive attributes. Allow the new attributes to be used in the following cycles of attribute construction, to construct more complex new attributes through iteration. In addition, we want to try to characterize the kinds of domains for which the approach discussed here is appropriate. We will also investigate some other operators and constructing approaches for attribute construction.

7. Acknowledgements
I am very grateful to Ross Quinlan for his advice and suggestions, especially during the draft of the paper, and for supplying C4.5 and C4.5rules. Thanks to Mike Cameron-Jones and Kaiming Ting for many helpful comments and discussions on the ideas presented in this paper, and thanks to P.M. Murphy and D. Aha for creating and managing the UCI machine learning databases.

8. References
1. L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classi cation And Regression Trees, (Belmont, CA: Wadsworth, 1984). 2. C.J. Matheus and L.A. Rendell, Constructive induction on decision trees, Proceedings of the International Joint Conference on Arti cial Intelligence (IJCAI) (1989), p. 645-650. 3. G. Pagallo and D. Haussler, Boolean feature discovery in empirical learning. Machine Learning, 5 (1990), p. 71-100. 4. G. Pagallo, Adaptive Decision Tree Algorithms for Learning from Examples, PHD thesis, (University of California at Santa Cruz, 1990). 5. J.R. Quinlan, Induction of decision trees, Machine Learning, 1 (1986), p. 81-106. 6. J.R. Quinlan, Generating production rules from decision trees, Proceedings of IJCAI (1987), p. 304-307. 7. D. Yang, L. Rendell, and G. Blix, A scheme for feature construction and a comparison of empirical methods, Proceedings of IJCAI (1991), p. 699-704. 8. P.M. Murphy and M.J. Pazzani, ID2-of-3: Constructive induction of M-of-N concepts for discriminators in decision trees, Proceedings of the Eighth International Workshop on Machine Learning, (Morgan Kaufmann, 1991), p. 183-187.

Das könnte Ihnen auch gefallen