Improved Discretization Based Decision Tree For Continuous Attributes

International Journal of Computer Trends and Technology (IJCTT) volume 5 number 5 Nov 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page257

Improved Discretization Based Decision Tree
for Continuous Attributes

S.J yothsna
Gudlavalleru Engineering College,
Gudlavalleru.

G.Bharthi
Asst. Professor
Gudlavalleru Engineering College,
Gudlavalleru.

Abstract :- The majority of the Machine Learning and
Data Mining applications can easily be applicable
only on discrete features. However, data in solid
world are sometimes continuous by nature. Even for
algorithms that will directly encounter continuous
features, learning is most often ineffective and
effective. Hence discretization addresses this problem
by finding the intervals of numbers which happen to
be more concise to represent and specify.
Discretization of continuous attributes is one of the
important data preprocessing steps of knowledge
extraction. The proposed improved discretization
approach significantly reduces the IO cost and also
requires one time sorting for numerical attributes
which leads to a better performance in time
dimension on rule mining algorithms. According to
the experimental results, our algorithm acquires less
execution time over the Entropy based algorithm and
also adoptable for any attribute selection method by
which the accuracy of rule mining is improved.
Keywords Discretization, Preprocessing, Data Mining,
Machine learning

I. INTRODUCTION

Discretization of continuous attributes simply
not only broadens the scope of a given range of data
mining algorithms able to analyze data in discrete form,
but might also dramatically amplify the speed at which
these tasks can be carried out. A discrete feature, also
known as qualitative features, including sex and level of
education, is only able to be limited among a number of
values. Continuous features might be ranked if you want
and admit to meaningful arithmetic operations. However,
discrete features sometimes can possibly be arrayed
within the meaningful order. However no arithmetic
operations can be placed upon them. Data discretization
is a multipurpose pre-processing method that reduces the
quantity of distinct values to obtain given continuous
variable by dividing its range right into a finite set of
disjoint intervals, and after that relates these intervals
with meaningful labels [2]. Subsequently, data are
analyzed or reported with it higher-level of data

representation instead of the subtle individual values,
therefore results in a simplified data representation in
data exploration and data mining process. Discretization
of continuous attributes plays an important role in
knowledge discovery. Many algorithms linked to data
mining require the running examples contain only
discrete values, and rules with discrete values are
normally shorter and even more understandable. Suitable
discretization is useful to increase the generalization and
accuracy of discovered knowledge. Discretization
algorithms might be categorized into unsupervised and
supervised based upon if the class label details are used.
Equal Width and Equal Frequency are two representative
unsupervised discretization algorithms. Compared to
supervised discretization, previous research[6][9] has
indicated that unsupervised discretization algorithms do
not have as much computational complexity, but may
contribute to usually not as good classification
performance. When classification performance is
probably the main concern, supervised discretization
should really be adopted.
There are several benefits associated with using
discrete values over continuous ones: (1) Discretization
will reduce the number of continuous features' values,
which brings smaller demands on system's storage.
(2)Discrete features are in close proximity to a
knowledge-level representation than continuous ones.
(3)Data can also be reduced and simplified through
discretization. For both users and experts, discrete
features are easier to comprehend, use, and explain.
(4)Discretization makes learning more accurate and faster
[5]. (5)Besides the many advantages of obtaining discrete
data over continuous one, a suite of classification
learning algorithms is only able to cope with discrete
data. Successful discretization can significantly extend
the application range of many learning algorithms.
Possibly one of the supervised discretization
methods, introduced by Fayyad and Irani, is referred to as
entropy-based discretization. An entropy-based method
will use the class information entropy of candidate
partitions to decide on boundaries for discretization.
Class information entropy is naturally a measure of purity
and it measures the quantity of information which will be


needed to specify to which class an outbreak belongs. It
considers one big interval by using all of the known
values regarding a feature then recursively partitions this
interval into smaller subintervals until some stopping
criterion, for instance Minimum Description Length
(MDL) Principle or maybe an optimal large number of
intervals has been reached thus creating multiple
intervals of feature [11].

Discretization methods can possibly be
supervised or unsupervised depending upon whether it
uses class information files sets. Supervised methods
make use of the course label when partitioning the
ongoing features. On the other hand, unsupervised
discretization methods tend not to require the instruction
information to discretize continuous attributes.
Supervised discretization can be further characterized as
error-based, entropy-based or statistics based.
Unsupervised discretization is seen in earlier methods
like equal-width and equal-frequency.
Discretization methods can also be viewed as
dynamic or static. A dynamic method would discretize
continuous values when a classifier has been built, for
instance in C4.5 while in the static approach
discretization is done previous to the classification task.

II. LITERATURE SURVEY

Discretization method which is supervised,
static and global. This methods discretization measure
takes account of the distribution of class probability
vector by applying Gini criterion [1] and its stopping
criterion involves a tradeoff between simplicity and
predictive accuracy by incorporating the number of
partition intervals.
ADVANTAGES:
The purpose of this nonparametric test was to
determine if significant differences existed
between two populations.
Effective data classification using Decision tree
with discretization.
Reduces number of partitioning iterations.
DISADVANTAGES:
Cut points are selected by recursively applying
the same binary discretization method.
Doesnt discretization binary data.
Problem in discretization small instances.
In this system Multivariate Discretization
(MVD) Method [2] based on the idea of transforming the
problem of unsupervised discretization in association
rules into a supervised problem. Within the support-
confidence framework, they find that a rule with high
confidence usually makes the corresponding data space
have a high density. Thus, they firstly use a density-based
clustering technique to identify the regions with high
densities. Regarding every region as a class, they then
develop a genetic algorithm to simultaneously discretize
multiattributes according to entropy criterion.

ADVANTAGES:
Generates quality rules.
Generates high frequent association rules with
proposed discretization approach.
MVD-CG discretizes variables based on the
HDRs (High density regions) where some
patterns with relatively high confidences are
hidden.
DISADVANTAGES:
The disadvantage is that MVD really discretizes
the attributes one at a time instead of
discretizing themsimultaneously.
For association rules this system uses basic
apriori algorithm which generates high
candidate sets.
A whole new rule-based algorithm for
classifying and [8] proposes a new and effective
supervised discretization algorithm in accordance to
correlation maximization (CM) is proposed by employing
multiple correspondence analysis (MCA). MCA seems to
be an effective technique to capture the correlations
between multiple variables. Two main questions ought to
be answered when preparing a discretization algorithm:
the time you need to cut and how to cut. Many
discretization algorithms are based on information
entropy, for instance maximumentropy which discretizes
the numeric attributes using the criterion of minimum
information loss. IEM is an often one on account of its
efficiency and good performance among the
classification stage. IEM selects the very first cut-point
that minimizes the entropy function over all possible
candidate cut-points and recursively applies this strategy
to both induced intervals. The Minimum Description
Length (MDL) principle is employed to discover if you
would like to accept a selected candidate cut-point or not
and thus stop the recursion in the event the cut-point will
not satisfy a pre-defined condition. An applicant cut-
point, MCA is made use of to measure the correlation
between intervals/items and classes. The mattress that
allows the highest correlation in the classes is selected
being a cut-point. The geometrical representation of
MCA just not only visualizes the correlation relationship
between intervals/items and classes, but additionally
presents an elegant way to decide the cut-points. For one
numeric feature, the candidate cut-point that maximizes


the correlation between feature intervals and classes is
chosen like the first cut-point, then the strategy is
performed among the nearly everywhere intervals
recursively to further partition the intervals. Empirical
comparisons with IEM, IEMV, CAIM, and CACC
supervised discretization algorithms are conducted using
six well-known classifiers. Currently, CM places focus
on discretizing a dataset with two classes and shows
promising results. This will be extended to handle a
dataset that come with than two classes in our future
work.
Discretization algorithms are mainly categorized
as supervised and unsupervised algorithms. Popular
unsupervised top-down algorithms are Equal Width,
Equal Frequency [10] and standard deviation. While the
supervised top-down algorithms are maximum entropy
[11], Paterson-Niblett which uses dynamic discretization,
Information Entropy Maximization (IEM) and class
attribute interdependence Maximization (CAIM). Kurgan
and Cios have shown the outperforming results of CAIM
discretization algorithm when compared to other
algorithms. As CAIM considers largest interdependence
between classes and attribute it improves classification
accuracy. Unlike other discretization algorithm CAIM
automatically generate the intervals and interval
boundaries for your given data without any user input.
Over the next couple of section, C4.5 a tree based
classification is discussed.
C4.5 builds decision trees typically from a
variety of training data in the same fashion as ID3,
making use of the information gain ratio. Each node of
this very tree, C4.5 chooses one attribute of the results
that the majority of effectively splits its multitude of
samples into subsets enriched available as one class as
well as other. It calculates the post gain for the attributes.
Compared to the attribute when using the highest
information gain is chosen in order to make the decision.
Then upon the bases on that attribute, divide the given
training set into a subsets. Then recursively apply the
algorithmfor each subset till the set contains instances of
the very same class. If the set contains instances of the
same class, then return that class.

III. PROPOSED APPROACH:

Algorithm: Improved Discretization method.
Attributes:Ai
Input:
N, number of examples.
Ai, continuous attributes.
Cj, class values in training set.
Global Threshold value
Output: Interval borders in Ai
Procedure:
1. for each continuous attribute Ai in training dataset do
2. Do normalize the attribute within 0-1 range
3. Sorting the values of continuous attribute Ai in
ascending order.
4. for each class Cj in training dataset do
5. Find the minimum (Minvalue) using StdDev
attribute value of Ai for Cj
6. Find the maximum (Max) attribute value of Ai
for Cj.
7. endfor
8. Find the cut points in the continuous attributes values
based on the Min and Max values of each class Cj.
Best Cutpoint range measure:
9. Find the conditional probability P(Cj/A) on each cut
point and select the cut point with maximumprobability
value.
Stopping criteria:
10. If the cut point using the maximum probability value
is exist and satisfies the global threshold value then it can
be taken as an interval border else consider the next cut
point, where information gain value and global threshold
value satisfy the same point.
12. endfor


Improved Decisi on tree measure:

Modified Information or entropy is gi ven as
ModInfo(D)=
3
1
og
m
i i
i
S l S

,m different classes
ModInfo(D)=
2
3
1
og
i i
i
S l S

=
3 3
1 1 2 2
log log S S S S
Where
1
S indicates set of samples which
belongs to target class anamoly,
2
S indicates set
of samples which belongs to target class normal.

Information or Entropy to each attribute is
calculated using

1
( ) / ( )
v
A i i
i
Info D D D ModInfo D

The term Di /D acts as the weight of the jth
partition. ModInfo(D) is the expected information
required to classify a tuple from D based on the
partitioning by A.
IV. Experimental Results:

RULE-7 TECHNIQUE:
==================

(word_freq_your = '(0.28698-0.770745]') and
(word_freq_money ='(0.02-INF)') and (word_freq_all =
'(0.214647-0.615166]') =>is_spam=1 (422.0/5.0)
(word_freq_free ='(0.068896-INF)') and (char_freq_! =
'(0.107811-INF)') =>is_spam=1 (372.0/15.0)
(word_freq_remove = '(0.026225-INF)') and
(word_freq_george = '(-INF-0.008661]') =>is_spam=1
(440.0/23.0)
(char_freq_$ ='(0.156751-INF)') and (word_freq_000 =
'(0.218378-INF)') =>is_spam=1 (78.0/3.0)
(char_freq_$ ='(0.156751-INF)') and (word_freq_hp ='(-
INF-0.075835]') and (capital_run_length_total =
'(0.090418-0.211566]') =>is_spam=1 (28.0/2.0)
(word_freq_hp = '(-INF-0.075835]') and
(capital_run_length_longest = '(0.041854-0.073868]')
and (word_freq_edu = '(-INF-0.047378]') and
(word_freq_george = '(-INF-0.008661]') and
(capital_run_length_total = '(0.066714-0.090418]') and
(char_freq_$ = '(0.156751-INF)') => is_spam=1
(31.0/0.0)
(char_freq_! = '(0.107811-INF)') and
(capital_run_length_average = '(0.058836-INF)') =>
is_spam=1 (45.0/3.0)
(word_freq_internet = '(0.036215-INF)') and
(word_freq_edu = '(-INF-0.047378]') and
(word_freq_order = '(0.092351-INF)') => is_spam=1
(33.0/0.0)
(capital_run_length_average = '(0.046493-0.058836]')
and (word_freq_george = '(-INF-0.008661]') and
(capital_run_length_longest = '(0.02916-0.041854]') =>
is_spam=1 (35.0/5.0)
and (char_freq_! = '(0.107811-INF)') => is_spam=1
(31.0/2.0)
(word_freq_hp ='(-INF-0.075835]') and (word_freq_free
= '(0.068896-INF)') and (word_freq_re = '(-INF-
0.026082]') and (capital_run_length_longest =
'(0.041854-0.073868]') and (capital_run_length_average
='(0.030341-0.046493]') =>is_spam=1 (21.0/2.0)
(word_freq_hp ='(-INF-0.075835]') and (word_freq_our
= '(0.185737-INF)') and (word_freq_your = '(0.28698-
0.770745]') and (word_freq_george ='(-INF-0.008661]')
=>is_spam=1 (87.0/23.0)
(capital_run_length_longest ='(0.02916-0.041854]') and
(word_freq_edu ='(-INF-0.047378]') and (char_freq_( =
'(-INF-0.010126]') and (char_freq_$ ='(0.156751-INF)')
=>is_spam=1 (11.0/0.0)
(word_freq_hp ='(-INF-0.075835]') and (char_freq_$ =
'(0.096152-0.156751]') and (char_freq_! = '(0.049475-
0.107811]') =>is_spam=1 (33.0/4.0)
and (char_freq_( = '(0.010126-0.106447]') and
(capital_run_length_average ='(0.030341-0.046493]') =>
is_spam=1 (11.0/0.0)
(word_freq_over ='(0.212283-INF)') and (word_freq_pm
= '(-INF-0.101716]') and (word_freq_all = '(-INF-
0.214647]') =>is_spam=1 (18.0/2.0)
(word_freq_edu ='(-INF-0.047378]') and (char_freq_! =
'(0.049475-0.107811]') and (word_freq_mail =
'(0.049675-0.327926]') and (word_freq_credit =
'(0.064194-INF)') =>is_spam=1 (7.0/0.0)
(word_freq_free ='(0.068896-INF)') and (word_freq_edu
= '(-INF-0.047378]') and (char_freq_$ = '(0.045623-
0.096152]') =>is_spam=1 (8.0/1.0)
and (word_freq_650 = '(0.023453-INF)') and
(word_freq_internet ='(-INF-0.036215]') =>is_spam=1
(15.0/1.0)
(word_freq_business ='(0.362835-INF)') =>is_spam=1
(18.0/5.0)


(word_freq_hp ='(-INF-0.075835]') and (word_freq_re =
'(-INF-0.026082]') and (capital_run_length_average =
'(0.058836-INF)') and (word_freq_our = '(0.022361-
0.185737]') =>is_spam=1 (7.0/0.0)
(word_freq_re ='(-INF-0.026082]') and (word_freq_font
= '(0.081988-INF)') and (char_freq_; = '(-INF-
0.128582]') =>is_spam=1 (14.0/1.0)
(word_freq_hp ='(-INF-0.075835]') and (word_freq_re =
'(-INF-0.026082]') and (char_freq_! ='(0.107811-INF)')
and (word_freq_will = '(-INF-0.159165]') and
(word_freq_meeting ='(-INF-0.178499]') =>is_spam=1
(13.0/1.0)
(word_freq_free ='(0.068896-INF)') and (char_freq_( =
'(-INF-0.010126]') and (capital_run_length_average =
'(0.058836-INF)') and (char_freq_! = '(0.049475-
0.107811]') =>is_spam=1 (5.0/0.0)
(word_freq_your = '(0.28698-0.770745]') and
(word_freq_business = '(0.095342-0.362835]') =>
is_spam=1 (7.0/1.0)
=>is_spam=0 (2811.0/122.0)

Number of Rules : 26

V. CONCLUSION AND FUTURE SCOPE

Discretization of continuous features plays an
important role in data pre-processing. This paper briefly
introduces that the generation of the problem of
discretization brings many benefits including improving
the algorithms efficiency and expanding their
application scope. There have been drawbacks in the
existing literature to classify discretization methods. The
idea and drawbacks of some typical methods are
expressed in details by supervised or unsupervised
category. Proposed Improved discretization approach
significantly reduces the IO cost and also requires one
time sorting for numerical attributes which leads to a
better performance in time dimension on rule mining
algorithms. According to the experimental results, our
algorithmacquires less execution time over the Entropy
based algorithm and also adoptable for any attribute
selection method by which the accuracy of rule mining is
improved.

REFERENCES
[1]: A DISCRETIZATION ALGORITHM BASED ON
GINI CRITERION XIAO-HANG ZHANG, JUN WU,
TING-J IE LU, YUAN J IANG, Proceedings of the Sixth
International Conference on Machine Learning and
Cybernetics, Hong Kong, 19-22 August 2007.
[2]: A Novel Multivariate Discretization Method for
Mining Association Rules Hantian Wei, 2009 Asia-
Pacific Conference on Information Processing
[3]: A Rule-Based Classification Algorithmfor Uncertain
Data, IEEE International Conference on Data
Engineering
[4]: M. C. Ludl, G. Widmer. Relative unsupervised
discretization for association rule mining. In: In
Proceedings of the 4
th
European Conference on Principles
and Practice of Knowledge Discovery in Databases,
Berlin, Germany, Springer, 2000.
[5]: Stephen D. Bay. Multivariate discretization for set
mining. Knowledge and Information Systems, 2001,
3(4): 491-512.
[6]: Stephen D. Bay and Michael J. Pazzani. Detecting
group differences: Mining contrast sets. Data Mining and
Knowledge Discovery, 2001, 5(3): 213-246.
[7]: CAIM Discretization AlgorithmLukasz A. Kurgan
[8]: Effective Supervised Discretization for Classification
based on Correlation Maximization
Qiusha Zhu, Lin Lin, Mei-Ling Shyu
[9]: X.S.Li, D.Y.Li. A New Method Based on Density
Clustering for Discretization of Continuous Attributes,
Journal of SystemSimulation, 15(6):804-806,813,2005
[10]: R.Kass, L.Wasserman. A reference Bayesian test
for nested hypotheses and its relationship to the Schwarz
criterion, Journal of the American Statistical Association,
Vol.90:928-935, 1995.
[11]: Comparative Analysis of Supervised and
Unsupervised Discretization Techniques
Rajashree Dash

Improved Discretization Based Decision Tree For Continuous Attributes

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Improved Discretization Based Decision Tree For Continuous Attributes

Hochgeladen von

Copyright:

Verfügbare Formate

International Journal of Computer Trends and Technology (IJCTT) volume 5 number 5 Nov 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page257

Das könnte Ihnen auch gefallen