Beruflich Dokumente
Kultur Dokumente
UNIT – III
•
Classification and prediction - data analysis methods used to extract models
describing data classes to predict future data trends.
o
Classification predicts categorical (discrete, unordered) labels
o
Prediction models continuous valued functions.
Issues Regarding Classification and Prediction
Preparing the Data for Classification and Prediction
The following Preprocessing steps are applied to data before classification &
prediction to improve
o Accuracy
o
efficiency and
o scalability.
Data cleaning:
o
To remove / reduce noise (smoothing techniques) and
o
The treatment of missing values.
Relevance analysis:
o
Many attributes in the data may be redundant & irrelevant.
o
Including such attributes may slow down the process
o Following methods are used to remove such data.
Correlation analysis - used to identify whether any two given
attributes are statistically related.
Attribute subset selection - e used to find a reduced set of attributes
•
Data transformation and reduction
o
The data is transformed by normalization
o
Normalization involves scaling all values for a given attribute to fall
within a small specified range such as -1.0 to 1.0 or 0.0 to 1.0.
o
Concept hierarchies are used for transforming continuous valued attributes
by generalizing it to higher-level concepts.
o Example:
Numeric values for the attribute income can be generalized to
discrete ranges such as low, medium, and high.
Categorical attributes like street, can be generalized to higher-level
concepts like city.
o
Generalization compresses the original training data & requires only few
input/output operations.
o
Other data reduction methods are:
Wavelet transformation
Principle components analysis
Binning
histogram analysis and
Clustering.
Speed:
o
Refers to the computational costs involved in generating and using the
given classifier or predictor.
Robustness:
o
Refers to the ability of the classifier or predictor to make correct
predictions given noisy data or data with missing values.
Scalability:
o
Refers to the ability to construct the classifier or predictor efficiently given
large amounts of data.
Interpretability:
o
Refers to the level of understanding and insight that is provided by the
classifier or predictor.
o
Interpretability is subjective and therefore more difficult to assess.
Some decision tree algorithms produce
o
binary trees (each internal node branches to exactly two nodes)
o
non binary trees.
Step 1: The tree starts as a single node, N, representing the training tuples in D.
Steps 2 and 3: If the tuples in D are of the same class, then node N becomes a leaf and
labeled with that class
Steps 4 and 5: terminating conditions.
Step 6: Calls Attribute selection method to determine the splitting criterion.
Splitting criterion indicates
o
which attribute to test at node N by determining the best way to separate /
partition the tuples in D into individual classes
o
which branches to grow from node N with respect to the outcomes of the
chosen test.
o
the splitting attribute ( split-point or a splitting subset).
Step 7: The node N is labeled with the splitting criterion which serves as a test at the
node. A branch is grown from node N for each outcome of the splitting criterion.
Steps 10 to 11: The tuples in D are partitioned
A has v distinct values, {a1, a2… av}, based on the training data.
There are three possible scenarios are:
A is discrete-valued:
The outcomes of the test at node N correspond directly to the known values of A.
A branch is created for each known value aj of A and labeled with that value.
Partition Dj is the subset of class-labeled tuples in D having value aj of A.
If all tuples in the partition has same value for A, then A need not be considered in
any future partitioning of the tuples & can be removed from attribute list
A is continuous-valued:
The test at node N has two possible outcomes corresponding to the conditions:
o
A<= split point and
o
A > split point,
where
•
split point - split-point returned by Attribute selection
Method
Two branches are grown from N and labeled according to the outcomes
The tuples are partitioned such that
o
D1 holds the subset of class-labeled tuples in D for which A<=split point
o
D2 holds the remaining.
A is discrete-valued and a binary tree must be produced:
The test at node N is of the form “
o
SA - splitting subset for A returned by Attribute selection method
If a given tuple has value aj of A and if aj € SA, then the test at node N is satisfied.
Two branches are grown from N.
o
The left branch out of N is labeled yes so that D1 corresponds to the subset
of class-labeled tuples in D that satisfy the test.
o
The right branch out of N is labeled no so that D2 corresponds to the
subset of class-labeled tuples from D that do not satisfy the test.
Where
•
pi - probability of tuple in D belonging to class Ci and estimated by | Ci ,D| / |D|.
•
Info(D) - average amount of information needed to identify the class label of a
tuple in D.
-
also known as the entropy of D.
Now, suppose we were to partition the tuples in D on some attribute A having v
distinct values, {a1, a2, : : : , av}, as observed from the training data.
If A is discrete-valued, these values correspond directly to the v outcomes of a test
on A.
Attribute A can be used to split D into v partitions or subsets, {D1, D2, : : : , Dv},
where Dj contains those tuples in D that have outcome aj of A.
These partitions would correspond to the branches grown from node N.
Ideally, we would like this partitioning to produce an exact classification of the
tuples.
That is, we would like for each partition to be pure.
However, it is quite likely that the partitions will be impure (e.g., where a
partition may contain a collection of tuples from different classes rather than from
a single class).
How much more information would we still need (after the partitioning) in order
to arrive at an exact classification?
This amount is measured by
The term |Dj | / |Dj| acts as the weight of the jth partition.
InfoA(D) is the expected information required to classify a tuple from D based on
the partitioning by A.
The smaller the expected information (still) required, the greater the purity of the
partitions.
Information gain is defined as the difference between the original information
requirement (i.e., based on just the proportion of classes) and the new requirement
(i.e., obtained after partitioning on A). That is,