Sie sind auf Seite 1von 7

MC1630 / DATA WARE HOUSING & DATA MINING LECTURE NOTES

UNIT – III

Classification and prediction - data analysis methods used to extract models
describing data classes to predict future data trends.
o
Classification predicts categorical (discrete, unordered) labels
o
Prediction models continuous valued functions.
Issues Regarding Classification and Prediction
Preparing the Data for Classification and Prediction

The following Preprocessing steps are applied to data before classification &
prediction to improve
o Accuracy
o
efficiency and
o scalability.
 Data cleaning:
o
To remove / reduce noise (smoothing techniques) and
o
The treatment of missing values.
 Relevance analysis:
o
Many attributes in the data may be redundant & irrelevant.
o
Including such attributes may slow down the process
o Following methods are used to remove such data.

Correlation analysis - used to identify whether any two given
attributes are statistically related.

Attribute subset selection - e used to find a reduced set of attributes

Data transformation and reduction
o
The data is transformed by normalization
o
Normalization involves scaling all values for a given attribute to fall
within a small specified range such as -1.0 to 1.0 or 0.0 to 1.0.
o
Concept hierarchies are used for transforming continuous valued attributes
by generalizing it to higher-level concepts.
o Example:

Numeric values for the attribute income can be generalized to
discrete ranges such as low, medium, and high.

Categorical attributes like street, can be generalized to higher-level
concepts like city.
o
Generalization compresses the original training data & requires only few
input/output operations.
o
Other data reduction methods are:

Wavelet transformation

Principle components analysis

Binning

histogram analysis and

Clustering.

Comparing Classification and Prediction Methods



Classification and prediction methods are compared and evaluated according to
the following criteria:
 Accuracy:

Ms. C. Maria Veronikka, Lect/MCA 1


MC1630 / DATA WARE HOUSING & DATA MINING LECTURE NOTES
o
The accuracy of a classifier refers to the ability of a given classifier to
correctly predict the class label of new or previously unseen data (i.e.,
tuples without class label information).
o
The accuracy of a predictor refers to how well a given predictor can
guess the value of the predicted attribute for new or previously unseen
data.
o
Accuracy can be estimated using one or more test sets that are independent
of the training set.

 Speed:
o
Refers to the computational costs involved in generating and using the
given classifier or predictor.
 Robustness:
o
Refers to the ability of the classifier or predictor to make correct
predictions given noisy data or data with missing values.
 Scalability:
o
Refers to the ability to construct the classifier or predictor efficiently given
large amounts of data.
 Interpretability:
o
Refers to the level of understanding and insight that is provided by the
classifier or predictor.
o
Interpretability is subjective and therefore more difficult to assess.

Classification by Decision Tree Induction



Decision tree induction is the learning of decision trees from class-labeled
training tuples.

A decision tree is a flowchart-like tree structure, where
o
each internal node (non-leaf node) - denotes a test on an attribute,
o
each branch - represents an out come of the test, and
o
each leaf node (or terminal node) - holds a class label.
 The topmost node in a tree is the root node.
Example
 Internal nodes are denoted by rectangles
 Leaf nodes are denoted by ovals.


Some decision tree algorithms produce
o
binary trees (each internal node branches to exactly two nodes)
o
non binary trees.

Ms. C. Maria Veronikka, Lect/MCA 2


MC1630 / DATA WARE HOUSING & DATA MINING LECTURE NOTES

Decision trees in classification:



Given a tuple, X & the associated class label is unknown, the attribute values of
the tuple are tested against the decision tree.

A path is traced from the root to a leaf node, which holds the class prediction for
that tuple.
• Decision trees can easily be converted to classification rules.
Decision tree classifiers

The construction of decision tree classifiers
o
does not require any domain knowledge or parameter setting
o appropriate for exploratory knowledge discovery.
o accurate
• Decision trees
o
handle high dimensional data.

Decision tree induction algorithms used for classification in many application
areas such as
o medicine
o manufacturing and production
o financial analysis
o astronomy and
o molecular biology.

Steps in decision tree construction
o
attribute selection measures - select the attribute that best partitions the
tuples into distinct classes.
o
Tree pruning - identify and remove branches with outliers & noise with
the goal of improving classification accuracy on unseen data.

Decision Tree Induction


Introduction:

ID3, C4.5, and CART adopt a greedy approach (non backtracking) in which
decision trees are constructed in a top-down recursive divide-and-conquer
manner.

It starts with a training set of tuples and their associated class labels.

The training set is recursively partitioned into smaller subsets as the tree is built.
The strategy:
 The algorithm is called with three parameters:
o
D (data partition) - complete set of training tuples and their associated
class labels
o
Attribute list - list of attributes describing the tuples
o
Attribute selection - specifies procedure for selecting the best attribute
- uses an attribute selection measure such as

information gain (resulting tree has
multi-way splits)

the gini index. (resulting tree as
binary)
Algorithm:

Ms. C. Maria Veronikka, Lect/MCA 3


MC1630 / DATA WARE HOUSING & DATA MINING LECTURE NOTES

Step 1: The tree starts as a single node, N, representing the training tuples in D.
Steps 2 and 3: If the tuples in D are of the same class, then node N becomes a leaf and
labeled with that class
Steps 4 and 5: terminating conditions.
Step 6: Calls Attribute selection method to determine the splitting criterion.

Splitting criterion indicates
o
which attribute to test at node N by determining the best way to separate /
partition the tuples in D into individual classes
o
which branches to grow from node N with respect to the outcomes of the
chosen test.
o
the splitting attribute ( split-point or a splitting subset).
Step 7: The node N is labeled with the splitting criterion which serves as a test at the
node. A branch is grown from node N for each outcome of the splitting criterion.
Steps 10 to 11: The tuples in D are partitioned
A has v distinct values, {a1, a2… av}, based on the training data.
There are three possible scenarios are:

Ms. C. Maria Veronikka, Lect/MCA 4


MC1630 / DATA WARE HOUSING & DATA MINING LECTURE NOTES

A is discrete-valued:

The outcomes of the test at node N correspond directly to the known values of A.

A branch is created for each known value aj of A and labeled with that value.

Partition Dj is the subset of class-labeled tuples in D having value aj of A.

If all tuples in the partition has same value for A, then A need not be considered in
any future partitioning of the tuples & can be removed from attribute list
A is continuous-valued:

The test at node N has two possible outcomes corresponding to the conditions:
o
A<= split point and
o
A > split point,
 where

split point - split-point returned by Attribute selection
Method

Two branches are grown from N and labeled according to the outcomes

The tuples are partitioned such that
o
D1 holds the subset of class-labeled tuples in D for which A<=split point
o
D2 holds the remaining.
A is discrete-valued and a binary tree must be produced:

The test at node N is of the form “
o
SA - splitting subset for A returned by Attribute selection method

If a given tuple has value aj of A and if aj € SA, then the test at node N is satisfied.

Two branches are grown from N.
o
The left branch out of N is labeled yes so that D1 corresponds to the subset
of class-labeled tuples in D that satisfy the test.
o
The right branch out of N is labeled no so that D2 corresponds to the
subset of class-labeled tuples from D that do not satisfy the test.

Ms. C. Maria Veronikka, Lect/MCA 5


MC1630 / DATA WARE HOUSING & DATA MINING LECTURE NOTES
Step 14: The algorithm uses the same process recursively to form a decision tree for the
tuples at each resulting partition Dj.
The recursive partitioning stops only when any one of the following terminating
conditions is true:

All of the tuples in partition D belong to the same class

There are no remaining attributes for further partitioning
o
An approach called majority voting is employed
o
It converting node N into a leaf and labeling it with the most common
class in D.

There are no tuples for a given branch i.e., a partition Dj is empty
o
In this case, a leaf is created with the majority class in D
Step 15: The resulting decision tree is returned.

The computational complexity of the algorithm


O (n X |D|) X log (|D|))
where
o
n - number of attributes describing the tuples in D
o
|D| - number of training tuples in D.
Attribute Selection Measures

An attribute selection measure is used for selecting the splitting criterion that
separates given data partition D, of class-labeled training tuples into individual
classes.

Splitting D into smaller partitions according to the outcomes of the splitting
criterion makes it pure (i.e., all tuples that fall into a given partition belong to the
same class).

Attribute selection measures also known as splitting rules because they determine
how the tuples at a given node are to be split.
 The attribute selection measure
o
Rank each attribute describing the given training tuples.
o
The attribute with best score chosen as the splitting attribute
o
If the splitting attribute is continuous-valued & restricted to binary trees
then

split point / splitting subset is determined as part of the splitting
criterion.

The tree node created for partition D is labeled with the splitting criterion
o
branches are grown for each outcome of the criterion
o
the tuples are partitioned accordingly.

Three attribute selection measures:
o
information gain
o gain ratio
o
gini index.
Information gain
• ID3 uses information gain as its attribute selection measure.

Let node N represents tuples of partition D.

Splitting attribute for node N
o
attribute with the highest information gain
o
minimizes the information that classifies the tuples in the resulting
partitions.
Ms. C. Maria Veronikka, Lect/MCA 6
MC1630 / DATA WARE HOUSING & DATA MINING LECTURE NOTES
o
reflects the least randomness / “impurity” in these partitions.
o
minimizes the expected number of tests needed to classify a given tuple
and forms a simple tree.

The expected information needed to classify a tuple in D is given by

Where

pi - probability of tuple in D belonging to class Ci and estimated by | Ci ,D| / |D|.

Info(D) - average amount of information needed to identify the class label of a
tuple in D.
-
also known as the entropy of D.

Now, suppose we were to partition the tuples in D on some attribute A having v
distinct values, {a1, a2, : : : , av}, as observed from the training data.

If A is discrete-valued, these values correspond directly to the v outcomes of a test
on A.

Attribute A can be used to split D into v partitions or subsets, {D1, D2, : : : , Dv},
where Dj contains those tuples in D that have outcome aj of A.

These partitions would correspond to the branches grown from node N.
 Ideally, we would like this partitioning to produce an exact classification of the
tuples.
 That is, we would like for each partition to be pure.
 However, it is quite likely that the partitions will be impure (e.g., where a
partition may contain a collection of tuples from different classes rather than from
a single class).
 How much more information would we still need (after the partitioning) in order
to arrive at an exact classification?
 This amount is measured by


The term |Dj | / |Dj| acts as the weight of the jth partition.

InfoA(D) is the expected information required to classify a tuple from D based on
the partitioning by A.
 The smaller the expected information (still) required, the greater the purity of the
partitions.

Information gain is defined as the difference between the original information
requirement (i.e., based on just the proportion of classes) and the new requirement
(i.e., obtained after partitioning on A). That is,

Ms. C. Maria Veronikka, Lect/MCA 7

Das könnte Ihnen auch gefallen