CH 6

Classification and Prediction
1
• Two forms of data analysis that can be used to
extract models describing important data
classes or to predict future data trends.
• Classification predicts categorical (discrete,
unordered) labels, prediction models
continuous valued functions.
• Example: a classification model can be built to
categorize bank loan applications as either
safe or risky, whereas a prediction model is
used to predict the expenditures in dollars of
potential customers on computer equipment.
2
How does classification work?
• Two step process: Learning and Classification.
• Since the class label of each training tuple is
provided, this method is also known as
supervised learning.
• Contrasts with unsupervised learning in which
the class label of each training tuple is not
known, and the number or set of classes to be
learned not known. Ex: Clustering.
3
4
Preparing Data for Classification
• Data Cleaning: Removal of Noisy and missing
data.
• Relevance Analysis: Correlation analysis and
Feature Subset Selection.
• Data Transformation: Normalization.
5
Comparing classification and
prediction methods
• Accuracy refers to the ability of a given
classifier to correctly predict the class label of
new or previously unseen data.
• Speed refers to the computational costs
involved in generating and using the given
classifier or predictor.
• Robustness is the ability of the classifier or
predictor to make correct predictions given
noisy data or data with missing values.
6
• Scalability refers to the ability to construct the
classifier or predictor efficiently given large
amounts of data.
• Interpretability refers to the level of
understanding and insight that is provided by
the classifier or predictor.
7
Classification using Decision Tree
Induction
• Decision tree induction is the learning of decision
trees from class-labeled training tuples.
• A decision tree is a flowchart-like tree structure,
where each internal node (non leaf node)
denotes a test on an attribute, each branch
represents an outcome of the test, and each leaf
node (or terminal node) holds a class label. The
topmost node in a tree is the root node.
8
Example
9
Why are decision tree classifiers so
popular?
• construction of decision tree classifiers does not
require any domain knowledge or parameter
setting.
• representation of acquired knowledge in tree
form is intuitive and generally easy to assimilate
by humans.
• can handle high dimensional data.
• are simple and fast.
• have good accuracy.
10
11
12
Evaluating the Accuracy of a Classifier
or Predictor
• Holdout, random subsampling, cross
validation, and the bootstrap are common
techniques for assessing accuracy based on
randomly sampled partitions of the given
data.
• The use of such techniques to estimate
accuracy increases the overall computation
time, yet is useful for model selection.
13
Holdout Method and Random
Subsampling
14
• The given data are randomly partitioned into
two independent sets, a training set and a test
set.
• Typically, two-thirds of the data are allocated
to the training set, and the remaining one-
third is allocated to the test set.
• The training set is used to derive the model,
whose accuracy is estimated with the test set.
• The estimate is pessimistic because only a
portion of the initial data is used to derive the
model.
15
• Random subsampling is a variation of the
holdout method in which the holdout method
is repeated k times.
• The overall accuracy estimate is taken as the
average of the accuracies obtained from each
iteration.
16
Cross-validation
• In k-fold cross-validation, the initial data are
randomly partitioned into k mutually exclusive
subsets or “folds,” D1, D2,……., Dk, each of
approximately equal size.
• In iteration i, partition Di is reserved as the test
set, and the remaining partitions are collectively
used to train the model.
• For classification, the accuracy estimate is the
overall number of correct classifications from the
k iterations, divided by the total number of tuples
in the initial data.
17
Bootstrap
• The bootstrap method samples the given
training tuples uniformly with replacement.
• That is, each time a tuple is selected, it is
equally likely to be selected again and readded
to the training set.
18
Ensemble Methods—Increasing the
Accuracy
• Bagging and boosting are two techniques for
increasing classifier accuracy.
• Both combines a series of k learned models
(classifiers or predictors), M1, M2,… , Mk, with
the aim of creating an improved composite
model, M*.
19
20
Bagging
• Given a set, D, of d tuples, bagging works as follows.
For iteration i (i = 1, 2,….., k), a training set, Di, of d
tuples is sampled with replacement from the original
set of tuples, D.
• Because sampling with replacement is used, some of
the original tuples of D may not be included in Di,
whereas others may occur more than once. A classifier
model, Mi, is learned for each training set, Di.
• To classify an unknown tuple, X, each classifier, Mi,
returns its class prediction, which counts as one vote.
The bagged classifier, M, counts the votes and assigns
the class with the most votes to X.
21

CH 6

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

CH 6

Hochgeladen von

Copyright:

Verfügbare Formate

Classification and Prediction

Das könnte Ihnen auch gefallen