You are on page 1of 77

Classification Learning

Prof. Vijay Ukani

Computer Science and Engineering Department Nirma University, Ahmedabad

Learning Classification Hypothesis generation Decision Tree Induction Classification by Backpropagation Bayesian classification

Learning vs Memory
Learning is how you acquire new information about the world, and memory is how you store that information over time
-Eric R. Kandel, M.D., vice chairman of The Dana Alliance for Brain Initiatives and recipient of the 2000 Nobel Prize

Extracting useful information from huge collection of available data Knowledge Discovery in Databases (KDD), Data Mining, Knowledge Extraction, Data Archaeology are other terms used to denote the process Mining can take following forms
Concept Description: Characterisation and Discrimintion Association Analysis Classification and Prediction Cluster Analysis Outlier Analysis Evolution Analysis

Supervised Learning
The training data are accompanied by labels indicating the class of the observations New data is classified based on the training set Training set is needed Ex. class room teaching, teacher upon feedback modifies the pace and flow of the talk

Unsupervised Learning
The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Like video lecture No feedback for error Used for clustering

predicts categorical class labels classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data It is a kind of supervised learning

Classification: A Two Step Process


Model construction: describing a set of predetermined classes

Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees, or mathematical formulae

Classification: A Two Step Process


Model usage: for classifying future or unknown objects

Estimate accuracy of the model If accuracy is found acceptable, use the model for further classification of unknown samples

Classification can be achieved with numerous techniques Hypothesis Learning
Find-S Candidate Elimination

Decision Tree Induction


Neural Networks

Bayesian Network

Hypothesis Learning
Inferring a boolean-valued function from training examples of its input and output. Task of searching through a large space of hypotheses defined by the hypothesis representation Each hypothesis consists of a conjunction of constraints on the instance attributes

Hypothesis Representation
For each attribute, the hypothesis will either indicate by a "?' that any value is acceptable for this attribute, specify a single required value (e.g., Warm) for the attribute, or indicate by a "" that no value is acceptable
Ex. < Sunny, Warm, Normal, Strong, Warm, Same> <Sunny, Warm, ?, ?, Warm, Same> <?, ?, ?, ?, ?, ?> - Most general hypothesis <, , , , , > - Most specific hypothesis

Training Data


General-to-specific ordering
Various hypothesis can be ordered in terms of generality and specialty


Finds the maximally specific hypothesis consistent with training data The algorithm


Start with h being initialized to <,,,,,> For first positive training example generalize h to cover the example h = <Sunny, Warm, Normal, Strong, Warm,

Next positive example, h becomes <Sunny, Warm, ?, Strong,

Warm, Same>

3rd example is negative so ignore it 4th example (positive) will lead to further generalization of h <Sunny, Warm, ?, Strong, ?, ?) At each stage, the hypothesis is the most specific hypothesis consistent with seen training examples




Issues with Find-S

Has the learner converged to the correct hypothesis? Why to prefer the most specific hypothesis? Are training examples consistent? What if there are multiple maximally specific hypothesis consistent with training data


Candidate-Elimination algorithm
Idea is to output a description of the set of all hypotheses consistent with the training Examples It represents the set of all hypotheses consistent with the observed training examples called as Version Space Version space can be represented by most general and most specific member These members form general and specific boundary sets that delimit the version space within the partially ordered hypothesis space

Version Space representation


Candidate-Elimination algorithm(1/2)


Candidate-Elimination algorithm(2/2)


Candidate-Elimination algorithm


Candidate-Elimination algorithm


Candidate-Elimination algorithm


Final Version Space


Issues with Candidate-Elimination

Practical applications of the CandidateElimination Find-S algorithms are limited by the fact that they both perform poorly when given noisy training data They allow only conjunctive hypothesis With candidate elimination, if no of training examples are less, it might lead to partially learned hypothesis


Decision Tree Induction with ID3

An internal node is a test on an attribute A branch represents an outcome of the test, e.g., Color=red A leaf node represents a class label or class label distribution At each node, one attribute is chosen to split training examples into distinct classes as much as possible A new case is classified by following a matching path to a leaf node


Weather Data: Play or not Play?

Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain 28 Temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humidity high high high high normal normal normal high normal normal normal High Normal High Windy Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong Play? No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No

Final Tree for Play?

Outlook sunny overcast Humidity high No


Yes Windy true No false Yes

normal Yes

Classification Rules
Other way of representing tree Ex.
If Outlook=sunny and Humidity=High then play = No If Outlook=sunny and Humidity=normal then play = Yes If Outlook=overcast then play= Yes If Outlook=rainy and Wind = Strong then play= No If Outlook=rainy and Wind = Weak then play=Yes


Building Decision Tree

Top-down tree construction
At start, all training examples are at the root. Partition the examples recursively by choosing one attribute each time.

Bottom-up tree pruning

Remove subtrees or branches, in a bottom-up manner, to improve the estimated accuracy on new cases.


Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf There are no samples left

Building Decision Tree

Choosing the Splitting Attribute

At each node, available attributes are evaluated on the basis of their capabilities in separating the classes of the training examples. A Goodness function is used for this purpose. Typical goodness functions:
information gain (ID3/C4.5) information gain ratio gini index


A criterion for attribute selection

Which is the best attribute?
The one which will result in the smallest tree Heuristic: choose the attribute that produces the purest nodes

Popular impurity criterion: information gain

Information gain increases with the average purity of the subsets that an attribute produces

Strategy: choose attribute that results in greatest information gain


Information Gain

Computing information
measures how well a given attribute separates the training examples according to their target classification It measures the how well an attribute can classify a training samples alone.

characterizes the (im)purity of an arbitrary collection of examples. Given a collection of samples with n classes with pi being the probability of occurrence of ith class

entropy(S ) = pi log pi
i =1


Consider the training example containing 14 samples (S) The probability distribution of its two classes: yes and no is 9 and 5 Then Entropy of S relative to this Boolean classification is

entropy(S ) = 9 / 14 log( 9 / 14 ) 5 / 14 log( 5 / 14 ) = 0.94


Entropy is zero if all members are of same class Entropy is 1 if it contains all classes in equal number Entropy is between 0 and 1 if classes are in unequal number


Information Gain
It is expected reduction in entropy caused by partitioning the collection according to a attribute

| Sv | Gain( S , A) = Entropy ( S ) Entropy ( Sv) vvalues ( A ) | S | For Ex.

Gain(S, Windy) = Entropy(S) 8/14 Entropy(Strue) 6/14 Entropy(Sfalse) = 0.94 (8/14) 0.811 (6/14) 1 = 0.048 On similar grounds, Gain(S, Outlook) = 0.246 Gain(S, Humidity) = 0.151 Gain(S, Temperature) = 0.029

Which attribute to select?


Decision Tree
Information Gain of outlook attribute is highest among other attributes It provides the best prediction of target class Select outlook as the test attribute Create branches below the node for each possible value of outlook


Decision Tree


Continuing to split

gain("Temperatur e") = 0.57 gain(" Humidity") = 0.97 gain(" Windy") = 0.019


Decision Tree


Issues with ID3

Overfit tree Incorporating Continuous valued attributes Alternative attribute selection measure Handling Missing values


Tree Overfitting
Consider error of tree T over Training data: errortrain (T) Entire Data: errordata (T) A tree T overfits the training data if there is an alternative tree T such that errortrain (T) < errortrain (T) and errordata (T) > errordata (T)


Tree Overfitting
As complexity of tree increases, its error rate tends to increase with unobserved samples. Consider adding a erroneous tuple to the set
Outlook = Sunny, Temperature = Hot, Humidity = Normal, Wind = Strong, PlayTennis = No What is the effect of this 15th tuple on tree ?

This new tree T will fit the training data than existing tree T but may not perform well for entire distribution


Avoid Overfitting
Broadly classified in to two classes
approaches that stop growing the tree earlier, when further splitting is not statistically significant (Prepruning)
Less successful as it is difficult to estimate precisely when to stop growing the tree

approaches that allow the tree to overfit the data, and then post-prune the tree (post-pruning)
Reduced Error Pruning Rule Post Pruning (C4.5)


Continuous Valued Attribute

ID3 requires that the test attribute and target attribute should be discrete valued A continuous valued attribute should be converted into discrete valued attribute by partitioning the values into discrete intervals Partitioning requires splitting the range of values Find best possible splitting points
Information Gain X2 analysis


Alternative attribute selection measure

information gain measure favors attributes with many values over those with few values Consider adding attribute date with different values to the training data As date field alone is capable of classifying tuples, information gain will select it and it will lead to pure classification
This may result in overfitting (selection of an attribute that is non-optimal for prediction)


It results into tree with single node having 14 branches all leading to leaf node would fare poorly on subsequent examples This is not a good classifier

Alternative attribute selection measure

Use other measure to select attribute like gain ratio

Problem with gain ratio: it may overcompensate

May choose an attribute just because its split information is very low Standard fix:
First, only consider attributes with greater than average information gain Then, compare them on gain ratio


Gain ratios for weather data

Outlook Info: Gain: 0.940-0.693 Split info: info([5,4,5]) Gain ratio: 0.247/1.577 0.693 0.247 1.577 0.156 Temperature Info: Gain: 0.940-0.911 Split info: info([4,6,4]) Gain ratio: 0.029/1.362 0.911 0.029 1.362 0.021

Humidity Info: Gain: 0.940-0.788 Split info: info([7,7]) Gain ratio: 0.152/1 0.788 0.152 1.000 0.152

Windy Info: Gain: 0.940-0.892 Split info: info([8,6]) Gain ratio: 0.048/0.985 0.892 0.048 0.985 0.049


Handling missing values

available data may be missing values for some attributes. it is common to estimate the missing attribute value based on other examples for which this attribute has a known value Strategies
assign it the value that is most common among training examples assign it the most common value among examples at node n that have the classification c(x)


Algorithm for top-down induction of decision trees (ID3) was developed by Ross Quinlan
Gain ratio just one modification of this basic algorithm Led to development of C4.5, which can deal with numeric attributes, missing values, and noisy data

Similar approach: CART There are many other attribute selection criteria! (But almost no difference in accuracy of result.)


Top-Down Decision Tree Construction Choosing the Splitting Attribute Information Gain biased towards attributes with a large number of values Gain Ratio takes number and size of branches into account when choosing an attribute


Classification by Backpropagation
Backpropagation: A neural network learning algorithm A neural network: A set of connected input/output units where each connection has a weight associated with it During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of the input tuples Also referred to as connectionist learning due to the connections between units

Handwritten character recognition, medical diagnosis etc are famous applications of NN

Neural Network Learning

Long training time Require a number of parameters typically best determined empirically, e.g., the network topology or ``structure." Poor interpretability: Difficult to interpret the symbolic meaning behind the learned weights and of ``hidden units" in the network

High tolerance to noisy data Ability to classify untrained patterns Well-suited for continuous-valued inputs and outputs Successful on a wide array of real-world data Algorithms are inherently parallel Techniques have recently been developed for the extraction of rules from trained neural networks


Multilayer Feed-Forward Neural Network

It consists of one input layer, one or more hidden layer, and an output layer Each layer consists of units Input layer corresponds to attributes in training data These inputs pass through input layer, weighted and fed to second layer known as hidden layer Subsequently it is forwarded to next layer The output layer emits networks prediction There are no clear rules regarding the topology design like number of hidden layers, initial weights


Multilayer Feed-Forward Neural Network

A two-layer neural network


Defining a Network Topology

First decide the network topology: # of units in the input layer, # of hidden layers (if > 1), # of units in each hidden layer, and # of units in the output layer Normalizing the input values for each attribute measured in the training tuples to [0.01.0] One input unit per domain value, each initialized to 0 Output, if for classification and more than two classes, one output unit per class is used Once a network has been trained and its accuracy is unacceptable, repeat the training process with a different network topology or a different set of initial weights

Iteratively process a set of training tuples & compare the network's prediction with the actual known target value For each training tuple, the weights are modified to minimize the mean squared error between the network's prediction and the actual target value Modifications are made in the backwards direction: from the output layer, through each hidden layer down to the first hidden layer, hence backpropagation Steps Initialize weights (to small random #s) and biases in the network Propagate the inputs forward (by applying activation function) Backpropagate the error (by updating weights and biases) Terminating condition (when error is very small, etc.)

Propagate the inputs forward Inputs are applied to input layer It passes through input layer unchanged Net input and output of each hidden layer is calculated I j = wij Oi + j

Given the net input, output can be computed as

Oj = 1

This process is repeated till output of output layer is computed which gives networks prediction

1+ e

I j

Backpropagate the error The error is backpropagated to update weights and biases to reflect error in networks prediction For a unit j in output layer, the Errj
Err j = O j (1 O j )(T j O j )

The error of a unit is backpropagated and error of a hidden layer unit j

Err j = O j (1 O j ) Errk w jk

The weights and biases are updated

wij = wij + (l ) Err j Oi
wij = wij + wij

j = j + (l ) Err j

j = j + j

where l is learning rate (value 0 to 1)

The weights and biases updated after each tuple known as case updating Other strategy is to store the updates in variable and update actual weights and biases after all tuples are presented known as epoch updating


Terminating conditions: All wij in previous iteration was less than threshold Percentage of tuple misclassified in previous iteration was less than threshold Prespecified number of iterations has expired


Example Problem
AND gate

XOR Gate


Bayesian Classification: Why?

A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities Foundation: Based on Bayes Theorem. Performance: A simple Bayesian classifier, nave Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct prior knowledge can be combined with observed data Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Bayesian Theorem: Basics

Let X be a data sample (evidence): class label is unknown Let H be a hypothesis that X belongs to class C Classification is to determine P(H|X), the probability that the hypothesis holds given the observed data sample X P(H) (prior probability), the initial probability E.g., X will buy computer, regardless of age, income, P(X): probability that sample data is observed P(X|H) (posteriori probability), the probability of observing the sample X, given that the hypothesis holds E.g., Given that X will buy computer, the prob. that X is 31..40, medium income
67 Data Mining: Concepts and Techniques

Bayesian Theorem
Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem

P ( X | H ) P ( H ) P(H | X) = P (X )
Informally, this can be written as posteriori = likelihood x prior/evidence Predicts X belongs to C2 iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes Practical difficulty: require initial knowledge of many probabilities, significant computational cost
68 Data Mining: Concepts and Techniques

Towards Nave Bayesian Classifier

Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, , xn) Suppose there are m classes C1, C2, , Cm. Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X) This can be derived from Bayes theorem
P(X | C )P(C ) i i P(C | X) = i P(X)

Since P(X) is constant for all classes, only P(C | X) = P(X | C )P(C ) i i i needs to be maximized

Derivation of Nave Bayes Classifier

A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes): n

P( X | C i ) = P( x | C i ) = P( x | C i ) P( x | C i ) ... P ( x | C i ) k 1 2 n k =1

This greatly reduces the computation cost: Only counts the class distribution If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D) If Ak is continuous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean and standard deviation (x )

g ( x, , ) =

and P(xk|Ci) is

1 e 2

P(X| Ci) = g(xk , Ci ,Ci )

Nave Bayesian Classifier: Training Dataset

age <=30 <=30 31 40 >40 >40 >40 31 40 <=30 <=30 >40 <=30 31 40 31 40 >40 income student credit_rating buys_computer high no fair no high no excellent no high no fair yes medium no fair yes low yes fair yes low yes excellent no low yes excellent yes medium no fair no low yes fair yes medium yes fair yes medium yes excellent yes medium no excellent yes high yes fair yes medium no excellent no February 8, 2010

Class: C1:buys_computer = yes C2:buys_computer = no Data sample X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)


Nave Bayesian Classifier: An Example

P(buys_computer = yes) = 9/14 = 0.643 P(buys_computer = no) = 5/14= 0.357

Compute P(X|Ci) for each class

P(age = <=30 | buys_computer = yes) = 2/9 = 0.222 P(age = <= 30 | buys_computer = no) = 3/5 = 0.6 P(income = medium | buys_computer = yes) = 4/9 = 0.444 P(income = medium | buys_computer = no) = 2/5 = 0.4 P(student = yes | buys_computer = yes) = 6/9 = 0.667 P(student = yes | buys_computer = no) = 1/5 = 0.2 P(credit_rating = fair | buys_computer = yes) = 6/9 = 0.667 P(credit_rating = fair | buys_computer = no) = 2/5 = 0.4 X = (age <= 30 , income = medium, student = yes, credit_rating = fair) P(X|Ci) : P(X|buys_computer = yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = no) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = yes) * P(buys_computer = yes) = 0.028 P(X|buys_computer = no) * P(buys_computer = no) = 0.007 Therefore, X belongs to class (buys_computer = yes)


Avoiding the 0-Probability Problem

Nave Bayesian prediction requires each conditional prob. be nonzero. Otherwise, the predicted prob. will be zero
P ( X | C i) = n P ( x k | C i) k =1

Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10), Use Laplacian correction (or Laplacian estimator) Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 The corrected prob. estimates are close to their uncorrected counterparts

Nave Bayesian Classifier: Comments

Advantages Easy to implement Good results obtained in most of the cases Disadvantages Assumption: class conditional independence, therefore loss of accuracy Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. Dependencies among these cannot be modeled by Nave Bayesian Classifier


How to deal with these dependencies? Bayesian Belief Networks

1. 2. 3. 4.

Data Mining Concepts and Techniques, J. Han and M. Kamber Machine Learning, Tom Mitchell Knowledge Acquisition from Data Mining, Xindong Wu An Implementation of ID3 Decision Tree Learning Algorithm, Wei Peng, Juhua Chen, Haiping Zhou




Thank You