Sie sind auf Seite 1von 40

Evaluating Model Accuracy and

Bias-Variance TradeOff
Bias check:
How well the predicted values fitted the actual values? (Ideally low bias model is the best model)

Variance check(Error variance) : (actual-predicted)^2/n


Model error check between Training Vs Test/Validation
(Ideally low error variance in both Train & Test/validation is the best fit model)
MACHINE LEARNING ALGORITHMS
Decision Trees
Decision Tree Algorithms are also called as Top Down Induction Of
Decision Trees(TDIDT)

Important Terminology:
Root Node : Test/Decision Points
Branch : Collection of nodes and the Leaf
Leaves : End route note/Final decisions/Conclusions

Famous TDIDT Algorithms are


- C5.0(Quinlan)
- CART(Breiman)
Trees are Rules expressed

Within branch nodes are connected with


“AND”

Branches with similar outcomes are


connected with “OR”

What is best Tree?

Smallest Tree(Least number of nodes) with


smallest error(least number of incorrectly
classified records)

Advantages:
• Fast
• Robust
• Explicable
Regression Trees

It turns out that, we are collecting very similar records at each


leaf. So that we can use mean (or) median of the records at a
leaf as the predictor value for all the new records that obey
similar conditions. Such tree are called Regression Trees.

Two Aspects It follows for both Regression & Classification


problems.

• Which Attribute to Choose(Where to Start)?


• Where to Stop(To avoid overfitting)?
Attribute Selection Criteria
• Main principle:
- Select attribute which partitions the learning set(dataset) into subsets as
“PURE” as possible

• Various measures of Purity:


- Entropy
- Information Gain
- Gini Index

Note: Lower the above values higher the Purity of nodes


We can measure the purity of a Leaf/Node by using below the methods:

For Classification Trees : ENTROPY/GINI Index


For Regression : RMSE/MAPE
Two Most Popular Decision Tree Algorithms

• C5.0:
- Multi split
- Information Gain (Measure of Purity)
- Pessimistic pruning (To avoid overfitting)

• CART:
- Binary Split
- Gini Index (Measure of Purity)
- Cost Complexity Pruning (To avoid overfitting)
C5.0 Algorithm
MEASURE OF PURITY
Entropy becomes Zero in case Probability value of any class(Pi) = 1
LOG(1) =0
We can grow until we exhaust the data. But is
that the right time to stop?

HOW TO MINIMIZE THE OVERFIT?


Why Prune? : To avoid Overfitting
REDUCED ERROR PRUNING/PESSIMISTIC PRUNING
Here ‘f’ is # bad/Total  2/6, 1/2, 2/6
‘e’ value comes out from the formulae

Weighted sum of errors(0.51) for the lowest layer is calculated as fallows(which is higher than its previous layer , so Prune it
(6/14)*0.47+(2/14)*0.72+(6/14)*0.47 = 0.51

Finally, the lower layer’s error is higher the its mother branch, hence Prune the complete layer.
CART Algorithm
MEASURE OF PURITY
** Here ‘S’ is Total records