Sie sind auf Seite 1von 42

Rule based classification

Lecture 21/10-09-09
(no class b’coz of placements)
Lecture 22/11-09-09
(No class)
Lecture 23/12-09-09
(No Class)
Lecture 24/14-09-09

L22/11-09-09 1
Building Classification Rules
• Direct Method:
• Extract rules directly from data.
• e.g.: RIPPER, CN2, Holte’s 1R
• Indirect Method:
• Extract rules from other classification
models (e.g. decision trees, neural
networks, etc).
• e.g: C4.5rules

L22/11-09-09 2
Direct Method: Sequential Covering
Algorithm
• Extracts rules directly from data.

• Extraction of rules are done one class at a time if


no. of classes are more.

• The criterion for selecting which class should be


considered, depends on no. of factors like class
prevalence.

L22/11-09-09 3
Algorithm
1: Let E - training Records, A – set of attr. value pairs, {(Aj,vj)}.
2: Let Yo be an ordered set of classes {y1,y2,y3-------- ,yk}.
3: Let R = { } be initial
∈ rule (decision) list.
4: for each class y Yo- {yk} do
5: while stopping condition is not met do
6: r Learn-One-Rule (E, A, y).
Remove training records from E that are covered by r.
Add r to the bottom of the rule list: R R V r.
end while
end for
Insert the default rule, { } yk, to the bottom of the rule list R

L22/11-09-09 4
Sequential Covering Algorithm
(in short for your quick reference)
1. Start from an empty rule set.
2. Extract a rule using the Learn-One-Rule
function.
3. Remove training records covered by the
rule.
4. Repeat Step (2) and (3) until stopping
criterion is met.

L22/11-09-09 5
Learn-One-Rule function
• Objective - extract a rule that covers
maximum of +ve examples and none or few
negative examples in the training dataset.

• Computationally expensive b’coz of


exponential size of search space.

• Generates an initial rule and keeps on


growing (refining) it till stopping criteria is
met.

L22/11-09-09 6
Example of Sequential Covering

(ii) Step 1

(i) Original Data

L22/11-09-09 7
Example of Sequential
Covering…

R1 R1

R2

(iii) Step 2 (iv) Step 3

L22/11-09-09 8
Aspects of Sequential Covering
• Rule Growing Strategy

• Instance Elimination

• Rule Evaluation

• Stopping Criterion

• Rule Pruning

L22/11-09-09 9
Rule Growing
• Two common strategies :
1. General-to-specific
Tid R
2. Specific-to-general

1 Y

L22/11-09-09
2 10
N
General- to- specific
r: { } y Rule has poor
quality as it
covers all
examples in the
training set
Conjuncts are
subsequently
added to
improve the
quality of the
rule

L22/11-09-09 11
Specific to general

Refund=No, Refund=No,
Status=Single, Status=Single,
Income=85K Income=90K
(Class=Yes) (Class=Yes)

Refund=No,
Status = Single
(Class = Yes)

(b) Specific-to-general
L22/11-09-09 12
Specific-to-general

Body temp=warm-blooded, Skin cover=hair, gives


One of the birth=yes, aquatic creature=no, Aerial creature= no,
positive has legs= yes, hibernates=no => MAMMALS
examples are
chosen randomly
as initial seed

Body temp=warm-blooded,
Skin cover=hair, gives
birth=yes, aquatic
creature=no, Aerial creature=
no, has legs= yes=> Mammals

Skin cover=hair, gives birth=yes, aquatic One of the


creature=no, Aerial creature= no, has conjuncts are
legs= yes, hibernates=no => MAMMALS removed so that it
can cover more
L22/11-09-09 +ve examples 13
Rule Evaluation
• Suppose a tr set contains 60 +ve and 100 –ve
examples.
• Consider two rules:

• R1: covers 50 +ve examples and 5 -ve examples


• R2: covers 2 +ve ex and 0 –ve ex.

The accuracy for R1 is 90.9% and R2 is 100%.

Still R1 is better b’coz its coverage.


Other measures clearly state

L22/11-09-09 14
Rule Evaluation
• Metrics: nc
– Accuracy
=
n
k
R = 2∑ f i log( f i / ei )
– Likelihood ratio Statistics, i =1
nc + 1
– Laplace =
n +k n : total no. of instances
nc : Number of instances
nc + kp covered by rule
=
– M-estimate n +k k : Number of classes
nc + pm p : Prior probability
=
OR n+m
L22/11-09-09 15
FOIL’s Information gain (Rule Evaluation
Contd…)

• r: A + covers p0 +ve examples and n0 –ve


examples
• Suppose we add a new conjunct B, the
extended rule become
• r’: A^B + covers p1 +ve examples and n1 –
ve examples
p1 p0
– Then FOIL IG= p1 × (log 2 − log 2 )
p1 + n1 p 0 + n0

L22/11-09-09 16
• Q2. Consider two rules:
– R1:A C
– R2:A ∧B C
Suppose R1 is covered by 350 +ve examples and 150 –
ve examples, while R2 is covered by 300 +ve
examples and 50 –ve examples. Compute the FOIL’s
information gain for rule R2 wrt R1.
• Q4. Page no. 317

L22/11-09-09 17
Aspects of Sequential Covering
Algorithm
• Rule Growing Strategy

• Rule Evaluation

• Stopping Criterion

• Rule Pruning

• Instance Elimination

L22/11-09-09 18
Stopping Criterion and Rule
Pruning
• Stopping criterion
– Compute the gain
– If gain is not significant, discard the new rule
• Rule Pruning
– Remove one of the conjuncts in the rule
– Compare error rate on validation set before
and after pruning
– If error improves, prune the conjunct

L22/11-09-09 19
Aspects of Sequential Covering
Algorithm
• Rule Growing Strategy

• Rule Evaluation

• Stopping Criterion

• Rule Pruning

• Instance Elimination

L22/11-09-09 20
Instance Elimination
R3 R
• Why do we need to eliminate
instances?
R1 + + + +
– Otherwise, the next rule is + + + +
+ + + +
identical to previous rule class = + ++ + +
+ + +
+ + +
+ + + +
- -
- - - - -
-
class = - - -
- -
-
- -
- -
•Why do we remove +ve and –ve instances?
-Ensure that the next rule is different
-Prevent underestimating the accuracy of rule.

L22/11-09-09 21
Indirect methods for rule-based
classifiers and Instance-Based
Classifiers

Lecture 26/17-09-09

L22/11-09-09 22
Indirect Methods
(Generating rule set from Decision tree)
P
No Yes

Q R Rule Set

No Yes No Yes r1: (P=No,Q=No) ==> -


r2: (P=No,Q=Yes) ==> +
- + + Q r3: (P=Yes,R=No) ==> +
r4: (P=Yes,R=Yes,Q=No) ==>
No Yes
r5: (P=Yes,R=Yes,Q=Yes) ==>
Consider r2, r3, r5
- +
It shows that the class label is always predicted as + when
Q=Yes
So we can say r2’: (Q=yes) + Simplified rules
r3: (P=yes) ∧ (R=no) + 23
Classification rules extracted from DT
Give
Birth? C4.5rules:
(Give Birth=No, Lives in water=No, Can Fly=Yes) →
Yes No
Birds
(Give Birth=No, Live in Water=Yes) → Fishes
Mammals Live In
W ater? (Give Birth=Yes) → Mammals
(Give Birth=No, Live in Water=No, Can Fly=No,) →
Yes No
Reptiles
( ) → Amphibians
Sometimes

Fishes Amphibians Can


Fly?

Yes No

Birds Reptiles

L22/11-09-09 24
Advantages of Rule-Based
Classifiers
• As highly expressive as decision trees
• Easy to interpret
• Easy to generate
• Can classify new instances rapidly
• Performance comparable to decision trees

L22/11-09-09 25
Instance–Based Classifiers

L22/11-09-09 26
Eager learners vs. Lazy learners
• Eager learners
– DT and rule-based classifiers are ex. of eager
learners
– They are designed to learn a model that maps the
input attr. to the class label as soon as training data
becomes available.
• Lazy Learners
– They delay the process of modeling the tr. Data until it
is provided with an unseen instance to be classified.
– Instance-based classifiers belong to this class.
– They memorizes the entire tr. data and perform
classification only when attr. of a test instance
matches it completely.

L22/11-09-09 27
Instance-Based Classifiers
Set of Stored Cases • Store the training records
• Use training records to
Atr1 ……... AtrN Class
predict the class label of
A unseen cases
B
B
Unseen Case
C
Atr1 ……... AtrN
A
C
B

L22/11-09-09 28
Instance Based Classifiers
• Examples:
– Rote-learner (classifier)
• Memorizes entire training data and performs
classification only if attributes of record match one of
the training examples exactly.
• Its drawback is that some test rec. may not at all be
classified b’coz they don’t match any of the instance
in the tr. Data.
• SLOUTION??

– Nearest neighbor
• Uses k “closest” points (nearest neighbors) for
performing classification

L22/11-09-09 29
Nearest Neighbor Classifiers
• Basic idea: The main idea or justification of nearest neighbor
classifier is emphasized with the following example:
• If it walks like a duck, quacks like a duck, looks like a
duck, then it’s probably a duck

Compute
Distance

TTraining
eRecords
s Choose k of the
t “nearest” records

R
L22/11-09-09 30
e
• A nearest-neighbor classifier represents
each instance as a d-dimensional data
point in space, where d is the no. of
attributes.

L22/11-09-09 31
Nearest-Neighbor Classifiers
Unknown record ● Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

● To classify an unknown record:


– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
32
Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points


that have the k smallest distance to x
L22/11-09-09 33
Nearest Neighbor Classification
• Compute distance between two points:
– Euclidean distance

d ( p, q ) = ∑ ( pi − qi )
i
2

Determine the class from nearest neighbor


list
– take the majority vote of class labels among
the k-nearest neighbors

34
Nearest Neighbor Classification…
• Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points
from other classes

K-nearest
classification
with large k

L22/11-09-09 35
Nearest Neighbor
Classification…
• Scaling issues
– Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
– Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M

L22/11-09-09 36
Nearest neighbor
Classification…
• k-NN classifiers are lazy learners
– It does not build models explicitly
– Unlike eager learners such as decision tree
induction and rule-based systems
– Classifying unknown records are relatively
expensive

L22/11-09-09 37
Algorithm
• 1: Let k be the no. of NN and D be the set of tr. examples.
• 2: for each test example z=(x’,y’) do
• 3: compute d(x’,x), the distance between z and
every example, (x,y) D.
• 4: ∈
Select Dz D, the set of k closest training
examples to z. ⊆
• 5: y’=
arg max ∑( xi , yi )∈Dz I (v = y i )
v
• 6: end for

L22/11-09-09 38
• Once the NN list is obtained, the test sample is
classified based on the majority class of its NN :
– Majority voting: y’=
arg max ∑( xi , yi )∈Dz I (v = y i )
v

v is the class label, yi is the class label for one of the


NN and I(.) is an indicator function that returns the
value 1 if its argument is true and 0 otherwise.

L22/11-09-09 39
• In majority voting approach, every
neighbor has the same impact on the
classification. (Refer slide 15 fig.)
• This factor makes classification algo.
sensitive to the choice of k.
• In order to reduce this impact of k, we
assign weight to each of the distance for
NN say xi:
wi=1/d(x’,xi)2.

L22/11-09-09 40
• As a result of applying wt. to the distance, the tr.
Ex that are located far away from z have a
weaker impact on the classification.
• Using the distance-weighted voting scheme, the
class label can be determined:
Distance-wtd. Voting

y’= arg max ∑( xi , yi )∈Dz wi × I (v = y i )


v

L22/11-09-09 41
Characteristics
• 1. NN classification is a part of instance-based
learning.
• 2. Lazy learners like NN classifiers do not need
model building.
• 3. NN classifiers make their predictions based
on local information whereas DT and rule-based
classifiers attempt to find a global model that fits
the entire input space.
• 4. Appropriate proximity measures play a
significant role in NN classifiers.

L22/11-09-09 42

Das könnte Ihnen auch gefallen