KNN NB

Classification: Alternative
Techniques
Lecture Notes for Chapter 5

Introduction to Data Mining
by
Tan, Steinbach, Kumar
Tan,Steinbach, Kumar
4/18/2004
Instance-Based Classifiers
S
A tr1
...
A trN
C la s s
A
Store the training records
Use training records to

predict the class label of
unseen cases
B
B
C
A
A tr1
...
A trN
C
B
4/18/2004
Instance Based Classifiers

Examples:
Rote-learner
Memorizes entire training data and performs
classification only if attributes of record match one of
the training examples exactly
Nearest neighbor
Uses k closest points (nearest neighbors) for
performing classification
4/18/2004
Nearest Neighbor Classifiers

Basic
idea:
If it walks like a duck, quacks like a duck, then
its probably a duck
Compute
Distance
Training
Records
Test
Record
Choose k of the
nearest records
4/18/2004
Nearest-Neighbor Classifiers
Requires three things

The set of stored records
Distance Metric to compute
distance between records
The value of k, the number of
nearest neighbors to retrieve
To classify an unknown record:

Compute distance to other
training records
Identify k nearest neighbors
Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
4/18/2004
Definition of Nearest Neighbor
(a) 1-nearest neighbor
(b) 2-nearest neighbor
(c) 3-nearest neighbor
K-nearest neighbors of a record x are data points

that have the k smallest distance to x
4/18/2004
1 nearest-neighbor
Voronoi Diagram
4/18/2004
Nearest Neighbor Classification

Compute
distance between two points:

Euclidean distance
d ( p, q )
( pi
i
q )
Determine
the class from nearest neighbor list

take the majority vote of class labels among
the k-nearest neighbors
Weigh the vote according to distance
weight factor, w = 1/d2
4/18/2004

Choosing
the value of k:
If k is too small, sensitive to noise points

If k is too large, neighborhood may include points from
other classes
4/18/2004

Scaling
issues
Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
Example:
height of a person may vary from 1.5m to 1.8m
weight of a person may vary from 90lb to 300lb
income of a person may vary from $10K to $1M
4/18/2004
10

Problem
with Euclidean measure:

High dimensional data
curse of dimensionality
Can produce counter-intuitive results

111111111110
vs
100000000000
011111111111
000000000001
d = 1.4142
d = 1.4142
Solution: Normalize the vectors to unit length
4/18/2004
11
Nearest neighbor Classification

k-NN
classifiers are lazy learners

It does not build models explicitly
Unlike eager learners such as decision tree
induction and rule-based systems
Classifying unknown records are relatively
expensive
4/18/2004
12
Bayes Classifier
A probabilistic
framework for solving classification
problems
Conditional Probability:
P ( A, C )
P (C | A)
P ( A)
P ( A, C )
P( A | C )
P (C )
Bayes theorem:
P ( A | C ) P (C )
P (C | A)
P ( A)
4/18/2004
13
Example of Bayes Theorem

Given:
A doctor knows that meningitis causes stiff neck 50% of the

time
Prior probability of any patient having meningitis is 1/50,000
Prior probability of any patient having stiff neck is 1/20
If a patient has stiff neck, whats the probability

he/she has meningitis?
P ( S | M ) P ( M ) 0.5 1 / 50000
P( M | S )
0.0002
P( S )
1 / 20
4/18/2004
14
Bayesian Classifiers
Consider each attribute and class label as random

variables
Given a record with attributes (A1, A2,,An)

Goal is to predict class C
Specifically, we want to find the value of C that
maximizes P(C| A1, A2,,An )
Can we estimate P(C| A1, A2,,An ) directly from

data?
4/18/2004
15
Bayesian Classifiers
Approach:
compute the posterior probability P(C | A1, A2, , An) for
all values of C using the Bayes theorem
P (C | A A A )
1
P ( A A A | C ) P (C )
P( A A A )
1
Choose value of C that maximizes

P(C | A1, A2, , An)
Equivalent to choosing value of C that maximizes
P(A1, A2, , An|C) P(C)
How to estimate P(A1, A2, , An | C )?
4/18/2004
16
Nave Bayes Classifier
Assume independence among attributes Ai when class is

given:
P(A1, A2, , An |C) = P(A1| Cj) P(A2| Cj) P(An| Cj)
Can estimate P(Ai| Cj) for all Ai and Cj.
New point is classified to Cj if P(Cj) P(Ai| Cj) is
maximal.
4/18/2004
17
How to Estimate Probabilities from

Data?
Class: P(C) = Nc/N

e.g., P(No) = 7/10,
P(Yes) = 3/10
For discrete attributes:

P(Ai | Ck) = |Aik|/ Nc k
where |Aik| is number of
instances having attribute
Ai and belongs to class Ck
Examples:
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
4/18/2004
18

Data?
For
continuous attributes:
Discretize the range into bins
one ordinal attribute per bin
violates independence assumption
Two-way split: (A < v) or (A > v)
choose only one of the two splits as new attribute
Probability density estimation:

Assume attribute follows a normal distribution
Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
Once probability distribution is known, can use it to
estimate the conditional probability P(Ai|c)
4/18/2004
19

Data?
Normal distribution:
1
P( A | c )
e
2
i
( Ai ij ) 2
2 ij2
ij
One for each (Ai,ci) pair
For (Income, Class=No):

If Class=No
sample mean = 110
sample variance = 2975
1
P ( Income 120 | No)
e
2 (54.54)
( 120110 ) 2
2 ( 2975 )
0.0072
4/18/2004
20
Example of Nave Bayes Classifier

Given a Test Record:
X (Refund No, Married, Income 120K)
P(X|Class=No) = P(Refund=No|Class=No)
P(Married| Class=No)
P(Income=120K| Class=No)
= 4/7 4/7 0.0072 = 0.0024
P(X|Class=Yes) = P(Refund=No| Class=Yes)

P(Married| Class=Yes)
P(Income=120K| Class=Yes)
= 1 0 1.2 10-9 = 0
Since P(X|No)P(No) > P(X|Yes)P(Yes)

Therefore P(No|X) > P(Yes|X)
=> Class = No
4/18/2004
21
Nave Bayes Classifier

If
one of the conditional probability is zero, then

the entire expression becomes zero
Probability estimation:
N ic
Original : P( Ai | C )
Nc
N ic 1
Laplace : P( Ai | C )
Nc c
N ic mp
m - estimate : P( Ai | C )
Nc m
c: number of classes
p: prior probability
m: parameter
4/18/2004
22
Example of Nave Bayes Classifier

Name
human
python
salmon
whale
frog
komodo
bat
pigeon
cat
leopard shark
turtle
penguin
porcupine
eel
salamander
gila monster
platypus
owl
dolphin
eagle
Give Birth
yes
Give Birth
yes
no
no
yes
no
no
yes
no
yes
yes
no
no
yes
no
no
no
no
no
yes
no
Can Fly
no
no
no
no
no
no
yes
yes
no
no
no
no
no
no
no
no
no
yes
no
yes
Can Fly
no
Live in Water Have Legs
no
no
yes
yes
sometimes
no
no
no
no
yes
sometimes
sometimes
no
yes
sometimes
no
no
no
yes
no
Class
yes
no
no
no
yes
yes
yes
yes
yes
no
yes
yes
yes
no
yes
yes
yes
yes
no
yes
mammals
non-mammals
non-mammals
mammals
non-mammals
non-mammals
mammals
non-mammals
mammals
non-mammals
non-mammals
non-mammals
mammals
non-mammals
non-mammals
non-mammals
mammals
non-mammals
mammals
non-mammals
Live in Water Have Legs
yes
no
Class
A: attributes
M: mammals
N: non-mammals
6 6 2 2
P ( A | M ) 0.06
7 7 7 7
1 10 3 4
P ( A | N ) 0.0042
13 13 13 13
7
P ( A | M ) P ( M ) 0.06 0.021
20
13
P ( A | N ) P ( N ) 0.004 0.0027
20
P(A|M)P(M) > P(A|N)P(N)
=> Mammals
4/18/2004
23
Nave Bayes (Summary)

Robust
to isolated noise points
Handle
missing values by ignoring the instance

during probability estimate calculations
Robust
to irrelevant attributes
Independence
assumption may not hold for some
attributes
Use other techniques such as Bayesian Belief
Networks (BBN)
4/18/2004
24
Artificial Neural Networks (ANN)
Output Y is 1 if at least two of the three inputs are equal to 1.
4/18/2004
25
Y I ( 0 .3 X 1 0 .3 X 2 0 .3 X 3 0 .4 0 )
1
where I ( z )
0
if z is true
otherwise
4/18/2004
26
Model is an assembly of
inter-connected nodes and
weighted links
Output node sums up

each of its input value
according to the weights
of its links
Perceptron Model
Compare output node

against some threshold t
Y I ( wi X i t )
or
Y sign( wi X i t )
i
4/18/2004
27
General Structure of ANN
Training ANN means learning

the weights of the neurons
4/18/2004
28
Algorithm for learning ANN

Initialize
the weights (w0, w1, , wk)
Adjust
the weights in such a way that the output

of ANN is consistent with class labels of training
examples
2
Objective function: E Yi f ( wi , X i )
i
Find the weights wis that minimize the above

objective function
e.g., backpropagation algorithm (see lecture notes)
4/18/2004
29
Support Vector Machines
Find a linear hyperplane (decision boundary) that will separate the data
4/18/2004
30
One Possible Solution
4/18/2004
31
Another possible solution
4/18/2004
32
Other possible solutions
4/18/2004
33
Which one is better? B1 or B2?

How do you define better?
4/18/2004
34
Find hyperplane maximizes the margin => B1 is better than B2
4/18/2004
35

w x b 0

w x b 1

w x b 1
f ( x)

1
if w x b 1

1 if w x b 1
2
Margin 2
|| w ||
4/18/2004
36

We
want to maximize:
2
Margin 2
|| w ||
2
|| w ||
Which is equivalent to minimizing: L( w)
2
But subjected to the following constraints:

if w x i b 1
1
f ( xi )

1 if w x i b 1
This is a constrained optimization problem

Numerical approaches to solve it (e.g., quadratic programming)
4/18/2004
37

What
if the problem is not linearly separable?
4/18/2004
38

What
if the problem is not linearly separable?

Introduce slack variables
2
Need to minimize:
|| w ||
N k
L( w)
C i
2
i 1
Subject to:
f ( xi )

1
if w x i b 1 - i

1 if w x i b 1 i
4/18/2004
39
Nonlinear Support Vector

Machines
What
if decision boundary is not linear?
4/18/2004
40
Nonlinear Support Vector

Machines
Transform
data into higher dimensional space
4/18/2004
41
Ensemble Methods
Construct
a set of classifiers from the training
data
Predict
class label of previously unseen records

by aggregating predictions made by multiple
classifiers
4/18/2004
42
General Idea
4/18/2004
43
Why does it work?

Suppose
there are 25 base classifiers

Each classifier has error rate, = 0.35
Assume classifiers are independent
Probability that the ensemble classifier makes
a wrong prediction:
25 i
25i
(
1
)
0.06
i
i 13
25
4/18/2004
44
Examples of Ensemble Methods

How
to generate an ensemble of classifiers?

Bagging
Boosting
4/18/2004
45
Bagging
Sampling
with replacement
Original Data
Bagging (Round 1)
Bagging (Round 2)
Bagging (Round 3)
Build
1
7
1
1
2
8
4
8
3
10
9
5
4
8
1
10
5
2
2
5
6
5
3
5
7
10
2
9
8
10
7
6
9
5
3
3
10
9
2
7
classifier on each bootstrap sample
Each
sample has probability (1 1/n)n of being

selected
4/18/2004
46
Boosting
An
iterative procedure to adaptively change

distribution of training data by focusing more on
previously misclassified records
Initially, all N records are assigned equal
weights
Unlike bagging, weights may change at the
end of boosting round
4/18/2004
47
Boosting
Records
that are wrongly classified will have their

weights increased
Records that are classified correctly will have
their weights decreased
Original Data
Boosting (Round 1)
Boosting (Round 2)
Boosting (Round 3)
1
7
5
4
2
3
4
4
3
2
9
8
4
8
4
10
5
7
2
4
6
9
5
5
7
4
1
4
8
10
7
6
9
6
4
3
10
3
2
4
Example 4 is hard to classify
Its weight is increased, therefore it is more

likely to be chosen again in subsequent rounds
4/18/2004
48

KNN NB

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

KNN NB

Hochgeladen von

Copyright:

Verfügbare Formate

Classification: Alternative

Lecture Notes for Chapter 5

Introduction to Data Mining

Store the training records

Use training records to

Introduction to Data Mining

Instance Based Classifiers

Introduction to Data Mining

Nearest Neighbor Classifiers

Introduction to Data Mining

Requires three things

To classify an unknown record:

Introduction to Data Mining

Definition of Nearest Neighbor

(a) 1-nearest neighbor

(b) 2-nearest neighbor

(c) 3-nearest neighbor

K-nearest neighbors of a record x are data points

Introduction to Data Mining

Introduction to Data Mining

Nearest Neighbor Classification

distance between two points:

the class from nearest neighbor list

weight factor, w = 1/d2

Introduction to Data Mining

Nearest Neighbor Classification

If k is too small, sensitive to noise points

Introduction to Data Mining

Nearest Neighbor Classification

Introduction to Data Mining

Nearest Neighbor Classification

with Euclidean measure:

Can produce counter-intuitive results

Solution: Normalize the vectors to unit length

Introduction to Data Mining

Nearest neighbor Classification

classifiers are lazy learners

Introduction to Data Mining

framework for solving classification

Introduction to Data Mining

Example of Bayes Theorem

A doctor knows that meningitis causes stiff neck 50% of the

If a patient has stiff neck, whats the probability

Introduction to Data Mining

Consider each attribute and class label as random

Given a record with attributes (A1, A2,,An)

Can we estimate P(C| A1, A2,,An ) directly from

Introduction to Data Mining

Choose value of C that maximizes

How to estimate P(A1, A2, , An | C )?

Introduction to Data Mining

Nave Bayes Classifier

Assume independence among attributes Ai when class is

Introduction to Data Mining

How to Estimate Probabilities from

Class: P(C) = Nc/N

For discrete attributes:

Introduction to Data Mining

How to Estimate Probabilities from

Two-way split: (A < v) or (A > v)

choose only one of the two splits as new attribute

Probability density estimation:

Introduction to Data Mining