Sie sind auf Seite 1von 22

# CS446 – Fall 2015

## Introduction to Data Mining

K Nearest Neighbor
Classification
All the slides were adapted from:
1- Intro. To Data Mining by Tan et. al.
2- Dr. Ibrahim Albluwi
Is it a Duck?
• If it quacks like a duck and walks like a duck, and looks like a
duck, then most probably, it is a duck!

Compare
with all the
Test Record
records

## Training Choose the “nearest”

Records record
k Nearest Neighbor (kNN)
NN Classification
• Given an unseen record r that needs to be classified:
– Compute the distance between r and all of the other records in the
training set.
– Choose the record rminDist that has the minimum distance to r..
– Assign to r the class value of rminDist.

## • Example: What should r=(X=2, Y=2) X Y Class

be classified?
1 1 1 YES
• Distance(r1, r) = sqrt((1-2)2+(1-2)2)=sqrt(2).
2 3 5 NO
Distance(r1, r) = sqrt((3-2)2+(5-2)2)=sqrt(10).
3 7 9 NO
Distance(r1, r) = sqrt((7-2)2+(9-2)2)=sqrt(74).
4 4 7 YES
Distance(r1, r) = sqrt((4-2)2+(7-2)2)=sqrt(29).

## The closest record is r1, so r will be classified as YES.

Algorithm
Lazy Learners
• Nearest neighbor classification is considered as a lazy method,
whereas decision tree classification is considered as an eager
method.

• Lazy Learners:
– Do not build any model: Zero training time.
– Delay “thinking” to classification time.
– Most time is spent on classification.

• Eager Learners:
– Spend most of the time on building the model prior to
classification.
– Classification is quick since the model is ready.
Proximity Measures
Definitions:
• Similarity: A numerical measure of how much two data objects are
alike.

## • Dissimilarity Or Distance: A numerical measure of how different two

data objects are.

## • What proximity measure to use? This is highly dependent on the

attribute types.
Distance Measures

• Numeric Attributes:
– Manhattan Distance, Euclidean Distance, etc.

## – Normalization is very important to avoid the domination of

one (or few) attributes over other attributes.
Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from \$10K to \$1M
If we do not normalize, the income attribute will dominate
the distance computation.
Numeric Attributes
• Distance between two attribute values: |v1-v2|
• Distance between two records: many possibilities.

• Euclidian Distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp

• Manhattan Distance:
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp
Euclidean Distance
Example: Age Income Height Weight
Record 1 45 2000 1.6 80
Record 2 32 1200 1.75 75

Normalization:
age1 = 45/140 = 0.32 age2 = 32/140 = 0.23
Income1 = 2000/5000 = 0.4 Income2 = 1200/5000 = 0.24
Height1 = 1.6/2.1 = 0.76 Height2 = 1.75/2.1 = 0.83
Weight1 = 80/150 = 0.53 Weight2 = 0.5

Euclidean Distance =

## (0.32  0.23) 2  (0.4  0.24) 2  (0.76  0.83) 2  (0.53  0.5) 2  0.495

Ordinal Attributes
• Assign numbers depending on order.

## Example: Height Income GPA

Record 1 Short Low A
Record 2 Tall Medium A

## • A1 [Short, Medium, Tall]: d1= |2-0|/2 = 1

• A2 [Low, Medium, High]: d2= |1-0|/2 =0.5
• A3 [A, B, C, D, F]: d3= |0-0|/4 = 0
• Distance = (1+0.5+0)/3 = 0.5
• Similarity = 1-0.5 = 0.5
Nominal Attributes
• Similarity between two attribute values: Match = 1 and Mismatch = 0.

## • Similarity between two records:

(Number of matches)/(Number of attributes)

## Example: Eye Color Country Job Married

Record 1 Black Jordan Engineer Yes
Record 2 Green Jordan Engineer Yes

## • Similarity= (0 + 1 + 1 + 1)/4 = 0.75

• Distance = 1 – 0.75 = 0.25
Notes
• For records with mixed attribute types, compute the distance for each
attribute individually in the range [0-1] and then compute the average
over all attributes.

## • Example: Age Gender Height Weight

Record 1 45 Female Short 80
Record 2 32 Male Tall 75

## • |R1(Age) – R2(Age)| = (45/140 = 0.32) – (32/140 = 0.23) = 0.09

• |R1(Gender) – R2(Gender)| = 1 (Mismatch = 1, Match = 0)
• |R1(Height) – R2(Height)| = (2-0)/2 = 1
• |R1(Weight) – R2(Weight)| = (80/150 = 0.53) – (75/150 = 0.5) = 0.03

## • Distance(R1, R2) = (0.09 + 1 + 1 + 0.03) / 4 = 0.53

KNN Concerns !
• What if the nearest neighbor is actually a noisy record?
• What if there are several nearest neighbors (all equally-distant)?
• What if there are several records that are all very close to the
unseen record but each having a different distance?

## • Assign to the unseen record:

– The majority class value among the K-NNs if the class
attribute is nominal.
– The median class value among the K-NNs if the class attribute
is ordinal.
– The mean class value among the K-NNs if the class attribute is
numeric.
Examples

X X X

## • Using a very small K: sensitive to noise and susceptible to over-

fitting.
• Using a very large K: computationally expensive and may
consider irrelevant records.
• May need to set K experimentally.
How Many Neighbors?
Example
Example
Standardized Distance
Does k-NN Classification Work?
• At the limit (with optimal conditions), k-NN is guaranteed to have
an error rate that is no more than twice the error of an optimal
classifier.

In a
Voronoi Diagram,
all points in a cell
are closer to the
record in that cell NN-Classifiers
more than any
+ can learn complex
record in the other patterns that
cells.
+ are difficult for
decision trees.
+
To classify a record:
See in which cell it + +
falls and assign to it
+
the class of the
record in that cell. +
Notes
• When to use NN-Classification?
– If there are less than 20 attributes.
[Curse of Dimensionality: In higher dimensions, intuition fails,
distance measures become less meaningful and computation
becomes expensive.
– If the application affords long classification time.
– If there are lots of training data.