Sie sind auf Seite 1von 22

CS446 – Fall 2015

Introduction to Data Mining

K Nearest Neighbor
Classification
All the slides were adapted from:
1- Intro. To Data Mining by Tan et. al.
2- Dr. Ibrahim Albluwi
3- Dr. Noureddin Sadawi
Is it a Duck?
• If it quacks like a duck and walks like a duck, and looks like a
duck, then most probably, it is a duck!

Compare
with all the
Test Record
records

Training Choose the “nearest”


Records record
k Nearest Neighbor (kNN)
NN Classification
• Given an unseen record r that needs to be classified:
– Compute the distance between r and all of the other records in the
training set.
– Choose the record rminDist that has the minimum distance to r..
– Assign to r the class value of rminDist.

• Example: What should r=(X=2, Y=2) X Y Class


be classified?
1 1 1 YES
• Distance(r1, r) = sqrt((1-2)2+(1-2)2)=sqrt(2).
2 3 5 NO
Distance(r1, r) = sqrt((3-2)2+(5-2)2)=sqrt(10).
3 7 9 NO
Distance(r1, r) = sqrt((7-2)2+(9-2)2)=sqrt(74).
4 4 7 YES
Distance(r1, r) = sqrt((4-2)2+(7-2)2)=sqrt(29).

The closest record is r1, so r will be classified as YES.


Algorithm
Lazy Learners
• Nearest neighbor classification is considered as a lazy method,
whereas decision tree classification is considered as an eager
method.

• Lazy Learners:
– Do not build any model: Zero training time.
– Delay “thinking” to classification time.
– Most time is spent on classification.

• Eager Learners:
– Spend most of the time on building the model prior to
classification.
– Classification is quick since the model is ready.
Proximity Measures
Definitions:
• Similarity: A numerical measure of how much two data objects are
alike.

• Dissimilarity Or Distance: A numerical measure of how different two


data objects are.

• Proximity: Similarity or dissimilarity depending on context.

• What proximity measure to use? This is highly dependent on the


attribute types.
Distance Measures

• Numeric Attributes:
– Manhattan Distance, Euclidean Distance, etc.

– Normalization is very important to avoid the domination of


one (or few) attributes over other attributes.
Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M
If we do not normalize, the income attribute will dominate
the distance computation.
Numeric Attributes
• Distance between two attribute values: |v1-v2|
• Distance between two records: many possibilities.

• Euclidian Distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp

• Manhattan Distance:
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp
Euclidean Distance
Example: Age Income Height Weight
Record 1 45 2000 1.6 80
Record 2 32 1200 1.75 75

Normalization:
age1 = 45/140 = 0.32 age2 = 32/140 = 0.23
Income1 = 2000/5000 = 0.4 Income2 = 1200/5000 = 0.24
Height1 = 1.6/2.1 = 0.76 Height2 = 1.75/2.1 = 0.83
Weight1 = 80/150 = 0.53 Weight2 = 0.5

Euclidean Distance =

(0.32  0.23) 2  (0.4  0.24) 2  (0.76  0.83) 2  (0.53  0.5) 2  0.495


Ordinal Attributes
• Assign numbers depending on order.

• Distance between two attribute values: |v1-v2|/(n-1)

• Distance between two records: average distance between attributes.

Example: Height Income GPA


Record 1 Short Low A
Record 2 Tall Medium A

• A1 [Short, Medium, Tall]: d1= |2-0|/2 = 1


• A2 [Low, Medium, High]: d2= |1-0|/2 =0.5
• A3 [A, B, C, D, F]: d3= |0-0|/4 = 0
• Distance = (1+0.5+0)/3 = 0.5
• Similarity = 1-0.5 = 0.5
Nominal Attributes
• Similarity between two attribute values: Match = 1 and Mismatch = 0.

• Similarity between two records:


(Number of matches)/(Number of attributes)

• Distance between two records = 1 – similarity.

Example: Eye Color Country Job Married


Record 1 Black Jordan Engineer Yes
Record 2 Green Jordan Engineer Yes

• Similarity= (0 + 1 + 1 + 1)/4 = 0.75


• Distance = 1 – 0.75 = 0.25
Notes
• For records with mixed attribute types, compute the distance for each
attribute individually in the range [0-1] and then compute the average
over all attributes.

• Example: Age Gender Height Weight


Record 1 45 Female Short 80
Record 2 32 Male Tall 75

• |R1(Age) – R2(Age)| = (45/140 = 0.32) – (32/140 = 0.23) = 0.09


• |R1(Gender) – R2(Gender)| = 1 (Mismatch = 1, Match = 0)
• |R1(Height) – R2(Height)| = (2-0)/2 = 1
• |R1(Weight) – R2(Weight)| = (80/150 = 0.53) – (75/150 = 0.5) = 0.03

• Distance(R1, R2) = (0.09 + 1 + 1 + 0.03) / 4 = 0.53


KNN Concerns !
• What if the nearest neighbor is actually a noisy record?
• What if there are several nearest neighbors (all equally-distant)?
• What if there are several records that are all very close to the
unseen record but each having a different distance?

• Use K-Nearest Neighbors instead of 1-Nearest Neighbor!

• Assign to the unseen record:


– The majority class value among the K-NNs if the class
attribute is nominal.
– The median class value among the K-NNs if the class attribute
is ordinal.
– The mean class value among the K-NNs if the class attribute is
numeric.
Examples

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

• Using a very small K: sensitive to noise and susceptible to over-


fitting.
• Using a very large K: computationally expensive and may
consider irrelevant records.
• May need to set K experimentally.
How Many Neighbors?
Example
Example
Standardized Distance
Does k-NN Classification Work?
• At the limit (with optimal conditions), k-NN is guaranteed to have
an error rate that is no more than twice the error of an optimal
classifier.

In a
Voronoi Diagram,
all points in a cell
are closer to the
record in that cell NN-Classifiers
more than any
+ can learn complex
record in the other patterns that
cells.
+ are difficult for
decision trees.
+
To classify a record:
See in which cell it + +
falls and assign to it
+
the class of the
record in that cell. +
Notes
• When to use NN-Classification?
– If there are less than 20 attributes.
[Curse of Dimensionality: In higher dimensions, intuition fails,
distance measures become less meaningful and computation
becomes expensive.
– If the application affords long classification time.
– If there are lots of training data.

• Advantages of NN-Classification:
– Quick training time.
– Can learn complex patterns.
– Can be used for regression (numeric class attributes).

• Disadvantages of NN-Classification:
– Slow at query time.
– Easily fooled by irrelevant attributes [Feature subset selection is
very important].