Sie sind auf Seite 1von 17

k-nearest neighbor Instance-based learning

B. B. Misra

A classification of learning algorithms

Eager learning algorithms
Neural networks Decision trees Bayesian classifiers

Lazy learning algorithms

K-nearest neighbor Case based reasoning

Lazy vs. Eager Learning

Lazy learning (e.g., instance-based learning): Simply stores training data and waits until it is given a test tuple. It does not build models explicitly Eager learning : Given a set of training set, constructs a classification model before receiving new (e.g., test) data to classify
Lazy: less time in training but more time in predicting Accuracy Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function

Eager: must commit to a single hypothesis that covers the entire instance space

General Idea of Instance-based Learning

Learning: store all the data instances Performance:
when a new query instance is encountered
retrieve a similar set of related instances from memory use to classify the new query

Pros and Cons of Instance Based Learning

Can construct a different approximation to the target function for each distinct query instance to be classified Can use more complex, symbolic representations

Cost of classification can be high Uses all attributes (do not learn which are most important)

Example: 1-Nearest Neighbor

Example: 3-Nearest Neighbor

k-nearest neighbor (knn) learning

Most basic type of instance learning Assumes all instances are points in n-dimensional space A distance measure is needed to determine the closeness of instances Classify an instance by finding its nearest neighbors and picking the most popular class among the neighbors Robust to noisy data by averaging k-nearest neighbors Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes To overcome it, axes stretch or elimination of the least relevant attributes

knn learning
Scaling issues
Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example:
height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M

Important Decisions
Distance measure Value of k (usually odd) Voting mechanism Memory indexing

Euclidean Distance
Typically used for real valued attributes Instance x (often called a feature vector)
a1 ( x), a2 ( x), an ( x)

Distance between two instances xi and xj


d ( xi , x j )
r 1

(ar ( xi ) ar ( x j ))2

Discrete Valued Target Function

Training algorithm: For each training example x, f(x), add the example to the list training_examples Classification algorithm: Given a query instance xq to be classified Let x1xk be the k training examples nearest k to xq ( x ) arg max f (v, f ( xi )) q v V Return i 1
where (a, b) 1 if a b (a, b) 0 otherwise

Continuous valued target function

Algorithm computes the mean value of the k nearest training examples rather than the most common value Replace fine line in previous algorithm with

(x ) f q

f ( xi )
i 1

Using the training data classify the test data for 1-NN, 3-NN, 5NN and 7-NN. Distance of 1st record of training data from test data is
d1 6 7

Training data
Number Lines Line types Rectangles Colors Mondrian? 1 6 1 10 4 No 2 4 2 8 5 No 3 5 2 7 4 Yes 4 5 1 8 4 Yes 5 5 1 10 5 No 6 6 1 8 6 Yes 7 7 1 14 5 No

1 2

10 9

4 4

Test data
Number Lines Line types Rectangles Colors Mondrian? 8 7 2 9 4 ? Number Mondrian? Distance from test data 1.732 1 No 3.317 2 No 2.828 3 Yes 4 Yes 2.450 2.646 5 No 2.646 6 Yes 5.196 7 No

3 1.732

For 1-NN, 1st record (class=No) is the closest i.e. the 1st neighbor, hence it classifies No. For 3-NN, and both equal distance} are neighbors. A tie occurred, to break it use certain mechanism, choose one randomly or let 1st priority to 1st neighbor and it classifies No. For 5-NN, 1st(No), 4th(Yes), 5th(No), 6th(Yes), and 3rd (Yes) are the neighbors in order. Hence it classifies Yes. For 7-NN, in this case only 7 records are there in training set so all are considered, there are 4 No and 3 Yes classes, hence it classifies No. 1st(No), 4th(Yes), {5th(No)& 6th(Yes)

Algorithm questions
What is the space complexity for the model? What is the time complexity for learning the model? What is the time complexity for classification of an instance?

Analysis of KNN Algorithm

Advantages of KNN Algorithm KNN is an easy to understand and easy to implement classification technique. It can perform well in many situations. Cover and Hart show that the error of the nearest neighbor rule is bounded above by twice the Bayes error under certain reasonable assumptions. Also, the error of the general KNN method asymptotically approaches that of the Bayes error and can be used to approximate it. KNN is particularly well suited for multi-modal classes as well as applications in which an object can have many class labels. Disadvantages of KNN Algorithm The naive version of the algorithm is easy to implement by computing the distances from the test sample to all stored vectors, but it is computationally intensive, especially when the size of the training set grows.

Case-Based Reasoning (CBR)

CBR: Uses a database of problem solutions to solve new problems

Store symbolic description (tuples or cases)not points in a Euclidean space Applications: Customer-service (product-related diagnosis), legal ruling Methodology Instances represented by rich symbolic descriptions (e.g., function graphs) Search for similar cases, multiple retrieved cases may be combined Tight coupling between case retrieval, knowledge-based reasoning, and problem solving

Challenges Find a good similarity metric Indexing based on syntactic similarity measure, and when failure, backtracking, and adapting to additional cases