Sie sind auf Seite 1von 122

Nearest Neighbor and Clustering

Nearest Neighbour and Clustering


Application Score Card Where to use Clustering and Nearest Neighbour Prediction The General Idea How Clustering and Nearest Neighbour Prediction work Case Study: Image Recognition and Human Handwriting

Nearest Neighbour and Clustering


Nearest Neighbor Used for prediction as well as consolidation. Clustering Used mostly for consolidating data into a high-level view and general grouping of records into like behaviors.

Space is defined by the problem Space is defined as default nto be solved (supervised learning). dimensional space, or is defined by the user, or is a predefined space driven by past experience (unsupervised learning).

Generally only uses distance metrics to determine nearness.

Can use other metrics besides distance to determine nearness of two records - for example linking two points together.

K Nearest Neighbors
K Nearest Neighbors
Advantage
Nonparametric architecture Simple Powerful Requires no training time

Disadvantage
Memory intensive Classification/estimation is slow

K Nearest Neighbors
The key issues involved in training this model includes setting
the variable K
Validation techniques (ex. Cross validation)

the type of distant metric


Euclidean measure
D 2

Dist ( X , Y ) !

( Xi  Yi)
i !1

Figure K Nearest Neighbors Example

Stored training set patterns X input pattern for classification --- Euclidean distance measure to the nearest three patterns

Store all input data in the training set

For each pattern in the test set

Search for the K nearest patterns to the input pattern using a Euclidean distance measure

For classification, compute the confidence for each class as Ci /K, (where Ci is the number of patterns among the K nearest patterns belonging to class i.) The classification for the input pattern is the class with the highest confidence.

Training parameters and typical settings


Number of nearest neighbors
The numbers of nearest neighbors (K) should be based on cross validation over a number of K setting. When k=1 is a good baseline model to benchmark against. A good rule-of-thumb numbers is k should be less than the square root of the total number of training patterns.

Training parameters and typical settings


Input compression
Since KNN is very storage intensive, we may want to compress data patterns as a preprocessing step before classification. Using input compression will result in slightly worse performance. Sometimes using compression will improve performance because it performs automatic normalization of the data which can equalize the effect of each input in the Euclidean distance measure.

Nearest Neighbour and Clustering


Oldest techniques used in DM Like records are grouped or clustered together and put into same grouping Nearest neighbor prediction tech quite close to clustering To find prediction value in one record, look for similar records with similar predictor values in the historical DB

Use the prediction value of the record which is nearest to the unknown rec Ex:laundry uses clustering In Business, clusters-more dynamic Which cluster a rec falls, may change daily, monthly Therefore is difficult to decide Another ex NN:income group of neighbours

Best way to predict an unknown persons income possibly choose the closest persons Nearest neighbour prediction alg works on DB very much same way Many factors-nearest condn Persons locn,school attended,degree attained etc..

Business Score Card


Measures critical to business success deals with: ease of deployment, real world problems avoiding serious mistakes as well as achieving big successes DM tech needs to be :easy to use, deploy in an automated fashion as possible Provide clear understandable answers Provide answer that can be converted into ROI

BSC
Automation: NN are relatively automated, although some preprocessing is performed in converting predictors into values that can be used in a measure of distance Unordered categorical predictors (eye color) need to be defined in terms of the dist from each other when there is a match (whether blue is close to brown)

Clarity: excellent for clear explanation of why a prediction was made. A single ex or a set of exs can be extracted from the historical DB for evidence as to why a prediction should or should not be made. The system can also communicate when it is not confident of its prediction

ROI: Since the individual records of the nearest neighbor are returned directly without altering the DB, it is possible to understand all facets of business behavior and thus derive a more complete estimate of the ROI not just from the prediction but from a variety of different factors

Where to use clustering and nearest neighbor prediction


Personal bankruptcy to computer recognition of human handwriting Clustering for clarity: clustering-like records are grouped together. High level view of what is going on in the DB Clustering-segmentation-birds eye view of business Commercial offerings:PRIZM & Microvision

Grouped the population by demographic info into segments Clustering info is then used by the end user to tag the customers in the db Business user gets a high level view of what is happening within the cluster Once worked with these clusters, users will know more about customers reaction

Clustering for outlier analysis


Clustering done to an extent where some records stick out Profit in stores, dept

Nearest Neighbour for Prediction


One particular object can be closer to another obj than the third object People have innate sense of ordering on a variety of objects Apple close to orange than tomato Toyota corolla, honda civic than porsche Sense of ordering places them in time and space and makes sense in real world

Defn of nearness that seems to be ubiquitous also allows us to make predictions NN prediction alg simply stated as: Objects that are near to each other will have similar prediction values as well. Thus if you know the prediction value of one of the objects , you can predict it for its nearest neighbors

Classic ex NN: Text retrieval Define a document. Look for more such documents NN looks for imp characteristics with those documents which have been marked as interesting Can be used in wide variety of places Successful use depends on preformatting of data, so that nearness can be calculated and where individual records can be defined

Easy for text retrieval but not for time series kind like stock prices where there is no inherent order

Application Score Card


Rules are seldom used for prediction here Used for unsupervised learning Clusters: the underlying prediction method for nearest neighbor technology is nearness in some feature space. This is same underlying metric used for most clustering algorithms although for nearest neighbor the feature space is shaped in such a way as to facilitate a particular prediction

Links: NN techniques can be used for link analysis as long as the data is preformatted so that predictor values to be linked fall within same record Outliers: NN techniques are particularly good at detecting outliers since they have effectively created a space within which it is possible to determine when a record is out of place

Rules: one strength of NN techniques is that they take into account all the predictors to some degree, which is helpful for prediction but makes for a complex model that cannot easily be described as a rule. The systems are also generally optimized for prediction of new records rather than exhaustive extraction of interesting rules from the DB

Sequences: NN techniques have been successfully used to make prediction in time sequences. The time values need to be encoded in records Text : most text retrieval systems are based around NN tech, and most of them remaining breakthroughs come from further refinements of the predictor weighting algs and the dist calculations

General Idea
NN is a refinement of clustering in the sense that both use dist in some feature space to create either structure in data or in predictions NN is a way of automatically determining the weighting of the importance of the predictors and how the dist will be measured within the feature space

Clustering is one special case-imp of each predictor is considered to be equivalent Ex:set of people and clustering friends

There is no best way to cluster


Is Clustering on financial status better than eye color or on food habit? Clustering with no specific purpose and just to group data , probably all are ok Reasons for clustering are ill defined As they are used more often for exploration and summarization than for prediction

How are tradeoffs made when determining which records fall into which clusters
Ex:aged vs young, classial vs rock Clustering large no. of records, these tradeoffs are explicitly defined by the clustering algroithm

Difference betn clustering and NN


Main distinction: clustering is unsupervised learning and NN for prediction supervised learning technique Unsupervised: there is no particular reason for the creation of the models Supervised: prediction Prediction: patterns presented are most important

N dimensional space
Clustering or NN work with n dimensional space In order to define what is near and what is far away, it is helpful to have a space defined where dist can be calculated

How is the space for clustering and nearest neighbor defined?


Clustering: n dimensional space-assigning one predictor to each dimension NN: predictors are also mapped to dimensions , but those dimensions are literally stretched or compressed according to how important the particular predictor is in making prediction Stretching makes it more imp than others

How clustering and NN prediction work


Looking at an n dimensional space: Db records Plot of Age vs income Cluster 1:retirees with modest incomes Cluster 2:middle-aged weekend golfers Cluster 3:wealthy youth who have exclusive club memberships Additional set of records which do not fit to any of these.(outliers)

The dist betn the cluster and a given data point :measured from the centre of mass of the cluster The center of mass of the cluster can be calculated avg of predictor value Clusters defined: soley by centre or by their centre with some radius attached in which all points that fall within the radius are classified into that cluster

Centre record :prototypical rec Normal DB records mapped onto n dimensional space 2 or 3 dimensions: easy to visualize More dim: complex

How is nearness defined?


Clustering and NN: work with n dimensional space One record being close to or far from another record Nearness determined: any rec in the historical DB that is exactly the same as the rec to be predicted is considered close and anything else is far away

Difficulty with this strategy Unlikely that exact matches of records in db Perfectly matching rec may be spurious. Better results: taking vote among several nearby recs

Two other dists: Manhattan dist: adds up the diff betn each predicator betn the historical rec and the rec to be predicted. Euclidean dist:(pythogorous)dist betn two points in n dimensions by squaring the differences of the predictor values for the two recs and taking the square root of the sum

Dist betn xyz & abc: Age:6 Sal:3100 Color of eye:0 Gender:1 Income:1(high 3 med 2 low 1) Total diff=3108

Diff dominated by sal Others whether they match or not does not matter To balance, use normalized values Val:0 to 100 Max diff betn sal in the data set:16543 Betn xyz and abc 3100 which is 19% of max 6+19+0+100+100=225

Weighting the dimensions: dist with a purpose:


High income rec when added (Mukesh ambani) there is outlier created when clustering is made betn age and income Normalizing does not help in this case When near is defined , how imp each dimensions contribution is? Ans: it depends on what is to be accomplished

Calculating dimension weights


Several automatic ways of calculating the imp of different dimensions Ex of document classification and prediction, the dimensions of space are often the individual words contained in the document Ex: entrepreneur occurs or not The occurs several times: it is little significance Earlier word significant

Weights: 1. Inverse freq often used: the occurred 10000 docs, word weight = 1/10000=0.0001 Entrepreneur occurred in 100 docs :1/100=0.01 2. importance of the word for the topic to be predicted. If topic : starting a small business, words such as entrepreneur and venture capital will be given higher weights

Data Mining in doc: special situation: many dimensions and all dim are binary Other business problems: binary (gender), categorical (eye color), numeric (revenue) dimensions Each dim weighted depending on its relevance to the topic to be predicted Calculation: correlation betn predictor and predictor value

Or conditional prob that prediction has certain value given that predictor has certain value Dimension weights calculated via alg searches :random weights tried initially Then slowly modified to improve the accuracy of the system

Hierarchical and nonhierarchical clustering


Hierarchical C: small to big clusters Unsupervised learning Fewer or Greater no of clusters desired. Depending on appn choose clusters Extreme: as many as there are no. of recs In this case recs are optimally similar to each other (within a cluster there is only one) but different from other clusters

Such clustering probably cannot find useful patterns No summary info Data is not understood any better Fewer than original is better Adv of HC: allow end users to choose from either many clusters or only a few

HC is viewed as tree: smaller clusters merge together to create next highest level of clusters and at that level again merge and so on User can decide what the adequate no. of clusters that will summarize data and providing useful info Single cluster :great summarization but does not provide any specific info

Two algs to HC: 1 . Agglomarative:AC tech start with as many clusters as there are recs. Each cluster has one rec. clusters that are nearest to each other merged. This is continued till we have single cluster containing all recs at the top of the hierarchy

2.Divisive:DC techniques take opp approach Start with all rec in one cluster Split into smaller pieces Further try to split

Non HC faster to create from historical db User makes decision about no. of clusters desired or nearness reqd multiple times run Start with arbitrary clustering and iteratively improve by shuffling Or create by taking one rec at time depending on the criteria

Nonhierarchical clustering
Two NHC: 1. single pass methods: db passed thro only once in order to create clusters 2. Reallocation methods: movement or reallocation of records from one cluster to another to create better clusters. Multiple passes thro db. Faster compared to HC.

Alg for single pass technique: Read in a rec from db, determine the cluster it best fits to (measure of nearness) If nearest still far away, new cluster with this rec Read next rec

Reading recs: expensive. Single pass scores better Problem: large clusters. Decision made earlier. Sequence in which processed matters Reallocation solves this problem by readjusting the cluster Optimizes similarity

Alg for reallocation: preset the no. of clusters desired Randomly pick a record to become the centre or seed for each of these clusters Go thro db and assign each rec to nearest cluster Recalculate the centers of the clusters Repeat steps 3 & 4 until there is a minimum or reallocation betn clusters

Recs initially assigned may not be good fits Recalculating center, clusters that actually better match are formed Center moving towards high density and away from outliers Predefining no. of cluster may be a bad idea than driven by data

There is no one right ans as how clustering is to be done

HC
HC has adv over NHC :clusters are defined solely by data(no predetermined no.) No. of clusters :increased or decreased :moving down or up the hierarchy Hierarchy started either from top and dividing further or from bottom and merging rec at every level

Merge or split:usually two at a time Agglomerative alg: Start with as many clusters as there are records,with one record in each cluster Combine the two nearest clusters into a larger cluster Continue until only one cluster remains

Divisive tech alg Start with one cluster that contains all the records in the db Determine the division of the existing cluster that best maximizes similarity within clusters and dissimilarity betn clusters Divide the cluster and repeat on the two smaller clusters Stop when some min threshold of cluster size or total no. has been reached or when there is only one rec in the cluster

Divisive techniques:quite expensive to compute Separates cluster into every possible smaller cluster and picks best one (min avg dist) Agglomerative preferred Decisions are made to merge clusters

Join the clusters whose resulting merged cluster has min total dist betn all recs:wards method. Produces symmetric hierarchy. Good at recovering cluster structure. Sensitive to outliers. Difficulty in recovering elongated structure

Decisions in several ways: Join the clusters whose nearest recs as near as possible.-single link method. Clusters can be joined on a single nearest pair of recs , tech can create long snakelike clusters not good at extracting classical spherical and compact clusters

Join the clusters whose most distinct recs are as near as possible.:complete link method.all recs are linked within some max dist. Favours:Compact clusters Join the clusters where the avg dist betn all pairs of recs is as small as possible:Group avg link method.Includes noth nearest and distinct, clusters result in elongated single link to tight complete link clusters

Implementation of KNN method for object recognition.

Outline
      

Introduction. Description of the problem. Description of the method. Image library. Process of identification. Example. Future work.

Introduction


Generally speaking, problem of object recognition is how to teach computer to recognize different objects on a picture. This is a nontrivial problem. Some of the main difficulties in solving this problem are separation of an object from the background (especially in the presence of clutter or occlusions in the background), and ability to recognize an object with different lighting.

Introduction.


In this research I am trying to improve accuracy of object recognition by implementation of KNN method with new weighted Hamming-Levenshtein distance that I developed.

Description of the problem.




  

The problem of object recognition can be divided into two parts: 1) Location of an object on the picture; 2) Identification of an object. For example, assume that we have the following picture:

Description of the problem.

Description of the problem.




and we have the following library of images that we will use for object identification:

Description of the problem.




Our goal is to identify and locate objects from our library on the picture.

Description of the problem.




In this research I have developed a method of objects identification assuming that we already know the location of an object, and I am going to develop the method of location in my future work.

Description of the method.




We will use KNN method to identify objects. For example, assume we need to identify an object X on a given picture. Let us consider the space of pictures generated by the image of X and images from our library.

Description of the method.


A3 B3 A1 X B2 B1 C2 A2 C1


In this space we will pick up, say 5, closest to X images, and identify X by finding the plurality class of the nearest pictures.

Nearest neighbors: A1, B1, A2, B2, A3

Description of the method.




In order to use KNN method we need to introduce a measure of similarity between two pictures. First of all, in order to say something about similarity between pictures, we need to get some ideas about the shape of objects on these pictures. To do this we use edgedetection method (Sobel method, for example).

Description of the method.




Next, we turn the edge-detected picture into a bit array by thresholding intensities to 0 or 1. In fact, we are going to keep images in the library in this form.

Description of the method.




Now, in order to compare two pictures , we need to compare two 2-dimensional bit arrays. It may seem natural to use the traditional Hamming distance for bitstrings which is defined as follows: given two bitstrings of the same dimension, the Hamming distance is the minimum number of symbol changes needed to change one bitmap into the other.

Description of the method.


     

For example, the Hamming distance between (A) 10001001 and (B) 11100000 is 4. Notice that the Hamming distance between (A) 10001001 and (C) 10010010 is also 4, but intuitively one can regard (C) as a better match for (A) than (B).

Description of the method.




We can modify Hamming distance using the idea of Levenshtein distance which is usually used for comparing of text strings and is obtained by finding the cheapest way to transform one string into another. Transformations are the one-step operations of insertion, deletion and substitution, and each transformation has a certain cost.

Description of the method.




Also, since different parts of images have different level of importance in the process of recognition, we can assign a weight value for each pixel of an image, and use it in the definition of a distance. For example, we can eliminate the background of a picture by assigning to the corresponding pixels zero weight.

Description of the method.




To get weighted Hamming-Levenshtein distance between two pictures we




divide each bitstring into several substrings of the same length. Then we compare corresponding substrings using Levenshtein distance, And summarize all these distances multiplied by the average weight of each substring.

Image library.


Each object in the library is represented by several images taken with different lighting and from different sides. Each image in the library is represented by two 2-dimensional arrays. First array contains the edge-detected picture turned into a bit array, and the second one contains weight values assigned to each pixel.

Process of identification.


To identify an object,


we turn its edge-detected image into a bit array by thresholding intensities to 0 or 1. Then we measure distance between this image and each image from our library using corresponding weight arrays and weighted Hamming-Levenshtein distance. Using KNN method we identify the object.

Example.


Below some results are presented in object identification that was obtained using the method that has been described.

Example.


Assume that we have the image library with the following edge-detected images of objects and weighted images.

Example.

Example.


We want to identify objects from this picture.

Example.


Let us try to identify the following picture.

Picture 1

Example.


We compare this picture with each image in our library, and we get the following table of distances.

Example.
Bear 1 Bear 2 Bear 3 Cat 1 Cat 2 Cat 3 Dog 1 Dog 2 Dog 3 876 21009 24495 27401 25986 24538 21629 26809 25546


If we select three closest neighbors of our picture 1, then we can identify it as Bear .

Example.


Let us do similar calculations for these two pictures:

Picture 2.

Picture 3.

Bear 1 Bear 2 Bear 3 Cat 1 Cat 2 Cat 3 Dog 1 Dog 2 Dog 3

Picture 2 31678 24644 31662 1864 22798 22242 23087 25679 25785

Picture 3 32629 23790 32150 28687 25655 25824 1577 24042 23880

Future work.


Develop a method of location of an object on the picture. Develop an idea of reasonable weight distribution on the images from the library. Improve the algorithm of identification to allow to compare pictures of different sizes. Continue to work on improving the definition of weighted Hamming-Levenshtein distance.

Introduction
Objective: To recognise images of Handwritten digits based on classification methods for multivariate data. Optical Character Recognition (OCR)
Predict the label of each image using the classification function learned from training

OCR is basically a classification task on multivariate data


Pixel Values Variables Class Each type of character

Handwritten Digit data


16 x16 (= 256 pixel) Grey Scale images of xij [0,1] digits in range 0-9
Xi=[xi1, xi2, . xi256] yi{ 0,1,2,3,4,5,6,7,8,9}
2

Training set ~ 1000 images Test set Randomly selected from the full data base

16

9298 labelled samples

10

12

14

16 2 4 6 8 10 12 14 16

Basic idea
Correctly identify the digit given an image
16

Dimension reduction - PCA


AVER AGE IM AG E

PCA done on the mean centered images The eigenvectors of are called the Eigen digits (256 dimensional) The larger an Eigen value the more important is that Eigen digit. The ith PC of an image X is yi=eiX
5 10 15 5 10 15 5 10 15 5 10

256x256matrix

10

12

14

16 2 4 6 8 10 12 14 16

5 10 15 15 5 10 15

5 10 AVERAGE DIGIT 15 5 10 15

5 10 15 5 10 15

EIG EN DIG ITS

5 10 15 5 10 15

5 10 15 5 10 15

5 10 15 5 10 15

5 10 15 5 10 15

5 10 15 5 10 15

PCA (continued)
Based on the Eigen values first 64 PCs were found to be significant

Variance captured ~ 92.74%

Any image represented by its PC: Y= [y1 y2.....y64 ] Reduced Data Matrix with 64 variables Y = 1000 x 64 matrix

C u m u la tive Pe rc e n ta g e va ria n c e e xp la in e d vs N o o f P rin c ip le C o m p o n e n ts u s e d


100

92.74 92.74
90

% Variance explained ( Cum ulative )

80

70

60

50

40

30

20

10 0 50

64

64

100

150

200

250

N o of P rin ciple C om pon en ts

The Eigen vectors are the rotation of the original axes to more meaningful directions. The PCs are the projection of the data onto each of these new axes. Image Reconstruction:
The original image can be reconstructed by projecting the PCs back to old axes. Using the most significant PC will give a reconstructed image that is close to original image. These features can be used for carrying out further investigations e.g. Classification!!

Interpreting the PCs as Image Features

Image Reconstruction
Mean Centered Image: I=(X-Xmean) PC as Features: yi = eiI Y= [y1, y2,.. y64] = EI where E=[e1 e2. e64] Reconstruction: Xrecon= E*Y + Xmean
A CTU A L IM A G E FR OM TE S T S E T 2 4 6 8 10 12 14 16 5 10 15 CO M P LE TE LY RE C ONS TR UCTE D IM A GE US ING A LL 256 P RICIP LE CO M P O NE NTS 2 4 6 8 10 12 14 16 5 10 15 2 4 6 8 10 12 14 16 5 10 15 RE C ONS TRU CTE D IM A GE US ING 150 P RICIP LE COM P ONE N TS 2 4 6 8 10 12 14 16 5 10 15 RE C ONS TRU CTE D IM A GE US ING 64 P RICIP LE COM P ONE NTS

Normality test on PCs


Q Q P lot of S am ple D ata ve rs us S tand ard N orm a l P rin c ip le C o m p o n e n t N o 1 1 Q uan tiles of Inpu t S am ple Q uan tiles of Inpu t S am ple 1 Q uan tiles of Inpu t S am ple -2 0 2 S tand ard N orm a l Q uantiles 4 Q Q P lot of S am ple D ata ve rs us S tand ard N orm a l P rin cip le C o m p o n e n t N o 3 1 Q Q P lot of S am ple D ata ve rs us S tand ard N orm a l P rin c ip le C o m p o n e n t N o 5 0.5 0.5 0.5 0 0 0 -0.5 -0.5 -0.5

-1 -4

-2 0 2 S tand ard N orm a l Q uantiles

-1 -4

-1 -4

-2 0 2 S tand ard N orm a l Q uantiles

Q Q P lot of S am ple D ata ve rs us S tand ard N orm a l P rin c ip le C o m p o n e n t N o 10 1 Q uan tiles of Inpu t S am ple Q uan tiles of Inpu t S am ple 1.5 1 0.5 0 -0.5 -1

Q Q P lot of S am ple D ata ve rs us S tand ard N orm a l P rin c ip le C o m p o n e n t N o 20 1.5 Q uan tiles of Inpu t S am ple -2 0 2 S tand ard N orm a l Q uantiles 4 1 0.5 0 -0.5 -1

Q Q P lot of S am ple D ata ve rs us S tand ard N orm a l P rin c ip le C o m p o n e n t N o 30

0.5

-0.5

-1 -4

-2 0 2 S tand ard N orm a l Q uantiles

-1.5 -4

-1.5 -4

-2 0 2 S tand ard N orm a l Q uantiles

Q Q P lot of S am ple D ata ve rs us S tand ard N orm a l P rin c ip le C o m p o n e n t N o 40 2 Q uan tiles of Inpu t S am ple Q uan tiles of Inpu t S am ple 1 0 3 2 1 0 -1 -2

Q Q P lot of S am ple D ata ve rs us S tand ard N orm a l P rin c ip le C o m p o n e n t N o 50 5 Q uan tiles of Inpu t S am ple -2 0 2 S tand ard N orm a l Q uantiles 4

Q Q P lot of S am ple D ata ve rs us S tand ard N orm a l P rin c ip le C o m p o n e n t N o 60

-1 -2 -3 -4

-2 0 2 S tand ard N orm a l Q uantiles

-3 -4

-5 -4

-2 0 2 S tand ard N orm a l Q uantiles

Classification
Principle Components used as features of images LDA assuming multivariate normality of the feature groups and common covariance Fisher discriminant procedure which assumes only common covariance

Classification (contd..)
Equal cost of misclassification Misclassification error rate:
APER based on training data AER on the validation data
Averaged over several random sampling of training and validation data from the full data set.

Error rate using different number of PCs were compared

Performing LDA
Prior probabilities of each class were taken as the frequency of that class in data. Equivalence of covariance matrix
Strong Assumption Error rates used to check validity of assumption Spooled used for covariance matrix

LDA Results
APER

No of PCs APER %
AER

256 1.8 256 13.63

150 4 150 10.91

64 6.4 64 10.087

No of PCs AER %

APER underestimates the AER Using 64 PCs is better than using 150/256 PCs!
The PCs with lower Eigen values tend to capture the noise in the data.

Fisher Discriminants
Uses equal prior probabilities, covariances. No of discriminants can be r <= 9
When all discriminants are used Fischer equivalent to LDA (verified by error rates) i.e. when r=9

Error rates with different r compared

Fisher Discriminant Results


r=2 discriminants

APER AER
No of PCs APER % No of PCs AER % 256 32 256 45 150 34.5 150 42 64 37.4 64 40

Both AER and APER are very high

Fisher Discriminant Results


r=7 discriminants APER No of PCs APER % No of PCs AER % 256 3.2 256 14.1 150 4.8 150 12.4 64 7.9 64 10.8

AER

Considerable improvement in AER and APER Performance is close to LDA Using 64 PCs is better

Fisher Discriminant Results


r=9(all) discriminants

APER AER

No of PCs APER % No of PCs AER %

256 1.6 256 13.21

150 4.3 150 10.55

64 6.4 64 9.86

No significant performance gain from r=7 Error rates are ~ LDA (as expected!)

Nearest Neighbour Classifier


Finds the nearest neighbours from the training set to test image and assigns its label to test image. Test point assigned to
Class 2

No assumption about distribution of data


Euclidean distance to find nearest neighbour

Class 1

Class 2

K-Nearest Neighbour Classifier (KNN)


Compute the k nearest neighbours and assign the class by majority vote.
k=3
Test point assigned to Class 1 Class 1 ( 2 votes ) Class 2 ( 1 vote )

1-NN Classification Results:


No of PCs AER % 256 150 64

7.09

7.01

6.45

Test error rates have improved compared to LDA and Fisher Using 64 PCs gives better results Using higher ks does not show improvement in recognition rate

Misclassification in NN:
Recognised as
0 1 2 3 4 5 6 7 8 9 0 1376 0 22 4 3 9 10 0 8 6 1 0 1113 9 0 15 3 3 6 11 1 2 4 1 728 4 9 12 5 1 1 2 3 4 2 0 0 1 17 4 690 2 0 687 37 5 0 3 19 0 26 7 23 0 5 5 0 4 26 0 517 2 0 7 0 6 7 8 9 12 2 0 0 2 0 2 0 6 16 18 2 0 4 6 3 32 7 2 4 32 0 23 9 714 0 3 2 20 0 657 1 8 5 547 13 32 0 0 664

Euclidean distances between transformed images of same class can be very high

A al ctu

Das könnte Ihnen auch gefallen