Beruflich Dokumente
Kultur Dokumente
Space is defined by the problem Space is defined as default nto be solved (supervised learning). dimensional space, or is defined by the user, or is a predefined space driven by past experience (unsupervised learning).
Can use other metrics besides distance to determine nearness of two records - for example linking two points together.
K Nearest Neighbors
K Nearest Neighbors
Advantage
Nonparametric architecture Simple Powerful Requires no training time
Disadvantage
Memory intensive Classification/estimation is slow
K Nearest Neighbors
The key issues involved in training this model includes setting
the variable K
Validation techniques (ex. Cross validation)
Dist ( X , Y ) !
( Xi Yi)
i !1
Stored training set patterns X input pattern for classification --- Euclidean distance measure to the nearest three patterns
Search for the K nearest patterns to the input pattern using a Euclidean distance measure
For classification, compute the confidence for each class as Ci /K, (where Ci is the number of patterns among the K nearest patterns belonging to class i.) The classification for the input pattern is the class with the highest confidence.
Use the prediction value of the record which is nearest to the unknown rec Ex:laundry uses clustering In Business, clusters-more dynamic Which cluster a rec falls, may change daily, monthly Therefore is difficult to decide Another ex NN:income group of neighbours
Best way to predict an unknown persons income possibly choose the closest persons Nearest neighbour prediction alg works on DB very much same way Many factors-nearest condn Persons locn,school attended,degree attained etc..
BSC
Automation: NN are relatively automated, although some preprocessing is performed in converting predictors into values that can be used in a measure of distance Unordered categorical predictors (eye color) need to be defined in terms of the dist from each other when there is a match (whether blue is close to brown)
Clarity: excellent for clear explanation of why a prediction was made. A single ex or a set of exs can be extracted from the historical DB for evidence as to why a prediction should or should not be made. The system can also communicate when it is not confident of its prediction
ROI: Since the individual records of the nearest neighbor are returned directly without altering the DB, it is possible to understand all facets of business behavior and thus derive a more complete estimate of the ROI not just from the prediction but from a variety of different factors
Grouped the population by demographic info into segments Clustering info is then used by the end user to tag the customers in the db Business user gets a high level view of what is happening within the cluster Once worked with these clusters, users will know more about customers reaction
Defn of nearness that seems to be ubiquitous also allows us to make predictions NN prediction alg simply stated as: Objects that are near to each other will have similar prediction values as well. Thus if you know the prediction value of one of the objects , you can predict it for its nearest neighbors
Classic ex NN: Text retrieval Define a document. Look for more such documents NN looks for imp characteristics with those documents which have been marked as interesting Can be used in wide variety of places Successful use depends on preformatting of data, so that nearness can be calculated and where individual records can be defined
Easy for text retrieval but not for time series kind like stock prices where there is no inherent order
Links: NN techniques can be used for link analysis as long as the data is preformatted so that predictor values to be linked fall within same record Outliers: NN techniques are particularly good at detecting outliers since they have effectively created a space within which it is possible to determine when a record is out of place
Rules: one strength of NN techniques is that they take into account all the predictors to some degree, which is helpful for prediction but makes for a complex model that cannot easily be described as a rule. The systems are also generally optimized for prediction of new records rather than exhaustive extraction of interesting rules from the DB
Sequences: NN techniques have been successfully used to make prediction in time sequences. The time values need to be encoded in records Text : most text retrieval systems are based around NN tech, and most of them remaining breakthroughs come from further refinements of the predictor weighting algs and the dist calculations
General Idea
NN is a refinement of clustering in the sense that both use dist in some feature space to create either structure in data or in predictions NN is a way of automatically determining the weighting of the importance of the predictors and how the dist will be measured within the feature space
Clustering is one special case-imp of each predictor is considered to be equivalent Ex:set of people and clustering friends
How are tradeoffs made when determining which records fall into which clusters
Ex:aged vs young, classial vs rock Clustering large no. of records, these tradeoffs are explicitly defined by the clustering algroithm
N dimensional space
Clustering or NN work with n dimensional space In order to define what is near and what is far away, it is helpful to have a space defined where dist can be calculated
The dist betn the cluster and a given data point :measured from the centre of mass of the cluster The center of mass of the cluster can be calculated avg of predictor value Clusters defined: soley by centre or by their centre with some radius attached in which all points that fall within the radius are classified into that cluster
Centre record :prototypical rec Normal DB records mapped onto n dimensional space 2 or 3 dimensions: easy to visualize More dim: complex
Difficulty with this strategy Unlikely that exact matches of records in db Perfectly matching rec may be spurious. Better results: taking vote among several nearby recs
Two other dists: Manhattan dist: adds up the diff betn each predicator betn the historical rec and the rec to be predicted. Euclidean dist:(pythogorous)dist betn two points in n dimensions by squaring the differences of the predictor values for the two recs and taking the square root of the sum
Dist betn xyz & abc: Age:6 Sal:3100 Color of eye:0 Gender:1 Income:1(high 3 med 2 low 1) Total diff=3108
Diff dominated by sal Others whether they match or not does not matter To balance, use normalized values Val:0 to 100 Max diff betn sal in the data set:16543 Betn xyz and abc 3100 which is 19% of max 6+19+0+100+100=225
Weights: 1. Inverse freq often used: the occurred 10000 docs, word weight = 1/10000=0.0001 Entrepreneur occurred in 100 docs :1/100=0.01 2. importance of the word for the topic to be predicted. If topic : starting a small business, words such as entrepreneur and venture capital will be given higher weights
Data Mining in doc: special situation: many dimensions and all dim are binary Other business problems: binary (gender), categorical (eye color), numeric (revenue) dimensions Each dim weighted depending on its relevance to the topic to be predicted Calculation: correlation betn predictor and predictor value
Or conditional prob that prediction has certain value given that predictor has certain value Dimension weights calculated via alg searches :random weights tried initially Then slowly modified to improve the accuracy of the system
Such clustering probably cannot find useful patterns No summary info Data is not understood any better Fewer than original is better Adv of HC: allow end users to choose from either many clusters or only a few
HC is viewed as tree: smaller clusters merge together to create next highest level of clusters and at that level again merge and so on User can decide what the adequate no. of clusters that will summarize data and providing useful info Single cluster :great summarization but does not provide any specific info
Two algs to HC: 1 . Agglomarative:AC tech start with as many clusters as there are recs. Each cluster has one rec. clusters that are nearest to each other merged. This is continued till we have single cluster containing all recs at the top of the hierarchy
2.Divisive:DC techniques take opp approach Start with all rec in one cluster Split into smaller pieces Further try to split
Non HC faster to create from historical db User makes decision about no. of clusters desired or nearness reqd multiple times run Start with arbitrary clustering and iteratively improve by shuffling Or create by taking one rec at time depending on the criteria
Nonhierarchical clustering
Two NHC: 1. single pass methods: db passed thro only once in order to create clusters 2. Reallocation methods: movement or reallocation of records from one cluster to another to create better clusters. Multiple passes thro db. Faster compared to HC.
Alg for single pass technique: Read in a rec from db, determine the cluster it best fits to (measure of nearness) If nearest still far away, new cluster with this rec Read next rec
Reading recs: expensive. Single pass scores better Problem: large clusters. Decision made earlier. Sequence in which processed matters Reallocation solves this problem by readjusting the cluster Optimizes similarity
Alg for reallocation: preset the no. of clusters desired Randomly pick a record to become the centre or seed for each of these clusters Go thro db and assign each rec to nearest cluster Recalculate the centers of the clusters Repeat steps 3 & 4 until there is a minimum or reallocation betn clusters
Recs initially assigned may not be good fits Recalculating center, clusters that actually better match are formed Center moving towards high density and away from outliers Predefining no. of cluster may be a bad idea than driven by data
HC
HC has adv over NHC :clusters are defined solely by data(no predetermined no.) No. of clusters :increased or decreased :moving down or up the hierarchy Hierarchy started either from top and dividing further or from bottom and merging rec at every level
Merge or split:usually two at a time Agglomerative alg: Start with as many clusters as there are records,with one record in each cluster Combine the two nearest clusters into a larger cluster Continue until only one cluster remains
Divisive tech alg Start with one cluster that contains all the records in the db Determine the division of the existing cluster that best maximizes similarity within clusters and dissimilarity betn clusters Divide the cluster and repeat on the two smaller clusters Stop when some min threshold of cluster size or total no. has been reached or when there is only one rec in the cluster
Divisive techniques:quite expensive to compute Separates cluster into every possible smaller cluster and picks best one (min avg dist) Agglomerative preferred Decisions are made to merge clusters
Join the clusters whose resulting merged cluster has min total dist betn all recs:wards method. Produces symmetric hierarchy. Good at recovering cluster structure. Sensitive to outliers. Difficulty in recovering elongated structure
Decisions in several ways: Join the clusters whose nearest recs as near as possible.-single link method. Clusters can be joined on a single nearest pair of recs , tech can create long snakelike clusters not good at extracting classical spherical and compact clusters
Join the clusters whose most distinct recs are as near as possible.:complete link method.all recs are linked within some max dist. Favours:Compact clusters Join the clusters where the avg dist betn all pairs of recs is as small as possible:Group avg link method.Includes noth nearest and distinct, clusters result in elongated single link to tight complete link clusters
Outline
Introduction. Description of the problem. Description of the method. Image library. Process of identification. Example. Future work.
Introduction
Generally speaking, problem of object recognition is how to teach computer to recognize different objects on a picture. This is a nontrivial problem. Some of the main difficulties in solving this problem are separation of an object from the background (especially in the presence of clutter or occlusions in the background), and ability to recognize an object with different lighting.
Introduction.
In this research I am trying to improve accuracy of object recognition by implementation of KNN method with new weighted Hamming-Levenshtein distance that I developed.
The problem of object recognition can be divided into two parts: 1) Location of an object on the picture; 2) Identification of an object. For example, assume that we have the following picture:
and we have the following library of images that we will use for object identification:
Our goal is to identify and locate objects from our library on the picture.
In this research I have developed a method of objects identification assuming that we already know the location of an object, and I am going to develop the method of location in my future work.
We will use KNN method to identify objects. For example, assume we need to identify an object X on a given picture. Let us consider the space of pictures generated by the image of X and images from our library.
In this space we will pick up, say 5, closest to X images, and identify X by finding the plurality class of the nearest pictures.
In order to use KNN method we need to introduce a measure of similarity between two pictures. First of all, in order to say something about similarity between pictures, we need to get some ideas about the shape of objects on these pictures. To do this we use edgedetection method (Sobel method, for example).
Next, we turn the edge-detected picture into a bit array by thresholding intensities to 0 or 1. In fact, we are going to keep images in the library in this form.
Now, in order to compare two pictures , we need to compare two 2-dimensional bit arrays. It may seem natural to use the traditional Hamming distance for bitstrings which is defined as follows: given two bitstrings of the same dimension, the Hamming distance is the minimum number of symbol changes needed to change one bitmap into the other.
For example, the Hamming distance between (A) 10001001 and (B) 11100000 is 4. Notice that the Hamming distance between (A) 10001001 and (C) 10010010 is also 4, but intuitively one can regard (C) as a better match for (A) than (B).
We can modify Hamming distance using the idea of Levenshtein distance which is usually used for comparing of text strings and is obtained by finding the cheapest way to transform one string into another. Transformations are the one-step operations of insertion, deletion and substitution, and each transformation has a certain cost.
Also, since different parts of images have different level of importance in the process of recognition, we can assign a weight value for each pixel of an image, and use it in the definition of a distance. For example, we can eliminate the background of a picture by assigning to the corresponding pixels zero weight.
divide each bitstring into several substrings of the same length. Then we compare corresponding substrings using Levenshtein distance, And summarize all these distances multiplied by the average weight of each substring.
Image library.
Each object in the library is represented by several images taken with different lighting and from different sides. Each image in the library is represented by two 2-dimensional arrays. First array contains the edge-detected picture turned into a bit array, and the second one contains weight values assigned to each pixel.
Process of identification.
To identify an object,
we turn its edge-detected image into a bit array by thresholding intensities to 0 or 1. Then we measure distance between this image and each image from our library using corresponding weight arrays and weighted Hamming-Levenshtein distance. Using KNN method we identify the object.
Example.
Below some results are presented in object identification that was obtained using the method that has been described.
Example.
Assume that we have the image library with the following edge-detected images of objects and weighted images.
Example.
Example.
Example.
Picture 1
Example.
We compare this picture with each image in our library, and we get the following table of distances.
Example.
Bear 1 Bear 2 Bear 3 Cat 1 Cat 2 Cat 3 Dog 1 Dog 2 Dog 3 876 21009 24495 27401 25986 24538 21629 26809 25546
If we select three closest neighbors of our picture 1, then we can identify it as Bear .
Example.
Picture 2.
Picture 3.
Picture 2 31678 24644 31662 1864 22798 22242 23087 25679 25785
Picture 3 32629 23790 32150 28687 25655 25824 1577 24042 23880
Future work.
Develop a method of location of an object on the picture. Develop an idea of reasonable weight distribution on the images from the library. Improve the algorithm of identification to allow to compare pictures of different sizes. Continue to work on improving the definition of weighted Hamming-Levenshtein distance.
Introduction
Objective: To recognise images of Handwritten digits based on classification methods for multivariate data. Optical Character Recognition (OCR)
Predict the label of each image using the classification function learned from training
Training set ~ 1000 images Test set Randomly selected from the full data base
16
10
12
14
16 2 4 6 8 10 12 14 16
Basic idea
Correctly identify the digit given an image
16
PCA done on the mean centered images The eigenvectors of are called the Eigen digits (256 dimensional) The larger an Eigen value the more important is that Eigen digit. The ith PC of an image X is yi=eiX
5 10 15 5 10 15 5 10 15 5 10
256x256matrix
10
12
14
16 2 4 6 8 10 12 14 16
5 10 15 15 5 10 15
5 10 AVERAGE DIGIT 15 5 10 15
5 10 15 5 10 15
5 10 15 5 10 15
5 10 15 5 10 15
5 10 15 5 10 15
5 10 15 5 10 15
5 10 15 5 10 15
PCA (continued)
Based on the Eigen values first 64 PCs were found to be significant
Any image represented by its PC: Y= [y1 y2.....y64 ] Reduced Data Matrix with 64 variables Y = 1000 x 64 matrix
92.74 92.74
90
80
70
60
50
40
30
20
10 0 50
64
64
100
150
200
250
The Eigen vectors are the rotation of the original axes to more meaningful directions. The PCs are the projection of the data onto each of these new axes. Image Reconstruction:
The original image can be reconstructed by projecting the PCs back to old axes. Using the most significant PC will give a reconstructed image that is close to original image. These features can be used for carrying out further investigations e.g. Classification!!
Image Reconstruction
Mean Centered Image: I=(X-Xmean) PC as Features: yi = eiI Y= [y1, y2,.. y64] = EI where E=[e1 e2. e64] Reconstruction: Xrecon= E*Y + Xmean
A CTU A L IM A G E FR OM TE S T S E T 2 4 6 8 10 12 14 16 5 10 15 CO M P LE TE LY RE C ONS TR UCTE D IM A GE US ING A LL 256 P RICIP LE CO M P O NE NTS 2 4 6 8 10 12 14 16 5 10 15 2 4 6 8 10 12 14 16 5 10 15 RE C ONS TRU CTE D IM A GE US ING 150 P RICIP LE COM P ONE N TS 2 4 6 8 10 12 14 16 5 10 15 RE C ONS TRU CTE D IM A GE US ING 64 P RICIP LE COM P ONE NTS
-1 -4
-1 -4
-1 -4
Q Q P lot of S am ple D ata ve rs us S tand ard N orm a l P rin c ip le C o m p o n e n t N o 10 1 Q uan tiles of Inpu t S am ple Q uan tiles of Inpu t S am ple 1.5 1 0.5 0 -0.5 -1
Q Q P lot of S am ple D ata ve rs us S tand ard N orm a l P rin c ip le C o m p o n e n t N o 20 1.5 Q uan tiles of Inpu t S am ple -2 0 2 S tand ard N orm a l Q uantiles 4 1 0.5 0 -0.5 -1
0.5
-0.5
-1 -4
-1.5 -4
-1.5 -4
Q Q P lot of S am ple D ata ve rs us S tand ard N orm a l P rin c ip le C o m p o n e n t N o 40 2 Q uan tiles of Inpu t S am ple Q uan tiles of Inpu t S am ple 1 0 3 2 1 0 -1 -2
Q Q P lot of S am ple D ata ve rs us S tand ard N orm a l P rin c ip le C o m p o n e n t N o 50 5 Q uan tiles of Inpu t S am ple -2 0 2 S tand ard N orm a l Q uantiles 4
-1 -2 -3 -4
-3 -4
-5 -4
Classification
Principle Components used as features of images LDA assuming multivariate normality of the feature groups and common covariance Fisher discriminant procedure which assumes only common covariance
Classification (contd..)
Equal cost of misclassification Misclassification error rate:
APER based on training data AER on the validation data
Averaged over several random sampling of training and validation data from the full data set.
Performing LDA
Prior probabilities of each class were taken as the frequency of that class in data. Equivalence of covariance matrix
Strong Assumption Error rates used to check validity of assumption Spooled used for covariance matrix
LDA Results
APER
No of PCs APER %
AER
64 6.4 64 10.087
No of PCs AER %
APER underestimates the AER Using 64 PCs is better than using 150/256 PCs!
The PCs with lower Eigen values tend to capture the noise in the data.
Fisher Discriminants
Uses equal prior probabilities, covariances. No of discriminants can be r <= 9
When all discriminants are used Fischer equivalent to LDA (verified by error rates) i.e. when r=9
APER AER
No of PCs APER % No of PCs AER % 256 32 256 45 150 34.5 150 42 64 37.4 64 40
AER
Considerable improvement in AER and APER Performance is close to LDA Using 64 PCs is better
APER AER
64 6.4 64 9.86
No significant performance gain from r=7 Error rates are ~ LDA (as expected!)
Class 1
Class 2
7.09
7.01
6.45
Test error rates have improved compared to LDA and Fisher Using 64 PCs gives better results Using higher ks does not show improvement in recognition rate
Misclassification in NN:
Recognised as
0 1 2 3 4 5 6 7 8 9 0 1376 0 22 4 3 9 10 0 8 6 1 0 1113 9 0 15 3 3 6 11 1 2 4 1 728 4 9 12 5 1 1 2 3 4 2 0 0 1 17 4 690 2 0 687 37 5 0 3 19 0 26 7 23 0 5 5 0 4 26 0 517 2 0 7 0 6 7 8 9 12 2 0 0 2 0 2 0 6 16 18 2 0 4 6 3 32 7 2 4 32 0 23 9 714 0 3 2 20 0 657 1 8 5 547 13 32 0 0 664
Euclidean distances between transformed images of same class can be very high
A al ctu