Beruflich Dokumente
Kultur Dokumente
By Ohad Klausner
1. Introduction
Optical character recognition usually abbreviated to OCR, involves a computer system designed to translate images of typewritten text (usually captured by a scanner) into machine editable text or to translate pictures of characters into a standard encoding scheme representing them. OCR began as a field of research in artificial intelligence and computational vision. In this project I decided to implement OCR using the appearance based recognition technique. Formally, the problem can be stated as follows: given a training dataset x, and an object o, find object xj, within the dataset, most similar to o. PCA (defined below) is a popular technique in apperance based recognition.
4. The Algorithm
4.1 Creating PCA subspace (eigenspace).
o Organize the image database into colunm vectors. The vector size
eqauls the image height multiplied by the image width. All the database images must be of the same size, 64 x 48 in our case (equivalent to a 36 size font). The result is a vector_size x database_size matrix, ColumnVectors.
o Find the empirical mean vector. Find the empirical mean along
of the data matrix ColumnVectors. Store the mean-subtracted data in a vector_size x database_size matrix, MeanSubracted.
o Compute the eigenvectors and eigenvalues of the covariance
matrix of MeanSubracted. In this phase I deviated form the original PCA algorithm. Finding the eigenvectors and eigenvalues of the covariance matrix of MeanSubracted, which is a vector_size x vector _size matrix, was more then Matlab could handle. When I tried to do so, I got an out of virtual memory error. Therefore I used a method (that I found on the web), of obtaining the covariance matrix of the transposed matrix of MeanSubracted is a database_size x database_size matrix, which is a smaller matrix but gives similar eigenvalues and eigenvectors as the original matrix. In order to
convert these eigenvectors to the eigenvectors required I multiplied the eigenvectors by the MeanSubracted matrix. o Sort the eigenvectors by decreasing eigenvalue.
o Create a k dimensional subspace. Save the first k eigenvectors as a
matrix, SubSpace. Eigenvector with normalized eigenvalue close to zero will not be saved even though it is in the k largest eigenvalues. The end result is a k (or less) dimensional subspace.
database colunm vectors in the subspace. Project both the image vector and the training set (database) vector to the PCA subspace, multiply SubSpaceT by ImageColumVec and by ColumnVectors and compute the distance using the L2 norm method. o Return the name (label) of the colunm vector with the minimal distance.
5. The Program
The program consists of 9 files and 2 main functions, written in Matlab. Main functions: 1. CreateDB(): loads the database, trains it and saves the data for
the OCR use. 2. Files: 1. database.zip an archive of the database images. Unzip it OCR (image_file_name): loads the image and recognizes it.
before database creation and training. 2. imread2.m this function loads an image and creates its
This file is the one needed to be changed in order to update/add images to the database. 4. pca.m this function trains the dataset with the images loaded
by the script, using the PCA algorithm described above. 5. 6. 7. CreateDB.m holds the CreateBD() function. PCAdateBase.mat if created, holds the PCA saved data. im_resize.m resizes the image to the desired size
8.
6. Results
I ran the program using a full hebrew Aleph Beth database, 270 images (10 different images per letter). After running a few tests, the recognition success rate was about 50% due to two main reasons: location sensitivity and similar letters with small differences between them. Location sensitivity: character recognition via apperance based recognition is location sensitive because charatcers come in different sizes (using the same font size) and can not always by at center of the image. In addition, in different fonts, the same letter can by placed in different locations within the letter box. When a given letter is located in a different location then the correct dataset images the recognition algorithm will return a long distance under the L2 norm method and a shorter distance can be found to an incorrect letter.
but the same image (letter) moved aside within the letter box
was
incorrectly recognized as
Similar letters: among the hebrew Aleph Beth one can find very similar letters with very small differences between them. For example VAV and NUN SOFIT ,can be called a similar couple. This kind of similarity can cause mistakes when writing with a specific font in which the letter resembles its similar couple partner. For example: this Kaph Sofit was incorrectly recognized as Resh
7. Discussion
The goal of my project was to create a reliable OCR using the PCA method. After testing this method and ending up with a poor recognition rate as described above, one may think that I failed reaching that goal, but with a few enhancements (maybe a project for next year) one can correct the problems described above.
would minimize the similar letters mistake. This can be done by loading a whole word image, dividing it into letters using some clustering or/and a edge detecting techniques and sending the letter to my OCR with a flag indicating the letter location in the word (last letter or not). Because most of the similar couples, described above, contains one final (Sofit) letter, ignoring the final letters when checking a letter form the beginning or the middle of the word would minimize those mistakes.
image box (both the dataset and the given character) the location sensitivity problem would be solved, because all the letters will be in the same location within the image box. This can be done again by using clustering or/and a edge detecting techniques in order to find the location of the letter within the image box and moving it to the center.
compution required. A large k requires more compution time but results in better accuracy. In this case k was choosen arbitrarily to be 32. Obviously, a higher k would have resulted in more accurate results. Exhaming the eigenvalues can proved a method for a better choise of k.
8. References
1.
http://en.wikipedia.org/wiki/Main_Page
2.
http://joplin.ucsd.edu/Tutorial/matlab.html
3.