Thesis Hindi Ocr

Devanagari Font Design for Optical
Character Recognition
Dual Degree Dissertation
Submitted in partial fulfillment of the requirements
of the degree of
(Bachelor of Technology & Master of Technology)
By
Mustafa Saifee
07D07004
Supervisor:
Prof. Ravi Poovaiah
Department of Electrical Engineering
INDIAN INSTITUTE OF TECHNOLOGY BOMBAY

May 2012
Approval Sheet
Devanagiri Font Design for Optical Character
Recognition by Mustafa Saifee is approved for the degree of Bachelor of Technology in Electrical
Engineering & Master of Technology in Communication and Signal Processing.
Examiner
Examiner
Supervisor
Chairman
Date
Place
i
Declaration
I declare that this written submission represents my ideas in my own words and where others
ideas or words have been included, I have adequately cited and referenced the original sources.
I also declare that I have adhered to all principles of academic honesty and integrity and have
-
stand that any violation of the above will be cause for disciplinary action by the institute and can
also evoke penal action from the sources which have thus not been properly cited or from whom
proper permission has not been taken when needed.
(Signature)
(Name)
(Roll No.)
Date
ii
Acknowledgements
I express my sincere gratitude to Prof. Ravi Poovaiah for his support, guidance and constant encour-
agement. I would like to thank Manoj G. from CDAC for his guidance and support. This time I
remember my parents and brother with great reverence whose support and prayers are always
my strength. I would also like to thank my faculty advisor and HoD Prof. Abhay Karandikar for
his support.
Mustafa Saifee
07D07004
iii
Abstract
The original motivations for developing optical character recognition technologies were modest
to convert printed text on flat physical media to digital form, producing machine-readable digital
content. By doing this, words that had been inert and bound to physical material would be brought
into the digital realm and thus gain new and powerful functionalities and analytical possibilities.
It is crucial to the computerization of printed texts so that they can be electronically searched,
stored more compactly, displayed on-line, and used in machine processes such as machine transla-
tion, text to speech and text mining.
We design a Devanagari script font optimized for OCR. In this report we first study the the basics
of OCR systems and working of Devanagari OCR. We also study Latin fonts designed for OCR
and what precaution needs to be taken while designing a Devanagari script font for OCR. Then
we apply this knowledge to design the font which is showcased in the report. In this report we also
discuss the different features of letterforms which help in recognition. In conclusion, we tested the
font and found the results encouraging.
iv
Content
1.Introduction 1
2. Devanagari Script 3
2.1Alphabets 3
2.2Anatomy 6
2.2Character Frequency in Hindi 8
3. Optical Character Recognition 9
3.1History of OCR 9
3.2Applications of OCR 10
3.3Recognition of Devanagari Script 11
4. Typefaces for OCR 17
4.1Latin Font for OCR 16
4.3Devanagari Font for OCR 21
5. Font Design for OCRs 22
5.1Design Features Important for OCR 22
5.2Special Care for Devanagari 24
6. Proposed Font Designed for OCRs 27
6.1Designing Characters 29
6.2Evolution of Typeface 41
6.3Anatomy of the Typeface 42
6.4Testing of the Typefaces 43
7. Future Work 47
References 48
v
List of Figures
Figure 2.3 Bhagwats grouping of letters on the basis of graphical similarities
Figure 2.4 Bhagwats guidelines
Figure 2.5 Naiks grouping of letters on the basis of the position of the vertical bar or the kana
Figure 3.1 OCR-A (top); OCR-B (Bottom)
Figure 3.2 Procedure of Devanagari Recognition
Figure 3.3 Image before binarization (left); Image after binarization (right)
Figure 3.4 Horizontal Projection Profiles of a document for line segmentation
Figure 3.5 Vertical Projection Profiles of a document for word segmentation
Figure 3.6 Three part of a Devanagari word
Figure 3.7 The procedure of Hindi character segmentation
Figure 4.1 Characters of OCR-A
Figure 4.2 Characters of OCR-B
Figure 4.3 Optimal Overlapping of characters
Figure 4.4 More coverage ratio because of serifs
Figure 4.5 First test version of OCR-B
Figure 4.6 First published version of OCR-B
Figure 4.7 OCR-B (top), OCRBczyk (bottom)
Figure 5.1 Addition of elements in OCR-B to differentiate two characters
Figure 5.2 Serifs in a typeface (grey serifs)
Figure 5.3 Shadow Characters
Figure 5.4 Counters is the circular negative space (grey)
Figure 5.5 Example of few characters after the removal of Shiro Rekha
Figure 5.6 Similarities between different characters once the Shiro Rekha is removed
Figure 5.7 Similarities between different characters once the bottom strip is removed
Figure 5.8 Similarities between descender and Halanta
Figure 5.9 Example of open counter (grey)
Figure 6.1 Extension in the diagonal stem of to differentiate it from
Figure 6.2 Difference in the shape of the bowl
Figure 6.3 Final design of letters and
Figure 6.4 Common element of and
Figure 6.6 Difference in the vertical and horizontal distance of the vertical and horizontal bar
Figure 6.7 Difference in and once Shiro Rekha is removed
vi
Figure 6.8 Closed counter in the letters and
Figure 6.9 Final design of the letters and
Figure 6.10 Small diagonal stroke and the openness of the counter in
Figure 6.11 Overlapping of the letters and
Figure 6.13 Distance between horizontal and vertical bar of the letters
Figure 6.19 Equal character height of all the letters
Figure 6.21 Overlap of the letters and without the Shiro Rekha
Figure 6.23 Overlap of the letters and
Figure 6.26 Small white space between the letters and lower matras
Figure 6.27 Final design of matras
Figure 6.28 First version of OCR-D (top); final version of OCR-D (bottom)
Figure 6.29 Grid used in OCR-D
Figure 6.30 H:V Ratio
Figure 6.31 Test document typed in OCR-D
Figure 6.32 Test document after binarization
Figure 6.33 Extracted test from the image
Figure 6.34 Scanned document typeset in OCR-D
Figure 6.35 Extracted text when OCR algorithm is executed on document typed in OCR-D
Figure 6.36 Scanned document typeset in Yogesh
Figure 6.37 Extracted text when OCR algorithm is executed on document typed in Yogesh
Figure 6.38 Scanned document typeset in Surekh
Figure 6.39 Extracted text when OCR algorithm is executed on document typed in Surekh
vii
List of Tables
Table 2.1 Vowels in Devanagari
Table 2.2 Consonants in Devanagari
Table 2.3 Character Frequency in Hindi
viii
Chapter 1: Introduction
Machine replicating human functions, like reading, is an old dream. However, over the last five
decades, machine reading has grown from a dream to reality. Machine reading uses the principles
of Optical Character Recognition (OCR). OCR has also become one of the most successful appli-
cations of technology in the field of pattern recognition and artificial intelligence. Since the mid
1950s, OCR has been a very active field of research and development. While the OCR technology
for some scripts like Latin is fairly mature and commercial OCR systems like Nuance OmniPage
Pro or ABBYY FineReader are available which can perform with high accuracy, it is still under
development for other scripts like Chinese and Devangari.
Although a great deal of research has been done for OCR applications for Latin script, even theses
OCR based machines are still not able to compete with human reading capabilities. This prob-
lem is more prominent for other scripts for which OCR technology is relatively newer. Typefaces
are very important in determining the performance of the OCR technology. Hence in order to
improve the accuracy of the OCR system, typefaces which are specially designed for OCR are
required. For Latin script, quite a few typefaces have been designed which are optimized for OCR.
These specially designed typefaces have a unique and well defined character set which allows for
greater accuracy in recognition. This in turn helps in building low cost systems which can recog-
nize characters using simple algorithms. However, no Devanagari script font is available which is
designed specifically for machine reading and we address this problem in this report.
In general, documents contain text, graphics, and images. The procedure of reading the text com-
ponent in such a document can be divided into three steps:
1. Document layout analysis in which the text component of the document is extracted.
2. Segmentation, i.e. extraction of characters from the text component of the document.
3. Recognition of the segmented characters.
Typically, the OCR character segmentation stage needs to be redesigned for each new script, while
the other stages are easier to port from one script to another and can be generalized over large
classes of languages. There is a great need for OCR related research in Indian languages as there
are many technical challenges which are specific to Devanagari script. With the spread of comput-
ers in organizations and offices, automatic processing and machine reading of paper documents
is gaining importance in India. Although a lot of research is going on Devanagari script recogni-
tion, there is no commercial OCR systems focusing on Devanagari based languages. OCR for
1
Devanagari is still in the research and development stage.
In chapter 2, we give an overview of the Devanagari script. We discuss the alphabets in the
Devanagari script and how they are grouped. Thereafter we discuss the anatomy of the script and
the graphical grouping of the alphabets.
In chapter 3, we have a look at the basics of OCR systems. First, we discuss the history and the applica-
tions of OCR system and then we look at one of the algorithms used in OCR systems for Devanagari
script. We analyse this algorithm discussing all the steps that are involved in character recognition.
In chapter 4, we discuss the features of typefaces which are designed specifically for OCR systems.
We discuss the need of a specially designed typeface for OCR and perform an in-depth analysis of
one of the most commonly used Latin typeface for OCR system OCR-B. We discuss the precau-
tions taken by its designer while designing the typeface. Finally, we look at the lack of Devanagari
typeface designed for OCR systems.
In chapter 5, we discuss the precautions that need to be taken while designing Devanagari type-
faces for OCR systems. We also look at the design decisions taken specifically for Devanagari
script because of the difference in the recognition algorithm of Devanagari and Latin script.
In chapter 6, we design a new Devanagari font optimized for OCR systems. We discuss each character
in detail and how the features of the letters are designed for improved performance in OCR systems.
We also present the evolution of our font from the first version to the final version. Thereafter, we test
this font on an OCR system which is available for free download on the internet.
Finally, we discuss the scope of future work and the improvements in the design which can fur-
ther enhance the performance of the font.
2
Chapter 2: Devanagari Script
Devanagari script is the most important and widely used script in India. It is the script used by
many Indian languages like Hindi, Marathi, Nepali and Sanskrit. Several other languages like
Punjabi, Kashmiri use close variations of this script. It was also formerly used to write Gujrati.
Devanagari is a part of the Brahmi script family. An evolutionary transition can be seen from
Brahmi script to the Gupta script to the Nagari script to Devanagari script. It was first seen in 7th
century A.D. and the transition to a more stable form can be seen from the 11th century onwards.
The current appearance of Devanagari was reached sometime around 12th century.
Etymologically, the word Devanagari is considered to be combination of two Sanskrit words Deva
meaning God, Brahma or sometime the king and nagara meaning city. Thus, literally combining
to form the city of god or the script used in the city of god. The use of the name Devanagari is
relatively recent, the older term Nagari is also used.
The Devanagari script represents the sounds which are consistent. Unlike letters of the English
alphabet which can be pronounced in different ways, the letters of the Devanagari script have the
same pronunciations (with a few minor exceptions). Some of the conceptual differences in Latin
and Devanagari scripts are as follows:
In Devanagari script each character has a horizontal bar (Shiro Rekha) at the top. In contempo-
rary time the Shiro Rekha is broken between words, to differentiate between two words.
Devanagari alphabets do not have distinct letter cases i.e. upper and lower case character.
The concept of matras is not present in Latin script. They can occur as a standalone characters
or with other alphabets to modify their sound.
2.1 Alphabets
There are around 50 basic characters in the script. The grouping of vowels and consonants is called
Swaras and Vyanjanas respectively. The grouping of vowels and consonants in Devanagari is done
on the basis of phonetic point of articulation. Within a word, vowels often take modified shapes
called modifiers or matras. Consonant modifiers are also possible. Moreover, 2 to 5 consonants
can combine to form compound characters called conjunct, which may partly retain the shape
of the constituent consonants. Along with these there also exist a set of sign or diacritical mark
which indicates the nasalization of vowels or use of Persian sound etc.
3
Vowels
Devanagari in its most elaborate form has 18 vowels out of which 11 are frequently used. Others
can be seen in the Vedic and non-Vedic Sanskrit text. Vowels in Devanagari are transcribed in two
distinct forms: the independent form, and the dependent (matra) form. The independent form is
used when the vowel letter appears alone, at the beginning of a word, or immediately following
another vowel letter. Matras are used when the vowel follows a consonant.
Independent Modifier or Independent Modifier or

form Matras form Matras
None

Table 2.1 Vowels in Devanagari
Apart from these, there also exist another set of vowels which has been added to the traditional
Devanagari to expand its range. For example is used to write transliteration or English loan
word like ball ().
Consonants
There are around 33 consonants in Devanagari script which are grouped phonetically. The first set
of 25 consonants are called occlusive,and rest 8 are called non occlusive. The occlusive consonants
are further divided into five groups: gutturals, palatals, cerebals or retroflex, dentals and labials.
The first four consonants in these groups are further divided in two groups: plosive and voiced
4
plosive and the last consonant is the nasal consonant. The plosive and voiced plosive are again
divided into unaspirated and aspirated version (each having one character).
There 8 non occlusive consonants are divided in three groups semivowel or approximant, sabilants
and aspirate each have four, three and one character respectively.
Occlusive Consonants
Plosive Voiced Plosive Nasal
Unaspirated Aspirated Unaspirated Aspirated
Gutturals
Palatals
Cerebals
Dentals
Labials
Non Occlusive Consonants
Semivowels
Sibilants
Aspirate
Table 2.2 Consonants in Devanagari
Conjunct
Conjuncts are combination of two to five consonants. There are about a thousand conjucts in
Devanagari script. Some of these conjuncts partly retain the shape of the constituent consonants
while there are others like ( + ) which are not clearly derived from the letters making up
their components.
5
Diacritics
Diacritics are glyphs added to a letter, or basic glyph to change the sound of the letter. Some of the
commonly used diacritics in Devanagari are Visarga, Chandra, Halanta and Nukta. Visarga ()
is an unvoiced variation of . Chandra is an open mid front rounded independent vowel . In its
dependent form it is placed on the top of the consonants ( ). Chandrabindu is use to represent the
inherent nasalization of the vowel
Halant () is use to represent a lone consonant without a vowel. It kills the vowel and reduces
the consonant to its base form. Nukta ( ) is used represent the Persian sound encountered in
some of the borrowed Urdu words like for or for .
2.2 Anatomy
The anatomy of a letter can be defined as a system which depicts the structural form of a letter;
describing key features of a letter in a typeface. The first attempt of graphical classification of
Devanagari script was done by S. V. Bhagwat. He grouped letters on the basis of graphical simi-
larities as shown in figure 2.1.
Figure 2.1 Bhagwats grouping of letters on the basis of graphical similarities[1]
He has also defined guidelines for the letters and terminology for some of the graphical elements
present in the letters which are shown in figure 2.2.
Figure 2.2 Bhagwats guidelines[1]
6
The top most line is the Rafar Line, which is followed by the Matra Line. Matra Line denotes the
top of the upper matras. After the Matra Line, Head Line is indicated. Head Line is the top of the
Shiro Rekha. Head Line is followed by the Upper Mean Line and the Lower Mean Line. Upper
Mean Line indicates the point where the actual letter starts for example the upper part of the
counter of or . Lower Mean Line denotes the point where the characteristic feature of the
letter comes to an end for example the lower part of the counter of or . This is followed by the
Base Line which denotes the end of the character and the point where the lower matra starts. The
lowermost line is the Rukar Line which denotes the end of the lowest portion of the Rukar.
Bapurao Naik also attempted the graphical grouping of letters. Naik organized letters graphically
in five groups on the basis of the position of the kana or the vertical bar. The important aspect of
this grouping is that and are missing from the group. Naik's grouping of letters is shown in
figure 2.3.
Figure 2.3 Naiks grouping of letters on the basis of the position of the vertical bar or the kana[1]
Few other people like M. W. Gokhale, Mahendra Patel have also proposed different method to
group the letters and create the vocabulary of Devanagari script. A comprehensive study on anat-
omy on Devanagari script can be found in [2].
2.3 Character Frequency in Hindi

Table 2.3 shows the frequencies of letters in Hindi language. Basis of this list were some Hindi texts
with together 978.430 characters (238.604 words), 736.216 characters were used for the counting.
The texts consist of a good mix of different literary genres. Of course, if other texts were used as a
basis, the result would be slightly different.
7
8.22% 1.62% 0.35%
7.14% 1.45% 0.31%
6.85% 1.39% 0.30%
5.91% 1.34% 0.27%
4.82% 1.31% 0.20%
3.78% 1.16% 0.20%
3.48% 1.15% 0.19%
3.47% 1.01% 0.17%
3.44% 0.94% 0.15%
3.28% 0.81% 0.13%
3.20% 0.78% 0.10%
3.02% 0.76% 0.10%
2.89% 0.75% 0.10%
2.66% 0.70% 0.09%
2.45% 0.67% 0.05%
2.21% 0.67% 0.03%
2.20% 0.66% 0.03%
1.96% 0.57% 0.01%
1.78% 0.45% 0.01%
1.68% 0.36% 0.00%
Table 2.3 Character Frequency in Hindi[3]
8
Chapter 3: Optical Character Recognition
With the recent emergence and widespread application of multimedia technologies, there is
increasing demand to create a paperless environment in our daily life. Wide variety of informa-
tion which has been conventionally stored on paper is now converted to electronic form for better
storage and intelligent processing. The primary purpose of such system is to facilitate the retrieval
of information based on a given query. Representation of documents as images is also undesirable
because it does not allow the user to edit or search the document. These limitations can be over-
come by representing the date as text, which takes less storage space and is also easier to process.
This kind of conversion can be achieved by Optical Character Recognition.
Optical Character Recognition or OCR is technology which allows machine to recognize text
from an image. It is the conversion of scanned image of printed or hand-written text to machine
encoded text. It is important for computerizing printed text so that they can be searched elec-
tronically, stored compactly or used for machine processing like translation or text to speech
conversion.
3.1 History of OCR

The dream of making machines perform humane tasks like reading is not new. The first attempt was
in 1870 when C. R. Carey invented an image transmission system. During the first decade of 19th
Century many attempt were made. But the modern version of OCR came into existence in 1940s.
The First OCR

The first OCR was installed in Readers Digest in 1954. It was used to convert typewritten report to
punched card so that they can be input in the computer
First Generation OCR

The first commercial OCR appeared from 1960 to 1965. These OCR had a constrained letter shape
read. The characters were specifically designed for machine recognition and were not very natural.
With time the OCR was able to recognize up to 10 different fonts.
Second Generation OCR

The reading machines of the second generation appeared in the middle of the 1960s and early
1970s. These systems were able to recognize regular machine printed characters and also had
hand-printed character recognition capabilities. The first one of this kind was IBM 1287. In this
9
period, characters in Latin script were also standardized. OCR-A and OCR-B were also designed
in this period. These fonts were designed so that they can be recognized by a machine but were
also still readable by a human.
ABCDEFGHIJKL
MNOPQRSTUVWX
YZ0123456789
ABCDEFGHIJKL
MNOPQRSTUVWX
YZ0123456789
Figure 3.1 OCR-A (top); OCR-B (Bottom)
Third Generation OCR

These first appeared in the middle of 1970s. The challenge was to recognize poorly scanned docu-
ments ad hand-written character set. Also low cost and high accuracy were main objective which
was achieved also because of the advancement in the technology.
Present OCR
Today OCRs are available at a very low cost and OCR systems are also available as software pack-
age. Omnifont OCRs are available for Latin script. Although systems are available for Latin,
Cyrillic, far eastern and many middle eastern scripts, such systems for Devanagari are still in the
research and development stage. This is mainly due to a lack of a commercial market.
3.2 Applications of OCR

OCR has been used to computerize data for dissemination and processing. The first major use of
OCR was in the banking industry where it was first used to read credit card numbers. Nowadays
OCRs are widely used for automated data entry especially in banks where it is used to read account
number, customer identification, amount of money etc.
It is also used for text entry i.e. extracting text out of a scanned document. The reading machine
is used to process large amount of text, which can then be used of several other purposes like for
searching within the document.
10
OCRs also have huge application for the blind. This was one of the earliest thought applications
of OCRs. Combined with text to speech conversion OCRs would enable blind people to read the
printed documents. It can also be used for automated license plate reading and can also help in
reading specially designed forms automatically. Once the text is computerized it can be used for
machine processes like text to speech conversion, language translation and text mining.
3.3 Recognition of Devanagari Script

The most important principle of automatic pattern recognition is training the machine what kind
of pattern may be present and what they look like. In OCR the patterns are letters, numbers and
punctuations. Machine is trained to recognize the pattern by showing it all the kind of characters
present in the script. This period is referred as the training period. On the basis of these exam-
ples the machine builts a prototype of all the characters. Then during recognition the machine
compares the unknown character to the prototype and assigns the character which is the closest
match. The four steps in recognition shown in figure 3.2 are as follow:
1. Preprocessing
2. Segmentation
3. Recognition
4. Post Processing
Preprocessing
The text document is generally scanned at 300 or 400 DPI. Preprocessing is also done to improve
the accuracy of the recognition algorithm. Main steps in preprocessing are noise removal, binari-
zation and skew correction.
Noise Removal or De-Noising

The main sources of noise in the input image are as follows:
Noise due to the quality of paper on which the printing is done.
Noise induced due to printing on both sides of paper or the quality of printing
Noise added due to the scanner source brightness and sensors.
All this noise results in reduction of accuracy of OCR system. As a result of this having a noise
correction routine in place becomes inevitable. To reduce the amount of noise, image is passed
through a mean filter; in this filter the intensity of the each pixel is replaced by the average intensity
of pixels surrounding it. After de-noising the image is subjected to binarization and skew (or tilt)
correction.
11
Preprocessing
Noise removal, Binarization and
Skew (or tilt) correction
Segmentation
Line, word and character segmentation
Recognition
Post Processing
Output Text
Figure 3.2 Procedure of Devanagari Recognition
Binarization
Printed documents generally are black text on white background. Hence most of the OCR algo-
rithms are designed to interpret bi-level images (an image that has only two possible value of pixel
i.e. black and white). This process of converting colored or grayscale images to bi-level image is
often known as binarization or thresholding. A comprehensive study on the method of binariza-
tion for OCRs can found in [4]
Figure 3.3 Image before binarization (left); Image after binarization (right)
12
Skew (or Tilt) Correction
When a document is scanned a small amount of skew (or tilt) is unavoidable. Skew angle is the angle
that the text lines make with the horizontal line. Skew estimation and correction are important pre-
processing steps of document layout analysis and character recognition. One of the popular skew esti-
mation techniques is based on projection profile of the documents. The horizontal/vertical projection
profile is a histogram of the number of black pixels along horizontal/vertical scan lines. In Devana-
gari Shiro Rekha is use to find the skew angle. The algorithm of skew correction can be found in [5].
Segmentation
Segmentation is the process of the dividing the page into its constituent element. The aim of seg-
mentation is to extract out all the character from the text in the image. This is needed to recognize
these characters.
Segmentation phase is a very crucial stage since this is where most of the errors occur. Even in
good quality documents, sometimes adjacent characters touch each other due to inappropri-
ate scanning resolution or the design of characters. This can create problems in segmentation.
Incorrect segmentation leads to incorrect recognition. Segmentation phase includes line, word
and character segmentation. Segmentation in OCR occurs in three steps: line segmentation, word
segmentation and character segmentation. While the precise algorithm for segmentation can be
found in [6] and [7], an overview of segmentation process is given below.
Line Segmentation
In line segmentation our aim is to separate out the line of text from the image. For this global
horizontal projection profile method is used which constructs a histogram of all the black pixels
in every row as shown in figure 3.4. Based on the peak/valley points of the histogram, individual
lines are separated. The steps for line segmentation are as follow:
1. Horizontal projection profile for the image is created.
2. Using the projection profile, the points from which the line starts and ends are found.
3. For a line of text, upper line is drawn at a point where we start finding black pixels and lower
line is drawn where we start finding absence of black pixels. And the process continues for next
line and so on.
Word Segmentation
After line segmentation the boundary of the line (i.e. the top and bottom of the line) is known. Word
13
Figure 3.4 Horizontal Projection Profiles of a document for line segmentation[8]
segmentation is extracting out the boundary of the words from these segmented lines. Word seg-
mentation is done in the same way as line segmentation but in place of horizontal profiling, vertical
projection profiling is done as shown in figure 3.5. The steps for line segmentation are as follow:
1. Vertical projection profile for the image is created.
2. Using the projection profile, points from which the word starts and ends are found.
3. Then we create vertical lines at the start and end of each line. And the process continues for
next word and so on.
Figure 3.5 Vertical Projection Profiles of a document for word segmentation[8]
Character Segmentation
Once the words are segmented, the next step is to extract out the characters form these words. A
word in Devanagari script is further divided into three parts: as shown in figure 3.6:
1. Top
2. Core (or Middle)
3. Bottom
The top strip and the core part are separated by the Header Line or the Shiro Rekha. But there is
no separation between the core strip and the bottom strip. The top strip contains the top matras
and the bottom strip contains the bottom matras or the descenders of some on the characters. The
14
Shiro Rekha is a unique feature of Devanagari script and helps to identify Devanagari in multi-
lingual document. It also helps in the identification of the baseline of the text.

Top Strip Head Line
(Shiro Rekha)
Core Strip
Bottom Strip
Figure 3.6 Three part of a Devanagari word

The steps of character segmentation shown in figure 3.7 are as follows:
1. Shiro Rekha is identified and the top strip is seperated from the core and bottom strip. So now the text
is divided in two parts a.) The Shiro Rekha and the top mantra and b.) The core-bottom part of the text
2. Core strip and bottom strip from the core-bottom part of the text, is identified and lower
matras are extracted.
3. Core strip is segmented into different letters or characters which may include conjuncts, punc-
tuation or numerals.
4. Conjuncts are segmented into single characters.
5. Shiro Rekha is removed form the extracted top strip and top matras are extracted.
6. Once the segmentation of the core character is done, Shiro Rekha is put back on the top of indi-
vidual characters.
Figure 3.7 The procedure of Hindi character segmentation[6]
15
Recognition
Segmentation is followed by recognition of the characters. The two main methods used for recog-
nizing characters are as follows:
Template Matching
Feature Based Recognition
Template Matching
In this method a matrix containing the image of the input character is matched with the set of
prototypes created in the training period. The distance between the pattern and each prototype is
computed and the character which is the best match to the pattern is assigned to the pattern.
The technique is simple and easy to implement in hardware. However, this technique is sensitive to
noise and style variations and has no way of handling rotated characters.
Feature Based Recognition

In this method significant features of the pattern are measured and examined. These features are
then compared to the prototypes developed in training phase. The description which provide the
closest match provides the recognition. These features can be like presence of vertical bar or the
number of conjunctions.
Algorithm of recognition in detail can be found in [9].
Post Processing
The result of recognition is set of some characters. However these characters doesn't contain the
complete information. We would like to combine these individual characters to form strings. This
process is called grouping. Grouping of string depends on the location of string in the document.
Strings which are close to each other are grouped together to form a word, since the distance between
two words is more than the distance between the letters of the word.
16
Chapter 4: Typefaces for OCR
Typeface is a design of a collection of alphanumeric symbols. A typeface may include letters, numer-
als, punctuation, various symbols, and more often for multiple languages. It is usually grouped
together in a family containing individual fonts for italic, bold, and other variations of the primary
design. Although typeface and font are used interchangeably; font refers to the physical embodi-
ment of the typeface (i.e. the a computer file or a metal piece in letterpress). Typeface is what we see
whereas font is what we use. In rest of this thesis, font and typefaces are used interchangeably.
4.1 Latin Fonts for OCR

Typefaces are designed for OCR so that they can be read by low cost systems. These fonts have a
unique and well defined character sets which allow for greater accuracy in recognition. The most
popular Latin script fonts which were designed for OCRs are as follows:
OCR-A
OCR-B
OCR-A
OCR-A is a monospaced font designed by American Type Founders. It was developed to meet the
standards set by the American National Standards Institute for the processing of documents by
banks, credit card companies and similar businesses. The design was simple so that it can be read by
machine but it is very difficult to read it by human eyes.
ABCDEFGHIJKLM
NOPQRSTUVWXYZ
abcdefghijklm
nopqrstuvwxyz
0123456789
&$%.!?
Figure 4.1 Characters of OCR-A
OCR-B
OCR-B was also designed by Adrian Frutiger for European Computer Manufacturers Association
17
(ECMA). It is a monospaced font and was designed following the standards of ECMA. The first
version contained 109 characters. The main objective was to create international standards for
optical recognition. They also wanted to avoid the wider acceptance of OCR-A because of its
unnatural looks. Therefore, OCR-B was designed to be pleasant to human eyes.
ABCDEFGHIJKLM
NOPQRSTUVWXYZ
abcdefghijklm
nopqrstuvwxyz
0123456789
&$%.!?
Figure 4.2 Characters of OCR-B
It pushed the limits of optical recognition. This was the first typeface, with respect to the machine
readable typeface which gave consideration to aesthetics. OCR-B was declared worldwide stand-
ard in 1973.
The principle of OCR-B was based on the fact that all the characters must differ from each other
by at least 7% in the worst possible case. To check this two characters were overlapped in such a
way that they overlapped optimally as shown in figure 4.3. This test was also carried out using two
different printing weights, a fine weight due to the lack of ink in typewriter ribbon and a fat weight
due to the ink blots.
Figure 4.3 Optimal Overlapping of characters
Generous character spacing was provided since its important for correct recognition whereas serif
were avoided because it increases the common coverage area of the characters therefore increasing
the similarity between characters as shown in figure 4.4.
18
Figure 4.4 More coverage ratio because of serifs
First Test Version

The first test version was designed in 1963 and had 109 characters is shown in figure 4.5. In the
first version of the font the bowl shape in the uppercase case letters was constant whereas there
were two types of bowl shapes in lowercase letters: a round bowl for example in c d p, and a flat
bowl in b g q. Also initially the height of the numerals and uppercase letters was kept the same
which was then changed before the first test version All the numerals also had dynamic shape but
the curves were different. The uppercase O was very similar to the numeral 0.
Figure 4.5 First test version of OCR-B[10]
First Published Version
Figure 4.6 First published version of OCR-B[10]
19
OCR-B was first published as Standard EMCA-11 in 1965 containing 112 characters and is shown
was converted to static bowl shape to match the rest of the characters. The

in figure 4.6. Some characters underwent considerable changes. The flat dynamic bowl of b g q
height of the upper-
www.linotype.com
F2Fcase
OCRBczyk
characters wasRegular
also reduced to differentiate it more from the numerals. Some character had
undergone considerable correction. Ws outer diagonals were curved in the new version. Also the
OHamburgefonstiv
24 pt numeral
0 received a more oval shape to differentiate it with the character O and the dot of j was
now normally placed. Altogether the typeface now had a more consistent look and feel to it.
36 pt OHamburgefonstiv
Some further corrections took place from 1969 onwards.. The British pound sign was considerably
changed. There was still problems in differentiating D O 0 and B 8 &.. With D the curve stroke now
OHamburgefonsti
started directly at the stem; O was given a much more oval shape and 0 became more angular. B
48 pt was made wider which resolved to problem of B-8 pair and the upper bowl of & was made smaller. A
horizontal bar was added to j (just like i) and the descender of y was curved. All these were not ben-
eficial in term of shape or aesthetics but were very important for differentiating different characters.
OHamburgefon
60 pt Even with the international standard in 1976 OCR-B project was far from over. The number of
character increased constantly; from 121 characters in 1976 to 147 in 1994. Also in 1994, a designer
named Alexander Branczyk designed a proportional version of OCR-B called OCRBczyk. It fea-
OHamburgef
tured much finer visual features but remained true to the design of OCR-B.
0Hamburg
72 pt
84 pt
OHamburg
Figure 4.7 OCR-B (top), OCRBczyk (bottom)
96 pt
OHambur
Application if OCR-B
Since 1960s machine readable typefaces have been used for data recognition. They can be found on
cheques, bank statement, credit cards and postal forms. OCR-B can also be found in many countries
paying-in forms and countries identity card. Most of the barcode numbers are also set in OCR-B.
20
F2F OCRBczyk is a trademark of Linotype GmbH and may be registered in certain jurisdictions.
For further information please contact: info@linotype.com
4.3 Devanagari Font for OCR
Although development of OCRs for Indian script is an active area of research today, not much
work has been done for designing a Devanagari font optimized for OCR. Unlike the Latin script
there is not even one commercially available Devanagari font which is optimized for Devanagari
OCR systems.
Few of the most common Hindi fonts are KrutiDev, Mangal, Surekh and Yogesh. But none of
them is designed for OCR. All these fonts have some letters with parts above the Shiro Rekha.
Also KrutiDev and Yogesh have some letters which are not connected horizontally like . Also
the stroke with os Yogesh is also thin for an OCR font. Surekh is not a monolinear font that is why
it cannnot be used for OCR. Therefore there is need of a Devanagari font designed for OCR.
21
Chapter 5: Font Design for OCRs
Font design is the art and process of designing typefaces. Regardless of the method used to specify
type design, all the characters of type should have artistic consistency. No character should look
small or large as compared to the other characters in the font. Although while designing type for
OCR systems special precautions have to be taken for better accuracy. Many of the principles of type
design for Latin fonts for OCR system apply directly to Devanagari fonts, but due to the difference in
the segmentation algorithm extra care need to be taken while designing for Devanagari OCR system.
5.1 Design Features Important for OCR

Every letter has to be more strongly differentiated than is customary in type design. Most of the
principle for designing type for OCR remain same as Latin, while special care need to be taken
for Devanagari because of the difference in character segmentation. However, many constraints
which were present while designing OCR-B are not applicable now because of the advancement
in technology for example previously OCRs were only able to detect monospaced font but now
because of the development in the OCR system it can also recognize proportional fonts with accu-
racy. Some of the things that should be taken care while designing type for OCRs are as follows
One Character should Never be Contained in Another Character

No character when overlapped with another should be completely inside the other letter. This is
very important for correct recognition. To do this certain additional feature or elements are added
to differentiate it from the other characters as shown in figure 5.1. We can also have different
counter size of similar looking characters like and .
Figure 5.1 Addition of elements in OCR-B to differentiate two characters
Font should be Monolinear

Monolinear fonts are the fonts that have same visual weight of the vertical and horizontal strokes.
If a font has different stroke width then there is a possibility of the breaking of the thin stroke at
small point size while scanning or during the process of binarization thus creating problems while
recognizing.
22
Font should be Sans Serif
Serif is a small decorative line added at the end of some of the strokes that make up thee basic form
of a character as shown in figure 5.2. A typeface with serifs is called a serif and a typeface without
serifs is called sans serif. Sans serif typefaces are preferred for OCR because serifs increases the
common coverage area of the characters therefore increasing the similarity between characters.
Figure 5.2 Serifs in a typeface (grey serifs)
Generous Character Spacing

Character spacing is the distance between two characters. White space between two characters
help in character segmentation but it should not be comparable to the space bar (' '). If the char-
acter spacing is not enough, the characters can end up touching each other because of the noise
added while scanning; then this would create problem in character segmentation.
Shadow characters should also be avoided. A character is said to be under the shadow of another
character if they do not physically touch each other but it is not possible to separate them merely
by drawing a vertical line. Example of shadow characters is shown in figure 5.3.
Although the algorithm takes care of shadow characters, it reduces the accuracy in some cases.
Figure 5.3 Shadow Characters[6]
Big Closed Counters

The enclosed or partially enclosed circular or curved negative space (white space) of some letters
such as d, o, and s is the counter as shown in figure 5.4.
23
Figure 5.4 Counters is the circular negative space (grey)
While designing for OCRs, counter size needs to be kept huge so that they don't get completely
filled because of noise while scanning or they can also get filled while printing. This can result in
faulty recognition as a character can be confused for other characters, for example if the counter
of is filled it can be confused for by the OCR.
Bold Strokes
Stroke width is another feature of a font which is very important for recognition as thin strokes
can get smudged and get broken because of poor quality of printing and scanning. Bold stroke is
also helpful in the process of binarization.
5.2 Special Care for Devanagari

Apart from the precautions stated above some special care has to be taken for Devanagari because
of the complicated segmentation process. For character segmentation the script is divided in three
parts: top, core (or middle) and bottom and all these parts are recognized separately. This increases
the complication because unlike Latin script, descenders and ascenders of the characters (in core
strip) won't be treated as the part of the character in Devanagari script. So no differentiating fea-
ture can be present in the ascender or descender of the character. These special precautions that
need to be taken care of are discussed below.
Removal of Shiro Rekha and the Top Strip

Removal of Shiro Rekha is the second step in character segmentation (as shown in figure 3.7).
When Shiro Rekha is removed, all the features of the character at the level of Shiro Rekha or above
it are also removed from the core strip as shown in figure 5.5.
Figure 5.5 Example of few characters after the removal of Shiro Rekha
24
When some important features of the character are at the level of Shiro Rekha or above it gets
removed resulting in no recognition or recognizing a different character. For example has a
curve at the level of Shiro Rekha which when removed results in looking like . Similarly looks
like when the Shiro Rekha is removed. This can be seen in figure 5.6
Figure 5.6 Similarities between different characters once the Shiro Rekha is removed
Also the differentiating characteristic between the kana () and purna viraam () is the presence of Shiro
Rekha above the kana. Once the Shiro Rekha is removed there is no differentiating features between
theses two characters and one character can be confused for other. So while designing some differen-
tiating features have to be added in either of two characters so that they can be recognized accurately.
Removal of Bottom Strip

The step after the removal of top strip in character segmentation is the removal of bottom strip.
Bottom strip is the strip which contains the lower matras, halanta and descenders of the letters in
the core strip. The most difficult part of this step is to determine where the core strip ends and the
bottom strip begin because in Devanagari script the lower matras are connected to the characters
in the core strip.
Also a few characters like has characteristic features extending to the bottom strip. When these
features are removed the character might closely resemble other characters as shown in figure 5.7.
Figure 5.7 Similarities between different characters once the bottom strip is removed
Also in some cases the descender resembles a particular lower matra or a diacritical mark. While
recognizing the lower matras in the bottom strip, the descender can be confused for the lower
matra which would result in incorrect recognition of both the character and the lower matra as
shown in figure 5.8.
25
Figure 5.8 Similarities between descender and Halanta
Recognition of Characters
Recognition of characters is much more complicated in Devanagari than in Latin because of the
graphical similarities in the letters. The graphical similarities in the letters in Devanagari is much
more than that in Latin. Some of the letters have just a difference of a stroke like just has an
additional diagonal stroke as compared to . While there are others which differ from each other
only because of the presence of vertical line like and .
Also unlike Latin script, Devanagari has letters which are disjoint horizontally. This should be
avoided in the characters in which this can be avoided for example can also be designed as .
This results in inaccurate recognition.
Also the open counters in the letters should be designed carefully. Open counter is the curved part
of the character that encloses curved parts (counter) of some letters as shown in figure 5.9.
Figure 5.9 Example of open counter (grey)
While designing the counters, special care need to be taken so that the strokes forming these curves
don't get connected because of noise or smudging. This results in the algorithm to confuse between
two letters. For example if the strokes of connects together they can be recognized as .
26
Chapter 6: Proposed Font Designed for OCRs
The proposed version contains 53 characters including letters and matras. The font is unicode
based. A reduction in the calligraphic strokes can be seen. All the characters are designed to have
same height i.e. no part of the character goes above the Shiro Rekha or goes below in the bottom
strip. Characters which were similar in design were given additional features. Also a small gap is
given between the lower matras and the core strip which helps in segmentation. Font designed is
monolinear and have a bold stroke so that the strokes are not broken in the process of binarization.

27
6 pts
8 pts
9 pts
10 pts
11 pts
12 pts
14 pts
18 pts
24 pts
36 pts
48 pts
60 pts
72 pts
96 pts
28
8 pts

10 pts

11 pts

12 pts

14 pts

18 pts

24 pts

29
6.1 Designing Characters
The characters , , and have similar features. Hence while designing these characters, care
must be taken so that OCR is able to differentiate between these characters. In order to incorpo-
rate differences in the features of these characters, following steps are taken:
The diagonal bar of is elongated so that it can be differentiated from . The elongated part of
is shown in Figure 6.1.
The bowl of these characters are designed differently so that even if there is smudging and the
horizontal bar in breaks, these characters can be differentiated by the shapes of their bowls.
The bowls of and are kept same as they can be differentiated using the elongated diagonal
bar and the bowls of and are different. The difference in the shape of bowls of and is
shown in Figure 6.2 by overlapping these
Figure 6.1 Extension in the diagonal stem of to differentiate it from
Figure 6.2 Difference in the shape of the bowl
In the first version, the width of was also compressed assuming it would provide better result but
test results showed that the width didn't have a prominent impact and hence the width of was
changed to the standard in the final design. The final design of , , and is shown in Figure 6.3.
30
Figure 6.3 Final design of letters , and
The common element in this group is the presence of the filled counter as shown in the figure 6.4
which is the most distinguishing feature of the character. The final design of the letters can be seen
in the figure 6.5.
Figure 6.4 Common element of and
31
To differentiate the letters in this group the horizontal and the vertical distance of the horizontal
and vertical bar is not kept the same as shown in figure 6.6.
Figure 6.6 Difference in the vertical and horizontal distance of the vertical and horizontal bar
Also the top part of doesn't go above the Shiro Rekha so that it doesn't look like once the Shiro
Rekha is removed as shown in figure 6.7.
32

The counter of the letters should be large so that it does not get filled with noise. Also a closed
counter was designed so that if there is a joint is broken from one end it still doesn't look like half
form of the letter as shown in the figure 6.8. Also the diagonal bar of has to be bold so that it is
not broken while printing or scanning or in the process of binarization.
Figure 6.8 Closed counter in the letters
The final design of the letters can be seen in the figure 6.9.
The challenge while designing this group was that should not look like . For this the length of
the diagonal stroke was reduced which also made the counter ore open as shown in the figure 6.10.
Also should not like if the lower counter is filled. The difference of the shape of and can
be seen in the figure 6.11.
Figure 6.10 Small diagonal stroke and the openness of the counter in
33
Figure 6.11 Overlapping of the letters and
While designing these characters, the following care must be taken so that OCR is able to recog-
nize these characters
The horizontal bar shown in the figure 6.13 should not touch the vertical bar at the left even in
the presence of noise, in order to do this the distance should be kept more.
The closed counter of should be of large size so that it is not filled in small size or in presence
of noise.
The letter should not look like after the Shiro Rekha is removed as shown in the figure 6.14.
Figure 6.13 Distance between horizontal and vertical bar of the letters
34
No character when overlapped with another should be completely inside the other letter.
The diagonal bar of has to be bold so that it is not broken while printing or scanning or in the
process of binarization
35
The counter of had to be large so that it does not get filled with noise because if it is filled with noise
the OCR system confuses with . The final design of the letters can be seen in the figure 6.18.
36

The letters and are designed to have the same characters height as the other letters as
shown in figure 6.19.
The ending of the stroke of had to be extended more than required for the normal design so
that it is not confused by .
Figure 6.19 Equal character height of all the letters
37

The main concern while designing these letters is that should not look like after the Shiro
Rekha is removed. An overlap of these letters without the Shiro Rekha is shown in the figure 6.21
and the final design of the letters is shown in the figure 6.22.
Figure 6.21 Overlap of the letters and without the Shiro Rekha
The horizontal bar of should be bold so that its not broken while scanning or printing.
The top of should not go above the Shiro Rekha.
The letter should not completely overlap . The difference is shown in figure 6.23.
Figure 6.23 Overlap of the letters and
38
These characters don't have any common element. The final design of the letters is shown in figure 6.25.
39
Matras
While designing the lower matras a small white space was given between the lower matras and the

bottom of the letters as shown in figure 6.26. The final design is shown in figure 6.27.
Figure 6.26 Small white space between the letters and lower matras

Figure 6.27 Final design of matras
40
6.2 Evolution of the Typefaces
The first version contained 52 characters. All the character had the same character height. The stroke
width of the first version was very less. The bowl of and had a dynamic shape. Some of the letters
like were compressed so that the character width of all the letters is comparable. The lower mean
line was also kept higher.
The final version on the other hand has a much bolder stroke so that the strokes are not broken while
printing or scanning. Some characters underwent considerable corrections. The horizontal bar of and
were brought closed to the Shiro Rekha. The knot at the bottom part of , , and was removed to
make the characters more open and the diagonal stroke at the bottom was also converted to a straight
stroke. was given a mre calligraphic look. Althogether the typeface now has a more natural look and
has a better stroke consistency as compared to the first version. Comparison of the first and final ver-
sion is shown in figure 6.28.

Figure 6.28 First version of OCR-D (top); final version of OCR-D (bottom)
41
6.3 Anatomy of the Typefaces
The grid used is shown in figure 6.29.
Upper Matra
Shiro Rekha
Character Height
Lower Matra
Figure 6.29 Grid used in OCR-D
The H:V ratio used is 1.1 as shown in figure 6.30
V
Figure 6.30 H:V Ratio
42
6.4 Testing of the Typefaces
Once the font is designed the next step is to test its accuracy on an OCR system. For this an OCR
system called HindiOCR is used. Oliver Hellwig of Department for Languages and Cultures of
Southern Asia, Freie Universitt Berlin designed HindiOCR. HindiOCR converts printed Hindi
texts into rich-text documents (RTF) in Devanagari-Unicode encoding. It processes standard
image formats i.e. *.jpeg, *.png and *.bmp. A free demo version of HindiOCR can be found at
[11]. The text document can be seen in figure 6.31. Figure 6.32 shows the text after the process of
binarization and figure 6.33 shows the result of HindiOCR when the algorithm was run on test
document in figure 6.31.
Figure 6.31 Test document typed in OCR-D
Figure 6.32 Test document after binarization
Figure 6.33 Extracted test from the image
43
Comparison with Other Fonts
Performance of the font was compared to other fonts. For testing purposes the fonts Surekh and
Yogesh were used. Same text was typeset in all the three fonts and the results were matched. The
best result was of OCR-D and Surekh had the maximum errors. Although Yogesh performed
much better than Surekh, there were some errors which occurred consistently like ' ' was most of
the time recognized as ' ' and sometimes '' was recognized as a combination of and because
of the disjoint in . Figure 6.34 shows the scanned text document typeset in OCR-D and figure 6.35
shows the test result when OCR algorithm was executed on this scanned documents. Figures 6.36 and
figure 6.38 shows the scanned text document typeset in Yogesh and Surekh respectively and figure 6.37
and figure 6.39 show their results.
Figure 6.34 Scanned document typeset in OCR-D
Figure 6.35 Extracted text when OCR algorithm is executed on document in figure 6.34
44
Figure 6.36 Scanned document typeset in Yogesh
Figure 6.38 Scanned document typeset in Surekh
45
46
Chapter 7: Future Work
All the alphabets in the Devanagari script are designed and has been tested on an OCR system.
Although all the vowels (independent and dependent forms) and the consonants have been
designed, numerals and some of the diacritics still have to be designed.
Recognition of conjucts in Devanagari has to be studied. This includes the algorithm for recogni-
tion and the algorithm for separation of the half form of the letter from the full form.. Designing
of conjunct has to be completed. Also the kerning table has to decicded upon keeping in mind
generous character spacing.
A comprehensive testing of the font also needs to done and based on the results of the test, design
of the characters have to be tweaked appropiately.
47
References
[1] Bapurao S. Naik Typography of Devanagari
[2] Girish Dalvi Anatomy of Devanagari Typefaces
[3] www. stefantrost.de
[4] Tushar Patnaik, Shalu Gupta, Deepak Arya Comparison of Binarization Algorithm in
Indian Language OCR
[5] B.B. Chaudhuri and U. Pal Skew Angle Detection of Digitized Indian Script Documents
[6] Huanfeng Ma, David Doermann Adaptive Hindi OCR Using Generalized Hausdorff Image
Comparison
[7] Vijay Kumar, Pankaj K. Sengar Segmentation of Printed Text in Devanagari Script and
Gurmukhi Script
[8] Mudit Agrawal, M. N. S. S. K. Pavan Kumar, C. V. Jawahar Indexing and Retrieval of
Devanagari Text in Printed Documents
[9] R. Jayadevan, Satish R. Kolhe, Pradeep M. Patil, and Umapada Pal Offline Recognition of
Devanagari Script: A Survey
10] Heidrun Osterer, Plilipp Stamm Adrian Frutiger Typefaces. The Complete Work
[11] http://www.indsenz.com
48

Thesis Hindi Ocr

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Thesis Hindi Ocr

Hochgeladen von

Copyright:

Verfügbare Formate

Devanagari Font Design for Optical

Department of Electrical Engineering

INDIAN INSTITUTE OF TECHNOLOGY BOMBAY

Independent Modifier or Independent Modifier or

Plosive Voiced Plosive Nasal

Unaspirated Aspirated Unaspirated Aspirated

Figure 2.1 Bhagwats grouping of letters on the basis of graphical similarities[1]

Figure 2.2 Bhagwats guidelines[1]

2.3 Character Frequency in Hindi

7.14% 1.45% 0.31%

6.85% 1.39% 0.30%

5.91% 1.34% 0.27%

4.82% 1.31% 0.20%

3.78% 1.16% 0.20%

3.48% 1.15% 0.19%

3.47% 1.01% 0.17%

3.44% 0.94% 0.15%

3.28% 0.81% 0.13%

3.20% 0.78% 0.10%

3.02% 0.76% 0.10%

2.89% 0.75% 0.10%

2.66% 0.70% 0.09%

2.45% 0.67% 0.05%

2.21% 0.67% 0.03%

2.20% 0.66% 0.03%

1.96% 0.57% 0.01%

1.78% 0.45% 0.01%

1.68% 0.36% 0.00%

Table 2.3 Character Frequency in Hindi[3]

3.1 History of OCR

The First OCR

First Generation OCR

Second Generation OCR

Third Generation OCR

3.2 Applications of OCR

3.3 Recognition of Devanagari Script

Noise Removal or De-Noising

Figure 3.2 Procedure of Devanagari Recognition

Figure 3.5 Vertical Projection Profiles of a document for word segmentation[8]

Figure 3.6 Three part of a Devanagari word

Figure 3.7 The procedure of Hindi character segmentation[6]

Feature Based Recognition

Algorithm of recognition in detail can be found in [9].

4.1 Latin Fonts for OCR

Figure 4.3 Optimal Overlapping of characters

First Test Version

Figure 4.5 First test version of OCR-B[10]

First Published Version

Figure 4.6 First published version of OCR-B[10]

5.1 Design Features Important for OCR

One Character should Never be Contained in Another Character

Figure 5.1 Addition of elements in OCR-B to differentiate two characters

Font should be Monolinear

Figure 5.2 Serifs in a typeface (grey serifs)

Generous Character Spacing

Figure 5.3 Shadow Characters[6]

Big Closed Counters

5.2 Special Care for Devanagari

Removal of Shiro Rekha and the Top Strip

Removal of Bottom Strip

Figure 5.9 Example of open counter (grey)

Figure 6.1 Extension in the diagonal stem of to differentiate it from

Figure 6.2 Difference in the shape of the bowl

Figure 6.4 Common element of and