Sie sind auf Seite 1von 10

Sinhala OCR (Digital, Handwritten, & Palm-leaf Text) ikeshala Wickramaarachchi1 D. L.

Anoj De Silva2 Tashila Kannangara3

nikeshalaw@gmail.com

anojonline@gmail.com

tashila.kannangara@gmail.com

G.Balachandran4 bala@apiit.lk School of Computing, Asia Pacific Institute of Information Technology (APIIT), Sri Lanka. 1.0 Abstract
OCR (Optical Character Recognition) for Sinhala script has become an area of interest in the recent years with a number of researches conducted on this subject. This paper uses multi-font multi-size digital text, handwritten text and palm-leaf manuscripts as three (3) case studies to address the Sinhala OCR. All three (3) case studies addressed the problem domains by developing demo tools. The demo tools were implemented mainly using Artificial Neural Networks. Key Words Sinhala Script, Optical Character Recognition, OCR, Artificial Neural Networks, Feature Extraction, Image Processing.

2.0 Introduction
Sinhala is an official language of Sri Lanka, which is primarily used by its ethnic majority, the Sinhalese. Sinhala script is principally used as the writing system for the Sinhala language. Sinhala script derives its orthography from the Brahmi script. Brahmi is a family of abugidas (writing systems) used in South Asia, Southeast Asia, Tibet, Mongolia and Manchuria. Moreover Sinhala writing system is influenced by Pallava Grantha script which used around 8th 10th century.

The art of converting human readable documents into machine readable and editable ASCII or / and Unicode format files is known as Optical Character Recognition (OCR). Most of the
1 2

Author of the Case Study 2 Author of the Case Study 1 3 Author of the Case Study 3

modern OCR engines for Scripts like Arabic, Latin, Chinese and Korean are capable of handling multi-font and multi-size characters. Font families such as Serif and Sansserif, different font sizes such as 10, 12 and 14 are concerned in those OCR engines. Research shows that there is NO such reported multi-font and multi-size supporting OCR engine for Sinhala Script. Therefore it remains as a challenging problem to develop practical OCR system for multi-font and multi-size characters, which contains in a single document. Case study one (1) tries to address this problem.

Case study two (2) addresses the handwritten Sinhala script. A large number of organizations in Sri Lanka deal with data acquired in the form of Sinhala handwriting. Handwriting is a major source of input to most organizations where data is collected using hand filled forms such as registration forms, tax forms, visa forms and census forms. Currently all collected data needs to be entered to information systems manually for the purposes of processing and storing. The manual data entry process is extremely time consuming and error prone. These organizations would benefit greatly from a system that could convert handwritten Sinhala script directly to electronic text.

Third (3) case study address the palm-leaf manuscripts. Most of Sri Lankan historical data such as medicine potions, Buddhist dharma and astrological data are written on palm-leaves. Over past two thousand years most of valuable data are written in palm-leaf manuscripts. Most of these palm-leaves are nearing the end of their natural lifetime or are facing destruction. There are some applications which are created OCR systems for palm-leaf manuscripts, but NOT for Sinhala script.

3.0 Sinhala Script


Indic languages primarily belong to two major linguistic families, Indo-Aryan and Dravidian. In Sri Lanka the majority spoken language is Sinhala and it is belongs to Indo-Aryan family. Sinhala script uses the Abugida system. In an abugida, writing system in which each vowelconsonant letter represents a pure-consonant accompanied by a specific vowel; the vowels are indicated by modification of the consonant sign, either by means of diacritics or through a change in the form of the consonant (Daniels and Bright, 1996, p4). To indicate consonant with a different vowel, symbols are added around the base symbol as before, after, above, below or In some cases, modifiers are placed on both sides of the consonant. This feature

makes OCR as complexity task for Indic scripts.

According to Gunasekara (1891, p.3), there are two types of alphabets in Sinhala. They are the Elu alphabet and the Mixed Sinhala alphabet. The Sinhala alphabet used in the present differs from both the Elu alphabet and the mixed alphabet. The contemporary Sinhala alphabet consists of a total of 60 letters. It is made up of 18 vowels, 40 pure consonants and the Anusvaraya and Visargaya. Some researchers consider that there are 41 pure consonants in the contemporary Sinhala alphabet (Premaratne & Bigun 2002).

4.0. Optical Character Recognition (OCR)


An OCR system consists of many stages such as preprocessing, segmenting, feature extraction and character recognition. The objective of the preprocessing stage is to enhance the image quality and prepare them for further processes. The output of the OCR system is highly influenced by the preprocessing module. Activities such as thresholding, noise removal, skew correction and background line removal are usually conducted within the preprocessing stage. For OCR only the foreground image is required. Extracting the foreground from the background of an image is known as thresholding (binarization). Noise removal is usually conducted on the image after undergoing thresholding. Noise can be defined as any unwanted information contained in a digital image. Noise in document images can be caused by certain attributes of the scanner, improper tuning of scanning parameters, texture of the source and type of implement used to produce the characters.

The goal of segmentation is to break down a set of characters into smaller entities prior to the recognition process. To reach the final output of segmentation, the group of characters should be segmented into lines, words and finally individual characters. Character recognition should not be directly conducted on raw segmented characters because characters of different sizes and the large number of input variables can cause problems for pattern recognition systems. Feature extraction is used after segmentation to transform raw character images into a smaller and consistent number of variables known as features. After the feature extraction stage character recognition is conducted. The objective of character recognition is to successfully recognize a character using the extracted character features.

5.0 Case Study 1: Multi-font and Multi-size Sinhala OCR


The research shows that, the one and only currently available (and reported) Sinhala OCR engine (2009) is capable of handling single-font and single-size character recognition, but not Multi-font and Multi-size Sinhala character recognition at once (developed by UCSC). Anyhow, in reality most of the documents we find contain at least two types of fonts and at least two sizes of fonts. This case study represents a practical scheme for Multi-Font and Multi-Size Character Recognition using Artificial Neural Network (ANN) for Sinhala Scripts which proves the concept that Multi-Font and Multi-Size Optical Character Recognition can be applied to Sinhala scripts as well.

The optical images which contains Multi-Font and Multi-Size Sinhala vowel characters taken as the inputs for the system and then it goes through the image Pre-processing techniques such as Grayscale Dilation, Median Filtering and then converted it to an binary image using a Global Thresholding value. As the first step it goes through a grayscale dilation process to reduce unwanted color details of an optical image. Noise filtering techniques should apply to reduce noise up to certain extent. Therefore noise reduction techniques such as Median Filtering, Sharpening and Smoothing applied to enhance the image quality in order to gain a reasonable overall output from the OCR system. Gray scaled and noise removed image will convert to a binary image as shown in Figures 1, 2, and 3. This binary image comprises of just two pixel values (Black & White). A color intensity value should be chosen and the pixels which contain higher values than the chosen intensity value are marked as 1 i.e. Black pixels and which contains a lower intensity value marked as 0 i.e. White pixels. This process helps to differentiate objects from its background of an image (Gonzalez & Woods, 2002).

Figure 1: RGB Image with background color

Figure 2: Grayscale Image

Figure 3: After Thresholding

Once the image is pre-processed, segmentation process has to be completed. Most of the time there can be multiple text lines in an optical image. Therefore Horizontal Projection Profile

implemented to segment text lines as Figures 4 and 5. The projection profile gives valleys of zero height for these OFF pixels between the text lines. Segmentation of the image into separate lines is done at these valley points (Reddy & Krishnamoorthi 2008).

Figure 4: Image with multiple text lines

Figure 5: Horizontal Projection Profile of Figure 4.

After segmenting text lines of an optical image, the characters/glyphs will segmented by using Vertical Projection Profiles applied to segmented text lines as Figure 6.

Figure 6 Having segmented characters, it goes through a process of extracting each character to a square with the width and height of the respective character. The above isolated characters/glyphs are then resized to a specific image size (250 x 250) so it contains only the specified amount of pixel data. This is the Normalization process taken to solve the main objective in this case study which is Multi Font and Multi Size character recognition. Size invariant shape invariant constant size of a character/glyph image would be the ideal solution to make the Feature Extraction more generalized (Shatil A.S.M. & Khan, M., 2006).

The concept of this Feature Extraction method is, creating an abstract image of a character/glyph out of the total pixel data (250x250 pixels) grabbed after normalization process. A sample of an abstract image created by the prototype is shown below in Figure 8.

Before Sampling (250x250 pixels)

After sampling to 25 x 25 pixels Figure 8

Finally a Feed Forward Neural Network with back-propagation algorithm for supervised learning is chosen for the training recognition process. As mentioned above, demo tool developed to recognize Multi Font and Multi Size Sinhala Characters/glyphs, prove the concept that Multi Font and Multi Size OCR for Sinhala Script is possible and successful. This confirms that achievement had taken the Sinhala OCR technology for a new level with the use Artificial Neural Networks.

6.0 Case Study 2: Offline Sinhala Handwriting OCR


OCR and Handwriting Recognition for Sinhala script have attracted a significant amount of attention in the recent years. Analysis of existing research reveals that most of the efforts focus on a limited subset of Sinhala characters and recognizing of constrained and well defined handwriting. The significant lack of research in the area of unconstrained Sinhala handwritten script recognition prevents the existing research and development attempts from being useful in any realistic environment.

Handwriting recognition is the task of transforming a language represented in its spatial form of graphical marks into its digital representation Plamondon and Srihari (2000, p.64). The ultimate handwritten script recognition system should be able to recognize unconstrained writing produced by any writer, deal with different writing styles and languages and remain unaffected by the size of the vocabulary. But developing such a system remains a challenge due to the complex nature of handwriting. Some of the factors contributing to this complexity are writer dependency, various writing styles, similar looking characters, nature of the input signal and vocabulary.

In this research the offline Sinhala handwriting recognition system was developed, trained and tested using handwritten names collected from National Identity Cards (NIC) of Sri Lankan citizens that contain Sinhala script. Since names of individuals contain a majority of contemporary Sinhala characters, names were chosen as the domain to test the demo tool. Since NIC contain names written in unconstrained Sinhala script and are readily available, NIC was selected as the medium of acquiring handwritten Sinhala names.

As described in case study 1, this research also uses thresholding and noise removal techniques for image preprocessing. After preprocessing, segmentation and feature extraction methods are used. Finally the ANN used as the classifier for character recognition. Figure 9.

Figure 9 : Major steps of the system

Use of ANN provided a considerable level of accuracy for the handwriting recognition. But test results suggested more room for improvement. Recognizing all Sinhala characters falls into the category of large vocabulary problems. For such problems utilizing the knowledge of the lexicon is a recommended method of increasing the performance of the system. A technique such as Hidden Markov Model (HMM) can be integrated to the system to increase its overall accuracy. Koerich et al. (2002, p.99) and (Marinai et al. 2005, p.31) have used hybrid classifiers of ANN and HMM to increase the accuracy of handwriting recognition systems.

6.0 Case Study 3: Palm leaf manuscripts OCR


Palm leaves were once a popular writing medium especially in the Asian region. These manuscripts are created by first carving characters or letters using a metal stylus into the dried Palm leaf. Next advancing the contrast, legibility and visibility of the carving was carried out by applying lampblack with coconut oil or another aromatic oil which contains insect repellent qualities. The life time of Palm leaf manuscripts are not as long as artificial materials. They face destruction from causes such as dampness, fungus and insects. Destruction of palm leaf manuscripts lead to the risk of losing a wealth of ancient knowledge contained within them. OCR systems can be used to preserve the knowledge contained in these manuscripts with more efficiency than a manual system.

It will be a vast domain if the selected topic was to address an area like medicine related or dharma related. To avoid those difficulties, the scope is narrow to address the horoscope which is written on Palm leaves only.

Sinhala Palm leaf horoscopes have mainly three parts where characters are written in. they are two cages which are mentioned with the Zodiac sign and Nawanshakaya, a description which written in Sanskrit using Sinhala script, describes about the time period which is the person born and related details about that, a description in Sinhala about the persons astronomical details. In this proposed system only consider about the two cages and the Sinhala description. Due to the difficulty of identifying and recognition of compound characters, touching characters and Sanskrit description are not addressed as in the domain.

Even though basic system is completed there are many enhancements should be done to use as a product. Some are listed below. - A capability of acquiring the data from the system (integrated scanning facility). - Fully auto mated image pre processing stage - A capability to segment overlapped lines and overlapped characters. - Capability of matching with the possible words in a lookup table.

7.0 Conclusion
This paper uses multi-font multi-size digital text, handwritten text and palm-leaf manuscripts as three case studies to address the Sinhala OCR for first time. All three (3) case studies

addressed the problem domains by developing demo tools. Except palm leaf OCR the other two systems shows satisfactory results. However improvements could be made in the preprocessing, segmenting, feature extraction and character recognition stages to improve the overall accuracy of all three systems. Recognizing all Sinhala characters falls into the category of large vocabulary problems. For such problems utilizing the knowledge of the lexicon is a recommended method.

8.0 Acknowledgements
Our heartfelt gratitude goes out to project supervisor Mr. Balachandran Gnanasekaraiyer for the vital encouragement and guidance he provided us at all times. We would also like to thank our project assessors Mr. Gamindu Hemachandra and Ms. Jina R. Daluwatta for the valuable feedback they gave us during the important stages of this project. Support given by the academic staffs, lab administrators, and library staffs at APIIT are deeply appreciated.

9.0 References
Daniels, P.T., Bright, W, The world's writing systems, 1st ed, 1996, New York: Oxford University Press. Gonzalez, R. C., Woods, R. E., 2002. Digital Image Processing. 2nd ed. Pearson Education, Delhi. Gunasekara, A. M., 1891. A comprehensive Grammar of the Sinhalese Language. Asian Educational Services, New Delhi. Koerich, A.L., Leydier, Y. Sabourin, R. Suen, C.Y. 2002. A hybrid large vocabulary handwritten word recognition system using neural networks with hidden Markov models. In: Eighth International Workshop on Frontiers in Handwriting Recognition, August 6-8 2002 Ontario Canada. 99-104. Marinai, S., Gori , M., Soda, G., 2005. Artificial Neural Networks for Document Analysis and Recognition, IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 27, no. 1, pp. 23-35. Plamondon, R., Srihari, S. N., 2000. On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey, IEEE Transactions on Pattern Analysis and Machine Intelli gence, vol. 22, no. 1, pp. 63-84. Premaratne H.L & Bigun J. 2002, Recognition of Printed Sinhala Characters Using Linear Symmetry, The 5th Asian Conference on Computer Vision, 23-25 January 2002, Melbourne, Australia

Reddy N.V.S. & Krishnamoorthi 2008, Hierarchical Recognition System for Machine Printed Kannada Characters, IJCS S International Journal of Computer Science and work Security, vol.8 no.11, pp 44-53. Shatil A.S.M. & Khan, M., c.2006, Minimally Segmenting High Performance Bangla Optical Character Recognition Using Kohonen etwork, Computer Science and Engineering, BRAC University, Dhaka, Bangladesh. et

Biographical otes

G. Balachandran Bala is a lecturer at School of Computing at Asia Pacific Institute of Information Technology, Sri Lanka, and is a consultant to the ICT Agency of Sri Lanka for the Tamil language. He was responsible for the standardization of Tamil encoding, collation and keyboard. Moreover he is working in ICT localization for last 4 years and member of Local Language Working Group (LLWG) at ICT Agency.

D. L. Anoj De Silva,

ikeshala Wickramaarachchi, and Tashila Kannangara

Anoj, Nikeshala and Tashila are graduates of APIIT city campus, of B.Sc. (Hons) Computing specialized in Software Engineering, which is affiliated to Staffodshire University of UK. Currently Anoj is working as an Associate Software Engineer at Virtusa Corporation, Nikeshala as an internship member of Unilever (Pvt) Ltd and Tashila as an E-Marketing Executive of Archmage (Pvt) Ltd.