Sie sind auf Seite 1von 16

Guide to OCR for Arabic Scripts

Volker Mrgner r Haikal El Abed


Editors

Guide to OCR
for Arabic Scripts

Editors
Volker Mrgner
Institute for Communications Technology
Braunschweig Technical University
Braunschweig, Niedersachsen
Germany

Haikal El Abed
Institute for Communications Technology
Braunschweig Technical University
Braunschweig, Niedersachsen
Germany

ISBN 978-1-4471-4071-9
ISBN 978-1-4471-4072-6 (eBook)
DOI 10.1007/978-1-4471-4072-6
Springer London Heidelberg New York Dordrecht
Library of Congress Control Number: 2012941223
Springer-Verlag London 2012
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of
this publication or parts thereof is permitted only under the provisions of the Copyright Law of the
Publishers location, in its current version, and permission for use must always be obtained from Springer.
Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations
are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any
errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect
to the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)

Foreword

It is with great pleasure that I accepted to write the foreword for this book.
In our society, we use computers increasingly for all kinds of communication.
This means that for a language to work well for communication, we need to have
computer toolslanguage technologyfor that language. Language technology
makes communication more efficient, and it enables many more users to have access
to information in various forms.
Arabic is an important language, spoken in a large region of the world, and it
is important to support the use of Arabicin business, in public administration, in
private homes, and even in the exchange of information with the rest of the world.
As not all Arabic information is already available in electronic form, one of the
basic requirements in such a scenario is the ability of computers to read images
of written Arabic, be it handwritten or printed, be it online or offline. This kind
of technology (OCRoptical character recognition) is difficult to create even for
languages in Latin script, and written Arabic adds to the complexity.
This book is a comprehensive guide to the field of Arabic OCR, offering thorough descriptions of the data sets for training and testing, several different OCR
methodologies, and the evaluation and assessment thereof.
Before closing, I would like to congratulate the editors in having assembled such
an important collection of contributions to a complex problem for which the Arabic
world, and the rest of the world, needs solutions.
Copenhagen, Denmark

Bente Maegaard

Preface

Arabic Scripts
The Internet is a source of information which is used by approximately 2 billion users worldwide (www.internetworldstats.com). Two aspects affect the way in
which the Internet is used, especially in searching for documents. One is the availability of more and more scanned documents with varying amounts of metadata for
searching these documents on the Internet, and the other os the appearance of more
and more languages with non-Latin characters. Both aspects show the importance of
developing recognition technology for all types of characters and languages to make
the content of scanned images of text available to the Internet users. A worldwideused acronym for any type of text recognition is OCR, which means optical character recognition. OCR is used not only for recognizing printed characters, but it is
often also used for cursive handwriting, even when words instead of single characters are recognized. Some alternative acronyms are used for the case of handwritten
words, like HWR (handwritten word recognition) but these are not in common use
today.
Knowing that about 200 million people in the world use Arabic as their first language it is obvious that a growing interest of that huge group of Arabic-speaking
Internet users is to search for documents in their mother tongue. In parallel to this
situation is in the past few years a growing interest in Arabic word and text recognition has been observed. During that time two events have been important landmarks in Arabic text recognition technology development. In 2002 a database on
Arabic handwritten words (IFN/ENIT-database)1 was made available to the community and has served as a reference for competitions since 2005 (ICDAR 2005).2
In September 2006 a summit on Arabic and Chinese Handwriting Recognition was
held at College Park, MD in the USA (SACH2006),3 where experts from both re1 http://www.ifnenit.com
2 http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=10526
3 http://www.umiacs.umd.edu/lamp/meetings/SACH06/

vii

viii

Preface

Table 1 Arabic characters (www.ethnologue.com)

Fig. 1 Example of an Arabic


printed text

search fields presented their actual work. From that time intensive research on Arabic script recognition started and has resulted in a big step forward today.
Arabic script is the second most widespread script in the world; it is used not only
for Arabic but also for the Persian, Urdu, and Pashto languages, for example. Today
14 languages use Arabic script worldwide, which shows its importance. Characteristics of Arabic script are a writing direction from right to left, characters within a
word being mostly connected, 28 characters with different shapes for different positions in a word, and dots and diacritical signs above and below characters. Table 1
shows all the shapes of the 28 Arabic characters.
For different languages some additional characters may be used. Typical for Arabic script is also the variation of a word in length by elongation of the connecting
lines between the characters. Figure 1 shows an example.

Preface

ix

Table 2 Examples of
ligatures

A further important special Arabic script style is the possibility to write characters as vertical or horizontal ligatures. These ligatures modify the shape of the characters significantly. Some examples of ligatures are shown in Table 2. All these typical Arabic script characteristics influence the processing and recognition of Arabic
script in different ways and make it clear that a simple adaptation of Latin characterbased processing is not possible.
This book presents the state of the art of OCR for Arabic scripts presented from
most active and successful groups. The parts of the book show that a lot of work still
has to be done on Arabic script recognition. But the techniques and algorithms used
are of general interest; many problems are typical not only for Arabic but for many
other scripts. We believe that the collection of Arabic OCR related work is also an
inspiration for other scripts and vice versa.
The book is divided into four parts. Part I, Pre-processing, presents different
aspects of pre-processing and feature extraction for Arabic OCR systems. Part II,
Recognition, includes chapters with details about different recognition approaches.
Part III collects chapters describing the important aspects of how to assess the performance of a recognition system. The final Part IV, Applications presents system
solutions for selected application fields.

Part I: Pre-processing
Part I presents different approaches for the pre-processing of OCR systems for
Arabic. It starts with an overview of Arabic handwriting recognition technology.
Srihari and Ball present in their chapter the parts of a recognition system from
pre-processing to classification. Finally they discuss application fields and challenges. Chapters 26 deal with pre-processing tasks of an OCR system. Bukhari,
Shafait, and Breuel discuss layout analysis methods, Setlur and Govindaraju preprocessing issues, Belaid and Ouwayed segmentation of ancient Arabic documents,
and Likforman-Sulem et al. features for word recognition systems.

Part II: Recognition


Chapters 715 present different approaches for the recognition of Arabic script. The
first six chapters all use HMM-based approaches. Borovikov and Zavorin present a
multi-stage approach to document analysis, Ahmed, Mahmoud, and Parvez a recognizer for printed Arabic text, Pechwitz, El Abed, and Mrgner an offline handwritten

Preface

Arabic word recognizer, Dreuw, Rybach, Heigold, and Ney a large vocabulary optical character recognition system, Alkhoury, Gimenez, and Juan a Bernoulli-based
handwriting recognition system, Jifroodian and Suen a handwritten Farsi word
recognition system, Kessentini, Paquet, and Ben Hamadou a multi-stream Markov
model recognizer.
Two chapters discuss further approaches. Graves presents a recognition system
based on multidimensional recurrent neural networks, and Mozaffari discusses the
application of fractal theory for document analysis and recognition. Khemakhem
and Belghith discuss an OCR system based on the combination of complementary
systems in Chap. 15.

Part III: Evaluation


The subject of Part III is the evaluation of recognition systems. In Chap. 16 Zavorin and Borovikov discuss data collection and annotation, and Arabic handwriting recognition competitions are described in Chap. 17 by Mrgner and El Abed. In
Chap. 18, Slimane et al. describe benchmarking strategies for Arabic word recognition.

Part IV: Applications


The final Part IV presents different applications using Arabic script recognition
technology. In Chap. 19 Cheriet and Moghaddam present a robust word spotting system for historical Arabic manuscripts. Natarajan discusses, in Chap. 20,
script-independent methods for Arabic handwriting recognition, and Kundu and
Hines present an Arabic handwriting recognition system using over-segmentation
in Chap. 21. Boubaker et al. discuss online Arabic databases and applications using
these data in Chap. 22, and Abdelazeem et al. present, in Chap. 23, techniques for
using online and offline features for Arabic handwriting recognition.

Target Audience
This book provides an overview of the state-of-the-art research in the field of OCR
for Arabic scripts. Different aspects and solutions have been addressed by the authors, and we hope that this comprehensive collection of ideas, problems, and solutions motivates researchers to continue this work. In that sense this book shall serve
as a reference for researchers and graduate students studying OCR technology and
methodology in general and for Arabic script in particular.
Braunschweig, Germany

Volker Mrgner
Haikal El Abed

Acknowledgements

We thank all the chapter authors for their contributions to this book, which make
it an invaluable resource to researchers in the area of OCR for all kinds of script
using Arabic characters. A total of 67 authors prepared their contributions for the
23 chapters of the book and supported us during all steps from concept to draft and
finalized chapter. We would like to thank Bente Maegaard, Director of the Centre for Language Technology at Copenhagen University, Denmark, for writing the
foreword and for her continuing support in developing Arabic language resources
and a network of researchers in this field. We would also like to thank Springer for
their encouragement and persistence in driving us to complete this work in a timely
manner.
Braunschweig, Germany

Volker Mrgner
Haikal El Abed

xi

Contents

Part I

Pre-processing

An Assessment of Arabic Handwriting Recognition Technology . . .


Sargur N. Srihari and Gregory Ball

Layout Analysis of Arabic Script Documents . . . . . . . . . . . . .


Syed Saqib Bukhari, Faisal Shafait, and Thomas M. Breuel

35

A Multi-stage Approach to Arabic Document Analysis . . . . . . . .


Eugene Borovikov and Ilya Zavorin

55

Pre-processing Issues in Arabic OCR . . . . . . . . . . . . . . . . . .


Zhixin Shi, Srirangaraj Setlur, and Venu Govindaraju

79

Segmentation of Ancient Arabic Documents . . . . . . . . . . . . . . 103


Abdel Belad and Nazih Ouwayed

Features for HMM-Based Arabic Handwritten Word Recognition


Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Laurence Likforman-Sulem, Ramy Al Hajj Mohammad, Chafic
Mokbel, Fares Menasri, Anne-Laure Bianne-Bernard, and Christopher
Kermorvant

Part II

Recognition

Printed Arabic Text Recognition . . . . . . . . . . . . . . . . . . . . 147


Irfan Ahmed, Sabri A. Mahmoud, and Mohammed Tanvir Parvez

Handwritten Arabic Word Recognition Using


the IFN/ENIT-database . . . . . . . . . . . . . . . . . . . . . . . . . 169
Mario Pechwitz, Haikal El Abed, and Volker Mrgner

RWTH OCR: A Large Vocabulary Optical Character Recognition


System for Arabic Scripts . . . . . . . . . . . . . . . . . . . . . . . . 215
Philippe Dreuw, David Rybach, Georg Heigold, and Hermann Ney
xiii

xiv

Contents

10 Arabic Handwriting Recognition Using Bernoulli HMMs . . . . . . 255


Ihab Alkhoury, Adri Gimnez, and Alfons Juan
11 Handwritten Farsi Word Recognition Using Hidden Markov
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Puntis Jifroodian Haghighi and Ching Y. Suen
12 Offline Arabic Handwriting Recognition with Multidimensional
Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 297
Alex Graves
13 Application of Fractal Theory in Farsi/Arabic Document
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Saeed Mozaffari
14 Multi-stream Markov Models for Arabic Handwriting
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
Yousri Kessentini, Thierry Paquet, and AbdelMajid Ben Hamadou
15 Toward Distributed Cursive Writing OCR Systems Based
on a Combination of Complementary Approaches . . . . . . . . . . 351
Maher Khemakhem and Abdelfettah Belghith
Part III Evaluation
16 Data Collection and Annotation for Arabic Document Analysis . . . 375
Ilya Zavorin and Eugene Borovikov
17 Arabic Handwriting Recognition Competitions . . . . . . . . . . . . 395
Volker Mrgner and Haikal El Abed
18 Benchmarking Strategy for Arabic Screen-Rendered Word
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
Fouad Slimane, Slim Kanoun, Jean Hennebert, Rolf Ingold, and
Adel M. Alimi
Part IV Applications
19 A Robust Word Spotting System for Historical Arabic
Manuscripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
Mohamed Cheriet and Reza Farrahi Moghaddam
20 Arabic Text Recognition Using a Script-Independent Methodology:
A Unified HMM-Based Approach for Machine-Printed
and Handwritten Text . . . . . . . . . . . . . . . . . . . . . . . . . . 485
Premkumar Natarajan, Rohit Prasad, Huaigu Cao, Krishna
Subramanian, Shirin Saleem, David Belanger, Shiv Vitaladevuni, Matin
Kamali, and Ehry MacRostie

Contents

xv

21 Arabic Handwriting Recognition Using VDHMM


and Over-segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 507
Amlan Kundu and Tom Hines
22 Online Arabic Databases and Applications . . . . . . . . . . . . . . . 541
Houcine Boubaker, Abdelkarim Elbaati, Najiba Tagougui, Haikal
El Abed, Monji Kherallah, and Adel M. Alimi
23 On-line Arabic Handwritten Word Recognition Based on HMM
and Combination of On-line and Off-line Features . . . . . . . . . . 559
Sherif Abdelazeem, Hesham M. Eraqi, and Hany Ahmed
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587

Contributors

Sherif Abdelazeem Electronics Engineering Dept., The American University in


Cairo (AUC), Cairo, Egypt
Hany Ahmed Electronics Engineering Dept., The American University in Cairo
(AUC), Cairo, Egypt
Irfan Ahmed Information and Computer Science, King Fahd University of
Petroleum and Minerals, Dhahran, Saudi Arabia
Adel M. Alimi REGIM Group, National School of Engineers (ENIS), University
of Sfax, Sfax, Tunisia; Research Group on Intelligent Machines (REGIM), National
School of Engineers ENIS, University of Sfax, Sfax, Tunisia
Ihab Alkhoury DSIC, Universitat Politcnica de Valncia, Valncia, Spain
Ramy Al Hajj Mohammad Lebanese International University, Beqaa Valley,
Lebanon
Gregory Ball Center of Excellence for Document Analysis and Recognition
(CEDAR), University at Buffalo, State University of New York, Amherst, NY, USA
Abdel Belad LORIA, Vandoeuvre, France
David Belanger BBN, Cambridge, MA, USA
Abdelfettah Belghith HANA Research Group, ENSI, University of Manouba,
Manouba, Tunisia
AbdelMajid Ben Hamadou Laboratoire MIRACL, Universit de Sfax Tunisie,
Sfax, Tunisia
Anne-Laure Bianne-Bernard Telecom ParisTech, CNRS UMR 5141, Paris,
France; A2iA, Paris, France
Eugene Borovikov Knowledge and Information Management Division, CACI,
Lanham, MD, USA; Research and Development, CACI International Inc, Lanham,
MD, USA
xvii

xviii

Contributors

Houcine Boubaker Research Group on Intelligent Machines (REGIM), National


School of Engineers ENIS, University of Sfax, Sfax, Tunisia
Thomas M. Breuel Technical University of Kaiserslautern, Kaiserslautern, Germany
Syed Saqib Bukhari Technical University of Kaiserslautern, Kaiserslautern, Germany
Huaigu Cao BBN, Cambridge, MA, USA
Mohamed Cheriet Synchromedia Laboratory for Multimedia Communication in
Telepresence, cole de Technologie Suprieure, Montral, QC, Canada
Philippe Dreuw Human Language Technology and Pattern Recognition, RWTH
Aachen University, Aachen, Germany
Haikal El Abed Institute for Communications Technology (IfN), Technische Universitt Braunschweig, Braunschweig, Germany
Abdelkarim Elbaati Research Group on Intelligent Machines (REGIM), National
School of Engineers ENIS, University of Sfax, Sfax, Tunisia
Hesham M. Eraqi Electronics Engineering Dept., The American University in
Cairo (AUC), Cairo, Egypt
Adri Gimnez DSIC, Universitat Politcnica de Valncia, Valncia, Spain
Venu Govindaraju Department of CSE, University at Buffalo, SUNY, Buffalo,
NY, USA
Alex Graves Technical University of Munich, Munich, Germany
Puntis Jifroodian Haghighi CENPARMI (Center for Pattern Recognition and Machine Intelligence), Department of Computer Science and Software Engineering,
Concordia University, Montreal, Quebec, Canada
Georg Heigold Human Language Technology and Pattern Recognition, RWTH
Aachen University, Aachen, Germany
Jean Hennebert DIVA Group, Department of Informatics, Universtity of Fribourg,
Fribourg, Switzerland
Tom Hines MITRE, McLean, VA, USA
Rolf Ingold DIVA Group, Department of Informatics, Universtity of Fribourg, Fribourg, Switzerland
Alfons Juan DSIC, Universitat Politcnica de Valncia, Valncia, Spain
Matin Kamali BBN, Cambridge, MA, USA
Slim Kanoun National School of Engineers (ENIS), University of Sfax, Sfax,
Tunisia

Contributors

xix

Christopher Kermorvant A2iA, Paris, France


Yousri Kessentini Laboratoire LITIS EA 4108, Universit de Rouen France, SaintEtienne du Rouvray, France
Maher Khemakhem MIRACL Lab, The Higher Institute of Management University of Sousse, Sousse, Tunisia
Monji Kherallah Research Group on Intelligent Machines (REGIM), National
School of Engineers ENIS, University of Sfax, Sfax, Tunisia
Amlan Kundu MITRE, McLean, VA, USA
Laurence Likforman-Sulem Telecom ParisTech, CNRS UMR 5141, Paris, France
Ehry MacRostie BBN, Cambridge, MA, USA
Sabri A. Mahmoud Information and Computer Science, King Fahd University of
Petroleum and Minerals, Dhahran, Saudi Arabia
Volker Mrgner Institute for Communications Technology (IfN), Technische Universitt Braunschweig, Braunschweig, Germany
Fares Menasri A2iA, Paris, France
Reza Farrahi Moghaddam Synchromedia Laboratory for Multimedia Communication in Telepresence, cole de Technologie Suprieure, Montral, QC, Canada
Chafic Mokbel UOB, El-Koura, Lebanon
Saeed Mozaffari Electrical and Computer Department, Semnan University, Semnan, Iran
Premkumar Natarajan BBN, Cambridge, MA, USA
Hermann Ney Human Language Technology and Pattern Recognition, RWTH
Aachen University, Aachen, Germany
Nazih Ouwayed LORIA, Vandoeuvre, France
Thierry Paquet Laboratoire LITIS EA 4108, Universit de Rouen France, SaintEtienne du Rouvray, France
Mohammed Tanvir Parvez Information and Computer Science, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
Mario Pechwitz Institute for Communications Technology (IfN), Technische Universitt Braunschweig, Braunschweig, Germany
Rohit Prasad BBN, Cambridge, MA, USA
David Rybach Human Language Technology and Pattern Recognition, RWTH
Aachen University, Aachen, Germany
Shirin Saleem BBN, Cambridge, MA, USA

xx

Contributors

Srirangaraj Setlur Department of CSE, University at Buffalo, SUNY, Buffalo,


NY, USA
Faisal Shafait German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany
Zhixin Shi Department of CSE, University at Buffalo, SUNY, Buffalo, NY, USA
Fouad Slimane DIVA Group, Department of Informatics, Universtity of Fribourg,
Fribourg, Switzerland
Sargur N. Srihari Center of Excellence for Document Analysis and Recognition
(CEDAR), University at Buffalo, State University of New York, Amherst, NY, USA
Krishna Subramanian BBN, Cambridge, MA, USA
Ching Y. Suen CENPARMI (Center for Pattern Recognition and Machine Intelligence), Department of Computer Science and Software Engineering, Concordia
University, Montreal, Quebec, Canada
Najiba Tagougui Research Group on Intelligent Machines (REGIM), National
School of Engineers ENIS, University of Sfax, Sfax, Tunisia
Shiv Vitaladevuni BBN, Cambridge, MA, USA
Ilya Zavorin Knowledge and Information Management Division, CACI, Lanham,
MD, USA; Research and Development, CACI International Inc, Lanham, MD, USA

Das könnte Ihnen auch gefallen