Beruflich Dokumente
Kultur Dokumente
E-mail: rumikristeva@hotmail.com
Abstract. A document image is a visual representation of a paper document, such as a journal article
page, a cover page of facsimile transmission, office correspondence, an application form, etc. Document
image understanding as a research endeavor consists of developing processes for taking a document
through various representations: from scanned image to semantic representation. This paper describes
the processes and subprocesses involved in document image understanding. In the paper presented an
approach for Old Bulgarian character recognition and it’s program realization. It’s described input
transformation, recognition algorithm and criteria for recognition decision.
1. INTRODUCTION
The need to process documents on paper by computer has led to an area of research
that may be referred to as document image understanding [DIU]. The goal of a DIU
system is to convert a raster image representation of a document, e.g., a paper
document scanned by a flatbed document scanner, into an appropriate symbolic form
[1]. DIU as a research endeavor consists of studying all processes involved in taking a
document through various representations: from a scanned or facsimile multi-page
document to high-level semantic descriptions of the document. Thus it involves many
sub-disciplines of computer science including image processing, pattern recognition,
natural language processing, artificial intelligence and database systems.
The symbolic representation desired as output of a DIU system can take one several
forms: an editable description, a representation from which the document can be
(exactly) reconstructed, a semantic description useful for document sorting/filing etc.
Representation schema that are useful for editing and exact reproduction are standards
for electronic document description.
Developing a DIU system with performance comparable to that achieved by human
expert is still decades from realization [4]. The state-of-the-art in DIU can be subdivided
into five areas as follows:
1.System architecture - The complexity of the DIU task leads to modularization into
manageable processes. Due to interdependency of processes, issues of how to
maintain communication and integrate results from each process arise.
2. SYSTEM ARCHITECTURE
Figure 1 shows the organization of the DIU system developed in CEDAR [5]. The
architecture allows for parallel development of different subsystems. The DIU
architecture consists of three major components:
Fig. 1. Organization of DIU system
1.The Tool box contains all the modules needed for document processing. Tools
developed for different conceptual levels are coordinated by the control.
2.The knowledge base consists of two sub-components: document models and
general knowledge. A document model describes the aspects of a document domain or
a group of documents that share similar layout structure. The expressive power of the
model representation dictates the capability of a DIU system to handle different types of
documents. General knowledge is shared by different document domains. It describes
the tasks that are needed to locate and identify document components, such as text
blocks and line segments. A task is carried out by one of the modules in the tool box.
The general knowledge can apply to objects of different domains since they share
similar structural information. Lexicons used by different tools such as for OCR and NLP
are stored in document models.
3.Control is the most critical issue in DIU system design. Its functions include: (1)
selective use of tools, and (2) intelligent combination of data extracted from document
sub-areas to generate a representation of the scanned document. It examines the
problem state in the working memory and uses the facts in the knowledge base to
determine which modules in the tool box should be used. Working memory is a
temporary storage where different levels of data will be stored during document
processing and will be updated after each module activation. The search process stops
when all the objects specified in the document model have been located.
Tool interaction is determined by the knowledge. The general knowledge defines the
dependency or the activation order of tools, e.g., area-labeling can only be activated
after area-segmentation. A document model defines the tool interactions needed in
different document sub-areas since each sub-area may require a different level of
interpretation, e.g., recognizing the recipient (name and address) on a business letter
requires both OCR and NLP while reading the title of a technical document only needs
OCR.
3. DECOMPOSITION AND STRUCTURAL ANALYSIS
4. TEXT RECOGNITION
Character Recognition, also known as Optical Character Recognition or OCR, is
concerned with the automatic conversion of scanned and digitized images of characters
in running text into their corresponding symbolic forms. The ability of humans to read
poor quality machine print as well as text with unusual fonts and handwriting is far from
matched by today's machines.
We have experimented an approach [11] for character recognition of old Bulgarian
text documents. Most OCR systems have binarization as a preprocessing step. This
approach, uses vertical projection on horizontal axis on in advance inclined text
characters. In this transformation the projection contour assumes different type from
standing characters.
Its rather simplify to find identity between image projection and model
projection.Observed minimum number of parameters.
Figure 2 shows old bulgarian scanning text document.
This item presents an approach for character recognition which is very suitable for old
bulgarian text character recognition. Old Bulgarian texts have to take separated place,
because the characters was hand drawn and painter ambition was maximum identically
for same characters. Character spaces was accurately observed, which reduce
character segmentation problems.
It’s presented information of developed program CYR1.0. The program used for
recognition and analysis on old bulgarian characters. In existing programs has not
possibility for working with old bulgarian texts. Experiments was made only with font
OldCyr for recognition without/after information loss.
Most OCR systems have binarization as a preprocessing step. An approach, offered
in this paper [11,12], uses vertical projection on horizontal axis on in advance inclined
text characters. In this transformation the projection contour assumes different type from
standing characters.
Its rather simplify to find identity between image projection and model
projection.Observed minimum number of parameters: minimum value, maximum value
and width value. Figure 4 shows differences between vertical projection on standing and
inclined characters.
Fig. 4. Vertical character projection (Old Cyr)
The projection on in advance inclined character gives more information. Its saves
time for single character recognition.
Figure 5 shows main menu.
All criteria structured in the tables. The recognition algorithm compares values for
each input character (after described transformations) with values in the tables and
makes recognition decision. Additionally, OCR system may use spell checkers or other
lexical analyzers that make use of context information to correct recognition errors and
resolve ambiguities in generated text.
Program CYR1.0 is structured as 5 separated modules. Each of them is a specific
routine and has specific functions:
MEN1 - routine realizing main menu and searching for input file, needed to be
processed. It’s operated only with files .BMP format .
MIT - routine for reading and processing for single character. After loading from input
file, making normalization on coordinates . There are separated procedures for
computing operation and computing for all parameters.
HIST1 - routine for histogram visualization on each character and saves it in .BMP
format.
TT1 - routine including all needed tables with parameters.
TT2 - routine, forming output. It’s making decision based on values from TT1.
6. CONCLUSION
The major modules in DIU system are: system architecture, decomposition and
structural analysis, text recognition and interpretation, table, diagram and image
understanding, and database and system performance evaluation.
The system architecture provides a computational framework to integrate and
regulate activities needed in document layout analysis and content interpretation.
Decomposition and structural analysis is responsible to decompose a document into
several regions, each of which contains homogeneous entities. These regions are then
grouped into logical units to form a high-level interpretation of the document structure.
Current OCR technology has limited success in recognizing poor quality text.
The use of contextual information, such as lexicon and syntax, has shown promising
results in degraded text recognition. Evaluation of the performance of document
analysis system was discussed. Meaningful performance evaluation should be related
directly to the goals of the system.
Presented approach uses vertical projection on horizontal axis on in advance inclined
text characters. This transformation dives possibility for additional recognition methods
as using fuzzy logic, neural networks and others. Large capacity of input information
reduced to few base criteria. Its rather decreasing and simplify comparing operation.
The program CYR1.0 for old bulgarian character recognition can uses for analysis on
old bulgarian texts and as additional tool in humanity.
REFERENCES
1. Michael Garris, Darrin Dimmick, Form Design for Hight Accuracy Optical
Character Recognition, IEEE Transactions PAMI, June 1996
2. P.J. Grother, Handprinted Forms and Character Database, NIST Special
Database 19, Technical Report, National Institute of Standards and Technology, March
1995
3. S.N. Srihari and S.W. Hull. Character Recognition. Center of Excellence for
Document Analysis and Recognition (CEDAR), Technical Report, January 1995
4. M. Garris, J. Blue, G. Candela, D. Dimmick, J. Geist, P. Grother, S. Janet and C.
Wilson, NIST form - base Handprint Recognition Systems, Technical Report NISTIR
5469, National Institute of Standards and Technology, July 1994
5. R. Wilkinson, J. Geist, S. Janet, P. Grother, C. Burges, R. Greecy, B. Hammond, J.
Hull, N. Larse, T. Vigl and C. Wilson, The First Census Optical Character
Recognition System Conference, Technical Report NISTIR 4912 National Institute of
Standards and Technology, July 1992
6. P. Grotcher, Karhunen Loeve feature extraction for neural handwritten
character recognition, Proc. Application of Artificial Neural Network III, vol 1709, pp.
155-166, SPIE, Orlando, April, 1992
7. S.N. Srihari. Document Image Understanding. Center of Excellence for
Document Analysis and Recognition (CEDAR), May, 1992
8. S.W. Lam, A.C. Girardin and S.N. Srihari. Gray-Scale Character Recognition
Using Boundary Features. SPIE/IS&T Symposium on Electronic Imaging Science
&Technology, San Jose, California, 1992.
9. J.J. Hull, S. Khoubyari, T.K. Ho, Visual Global Context: Word Image Matching in a
Methodology for Degraded Text Recognition, Symposium on Document Analysis and
Information Retrieval Las Vegas, Nevada March, 1992.
10. C.L. Wilson, Evaluation of Character Recognition Systems, Neural Networks for
Signal Processing III, IEEE, pp.485-495, New York, 1992
11. Geortchev V., Krusteva R., Boneva A., Stanischev K., Experimentally analysis on
old Bulgarian text character recognition, MIM2000 IFAC Symposium on
Manufacturing, Modeling, Management and Control, University of Patras Rio, Greece,
(July 12¸14, 2000), Proceeding (Editors:P. Groumpos & A.Tzes) ISBN 0 08043554 8,
Sesion WP1: Applications, WP1, pp. 124-127, 2000
12. Geortchev V., D. Butchvarov, A. Boneva , R. Krusteva and K. Stanischev (1999).
Letter characters
recognition after information loss. In: Proceedings "Scientific reports" (in bulgarian):
Section 3: Mechatronics, ISSN 1310-3946, Sofia, Bulgaria, pp. 3.39-3.44., 1999