Document Analysis: Philozophyly of Voronoi Skeletization: Spring Semester 2008

Document Analysis:
philozophyly of Voronoi skeletization
spring semester 2008
Prénom Nom
Outline
Objectives
Typology
Processing chain
Methodology
Character recognition
Word recognition
An OCR experiment
Conclusion
© Prof. Rolf Ingold

Objectives
Text recognition is the most advanced domain of document analysis

It aims at analyzing images to extracting text, i.e. sequences of
character codes (ASCII/Unicode)

Character recognition typology
Machine printed vs. handwritten text

on-line vs. off-line handwriting
Isolated characters, connected characters, cursive text
Various Alphabet
Western languages (Roman, Greek,
Asian (Chinese, Japanese, Korean, Thai, ...
Arabic alphabets
Limited vocabulary
only numbers, only uppercase text
with/without diacritics/punctuation
restricted vocabulary (city names, street names, ...)
language
contextual knowledge

Factor influencing performance
Variability in style
single scriber, omni-scriber
mono-font, multi-font, omni-font
Geometrical variability
in size
in orientation (rotated)
in transformations (slanted, perspective view, ...)
Image resolution
binary images, starting at 200 dpi
grey-level images starting at 150 dpi
Image quality
degraded support (historical documents)
acquisition conditions (bad illumination, optical aberration,
noise, ...)

European languages
European text is characterized by

limited set of characters (26 to ~100 classes)
diacritics and punctuation
isolated characters and
cursive scripts
left-to right and
top-to-bottom writing
large variety of fonts
different handwriting
styles

Arabic
Arabic text is characterized

right-to-left writing
limited set of characters
context dependent glyphs
connected characters
diacritics
justification by word stretching

Asian scripts
Asian text is characterized

numerous scripts (hanzi, kanji, hanja, han tu, hiragana,
katakana, ...)
horizontal and vertical writing
very large alphabets
structured characters
grid based layout

OCR Methodologies
OCR systems are very complex and combine several steps

image segmentation into characters or isolated shapes
preprocessing of hypothetical segmented characters
size normalization, morphological filtering, thinning, ...
feature extraction of isolated shapes
global measures (width to height ratio, density,
center of gravity, moments, ...)
local properties (stroke thickness, ...)
shape identification using
a single classifier (neural network, support vector machine,
...)
multiple classifiers and fusion methods
word validation using contextual information

Character vs. word recognition
There is a paradox with cursive text or connected characters:

character recognition supposes prior character segmentation
character segmentation requires prior character recognition
Several approaches to bypass this paradox

entire word modeling and recognition
multiple hypothesis generation and testing
combined character segmentation and recognition (HMM)
10

Processing chain
line segmentation
word segmentation
character segmentation
isolated character rec. word recognition
normalization normalization
feature/primitive extr. feature/primitive extr.
identification identification
post-analysis
recognized word
11

Isolated character recognition
Isolated character recognition is applicable

on high quality printed text
on constrained handwriting (forms)
The challenge is to take into account the variability of the class
Performance depends on
size of alphabet (number of classes)
image quality
12

Several classification strategies
Direct comparison with class model
Statistical pattern recognition

using features
Structural pattern recognition

using primitives
Hybrid approaches combining statistical and structural approaches
Use of multiple classifiers and fusion of their results
13

Comparison with class model or class samples
The unknown pattern is compared with one representative of each

class (model)
a similarity measure is returned
decision is determined by most similar sample
rejection may occur if similarity is above a threshold
14

Similarity measures
Hamming distance
Warping distance
15

Features for statistical approaches
Someimes preprocessing at image level is required

smoothing
size normalization
stroke normalization
skeletization
...
Features are extracted
horizontal and vertical projection profile
central moments
intersections with lines
global transforms (Hough, Fourier, ...)
local features (densities, moments, ...)
...
16

Central moments
Central moments are shift invariant properties defined as
µ pq = ∑x ∑y ( x − x) p ( y − y) q f ( x, y)
m10 m01
with x= y= mpq = ∑x ∑y x p y q f ( x, y)
m00 m00
They can be computed using the following formulas
µ00 = m00 µ11 = m11 + m10 y

2
µ10 = 0 µ12 = m12 − m02 x − 2m11 y + 2m10 y
2
µ01 = 0 µ21 = m21 − m20 y − 2m11 x + 2m01 x
2
µ20 = m20 − m10 x µ30 = m30 − 3m20 x + 2m10 x
2
µ02 = m02 − m01 x µ03 = m03 − 3m02 y + 2m01 y
17

Primitives for structural approaches
Shapes are decomposed in strokes and several

properties are extracted
number of connected components
number of holes
number and relative position of singular points
extremities
connections
crossings
concavities, convexities
...
These primitives are represented as
strings
trees
graphs
and used for comparison
18

Identification
For statistical approaches, different classifier are used

discriminant functions
kNN classifier
multi-layer perceptron
support
...
For statistical approaches use

hierarchical classification
string distances
graph matching
19

Multilayer perceptron
Information is propagated throw a layered network of "neurons"

decision is given by the highest activation on the output layer
weights of connections are computed in a training phase
20

Hierarchical classification
Structural pattern recognition can be performed hierarchically
21

OCR Difficulties
The main sources of errors are

variability of character shapes (special fonts, handwriting)
image defects : noise and distortions
broken or touching characters
shape similarity ("0" and "O", "1", "I" and "l", "5" and "S", ...)
small shapes : punctuation, accents, superscripts ("er", "ème")
special characters ("©", "½", "±", ...) or bullets
22

Word recognition
Text recognition at word level makes sense

in case of restricted vocabulary
for language driven approaches
for knowledge based approaches
for keyword spotting
Word recognition is typically used

cursive scripts
handwriting
noisy text, difficult to segment
low resolution text
23

Word Recognition specificities
Word recognition is more complex than character recognition

usually the number of classes is much higher
more features are needed
Word recognition can take into account external information
language based knowledge
dictionary
character frequencies, bigrams, trigrams, etc.
structural constraints (security number, dates, ...)
restricted vocabulary (e.g. city names, street names, ...)
redundancy (e.g. zip codes and city names)
Hidden Markov Models (HH) are suited for word recognition
24

Hidden Markov models (1)
Each class is modeled by a two stage stochastic process using

hidden and visible states
A model λ=(A,B, π) is composed of

A, the matrix of transition
probabilities
aij = P( qt = x j | qt −1 = xi )
B, the matrix of observation
probabilities
b j ( k ) = P( yk | qt = x j )
π, the vector of the initial
state probabilities
π i = P(q0 = xi )
25

Hidden Markov models (2)
The probability of an observation can be computed using

P (O | λ ) = ∑ P (O, Q | λ ) = ∑ P (O | λ , Q) P (Q | λ )
Q Q
= ∑π
q1 ,... qT
q bq1 (o1 )aq1q2 bq2 (o2 ) ... aqT −1qT bqT (oT )
A pattern is assigned to the model with highest posterior probability

(i.e, the model that best explains the pattern)
The parameters of the model (probabilities) are determined in a

training phase using training samples
26

Post-analysis
Post-analysis is performed with the aim of validating / correcting

character recognition
based on dictionaries
bigrams, trigrams
confidence of character recognition (if available)
27

Character Recognition Performance
OCR (Optical Character Recognition) is the most mature technique

of document analysis
For most applications, very high accuracy is required

99% recognition rate would generate 30-50 errors per page
99,9% – 99,99% is often requested
OCR systems may be designed for

standard OCR-A, OCR-B fonts, specially designed for OCR
mono-font recognition, specialized for typewriting
omni-font or multi-font text recognition (the most popular)
trainable systems, being tuned for specific fonts and styles
28

OCR experiment
One magazine page

[Hebdo 18, 2007, Editorial]
good quality printed text
medium layout complexity
29

First Experiment (Standard MS-Office OCR)
Layout not understood

Many OCR errors
=> Unusable !
30

Second Experiment (ReadIris)
Page layout correctly

recognized
Italic correctly detected
including one false detection
A few word segmentation errors
Correct de-hyphenation
A few OCR errors
often as consequence of
segmentation errors !
31

Conclusion on OCR technologies
Imperfect results in printed character recognition

Recognition of uncontrolled handwriting not mature
Practical problems with
mathematics (symbols and formulas)
special fonts or scripts
logos
=> perfect document recognition is not achievable !
Some applications can deal with approximate results

Recognition algorithms should be tuned to prefer rejections to errors
Include manual correction in the processing step
32

Document Analysis: Philozophyly of Voronoi Skeletization: Spring Semester 2008

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Document Analysis: Philozophyly of Voronoi Skeletization: Spring Semester 2008

Hochgeladen von

Copyright:

Verfügbare Formate

Document Analysis:

philozophyly of Voronoi skeletization

spring semester 2008

© Prof. Rolf Ingold

Text recognition is the most advanced domain of document analysis

© Prof. Rolf Ingold

Machine printed vs. handwritten text

© Prof. Rolf Ingold

© Prof. Rolf Ingold

European text is characterized by

© Prof. Rolf Ingold

Arabic text is characterized

© Prof. Rolf Ingold

Asian text is characterized

© Prof. Rolf Ingold

OCR systems are very complex and combine several steps

© Prof. Rolf Ingold

There is a paradox with cursive text or connected characters:

Several approaches to bypass this paradox

© Prof. Rolf Ingold

isolated character rec. word recognition

feature/primitive extr. feature/primitive extr.

© Prof. Rolf Ingold

Isolated character recognition is applicable

© Prof. Rolf Ingold

Direct comparison with class model

Statistical pattern recognition

Structural pattern recognition

Hybrid approaches combining statistical and structural approaches

Use of multiple classifiers and fusion of their results

© Prof. Rolf Ingold

The unknown pattern is compared with one representative of each

© Prof. Rolf Ingold

© Prof. Rolf Ingold

Someimes preprocessing at image level is required

© Prof. Rolf Ingold

Central moments are shift invariant properties defined as

They can be computed using the following formulas

µ00 = m00 µ11 = m11 + m10 y

© Prof. Rolf Ingold

Shapes are decomposed in strokes and several

© Prof. Rolf Ingold

For statistical approaches, different classifier are used

For statistical approaches use

© Prof. Rolf Ingold

Information is propagated throw a layered network of "neurons"

© Prof. Rolf Ingold

Structural pattern recognition can be performed hierarchically

© Prof. Rolf Ingold

The main sources of errors are

© Prof. Rolf Ingold

Text recognition at word level makes sense

Word recognition is typically used

© Prof. Rolf Ingold

Word recognition is more complex than character recognition

Hidden Markov Models (HH) are suited for word recognition

© Prof. Rolf Ingold

Each class is modeled by a two stage stochastic process using

A model λ=(A,B, π) is composed of

© Prof. Rolf Ingold

The probability of an observation can be computed using

A pattern is assigned to the model with highest posterior probability

The parameters of the model (probabilities) are determined in a