Sie sind auf Seite 1von 32

Document Analysis:

philozophyly of Voronoi skeletization

spring semester 2008

Prénom Nom
Outline

 Objectives
 Typology
 Processing chain
 Methodology
 Character recognition
 Word recognition
 An OCR experiment
 Conclusion

© Prof. Rolf Ingold


Objectives

 Text recognition is the most advanced domain of document analysis


 It aims at analyzing images to extracting text, i.e. sequences of
character codes (ASCII/Unicode)

© Prof. Rolf Ingold


Character recognition typology

 Machine printed vs. handwritten text


 on-line vs. off-line handwriting
 Isolated characters, connected characters, cursive text
 Various Alphabet
 Western languages (Roman, Greek,
 Asian (Chinese, Japanese, Korean, Thai, ...
 Arabic alphabets
 Limited vocabulary
 only numbers, only uppercase text
 with/without diacritics/punctuation
 restricted vocabulary (city names, street names, ...)
 language
 contextual knowledge

© Prof. Rolf Ingold


Factor influencing performance

 Variability in style
 single scriber, omni-scriber
 mono-font, multi-font, omni-font
 Geometrical variability
 in size
 in orientation (rotated)
 in transformations (slanted, perspective view, ...)
 Image resolution
 binary images, starting at 200 dpi
 grey-level images starting at 150 dpi
 Image quality
 degraded support (historical documents)
 acquisition conditions (bad illumination, optical aberration,
noise, ...)

© Prof. Rolf Ingold


European languages

 European text is characterized by


 limited set of characters (26 to ~100 classes)
 diacritics and punctuation
 isolated characters and
cursive scripts
 left-to right and
top-to-bottom writing
 large variety of fonts
 different handwriting
styles

© Prof. Rolf Ingold


Arabic

 Arabic text is characterized


 right-to-left writing
 limited set of characters
 context dependent glyphs
 connected characters
 diacritics
 justification by word stretching

© Prof. Rolf Ingold


Asian scripts

 Asian text is characterized


 numerous scripts (hanzi, kanji, hanja, han tu, hiragana,
katakana, ...)
 horizontal and vertical writing
 very large alphabets
 structured characters
 grid based layout

© Prof. Rolf Ingold


OCR Methodologies

 OCR systems are very complex and combine several steps


 image segmentation into characters or isolated shapes
 preprocessing of hypothetical segmented characters
 size normalization, morphological filtering, thinning, ...
 feature extraction of isolated shapes
 global measures (width to height ratio, density,
center of gravity, moments, ...)
 local properties (stroke thickness, ...)
 shape identification using
 a single classifier (neural network, support vector machine,
...)
 multiple classifiers and fusion methods
 word validation using contextual information

© Prof. Rolf Ingold


Character vs. word recognition

 There is a paradox with cursive text or connected characters:


 character recognition supposes prior character segmentation
 character segmentation requires prior character recognition

 Several approaches to bypass this paradox


 entire word modeling and recognition
 multiple hypothesis generation and testing
 combined character segmentation and recognition (HMM)

10

© Prof. Rolf Ingold


Processing chain

line segmentation

word segmentation

character segmentation

isolated character rec. word recognition

normalization normalization

feature/primitive extr. feature/primitive extr.

identification identification

post-analysis

recognized word
11

© Prof. Rolf Ingold


Isolated character recognition

 Isolated character recognition is applicable


 on high quality printed text
 on constrained handwriting (forms)
 The challenge is to take into account the variability of the class

 Performance depends on
 size of alphabet (number of classes)
 image quality

12

© Prof. Rolf Ingold


Several classification strategies

 Direct comparison with class model

 Statistical pattern recognition


 using features

 Structural pattern recognition


 using primitives

 Hybrid approaches combining statistical and structural approaches

 Use of multiple classifiers and fusion of their results

13

© Prof. Rolf Ingold


Comparison with class model or class samples

 The unknown pattern is compared with one representative of each


class (model)
 a similarity measure is returned
 decision is determined by most similar sample
 rejection may occur if similarity is above a threshold

14

© Prof. Rolf Ingold


Similarity measures

 Hamming distance

 Warping distance

15

© Prof. Rolf Ingold


Features for statistical approaches

 Someimes preprocessing at image level is required


 smoothing
 size normalization
 stroke normalization
 skeletization
 ...
 Features are extracted
 horizontal and vertical projection profile
 central moments
 intersections with lines
 global transforms (Hough, Fourier, ...)
 local features (densities, moments, ...)
 ...

16

© Prof. Rolf Ingold


Central moments

 Central moments are shift invariant properties defined as

µ pq = ∑x ∑y ( x − x) p ( y − y) q f ( x, y)

m10 m01
with x= y= mpq = ∑x ∑y x p y q f ( x, y)
m00 m00

 They can be computed using the following formulas

µ00 = m00 µ11 = m11 + m10 y


2
µ10 = 0 µ12 = m12 − m02 x − 2m11 y + 2m10 y
2
µ01 = 0 µ21 = m21 − m20 y − 2m11 x + 2m01 x
2
µ20 = m20 − m10 x µ30 = m30 − 3m20 x + 2m10 x
2
µ02 = m02 − m01 x µ03 = m03 − 3m02 y + 2m01 y

17

© Prof. Rolf Ingold


Primitives for structural approaches

 Shapes are decomposed in strokes and several


properties are extracted
 number of connected components
 number of holes
 number and relative position of singular points
 extremities
 connections
 crossings
 concavities, convexities
 ...
 These primitives are represented as
 strings
 trees
 graphs
and used for comparison
18

© Prof. Rolf Ingold


Identification

 For statistical approaches, different classifier are used


 discriminant functions
 kNN classifier
 multi-layer perceptron
 support
 ...

 For statistical approaches use


 hierarchical classification
 string distances
 graph matching

19

© Prof. Rolf Ingold


Multilayer perceptron

 Information is propagated throw a layered network of "neurons"


 decision is given by the highest activation on the output layer
 weights of connections are computed in a training phase

20

© Prof. Rolf Ingold


Hierarchical classification

 Structural pattern recognition can be performed hierarchically

21

© Prof. Rolf Ingold


OCR Difficulties

 The main sources of errors are


 variability of character shapes (special fonts, handwriting)
 image defects : noise and distortions
 broken or touching characters

 shape similarity ("0" and "O", "1", "I" and "l", "5" and "S", ...)
 small shapes : punctuation, accents, superscripts ("er", "ème")
 special characters ("©", "½", "±", ...) or bullets

22

© Prof. Rolf Ingold


Word recognition

 Text recognition at word level makes sense


 in case of restricted vocabulary
 for language driven approaches
 for knowledge based approaches
 for keyword spotting

 Word recognition is typically used


 cursive scripts
 handwriting
 noisy text, difficult to segment
 low resolution text

23

© Prof. Rolf Ingold


Word Recognition specificities

 Word recognition is more complex than character recognition


 usually the number of classes is much higher
 more features are needed
 Word recognition can take into account external information
 language based knowledge
 dictionary
 character frequencies, bigrams, trigrams, etc.
 structural constraints (security number, dates, ...)
 restricted vocabulary (e.g. city names, street names, ...)
 redundancy (e.g. zip codes and city names)

 Hidden Markov Models (HH) are suited for word recognition

24

© Prof. Rolf Ingold


Hidden Markov models (1)

 Each class is modeled by a two stage stochastic process using


hidden and visible states

 A model λ=(A,B, π) is composed of


 A, the matrix of transition
probabilities
aij = P( qt = x j | qt −1 = xi )
 B, the matrix of observation
probabilities
b j ( k ) = P( yk | qt = x j )
 π, the vector of the initial
state probabilities
π i = P(q0 = xi )

25

© Prof. Rolf Ingold


Hidden Markov models (2)

 The probability of an observation can be computed using


P (O | λ ) = ∑ P (O, Q | λ ) = ∑ P (O | λ , Q) P (Q | λ )
Q Q

= ∑π
q1 ,... qT
q bq1 (o1 )aq1q2 bq2 (o2 ) ... aqT −1qT bqT (oT )

 A pattern is assigned to the model with highest posterior probability


(i.e, the model that best explains the pattern)

 The parameters of the model (probabilities) are determined in a


training phase using training samples

26

© Prof. Rolf Ingold


Post-analysis

 Post-analysis is performed with the aim of validating / correcting


character recognition
 based on dictionaries
 bigrams, trigrams
 confidence of character recognition (if available)

27

© Prof. Rolf Ingold


Character Recognition Performance

 OCR (Optical Character Recognition) is the most mature technique


of document analysis

 For most applications, very high accuracy is required


 99% recognition rate would generate 30-50 errors per page
 99,9% – 99,99% is often requested

 OCR systems may be designed for


 standard OCR-A, OCR-B fonts, specially designed for OCR
 mono-font recognition, specialized for typewriting
 omni-font or multi-font text recognition (the most popular)
 trainable systems, being tuned for specific fonts and styles

28

© Prof. Rolf Ingold


OCR experiment

 One magazine page


[Hebdo 18, 2007, Editorial]
 good quality printed text
 medium layout complexity

29

© Prof. Rolf Ingold


First Experiment (Standard MS-Office OCR)

 Layout not understood


 Many OCR errors
 => Unusable !

30

© Prof. Rolf Ingold


Second Experiment (ReadIris)

 Page layout correctly


recognized
 Italic correctly detected
 including one false detection
 A few word segmentation errors
 Correct de-hyphenation
 A few OCR errors
 often as consequence of
segmentation errors !

31

© Prof. Rolf Ingold


Conclusion on OCR technologies

 Imperfect results in printed character recognition


 Recognition of uncontrolled handwriting not mature
 Practical problems with
 mathematics (symbols and formulas)
 special fonts or scripts
 logos
 => perfect document recognition is not achievable !

 Some applications can deal with approximate results


 Recognition algorithms should be tuned to prefer rejections to errors
 Include manual correction in the processing step

32

© Prof. Rolf Ingold

Das könnte Ihnen auch gefallen