Sie sind auf Seite 1von 4

TEXT EXTRACTION USING DOCUMENT STRUCTURE FEATURES AND SUPPORT VECTOR MACHINES

Konstantinos Zagoris and Nikos Papamarkos

Image Processing and Multimedia Laboratory Department of Electrical & Computer Engineering Democritus University of Thrace 67100 Xanthi, Greece kzagoris@ee.duth.gr papamark@ee.duth.gr

ABSTRACT In order to successfully locate and retrieve document images such as technical articles and newspapers, a text localization technique must be employed. The proposed method detects and extracts homogeneous text areas in document images indifferent to font types and size by using connected components analysis to detect blocks of foreground objects. Next, a descriptor that consists of a set of structural features is extracted from the merged blocks and used as input to a trained Support Vector Machines (SVM). Finally, the output of the SVM classifies the block as text or not.

KEY WORDS Page Layout, Text Extraction, Support Vector Machines, Document Structure Elements, Connected Component Analysis

1. Introduction

Nowadays, there is abundance of document images such as technical articles, business letters, faxes and newspapers sparked by the easiness to create them using scanners or digital cameras. Therefore, the needs to easily locate and retrieve this kind of documents quickly arise. In order to successfully exploit them a text localization technique must be employed with the purpose of determine the location of the text inside them. In previous literature, there are top-down approaches [1,2] employing recursive algorithms to segment the whole page to small regions. On the other hand, the most frequently used approaches are the bottom-up methods which segment the page to small regions and then merge them based on some criteria. Such method is proposed by Strouthopoulos et al. [3] which it is a technique to automatically detect and extract text in mixed-type color documents using a combination of an adaptive color reduction technique and a page layout analysis approach. Jain et al. [4] have presented a geometric layout analysis of technical journal pages using connected component extraction to efficiently implement page segmentation and region identification.

679-009

88

The proposed method detects and extracts homogeneous text in document images indifferent to font types and size by using connected components analysis to detect the objects, document structure features to construct a descriptor and Support Vector Machines to tag the appropriate objects as text. The proposed technique has the ability to adapt to the peculiarities of each document images database since the features adjust to it each time.

2. Text Extraction Algorithm

Figure 1, depicts the overall structure of the proposed algorithm. After applying preprocessing techniques (binarization etc), the initial blocks are identified using the Connected Component Analysis (CCA) method. Then, these blocks are expanded and merged to model lines of text.

Locate, Merge and Extract Blocks Extract the Features from the Blocks Find the Blocks Which
Locate, Merge and Extract
Blocks
Extract the Features from the
Blocks
Find the Blocks Which Contain
Text Using Support Vector
Machines
Extract or Locate the Text
Blocks and Present them to
User

Figure 1. The steps of the proposed text-extraction algorithm.

Next, a descriptor that consists of a set of structural features (determine by a procedure which we call Feature Standard Deviation Analysis of Structure Elements) is extracted from the merged blocks and used as input to a

trained Support Vector Machines (SVM). Finally, the output of the SVM defines the block as text or not.

a)

b)

c)

d)

of the SVM defines the block as text or not. a) b) c) d) Figure 2.
of the SVM defines the block as text or not. a) b) c) d) Figure 2.
of the SVM defines the block as text or not. a) b) c) d) Figure 2.
of the SVM defines the block as text or not. a) b) c) d) Figure 2.

Figure 2. The Block Extraction Steps: (a) The Original Document, (b) the Connected Components,(c) the Expanded Connected Components, (d) The Final Blocks after the Merging of the Connected Components

3. Block Extraction

The primary goal of the block extraction method is to detect and extract the objects of the document. This is accomplished by using the Connected Components Labeling and Filtering technique. After applying a binarization method appropriate for the document (e.g. Otsu [5]) and indentifying all the Connected Components (CCs) (Figure 2(b)), the most

) is calculated. The next

common height of the CCs (

step includes the expansion of the left and right sides of

the CCs by 50% of the

CC

h

CC

h

as Figure 2(c) depicts.

89

Finally, to locate the text-lines the overlapping CCs are merged (Figure 2(d)).

4. Creation of the Block Descriptor

The next step involves the feature extraction stage of the blocks. The extracted features construct a descriptor of each block that maximizes the separability between the blocks. The spatial features are constructed by a number of suitable Document Structure Elements (DSEs) which the blocks contain.

a)

b 8 b 7 b 6 b 5 b 4 b 3 b 2 b
b 8
b 7
b 6
b 5
b 4
b 3
b 2
b 1
b 0

b)

contain. a) b 8 b 7 b 6 b 5 b 4 b 3 b 2
contain. a) b 8 b 7 b 6 b 5 b 4 b 3 b 2
contain. a) b 8 b 7 b 6 b 5 b 4 b 3 b 2

Figure 3. (a) The Pixel Order of the DSEs (b) The DSE of

L

142

Analytically, a DSE is any 3x3 binary block, as Figure 3 depicts. Therefore it is obvious that there are total

is assign to each DSE such

L

2

9

= 512 DSEs. An integer

j

as

L

j

8

=

i = 0

b

ji

2 i

(Figure 3(a)). For a block B , if C

the

number of its columns and R the number of its rows then the block B contains (C 2)(K 2) DSEs. The initial

of the

descriptor of the block B is the histogram

DSEs that the block B contains and it is calculated by the following equation:

H

(

L

j

)

H

(

L

n

)

H

= ⎨

(

H

L

(

)

L

n

n

+

),

1,

if L

if L

L

=

jn

L

jn

(1)

for n = 1, 2,

(C

−−2)( K 2)

. Note that the 0 and 511 DSEs are

where

removed because they correspond to pure background and pure document objects, respectively. According to the above analysis a histogram is constructed by the following equation:

L

j

,

L

v

[1,510]

X

(

L

n

)

H

(

L

n

)

=

510

H

(

L

i

)

i = 1

(2)

where X (L) is a vector of 510 elements. Next, a feature

reduction algorithm applied which reduces the number of features from 510 to 32. We call this algorithm Feature Standard Deviation Analysis of Structure Elements (FSDASE).

If there are T text blocks and P non text blocks then the stages of the FSDASE algorithm are:

1. Find the Standard Deviation (SD)

the

DSEs.

X

( L

n

)

for the

T

2. for the

Do the same

(

SDXT L

n

)

where φ (x) is the feature map mapping the input space

L to a high dimensional feature space where the training

data become linearly separable. The most common used

kernels are the Polynomial ( (

), Radial Basis

of

n

blocks and for each

P

T

x

x′ +

1)

p

blocks: Find the SD

3.

(

SDXP L

n

)

of the

X

( L

(

n

Normalize the

SDXT L

n

)

)

for each

L

n

DSEs.

Function (

exp

{

γ

x

x

}

) and the Sigmoid kernel (

and

(

SDXP L

n

)

:

tanh

(

T

kx x′ − δ

)

). Our experiments showed the Radial

SDXT

(

L

n

(

SDXPL

n

)

)

=

=

SDXT

(

L

n

)

510

(

SDXP L

n

)

510

,

4. Then define the vector

O

(

L

n

)

as:

O

()

L

=

SDXT

()

L

SDXPT

()

L

nnn

5. Finally, take those 32 DSEs that correspond to

the first 32 maximum values of

O

(

L

n

)

.

The goal of the FSDASE is to find those DSEs that have maximum SD at the text blocks and minimum SD at the non text blocks and the opposite. Obviously, a training dataset is required to determine the optimal DSEs. Unfortunately, this does not cause a problem because such dataset already is required for the training of the SVMs. Therefore the final block descriptor is a vector with 32 elements and it corresponds to the frequency of the 32 DSEs that the block contains. This descriptor is used to train the Support Vector Machines. Note that the descriptor has the ability to adapt to the demands of each set of documents images. The advantages are twofold:

A noisy document has different set of DSEs than a clear document.

If there is available more computational power, the descriptor can increase its size easily above 32.

5. Support Vector Machines

The Support Vector Machines (SVMs), introduced in 1992 [6,7] are based on statistical learning theory and recently have been applied to many and various classification problems.

Basis Function as the most robust kernel.

If

w

n

=

α x

i

i

the SVM conditions of Eq. (3) transforms

i

=

1

to:

(5)

In practice sometimes the classifier must misclassify

some data points (for instance to overcome the over fitting problem). This is achieved using the slack

variables

(6)

Finally, the maximum margin classifier is calculated by solving the following constrained optimization problem which is expressed in terms of variables

y

α k xx + b⎤− ≥

(

,

i

)

10

ii

(

i

,

0 . So Eq. (5) is changed to:

i

)

10

α :

i

ξ >

y

α k xx + b⎤−+ξ

ii

maximize

a

subject to:

n

=

1

i

α

i

n

i = 1

y

1

2

nn

∑∑

ij 11

==

j

α

i

=

0,

y y

i

j

T

αα

i

xx

ji

0

≤≤

α

i

C

j

(7)

The constant C > 0 defines the tradeoff between the

x i for

which

One of the difficulties of the SVM consists of finding the correct parameters to train them. In our case, we have two parameters: the C from the maximum margin classifier and the γ from the Radial Basis Function kernel. The

goal is to find the values of the two parameters C and γ so that the classifier can accurately predict the unknown data. This is achieved through a cross-validation procedure by using a grid search for the two parameters. The values of the above parameters for our document images database are calculated as: C = 8, γ = 8 . Final, the output of the SVM classifies each block as text or not (Figure 4).

training error and the margin. The training data

a i > 0 , are called support vectors.

If

D

is

a

given

training

x

[0,1], y ∈−{ 1, +1}, in[1,

] , where

T

T

x

x

i

i

y

linear

+

+

b

b

≥+

≤−

1,

1,

i

+

dataset{(

x

i

x

i

is the

,

y

i

i

)}

n

i = 1

,

input

vector and

original

conditions:

w

w

If the training data are not linear separable (as in our case) then they mapped from the input space X to a feature space F using the kernel method is defined as:

following

The

is the label correspond to the

SVM

when y

when y

i

i

classifier

=+

=−

1

1

y

i

satisfies

the

T

wx

b ⎤−≥ ⎦ 1

x

i

0

.

(3)

k

(

)

xx′ = φ

,

()

x

T

φ

(

x

)

(4)

6. Implementation and Evaluation

The proposed technique is implemented in a visual environment (Figure 5) with the help of the Visual Studio 2008 and LIBSVM [8] and is based on the Microsoft

.NET Framework 3.5. The programming language which

is used is C#/XAML. The program can be downloaded from the following web address:

http://orpheus.ee.duth.gr/download/TextFind

er_1.0.9.zip

To evaluate the proposed text extraction technique, the Document Image Database from the University of Oulu

90

[9,10] is employed, which includes 198 various types of documents. In our experiments we used a set of the 48 article documents. Those image documents contained a mixture of text and pictures. From this database five images are selected and the extracted blocks used to determine the proper DSEs and to be employ as training samples for the SVMs. The overall results are presented at Table 1.

the SVMs. The over all results are presented at Table 1. Figure 4. The Final Text

Figure 4.The Final Text Extracted Blocks by the SVM.

Table 1. Experimental Results.

Document Images

Blocks

Success Rate

48

25958

98.453%

7. Conclusion

In this paper a bottom-up text localization technique is proposed that detects and extracts homogeneous text from document images. A Connected Component analysis technique is applied which detects the objects of the document. Then a powerful descriptor is extracted based on structural elements. Finally, a trained SVM classify the objects as text and non-text. The proposed technique is implemented in a visual environment and the experimental results are much promised.

environment and the experimental results are much promised. Figure 5. The application of the proposed method.

Figure 5. The application of the proposed method.

91

Acknowledgement

This work is co-funded by the project "PENED 2003-

03ΕΔ679".

References

[1] R. Ingold, and D. Armangil, A Top-Down Document

Analysis Method for Logical Structure Recognition, Proc. First Int’l Conf. Document Analysis and Recognition,

Saint – Malo, France, 1991, 41-49. [2] Y. Chenevoy and A. Belaid, Hypothesis Management

for Structured Document Recognition, Proc. First Int’l Conf. Document Analysis and Recognition, Saint-Malo, France, 1991, 121-129. [3] C. Strouthopoulos, N. Papamarkos, and A. E. Atsalakis, Text extraction in complex colour document, Pattern Recognition, 35, 2002, 1743-1758. [4] K. Anil, J. Fellow, and Bin Yu, Document Representation and Its Application to Page Decomposition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), March 1998. [5] N.Otsu, A threshold selection method from gray-level histograms, IEEE Trans. Systems, Man, and Cybernetics, 9, 1979, 62-66. [6] B. E. Boser, I. Guyon, and V. Vapnik, A training algorithm for optimal margin classifiers, In Proceedings of the Fifth Annual Workshop on Computational Learning

Theory, ACM Press, 1992, 144-152. [7] C. Cortes, and V. Vapnik, Support-vector network, Machine Learning, 20, 1995, 273-297. [8] Chih-Chung Chang, and Chih-Jen Lin, LIBSVM : a library for support vector machines, Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001. [9] University of Oulu, Finland, Document Image Database, http://www.ee.oulu.fi/research/imag/document/. [10] J. Sauvola and H. Kauniskangas, MediaTeam Document Database II, a CD-ROM collection of document images, University of Oulu, Finland, 1999.