Sie sind auf Seite 1von 7

SECTIONALIZATION AND DEPICTION OF COMPRESSED

DOCUMENT IMAGE
MR. MURUGESH.K
Assistant professor Department of ECE,Don Bosco Institute of Technology, Bangalore, India.
Email id: muru.dbit@gmail.com

Dr.MAHESH P K
HOD & Professor, Department of ECE,Don Bosco Institute of Technology, Bangalore, India.

ABSTRACT: Segmentation is a process of partitioning an image and extracting a block of


interest. Studying the characteristics could be challenging if the process of extraction has to be
done to a compressed image directly. Segmenting the compressed image is the objective of this
research paper. The proposition here is to develop a way to segment and extract a specified block
and apart from that, to carry out characterization without decompressing a compressed image. The
two main reasons for the image being stored in compressed format and the process of
decompressing it, is to minimize the additional computing time and space. Particularly in this
research paper, the proposition is to work on run-length compressed images.

KEYWORDS:Density, Compressed data, Document Characterization, Entropy,Document Block Extraction.

INTRODUCTION

In the area of Pattern Recognition (PR) and Document Image Analysis (DIA) systems, specifically in
applications like signature extraction from logo extraction and detection , official documents and
document text, photo and line extraction [5], process of segmentation and characterization of particular
block in a document image determines many applications.This concept of block segmentation, apart
from above mentioned applications, has been applied for 2D plot image documents and text extraction
from layout-aware [4]. For the purpose of classification using wavelet coefficients, uses extracted text
blocks from postal images. All the methods mentioned above, have a limitation of working with
decompressed or uncompressed data. A document as an example shows the real life applications of
logo, signature and data extraction by holding them in rectangular blocks is presented in Fig-
1.Another subject of issue is, extracted/segmented block characterization. Usually, sheer
characterization of the specific block may be sufficient, but there arises circumstance where the
characterization of block is to be expressed with dissimilarity with respect to characterization of
whole document. This means that only characterization of segmented block using compressed
format is not enough, but also on complete document in the compressed format is to be carried
out.It is challenging to work with compressed form of documents directly, specifically for
applications in pattern recognition and image analysis. In 1980s, a novel idea of working on
compressed data was first presented. The simplest way of compression first used for television
signals [4] and coding pictures was run-length encoding (RLE). There have been a number of
attempts towards working directly on compressed document images. In literature, associated to
run length information processing, operations like component extraction, page layout analysis,
image rotation and skew detection are reported. Many initiatives have been coming forth in
finding equivalence, document similarity and retrieval. Performing morphological related
operations is one of the recent works using run length data. In most of the instances, either some
partial decoding or run-length data of an uncompressed image is used to carry out the operations.
For characterization of documents, so far in literature there has been no attempt seen in extracting
blocks of documents from the TIFF compressed binary text documents. Nonetheless, there has
been first step to show viability of performing extraction/segmentation excluding decompression
using JPEG documents by which, is not desirable for processing of TIFF compressed data
directly.

Fig.1: Block sections of Logo, Data and Signature used in the analysis of document image

UNDERSTANDING THE PROBLEM AND TERMINOLOGIES

Characterizing a specific block or segmentation and analysis of document image, normally means
extracting the information in the rectangular segment of the document. In such case of
segmentation, it requires extraction of vertical and horizontal boundaries, that could be carried
out with ease in a decompressed image without effort by making use of columns and rows of the
block which is as shown in Fig-2. This specific block of image, when subjected to run-length
compression, does not appear rectangular shaped. Consequently, automatic tracing of the block
becomes difficult task. It can be observed in Fig-2 that when tracing the contents of specific
block within compressed data as in Fig-2 to be located within enclosed boundary of runs. This
enclosed boundary can be situated inside the compacted data with the aid of start(y1) and end(y2)
columns of that particular block. For example in Fig-2, When the boundary of the specific block
is estimated inside the compacted data in Fig-2, it is essential to do further refinement of runs of
start and end locations in order to get accurate boundary of specific block, shown in Fig-2. The
conventional view of the precise locations of the specified block in the compacted data is given in
fig-2.

Fig. 2: Specified block runs and their respectiveresidue runs

529
Another significant aspect in this research work is the relativeand absolute characterization of the
specified block from the compressed information. Analyzing the contents of the extracted block
alone is known as absolute characterization, on the other hand, with respect to source document
the analysis is relative characterization.

PROPOSED MODEL

This section proposes a new a method for extracting specified document blocks directly from
run-length compressed document data. All the stages involved in segmenting/extracting a
document block that are shown in Fig-3. The model proposed here exploits the run-length
compressed data extracted, by making use of Huffman decoding from TIFF compressed documents.
By making use of the particular rows (x1 and x2) of the document block, the sectionalization of
horizontal boundaries is performed in the compressed data.

TIFF,PDF,BMP Compressed Documents


CCITT Group3, PackBits Compressed Data

Run-Length Compressed Document Data Specified Black with


rows(x1 & x2)

Extract vertical boundary Extract the compressed


Extract Horizontal positions and data matrix of the
boundaries with rows x1 corresponding residues specified block using
and x2 using y1 and y2 position table

Compressed Document Updated compressed


Segment Analysis data matrix of the
(Entropy, computations specified block Remove the residue runs
and density estimation) from the compressed
matrix using start and end
residue

Fig. 3: Proposed model for block segmentation from Compressed data

The next level is to draw the vertical boundary positions and corresponding remainder runs from
each row in the compressed data by making use of columns (y1 and y2) of the specified block.The
scheme to be used in fixing the maximum position and its corresponding residues is summed up in
Table-1 and Table-2, where 'j' suggests the column place of runs in the compacted data. For every
available row of runs in compacted data, the cumulative sum of runs (runSum) is computed till
one of the next specified cases reported in Table-1 is found which indicates the position of y1 in the
compacted data. In the same way, continue the row scan to get the position of y2 on the basis of
condition assigned in Table-2. Arranging them in a table form, these positions of y1 and y2, drawn
from the compacted data as P1 and P2 and their obtained corresponding residues R1 and R2 in
table shown in Fig-2. The compressed data matrix of particular block is extracted by using the
position table. As it is known that, this compacted data roughly represents the particular block and
hence requires dismissing of residue runs using the residue values that are being tabulated earlier.
The updated compressed information, as shown with example in Fig-4a, is received after which the
residue runs are removed from the compressed data of that particular block using scheme specified

530
in Table-3.The obtained compressed data of the specified block is used for further analysis of
document such as for relative and absolute characterization by making use of the entropy and
density characteristics which will be discussed in the next section.

Fig. 4: Schematic view of block-segment in compressed and decompressed binary image.

Fig. 5: Compressed and decompressed version of the specified block segment.

EXPERIMENTAL ANALYSIS

This section demonstrates the extraction of document blocks straight from the run-length
compacted data. The results of working on compacted data are difficult to figure out and present.
However in this paper, for the better understanding and visualization of the suggested
methodology and results, the obtained compacted data of extracted segments is decompressed
and is as shown in this section. As an example, consider a decompressed version of a sample
document of size 1009 X 1542. The decompressed form of 4 segmented/extracted document
blocks from the example document with their respective columns and rows are shown in (300 _
400; where x1 = 500, x2 = 800, y1 = 700, y2 = 1100), (300 _ 300; where x1 = 100, x2 = 400, y1 =
200, y2 = 500), (300 _ 300; where x1 = 700, x2 = 1000, y1 = 1200, y2 = 1500) and (400 _ 300;
where x1 = 100, x2 = 500, y1 = 1200, y2 = 1500). In a similar way this method has been tested
indiscriminately with 35 compressed documents from Kannada, Bengali and English scripts.The
execution of the obtained results from the proposed procedure can be assessed in two ways. One
way of finding exactness of the segmented block is by decompressing the compressed data of block
Ai,jand check for similarity with the uncompressed version of Bi,j,, using the formula as given
Accuracy(%)= x100

where m and n are number of rows and columns respectively in the given ground truth block

531
segment and decompressed version of segmented/extracted block. The next method of computing
accuracy is to have a ground truth of compacted data of block Bi;j and then compare it with the
compressed data of extracted/segmented block Ai;jwith the following formula,

Accuracy(%)= x100

Where m and n' are the rows and columns respectively,of the specified block in the compressed
data, where n'<n. The accuracy of 4 segmented blocks in Fig-6 using both the methods.

Fig. 6: Viewing the sample document and different block-segments extracted from
compressed data

BLOCK CHARACTERIZATION

This section demonstrates, the relative and absolute characterization of thespecified block, that is
extracted straight from the compressed data. The entropy and density features [3,2] calculated from
Sequential Entropy Quantifiers (CEQ) and Sequential Entropy Quantifiers (SEQ) are used for
characterizing the extracted document block after decompression.

The mathematical formula for CEQ is as follows,

E(t)=p* +(1-p)*

where t is the conversion from 0 - 1 and 1 - 0, E(t) is the entropy, p is the likely appearance of
transition in each row, and so 1 - p is the probable absence of transition. For SEQ, the transition if
appears between two columns C1 and C2 in row r then the corresponding row entropy is formulated as.

E()=

532
where = 1, m and n the position parameter pos, which indicates the place of transition point in
horizontal direction. Only features of the block is taken into consideration for absolute
characterization of the extracted/segmented block.. The parameters 'p' for CEQ and SEQ is
computed as the ratio oftotal number of (0-1) Or (1- 0) transitions in every row of the block and
overall number of approximate transitions possible in each row of the block, and parameter 'pos'
indicates the column position of transition in every row of the block. The relative and absolute
characterization using density feature is given as below,

Absolute density =

Relative density =

The relative and absolute characterization using entropy and density features for different blocks
extracted in Fig-6 is tabulated as in Table-6 and Tables-5 respectively. From these tables it is
observed that the segmented/extracted block in a case for high density and high entropy, Fig-6d is
an example of low entropy and low density. On the other hand ig-6e shows low density and high
entropy and Fig-6b shows low entropy and high density.

Table 5: Absolute density and entropy computations of sample document and extracted blocks
Sample Density CEQ SEQ

Sample Document D=0.0945 C=75.321 S=-1.096*103


Block Segment-1 D1=0.1325 C1=27.873 S1=-2.217*106
Block Segment-2 D2=0.1556 C2=35.070 S2=-3.117*106
Block Segment-3 D3=0.0677 C3=17.183 S3=-0.886*106
Block Segment-4 D4=0.0797 C4=25.514 S4=-2.860*106

CONCLUSION

In this paper, a new innovative idea is proposed for segmenting/extracting the specified/particular
document block in rectangular fragments straight from the run length compressed data without
decompression is being proposed. Further, the relative absolute characterization of the extracted
blocks using entropy features and density are demonstrated. This research study also explains that
analysis of document is possible using compressed data without making use of decompression
technique and also opens up the entry for plenty of research issues using compresseddocuments in
compressed form.

REFERENCES

Thomas M. Breuel. Binary morphology and related operations on run-length


representations,InternationalConferenceon Computer Vision Theory and Applications,Pages 159-
166,2008.
B.B.Chaudhur,.P.Nagabhushan, and Mohammed JavedEntropy computation of document
images in run-length compressed domain at International Conference on Signal and Image
Processing (ICSIP14), January 8-11, 2014.

533
P Nagabhushan, Mohammed Javed, and B BChaudhuri. Proceedings of Second IAPR Asian
Conference on Pattern Recognition (ACPR'13), Okinawa, Japan,Extraction of projection profile,
run-histogram and entropy features straight from run-length compressed documents, November
2013.
Mohammed Javed, BBChaudhuri and P Nagabhushan. National Conference on Computer
Vision, PatternRecognition, Image Processing and Graphics (NCVPRIPG'13), Jodhpur, India.
Extraction of line-word-character segments directly from runlength compressed printed text-
documents December19-21, 2013.
Eduard Hovy, CarticRamakrishnan, Gully APC Burns and AbhishekPatnia. Source Code for
Biology and Medicine, 7:7, Layout-aware text extraction from full text pdf of scientific
articles,2012.

534

Das könnte Ihnen auch gefallen