Sie sind auf Seite 1von 8

IPASJ International Journal of Computer Science (IIJCS)

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 5, Issue 7, July 2017 ISSN 2321-5992

An Architecture for Efficient News Items


Clustering and Retrieval Based on Language
Models for a Dynamic Collection of E-
Newspapers
Deepa Nagalavi1, M. Hanumanthappa2
1
Dept. Of Computer Science and Applications,Bangalore University,Bangalore, India.
2
Dept. Of Computer Science and Applications,Bangalore University,Bangalore, India.

ABSTRACT
Newspaper pages comprises of multiple individual articles divided into multiple columns. The challenging part of this task is to
organize and integrate article blocks in the newspaper. This paper proposes a novel approach for Article reconstruction from
newspapersincluding an aggregation of multiple sections of article and reading order recovery of each individual
article.Thus,the process combines diverse information sources such as geometriclayout, semantic contentsand the sequence of
article blocks are also enormouslymined in the model using the clustering algorithm to deal with thecomplex newspaper
layouts.The work consisting of different sub tasks such as identification of sections of article, establishment of boundary for
individual article, identifying the sequence of blocks. Furthermore subsequently the reading order of English text is used to
aggregatethe blocks and retrieve an individual article from newspaper.

Keywords:Newspaper, Information Retrieval System, Clustering algorithm, Natural Language Processing,News


Retrieval, Article Reconstruction.

1. INTRODUCTION
In the age of information newspapers are information sources. About all aroundthe newspapers delivered to the
doorstep, each with hundreds of new articles each day. The newspapers plays an important role in disseminating
current information and events and keep its reader up-to-date. E-newspapers are the electronic replica of traditional
printed newspaper that secure information electronically. It can contains variety of data such as photographs, statistics,
graphs, interviews, polls, debates on the topic, etc. Thus the layout of newspapers is heterogeneous.The news design
involves in arranging the newspaper according to editorial and graphical guidelines. The editorial guideline says that
the ordering of news stories should be in order of importance. The newspaper page layout is designed based on the
typographic grid system. Newspaper writings attempt to answer all the basic questions about any particular event such
as who, what, when, where, why and often how at the opening of the article. The organization of news article contains
headline, byline, dateline, lead paragraph, explanation and additional information sections and jumpline section.
Newspaper contains multiple individual articles in a page. Each article is divided into multiple sections. Headlines can
be identified by analyzing its features. Since news articles are presented in multiple small blocks it is a challenging task
to read the article automatically, also the order of the blocks arenot known. Each individual article blocks are identified
based on similarity measures. The difficult task is how to recover the reading order of the different blocks and
reconstructing the article in the newspaper page to identify individual articlesdue to different forms of multi-article
page layouts of newspapers.

2. LITERATURE SURVEY
In the literature, the most prominent procedures and techniques are applied forsegmentation of the page and reading
order recovery. Earlier, many optimizedalgorithms and the techniques have been projected toprocess the complex page
layouts of newspaper.

Volume 5, Issue 7, July 2017 Page 53


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 5, Issue 7, July 2017 ISSN 2321-5992

Thomas [14]formulated the reading order by using the topological sorting oftext line. The spatiallayout-based rules are
helped in finding the Pair-wise relationships connecting the text linesegments. However, in the literature there exist
other methods to formulate the reading order such as image processing methods. R. Smith in [17] developed image
processing method for reading order recovery. It extracts the column structure by locating the tab-stops in a page and
obtain the reading order in a top-downmanner. Further, Chen et al. [3] has also used image processing technique and
developed a method to retrieve the information for Chinese newspapers.Thus, they have applied set of rules based on
visual information like distance, size, color, etc. In contrast, Prof. A.D. Thakare, N. Muthiyan, D. Nangade, D. Patil,
M. Patil [15] proposed a new methodologyby using Genetic Algorithm(GA). The genetic algorithm is used to retrieve
the hidden knowledgefrom the newspaper and use it for decision making process. Hence, it creates document clusters of
news articles. Since, the Newspaperarticles areclassified into different domains.The algorithm groups the article into
respective domain based on the similarity of the keywords. The GeneticAlgorithm operates on the population of
possible targets which fit the solution. However, this algorithm is best suited for searching techniques.
On the other hand, Aiello and Pegoretti in [11]; used the text processing techniques. It is an alternative solution to an
image processing technique.They proposed the algorithms like simple clustering, comparative clustering and
agglomerative clustering text processing based algorithms and called the graph as graph clustering algorithms. The
objects in an article are referred as the nodes of a graph, an edge represents the connection. Therefore, the connection
graph represents the objects which belong to the same article as a cluster. The proposed algorithms are all starts with a
graph with one node per each article object and no edges. The algorithm establishes link to the belonging nodes. Each
fully connected component forms a cluster. The connected unit of objects for a page is the graph in which each article is
a clique. Among the three algorithm [11] proves that the simple clustering algorithm is the best suitable algorithm.
Since, it provides high performance rate and low computational complexity after removing stop words. In contrast, R.
Beretta, L. Laura [1] has proposed an extended graph clustering technique to measure newspaper article identification.
By identifying the difficulties in [11], [1] focused to a specific graph clustering problem. In this regard, a newspaper
page is considered as a graph and the blocks of articles are referred as node and all the nodes of the same article are
connected together. It is also evaluated by proper coverage and performance measures. The accuracy of intra-cluster
density and extra-cluster sparsity is identified[1].
Liangcai Gao, Zhi Tang, Xiaoyan Lin, Yongtao Wang [10] referred the bipartite graph model to improve the reading
order of an article. This model has two vertices called predecessor and successor. The vertices are referred to as the
blocks of the article within a page. Consequently, The reading order of the blocks represents the edge and is derived
based on the reading transition probability. More often, English text reading order in the document will be from top to
bottom and left to right directions. Therefore, the topological analysis helped to find the spatially admissible reading
orders before selecting the linguistically portable blocks. The transition score for reading text is measured using
different properties such as textual content, position, style, and so on. Consequently, an optimal matching graph is
considered with the maximum weight of classic Kuhn-Munkres algorithm [10]. It stores optimized graph in queue and
with the ready queue article aggregation method is reading the sequence and merge them into articles.
This paper proposes a methodology which improvises the efficiency of the solution discussed in literature survey for the
problem of article reconstruction. It combines visual and textual semantic information with natural language processing
models to obtain optimal relationship between blocks of an article.

3. ARTICLE RECONSTRUCTION
E-Newspapers are available in PDF format. Thus an efficient information extraction tool is used to transform an image
into text format while retaining the original layout and look of the initial document. In order to extract individual
article, the different sections of article are identified. Later the article blocks are read sequentially based on the reading
order of the text. However, the presented work focuses on how to identify the proper reading order of thecomponents in
a page. Accordingly, an artificial intelligence is applied to read the article like a human identifying and reading the
article. The proposed process architecture for automatic news article retrieval is shown in fig 1. The process shows how
to reconstruct and extract the article from e-newspaper and create a database of individual article.
3.1) Pre-processing: E-Newspapers are converted to text format. The OCR based tool converts the document from
PDF which look almost as good as initial document. It makes use of an intelligent pre-processing algorithms that
improves the quality of images by removing noise, straightening text lines and whitening background to achieve the
highest conversion accuracy. Thus the newspaper of text format look almost as good as initial document.

Volume 5, Issue 7, July 2017 Page 54


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 5, Issue 7, July 2017 ISSN 2321-5992

3.2) Headline Section:Headlines are the summarized story in brief statement, it is placed above the story, generally in
a bigger sized font and bold style. The features of headlines such as font feature, words feature, punctuation mark
feature, keyword feature and abbreviation feature are identified. The weights of these features are also calculated for
extracting headlines from English newspapers.
3.3) Lead Paragraph: Bylines are generally placed between the headline and the text of the article or else placed at
the end of article, where it introduces the writer by name. Whereas, An article's dateline is the part of the article that
identifies the location from which the reporter filed the article. As datelines are presented at the beginning of lead
paragraph, it is considered as the initial point of article content. Sometimes news articles does not contain byline and
dateline section, in such situation an article lead paragraph is start with a large dropped initial capital letter. Drop
Caps are not typically readable but an elegant identification of where the text is starting.

Fig. 1: Process Architecture

3.4) Explanation Section: the content of article is divided into multiple blocks and placed in heterogeneous format.
The different blocks are not in sequential order. Thus the blocks of each individual article are grouped first and then
reading order of the text is identified to connect blocks in sequential order. Since newspaper page contains multiple
articles and the content of the article is divided into blocks and placed in heterogeneous format. Thus the blocks of
each individual article are grouped and build a boundary. Later the correct sequence of blocks is identified.
A. The text blocks of news article are categorized into logical units based on similarity measures. The style similarity
and content similarity are the two measures used to group the article blocks. The style similarity measure checks
whether the two text block of the article is similar in style or not, with the properties such as width, height, font,
background and other format properties. Whereas, the content similarity measures are checked with semantic
similarity approach. WordNet is a general purpose ontology used for similarity calculation. A knowledge based
WordNet measures are used to identify semantic relatedness between terms.This approach tackles each word
individually rather than defining the meanings of all words simultaneously in a given context. Lesk algorithm is a
knowledge based algorithm and is normally used in many NLP applications for its simplicity and speed. The
keywords of each blocks are given as an input to the algorithm, an algorithm select the sense that has the maximum
overlap between its dictionary interpretation and the the text. Usually, the context is the number of words in the text
around the target word. The extracted gloss of each concept is selected with maximum match in WordNet through
semantic and lexical relations. The glosses are then compared with its associative concepts. At last, the selected
senses were listed as terms in article blocks. The measures of vector space model, Term Frequency and Inverse
Document Frequency are mainly used to identify the similarity between the blocks of article.

Volume 5, Issue 7, July 2017 Page 55


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 5, Issue 7, July 2017 ISSN 2321-5992
Simplified_Lesk(word,sentence)
{
foreach sense s of word w do
sign= set of words in the gloss
overlap= no. of words in common between two sets
if arg_max(overlap) is unique then
select the sense s for w
end return sense s
}
B. Both grammatical feature based part of speech model and word based n-gram model have associated probability and
is used to check the expected word in chronological sequence of blocks. As in natural language processing system,
the partial differential equations (eq1) [4][5] characterize a system which has been observed and, can be possible to
extract relevant information about them. However the combination language models efficiently provides an
information to predict the sequence of words, sentences and the blocks of articles in English Newspapers [4].

1
( combined )
Pinterpolat ion (Wi | W i ( n 1), t i ( m 1 )) .[ P n _ gram (W i | W i n 1)] a .e (1 a ) .[ P m _ pos (W i | ti ( m 1))] ( eq1)
e (1 a )

3.5) Jumpline Section: As important news articles needs to be displayed on front page and it is difficult process to
display multiple articles in one page. Thus, jumpline concept is used in front page. Jumpline refers to the continuation
of the article to further inside page of a newspaper. It is used to inform the reader the page number where he can find
the rest of the story of an article. Normally, it appears at the end of the column for example, continued on page 7.
Jumplines at the top of a block indicates where the article is continued from, as it continued from page 1. Therefore, it
helps the readers to search and read the articles contiguously without any difficulty when articles in front page
continuedto another page. Sometimes, the page number could be omitted or the phrase continued on next page used.
However while designing article whatever style is used will be consistent such as fonts, spacing, and alignment,etc.
The same style of jumplines throughout the article and throughout the newsletter design.
This work combines the techniques used to identify different sections of news article. Reconstruction of article interpret
each and every section of article with proper reading order of text. Accordingly a complete article is extracted from
newspaper. Thus the clustering algorithm based on multi objective evolutionary approach that effectively identifies
clusters of an individual article.

4. AN ARTICLE READING TRANSITION ANALYSIS

4.1) headline(): The features of headlines such as font feature, words feature, punctuation mark feature, keyword
feature and abbreviation feature are identified. Later the weights of these features are also calculated for extracting
headlines.
If(first word followed by colon or hypen symbol) then{
fWord=firstword;
if(fWord.contains(/) thensearchElement[]=fWord.split(/);
else
searchElement[0]=fWord;
for(i=0 to searchElement.length)
{
result=binarySearch(cityName,searchElement[i]);
if(result==true) then article->root=text_block;
}
if(fWord.fontStyle==BOLD) thenarticle->root=text_block;
}
Else{
fLetter= fWord.subString(0,1);
if(fLetter.fontSize>fSize*2 or fWord.fontSize>fSize) then article->root=text_block;
}

Volume 5, Issue 7, July 2017 Page 56


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 5, Issue 7, July 2017 ISSN 2321-5992

4.2) leadParagraph(): date line section or drop cap letters of first paragraph is identified to know the root of the
content of the article.

4.3) distance(TB1; TB2): the physical distance between TB1 and TB2.It identifies the blocks with minimum of the
distances in horizontal and vertical directions.

4.4) stlSim(TB1; TB2): the style similarity between text block TB1 and TB2 is analyzed and set a large value if they
are similar in terms of width, height, font,background, and so on.

4.5) blockSim(TB1,TB2): The similar blocks are then grouped into individual article using simple clustering
algorithm. This algorithm identifies an edge in the connection graph by looking at the similarity matrix checked
by the concepts style similarity, content similarity and distance similarity. For content similarity both a thesaurus
based and distributional algorithm are used to identify similarity between two blocks effectively. It enhances the
efficiency by pre-processing the blocks and using the effectiveness of extended vector space model with all related
terms identified by Lesk algorithm.
blockClustering
{
set the connection graph to have no edges
compute the similarity matrix M
for all elements mtb1,tb2 =(conSim()+stlSim()+distance()) in M
if mtb1,tb2 threshold thenadd an edge in the connection graph between btb1 and btb2
}
4.6) textCont(TB1,TB2): The transfer degree between one text block to the next block in sequence is identified with
the interpolation model (eq1)[4][5] of a part of speech based and a word based n-gram language models. The
sequence of words in an article blocks are identified more efficiently in the predefined dictionaries. It identifies
the list of words which most likely to follow the sentence.The input for this function is the words of the last
sentence of a block grouped from blockSim() function.

4.7) readingTransition(TB1,TB2): The reading transition score between the blocks of the article. It is matching the
list of words taken from textCont(), with the first word of the remaining blocks of an article. If match found then
execute textCont() function for the next block of the article.

5. NEWS ARTICLE CLUSTERING


Generally, newspaper pages consists of multiple independent articles. These articles are divided into several blocks or
columns to arrange articles in small space and also enhancing the reading speed.Despite, a news article is composed of
an elementary logical unit also called as document object [7]. An important characteristics of a document objects are its
bounding box, its position within the page, and its content. However, the concentration is given more to reading text
objects. To constructs the content of an article story automatically and read sequentially in its respective order. Thus,
the text inside blocks is considered to reconstruct the individual article from newspaper.
Reconstruction of an individual article from newspaper is carried out by merging the result of multiple tasks discussed
in section 4. Later, group the blocks identified by analyzing reading transition score. However, Cluster analysis or
clustering does the task of grouping an individual article. As a result, the clusters of an articles are dissimilar to other
article. A cluster model follows few characteristics such as high similarity between cases inside the clusters and each
cluster should be as unique as it can, comparing with others [7][1]. Moreover the connectivity based algorithms connect
components of an article to form clusters based on similarity measures. It is a whole family of methods that differ by the
way on the linkage criterion. The connectivity based algorithms are also known as Hierarchical clustering which is a
method of cluster analysis which seeks to build a hierarchy of clusters. The nearest neighbor chain algorithm used to
perform agglomerative clustering, it creates the hierarchy of clusters by repeatedly merging pairs of clusters to form
larger and accurate cluster. It identifies the pairs of clusters and merge them by following the paths in the nearest
neighbors. It make use of a stack to find and merge most nearest pairs of clusters. Since, it takes less time and space
than the greedy algorithm, it is best suitable algorithm for merging pairs of clusters in a different order. The
characteristics of nearest neighbor algorithm is to sequentially combine larger and more similar clusters until all
elements are end up being in the same cluster.
Volume 5, Issue 7, July 2017 Page 57
IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 5, Issue 7, July 2017 ISSN 2321-5992

The algorithm is an agglomerative scheme, it is merging the previous unordered and dissimilar clusters into new
combined one. Thus, it erases rows and columns in the proximity matrix. The reading transition score of blocks are
store in D matrix. The N*N matrix is D=[d(i,j)]. The clusters are assigned by sequence of numbers 0,1,..., (n-1) and
L(r) is the level of the r th clustering. A cluster with sequence number n is denoted (n) and the proximity between
clusters (a) and (b) is denoted d [(a),(b)] [8].

The algorithm is composed of the following steps:


1) Input e-newspaper.
2) Convert the e-newspaper to text format document.
3) Read the blocks of the newspaper.
4) Identify all the sections of article.
5) Set the connection graph to have no edges.
6) The reading transition score of blocks are store in d matrix.
7) In the current clustering, find the dissimilar pair of clusters (a) and (b), d[(a),(b)] = min(d[(i),(j)]), where the min
function defines the over all pairs of clusters in the current clustering.
8) Increment the sequence number: n= n +1. Merge clusters (a) and (b) into a single cluster. Set L(n) = d[(a),(b)].
9) Update the distance matrix D, by deleting the rows and columns corresponding to clusters (a) and (b) and adding to
the newly formed cluster. d[(r),(a,b)] = min (d[(r),(a)], d[(r),(b)]).
10) Check whether all the data points are in one cluster or not, if not repeat from step 2 otherwise move to next step.
11) Retrieve cluster of an article corresponding to headline and store in database.

6. EXPERIMENTAL RESULT
The experimental data from 25 newspapers of different publishers are collected. Each paper consists of around 8 to 10
pages and each page contains maximum of 4 to 5 individual articles. Because of the inconsistency of page layouts of
newspapers, the experimental results on different newspapers can vary. The proposed method is evaluated with the
three measures precision, recall and F-measure rates. In fact, precision is the positive term such as fraction of retrieved
instances. Whereas recall is a sensitive term and it is the fraction of relevant instances. F-Measure is a harmonic mean
of precision and recall. Similarly, a perfect precision score of 1.0 implies that an algorithm returned considerably more
relevant results than irrelevant, whereas a perfect recall score of 1.0 implies relevant results.Table1 shows the
experimental result of the 25 tested newspapers.
Table 1: Experimental Result of an algorithm with three measures
Sl. No Recall Precision F-Measure
1 0.62 0.42 0.52
2 1.00 0.75 0.88
3 1.00 1.00 1.00
4 0.73 0.57 0.65
5 0.92 0.73 0.83
6 1.00 0.57 0.79
7 0.91 0.91 0.91
8 1.00 0.90 0.95
9 0.91 0.83 0.87
10 0.67 0.50 0.59
11 1.00 1.00 1.00
12 0.63 0.63 0.63
13 0.60 0.38 0.49
14 1.00 0.75 0.88
15 0.91 0.83 0.87
16 1.00 0.90 0.95
17 1.00 1.00 1.00

Volume 5, Issue 7, July 2017 Page 58


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 5, Issue 7, July 2017 ISSN 2321-5992
18 0.88 0.78 0.83
19 0.91 0.91 0.91
20 0.88 0.88 0.88
21 1.00 1.00 1.00
22 1.00 0.88 0.94
23 1.00 0.75 0.88
24 1.00 1.00 1.00
25 0.92 0.73 0.83

The plots in fig 2 highlights the performance of an algorithm with three measures precision, recall and F-measure. The
plot clearly shows that the recall rate is higher than precision whereas F-measure is an average rate of precision and
recall.

Fig 2: Graphical representation of an experimental result


7.CONCLUSION
In this work the features of news article are analyzed and proposed an approach to retrieve an article automatically.
Thus, the proposed work identifies the blocks of an individual article efficiently based on reading transition score. It is
an efficient solution for automatic article extraction from English e-newspaper.It identifies the reading order of an
article and retrieve the information for article aggregation. Importantly, the geometric and content information of
newspaper are analyzed which helped to improve the reliability and efficiency. However, this paper presents a novel
system for newspaper segmentation to combine an individual article and sub-article components.

REFERENCES
[1] Beretta, R., Laura, L., Performance Evaluation of Algorithms for Newspaper Article Identification Published in:
Document Analysis and Recognition (ICDAR), IEEE International Conference on DOI : 10.1109/ICDAR.2011.87
, 2011.
[2] Bloechle, J.L. and Pugin, C. and Ingold, R. Dolores: An Interactive and Class-Free Approach for Document
Logical Restructuring. In Proc. of DAS08, 2008.
[3] Chen, M., Ding, X. and Liang, J. Analysis, Un-derstanding and Representation of Chinese Newspaper with
Complex Layout. In Proc. of CIP00, 2000.
[4] Daniel C. Cavalieri, et.al., "Combination of Language Models for Word Prediction: An Exponential Approach",
IEEE/ACM Transactions on Audio, Speech and Language Processing, Volume: 24 Issue: 9, sept 2016, DOI:
10.1109/TASLP.2016.2547743.
[5] Daniel Cruz Cavalieri, et.al., "A Part of Speech tag Clustering for a Word Prediction System in Portuguese
Language", Sociedad Espanola Para el Procesamiento del Lenguaje Natural, 2011, ISSN 1135-5948
[6] Gerald R. Gendron, Natural Language Processing: A Model to Predict a Sequence of Words, MODSIM World
2015, 2015 Paper No. 13 Page 1 of 10
[7] Hadjar, K., Rigamonti, M., Lalanne, D. and Ingold, R. Xed: A New Tool for Extracting Hidden Structures from
Electronic Documents. In Proc. of DIAL04, 2004.

Volume 5, Issue 7, July 2017 Page 59


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 5, Issue 7, July 2017 ISSN 2321-5992

[8] Hussain Abu-Dalbouh,Norita Md Norwaw, Bidirectional Agglomerative Hierarchical Clustering using AVL Tree
Algorithm, IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 5, No 1, September 2011, ISSN
(Online): 1694-0814
[9] K. H. Talukder, Md. Mustaneer Rahman, T. Ahmed, An Efficient Speech Generation Method based character and
modifier of Bangla PDF Document, IEEE, Proceedings of 13th International Conference on Computer and
Information Technology, ICCIT 2010.
[10] Liangcai Gao, Zhi Tang, Xiaoyan Lin, Yongtao Wang, A Graph-based Method of Newspaper Article
Reconstruction, 21st International Conference on Pattern Recognition (ICPR 2012), Tsukuba, Japan, November
11-15, 2012.
[11] M. Aiello , A. Pegoretti , Textual Article Clustering in Newspaper Pages , Applied Artificial Intelligence, vol
20, no 9, pp. 767-796, 2006, [online] available: http:// dx.doi.org/10.1080/08839510600903858
[12] M. Minami, H. Morikawa, T. Aoyama, The Design of Naming-based Composition System for Ubiquitous
Computing Applications, 2004 IEEE, Proceedings of the 2004 International Symposium on Applications and the
Internet Workshops.
[13] Martina Naughton, Nicholas Kushmerick, and Joe Carthy, Clustering sentences for discovering events in news
articles, ECIR 2006: 535-538
[14] Meunier, J. L. Optimized xy-cut for determining a page reading order. In Proc. of ICDAR05, 2005.
[15] Prof. A.D. Thakare,N. Muthiyan, D. Nangde, D Patil, M. Patil, Clustering Of News Articles to Extract Hidden
Knowledge IJETAE Webs e: www.ijetae.com , SSN 2250-2459, Volume 2, Issue 11, November 2012.
[16] Shashi Pal Singh, et.al, Word and Phrase Prediction Tool for English and Hindi language, International
Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT) 2016, 978-1-4673-9939-5/16
2016 IEEE.
[17] Smith, R.W. Hybrid page layout analysis via tab-stop detection. In Proc. of ICDAR09, 2009.
[18] Sutheebanjard, P., and Premchaiswadi,W. A modified recursive x-y cut algorithm for solving block ordering
problems. In Proc. of ICCET10, 2010.

Volume 5, Issue 7, July 2017 Page 60

Das könnte Ihnen auch gefallen