Sie sind auf Seite 1von 8

Information Retrieval by Document Re-ranking using Term

Association Graph
Veningston. K

Shanmugalakshmi. R

Department of Computer Science and Engineering


Government College of Technology
Coimbatore 641013, INDIA

Department of Computer Science and Engineering


Government College of Technology
Coimbatore 641013, INDIA

veningstonk@gct.ac.in

cseit.gct@gmail.com

ABSTRACT
Most of the Information Retrieval techniques are based on
representing the documents using the traditional vector space
model i.e. bag-of- words model. In this paper, associations among
words in the documents are assessed and it is expressed in term
graph model to represent the document content and the
relationship among the keywords. Most modern web search
engines typically employ two-level ranking strategy. Firstly, an
initial list of documents is prepared using a low-quality ranking
function with consumes less computation. Secondly, initial list is
then re-ranked by machine learning algorithms which involve
expensive computation. This paper experiments the second level
of ranking strategy which exploits term graph data structure to
assess the importance of a document for the user query and thus
documents are re-ranked according to the association and
similarity exists among the documents. The proposed algorithms
achieve promising results within the top 10 search results.

Categories and Subject Descriptors


H.3.3 [Information Storage and Retrieval]: Text Mining,
Information Search and Retrieval Search process.

General Terms
Algorithms, Experimentation

Keywords
Term association graph, Document search, Re-ranking

1. INTRODUCTION
Typically, Information Retrieval (IR) [1][2] has become one of
the dominant area of research in web mining due to the growth
and evolution of web documents. As the amount of information
on the Web increases speedily, it creates many new challenges for
Web search. When the query is submitted by a user, a typical
search engine returns a large set of results. Users may be
expecting relevant documents in the first few pages of search
results for the query.
Most of the state of the art retrieval techniques adopt the approach
of transforming the document retrieval problem into machine
learning problem. Typically, the documents are represented using
the popular Vector Space Model (VSM) [14]. Intuitively, the
documents are preprocessed in order to prepare a list of terms

with corresponding term frequencies. In this way, corresponding


vector can be constructed to represent the document. Thus, a
collection of documents are represented by a term-by-frequency
matrix which can be subsequently interpreted as a structured
relation.
Existing web search engines rank Web pages mainly based on
keyword matching and hyperlink structures (e.g. authorities and
hubs) [7][8]. Not much importance has been paid to measure the
informative values of Web pages. The following are the
techniques to represent web documents for the application of web
IR.

1.1 Vector space model


Documents and queries are represented as vectors
dj=(w1j,w2j,,wtj) and q=(w1q,w2q,,wtq) respectively in vector
space model [18]. Each dimension corresponds to a separate term.
If a term occurs in the document, its value in the vector is nonzero. There are several methods of computing these term weights.
One of the known schemes is term frequency-inverse document
frequency (tf-idf) weighting. Vector operations can be used to
compare documents with queries. tf-idf is a numerical statistic
which reflects how important a word is to a document in a
collection or corpus. The Boolean frequency is defined as tf(t,d) =
1 if t occurs in d and 0 otherwise. A term occurring frequently in
the document but rarely in the rest of the collection is given high
weight. Relevance rankings of documents in a keyword search can
be calculated based on similarity measure based on cosine
similarity between document D vector and query Q vector given
in Equation (1).
n

Similarity cos( )

i 1

(D )
i 1

(1)

(Q )
i 1

1.2 Best Matching 25 (BM25)


BM25 (Best Matching) [13] is a bag-of-words retrieval function
that ranks a set of documents based on the query terms appearing
in each document, regardless of the inter-relationship between the
query terms within a document. Given a query Q, containing
keywords q1, q2,,qn, the BM25 score of a document D is
computed using Equation (2).
n

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
ICONIAAC '14, October 10 - 11 2014, Amritapuri, India
Copyright 2014 ACM 978-1-4503-2908-8/14/08$15.00.
http://dx.doi.org/10.1145/2660859.2660927

D.Q

|| D || . || Q ||

D Q

score( D, Q) IDF qi .
i 1

tf i .k1 1

|D|

tf i k1.1 b b.
avgdl

(2)

Where, tfi is qis term frequency in document D, |D| is the length


of the document D in terms of number of words, avgdl is the
average document length in the dataset. k1 and b are tuning
parameters.

IDF (qi ) log

N df i 0.5
df i 0.5

(3)

Where, N is the total number of documents in the dataset and dfi


is the number of documents containing the query qi, k1 has little
effect on retrieval performance, and b is a document
normalization parameter which is tuned to optimize retrieval
performance by ranging its values within 0-1.

1.3 Latent Semantic Analysis (LSA) and


probabilistic LSA (pLSA)
Two problems that arise using the vector space model are
synonymy i.e. many ways to refer to the same object which leads
to poor recall, e.g. car and automobile and polysemy i.e. most
words have more than one distinct meaning which leads to poor
precision, e.g. model, python, and chip.
LSI finds the hidden semantic meaning of terms based on their
occurrences in documents. LSI is a technique that maps query
terms and documents to a latent semantic space. Comparing terms
in this space would make synonymous terms look more similar. In
the latent semantic dimension, a query and documents can have
high cosine similarity even if they do not share any terms. Cooccurring terms are projected onto the same dimensions.
Dimensions of the reduced space correspond to the axes of
greatest variation. pLSA has been evolved from LSA with
probabilistic model. LSA [19] and pLSA [20] are representative
probabilistic topic modelling for identifying word/document
relationship. Whereas the work proposed work identify the
word/word relationship. LSA do not address the problem of
polysemy as there is no active disambiguation. Polysemy is
defined as the words with multiple meanings whereas Synonymy
is the separate words that have the same meaning.
pLSI is a generative probabilistic model which learn the
parameters that best explain the data and use the model to predict
or infer new data based on the data already seen.
Procedure 1: pLSA
for each word of document d in the training set,
Choose a topic z according to a multinomial
distribution conditioned on d.
Generate the word by drawing from a
multinomial conditioned on z.
It is not a proper generative model for document because
document is generated from a mixture of topics. The number of
topics may grow linearly with the size of the corpus. Thus, it
becomes difficult to generate a new document.

1.4

Latent Dirichlet Allocation (LDA)

Document is a mixture of topics as in pLSA, but according to a


Dirichlet prior . When a uniform Dirichlet prior is used, pLSA
and LDA becomes same. A word is also generated according to
another variable .
and are corpus level parameters sampled once per corpus. is
the parameter of the Dirichlet prior on the per-document topic
distributions i.e. tells how much Dirichlet prior scatters around
the different topics z. is the parameter of the Dirichlet prior on
the per-topic word distribution i.e. distribution over topics. is the
topic distribution for document. In order to compute topics of a
given document, it is essential to compute the posterior
distribution of the hidden variables given a document.
Procedure 2: LDA
for each document,
Choose which is Dirichlet()
for each of the N words wn:
Choose a topic zn which is Multinomial()
Choose a word wn from p(wn|zn,)
Where, p(wn|zn,) is a multinomial probability conditioned on the
topic zn. The LDA algorithm is computational intractable.

1.5 Term Graph Model


In this model, text documents are modeled as a graph whose
vertices represent words, and whose edges denote meaningful
statistical (e.g. co-occurrence) or linguistic (e.g. grammatical)
relationship between the words. There are different types of text
graph which includes thesaurus graph [22], concept graph [21]
and so on. Thesaurus graph denote terms as vertices and sense
relations e.g. synonymy or antonomy as edges. Concept graph
denotes concepts as vertices and conceptual relations e.g.
hypernymy or hyponymy as edges. The graph model proposed in
[12] incorporates topological properties of graph such as degree
distribution, average path length, and clustering coefficient.
Degree distribution gives the probability that a randomly selected
vertex vi will have degree k=1,2,3 edges i.e. degree of a vertex
vi is defined as the number of edges adjacent to vi. Average path
length is the average number of edges in the shortest path between
any two vertices in a graph. Average clustering coefficient of the
graph is computed by averaging the clustering coefficients of all
vertices in a graph. Clustering coefficient of a vertex measures the
proportion of its neighbors that are themselves neighbors. The
average clustering coefficient can be used to identify connected
graph partitions i.e. it defines the strength of connectivity within
the graph. The two variants of text graphs have been presented in
[12] namely undirected co-occurrence text graph and directed cooccurrence text graph with grammatical constraints. The major
limitation of this approach is that it does not preprocess the
documents before constructing the text graph. Text graph denotes
document based graphs not collection based graphs. Therefore,
this approach constructs individual graph for every text document
which may increase the computational complexity when using it
for re-ranking process. Thus, the proposed approach construct
collection based graph which is more appropriate for the
document ranking process.

2. RELATED WORKS
Figure 1 LDA model

The goal of IR is to effectively retrieve documents relevant to


users queries. Many retrieval models and techniques have been
proposed in literature. Graph-based document ranking algorithms
are used mostly in calculating term weights to represent the
contribution of a term in search context. In this research, the

variants of IR techniques utilizing graph representation are


investigated.
Conceptual graph [21] is a network of concept nodes and
relation nodes. The concept nodes represent entities, attributes, or
events. The relation nodes identify the kind of relationship
between two concept nodes. The similarity between document and
query conceptual graph is measured as the relative size of each
one of their intersection graphs. The model presented in [26]
employs conceptual graphs to represent the text contents of
documents and query. The model compares two conceptual
graphs namely document graph and query graph by measuring
their similarity. This measure is suitable for IR since it considers
not only topical aspects of the phrases but also the relationships
among texts.
Thesaurus graph [22] is a graph representing semantic relations
between terms. In thesaurus graph, vertices denote terms and
edges denote sense relations e.g. synonymy or antonomy. In a
variant of concept graph called Wordnet [23], vertices denote
concepts and edges denote conceptual relations e.g. hypernymy or
hyponymy.
Syntactic-semantic association graphs [24], is a graph in which
edge relations combine two or more different criteria such as
statistical and linguistic features e.g. term frequency, rank in some
semantic hierarchy.
Co-occurrence graph presented in [25] further refine the edge
relation by defining co-occurrence either within a fixed window
or within the same sentence.
Dependency based semantic model proposed in [26] filter out
meaningless edge relations under statistical linguistic
interpretations. The vertices can also be weighted based on
statistical or linguistic criterion.
PageRank algorithm [7][8] is a technique for estimating
importance scores for Web pages. The PageRank value is
computed by weighting each hyperlink to the Web page
proportionally to the quality of the Web page containing the
hyperlink. The PageRank values are calculated recursively with
arbitrary initial values. PageRank algorithm computes importance
of a web page based on linkage structure exist among pages. This
technique employs the global hyperlink structure of the Web in
order to produce rankings of search results but it does not consider
the contents of the page. Many variants of page rank algorithms
have been presented in literature.
Usage based PageRank is a personalization algorithm which
combines usage data and link analysis techniques for ranking and
recommending web pages to the user. It produces personalized
navigational sub graph using web page structure and previously
recorded user sessions for ranking the web pages [27].
Topic sensitive PageRank Algorithm is a variant of original
PageRank algorithm for improving the ranking of search results.
The original PageRank algorithm presented in [8] computes a
single PageRank vector using the link structure of the Web in
order to capture the relative importance of Web pages,
independent of any particular search query. In order to yield more
accurate search results, the approach proposed in [28] computes a
set of PageRank vectors biased using a set of representative topics
in order to capture the notion of importance more accurately with
respect to a particular topic.
Hyperlink-Induced Topic Search (HITS) algorithm presented in
[29] addresses the abundance problem i.e. determining the number
of pages that could reasonably be relevant if the system returns

too large set of results for the user to decide. This scheme
therefore assigns two scores for each page i.e. authority which
estimates the value of the content of the page and hub which
estimates the value of its links to other pages. Pages with high
authority scores are considered to have content relevant to the
query, whereas pages with high hub scores are considered to
contain links to relevant content. The intuition is that a page
which points to many other pages is a good hub, and a page that
many pages point to is a good authority. Good authorities are
those pointed to by good hubs, and good hubs are pointed to by
good authorities. It is an algorithmic formulation of the notion of
authority, based on the relationship between a set of relevant
authoritative pages and the set of hub pages that join them
together in the link structure.
TextRank [30] is a graph-based ranking model for natural
language text processing. The algorithm takes into account edge
weights while computing the score associated with a vertex in the
text graph.

3. TERM GRAPH REPRESENTATION


The term association graph model is an enhanced model of the
VSM [14]. The traditional term weighting schemes such as
Boolean weighting, term frequency (tf) weighting and term
frequency-inverse document frequency (tf-idf) weighting assigns
weight value to each term according to its importance without
considering the term associations. The Boolean weights, tf
weights and tf-idf weights determine the weight of each term in a
document independently. Thus, rich information such as
relationships existing among the terms in a document is not
considered by traditional term weighting schemes. Hence, the
term graph model proposed in [3] for text classification has been
adopted here in order to solve the problem of retrieving relevant
document for a user query in IR.

3.1 Preprocessing
Extract all the terms from the collection of documents. Each text
document in the corpus is treated as a transaction in which each
word is an item. Similar to that of a transaction in Association
Rule Mining (ARM) approach [4].

3.2 Graph model construction


The graph data structure reveals the important semantic
relationship among the words of the document when the terms in
corpus are expressed as graph model. The features about the
documents are extracted using a data mining technique and it is
represented in terms of term graph model.

3.2.1 Frequent item-set mining


After preprocessing, each document in the corpus has been stored
as a transaction (item-set) in which each term/concept (item) is
represented numerically by a unique non-negative integer. Then
the first step of Apriori algorithm [6] alone has been used to find
all the subset of items that appeared more than a user specified
threshold (minimum support threshold) in the corpus.

3.2.2 Term graph construction


In order to construct the graph from the set of frequent item-sets
mined from the text collections, firstly create a node for each
unique term that appear at least once in the frequent item-sets,
secondly create edges between two node a and b if and only if
they are both contained in one frequent item-set. Besides this, the
weight of the edge between a and b is the largest support value
among all the frequent item-sets that contain both terms i.e. a and
b.

Typically after a query is submitted to a medical information


database or search engine, a list of documents is returned to the
user. In this work, document denotes the title and abstract of a
medical journal article returned by the search engine. It is
assumed that if a keyword exists frequently in the document of a
particular query, it represents that there is closeness in terms of
similarity with the query in the documents that are retrieved.
Thus, the support metric [4] is employed to measure the
interestingness of a particular keyword t extracted from the
document for the query.

f d (ti )

i 1

j 1

i 1

(ti )
f d j (ti )

64711
65118

tRNA
alanine

aldehyde

term _ frequency d (ti )


MAX (term _ frequency d (ti ))

Item-set
{Ribonuclease, catalytic, lysine,
phosphate, enzymatic, ethylation }
{ Ribonuclease, Adx, glucocorticoids,
chymotrypsin, mRNA }
{ Ribonuclease, anticodon, alanine,
tRNA }
{ Cl- channels, catalytic,
Monophosphate, cells }
{ isozyme, enzyme, aldehyde, catalytic}

isozyme
Cl- channels

(5)

Table 1. Frequent Item-sets and its corresponding Document


support value

62920

Ribonucleas
e

lysine

cells

Table 1 shows a sample frequent item-sets and its corresponding


term graph for the query Ribonuclease. Before the items are
extracted, stop words, such as the, of to, etc. is removed
from the documents. The maximum length of an item-set is
limited to twenty words. Thereby the extraction of meaningless
terms is avoided and also the computational time is reduced. The
length of the item-set may be increased for large scale systems. It
is constructed using a benchmark dataset TREC-9 Filtering track
[9] which is a medical information database.

55199

anticodon

enzymatic

phosphate

Where fd(ti) is the support of the term ti defined as the frequency


of the term/concept ti in the document d, n is the number of terms
in item-set, N is the number of frequent item-sets i.e. number of
documents returned for the query. The minimum support
threshold of the term is assumed as 0.1 in this paper.

54711

mRNA
Adx

(4)

Doc ID

ethylation

catalytic

Supportd

chymotrypsin
glucocorticoids

Support
0.12
0.2
0.1
0.072
0.096

4. TERM GRAPH BASED DOCUMENT RERANKING


The term graph model reveals richer information i.e. association
between the terms exist across different documents. This term
graph model presented in this paper is different from the one
which is proposed in [3]. The terms i.e. nodes are considered
based on a novel support metric similar to [10] shown in Equation
(4).

enzyme

Monophosphate

Figure 2. The term association graph representation for


frequent item-sets shown in Table 1

4.1 Term Rank based approach (TRM)


When searching for information on the WWW, user enters a
search query to a search engine; the engine returns a list of web
pages or web snippets as the search result which is usually a huge
set. So the ranking of these web pages is very significant task.
Since more information is contained in the link-structure of the
WWW, information such as which pages are linked to others can
be used to supplement search algorithms. This is the motivation
behind the PageRank algorithm proposed in [7][8].
Intuitively, every web page has some number of forward-links
(out-edges) and backward-links (in-edges). Web pages vary
greatly in terms of the number of back-links they have. For
example, the Wikipedia page has more back-links compared to
other web pages which have a very few back-links. Typically,
highly linked pages are more important than the pages with few
links. Back-links coming from important pages will also express
more importance to a page. For example, if a web page has a link
off the Google home page, it may be just one link but it is a very
important one.
The notion of PageRank is employed in this approach. The
PageRank scores for the nodes in the term graph is computed. If a
word that appears frequently with many other words in the corpus,
it is an important word; words that appear together with some
important words may also be important. Since the PageRank
algorithm assumes directed un-weighted graph as an input, the
term graph shown in Figure 2 is transformed to a directed graph
structure. Then the rank of each term is computed using Equation
(6).

Rank (tb )
N tb
tb Ta

Rank (ta ) c

(6)

Where, ta, tb are terms (nodes), Tb is a set of terms ta points to, Ta


is a set of terms that point to ta, Nt | Tb | is the number of links
b

Two approaches have been presented. Firstly, Term Rank based


document re-ranking method, Secondly, Term Distance matrix
based document re-ranking method. The detailed descriptions are
given in sub-sections.

from ta, and c is a normalization factor. This equation is processed


recursively to assign rank to each node in the graph. It is
computed by starting with set of initial ranks and iterating the
computation until it converges. If two terms point to each other
but to no other terms, this loop will accumulate rank but never
distribute rank to other terms during the iteration.

4.1.1 Method 1:
Once the term rank is computed for all the terms in the graph, the
documents retrieved for the initial query is re-ranked as per the
following procedure.

List the words that are linked with the node i.e. query
term in the order of higher TermRank value. Depth of
the term may also be specified so as to indicate the size
of the resultant document set. The number of results is
directly proportional to the term depth specified. If the
depth is low, user will get more specific results. If the
depth is high, the coverage of results will also be high
however the more general results will be generated.

Assume the top k terms {k=10,15,20,}and identify the


documents which contains these top words.

Order the documents according to the TermRank


associated with top k terms in descending order.

4.1.2 Method 2
Given a term graph with TermRank associated with each node,
employ Spearman Correlation Co-efficient (SCC) in order to
compute the linear relationship between two set of ranked terms.
Thus, dominantly correlated set of terms are considered to re-rank
the documents.
N

SCC

(x

x)( yi y )

(7)

i
N

( x x) ( y y )
2

List the words that are linearly linked with the query term in the
order of higher TermRank value for the specified depth of i where
i=1,...,N. Find such linear list with non-overlapping texts as many
as possible and then, compute SCC between these linear lists of
ranked terms. Thereby added relevant documents can be
accumulated if any list of ranked variables is positively correlated
with the one that possesses higher TermRank value. RCC can be
computed by spearmans rank correlation coefficient using
Equation (7).

4.2 Term Distance based approach (TDM)


The term distance based similarity function employed in this
approach has been adopted from [3] wherein it is applied for text
classification task. Intuitively, the distance between two terms in
the Term_graph TG reveals significant association between the
terms. If the terms appear more frequently in different documents
in the corpus, then those terms will have more probability to be
connected with many other terms in TG.
Given a Term_graph TG, the term distance matrix is constructed.
If TG has N terms, then the size of term distance matrix TDMN will
be NN. Each element in TDM i.e. TDMN[ti][tj] possesses the
value as the least number of hops between term ti and tj. TDMN is
an adjacency matrix shown in Figure 3.
The TDMN is employed to define the similarity of the documents
and a query term. For example, if the query term assumed to be T5
then the documents which contains the terms T1, T6, T7, T8, ,
T12, T17, T18, T19 would be assumed to be more relevant to the
query. Since the distance between the query term and the
document terms are extremely least (indicated using rectangle box
in TDMN) i.e. T6, T7, T8, , T12 are directly connected with T5,
documents which contain these terms will be ranked higher.

0
1

1
1

2
2

2
1

1
1

1
1

1
1

2
2

1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 2 2 2
0 2 2 2 3 3 3 2 2 2 2 1 1 1 2 3 3 3
2 0 1 2 3 3 3 2 2 2 2 2 2 2 1 3 3 3

2 1 0 2 3 3 3 2 2 2 2 2 2 2 1 3 3 3
2 2 2 0 1 1 1 1 1 1 1 2 2 2 2 1 1 1

3 3 3 1 0 1 1 2 2 2 2 3 3 3 3 2 2 2
3 3 3 1 1 0 1 3 3 3 2 3 3 3 3 2 2 2

3 3 3 1 1 1 0 2 2 2 2 3 3 3 3 2 2 2
2 2 2 1 2 3 2 0 1 1 1 2 2 2 2 2 2 2

2 2 2 1 2 3 2 1 0 2 2 2 2 2 2 2 2 2

2 2 2 1 2 3 2 1 2 0 1 2 2 2 2 2 2 2
2 2 2 1 2 2 2 1 2 1 0 2 2 2 2 2 2 2

1 2 2 2 3 3 3 2 2 2 2 0 1 1 2 3 3 3
1 2 2 2 3 3 3 2 2 2 2 1 0 1 2 3 3 3

1 2 2 2 3 3 3 2 2 2 2 1 1 0 2 3 3 3
2 1 1 2 3 3 3 2 2 2 2 2 2 2 0 3 3 3

3 3 3 1 2 2 2 2 2 2 2 3 3 3 3 0 1 1

3 3 3 1 2 2 2 2 2 2 2 3 3 3 3 1 0 1
3 3 3 1 2 2 2 2 2 2 2 3 3 3 3 1 1 0

Figure 3. Term Distance Matrix of the Graph shown


in Figure 2

5. EXPERIMENTAL EVALUATION
The approaches proposed in this work focuses on the querycentric re-ranking of search results. Typically input query issued
by the user are keywords in which user does not have a special
page in mind intends to find out documents related to a
topic/concept. Experimentation of proposed approaches has been
evaluated using synthetic dataset.

5.1 Dataset description


OHSUMED document collection [9][17] which was used for the
TREC-9 Filtering Track has been employed in this work. The
OHSUMED test collection is a set of 348,566 document
references from MEDLINE on-line medical information database,
consisting of titles and/or abstracts from 270 medical journals
over a period of 5 years from 1987 to 1991. The available fields
are title, abstract, indexing terms, and author.
Table 2. Statistics about the corpus considered for
experimentation
Document
Corpus

# of
docs

# of
queries

Avg.
doc.
length

Avg. doc.
Length
after preprocessing

Synthetic
dataset
(OHSUMED)

348,566

106

210

64

5.2 Evaluation metrics


The re-ranking algorithms proposed in this work have been
evaluated using a variety of accepted IR metrics [1][2][11].

5.2.1 Precision
This measures the accuracy of the retrieved results. Precision
defines the fraction of retrieved documents that are labeled as
relevant i.e. documents ranked in the top n results that are found
to be relevant. If the documents within the top k are irrelevant,
then this measures the user satisfaction with the top k results.

P@k

# of _ relevant _ doc _ retrieved _ among _ k


k

(8)

5.2.2 Recall
This measures the coverage of the relevant documents in the
retrieved results. Recall defines the fraction of relevant documents
that are retrieved.

R@k

# of _ relevant _ doc _ retrieved _ among _ k


total # of _ relevant _ documents

(9)

affinity (d i , d j )

5.2.3 Mean Reciprocal Rank (MRR)


The mean reciprocal rank is the average of the reciprocal ranks of
results for a sample of queries Q. Consider the rank position k of
first relevant document, Reciprocal Rank score =1/k. MRR is the
mean of reciprocal rank across multiple queries given by Equation
(10).

MRR

1 |Q| 1

| Q | i 1 ranki

(10)

5.2.4 F-Measure
This measure combines precision (P) and recall (R) i.e.
the harmonic mean of precision and recall. Recall and precision
are evenly weighted by F-Score.

F 2*

P*R
PR

(11)

5.2.5 Mean Average Precision (MAP)


The most commonly used document ranking measure is MAP
which is the mean of Average Precision (AP) over all queries.

AP

1 R
((P @ k ).(rel (k ))
R k 1

(12)

Where rel(k) = 1 if the kth ranked document is relevant; k=0 if it is


not relevant. R is the total number of relevant documents in the
collection. MAP for a set of queries is the mean of the precision
scores for each query at k computed after each relevant document
is retrieved. This metric is used as single-value summary of a
retrieval process over a set of queries Q. MAP is computed using
Equation (13)
|Q|

MAP

AP (q)
q 1

affinity of di to dj similar to cosine similarity in order to define the


similarity between each document pair as shown in Equation (15).

(13)

|Q|

5.3 Baselines for Comparison


The state-of-the-art bag-of-words model i.e. tf-idf based Vector
space model, Okapi-BM25 and Language Model has been
assumed to be the baseline systems. In addition to these, K-means
and Affinity graph ranking approaches have also been taken for
comparing the results prepared by the proposed re-ranking
approaches.

d i .d j

(15)

||d i||

Affinity graph models the structure of a group of documents based


on the asymmetric content similarities between each pair of
documents. Affinity measure is assumed to be asymmetric
because affinity(di,dj)affinity(dj,di) whereas cosine similarity is
symmetric. In affinity graph model, documents are considered as
nodes, the document collection is modeled as a graph by
generating the link between documents using a directional link
from di to dj (ij) with weight affinity(di,dj) is constructed if
affinity(di,dj)affinitythreshold; otherwise no link is constructed.
Thus, the affinity ranking scheme re-rank top documents returned
from the baseline system.

5.4 Evaluation on Synthetic data set


Queries were issued to retrieve from the OHSUMED document
collection [9] and then the proposed methods have been applied to
re-rank top 20 documents. This dataset contains associated
documents of over 106 queries. A query is about a medical search
need. The relevance degrees of documents with respect to the
queries are judged by humans on three point scale such as highly
relevant, moderately relevant, not relevant. The dataset consists of
around 16,140 query-document pairs with relevance judgments.
The sample set of queries used for experimentation is given in
Table 3.
Table 3. Sample test queries and the expected documents to be
retrieved
Test query
Expected Documents
Phosphate
Catalytic activity
Alcolmeter
Blood ethanol concentration measures
Ethanol
Alcohol, Breath analysis and tests
Morphine
Narcotic Syndromes
Endorphin
Blood-Brain Barrier due to Alcoholism
Alcoholism Brain syndromes, Treatments
Erythrocyte State of Hemoglobin on narcotic consumption
Serotonin
Brain test during sleep and depression
Platelets
Measure of platelet affinity, Treatments
Abstinent
Appetite for food or drink, insulin tests
Precision at various recall points are estimated and plotted in
Figure 4. It is observed that the proposed approaches outperform
baseline systems.

5.3.1 K-means clustering algorithm


To re-rank top results wherein K is chosen to be 10 and the top 1
document from each cluster is used to construct the top 10 results.
The k-means partition based clustering algorithms [16] typically
attempts to minimize the distance between documents in the same
cluster i.e. if D(d1, d2, . . . , dn) are the n documents and C(c1, c2, . .
. , ck) are the k clusters centroids, then k-means tries to minimize
the function shown in Equation (14).
k

similarity (d
i 1 j 1

, ci )

(14)

5.3.2 Affinity graph ranking


In this approach [15], the document collection is modeled as a
graph by generating the link between documents. This defines the

Figure 4. Precision-Recall curve obtained for 10 queries


Figure 5 illustrate the improvement in MRR for the sample test
queries and it is found that over 21% of improvement can be
achieved within in top 10 search results positions after re-ranking.

Figure 5. MRR Evaluation for 10 queries


Figure 6 shows the results of the proposed TRM1, TRM2, and
TDM schemes which improve the coverage of results at top 10
results compared to baseline BM25, clustering, and affinity graph
ranking approaches on OHSUMED Medical document dataset.

information from large corpus of text documents. This paper


revealed the challenges that present in the state of the art IR
systems, and presented three methods to enhance the document reranking task in order to meet the information need of the user. The
proposed methods captures hidden semantic association and
demonstrate the results by improving document representation
and retrieval by incorporating more information available within
the documents term association into the retrieval process. Thus,
the effectiveness of the information retrieval is enhanced. The
implementation results show that the proposed algorithms
improve the retrieval performance in terms of accuracy and
coverage.
It is inferred that still there is a gap existing in the process of
identifying most relevant information that is of interest for the
user. Thus, the work presented in this paper may possibly be
extended in the following direction. (1) The reputation of a word
does not guarantee the desired information to the searcher. Thus
user specific relevance factor may be considered, and (2) the
proposed approaches do not consider users search preferences.
Thus, personalized search feature may be enabled in order to
organize search result to be more appropriate and relevant to the
user.

7. ACKNOWLEDGMENTS

Figure 6. F-Measure Evaluation for 10 queries

The work presented in this paper is supported and funded by the


Department of Science and Technology (DST), Ministry of
Science and Technology, Government of India under INSPIRE
scheme. Authors wish to extend their thanks to DST, Government
College of Technology Coimbatore, and the anonymous reviewers
for their helpful comments which improved the presentation of
this paper.

8. REFERENCES
[1] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information
Retrieval. Addison Wesley, (1999).
[2] Christopher D. Manning, Prabhakar Raghavan, Hinrich
Schutze. Introduction to Information Retrieval, Cambridge
University Press, 2008.
Figure 7. MAP Evaluation for 10 queries
Figure 7 shows the MAP improvement at k10 positions after each
relevant document is retrieved over 10 queries. In comparing
proposed Term Rank based methods namely TRM1 & TRM2 and
Term Distance method TDM, TRM1 outperforms other methods.

5.5 Result Analysis


The graph representation of text estimates the association between
the texts in order to compute term ranks i.e. term importance. The
computed term rank associated with each term is further used to
determine the documents for relevance. The evaluation on
OHSUMED report that the term association model improves
retrieval performance by identifying highly relevant documents
for the query. The term association graph representation can also
be used for suggesting related keywords for enhancing search
even before the documents are retrieved. Thus, the same graph
based term representation model could be adapted for representing
users implicit interest in order to rank the documents customized
to individual user.

6. CONCLUSION
An implementation of algorithms that exploits term association
graph model has been proposed for efficient retrieval of

[3] Wei Wang, Diep Bich Do, and Xuemin Lin. Term Graph
Model for Text Classification, In Proc. Springer LNAI, pp.
19-30, 2005.
[4] Jiawei Han, and Micheline Kamber. Data Mining Concepts
and Techniques, Second edition, Elsevier, Morgan
Kaufmann publishers, 2006.
[5] Bing Liu, Chee Wee Chin, and Hwee Tou Ng. Mining topicspecific concepts and definitions on the web. In Proceedings
of the 12th ACM International Conference on World Wide
Web, pp. 251260, 2003.
[6] Rakesh Agrawal, and Ramakrishnan Srikant. Fast algorithm
for mining association rules. In Proc. of 20th Intl. Conf.
VLDB, 1994.
[7] Sergey Brin, and Lawrence Page. The anatomy of a largescale hypertextual web search engine. In Proc. of 7th Intl.
Conf. WWW, pp. 107-117, 1998.
[8] Page
Lawrence, Brin
Sergey, Motwani
Rajeev, and Winograd Terry. The PageRank citation ranking:
Bringing order to the Web. In Stanford CS Technical Report,
1998.
[9] Hersh WR, Buckley C, Leone TJ, Hickam DH, OHSUMED:
An interactive retrieval evaluation and new large test

collection for research, In Proceedings of the 17th Annual


ACM SIGIR Conference, pp. 192-201, 1994.

[20] Thomas Hofmann, Probabilistic latent semantic indexing, In


Proc. ACM SIGIR, pp. 50-57, (1999).

[10] Kenneth Wai-Ting Leung, and Dik Lun Lee. Deriving


Concept-based User profiles from Search Engine Logs. IEEE
Trans. Knowledge and Data Engineering, Vol. 22, No. 7, pp.
969-982, July 2010.

[21] M. Montes-y-Gomez, A. Lpez-Lpez, and A. Gelbukh.


Information retrieval with conceptual graph matching. In
Proc. 12th Intl. Conf. Database and Expert Systems
Applications, Springer LNCS, Volume 1873, pp 312-321,
2000.

[11] K. Jarvelin and J. Kekalainen. IR evaluation methods for


retrieving highly relevant documents. In Proc. SIGIR, pp.
4148, 2000.

[22] Leicht, E. A., Holme, P., and Newman, M. E. J. Vertex


similarity in networks. Physical Review, 2006.

[12] Roi Blanco, and Christina Lioma. Graph-based term


weighting for information retrieval. Springer Information
Retrieval, Volume 15, Issue 1, pp 54-92, February 2012.

[23] Sigman, M., Cecchi, G. A. Global organization of the


WordNet lexicon. In Proc.National Academy of Sciences,
3(99), pp. 17421747, 2002.

[13] S. E. Robertson, S. Walker, S. Jones, M. M. HancockBeaulieu, M Gatford, A. Gull, and M. Lau. Okapi at TREC,
In Proc. Text Retrieval Conference, pp. 21-30 (1992)

[24] Nastase, V., Sayyad-Shirabad, J., Sokolova, M., and


Szpakowicz, S. Learning noun-modifier semantic relations
with corpus-based and wordnet-based features. In American
Association for Artificial Intelligence, pp. 781-786, 2006.

[14] S. K. M. Wong, Vijay V. Raghavan. Vector space model of


Information Retrieval: A reevaluation, In Proc. SIGIR, pp.
167-185, ACM (1984)
[15] Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo
Fan, Zheng Chen, and Wei-Ying Ma. Improving web search
results using affinity graph, In Proc. SIGIR, pp. 504-511,
ACM (2005)
[16] L. Kaufman and P. Rousseeuw. Finding groups in data: an
introduction to cluster analysis. John Wiley & Son, (1990).
[17] Tie-Yan Liu, Jun Xu, Tao Qin, Wenying Xiong, and Hang
Li. LETOR: Benchmark Dataset for Research on Learning to
Rank for Information Retrieval. In Proc. SIGIR, pp. 3-10,
ACM (2007)
[18] G. Salton, A. Wong, C. S. Yang, A vector space model for
automatic indexing, ACM Communication magazine, Vol.
18, Issue 11, pp. 613-620, (1975).
[19] Scott C. Deerwester, Susan T. Dumais, Thomas K.
Landauer, George
W.
Furnas, Richard
A.
Harshman, Indexing by Latent Semantic Analysis, Journal of
the American Society for Information Science, Vol. 41, Issue
6, pp. 391-407, (1990).

[25] A.P. Masucci, G.J. Rodgers, Network properties of written


human language, Physics Review, 74(2), 2006.
[26] Pado, S., and Lapata, M. Dependency-based construction of
semantic space models, Computational Linquistics, 33(2),
pp. 161199, 2007.
[27] Magdalini Eirinaki, Michalis Vazirgiannis, UPR:Usagebased page ranking for web persoanalization, In Proc. 5th
IEEE Intl. Conf. on Data Mining (ICDM), pp. 130-137,
2005.
[28] T. H. Haveliwala, Topic-sensitive PageRank: A contextsensitive ranking algorithm for web search, IEEE
Transactions on Knowledge and Data Engineering, 15(4), pp.
784796, 2003.
[29] J. M. Kleinberg. Authoritative Sources in a Hyperlinked
Environment. Journal of the ACM, 1999, 46(5): 604-632.
[30] R. Mihalcea and P. Tarau. TextRank: Bringing Order into
Texts. In Proceedings of Empirical Methods in Natural
Language Processing. ACL, 2006, 404-411.

Das könnte Ihnen auch gefallen