Sie sind auf Seite 1von 33

Modern Information Retrieval

Chapter 5 Query Operations

報告人:林秉儀
學號: 89522022
Introduction

• It is difficult to formulate queries which are well designed


for retrieval purposes.
• Improving the initial query formulation through query exp
ansion and term reweighting.
Approaches based on:
– feedback information from the user
– information derived from the set of documents initially r
etrieved (called the local set of documents)
– global information derived from the document collectio
n
User Relevance Feedback

• User is presented with a list of the retrieved docu


ments and, after examining them, marks those whi
ch are relevant.
• Two basic operation:
– Query expansion : addition of new terms from r
elevant document
– Term reweighting : modification of term weight
s based on the user relevance judgement
User Relevance Feedback

• The usage of user relevance feedback to:


– expand queries with the vector model
– reweight query terms with the probabilistic mo
del
– reweight query terms with a variant of the prob
abilistic model
Vector Model

• Define:
– Weight:
Let the ki be a generic index term in the set K = {k1
, …, kt}.
A weight wi,j > 0 is associated with each index ter
m ki of a document dj.
– document index term vector:
the document dj is associated with

an index term v

ector dj representd
dj by d j  ( w1, j , w2, j , , wt , j )
Vector Model (cont’d)
• Define
– from the chapter 2
N
the term weighting : wi , j  f i , j  log
ni
freq i , j
the normalized frequency : fi, j 
max l freq l , j
freqi,j be the raw frequency of ki in the document dj
nverse document frequency for ki : N
idf i  log
ni
the query term weight:  0.5 freq i ,q 
wi ,q   0.5    log N
 max l freq l ,q  ni

Vector Model (cont’d)

• Define:
– query vector:
 
query vector qq is defined as q  ( w1,q , w2,q , , wt ,q )
– Dr: set of relevant documents identified by the: user
– Dn: set of non-relevant documents among the retrieved
documents
– Cr: set of relevant documents among all documents in t
he collection
– α,β,γ: tuning constants
Query Expansion and Term Reweighting
for the Vector Model
• ideal case
Cr : the complete set Cr of relevant documents to a
given query q
– the best query vector is presented by
 1  1 
q opt 
Cr


dj 
d j C r N  Cr


dj
d j C r

• The relevant documents Cr are not known a priori,


should be looking for.
Query Expansion and Term Reweighting
for the Vector Model (cont’d)
• 3 classic similar way to calculate the modified que
ry 
qm      
– Standard_Rochio: qm  q  Dr d 
C
dj  
Dn d C

dj
j r j r

   
– Ide_Regular: qm  q   

d j 
d j Dr
 j
d
d j Dn
   
– Ide_Dec_Hi:
qm  q  
d D


d j   max non relevant (d j )
j r

• the Dr and Dn are the document sets which the user


judged
Term Reweighting for the Probabilistic
Model
 
• simialrity: the correlation between the vectors djd and
j q
this correlation can be quantified as:
 
dj q dj
sim( d j , q)   
dj  q
Q
• The probabilistic model according to the probabilistic rankin
g principle.
– Pp(ki|R)
(ki | R) : the probability of observing the term ki in the s
et R of relevant document
– Pp(ki|R)
(ki | R ) : the probability of observing the term ki in the s
et R of non-relevant document
(5.
2)
Term Reweighting for the Probabilistic
Model
• The similarity of a document dj to a query q can be express
ed as
 P( ki | R) P (ki | R ) 
sim(d j , q)   wi ,q wi , j  log
  log 

 1  P ( k i | R ) 1  P ( k i | R ) 
• for the initial search
– estimated above equation by following assumptions
n
P(ki | R )  0.5 P(ki | R)  i
N
ni is the number of documents which contain the index term ki
get t
 N ni 
sim(d j , q )   wi ,q wi , j  log 
i 1  ni 
Term Reweighting for the Probabilistic
Model (cont’d)
• for the feedback search
– The P(k
P(ki|R) and P(k
P (i|R)
ki | Rcan
) be approximated as:
i | R)
ni  Dr ,i Dr ,i
P (ki | R )  P (ki | R ) 
N  Dr Dr
the Dr is the set of relevant documents according to the user judgement
the Dr,i is the subset of Dr composed of the documents contain the term k
i

– The similarity of dj to q:

t  Dr ,i ni  Dr ,i 
• Theresim
is(no )  expansion
d j , qquery 
wi ,q wi , j log occurs in the procedure. 
 Dr  Dr ,i N  Dr  (ni  Dr ,i ) 
i 1 
Term Reweighting for the Probabilistic
Model (cont’d)
• Adjusment factor
– Because of |Dr| and |Dr,i| are certain small, take
a 0.5 adjustment factor added to the PP(k
(ki i||R)
R ) and
P(k
P(ki|R)
i | R)

Dr ,i  0.5 ni  Dr ,i  0.5
P ( ki | R )  P (ki | R ) 
Dr  1 N  Dr  1

– alternative adjustment factor ni/N


n ni
Dr ,i  i ni  Dr ,i 
N P ( ki | R )  N
P(ki | R) 
Dr  1 N  Dr  1
A Variant of Probabilistic Term Reweig
hting
• 1983, Croft extended above weighting scheme by
suggesting distinct initial search methods and by a
dapting the probabilistic formula to include withi
n-document frequency weights.
• The variant of probabilistic term reweighting:
t
sim(d j , q )   wi ,q wi , j Fi , j ,q
i 1
the Fi,j,q is a factor which depends on the triple [ki,dj,q].
A Variant of Probabilistic Term Reweig
hting (cont’d)
• using disinct formulations for the initial search and feedback searche
s
– initial search:
fi, j
Fi , j ,q   C  idf i  f i , j f i , j  K  (1  K )
max( f i , j )
the fi,j is a normalized within-document frequency
C and fK should be adjusted according to the collection.
i, j
• feedback searches:

 P (ki | R ) 1  P ( ki | R ) 
Fi , j ,q   C  log  log  fi, j

• empty text  1  P ( k i | R ) i
P ( k i | R ) i 
Automatic Local Analysis
• Clustering : the grouping of documents which satisfy a set
of common properties.
• Attempting to obtain a description for a larger cluster of rel
evant documents automatically :
To identify terms which are related to the query terms such
as:
– Synonyms
– Stemming
– Variations
– Terms with a distance of at most k words from a query
term
Automatic Local Analysis (cont’d)

• The local strategy is that the documents retrieved f


or a given query q are examined at query time to d
etermine terms for query expansion.
• Two basic types of local strategy:
– Local clustering
– Local context analysis
• Local strategies suit for environment of intranets,
not for web documents.
Query Expansion Through Local Cluster
ing
• Local feedback strategies are that expands the quer
y with terms correlated to the query terms.

Such correlated terms are those present in local clu


sters built from the local document set.
Query Expansion Through Local Cluster
ing (cont’d)
• Definition:
– Stem:
A V(s) be a non-empty subset of words which are grammatical v
ariants of each other. A canonical form s of V(s) is called a stem.
Example:
If V(s) = { polish, polishing, polished} then s=polish
– Dl :the local document set, the set of documents retrieved for a gi
ven query q
• Strategies for building local clusters:
– Association clusters
– Metric clusters
– Scalar clusters
Association clusters

• An association cluster is based on the co-occurrence of stems


inside the documents
• Definition:
– fsi,j
f si , j : the frequency of a stem si in a document dj , d j  Dl

– Let m=(m m  (mijij) ) be an association matrix with |Sl| row and |Dl
| columns, where mij=f ij . f si , j
msi,j
 
– The matrix s=mm s  mm tis a local stem-stem association mat
rix.

– Each element su,v in s sexpresses a correlation cu,v between
the stems su and sv:  cu ,v 
d j D l
fs , j  fs , j
u v
Association Clusters (cont’d)

• The correlation factor cu,v qunatifies the absolute fr


equencies of co-occurrence

– The association matrix s unnormalized
su ,v  cu ,v

– Normalized
cu ,v
su ,v 
cu ,u  cv ,v  cu ,v
Association Clusters (cont’d)

• Build local association clusters:



– Consider the u-th row in the association matrix s
– Let Su(n) be a function which takes the u-th row
and returns the set of n largest values su,v, wher
e v varies over the set of local stems and vnoteq
vu
ualtou
– Then su(n) defines a local association cluster a
round the stem su.
Metric Clusters

• Two terms which occur in the same sentence seem


more correlated than two terms which occur far ap
art in a document.
• It migh be worthwhile to factor in the distance bet
ween two terms in the computation of their correla
tion factor.
Metric Clusters

• Let the distance r(ki, kj) between two keywords ki


and kj in a same document.
• If ki and kj are in distinct documents we take r(ki, k
j)= 

• A local stem-stem metric correlation matrix s is de
fined as :
Each element su,v of s expresses a metric c
orrelation cu,v between the setms su, and sv
1
cu ,v   
ki V ( su ) k j V ( sv ) r ( k i , k j )
Metric Clusters

• Given a local metric matrix ss , to build local metri
c clusters:
– Consider the u-th row in the association matrix
– Let Su(n) be a function which takes the u-th row
and returns the set of n largest values su,v, where
v varies over the set of local stems and v v  u
– Then su(n) defines a local association cluster ar
ound the stem su.
Scalar Clusters

• Two stems with similar neighborhoods have some


synonymity relationship.
• The way to quantify such neighborhood relationsh
ips is to arrange all correlation values su,i in a vect

su all correlation
or su, to arrange  values sv,i in anothe
sv
r vector sv, and to compare these vectors through a
scalar measure.
Scalar Clusters
 
u  ( su ,1 ,su2,
• Let ssu=(su1, su , 2 ,   ( sv ,1 , sv , 2 ,sv2,
, su ,n )) and svsv=(sv1,
…,sun , ssvn)
v , n ) be
two vectors of correlation values for the stems su and sv.

s  ( su ,v))be a scalar association matrix.
• Let s=(su,v
• Each su,v can be defined as  
su  sv
su ,v   
su  sv
• Let Su(n) be a function which returns the set of n largest va
v . uThen Su(n) defines a scalar cluster around t
lues su,v , v=u
he stem su.
Interactive Search Formulation

• Stems(or terms) that belong to clusters associated


to the query stems(or terms) can be used to expand
the original query.
• A stem su which belongs to a cluster (of size n) ass
ociated to another stem sv ( i.e. su  S v (n))is said t
o be a neighbor of sv .
Interactive Search Formulation (cont’d)

• figure of stem su as a neighbor of the stem sv

  


 Sv(n)
 
su 
  
sv 
 
 

Interactive Search Formulation (cont’d)

• For each stem , select m neighbor stems from the cluster Sv


(n) (which might be of type association, metric, or scalar) a
nd add them to the query.
• Hopefully, the additional neighbor stems will retrieve new
relevant documents. 新增的鄰近字根會找出新的 relevant
documents.
• Sv(n) may composed of stems obtained using correlation fa
ctors normalized and unnormalized.
– normalized cluster tends to group stems which are more rare.
– unnormalized cluster tends to group stems due to their large freque
ncies.
Interactive Search Formulation (cont’d)

• Using information about correlated stems to improve the s


earch.
– Let two stems su and sv be correlated with a correlation factor cu,v.
– If cu,v is larger than a predefined threshold then a neighbor stem of
su can also be interpreted as a neighbor stem of sv and vice versa.
– This provides greater flexibility, particularly with Boolean queries.
– Consider the expression (su + sv) where the + symbol stands for dis
junction.
– Let su' be an neighbor stem of su.
– Then one can try both(su'+sv) and (su+su) as synonym search expres
sions, because of the correlation given by cu,v.
Query Expansion Through Local Contex
t Analysis
• The local context analysis procedure operates in three steps:
– 1. retrieve the top n ranked passages using the original q
uery.
This is accomplished by breaking up the doucments initi
ally retrieved by the query in fixed length passages (for i
nstance, of size 300 words) and ranking these passages a
s if they were documents.
– 2. for each concept c in the top ranked passages, the simi
larity sim(q, c) between the whole query q (not individua
l query terms) and the concept c is computed using a vari
ant of tf-idf ranking.
Query Expansion Through Local Contex
t Analysis
– 3. the top m ranked concepts(accroding to sim(q
, c) ) are added to the original query q.
To each added concept is assigned a weight giv
en by 1-0.9 × i/m where i is the position of the c
oncept in the final concept ranking .
The terms in the original query q might be stres
sed by assigning a weight equal to 2 to each of t
hem.

Das könnte Ihnen auch gefallen