Beruflich Dokumente
Kultur Dokumente
報告人:林秉儀
學號: 89522022
Introduction
• Define:
– Weight:
Let the ki be a generic index term in the set K = {k1
, …, kt}.
A weight wi,j > 0 is associated with each index ter
m ki of a document dj.
– document index term vector:
the document dj is associated with
an index term v
ector dj representd
dj by d j ( w1, j , w2, j , , wt , j )
Vector Model (cont’d)
• Define
– from the chapter 2
N
the term weighting : wi , j f i , j log
ni
freq i , j
the normalized frequency : fi, j
max l freq l , j
freqi,j be the raw frequency of ki in the document dj
nverse document frequency for ki : N
idf i log
ni
the query term weight: 0.5 freq i ,q
wi ,q 0.5 log N
max l freq l ,q ni
Vector Model (cont’d)
• Define:
– query vector:
query vector qq is defined as q ( w1,q , w2,q , , wt ,q )
– Dr: set of relevant documents identified by the: user
– Dn: set of non-relevant documents among the retrieved
documents
– Cr: set of relevant documents among all documents in t
he collection
– α,β,γ: tuning constants
Query Expansion and Term Reweighting
for the Vector Model
• ideal case
Cr : the complete set Cr of relevant documents to a
given query q
– the best query vector is presented by
1 1
q opt
Cr
dj
d j C r N Cr
dj
d j C r
– Ide_Regular: qm q
d j
d j Dr
j
d
d j Dn
– Ide_Dec_Hi:
qm q
d D
d j max non relevant (d j )
j r
– The similarity of dj to q:
t Dr ,i ni Dr ,i
• Theresim
is(no ) expansion
d j , qquery
wi ,q wi , j log occurs in the procedure.
Dr Dr ,i N Dr (ni Dr ,i )
i 1
Term Reweighting for the Probabilistic
Model (cont’d)
• Adjusment factor
– Because of |Dr| and |Dr,i| are certain small, take
a 0.5 adjustment factor added to the PP(k
(ki i||R)
R ) and
P(k
P(ki|R)
i | R)
Dr ,i 0.5 ni Dr ,i 0.5
P ( ki | R ) P (ki | R )
Dr 1 N Dr 1
P (ki | R ) 1 P ( ki | R )
Fi , j ,q C log log fi, j
• empty text 1 P ( k i | R ) i
P ( k i | R ) i
Automatic Local Analysis
• Clustering : the grouping of documents which satisfy a set
of common properties.
• Attempting to obtain a description for a larger cluster of rel
evant documents automatically :
To identify terms which are related to the query terms such
as:
– Synonyms
– Stemming
– Variations
– Terms with a distance of at most k words from a query
term
Automatic Local Analysis (cont’d)
– Normalized
cu ,v
su ,v
cu ,u cv ,v cu ,v
Association Clusters (cont’d)
Sv(n)
su
sv
Interactive Search Formulation (cont’d)