Beruflich Dokumente
Kultur Dokumente
ij
tf is the term frequency of token
i
t in document
j
d .
ij
f is the frequency of token
i
t in document
j
d .
Note that
ij
f can be retrieved from
frequency
L of key
i
t in
Inverted Index
D .
ij
tf is normalized over maximum
occurring token in document
j
d .
i
idf is the inverse document frequency of token
i
t . N is the total
number of documents in the corpus.
i
df is the document frequency of token
i
t . Note that
i
df can be
retrieved from df value of key
i
t in
Inverted Index
D . Also note that
i
idf value is constant for a token and
it doesnt vary with different documents.
*
ij ij i
W tf idf =
Where
ij
W is the TF-IDF weight of the token
i
t in document
j
d . Note that for
ij
W we dont need any
extra data structure, it can be stored in
Inverted Index
D itself. Only change to be made is that now
frequency
L will contain the list of TF-IDF weight of corresponding token
i
t for all documents. By going
through all the keys (tokens) in
Inverted Index
D for fixed index j in
frequency
L , vector for the document
j
d ,
j
d
V can be retrieved very easily.
1 2
, ,...,
j
d j j Mj
V W W W = ( )
, where M is total number of tokens present in the corpus.
Query Vector Construction
Query is small text which itself considered as a document
q
d . Only difference with query document
and other documents in the corpus is that query document is already in the form of plain text (free
from html tags). So tokenization process can be started with punctuation removal. Same regular
expression for punctuation symbols, list of stop word and stemmer, used for extracting tokens from
document in corpus, are used for extracting the tokens from the query document. The tokens
extracted from query document which dont occur in
Inverted Index
D are filtered out. Rest of the tokens
form the set for query document vector
q
d
V
. Note that because of small size of query text,
dimension of
q
d
V
will be very small with comparison to M .
University of Malta | Shashi Narayan
1 2
, ,..., ,
q
d q q mq
V W W W m M = ( )
Calculation of TF-IDF weight (
iq
W ) for token
i
t :
iq
tf can be calculated very easily with the above
described formula because it just depends on the query document
q
d . Calculating
i
idf is slightly
confusing because
i
df and N depends on the existing corpus plus query document. If we include query
document, both will increase by one. In current SE,
i
df and N are taken independent of query
document so
i
idf is the discrimination power of token
i
t based on the existing corpus. Finally, the
product of
iq
tf and
i
idf produce
iq
W and so
q
d
V
.
Cosine Similarity Model
To find the relevant document, query document
q
d is compared with all the documents
j
d in the
corpus one by one using cosine similarity. This similarity value can be used to rank the relevant
documents
j
d .
1
2 2
1 1
.
( , )
q j
q j
m
iq ij
d d
i
q j
m M
d d
iq ij
i i
W W
V V
sim d d
V V
W W
=
= =
= =
Note that dot product is running over dimension m of
q
d
V
which is very small with comparison to
M ( m M). This makes the process very fast. Only thing which looks time consuming is the
calculation of
j
d
V
(dimension M ). But fortunately
j
d
V
can be calculated for all j and stored in
memory. These whole assumption and pre-calculation make search of relevant documents
extremely fast.
The threshold value on ( , )
q j
sim d d can be used to reduce the size of the relevant document set.
Result Format (HTML and XML)
In the result, top 10 most relevant documents are shown in decreasing order of similarity value
similarity
Value . Number of relevant document to be shown can be changed by small update in Global
Information file of my implementation.
For each document on the result page, following details are shown:
University of Malta | Shashi Narayan
query
Text : Value of Query Text
similarity
Value : Cosine similarity between the query document and the document ( , )
q j
sim d d
doc
Rank : Rank of the document in the sorted list of decreasing similarity value
similarity
Value
doc
Link : Hyperlink to the document in the corpus, to easily view the content of the document
doc
Name : Name of the document
doc
Title : Title of the document extracted from title value of the document from the list
Doc Details
L
(stored during tokenization process of head part). If no such detail is available then title is
computed after search gets complete and taken as the first line of the document with some fixed
length (specified in Global Information file of systems implementation).
( , ) doc query
Snippet : Snippet for the query explaining the document, extracted from description and
summary value of the document from the list
Doc Details
L (Meta tag information). If no such details
available then snippet can be constructed from the document itself using sentence fragments
(Construction). In current implementation, if no Meta tags information available, each sentence of
document is ranked based on query vector and snippet is constructed from top ranked sentences
taking some fixed length (specified in Global Information file of systems implementation). Note
that with these two methods, Google also uses Open Directory Project (ODP) to get the snippet. Our
current SE cant use ODP as the web address is not provided for the documents.
Tokens from the query text are highlighted in
doc
Title and
( , ) doc query
Snippet .
HTML Structure
| | | |
( , )
/ /
2 / 2
Re : a href / /
: /
: /
query
query
doc similarity doc doc
doc
doc query
html head title Text title head body
h Text h
b sult Rank Value Link Name a b
br b Title b Title
br b Snippet b Snippet
br
< >< >< > < >< >< >
< > < >
( < > < = > < > < >
< >< > < >
< >< > < >
< >
/ /
br
body html
< >
>