Beruflich Dokumente
Kultur Dokumente
Browsing Probabilistic
Flat Inference Network
Structured Guided Belief Network
Hypertext
Index Terms Full Text Full Text+
Structure
8
Boolean Model
Boolean model
Binary decision criterion
Data retrieval model
Advantage
clean formalism, simplicity
Disadvantage
It is not simple to translate an information need into
a Boolean expression.
exact matching may lead to retrieval of too few or
too many documents
Example
Boolean Model
Boolean Model is one of the oldest and simplest models of
Information Retrieval.
It is based on set theory and Boolean algebra.A document is
represented as a set of keywords.
Queries are Boolean expressions of keywords, connected by
AND, OR, and NOT, including the use of brackets to
indicate scope.
Output: Document is relevant or not. No partial matches or
ranking.
In this model, each document is taken as a bag of index
terms.
Index terms are simply words or phrases from the document
that are important to establish the meaning of the document
13
Boolean Retrieval Model
The query is a Boolean algebra expression using connectives like ,, etc.
The documents retrieved are the documents that completely match the given
query.
Partial matches are not retrieved. Also, the retrieved set of documents is not
ordered.For example, Say, there are four documents in the system.
For each term in the query, a list of documents that contain the term is created.
Then the lists are merged according to the Boolean operators.
14
Boolean Model
Advantages
It is simple, efficient and easy to implement.
It was one of the earliest retrieval methods to be implemented. It
remained the primary retrieval model for at least three decades.
It is very precise in nature. The user exactly gets what is specified.
Boolean model is still widely used in small scale searches like
searching emails, files from local hard drives or in a mid-sized library.
Disadvantages
In Boolean model, the retrieval strategy is based on binary criteria.
So, partial matches are not retrieved. Only those documents that exactly
match the query are retrieved.
Hence, to effectively retrieve from a large set of documents users must
have a good domain knowledge to form good queries.
The retrieved documents are not ranked.
Disadvantage:
Long documents are poorly represented because they have poor similarity
values (a small scalar product and a large dimensionality)
Search keywords must precisely match document terms; word substrings
might result in a "false positive match"
Semantic sensitivity; documents with similar context but different term
vocabulary won't be associated, resulting in a "false negative match".
The order in which the terms appear in the document is lost in the vector
space representation.
Theoretically assumed terms are statistically independent.
Weighting is intuitive but not very formal.
Probabilistic Model
Why probabilities in IR?
User Query
Understanding
Information Need Representation of user need is
uncertain
How to match?
Uncertain guess of
Document whether document
Documents Representation
has relevant content
For any document , the cosine similarity is the weighted sum, over all terms in the query , of
the weights of those terms in . This in turn can be computed by a postings intersection
exactly as in the algorithm.
Ranking
Ranking of query results is one of the fundamental problems in
information retrieval (IR), the scientific/engineering discipline behind
search engines.
Given a query q and a collection D of documents that match the query, the
problem is to rank, that is, sort, the documents in D according to some
criterion so that the "best" results appear early in the result list displayed to
the user. Classically, ranking criteria are phrased in terms of relevance of
documents with respect to an information need expressed in the query.
Ranking is often reduced to the computation of numeric scores on
query/document pairs.
Ranking functions are evaluated by a variety of means; one of the simplest
is determining the precision of the first k top-ranked results for some fixed
k; for example, the proportion of the top 10 results that are relevant, on
average over many queries.
Ch. 6
Log-frequency weighting
The log frequency weight of term t in d is
1 log 10 tf t,d , if tf t,d 0
wt,d
0, otherwise
0 0, 1 1, 2 1.3, 10 2, 1000 4, etc.
Score for a document-query pair: sum over terms
t in both q and d:
score
tqd (1 log tf t ,d )
The score is 0 if none of the query terms is
present in the document.
Sec. 6.2.1
idf weight
dft is the document frequency of t: the number of
documents that contain t
dft is an inverse measure of the informativeness of t
dft N
We define the idf (inverse document frequency)
of t by idf log ( N/df )
t 10 t
tf-idf weighting
The tf-idf weight of a term is the product of its tf
weight and its idf weight.
w t ,d log( 1 tf t ,d ) log 10 ( N / dft )
Best known weighting scheme in information
retrieval
Note: the - in tf-idf is a hyphen, not a minus sign!
Alternative names: tf.idf, tf x idf
Increases with the number of occurrences within a
document
Increases with the rarity of the term in the collection
Sec. 6.2.2
58
Language Models (LMs)
How can we come up with good queries?
Think of words that would likely appear in a
relevant document.
Idea of LM:
A document is a good match to a query if the
document model is likely to generate the query.
59
Language Models (LMs)
Generative Model:
Recognize or generate strings.
The full set of strings that can be generated is called the language of the
automaton.
Language Model:
A function that puts a probability measure over strings drawn from some
vocabulary.
60
Language Models (LMs)
Example 1:
Calculate the probability of a word sequence.
Multiply the probabilities that the model gives to each word
in the sequence, together with the probability of continuing
or stopping after producing each word.
P(frog said that toad likes
frog)=(0.01*0.03*0.04*0.01*0.02*0.01)
*(0.8*0.8*0.8*0.8*0.8*0.8*0.2)
=0.000000000001573
62
Application
Community-based Question Answering (CQA)
System:
Question Search.
Given a queried question, find a semantically
equivalent question for the queried question.
General Search Engine
Given a query, rank documents.
71
Set Theoretic Models
Set Theoretic Models
Flat browsing