Sie sind auf Seite 1von 113

UNIT- II MODELING

Taxonomy and Characterization of IR Models


Boolean Model
Vector Model
Term Weighting
Scoring and Ranking
Language Models
Set Theoretic Models
Probabilistic Models
Algebraic Models
Structured Text Retrieval Models
Models for Browsing
Define Information Retrieval
Information retrieval (IR) is the activity of obtaining information
resources relevant to an information need from a collection of
information resources. Searches can be based on full-text or
other content-based indexing.

The tracing and recovery of specific information from stored


data.

The techniques of storing and recovering and often


disseminating recorded data especially through the use of a
computerized system.
A taxonomy of information retrieval models

Classic Models Set Theoretic


Boolean Fuzzy
U
Vector Extended Boolean
S Probabilistic
E Retrieval:
R Ad hoc Algebraic
T Filtering Structured Models
A Generalized Vector
S Lat. Semantic Index
K Non-overlapping lists
Neural Networks
Browsing Proximal Nodes

Browsing Probabilistic
Flat Inference Network
Structured Guided Belief Network
Hypertext
Index Terms Full Text Full Text+
Structure

Retrieval Classic Classic Structured


Set Theoretic Set Theoretic
Algebraic Algebraic
Probabilistic Probabilistic

Browsing Flat Flat Structure Guided


Hypertext Hypertext
A formal characterization of IR models

D : A set composed of logical views (or representation) for the


documents in the collection.
Q : A set composed of logical views (or representation) for the user
information needs (queries).
F : A framework for modeling document representations, queries, and
their relationships.
R(qi, dj) : A ranking function which defines an ordering among the
documents with regard to the query.
Define
ki : A generic index term
K : The set of all index terms {k1,,kt}
wi,j : A weight associated with index term
ki of a document dj

gi: A function returns the weight associated


with ki in any t-dimensoinal vector( gi(dj)=wi,j )
Retrieval Models
A retrieval model specifies the details of:
Document representation
Query representation
Retrieval function
Determines a notion of relevance.
Notion of relevance can be binary or
continuous (i.e. ranked retrieval).

8
Boolean Model
Boolean model
Binary decision criterion
Data retrieval model
Advantage
clean formalism, simplicity
Disadvantage
It is not simple to translate an information need into
a Boolean expression.
exact matching may lead to retrieval of too few or
too many documents
Example
Boolean Model
Boolean Model is one of the oldest and simplest models of
Information Retrieval.
It is based on set theory and Boolean algebra.A document is
represented as a set of keywords.
Queries are Boolean expressions of keywords, connected by
AND, OR, and NOT, including the use of brackets to
indicate scope.
Output: Document is relevant or not. No partial matches or
ranking.
In this model, each document is taken as a bag of index
terms.
Index terms are simply words or phrases from the document
that are important to establish the meaning of the document

13
Boolean Retrieval Model
The query is a Boolean algebra expression using connectives like ,, etc.
The documents retrieved are the documents that completely match the given
query.
Partial matches are not retrieved. Also, the retrieved set of documents is not
ordered.For example, Say, there are four documents in the system.
For each term in the query, a list of documents that contain the term is created.
Then the lists are merged according to the Boolean operators.

14
Boolean Model
Advantages
It is simple, efficient and easy to implement.
It was one of the earliest retrieval methods to be implemented. It
remained the primary retrieval model for at least three decades.
It is very precise in nature. The user exactly gets what is specified.
Boolean model is still widely used in small scale searches like
searching emails, files from local hard drives or in a mid-sized library.
Disadvantages
In Boolean model, the retrieval strategy is based on binary criteria.
So, partial matches are not retrieved. Only those documents that exactly
match the query are retrieved.
Hence, to effectively retrieve from a large set of documents users must
have a good domain knowledge to form good queries.
The retrieved documents are not ranked.

Vector Based Model


Vector Based Model
The main problem with Boolean model is its inability to fetch partial
matches and the absence of any scoring procedure to rank the
retrieved documents.
This problem was addressed in the vector based model of
Information retrieval.
The VSM allows decisions to be made about which documents are
similar to each other and to keyword queries
Vector space model or term vector model is an algebraic model
for representing text documents (and any objects, in general) as
vectors of identifiers, such as, for example, index terms. It is used in
information filtering, information retrieval, indexing and relevancy
rankings. Its first use was in the SMART Information Retrieval
System.
How it works: Overview
Each document is broken down into a word
frequency table
The tables are called vectors and can be stored
as arrays
A vocabulary is built from all the words in all
documents in the system
Each document is represented as a vector
based against the vocabulary
Example
The vocabulary contains all words used
a, dog, and, cat, frog
The vocabulary needs to be sorted
a, and, cat, dog, frog

Document A: A dog and a cat.


a and cat dog frog
2 1 1 1 0
Vector: (2,1,1,1,0)
Document B: A frog.
a and cat dog frog
1 0 0 0 1
Vector: (1,0,0,0,1)
Example
Let d = (2,1,1,1,0) and d = (1,0,0,0,1)
dXd = 2X1 + 1X0 + 1X0 + 1X0 + 0X1=2
|d| = (22+12+12+12+02) = 7=2.646
|d| = (12+02+02+02+12) = 2=1.414
Similarity = 1/(1 X 2.646) = 0.378
Advantage and Disadvantage
Advantages
Simple model based on linear algebra
Term weights not binary
Allows computing a continuous degree of similarity between queries and
documents
Allows ranking documents according to their possible relevance
Allows partial matching

Disadvantage:
Long documents are poorly represented because they have poor similarity
values (a small scalar product and a large dimensionality)
Search keywords must precisely match document terms; word substrings
might result in a "false positive match"
Semantic sensitivity; documents with similar context but different term
vocabulary won't be associated, resulting in a "false negative match".
The order in which the terms appear in the document is lost in the vector
space representation.
Theoretically assumed terms are statistically independent.
Weighting is intuitive but not very formal.
Probabilistic Model
Why probabilities in IR?

User Query
Understanding
Information Need Representation of user need is
uncertain
How to match?

Uncertain guess of
Document whether document
Documents Representation
has relevant content

In traditional IR systems, matching between each document and


query is attempted in a semantically imprecise space of index terms.
Probabilities provide a principled foundation for uncertain reasoning.
Can we use probabilities to quantify our uncertainties?
The Probability Ranking Principle
If a reference retrieval system's response to each request is a
ranking of the documents in the collection in order of decreasing
probability of relevance to the user who submitted the request,
where the probabilities are estimated as accurately as possible on
the basis of whatever data have been made available to the system
for this purpose, the overall effectiveness of the system to its user
will be the best that is obtainable on the basis of those data.

[1960s/1970s] S. Robertson, W.S. Cooper, M.E. Maron;


van Rijsbergen (1979:113); Manning & Schtze (1999:538)
Probability Ranking Principle

Let x be a document in the collection.


Let R represent relevance of a document w.r.t. given (fixed)
query and let NR represent non-relevance. R={0,1} vs. NR/R
Need to find p(R|x) - probability that a document x is relevant.

p( x | R) p( R) p(R), p(NR) prior probability


p( R | x)
p( x) of retrieving a (non) relevant
p( x | NR) p( NR) document
p( NR | x)
p( x) p( R | x) p( NR | x) 1
p(x|R), p(x|NR) probability that if a relevant (non-relevant)
document is retrieved, it is x.
Probabilistic Model
The similarity sim(dj,q) of the document dj to the query q is defined as the
ratio

Pr( R | d j )
sim (d j , q )
Pr( R | d j )
Using Bayes rule,

Pr( d j | R) Pr( R)
sim (d j , q )
Pr( d j | R) Pr( R)
P(R) stands for the probability that a document randomly selected from the
entire collection is relevant
P ( d j | R ) stands for the probability of randomly selecting the document dj
from the set R of relevant documents
Estimation of Term Relevance
In the very beginning:
Pr( ki | R) 0.5
dfi
Pr( ki | R)
N
Next, the ranking can be improved as follows:
Vi
Let V be a subset Pr( ki | R )
V
of the documents
df Vi
initially retrieved Pr( ki | R ) i
N V
For small values for V
Vi 0.5 Vi VVi
Pr( ki | R ) Pr( ki | R )
V 1 V 1
df Vi 0.5 df i Vi VVi
Pr( ki | R ) i Pr( ki | R )
N V 1 N V 1
Probabilistic Retrieval Strategy
Estimate how terms contribute to relevance
How do things like tf, df, and length influence your
judgments about document relevance?
One answer is the Okapi formulae (S. Robertson)

Combine to find document relevance


probability

Order documents by decreasing probability


Probabilistic Model Advantages &
Disadvantages
Advantages
Based on a firm theoretical foundation
Theoretically justified optimal ranking scheme
Disadvantages
Making the initial guess to get V
Binary word-in-doc weights (not using term frequencies)
Independence of terms (can be alleviated)
Amount of computation
Has never worked convincingly better in practice
Scoring and Ranking
A ranking can then be computed by sorting documents by descending score. An alternative
approach is to define a score function on pairs of documents d, d that is positive if and
only if d is more relevant to the query than d and using this information to sort.
purpose of ranking the documents matching this query, we are really interested in the
relative (rather than absolute) scores of the documents in the collection.

For any document , the cosine similarity is the weighted sum, over all terms in the query , of
the weights of those terms in . This in turn can be computed by a postings intersection
exactly as in the algorithm.
Ranking
Ranking of query results is one of the fundamental problems in
information retrieval (IR), the scientific/engineering discipline behind
search engines.
Given a query q and a collection D of documents that match the query, the
problem is to rank, that is, sort, the documents in D according to some
criterion so that the "best" results appear early in the result list displayed to
the user. Classically, ranking criteria are phrased in terms of relevance of
documents with respect to an information need expressed in the query.
Ranking is often reduced to the computation of numeric scores on
query/document pairs.
Ranking functions are evaluated by a variety of means; one of the simplest
is determining the precision of the first k top-ranked results for some fixed
k; for example, the proportion of the top 10 results that are relevant, on
average over many queries.
Ch. 6

Scoring as the basis of ranked retrieval


We wish to return in order the documents most
likely to be useful to the searcher
How can we rank-order the documents in the
collection with respect to a query?
Assign a score say in [0, 1] to each
document
This score measures how well document and
query match.
Ch. 6

Take 1: Jaccard coefficient


Recall from Lecture 3: A commonly used
measure of overlap of two sets A and B
jaccard(A,B) = |A B| / |A B|
jaccard(A,A) = 1
jaccard(A,B) = 0 if A B = 0
A and B dont have to be the same size.
Always assigns a number between 0 and 1.
Ch. 6

Jaccard coefficient: Scoring example


What is the query-document match score that
the Jaccard coefficient computes for each of
the two documents below?
Query: ides of march
Document 1: caesar died in march
Document 2: the long march
Term frequency tf
The term frequency tft,d of term t in document d is
defined as the number of times that t occurs in d.
We want to use tf when computing query-document
match scores. But how?
Raw term frequency is not what we want:
A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term.
But not 10 times more relevant.
Relevance does not increase proportionally with term
frequency.

NB: frequency = count in IR


Sec. 6.2

Log-frequency weighting
The log frequency weight of term t in d is
1 log 10 tf t,d , if tf t,d 0
wt,d
0, otherwise
0 0, 1 1, 2 1.3, 10 2, 1000 4, etc.
Score for a document-query pair: sum over terms
t in both q and d:
score
tqd (1 log tf t ,d )
The score is 0 if none of the query terms is
present in the document.
Sec. 6.2.1

idf weight
dft is the document frequency of t: the number of
documents that contain t
dft is an inverse measure of the informativeness of t
dft N
We define the idf (inverse document frequency)
of t by idf log ( N/df )
t 10 t

We use log (N/dft) instead of N/dft to dampen the


effect of idf.

Will turn out the base of the log is immaterial.


Sec. 6.2.1

idf example, suppose N = 1 million


term dft idft
calpurnia 1
animal 100
sunday 1,000
fly 10,000
under 100,000
the 1,000,000

idf t log 10 ( N/df t )


There is one idf value for each term t in a collection.
Sec. 6.2.2

tf-idf weighting
The tf-idf weight of a term is the product of its tf
weight and its idf weight.
w t ,d log( 1 tf t ,d ) log 10 ( N / dft )
Best known weighting scheme in information
retrieval
Note: the - in tf-idf is a hyphen, not a minus sign!
Alternative names: tf.idf, tf x idf
Increases with the number of occurrences within a
document
Increases with the rarity of the term in the collection
Sec. 6.2.2

Score for a document given a query

Score(q,d) tf.idf t,d


t qd

There are many variants


How tf is computed (with/without logs)
Whether the terms in the query are also weighted

54
Sec. 6.3

Binary count weight matrix


Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 5.25 3.18 0 0 0 0.35
Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95

Each document is now represented by a real-valued


vector of tf-idf weights R|V|
Sec. 6.4

tf-idf weighting has many variants

Columns headed n are acronyms for weight schemes.

Why is the base of the log in idf immaterial?


Language Models (LMs)
Outline
Language models
Finite automata and language models
Types of language models
Multinomial distributions over words
Query likelihood model
Application

58
Language Models (LMs)
How can we come up with good queries?
Think of words that would likely appear in a
relevant document.
Idea of LM:
A document is a good match to a query if the
document model is likely to generate the query.

59
Language Models (LMs)
Generative Model:
Recognize or generate strings.

The full set of strings that can be generated is called the language of the
automaton.
Language Model:
A function that puts a probability measure over strings drawn from some
vocabulary.

60
Language Models (LMs)

Example 1:
Calculate the probability of a word sequence.
Multiply the probabilities that the model gives to each word
in the sequence, together with the probability of continuing
or stopping after producing each word.
P(frog said that toad likes
frog)=(0.01*0.03*0.04*0.01*0.02*0.01)
*(0.8*0.8*0.8*0.8*0.8*0.8*0.2)
=0.000000000001573

Most of the time, we will omit to include STOP and (1-


STOP) probabilities.
61
Query Likelihood Model
Query likelihood model:
Rank document by P(d|q)
Likelihood that document d is
relevant to the query.
Using Bayes rule:
P(q) is the same for all
documents.
P(d) is treated as uniform across
all d. P(d | q) P(q | d )

62
Application
Community-based Question Answering (CQA)
System:
Question Search.
Given a queried question, find a semantically
equivalent question for the queried question.
General Search Engine
Given a query, rank documents.

71
Set Theoretic Models
Set Theoretic Models

The Boolean model imposes a binary criterion


for deciding relevance
The question of how to extend the Boolean
model to accomodate partial matching and a
ranking has attracted considerable attention in
the past
We discuss now two set theoretic models for
this:
Fuzzy Set Model
Extended Boolean Model
Fuzzy Set Model

Queries and docs represented by sets of index


terms: matching is approximate from the start
This vagueness can be modeled using a fuzzy
framework, as follows:
witheach term is associated a fuzzy set
each doc has a degree of membership in this fuzzy
set
This interpretation provides the foundation for
many models for IR based on fuzzy theory
Alternative Set Theoretic Models
-Fuzzy Set Model
Model
a query term: a fuzzy set
a document: degree of membership in this set
membership function
Associate membership function with the elements of
the class
0: no membership in the set
1: full membership
0~1: marginal elements of the set documents
Structured Text Retrieval Models
Models for Browsing
Models for browsing

Flat browsing

Structure guided browsing

The hypertext model


Flat browsing
The documents might be represented as
dots in a plan or as elements in a list.
Relevance feedback
Disadvantage : In a given page or screen
there may not be any indication about the
context where the user is.
Structure guided browsing

Organized in a directory structure. It groups


documents covering related topics.
The same idea can be applied to a single
document.
Using history map.
The hypertext model
Written text is usually conceived to be read
sequentially.

The reader should not expect to fully


understand the message conveyed by the
writer by randomly reading pieces of text here
and there.
End of II Unit

Das könnte Ihnen auch gefallen