Lecture 3-Skip Pointers and Phrase Queries

Introduction to Information
Retrieval
Introduction to
Information Retrieval
Faster postings merges:
Skip pointers/Skip lists
Retrieval
Recall basic merge

Walk through the two postings
simultaneously, in time linear in the total
number of postings entries
2
41
48
11
64
17
128
21
Brutus
31 Caesar
If the list lengths are m and n, the merge takes O(m+n)

operations.
Can we do better?
Yes (if the index isnt changing too fast).
Retrieval
Augment postings with skip pointers (at

indexing time)
128
41
41
64
128
31
11
48
11
17
21
31
Why?
To skip postings that will not figure in the search
results.
How?
Where do we place skip pointers?
Retrieval
Query processing with skip pointers

128
41
41
64
128
31
11
48
11
17
21
31
Suppose weve stepped through the lists until we process 8 on

each list. We match it and advance.
We then have 41 and 11 on the lower. 11 is smaller.
But the skip successor of 11 on the lower list is 31, so
we can skip ahead past the intervening postings.
Retrieval
Where do we place skips?

Tradeoff:
More skips shorter skip spans more likely
to skip. But lots of comparisons to skip
pointers.
Fewer skips few pointer comparison, but
then long skip spans few successful skips.
Retrieval
Retrieval
Placing skips
Simple heuristic: for postings of length L, use L evenlyspaced skip pointers
[Moffat and Zobel 1996]
This ignores the distribution of query terms.
Easy if the index is relatively static; harder if L keeps
changing because of updates.
This definitely used to help; with modern hardware it may
not unless youre memory-based [Bahle et al. 2002]
Retrieval
Positional postings and phrase queries

Many complex or technical concepts and many
organization and product names are multiword
compounds or phrases.
Most recent search engines support a double
quotes syntax (stanford university) for phrase
queries.
As many as 10% of web queries are phrase
queries, and many more are implicit phrase
queries (such as person names), entered without
use of double quotes.
Retrieval
1. Biword indexes
One approach to handling phrases is to consider
every pair of consecutive terms in a document as
a phrase.
For example, the text Friends, Romans,
Countrymen would generate the biwords:
friends romans
romans countrymen
In this model, we treat each of these biwords as a

vocabulary term.
The concept of a biword index can be extended to
longer sequences of words, and if the index
includes variable length word sequences, it is
generally referred to as a phrase index.
Retrieval
2. Positional indexes
A biword index is not the standard solution.
Rather, a positional index is most commonly
employed.
Here, for each term in the vocabulary, we store
postings of the form docID: {hposition1,
position2, . . . } e.g.
to, 993427:
(1, 6: (7, 18, 33, 72, 86, 231);
2, 5: (1, 17, 74, 222, 255);
4, 5: (8, 16, 190, 429, 433);
5, 2: (363, 367);
7, 3: (13, 23, 191); ..... . . )
be, 178239:
(1, 2: (17, 25);
4, 5: (17, 191, 291, 430, 434);
Retrieval
2. Positional indexes
To process a phrase query, we still need to access
the inverted index entries for each distinct term.
As before, we would start with the least frequent
term and then work to further restrict the list of
possible candidates.
In the merge operation, the same general
technique is used as before, but rather than simply
checking that both terms are in a document, we
also need to check that their positions of
appearance in the document are compatible with
the phrase query being evaluated.
Retrieval
Example: Satisfying phrase queries

Suppose the postings lists for to and be are as in previous slide, and
the query is to be or not to be. The postings lists to access are:
to, be, or, not. We will examine intersecting the postings lists for to
and be. We first look for documents that contain both terms. Then,
we look for places in the lists where there is an occurrence of be
with a token index one higher than a position of to, and then we
look for another occurrence of each word with token index 4 higher
than the first occurrence. In the above lists, the pattern of
occurrences that is a possible match is:
to:
be:
(. . . ; 4: (. . . ,429,433); . . . )
(. . . ; 4(. . . ,430,434); . . . )

Lecture 3-Skip Pointers and Phrase Queries

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lecture 3-Skip Pointers and Phrase Queries

Hochgeladen von

Copyright:

Verfügbare Formate

Introduction to Information

Recall basic merge

If the list lengths are m and n, the merge takes O(m+n)

Augment postings with skip pointers (at

Query processing with skip pointers

Suppose weve stepped through the lists until we process 8 on

Where do we place skips?

Positional postings and phrase queries

In this model, we treat each of these biwords as a

Example: Satisfying phrase queries

Das könnte Ihnen auch gefallen