03 Dict

Introduction to Information Retrieval
Introduction to
Information Retrieval
Hinrich Schtze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval
Outline

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

2
Dictionary data structures for inverted indexes

The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list in what data structure?
Sec. 3.1
Sec. 3.1
A nave dictionary
An array of struct:
char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes How do we store a dictionary in memory efficiently? How do we quickly look up elements at query time?
Sec. 3.1
Dictionary data structures

Two main choices:
Hash table Tree
Some IR systems use hashes, some trees
Sec. 3.1
Hashes
Each vocabulary term is hashed to an integer
(We assume youve seen hashtables before)
Pros:
Lookup is faster than for a tree: O(1)
Cons:
No easy way to find minor variants:
judgment/judgement
No prefix search [tolerant retrieval] If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything
Sec. 3.1
Trees
Simplest: binary tree More usual: B-trees Trees require a standard ordering of characters and hence strings but we standardly have one Pros: Solves the prefix problem (terms starting with hyp) Cons: Slower: O(log M) [and this requires balanced tree] Rebalancing binary trees is expensive
But B-trees mitigate the rebalancing problem
Binary tree
Sec. 3.1
Tree: B-tree
a-hu hy-m
n-z
Definition: Every internal node has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].
Outline


10
Sec. 3.2
Wild-card queries: *
mon*: find all docs containing any word beginning mon. Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon w < moo *mon: find words ending in mon: harder
Maintain an additional B-tree for terms backwards. Can retrieve all words in range: nom w < non.
Exercise: from this, how can we enumerate all terms meeting the wild-card query pro*cent ?
How to handle * in the middle of a term

Example: m*nchen We could look up m* and *nchen in the B-tree and intersect the two term sets. Expensive Alternative: permuterm index Basic idea: Rotate every wildcard query, so that the * occurs at the end. Store each of these rotations in the dictionary, say, in a B-tree
12
B-trees handle *s at the end of a query term

co*tion
Sec. 3.2
How can we handle *s in the middle of query term? We could look up co* AND *tion in a B-tree and intersect the two term sets
Expensive
The solution: transform wild-card queries so that the *s occur at the end This gives rise to the Permuterm Index.
Permuterm index
For term HELLO: add hello$, ello$h, llo$he, lo$hel, and o$hell to the B-tree where $ is a special symbol
14
Permuterm term mapping
15
Permuterm index
For HELLO, weve stored: hello$, ello$h, llo$he, lo$hel, and o$hell Queries
For X, look up X$ For X*, look up $X* For *X, look up X$* For *X*, look up X*
For X*Y, look up Y$X*

Example: For hel*o, look up o$hel*
Permuterm index would better be called a permuterm tree. But permuterm index is the more common name.
16
Processing a lookup in the permuterm index
Rotate query wildcard to the right Use B-tree lookup as before Problem: Permuterm more than quadruples the size of the dictionary compared to a regular B-tree. (empirical number)
17
k-gram indexes
More space-efficient than permuterm index Enumerate all character k-grams (sequence of k characters) occurring in a term 2-grams are called bigrams. Example: from April is the cruelest month we get the bigrams: $a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo on nt h$ $ is a special word boundary symbol, as before. Maintain an inverted index from bigrams to the terms that contain the bigram
18
Postings list in a 3-gram inverted index
19
k-gram (bigram, trigram, . . . ) indexes

Note that we now have two different types of inverted indexes The term-document inverted index for finding documents based on a query consisting of terms The k-gram index for finding terms based on a query consisting of k-grams
20
Processing wildcarded terms in a bigram index

Query mon* can now be run as: $m AND mo AND on Gets us all terms with the prefix mon . . . . . . but also many false positives like MOON. We must postfilter these terms against query. Surviving terms are then looked up in the term-document inverted index. k-gram index vs. permuterm index
k-gram index is more space efficient.
Permuterm index doesnt require postfiltering.
21
Sec. 3.2.2
Processing wild-card queries

As before, we must execute a Boolean query for each enumerated, filtered term. Wild-cards can result in expensive query execution (very large disjunctions)
pyth* AND prog*
If you encourage laziness people will respond!

Search
Type your search terms, use * if you need to. E.g., Alex* will match Alexander.
Which web search engines allow wildcard queries?
Outline


23
Distance between misspelled word and correct word
We will study several alternatives. weighted edit distance Edit distance and Levenshtein distance k-gram overlap
24
Weighted edit distance

As above, but weight of an operation depends on the characters involved. Meant to capture keyboard errors, e.g., m more likely to be mistyped as n than as q. Therefore, replacing m by n is a smaller edit distance than by q.
25
Edit distance
The edit distance between string s1 and string s2 is the minimum number of basic operations that convert s1 to s2. Levenshtein distance: The admissible basic operations are insert, delete, and replace Levenshtein distance dog-do: 1 Levenshtein distance cat-cart: 1 Levenshtein distance cat-cut: 1 Levenshtein distance cat-act: 2
26
Levenshtein distance: Algorithm
27
28
29
30
31
Each cell of Levenshtein matrix

cost of getting here from my upper left neighbor (copy or replace) cost of getting here from my left neighbor (insert) cost of getting here from my upper neighbor (delete) the minimum of the three possible movements; the cheapest way of getting here
32
Levenshtein distance: Example
33
Exercise
Compute
Levenshtein distance matrix for OSLO SNOW What are the Levenshtein editing operations that transform cat into catcat?
34
35
36
37
38
39
40
41
How do I read out the editing operations that transform OSLO into SNOW?
42
43
44
45
46
47
48
49
50
51
52
Outline


53
Sec. 3.3
Spell correction
Two main flavors:
Isolated word
Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words e.g., from form
Context-sensitive
Look at surrounding words, e.g., I flew form Heathrow to Narita.
Correcting queries
First: isolated word spelling correction Premise 1: There is a list of correct words from which the correct spellings come. Premise 2: We have a way of computing the distance between a misspelled word and a correct word. Simple spelling correction algorithm: return the correct word that has the smallest distance to the misspelled word. Example: informaton information For the list of correct words, we can use the vocabulary of all words that occur in our collection. Why is this problematic?
55
Alternatives to using the term vocabulary
A standard dictionary (Websters, OED etc.) An industry-specific dictionary (for specialized IR systems)
56
k-gram indexes for spelling correction

Enumerate all k-grams in the query term Example: bigram index, misspelled word bordroom Bigrams: bo, or, rd, dr, ro, oo, om Use the k-gram index to retrieve correct words that match query term k-grams Threshold by number of matching k-grams
57
k-gram indexes for spelling correction: bordroom
58
Sec. 3.3.4
Jaccard coefficient
A commonly-used measure of overlap Let X and Y be two sets; then the J.C. is
X Y / X Y
Equals 1 when X and Y have the same elements and zero when they are disjoint X and Y dont have to be of the same size
Example : bigrams match

Query A = bordroom reaches term B = boardroom A = bo or rd dr ro oo om B = bo oa ar rd dr ro oo om A B / A B = 2 / (7 + 8 -6) = 6/9
60
Sec. 3.3.5
Context-sensitive spell correction

Text: I flew from Heathrow to Narita. Consider the phrase query flew form Heathrow Wed like to respond Did you mean flew from Heathrow? because no docs matched the query phrase.
Sec. 3.3.5
Context-sensitive correction
Need surrounding context to catch this. First idea: retrieve dictionary terms close (in weighted edit distance) to each query term Now try all possible resulting phrases with one word fixed at a time
flew from heathrow fled form heathrow flea form heathrow
Hit-based spelling correction: Suggest the alternative that has lots of hits.
Outline


63
Soundex
Soundex is the basis for finding phonetic (as opposed to orthographic) alternatives. Example: chebyshev / tchebyscheff Algorithm:
Turn every token to be indexed into a 4-character reduced form Do the same with query terms
Build and search an index on the reduced forms
64
Soundex algorithm
Retain
the first letter of the term. Change all occurrences of the following letters to 0 (zero): A, E, I, O, U, H, W, Y Change letters to digits as follows: B, F, P, V to 1 C, G, J, K, Q, S, X, Z to 2 D,T to 3 L to 4 M, N to 5 R to 6 Repeatedly remove one out of each pair of consecutive identical digits Remove all zeros from the resulting string; pad the resulting string with trailing zeros and return the first four positions, which will consist of a letter followed by three digits 65
Example: Soundex of HERMAN

Retain H ERMAN 0RM0N 0RM0N 06505 06505 06505 06505 655 Return H655 Note: HERMANN will generate the same code
66
How useful is Soundex?

Not very for information retrieval OK for similar-sounding terms. The idea owes its origins to work in international police department , seeking to match names for wanted criminals despite the names being spelled differently in different countries.
67

03 Dict

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

03 Dict

Hochgeladen von

Copyright:

Verfügbare Formate

Introduction to Information Retrieval

Introduction to Information Retrieval

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Introduction to Information Retrieval

Dictionary data structures for inverted indexes

Introduction to Information Retrieval

Introduction to Information Retrieval

Dictionary data structures

Some IR systems use hashes, some trees

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Introduction to Information Retrieval

Introduction to Information Retrieval

How to handle * in the middle of a term

Introduction to Information Retrieval

B-trees handle *s at the end of a query term

Introduction to Information Retrieval

Introduction to Information Retrieval

Permuterm term mapping

Introduction to Information Retrieval

For X*Y, look up Y$X*

Introduction to Information Retrieval

Processing a lookup in the permuterm index

Introduction to Information Retrieval

Introduction to Information Retrieval

Postings list in a 3-gram inverted index

Introduction to Information Retrieval

k-gram (bigram, trigram, . . . ) indexes

Introduction to Information Retrieval

Processing wildcarded terms in a bigram index

Permuterm index doesnt require postfiltering.

Introduction to Information Retrieval

Processing wild-card queries

If you encourage laziness people will respond!

Which web search engines allow wildcard queries?

Introduction to Information Retrieval

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Introduction to Information Retrieval

Distance between misspelled word and correct word

Introduction to Information Retrieval

Weighted edit distance

Introduction to Information Retrieval

Introduction to Information Retrieval

Levenshtein distance: Algorithm

Introduction to Information Retrieval

Levenshtein distance: Algorithm

Introduction to Information Retrieval

Levenshtein distance: Algorithm

Introduction to Information Retrieval

Levenshtein distance: Algorithm

Introduction to Information Retrieval

Levenshtein distance: Algorithm

Introduction to Information Retrieval

Each cell of Levenshtein matrix

Introduction to Information Retrieval

Levenshtein distance: Example

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

For XY, look up Y$X