Lecture 2 Inverted Index PDF

Inverted Index
Lecture 2
Abstraction of search engine architecture
Indexed corpus
Crawler
Ranking procedure
Feedback Evaluation
Doc Analyzer
(Query)
Doc Representation
Query Rep User
Indexer Index Ranker results
2
What we have now
• Documents have been
– Crawled from Web
– Tokenized/normalized
– Represented as Bag-of-Words
• Let’s do search!
– Query: “information retrieval”
information retrieval retrieved is helpful for you everyone
Doc1 1 1 0 1 1 1 0 1
Doc2 1 0 1 1 1 1 1 0
3
Term-document incidence matrices
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
AND BUT NOT 1 if play contains

word, 0 otherwise
4
Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus,
Caesar and Calpurnia (complemented) 
bitwise AND.
– 110100 AND
– 110111 AND Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
– 101111 = Antony
Brutus
Caesar
1
1
1
1
1
1
0
0
0
0
1
1
0
0
1
1
0
1
– 100100 Calpurnia
Cleopatra
mercy
0
1
1
1
0
0
0
0
1
0
0
1
0
0
1
0
0
1
worser 1 0 1 1 1 0
5
Bigger collections
• Consider N = 1 million documents, each with
about 1000 words.
• Avg 6 bytes/word including spaces/punctuation

– 6GB of data in the documents.
• Say there are M = 500K distinct terms among

these.
6
Can’t build the matrix
• 500K x 1M matrix has half-a-trillion 0’s and 1’s.
• But it has no more than one billion 1’s. Why?
– matrix is extremely sparse.
• What’s a better representation?

– We only record the 1 positions.
7
Complexity analysis
• Space complexity analysis
– 𝑂(𝐷 ∗ 𝑉)
• D is total number of documents and V is vocabulary size
– Zipf’s law: each document only has about 10% of
vocabulary observed in it
• 90% of space is wasted!
– Space efficiency can be greatly improved by only
storing the occurred words
Solution: linked list for each document
8
Solution: linked list for each document
Doc1: Term1, Term2, Term5….
Doc5 Term3, Term27, Term5….
………….
9
Complexity analysis
• Time complexity analysis
– 𝑂( 𝑞 ∗ 𝐷 ∗ |𝐷|)
• 𝑞 is the length of query, |𝐷| is the length of a document
doclist = []
for (wi in q) {
for (d in D) { Bottleneck, since most
for (wj in d) { of them won’t match!
if (wi == wj) {
doclist += [d];
break;
}
}
}
}
return doclist; 10
Solution: Inverted index
The Inverted Index
The key data structure underlying modern IR
11
Inverted index
• Build a look-up table for each word in
Word “posting”
vocabulary comes from the
word positions in
– From word to find documents! index
Dictionary Postings
information Doc1 Doc2 Time complexity:
• 𝑂( 𝑞 ∗ |𝐿|), |𝐿| is the
retrieval Doc1
average length of
Query: retrieved Doc2 posting list
information is Doc1 Doc2
retrieval
helpful Doc1 Doc2
for Doc1 Doc2
you Doc2
everyone Doc1
12
Inverted Index
• If we have a corpus of 1 million documents, each of length
1,000 words, and a total vocabulary size of 500,000, what is
the approximate maximum size of the postings and the size of
the matrix (which contains a 1 in row i and column j if word i
occurs in document j and a 0 otherwise), respectively?
A. 109 and 5×1011

B. Both 109
C. 103 and 5×1011
D. Both 5×1011
13
Inverted Index
Answer: Option A: 109 and 5×1011
If no word is repeated twice in a document,

then the postings can at most reach a size of
1,000×106=109, whereas the matrix has
500,000 rows and 106 columns, yielding a total
size of 5×1011 elements.
14
Structures for inverted index
• Dictionary: modest size
– Needs fast random access
– Stay in memory “Key data structure underlying
• Hash table, B-tree, … modern IR”
- Christopher D. Manning
• Postings: huge
– Stay on disk
– Contain docID, term freq, term position, …
– Compression is needed
15
Sorting-based inverted index construction
<Tuple>: <termID, docID>
Sorted by docId All info about term 1 All info about term 1
doc1 <1,1> <1,1,2,5, ….300>
<1,1>
<2,1> <1,2> <2,1,4,6,…..200>
<3,1> <1,5> ...
... <1,6>
doc2 ...
<3,2>
<7,2> <1,300>
<4,2> <2,1>
… …
<4999,3,8, ….299>
...
<5000,7,9,….300>
…..
doc300 <5000,299>
<6,300> Create Posting for
<3,300> <5000,300>
each TermId
... ...
Parse & Count Sort by TermId

16
Problem
• All index data cannot fit in memory so how to
sort all data based on terms
Solution
• Divide data in fixed size blocks that can fit in
memory
• Create separate index for each block sorted on
terms and then merge all indexes using Merge
Sort
17
Sorting-based inverted index construction
<Tuple>: <termID, docID>
Sorted by docId Sort by termId All info about term 1
doc1
<1,1> <1,1> <1,1>
<2,1> <1,2> <1,2>
<3,1> <2,1> <1,5>
doc2 ... <2,4> <1,6>
<3,2> ... ...
<7,2> <1,5> <1,300>
<4,2> <1,6> <2,1>
… <2,3> …
<2,7>
...
…
doc300
<6,300> <1,299> <5000,299>
<3,300> <1,300> <5000,300>
... ... ...
Parse & Count “Local” sort Merge sort 18
HashMap or Dictionary
• Do we need to sort all (termId, DocId) pairs ?
• We can use HashMap or Dictionary to create
postings on the fly as we parse documents
• Benefit
• Number of items that need to be sorted is
substantially reduced
19
Sec. 4.3
SPIMI: Single-pass in-memory indexing
• Key idea 1: Divide data in fixed size blocks that

can fit in memory
• Key idea 2: Don’t sort. Accumulate postings in
postings lists as they occur using dictionary.
• With these two ideas we can generate a
complete inverted index for each block.
• These separate indexes can then be merged
into one big index.
20
Postings Merge
Postings to be merged Merged Postings
Brutus d2 d3 d6 Brutus d9 d19 d21 Brutus d2 d3 d6 d9 d19 d21

Caesar d4 d5 d7 Caesar d8 d9 d10 Caesar d4 d5 d7 d8 d9 d10
noble d2 d5 d6 Julius d13 d15 d17 Julius d13 d15 d17
with d2 d3 d4 killed d16 d17 d19 killed d16 d17 d19
noble d2 d5 d6
with d2 d3 d4
Disk
21
Sec. 4.3
SPIMI-Invert
22
Sec. 4.2
How to merge the sorted runs?

• It is more efficient to do a multi-way merge, where you are
reading from all blocks simultaneously
• Providing you read decent-sized chunks of each block into

memory and then write out a decent-sized output chunk,
then you’re not killed by disk seeks
23
Slide Credits
• Lecture Notes, Text Retrieval and Mining by
Christopher Manning and Prabhakar
Raghavan, Stanford University
24

Lecture 2 Inverted Index PDF

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lecture 2 Inverted Index PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Inverted Index

Indexer Index Ranker results

AND BUT NOT 1 if play contains

• Avg 6 bytes/word including spaces/punctuation

• Say there are M = 500K distinct terms among

• But it has no more than one billion 1’s. Why?

– matrix is extremely sparse.

• What’s a better representation?

The Inverted Index

The key data structure underlying modern IR

A. 109 and 5×1011

If no word is repeated twice in a document,

Parse & Count Sort by TermId

SPIMI: Single-pass in-memory indexing

• Key idea 1: Divide data in fixed size blocks that

Brutus d2 d3 d6 Brutus d9 d19 d21 Brutus d2 d3 d6 d9 d19 d21

How to merge the sorted runs?

• Providing you read decent-sized chunks of each block into

Das könnte Ihnen auch gefallen