Beruflich Dokumente
Kultur Dokumente
UE16CS412
Instructor: Dr S Natarajan
Professor and Key Resource Person
Department of Computer Science and Engineering
PES University
Bengaluru
natarajan@pes.edu 9945280225
Slides: Adapted from the book “introduction to IR” of
authors from Stanford University and supplemented
Slides from other universities
UNIT I
Introduction
Architecture of IR systems
IR Models – Boolean and Extended Boolean
Vocabulary
Preprocessing : Tokenization, Stemming etc
Algorithms to search in posting lists
Positional and Phrase Queries
Search structure for dictionaries
Wildcard queries
Spelling and Phonetic Corrections
IR:
- original definition
“Information retrieval embraces the intellectual
aspects of the description of information and its
specification for search, and also whatever
systems, techniques, or machines are employed
to carry out the operation.”
Calvin Mooers, 1951
© Tefko Saracevic 3
Introduction to Information Retrieval
4
Information Retrieval
• Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).
5
You see things; and you say "Why?"
But I dream things that never were;
and I say "Why not?"
George Bernard Shaw
Prasad L1IntroIR 6
IR Through the Ages
• 3rd Century BCE
– Library of Alexandria
• 500,000 volumes
• catalogs and classifications
• 1600
– University of Oxford Library
• All books printed in England
Document Collections
Stone inscriptions at
Brihadeeshwara Temple
4
IR Through the Ages
• 1755
– Johnson’s Dictionary
• Set standard for dictionaries
• Included common language
• Helped standardize spelling
• 1800
– Library of Congress
• 1828
– Webster’s Dictionary
• Significantly larger than previous dictionaries
• Standardized American spelling
• 1852
– Roget’s Thesaurus
Document Collections
• 1980’s:
– Large document database systems, many run by
companies:
• Lexis-Nexis
• Dialog
• MEDLINE
15
IR Through the Ages
• Late 1980’s
– First mini-computer and PC systems incorporating
“relevance ranking”
• Early 1990’s
– information storage revolution
• 1992
– First large-scale information service incorporating
probabilistic retrieval (West’s legal retrieval system)
IR History Continued
• 1990’s:
– Searching FTPable documents on the Internet
• Archie
• WAIS (Wide Area Information Services)
– Searching the World Wide Web
• Lycos
• Yahoo
• Altavista
17
IR Through the Ages
• Mid 1990’s to present
– Multimedia databases
• 1994 to present
– The Internet and Web explosion
• e.g. Google, Yahoo, Lycos, Infoseek (now Go)
• 1995 to present
– Digital Libraries
– Data Mining
– Agents and Filtering
– Knowledge and Distributed Intelligence
– Information Organization
– Knowledge Management
Historical Summary
• 1990’s
– Large-scale, full-text IR and filtering experiments
and systems (TREC)
– Dominance of ranking
– Many web-based retrieval engines
– Interfaces and browsing
– Multimedia and multilingual
– Machine learning techniques
IR History Continued
• 1990’s continued:
– Organized Competitions
• NIST TREC (Text Retrieval Conference)
– Recommender Systems
• Ringo (Social Information Filtering System)
• Amazon
• NetPerceptions (leading seller of personalization
technology during the Internet boom of the late
1990s)
– Automated Text Categorization & Clustering
20
IR History Continued
2000’s continued:
Link Analysis of Web Search
Google
– Multimedia IR
• Image
• Video
• Audio and music
– Cross-Language IR
• DARPA TIDES(Translingual Information Detection,
Extraction and Summarization)
– Document Summarization
– Learning to Rank
21
IR History continued
2000’s (continued)
Automated Information Extraction
• Whizbang
• Fetch
• Burning Glass
Question Answering
• TREC Q/A track
Multimedia IR
Cross-Language IR
WEB changed everything
• New issues:
• Trust
• Privacy, etc
• Additional sources:
• Social networking
• Wikipedia, etc 22
Recent IR History
• 2010’s
– Intelligent Personal Assistants
• Siri
• Cortana
• Google Now
• Alexa
– Complex Question Answering
• IBM Watson
– Distributional Semantics
– Deep Learning
23
Int roduct ion Hist ory Boolean model Inverted index Processing Boolean queries Query opt imization Cour
Visualization
Terabytes Data Mining
Distributed Retrieval
Summarization
Information Extraction
Gigabytes Ranked Filtering
Concept-Based Retrieval
Ranked Retrieval Technologies
Boolean Retrieval and Filtering
1-page word document without any images = ~10 kilobytes (kb) of disk space.
1 terabyte = one-hundred million imageless word docs
1 petabyte = one-thousand terabytes.
Historical Summary
• The Future
– Logic-based IR?
– NLP?
– Integration with other functionality
– Distributed, heterogeneous database access
– IR in context
– “Anytime, Anywhere”
Information Retrieval
• Ad Hoc Retrieval
– Given a query and a large database of text objects, find the
relevant objects
• Distributed Retrieval
– Many distributed databases
• Information Filtering
– Given a text object from an information stream (e.g. newswire)
and many profiles (long-term queries), decide which profiles
match
• Multimedia Retrieval
– Databases of other types of unstructured data, e.g. images,
video, audio
Information Retrieval
• Multilingual Retrieval
– Retrieval in a language other than English
• Cross-language Retrieval
– Query in one language (e.g. Spanish), retrieve
documents in other languages (e.g. Chinese,
French, and Spanish)
Information Retrieval Family Trees
Cyril Cleverdon
Cranfield
Bruce Croft
Donna Harman Michael Lesk
University of Massachusetts,
NIST Bell Labs, Rutgers etc. Amherst
29
Personalities in IR
Dr Marti Hearst,Univ of
Dr Thorsten Joachims,Cornell
California at Berkeley
• Lucene (Apache)
• MeTA (Modern Text Analysis) (Univ. of Illinois at Urbana-
Champaign)
• Lemur & Indri (CMU/Univ. of Massachusetts)
• Terrier (Glasgow)
• IR Systems and Tools
• Information Retrieval Systems
Other Text Books in Information Retrieval
1.Modern Information Retrieval: The Concepts and Technology behind
Search, Ricardo Baeza -Yates and Berthier Ribeiro –2nd Edition, Neto ,
ACM Press Books, 2011
2. Search Engines: Information Retrieval in Practice, Bruce Croft, Donald
Metzler and Trevor Strohman, 1st Edition, Addison Wesley, 2009
3. Information Retrieval: Implementing and Evaluating Search Engines,
Stephen Buetcher, Charles L.A. Clarke and Gordon V. Cormack, 1st Edition,
MIT Press, 2010
4. Information Retrieval: Algorithms and Heuristics, David A. Grossman
and Ophir Frieder, 2nd Edition, Springer, 2004
5. Managing Gigabytes: Compressing and Indexing Documents and
Images, Ian H. Witten, Alistair Moffat, Timothy C. Bell, Second Edition,
Van Nostrand Reinhold, 1994
6. Readings in Information Retrieval, Karen Sparck Jones and Peter
Willett , Morgan Kaufmann, 1997
Courses in IR
CS276 - IR Stanford and Web Search
CS 371R: Information Retrieval and Web Search - UT Computer Science
INFO/CS 4300: Language and Information Cornell
COMPSCI 646: Information Retrieval Umass
COMPSCI 546, Applied Information Retrieval UMass
CS60035: Information Retrieval – IITKgp
Info 240: Principles of Information Retrieval | UC Berkeley School of
Information
CS 54701: Information Retrieval - Purdue Computer Science
605.744 - Information Retrieval | Johns Hopkins University
COSC 488 –Information Retrieval Georgetown University
CS 4501/6501: Information Retrieval Virginia
Introduction to Information Retrieval Univ of Munich
CS510 - Advanced Information Retrieval UIUC
Unstructured (text) vs. structured (database)
data in the mid-nineties
37
Unstructured (text) vs. structured (database)
data today
38
Sec. 1.1
39
The classic search model
User task Get rid of mice in a
politically correct way
Misconception?
Info need
Info about removing mice
without killing them
Misformulation?
Query
how trap mice alive Search
Search
engine
Query Results
Collection
refinement
Sec. 1.1
41
Introduction to Information Retrieval
Definitions
Word – A delimited string of characters as it appears
in the text.
Term – A “normalized” word (case, morphology,
spelling etc); an equivalence class of words.
Token – An instance of a word or term occurring in a
document.
Type – The same as a term in most cases: an
equivalence class of tokens.
42
Shakespeare's classics
Romeo and Juliet
Macbeth
A Midsummer Night’s dream
King Lear
Hamlet
The Tempest
Juli Julius Caeser BrurBrutus
Richard III
Othello
Henry IV
Twelfth Night
Julius Caesar Calpurnia
Introduction
Architecture of IR systems
IR Models – Boolean and Extended Boolean
Vocabulary
Preprocessing : Tokenization, Stemming etc
Algorithms to search in posting lists
Positional and Phrase Queries
Search structure for dictionaries
Wildcard queries
Spelling and Phonetic Corrections
Typical IR System Architecture
IR System Architecture
Query Engine Index
Interface
Indexer
Users
Crawler
Web
A Typical Web Search Engine
Query Engine Index
Interface
Indexer
Users
Document
Collection
Interface
Indexer
Tokenization
Users
Document
Collection
64
Term Document (Incident) Matrix
Example:
You are given 3 documents as follows
D1: I go to the movie Super30
D2: You go to a library Stop words articles and prepositions I, to,
D3: She goes to a Park the, You, to a, She, to, a etc
Let us remove stop words
Term Document (Incident) Matrix
We build Term Document Matrix
D1 D2 D3
go 1 1 0
movie 1 0 0
Super30 1 0 0
library 0 1 0
goes 0 0 1
Park 0 0 1
We address some of the queries as below
movie OR Park returns documents D1 and D3
go OR goes returns documents D1, D2 and D3
go AND movie returns the document D1
go AND library returns the document D2
Inverted index (get documents for the terms)
go D1 D2 movie D1
The term-document incidence matrix
Main idea: record for each document whether it contains each word
out of all the different words Shakespeare used (about 32K).
Antony Julius The
and Caesar Tempest Hamlet Othello Macbeth
Cleopatra
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
...
24
Query “Brutus AND Caesar AND NOT Calpunia”
We compute the results for our query as the bitwise AND between
vectors for Brutus, Caesar and complement (Calpurnia):
Antony Julius The
and Caesar Tempest Hamlet Othello Macbeth
Cleopatra
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
...
25
Query “Brutus AND Caesar AND NOT Calpunia”
We compute the results for our query as the bitwise AND between
vectors for Brutus, Caesar and complement (Calpurnia):
Antony Julius The
and Caesar Tempest Hamlet Othello Macbeth
Cleopatra
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
¬Calpurnia 1 0 1 1 1 1
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
...
26
Query “Brutus AND Caesar AND NOT Calpunia”
We compute the results for our query as the bitwise AND between
vectors for Brutus, Caesar and complement (Calpurnia):
Julius The
Antony Caesar Tempest Hamlet Othello Macbeth
and
Cleopatra
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
¬Calpurnia 1 0 1 1 1 1
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
AND 1 0 0 1 0 0
28
Boolean Retrieval Model
77
Boolean Models Problems
• Very rigid: AND means all; OR means any.
• Difficult to express complex user requests.
• Difficult to control the number of documents
retrieved.
– All matched documents will be returned.
• Difficult to rank output.
– All matched documents logically satisfy the query.
• Difficult to perform relevance feedback.
– If a document is identified by the user as relevant or
irrelevant, how should the query be modified?
80
Extended Boolean Model
Boolean model is simple and elegant.
But, no provision for a ranking
As with the fuzzy model, a ranking can be
obtained by relaxing the condition on set
membership
Extend the Boolean model with the notions of
partial matching and term weighting
Combine characteristics of the Vector model
with properties of Boolean algebra
The Idea
The extended Boolean model (introduced by
Salton, Fox, and Wu, 1983) is based on a
critique of a basic assumption in Boolean
algebra
Let,
q= kx k y
wxj = fxj * idf(x) associated with [kx,dj]
max(idf(i))
Further, wxj = x and wyj = y
• We want the document as far as possible from (0,0)
The similarities of q = (Kx ∨ Ky) with
documents dj and dj+1.
• We We want the document as close as possible to (1,1)
The similarities of q= k^x ∧ ky
with documents dj and dj+1
Example: Westlaw
38
Westlaw query dash board
Choosing connectors
Westlaw Queries/Information Needs
39
Comments on WestLaw
40
Does Google use the Boolean Model?
41
Introduction
Architecture of IR systems
IR Models – Boolean and Extended Boolean
Vocabulary
Preprocessing : Tokenization, Stemming etc
Algorithms to search in posting lists
Positional and Phrase Queries
Search structure for dictionaries
Wildcard queries
Spelling and Phonetic Corrections
Recall the basic indexing pipeline
Documents to Friends, Romans, countrymen.
be indexed
Tokenizer
Linguistic modules
Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Document delineation (Description)
and character sequence decoding
• Obtaining the character sequence in a document:
– What format is it in?
• pdf/word/excel/html?
– What language is it in?
– What character set is in use?
• (CP1252, UTF-8, …)
• Each of these is a classification problem, which we will study later in the
course.
• But these tasks are often done heuristically …
Sunday, 6 March 16 3
Document delineation and character
sequence decoding
• Choosing a document unit:
• Documents being indexed can include docs from many
different languages
– A single index may contain terms from many languages.
• Sometimes a document or its components can contain
multiple languages/formats
– French email with a German pdf attachment.
– French email quote clauses from an English-language
contract
Sunday, 6 March 16 5
Introduction
Architecture of IR systems
IR Models – Boolean and Extended Boolean
Vocabulary, Posting Lists
Preprocessing : Tokenization, Stemming etc
Algorithms to search in posting lists
Positional and Phrase Queries
Search structure for dictionaries
Wildcard queries
Spelling and Phonetic Corrections
Introduction to Information Retrieval Sec. 2.2.1
Tokenization
Input: “Friends, Romans and Countrymen”
Output: Tokens
Friends Romans and Countrymen
A token is an instance of a sequence of characters
Each such token is now a candidate for an index
entry, after further processing
Described below
But what are valid tokens to emit?
Introduction to Information Retrieval Sec. 2.2.1
Tokenization
Issues in tokenization:
Finland’s capital
Finland? Finlands? Finland’s?
Hewlett-Packard Hewlett and Packard as two
tokens?
state-of-the-art: break up hyphenated sequence.
co-education
lowercase, lower-case, lower case ?
It can be effective to get the user to put in possible hyphens
San Francisco: one token or two?
How do you decide it is one token?
Introduction to Information Retrieval Sec. 2.2.1
Numbers
3/20/91 Mar. 12, 1991 20/3/91
55 B.C.
B-52
My PGP key is 324a3df234cb23e
(800) 234-2333
Often have embedded spaces
Older IR systems may not index numbers
But often very useful: think about things like looking up error
codes/stacktraces on the web
(One answer is using n-grams)
Will often index “meta-data” separately
Creation date, format, etc.
Introduction to Information Retrieval Sec. 2.2.1
← → ←→ ← start
‘Algeria achieved its independence in 1962 after 132
years of French occupation.’
With Unicode, the surface presentation is complex, but the
stored form is straightforward
Introduction to Information Retrieval Sec. 2.2.2
Stop words
Non-content bearing words are Stop Words – eg. the, of, to,
and, in, for , that , said
With a stop list, you exclude from the dictionary entirely the
commonest words. Intuition: They have little semantic content: the,
a, and, to, be
There are a lot of them: ~30% of postings for top 30 words
But the trend is away from doing this:
Good compression techniques means the space for including
stopwords in a system is very small
Good query optimization techniques mean you pay little at query
time for including stop words.
You need them for:
Phrase queries: “King of Denmark”
Various song titles, etc.: “Let it be”, “To be or not to be”
“Relational” queries: “flights to London”
Introduction to Information Retrieval Sec. 2.2.3
Normalization to terms
We need to “normalize” words in indexed text as well
as query words into the same form
We want to match U.S.A. and USA
Result is terms: a term is a (normalized) word type,
which is an entry in our IR system dictionary
We most commonly implicitly define equivalence
classes of terms by, e.g.,
deleting periods to form a term
U.S.A., USA USA )
deleting hyphens to form a term
anti-discriminatory, antidiscriminatory antidiscriminatory
Introduction to Information Retrieval Sec. 2.2.3
← → ←→ ← start
‘Algeria achieved its independence in 1962 after 132
years of French occupation.’
With Unicode, the surface presentation is complex, but the
stored form is straightforward
Sec. 1.1
Term frequency tf
The term frequency tft,d of term t in document d is
defined as the number of times that t occurs in d.
This determines the density of terms in the document
We want to use tf when computing query-document match
scores. But how?
Raw term frequency is not what we want:
A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term.
But not 10 times more relevant.
Relevance does not increase proportionally with term
frequency.
Introduction to Information Retrieval Sec. 6.2
Log-frequency weighting
The log frequency weight of term t in d is
1 log 10 tf t,d , if tf t,d 0
wt,d
0, otherwise
Stop words
With a stop list, you exclude from the dictionary
entirely the commonest words. Intuition:
They have little semantic content: the, a, and, to, be
There are a lot of them: ~30% of postings for top 30 words
But the trend is away from doing this:
Good compression techniques (IIR 5) means the space for including
stop words in a system is very small
Good query optimization techniques (IIR 7) mean you pay little at
query time for including stop words.
You need them for:
Phrase queries: “King of Denmark”
Various song titles, etc.: “Let it be”, “To be or not to be”
“Relational” queries: “flights to London”
Introduction to Information Retrieval Sec. 2.2.3
Normalization to terms
We may need to “normalize” words in indexed text
as well as query words into the same form
We want to match U.S.A. and USA
Result is terms: a term is a (normalized) word type,
which is an entry in our IR system dictionary
We most commonly implicitly define equivalence
classes of terms by, e.g.,
deleting periods to form a term
U.S.A., USA USA
deleting hyphens to form a term
anti-discriminatory, antidiscriminatory antidiscriminatory
Introduction to Information Retrieval Sec. 2.2.3
Case folding
Reduce all letters to lower case
exception: upper case in mid-sentence?
e.g., General Motors
Fed vs. fed
SAIL vs. sail
Often best to lower case everything, since users will use
lowercase regardless of ‘correct’ capitalization…
Normalization to terms
Lemmatization
• Reduce inflectional (forms of nouns, the past
tense, past participle, and present participle
forms of verbs, and the comparative and
superlative forms of adjectives and
adverbs)/variant forms to base form
• E.g., -am, are, is be
-car, cars, car's, cars' car
• the boy's cars are different colors the boy car be
different color
• Lemmatization implies doing “proper”
reduction to dictionary headword form
Lemmatization
transform to standard form according to syntactic
category.
E.g. verb + ing verb
noun + s noun
Need POS (Part of Speech) tagging
More accurate than stemming, but needs more resources
123
Sec. 2.2.4
Stemming
• Reduce terms to their “roots” before indexing
• “Stemming” suggests crude affix chopping
– language dependent
– e.g., automate(s), automatic, automation all
reduced to automat.
Porter’s algorithm
• Commonest algorithm for stemming English
– Results suggest it’s at least as good as other
stemming options
• Conventions + 5 phases of reductions
– phases applied sequentially
– each phase consists of a set of commands
– sample convention: Of the rules in a compound
command, select the one that applies to the
longest suffix.
Sec. 2.2.4
Other stemmers
• Other stemmers exist:
– Lovins stemmer
• http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm
Language-specificity
• The above methods embody transformations
that are
– Language-specific, and often
– Application-specific
• These are “plug-in” addenda (added at the
end) to the indexing process
• Both open source and commercial plug-ins are
available for handling these
Sec. 2.2.4
un+cook+ed=uncooked
Affix
Affix
Prefixes
• Prefixes are word parts (affixes) that
comes at the beginning of the root word
or base word.
un+cook+ed=uncooked
Prefix
Affix
Suffix
• Suffixes are word parts (affixes) that come
at the end of the root word or base word.
un+cook+ed=uncooked
Prefix
Suffix
Introduction to Information Retrieval Sec. 3.1
136
Introduction to Information Retrieval Sec. 3.1
A naïve dictionary
An array of struct:
138
Introduction to Information Retrieval Sec. 3.1
Hashtables
Each vocabulary term is hashed to an integer
(We assume you’ve seen hashtables before)
Pros:
Lookup is faster than for a tree: O(1)
Cons:
No easy way to find minor variants:
judgment/judgement
No prefix search [tolerant retrieval]
If vocabulary keeps growing, need to occasionally do the
expensive operation of rehashing everything
139
Introduction to Information Retrieval Sec. 3.1
140
Introduction to Information Retrieval Sec. 3.1
Tree: B-tree
a-hu n-z
hy-m
Trees
Simplest: binary tree
More usual: B-trees
Trees require a standard ordering of characters and hence
strings … but we typically have one
Pros:
Solves the prefix problem (terms starting with hyp)
Cons:
Slower: O(log M) [and this requires balanced tree]
Rebalancing binary trees is expensive
But B-trees mitigate the rebalancing problem
142
Introduction
Architecture of IR systems
IR Models – Boolean and Extended Boolean
Vocabulary, Posting Lists
Preprocessing : Tokenization, Stemming etc
Algorithms to search in posting lists
Positional and Phrase Queries
Search structure for dictionaries
Wildcard queries
Spelling and Phonetic Corrections
Phrase queries
Want to answer queries such as stanford
university – as a phrase
Thus the sentence “I went to university at
Stanford” is not a match.
No longer suffices to store only
<be: 993427;
1: <7, 18, 33, 72, 86, 231>; Which of docs 1,2,4,5
2: <3, 149>; could contain “to be
4: <17, 191, 291, 430, 434>; or not to be”?
5: <363, 367, …>
Wild-card queries: *
mon*: find all docs containing any word beginning
with “mon”.
Easy with binary tree (or B-tree) lexicon: retrieve all
words in range: mon ≤ w < moo
*mon: find words ending in “mon”: harder
Maintain an additional B-tree for terms backwards.
Can retrieve all words in range: nom ≤ w < non.
Query processing
At this point, we have an enumeration of all terms in
the dictionary that match the wild-card query.
We still have to look up the postings for each
enumerated term.
E.g., consider the query:
se*ate AND fil*er
This may result in the execution of many Boolean
AND queries.
158
Introduction to Information Retrieval Sec. 3.2
159
Introduction to Information Retrieval Sec. 3.2.1
Permuterm index
For term hello, index under:
hello$, ello$h, llo$he, lo$hel, o$hell, $hello
where $ is a special symbol.
Queries:
X lookup on X$ X* lookup on $X*
*X lookup on X$* *X* lookup on X*
X*Y lookup on Y$X* X*Y*Z ???
Search for X*Y and Y*Z look up
Query = hel*o Y$X* and Z$Y*?
X=hel, Y=o Example - we’re looking for
Lookup o$hel* c*h*r
look up h$c* and r$h*? 160
Wild card queries
1. The wild card queries by user can be divided into 5 cases.
1. X* --> X
2. *X --> X$*
3. X*Y --> Y$X*
4. X*Y*Z --> (Z$X*) and (Y*)
5. *X* --> can be converted to X* form
2. The 1st 3 cases can be done by prefix matching with the corresponding forms
given in RHS.
3. 4th and 5th are exceptional cases.
4. 5th can be converted to 1st case.
5. For case 4,
Two queries can be taken and their posting lists can be generated. Then a bitwise
AND operation on vectors containing their posting lists can be done to obtain
the required result.
The above mentioned two queries (for case X*Y*Z) are (Z$X*) and (Y*)
Introduction to Information Retrieval Sec. 3.2.1
162
Language Model
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,
ue,el,le,es,st,t$, $m,mo,on,nt,h$
$ is a special word boundary symbol
Maintain a second inverted index from bigrams to
dictionary terms that match each bigram.
165
Introduction to Information Retrieval Sec. 3.2.2
$m mace madden
mo among amortize
on along among
166
Introduction to Information Retrieval
Processing wild-cards
Query mon* can now be run as
$m AND mo AND on
Gets terms that match AND version of our wildcard
query.
But we’d enumerate moon.
Must post-filter these terms against query.
Surviving enumerated terms are then looked up in
the term-document inverted index.
Fast, space efficient (compared to permuterm).
169
Introduction to Information Retrieval
170
Introduction to Information Retrieval Sec. 3.2.2
Search
Type your search terms, use ‘*’ if you need to.
E.g., Alex* will match Alexander.
Spell correction
Two principal uses
Correcting document(s) being indexed
Correcting user queries to retrieve “right” answers
Two main flavors:
Isolated word
Check each word on its own for misspelling
Will not catch typos resulting in correctly spelled words
e.g., from form
Context-sensitive
Look at surrounding words,
e.g., I flew form Heathrow to Narita.
175
Introduction to Information Retrieval Sec. 3.3
Document correction
Especially needed for OCR’ed documents
Correction algorithms are tuned for this: rn/m
Can use domain-specific knowledge
E.g., OCR can confuse O and D more often than it would confuse O
and I (adjacent on the QWERTY keyboard, so more likely
interchanged in typing).
But also: web pages and even printed material have
typos
Goal: the dictionary contains fewer misspellings
But often we don’t change the documents and
instead fix the query-document mapping
176
Introduction to Information Retrieval Sec. 3.3
Query mis-spellings
Our principal focus here
E.g., the query Alanis Morisett
We can either
Retrieve documents indexed by the correct spelling, OR
Return several suggested alternative queries with the
correct spelling
Did you mean … ?
177
Introduction to Information Retrieval Sec. 3.3.2
178
Introduction to Information Retrieval Sec. 3.3.2
179
Introduction to Information Retrieval
180
180
Introduction to Information Retrieval Sec. 3.3.3
Edit distance
Given two strings S1 and S2, the minimum number of
operations to convert one to the other
Operations are typically character-level
Insert, Delete, Replace, (Transposition)
E.g., the edit distance from dof to dog is 1
From cat to act is 2 (Just 1 with transpose.)
from cat to dog is 3.
Generally found by dynamic programming.
181
Introduction to Information Retrieval
182
Introduction to Information Retrieval Sec. 3.3.3
183
Introduction to Information Retrieval Sec. 3.3.4
185
Introduction to Information Retrieval Sec. 3.3.4
n-gram overlap
Enumerate all the n-grams in the query string as well
as in the lexicon
Use the n-gram index (recall wild-card search) to
retrieve all lexicon terms matching any of the query
n-grams
Threshold by number of matching n-grams
Variants – weight by keyboard layout, etc.
186
Introduction to Information Retrieval Sec. 3.3.4
187
Introduction to Information Retrieval Sec. 3.3.4
X Y / X Y
Equals 1 when X and Y have the same elements and
zero when they are disjoint
X and Y don’t have to be of the same size
Always assigns a number between 0 and 1
Now threshold to decide if you have a match
E.g., if J.C. > 0.8, declare a match
188
Introduction to Information Retrieval Sec. 3.3.4
Matching trigrams
Consider the query lord – we wish to identify words
matching 2 of its 3 bigrams (lo, or, rd)
190
Introduction to Information Retrieval Sec. 3.3.5
Context-sensitive correction
Need surrounding context to catch this.
First idea: retrieve dictionary terms close (in
weighted edit distance) to each query term
Now try all possible resulting phrases with one word
“fixed” at a time
flew from heathrow
fled form heathrow
flea form heathrow
Hit-based spelling correction: Suggest the
alternative that has lots of hits.
191
Introduction to Information Retrieval Sec. 3.3.5
Exercise
Suppose that for “flew form Heathrow” we have 7
alternatives for flew, 19 for form and 3 for heathrow.
How many “corrected” phrases will we enumerate in
this scheme?
192
Introduction to Information Retrieval Sec. 3.3.5
Another approach
Break phrase query into a conjunction of biwords
(Lecture 2).
Look for biwords that need only one term corrected.
Enumerate only phrases containing “common”
biwords.
193
Introduction to Information Retrieval Sec. 3.3.5
Soundex
Class of heuristics to expand a query into phonetic
equivalents
Language specific – mainly for names
E.g., chebyshev tchebycheff
Invented for the U.S. census … in 1918
195
Introduction to Information Retrieval Sec. 3.4
http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#Top
196
Introduction to Information Retrieval Sec. 3.4
Soundex continued
4. Remove all pairs of consecutive digits.
5. Remove all zeros from the resulting string.
6. Pad the resulting string with trailing zeros and
return the first four positions, which will be of the
form <uppercase letter> <digit> <digit> <digit>.
Soundex
Soundex is the classic algorithm, provided by most
databases (Oracle, Microsoft, …)
How useful is soundex?
Not very – for information retrieval
Okay for “high recall” tasks (e.g., Interpol), though
biased to names of certain nationalities
Zobel and Dart (1996) show that other algorithms for
phonetic matching perform much better in the
context of IR
199
Introduction to Information Retrieval
200
Introduction to Information Retrieval
Exercise
Draw yourself a diagram showing the various indexes
in a search engine incorporating all the functionality
we have talked about
Identify some of the key design choices in the index
pipeline:
Does stemming happen before the Soundex index?
What about n-grams?
Given a query, how would you parse and dispatch
sub-queries to the various indexes?
201
Introduction to Information Retrieval Sec. 3.5
Resources
IIR 3, MG 4.2
Efficient spell retrieval:
K. Kukich. Techniques for automatically correcting words in text. ACM
Computing Surveys 24(4), Dec 1992.
J. Zobel and P. Dart. Finding approximate matches in large
lexicons. Software - practice and experience 25(3), March 1995.
http://citeseer.ist.psu.edu/zobel95finding.html
Mikael Tillenius: Efficient Generation and Ranking of Spelling Error
Corrections. Master’s thesis at Sweden’s Royal Institute of Technology.
http://citeseer.ist.psu.edu/179155.html
Nice, easy reading on spell correction:
Peter Norvig: How to write a spelling corrector
http://norvig.com/spell-correct.html
202