Sie sind auf Seite 1von 200

Algorithms for Information Retrieval

UE16CS412

Instructor: Dr S Natarajan
Professor and Key Resource Person
Department of Computer Science and Engineering
PES University
Bengaluru
natarajan@pes.edu 9945280225
Slides: Adapted from the book “introduction to IR” of
authors from Stanford University and supplemented
Slides from other universities
UNIT I
Introduction
Architecture of IR systems
IR Models – Boolean and Extended Boolean
Vocabulary
Preprocessing : Tokenization, Stemming etc
Algorithms to search in posting lists
Positional and Phrase Queries
Search structure for dictionaries
Wildcard queries
Spelling and Phonetic Corrections
IR:
- original definition
“Information retrieval embraces the intellectual
aspects of the description of information and its
specification for search, and also whatever
systems, techniques, or machines are employed
to carry out the operation.”
Calvin Mooers, 1951

© Tefko Saracevic 3
Introduction to Information Retrieval

Definition of information retrieval

Information retrieval (IR) is finding material (usually documents) of


an unstructured nature (usually text) that satisfies an information
need from within large collections (usually stored on computers).

4
Information Retrieval
• Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).

– These days we frequently think first of web search, but


there are many other cases:
• E-mail search
• Searching your laptop
• Corporate knowledge bases
• Legal information retrieval

5
You see things; and you say "Why?"
But I dream things that never were;
and I say "Why not?"
George Bernard Shaw

Prasad L1IntroIR 6
IR Through the Ages
• 3rd Century BCE
– Library of Alexandria
• 500,000 volumes
• catalogs and classifications

• 13th Century A.D.


– First concordance of the Bible
• What is a concordance?

• 15th Century A.D.


– Invention of printing

• 1600
– University of Oxford Library
• All books printed in England
Document Collections

Stone inscriptions at
Brihadeeshwara Temple

4
IR Through the Ages
• 1755
– Johnson’s Dictionary
• Set standard for dictionaries
• Included common language
• Helped standardize spelling
• 1800
– Library of Congress
• 1828
– Webster’s Dictionary
• Significantly larger than previous dictionaries
• Standardized American spelling
• 1852
– Roget’s Thesaurus
Document Collections

IR in the 17th century: Samuel Pepys, the famous English diarist,


subject-indexed his treasured 1000+ books library with key
words.
5
IR Through the Ages
• 1876
– Dewey Decimal Classification
• 1880’s
– Carnegie Public Libraries
• 1,681 built (first public library 1850)
• 1930’s
– Punched card retrieval systems
• 1940’s
– Bush’s Memex
– Shannon’s Communication Theory
– Zipf’s “Law”
MemeMe
Early ideas of IR/search
Vannevar Bush - Memex - 1945
"A memex is a device in which an individual stores all his books, records, and
communications, and which is mechanized so that it may be consulted with
exceeding speed and flexibility. It is an enlarged intimate supplement to his
memory.”
Bush seems to understand that computers won’t just store information as a product;
they will transform the process people use to produce and use information.
Historical Summary
• 1960’s
– Basic advances in retrieval and indexing techniques
• 1970’s
– Probabilistic and vector space models
– Clustering, relevance feedback
– Large, on-line, Boolean information services
– Fast string matching
• 1980’s
– Natural Language Processing and IR
– Expert systems and IR
– Off-the-shelf IR systems
IR History Continued

• 1980’s:
– Large document database systems, many run by
companies:
• Lexis-Nexis
• Dialog
• MEDLINE

15
IR Through the Ages
• Late 1980’s
– First mini-computer and PC systems incorporating
“relevance ranking”

• Early 1990’s
– information storage revolution

• 1992
– First large-scale information service incorporating
probabilistic retrieval (West’s legal retrieval system)
IR History Continued

• 1990’s:
– Searching FTPable documents on the Internet
• Archie
• WAIS (Wide Area Information Services)
– Searching the World Wide Web
• Lycos
• Yahoo
• Altavista

17
IR Through the Ages
• Mid 1990’s to present
– Multimedia databases

• 1994 to present
– The Internet and Web explosion
• e.g. Google, Yahoo, Lycos, Infoseek (now Go)

• 1995 to present
– Digital Libraries
– Data Mining
– Agents and Filtering
– Knowledge and Distributed Intelligence
– Information Organization
– Knowledge Management
Historical Summary
• 1990’s
– Large-scale, full-text IR and filtering experiments
and systems (TREC)
– Dominance of ranking
– Many web-based retrieval engines
– Interfaces and browsing
– Multimedia and multilingual
– Machine learning techniques
IR History Continued

• 1990’s continued:
– Organized Competitions
• NIST TREC (Text Retrieval Conference)
– Recommender Systems
• Ringo (Social Information Filtering System)
• Amazon
• NetPerceptions (leading seller of personalization
technology during the Internet boom of the late
1990s)
– Automated Text Categorization & Clustering

20
IR History Continued
2000’s continued:
Link Analysis of Web Search
Google
– Multimedia IR
• Image
• Video
• Audio and music
– Cross-Language IR
• DARPA TIDES(Translingual Information Detection,
Extraction and Summarization)
– Document Summarization
– Learning to Rank

21
IR History continued
2000’s (continued)
Automated Information Extraction
• Whizbang
• Fetch
• Burning Glass
Question Answering
• TREC Q/A track
Multimedia IR
Cross-Language IR
WEB changed everything
• New issues:
• Trust
• Privacy, etc
• Additional sources:
• Social networking
• Wikipedia, etc 22
Recent IR History

• 2010’s
– Intelligent Personal Assistants
• Siri
• Cortana
• Google Now
• Alexa
– Complex Question Answering
• IBM Watson
– Distributional Semantics
– Deep Learning
23
Int roduct ion Hist ory Boolean model Inverted index Processing Boolean queries Query opt imization Cour

History of information retrieval: gradual channel changes

Sojka, IIR Group: PV211: Boolean Retrieval 12 / 79


Trends in IR Technology
On-line
Information

Petabytes Image and Video


Retrieval

Visualization
Terabytes Data Mining

Distributed Retrieval
Summarization
Information Extraction
Gigabytes Ranked Filtering
Concept-Based Retrieval
Ranked Retrieval Technologies
Boolean Retrieval and Filtering

1970 1990 Time


Batch systems...Interactive systems...Database Systems…Cheap Storage...Internet…Multimedia...

1-page word document without any images = ~10 kilobytes (kb) of disk space.
1 terabyte = one-hundred million imageless word docs
1 petabyte = one-thousand terabytes.
Historical Summary
• The Future
– Logic-based IR?
– NLP?
– Integration with other functionality
– Distributed, heterogeneous database access
– IR in context
– “Anytime, Anywhere”
Information Retrieval
• Ad Hoc Retrieval
– Given a query and a large database of text objects, find the
relevant objects

• Distributed Retrieval
– Many distributed databases

• Information Filtering
– Given a text object from an information stream (e.g. newswire)
and many profiles (long-term queries), decide which profiles
match

• Multimedia Retrieval
– Databases of other types of unstructured data, e.g. images,
video, audio
Information Retrieval

• Multilingual Retrieval
– Retrieval in a language other than English

• Cross-language Retrieval
– Query in one language (e.g. Spanish), retrieve
documents in other languages (e.g. Chinese,
French, and Spanish)
Information Retrieval Family Trees

Cyril Cleverdon
Cranfield

Karen Sparck Jones


Cambridge
Gerald Salton
Cornell
Keith Van Rijsbergen
Univ of Glasgow

Bruce Croft
Donna Harman Michael Lesk
University of Massachusetts,
NIST Bell Labs, Rutgers etc. Amherst

29
Personalities in IR

Dr Marti Hearst,Univ of
Dr Thorsten Joachims,Cornell
California at Berkeley

Dr Raymond Mooney Dr Gordon


Dr David Grossman Dr Ricardo Cormack

Dr Chris Manning Dr Raghavan


Gerard Salton
 1927-1995. Born in Germany,
Professor at Cornell (co-founded
the CS department), Ph.D from
Harvard in Applied Mathematics
 Father of information retrieval
 Vector space model
 SMART information retrieval
system
 First recipient of SIGIR outstanding
contribution award, now called the
Gerard Salton Award
Top IR Conferences
• ACM SIGIR (Special Interest Group in Information Retrieval)
• European Conférence on Information Retrieval (ECIR)
• Asia Information Retrieval Societies’ Conference
• Australasian Document Computing Symposium
• Text REtrieval Conference (TREC) of NIST
• ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR)
• ACM Conference on Information and Knowledge Management (CIKM)
• SPIRE( String Processing and Information Retrieval)
• FQAS(Flexible Query Answering Systems)
• JCDL (Joint Conference on Digital Library)
• IIiX(Information Interaction in Context)
• ACM SAC (Symposium on Applied Computing) IAR(Information Access and Retrieval)
• ACM WSDM (Web Search and Data Mining) Conference
• ACM SIGIR CHIIR( Conference on Human Information Interaction Retrieval)
• The Web Conference
• DESIRES (Design of Experimental Search & Information REtrieval Systems)
• Forum of Information Retrieval Evaluation (FIRE) –ISI Kolkata
Top IR Journals
• Transactions on Asian and Low-Resource Language
Information Processing (TALLIP) of ACM
• Transactions On Information Systems (TOIS) of ACM
• IP&M ( Information Processing & Management), Elsevier
• JDOC (Journal of Documentation) –Emerald Insight
• JASIST(Journal of the Association for Information Science
and Technology) Wiley Publications
• Information Retrieval (IR) Journal ,Springer
• International Journal of Information Retrieval Research
(IJIRR) – IGI-Global
IR toolkits

• Lucene (Apache)
• MeTA (Modern Text Analysis) (Univ. of Illinois at Urbana-
Champaign)
• Lemur & Indri (CMU/Univ. of Massachusetts)
• Terrier (Glasgow)
• IR Systems and Tools
• Information Retrieval Systems
Other Text Books in Information Retrieval
1.Modern Information Retrieval: The Concepts and Technology behind
Search, Ricardo Baeza -Yates and Berthier Ribeiro –2nd Edition, Neto ,
ACM Press Books, 2011
2. Search Engines: Information Retrieval in Practice, Bruce Croft, Donald
Metzler and Trevor Strohman, 1st Edition, Addison Wesley, 2009
3. Information Retrieval: Implementing and Evaluating Search Engines,
Stephen Buetcher, Charles L.A. Clarke and Gordon V. Cormack, 1st Edition,
MIT Press, 2010
4. Information Retrieval: Algorithms and Heuristics, David A. Grossman
and Ophir Frieder, 2nd Edition, Springer, 2004
5. Managing Gigabytes: Compressing and Indexing Documents and
Images, Ian H. Witten, Alistair Moffat, Timothy C. Bell, Second Edition,
Van Nostrand Reinhold, 1994
6. Readings in Information Retrieval, Karen Sparck Jones and Peter
Willett , Morgan Kaufmann, 1997
Courses in IR
CS276 - IR Stanford and Web Search
CS 371R: Information Retrieval and Web Search - UT Computer Science
INFO/CS 4300: Language and Information Cornell
COMPSCI 646: Information Retrieval Umass
COMPSCI 546, Applied Information Retrieval UMass
CS60035: Information Retrieval – IITKgp
Info 240: Principles of Information Retrieval | UC Berkeley School of
Information
CS 54701: Information Retrieval - Purdue Computer Science
605.744 - Information Retrieval | Johns Hopkins University
COSC 488 –Information Retrieval Georgetown University
CS 4501/6501: Information Retrieval Virginia
Introduction to Information Retrieval Univ of Munich
CS510 - Advanced Information Retrieval UIUC
Unstructured (text) vs. structured (database)
data in the mid-nineties

37
Unstructured (text) vs. structured (database)
data today

38
Sec. 1.1

Basic assumptions of Information Retrieval

• Collection: A set of documents


– Assume it is a static collection for the moment

• Goal: Retrieve documents with information


that is relevant to the user’s information need
and helps the user complete a task

39
The classic search model
User task Get rid of mice in a
politically correct way
Misconception?

Info need
Info about removing mice
without killing them
Misformulation?

Query
how trap mice alive Search

Search
engine

Query Results
Collection
refinement
Sec. 1.1

How good are the retrieved docs?


 Precision : Fraction of retrieved docs that are
relevant to the user’s information need
 Recall : Fraction of relevant docs in collection
that are retrieved

 More precise definitions and measurements to


follow later

41
Introduction to Information Retrieval

Definitions
 Word – A delimited string of characters as it appears
in the text.
 Term – A “normalized” word (case, morphology,
spelling etc); an equivalence class of words.
 Token – An instance of a word or term occurring in a
document.
 Type – The same as a term in most cases: an
equivalence class of tokens.

42
Shakespeare's classics
Romeo and Juliet
Macbeth
A Midsummer Night’s dream
King Lear
Hamlet
The Tempest
Juli Julius Caeser BrurBrutus
Richard III
Othello
Henry IV
Twelfth Night
Julius Caesar Calpurnia
Introduction
Architecture of IR systems
IR Models – Boolean and Extended Boolean
Vocabulary
Preprocessing : Tokenization, Stemming etc
Algorithms to search in posting lists
Positional and Phrase Queries
Search structure for dictionaries
Wildcard queries
Spelling and Phonetic Corrections
Typical IR System Architecture
IR System Architecture
Query Engine Index

Interface

Indexer
Users

Crawler

Web
A Typical Web Search Engine
Query Engine Index

Interface

Indexer
Users

Document
Collection

A Typical Information Retrieval System


Query Engine Index

Interface

Indexer
Tokenization
Users

Document
Collection

A Typical Information Retrieval System


Introduction
Architecture of IR systems
IR Models – Boolean and Extended Boolean
Vocabulary, Posting Lists
Preprocessing : Tokenization, Stemming etc
Algorithms to search in posting lists
Positional and Phrase Queries
Search structure for dictionaries
Wildcard queries
Spelling and Phonetic Corrections
Sec. 1.3

Boolean queries: Exact match

• The Boolean retrieval model is being able to ask a


query that is a Boolean expression:
– Boolean Queries are queries using AND, OR and NOT
to join query terms
• Views each document as a set of words
• Is precise: document matches condition or not.
– Perhaps the simplest model to build an IR system on
• Primary commercial retrieval tool for 3 decades.
• Many search systems you still use are Boolean:
– Email, library catalog, Mac OS X Spotlight
58
• Boolean AND retrieves very few documents (low recall) AND
narrows the search
• Boolean OR retrieves too many documents
( low precision) OR broadens the search
• Boolean NOT eliminates many good documents (low recall)
NOT always excludes records with the
specified term
Boolean query looks for exact match and hence
• query specifies precise retrieval criteria
• – every document either matches or fails to match query
• – result is a set of documents
Unordered in pure exact match
Stopwords / Stoplist
 function words do not bear useful information for IR
of, in, about, with, I, although, …
 Stoplist: contain stopwords, not to be used as index
 Prepositions
 Articles
 Pronouns
 Some adverbs and adjectives
 Some frequent words (e.g. document)

 The removal of stopwords usually improves IR


effectiveness
 A few “standard” stoplists are commonly used.

64
Term Document (Incident) Matrix
Example:
You are given 3 documents as follows
D1: I go to the movie Super30
D2: You go to a library Stop words articles and prepositions I, to,
D3: She goes to a Park the, You, to a, She, to, a etc
Let us remove stop words
Term Document (Incident) Matrix
We build Term Document Matrix
D1 D2 D3
go 1 1 0
movie 1 0 0
Super30 1 0 0
library 0 1 0
goes 0 0 1
Park 0 0 1
We address some of the queries as below
movie OR Park returns documents D1 and D3
go OR goes returns documents D1, D2 and D3
go AND movie returns the document D1
go AND library returns the document D2
Inverted index (get documents for the terms)
go  D1 D2 movie  D1
The term-document incidence matrix

Main idea: record for each document whether it contains each word
out of all the different words Shakespeare used (about 32K).
Antony Julius The
and Caesar Tempest Hamlet Othello Macbeth
Cleopatra

Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
...

Matrix element (t , d ) is 1 if the play in column d contains the


word in row t , 0 otherwise.

24
Query “Brutus AND Caesar AND NOT Calpunia”

We compute the results for our query as the bitwise AND between
vectors for Brutus, Caesar and complement (Calpurnia):
Antony Julius The
and Caesar Tempest Hamlet Othello Macbeth
Cleopatra

Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
...

25
Query “Brutus AND Caesar AND NOT Calpunia”

We compute the results for our query as the bitwise AND between
vectors for Brutus, Caesar and complement (Calpurnia):
Antony Julius The
and Caesar Tempest Hamlet Othello Macbeth
Cleopatra

Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
¬Calpurnia 1 0 1 1 1 1
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
...

26
Query “Brutus AND Caesar AND NOT Calpunia”

We compute the results for our query as the bitwise AND between
vectors for Brutus, Caesar and complement (Calpurnia):
Julius The
Antony Caesar Tempest Hamlet Othello Macbeth
and
Cleopatra
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
¬Calpurnia 1 0 1 1 1 1
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
AND 1 0 0 1 0 0

Bitwise AND returns two documents, “Antony and Cleopatra” and


“Hamlet”.
27
The results: two documents

Antony and Cleopatra, Act III, Scene ii


Agrippa [Aside to Dominitus Enobarbus]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring, and he wept
When at Philippi he found Brutus slain.

Hamlet, Act III, Scene ii


Lord Polonius: I did enact Julius Caesar: I was killed i’ the
Capitol; Brutus killed me.

28
Boolean Retrieval Model

• Popular retrieval model because:


– Easy to understand for simple queries.
– Clean formalism.
• Boolean models can be extended to include
ranking.
• Reasonably efficient implementations possible for
normal queries.

77
Boolean Models  Problems
• Very rigid: AND means all; OR means any.
• Difficult to express complex user requests.
• Difficult to control the number of documents
retrieved.
– All matched documents will be returned.
• Difficult to rank output.
– All matched documents logically satisfy the query.
• Difficult to perform relevance feedback.
– If a document is identified by the user as relevant or
irrelevant, how should the query be modified?

80
Extended Boolean Model
 Boolean model is simple and elegant.
 But, no provision for a ranking
 As with the fuzzy model, a ranking can be
obtained by relaxing the condition on set
membership
 Extend the Boolean model with the notions of
partial matching and term weighting
 Combine characteristics of the Vector model
with properties of Boolean algebra
The Idea
 The extended Boolean model (introduced by
Salton, Fox, and Wu, 1983) is based on a
critique of a basic assumption in Boolean
algebra
 Let,
q= kx  k y
wxj = fxj * idf(x) associated with [kx,dj]
max(idf(i))
Further, wxj = x and wyj = y
• We want the document as far as possible from (0,0)
The similarities of q = (Kx ∨ Ky) with
documents dj and dj+1.
• We We want the document as close as possible to (1,1)
The similarities of q= k^x ∧ ky
with documents dj and dj+1
Example: Westlaw

Largest commercial legal search service – 700K subscribers


Legal search started in 1975, ranking added in 1992
Federated search added in 2010
Boolean Search and ranked retrieval both offered
Document ranking only wrt chronological order
Expert queries are carefully defined and incrementally
developed

38
Westlaw query dash board
Choosing connectors
Westlaw Queries/Information Needs

“trade secret” /s disclos! /s prevent /s employe!


Information need: Information on the legal theories involved in
preventing the disclosure of trade secrets by employees formerly
employed by a competing company.

disab! /p access! /s work-site work-place (employment /3 place)


Information need: Requirements for disabled people to be able to
access a workplace.

host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest


Information need: Cases about a host’s responsibility for drunk
guests.

39
Comments on WestLaw

Proximity operators: /3= within 3 words, /s=within same


sentence /p =within a paragraph
Space is disjunction, not conjunction (This was standard in
search pre-Google.)
Long, precise queries: incrementally developed, unlike web
search
Why professional searchers like Boolean queries: precision,
transparency, control.

40
Does Google use the Boolean Model?

On Google, the default interpretation of a query [w1 w2 ... wn] is


w1 AND w2 AND ... AND wn
Cases where you get hits which don’t contain one of the w−i :
Page contains variant of wi (morphology, misspelling,
synonym)
long query (n is large)
Boolean expression generates very few hits
wi was in the anchor text
Google also ranks the result set
Simple Boolean Retrieval returns matching documents in no
particular order.
Google (and most well-designed Boolean engines) rank hits
according to some estimator of relevance

41
Introduction
Architecture of IR systems
IR Models – Boolean and Extended Boolean
Vocabulary
Preprocessing : Tokenization, Stemming etc
Algorithms to search in posting lists
Positional and Phrase Queries
Search structure for dictionaries
Wildcard queries
Spelling and Phonetic Corrections
Recall the basic indexing pipeline
Documents to Friends, Romans, countrymen.
be indexed

Tokenizer

Token stream Friends Romans Countrymen

Linguistic modules

Modified tokens friend roman countryman

Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Document delineation (Description)
and character sequence decoding
• Obtaining the character sequence in a document:
– What format is it in?
• pdf/word/excel/html?
– What language is it in?
– What character set is in use?
• (CP1252, UTF-8, …)
• Each of these is a classification problem, which we will study later in the
course.
• But these tasks are often done heuristically …

Sunday, 6 March 16 3
Document delineation and character
sequence decoding
• Choosing a document unit:
• Documents being indexed can include docs from many
different languages
– A single index may contain terms from many languages.
• Sometimes a document or its components can contain
multiple languages/formats
– French email with a German pdf attachment.
– French email quote clauses from an English-language
contract

• There are commercial and open source libraries that


can handle a lot of this stuff
Sunday, 6 March 16 4
Document delineation and character
sequence decoding
We return from our query “documents” but there
are often interesting questions of grain size:

What is a unit document?


– A file?
– An email? (Perhaps one of many in a single mbox file)
• What about an email with 5 attachments?
– A group of files (e.g., PPT or LaTeX split over HTML
pages)

Sunday, 6 March 16 5
Introduction
Architecture of IR systems
IR Models – Boolean and Extended Boolean
Vocabulary, Posting Lists
Preprocessing : Tokenization, Stemming etc
Algorithms to search in posting lists
Positional and Phrase Queries
Search structure for dictionaries
Wildcard queries
Spelling and Phonetic Corrections
Introduction to Information Retrieval Sec. 2.2.1

Tokenization
 Input: “Friends, Romans and Countrymen”
 Output: Tokens
 Friends Romans and Countrymen
 A token is an instance of a sequence of characters
 Each such token is now a candidate for an index
entry, after further processing
 Described below
 But what are valid tokens to emit?
Introduction to Information Retrieval Sec. 2.2.1

Tokenization
 Issues in tokenization:
 Finland’s capital 
Finland? Finlands? Finland’s?
 Hewlett-Packard  Hewlett and Packard as two
tokens?
 state-of-the-art: break up hyphenated sequence.
 co-education
 lowercase, lower-case, lower case ?
 It can be effective to get the user to put in possible hyphens
 San Francisco: one token or two?
 How do you decide it is one token?
Introduction to Information Retrieval Sec. 2.2.1

Numbers
 3/20/91 Mar. 12, 1991 20/3/91
 55 B.C.
 B-52
 My PGP key is 324a3df234cb23e
 (800) 234-2333
 Often have embedded spaces
 Older IR systems may not index numbers
 But often very useful: think about things like looking up error
codes/stacktraces on the web
 (One answer is using n-grams)
 Will often index “meta-data” separately
 Creation date, format, etc.
Introduction to Information Retrieval Sec. 2.2.1

Tokenization: language issues


 French
 L'ensemble  one token or two?
 L ? L’ ? Le ?
 Want l’ensemble to match with un ensemble
 Until at least 2003, it didn’t on Google
 Internationalization!

 German noun compounds are not segmented


 Lebensversicherungsgesellschaftsangestellter
------ -------------------- ---------------- ---------------
 ‘life insurance company employee’
 German retrieval systems benefit greatly from a compound splitter
module
 Can give a 15% performance boost for German
Introduction to Information Retrieval Sec. 2.2.1

Tokenization: language issues


 Chinese, Japanese and Thai have no spaces
between words:
 莎拉波娃现在居住在美国东南部的佛罗里达。
 Not always guaranteed a unique tokenization
 Further complicated in Japanese, with multiple
alphabets intermingled
 Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!


Introduction to Information Retrieval Sec. 2.2.1

Tokenization: language issues


 Arabic (or Hebrew) is basically written right to left,
but with certain items like numbers written left to
right
 Words are separated, but letter forms within a word
form complex ligatures

 ← → ←→ ← start
 ‘Algeria achieved its independence in 1962 after 132
years of French occupation.’
 With Unicode, the surface presentation is complex, but the
stored form is straightforward
Introduction to Information Retrieval Sec. 2.2.2

Stop words
 Non-content bearing words are Stop Words – eg. the, of, to,
and, in, for , that , said
 With a stop list, you exclude from the dictionary entirely the
commonest words. Intuition: They have little semantic content: the,
a, and, to, be
 There are a lot of them: ~30% of postings for top 30 words
 But the trend is away from doing this:
 Good compression techniques means the space for including
stopwords in a system is very small
 Good query optimization techniques mean you pay little at query
time for including stop words.
 You need them for:
 Phrase queries: “King of Denmark”
 Various song titles, etc.: “Let it be”, “To be or not to be”
 “Relational” queries: “flights to London”
Introduction to Information Retrieval Sec. 2.2.3

Normalization to terms
 We need to “normalize” words in indexed text as well
as query words into the same form
 We want to match U.S.A. and USA
 Result is terms: a term is a (normalized) word type,
which is an entry in our IR system dictionary
 We most commonly implicitly define equivalence
classes of terms by, e.g.,
 deleting periods to form a term
 U.S.A., USA  USA )
 deleting hyphens to form a term
 anti-discriminatory, antidiscriminatory  antidiscriminatory
Introduction to Information Retrieval Sec. 2.2.3

Normalization: other languages


 Accents: e.g., French résumé vs. resume.
 Umlauts: e.g., German: Tuebingen vs. Tübingen
 Should be equivalent
 Most important criterion:
 How are your users like to write their queries for these
words?

 Even in languages that standardly have accents, users


often may not type them
 Often best to normalize to a de-accented term
 Tuebingen, Tübingen, Tubingen  Tubingen
Introduction to Information Retrieval Sec. 2.2.3

Normalization: other languages


 Normalization of things like date forms
 7月30日 vs. 7/30
 Japanese use of kana vs. Chinese characters

 Tokenization and normalization may depend on the


language and so is intertwined with language
detection
Is this
Morgen will ich in MIT … German “mit”?

 Crucial: Need to “normalize” indexed text as well as


query terms into the same form
Introduction to Information Retrieval Sec. 2.2.1

Tokenization: language issues


 Arabic (or Hebrew) is basically written right to left,
but with certain items like numbers written left to
right
 Words are separated, but letter forms within a word
form complex ligatures

 ← → ←→ ← start
 ‘Algeria achieved its independence in 1962 after 132
years of French occupation.’
 With Unicode, the surface presentation is complex, but the
stored form is straightforward
Sec. 1.1

Unstructured data in 1620


• Which plays of Shakespeare contain the words
Brutus AND Caesar but NOT Calpurnia?
• One could grep all of Shakespeare’s plays for
Brutus and Caesar, then strip out lines containing
Calpurnia?
• Why is that not the answer?
– Slow (for large corpora)
– NOT Calpurnia is non-trivial
– Other operations (e.g., find the word Romans near
countrymen) not feasible
– Ranked retrieval (best documents to return)
• Later lectures
112
Introduction to Information Retrieval

Term frequency tf
 The term frequency tft,d of term t in document d is
defined as the number of times that t occurs in d.
 This determines the density of terms in the document
 We want to use tf when computing query-document match
scores. But how?
 Raw term frequency is not what we want:
 A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term.
 But not 10 times more relevant.
 Relevance does not increase proportionally with term
frequency.
Introduction to Information Retrieval Sec. 6.2

Log-frequency weighting
 The log frequency weight of term t in d is
1  log 10 tf t,d , if tf t,d  0
wt,d 
 0, otherwise

 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.


 Score for a document-query pair: sum over terms t in
both q and d:

 score  tqd (1  log tf t ,d )

 The score is 0 if none of the query terms is present in


the document.
Introduction to Information Retrieval Sec. 2.2.2

Stop words
 With a stop list, you exclude from the dictionary
entirely the commonest words. Intuition:
 They have little semantic content: the, a, and, to, be
 There are a lot of them: ~30% of postings for top 30 words
 But the trend is away from doing this:
 Good compression techniques (IIR 5) means the space for including
stop words in a system is very small
 Good query optimization techniques (IIR 7) mean you pay little at
query time for including stop words.
 You need them for:
 Phrase queries: “King of Denmark”
 Various song titles, etc.: “Let it be”, “To be or not to be”
 “Relational” queries: “flights to London”
Introduction to Information Retrieval Sec. 2.2.3

Normalization to terms
 We may need to “normalize” words in indexed text
as well as query words into the same form
 We want to match U.S.A. and USA
 Result is terms: a term is a (normalized) word type,
which is an entry in our IR system dictionary
 We most commonly implicitly define equivalence
classes of terms by, e.g.,
 deleting periods to form a term
 U.S.A., USA  USA
 deleting hyphens to form a term
 anti-discriminatory, antidiscriminatory  antidiscriminatory
Introduction to Information Retrieval Sec. 2.2.3

Normalization: other languages


 Normalization of things like date forms
 7月30日 vs. 7/30
 Japanese use of kana vs. Chinese characters

 Tokenization and normalization may depend on the


language and so is intertwined with language
detection Is this
Morgen will ich in MIT … German “mit”?

 Crucial: Need to “normalize” indexed text as well as


query terms identically
Introduction to Information Retrieval Sec. 2.2.3

Normalization: other languages


 Accents: e.g., French résumé vs. resume.
 Umlauts: e.g., German: Tuebingen vs. Tübingen
 Should be equivalent
 Most important criterion:
 How are your users like to write their queries for these
words?

 Even in languages that standardly have accents, users


often may not type them
 Often best to normalize to a de-accented term
 Tuebingen, Tübingen, Tubingen  Tubingen
Introduction to Information Retrieval Sec. 2.2.3

Case folding
 Reduce all letters to lower case
 exception: upper case in mid-sentence?
 e.g., General Motors
 Fed vs. fed
 SAIL vs. sail
 Often best to lower case everything, since users will use
lowercase regardless of ‘correct’ capitalization…

 Longstanding Google example: [fixed in 2011…]


 Query C.A.T.
 #1 result is for “cats” (well, Lolcats) not Caterpillar Inc.
Introduction to Information Retrieval Sec. 2.2.3

Normalization to terms

 An alternative to equivalence classing is to do


asymmetric expansion
 An example of where this may be useful
 Enter: window Search: window, windows
 Enter: windows Search: Windows, windows, window
 Enter: Windows Search: Windows
 Potentially more powerful, but less efficient
Introduction to Information Retrieval

Thesauri and Soundex


Thesarus-a book of words or of information about a particular field
or set of concepts; especially: a book of words and their
synonyms
 Do we handle synonyms and homonyms?
 E.g., by hand-constructed equivalence classes
 car = automobile color = colour
 We can rewrite to form equivalence-class terms
 When the document contains automobile, index it under car-
automobile (and vice-versa)
 Or we can expand a query
 When the query contains automobile, look under car as well
 What about spelling mistakes?
 One approach is Soundex, which forms equivalence classes
of words based on phonetic heuristics
Sec. 2.2.4

Lemmatization
• Reduce inflectional (forms of nouns, the past
tense, past participle, and present participle
forms of verbs, and the comparative and
superlative forms of adjectives and
adverbs)/variant forms to base form
• E.g., -am, are, is  be
-car, cars, car's, cars'  car
• the boy's cars are different colors  the boy car be
different color
• Lemmatization implies doing “proper”
reduction to dictionary headword form
Lemmatization
 transform to standard form according to syntactic
category.
E.g. verb + ing  verb
noun + s  noun
 Need POS (Part of Speech) tagging
 More accurate than stemming, but needs more resources

 crucial to choose stemming/lemmatization rules


noise v.s. recognition rate
 compromise between precision and recall

light/no stemming severe stemming


-recall +precision +recall -precision

123
Sec. 2.2.4

Stemming
• Reduce terms to their “roots” before indexing
• “Stemming” suggests crude affix chopping
– language dependent
– e.g., automate(s), automatic, automation all
reduced to automat.

for exampl compress and


for example compressed compress ar both accept
and compression are both as equival to compress
accepted as equivalent to
compress.
Stemming
 Reason:
 Different word forms may bear similar meaning
(e.g. search, searching): create a “standard”
representation for them
 Stemming:
 Removing some endings of word
computer
compute
computes
comput
computing
computed
computation
125
Porter algorithm
(Porter, M.F., 1980, An algorithm for suffix stripping,
Program, 14(3) :130-137)
 Step 1: plurals and past participles
 SSES -> SS caresses -> caress
 (*v*) ING -> motoring -> motor
 Step 2: adj->n, n->v, n->adj, …
 (m>0) OUSNESS -> OUS callousness -> callous
 (m>0) ATIONAL -> ATE relational -> relate
 Step 3:
 (m>0) ICATE -> IC triplicate -> triplic
 Step 4:
 (m>1) AL -> revival -> reviv
 (m>1) ANCE -> allowance -> allow
 Step 5:
 (m>1) E -> probate -> probat
 (m > 1 and *d and *L) -> single letter controll -> control
126
Sec. 2.2.4

Porter’s algorithm
• Commonest algorithm for stemming English
– Results suggest it’s at least as good as other
stemming options
• Conventions + 5 phases of reductions
– phases applied sequentially
– each phase consists of a set of commands
– sample convention: Of the rules in a compound
command, select the one that applies to the
longest suffix.
Sec. 2.2.4

Typical rules in Porter


• sses  ss caresses  caress
• ies  i ponies  poni
• ational  ate variational  variate
• tional  tion optional  option
• ss  ss caress  caress
• s cats  cat
• Weight of word sensitive rules
• (m>1) EMENT →
• replacement → replac
• cement → cement
Sec. 2.2.4

Other stemmers
• Other stemmers exist:
– Lovins stemmer
• http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm

• Single-pass, longest suffix removal (about 250 rules)


– Paice/Husk stemmer
– Snowball
• Full morphological analysis (lemmatization)
– At most modest benefits for retrieval
Sec. 2.2.4

Language-specificity
• The above methods embody transformations
that are
– Language-specific, and often
– Application-specific
• These are “plug-in” addenda (added at the
end) to the indexing process
• Both open source and commercial plug-ins are
available for handling these
Sec. 2.2.4

Does stemming help?


• English: very mixed results. Helps recall for
some queries but harms precision on others
– E.g., operative (dentistry) ⇒ oper
• Definitely useful for Spanish, German, Finnish,

– 30% performance gains for Finnish!
Introduction
Architecture of IR systems
IR Models – Boolean and Extended Boolean
Vocabulary
Preprocessing : Tokenization, Stemming etc
Algorithms to search in posting lists
Positional and Phrase Queries
Search structure for dictionaries
Wildcard queries
Spelling and Phonetic Corrections
Affixes
• Affixes are word parts that change the
meaning of a root or base word.
• Prefixes and Suffixes are both Affixes.

un+cook+ed=uncooked
Affix
Affix
Prefixes
• Prefixes are word parts (affixes) that
comes at the beginning of the root word
or base word.

un+cook+ed=uncooked
Prefix
Affix
Suffix
• Suffixes are word parts (affixes) that come
at the end of the root word or base word.

un+cook+ed=uncooked
Prefix
Suffix
Introduction to Information Retrieval Sec. 3.1

Dictionary data structures for inverted


indexes
 The dictionary data structure stores the term
vocabulary, document frequency, pointers to each
postings list … in what data structure?

136
Introduction to Information Retrieval Sec. 3.1

A naïve dictionary
 An array of struct:

char[20] int Postings *


20 bytes 4/8 bytes 4/8 bytes
 How do we store a dictionary in memory efficiently?
 How do we quickly look up elements at query time?
137
Introduction to Information Retrieval Sec. 3.1

Dictionary data structures


 Two main choices:
 Hashtables
 Trees
 Some IR systems use hashtables, some trees

138
Introduction to Information Retrieval Sec. 3.1

Hashtables
 Each vocabulary term is hashed to an integer
 (We assume you’ve seen hashtables before)
 Pros:
 Lookup is faster than for a tree: O(1)
 Cons:
 No easy way to find minor variants:
 judgment/judgement
 No prefix search [tolerant retrieval]
 If vocabulary keeps growing, need to occasionally do the
expensive operation of rehashing everything

139
Introduction to Information Retrieval Sec. 3.1

Tree: binary tree


Root
a-m n-z

a-hu hy-m n-sh si-z

140
Introduction to Information Retrieval Sec. 3.1

Tree: B-tree

a-hu n-z
hy-m

 Definition: Every internal nodel has a number of children


in the interval [a,b] where a, b are appropriate natural
numbers, e.g., [2,4].
141
Introduction to Information Retrieval Sec. 3.1

Trees
 Simplest: binary tree
 More usual: B-trees
 Trees require a standard ordering of characters and hence
strings … but we typically have one
 Pros:
 Solves the prefix problem (terms starting with hyp)
 Cons:
 Slower: O(log M) [and this requires balanced tree]
 Rebalancing binary trees is expensive
 But B-trees mitigate the rebalancing problem

142
Introduction
Architecture of IR systems
IR Models – Boolean and Extended Boolean
Vocabulary, Posting Lists
Preprocessing : Tokenization, Stemming etc
Algorithms to search in posting lists
Positional and Phrase Queries
Search structure for dictionaries
Wildcard queries
Spelling and Phonetic Corrections
Phrase queries
 Want to answer queries such as stanford
university – as a phrase
 Thus the sentence “I went to university at
Stanford” is not a match.
 No longer suffices to store only

<term : docs> entries


A first attempt: Biword indexes
 Index every consecutive pair of terms in the text
as a phrase
 For example the text “Friends, Romans,
Countrymen” would generate the biwords
 friends romans
 romans countrymen
 Each of these biwords is now a dictionary term
 Two-word phrase query-processing is now
immediate.
Longer phrase queries
 Longer phrases are processed as we did with
wild-cards:
 stanford university palo alto can be broken into
the Boolean query on biwords:
stanford university AND university palo AND
palo alto

Without the docs, we cannot verify that the docs


matching the above Boolean query do contain
the phrase.
Can have false positives!
Extended biwords
 Parse the indexed text and perform part-of-
speech-tagging (POST).
 Bucket the terms into (say) Nouns (N) and
articles/prepositions (X).
 Now deem any string of terms of the form NX*N
to be an extended biword.
 Each such extended biword is now made a term in
the dictionary.
 Example:
 catcher in the rye
N X X N
Query processing
 Given a query, parse it into N’s and X’s
 Segment query into enhanced biwords
 Look up index
 Issues
 Parsing longer queries into conjunctions
 E.g., the query tangerine trees and marmalade
skies is parsed into
 tangerine trees AND trees and marmalade AND
marmalade skies
Other issues
 False positives, as noted before
 Index blowup due to bigger dictionary
Solution 2: Positional indexes
 Store, for each term, entries of the form:
<number of docs containing term;
doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
Positional index example

<be: 993427;
1: <7, 18, 33, 72, 86, 231>; Which of docs 1,2,4,5
2: <3, 149>; could contain “to be
4: <17, 191, 291, 430, 434>; or not to be”?
5: <363, 367, …>

 Can compress position values/offsets


 Nevertheless, this expands postings storage
substantially
Processing a phrase query
 Extract inverted index entries for each distinct
term: to, be, or, not.
 Merge their doc:<position lists…> to enumerate
all positions with “to be or not to be”.
 to:
 2:<1,17,74,222,551>; 4:<8,16,190,429,433>;
7:13,23,191; ...
 be:
 1:<17,19>; 4:<17,191,291,430,434>;
5:<14,19,101>; ...
 Same general method for proximity searches
Proximity queries
 LIMIT! /3 STATUTE /3 FEDERAL /2 TORT
Here, /k means “within k words of”.
 Clearly, positional indexes can be used for
such queries; biword indexes cannot.
 Exercise: Adapt the linear merge of postings
to handle proximity queries. Can you make it
work for any value of k?
Positional index size
 Can compress position values/offsets as we did
with docs in the last lecture
 Nevertheless, this expands postings storage
substantially
Rules of thumb
 Positional index size factor of 2-4 over non-
positional index
 Positional index size 35-50% of volume of
original text
 Caveat: all of this holds for “English-like”
languages
Introduction
Architecture of IR systems
IR Models – Boolean and Extended Boolean
Vocabulary, Posting Lists
Preprocessing : Tokenization, Stemming etc
Algorithms to search in posting lists
Positional and Phrase Queries
Search structure for dictionaries
Wildcard queries
Spelling and Phonetic Corrections
Introduction to Information Retrieval Sec. 3.2

Wild-card queries: *
 mon*: find all docs containing any word beginning
with “mon”.
 Easy with binary tree (or B-tree) lexicon: retrieve all
words in range: mon ≤ w < moo
 *mon: find words ending in “mon”: harder
 Maintain an additional B-tree for terms backwards.
Can retrieve all words in range: nom ≤ w < non.

Exercise: from this, how can we enumerate all terms


meeting the wild-card query pro*cent ?
157
Introduction to Information Retrieval Sec. 3.2

Query processing
 At this point, we have an enumeration of all terms in
the dictionary that match the wild-card query.
 We still have to look up the postings for each
enumerated term.
 E.g., consider the query:
se*ate AND fil*er
This may result in the execution of many Boolean
AND queries.

158
Introduction to Information Retrieval Sec. 3.2

B-trees handle *’s at the end of a


query term
 How can we handle *’s in the middle of query term?
 co*tion
 We could look up co* AND *tion in a B-tree and
intersect the two term sets
 Expensive
 The solution: transform wild-card queries so that the
*’s occur at the end
 This gives rise to the Permuterm Index.

159
Introduction to Information Retrieval Sec. 3.2.1

Permuterm index
 For term hello, index under:
 hello$, ello$h, llo$he, lo$hel, o$hell, $hello
where $ is a special symbol.
 Queries:
 X lookup on X$ X* lookup on $X*
 *X lookup on X$* *X* lookup on X*
 X*Y lookup on Y$X* X*Y*Z ???
Search for X*Y and Y*Z look up
Query = hel*o Y$X* and Z$Y*?
X=hel, Y=o Example - we’re looking for
Lookup o$hel* c*h*r
look up h$c* and r$h*? 160
Wild card queries
1. The wild card queries by user can be divided into 5 cases.
1. X* --> X
2. *X --> X$*
3. X*Y --> Y$X*
4. X*Y*Z --> (Z$X*) and (Y*)
5. *X* --> can be converted to X* form

2. The 1st 3 cases can be done by prefix matching with the corresponding forms
given in RHS.
3. 4th and 5th are exceptional cases.
4. 5th can be converted to 1st case.
5. For case 4,
Two queries can be taken and their posting lists can be generated. Then a bitwise
AND operation on vectors containing their posting lists can be done to obtain
the required result.
The above mentioned two queries (for case X*Y*Z) are (Z$X*) and (Y*)
Introduction to Information Retrieval Sec. 3.2.1

Permuterm query processing


 Rotate query wild-card to the right
 Now use B-tree lookup as before.
 Permuterm problem: ≈ quadruples lexicon size

Empirical observation for English.

162
Language Model

 Language Model (LM)


 A language model is a probability distribution over
entire sentences or texts
• N-gram: unigrams, bigrams, trigrams,…

 In a simple n-gram language model, the


probability of a word, conditioned on some
number of previous words
 In other words, using the previous N-1 words in
a sequence we want to predict the next word

 Sue swallowed the large green ____.


• A. Frog
• B. Mountain
• C. Car
• D. Pill
Introduction to Information Retrieval Sec. 3.2.2

Bigram (k-gram) indexes


 Enumerate all k-grams (sequence of k chars)
occurring in any term
 e.g., from text “April is the cruelest month” we get
the 2-grams (bigrams)

$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,
ue,el,le,es,st,t$, $m,mo,on,nt,h$
 $ is a special word boundary symbol
 Maintain a second inverted index from bigrams to
dictionary terms that match each bigram.
165
Introduction to Information Retrieval Sec. 3.2.2

Bigram index example


 The k-gram index finds terms based on a query
consisting of k-grams (here k=2).

$m mace madden

mo among amortize
on along among

166
Introduction to Information Retrieval

Example with 3-grams

 Suppose the correct word is “november":


$$n, $no, nov, ove, vem, emb, mbe, ber, er$, r$$
 And the query term is “december":
$$d, $de, dec, ece, cem, emb, mbe, ber, er$, r$$
 So 5 trigrams overlap (out of 10 in each term)
 Issue: Fixed number of k-grams that differ
does not work for words of differing length.
 How can we turn this into a normalized measure of
overlap?
 We have to use Jaccard coefficient 167
Postings list in a 3-gram inverted index

k-gram (bigram, trigram, . . . ) indexes


Note that we now have two different types of
inverted indexes
The Term-Document Inverted Index for
finding documents based on a query
consisting of terms
The k-gram Index for finding terms based
168
on a query consisting of k-grams
168
Introduction to Information Retrieval Sec. 3.2.2

Processing wild-cards
 Query mon* can now be run as
 $m AND mo AND on
 Gets terms that match AND version of our wildcard
query.
 But we’d enumerate moon.
 Must post-filter these terms against query.
 Surviving enumerated terms are then looked up in
the term-document inverted index.
 Fast, space efficient (compared to permuterm).

169
Introduction to Information Retrieval

Processing wildcard terms in a bigram index

 Query hel* can now be run as:


$h AND he AND el
 ... but this will show up many false positives like heel
 Postfilter, then look up surviving terms in term–document inverted index.
 k-gram vs. permuterm index
 k-gram index is more space-efficient
 permuterm index does not require postfiltering.

170
Introduction to Information Retrieval Sec. 3.2.2

Processing wild-card queries


 As before, we must execute a Boolean query for each
enumerated, filtered term.
 Wild-cards can result in expensive query execution
(very large disjunctions…)
 pyth* AND prog*
 If you encourage “laziness” people will respond!

Search
Type your search terms, use ‘*’ if you need to.
E.g., Alex* will match Alexander.

 Which web search engines allow wildcard queries? 171


Introduction
Architecture of IR systems
IR Models – Boolean and Extended Boolean
Vocabulary, Posting Lists
Preprocessing : Tokenization, Stemming etc
Algorithms to search in posting lists
Positional and Phrase Queries
Search structure for dictionaries
Wildcard queries
Spelling and Phonetic Corrections
Introduction to Information Retrieval Sec. 3.3

Spell correction
 Two principal uses
 Correcting document(s) being indexed
 Correcting user queries to retrieve “right” answers
 Two main flavors:
 Isolated word
 Check each word on its own for misspelling
 Will not catch typos resulting in correctly spelled words
 e.g., from  form
 Context-sensitive
 Look at surrounding words,
 e.g., I flew form Heathrow to Narita.

175
Introduction to Information Retrieval Sec. 3.3

Document correction
 Especially needed for OCR’ed documents
 Correction algorithms are tuned for this: rn/m
 Can use domain-specific knowledge
 E.g., OCR can confuse O and D more often than it would confuse O
and I (adjacent on the QWERTY keyboard, so more likely
interchanged in typing).
 But also: web pages and even printed material have
typos
 Goal: the dictionary contains fewer misspellings
 But often we don’t change the documents and
instead fix the query-document mapping
176
Introduction to Information Retrieval Sec. 3.3

Query mis-spellings
 Our principal focus here
 E.g., the query Alanis Morisett
 We can either
 Retrieve documents indexed by the correct spelling, OR
 Return several suggested alternative queries with the
correct spelling
 Did you mean … ?

177
Introduction to Information Retrieval Sec. 3.3.2

Isolated word correction


 Fundamental premise – there is a lexicon from which
the correct spellings come
 Two basic choices for this
 A standard lexicon such as
 Webster’s English Dictionary
 An “industry-specific” lexicon – hand-maintained
 The lexicon of the indexed corpus
 E.g., all words on the web
 All names, acronyms etc.
 (Including the mis-spellings)

178
Introduction to Information Retrieval Sec. 3.3.2

Isolated word correction


 Given a lexicon and a character sequence Q, return
the words in the lexicon closest to Q
 What’s “closest”?
 We’ll study several alternatives
 Edit distance (Levenshtein distance)
 Weighted edit distance
 n-gram overlap

179
Introduction to Information Retrieval

Dynamic programming (Cormen et al.)

 Optimal substructure: The optimal solution to the problem


contains within it subsolutions, i.e., optimal solutions to
subproblems.
 Overlapping subsolutions: The subsolutions overlap. These
subsolutions are computed over and over again when
computing the global optimal solution in a brute-force
algorithm.
 Subproblem in the case of edit distance: what is the edit
distance of two prefixes
 Overlapping subsolutions: We need most distances of prefixes
3 times – this corresponds to moving right, diagonally, down.

180

180
Introduction to Information Retrieval Sec. 3.3.3

Edit distance
 Given two strings S1 and S2, the minimum number of
operations to convert one to the other
 Operations are typically character-level
 Insert, Delete, Replace, (Transposition)
 E.g., the edit distance from dof to dog is 1
 From cat to act is 2 (Just 1 with transpose.)
 from cat to dog is 3.
 Generally found by dynamic programming.

181
Introduction to Information Retrieval

Simple example of Edit distance


Find the Levenshtein edit distance between ME and MY
Introduce NULL
If the characters are same copy the previously computed value otherwise
take the minimum values of the three neighbouring values and add 1
NULL M Y
NULL 0 1 2
M 1 0 1
E 2 1 1

The edit distance is 1

182
Introduction to Information Retrieval Sec. 3.3.3

Weighted edit distance


 As above, but the weight of an operation depends on
the character(s) involved
 Meant to capture OCR or keyboard errors
Example: m more likely to be mis-typed as n than as q
 Therefore, replacing m by n is a smaller edit distance than
by q
 This may be formulated as a probability model
 Requires weight matrix as input
 Modify dynamic programming to handle weights

183
Introduction to Information Retrieval Sec. 3.3.4

Using edit distances


 Given query, first enumerate all character sequences
within a preset (weighted) edit distance (e.g., 2)
 Intersect this set with list of “correct” words
 Show terms you found to user as suggestions
 Alternatively,
 We can look up all possible corrections in our inverted
index and return all docs … slow
 We can run with a single most likely correction
 The alternatives disempower the user, but save a
round of interaction with the user
184
Introduction to Information Retrieval Sec. 3.3.4

Edit distance to all dictionary terms?


 Given a (mis-spelled) query – do we compute its edit
distance to every dictionary term?
 Expensive and slow
 Alternative?
 How do we cut the set of candidate dictionary
terms?
 One possibility is to use n-gram overlap for this
 This can also be used by itself for spelling correction.

185
Introduction to Information Retrieval Sec. 3.3.4

n-gram overlap
 Enumerate all the n-grams in the query string as well
as in the lexicon
 Use the n-gram index (recall wild-card search) to
retrieve all lexicon terms matching any of the query
n-grams
 Threshold by number of matching n-grams
 Variants – weight by keyboard layout, etc.

186
Introduction to Information Retrieval Sec. 3.3.4

Example with trigrams


 Suppose the text is november
 Trigrams are nov, ove, vem, emb, mbe, ber.
 The query is december
 Trigrams are dec, ece, cem, emb, mbe, ber.
 So 3 trigrams overlap (of 6 in each term)
 How can we turn this into a normalized measure of
overlap?

187
Introduction to Information Retrieval Sec. 3.3.4

One option – Jaccard coefficient


 A commonly-used measure of overlap
 Let X and Y be two sets; then the J.C. is

X Y / X Y
 Equals 1 when X and Y have the same elements and
zero when they are disjoint
 X and Y don’t have to be of the same size
 Always assigns a number between 0 and 1
 Now threshold to decide if you have a match
 E.g., if J.C. > 0.8, declare a match
188
Introduction to Information Retrieval Sec. 3.3.4

Matching trigrams
 Consider the query lord – we wish to identify words
matching 2 of its 3 bigrams (lo, or, rd)

lo alone lore sloth


or border lore morbid
rd ardent border card

Standard postings “merge” will enumerate …


Adapt this to using Jaccard (or another) measure.
189
Introduction to Information Retrieval Sec. 3.3.5

Context-sensitive spell correction


 Text: I flew from Heathrow to Narita.
 Consider the phrase query “flew form Heathrow”
 We’d like to respond
Did you mean “flew from Heathrow”?
because no docs matched the query phrase.

190
Introduction to Information Retrieval Sec. 3.3.5

Context-sensitive correction
 Need surrounding context to catch this.
 First idea: retrieve dictionary terms close (in
weighted edit distance) to each query term
 Now try all possible resulting phrases with one word
“fixed” at a time
 flew from heathrow
 fled form heathrow
 flea form heathrow
 Hit-based spelling correction: Suggest the
alternative that has lots of hits.

191
Introduction to Information Retrieval Sec. 3.3.5

Exercise
 Suppose that for “flew form Heathrow” we have 7
alternatives for flew, 19 for form and 3 for heathrow.
How many “corrected” phrases will we enumerate in
this scheme?

192
Introduction to Information Retrieval Sec. 3.3.5

Another approach
 Break phrase query into a conjunction of biwords
(Lecture 2).
 Look for biwords that need only one term corrected.
 Enumerate only phrases containing “common”
biwords.

193
Introduction to Information Retrieval Sec. 3.3.5

General issues in spell correction


 We enumerate multiple alternatives for “Did you
mean?”
 Need to figure out which to present to the user
 The alternative hitting most docs
 Query log analysis
 More generally, rank alternatives probabilistically
argmaxcorr P(corr | query)
 From Bayes rule, this is equivalent to
argmaxcorr P(query | corr) * P(corr)

Noisy channel Language model


194
Introduction to Information Retrieval Sec. 3.4

Soundex
 Class of heuristics to expand a query into phonetic
equivalents
 Language specific – mainly for names
 E.g., chebyshev  tchebycheff
 Invented for the U.S. census … in 1918

195
Introduction to Information Retrieval Sec. 3.4

Soundex – typical algorithm


 Turn every token to be indexed into a 4-character
reduced form
 Do the same with query terms
 Build and search an index on the reduced forms
 (when the query calls for a soundex match)

 http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#Top

196
Introduction to Information Retrieval Sec. 3.4

Soundex – typical algorithm


1. Retain the first letter of the word.
2. Change all occurrences of the following letters to '0'
(zero):
'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.
3. Change letters to digits as follows:
 B, F, P, V  1
 C, G, J, K, Q, S, X, Z  2
 D,T  3
 L4
 M, N  5
 R6
197
Introduction to Information Retrieval Sec. 3.4

Soundex continued
4. Remove all pairs of consecutive digits.
5. Remove all zeros from the resulting string.
6. Pad the resulting string with trailing zeros and
return the first four positions, which will be of the
form <uppercase letter> <digit> <digit> <digit>.

E.g., Herman becomes H655.

Will hermann generate the same code?


198
Introduction to Information Retrieval Sec. 3.4

Soundex
 Soundex is the classic algorithm, provided by most
databases (Oracle, Microsoft, …)
 How useful is soundex?
 Not very – for information retrieval
 Okay for “high recall” tasks (e.g., Interpol), though
biased to names of certain nationalities
 Zobel and Dart (1996) show that other algorithms for
phonetic matching perform much better in the
context of IR

199
Introduction to Information Retrieval

What queries can we process?


 We have
 Positional inverted index with skip pointers
 Wild-card index
 Spell-correction
 Soundex
 Queries such as
(SPELL(moriset) /3 toron*to) OR SOUNDEX(chaikofski)

200
Introduction to Information Retrieval

Exercise
 Draw yourself a diagram showing the various indexes
in a search engine incorporating all the functionality
we have talked about
 Identify some of the key design choices in the index
pipeline:
 Does stemming happen before the Soundex index?
 What about n-grams?
 Given a query, how would you parse and dispatch
sub-queries to the various indexes?

201
Introduction to Information Retrieval Sec. 3.5

Resources
 IIR 3, MG 4.2
 Efficient spell retrieval:
 K. Kukich. Techniques for automatically correcting words in text. ACM
Computing Surveys 24(4), Dec 1992.
 J. Zobel and P. Dart. Finding approximate matches in large
lexicons. Software - practice and experience 25(3), March 1995.
http://citeseer.ist.psu.edu/zobel95finding.html
 Mikael Tillenius: Efficient Generation and Ranking of Spelling Error
Corrections. Master’s thesis at Sweden’s Royal Institute of Technology.
http://citeseer.ist.psu.edu/179155.html
 Nice, easy reading on spell correction:
 Peter Norvig: How to write a spelling corrector
http://norvig.com/spell-correct.html
202

Das könnte Ihnen auch gefallen