II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006

Web Search Summer Term 2006
II. Information Retrieval (Basics Cont.)
(c) Wolfgang Hrst, Albert-Ludwigs-University
Organizational Remarks
Exercises: Please, register to the exercises by sending me (huerst@informatik.uni-freiburg.de) an email till Friday, May 5th, with - Your name, - Matrikelnummer, - Studiengang, - Plans for exam This is just to organize the exercises but has no effect if you decide to drop this course later.
Recap: IR System & Tasks Involved

INFORMATION NEED DOCS. User Interface RESULTS DOCUMENTS
INDEX

INFORMATION NEED DOCS. User Interface RESULTS RESULT REPRESENTATION DOCUMENTS
QUERY
INDEXING
SEARCH
INDEX

INFORMATION NEED DOCS. User Interface RESULTS QUERY PROCESSING (PARSING & TERM PROCESSING) RESULT REPRESENTATION RANKING LOGICAL VIEW OF THE INFORM. NEED SEARCHING INDEX DOCUMENTS
QUERY
SELECT DATA FOR INDEXING
PARSING & TERM PROCESSING
PERFORMANCE EVALUATION
Query Languages: Boolean Search

So far: a) Single terms (unrelated / bag of words) b) Boolean conjunctions (AND, OR, NOT) Boolean search: Main search model before the Web came along (Note: Mainly professional users). Advantages of Boolean queries: Precise (mathematical model), Offers great control and transparency, Good for domains with ranking by other means than relevance, i.e. chronological
Boolean Search (Cont.)

Disadvantages of Boolean queries: Sometimes hard to specify, even for experts Binary decision (relevant or not) Bag-of-Words, no position Example: Query: New York City Doc. 1: This is a nice city. Doc. 2: This city has a new library. Query: New AND York AND City Doc. 1: New York has a new library. Doc. 2: The city of York has a new library.
Further Query Types

Phrases, e.g.
New York City University NEAR Freiburg (finds University of Freiburg and Albert-Ludwigs University Freiburg)
Proximity, e.g.
Structural queries, e.g.
AUTHOR = Ottmann AND TEXT CONTAINS binary search tree
Natural language vs. keywords Pattern matching, e.g. wildcards:
index* (finds index, indexing, indexes, indexer, )
Spelling corrections and some more (often application dependent)
Phrases
Often used (esp. for web search): Quotas e.g. New York City Advantage: Easy and seem to work well (about 10% of web queries are such phrases according to Manning et al. [2])
How do we support this? We need word positions. We need all original words (e.g. no stop word removal in University of Freiburg). We need an efficient way to do this.
Approaches to Support Phrases

Biword indexes: Idea: Store pairs of consecutive words (in addition to single terms), e.g. New York City is represented by the terms New, York, City, New York, York City Might cause problems for phrases with more than 2 words, but often works quite well Positional indexes: Idea: Store position of each word in the postings list
Positional Indexes Example

CITY NEW 18453 23535 23 25 32 47
23:4[3,12,46,78] 18 23 25 47 25:3[43,120,221]
YORK 9421
32:6[12,20,57,200,322,481] 25 47 53 55
NEW
23535
,25:6[41,87,136,],
,25:2[42,137],
YORK 9421
Positional Indexes
Also works for queries such as University [word]1 Freiburg University NEAR Freiburg Problem: Size Need to store additional info (positions) on an already large index (stop words!) Approx. size: 2-4 times the original index, 1/2 size of uncompressed documents [2] In practice: Combinations exist, e.g. index w. names as phrases, useful biwords, and store position
Pattern Matching Wildcards

Example: fuball* is mapped to fuballer, fuballspiel, fuballweltmeister, Trailing wildcard queries, e.g. fuball* Can easily be found if dictionary is stored as a B-tree Leading wildcard queries, e.g. *meister Can easily be found if dictionary is stored as a reverse B-tree (i.e. terms stored backwards)
Wildcards (Cont.)
General wildcards, e.g. f*ball
(matches e.g. to fuball, federball, )
Idea: Move the * at the end Permuterm index:

For each word (e.g. fuball) add end symbol (e.g. fuball$) and create permutations (e.g. fuball$, uball$f, ball$fu, ball$fu, , l$fubal, $fuball)
Permuterm index:
dictionary = all permuterms, postings = dictionary terms containing this rotation
Query:
Permute * to the end (e.g. ball$f*) and get postings from permuterm index (e.g. ball$fu, ball$feder, )
Structural Queries
In practice: Often semi-structured documents Structural queries: Use available structure to better specify the information need, e.g. Requires to store structure information, e.g. in a parametric index
encoded in the dictionary:
OTTMANN.AUTHOR OTTMANN.TITLE OTTMANN.BODY 9 12 8 17 26 9 19 44 17 28 48 23
AUTHOR = Ottmann AND TEXT CONTAINS search tree
or in the postings:
OTTMANN 8.BODY 9.AUTHOR, 9.BODY 12.TITLE
Summary: Further Query Types

Phrases, e.g.
New York City University NEAR Freiburg (finds University of Freiburg and Albert-Ludwigs University Freiburg)
Proximity, e.g.
Structural queries, e.g.
AUTHOR = Ottmann AND TEXT CONTAINS binary search tree
Natural language vs. keywords Pattern matching, e.g. wildcards:
index* (finds index, indexing, indexes, indexer, )
Spelling corrections and some more (often application dependent)

II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006

Hochgeladen von

Copyright:

Verfügbare Formate

Web Search Summer Term 2006

II. Information Retrieval (Basics Cont.)

(c) Wolfgang Hrst, Albert-Ludwigs-University

Recap: IR System & Tasks Involved

Recap: IR System & Tasks Involved

Recap: IR System & Tasks Involved

SELECT DATA FOR INDEXING

PARSING & TERM PROCESSING

Query Languages: Boolean Search

Boolean Search (Cont.)

Further Query Types

Structural queries, e.g.

AUTHOR = Ottmann AND TEXT CONTAINS binary search tree

Natural language vs. keywords Pattern matching, e.g. wildcards:

index* (finds index, indexing, indexes, indexer, )

Spelling corrections and some more (often application dependent)

Approaches to Support Phrases

Positional Indexes Example

Pattern Matching Wildcards

Idea: Move the * at the end Permuterm index:

AUTHOR = Ottmann AND TEXT CONTAINS search tree

Summary: Further Query Types

Structural queries, e.g.

AUTHOR = Ottmann AND TEXT CONTAINS binary search tree

Natural language vs. keywords Pattern matching, e.g. wildcards:

index* (finds index, indexing, indexes, indexer, )

Spelling corrections and some more (often application dependent)

Das könnte Ihnen auch gefallen