Beruflich Dokumente
Kultur Dokumente
INDEXING
Static and Dynamic Inverted Indices
Index Construction and Index Compression.
Searching -
Sequential Searching and Pattern Matching.
Query Operations
Query Languages
Query Processing
Relevance Feedback and Query Expansion
Automatic Local and Global Analysis
Measuring Effectiveness and Efficiency
Inverted Files
Definition: an inverted file is a word-oriented
mechanism for indexing a text collection in order to
speed up the searching task.
Structure of inverted file:
Vocabulary: is the set of all distinct words in the text
Occurrences: lists containing all information
necessary for each word of the vocabulary (text
position, frequency, documents where the word
appears, etc.)
Index Components and Index Life Cycle
the two principal components of an inverted
index are
the dictionary and the postings lists. For each
term in the text collection, there is a postings
list that contains information about the terms
occurrences in the collection.
Two distinct phases- static inverted
index
The life cycle of a static inverted index, built for a
never-changing text collection, consists of
two distinct phases (for a dynamic index the two phases
coincide):
1. Index construction: The text collection is processed
sequentially, one token at a time,
and a postings list is built for each term in the
collection in an incremental fashion.
2. Query processing: The information stored in the
index that was built in phase 1 is used
to process search queries.
The Dictionary
The dictionary is the central data structure that is used to manage the set of terms
found in a text collection.
It provides a mapping from the set of index terms to the locations of their
postings lists. At query time, locating the query terms postings lists in the index
is one of the first operations performed when processing an incoming keyword
query.
At indexing time, the dictionarys lookup capability allows the search engine to
quickly obtain the memory address of the inverted list for each incoming term
and to append a new posting at the end of that list.
Dictionary implementations found in search engines usually support the
following set of
operations:
1. Insert a new entry for term T.
2. Find and return the entry for term T (if present).
3. Find and return the entries for all terms that start with a given prefix P.
When building an index for a text collection, the search engine performs
operations of types 1 and 2 to look up incoming terms in the dictionary and to
add postings for these terms to the index.
The two most common ways to realize
an in-memory dictionary are
Example
Text:
1 6 12 16 18 25 29 36 40 45 54 58 66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
Inverted file
Vocabulary Occurrences
beautiful 70
flowers 45, 58
garden 18, 29
house 6
Inverted Files with TF-IDF
Prior example allows for boolean queries.
Need the document frequency and term
frequency.
Vocabulary entry Posting file entry
That house has a garden. The garden has many flowers. The flowers are
beautiful
Inverted file
Vocabulary Occurrences
beautiful 4
flowers 3
garden 2
house 1
Block Effect on Inverted File Size
How big are inverted files?
In relation to original collection size
Index Small collection Medium collection Large collection
(1Mb) (200Mb) (2Gb)
7 level 3
I 1...4 I 5...8
3 6 level 2
1 2 4 5 level 1
I1 I2 I3 I4 I5 I6 I7 I8 initial dumps
Large Index Construction Time
The total time to generate partial indices is O(n)
4. Natural Language
Single-Word Queries
A query is formulated by a word
A document is formulated by long sequences
of words
A word is a sequence of letters surrounded by
separators
What are letters and separators? e.g,on-line
The division of the text into words is not
arbitrary
Context Queries
Definition
- Search words in a given context
Types
Phrase
>a sequence of single-word queries
>e.g, enhance retrieval
Proximity
>a sequence of single words or phrases, and a maximum allowed distance
between them are specified
>e.g,within distance (enhance, retrieval, 4) will match enhance the
power of retrieval
Boolean Queries
Definition
A syntax composed of atoms that retrieve
documents, and of Boolean operators which work
on their operands
e.g, translation AND syntax OR syntactic
Fuzzy Boolean
Retrieve documents appearing in some operands (The AND may require it to appear in
more operands than the OR)
Natural Language
Generalization of fuzzy Boolean
A query is an enumeration of words and
context queries
All the documents matching a portion of the
user query are retrieved
Query processing-Introduction
It is difficult to formulate queries which are well designed for
retrieval purposes.
Improving the initial query formulation through query
expansion and term reweighting.
Approaches based on:
feedback information from the user
information derived from the set of documents initially
retrieved (called the local set of documents)
global information derived from the document collection
User Relevance Feedback
User is presented with a list of the retrieved
documents and, after examining them, marks
those which are relevant.
Two basic operation:
Query expansion : addition of new terms from
relevant document
Term reweighting : modification of term weights
based on the user relevance judgement
User Relevance Feedback
The usage of user relevance feedback to:
expand queries with the vector model
reweight query terms with the probabilistic model
reweight query terms with a variant of the
probabilistic model
Automatic Local Analysis
Clustering : the grouping of documents which satisfy a set of
common properties.
Attempting to obtain a description for a larger cluster of
relevant documents automatically :
To identify terms which are related to the query terms such as:
Synonyms
Stemming
Variations
Terms with a distance of at most k words from a query
term
Automatic Local Analysis (contd)
The local strategy is that the documents
retrieved for a given query q are examined at
query time to determine terms for query
expansion.
Two basic types of local strategy:
Local clustering
Local context analysis
Local strategies suit for environment of
intranets, not for web documents.
Measuring Search Effectiveness