Completed UNIT-III 20.9.17

UNIT-III
INDEXING
Static and Dynamic Inverted Indices
Index Construction and Index Compression.
Searching -
Sequential Searching and Pattern Matching.
Query Operations
Query Languages
Query Processing
Relevance Feedback and Query Expansion
Automatic Local and Global Analysis
Measuring Effectiveness and Efficiency
Inverted Files
Definition: an inverted file is a word-oriented
mechanism for indexing a text collection in order to
speed up the searching task.
Structure of inverted file:
Vocabulary: is the set of all distinct words in the text
Occurrences: lists containing all information
necessary for each word of the vocabulary (text
position, frequency, documents where the word
appears, etc.)
Index Components and Index Life Cycle
the two principal components of an inverted
index are
the dictionary and the postings lists. For each
term in the text collection, there is a postings
list that contains information about the terms
occurrences in the collection.
Two distinct phases- static inverted
index
The life cycle of a static inverted index, built for a
never-changing text collection, consists of
two distinct phases (for a dynamic index the two phases
coincide):
1. Index construction: The text collection is processed
sequentially, one token at a time,
and a postings list is built for each term in the
collection in an incremental fashion.
2. Query processing: The information stored in the
index that was built in phase 1 is used
to process search queries.
The Dictionary
The dictionary is the central data structure that is used to manage the set of terms
found in a text collection.
It provides a mapping from the set of index terms to the locations of their
postings lists. At query time, locating the query terms postings lists in the index
is one of the first operations performed when processing an incoming keyword
query.
At indexing time, the dictionarys lookup capability allows the search engine to
quickly obtain the memory address of the inverted list for each incoming term
and to append a new posting at the end of that list.
Dictionary implementations found in search engines usually support the
following set of
operations:
1. Insert a new entry for term T.
2. Find and return the entry for term T (if present).
3. Find and return the entries for all terms that start with a given prefix P.
When building an index for a text collection, the search engine performs
operations of types 1 and 2 to look up incoming terms in the dictionary and to
add postings for these terms to the index.
The two most common ways to realize
an in-memory dictionary are
Example
Text:
1 6 12 16 18 25 29 36 40 45 54 58 66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
Inverted file
Vocabulary Occurrences
beautiful 70
flowers 45, 58
garden 18, 29
house 6
Inverted Files with TF-IDF
Prior example allows for boolean queries.
Need the document frequency and term
frequency.
Vocabulary entry Posting file entry
k dk doc1 f1k doc2 f2k
dk : document frequency of term k

doci : i-th document that contains term k
fik : term frequency of term k in document i
Space Requirements
The space required for the vocabulary is rather
small. According to Heaps law the vocabulary grows
as O(n), where is a constant between 0.4 and 0.6
in practice
TREC-2: 1 GB text, 5 MB lexicon
On the other hand, the occurrences demand much
more space. Since each word appearing in the text is
referenced once in that structure, the extra space is
O(n)
To reduce space requirements, a technique called
block addressing is used
Block Addressing
The text is divided in blocks
The occurrences point to the blocks where
the word appears
Advantages:
the number of pointers is smaller than positions
all the occurrences of a word inside a single block are
collapsed to one reference
Disadvantages:
online search over the qualifying blocks if exact
positions are required
Example
Text:
Block 1 Block 2 Block 3 Block 4
That house has a garden. The garden has many flowers. The flowers are
beautiful
Inverted file
Vocabulary Occurrences
beautiful 4
flowers 3
garden 2
house 1
Block Effect on Inverted File Size
How big are inverted files?
In relation to original collection size
Index Small collection Medium collection Large collection
(1Mb) (200Mb) (2Gb)
Addressing words 45% 73% 36% 64% 35% 63%

Addressing 256 blocks 27% 41% 18% 32% 5% 9%
Addressing 64K blocks 18% 25% 1.7% 2.4% 0.5% 0.7%
right column indexes stopwords while left removes

stopwords
Blocks require text to be available for location
of terms within blocks.
Searching
The search algorithm on an inverted
index follows three steps:
1. Vocabulary search: the words present in
the query are located in the vocabulary
2. Retrieval occurrences: the lists of the
occurrences of all query words found are
retrieved
3. Manipulation of occurrences: the
occurrences are processed to solve the
query
Searching
Searching inverted files starts with vocabulary
store the vocabulary in a separate file
Structures used to store the vocabulary include
Hashing : O (1) lookup, does not support range
queries
Tries : O (c) lookup, c = length (word)
B-trees : O (log v) lookup
An alternative is simply storing the words in
lexicographical order
cheaper in space and very competitive with O(log v)
cost
Vocabulary Construction
All the vocabulary is kept in a suitable data
structure storing for each word and a list of its
occurrences
Each word of each text in the corpus is read
and searched for in the vocabulary
If it is not found, it is added to the vocabulary
with a empty list of occurrences
The new position is added to the end of its list
of occurrences for the word
Index File Construction
Once the text is exhausted the vocabulary is
written to disk with the list of occurrences.
Two files are created:
in the first file, each list of word occurrences is stored
contiguously
in the second file, the vocabulary is stored in
lexicographical order and, for each word, a pointer to its
list in the first file is also included. This allows the
vocabulary to be kept in memory at search time
The overall process is O(n) worst-case time
Faster Large Index Construction
An option is to use the previous algorithm until the
main memory is exhausted. When no more
memory is available, the partial index Ii obtained
up to now is written to disk and erased the main
memory before continuing with the rest of the text
Once the text is exhausted, a number of partial
indices Ii exist on disk
The partial indices are merged to obtain the final
index
Example
I 1...8 final index
7 level 3
I 1...4 I 5...8
3 6 level 2
I 1...2 I 3...4 I 5...6 I 7...8
1 2 4 5 level 1
I1 I2 I3 I4 I5 I6 I7 I8 initial dumps
Large Index Construction Time
The total time to generate partial indices is O(n)
The number of partial indices is O(n/M)
To merge the O(n/M) partial indices are necessary

log2(n/M) merging levels
The total cost of this algorithm is O(n log(n/M))

Conclusion
Inverted files are used to index text
The indices are appropriate when the
text collection is large and semi-static
If the text collection is volatile online
searching is the only option
Some techniques combine online and
indexed searching
Index Construction and Index
Compression
Sort-Based Index Construction
Index Merge-Based
In-Memory Index Construction
Pattern Matching
Data retrieval
A pattern is a set of syntactic features that must
occur in a text segment
Types
Words
Prefixes
e.q comput->computer ,computation,computing,etc
Suffixes
e.q ters->computers,testers,painters,etc
Substrings
e.q tal->coastal,talk,metallic,etc
Ranges
between held and hold->hoax and hissing
Pattern Matching
Pattern Matching
Proximal Nodes
This model tries to find a good compromise
between expressiveness and efficiency.
It does not define a specific language, but a
model in which it is shown that a number of
useful operators can be included achieving
good efficiency.
Tree Matching
The leaves of the query can be not only
structural elements but also text patterns,
meaning that the ancestor of the leaf must
contain that pattern.
Pattern Matching
Query Languages-Outline
Keyword-Based Querying
Patten Matching
Structural Queries
Query Protocols
Trends and Research Issues
Keyword-Based Querying
A query is formulation of a user information
need
Keyword-based queries are popular
Data Retrieval
1. Single-Word Queries
2. Context Queries
3. Boolean Queries Information Retrieval
4. Natural Language
Single-Word Queries
A query is formulated by a word
A document is formulated by long sequences
of words
A word is a sequence of letters surrounded by
separators
What are letters and separators? e.g,on-line
The division of the text into words is not
arbitrary
Context Queries
Definition
- Search words in a given context
Types
Phrase
>a sequence of single-word queries
>e.g, enhance retrieval
Proximity
>a sequence of single words or phrases, and a maximum allowed distance
between them are specified
>e.g,within distance (enhance, retrieval, 4) will match enhance the
power of retrieval
Boolean Queries
Definition
A syntax composed of atoms that retrieve
documents, and of Boolean operators which work
on their operands
e.g, translation AND syntax OR syntactic
Fuzzy Boolean
Retrieve documents appearing in some operands (The AND may require it to appear in
more operands than the OR)
Natural Language
Generalization of fuzzy Boolean
A query is an enumeration of words and
context queries
All the documents matching a portion of the
user query are retrieved
Query processing-Introduction
It is difficult to formulate queries which are well designed for
retrieval purposes.
Improving the initial query formulation through query
expansion and term reweighting.
Approaches based on:
feedback information from the user
information derived from the set of documents initially
retrieved (called the local set of documents)
global information derived from the document collection
User Relevance Feedback
User is presented with a list of the retrieved
documents and, after examining them, marks
those which are relevant.
Two basic operation:
Query expansion : addition of new terms from
relevant document
Term reweighting : modification of term weights
based on the user relevance judgement
User Relevance Feedback
The usage of user relevance feedback to:
expand queries with the vector model
reweight query terms with the probabilistic model
reweight query terms with a variant of the
probabilistic model
Automatic Local Analysis
Clustering : the grouping of documents which satisfy a set of
common properties.
Attempting to obtain a description for a larger cluster of
relevant documents automatically :
To identify terms which are related to the query terms such as:
Synonyms
Stemming
Variations
Terms with a distance of at most k words from a query
term
Automatic Local Analysis (contd)
The local strategy is that the documents
retrieved for a given query q are examined at
query time to determine terms for query
expansion.
Two basic types of local strategy:
Local clustering
Local context analysis
Local strategies suit for environment of
intranets, not for web documents.
Measuring Search Effectiveness

Completed UNIT-III 20.9.17

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Completed UNIT-III 20.9.17

Hochgeladen von

Copyright:

Verfügbare Formate

UNIT-III

k dk doc1 f1k doc2 f2k

dk : document frequency of term k

Addressing words 45% 73% 36% 64% 35% 63%

right column indexes stopwords while left removes

I 1...2 I 3...4 I 5...6 I 7...8

The number of partial indices is O(n/M)

To merge the O(n/M) partial indices are necessary

The total cost of this algorithm is O(n log(n/M))

Das könnte Ihnen auch gefallen