Sie sind auf Seite 1von 84

Introduction to Data Structure, IR

1
Total IR System

Selective Profiles Private


Dissemination of Indexing
Information
(Mail)
Private
Mail Files
Index
File
Item Item
Input Normalization Document File Document
Creation File
Public
Index
Candidate File
Automatic File Index Records
Build(AFB) Public
AFB Indexing
Profiles

2
Major Data Structures

Item Input

Item Normalization

Document File
Creation

Document Document
Manager Search Manager Indexing
Data Structure
Processing
Original Token
Document File
Searchable File

3
Introduction to Data Structure
 Two aspects of Data structures from IRS perspective
 Ability to represent concepts and their relationships
 Its support to locate those concepts.
 Two major data structures
 Stores and manages the received items in their normalized form  Document
manager
 Contains the processing tokens and associated data to support search  Document
search manager
 The results of the search are the references to the items, which are passed to
the Document Manager for retrieval
 Data structures that support search function are dealt .

4
Outline
 Introduction to Data Structure in IR
 Stemming
 Porter Stemming Algorithm
 Dictionary Look-up Stemmers
 Successors stemmers
 Major Data Structure
 Inverted File Structures
 N-Gram Data Structures
 PAT Data Structures
 Signature File Structure
 Hypertext Data Structures

5
Introduction to Data Structure
 Before placing data in the searchable data structure, the transformation of data called
stemming is applied.
 Conflation is the term used to refer to mapping multiple morphological variants to a
single representation called stem/root.
 Reduce tokens to “root” form of words to recognize morphological variation.
“computer”, “computational”, “computation” all reduced to same token “compute”
 Correct morphological analysis is language specific and can be complex.
 Stemming “blindly” strips off known affixes (prefixes and suffixes) in an iterative
fashion.
 Stemming provide compression, savings in storage and processing.
 Stemming improves recall
 Stemming process has to categorize a word prior to making the decision to stem it.
 proper names and acronyms should not stem as they are not related to a common core concept.
Stemming process in NLP causes loss of information
 Tense information is lost, hence a concept “economic support” being indexed needed to determine
whether occurred in past or will be occurring in future.

6
The Porter algorithm
 The Porter algorithm consists of a set of condition/action rules.
 The condition fall into three classes
 Conditions on the stem
 Conditions on the suffix
 Conditions on rules
Conditions on the stem
1.The measure , denoted m ,of a stem is based on its alternate vowel-consonant sequences.

[C] ( VC ) m [V]
Measure Example
M=0 TR,EE,TREE,Y,BY
M=1 TROUBLE,OATS,TREES,IVY
M=2 TROUBLES,PRIVATE,OATEN

2.*<X> ---the stem ends with a given letter X


3.*v*---the stem contains a vowel
7
Conditions on the stem (con’t)
4.*d ---the stem ends in double consonant
5.*o ---the stem ends with a consonant-vowel-consonant, sequence, where the final
consonant is not w, x or y
Suffix conditions take the form: (current_suffix == pattern)

Conditions on rules
The rules are divided into steps. The rules in a step are examined in sequence , and only
one rule from a step can apply
{ step1a(word);
step1b(stem);
if (the second or third rule of step 1b was used)
step1b1(stem);
step1c(stem);
step2(stem);
step3(stem);
step4(stem);
step5a(stem);
step5b(stem);
}
8
Table / Dictionary Look up
 Store a table of all index terms and their stems.
 The original term or stemmed version of the term is looked up in a dictionary and
replaced by the stem that best represents it.
 Implemented in INQUERY, Retrieval Ware systems
 Kstem a table look up algorithm implemented in INQUERY uses the following six data
files
 Dictionary of words (lexicon)
 Supplemental list of words for the dictionary
 Exceptions list for those words that should retain an “e” at the end
 (e.g., “suites” to “suite” but “suited” to “suit”)
 Direct_Conflation - allows definition of direct conflation via
 word pairs that override the stemming algorithm
 Country_Nationality - conflations between nationalities and
 countries (“British” maps to “Britain”)
 Proper Nouns - a list of proper nouns that should not be stemmed.

15
Successor Stemmer
 Successor Stemmer based on the length of prefixes that optionally stem
expansions of additional suffixes
 The alg investigates word and morpheme boundaries based on the distribution
of phonemes that distinguishes one word form other
 The process determines the successor variety for a word, uses this information
to divide a word into segments and selects one of the segment as stem
 The successor variety of a segment of a word in a set of words is the no. of
distinct letters that occupy the segment length plus one character
 Ex : The successor variety for the first 3 letters of a 5 letter word is the no. of
words that have the same first 3 letters but a different 4th letter plus one
 The successor variety of any prefix of a word is the no. of children associated
with the node in the symbol tree representing that prefix

16
Successor Stemmer
 Determine word and morpheme boundaries based on the distribution of
phonemes in a large body of utterances.
 The successor variety of a string is the number of different characters that follow
it in words in some body of text.
 The successor variety of substrings of a term will decrease as more characters are
added until a segment boundary is reached
 Test Word: READABLE
 Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE, READING,
READS, RED, ROP Prefix Successor Variety Letters
R 3 E,I,O
RE 2 A,D
REA 1 D
READ 3 A,I,S
READA 1 B
READAB 1 L
READABL 1 E
READABLE 1 (Blank)
17
Successor variety for the first letter “b” is three. The successor
18 variety for the prefix “ba” is two.
Successor Stemmer (Cont..)
 cutoff method
 some cutoff value is selected and a boundary is identified whenever the cutoff value is
reached
 peak and plateau method
 segment break is made after a character whose successor variety exceeds that of the
characters immediately preceding and following it
 complete method
 Break on boundaries of complete word
 entropy method
 Let|D | be the no. of words beginning with the i length sequence of letters a
ai

 Let |Da ij |be the no. of words in Dai with the successor j
|Daij|
 The probability that a member Daihas the successor j is given by |Dai|
The entropy of|Dai |is
26
|Daij| |Daij|
Hai  -
|Dai|
 log2
|Dai|
j 1
19
Successor Stemmer (Cont..)
 Two criteria used to evaluate various segmentation methods
1. the number of correct segment cuts divided by the total number of cuts
2. the number of correct segment cuts divided by the total number of true
boundaries
 After segmenting, if the first segment occurs in more than 12 words in the
corpus, it is probably a prefix.
 The successor variety stemming process has three parts
1. determine the successor varieties for a word
2. segment the word using one of the methods
3. select one of the segments as the stem

20
Affix Removal Stemmers
 Affix removal algorithms remove suffixes and/or prefixes from terms leaving
a stem
 If a word ends in “ies” but not ”eies” or ”aies ” (Harman 1991)
Then “ies” -> “y”
 If a word ends in “es” but not ”aes” , or ”ees ” or “oes”
Then “es” -> “e”
 If a word ends in “s” but not ”us” or ”ss ”
Then “s” -> “NULL”

21
Stemming Studies : Conclusion
 The majority of stemming’s affection on retrieval performance have been
positive
 Stemming is as effective as manual conflation
 The effect of stemming is dependent on the nature of vocabulary used
 There appears to be little difference between the retrieval effectiveness of
different full stemmers
Related Data Structures for PT Searchable Files
 Inverted file system
 Minimize secondary storage access when multiple search terms are applied across
the total database
 N-gram
 Break process tokens into smaller string units and uses the token fragment for
search
 Improve efficiencies and conceptual manipulation over full word inversion
 PAT Trees and Arrays
 View the text of an item as a single long stream versus a juxtaposition of words
 Signature file
 Fast elimination of non-relevant items reducing the searchable items into a
manageable subset
 Hypertext
 Manually or automatically create imbedded links within one item to a related item

25
Inverted File Structure
 Commonly used in DBMS and IR
 For each word, a list of documents in which the word is found in is stored
 Composed of three basic files
 Document files
 Inversion lists: contains the document identifier
 Dictionary: list all the unique word or other information used in query optimization (e.q. length
of inversion lists)
 The inversion list contains Doc-id for each DOC in which the word is found.
 To support proximity, continuous word phrase & term weighting all occurrences of a word
are stored in the inversion list along with the word position.
 For Systems that supports ranking, the list is re-organized into rank order.

Doc #1, computer, bit, byte Bit (2) 1 3


Doc #2, memory, byte Byte (3) 1 2 4
Doc #3, computer, bit, Computer (3)
memory 1 3 4
Memory (2)
Doc #4, byte, computer
2 3
Document Dictionary Inversion Lists (Posting
26 File)
Dictionary
• It is a stored list of all unique words in the system and a pointer to the location
of its inversion list
• In case of zoning the dictionary may be partitioned by zone.
• There could be a dictionary & set of inversion lists for “the Abstract “ Zone.
• And another a dictionary & set of inversion lists for “the Main Body “ Zone.
• It is a stored list of all unique words in the system and a pointer to the
location of its inversion list
• Words with Special Characteristics are frequently stored in their own
dictionaries for optimal representation and manipulation (ex. Dates)
• If the inversion list contains only one or two entries, these can be stored as a
part of Dictionary

27
Inverted File Structure
B-tree Inversion Lists
• A root node with 2 & 2m keys
• All other internal nodes have between m and 2m
keys. B M
• All keys are kept in order from smaller to larger
• All leaves are at the same level or differ by atmost
one level

A to B C to L M to Z

Bit - 1,3 Byte - 1,2,4 Computer - 1,3,4 Memory - 2,3

28
Inverted File Structure (Cont.)
 Additional inf.,such as term frequency and term position, can be stored in the posting file.
 Each document is represented by a set of weighted keywords (terms):
D 1 fi {(t 1 ,w 1 ), (t
e.g. D1 fi {(comput, 0.2), (architect, 0.3), …}
D 2 fi {(comput, 0.1), (network, 0.5), …}
Inverted file:
comput fi {(D1,0.2), (D2,0.1), …}
Inverted file is used during retrieval for higher efficiency.
• IRS freeze the Doc. Files and its associated inversion list once a maximum size is
reached and starts a new structure.
• The Doc. File, dictionary and inversion list are archived and are available for a user
query.
• Allowing the latest databases to be searched for queries interested in more recent
information
• IFS provides optimum performance in large databases.
• IFS are suited to store concepts and their relationships.
• Each inversion list represents a concept
• Finer resolution of concepts can be achieved by storing locations with an item and weight of item
in inversion list.
29
Location of concepts is made easy by their listing in the dictionary and inversion lists.
N-Gram
• N-gram can be viewed as a special technique for conflation
• N-gram are a fixed length consecutive series of n characters
• N-gram do not care about semantics, they are based on fixed no. of characters
• The searchable data structure is transformed into overlapping ngrams, which are
used to create searchable databases.
• # represents the interword symbol which can be(blank, period, semicolon,
colon etc.)
se ea co ol lo on ny Bigrams (no interword symbols)

sea col olo lon ony Trigrams(no interword symbols)

#se sea ea# #co col olo lon ony ny# Trigrams(with interword symbol #)

#sea# #colo colon olony lony# Pentagrams(with interword symbol #)

Figure :-Bigrams, Trigrams and Pentagrams for “sea colony”

30
N-Gram contd..
• N gram produce word fragments vs. semantically meaningful word stems
• Mapping longer words into shorter n-gram fragments seems more appropriate
• N gram are used in spelling error detection and correction
• Most approaches look at the statistics on probability of occurrence of n-grams (trigrams in
most approaches) in the English vocabulary and indicate any word that contains non-existent
to seldom used ngrams as a potential erroneous word.
• Damerau specified 4 categories of errors
Error Category Example
Single Character Insertion compuuter
Single Character Deletion compter
Single Character Substitution compiter
Transposition of two adjacent characters comptuer
Figure :- Categories of Spelling Errors
• Frequency of occurrence of N gram patterns can be used for identifying the
language of an item
• In IR trigrams are used for text compression and to manipulate the length of
index terms
31
N-Gram contd ..

• A general word string consisting of m letters leads to m+1 bi-grams,


m+2 trigrams and m+n-1, N grams.

• The theoretical number of possible N grams is very high.


• For an alphabet of 26 letters there are 26² = 676 bi-grams and

26³ = 17576 tri-grams are


possible.

•However, in English only 64% of these bi-grams and 16% of all tri-grams
are actually under use.
N-Gram contd ..
•The data structure consists of fixed length overlapping symbol segments that define
the searchable processing tokens.

•These tokens have logical linkages to all the items in which the tokens are found.
Inversion lists, document vectors and other proprietary data structures are used to
store the linkage data structure and are used in the search process.
In some cases just the least frequently occurring n-gram is kept as part of a first pass
search process

•Yochum and D’Amore , Fatah comlekoglu studied ngram DS using an Inverted file
structure for n=2 to n=26 and determined that trigrams are optimal length.
N-grams place a finite limit on the no. of searchable tokens.

•Max.no. of unique n-grams generated, MaxSeg, can be calculated as a function of n


which is the length of the n-grams, and which is the no. of processable symbols from the
alphabet (i.e., non-interword symbols).

33
Suffix Trees and Suffix Arrays

Modern Information Retrieval


by R. Baeza-Yates and B. Ribeiro-Neto
Addison-Wesley, 1999.
(Chapter 8)
Introduction
 Word-based indexing
 Inverted indices are good for search words
 Queries such as phrases are expensive to solve using Inverted files
 For word-based applications, inverted files perform better
 Suffix trees and suffix arrays
 complex queries
Text Suffixes

This is a text. A text has many words. Words are made from letters.

text. A text has many words. Words are made from letters.
text has many words. Words are made from letters.
many words. Words are made from letters.
words. Words are made from letters.
Words are made from letters.
made from letters.
letters.
The Suffix Trie and Suffix Tree

1 11 19 28 33 40 46 50 60
This is a text. A text has many words. Words are made from letters.

w l
t m
60 6 5 3 60

o e a
28 50

r n d
x
28 50 11 19

d
t
33 40

s
11 19

33 40
PAT Trees and PAT Arrays
Information Retrieval: Data Structures and Algorithms
by W.B. Frakes and R. Baeza-Yates (Eds.)
Englewood Cliffs, NJ: Prentice Hall, 1992.
(Chapters 5)
Introduction

 Text searching methods may be classified as lexicographical indices (indices that are
sorted), clustering techniques, and indices based on hashing.
 Two new lexicographical indices for text, called PAT trees and PAT arrays takes index
for the text of size similar to or smaller than the text.
 Briefly, the traditional model of text used in information retrieval is that of a set of
documents.
 Each document is assigned a list of keywords (attributes), with optional relevance weights associated to each
keyword. This model is oriented to library applications,
 Problems with traditional IR
 A basic structure is assumed (documents and words). This may be reasonable for many applications, but
not for others.
 Keywords must be extracted from the text (this is called "indexing"). This task is not trivial and error
prone, whether it is done by a person, or by a computer.
 Queries are restricted to keywords. Because the number of keywords is variable, common database
techniques are not useful in this context.
 Approximate query search is not possible
PAT Trees and PAT Arrays

 PAT tree transforms the input stream into a searchable data structure consisting of
substrings
 Pat trees are used for searching text and images.
 PAT is short form of PATRICIA (Practical algorithm to retrieve information coded in
Alpha-Numerics.)
 Each position in the text corresponds to a semi-infinite string (sistring), the string that starts at
that position and extends arbitrarily far to the right, or to the end of the text.
 Advantages of this model
 No structure of the text is needed, although if there is one, it can be used.
 No keywords are used. The queries are based on prefixes of sistrings, that is, on any substring of the text.
 This model is simpler and does not restrict the query domain. Furthermore, almost any searching structure can
be used to support this view of text.
PAT tree

 Definition: Patricia Tree stores every semi-infinite string (sistring) of a


document
 Two things we have to know
 PATRICIA TREE
 SISTRING
PATRICIA TREE

 A particular type of “trie”


 Example, trie and PATRICIA TREE with content ‘010’, ‘011’, and ‘101’.

Lv0
Lv0

0 1
0 1

Lv1
trie Lv2 101

0 1
1 0
PATRICIA
Lv2 010 011 TREE
0 1 1

010 011 101


PATRICIA TREE

 Therefore, PATRICIA TREE will have the following attributes in its internal
nodes:
 Index bit (check bit)
 Child pointers (each node must contain exactly 2 children)
 On the other hand, leaf nodes must be storing actual content for final
comparison
Semi-infinite Strings
 In creation of a pat tree each position in the input string is the anchor
point for a sub-string that starts at that point and includes all new text up
to the end of input.
 All Substrings are unique.
 A substring can start at any point in the text and can be uniquely
indexed by its starting location and length.
 A substring can go beyond the length of i/p stream by additional null
characters. This substrings are called sistrings.
 Example
Text Once upon a time, in a far away land …
sistring 1 Once upon a time …
sistring 2 nce upon a time …
sistring 8 on a time, in a …
sistring 11 a time, in a far …
sistring 22 a far away land …
 Compare sistrings
22 < 11 < 2 < 8 < 1
PAT Tree
 PAT Tree
A Patricia tree constructed over all the possible sistrings of a text
 Patricia tree
 a binary digital tree where the individual bits of the keys are used to decide on the branching
 A zero bit will cause a branch to the left subtree
 A one bit will cause a branch to the right subtree
 each internal node indicates which bit of the query is used for branching
 absolute bit position
 a count of the number of bits to skip
 each external node points to a sistring
 the integer displacement to original text
1
Example

2 2
Text 01100100010111 …
sistring 101100100010111 …
sistring 21100100010111 … 4 1 3 2
sistring 3100100010111 …
1
sistring 400100010111 …
sistring 50100010111 …
sistring 6100010111 … 2 2
sistring 700010111 …
sistring 80010111 ... 4 3 3 2

5 1

: external node sistring


1 1 1 (integer displacement)
0 1 0 1 total displacement of the
1 2 1 2 bit to be inspected
0 1
3 2 : internal node
skip counter & pointer
1
Text 01100100010111 …
sistring 101100100010111 …
sistring 21100100010111 … 2 2
sistring 3100100010111 …
sistring 400100010111 … 3 3 4 2
sistring 50100010111 …
sistring 6100010111 … 7 4 6 3
5 1
sistring 700010111 …
sistring 80010111 ...
1

1 2 2

3 3 4 2
2 2

7 5 5 6 3
4 3 4 2 1

6 3 4 8
5 1
Search 00101
註: 3 和 6 要 4 個 bits 才能區辨
PAT Binary Tree for input “100110001101”
PAT Tree skipping bits for “100110001101”
Indexing Points

 The above example assumes every position in the text is indexed.


i.e. n external nodes, one for each indexed position in the text
 If for Word and phrase searches then only the
sistrings that are at the beginning of words are necessary
 No. of sistrings to be included in the tree is application dependent.
 Trade-off between size of the index and search requirements
Prefix searching

 Idea : every subtree of the PAT tree has all the sistrings with a given prefix.
 Search: proportional to the query length
exhaust the prefix or up to external node.
 At this point we need to verify whether we could have skipped bits. This is done with a single
comparison of any of the sistrings in the subtree (considering an external node as a subtree of
size one). If this comparison is successful, then all the sistrings in the subtree (which share the
common prefix) are the answer; otherwise there are no sistrings in the answer.

Search for the prefix


“10100” and its answer
Proximity Searching

 Find all places where s1 is at most a fixed (given by a user) number of characters
away from s 2 .
in 4 ation ==> insulation, international, information
 Algorithm
1. Search for s1 and s2.
2. Select the smaller answer set from these two sets and
sort by position.
3. Traverse the unsorted answer set, searching every
position in the sorted set and checking if the distance
between positions satisfying the proximity condition.

sort+traverse time:m1 logm1 +m2logm1 (assume m1<m2)


Range Searching
 Search for all the strings within a certain lexicographical range.
 Ex: the range of “abc” ..”acc”:
 “abracadabra”, “acacia” ○
 “abacus”, “acrimonious” X
 Algorithm
 Search each end of the defining intervals.
 Collect all the sub-trees between (and including) them.
Longest Repetition Searching

 the match between two different positions of a text where this match is the longest in
the entire text, e.g.,
01100100010111
the tallest internal node gives a pair
Text 01100100010111
of sistrings that match for the greates
sistring 101100100010111
sistring 21100100010111 number of characters
sistring 3100100010111 1
sistring 400100010111
sistring 50100010111
2 2
sistring 6100010111
sistring 700010111
sistring 80010111 3 3 4 2
7 5
5 1 6 3

4 8
 The most frequently occurring strings within the text database nt” Matching
 e.g., the most frequent trigram
 Find the most frequent trigram
 find the largest subtree at a distance 3 characters from root

1 the tallest internal node gives a pair


of sistrings that match for the greatest
2 2 number of characters

3 3 4
2 i.e., 1, 2, 3 are the same for
sistrings 100100010111
7 5 5 1 6 3 and 100010111

4 8
Building PAT Trees as Patricia Trees (1)
 Bucketing of external nodes
 collect more than one external node
 a bucket replaces any subtree with size less than a certain constraint (b)
save significant number of internal nodes
 the external nodes inside a bucket do not have any structure associated with them
increase the number of comparisons for each search
Building PAT Trees as Patricia Trees (2)

 Mapping the tree onto the disk using super-nodes


 Advantage: save the number of disk access and space
 Every disk page has a single entry point, contains as much of the trees as possible, and
 terminates either in external nodes or in pointers to other disk pages
 The pointers in internal nodes will address either a disk page or another node inside the same page
 reduces the storage cost of internal nodes
 Example
 Assume a disk page contains on the order of 1,000 internal/external nodes
 on the average, each disk page contains about 10 steps of a root-to-leaf path
PAT Trees Represented as Arrays
 External node bucket size, b
 If we keep the external nodes in the bucket in the same relative order as they would be in the
tree
 Indirect binary search vs. sequential search

PAT array
1
7 4 8 5 1 6 3 2

2 2

3 3 4 2
0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...
7 5 6 3 Text
5 1

4 8
Searching PAT Trees as Arrays
 Prefix searching and range searching
doing an indirect binary search over the array with the results of the comparisons being less
than, equal, and greater than.
 Example
Search for the prefix 100 and its answer
 Most frequent, Longest repetition
 Manber and Baeza-Yates (1991)

PAT array
7 4 8 5 1 6 3 2

0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...
Text
Comparisons
 Signature files
 Use hashing techniques to produce an index
 Advantage
 storage overhead is small (10%-20%)
 Disadvantages
 the search time on the index is linear
 some answers may not match the query, thus filtering must be done
Comparisons (Continued)
 Inverted files
 storage overhead (30% ~ 100%)
 search time for word searches is logarithmic
 PAT arrays
 potential use in other kind of searches
 phrases
 regular expression searching
 approximate string searching
 longest repetitions
 most frequent searching
Signature Files

Information Retrieval: Data Structures and Algorithms


Signature Files
 Characteristics
 Word-oriented index structures based on hashing
 Low overhead (10%~20% over the text size) at the cost of forcing a sequential search over
the index
 Suitable for not very large texts
 Inverted files outperform signature files for most applications
Structure

 Use superimposed coding to create signature.


 Each text is divided into logical blocks.
 A block contains n distinct non-common words.
 Each word yields “word signature”.
 A word signature is a B-bit pattern, with m 1-bit.
 Each word is divided into successive, overlapping triplets. e.g. free --> ▲fr, fre, ree,
ee ▲
 Each such triplet is hashed to a bit position.
 The word signatures are OR’ed to form block signature.
 Block signatures are concatenated to form the document signature.
Example
 Example (n=2, B=12, m=4)
word signature
free 001 000 110 010
text 000 010 101 001
block signature 001 010 111 011
 Search
 Use hash function to determine the m 1-bit positions.
 Examine each block signature for 1’s bit positions that the signature of the search word has a
1.
False Drop
 false alarm (false hit, or false drop) Fd
the probability that a block signature seems to qualify, given that the block does not actually qualify.
F d = Prob{signature qualifies/block does not}
 For a given value of B, the value of m that minimizes the false drop probability is such
that each row of the matrix contains “1”s with probability 0.5.
Fd = 2-m
m = B ln2/n
Sequential Signature File (SSF)

documents

assume documents span exactly one logical block


the size of document signature F = the size of block signature B
Classification of Signature-Based Methods

 Compression
If the signature matrix is deliberately sparse, it can be compressed.
 Vertical partitioning
Storing the signature matrix column-wise improves the response time on the expense of insertion time.
 Horizontal partitioning
Grouping similar signatures together and/or providing an index on the signature matrix may result in
better-than-linear search.
Classification of Signature-Based Methods
 Sequential storage of the signature matrix
 without compression
sequential signature files (SSF)
 with compression
bit-block compression (BC)
variable bit-block compression (VBC)
 Vertical partitioning
 without compression
bit-sliced signature files (BSSF, B’SSF)
frame sliced (FSSF)
generalized frame-sliced (GFSSF)
Classification of Signature-Based Methods
(Continued)

 with compression
compressed bit slices (CBS)
doubly compressed bit slices (DCBS)
no-false-drop method (NFD)
 Horizontal partitioning
 data independent partitioning
Gustafson’s method
partitioned signature files
 data dependent partitioning
2-level signature files
5-trees
Criteria

 the storage overhead


 the response time on single word queries
 the performance on insertion, as well as whether the insertion maintains the
“append-only” property
Compression
 idea
 Create sparse document signatures on purpose.
 Compress them before storing them sequentially.
 Method
 Use B-bit vector, where B is large.
 Hash each word into one (or k) bit position(s).
 Use run-length encoding (McIlroy 1982).
Compression using run-length encoding

data 0000 0000 0000 0010 0000


base 0000 000 1 0000 0000 0000
management 0000 1 000 0000 0000 0000
system 0000 0000 0000 0000 1 000
block signature 0000 1001 0000 0010 1000

L1 L2 L3 L4 L5

[L1] [L2] [L3] [L4] [L5]


where [x] is the encoded vale of x.

search: Decode the encoded lengths of all the preceding intervals


example: search “data”
(1) data ==> 0000 0000 0000 0010 0000
(2) decode [L1]=0000, decode [L2]=00, decode [L3]=000000
disadvantage: search becomes low, ensures no false hits
Bit-block Compression (BC)
Data Structure:
(1) The sparse vector is divided into groups of consecutive bits
( bit-blocks ).
(2) Each bit block is encoded individually.
Algorithm:
Part I. It is one bit long, and it indicates whether there are any
“1”s in the bit-block (1) or the bit -block is (0). In
the latter case, the bit-block signature stops here.
0000 1001 0000 0010 1000
0 1 0 1 1
Part II. It indicates the number s of “1”s in the bit-block. It consists
of s -1 “1” and a terminating zero.
10 0 0
Part III. It contains the offsets of the “1”s from the beginning of the
bit-block.
0011 10 00
說明: 4bits ,距離為 0, 1, 2, 3 ,編碼為 00, 01, 10, 11
block signature: 01011 | 10 00 | 00 11 10 00
Bit-block Compression (BC)
(Continued)

Search “data”
(1) data ==> 0000 0000 0000 0010 0000
(2) check the 4th block of signature 01011 | 10 0 0 | 00 11 10 00
(4) OK, there is at least one setting in the 4th bit-block.
(5) Check furthermore. “0” tells us there is only one setting in
the 4th bit-clock. Is it the 3rd bit?
(6) Yes, “10” confirms the result.

Discussion:
(1) Bit-block compression requires less space than Sequential
Signature File for the same false drop probability.
(2) The response time of Bit-block compression is lightly less
then Sequential Signature File.
Vertical Partitioning
 idea
avoid bringing useless portions of the document signature in main memory
 methods
 store the signature file in a bit-sliced form or in a frame-sliced form
 store the signature matrix column-wise to improve the response time on the expense of insertion time
Bit-Sliced Signature Files (BSSF)

Transposed bit matrix


documents
(document signature)

transpose

documents

represent
documents

F bit-files

search: (1) retrieve m bit-files.


e.g., the word signature of free is 001 000 110 010
the document contains “free”: 3rd, 7th, 8th, 11th bit are set
i.e., only 3rd, 7th, 8th, 11th files are examined.
(2) “and” these vectors. The 1s in the result N-bit vector
denote the qualifying logical blocks (documents).
(3) retrieve text file through pointer file.
insertion: require F disk accesses for a new logical block (document),
one for each bit-file, but no rewriting
Frame-Sliced Signature File (FSSF)

 Ideas
 random disk accesses are more expensive than sequential ones
 force each word to hash into bit positions that are closer to each other in the
document signature
 these bit files are stored together and can be retrieved with a few random accesses
 Procedures
 The document signature (F bits long) is divided into k frames of s consecutive bits
each.
 For each word in the document, one of the k frames will be chosen by a hash
function.
 Using another hash function, the word sets m bits in that frame.
Frame-Sliced Signature File (Cont.)

documents

frames

Each frame will be kept in consecutive disk blocks.


FSSF (Continued)

 Example (n=2, B=12, s=6, f=2, m=3)


Word Signature
free 000000 110010
text 010110 000000
doc. signature 010110 110010
 Search
 Only one frame has to be retrieved for a single word query. I.E., only one random disk
access is required.
e.g., search documents that contain the word “free”
->because the word signature of “free” is placed in 2nd frame,
only the 2nd frame has to be examined.
 At most k frames have to be scanned for an k word query.
 Insertion
 Only f frames have to be accessed instead of F bit-slices.
Vertical Partitioning with Compression

 idea
 create a very sparse signature matrix
 store it in a bit-sliced form
 compress each bit slice by storing the position of the 1s in the slice.
Compressed Bit Slices (CBS)
 Rooms for improvements
 Searching
 Each search word requires the retrieval of m bit files.
 The search time could be improved if m was forced to be “1”.
 Insertion
 Require too many disk accesses (equal to F, which is typically 600-1000).
Compressecd Bit Slices (CBS)
(Continued)

documents
 Let m=1. To maintain the same
false drop probability, F has to be
increased.
 Let S denote the size of a
signature, thus the bit files and bit
matrix are sparse
 To compress each bit file, we
store only the positions of the
“1”s.
 For unpredictable size if each bit
file we store them in buckets of
size B p .
 A directory with S pointers one
for each bi slice needed
Sparse bit matrix
● Differences with
inversion
» The directory (hash
table) is sparse
» The actual word is
stored nowhere
» Simple structure

Obtain the pointers to the


relevant documents from
buckets
Hash a word to
obtain bucket address
h(“base”)=30
Compressecd Bit Slices (CBS)
(Continued)

• There is no need to split documents into logical blocks any more.


• The pointer file can be eliminated. Instead of storing the position of each "1" in a
(compressed) bit file, we can store a pointer to the document in the text file.
• The compressed bit files will contain pointers to the appropriate documents (or
logical blocks).
• The set of all the compressed bit files will be called "leve1 1" or "postings file,“
The postings file consists of postings buckets, of size Bp bytes (Bp is a desig
parameter). Each such bucket contains pointers to the documents in the text
file, as well as an extra pointer, to point to an overflow postings bucket, if
necessary.
Doubly Compressed Bit Slices
Idea:
compres
s
The
sparse
directory

Distinguish synonyms partially.


Follow the pointers of posting
h1(“base”)=30 h2(“base”)=011 buckets to retrieve the qualifying
documents.
• This method tries to compress the sparse directory of CBS. The file
structure consists of a hash table, an intermediate file, a postings file
and the text file
• It uses a hashing function h1(), which returns values in the range (O,
( S-1)) and determines the slot in the directory.

• The difference is that DCBS makes an effort to distinguish among


synonyms, by using a second hashing function h2(), which returns bit
strings that are h bits long.
• These hash codes are stored in the "intermediate file," which consists of
buckets of Bi bytes (design parameter).
• Each such bucket contains records of the form (hashcode, ptr). The
pointer ptr is the head of a linked list of postings buckets.

Searching for the word "base" is handled as follows:


Step 1 h1("base") = 30: The 30-th pointer of the directory will be followed.
The corresponding chain of intermediate buckets will be examined
Step 2 h2("base") = (011)2: The records in the above intermediate buckets
will be examined. If a matching hash code is found (at most one will
exist!), the corresponding pointer is followed, to retrieve the chain of
postings buckets.
Step 3 The pointers of the above postings buckets will be followed, to
No False Drop method

The idea is to modify the intermediate file of the DCBS, and store a pointer to the
word in the text file.
Specifically, each record of the intermediate file will have the format (hashcode, ptr,
ptr-to-word), where ptr-to-word is a pointer to the word in the text file.
The advantages of storing ptr-to-word instead of storing the actual word are two:
(1)space is saved (a word from the dictionary is 8 characters long
(2)the records of the intermediate file have fixed length.
No False Drops Method

To distinguish between Using pointer to the word


synonyms completely. in the text file
Horizontal Partitioning
1. Goal: group the signatures into sets, partitioning the signature
matrix horizontally.
2. Grouping criterion

documents
Partitioned Signature Files
 Using a portion of a document signature as a signature key to partition the signature
file.
 All signatures with the same key will be grouped into a so-called “module”.
 When a query signature arrives,
 examine its signature key and look for the corresponding modules
 scan all the signatures within those modules that have been selected

Das könnte Ihnen auch gefallen