Sie sind auf Seite 1von 40

File Structures

Chap 12. Extendible Hashing

Things you have to learn


™ The problem solved by extendible hashing
™ How extendible hashing works (how it combines
tries with conventional, static hashing)
™ Use of the buffer, file, and index classes of
previous chapters to implement extendible hashing
™ Alternative approaches: dynamic hashing and
linear hashing
Introduction

‰ Dynamic files
• dynamic: records are added and deleted from the data set
• undergo a lot of growth

‰ Static hashing
• described in chapter 11 (direct hashing)
• typically worse than B-Tree for dynamic files
• eventually requires file reorganization

‰ Extendible hashing
• Robust, self-adjusting hashing for dynamic file
• Fagin, Nievergelt, Pippenger, and Strong (ACM TODS 1979)

2
Overview(1)

‰ Direct access (hashing) files have static size, so not


suitable for files whose size is unknown in advance
‰ Dynamic file structure is desired which retains the
feature of fast retrieval by primary key, and which also
expands and contracts as the number of records in the
file fluctuates (without reorganizing the whole file)
‰ Similar motivation!
• Indexed-sequential File ==> B tree
• Hashing ==> Extendible Hashing

3
Overview(2)

‰ Extendible Hashing

Hashing function
Primary key H(key)

Extract first d digit

Directory Table look-up File pointer


Index

4
How Extendible Hashing works

‰ Idea from Tries file (radix searching)


• The branching factor of the tree is equal to the # of alternative
symbols in each position of the key
– e.g.) Radix 26 trie - able, abrahms, adams, anderson, adnrews,
baird
• Use only a portion of the key: use-more-as-we-need-more
capability l able
b r
abrahms
d
adams
a n e anderson
d

r andrews
b
baird

5
Formal Definition of Trie

‰ A search structure in which each successive character of


the key is used to determine the direction of the search
at each successive level of the tree
‰ The branching factor (the radix of the trie) at any level is
potentially equal to the number of values that the
character can take

6
Extendible Hashing

‰ H maps keys to a fixed address space, with size one


less than a power of 2 (e.g, 65531 = 216-1 when d=16)
‰ The d bits are used as an index in a directory array
containing 2d entries, which usually resides in primary
memory
‰ File pointers point to blocks of records known as buckets,
where an entire bucket is read by one physical data
transfer, buckets may be added to or removed from the
file dynamically
‰ The value d, the directory size(2d), and the number of
buckets change automatically as the file expands and
contracts
7
Extendible Hashing Example
Directory with d=3 and 4 buckets
d’=1
d=3
B0 H(key)=0
000
001
010 d’=3
011
B100 H(key)=100
100
101
110
111 d’=3
B101 H(key)=101

d’=2
B11 H(key)=11
8
Turning the Trie into a Directory

‰ Using Trie for extendible hashing buckets


• Use Radix 2 Trie :
0 A
– Keys in A : beginning with 0 1 0 B
– Keys in B : beginning with 10 1
– Keys in C : beginning with 11
C
• Retrieving from secondary storage the buckets containing keys,
instead of individual keys

9
Representation of Trie (1)

‰ Tree is not preferable


• many comparisons are required for searching
‰ A flattened array
• Make a complete full binary tree
• Collapse it into the directory structure

0
00 A
0 1 A
01
1 0 B 10 B
1
C 11 C

10
Representation of Trie(2)

‰ Directory is a complete binary tree


• Directory entry: a pointer to the associated bucket
• Given an address beginning with the bits 10, the 102-th directory
entry gives us a pointer to the associated bucket
• Introduced for uniform distribution

11
Retrieve a Record

‰ Steps in retrieving a record with a given key


• find H(given key)
• extract first d bits of H(given key)
• use this value as an index into the directory to find a pointer
• use this pointer to read a bucket into primary memory
• locate the desired record within the bucket (scan)

12
Splitting to Handle Overflow (1)

‰ Example 1: Overflowing of bucket A


• Split A into A and D
• Come to use additional unused bits to divide the addresses
between two buckets
• No need to expand the directory

00
A
00 A
01 D
01

10 B
10 B
11 C
11 C

13
Splitting to Handle Overflow(2)

‰ Example 2: Overflowing of bucket B


• Do not have additional unused bits
• Need to expand the directory
1. Divide B using 3 bits of hash address
2. Make a complete full binary tree
3. Collapse it into the directory structure

00
A
01

10
B
11
C
14
1. Result of overflow of bucket B
A
0
0 B
1 0
1
1 D
C
3. Directory
2. Complete Binary Tree
0 000
0 1 001 A
0 1 0 A 010
1
011
1 B
0 B 100
0 1 D
1
D 101
0
1 C 110 C
111
Another Example
Bucket B100 overflows, then…

d=3
000 d’=2
001 B00 H(key)=00..
010 d’=2
011 B01 H(key)=01..
100
101 d’=3
110 B100 H(key)=100..
111 d’=3
B00 H(key)=101..

d’=2
B00 H(key)=11..

16
Another Example (cont’d)
d=4 d’=2
0000 B00 H(key)=00..
0001
0010 d’=2
0011 B01 H(key)=01..
0100 d’=4
0101 H(key)=1000..
B1000
0110
0111 d’=4 H(key)=1001..
1000 B1001
1001
1010 d’=3
1011 B101 H(key)=101..
1100 d’=2
1101 B11 H(key)=11..
1110
1111 Bucket B100 overflows, d increase to 4
17
Contraction

‰ A pair of adjunct buckets with the same value of d' which


share a common value of the first d' -1 bits of H(key) can
be combined if the average load < 50%, so all records
would be able to fit into one bucket

‰ File contraction is the reverse of expansion


• the directory can be compacted and d decremented whenever all
pairs of pointers have the same values

18
Implementation

‰Creating the address


‰Directory and bucket operations
‰Classes for representing bucket and directory
objects

19
Creating Address

‰ Function hash(KEY)
• Fold/Add hashing algorithm
• Do not MOD hashing value by address space since no fixed
address space exists
• Output from the hash function for a number of keys
bill 0000 0011 0110 1100
lee 0000 0100 0010 1000
pauline 0000 1111 0110 0101
alan 0100 1100 1010 0010
julie 0010 1110 0000 1001
mike 0000 0111 0100 1101
elizabeth 0010 1100 0110 1010
mark 0000 1010 0000 0111

20
Example of Function Hash (key)

// returns an integer hash value for key for a 15 bit

Int Hash (char * key)


{
int sum = 0;
int len = strlen(key);
if (len % 2 == 1) len ++; // make len even
for (int j = 0; j < len; j+=2)
sum = (sum + 100 * key[j] + key[j+1]) % 19937;
return sum;
}

21
Function MakeAddress(key,depth)

Int MakeAddress (char * key, int depth)


{
int retval = 0;
int hashVal = Hash(key);
// reverse the bits
for (int j = 0; j < depth; j++)
{
retval = retval << 1;
int lowbit = hashVal & 1;
retval = retval | lowbit;
hashVal = hashVal >> 1;
}
return retval;
}
22
Classes for Representing Bucket and
Directory Objects
‰ Extendible hashing consists of a set of buckets stored in
a file and a directory that references them
‰ Each bucket is a record that contains a particular set of
keys and information associated with the keys
‰ A directory is primarily an array containing the record
addresses of the buckets
‰ In our implementation, we use a data record reference
for the information associated with the keys (we will treat
the buckets as sets of key-reference pairs)
‰ Class Bucket becomes a derived class of the class
TextIndex

23
Main Members of class Bucket

class Bucket: protected TextIndex


{protected:
Bucket (Directory & dir, int maxKeys = defaultMaxKeys);
int Insert (char * key, int recAddr);
int Remove(char * key);
Bucket * Split (); // split the bucket and redistribute the key
int NewRange (int & newStart, int & newEnd); // calculate the range of a new (split)
bucket
int Redistribute (Bucket & newBucket); // redistribute keys
int FindBuddy (); // find the bucket that is the buddy of this
int TryCombine (); // attempt to combine buckets
int Combine (Bucket * buddy, int buddyIndex); // combine buckets
int Depth; // number of bits used ‘in common’ by keys in buckets
Directory & Dir; // directory that contains the bucket
int BucketAddr; // address in file
friend class Directory;
friend class BucketBuffer;
};
24
Definition of class Directory

class Directory
{public:
Directory (…..); ~Directory();
int Open (..); int Create(…); int Close();
int Insert(…); int Delete(…); int Search(…);

protected
int DoubleSize();
int Collape();
int InsertBucket (….);
int Find (…);
int StoreBucket(…);
int LoadBucket(…)
…..
}

25
Deletion

‰ When to combine buckets


• Buddy buckets: the buckets are siblings and at the leaf level of
the tree (Buddy means something like friend)
e.g., A and D in page 13 are buddy buckets

‰ Examine the directory to see if we can make changes


there
• Shrink the directory if none of the buckets requires the depth of
address information that is currently available in the directory

26
Buddy Bucket

‰ Given a bucket with an address uvwxy, where u, v, w, x,


and y have values of either 0 or 1, the buddy bucket, if it
exists, has the value uvwxz, such that

z = y XOR 1

‰ If enough keys are deleted, the contents of buddy


buckets can be combined into a single bucket

27
Collapsing the Directory

‰ Collapse condition
• If a single cell, downsizing is impossible
• If there is a pair of directory cells that do not both point to the
same bucket, collapsing is impossible

‰ Allocating space
• Allocate half the size of the original
• Copy the bucket references shared by each cell pair to a single
cell in the new directory

28
Extendible Hashing
Performance
‰ Time : O(1)
• If the directory can kept in RAM: a single access
• Otherwise: two accesses are necessary
‰ Space utilization of the bucket
• r (# of records), b (block size), N (# of Blocks)
• Utilization = r / bN
• Average utilization Æ 0.69
‰ Space utilization for the directory
• How large a directory should we expect to have, given an
expected number of keys?
• Expected value for the directory size by Flajolet(1983)
– Estimated directory size =3.92 / b X r(1+1/b)

29
Space Utilization for Buckets

‰ Periodic and fluctuating


• With uniform distributed addresses, all the buckets tend to fill up
at the same time Æ split at the same time
• As buffer fills up: past 90%
• After a concentrated series of splits: below 50%
‰ r : # of records , b : block size
• N ~= r / (b ln 2)
• Utilization = r / bN ~= ln 2 = 0.69
• Average utilization of 69%
‰ B tree space utilization
• Normal B-tree : 67%, B-tree with redistribution in insertion :
85 %

30
Alternative Approaches(1):
Dynamic Hashing
‰ Similar to dynamic extendible hashing
• Use a directory to track bucket addresses
• Extend the directory through the use of tries

‰ Start with a hash function that covers an address space


of a fixed size

‰ When overflow occurs


• The buckets split, forming the leaves of a trie that grows down
from the original address node Î makes a trie

31
Alternative Approaches(1):
Dynamic Hashing (cont’d)

‰ Two kinds of nodes


• External node: reference a data bucket
• Internal node: point to two children index nodes
• When a node has split children, it changed from an external
node to an internal node

‰ Two hash functions


• Apply the first hash function original address space
• if external node is found : search is completed
• if internal node is found : apply second hash function

32
Original
(a) 1 2 3 4 address
space

Original
1 2 3 4 address
(b) space
40 41

Original
1 2 3 4 address
(c) space
20 21 1 41

410 411
Dynamic Hashing vs. Extendible
Hashing(1)
‰ Overflow handling
• Both schemes extend the hash function locally, as a binary
search trie

‰ Both schemes use directory structure


• Dynamic hashing: a linked structure
• Extendible hashing: perfect tree expressible as an array

‰ Space utilization
• both schemes is the same (space utilization : 69%)

34
Dynamic Hashing and
Extendible Hashing(2)
‰ Growth of directory
• Dynamic hashing: slower, more gradual growth
• Extendible hashing: extend directory by doubling it

‰ Actual size of an index node


• Dynamic hashing is lager than a directory cell in extendible
hashing (because of pointers)

‰ Page fault
• Dynamic hashing: more than one page fault (with linked
structure for the directory)
• Extendible hashing: single page fault

35
Alternative Approaches(2):
Linear Hashing

‰ Unlike extendible hashing and dynamic hashing, linear


hashing does not use a directory.
‰ The actual address space is extended one bucket at a
time as buckets overflow
‰ Because the extension of the address space does not
necessarily correspond to the bucket that is overflowing,
linear hashing necessarily involves the use of overflow
buckets, even as the address space expands
‰ No directories: Avoid additional seek resulting from
additional layer
‰ Use more bits of hashed value
• hd(k) : depth d hashing function (using function make_address)
36
The growth of address space in
linear hashing(1)
w

a b c d a b c d A
00 01 10 11 000 01 10 11 100

(a) (b)

x x

a b c d A B a b c d A B C
00 01 10 11 100 101 00 01 10 11 100 101 110

(c) (d)
(continued...) 37
The growth of address space in
linear hashing(2)

a b c d A B C D
00 01 10 11 100 101 110 111

(e)

38
Approaches to Controlling
Splitting
‰ Postpone splitting: increase space utilization
• B-Tree: redistribution rather than splitting
• Hashing: placing records in chains of overflow buckets to
postpone splitting

‰ Triggering event for splitting


• Linear hashing
– Every time any bucket overflows
– Not split overflowing bucket
• Litwin(1980): overall load factor of the file
– Below 2 seeks, 75% ~ 80% storage utilization

39
Approaches to Controlling
Splitting (cont’d)
‰ Postpone splitting for extensible hashing
• Use chaining overflow bucket and avoid doubling directory space
• 1.1 seek, 76% ~ 81% storage utilization

40

Das könnte Ihnen auch gefallen