Chap 12. Extendible Hashing: File Structures

File Structures
Chap 12. Extendible Hashing
Things you have to learn

The problem solved by extendible hashing
How extendible hashing works (how it combines
tries with conventional, static hashing)
Use of the buffer, file, and index classes of
previous chapters to implement extendible hashing
Alternative approaches: dynamic hashing and
linear hashing
Introduction
Dynamic files
• dynamic: records are added and deleted from the data set
• undergo a lot of growth
Static hashing
• described in chapter 11 (direct hashing)
• typically worse than B-Tree for dynamic files
• eventually requires file reorganization
Extendible hashing
• Robust, self-adjusting hashing for dynamic file
• Fagin, Nievergelt, Pippenger, and Strong (ACM TODS 1979)
2
Overview(1)
Direct access (hashing) files have static size, so not

suitable for files whose size is unknown in advance
Dynamic file structure is desired which retains the
feature of fast retrieval by primary key, and which also
expands and contracts as the number of records in the
file fluctuates (without reorganizing the whole file)
Similar motivation!
• Indexed-sequential File ==> B tree
• Hashing ==> Extendible Hashing
3
Overview(2)
Extendible Hashing
Hashing function
Primary key H(key)
Extract first d digit
Directory Table look-up File pointer

Index
4
How Extendible Hashing works
Idea from Tries file (radix searching)

• The branching factor of the tree is equal to the # of alternative
symbols in each position of the key
– e.g.) Radix 26 trie - able, abrahms, adams, anderson, adnrews,
baird
• Use only a portion of the key: use-more-as-we-need-more
capability l able
b r
abrahms
d
adams
a n e anderson
d
r andrews
b
baird
5
Formal Definition of Trie
A search structure in which each successive character of

the key is used to determine the direction of the search
at each successive level of the tree
The branching factor (the radix of the trie) at any level is
potentially equal to the number of values that the
character can take
6
Extendible Hashing
H maps keys to a fixed address space, with size one

less than a power of 2 (e.g, 65531 = 216-1 when d=16)
The d bits are used as an index in a directory array
containing 2d entries, which usually resides in primary
memory
File pointers point to blocks of records known as buckets,
where an entire bucket is read by one physical data
transfer, buckets may be added to or removed from the
file dynamically
The value d, the directory size(2d), and the number of
buckets change automatically as the file expands and
contracts
7
Extendible Hashing Example
Directory with d=3 and 4 buckets
d’=1
d=3
B0 H(key)=0
000
001
010 d’=3
011
B100 H(key)=100
100
101
110
111 d’=3
B101 H(key)=101
d’=2
B11 H(key)=11
8
Turning the Trie into a Directory
Using Trie for extendible hashing buckets

• Use Radix 2 Trie :
0 A
– Keys in A : beginning with 0 1 0 B
– Keys in B : beginning with 10 1
– Keys in C : beginning with 11
C
• Retrieving from secondary storage the buckets containing keys,
instead of individual keys
9
Representation of Trie (1)
Tree is not preferable

• many comparisons are required for searching
A flattened array
• Make a complete full binary tree
• Collapse it into the directory structure
0
00 A
0 1 A
01
1 0 B 10 B
1
C 11 C
10
Representation of Trie(2)
Directory is a complete binary tree

• Directory entry: a pointer to the associated bucket
• Given an address beginning with the bits 10, the 102-th directory
entry gives us a pointer to the associated bucket
• Introduced for uniform distribution
11
Retrieve a Record
Steps in retrieving a record with a given key

• find H(given key)
• extract first d bits of H(given key)
• use this value as an index into the directory to find a pointer
• use this pointer to read a bucket into primary memory
• locate the desired record within the bucket (scan)
12
Splitting to Handle Overflow (1)
Example 1: Overflowing of bucket A

• Split A into A and D
• Come to use additional unused bits to divide the addresses
between two buckets
• No need to expand the directory
00
A
00 A
01 D
01
10 B
10 B
11 C
11 C
13
Splitting to Handle Overflow(2)
Example 2: Overflowing of bucket B

• Do not have additional unused bits
• Need to expand the directory
1. Divide B using 3 bits of hash address
2. Make a complete full binary tree
3. Collapse it into the directory structure
00
A
01
10
B
11
C
14
1. Result of overflow of bucket B
A
0
0 B
1 0
1
1 D
C
3. Directory
2. Complete Binary Tree
0 000
0 1 001 A
0 1 0 A 010
1
011
1 B
0 B 100
0 1 D
1
D 101
0
1 C 110 C
111
Another Example
Bucket B100 overflows, then…
d=3
000 d’=2
001 B00 H(key)=00..
010 d’=2
011 B01 H(key)=01..
100
101 d’=3
110 B100 H(key)=100..
111 d’=3
B00 H(key)=101..
d’=2
B00 H(key)=11..
16
Another Example (cont’d)
d=4 d’=2
0000 B00 H(key)=00..
0001
0010 d’=2
0011 B01 H(key)=01..
0100 d’=4
0101 H(key)=1000..
B1000
0110
0111 d’=4 H(key)=1001..
1000 B1001
1001
1010 d’=3
1011 B101 H(key)=101..
1100 d’=2
1101 B11 H(key)=11..
1110
1111 Bucket B100 overflows, d increase to 4
17
Contraction
A pair of adjunct buckets with the same value of d' which

share a common value of the first d' -1 bits of H(key) can
be combined if the average load < 50%, so all records
would be able to fit into one bucket
File contraction is the reverse of expansion

• the directory can be compacted and d decremented whenever all
pairs of pointers have the same values
18
Implementation
Creating the address

Directory and bucket operations
Classes for representing bucket and directory
objects
19
Creating Address
Function hash(KEY)
• Fold/Add hashing algorithm
• Do not MOD hashing value by address space since no fixed
address space exists
• Output from the hash function for a number of keys
bill 0000 0011 0110 1100
lee 0000 0100 0010 1000
pauline 0000 1111 0110 0101
alan 0100 1100 1010 0010
julie 0010 1110 0000 1001
mike 0000 0111 0100 1101
elizabeth 0010 1100 0110 1010
mark 0000 1010 0000 0111
20
Example of Function Hash (key)
// returns an integer hash value for key for a 15 bit
Int Hash (char * key)

{
int sum = 0;
int len = strlen(key);
if (len % 2 == 1) len ++; // make len even
for (int j = 0; j < len; j+=2)
sum = (sum + 100 * key[j] + key[j+1]) % 19937;
return sum;
}
21
Function MakeAddress(key,depth)
Int MakeAddress (char * key, int depth)

{
int retval = 0;
int hashVal = Hash(key);
// reverse the bits
for (int j = 0; j < depth; j++)
{
retval = retval << 1;
int lowbit = hashVal & 1;
retval = retval | lowbit;
hashVal = hashVal >> 1;
}
return retval;
}
22
Classes for Representing Bucket and
Directory Objects
Extendible hashing consists of a set of buckets stored in
a file and a directory that references them
Each bucket is a record that contains a particular set of
keys and information associated with the keys
A directory is primarily an array containing the record
addresses of the buckets
In our implementation, we use a data record reference
for the information associated with the keys (we will treat
the buckets as sets of key-reference pairs)
Class Bucket becomes a derived class of the class
TextIndex
23
Main Members of class Bucket
class Bucket: protected TextIndex

{protected:
Bucket (Directory & dir, int maxKeys = defaultMaxKeys);
int Insert (char * key, int recAddr);
int Remove(char * key);
Bucket * Split (); // split the bucket and redistribute the key
int NewRange (int & newStart, int & newEnd); // calculate the range of a new (split)
bucket
int Redistribute (Bucket & newBucket); // redistribute keys
int FindBuddy (); // find the bucket that is the buddy of this
int TryCombine (); // attempt to combine buckets
int Combine (Bucket * buddy, int buddyIndex); // combine buckets
int Depth; // number of bits used ‘in common’ by keys in buckets
Directory & Dir; // directory that contains the bucket
int BucketAddr; // address in file
friend class Directory;
friend class BucketBuffer;
};
24
Definition of class Directory
class Directory
{public:
Directory (…..); ~Directory();
int Open (..); int Create(…); int Close();
int Insert(…); int Delete(…); int Search(…);
protected
int DoubleSize();
int Collape();
int InsertBucket (….);
int Find (…);
int StoreBucket(…);
int LoadBucket(…)
…..
}
25
Deletion
When to combine buckets

• Buddy buckets: the buckets are siblings and at the leaf level of
the tree (Buddy means something like friend)
e.g., A and D in page 13 are buddy buckets
Examine the directory to see if we can make changes

there
• Shrink the directory if none of the buckets requires the depth of
address information that is currently available in the directory
26
Buddy Bucket
Given a bucket with an address uvwxy, where u, v, w, x,

and y have values of either 0 or 1, the buddy bucket, if it
exists, has the value uvwxz, such that
z = y XOR 1
If enough keys are deleted, the contents of buddy

buckets can be combined into a single bucket
27
Collapsing the Directory
Collapse condition
• If a single cell, downsizing is impossible
• If there is a pair of directory cells that do not both point to the
same bucket, collapsing is impossible
Allocating space
• Allocate half the size of the original
• Copy the bucket references shared by each cell pair to a single
cell in the new directory
28
Extendible Hashing
Performance
Time : O(1)
• If the directory can kept in RAM: a single access
• Otherwise: two accesses are necessary
Space utilization of the bucket
• r (# of records), b (block size), N (# of Blocks)
• Utilization = r / bN
• Average utilization Æ 0.69
Space utilization for the directory
• How large a directory should we expect to have, given an
expected number of keys?
• Expected value for the directory size by Flajolet(1983)
– Estimated directory size =3.92 / b X r(1+1/b)
29
Space Utilization for Buckets
Periodic and fluctuating

• With uniform distributed addresses, all the buckets tend to fill up
at the same time Æ split at the same time
• As buffer fills up: past 90%
• After a concentrated series of splits: below 50%
r : # of records , b : block size
• N ~= r / (b ln 2)
• Utilization = r / bN ~= ln 2 = 0.69
• Average utilization of 69%
B tree space utilization
• Normal B-tree : 67%, B-tree with redistribution in insertion :
85 %
30
Alternative Approaches(1):
Dynamic Hashing
Similar to dynamic extendible hashing
• Use a directory to track bucket addresses
• Extend the directory through the use of tries
Start with a hash function that covers an address space

of a fixed size
When overflow occurs

• The buckets split, forming the leaves of a trie that grows down
from the original address node Î makes a trie
31
Dynamic Hashing (cont’d)
Two kinds of nodes

• External node: reference a data bucket
• Internal node: point to two children index nodes
• When a node has split children, it changed from an external
node to an internal node
Two hash functions

• Apply the first hash function original address space
• if external node is found : search is completed
• if internal node is found : apply second hash function
32
Original
(a) 1 2 3 4 address
space
Original
1 2 3 4 address
(b) space
40 41
Original
1 2 3 4 address
(c) space
20 21 1 41
410 411
Dynamic Hashing vs. Extendible
Hashing(1)
Overflow handling
• Both schemes extend the hash function locally, as a binary
search trie
Both schemes use directory structure

• Dynamic hashing: a linked structure
• Extendible hashing: perfect tree expressible as an array
Space utilization
• both schemes is the same (space utilization : 69%)
34
Dynamic Hashing and
Extendible Hashing(2)
Growth of directory
• Dynamic hashing: slower, more gradual growth
• Extendible hashing: extend directory by doubling it
Actual size of an index node

• Dynamic hashing is lager than a directory cell in extendible
hashing (because of pointers)
Page fault
• Dynamic hashing: more than one page fault (with linked
structure for the directory)
• Extendible hashing: single page fault
35
Linear Hashing
Unlike extendible hashing and dynamic hashing, linear

hashing does not use a directory.
The actual address space is extended one bucket at a
time as buckets overflow
Because the extension of the address space does not
necessarily correspond to the bucket that is overflowing,
linear hashing necessarily involves the use of overflow
buckets, even as the address space expands
No directories: Avoid additional seek resulting from
additional layer
Use more bits of hashed value
• hd(k) : depth d hashing function (using function make_address)
36
The growth of address space in
linear hashing(1)
w
a b c d a b c d A
00 01 10 11 000 01 10 11 100
(a) (b)
x x
a b c d A B a b c d A B C
00 01 10 11 100 101 00 01 10 11 100 101 110
(c) (d)
(continued...) 37
The growth of address space in
linear hashing(2)
a b c d A B C D
00 01 10 11 100 101 110 111
(e)
38
Approaches to Controlling
Splitting
Postpone splitting: increase space utilization
• B-Tree: redistribution rather than splitting
• Hashing: placing records in chains of overflow buckets to
postpone splitting
Triggering event for splitting

• Linear hashing
– Every time any bucket overflows
– Not split overflowing bucket
• Litwin(1980): overall load factor of the file
– Below 2 seeks, 75% ~ 80% storage utilization
39
Approaches to Controlling
Splitting (cont’d)
Postpone splitting for extensible hashing
• Use chaining overflow bucket and avoid doubling directory space
• 1.1 seek, 76% ~ 81% storage utilization
40

Chap 12. Extendible Hashing: File Structures

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Chap 12. Extendible Hashing: File Structures

Hochgeladen von

Copyright:

Verfügbare Formate

File Structures

Chap 12. Extendible Hashing

Things you have to learn

 Direct access (hashing) files have static size, so not

Extract first d digit

Directory Table look-up File pointer

 Idea from Tries file (radix searching)

 A search structure in which each successive character of

 H maps keys to a fixed address space, with size one

 Using Trie for extendible hashing buckets

 Tree is not preferable

 Directory is a complete binary tree

 Steps in retrieving a record with a given key

 Example 1: Overflowing of bucket A

 Example 2: Overflowing of bucket B

 A pair of adjunct buckets with the same value of d' which

 File contraction is the reverse of expansion

Creating the address

// returns an integer hash value for key for a 15 bit

Int Hash (char * key)

Int MakeAddress (char * key, int depth)

class Bucket: protected TextIndex

 When to combine buckets

 Examine the directory to see if we can make changes

 Given a bucket with an address uvwxy, where u, v, w, x,

 If enough keys are deleted, the contents of buddy

 Periodic and fluctuating

 Start with a hash function that covers an address space

 When overflow occurs

 Two kinds of nodes

 Two hash functions

 Both schemes use directory structure

 Actual size of an index node

 Unlike extendible hashing and dynamic hashing, linear

 Triggering event for splitting

Das könnte Ihnen auch gefallen

Direct access (hashing) files have static size, so not

Idea from Tries file (radix searching)

A search structure in which each successive character of

H maps keys to a fixed address space, with size one

Using Trie for extendible hashing buckets

Tree is not preferable

Directory is a complete binary tree

Steps in retrieving a record with a given key

Example 1: Overflowing of bucket A

Example 2: Overflowing of bucket B

A pair of adjunct buckets with the same value of d' which

File contraction is the reverse of expansion

Creating the address

When to combine buckets

Examine the directory to see if we can make changes

Given a bucket with an address uvwxy, where u, v, w, x,

If enough keys are deleted, the contents of buddy

Periodic and fluctuating

Start with a hash function that covers an address space

When overflow occurs

Two kinds of nodes

Two hash functions

Both schemes use directory structure

Actual size of an index node

Unlike extendible hashing and dynamic hashing, linear

Triggering event for splitting