Sie sind auf Seite 1von 39

IT623 Algorithms & Data

Structures
Hashing
Notes referred from Prof Anish Maturia’s lectures
Hash Table and Hashing
◼ The data structures (binary search trees, etc.)
discussed so far work with the input keys by
comparing them.
◼ In practice, we can process the input key to make
insert/search/delete operations run faster!
◼ For example,
◆ Integers consist of digits.
◆ Strings consist of letters.
We can process a key to map it to an integer, and
use that integer as an array index.
Basic ideas
In general,
◼ Universe of keys : U = {u0, u1,…, uN-1}.
◼ For each key, it should be relatively easy to
compute its corresponding index.
◼ Operations supported:
◆ Find
◆ Insert
◆ Delete. Deletions may be unnecessary in some
applications.

3
Basic ideas
◼ Given a key, compute its corresponding index
into a hash table.
◼ Index is computed using a hash function.

hash_function(key) = Index

key Index

4
Hash Table
Example Applications
◼ Compilers use hash tables (symbol tables) to
keep track of declared variables.
◼ Operating System makes use of hash table
for many area. For example, memory page
tables, process tables, thread tables.
◼ On-line spell checkers. After prehashing the
entire dictionary, one can check each word in
constant time and print out the misspelled
word in order of their appearance in the
document.

5
Unrealistic Hashing: bit vector

0 if ui  S
entry[i ] = 
1 if ui  S
Universe of keys : U = {u0, u1,…, uN-1}.

◼ Find(ui) : Test entry[i] → O(1) time


◼ Insert(ui) : Set entry[i] to 1 → O(1) time
◼ Delete(ui) : Set entry[i] to 0 → O(1) time

6
Unrealistic Hashing: bit vector
Features :
◼ Each operation takes constant time.
◼ Simple implementation.
◼ The scheme wastes too much space if the universe
of key is too large, compared with the actual number
of elements to be stored.
◆ For example, consider your student id. If we treat it as an 9-
digit integer, then, the universe size is 109, but we only have
about 1600 students → around (109-7000) spaces will be
wasted.

7
Hashing
◼ Let U={K0, K1 , …, KN-1} be the universe of keys.
◼ Let T[0 … m-1] be an array representing the
hash table, where m is much smaller than N.
◼ The soul of the hashing scheme h is the hash
function
h: Key universe → [0 .. m-1]
that maps each key in the universe to an integer
between 0 and m-1.
◼ For each key K, h(K) is called the hash value of
K and K is supposed to be stored at T[h(K)].

8
Hashing
0
h(K1) → index1
K1 index1

h(K2) → index2
K2 index2

h(K3) → index2

Two (or more) keys may get into the


same location !

m-1
9
Hash Table
Hashing
◼ What do we do if two (or more) keys have the
same hash values?
◼ There are two aspects to this.
◆ We should design the hash function such that it
spreads the keys uniformly among the entries of T.
This will decrease the likelihood that two keys
have the same hash values.
◆ Nevertheless, since N > m, we still need a solution
when this event happens.

10
Example of a bad hash function
◼ Suppose that our keys are a string of letters.
◆ Lets consider a letter equals its ASCII value
Key = cn-1 cn-2 . . . co
◆ For example:
‘A’ = 65, ‘Z’ = 90, ‘a’ = 97, ‘z’ = 122
◼ Our hash function:
n −1
h(cn −1...c0 ) = ( ci )%m
i =0

(This simply adds the string’s characters values up and takes the
modulus by m, where m is the size of the hash table.)

11
Example of a bad hash function
◼ Our hash function:
n −1
h(cn−1...c0 ) = ( ci )%m
i =0

◼ This hash function gives the same results for any


permutation of the string
h(“ABC”) = h(“CBA”) = h(“ACB”)

◼ Keys cannot be spread uniformly → so it is not a


good idea!

12
Improving the hash function
◼ We can improve the hash function so that the letters
contribute differently according to their positions.
n −1
h(cn −1...c0 ) = ( ci * r )%m
i

i =0

Weight by the ith position


◼ r is the radix
◆ Integers: r = 10
◆ Bit strings: r=2
◆ Strings: r = 128

13
Improving the hash function
◼ Need to be careful about overflows, since we may
raise the base to a large power.
◼ We can do all computation in modulo arithmetic.
◼ For example:
Take modulus at each
sum = 0; step.
for (int j=0; j < n; j++)
sum = (sum + (cj * rj) % m ) % m;

14
Java Object Hash: Used for searching a value
Java Object Hash: Used for searching a value
• JavaHashExample.java
hashCode for an integer 2009 is 2009
hashCode for a string “2009” is 1537223

String class hashCode() method calculated hash code as:


s.charAt(0) * 31n-1 + s.charAt(1) * 31n-2 + ... + s.charAt(n-1)
= 50*313 + 48*312 +48*311 + 57 → ASCII of 0 is 48 and for 9 it is 59
= 1,537,223
Division method
• A simple and effective hash function
h(x) = x mod tableSize

• The table size should be a prime number


• Why?

17
Why Primes?
• h(x) = x mod 12 0 0 12 24 36 …
• Consider the set of 1
keys {0, 1, …, 100} 2
• The keys that are
3 3 15 27 39 …
multiples of 3 will be
hashed to indexes that 4

are multiples of 3 5

6 6 18 30 42 …

8
9 9 21 33 45 …

10

11 18
Why Primes?
• Now try it with x mod 13

19
Collision Handling
◼ When the hash values of two keys are the same (i.e.,
the two keys need to store in the same location), we
have a collision.
◼ Collisions could happen even if we design our hash
function carefully since N is much larger than m.
◼ There are two approaches to resolve collisions.
◆ separate chaining -- we can convert the hash table to a table
of linked lists
◆ open addressing -- we can relocate the key to a different
entry in case of collision

20
Collision Handling – Separate Chaining

◼ A hash table becomes a table of linked lists.


◼ To insert a key K, we compute h(K). If T[h(K)] contains a null
pointer, we initialize this entry to point to a linked list that
contains K alone. If T[h(K)] is a non-empty list, we just insert K
at the beginning of this list.

21
Separate Chaining
◼ To delete a key K, we compute h(K). Then search for
K within the list at T[h(K)]. Delete K if it is found.

◼ Assume that we will be storing n keys. Then we


should make m the next larger prime number. If the
hash function works well, the number of keys in each
linked list will be a small constant. Therefore, we
expect that each search, insert, and delete operation
can be done in constant time.
◼ For example, if we want to store 100 keys, we should
use m = 101 which is the next larger prime number.

22
Collision Handling – Open Addressing
◼ Separate chaining has the disadvantage of using linked
lists.
◼ An alternative method is to relocate the key K to be
inserted if it collides with an existing key. That is, we
store K at an entry different from h(K).
◼ Two issues:
◆ What is the relocation scheme?
◆ How to search for K later?
◼ Two common methods for resolving a collision in open
addressing :
◆ Linear probing
◆ Double hashing

23
Linear Probing
◼ Insertion
◆ Let K be the new key to be inserted. We compute h(K).
◆ For i = 0 to m-1,
1. compute L = (h(K) + i) % m
2. if T[L] is empty, then we put K there and stop.
◆ If we cannot find an empty entry to put K, it means that
the table is full and we should report an error.

24
Linear Probing
◼ Searching
◆ Let K be the key to be searched. We compute h(K).
◆ For i = 0 to m-1,
1. compute L = (h(K) + i) % m
2. If T[L] contains K, then we are done and we can stop.
3. If T[L] is empty, then K is not in the table and we can stop
too. (If K were in the table, it would have been placed in
T[L] by our insertion strategy.)
◆ If we cannot find K at the end of the for-loop, we have scanned
the entire table and so we can report that K is not in the table.

25
Linear Probing - Example

h(21)=10
h(42)=9
h(31)=9
h(12)=1
h(54)=10
h(20)=9
26
Exercise
• Perform linear probing with values 50, 700, 76, 85, 92,
73, 101 using 7 element Hash
Quiz #1
• Assume a hash table that stores strings
• A possible hash function is the string’s length:
h(x) = x.length()
• Used by early versions of PHP
• Is this a good hash function?
Quiz #2
• Suppose the hash function is h(x) = x mod 9. We have the
following hash table, implemented using linear probing.

0 1 2 3 4 5 6 7 8
9 18 12 3 14 4 21

• In which order could the elements have been added to the


table?
A. 9, 14, 4, 18, 12, 3, 21
B. 12, 3, 14, 18, 4, 9, 21
C. 12, 14, 3, 9, 4, 18, 21
D. 9, 12, 14, 3, 4, 21, 18
E. 12, 9, 18, 3, 14, 21, 4
• Assume the following table of hash values.
key A B C D E F G
hash 5 2 5 1 4 1 3

• Which of the following could be the contents of the linear-


probing array if the keys are inserted in some order?
0 1 2 3 4 5 6
A.
A F D B G E C

B. 0 1 2 3 4 5 6
F A D B G E C
Linear Probing – Primary Clustering
◼ We call a block of contiguously occupied
table entries a cluster.
◼ On average, when we insert a new key K,
we may hit the middle of a cluster.
◆ Therefore, the time to insert K would be
proportional to half the size of a cluster. cluster
That is, the larger the cluster, the slower
the performance.

Hash Table

31
Linear Probing – Disadvantages
◼ Once h(K) falls into a cluster, this cluster will
definitely grow in size by one. Thus, this may worsen
the performance of insertion in the future.
◼ If two cluster are only separated by one entry, then
inserting one key into a cluster can merge the two
clusters together.
◆ The cluster size can increase drastically by a single
insertion.
◆ This means that the performance of insertion (searching)
can deteriorate drastically after a single insertion.

32
Double Hashing
◼ To alleviate the problem of primary clustering, when
resolving a collision, we should examine alternative
positions in a more “random” fashion. To this end,
we work with two hash functions h and h2.
◼ Insertion
◆ Let K be the new key to be inserted. We compute h(K).
◆ For i = 0 to m-1,
1. compute L = (h(K) + i * h2 (K)) % m
2. if T[L] is empty, then we put K there and stop.
◆ If we cannot find an empty entry to put K, it means that the
table is full and we should report an error.

33
Double Hashing
◼ Searching
◆ Let K be the new key to be searched. We
compute h(K).
◆ For i = 0 to m-1,
1. compute L = (h(K) + i * h2 (K)) % m
2. if T[L] contains K, then we are done and we can stop.
3. If T[L] is empty, then K is not in the table and we can
stop too. (If K were in the table, it would have been
placed in T[L] by our insertion strategy.)
◆ If we cannot find K at the end of the for-loop, we
have scanned the entire table and so we can
report that K is not in the table.

34
Double Hashing – Choice of h2
◼ For any key K, h2(K) must be relatively prime to the
table size m. Otherwise, we will only be able to
examine a fraction of the table entries.
◆ For example, if h(K) = 0 and h2(K) = m/2, then we can only
examine the entries T[0], T[m/2], and nothing else!

◼ One solution is to make m prime, choose r to be a


prime smaller than m, and set
h2(K) = r - (K % r).

35
Double Hashing – Example
h(21)=10; h2(21)=7

h(42)=9; h2(42)=7

h(31)=9 ; h2(31)=4
h(12)=1; h2(12)=2
h(54)=10; h2(54)=2

36
Exercise
• Perform double hashing with values 50, 700, 76, 85,
92, 73, 101 using
H(k) = k % 11
H2(k) = 7 - k % 7
Deletion in Open Addressing
◼ We cannot just delete a key in open
addressing strategies.
◼ Otherwise, suppose that the table K1 K1
stores three keys K1, K2 and K3 that
have identical probe sequences.
Suppose that K1 is stored at h(K1),
K2
K2 is stored at the second probe
location and K3 is stored at the third ???
probe location.
◼ If K2 is to be deleted and we make K3 K3
the slot containing K2 empty, then
when we search for K3, we will find
an empty slot before finding K3.
◼ So, we will report that K3 does not
exist in the table !!!
38
Deletion in Open Addressing
◼ Instead, we add an extra bit to
each entry to indicate whether the
K1
key stored here has been deleted K1

or not.
◼ Such delete bit serves two
purposes K2 K2 T

◆ searching – we should NOT stop at


there
◆ insertion – that position is logically K3 K3
empty (though the deleted key is still
in the hash table), so we can
overwrite this entry

39

Das könnte Ihnen auch gefallen