Sie sind auf Seite 1von 68

Hashing

1/51
Why Hashing

Internet has grown to millions of users generating


terabytes of content every day.
According to internet data tracking services, the
amount of content on the internet doubles every
six months.
With this kind of growth, it is impossible to find
anything in the internet, unless we develop new
data structures and algorithms for storing and
accessing data.

2/51
Why Hashing

So what is wrong with traditional data structures


like Arrays and Linked Lists?

3/51
Why Hashing

Suppose we have a very large data set stored in an array. Then in


case of:
Sorted Array: Binary Search->O(Log n)
Unsorted Array: Linear Search ->O(n)
Therefore we discuss a new technique called hashing that allows us
to update and retrieve any entry in constant time O(1).
The constant time or O(1) performance means, the amount of time to perform
the operation does not depend on data size n.

4/51
The Dictionary or Map Data
Structure
In a mathematical sense, a map is a relation between two sets.

Map function

We can define Map M as a set of pairs, where each pair is of the form (key, value), where
for given a key, we can find a value using some kind of a function that maps keys to
values.

5/51
Hashing

Initially we can assume that the hash table


structure is merely an array of some fixed size,
containing the items.
A stored item needs to have a data member, called
key, that will be used in computing the index value
for the item.
Key could be an integer, a string, etc
e.g. a name or Id that is a part of a
large employee structure
Hashing
The size of the array is TableSize.
The items that are stored in the hash table are
indexed by values from 0 to TableSize 1.
Each key is mapped into some number in the
range 0 to TableSize 1.
The mapping is called a hash function.

7
Array as a Dictionary or Map

The concept of a hash table is a generalized idea of an array where key


does not have to be an integer.
We can have a name as a key, or for that matter any object as the key.

The trick is to find a hash function to compute an index so that an object can be stored
at a specific location in a table such that it can easily be found.

8/51
Example of Hashing

Set of strings {abc, def, ghi} for store in a


table.
Fast access is required.
We are not concerned about ordering

9/51
Example of Hashing

Set of strings {abc, def, ghi} for store in a


table.
Suppose we assign a = 1, b=2, etc to all
alphabetical characters.
Compute a number for each of the strings by using
the sum of the characters as follows.
abc = 1 + 2 + 3=6, def = 4 + 5 + 6=15 ,
ghi = 7 + 8 + 9=24

10/51
Example of Hashing
abc = 1 + 2 + 3=6,
def = 4 + 5 + 6=15 ,
ghi = 7 + 8 + 9=24

11/51
Example of Hashing
If we assume that we have a table of size 5 to store these strings, we can
compute the location of the string by taking the sum mod 5. So we will then store
abc in 6 mod 5 = 1, def in 15 mod 5 = 0, and ghi in 24 mod 5 = 4 in locations
1, 0 and 4 as follows.

Here simple hash function=


Sum of the characters mod Table size.
This seems to be great way to store a Dictionary.
Therefore the idea of hashing seems to be a great way to store pairs of (key,
value) in a table.
Problem with Hashing

Disadvantage of the method

13/51
Problem with Hashing

In case we have permutations of the same letters,


abc, bac etc in the set, we will end up with the
same value for the sum and hence the key.
In this case, the strings would hash into the same
location, creating what we call a collision.
Question 1: How do we pick a good hash function?
Question 2: How do we deal with collisions?

14/51
Hash function

Problems:
Keys may not be numeric.
Number of possible keys is much larger than the
space available in table.
Different keys may map into same location
Hash function is not one-to-one => collision.
If there are too many collisions, the performance of the
hash table will suffer dramatically.

15
Factors affecting Hash Table
Design
Hash function
Table size
Usually fixed at the start
Collision handling scheme

16/51
Hash Function Properties

17/51
Hash Function Properties

18/51
Hash Function - Effective use
of table size

19/51
Different Ways to Design a
Hash Function for String Keys

20/51
Different Ways to Design a
Hash Function for String Keys

21/51
Different Ways to Design a
Hash Function for String Keys
It is difficult to find a perfect hash function, that is a function that has no
collisions.
But we can do better by using hash functions as follows. Suppose we
need to store a dictionary in a hash table.
A dictionary is a set of Strings and we can define a hash function as
follows. Assume that S is a string of length n and S = S1S2. Sn

where p is a prime number. Obviously, each string will lead to a unique


number, but when we take the number Mod TableSize, it is still possible that
we may have collisions but may be fewer collisions than when using a
nave hash function like the sum of the characters.

22/51
Implementation of hash tables

Hash tables consist of two components:


a bucket array and
a hash function.
Bucket Array

Each cell is thought of as a bucket or a container


Holds key element pairs
In array A of size N, an element e with key k is inserted
in A[k].

(null) (null) Roberto (null)

000-000-0000 000-000-0001 401-863-7639 ... 999-999-9999


Collision Resolution

If, when an element is inserted, it hashes to the


same value as an already inserted element, then we
have a collision and need to resolve it.
There are several methods for dealing with this:
Separate chaining
Open addressing

25
Separate Chaining

The idea is to keep a list of all elements that hash to


the same value.
The array elements are pointers to the first nodes of the
lists.
A new item is inserted to the front of the list.
Advantages:
Better space utilization for large items.
Simple collision handling: searching linked list.
Overflow: we can store more items than the hash table
size.
Deletion is quick and easy: deletion from the linked list.
26
Example
Keys: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81
hash(key) = key % 10.
0 0

1 81 1
2

4 64 4
5 25
6 36 16
7

9 49 9

27
Operations

Initialization: all entries are set to NULL


Find:
locate the cell using hash function.
sequential search on the linked list in that cell.
Insertion:
Locate the cell using hash function.
(If the item does not exist) insert it as the first item in
the list.
Deletion:
Locate the cell using hash function.
Delete the item from the linked list.

28
Cost of searching

Cost = Constant time to evaluate the hash function


+ time to traverse the list.

29
Disadvantages of Saparate
Chaining
Separate chaining has the disadvantage of using
linked lists.
Requires the implementation of a second data
structure.

30
Open Addressing

More formally:
Cells h0(x), h1(x), h2(x), are tried in succession where
hi(x) = (hash(x) + f(i)) mod TableSize, with f(0) = 0.
The function f is the collision resolution strategy.
There are two common collision resolution
strategies:
Linear Probing
Double Hashing

31
Linear Probing

In linear probing, collisions are resolved by


sequentially scanning an array (with
wraparound) until an empty cell is found.
i.e. f is a linear function of i, typically f(i)= i.
Example:
Insert items with keys: 89, 18, 49, 58, 9 into an
empty hash table.
Table size is 10.
Hash function is hash(x) = x mod 10.
f(i) = i;
33/51
18 mod 13=5
41 mod 13=2
22 mod 13=9
44 mod 13=5
59 mod 13=7
32 mod 13=6
31 mod 13=5
73 mod 13=8

34/51
35/51
36/51
37/51
38/51
39/51
For h1
18 mod 13=5
41 mod 13=2
22 mod 13=9
44 mod 13=5 [ use of h2]
59 mod 13=7
32 mod 13=6
31 mod 13=5 [ use of h2]
73 mod 13=8 [ use of h2]
For h2
44 mod 8=4
8-4=4

31 mod 8=7
8-7=1

73 mod 8=1
8-1=7

40/51
Disadvantages of Open
Addressing
In an open addressing hashing system, all the
data go inside the table.
Thus, a bigger table is needed.
If a collision occurs, alternative cells are tried until an
empty cell is found.

41/51
Example 1

Give the contents of the hash table that results


when you insert items with the
keys D E M O C R A T
in that order into an initially empty table of M = 5
lists, using separate chaining with unordered lists.
Use the hash function 11 k mod M to transform
the kth letter of the alphabet into a table index,
e.g., hash(I) = hash(9) = 99 % 5 = 4.

42/51
Example 2

Give the contents of the hash table that results


when you insert items with the
keys R E P U B L I C A N
in that order into an initially empty table of size M
= 16 using linear probing. Use the hash function
11k mod M to transform the kth letter of the
alphabet into a table index.

43/51
Example 3

Give the contents of the hash table that results


when you insert items with the
keys A N O T H E R X M P L
in that order into an initially empty table of size M
= 16 using double hashing. Use the hash function
11k mod M for the inital probe and the second
hash function (k mod 3) + 1 for the search
increment.

44/51
Example

For the following two questions, use the following


values:
67 46 88 91 123 141 152 155 178 288 390 399
465 572 621 734
Draw a diagram to show how the values are
inserted into a hash table with 20 positions. Use
the division method of hashing and the linear
probing method of resolving collisions.

45/51
Example 4

Step 1: Apply the Division Method to Get the Index

67 % 20 = 7
46 % 20 = 6
88 % 20 = 8
91 % 20 = 11
...
734 % 20 = 14

46/51
Step 2: Create Table Using Linear Probing Method

47/51
Example

Draw a diagram to show how the values are


inserted into a hash table that uses the hash
function key % 10 to determine into which of ten
chains to put the value.

48/51
Example

Step 1: Apply the Division Method to Get the Index


67 % 10 = 7
46 % 10 = 6
88 % 10 = 8
91 % 10 = 1
...
734 % 10 = 4

49/51
Example

50/51
51/51
52/51
53/51
54/51
55/51
56/51
57/51
58/51
59/51
60/51
61/51
62/51
63/51
64/51
Hash Tables
Ordering is not important.
It is not possible to search the records in terms of
sorted order.
Finding record with minimum or maximum value is
difficult.

65
Hashing

As we saw with binary search, certain data


structures such as a binary search tree can help
improve the efficiency of searches.
From linear search to binary search, we improved
our search efficiency from O(n) to O(logn) .
We now present a new data structure, called a
hash table, that will increase our efficiency to
O(1), or constant time.
Hashing

The worst case search time for a hash table is


O(n), however, the probability of that happening is
so small.
The Best and Average cases are O(1) .
Summary
Hash tables store a collection of records with keys.
The location of a record depends on the hash
value of the record's key.
When a collision occurs, the next available
location is used.
Searching for a particular key is generally quick.
When an item is deleted, the location must be
marked in a special way, so that the searches
know that the spot used to be used.