Beruflich Dokumente
Kultur Dokumente
DECEMBER 1, 2017
LATIF SIDDIQ SUNNY
Page |1
Hashing
Hashing is a technique that is used to uniquely identify a specific object from a group of similar objects.
Suppose, we have a large table of data. In this data table, we want to insert, remove, and search data.
If we use sorted arrays and keep the data sorted, then a data can be searched in O(log(n)) time using
Binary Search, but remove operations becomes costly as we have to maintain sorted order.
If we use sorted/unsorted linked-list, insert, remove and search operations become costly.
With Balanced Binary Search Tree (For example, AVL Tree, Red Black Tree), we get moderate search,
insert, and delete times. These operations can be guaranteed to be in O(log(n)) time.
Having an insertion, find and removal of O(log(n)) is good but as the size of the table becomes larger,
even this value becomes significant. We would like to be able to use an algorithm for finding of O (1). In
this case, we have to use Hashing.
So, hashing is a technique when we have insertion and search dominate operations, it helps to insert
data and search them in O (1) complexity. Using this technique, we store data in Hash table.
Hash Function
A hash function maps a big number or string to a small integer that can be used as index in hash table.
1. Efficiently computable.
2. Should uniformly distribute the keys (Each table position equally likely for each key)
The hash function is used to map the search key to a list; the index gives the place in the hash table
where the corresponding record should be stored and where the data should be found.
Hashing Techniques
We can use a Direct Access Table for hashing. We build a large array and use the following hashing
function,
Advantage:
Disadvantage:
2. We can not store a large value as we have limitation to have a huge sized array.
As we have limitation to have a large array, we can use a small array and modify the hash function.
𝒉(𝒌) = 𝒌%𝒎 , 𝒘𝒉𝒆𝒓𝒆 𝒌 𝒊𝒔 𝒕𝒉𝒆 𝒌𝒆𝒚 𝒂𝒏𝒅 𝒎 𝒊𝒔 𝒕𝒉𝒆 𝒔𝒊𝒛𝒆 𝒐𝒇 𝒕𝒉𝒆 𝒂𝒓𝒓𝒂𝒚
But there is a chance of collision of data. For example,
m=7, if we insert 6, the hash value of 6 is 6. So, we insert 6 at the 6th position of the array. Then if we
insert 13, the hash value of 13 is 13%7= 6, but this place/ slot is not empty, a collision occurs. Though
we overcome the limitation of large size array, we can not avoid such collision.
Separate chaining
In this method we use same hash function described above, but this time the array should be an array of
pointer head of a linked-list. If there is no data in a slot, the head should be null. Whenever, we get a
data in a slot, we should insert the data at the end of that linked-list.
Page |4
In this system, we avoid collision of data, but we have to search data in a linear approach.
Advantages:
1. Simple to implement.
2. Hash table never fills up, we can always add more elements to chain.
3. Less sensitive to the hash function or load factors.
4. It is mostly used when it is unknown how many and how frequently keys may be inserted or
deleted.
Disadvantages:
1. Cache performance of chaining is not good as keys are stored using linked list. Open addressing
provides better cache performance as everything is stored in same table.
2. Wastage of Space (Some Parts of hash table are never used)
3. If the chain becomes long, then search time can become O(n) in worst case.
4. Uses extra space for links.
Page |5
Analysis:
As the length in every chain is not equal, so we take average expected value.
𝒍(𝒙) = ∑(𝒄𝒙,𝒚 ) , 𝒙 𝒂𝒏𝒅 𝒚 𝒃𝒐𝒕𝒉 𝒊𝒏 𝑻 𝒂𝒏𝒅 𝑻 𝒊𝒔 𝒕𝒉𝒆 𝒔𝒆𝒕 𝒐𝒇 𝒆𝒍𝒆𝒎𝒆𝒏𝒕 𝒊𝒏 𝒕𝒉𝒆 𝒔𝒍𝒐𝒕 𝒐𝒇 𝒕𝒉𝒆 𝒕𝒂𝒃𝒍𝒆
𝒚∊𝑻
𝑬(𝒍(𝒙))
= 𝑬 ( ∑(𝒄𝒙,𝒚 ) )
𝒚∊𝑻
= ( ∑ 𝑬(𝒄𝒙,𝒚 ) )
𝒚∊𝑻
Now, 𝑬( 𝒄𝒙,𝒚 )
= 𝟏 ∗ 𝑷(𝒄𝒙,𝒚 = 𝟏) + 𝟎 ∗ (𝒄𝒙,𝒚 = 𝟎)
= 𝟏 ∗ 𝑷(𝒉(𝒙) = 𝒉(𝒚))
𝟏
=
𝒎
𝟏
𝑷(𝒉(𝒙) = 𝒉(𝒚)) = , 𝑷𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒐𝒇 𝒈𝒆𝒕 𝒔𝒂𝒎𝒆 𝒔𝒍𝒐𝒕 𝒐𝒇 𝒉𝒂𝒗𝒊𝒏𝒈 𝒔𝒂𝒎𝒆 𝒉𝒂𝒔𝒉 𝒗𝒂𝒍𝒖𝒆
𝒎
So, now 𝑬(𝒍(𝒙))
𝟏 𝒏
= ∑( ) = , 𝒏 𝒊𝒔 𝒕𝒉𝒆 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒆𝒍𝒆𝒎𝒆𝒏𝒕𝒔 𝒊𝒏 𝑻
𝐦 𝒎
𝒚∊𝑻
= α, Load Factor
Page |6
Search Complexity
1 𝑛(𝑛−1)
= 𝑛(n+ 2𝑚
)
(𝑛−1)
= (1+ )
2𝑚
(𝑛)
< (1+2𝑚)
𝛼
= (1+2 )
If the search is unsuccessful then n=m. So, α=1. In this moment search complexity 0(c), c is a constant.
Open Addressing
Open addressing, or closed hashing, is a method of collision resolution in hash tables. With this method
a hash collision is resolved by probing, or searching through alternate locations in the array (the probe
sequence) until either the target record is found, or an unused array slot is found, which indicates that
there is no such key in the table.
Insert(k): Keep probing until an empty slot is found. Once an empty slot is found, insert k.
Search(k): Keep probing until slot’s key doesn’t become equal to k or an empty slot is reached.
Delete(k): If we simply delete a key, then search may fail. So, slots of deleted keys are marked specially
as “deleted”.
Insert can insert an item in a deleted slot, but search doesn’t stop at a deleted slot.
Linear Probing
let h(x) be the slot index computed using hash function and S be the table size
1 is for hashing and E [ T (m-1, n-1)] is when current slot is filled, then there is n-1 element in m-1 sized
array.
𝒏 𝒎−𝟏
≤ 1+𝒎 ∗ (𝒎−𝟏)−(𝒏−𝟏)
𝒏
< 1+(𝒎−𝒏)
𝒎
=(𝒎−𝒏)
𝟏
= 𝒏
𝟏−
𝒎
𝟏
=𝟏−𝜶
=1+α+ α2 + α3 + α4+….
Page |8
The main problem with linear probing is clustering, many consecutive elements form groups and it
starts taking time to find a free slot or to search an element.
Quadratic Probing
In this probing, h (x, i) = (x+ i*i) %m is used, where i is the number of attempt.
..................................................
..................................................
Assume, first ceil(m/2) probes are not unique. ith and jth probe to the same location and i<j<ceil(m/2).
(h(k)+i*i) =(h(k)+j*j) %m
i*i=j*j %m
i*i-j*j=0 %m
(i+j) (i-j) =0 %m
As, m is a prime. So, (i+j) or (i-j) are not divisible by m as i<j<ceil(m/2) <m
If there is m sized array, m! sequence can be possible. By linear and quadratic probing, we can get m
sequence.
Page |9
Double Hashing
In this probing, hash (x) = (x+ i*hash2(x)) %m is used, where i is the number of attempt.
..................................................
..................................................
Double hashing requires more computation time as two hash functions need to be computed.
Perfect Hashing
In this hashing, we can insert, remove and search in O (1) complexity in worst case. It can be possible if
we have some domain knowledge about data.
Actually, it uses same idea of double hashing. Whenever there can be a collision, step hashing function
gives an unique slot to search the data in O(1) complexity.