Sie sind auf Seite 1von 48

Fundamentals of Python:

From First Programs Through Data


Structures
Chapter 19
Unordered Collections: Sets and
Dictionaries

Objectives
After completing this chapter, you will be able to:
Implement a set type and a dictionary type using
lists
Explain how hashing can help a programmer
achieve constant access time to unordered
collections
Explain strategies for resolving collisions during
hashing, such as linear probing, quadratic probing,
and bucket/chaining

Fundamentals of Python: From First Programs Through Data Structures

Objectives (continued)
After completing this chapter, you will be able to:
(continued)
Use a hashing strategy to implement a set type and
a dictionary type
Use a binary search tree to implement a sorted set
type and a sorted dictionary type

Fundamentals of Python: From First Programs Through Data Structures

Using Sets
A set is a collection of items in no particular order
Most typical operations:

Return the number of items in the set


Test for the empty set (a set that contains no items)
Add an item to the set
Remove an item from the set
Test for set membership
Obtain the union of two sets
Obtain the intersection of two sets
Obtain the difference of two sets

Fundamentals of Python: From First Programs Through Data Structures

Using Sets (continued)

Fundamentals of Python: From First Programs Through Data Structures

The Python set Class

Fundamentals of Python: From First Programs Through Data Structures

The Python set Class (continued)

Fundamentals of Python: From First Programs Through Data Structures

A Sample Session with Sets

Fundamentals of Python: From First Programs Through Data Structures

A Sample Session with Sets


(continued)

Fundamentals of Python: From First Programs Through Data Structures

Applications of Sets
Sets have many applications in the area of data
processing
Example: In database management, answer to
query that contains conjunction of two keys could be
constructed from intersection of sets of items
associated with those keys

Fundamentals of Python: From First Programs Through Data Structures

10

Implementations of Sets
Arrays and lists may be used to contain the data
items of a set
A linked list has the advantage of supporting
constant-time removals of items
Once they are located in the structure

Hashing attempts to approximate random access


into an array for insertions, removals, and searches

Fundamentals of Python: From First Programs Through Data Structures

11

Relationship Between Sets and


Dictionaries
A dictionary is an unordered collection of elements
called entries
Each entry consists of a key and an associated
value
A dictionarys keys must be unique, but its values
may be duplicated

One can think of a dictionary as having a set of


keys

Fundamentals of Python: From First Programs Through Data Structures

12

List Implementations of Sets and


Dictionaries
The simplest implementations of sets and
dictionaries use lists
This section presents these implementations and
assesses their run-time performance

Fundamentals of Python: From First Programs Through Data Structures

13

Sets
List implementation of a set

Fundamentals of Python: From First Programs Through Data Structures

14

Dictionaries
Our list-based implementation of a dictionary is
called ListDict
The entries in a dictionary consist of two parts, a key
and a value

A list implementation of a dictionary behaves in


many ways like a list implementation of a set

Fundamentals of Python: From First Programs Through Data Structures

15

Dictionaries (continued)

Fundamentals of Python: From First Programs Through Data Structures

16

Dictionaries (continued)

Fundamentals of Python: From First Programs Through Data Structures

17

Dictionaries (continued)

Fundamentals of Python: From First Programs Through Data Structures

18

Complexity Analysis of the List


Implementations of Sets and
Dictionaries
The list implementations of sets and dictionaries
require little programmer effort
Unfortunately, they do not perform well

Basic accessing methods must perform a linear


search of the underlying list
Each basic accessing method is O(n)

Fundamentals of Python: From First Programs Through Data Structures

19

Hashing Strategies
Key-to-address transformation or a hashing
function
Acts on a given key by returning its relative position in
an array

Hash table
An array used with a hashing strategy

Collision
Placement of different keys at the same array index

Fundamentals of Python: From First Programs Through Data Structures

20

Hashing Strategies (continued)

Fundamentals of Python: From First Programs Through Data Structures

21

Hashing Strategies (continued)

Fundamentals of Python: From First Programs Through Data Structures

22

The Relationship of Collisions to


Density
Density
The number of keys relative to the length of an array

As the density decreases, so does the probability


of collisions
Keeping a low load factor even (say, below .2)
seems like a good way to avoid collisions
Cost of memory incurred by load factors below .5 is
probably prohibitive for data sets of millions of items
Even load factors below .5 cannot prevent many
collisions from occurring for some data sets
Fundamentals of Python: From First Programs Through Data Structures

23

Hashing with Non-Numeric Keys


Try returning the sum of the ASCII values in the
string
This method has effect of producing same keys for
anagrams
Strings that contain same characters, but in different
order

First letters of many words in English are unevenly


distributed
This might have the effect of weighting or biasing the
sums generated
Fundamentals of Python: From First Programs Through Data Structures

24

Hashing with Non-Numeric Keys


(continued)
One solution:
If length of string is greater than a certain threshold
Drop first character from string before computing sum
Can also subtract the ASCII value of the last character

Python also includes a standard hash function for


use in hashing applications
Function can receive any Python object as an
argument and returns a unique integer

Fundamentals of Python: From First Programs Through Data Structures

25

Hashing with Non-Numeric Keys


(continued)

Fundamentals of Python: From First Programs Through Data Structures

26

Linear Probing
Linear probing
Simplest way to resolve a collision
Search array, starting from collision spot, for the first
available position

At the start of an insertion, the hashing function is


run to compute the home index of the item
If cell at home index is not available, move index to
the right to probe for an available cell
When search reaches last position of array, probing
wraps around to continue from the first position
Fundamentals of Python: From First Programs Through Data Structures

27

Linear Probing (continued)

For retrievals, stop probing process when current


array cell is empty or it contains the target item
If target item is found, its cell is set to DELETED
Fundamentals of Python: From First Programs Through Data Structures

28

Linear Probing (continued)


Problem: After several insertions/removals, item is
farther away from its home index than needs to be
Increasing the average overall access time

Two ways to deal with this problem:


After a removal, shift items on the cells right over to
the cells left until an empty cell, a currently occupied
cell, or the home indexes for each item are reached
Regularly rehash the table (e.g., if load factor is .5)

Clustering: Occurs when items causing a collision


are relocated to the same region within the array
Fundamentals of Python: From First Programs Through Data Structures

29

Linear Probing (continued)

Fundamentals of Python: From First Programs Through Data Structures

30

Quadratic Probing
To avoid clustering associated with linear probing,
we can advance the search for an empty position a
considerable distance from the collision point
Quadratic probing: Increments the home index by
the square of a distance on each attempt

Problem: By jumping over some cells, one or more


of them might be missed
Can lead to some wasted space

Fundamentals of Python: From First Programs Through Data Structures

31

Quadratic Probing (continued)


Here is the code for insertions, updated to use
quadratic probing:

Fundamentals of Python: From First Programs Through Data Structures

32

Chaining
Items are stored in an array of linked lists (chains)
Each items key locates the bucket (index) of the
chain in which the item resides or is to be inserted

Retrieval and removal each perform these steps:


Compute the items home index in the array
Search the linked list at that index for the item

To insert an item:
Compute the items home index in the array
If cell is empty, create a node with item and assign
the node to cell; else (collision), insert item in chain
Fundamentals of Python: From First Programs Through Data Structures

33

Chaining (continued)

Fundamentals of Python: From First Programs Through Data Structures

34

Complexity Analysis
Linear probing: Complexity depends on load factor
(D) and tendency of items to cluster
Worst case (method traverses entire array before
locating items position): behavior is linear
Average behavior in searching for an item that
cannot be found is (1/2) [1 + 1/(1 D)2]

Quadratic probing: Tends to mitigate clustering


Average search complexity is 1 loge(1 D) (D /
2) for the successful case and 1 / (1 D) D
loge(1 D) for the unsuccessful case
Fundamentals of Python: From First Programs Through Data Structures

35

Complexity Analysis (continued)


Chaining:
Locating an item consists of two parts:
Computing home index constant time behavior
Searching linked list upon a collision linear
Worst case (all items that have collided with each
other are in one chain, which is a linked list): O(n)
If lists are evenly distributed in array and array is
fairly large, the second part can be close to constant
Best case (a chain of length 1 occupies each array
cell): O(1)
Fundamentals of Python: From First Programs Through Data Structures

36

Case Study: Profiling Hashing


Strategies
Request:
Write a program that allows a programmer to profile
different hashing strategies

Analysis:
Should allow to gather statistics on number of
collisions caused by the hashing strategies
Other useful information:
Hash tables load factor
Number of probes needed to resolve collisions during
linear or quadratic probing
Fundamentals of Python: From First Programs Through Data Structures

37

Case Study: Profiling Hashing


Strategies (continued)

Fundamentals of Python: From First Programs Through Data Structures

38

Case Study: Profiling Hashing


Strategies (continued)

Fundamentals of Python: From First Programs Through Data Structures

39

Case Study: Profiling Hashing


Strategies (continued)
Analysis (continued):
Here are the profilers results:

Design:
Profiler class requires instance variables to track
a table, number of collisions, and number of probes
Fundamentals of Python: From First Programs Through Data Structures

40

Case Study: Profiling Hashing


Strategies (continued)
Implementation:

Fundamentals of Python: From First Programs Through Data Structures

41

Case Study: Profiling Hashing


Strategies (continued)

Fundamentals of Python: From First Programs Through Data Structures

42

Hashing Implementation of
Dictionaries
HashDict uses the bucket/chaining strategy
To manage the array, declare three instance
variables: _table, _size, and _capacity

Fundamentals of Python: From First Programs Through Data Structures

43

Hashing Implementation of Sets


The design of the methods for HashSet is also the
same as the methods in HashDict, except:
__contains__ searches for an item (not key)
add inserts item only if it is not already in the set
A single iterator method is included instead of
separate methods that return keys and values

Fundamentals of Python: From First Programs Through Data Structures

44

Sorted Sets and Dictionaries


Each item added to a sorted set must be
comparable with its other items
Same applies for keys added to a sorted dictionary

The iterator for each type of collection guarantees


its users access to items or keys in sorted order
Implementation alternatives:
List-based: must maintain a sorted list of the items
Hashing implementation: not feasible
Binary search tree implementation: generally provide
logarithmic access to data items
Fundamentals of Python: From First Programs Through Data Structures

45

Sorted Sets and Dictionaries


(continued)

Fundamentals of Python: From First Programs Through Data Structures

46

Summary
A set is an unordered collection of items
Each item is unique
List-based implementation linear-time access
Hashing implementation constant-time access

Items in a sorted set can be visited in sorted order


A tree-based implementation of a sorted set
supports logarithmic-time access

A dictionary is an unordered collection of entries,


where each entry consists of a key and a value
Each key is unique; its values may be duplicated
Fundamentals of Python: From First Programs Through Data Structures

47

Summary (continued)
A sorted dictionary imposes an ordering by
comparison on its keys
Implementations of both types of dictionaries are
similar to those of sets
Hashing: Technique for locating an item in constant
time
Techniques to resolve collisions: linear collision
processing, quadratic collision processing, chaining
The run-time and memory aspects involve the load
factor of the array
Fundamentals of Python: From First Programs Through Data Structures

48

Das könnte Ihnen auch gefallen