Beruflich Dokumente
Kultur Dokumente
Why Sort?
Rendezvous
Eliminating duplicates
Summarizing groups of items
Ordering
Sometimes, output must be ordered
e.g., return results in decreasing order of relevance
Upcoming fundamentals:
But First
Major implications!
Economics
20
Sometimes the DB
15
Sometimes a cache
10
Registers
Small, Fast
On-chip Cache
On-Board Cache
RAM
SSD
Disk
2.65
0.125
Big, Slow
0
RAM
SSD
Disk
9/1/15
Components of a Disk
Spindle
Disk head
10 9
Andromeda
Tape /Optical
Robot
10 6 Disk
2,000 Years
Pluto
2 Years
100
Sacramento
Memory
10
2
1
1.5 hr
This Building
10 min
This Room
My Head
1 min
On Board Cache
On Chip Cache
Registers
Sector
Arm movement
Arm assembly
~8-10msec on average
Platters
Tracks
http://www.tomshardware.com/charts/ssd-charts-2014/benchmarks,129.html
http://www.storagesearch.com/ssdmyths-endurance.html
9/1/15
Meanwhile
Its Big: Can generate and archive data cheaply and easily
Its Small: Many rich data sets have (small) fixed size
Hmmm!
Many people will continue to worry about magnetic
disk for some time yet.
Input
Buffer
INPUT
f(x)
Output
Buffer
INPUT
Input
Buffers
OUTPUT
f(x)
RAM
Output
Buffers
OUTPUT
RAM
Sorting: 2-Way
Given:
A file F:
Pass 0 (conquer):
read a page, sort it, write it.
only one buer page is used
a repeated batch job
INPUT
I/O
Buffer
sort
OUTPUT
RAM
9/1/15
Sorting: 2-Way
Pass 0 (conquer):
read a page, sort it, write it.
only one buer page is used
a repeated batch job
3,4
6,2
9,4
8,7
5,6
3,1
3,4
2,6
4,9
7,8
5,6
1,3
4,7
2,3
4,6
1,3
5,6
8,9
Input file
PASS 0
1-page runs
PASS 1
PASS 2
2,3
4,4
6,7
INPUT 2
PASS 3
8-page runs
4,5
6,6
7,8
9
2N (!"log 2 N #$ +1)
RAM
. . .
OUTPUT
INPUT B-1
RAM
Disk
Merging Runs
4-page runs
1,2
2,3
3,4
So total cost is:
OUTPUT
1,2
3,5
6
8,9
=!"log 2 N#$+1
INPUT 1
INPUT 2
2-page runs
N
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
B=3 B=5
7
4
10
5
13
7
17
9
20
10
23
12
26
14
30
15
B=9
3
4
5
6
7
8
9
10
Answer: B(B-1).
Sort N pages of data in about
N space
9/1/15
Internal Sort
Input
Buffer
More on Heapsort
Fact: average length of a run = 2(B-2)
Worst-Case:
What is min length of a run?
How does this arise?
<m
>= m
else
H1 = H2; H2.reset();
start new output run;
Heap 1
H1
H2
}
Current Run
Next Run
Best-Case:
What is max length of a run?
How does this arise?
Alternative: Hashing
Divide
Idea:
9/1/15
Two Phases
Original
Relation
OUTPUT
1
...
INPUT
Partition:
(Divide)
Partitions
hash
function
hp
B-1
B-1
Disk
Disk
ReHash (conquer):
Read partitions into RAM hash table one at a
time, using dierent hash hr
Then go through each bucket of this hash table to
achieve rendezvous in RAM
(rendezvous in the bucket)
Two Phases
Original
Relation
OUTPUT
1
Partition:
(Divide)
INPUT
hash
function
...
Partitions
hp
B-1
B-1
Disk
Divide
Conquer
Disk
Result
Partitions
hash
hr
Rehash:
(Conquer)
Memory Requirement
How big of a table can we hash in two
passes?
B-1 partitions result from Pass 1
Each should be no more than B pages in size
Answer: B(B-1).
9/1/15
Divide
Conquer
Conquer
Merge
So which is better ??
Simplest analysis:
Same memory requirement for 2 passes
Same I/O cost
But we can dig a bit deeper
Sorting pros:
Great if input already sorted (or almost sorted) w/heapsort
Great if need output to be sorted anyway
Not sensitive to data skew or bad hash functions
Hashing pros:
For duplicate elimination, scales with # of values
Not # of items! Well see this again.
Summary
Sort/Hash Duality
Hashing is Divide & Conquer
Sorting is Conquer & Merge