Chap 8. Cosequential Processing and The Sorting of Large Files

File Structures by Folk, Zoellick, and Riccardi
Chap 8. Cosequential Processing

and the Sorting of Large Files

SNU-OOPSLA-LAB

File Structures
SNU-OOPSLA Lab.
Chapter Objectives(1)
Describe a class of frequently used processing activities

known as cosequential process
Provide a general object-oriented model for implementing
varieties of cosequential processes
Illustrate the use of the model to solve a number of
different kinds of cosequential processing problems,
including problems other than simple merges and
matches
Introduce heapsort as an approach to overlapping I/O with
sorting in RAM
File Structure
SNU-OOPSLA Lab.
Chapter Objectives(2)
Show how merging provides the basis for sorting very

large files
Examine the costs of K-way merges on disk and find ways
to reduce those costs
Introduce the notion of replacement selection
Examine some of the fundamental concerns associated
with sorting large files using tapes rather than disks
Introduce UNIX utilities for sorting, merging, and
cosequential processing
File Structure
SNU-OOPSLA Lab.
Contents
8.1 Cosequential operations
8.2 Application of the OO Model to a General Ledger Program
8.3 Extension of the OO Model to Include Multiway Merging
8.4 A Second Look at Sorting in Memory
8.5 Merging as a Way of Sorting Large Files on Disk
8.6 Sorting Files on Tape
8.7 Sort-Merge Packages
8.8 Sorting and Cosequential Processing in Unix
File Structure
SNU-OOPSLA Lab.
8.1 An Object-Oriented Model for

Implementation Cosequential Processes
Cosequential operations
Coordinated processing of two or more sequential

lists to produce a single list
Kinds of operations
merging, or union
matching, or intersection
combination of above
File Structure
SNU-OOPSLA Lab.

Matching Names in Two Lists(1)
So called intersection operation

Output the names common to two lists
Things that must be dealt with to make match procedure
work reasonably
initializing that is to arrange things

methods that are getting and accessing the next list item
synchronizing between two lists
handling EOF conditions
recognizing errors
e.g. duplicate names or names out of sequence
File Structure
SNU-OOPSLA Lab.

Matching Names in Two Lists(2)
In comparing two names
if Item(1) is less than Item(2), read the next from List 1
if Item(1) is greater than Item(2), read the next name from

List 2
if the names are the same, output the name and read the
next names from the two lists
File Structure
SNU-OOPSLA Lab.

Cosequential match procedure(1)

PROGRAM: match
Item(1)
Item(1) < Item(2)
List 1
use input() & initialize() procedure
same
name
List 2
Item(1) > Item(2)
File Structure
Item(2)
SNU-OOPSLA Lab.

Cosequential match procedure(2)

int Match(char * List1, char List2, char *OutputList)
{
int MoreItems; // true if items remain in both of the lists
// initialize input and output lists
InitializeList(1, List1); InitializeList(2, List2);
InitializeOutput(OutputList);
// get first item from both lists
MoreItems = NextItemInLIst(1) && NextItemInList(2);
while (MoreItems) { // loop until no items in one of the lists
if(Item(1) < Item(2) ) MoreItems = NextItemInList(1);
else if (Item(1) == Item (2) ) {
ProcessItem(1);
// match found
MoreItems = NextItemInList(1) && NextItemInList(2);
}
else
MoreItems = NextItemInList(2); // Item(1) > Item(2)
}
FinishUp();
return 1;
File Structure
SNU-OOPSLA Lab.

General Class for Cosequential Processing(1)

template <class ItemType> class CosequentialProcess
// base class for cosequential processing
{ public:
// the following methods provide basic list processing
// these must be defined in subclasses
virtual int InitializeList (int ListNumber, char *LintName) = 0;
virtual int InitializeOutput (char * OutputListName) = 0;
virtual int NextItemInList (int ListNumber) = 0;
// advance to next item in this list
virtual ItemType Item(int ListNumber) = 0;
// return current item from this list
virtual int ProcessItem(int ListNumber) = 0;
// process the item in this list
virtual int FinishUp() = 0; // complete the processing
// 2-way cosequential match method
virtual int Match2Lists (char *List1, char * List2, char *OutputList);
};
File Structure
SNU-OOPSLA Lab.
10

A Subclass to support lists that are files of strings, one per line
class StringListProcess : public CosequentialProcess<String &>

{ public:
StringListProcess (int NumberOfLists); // constructor
// Basic list processing methods
int InitializeList (int ListNumber, char * List1);
int InitializeOutput(char * OutputList);
int NextItemInList (int ListNumber); // get next
String & Item (int ListNumber); // return current
int ProcessItem (int ListNumber); // process the item
int FinishUp(); // complete the processing
protected:
ifstream * List; // array of list files
String * Items; // array of current Item from each list
ofstream OutputLsit;
static const char * LowValue; //used so that NextItemInList() doesnt
// have to get the first item in an special way
static const char * HighValue;
};
File Structure
SNU-OOPSLA Lab.
11

Appendix H: full implementation

An example of main
#include coseq.h
int main()
{
StringListProcess ListProcess(2); // process with 2 lists
ListProces.Match2Lists (list1.txt, list2.txt, match.txt);
}
File Structure
SNU-OOPSLA Lab.
12

Merging Two Lists(1)
Based on matching operation

Difference
must read each of the lists completely

must change MoreNames behavior
keep this flag set to true as long as there are records in
either list
HighValue
the special value (we use \xFF)

come after all legal input values in the files to ensure both
input files are read to completion
File Structure
SNU-OOPSLA Lab.
13

Cosequential merge procedure based on a single

loop
This method has been added to class CosequentialProcess

No modifications are required to class StringListProcess
template <class ItemType>

int CosequentialProcess<ItemType> :: Merge2Lists
(char * List1Name, char * List2Name, char * OutputList)
{
int MoreItems1, MoreItems2; // true if more items in list
(continued )
File Structure
SNU-OOPSLA Lab.
14

InitializeList (1 List1Name);
InitializeList (2, List2Name);
InitializeOutput (OutputListName);
MoreItems1 = NextItemInList(1);
MoreItems2 = NextItemInLIst(2);
while (MoreItems1 || MoreItems(2) ) { // if either file has more
if (Item(1) < Item(2)) { // list 1 has next item to be processed
ProcessItem(1);
MoreItem1 = NextItemInList(1);
}
else if (Item(1) == Item(2) ) {
ProcessItem(1);
}
else // Item(1) > Item(2) {
ProcessItem(2);
MoreItem2 = NextItemInList(2);
}
}
FinishUp(); return 1;
File Structure
SNU-OOPSLA Lab.
15

Cosequential merge procedure(1)

PROGRAM: merge
List 1
(Item(1) < Item(2) )or match
NAME_1
OutputList
List 2
NAME_2
Item(1) > Item(2)
File Structure
SNU-OOPSLA Lab.
16

Summary of the Cosequential Processing Model(1)
Assumptions
two or more input files are processed in a parallel fashion

each file is sorted
in some cases, there must exist a high key value or a low
key
records are processed in a logical sorted order
for each file, there is only one current record
records should be manipulated only in internal memory
File Structure
SNU-OOPSLA Lab.
17

Essential Components
initialization - reads from first logical records
one main synchronization loop
- continues as long as relevant records remain
selection in main synchronization loop
if
(Item(1) > Item(2) then ..........
else if ( Item(1) < Item(2)) then .........
else ........... /* current keys equal */
endif
Input files & Output files are sequence checked by

comparing the previous item value with new one
File Structure
SNU-OOPSLA Lab.
18

Essential
components (contd)
substitute
high values for actual key when EOF

main loop terminates when high values have occurred for
all relevant input files
no special code to deal with EOF
I/O or error detection are to be relegated to supporting method
so the details of these activities do not obscure the principal
processing logic
File Structure
SNU-OOPSLA Lab.
19
8.2 The General Ledger Program (1)

Account table (Fig 8.6)
Acct-No Acct-Title
Jan
101
check #1
100
102
check #2
500
505
advertize
300
Feb
200
270
129
Mar Apr
170
320
230
Journal entry table (Fig 8.7)

Acct-No Check-No Date
Description
101
112
04/02/86
auto-repair
505
213
05/13/86
newspaper
540
670
04/13/86
printer
Debit/Credit
-30
-39
+60
Ledger Printout (Fig 8.8)

101 check #1
1271 04/02/86 auto-expense
1272 04/03/86 advertise
-78
-30
File Structure
SNU-OOPSLA Lab.
20
8.2 The General Ledger Program(2)

Ledger List and Journal List (Fig 8.10)
101 check#1
101 1271 Auto-expense
101 1272 Rent
101 1273 Advertising
102 check#2 102 670 Office-expense
The ledger (master) account number

The journal (transaction) account number
Class MasterTransactionProcess (Fig 8.12)
Subclass LedgeProcess (Fig 8.14)
File Structure
SNU-OOPSLA Lab.
21
8.2 The General Ledger Program (3)

Template <class ItemType>
class MasterTransactionProcess: Public CosequentialProcess<ItemType>
// a cosequential process that supports master/transaction processing
{public:
MasterTransactionProcess(); // constructor
Virtual int ProcessNewMaster() = 0; //processing when new master read
Virtual int ProcessCurrentMaster() = 0;
Virtual int ProcessEndMaster() = 0;
Virtual int ProcessTransactionError()= 0;
//cosequential processing of master and transaction records
int PostTransactions (char * MasterFileName, char * TransactionFileName,
char * OutputListName);
};
File Structure
SNU-OOPSLA Lab.
22
8.3 Extension of the Model to Include

Multiway Merging
A K-way Merge Algorithm
A very general form of cosequential file processing

Merge K input lists to create a single, sequentially
ordered output list
Algorithm
begin loop
determine which list has the key with the lowest value
output that key
move ahead one key in that list
in duplicate input entries, move ahead in each list
loop again
File Structure
SNU-OOPSLA Lab.
23

Multiway Merging
Selection Tree for Merging Large Number of Lists
K-way merge
nice if K is no larger than 8 or so

if K > 8, the set of comparisons for minimum key is expensive
loop of comparison (computing)
Selection Tree (if K > 8)
time vs. space trade off

a kind of tournament tree
the minimum value is at root node
the depth of tree is log2 K
File Structure
SNU-OOPSLA Lab.
24

Multiway Merging
Selection Tree
7, 10, 17....List 0
9, 19, 23....List 1
7
11
input
11, 13, 32....List 2

18, 22, 24....List 3
5
5
5
12, 14, 21....List 4

5, 6, 25....List 5
15, 20, 30....List 6
8
8, 16, 29....List 7
File Structure
SNU-OOPSLA Lab.
25
Read the whole file from into memory, perform

sorting, write the whole file into disk
Can we improve on the time that it takes for this RAM

sort?
perform some of parts in parallel

selection sort is good but cannot be used to sort entire file
Using Heap technique!

processing and I/O can occur in parallel
keep all the keys in heap
Heap building while reading a block
Heap rebuilding while writing a block
File Structure
SNU-OOPSLA Lab.
26
Overlapping processing and I/O : Heapsort
Heap
a kind of binary tree, complete binary tree

each node has a single key, that key is less than or equal to
the key at its parent node
storage for tree can be allocated sequentially
so there is no need for pointers or other dynamic overhead
for maintaining the heap
File Structure
SNU-OOPSLA Lab.
27
A heap in both its tree form and

as it would be stored in an array
A (1)
* n, 2n, 2n+1 positions
B (2)
E (4)
G (8)
File Structure
c (3)
H (5) I (6)
F (9)
D (7)
SNU-OOPSLA Lab.
28
Class Heap and Method Insert(1)

class Heap
{ public:
Heap(int maxElements);
int Insert (char * newKey);
char * Remove();
protected:
int MaxElements; int NumElements;
char ** HeapArray;
void Exchange (int i, int j); // exchange element i and j
int Compare (int i, int j) // compare element i and j
{ return strcmp(Heaparray[i], HeapArray[j]); }
};
File Structure
SNU-OOPSLA Lab.
29
Class Heap and Method Insert(2)

int Heap::Insert(char * newKey)
{
if (NumElements == MaxElements) return FALSE;
NumElements++; // add the new key at the last position
HeapAray[NumElements] = newKey;
// re-order the heap
int k = NumElements; int parent;
while(k > 1) { // k has a parent
parent = k/2;
if (Compare(k, parent) >= 0) break;
// HeapArray[k] is in the right place
// else exchange k and parent
Exchange(k, parent);
k = parent;
}
return;
}
File Structure
SNU-OOPSLA Lab.
30
Heap Building Algorithm(1)

input key order : F D C G H I B E A
New key to
be inserted
Heap, after insertion

of the new key
1 2 3 4 5 6 7 8 9
F
1 2 3 4 5 6 7 8 9
DF
1 2 3 4 5 6 7 8 9
CFD
G
H
Selected heaps
in tree form
C
F
1 2 3 4 5 6 7 8 9
CF D G
1 2 3 4 5 6 7 8 9
CFD GH
File Structure
(continued....)
SNU-OOPSLA Lab.
31

input key order : F D C G H B E A
New key to
be inserted
Heap, after insertion

of the new key
1 2 3 4 5 6 7 8 9
CF D GH I
1 2 3 4 5 6 7 8 9
BFC GH I D
1 2 3 4 5 6 7 8 9
B EC F H I D G
E
A
1 2 3 4 5 6 7 8 9
A BC E HI D G F
File Structure
Selected heaps
in tree form
C
F
G
D
H
B
C
F
G
(continued....)
SNU-OOPSLA Lab.
32

input key order : F D C G H B E A
New key to Heap, after insertion
of the new key
be inserted
A
Selected heaps
in tree form
1 2 3 4 5 6 7 8 9
A BC E HI D G F
A
C
B
H
E
G
File Structure
SNU-OOPSLA Lab.
33
Illustration for overlapping input with heap building(1)

(Free ride of main memory processing: heap building is faster than IO!)
Total RAM area allocated for heap
First input buffer. First part of heap is built here. The

first record is added to the heap, then the second
record
is added, and so forth
Second input buffer. This buffer is being filled
while heap is being built in first buffer.
File Structure
SNU-OOPSLA Lab.
34
Illustration for overlapping input with heap building(2)

(One Heap is growing during IO time!)
Second part of heap is built here. The first record is

added to the heap, then the second record, etc
Third input buffer. This buffer is filled while heap is being

built in second buffer
Third part of heap is built here
File Structure
Fourth input buffer is filled while heap is being

built
in third bufferLab.
SNU-OOPSLA
35
Sorting while Writing to the File
Heap rebuilding while writing a block

(Free ride of main memory processing)
Retrieving the keys in order (Fig 8.20)
while( there is no elements)
get the smallest value

put largest value into root
decrease the # of elements
reorder the heap
Overlapping retrieve-in-order with I/O
retrieve-in-order a block of records

while writing this block,
retrieve-in-order the next block
File Structure
SNU-OOPSLA Lab.
36
8.5 Merging as a Way of Sorting Large

Files on Disk
Keysort: holding keys in memory

Two Shortcomings of Keysort
substantial cost of seeking may happen after keysort

cannot sort really large files
e.g. a file with 800,000 records, size of each record: 100 bytes,
size of key part: 10 bytes, then 800,000 X 10 => 8G bytes!
cannot even sort all the keys in RAM
Multiway merge algorithm
small overhead for maintaining pointers, temporary variables
run: sorted subfile
using heap sort for each run

split, read-in, heap sort, write-back
File Structure
SNU-OOPSLA Lab.
37

Files on Disk
Sorting through the creation of runs

and subsequential merging of runs
800,000 unsorted records
80 internal sorts
.............
80runs, each containing 10,000 sorted records
.............
Merge
File Structure
800,000 records in sorted order
SNU-OOPSLA Lab.
38

Files on Disk
Multiway merging (K-way merge-sort)
Can be extended to files of any size

Reading during run creation is sequential
no seeking due to sequential reading
Reading & writing is sequential

Sort each run: Overlapping I/O using heapsort
K-way merges with k runs
Since I/O is largely sequential, tapes can be used
File Structure
SNU-OOPSLA Lab.
39

Files on Disk
How Much Time Does a Merge Sort Take?
Assumptions
only one seek is required for any sequential access
only one rotational delay is required per access
Four I/Os (
refer to page of 39 )
during the sort phase
reading all records into RAM for sorting, forming runs
writing sorted runs out to disk
during the merge phase
reading sorted runs into RAM for merging
writing sorted file out to disk
File Structure
SNU-OOPSLA Lab.
40

Files on Disk
Four Steps(1)
Step1: Reading records into RAM for sorting and forming runs
assume: 10MB input buffer, 800MB file size

seek time --> 8msec, rotational delay --> 3msec
transmission rate --> 0.0145MB/msec
Time for step1:
access 80 blocks (80 X 11)msec + transfer 80 blocks (800/0.0145)msec
Step2: Writing sorted runs out to disk
writing is reverse of reading

time that it takes for step2 equals to time of step1
File Structure
SNU-OOPSLA Lab.
41
Four Steps(2)
Step3: Reading sorted runs into RAM for merging

10 MB of RAM is for storing runs. 80 runs
reallocate each of 80 buffers 10MB RAM as 80 input buffers

access each run 80 buffers to read all of it
Each buffer holds 1/80 of a run (0.125MB)
total seek & rotational time --> 80 runs X 80 seeks

--> 6400 seeks. 6400 X 11 msec = 70 seconds
transfer time --> 60 seconds
total time = total seek & rotation time + transfer time
File Structure
SNU-OOPSLA Lab.
42

Files on Disk
Four Steps(3)
Step4:
Writing sorted file out to disk
need
to know how big output buffers are

with 20,000-byte output buffers,
80,000,000 bytes
20,000 bytes per seek
4,000 seeks
total
seek & rotation time = 4,000 x 11 msec

transfer time is still 60 seconds
Consider
Table 8.1 (323pp)

What if we use keysort for 800M file? --> 24hrs 26mins 40secs
File Structure
SNU-OOPSLA Lab.
43

Files on Disk
Effect of buffering on the number of seeks required

10MB file
1st run = 80 buffers worth(80 accesses)
800MB file
2nd run = 80 buffers worth(80 accesses)
800,000
sorted records
:
:
:
80 buffers(10MB)
80th run = 80 buffers worth(80 accesses)
File Structure
SNU-OOPSLA Lab.
44

Files on Disk
Sorting a Very Large File
Two kinds of I/O
Sort phase
I/O is sequential if using heapsort

Since sequential access is minimal seeking, we cannot
algorithmically speed up I/O
Merge phase
RAM buffers for each run get loaded, reloaded at predictable

times -> random access
For performance, look for ways to cut down on the number of
random accesses that occur while reading runs
you can have some chance here!
File Structure
SNU-OOPSLA Lab.
45

Files on Disk
The Cost of Increasing the File Size
K-way merge of K runs
Merge sort = O(K2) ( merge op. -> K2 seeks )
If K is a big number, you are in trouble!
Some ways to reduce time!! (8.5.4, 8.5.5, 8.5.6)
more hardware (disk drives, RAM, I/O channel)

reducing the order of merge (k), increasing buffer size
of each run
increase the lengths of the initial sorted runs
find the ways to overlap I/O operations
File Structure
SNU-OOPSLA Lab.
46

Files on Disk
Hardware-base Improvements
Increasing the amount of RAM
Increasing the number of disk drives
longer & fewer initial runs

fewer seeks
no delay due to seek time after generation of runs
assign input and output to separate drives
Increasing the number of I/O channels
separate I/O channels, I/O can overlap

Improve transmission time
File Structure
SNU-OOPSLA Lab.
47

Files on Disk
Decreasing the Num of Seeks Using Multiple-step Merges
K-way merge characteristics
a selection tree is used
K is proportional to N
the number of comparisons is N*log K

(K-way merge with N records)
O(N*log N) : reasonably efficient
Reducing seeks is to reduce the number of runs
give each run a bigger buffer space

multiple-step merge provides the way without more RAM
File Structure
SNU-OOPSLA Lab.
48

Files on Disk
Multiple-step merge(1)
Do not merge all runs at one time

Break the original set of runs into small groups and
Merge runs in these group separately
Leads fewer seeks, but extra transmission time in
second pass
Reads every record twice
to form the intermediate runs & the final sorted file
Similar to have selection tree in merging n lists!!
File Structure
SNU-OOPSLA Lab.
49

Files on Disk
Two-step merge of 800 runs
(25 sets X 32 runs) = 800 runs
25 sets of 32 runs each

32 runs
......
32 runs
......
......
32 runs
......
......
File Structure
SNU-OOPSLA Lab.
50

Files on Disk
Multiple-step merge(2)
Essence of multiple-step merging
Can we do even better with more than two steps?
increase the available buffer space for each run

extra pass vs. random access decrease
trade-offs between the seek&rotation time and the
transmission time
major cost in merge sort
seek, rotation time, transmission time, buffer size, number of

runs
File Structure
SNU-OOPSLA Lab.
51

Files on Disk
Increasing Run Lengths Using Replacement Selection(1)
Facts of Life
Want to use the heap sort in memory

Want to allocate longer output runs
Can we pack the longer output runs using the heap sort in memory?
Replacement Selection
Idea
always select the key from memory that has the lowest value
output the key
replace it with a new key from the input list
use 2 heaps in the memory buffer
File Structure
SNU-OOPSLA Lab. (continued...) 52

Files on Disk
Increasing Run Lengths Using Replacement Selection(2)
Implementation
step1: read records and sort using heap sort
this heap is the primary heap
step2: write out only the record with the lowest value
step3: bring in new record and compare its key with that of
record just output
step3-a: if the new key is higher, insert new record into its proper in the
primary heap along with the other records selected for output
step3-b: if the new key is lower, place the record in a secondary heap
with key values lower than already written out
step4: repeat step 3 while there are records in the primary heap and
there are records to be read in. When the primary heap is empty, make
the secondary heap into the primary heap and repeat step2 & step3
File Structure
SNU-OOPSLA Lab.
53

Files on Disk
Example of the principle underlying

replacement selection
Input:
21, 67, 12, 5, 47, 16
Front of input string

(Heap sort!)
Remaining input
21, 67, 12
21, 67
21
-
File Structure
Memory(p=3)
5
12
67
67
67
67
-
47
47
47
47
47
-
16
16
16
21
-
SNU-OOPSLA Lab.
Output run
5
12, 5
16, 12, 5
21, 16, 12, 5
47, 21, 16, 12, 5
67, 47, 21, 16, 12, 5
54

Files on Disk
Replacement Selection(1)
What happens if a key arrives in memory too late to be output into ins
proper position relative to the other keys? (if 4th key is 2 rather than 12)
use of second heap, to be included in next run
refer to page 335 Figure 8.25
Two questions
Given P locations in memory, how long a run can we expect replacement

selection to produce, on the average?
On the average, we can expect a run length of 2P
Knuth provides an excellent description (page 335-336)
File Structure
SNU-OOPSLA Lab.
(continued...)
55

Files on Disk
Comparisons of access times required to sort 8 million records

both RAM sort and replacement selection
Approach
# of Records
per Seek to
Form Runs
Size of
Runs
Formed
# of Seeks
Required to
Form Runs
Merge
Order
Used
Total
Number
of Seeks
Total Seek &

Rotation Delay
Time
(hr)
800 RAM
sorts followed 10,000
by an 800-way
merge
Replacement
selection followed
by 534-way merge 2,500
(records in random
order)
Replacement
selection followed
by 200-way merge 2,500
(records partially
ordered)
File Structure
10,000
800
1,600
(min)
681,600
58
15,000
534
6,400
521,134
48
40,000
200
200
206,400
30
SNU-OOPSLA Lab.
56

Files on Disk
Step-by-step op. of replacement selection with 2 heaps

working to form two sorted runs(1)
Input
33, 18, 24, 58, 14, 17, 7, 21, 67, 12, 5, 47, 16
Front of input string
(Heap sort!)
Remaining input
Memory(P=3)
33, 18, 24, 58, 14, 17, 7, 21, 67, 12

33, 18, 24, 58, 14, 17, 7, 21, 67
33, 18, 24, 58, 14, 17, 7, 21
33, 18, 24, 58, 14, 17, 7
33, 18, 24, 58, 14, 17
33, 18, 24, 58, 14
33, 18, 24, 58
File Structure
5 47 16
12 47 16
67 47 16
67 47 21
67 47 ( 7)
67 (17) ( 7)
(14) (17) ( 7)
SNU-OOPSLA Lab.
Output run(A)
5
12, 5
16, 12, 5
21, 16, 12, 5
47, 21, 16, 12, 5
67, 47, 21, 16, 12, 5
57

Files on Disk
Step-by-step op. of replacement selection

working to form two sorted runs(2)
Remaining input
Memory(P=3)
Output run(B)
First run complete; start building the second

33, 18, 24, 58
33, 18, 24
33, 18
-
File Structure
14
14
24
24
24
-
17
17
17
18
33
33
-
7
58
58
58
58
58
58
SNU-OOPSLA Lab.
7
14, 7
17, 14, 7
18, 17, 14, 7
24, 18, 17, 14, 7
33, 24, 18, 17, 14, 7
58, 33, 24, 18, 17, 14, 7
58

Files on Disk
Replacement Selection Plus Multiple Merging
Total number of seeks is less than for the one-step merges

The two-step merge requires transferring the data two more
times than do the one-step merge
the two-step merges & replacement selection are still better, but the
results are less dramatic
refer to table of the next slide
File Structure
SNU-OOPSLA Lab.
59

Files on Disk
Comparison of merges, considering transmission times(1)

:1-step merge
Approach Number of Merge

Records per Pattern
Seek to
Used
Form Runs
RAM sorts
Number
of Seeks
for Sorts
and Merges
Seek +
Rotational
Delay
Time(min)
Total
Passes
over the
File
Total
Transmission
Time(min)
298
43
341
341
10,000
800way
681,700
replacement
selection
2,500
(records in
random order)
534way
521,134
228
43
replacement
2,500
selection
(records part
-ially ordered)
200way
206,400
90
43
File Structure
SNU-OOPSLA Lab.
Total of Seek,
Rotation, and
Transmission
Times(min)
341
(continued...) 60

Files on Disk
Comparison of merges, considering transmission times(2)

:2-step merge
Approach Number of Merge

Records per Pattern
Seek to
Used
Form Runs
Number
of Seeks
for Sorts
and Merges
10,000
25 x 32
127,200
-way
(one 25-way)
replacement
selection
2,500
(records in
random order)
19 x 28
-way
124,438
(one 19-way)
replacement
2,500
selection
(records part
-ially ordered)
20 x 10
110,400
-way
(one 20-way)
RAM sorts
File Structure
Seek +
Rotational
Delay
Time(min)
Total
Passes
over the
File
56
55
48
SNU-OOPSLA Lab.
Total
Transmission
Time(min)
Total of Seek,
Rotation, and
Transmission
Times(min)
65
121
65
120
65
113
61

Files on Disk
Using Two Disks with Replacement Selection
Two disk drives
Sort phase
the run selection & output can overlap
Merge phase
input & output can overlap

reduce transmission by 50%
seeking is virtually eliminated
output disk becomes input disk, and vice versa

seeking will occur on input disk, output is sequential
substantially reducing merge & transmission time

File Structure
SNU-OOPSLA Lab.
62

Files on Disk
Memory organization for replacement selection
disk1
input
buffers
heap
disk2
output
buffers
File Structure
SNU-OOPSLA Lab.
63

Files on Disk
More Drives? More Processors?
More drives?
Until I/O becomes so fast that processing cannot keep up

with it
More processors?
mainframes
vector and array processors
massively parallel machines
very fast local area networks
File Structure
SNU-OOPSLA Lab.
64

Files on Disk
Effects of Multiprogramming
Increase the efficiency of overall system by

overlapping processing and I/O
Effects are very hard to predict
File Structure
SNU-OOPSLA Lab.
65

Files on Disk
A Concept Toolkit for External Sorting
For in-RAM sorting, use heapsort

Use as much RAM as possible
Use a multiple-step merge when
Use replacement selection when
possibility of partially ordered
Use more than one disk drive and I/O channel
the number of initial runs is so long that seek and rotation time is much
greater than transmission time
read/write can overlap
Look for ways to take advantage of new architecture and systems
parallel processing or high-speed networks
File Structure
SNU-OOPSLA Lab.
66
Sorting Files on Tape
Balanced Merge with several tape drivers

Tape
Step1
T1
T2
T3
T4
contains runs
R1 R3 R5
R2 R4 R6
---
R7
R8
R9
R10
Figure 8.28 (2 way-balanced 4 tape merge)
P is the number of passes, N is the number of runs, k is the number of

input drivers ==> then,
P = ceiling of (logkN)
4 tape drivers (2 for input, 2 for output), 10 runs ==> 4 passes

20 tape drivers (10 for input, 10 for output), 200 runs ==> 3 passes
File Structure
SNU-OOPSLA Lab.
67
Sorting Files on Tape
Other ways of Balanced Merge
(Fig 8.30)
T1
T2
T3
T4
Step1
Step2
Step3
Step4
11111
-4
--
11111
-4
--
-2 2 2
.. 2
--
-2 2
-10
(Fig 8.31)
T1
11111
1 1 1
11
. 1
--
T2
111
.. 1
-4
--
T3
11
-5
5
--
T4
-3 3
.3
-10
Step1
Step2
Step3
Step4
Step5
File Structure
SNU-OOPSLA Lab.
68
K-way Balanced Merge on Tapes
Some difficult questions
How does one choose an initial distribution that leads readily to an

efficient merge pattern?
Are there algorithmic descriptions of the merge patterns, given an

initial distribution?
Given N runs and J tape drives, is there some way to compute the
optimal merging performance so we have a yardstick against which
to compare the performance of any specific algorithm?
File Structure
SNU-OOPSLA Lab.
69
Unix: Sorting and Cosequential Processing
Sorting in Unix
The Unix sort command

The qsort library routine
Cosequential processing utilities in Unix
Compares: cmp
Difference: diff
Common: comm
File Structure
SNU-OOPSLA Lab.
70
Lets Review !!
8.1 Cosequential operations
8.2 Application of the Model to a General Ledger Program
8.3 Extension of the Model to Include Multiway Merging
8.6 Sorting Files on Tape
8.7 Sort-Merge Packages
8.8 Sorting and Cosequential Processing in Unix
File Structure
SNU-OOPSLA Lab.
71

Chap 8. Cosequential Processing and The Sorting of Large Files

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Chap 8. Cosequential Processing and The Sorting of Large Files

Hochgeladen von

Copyright:

Verfügbare Formate

File Structures by Folk, Zoellick, and Riccardi

Chap 8. Cosequential Processing

Describe a class of frequently used processing activities

Show how merging provides the basis for sorting very

8.1 An Object-Oriented Model for

Coordinated processing of two or more sequential

8.1 An Object-Oriented Model for

Matching Names in Two Lists(1)

So called intersection operation

initializing that is to arrange things

8.1 An Object-Oriented Model for

Matching Names in Two Lists(2)

In comparing two names

if Item(1) is less than Item(2), read the next from List 1

if Item(1) is greater than Item(2), read the next name from

8.1 An Object-Oriented Model for

Cosequential match procedure(1)

Item(1) < Item(2)

8.1 An Object-Oriented Model for

Cosequential match procedure(2)

8.1 An Object-Oriented Model for

General Class for Cosequential Processing(1)

8.1 An Object-Oriented Model for

General Class for Cosequential Processing(2)

class StringListProcess : public CosequentialProcess<String &>

8.1 An Object-Oriented Model for

General Class for Cosequential Processing(3)

Appendix H: full implementation

8.1 An Object-Oriented Model for

Merging Two Lists(1)

Based on matching operation

must read each of the lists completely

the special value (we use \xFF)

8.1 An Object-Oriented Model for

Merging Two Lists(2)

Cosequential merge procedure based on a single

This method has been added to class CosequentialProcess

template <class ItemType>

8.1 An Object-Oriented Model for

Merging Two Lists(3)

8.1 An Object-Oriented Model for

Cosequential merge procedure(1)

(Item(1) < Item(2) )or match

Item(1) > Item(2)

8.1 An Object-Oriented Model for

Summary of the Cosequential Processing Model(1)

two or more input files are processed in a parallel fashion

8.1 An Object-Oriented Model for

Summary of the Cosequential Processing Model(2)

selection in main synchronization loop

Input files & Output files are sequence checked by

8.1 An Object-Oriented Model for

Summary of the Cosequential Processing Model(3)

high values for actual key when EOF

8.2 The General Ledger Program (1)

Journal entry table (Fig 8.7)

Ledger Printout (Fig 8.8)

8.2 The General Ledger Program(2)

The ledger (master) account number

8.2 The General Ledger Program (3)

8.3 Extension of the Model to Include

A K-way Merge Algorithm

A very general form of cosequential file processing