Redacted For Privacy: Improvement Over Current Solutions But Has A Smaller Time

AN ABSTRACT OF THE THESIS OF
Boontee Kruatrachue for the degree of Doctor of Philosophy in

Electrical and Computer Engineering presented on June 10. 1987.
Title: Static Task Scheduling and Grain Packing in Parallel
Processing Systems
Redacted for Privacy
Abstract approved:
Theodore G. Lewis
We extend previous results for optimally scheduling

concurrent program modules, called tasks, on a fixed, finite number
of parallel processors in two fundamental ways: (1) we introduce a
new heuristic which considers the time delay imposed by message
transmission among concurrently running tasks; and (2) we
introduce a second heuristic which maximizes program execution
speed by duplicating tasks. Simulation results are given which
suggest an order of magnitude improvement in program execution
speed over previous scheduling heuristics. The first solution, called
ISH, (insertion scheduling heuristic) provides only a small
improvement over current solutions but has a smaller time
complexity than DSH, 0(N2). DSH (duplication scheduling heuristic),
is an 0(N4) heuristic that (1) gives up to an order of magnitude
improvement in performance, (2) solves the max-min problem of
parallel processor scheduling by duplicating selected scheduled
tasks on some PEs, and (3) gives monotonically growing
improvements as the number of PEs is increased. The max-min
problem is caused by the trade-off between maximum parallelism
versus minimum communication delay.
The DSH is also applied in "Grain Packing", which is a new way

to define the grain size for a user program on a specific parallel
processing system. Instead of defining the grain size before
scheduling, grain packing uses the fine grain scheduling to construct
larger grains. In this way all available parallelism is considered as
well as the communication delay.
Static Task Scheduling and Grain Packing
in
Parallel Processing Systems
by
B oontee Kruatrachue
A THESIS
submitted to
Oregon State University
in partial fulfillment of
the requirements for the
degree of
Doctor of Philosophy
Completed June 10, 1987

Commencement June, 1988
APPROVED:
Professor of Computer Science, in charge of major
Head of DVpartment of Electrical & Computer Engineering
Dean of Graduat chool d
Date thesis is presented June 10, 1987
Type by Boontee Kruatrachue

DEDICATION
This dissertation is dedicated to my aunt

Dr. Fuangfoong Kruatrachue,
,my parents
Dr. Mongkol Kruatrachue
Dr. Foongfuang Kruatrachue
Mrs. Praneet Kruatrachue
my brother and sister
Samapap Kruatrachue
Kritawan Kruatrachue.
ACKNOWLEDGEMENTS
I am deeply grateful to Dr. Ted Lewis for his help, guidance,

inspiration, and many hours he spent reading the thesis and
commenting on it. Also, I thank him for his constant encouragement
and confidence in me throughout the years of my research at Oregon
State University.
I would like to express my gratitude to my parents and my

aunts for theirs love , support and encouragement.
I thank my sister for all her love and support. Finally, I

express my gratitude to both the Electrical and Computer
Engineering Department and Computer Science Department for
financial support and computer equipments.
TABLE OF CONTENTS
.Page
I. INTRODUCTION 1
1.1 Motivation and Purpose of this Research 1
1.2 Previous Works 4
1.3 Scope 11
1.4 Outline of the Dissertation 15
II. THE PARALLEL PROCESSOR SCHEDULING

ENVIRONMENT 17
2.1 Background 17
2.1.1 Scheduling Definition, Function and Goal 17
2.1.2 Critical Path Nature of a Task Graph

and Gantt Chart 18
2.1.3 Difference between Allocation and

Scheduling 19
2.1.4 List Scheduling 20
2.2 The Parallel Processor Scheduling Problem 20
2.2.1 Communication and Parallelism

Trade off Problem 21
2.2.2 Grain Size Problem 23
2.2.3 Level Alteration and the Critical Path 24
III. DUPLICATION SCHEDULING HEURISTIC (DSH) &

INSERTION SCHEDULING HEURISTIC (ISH) 30
3.1 Introduction 30
3.2 Definitions 30
TABLE OF CONTENTS
(continuation)
.Page
3.3 Insertion Scheduling Heuristic (ISH) 34
3.4 Duplication Scheduling Heuristic (DSH) 39
3.5 Complexity of ISH and DSH 44
IV. EXPERIMENT RESULTS 65
V. OPTIMAL GRAIN DETERMINATION FOR

PARALLEL PROCESSING SYSTEM 79
5.1 Introduction 79
5.2 Grain Packing Approach 82
VI. CONCLUSION 93
6.1 Significance of this Research 93
6.2 Future Related Research 94
BIBLIOGRAPHY 96
APPENDIX 101
LIST OF FIGURES
Figure .Page
1.1 An Example of Program Represented by Task Graph 16
2.1 Comparison between Allocation and Scheduling 26
2.2 The Allocation Consideration due to

Communication Delay 27
2.3 The Comparison between Parallelism and

2.4 The Comparison of Large Grain Versus Fine

Grain Scheduling 28
3.1 The Segment of a Three-processor Schedule

after Node 7's Assignment 47
3.2 The Main List Scheduler 48
3.3 The Update_R_queue 49
3.4 The Example of Task Graph List Scheduling 50
3.5 The Locate-PE of ISH 52
3.6 The Assigned-Node of ISH 53
3.7 The Example of ISH Task Insertion and Scheduling 54
3.8 The Choices in Implementing ISH Step 2.1.2 55
3.9 The Average Speedup Ratio Comparison between

ISH Versions on 10 Random-generated 350 Nodes
Task Graphs 56
3.10 The Task Duplication Concept 57
3.11 The Task Duplication Process (TDP) 58
3.12 The Copy_LIP of DSH 59
3.13 The Sample Task Graph of 11 Nodes and Its

Intermediate Gantt Chart using DSH 60
LIST OF FIGURES
(continuation)
Figure Page
3.14 The Example of Duplication Task List (CTlst)
on PEi Constructed by TDP for Node 11 61
3.15 The Locate_PE of DSH 63
3.16 The Assigned-Node of DSH 64
4.1 The Average Speedup Ratio Comparison

(20 Nodes) 68

(50 Nodes) 69

(100 Nodes) 70

(150 Nodes) 71

(250 Nodes) 72

(350 Nodes) 73
4.7 The Average DSH Speedup Ratio Comparison

for Different Delay 74
4.8 The Average Speedup Ratio Comparison for

Non-identical Node Size and Non-identical
5.1 An Example of User Program and Its

Task Graph Construction 87
5.2 An Example of Fine Grain Task Graph

Construction 88
5.3 An Example of Fine Grain Node Size

Calculation 89
LIST OF FIGURES
(continuation)
Figure Page
5.4 An Example of Communication Delay Calculation
for the specific architecture 90
5.5 An Example of Fine Grain Scheduling using DSH

in Comparison with Load Balancing and Single PE 91
5.6 An Example of User Program Restructure 92

LIST OF TABLES
Table Page
2.1 The Complexity Comparison of Scheduling

Problems 29
4.1 ISH's Speedup Ratio Improvement over Hu and Yu D

Heuristics 76
4.2 DSH's Speedup Ratio Improvement over Hu and Yu D

Heuristics 76
4.3 The Effect of Communication Delay on Speedup

Ratio Comparison 77
4.4 DSH and ISH Speedup Ratio Improvement over

Hu and Yu D Heuristics for Variable Node Size
and Communication Delay 78
STATIC TASK SCHEDULING AND GRAIN PACKING IN
PARALLEL PROCESSING SYSTEMS
CHAPTER I
INTRODUCTION
1.1 Motivation and Purpose of this Research
The goal of parallel processing is to achieve very high-speed
computing by partitioning a sequential program into concurrent
parts and then executing the concurrent parts, simultaneously. This
usually means that a programmer must manually schedule the
concurrent parts on the available processors. Besides being time-
consuming and prone to errors, this approach limits the number and
kind of applications that can take advantage of parallel processing.
Instead, an automatic means of allocating and scheduling parts of a
program on multiple processors is desired.
The research problem in this dissertation is the problem of

optimal scheduling of concurrent program modules called tasks on a
fixed, finite number of parallel processors. We call this the
scheduling problem for parallel processor systems. To be specific,
consider a system of fully connected identical processors and
assume the program or tasks submitted to the system are specified
by an acyclic directed task graph TG (N,E). N is a set of numbered
nodes, each node is assigned an integer weight representing the
execution time of a task. The node execution time is also called its
size. E is a set of edges representing precedence constraints among
nodes. Edges are also assigned an integer representing the amount of
communication between nodes in the form of a message which is
2
sent from one node to another. A node must wait for all of its input
messages before it can start execution. A node N1 is called an
immediate predecessor to node N2, if N1 sends a message to N2.
Likewise, node N2 is called the immediate successor node of N1
because N2 receives a message from N1. A successor (predecessor)
node may have zero, one or more immediate predecessors
(successors).
Figure 1.1 is an example of a program represented by TG. The

label number inside a node is a node number. The number adjacent
to a node is the node's size, and represents the execution time for
that node. The number on each branch is the message's size, and
represents the transmission time for that branch. Node 1 has no
predecessor, but node 1 is the immediate predecessor of nodes
2,3,4,5 and 6. Node 7 is the immediate successor to nodes 5 and 6.
All nodes are of size 1, and all messages are of size 1.
Given TG for some program P, we seek the scheduling of tasks

to processors such that the shortest possible execution time is
obtained. Furthermore, we assume that only one program is
executed on the parallel processor at a time. The resulting shortest-
time execution is called the optimal scheduling length for program P.
Ullman [U1 lm 75] has shown that finding the optimal schedule length
for this type of problem is generally very hard and is an NP-
complete problem. Because of the computational complexity of
optimal solution strategies, a need has arisen for a simplified
suboptimal ("near optimal") approach to this scheduling problem.
3
Scheduling research for this problem has emphasized heuristic

approaches.
Without communication delay, this type of problem is solved

by using algorithms belonging to the class of HLF (high level first) or
CP (critical path) [Hu 61] (see previous work section). In HLF, tasks
are executed from the highest level, level by level, where level is
defined as the longest path to the ending node. HLF algorithms
provide "the nearest optimal" schedule most of the time compared to
other heuristics [Adam 74].
In order to be more realistic, we need to add the

communication delay to this scheduling problem. This addition
makes the problem harder. In fact, an algorithm for finding the
optimal solution for an arbitrary task graph is not known. Moreover,
addition of communication delays introduces a key difficulty in
parallel processor scheduling, the so called max-min problem.
The max-min problem occurs because of the trade-off possible

between maximum parallelism versus minimum communication
delay.
If tasks are allocated to parallel processors in such a manner

as to maximize the amount of simultaneous execution of tasks
without regard for the cost of message transmission, the result may
be a program that runs slower than on a single processor. This case
arises when communication costs are high compared to execution
time delays. Alternately, when available parallelism is not exploited
4
to advantage, the parallel processor may be underutilized.

Therefore, a "good near optimal" scheduling algorithm must solve
the max-min problem and consider available communication
bandwidth as well as available concurrency.
While various scheduling problems have been studied for

many years, previous results are extended in three fundamental
ways by
1) introducing a new method which includes the time-delay
imposed by message transmission among tasks,
2) proposing an entirely new heuristic which maximizes
program execution speed by duplicating tasks on the fixed set
of processors, and
3) proposing a solution to the max-min problem by duplicating
tasks (this max-min problem has not widely been recognized
in the literature and so has never been solved by previous
researchers).
The scheduling results of these two new schedulers will also

show that load balancing does not yield "near optimal" schedules for
this scheduling problem.
Moreover, a new method to solve the "near optimal-grain size"

problem, called "Grain Packing", is proposed. Grain packing uses the
fine grain scheduling results of the new heuristic to find a "near-
optimal grain size" for an application program on a parallel
processing system.
1.2 Previous Works
5
The scheduling problem for parallel processor systems has

been widely studied [Coff 76]. A detailed survey of earlier results
can be found in a paper by Gonzalez[Gonz 77]. This survey presents
an extensive discussion of processor scheduling, as well as results on
job scheduling, taking into account a wide range of scheduling
constraints.
Research results in parallel processor scheduling can be

roughly classified into five groups based on the type of tasks to be
scheduled, type of parallel processor system and the goal of the
scheduler. The five groups are 1) Optimal precedent-scheduling, 2)
Optimal communication precedent scheduling, 3) Load balancing
communication scheduling, 4) Dynamic task scheduling, and 5)
Independent task scheduling.
1: Optimal precedent-scheduling. The objective of the first

type of scheduler is to minimize the schedule length. The tasks can
be represented by an acyclic directed task graph G(T,<), where T is a
set of nodes representing tasks (with known execution time), and <
is a set of edges representing the precedence constraints. We also
assume zero communication delay between any two communicating
tasks.
Finding the optimal schedule length for this type of problem is

generally very hard and is an NP-complete problem. The problem is
NP-complete even in two simple restricted cases, 1) scheduling unit-
time tasks to an arbitrary number of processors, 2) two-processor
scheduling with all tasks requiring one or two time units [Ullm 75].
6
With the addition of more restrictions this problem can be solved in

polynomial time [Lens 78]. For example, when the task graph is a
tree and all tasks execute in one time unit, a solution can be found in
0(n) time, see [Hu 61].
Hu's list scheduling algorithm uses a level number equal to the

length of the longest path from the node to the ending node as a
priority number, i.e. tasks are executed level by level from the
highest level first.
Coffman and Graham [Coff 72] gave an 0(n2) list scheduling

algorithm similar to Hu's list scheduling algorithm except that the
task scheduling priorities are assigned in such a way that nodes at
the same level have different priorities. The algorithm gives an
optimal length schedule for an arbitrary graph containing unit-time
delay tasks on a 2-processor system. Sethi [Seth 76] gave a less
complex algorithm which provides the same schedule in 0(na(n) + e),
where e is the number of edges in the graph, and a(n) is almost a
constant function of n. Nett [Nett 76] extended Hu's algorithm to
provide the same optimal schedule for 2 processors. Task priorities
are still equal to the node's level number, but tasks at the same
level are ordered by the number of each task's immediate
successors.
Kaufman [Kauf 74] reports an algorithm similar to Hu's

algorithm that works on a tree containing tasks with arbitrary
execution time. This algorithm finds a schedule in time bounded by
1 + (p -1)t/T, where p = number of processors, t = longest task
7
execution time, and T = summation of all task execution times.
All of the algorithms described above belong to the more

general class of HLF-algorithm (Highest-level-first), which have also
been called (Critical path), LP (longest path), or LPT (largest
processing time) algorithms.
Since no schedule for an arbitrary graph can be shorter than

its critical path, the level number or critical path length is the key to
successful precedence task scheduling. In fact experimental studies
of simulated task graphs show that:
very simple HLF-algorithms can produce optimal schedules

most of the time for unit time tasks(within 90 % of optimal for
over 700 cases) [Bash 83],
HLF-algorithms yield "near optimal" schedules most of the
time for arbitrary time tasks (4.4 percent away from optimal
in 899 of 900 cases). HLF algorithms also provided the best
results compared to other list scheduling algorithms (847 cases
of 900 cases) [Adam 74].
Bounds on the ratio of HLF algorithm schedules versus optimal

schedules are summarized as followed:
Precedence P* T* Algorithm Bound Ref.
Tree Ar* =1 [Hu 61] optimal [Hu 61]

Ar =2 =1 [Coff 72] optimal [Coff 72]
Ar =2 Ar HLF 4/ 3 [Chen 75]
Ar Ar Ar HLF 2-1/(P-1) [Chen 75]
8
*
where P = Number of Processors, T = Task execution time, Ar
= Arbitrary (precedence, number of processors, execution times).
Kohler [Kohl 75] suggested a branch-and-bound algorithm to

obtain the optimal solution for an arbitrary task graph. The
algorithm begins with construction of a searching tree. Then an
elimination rule is used to eliminate branches that violate the
constraints, and a selection rule is used to select only the most
promising branches first. Therefore, only part of the solution space
is searched. The algorithm is very general and can be applied to
many kinds of scheduling problems.
Branch-and-bound guarantees an optimal solution, but the

solution space for the precedence-constraint scheduling problem is
very large. Therefore, the branch-and-bound algorithm is much
slower than HLF-algorithms.
2: Optimal communication precedent scheduling The second

type of scheduling problem is exactly the same as the first type but
with the addition of communication delays between communicating
nodes located in different processors.
Because of the computational complexity of optimal solution

strategies, a need has arisen for a simplified suboptimal approach to
this scheduling problem. Recent research in this area has
emphasized heuristic approaches.
Yu's [Yu 84] heuristics are based on Hu's algorithm [Hu 61].
Yu's improvements were to consider communication delays when
9
making task assignments, and to use a combinatorial min-max

weight matching scheme at each task assignment step. Yu's results
were compared with results from Hu's algorithms, and the
comparisons were good for large communication delays. However,
the results were not significantly different from those of Hu for
small communication delays. Also, nodes in the application task
graph must have identical size (execution time) and so must the
communication delays.
3: Load balancing communication scheduling can be

represented by a task graph G(T,E), where T is a set of nodes
representing tasks with known execution time, and E is a set of
edges representing the amount of communication between nodes
located on different processors (no precedence constraint is
assumed). The scheduling (allocation) objective is to minimize the
communication delay between processor and to balance the load
among processors. Since there are no precedence constraints, load
balancing yields a system with high throughput and faster response
time.
Exclusion of the precedence constraints results in a different

schedule objective, a completely different schedule solution, and a
totally different scheduling problem from the previous two types.
This type of problem is more suited for partitioning (allocation) of a
task graph than scheduling tasks on processors. While we are not
concerned with this kind of problem, there are three approaches
which have been taken by the following researchers:
10
Graph theoretical approach : [Ston 77], [Ston 78], [Bokh 81],

Integer programming approach : [Chu 80], [Ma 82],
Heuristic approach : [Etc 82], [Ma 84].
4: Dynamic task scheduling is about the same as the second

type of problem except that every parameter (task execution time,
amount of communication, precedence constraint, number of nodes
in task graph) is dynamic and can be changed during runtime. An
example of this type of task graph is a graph which represents a
program with loop or branching statements. We generally do not
know how many times a loop will be executed nor which branch will
be executed beforehand. Solutions to this type of problem are
obtained through stochastic scheduling algorithms.
A critical path can not be found until the dynamic graph is

executed. If the objective is to minimize the schedule length, the
schedule of this type of problem would have the following
properties:
1) scheduling must be done during task execution because
there is not enough information for scheduling before
execution.
2) Statistical data about the dynamic task graph may have to
be gathered during runtime in order to predict the behavior of
the task graph. [Schw 87]
3) Since the schedule is determined at runtime, a complex
schedule strategy might introduce excessive overhead with no
guarantee of a "near optimal" schedule length.
11
4) The "goodness" of the schedule length computed by the

"dynamic" scheduler depends on 1) how close the scheduler
can predict the future behavior of a task graph, and 2)how
much excessive overhead is introduced.
Several scheduling models were proposed for dynamic

scheduling [Rama 76], [Jens 77], [Schw 87] along with a scheduler
heuristic [Rama 76], [Kung 81]. Some researchers have focused on
special cases, for example loops and branches [Tows 86].
5: Independent task scheduling: The tasks for this type of

scheduling problem are completely independent from each other.
There are several objectives, such as load balancing and individual
task deadlines. Since there are no precedence constraints among
tasks, this type of scheduling is different from the first two.
Examples of this type of scheduler are found in [Ni 85] and [Krau
75].
1.3 scope
Given a program P which can be partitioned into separate
cooperating and communicating parts called tasks, the goal of this
research is to devise an efficient heuristic scheduler to assign
statically a set of cooperative tasks (task graph) of P into a finite
number of processors in the parallel processor system. The
assumptions on the task graph and parallel processor system are,
1: (Data Flow Property): Each task of P, pi must wait for its

input before it can begin to execute; but once its inputs have
12
been obtained, the task can executed to completion in time ti.

Therefore, the task assignment must consider the order of
execution of each task (precedence constraint).
2: (Non-homogeneous Time Delays): Nodes of a task graph to
be scheduled need not be of the same size (execution times ti
may vary from task to task), hence the assignment of tasks to
processors must be optimized with regard to variable but
statically defined time delay.
3: (Non-homogeneous Communication Delays): Tasks
communicate with one another. The time taken to
communicate is measured as a communication time delay.
These delays are related to the interconnection topology of the
underlying hardware, the method of transmission, and the
number of bits of information to be transmitted. The
underlying hardware parameters are assumed to be fixed, but
the number of bits per message may vary, hence delays may
vary from connection to connection. Thus, communication
delays are not required to be identical.
4: (Max Parallelism) Parallelisms must be exploited as much
:
as possible.
5: (Min Communication) : Communication time delays must be
suppressed as much as possible.
The main objectives of this research is to devise a heuristic

that takes advantage of parallelisms while at the same time reduces
the communication delay. Therefore, the execution of a task graph is
completed as early as possible. Unfortunately, maximal parallelism
13
and minimal communication delay compete with one another

leading to a trade-off problem (Max-min), as is discussed in Chapter
II.
Load balancing is not included in our objectives, since load

balancing tends to distribute tasks evenly to every processor in the
system even though tasks can be distributed to a smaller number of
processors. This tends to increase the communication delay, hence
increasing the runtime.
The assumptions above can be rephrased in terms of the task

graph as follows:
1: (Acyclic Graph): The task graph is acyclic,
2: (Static Topology): The task graph is defined beforehand, and
remains unchanged during program execution,
3: (Non-homogeneous Labels): Message and node size need not
be identical,
4: (Non-preemption): Once a task begins executing, it executes
to completion,
5: (Static Scheduling): Once a task has been scheduled on a
processor, the schedule is not changed.
In addition, we assume the following properties of the parallel

processing system.
1: (Connectivity): All processors are connected so that a
message can be sent from any processor to any other
processor (for simplicity, the queuing delay that may occur
due to the congestion on the interconnection network is not
14
analyzed),
2: (Locality): Transmission time between nodes located on the
same processor is assumed to be zero time units,
3: (Identicality): Each processor is the same (speed, function).
4: (Co-Processor I /O): A processor can execute tasks and
communicate with another processor at the same time. This is
typical with IO-processors (channel) and direct memory
access. Hence, the term "processing element" (PE) is used
instead of "processor" to imply the existence of an I0-
processor,
5: (Single Application): Only one program is executed at a time
on the parallel processing system--this is to maximize
execution speed.
Inspite of the connectivity and identicality assumptions, the

proposed DSH scheduling algorithms can also be applied to parallel
processor systems partially connected to non-identical processors of
different speeds.
An efficient solution to the optimal scheduling problem

described above is a significant contribution to parallel processing
system research, because the solution enables more effective and
widespread use of parallel processing systems for general purpose
applications. The new scheduling heuristic, DSH, produces near-
optimal solutions with speedup ratios several times better than the
earlier solutions (Hu and Yu heuristic [Hu 601 [YU 84]). Also, the
heuristic can be used in related disciplines to solve problems in
15
operations research, job assignment in industrial management, and

programmer work assignment in software engineering.
1.4 Outline of the Dissertation
The research is organized into six chapters as follows:

The second chapter gives background and details on the
scheduling problem. Problems and difficulties in parallel processor
precedence task scheduling with communication delay are
presented, including the parallelism and communication trade-off
problem.
Heuristic scheduling algorithms and their complexities are

described in Chapter Three along with examples of scheduling of
sample task graphs.
Chapter Four presents results from the proposed heuristics

and compares these results with selected scheduling algorithms
described in the Literature. The software used to conduct these
experiments include a random task graph generator, Hu's algorithm,
Yu's algorithm, and Report generator.
Chapter Five presents "grain packing", which is the method

used to define optimal grain sizes for a user application program on
a specific parallel processing system using DSH.
Finally, Chapter Six contains a summary, conclusions, and

recommendations for future research.
16
y = communication delay
x = Node Number
z = Node size
Program Parallel;
Var a,b,c,d,e,f,g,h,i,j,k,l,m,n,o : real;
Begin
a:= 1; b:= 2; c:= 3; d:=4; e:=5; f:=6; {Node 1}
g := a*b; {Node 2}
h := c*d; {Node 3)
i := d*e; {Node 4}
j := e*f; {Node 5}
k := d*f; {Node 6}
1 := j*k; {Node 7)
m := 4*1; {Node 8)
n := 3*m; {Node 9}
o := n*i; {Node 10}
p := o*h; {Node 11}
q := p*g; {Node 12}
end.
Figure 1.1 An Example of Program Represented by Task Graph.

17
CHAPTER II
THE PARALLEL PROCESSOR SCHEDULING ENVIRONMENT
2.1 Background
2.1.1. Scheduling Definition Function and Goals.
A scheduler is an algorithm that takes as inputs a task graph

and the number of available processors, and produces as output an
allocation and execution order of tasks to be processed by each
processor. An optimal schedule results when the scheduler
guarantees the shortest time to complete all tasks in a graph.
However, it is known that finding an optimal schedule is a difficult
problem and generally intractable. Consequently, restrictions are
added to reduce the computational complexity of this class of
problems. Even so most of these problems are still classified as NP-
hard problems[Coff 76], as shown in Table 2.1.
Scheduling problems can be classified by the type of jobs that

schedulers distribute, the type of processor systems that process the
jobs, and the restrictive conditions added to the problem. First, jobs
vary from a set of independent tasks to a set of precedence-
constrained tasks. The latter can be modeled by a precedence graph,
where each node represents a task and each arc represents a
precedence constraint between nodes connected by the arc. An
example of a precedence task graph is shown in Figure 2.1.
Second, processor systems depend on their connection

topology, number of processors, communication protocol among
18
processors, and the similarity of processors in the system.
Third, restrictive conditions can be based on characteristics of

a scheduler itself (preemption or non-preemption). The difficulty of
the scheduling problem differs among each class of scheduler as
shown in Table 2.1.
2.1.2. Critical Path Nature of the Task Graph and Gantt Chart.
The length of a path in a task graph is the sum of all node and
branch sizes along the path. The critical path of a task graph is the
longest path from the start node to the exit node. For example, there
are two identical longest length paths in Figure 2.1 so there are two
critical paths. One is through nodes 1,5,7,8,9,10,11,12 and the other
through nodes 1,6,7,8,9,10,11,12. The lower bound on the execution
time of the task graph is the length of the critical path. That is, a
program cannot execute to completion in less time than given by the
length of the critical path regardless of the number of processors in
the system.
Processor run time is given by the difference ( tstop tstart)

where t start and tstop are the time a processor starts and finishes
node execution. The execution time of a task graph on a parallel
processor system is the longest run time among processors in the
system; given by maxi (tstopi - t starti ). Processor run time includes
both communication delay time and execution time.
A form of Gantt chart [Clar 52], as shown in Figure 2.1, is used

to show the scheduling of nodes on processors. From Scheduling
19
Gantt Chart in Figure 2.1 , the longest run time belongs to processor
PE 1, and the execution time of the task graph is 10 time units. Notice
that this time includes the communication delay time (2 units, from
node 1 to 5 and node 5 to 7) plus node execution time in the critical
path (8 units, node 1, 5, 7, 8, 9, 10, 11, 12).
2.1.3. Difference between Allocation and Scheduling
Figure 2.1 also shows how an allocation might be different

than a schedule. This difference is due to the bound on the execution
time defined by the critical path of the task graph. An allocation
defines where the node is to execute, but not the order of execution
of each node. The order of execution is dictated by the availability of
input messages to each node on each processor. That is, the node
that receives all of its messages begins execution first. But, this
execution order may not be the optimal one.
In Figure 2.1, nodes 2, 3, 4, 5, and 6 are the successor nodes of

node 1. Since the directed graph does not specify the order of
sending communication messages (this order depends on
communication protocol), nodes 2,3, and 4 receive their messages
either before or at the same time as node 5. If they are assigned to
the same PE (PE2) and an execution order is not specified, nodes 2,3,
and 4 may be scheduled for execution before node 5 as shown in the
Allocation Gantt Chart. This results in longer task graph execution
time because node 5 is in the critical path, and any execution time
before node 5 execution (node 2,3,4) is included in the overall run
time. But, if the scheduler specifies the order of node execution so
20
node 5 can be assigned and executed before nodes 2,3, and 4 then
an overall shorter execution time is obtained as shown in the
scheduling Gantt chart of Figure 2.1.
2.1.4. List Scheduling
One class of scheduling heuristics, in which many parallel

processing schedulers are classified, is list scheduling. In list
scheduling each task(node) is assigned a priority, then a list of nodes
is constructed in decreasing priority order. Whenever a processor is
available, a ready node with the highest priority is selected from the
list and assigned to the processor. If more than one node has the
same priority, a node is selected randomly.
The schedulers in this class differ only in the way that each
scheduler assigns priorities to nodes. Priority assignment results in
different schedules because nodes are selected in a different order.
The comparison between different node priority (level, co-level,
random) has been studied by Adam et al [Adam 74]. The comparison
suggests that the use of level number as node priority is the nearest
to optimal.
2.2 The Parallel Processor Scheduling Problem
The principle difficulties encountered when designing

schedulers of this type are reviewed before presenting new
heuristic schedulers. The first two problems are due to
communication delay, and the third problem is due to the alteration
of critical paths of a task graph which is a subset of the Dynamic
21
Scheduling Problem discussed in Chapter One.
2.2.1. Communication and Parallelism Trade off Problem
When there is non-zero communication delay between tasks,

all ready tasks can be allocated to all available processors so that the
overall execution time of the task graph is reduced. This situation
may occur in a shared-memory parallel processor where messages
are passed at memory cycle speeds. In fact, this is the basis of
earlier schedulers that do not consider communication delays, for
example Hu's heuristic [Hu 61].
On the other hand, when there is a communication delay,

scheduling must be based on both the communication delay and the
point in time when each processor is ready for execution. It is
possible for ready nodes with long communication delays to end up
assigned to the same processor as their immediate predecessors as
shown in Figure 2.2 Gantt chart A. Node 3 start time on PE2 is later
than node 3 start time on PE1 since the communication delay of
node 3 is greater than the execution time of node 2. So node 3
should be assigned to PE 1 , which has its immediate predecessor
node 1. Conversely as shown on Gantt chart B, if the communication
delay is less than the execution time of node 2, node 3 should be
assigned to PE2 instead. Hence, adding a communication delay
constraint increases the difficulty of arriving at an optimal schedule
because the scheduler must examine the start time of each node on
each available processor in order to select the best one.
22
As shown above, it would be a mistake to always increase the

amount of parallelism available by simply starting each node as
soon as possible. Distributing parallel nodes to as many processors as
possible tends to increase the communication delay, which
contributes to the overall execution time. In short, there is a trade-
off between taking advantage of maximal parallelism versus
minimizing communication delay. This problem has not been widely
recognized and will be called the max-min problem for parallel
processing.
The task graph in Figure 2.3 demonstrates the dramatic effect

of the max-min problem. If communication delay D3 between node
1 and node 3 is less than execution time of node 2, node 3 is
assigned to PE2 in order to begin its execution sooner. Because node
2 and node 3 are the immediate predecessor of node 4, and they are
assigned to different processors, node 4 cannot avoid the
communication delay from one of its immediate predecessors. Thus,
the execution time of this task graph is the summation of the
execution time of nodes 1, 2, 4, plus communication delay Dx or Dy
depending on where node 4 is assigned.
But, what happens if node 4 communication delays are larger

than node 3 execution time? Then, assigning node 3 to PE1 will
result in a shorter task graph execution time. This is so even if node
3 finishes its execution later than the previous assignment as shown
in Gantt chart C.
Current communication delay scheduling heuristics try to take

23
advantage of parallelism and reduce communication delay. But none

of the current heuristics solve the max-min problem. A new
method is proposed for the solution of the max-min problem by
duplicating tasks where necessary to reduce the overall
communication delay and maximize parallelism at the same time.
The method is called DSH.
2.2.2. Grain Size Problem
Another problem closely related to the max-min problem is

the grain size problem. The challenge of this problem is to determine
the "best" node size for every node in the task graph such that the
task graph execution time is minimized. The size of a node can be
altered by adding or removing tasks from the node. Such nodes are
called grains to indicate the packing of tasks within a node.
If a grain is too big, parallelism is reduced because potentially

concurrent tasks are grouped in a node and executed sequentially
by one processor. If a grain size is too small, more overhead in the
form of context switching, scheduling time, and communication
delay is added to the overall execution time.
The solution to the max-min problem can be used to solve the

grain size problem, since in the grain size problem, there is also a
trade-off between parallelism (small grain) and
communication(large grain).
As shown in Figure 2.4, the small grain scheduler can take

advantage of parallelism, but the large grain scheduler can not
24
because there is no parallelism in the large grain task graph. Also,

for large grain, the order of execution of each small task grouped
inside the larger grain is fixed before schedule time, and the order
may not be the optimal one. As shown in Figure 2.4 Gantt chart D,
fixing the order of execution of the small grains too early in the
algorithm results in sequential execution of the whole task graph.
Figure 2.4 shows the technique used to define the best grain
size. The grain size is defined by grouping the scheduled tasks
obtained from the small grain schedule shown in Gantt chart A. This
forms the larger grain schedule shown in Gantt chart B. The
grouping decision depends on the underlying parallel processor
system hardware and software. Usually, the more we can group
smaller grains, the shorter the task graph execution time because of
the reduction of overhead.
2.2.3. Level Alteration and the Critical Path
Another important scheduling problem caused by the

introduction of non-zero communication delays is due to the
alteration of node levels and their impact on critical path calculation.
Any heuristic that uses level numbers or critical path length faces
this problem. The level of a node is defined as the length of the
longest path from the node to the exit node. This length includes all
node execution times and all communication delay times along the
path.
Node level was first used in scheduling by Hu [Hu 61]. Adam

25
[Adam 74] showed that among all priority schedulers, level priority
schedulers are the best at getting close to the optimal schedule.
Unfortunately, the level numbers do not remain constant when

communication delays are considered because the level of each node
changes as the length of the path leading to the exit node changes.
The path length varies depending on communication delay, and the
communication delay changes depending on task allocation.
Communication delay is zero if tasks are allocated to the same PE,
and non-zero if tasks are allocated to different PEs. We call this the
level number problem for parallel processor scheduling.
The scheduling techniques used in this paper do not solve the

level number problem, and nor does any other known scheduling
technique. The node level in this paper is the summation of the node
sizes along the path to the exit node excluding the communication
delay. The reason that this is an unsolved problem is as follows.
Level number is used as node priority which has to be defined
before schedule time in order to construct a schedule. But, the
communication delay, which is a part of a level number as
previously described, is not defined until nodes are scheduled
because the communication delay is a function of assigned PE. A
better approximation of level number may be obtained by iteration:
schedule, then calculate node level, schedule ... etc. The time
complexity would be tremendously increased and the resulting level
number would be only an approximation.
26
y = communication delay
x = Node Number
z = Node size
Allocation Gantt Chart Scheduling Gantt Chart

Time Time
PE1 PE2 PEI PE2
0 0
1
1
1 1
6 6
2 2
2 5
3 3
3 2
4 4 7
4 3
5 5
5 8 4
6 6
9
7 7
7 10
8 8
8 11
9 9
9 12
10 10
10
11 11
11
12 12
12
13 13
14 14
Figure 2.1 Comparison between Allocation and Scheduling.

27
Gantt Chart A Gantt Chart B

Time Time
PE1 PE2 PE1 PE2
I
0 0
1
1
1 1
1
Dx
2
2
Dx
2
2
T
3
3
3 T 3
3
Dx > Node 2 Dx < Node 2
Figure 2.2 The Allocation Consideration due to Communication Delay.
Gantt Chart Gantt Chart Gantt Chart

A B C
PE1 PE2 PE1 PE2 PE1
1
D3
1 1
1 1
2
3
T1 2
3 T 2
3
Dx
4 T 4
4
Figure 2.3 The Comparison between Parallelism and Communication

Delay.
28
Gantt Chart Gantt Chart

Time A Time B
PE1 PE2
0 PE1 PE2
0
1
1
I
2 5 6 II
2
3 2
3
7 3 II
4 4
8 4 IV
5 5
9
6
6
10
7
7
V
11 8
8
12 9
9
Small Grain Scheduling Followed by Grain Packing (Grouping).

Gantt Chart
D
Time
0
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
Large grain scheduling
Figure 2.4 The Comparison of Large Grain versus Fine Grain

Scheduling.
29
Task Graph Type Processor Type
Graph Task execution Number of Complexity
Topology Time Predecessors (P)
tree identical arbitrary 0(N)
arbitrary identical 2 0(N)
arbitrary identical arbitrary NP-hard
arbitrary 1 Or 2 time unit Fixed P>=2 NP-hard
arbitrary arbitrary arbitrary Strong NP-hard
Table 2.1 The Complexity Comparison of Scheduling Problems.

30
CHAPTER III
DUPLICATION SCHEDULING HEURISTIC (DSH) & INSERTION
SCHEDULING HEURISTIC (ISH)
3.1. Introduction
Two new scheduling heuristics are proposed to solve the
scheduling problem as discussed in Chapter I. They are the insertion
scheduling heuristic (ISH) and the duplication scheduling heuristic
(DSH). Both heuristics are improvements over the Hu heuristic which
solve the communication delay and the max-min problems. ISH is
essentially Hu's heuristic with an improvement to the
communication delay problem by inserting tasks in available
communication delay time slots. However, ISH does not solve the
max-min problem. DSH solves the max-min problem using task
duplication scheduling.
The inputs of an heuristic are the task graph (directed task

graph model) and the parallel processing system as described
earlier. The output is a Gantt chart of all PEs in the system. The
Gantt chart show the order of execution of each task of a task graph
on each PE. Some definitions and terminology pertaining to these
heuristics and an example from Figure 3.1 are given as follows.
3.2. definitions
A node is ready if all of its immediate predecessors have been
assigned. Node 8 is ready since node 7 was assigned to PE2.
The ready time of a PE (trp) is the time when a PE has

31
finished its assigned task and is ready for execution of the next task.
The ready times for PEI, PE2, PE3 are 10,14,9 respectively.
The message ready time of a node lt 1 is the time when

all the messages to the node have been received by the PE
containing the node. This time is the latest communication delay of
all the messages sent from the node's immediate predecessors. The
immediate predecessor of the node that has the latest
communication delay will be called the latest immediate predecessor
(LIP). The times that PE2 receives messages from node 4,5,6 are
12,13,11 respectively. Hence, tmr = 13 and LIP = node 5.
The starting time (tsn) of a node for a PE is the earliest time

when the node can start its execution on the PE. It is the larger of
either the ready time of the PE or the message ready time of the
node. Node 7's tsn is 13.
The jdle time slot(tidp) is the time interval between the PE

ready time and the assigned node starting time if the node starting
time is more than the PE ready time. Otherwise the idle time is zero.
PE2's tidp is 11 to 13.
tidp = 0; if node start time <= PE ready time
otherwise tidp = node start time PE ready time
The finishing time (tf)of a node for a PE is the time when the
node finishes its execution on that PE. It is the starting time of that
node plus the size of that node.
tf = tsn + node size
32
PE2 finish time is 14.
The realtv queue is a queue of ordered ready nodes. The

order is defined by the node priority; A node priority is a node level.
The highest level node is first in the queue, and so it is scheduled
first, and the lowest level node is scheduled last. Nodes at the same
level are ordered according to the number of their immediate
successors; the node with the greatest number of immediate
successors is scheduled first.
The assigned node (AN) is the highest priority node selected

from the ready queue. After node 7 is assigned to PE2, Node 8 is
inserted in the ready queue and becomes the assigned node.
The assigned PE is the PE chosen to execute the assigned

node. The assigned PE for node 7 is PE2.
The first ready PE (PERF) is the first PE in the set of all PEs
to become ready after each assigned node is scheduled on the
assigned PE. After node 7 is assigned, PE3 is PERF and is ready at
time unit 9.
ISH and DSH share the same list scheduling heuristic shown in
pseudocode in Figure 3.2. The heuristic tries to minimize the
schedule length by minimizing the finishing time of each assigned
node. At first, the level of each node in the task graph is calculated
and used as each node's priority. (An example of node level is shown
in Figure 3.4). The ready queue is then initialized by inserting all
nodes with no immediate predecessors into the ready queue. This
33
usually means that only one node is inserted into the ready queue
initially. Then, the node at the top of the queue (with the highest
priority, i.e. highest level) is assigned to a PE. The PE is selected by
the processor selection routine called Locate-PE. For the first node,
any PE can be selected because no PE has been assigned a task, yet.
The heuristic continues assigning nodes and updating the ready
queue until all the nodes in the task graph are assigned.
Each time a node is assigned, the assigned PE is marked, then

the marked PE with the earliest ready time is located. The last node
assigned to that PE, is called the event node, and is used to update
the ready queue. The reason to mark and unmark a PE is to prevent
one event node from updating the ready queue a second time. Each
time an event node is selected, the PERF is unmarked. Hence, the
event node can not be used to update the ready queue again. The PE
is marked each time a new node is assigned to it, and unmarked
each time an event node is chosen.
Pseudocode of Update_R_queue routine is shown Figure 3.3.

The ready queue is updated by inserting a new ready node chosen
after the assignment of an event node. The new ready node is
selected as follows: The number of immediate predecessors of the
immediate successor nodes of the event node is decrement by one.
If the number is zero, that immediate successor node is chosen as
the new ready node and is inserted into the ready queue.
Decrementing is repeated for all immediate successors of the event
node. The new ready nodes are placed in the ready queue according
34
to their priority, thus maintaining the order of the queue.
A node is assigned by getting a ready node from the front of

the ready queue. Thus, the node with the highest level and with the
maximum number of immediate successors is assigned first. Then,
the processor selection routine (locate-PE) is used to select the PE,
and Assign_node routine assigns the node to the selected PE.
Figure 3.4 illustrates the step-by-step example of the main list

scheduler along with the scheduling result in the Gantt chart form,
ready queue, and task graph at each scheduling step, given the
sample task graph and two PEs in the parallel processing system.
The Locate_PE routine, used in this example, selects the assigned-PE
that has the earliest start time for each assigned-node without any
task duplication but considering the communication delay. The
Assign_Node, in this example, assigns an assigned-node to the
selected PE at its start time calculated by Locate_PE without any
task insertion.
3.3. Insertion Scheduling Heuristic (ISH)
ISH is essentially Hu's heuristic with two improvements. First,

Hu assumed no communication delay. So, the first improvement is in
the processor selection routine which takes into account
communication delay. The second improvement is in the Assign-
Node routine which makes use of the idle time of a PE by trying to
assign ready nodes (hole tasks) to idle time slots. ISH tries to
minimize the schedule length by utilizing idle time. However, ISH
35
does not solve the max-min problem of trading communication

delay for parallelism.
The Locate-PE routine returns the PE (PEL) that can start

executing the assigned node earliest and returns the assigned node
start time on the PEL, ST. If there is no communication delay, the
PEL is equal to PERF. But in the presence of communication delays,
the PE containing the assigned-node's immediate predecessor (or the
PE that communicates with the assigned node) must be considered
as well because communication delay is assumed to be zero when
the message source and destination tasks are located in the same PE.
On the other hand, the communication delay can be the major part
of an assigned node's starting time. Therefore, the assigned PE is the
first PE that can start executing the assigned node, and is selected
from the PERF and the PEs that have the assigned node's immediate
predecessors. This means that the minimum schedule length is
enforced instead of load balancing, because the assigned PE is
selected by the assigned node's starting time. The details of the
Locate-PE routine are shown in Figure 3.5.
The purpose of the assign node routine is not only to assign

node AN to the PEL column of the Gantt chart but, also to insert hole
tasks into the idle time slot created by the communication delays.
Instead of changing the node priority to select the assigned node
that creates the smallest idle time, the node with the highest level is
still the highest priority node and is assigned first. But, the idle time
created by this strategy is exploited by searching through all the
36
ready nodes from the front of the ready queue to find nodes (hole
tasks) that can be inserted into the idle time slot. Searching
continues until the idle time slot is filled or no hole task is found.
The details of the Assign_node routine are shown in Figure 3.6.
Figure 3.7 contains a step-by-step example of Locate_PE and

Assign_Node procedures used by the ISH scheduler along with the
intermediate scheduling results returned in the Gantt chart form
and the ready queue, given a sample task graph and two PEs.
Figure 3.7 starts with the Locate_PE for node 7 (Node 1,4,5,
and 6 have already been assigned, Gantt chart A). The Locate_PE
returns PEi as the assigned PE with node 7 start time is 4 time unit
and the idle time slot is 3-4 time unit. Then, the Assign_node tries to
assign the hole task to the idle time slot, and finally Assign_node
assigns node 7 to PEi start at 4 time unit.
The hole task assignment (Step 2 of Figure 3.6) starts with

searching from the top of the ready queue to find the ready node
that can start execution within the idle time slot. The first hole task
candidate node from the ready queue is node 2. Even node 2 can
start execution with in the idle time slot, but it can start execution
on PE3 earlier, hence hole task searching continues to the next node,
node 3. Node 3 is selected as a hole task and assigned to the idle
time slot (since it can start execution within the idle time slot and
can not start execution on other PEs earlier than PEi ). Then, Node 3
is used to update the ready queue (Update_R_queue) and the idle

time slot is modified due to the assignment of node 3. After the
37
modification, the Idle time slot is empty and the hole task
assignment step (Step 2) is done. Then, node 7 is assigned at its start
time on PE 1.
Finally, Figure 3.7 shows the improvement of final task

schedule result of ISH compared to the schedule result without hole
task insertion.
From the example of task graph scheduling in Figure 3.7, there

are choices in implementing ISH. Fifty task graphs with variation on
number of PE from 2 to 70 for a total of 1750 runs were simulated
in order to make decisions on those choices. The choices and their
simulation results are as follows:
1: NISH, non-insertion scheduling heuristic, is the level-list

scheduling with communication delay but no task insertion (No step
2 in Figure 3.6).
2: ISHO is about the same as ISH. The difference is in the step

2.1.2 as shown in Figure 3.8. The hole task for ISHO is any ready
task that can start execution within the idle time slot, even though it
can start execution on other PE earlier.
3: ISH1 is ISH. There are two criteria for a ready task to be a

hole task. The first criterion is a ready task must be able to start
execution within the idle time slot. The second criterion is a ready
task must not be able to start execution on the other PE earlier. The
ISH1 heuristically enforces the second criteria by only inserting
ready task that has message ready time no sooner than the idle time
38
start time, since that node has a tendency to start execution on the
other PE earlier (Figure 3.7, node 2 is not a hole task since its
message ready time on PE1 is 1 time unit and the idle time slot start
time is 3 time unit).
4: ISH2 has the same criteria as ISH1 except it heuristically

enforces the second criterion in a more strict way (more runtime).
ISH2 tests the second criterion by performs Locate_PE on each
candidate hole task to insure that it can not start on the other PE
earlier.
From the simulation results, Figure 3.9, we can see that

1: ISH2 has the best speedup ratio in most of the cases (except
on PE = 6-12, ISH1 is better). On average, the descending order of
the speedup ratio is ISH2, ISH1, NISH, and ISHO.
2: ISHO is better than NISH only when the number of PE is

small (2-6). The main reason is that ISHO does not enforce the
second criterion. Hence, the hole task could start execution on the
other PE earlier than execution in the idle time slot, and finally
results in a longer overall task graph execution time than scheduler
without task insertion (NISH). The reason that ISHO is better on a
small number of PEs is that ISHO can squeeze some extra PE
execution time out of the wasted idle time, and for small number of
PE (relative to the number of nodes in the task graph) most PEs are
busy and the safe idle time is beneficial (as seen in Figure 3.7). On
the other hand when the number of PEs is large, most PEs are idle so
it is risky to try to save idle time if hole tasks can start execution on
39
the other PE earlier.
In conclusion, ISH1 is selected for ISH, since the speedup

performance is better than NISH and the time complexity is less
than ISH2 (ISH2 has a small improvement, 4 percent, over ISH1).
3.4. Duplication Scheduling Heuristic (DSH)
DSH is an improvement over ISH because it duplicates tasks to

reduce the cost of communication. The duplication of a task is called
the task duplication concept (TDC). Task duplication has not been
explored by other researchers.
The TDC solves the max-min problem by duplicating the task

nodes that influence the communication delay. As shown in Figure
3.10 ,node 1 is duplicated to run on both PEi and PE2. The
duplication decreases the starting time of node 3 on PE2, so that
parallelism (Node 2 and 3) is fully exploited. Node 2 and node 3 run
in parallel on different PEs and node 5 can start execution sooner
than if we were to assign node 2 and node 3 to the same PE. The TDC
takes advantage of parallelism and reduces the communication
delays at the same time. Since node 3 is also duplicated to run on
PE 1, there is no communication delay for node 4 except the time it
takes to run the duplicated node 3 (The communication delay is
usually bigger than duplication nodes for a "fine-grain").
The TDC is used in a task duplication heuristic(TDP), which is

shown in detail in Figure 3.11. The inputs to TDP are an assigned
node and a PE that is a candidate for an assigned PE (PEc). TDP
40
calculates the starting time ST of the assigned node, and constructs

the duplicated task list, CTlst, for assigned node starting time ST, if
there are any duplicated tasks. The duplication task list is a list of
duplication tasks and their starting times on the candidate PE.
TDP begins with the calculation of the message ready time of

the assigned node and finds the latest immediate predecessor node
(LIP) that causes that starting time. Then, if the assigned node
starting time is more than the PEE ready time (there is a
communication delay) and the LIP was not assigned to PEC, TDP tries
to minimize the communication delay by copying predecessor(s) of
the assigned node to PEC with the hope that the copy-node will
reduce the communication delay. To copy LIP, there are two cases
depending on where LIP is located.
In the first case, the LIP was assigned to another PE in the

system (not PEE ). TDP tries to copy LIP node into the idle time slot
of the PEE, since the duplication of the LIP may improve the start
time of the assigned node.
For the second case, the LIP has already been duplicated in the
idle time slot of the PEE . To start the assigned node sooner, the LIP
has to start it's execution sooner. Thus, the CTlst is searched to find
the node that affects the start time of the LIP node. The search
starts with the LIP of the LIP of the assigned node. If that node is
assigned to some other PE, the search process stops and that node is
the search node. Otherwise, the process searches deeper levels until
it finds the LIP of the LIP node that is located in some other PE, and
41
that node becomes the search node.
Once the search node is found, it is copied into the idle time
slot of the PEc . Then all the duplicated task nodes that start after the
copied node are removed and re-copied due to the duplication of the
search node. The reason for re-copying is that the start time (and
the order in CTlst) of the node located after the copied-search node
may change due to the presence of the search node. The re-copy
may indirectly cause the LIP node starting time to decrease so the
assigned node can start execution sooner. The duplication process
continues until duplication fails or the LIP was already assigned to
PEC (No reason to copy). If some idle time remains, hole tasks are
inserted into the time slot as in the ISH heuristic.
The details of the copy routine (COPY_LIP) are shown in Figure

3.12. To copy the node the starting time of the copied node and the
LIP node of the copied node have to be determined first. If the
copied node starting time is within the idle time slot and no
duplicated node is assigned to that time, then the duplication is
successful and the LIP of the copied node is recorded for the
purpose of searching as previously described. If the copied node
starting time is not in the idle time slot, the LIP node of the copied
node has to be duplicated if possible. Otherwise, the copy fails and
the duplication process is terminated at that point.
If the idle time slot is large enough, TDP will continue to

duplicate the predecessors of the assigned node in order to make the
node start sooner. TDP is used in Locate-PE and Assign-node in
42
order to find the starting time and the duplicated task list for each
assigned node on any specified PE.
Figure 3.14 illustrates the step-by-step example of the CTlst

created by TDP, given the sample task graph in Figure 3.13 and
three PEs in the parallel processing system and the intermediate
scheduling result of DSH in the Gantt chart A .
From Figure 3.14, all node in the task graph have already been
assigned except the last node (node 11, exit node, Gantt chart A).
Locate_PE of DSH have to find out the start time of node 11 by using
TDP on all three PEs in order to compare and make the assignment
decision. Figure 3.14 shows only TDP (CTlst) on PEi which is also the
one that Locate_PE will select for node 11's assignment.
The Locate_PE of DSH is shown in Figure 3.15. It is basically

the same as that of ISH. The difference is in the use of the TDP to
find the assigned node starting time instead of finding it by
calculating communication delay without TDC.
The Assign_node's function is the same as that of ISH except

for the use of TDP to find the assigned node's starting time along
with the duplication task list. The details of the Assigned_node are
shown in Figure 3.16.
An example of a schedule computed by DSH is shown in Figure

13 Gantt chart B (final assignment). This example illustrates how
DSH solves the max-min problem. From Gantt chart A (after node
10's assignment), we see that DSH has taken full advantage of all
43
existing parallelism within the task graph and no PE is idle due to

communication delay. This is because no message is passing due to
the duplication of tasks. At this point, node 2,3,4 and, node 5,6,7
and node 8,9,10 are scheduled to execute in parallel. But, without
task duplication (ISH or NISH), none of this parallelism would be
exploited, because all nodes would run on the same PE due to
extremely high communication delay.
Node 11's assignment illustrates the max-min problem.

Instead of scheduling node 11 at time 11, node 6,9,7 and 10 are
duplicated so that node 11 can be scheduled at time 10. In this
extreme case, the parallelism of node 8,9,10 and node 5,6,7 gives no
improvement in the overall task graph execution time. On the
contrary, the parallelism of node 8,9,10 and node 5,6,7 is harmful to
node 11's assignment. Consequently, DSH has to duplicate them and
schedule them to run serially on PEI to avoid the communication
delays.
Before node 11's assignment DSH tried to take advantage of

parallelism as much as possible even though the parallelism is
harmful to the future nodes assignments (node 11). The duplications
of node 6,9,7,10 at node 11's assignment can be viewed as feedback
(correction) from previous schedulings. From this feedback DSH
decides that maximum parallelism should not be exploited; only
node 2,4 parallelism. In short, DSH's scheduling strategy is to exploit
all possible parallelism first, but when it discover that parallelism is
known to be harmful, DSH duplicates nodes and schedules them to
44
execute serially.
Notice the redundant node allocation in Gantt chart B, (all

nodes in PE2 and nodes 2,7,10 in PE3). A simple clean-up-algorithm
could be constructed to get rid of redundant tasks resulting in the
final Gantt chart C. On the other hand, the redundancies could be
used for fault torelance.
This simple example was intensionaly constructed to show the

max-min problem. It shows the feed back case and the creation of a
duplication task list (CTlst). If one of the communication delays from
nodes 8,9, or 10 to node 11 is less than 6 the result would show
great improvement over scheduling without duplication (in this case,
assign all node to one PE, task graph execution time is 12) and the
parallelism of node 8,9,10 and node 5,6,7 would be a beneficial to
node 11's assignment.
3.5. Complexity of ISH and DSH.
The difference between ISH and DSH are in the processor

selection routine (Locate-PE) and the way a node is assigned to the
selected PE (assign_node). The node priority (highest level) and the
main algorithm are the same. Hence, the time complexity of the
main algorithm will be described first, where P= number of PEs in
the system, N = number of nodes in a task graph, B = number of
branches in a task graph.
From the main algorithm in Figure 7: The complexity of node

level calculation (step 1) is 0(B). Since N nodes are taken from a
45
ready queue, the complexity of step 4 and step 8.1 is 0(N). Also, N
nodes are inserted into a ready queue so the node insertion is 0(N).
But before each insertion of ready node, each ready node
predecessor must be assigned to some PE. To check predecessor
node assignment each branch has to be traversed once. So the
complexity of step 3 and step 7 is 0(N+B). Each PE is initialized in
step 2 with complexity 0(P). In step 6, PERF is located before each
node assignment, so the complexity of step 6 is 0(NP). The
complexity of the main algorithm is
0(B+N+P+NP+O(Locate_PE)+0(Assign_node)).
In Locate_PE ISH, the node start time is calculated for each PE

that has been assigned the assign-node's immediate predecessors
(bounded by P). The start time calculation is done by comparing
communication delay for each immediate predecessor node and the
complexity is 0(N), so the execution time complexity of Locate_PE
ISH is 0(PN). But Locate_PE is called whenever a node is assigned, so
the complexity becomes 0(PN2).
For Assigned_node ISH, each time a node is assigned and there

is an idle time slot, a hole task is searched to fill in the time slot. So,
the complexity of Assign_node ISH is 0(N2).
Therefore, the execution time complexity of ISH is

0(B+N+P+PN+PN2+N2), which is 0(N2) for constant P.
TDP is used to calculate a node start time in Locate_PE of DSH.

The order of the TDP is 0(N3). Therefore the complexity of
46
Locate_PE DSH is 0(PN4). The complexity of assigned_node DSH is

0(N2).
Therefore, the execution time complexity of DSH is

0(B+N+P+PN+PN4+N2), which is 0(N4) for constant P.
The time complexity of both Hu's heuristic and YU D heuristic

is 0(N) [Hu 61], [Yu 84]. The complexity of ISH, DSH are 0(N2), 0(N4)
respectively.
47
Gantt Chart
Time PE1 PE2 PE3 Ready Queue
x x 5
9
4 x
10
6
11
i
12
i
x = don't care 13
i = idle time slot 7
14
Figure 3.1 The segment of a Three-processor Schedule after Node 7's

Assignment
48
Main List Scheduler

input: 1) TG, Directed task graph
2) NP, number of PEs in the parallel processing system,
output: PE_GC, Gantt chart consisting of an array [1..NP] for each PE
in the system (list of tasks ordered by their execution time on a PE,
including task start time and finish time)
Begin:
1) Level_graph(TG); (Calculate level number for each node in TG)
2) Init_Gantt(PE_GC); (reset Gantt chart of each PE to empty }
3) Init_R_queue(RQ); (Insert all nodes having no immediate
predecessors into the RQ, ready queue, in order by their level
number)
4) Get_node (AN,RQ); (get AN, assigned node, from the front of the
Ready queue)
5) Assign_node(AN,O,PE_GC,1,RQ); (assigns AN, the assigned node,
to PE1 at start time 0) ;
repeat
6) Update_R_queue(RQ,AN,TG); (update the ready queue
using the assigned node)
7) if RQ, ready queue, is not empty then
7.1) Get node (AN,RQ); (get a new AN from front of RQ)
7.2) Locate_PE (AN,PE_GC,PEL,ST); (return PEL that has
the smallest ST, AN's start time)
7.3) Assign_node(AN,ST,PE_GC,PEL,RQ);
until all nodes in TG are assigned
Figure 3.2 Main List scheduler.
49
Update_R_queue (AN,RQ,TG);
input: 1) AN, assigned node

2) RQ, Ready Queue
3) TG, Task graph
output: RQ, Ready queue
Begin
Let IMS be a set of immediate successor nodes of EN
For all IN node in IMS
Do
1) N = Num of Immediate Pred(1N) (N = Number of
Immediate Predecessors of node IN)
2) N = N 1; (store N as number of predecessors of node
IN for next Update_r_queue called)
3) if N = 0 then
Ordered Insert(IN,Ready_queue)
{ Keep the ready_queue in order by level number}
Figure 3.3 Update_R_queue.

50
Bold number = Node level
After initializing Gantt chart and Ready queue (step 3 of Figure 3.2)
Gantt Chart Ready Queue N# #Pr Level
PE1 PE2
1 1 0 4
2 1 2
1
3 1 3
4 2 1
After assigned node 1 to PE1 (Step 5 Figure 3.2)

PE1 PE2
1
1 0 4
2 1 2
3 1 3
4 2 1
After update Ready queue using node 1, Event node (Step 6).

PE1 PE2
1 3 1 0 4
2 2 0 2
3 0 3
4 2 1
N# = Node Number
#Pr = Number of Immediate Predecessor of node N#
Level = Level number of node N#
Figure 3.4 The Example of Task Graph List Scheduling

51
After assign node 3. (Step 7.3, Figure 3.2)

Nytt art Ready Queue N# #Pr Level
1 2 1 0 4
2 0 2
3 3 0 3
4 2 1
After updAte Ready queue using node 3 (repeat loop back to Step 6)
Iiiett ivy Ready Queue N# #Pr Level
1 2 1 0 4
2 0 2
3 3 0 3
4 1 1
After assign node 2 (Step 7.3, Figure 3.2)

Rytt Fet Ready Queue N# #Pr Level
1 1 0 4
2 0 2
3
2 3 0 3
4 1 1
After update Ready queue using node 2 (Step 6, Figure 3.2)

Gantt Chg rt Ready Queue N# #Pr Level
PE1 PE2
1 4 1 0 4
2 0 2
3
2 3 0 3
4 0 1
After assign node 4 (Done).

PE1 PE2
1 1 0 4
2 2
3
2 3 3
4 0 1
Figure 3.4 The Example of Task Graph List Scheduling (Continuation).

52
Locate PE ISH (AN,PE_GC,PEL,ST)

input: 1)AN, assigned node
2)PE_GC, array of Gantt chart of all PEs
output: l)PEL, assigned PE
2)ST, AN's start time on PEL
Begin
1) First_Ready_PE(PE_GC,PERF); {From all PEs in the system,

compare PE ready times in PE_GC and find the one that ready at the
earliest time (PERF) }
2) Initially, PEL = PERF
3) if Num_of_Immediate_Pred(AN) > 0
then Start time(AN,PERF,ST,PE_GC) (calculate the AN
start time on PERF (which is the larger of either PERF's
ready time or AN message ready time which is a
communication delay from LIP node. Figure 5 is an
example of a calculation of node (7) start time.) )
4) Let IMP be the set of all immediate predecessor nodes of AN

5) For all P such that X is in IMP and X is in PE_GC[P] ( for all PE that
executed AN's immediate predecessor} Do
5.1) Start_time(AN,P,STA,PE_GC) (calculate the AN's start
time on PE P).
5.2) if STA < ST then
ST = STA; PEL = P;
Figure 3.5 The Locate-PE of ISH.

53
Assign-Node ISH (AN,ST,PE_GC,PEL,RQ)

input: 1)AN, assigned-node
2)PEL, assigned-PE
4)PE_GC, PEs's Gantt chart
5)RQ, ready queue
output: PE_GC[PEL], PEL's Gantt chart
Begin
1) Idle time slot = ST Ready_Time(PEL)
2) if Idle time slot > 0 then
2.1) repeat
2.1.1) Initially, HT, hole task = task at the top of RQ,
2.1.2) repeat
Start time(HT,PEL,STH,PE_GC);
(compute STH = start time of HT on PEL)
if (STH is within the idle time slot)
then begin
FTH = size(HT) + STH; (hole task's finish time)
if FTH is with in the idle time slot
then HT is the hole task; end;
if HT is not a hole task then HT =
Next_task(HT,RQ);
(HT becomes the task next to HT in RQ)
until hole task found or search through RQ
(find a hole task that can be assigned between the
idle time slot of PEL by searching the ready queue )
2.1.3) If hole task found then

Insert the hole task into PE_GC[PEL] at time STH;
Update-R-queue (HT,RQ,TG);
idle time slot = idle time slot the STH to FTH time
slot ;
until no hole task found or no more idle time;

2.2) Insert remain idle time slots into PE_GC[PEL] .
3) Insert AN into PE_GC[PEL] at time ST.

Figure 3.6 The Assigned-Node of ISH.
54
Gantt chart A and ready queue A after Locate_PE routine selects PE1 as assigned PE for
node 7 start at 4 time unit with idle time slot 4-5.
Gantt Chart B is after node 3 (hole task) assignment followed by node 7's assignment.
Gantt Chart A Gantt Chart B
PE1 PE2 PE3 Time PE1 PE2 PE3 Ready Queue B
Time Ready Queue A NodeLevel
0 Node Level 0
1 1 2 2
1 1
4 4
2 2
5 6 5 6
3 3
3
4 4
7* 7
5 5
* Locate_PE return PE1 for node 7 at time 4, but node 7 has not been assigned to PE1 yet.
The final Gantt Chart of ISH in comparison to Gantt chart of level-list scheduling
without task insertion (NISH).
Gantt Chart with Gantt Chart without
Task insertion Task insertion
Time PE1 PE2 PE3 Time PE1 PE2 PE3
0 0
1 1
1 1
4 4
2 2
5 6 2 5 6 2
3 3
3
4 4
7 7 3
5 5
8 8
6 6
9
7 7
9
8 8
Figure 3.7 The Example of ISH Task Insertion and Scheduling.

55
ISH1 step 2.1.2

repeat
if (STH < the idle time slot start time)
then STH = idle time slot start time;
FTH = size(HT) + STH;
if FTH is with in the idle time slot then HT is the hole
task;
if HT is not a hole task then HT = Next task(HT,RQ);
ISH1 (ISH) step 2.1.2
repeat
if (STH is within the idle time slot)
then begin FTH = size(HT) + STH;
if HT is not a hole task then HT = Next task(HT,RQ);
ISH2 step 2.1.2
repeat
hole task found = false;
if (STH is <= the idle time slot end time)
then begin
Locate_PE(HT,PE_GC,PEL,STH2);
if STH <= STH2 then begin
FTH = size(HT) + STH;
end;
if HT is not a hole task then HT =
Next task(HT,RQ);
Figure 3.8 The choices in implementing ISH step 2.1.2
56
-- NISH
- ISHO
ISM
- ISH2
12 16 20 24 28 32 36 40 44 48
Figure 3.9 The Average Speedup Ratio Comparison between ISH

Versions on 10 Random-generated 350 Nodes Task Graphs.
57
Gantt Chart Gantt Chart

with TDC without TDC
Time PE1 PE2 Time PE1 PE2
0 0
1 1* 1
1 1
2 3 2
2 2
a 5 3
3 3
4 5
4 4
4
5
* underline number indicated duplicated node
Figure 3.10 The Task Duplication Concept.

58
Task Duplication Process, TDP (AN,PEc,ST,CT1st,CTcnt);

Input: 1) AN, an assigned node
2) PEC, Assigned PE candidate
Output: 1) ST, AN's Start time
2) CTlst of PEC, a list of duplicated tasks, and their start time on PEC.
3) CTcnt number tasks duplicated in PEc's CTlst.
Begin
CTlst is empty and CTcnt = 0
repeat
Start time(AN,PEc ,ST,ANLIP,CT1st,PE_GC); (The same as the
Start_time used in ISH except using both PE_GC and CTlst to
calculate AN's start time on PEC, also return ANLIP which is the LIP
node of AN)
COPY = false; (Flag indicates LIP has successfully been copied)
If (ST > Ready_Time(PEC)) and ANLIP is not in PE_GC[PEC] then
1) If ANLIP is not in PEC's CTlst then
Start time(ANLIP,PEc,STL,LIPLIP,CT1st,PE_GC);
Copy_LIP(ANLIP,LIPLIP,CT1st,STL,CTcnt,COPY,CTPT);
if COPY then Shift_ Task_ in _ CT1st(CTPT,CTLst);
(remove all nodes in CTlst that are located after the Copied
node; and Let SRN be a set of removed nodes;
For all removed nodes RN in SRN Do
Start time (RN,PEc,STRN,RNLIP,CT1st,PE_GC);
Copy_LIP(RN,RNLIP,CT1st,STRN,CTcnt,COPY,CTPT) 1
2) If ANLIP is in PEc's CTlst, then
2.1) Search _CTlst_LIP(ANLIP,LIP,Found); (Search
ancestor of ANLIP from CTlst to find LIP node that is not in
CT1st or LIP becomes an entry node)
2.2) If Found and LIP is not in PE_GC[PEC] then
Start time(LIP,PEc,STL,LIPLIP,CT1st,PE_GC);
Copy_LIP(LIP,LIPLIP,CT1st,STL,CTcnt,COPY,CTPT);
if COPY then Shift Task in CT1st(CTPT,CTLst);
until not COPY or ST <= Ready_Time(PEC)

Figure 3.11 Task duplication Process (TDP).
59
Copy_LIP (LIP,LIPLIP,CT1st,STL,CTcnt,COPY)
Input: 1) LIP, LIP node that want to be copied in CTlst

2) LIPLIP, LIP of LIP node
3) STL, Start time of LIP node
4) CTlst, link list of Copied task
5) CTcnt, Number of copied task in CTlst
output: 1) CTlst
2) CTcnt
3) CTPT, pointer point to the copied task in CTlst
3) COPY, Boolean indicates whether COPY_LIP is successful
Begin
1) Insert LIP node in CTlst.
2) if insert successful
then COPY = true; CTPT point to the Copied node in CTlst
else COPY = false; CTPT = nil;
3) if not COPY and LIP is not an entry node and LIPLIP is not in
PE_GC[PEC] then
3.1) If LIPLIP is not in PEc's CT1st then
Start_time(LIPLIP,PEc,STLL,LIPLIPLIP,CT1st,PE_GC);
Copy_LIP(LIPLIP,LIPLIPLIP,CT1st,STLL,CTcnt,COPY,CTPT);
3.2) If LIPLIP is in PEC's CTlst, then
2.1) Search_CT1st_LIP(LIPLIP,LIP,Found);
{ Search ancestor of LIPLIP from CT1st to find LIP node that
is not in CTlst or LIP becomes an entry node)
2.2) If Found and LIP is not in PE_GC[PEC] then

Start time(LIP,PEc,STL,LIPLIP,CT1st,PE_GC);
Copy_LIP(LIP,LIPLIP,CT1st,STL,CTcnt,COPY,CTPT);
Figure 3.12 The Copy_LIP of DSH.

60
Gantt Chart A is an intermediate DSH's scheduling result before Node 11's assignment.
Gantt Chart B is a final result after Assign_Node (node 1 I,CTIst).
Gantt Chart C is a Gantt chart B after removing redundant tasks
Gantt Chart A Gantt Chart B Gantt Chart C

Time Time Time
PE1 PE2 PE3 PE1 PE2 PE3 PE1 PE2 PE3
0 0 0
1
1 1 1 1
1 I 1 1
1
2 3 4 2 3 4 2 4
2 2 2
1 it. 2_
3
2. 4 a 3
a
3
5 6 7 5 6 7
4 4 4
8 9 10 8 9 10 8
5 5 5
6_ b.
6 6 6
2 2.
7 7 7
2.. 1
8 8 8
IQ IQ
9 9 9
11 11
10 10 10
Figure 3.13 The Sample Task Graph of 11 Nodes and Its
Intermediate Gantt Chart using DSH.
61
CTlst after call Start_Time of node 11 N# LIP ST FT MSRT

(LIP = 9, Start time = 11) . No node
duplicatuion yet. First node is a 0 8 5 5 0
dummy node shows last node on the 11 11 0
0 9
PE1 (node 8) and its finish time (5).
Last node shows LIP of the assign node
(node 11) and its (nodel 1) 's start
time. FT MSRT
1: LIP = node 9 and node 9 is not in N# LIP ST
CT1st (case 1 of Figure 3.11)
2: Start_time (node 9) returns LIP of 0 8 5 5 0
node 9 (node 6) and node 9's start time
9 6 10 11 10
(10).
3: Copy_LIP (node 9) (node 9 is LIP of 0 9 11 11 0
node 11).
4: Copy sucess, Node 9 is duplicated in
CTlst at time 10.
1: re-calculated start time of node 11 N# LIP ST FT MSRT

(LIP=9, ST=11).
2: LIP (node 9) is already in CTlst (Case 0 8 5 5
2 Figure 3.11) 6 4 5
3: Search_CTlst_LIP (node 9), node 9's
LIP = node 6 and node 6 is not in CTlst. 9 6 10 11 10
4: Start_time (Node 6), Copy_LIP(node 0 9 11 11 0
6).
5: Copy success, Node 6 is duplicated in
Ctlst at time 5.
N# LIP ST FT MSRT
1: remove node 9 from CTlst (since 0 8 5 5 0

node 6 was duplicted sucessfully. 6 4 5 6 5
2: Start_time (node 9), LIP = node 6,
ST=6. 9 6 6 7 6
3: re-Copy_LIP(Node 9) with the new 0 9 11 11 0
start time (6).
* N# = Node number of duplicated node, LIP = LIP of N#, ST = Start time of

N# on CTlst, FT = finish time of N#, MSRT = Message ready time of N# on
CTlst
Figure 3.14 The Example of Duplication Task List (CTlst) on PE1
Constructed by TDP for Node 11.
62
1: re-calculated start time of node 11 N# LIP ST FT MSRT

(LIP=10, ST=11).
2: LIP (node 10) is not in CTlst (Case 1
of Figure 3.11) 0 8 5 5 0
3: Start_time (Node 10) , returns
6 4 5 6 5
ST=11, LIP = node 7, Copy_LIP(node 10)
4: Copy node 10 fail since node 10's ST 9 6 6 7 6
= 11
7 4 7 8 5
5: Node 10's LIP (node 7) is not in CTlst
(case 3.1 of Figure 3.12) 0 10 11 11 0
6: Start_time (node 7) , Copy_LIP (node
7).
7: Copy success, Node 7 is duplicated in
CTlst at time 7. N# LIP ST FT MSRT
1: re-calculated start time of node 11
(LIP=10, ST=11). 0 8 5 5 0
2: LIP (node 10) is not in CT1st (Case 1 6 4 5 6 5
of Figure 3.11)
3: Start_time (Node 10) , returns ST=8, 9 6 6 7 6
LIP = node 7, Copy_LIP(node 10). 7 4 7 8 5
4: Copy success, Node 10 is duplicated
in CT1st at time 8. 10 7 8 9 8
0 10 11 11 0
N# LIP ST FT MSRT
0 8 5 5 0
1: re-calculated start time of node 11
(LIP=10, ST=9). 6 4 5 6 5
2: No idle time slot left TDP terminated 9 6 6 7 6
with Node 11's start time on PE1 is 9
and four nodes duplicated. 7 4 7 8 5
3: Final Gantt chart (B) is shown on 10 7 8 9 8
Figure 3.13
0 10 9 9 0
Figure 3.14 The Example of Duplication Task List (CTlst) on PE1

Constructed by TDP for Node 11(Continuation).
63
Locate-PE DSH (AN,PE_GC,PEL,ST,DT1st)

2)PE_GC, array of Gantt chart of all PEs
output: 1)PEL, assigned PE
3)duplication task list DTIst
Begin
1) First_Ready_PE(PE_GC,PERF); {From all PEs in the system,
compare PE ready times in PE_GC and find the one that ready at the
earliest time (PERF) }
2) PEL = PERF
3) if Num_of_Immediate_Pred(AN) > 0
then TDP(AN,PERF,ST,CT1st,CTcnt);
4) Let IMP be a set of immediate predecessor nodes of AN
5) For all P that X is in IMP and X is in PE_GC[P] {PE that execute
AN's immediate predecessor) D o
5.1) TDP(AN,P,STA,CT1st,CTAcnt); [calculate the AN's start time
on the PE P}.
5.2) if (STA < ST) or (STA=ST and CTAcnt<CTcnt) then
ST = STA; PEL = P;
DTIst = CTlst; CTcnt = CTAcnt
Figure 3.15 The Locate_PE of DSH.

64
Assign_Node DSH (AN,PE_GC,PEL,ST,DT1st)

2)PEL, assigned PE
4)duplication task list DTlst
output: PE_GC array of Gantt chart of all PEs
Begin
1) If DTlst is not empty
then Insert all the tasks in the duplication task list into
PE_GC[PEL].
2) Insert hole tasks to all idle time slots between the duplicated
tasks. The details are as described in Assign-Node ISH step 2
4) Insert idle tasks into all remaining idle time slots.
5) Insert AN into PE_GC[PEL] at time ST
Figure 3.16 The Assigned-Node of DSH.

65
CHAPTER IV
EXPERIMENT RESULTS
This Chapter describes the test results for ISH and DSH
compared to the results for Hu's heuristic [Hu 61] and Yu's heuristic
D[Yu 84]. The comparison test consists of applying the heuristics to a
wide range of randomly generated task graphs with 20, 50, 100,
150, 250, and 350 nodes for a total of 340 task graphs. This
approach follows closely the approach used by [Adam 74]. The only
difference from Adam's test is that we compared our results with
Hu's and Yu's results instead of the optimal solution because there is
no algorithm to find the optimal solution for precedence task graphs
that include communication delays.
We used the speedup ratio, which is the ratio of the execution

time of the task graph on a uni-processor (no communication delay)
to the execution time on a parallel processing system, as our
measure of performance improvement. The number of PEs in the
parallel processing system were varied from 2 to 15 in 20-node
tests, 2 to 30 in 150-node tests, and 2 to 70 in 250-node and 350
node tests.
The random graphs can be classified into two groups. The first
group has identical node sizes and identical communication delays in
order to study the effects of communication delay on the
performance of each heuristic. The second group has variable node
sizes and variable communication delays in order to study the
stability of the heuristic when node and communication delay sizes
66
change.
For the first group, the task graph node size is one time unit
and communication delay sizes are varied; 1, 3, 5, 10, and 20 time
units. At each communication delay size, 10 task graphs were
randomly generated and scheduled. An average speedup ratio was
compared from the 10 task graph runs for each communication
delay and the results plotted in Figures 4.1-4.6. The minimum,
maximum and average of each data point in Figures 4.1-4.6 are
summarized in the Appendix. The speedup ratios at the saturation
points Figures 4.1-4.6 and the percentage improvements of ISH and
DSH are summarized in Table 4.1 and Table 4.2, respectively. Table
4.3 summarizes the percent of the average speedup ratio versus the
average speedup ratio for unit delay task graphs for all four
heuristics. This percent, which is the ratio of speedup ratio at each
delay to the speedup ratio of unit delay, shows how speedup
decreases as the delay increases.
From Table 4.1, we conclude that ISH gives improvements up

to 45 % over previous heuristics. On the other hand, DSH gives much
more improvement as shown in Table 4.2. The percentage of
improvement of DSH increases as communication delay increases.
This means that DSH can handle communication delays better than
previous methods. The improvement is up to 420% over Hu's
heuristic and 270% over Yu's heuristic at the unity speedup for 20-
node 20-delay tests. For 350-node 20-delay tests, the improvement
is up to 378% and 158% respectively.
67
From Table 4.3, the percentage of the speedup ratio of DSH

decreases slowly as communication delay increases. Also the
average speedup ratio never goes below 1. This indicates that DSH
handles communication delay better than previous heuristics. The
speedup ratio of DSH for different delay are plotted in Figure 4.7.
It is also interesting to note that the speedup ratio of DSH

never decreases when the number of PEs available in the system is
increased, as compared to the other methods (especially Hu's
heuristic). This is an important property of a good scheduling
algorithm since the total task execution time should not increase
with increases in the number of available PEs in the system. If
adding PEs to the system causes a greater delay, a good scheduler
should be able to decide not to use the additional PEs.
For the second group, the task graphs have 20, 150, 250, and
350 nodes. For each number of nodes 10 task graphs were randomly
generated with different node size and communication delay as
shown in Table 4.4. The schedule results were plotted in Figure 4.8.
The speedup ratios at the saturation points marked in Figures 4.8
and the percentage improvements of DSH are summarized in Table
4.4. The results are about the same as the first group for the same
communication delay ratios.
20 nodes, delays - i 20 nodes, delays - 10
2.8
1.15
2.
1.0
2.4 DSH
DSH
2.2 0.95
ISH ISH
O
tx 2 1// 0.
-0: 1.8 -- Yu 0.75 -- Yu

0.°) i.6 0.65
tn Hu Hu
1.4 0.55
1.2 0.45
4- 9+ ii t 13 145
0.
5
t f-4
9
I
13
I
i5
Number of PEs Number of PEs
20 nodes, delays - 5 20 nodes, delays a 20

1.7
0.
1.5
0._
DSH DSH
ISH ISH
\.O.
Li
Yu 0.5 -- Yu
Hu
0.4 ' ... Hu
0.7
0.2
-4--- -I-- -4-- -I----4 -4- -4

I
5 7 9
4
13 i5 0.i 94 I I ;- 4-13 4-115
Figure 4.1 The Average Speedup Ratio Comparison (20 Nodes).

50 nodes, delays = 50 nodes, delays = 10
5.5
4.5 DSH DSH

4
ISH ISH
3.5
cc
ca.
3 - - Yu -- Yu
2.1 2.5 Hu Hu
1.5
13 15 17 19 21 23 25 27 11 13 i5 17
50 nodes, delays = 5 50 nodes, delays = 20
1.35
DSH 1.15 DSH

0
r4
4-,
ISH ISH
12 0.95
ca.
-- Yu 1:2 -- Yu
a 0.75
cn
Hu Hu
0.55
0.35
11 13 15 17 19 21 23 25 27 ii 13 15 17 19 21 23 25 27

100 nodes, delays = 1 100 nodes, delays = 10
OSH DSH
ISH ISH
-- Yu -- Yu
Hu 11U
1 4 1 1 I I I 1 1 1 1 1 1 1 1 1 1 1 1
12 16 20 24 28 32 36 40 44 48
100 nodes, delays = 5 100 nodes, delays - 20

4.5
1.0
DSH 1.6 DSH

3.5
ISH 40 1.4
ISH
CO
IX
1. 2
0. -- Yu
-- Yu
2.5 '0
Hu 0. Hu
0.8
1.5 0.6
1 1 1 1 1 0.4
12 16 20 24 28 32 36 40 44 48 12 16 20 24 28 32 36 40 44 48
Plumber of PEs Number of PEs

150 nodes, delays - I 150 nodes, delays -, 10
3.5
1...f..1.....
DSH DSH
a
..,
4.,
ISH 0 ISH
rc 2.5
a.
-- Yu a
.1:3 -- Yu
a))
a
cx
U3
Hu Hu
1.5
I i i 1 1 1 1 1 1 i 1 1 1 1 1 1
12 16 20 24 28 32 36 40 44 48 16 20 24 28 32 36 40 44 48
150 nodes, delays .. 5 150 nodes, delays 20

5.5 2.5--
5
4.5 DSH DSH

2--
0 4
0
.4
... .1./
ISH 10 - ISH
= 3.../
CC
Q
Ca.
m -- Yu 1.5-- -- Yu
-0
5 ,../----
2.5 ;,-- ----N. Hu ia)
Hu
2
.....,,.," ....
A. -
1.5 \-----------"'"
...
I 0.5
20 24 28 32 36 40 44 48 12 16 20 24 28 32 36 40 44 48

250 nodes, delays - 250 nodes, delays - 10
15
5.5
13 5
DSH 4.5 DSH

ii
4
ISH ISH
9
3.5
0. -- Yu
" " Yu
3
Hu Si 2.5 Hu
1.5
1 4 12 16 20 24 28 32 36 40 44 48 2 56 60 64 68 12 16 20 24 28 32 36 40 44 48 2 56 60 64 60
250 nodes, delays - 5 250 nodes, delays - 20
DSH DSH
0
2.8 ISH
ISH
CC
CL
2.3 -- Yu
-- Yu
CL
Hu cn 1.8 Hu
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68
Figure 4.5 The Average Speedup Ratio Comparison (250 Nodes). t.)
350 nodes, delays - i 350 nodes, delay = 10
DSH DSH
0
ISH 4-1
03
ISH
CC
-- Yu 7
.0 4
-- Yu
Hu Hu
-1-1-4-+-1-4-1 I I I I
4 8 12 16 2 0 24 28 32 36 40 44 48 b2 bb 60 64 68 8 12 16 20 24 28 32 36 401 44 4181 421 461 601 64 I 68
350 nodes, delays = 5 350 nodes, delays a 20
10
DSH DSH
4-1
ISH ISH
7
----------------------------
-- Yu Yu
m
m
Hu . -- Hu
2
1 1 I I I I I +-4-+++-141- I I 1 1 -I-1-1-++
1 4 8 12 16 20 24 20 32 36 40 44 48 52 56 60 64 68 1 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68

20 nodes 150 nodes
10.5
8.5
0
CC 8.5
0
0. 1.8
4.
0. 1.3
2.5
1V-- I 10 I 12 I 14 I 10 14 18 22 26 30 34 38 42 48 50
100 nodes 350 nodes
18
7.2
16
6.2
0 g 10
/./
D
6
4.4.....40
4
2.2
2
1.2 62 66
10 14 18 22 28 30 34 38 42 46 18 22 26 30 34 38 42 46 50 4 0
Figure 4.7 The Average DSH Speedup Ratio Comparison for Different Delay
20 nodes 250 nodes
1.2
DSH DSH
ISH ISH
0.8
-- Yu 4
- - Yu
0.6
Hu Hu
------------------------------------
0.4
0.2
1 7 9 11 15 4 8 12 16 20 24 28 32 36 56 60 64
150 nodes 350 nodes

5.5 11
10
5T
9
4.5- DSH DSH
4 0 8
0
4-1 ISH
ISH CO
cc
7
CO
cc4.1
3.5 ca. 6
0. 3 -- Yu -a
a)
5
- - - Yu
2.5 Hu 4 Hu
2 3
1.5 2
1 1 III
4 8
1-1111-4-1-4-1111111111111
1216 20 24 28 32 36 40 44 48 4 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68
Figure 4.8 The Average Speedup Ratio Comparison for Non-identical

Node Size and Non-identical Communication Delay.
Table 4.1 ISH's Speedup Ratio Improvement over Hu and Yu D Heuristics.
20 Nodes 150 Nodes 250 Nodes 350 Nodes
D % Hu % Yu % Hu % Yu % Hu % Yu % Hu % Yu
1 6.55 2.1 36.69 1.11 48.20 3.59 56.38 1.18
3 34.42 14.52 44.99 5.60 57.56 7.11 60.28 4.51
5 78.11 18.80 77.26 19.02 74.65 5.22 80.29 6.80
10 93.42 23.64 86.01 20.24 83.97 18.53 87.85 17.54
20 62.60 45.98 73.99 16.84 94.74 17.58 120.45 19.26
D = communication delay size (time unit), Nodes size are 1 time unit.
% A, percentage improvement over heuristic A = (ISH's speedup A's speedup)/ A's speedup * 100
Table 44.2 DSH's Speedup Ratio Improvement over Hu and Yu D Heuristics.
20 Nodes 150 Nodes 250 Nodes 350 Nodes
D % Hu % Yu % Hu % Yu % Hu % Yu % Hu % Yu
1 42.28 11.12 53.27 13.38 59.92 11.79 65.70 7.21
3 69.29 35.13 101.01 46.40 116.42 47.13 140.75 46.91
5 142.06 78.81 159.04 73.94 176.74 66.72 183.71 68.05
10 229.35 139.70 240.61 120.18 226.04 110.07 232.83 208.25
20 420.83 270.37 295.05 165.28 327.10 157.88 378.37 158.78 C7N

77
20 Nodes
D DSH S S% Hu S% Yu S% ISH S% DSH
1 2.714 100.00 100.00 100.00 100.00
3 1.964 52.03 50.00 57.26 72.37
5 1.650 30.02 36.85 43.77 60.80
10 1.220 16.43 21.04 26.01 44.95
20 1.000 11.07 10.10 14.74 36.85
150 Nodes
1 10.018 100.00 100.00 100.00 100.00
3 7.162 54.51 55.36 57.82 71.49
5 5.559 32.83 36.17 42.58 55.49
10 3.774 16.95 16.40 23.07 37.67
20 2.552 9.88 10.88 12.58 25.47
250 Nodes
1 14.726 100.00 100.00 100.00 100.00
3 10.795 54.17 55.71 57.59 73.31
5 8.341 32.73 37.99 38.57 56.60
10 5.735 19.10 20.73 23.71 38.94
20 3.814 9.69 11.23 12.74 25.90
350 Nodes
1 18.524 100.00 100.00 100.00 100.00
3 13.802 51.28 54.37 56.17 74.51
5 11.30 35.63 38.92 41.08 61.00
10 7.695 20.68 21.39 24.84 41.54
20 4.865 9.10 10.88 12.82 26.26
D = Communication Delay Size. DSH S = Average Speedup Ratio of DSH
S% X = Speedup Ratio of Heuristic X at Delay D *100

Speedup Rayio of X at D = 1
Table 4.3 The Effect of Communication Delay on Speedup Ratio

Comparison.
78
ISH DSH
Nodes Node size Delay size RD % Hu % Yu % Hu % Yu
20 1-5(3) 1-5(3) 5 155.41 61.62 303.93 155.60
150 1-5(3) 1-5(3) 5 77.20 13.95 216.35 103.44
250 1-5(3) 1-5(3) 5 71.63 11.36 218.74 118.02
350 1-5(3) 1-5(3) 5 84.91 21.77 230.84 117.88
Node Size 1-5(3) means node sizez varied from 1 to 5 with average
of 3 time unit.
Delay Size (Communication Delay) used the same format as node
size.
RD is a ratio of average communication delay
average node size
Table 4.4 DSH and ISH Speedup Ratio Improvement over Hu and YU
D Heuristics for Variable Node Size aand Communication Delay.
79
CHAPTER V
OPTIMAL GRAIN DETERMINATION FOR PARALLEL
PROCESSING SYSTEMS
5.1 Introduction
Solutions to the "grain size" problem for parallel processing
systems and an example of grain size determination are given in this
Chapter. This grain size problem as previously described in Chapter
II is stated as follows:
Grain Size Problem. How to partition a given program into
concurrent modules to obtain the shortest possible program
execution time? What is the "best" size for each concurrent
module?
While these problems have been widely studied [Babb 83], we

propose a new solution called "Grain packing" which provides:
1: a new way to determine grain size for any underlying

parallel processing architecture or any kind of application program,
with the advantage that each grain is of the best size for scheduling,
reducing the communication delay, and enhancing parallelism,
2: a new way to schedule a given application program to

execute on a given parallel processing system, with the advantage
that program execution time is as short as possible,
3: an automatic parallel program development scheme which

saves the user time and reduces errors when doing program
development (grain size determination and scheduling) by hand,
80
4: a parallel processor simulator tool: Given a user program

and a specification of a target parallel system, a simulator can
compute the speedup ratio and execution time of the application
without actually running it on the real system. This tells the user
that the program may have too much communication overhead to
take advantage of the parallelism in the target system. This saves
both cost and time to find out whether the target system is
appropriate.
Researchers have taken two general approaches to solving

these problems:
1: Basic scheduling strategy [Grah 721 This strategy consists of

assigning the task whose predecessors have been completed to the
first available processor. Therefore, if a processor is idle it is because
no task can be assigned to it. Examples of this type of scheduling are
load balancing, and list scheduling [Hu 61].
It is known that this strategy may not produce the best

schedule for a given task graph. For example, forcing some
processors to remain idle can decrease the execution time of some
task graphs rather than increasing execution time as might be
expected [Rama 72]. On the other hand, this type of scheduler, yields
"near optimal" schedules most of the time [Adam 74].
Because of the general performance of this strategy, it has

been used in real parallel processing systems. Most parallel
processing system users believe it is best to take advantage of all
81
parallelism, keep all processors busy (load balancing), and to start

each task as soon as possible. It is generally thought that these
rules-of-thumb provide the best possible program execution time.
While load balancing works very well in an ideal system, it

yields "poor" results in the presence of the unavoidable
communication delays of real systems. By "poor" results, we mean
that program execution time is "far" from optimal. In fact as shown
in Chapter IV, in the case of an application with intensive
communication, execution time on several processors is greater than
execution time on one processor! The main reason for this failure is
that load balancing attempts to utilize all available parallelism
without regard for the corresponding high cost of communication.
2: Large grain data flow (Babb 841 This strategy is based on an

awareness of communication delay between tasks on different
processors. Instead of taking advantage of all available parallelism
in a program, the program is partitioned in such a way that
execution time of each task is much greater than its communication
time delay. This gives the appearance that communication delays
are negligible.
This strategy seems to be a good solution, but it still has

problems:
1) even though Babb states, " what qualifies as large-grain

processing will vary somewhat depending on the underlying
architecture. In general, 'large grain' means that the amount of
82
processing a lowest level program performs during an execution is

large compared to the overhead of scheduling its execution". Babb
offers no method of grain size determination for a particular system
[Babb 84]. If the grain size is defined manually, it is time-consuming
and prone to errors,
2) Since grains are typically large, some parallelism is reduced

to run sequentially inside the large grain, and hence the application
fails to take advantage of parallelism. Again, large grain dataflow
does not solve the max-min problem. Instead it tries to reduce the
communication delay by "throwing away" the available parallelism
in a user program.
5.2 Grain Packing Approach

Grain Packing is an alternative approach to automatically
determining the best size grains to schedule on a target parallel
processing system. Instead of trying to define the best grain size and
scheduling those grains, grain packing starts from the smallest grain
size, schedules these small grains, and then defines larger grains.
Since the final grain sizes are defined after scheduling of fine grains,
all parallelism is taken into consideration and the only parallelism
that is discarded is the parallelism that leads to decreased
performance.
Grain packing can be divided into 4 main steps
1: Fine grain task graph construction. A fine grain task

graph is constructed from a user program.
83
This step involves 1.1)parallelism extraction [Rama 69] of the

user program as shown in Figures 5.1-5.2.
1.2) node size calculation. The size of each node is an

estimation of the length of time to execute all tasks in the node.
Sizes are measured in units of cpu machine cycles, hence if each CPU
in the system has different speed the size of each node can be
calculated differently for each CPU. As shown in Figure 5.3, some
move instructions are not included since moves are only needed
when a task is not in the same node.
1.3) Edge size calculation. The communication delay is

calculated by estimating the time taken to transmit a message
between processors. So if each link has a different transmission rate,
distance between two processors in the system is not the same
(more than one hop in a hyper-cube, for instance), or the length
varies, then the communication delay is calculated by adding up all
of these effects.
2: Fine grain scheduling, The task graph from step 1 is

scheduled on a parallel processor system using a fine grain
scheduler that can take advantage of all available parallelism while
reducing the communication delay (for example DSH in Chapter 3 ).
The application programs execution time is defined at this step from
each node's execution time and the communication delays using
information from the fine grain task graph and the specific
architecture of the target parallel processing system.
84
Since the grain size and the user program's execution time
depends on the scheduler used, the choice of scheduler is very
critical. The choice of scheduler used in grain packing must provide
a solution to the max-min problem, and give monotonically growing
improvement as the number of processors is increased (Speedup
ratio >=1).
An example of communication delay calculation is shown in

Figure 5.4, where, T1 and T5 represent MOV instructions in Figure
5.3, T2 and T4 represent DMA fetch and set up delay, T3 is a
transmission delay in the serial link, and T6 is the delay for the
communication protocol. Communication delay is a function of both
application program and specific architecture of the target parallel
processing system. The final schedule is shown in Figure 5.5.
The schedule obtained by DSH yields a speed up ratio of 2.39

while load balancing yields a speed up ratio of 1.13. This
underscores the importance of selecting a scheduler that solves the
max-min problem.
3: Grain Packing. In this step, the Gantt chart from step 2 is

analyzed. Fine grains are "packed" together to form a larger grain in
order to reduce overhead. For example, "overhead" includes all
optional move instructions and all the overhead caused by
communication protocols. Since the overhead is system dependent,
the way to pack fine grains depends on each specific system. Usually
the larger the grain the smaller the overhead.
85
An example of grain packing is shown in Figure 5.5. The grain

boundary is obtained from the scheduler, hence it will be the best
schedule that trades off communication delay and parallelism. Also,
some fine grains may be duplicated and grouped into more than one
large grain to reduce the communication delay and increase the
parallelism (see duplication details in DSH) .
4: Parallel module generation. Based on the grain

information from step 3, a compiler might construct modules to run
in parallel on the parallel processor system. Alternately, a user
program can be restructured to achieve the optimal runtime as
shown by an Occam program in Figure 5.6. This figure also shows
the steps taken to find the best grain size starting from a user
defined grain size, to fine grain size, and finally to large, packed
grain size. The run time for the user-defined and packed grain task
graph is the same for both DSH and load balancing (536,536 and
361,361 time unit) because no more duplication is possible.
Once all grains are packed, they can be rescheduled by a

simple scheduler for a given real system. Such a simple scheduler
might be an operating system scheduler which is executed while the
user program is running.
By using grain packing, users do not need to learn a specific

parallel programming languages such as Occam. A programmer
typically does not know details such as grain execution time and
communication delay cost, so the grain size and parallelism selected
by a programmer is not optimal. For example, a programmer who
86
tries to take advantage of parallelism in matrix multiplication will

produce too fine a grain and as a consequence introduce more
communication delay than necessary. For example, the Occam matrix
multiplication program in Figure 5.1 contains no more parallelism
than a corresponding C or Pascal matrix multiplication program. In
short, the parallelism identified by a programmer using a parallel
programming language is not useful information for producing an
optimal parallel program. This counter-intuitive result may seem
controversial, but is observed in even the simplest examples.
To sum up, grain packing provides a new way to develop a

program on a particular system. It
1: gives an optimal way to partition a serial or parallel

program on a specific computer architecture.
2: gives a run time estimate of a particular program on a
particular system before running the program. A speedup of
less than or equal to one means the program is not suitable to
run on the specific architecture.
3: gives automatic grain packing which saves the user time
and reduces errors which might occur if grain packing were
done by hand.
4: applies to any language such as C, Pascal, Fortran, Modular-
2. Grain Packing allows more applications to take advantage of
the parallel processing systems because existing programs do
not need to be re-written in a new language.
87
OCCAM Matrix multiplication

PAR
INT All,Al2,B11,B21,C11 :
SEQ
All Al2 B11 B12 C11 C127
Chan lin ? All * =
Chan lin ? Al2
Chan lin? B11 [A21 A22 B21 B22 C21 C22
Chan lin 7 B21
C11 := (All*B11)+(Al2*B21)
Chan lout ! C11 C11 = All*B11 + Al2*B21
INT All,Al2,B12,B22,C12 : C12 = Al l*B12 + Al2*B22
SEQ
Chanlin ? All C21 = A21*B11 + A22*B21
Chan 2in ? Al2
Chan 2in? B12
Chan 2in 7 B22 C22 = A21*B12 + A22*B22
C12 := (A1l*B12)+(Al2*B22)
Chan 2out ! C12 Sum = C11 + C12 + C21 + C22
INT A21,A22,B11,B21,C21 :
SEQ
Chan3in ? A21 Task Graph representation of OCCAM Program
Chan 3in ? A22
Chan 3M7 B11
Chan 3M 7 B21
C21 := (A21*B11)+(A22*B21)
Chan 3out ! C21
INT A21,A22,B12,B22,C22 :
SEQ
Chan4in ? A21
Chan 4in 7 A22
Chan 4in? B12
Chan 4in 7 B22
C22 := (A21*B12)+(A22*B22)
Chan 4out ! C22
INT C11,C12,C21,C22 :
SEQ
Chanlout? C11
Chan 2out? C12
Chan 3out C21
Chan 4out C22
Sum := C11+C12+C21+C22
Chan 5out ! Sum
Figure 5.1 An Example of User Program and Its Task Graph

Construction.
88
User Specified Grain.

l C
000
0 0
0
Fine Grain Decomposition from User Defined Grain.
Figure 5.2 An Example of Fine Grain Task Graph Construction.

89
M68000 ASSEMBLY LANQUAGE
CPU CYCLE
MOVE.W Axx, DI 15
MOVE.W Bxx, D2 15
MULU DI, D2 71
CM. OPT.* MOVE.L D2, PAR 20
Ann * Bnn O Node size = 101
Ann*Bnn Ann*Bnn CM. OPT. = OPTIONAL FOR COMMUNiCRION
CM. OPT.* MOVE.L PAR1, DI 20

CM. OPT.* MOVE.L PAR2, D2 20
ADD.L DI, D2 8
Ann*Bnn + Ann*Bnn CM. OPT.* MOVE.L D2, PSUM 20
O Node size = 8
Communication delay (Cm) = CM. OPT.+ Extra delay

= 20 +
i1 101 101 101 101 LQ1
0 0 0 0
20 20+ 20
a a 0
20+ 0+
Figure 5.3 An Example of Fine Grain Node Size Calculation.

90
PE1 PE2
A
T
I B
( PE1 ) ( PE2 )
T1
T3
DMA T5
T2 Serial Link
MEM. MEM.
Corn. Delay = T1 + T2 + T3 + T4 + T5 + Corn. Protocol

= 20 + 20 + 32 + 20 + 20 + ?
= 112
T3 = 32 bit transmission time at 20Mbit/Sec

normalized to M68000 Cycle at 20 Mhz
Corn. Protocol
1. Protocol code execution time
2. Synchronization time
3. Routing time (# of Hops)
Figure 5.4 An Example of Communication Delay Calculation for the
Specific Architecture.
91

11
4 ci1 a il
1
212 212 12 212 212

0 15 (I)
2
21 12* 1
es
212 12
* Communication Delay 212 assume Communication Protocol equals to 5 MOV instructions
Gantt Chart A (DSH) Gantt Chart B (Load Balancing) Gantt Chart C
Time P1 P2 P3 P4 P5 P6 P7 P8 Time P1 P2 P3 P4 P5 P6 P7 P8 P1
0 0
3 4 5 6 7 8 1 2 3 4 5 6 7 8 1
101
101
4 6 8 2
202 11 12
210 3
313 313
4
5
361
533
6
Fine grain for scheduling

7
8
761
9
TZ-
13
14
864
Figure 5.5 An Example of Fine Grain Scheduling using DSH in

Comparison with Load Balancing and Single PE.
92
Occam matrix multiplication program (reconstructed)

User defined grain
PAR
INT All,Al2,B11,B12,C11
SEQ
T1,T2
Chan lin ? All
Chanlin ? Al2 536,536
Chanlin ? B11
Chanlin ? B21
C11 := (Al l*B11)+(Al2*B12)
Chanlout! C11
INT A 1 1,B12,Mul2 Fine grain for scheduling
SEQ
Chan2in ? All
Chan2in ? B12
Mul2 := Al l*B12
Chan2out! Mul2
INT A22,B22,Mu17
SEQ
Chanlin 7 A22
Chanlin ? B22
Mu17 := A22*B22
Chanlout! Mu17
INT C11,Mul2,Mul3,Mul4,Mul5,Mul6,Mul7,Sum Grain after Grain Packing
SEQ
Chanlout? C11
Clian2out? Mu12
Chanlout? Mu17
Sum := Cll+Mu12+Mu13+Mul4+Mu15+Mu16+Mu17
Chan8out! Sum
Ti is a task graph run time using DSH.

T2 is a task graph run time using Load Balancing.
Figure 5.6 An Example of User Program Restructure.

93
CHAPTER VI
CONCLUSION
6.1 Significance of this Research
We proposed two new scheduling heuristics for task graphs

that have communication delays.
ISH (insertion scheduling heuristic) provides an improvement

up to 45 % over the current solutions and has a smaller time
complexity than DSH, 0(N2).
DSH (duplication scheduling heuristics), is an 0(N4) heuristic

which we recommend be used to solve the scheduling problem for
three main reasons:
First, resulting schedules from DSH provide up to an order of

magnitude improvement in performance over the current solutions.
The improvement keeps growing as the ratio of the communication
delay to node size increases. From a wide range of randomly
generated test data (340 task graphs) running with a varied number
of PEs for a total of 9040 test runs, DSH provided improvement in
almost all tests. Performance is about the same in the tests with
small ratio of communication delay and small number of PEs, but
none of the tests show program that scheduled by DSH have less
performance.
Second, DSH solves the max-min problem by duplicating some

scheduled tasks on some PEs. The max-min problem has not been
94
fully explored elsewhere, yet it is of major importance to gaining

optimal or near-optimal schedules.
Third, DSH gives monotonically growing improvements as the

number of PEs are allowed to increase. In cases where the
communication delay ratio is very high, DSH gives a speed up ratio
of 1.0, which indicates that such a task graph should be executed on
a single processor.
Furthermore, the small grain schedule obtained from DSH can

be used in "grain packing" to find a "near optimal" grain size for
parallel programs.
The main problem that might influence the "near optimal"

schedule result of DSH is the level alteration and the critical path
problem as mentioned in Chapter II. The schedule result might not
"near optimal" because critical path changes as allocation is done.
This problem is an unsolved problem and is part of the Dynamic
scheduling problem.
6.2 Future Related Research
In this section, some of the possible future related research

areas are discussed. The main future researches are in relaxing some
of the assumptions in the task graph and parallel system model.
There are many extensions that can be made to our parallel

processing model. First, the scheduler assumes all PEs are fully
connected. Second, the calculation of communication delay does not
95
take into account the queuing delay nor a number of hops in a

network. The inclusion of the shortest path, routing algorithm, and
the scheduling of the messages into a parallel processor scheduler
would be another useful area to explore.
The extensions of the task graph model is to handle a dynamic

task graph where node execution time, amount of message passing,
precedence constraint, and number of nodes in the task graph is
dynamic and can be changed during runtime. An example of this is a
task graph with loops and branches.
This extension is extremely hard to achieve because critical

path information is not available until runtime. Hence, the
performance of such a scheduler depends on 1) how close the
scheduler can predict the future behavior of the task graph, 2) how
much overhead is introduced by the scheduler if scheduling is done
during runtime. This extension is the most important problem in
parallel processing system area because it will allow many more
application programs to run on parallel processing systems.
96
BIBLIOGRAPHY
Adam 74 T. L. Adam, K. M. Chandy, J. R. Dickson, "A Comparison of
List Schedules for Parallel Processing Systems," Comm.
ACM., Vol. 17, pp. 685-690, Dec. 1974.
Babb 84 R. G. Babb, "Parallel Processing with Large-Grain Data
Flow Techniques", Computer, Vol. 17, No. 7, July 1984,
pp. 55-61.
Bash 83 A. F. Bashir, V. Susarla, and K. Varavan, "A Statistical
Study of a Task Scheduling Algorithm," IEEE Trans.
Comput., Vol. C-32, No. 8, Aug. 1983, pp 774-777
Blaz 84 J. Blazewicz and J. Weglarz, "Scheduling Independent 2-
Processor Tasks to Minimize Scheduling," Information
Processing Letter, Vol 18, No. 5, June 1984, pp. 267-273.
Bokh 81 S. H. Bokhari, "On the Mapping Problem," IEEE Trans.
Computers, Vol. C-30, No. 3, pp. 207-214, March 1981.
Chen 75 N. F. Chen, and C. L. Liu, On a Class of Scheduling
Algorithm for Multiprocessing Systems, Proc. 1974
Segamore Computer Conference on Parallel Processing, T.
Feng, ed., Springer, Berlin, 1975, pp. 1-16.
Chou 82 T. Chou and J. Abraham, "Load Balancing in Distributed
Systems," IEEE Transactions on Software Engineering,
Vol. SE-8, No. 4, July 1982, pp. 401-412.
Chu 80 W. W. Chu et al., "Task Allocation in Distributed Data
Processing," Computer, Vol. 13, No. 11, pp. 57-69, Nov.
1980.
Clar 52 W. Clark, The Gantt Chart, 3rd edition, London: Pittman
and Sons, 1952.
Coff 72 E. G. Coffman, Jr., and R. L. Graham, "Optimal Scheduling
for two Processor System," Acta Information, Vol. 1, No.
3, pp. 200-213, 1972.
Coff 76 E. G. Coffman, Computer and Job-Shop Scheduling Theory.
New York: Wiley, 1976.
Dogr 78 A. Dogramaci and J. Surkis, "Limitation of a Parallel
97
Processor Scheduling Algorithm," Int. J. Prod. Res., Vol.

16, No. 1, pp. 83-85, 1978
Efe 82 Kermal Efe, "Heuristic Models of Task Assignment
Scheduling in Distributed Systems," Computer, Vol. 15,
No. 6, pp. 50-56, June 1982.
Gabo 82 H. N. Gabow, "An Almost-Linear Algorithm for Two-
Processor Scheduling," ACM. J., Vol. 29, No. 3, July 1982,
pp 766-780.
Gonz 77 M. J. Gonzalez Jr., "Deterministic Processor Scheduling,"
ACM Computing Surveys, Vol. 9, No. 3, Sept. 1977, pp.
1 73 2 04 .
Grah 72 R. L. Graham, "Bounds on Multiprocessing Anamolies and

Related Packing Algorithms," AFIPS 1972 Conf. Proc. ,
Vol. 40, AFIPS Press, Montvale, N.J., pp. 205-217.
Hory 77 E. C. Horvath, S. Lam, and R. Sethi, "A Level Algorithm for
Preemptive Scheduling," ACM J., Vol. 24, No. 1, 1977, pp
32 -43.
Hu 61 T. C. Hu, "Parallel Sequencing and Assembly Line
Problems," Operations Research, Vol. 9, No. 6, 1961, pp.
8 4 1-8 4 8.
Jens 77 John E. Jensen, "A Fixed-Variable Scheduling Model for

Multiprocessors," Proc. of 1977 International Conference
on Parallel Processing, pp 108-117, 1977.
Kasa 84 H. Kasahara, S. Narita, "Practical Multiprocessor
Scheduling Algorithms for Efficient Parallel Processing,"
IEEE Transactions on Computers, Vol. c-33, No. 11, Nov .
1984, pp. 1023-1029.
Kauf 74 M. T. Kaufman, "An Almost-Optimal Algorithm for the

Assembly Line Scheduling Problem," IEEE Trans.
Comput., Vol. C-23, No. 11, Nov. 1974, pp 1169-1174
Kohl 75 W. H. Kohler, "A Preliminary Evaluation of the Critical
Path Method for Scheduling Tasks on Multiprocessor
Systems," IEEE Transactions on Computers, Vol. c-15, No.
12, Dec. 1975, pp. 1235-1238.
98
Krau 75 K. L. Krause, V. Y. Shen, and H. D. Schwetman, "Analysis

of Several Task-Scheduling Algorithms for a Model of
Multiprogramming Computer Systems," ACM J., Vol. 22,
No. 4, October 1975, pp 522-550.
Kund 81 M. Kunde, "Nonpreemptive LP-Scheduling on
Homogeneous Multiprocessor Systems," SIAM J. Comput.,
Vol. 10, No. 1, Feb. 1981, pp. 151-173.
Kung 81 H. T. Kung, "Synchronized and Asynchronized Parallel
Algorithms for Multiprocessors," Tutorial on Parallel
Processing, IEEE Computer Society Press, 1981, pp. 428-
463 .
Lam 77 S. Lam, and R. Sethi, "Worst Case Analysis of Two
Scheduling Algorithms," SIAM Journal of Computing, Vol.
6, 1977, pp. 518-536.
Lens 78 J. K. Lenstra, and A. H. G. Rinnooy Kan, "Complexity of
Scheduling under Precedence Constraints," Operations
Research, Vol. 26, No. 1, Jan.-Feb. 1978, pp. 22-35.
Lo 81 Virginia Lo, Jane W. S. Liu, "Task Assignment in
Distributed Multiprocessor Systems," Proc. of 1981
International Conference on Parallel Processing, pp 358-
360, 1981.
Ma 82 P. Ma, E. Y. Lee and M. Tsuchiya, "A Task Allocation
Model for Distributed Computing Systems," IEEE Trans.
Computers, Vol. C-31, No. 1, pp. 41-45, Jan. 1982.
Ma 84 P. Ma, "A Model to Solve Timing-Critical Application
Problems in Distributed Computer Systems," Computer,
Vol. 17, No. 1, pp. 62-68, Jan. 1984.
Nett 76 E. Nett, "On Further Applications of the Hu Algorithm to
Scheduling Problems," Proc. of 1976 International
Conference on Parallel Processing, pp 317-325, 1976.
Ni 85 Lionel M. Ni and Kai Hwang, "Optimal Load Balancing in a
Multiprocessor System with Many Job Classes," IEEE
Transactions on Software Engineering, Vol. SE-11, No. 5,
May 1985, pp. 491-496.
99
Rama 69 C. V. Ramamoorthy and M. J. Gonzalez, "A Survey of the

Techniques for Recognizing Parallel Processable Streams
in Computer Programs," AFIPS FJCC, 1969
Rama 72 C. V. Ramamoorthy, K. M. Chandy, and M. J. Gonzalez,
"Optimal Scheduling Strategies in a Multiprocessor
System," IEEE Trans. Comput., Vol. C-21, No. 2, Feb. 1972,
pp 137-146.
Rama 76 C. V. Ramamoorthy and W. H. Leung, "A Scheme for
Parallel Execution of Sequential Programs," Proc. of 1976
International Conference on Parallel Processing, pp 312-
316, 1976.
Schw 87 K. Schwan, R. Ramnath, and S. Vasudevan, and D. Ogle, "A
System for Parallel Programmimng," Proc. of the Ninth
International Conference on Software Engineering, Mar.
1987, pp. 270-282.
Seth 76 R. Sethi, "Scheduling Graphs on Two Processors," SIAM J.
Comput., Vol. 5, No. 1, March 1976, pp. 73-82.
Shen 85 Chien-Chung Shen and Wen-Hsiang Tsai, "A Graph
Matching Approach to Optimal Task Assignment in
Distributed Computing Systems Using a Minimax
Criterion," IEEE Trans. Computers, Vol. C-34, No. 3, pp.
207-214, March 1985.
Ston 77 H. S. Stone, "Multiprocessor Scheduling with the Aid of
Network Flow Algorithms," IEEE Trans. Software
Engineering, Vol. SE-3, pp. 85-94, Jan. 1977.
Ston 78 H. S. Stone and S. H. Bokhari, "Control of Distributed
Processes," Computer, Vol. 11, No. 7, pp. 97-106, July
1978.
Tows 86 Don Towsley "Allocating Programs Containing Branches
and Loops Within a Multiple Processor System," IEEE
Transactions on Software Engineering, Vol. SE-12, No. 10,
Oct. 1986, pp. 1018-1024.
Ulim 75 J. D. Ullman, "NP-Complete Scheduling Problems," J. of
Computer and System Science, pp. 384-393, 1975.
100
Yu 84 Wang Ho Yu, "LU Decomposition on a Multiprocessing

System with Communication Delay," Ph. D. dissertation,
Department of Electrical Engineering and Computer
Science. University of California, Berkeley, 1984.
101
APPENDIX
Data points of Figure 4.1 (20 nodes, delays = 1)
P # min Hu max Hu Ave Hu min Yu max Yu ave Yu min ISH max ISH ave ISH min DSH max DSH ave DSH
2 1.6670 1.6670 1.6670 1.6670 1.8180 1.7878 1.6670 1.8180 1.7878 1.8180 1.8180 1.8180
3 2.0000 2.2220 2.0444 2.0000 2.2220 2.1332 2.0000 2.2220 2.1776 2.2220 2.2220 2.2220
4 2.2220 2.5000 2.3332 2.2220 2.5000 2.4444 2.5000 2.5000 2.5000 2.5000 2.5000 2.5000
5 2.2220 2.2220 2.2220 2.5000 2.8570 2.6428 2.5000 2.8570 2.5714 2.5000 2.8570 2.6428
6 2.2220 2.5000 2.2776 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142
7 2.2220 2.2220 2.2220 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142
8 2.2220 2.2220 2.2220 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142
9 2.2220 2.2220 2.2220 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142
10 2.2220 2.2220 2.2220 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142
11 2.2220 2.2220 2.2220 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142
12 2.2220 2.2220 2.2220 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142
13 2.2220 2.2220 2.2220 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142
14 2.2220 2.2220 2.2220 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142
15 2.2220 2.2220 2.2220 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142 2.5000 2.8570 2.7142 .
2 0.6450 0.7410 0.7016 0.7140 0.8700 0.7670 0.9090 1.1760 0.9978 1.1760 1.3330 1.2370
3 0.6450 0.8000 0.7096 0.7690 0.8000 0.7938 1.0000 1.1760 1.1162 1.3330 1.5380 1.4342
4 0.5710 0.8330 0.7342 0.8000 1.0000 0.9312 1.0000 1.3330 1.1772 1.4290 1.6670 1.5420 .
5 0.5560 0.8330 0.7250 1.0000 1.0000 1.0000 1.0530 1.3330 1.1878 1.4290 1.6670 1.5678
6 0.5560 0.8000 0.6670 1.0000 1.0000 1.0000 1.0530 1.3330 1.1878 1.5380 1.8180 1.6500
7 0.5560 0.8000 0.6670 1.0000 1.0000 1.0000 1.0530 1.3330 1.1878 1.5380 1.8180 1.6500
8 0.5560 0.8000 0.6670 1.0000 1.0000 1.0000 1.0530 1.3330 1.1878 1.5380 1.8180 1.6500
9 0.5560 0.8000 0.6670 1.0000 1.0000 1.0000 1.0530 1.3330 1.1878 1.5380 1.8180 1.6500
10 0.5560 0.8000 0.6670 1.0000 1.0000 1.0000 1.0530 1.3330 1.1878 1.5380 1.8180 1.6500
11 0.5560 0.8000 0.6670 1.0000 1.0000 1.0000 1.0530 1.3330 1.1878 1.5380 1.8180 1.6500
12 0.5560 0.8000 0.6670 1.0000 1.0000 1.0000 1.0530 1.3330 1.1878 1.5380 1.8180 1.6500
13 0.5560 0.8000 0.6670 1.0000 1.0000 1.0000 1.0530 1.3330 1.1878 1.5380 1.8180 1.6500
14 0.5560 0.8000 0.6670 1.0000 1.0000 1.0000 1.0530 1.3330 1.1878 1.5380 1.8180 1.6500
15 0.5560 0.8000 0.6670 1.0000 1.0000 1.0000 1.0530 1.3330 1.1878 1.5380 1.8180 1.6500 .
2 0.2630 0.4350 0.3716 0.4350 0.5410 0.4922 0.5560 0.8000 0.6478 1.0530 1.1110 1.0646
3 0.3030 0.4350 0.3690 0.4350 0.5560 0.5288 0.5560 0.7690 0.7152 1.1110 1.1760 1.1370
4 0.2630 0.4440 0.4078 0.4550 0.5710 0.5418 0.5710 0.8000 0.7056 1.1110 1.2500 1.1926
5 0.2630 0.4550 0.3828 0.5560 0.5880 0.5714 0.5710 0.8000 0.7056 1.1760 1.2500 1.2056
6 0.2630 0.4440 0.3646 0.5560 0.5880 0.5714 0.5710 0.8000 0.7056 1.1760 1.2500 1.2204
7 0.2630 0.4440 0.3646 0.5560 0.5880 0.5714 0.5710 0.8000 0.7056 1.1760 1.2500 1.2204
8 0.2630 0.4440 0.3646 0.5560 0.5880 0.5714 0.5710 0.8000 0.7056 1.1760 1.2500 1.2204
9 0.2630 0.4440 0.3646 0.5560 0.5880 0.5714 0.5710 0.8000 0.7056 1.1760 1.2500 1.2204
10 0.2630 0.4440 0.3646 0.5560 0.5880 0.5714 0.5710 0.8000 0.7056 1.1760 1.2500 1.2204
11 0.2630 0.4440 0.3646 0.5560 0.5880 0.5714 0.5710 0.8000 0.7056 1.1760 1.2500 1.2204
12 0.2630 0.4440 0.3646 0.5560 0.5880 0.5714 0.5710 0.8000 0.7056 1.1760 1.2500 1.2204
13 0.2630 0.4440 0.3646 0.5560 0.5880 0.5714 0.5710 0.8000 0.7056 1.1760 1.2500 1.2204
14 0.2630 0.4440 0.3646 0.5560 0.5880 0.5714 0.5710 0.8000 0.7056 1.1760 1.2500 1.2204
15 0.2630 0.4440 0.3646 0.5560 0.5880 0.5714 0.5710 0.8000 0.7056 1.1760 1.2500 1.2204 .
2 0.2330 0.2990 0.2462 0.2330 0.3080 0.2744 0.3030 0.4260 0.3996 1.0000 1.0000 1.0000
3 0.1900 0.2380 0.2258 0.2330 0.3030 0.2478 0.3080 0.4440 0.3606 1.0000 1.0000 1.0000
4 0.1900 0.3080 0.2412 0.2350 0.3080 0.2784 0.3080 0.4440 0.3878 1.0000 1.0000 1.0000
5 0.1900 0.2350 0.2170 0.2380 0.3130 0.2810 0.3080 0.4440 0.3616 1.0000 1.0000 1.0000
6 0.1600 0.2380 0.2030 0.3080 0.4440 0.3372 0.3080 0.4440 0.3616 1.0000 1.0000 1.0000
7 0.1600 0.2350 0.1934 0.3080 0.4440 0.3372 0.3080 0.4440 0.3616 1.0000 1.0000 1.0000
8 0.1600 0.2350 0.1934 0.3080 0.4440 0.3372 0.3080 0.4440 0.3616 1.0000 1.0000 1.0000
9 0.1600 0.2350 0.1934 0.3080 0.4440 0.3372 0.3080 0.4440 0.3616 1.0000 1.0000 1.0000
10 0.1600 0.2350 0.1934 0.3080 0.4440 0.3372 0.3080 0.4440 0.3616 1.0000 1.0000 1.0000
11 0.1600 0.2350 0.1934 0.3080 0.4440 0.3372 0.3080 0.4440 0.3616 1.0000 1.0000 1.0000
12 0.1600 0.2350 0.1934 0.3080 0.4440 0.3372 0.3080 0.4440 0.3616 1.0000 1.0000 1.0000
13 0.1600 0.2350 0.1934 0.3080 0.4440 0.3372 0.3080 0.4440 0.3616 1.0000 1.0000 1.0000
14 0.1600 0.2350 0.1934 0.3080 0.4440 0.3372 0.3080 0.4440 0.3616 1.0000 1.0000 1.0000
15 0.1600 0.2350 0.1934 0.3080 0.4440 0.3372 0.3080 0.4440 0.3616 1.0000 1.0000 1.0000 .
2 1.9770 1.9770 1.9770 1.9770 1.9890 1.9866 1.9770 1.9890 1.9866 1.9890 1.9890 1.9890
4 3.8890 3.8890 3.8890 3.8890 3.9330 3.9066 3.8890 3.9330 3.9066 3.8890 3.9330 3.9242
6 5.7380 5.7380 5.7380 5.7380 5.8330 5.7570 5.7380 5.8330 5.7570 5.7380 5.8330 5.7570
8 7.4470 7.6090 7.4794 7.4470 7.6090 7.5442 7.4470 7.6090 7.5442 7.6090 7.6090 7.6090
10 8.9740 9.2110 9.1162 8.9740 9.2110 9.1636 8.9740 9.2110 9.1636 9.2110 9.4590 9.3598
12 10.6060 10.9380 10.6724 10.6060 10.9380 10.6724 10.6060 10.9380 10.6724 10.9380 10.9380 10.9380
14 12.0690 12.0690 12.0690 12.0690 12.0690 12.0690 12.0690 12.0690 12.0690 12.5000 12.5000 12.5000
16 12.9630 13.4620 13.3622 12.9630 13.4620 13.3622 12.9630 13.4620 13.3622 14.0000 14.0000 14.0000
18 14.0000 14.5830 14.4664 14.5830 14.5830 14.5830 14.0000 14.5830 14.4664 15.2170 15.2170 15.2170
20 14.5830 15.9090 15.3670 15.2170 15.9090 15.4938 15.2170 15.9090 15.4938 15.9090 16.6670 16.3638
22 12.9630 16.6670 15.7746 15.2170 16.6670 16.2254 15.2170 17.5000 16.3920 15.9090 17.5000 17.0152
24 11.6670 17.5000 15.8486 15.2170 17.5000 16.7252 15.2170 18.4210 16.9094 15.9090 18.4210 17.3836
26 10.9380 18.4210 15.7486 15.2170 18.4210 17.0936 15.2170 18.4210 17.0936 16.6670 19.4440 17.9240
28 10.6060 18.4210 14.5958 15.2170 18.4210 16.9094 15.2170 19.4440 17.2982 16.6670 19.4440 17.9240
30 10.6060 18.4210 13.7344 15.2170 18.4210 16.9094 15.2170 19.4440 17.4824 16.6670 19.4440 17.9240
32 10.2940 15.2170 12.3156 15.2170 18.4210 17.0936 15.2170 19.4440 17.4824 16.6670 19.4440 18.1286
34 10.0000 13.4620 11.8304 15.2170 18.4210 17.0936 15.2170 19.4440 17.4824 16.6670 19.4440 18.1286
36 9.7220 12.9630 11.3300 15.2170 18.4210 17.0936 15.2170 19.4440 17.4824 16.6670 19.4440 18.1286
38 9.7220 12.9630 11.3300 15.2170 18.4210 17.0936 15.2170 19.4440 17.4824 16.6670 19.4440 18.1286
40 9.7220 12.9630 11.1792 15.2170 18.4210 17.0936 15.2170 19.4440 17.4824 16.6670 19.4440 18.2952
42 9.7220 12.9630 11.1792 15.2170 18.4210 17.2778 15.2170 19.4440 17.4824 16.6670 19.4440 18.2952
44 9.7220 12.9630 11.1792 15.2170 18.4210 17.2778 15.2170 19.4440 17.4824 16.6670 20.5880 18.5240
46 9.7220 12.9630 11.1792 15.2170 18.4210 17.2778 15.2170 19.4440 17.4824 16.6670 19.4440 18.2952
48 9.7220 12.9630 11.1128 15.2170 18.4210 17.2778 15.2170 19.4440 17.4824 16.6670 20.5880 18.5240
50 9.7220 12.9630 11.1128 15.2170 18.4210 17.2778 15.2170 19.4440 17.4824 16.6670 20.5880 18.5240
52 9.7220 12.9630 11.1128 15.2170 18.4210 17.2778 15.2170 19.4440 17.4824 16.6670 20.5880 18.5240
54 9.7220 12.9630 11.1128 15.2170 18.4210 17.2778 15.2170 19.4440 17.4824 16.6670 20.5880 18.5240
56 9.7220 12.9630 11.1128 15.2170 18.4210 17.2778 15.2170 19.4440 17.4824 16.6670 20.5880 18.5240
58 9.7220 12.9630 11.1128 15.2170 18.4210 17.2778 15.2170 19.4440 17.4824 16.6670 20.5880 18.5240
60 9.7220 12.9630 11.1128 15.2170 18.4210 17.2778 15.2170 19.4440 17.4824 16.6670 20.5880 18.5240
62 9.7220 12.9630 11.1128 15.2170 18.4210 17.2778 15.2170 19.4440 17.4824 16.6670 20.5880 18.5240
64 9.7220 12.9630 11.1128 15.2170 18.4210 17.2778 15.2170 19.4440 17.4824 16.6670 20.5880 18.5240
66 9.7220 12.9630 11.1128 15.2170 18.4210 17.2778 15.2170 19.4440 17.4824 16.6670 20.5880 18.5240
68 9.7220 12.9630 11.1128 15.2170 18.4210 17.2778 15.2170 19.4440 17.4824 16.6670 20.5880 18.5240
70 9.7220 12.9630 11.1128 15.2170 18.4210 17.2778 15.2170 19.4440 17.4824 16.6670 20.5880 18.5240 .
P# min Hu max Hu Ave flu min Yu max Yu ave Yu min ISH max ISH ave ISH min DSH max DSH ave DSH
2 1.9020 1.9340 1.9212 1.9130 1.9440 1.9294 1.9340 1.9440 1.9400 1.9550 1.9660 1.9616
4 3.3330 3.5710 3.4546 3.3980 3.6080 3.5158 3.5710 3.6840 3.6234 3.6460 3.7630 3.7156
6 4.0700 4.6050 4.3400 4.3750 4.9300 4.6040 4.7950 5.2240 4.9350 5.0000 5.3030 5.1800
8 4.3210 5.2240 4.7350 4.6050 5.3030 4.9832 5.5560 6.0340 5.7624 5.8330 6.3640 6.1054
10 4.7300 5.3850 4.9810 5.0720 5.8330 5.2850 6.2500 6.6040 6.4132 6.6040 7.1430 6.8946
12 5.0720 5.6450 5.2624 5.2240 6.0340 5.4666 6.6040 7.0000 6.8660 7.2920 7.7780 7.5146
14 5.3030 5.6450 5.4042 5.6450 6.1400 5.7812 6.6040 7.1430 7.0352 7.9550 8.3330 8.1432
16 5.3030 5.6450 5.5246 5.7380 6.1400 5.9982 6.4810 7.4470 7.1310 8.5370 8.9740 8.6670
18 5.3850 5.9320 5.7252 5.8330 6.1400 6.0574 6.6040 7.2920 7.0650 8.9740 9.4590 9.1658
20 5.3850 5.9320 5.7442 5.8330 6.3640 6.1674 6.7310 7.4470 7.1810 9.2110 9.7220 9.5146
22 5.7380 6.3640 6.0234 5.8330 6.8630 6.2900 6.7310 7.2920 7.1202 9.7220 10.0000 9.8332
24 5.3030 6.4810 6.0462 6.1400 7.0000 6.5968 6.7310 7.4470 7.1512 10.0000 10.2940 10.1764
26 5.3030 6.3640 6.0228 6.2500 7.0000 6.6680 6.7310 7.4470 7.1810 10.0000 10.6060 10.2976
28 4.9300 6.3640 5.9050 6.2500 6.8630 6.5642 6.7310 7.4470 7.1512 10.0000 10.6060 10.4224
30 4.0700 6.3640 5.3178 6.2500 7.0000 6.6680 6.6040 7.4470 7.1258 10.2940 10.9380 10.6764
32 4.0700 6.2500 4.9670 6.2500 7.1430 6.5968 6.7310 7.4470 7.1512 10.2940 10.9380 10.7428
34 3.8890 4.8610 4.2168 6.2500 7.1430 6.6202 6.7310 7.4470 7.1512 10.6060 11.2900 10.9460
36 3.8460 4.8610 4.1720 6.2500 7.0000 6.6954 6.7310 7.4470 7.1810 10.6060 11.2900 10.9460
38 3.8460 4.8610 4.1634 6.2500 7.1430 6.7240 6.7310 7.4470 7.1810 10.6060 11.2900 10.9460
40 3.8460 4.8610 4.1634 6.2500 7.1430 6.7240 6.7310 7.4470 7.1810 10.6060 11.2900 10.9460
42 3.8460 4.8610 4.1634 6.2500 7.1430 6.7240 6.7310 7.4470 7.1810 10.6060 11.6670 11.0918
44 3.8460 4.7950 4.0988 6.2500 7.1430 6.7240 6.7310 7.4470 7.1810 10.6060 11.6670 11.1672
46 3.4650 4.7950 3.9826 6.2500 7.1430 6.7240 6.7310 7.4470 7.1810 10.6060 11.6670 11.1672
48 3.4650 4.7950 3.9826 6.2500 7.1430 6.7240 6.7310 7.4470 7.1810 10.6060 11.6670 11.1672
50 3.4650 4.7950 3.9826 6.2500 7.1430 6.7240 6.7310 7.4470 7.1810 10.9380 11.6670 11.3000
52 3.4650 4.7950 3.9826 6.2500 7.1430 6.7240 6.7310 7.4470 7.1810 10.9380 11.6670 11.3000
54 3.4650 4.4870 3.9210 6.2500 7.1430 6.7240 6.7310 7.4470 7.1810 10.9380 11.6670 11.3000
56 3.4650 4.4870 3.9210 6.2500 7.1430 6.7240 6.7310 7.4470 7.1810 10.9380 11.6670 11.3000
58 3.4650 4.4870 3.9210 6.2500 7.1430 6.7240 6.7310 7.4470 7.1810 10.9380 11.6670 11.3000
60 3.4650 4.4870 3.9210 6.2500 7.1430 6.7240 6.7310 7.4470 7.1810 10.9380 11.6670 11.3000
62 3.4650 4.4870 3.9210 6.2500 7.1430 6.7240 6.7310 7.4470 7.1810 10.9380 11.6670 11.3000
64 3.4650 4.4870 3.9210 6.2500 7.1430 6.7240 6.7310 7.4470 7.1810 10.9380 11.6670 11.3000
66 3.4650 4.4870 3.9210 6.2500 7.1430 6.7240 6.7310 7.4470 7.1810 10.9380 11.6670 11.3000
68 3.4650 4.4870 3.9210 6.2500 7.1430 6.7240 6.7310 7.4470 7.1810 10.9380 11.6670 11.3000
70 3.4650 4.4870 3.9210 6.2500 7.1430 6.7240 6.7310 7.4470 7.1810 10.9380 11.6670 11.3000 .
2 1.7070 1.8320 1.7724 1.7410 1.8320 1.7862 1.8040 1.8620 1.8308 1.9020 1.9340 1.9210
4 2.1210 2.6920 2.4662 2.2440 2.8230 2.5910 2.7780 3.0700 2.9698 3.0430 3.3020 3.2076
6 2.3490 2.8000 2.5654 2.4820 2.9660 2.7986 3.2410 3.7630 3.5160 3.7630 4.1180 3.9540
8 2.5180 2.8690 2.7376 2.6520 3.0700 2.9216 3.5710 4.0230 3.7874 4.3210 4.6050 4.5006
10 2.5930 3.0970 2.8620 2.7340 3.0970 2.9680 3.5710 4.3210 4.0302 4.9300 5.1470 5.0452
12 2.8460 3.1250 2.9512 2.8690 3.3980 3.1286 3.8040 4.6050 4.1292 5.3030 5.5560 5.4538
14 2.8690 3.3980 3.1120 3.0970 3.4650 3.2900 3.7630 4.2680 4.0970 5.5560 5.9320 5.8180
16 2.8690 3.3980 3.1544 3.0970 3.7230 3.4028 3.8040 4.3750 4.1868 5.8330 6.2500 6.1014
18 3.0970 3.4650 3.2542 3.1250 3.7630 3.3874 4.0230 4.4870 4.2550 6.0340 6.4810 6.2986
20 3.1250 3.5000 3.3348 3.1250 3.8040 3.5396 3.8040 4.5450 4.2118 6.1400 6.7310 6.5140
22 3.1820 3.5000 3.3590 3.1530 3.8040 3.5452 4.0230 4.5450 4.2666 6.3640 6.8630 6.7114
24 3.1530 3.8460 3.5054 3.4310 3.8460 3.6868 3.8040 4.7300 4.2574 6.4810 7.0000 6.8150
26 2.6320 3.8460 3.2582 3.4310 3.8890 3.6954 3.8460 4.7300 4.2872 6.6040 7.4470 7.0400
28 2.5930 3.8890 3.3358 3.4310 3.8890 3.6332 3.6840 4.7300 4.2658 6.7310 7.4470 7.1524
30 2.2290 3.5350 2.9758 3.4310 3.8890 3.6262 3.9770 4.7950 4.3724 6.7310 7.4470 7.2430
32 1.9770 3.1250 2.7066 3.4310 3.8890 3.6332 4.0700 4.7300 4.3430 6.7310 7.6090 7.2754
34 1.9770 3.1530 2.7098 3.4310 3.8890 3.6954 3.9330 4.4870 4.2670 6.8630 7.7780 7.4302
36 1.9770 2.8460 2.5498 3.4310 3.8890 3.6954 3.9770 4.6050 4.2590 6.8630 7.7780 7.4302
38 1.9770 2.8460 2.5452 3.4310 3.8890 3.6954 3.9330 4.6050 4.2502 6.8630 7.7780 7.4640
40 1.9770 2.6120 2.3482 3.4310 3.8890 3.6954 3.9330 4.5450 4.2382 7.0000 7.7780 7.5224
42 1.9770 2.4480 2.3120 3.4310 3.8890 3.6954 3.9330 4.5450 4.2900 7.0000 7.9550 7.5578
44 1.9770 2.4480 2.3120 3.4310 3.8890 3.6954 3.9330 4.5450 4.2606 7.0000 7.9550 7.5916
46 1.9770 2.4480 2.3120 3.4310 3.8890 3.6954 3.9330 4.7950 4.3222 7.0000 7.9550 7.6270
48 1.9770 2.4480 2.3120 3.4310 3.8890 3.6954 3.9330 4.4870 4.2376 7.0000 7.9550 7.6270
50 1.9770 2.4480 2.3120 3.4310 3.8890 3.6954 3.9330 4.5450 4.2382 7.0000 7.9550 7.6624
52 1.9770 2.4480 2.3120 3.4310 3.8890 3.6954 3.9770 4.7950 4.3430 7.0000 7.9550 7.6948
54 1.9770 2.4480 2.3120 3.4310 3.8890 3.6954 3.9770 4.7950 4.3430 7.0000 7.9550 7.6948
56 1.9770 2.4480 2.3120 3.4310 3.8890 3.6954 3.9770 4.7950 4.3430 7.0000 7.9550 7.6948
58 1.9770 2.4480 2.3120 3.4310 3.8890 3.6954 3.9770 4.7950 4.3430 7.0000 7.9550 7.6948
60 1.9770 2.4480 2.3120 3.4310 3.8890 3.6954 3.9770 4.7950 4.3430 7.0000 7.9550 7.6948
62 1.9770 2.4480 2.3120 3.4310 3.8890 3.6954 3.9770 4.7950 4.3430 7.0000 7.9550 7.6948
64 1.9770 2.4480 2.3120 3.4310 3.8890 3.6954 3.9770 4.7950 4.3430 7.0000 7.9550 7.6948
66 1.9770 2.4480 2.3120 3.4310 3.8890 3.6954 3.9770 4.7950 4.3430 7.0000 7.9550 7.6948
68 1.9770 2.4480 2.3120 3.4310 3.8890 3.6954 3.9770 4.7950 4.3430 7.0000 7.9550 7.6948
70 1.9770 2.4480 2.3120 3.4310 3.8890 3.6954 3.9770 4.7950 4.3430 7.0000 7.9550 7.6948 .
P# min Hu max Hu Ave Hu min Yu max Yu ave Yu min ISH max ISH ave ISH min DSH max DSH ave DSH
2 1.0800 1.2540 1.1440 1.0940 1.2680 1.1988 1.4290 1.5770 1.5026 1.6360 1.7240 1.6706
4 1.1820 1.2730 1.2306 1.2150 1.3510 1.3080 1.7330 1.9230 1.8390 2.3490 2.4140 2.3878
6 1.1780 1.3620 1.2846 1.3260 1.4400 1.3650 1.8920 2.1470 2.0018 2.8000 2.9170 2.8558
8 1.1860 1.4770 1.3524 1.3730 1.6130 1.4632 1.8040 2.2010 1.9810 3.0970 3.2710 3.1712
10 1.2680 1.4960 1.4140 1.3730 1.6360 1.5064 1.8320 2.2580 2.1142 3.3650 3.5000 3.4316
12 1.2730 1.5150 1.4312 1.4770 1.7680 1.6032 1.8320 2.3180 2.1356 3.5350 3.6840 3.6314
14 1.3670 1.6430 1.5014 1.4960 1.8040 1.6414 1.9890 2.3180 2.1540 3.6840 3.8460 3.7968
16 1.3780 1.7950 1.5708 1.5150 1.8230 1.6806 2.0000 2.3330 2.1848 3.8890 4.0700 3.9964
18 1.4960 1.8230 1.6242 1.6430 1.8230 1.7176 2.0350 2.3490 2.2294 4.0230 4.2170 4.1190
20 1.5090 1.8320 1.6332 1.5150 1.8320 1.7290 2.0000 2.5360 2.2236 4.1180 4.3750 4.2390
22 1.6590 1.8320 1.7572 1.6590 1.8320 1.7592 2.0110 2.3490 2.1912 4.2170 4.4300 4.3222
24 1.5020 1.8320 1.6968 1.6590 1.8320 1.7644 2.1470 2.5000 2.2992 4.3210 4.4870 4.4092
26 1.1180 1.8420 1.4726 1.8230 1.8420 1.8324 2.1340 2.3490 2.2458 4.3210 4.6050 4.4666
28 1.1180 1.8320 1.4356 1.6590 2.0470 1.8744 2.1600 2.2880 2.2418 4.4300 4.6670 4.5354
30 1.1080 1.8420 1.4194 1.6590 2.0470 1.8792 2.1340 2.4820 2.2724 4.4300 4.7300 4.5834
32 1.0320 1.6590 1.3646 1.8230 2.0470 1.9102 2.1340 2.4820 2.2610 4.4870 4.7300 4.6188
34 0.9830 1.1860 1.1146 1.8230 2.0470 1.9078 2.0830 2.5360 2.2422 4.5450 4.8610 4.6816
36 0.9310 1.1860 1.0920 1.6590 2.0470 1.8798 2.1340 2.3650 2.2292 4.5450 4.8610 4.7070
38 0.9260 1.1860 1.0902 1.6590 2.0470 1.8798 2.1340 2.3650 2.2552 4.5450 4.9300 4.7334
40 0.8790 1.1860 1.0808 1.6590 2.0470 1.8798 2.0350 2.5550 2.2030 4.6050 4.9300 4.7454
42 0.8790 1.1860 1.0660 1.6590 2.0470 1.8798 2.0350 2.3650 2.2014 4.6050 4.9300 4.7454
44 0.8790 1.1860 1.0660 1.6590 2.0470 1.8798 2.1340 2.3650 2.2610 4.6050 5.0000 4.7850
46 0.8790 1.1860 1.0556 1.6590 2.0470 1.8798 2.0590 2.4310 2.2592 4.6050 5.0720 4.8126
48 0.8370 1.1860 1.0472 1.6590 2.0470 1.8798 2.0590 2.3970 2.2552 4.6050 5.0720 4.8256
50 0.8370 1.1860 1.0324 1.6590 2.0470 1.8798 2.0590 2.3970 2.2552 4.6050 5.0720 4.8388
52 0.8370 1.1110 1.0174 1.6590 2.0470 1.8798 2.0590 2.3330 2.2424 4.6050 5.0720 4.8388
54 0.8370 1.1110 1.0174 1.6590 2.0470 1.8798 2.0590 2.3330 2.2424 4.6670 5.0720 4.8512
56 0.8370 1.1110 1.0174 1.6590 2.0470 1.8798 2.0590 2.3330 2.2424 4.6050 5.0720 4.8526
58 0.8370 1.1110 1.0174 1.6590 2.0470 1.8798 2.0590 2.3330 2.2424 4.6670 5.0720 4.8650
60 0.8370 1.1110 1.0174 1.6590 2.0470 1.8798 2.0590 2.3330 2.2424 4.6670 5.0720 4.8650
62 0.8370 1.1110 1.0174 1.6590 2.0470 1.8798 2.0590 2.3330 2.2424 4.6670 5.0720 4.8650
64 0.8370 1.1110 1.0174 1.6590 2.0470 1.8798 2.0590 2.3330 2.2424 4.6050 5.1470 4.8676
66 0.8370 1.1110 1.0174 1.6590 2.0470 1.8798 2.0590 2.3330 2.2424 4.6670 5.1470 4.8800
68 0.8370 1.1110 1.0174 1.6590 2.0470 1.8798 2.0590 2.3330 2.2424 4.6670 5.1470 4.8800
70 0.8370 1.1110 1.0174 1.6590 2.0470 1.8798 2.0590 2.3330 2.2424 4.6670 5.1470 4.8800 .

Redacted For Privacy: Improvement Over Current Solutions But Has A Smaller Time

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Redacted For Privacy: Improvement Over Current Solutions But Has A Smaller Time

Hochgeladen von

Copyright:

Verfügbare Formate

AN ABSTRACT OF THE THESIS OF

Boontee Kruatrachue for the degree of Doctor of Philosophy in

We extend previous results for optimally scheduling

The DSH is also applied in "Grain Packing", which is a new way

Completed June 10, 1987

Professor of Computer Science, in charge of major

Head of DVpartment of Electrical & Computer Engineering

Dean of Graduat chool d

Date thesis is presented June 10, 1987

Type by Boontee Kruatrachue

This dissertation is dedicated to my aunt

I am deeply grateful to Dr. Ted Lewis for his help, guidance,

I would like to express my gratitude to my parents and my

I thank my sister for all her love and support. Finally, I

1.1 Motivation and Purpose of this Research 1

1.2 Previous Works 4

1.4 Outline of the Dissertation 15

II. THE PARALLEL PROCESSOR SCHEDULING

2.1.1 Scheduling Definition, Function and Goal 17

2.1.2 Critical Path Nature of a Task Graph

2.1.3 Difference between Allocation and

2.1.4 List Scheduling 20

2.2 The Parallel Processor Scheduling Problem 20

2.2.1 Communication and Parallelism

2.2.2 Grain Size Problem 23

2.2.3 Level Alteration and the Critical Path 24

III. DUPLICATION SCHEDULING HEURISTIC (DSH) &

3.3 Insertion Scheduling Heuristic (ISH) 34

3.4 Duplication Scheduling Heuristic (DSH) 39

3.5 Complexity of ISH and DSH 44

IV. EXPERIMENT RESULTS 65

V. OPTIMAL GRAIN DETERMINATION FOR

5.2 Grain Packing Approach 82

6.1 Significance of this Research 93

6.2 Future Related Research 94

1.1 An Example of Program Represented by Task Graph 16

2.1 Comparison between Allocation and Scheduling 26

2.2 The Allocation Consideration due to

2.3 The Comparison between Parallelism and

2.4 The Comparison of Large Grain Versus Fine

3.1 The Segment of a Three-processor Schedule

3.2 The Main List Scheduler 48

3.3 The Update_R_queue 49

3.4 The Example of Task Graph List Scheduling 50

3.5 The Locate-PE of ISH 52

3.6 The Assigned-Node of ISH 53

3.7 The Example of ISH Task Insertion and Scheduling 54

3.8 The Choices in Implementing ISH Step 2.1.2 55

3.9 The Average Speedup Ratio Comparison between

3.10 The Task Duplication Concept 57

3.11 The Task Duplication Process (TDP) 58

3.12 The Copy_LIP of DSH 59

3.13 The Sample Task Graph of 11 Nodes and Its

3.15 The Locate_PE of DSH 63

3.16 The Assigned-Node of DSH 64

4.1 The Average Speedup Ratio Comparison

4.2 The Average Speedup Ratio Comparison

4.3 The Average Speedup Ratio Comparison

4.4 The Average Speedup Ratio Comparison

4.5 The Average Speedup Ratio Comparison

4.6 The Average Speedup Ratio Comparison

4.7 The Average DSH Speedup Ratio Comparison