Sie sind auf Seite 1von 145

Flynns Classifications (1972) *1+

SISD Single Instruction stream, Single Data stream


Conventional sequential machines
Program executed is instruction stream, and data operated on is data stream
SIMD Single Instruction stream, Multiple Data streams
Vector machines (superscalar)
Processors execute same program, but operate on different data streams
MIMD Multiple Instruction streams, Multiple Data streams
Parallel machines
Independent processors execute different programs, using unique data streams
MISD Multiple Instruction streams, Single Data stream
Systolic array machines
Common data structure is manipulated by separate processors, executing different
instruction streams (programs)
Anshul Kumar, CSE IITD slide 2
SISD
C P
M
IS
IS DS
Anshul Kumar, CSE IITD slide 3
SIMD
C
P
P
M
IS
DS
DS
Anshul Kumar, CSE IITD slide 4
MISD
C
C
P
P
M
IS
IS
IS
IS
DS
DS
Anshul Kumar, CSE IITD slide 5
MIMD
C
C
P
P
M
IS
IS
IS
IS
DS
DS
Classification of Parallel Architectures
Parallel architectures
PAs
Data-parallel
architecture
Function-parallel
architectures
Instruction-level
PAs Thread level PAs
Process-level
PAs
ILPs MIMDs
Pipelined
processors
VLIWs
Superscalar
processors
Distributed
Memory
MIMD
Shared
Memory
MIMD
Vector
architectures
Associative
And neural
architectures
SIMDs
Systolic
architectures
DPs
[Ref : Sima et al]
What is Pipelining
A technique used in advanced microprocessors where the
microprocessor begins executing a second instruction before
the first has been completed.

- A Pipeline is a series of stages, where some work is done at
each stage. The work is not finished until it has passed
through all stages.

With pipelining, the computer architecture allows the next
instructions to be fetched while the processor is performing
arithmetic operations, holding them in a buffer close to the
processor until each instruction operation can performed.
Four Pipelined Instructions
IF
IF
IF
IF
ID
ID
ID
ID
EX
EX
EX
EX M
M
M
M
W
W
W
W
5
1
1
1
Instructions Fetch

The instruction Fetch (IF) stage is responsible for obtaining the
requested instruction from memory. The instruction and the
program counter (which is incremented to the next
instruction) are stored in the IF/ID pipeline register as
temporary storage so that may be used in the next stage at
the start of the next clock cycle.

Instruction Decode

The Instruction Decode (ID) stage is responsible for decoding
the instruction and sending out the various control lines to
the other parts of the processor. The instruction is sent to the
control unit where it is decoded and the registers are fetched
from the register file.
Memory and IO

The Memory and IO (MEM) stage is responsible for storing
and loading values to and from memory. It also responsible
for input or output from the processor. If the current
instruction is not of Memory or IO type than the result from
the ALU is passed through to the write back stage.


DATA FLOW COMPUTERS
Data flow computers execute the instructions
as the data becomes available
Data Flow architectures are highly
asynchronous
In the data flow architecture there is no need
to store intermediate or final results, because
they are passed as tokens among instructions.
(cont.)
The program sequencing depends on the data
availability
The information appears as operation packets
and data tokens
Operation packet = opcode + operands +
destination
Data tokens = data(result) +destination
(cont.)
Data flow computers have packet
communication architecture
Dataflow computers have distributed
multiprocessor organization
Copyright 2004 David J. Lilja 15
What is a performance metric?
Count
Of how many times an event occurs
Duration
Of a time interval
Size
Of some parameter
A value derived from these fundamental
measurements
Copyright 2004 David J. Lilja 16
Good metrics are
Linear -- nice, but not necessary
Reliable -- required
Repeatable -- required
Easy to use -- nice, but not necessary
Consistent -- required
Independent -- required

III. PERFORMANCE METRICS
A performance metric is a measure of a systems performance.
It focuses on measuring a certain aspect of the
system and allows comparison of various types of systems.
The criteria for evaluating performance in parallel computing
can include: speedup, efficiency and scalability.
Speedup
Speedup is the most basic of parameters in multiprocessing
systems and shows how much a parallel algorithm is faster
than a sequential algorithm. It is defined as follows:
Sp =T1/Tp

where Sp is the speedup, T1 is the execution time for a
sequential algorithm, Tp is the execution time for a parallel
algorithm and p is the number of processors.
There are three possibilities for speedup: linear, sublinear
and super-linear, shown in Figure 1. When Sp = p, i.e.
when the speedup is equal to the number of processors, the
speedup is called linear. In such a case, doubling the number
of
processors, will double the speedup. In the case of sub-linear
speedup, increasing the number of processors, decreases the
speedup. Most algorithms are sub-linear, because of various
overheads associated with multiple processors, like
communication.
This can occur because of the increasing parallel
overhead from such areas as: interprocessor communication,
load imbalance, synchronization, and extra computation. An
interesting case occurs in super-linear speedup, which can
mainly be due to cache size increase
Efficiency
Another performance metric in parallel
computing is efficiency,
It is defined as the achieved fraction of total
potential parallel processing gain. It estimates
how well the
processors are used in solving the problem.
Ep =Sp/p =T1/pTp
where Ep is the efficiency.
The Random Access Machine Model
RAM model of serial computers:
Memory is a sequence of words, each capable
of containing an integer.
Each memory access takes one unit of time
Basic operations (add, multiply, compare) take
one unit time.
Instructions are not modifiable
Read-only input tape, write-only output tape


7.2.2 The PRAM model
The PRAM is an idealized parallel machine which was developed as a
straightforward generalization of the sequential RAM. Because we will be
using it often, we give here a detailed description of it.
Description
A PRAM uses p identical processors, each one with a distinct id-number, and
able to perform the usual computation of a sequential RAM that is equipped
with a finite amount of local memory. The processors communicate through
some shared global memory (Figure 7.4) to which all are connected. The
shared memory contains a finite number of memory cells. There is a global
clock that sets the pace of the machine executon. In one time-unit period
each processor can perform, if so wishes, any or all of the following three
steps:
1. Read from a memory location, global or local;
2. Execute a single RAM operation, and
3. Write to a memory location, global or local.

Control
Global memory
P
1
Private memory

P
2
Private memory

P
n
Private memory




Interconnection network

Advanced Topics in Algorithms and Data
Structures
ClassificationofthePRAM
model
ThepowerofaPRAMdependsonthe
kindofaccesstothesharedmemory
locations.
Advanced Topics in Algorithms and Data
Structures
ClassificationofthePRAM
model
Ineveryclockcycle,
IntheExclusiveReadExclusiveWrite
(EREW)PRAM,eachmemorylocation
canbeaccessedonlybyoneprocessor.
IntheConcurrentReadExclusiveWrite
(CREW)PRAM,multipleprocessorcan
readfromthesamememorylocation,but
onlyoneprocessorcanwrite.
Advanced Topics in Algorithms and Data
Structures
ClassificationofthePRAM
model
IntheConcurrentReadConcurrentWrite
(CRCW)PRAM,multipleprocessorcan
readfromorwritetothesamememory
location.
Advanced Topics in Algorithms and Data
Structures
ClassificationofthePRAM
model
Itiseasytoallowconcurrentreading.
However,concurrentwritinggivesriseto
conflicts.
Ifmultipleprocessorswritetothesame
memorylocationsimultaneously,itisnot
clearwhatiswrittentothememory
location.
Advanced Topics in Algorithms and Data
Structures
ClassificationofthePRAM
model
IntheCommonCRCWPRAM,allthe
processorsmustwritethesamevalue.
IntheArbitraryCRCWPRAM,oneofthe
processorsarbitrarilysucceedsinwriting.
InthePriorityCRCWPRAM,processors
haveprioritiesassociatedwiththemand
thehighestpriorityprocessorsucceedsin
writing.
Advanced Topics in Algorithms and Data
Structures
ClassificationofthePRAM
model
TheEREWPRAMistheweakestandthe
PriorityCRCWPRAMisthestrongest
PRAMmodel.
TherelativepowersofthedifferentPRAM
modelsareasfollows.
Advanced Topics in Algorithms and Data
Structures
ClassificationofthePRAM
model
Analgorithmdesignedforaweaker
modelcanbeexecutedwithinthesame
timeandworkcomplexitiesona
strongermodel.
Advanced Topics in Algorithms and Data
Structures
ClassificationofthePRAM
model
WesaymodelAislesspowerful
comparedtomodelBifeither:
thetimecomplexityforsolvinga
problemisasymptoticallylessinmodel
BascomparedtomodelA.or,
ifthetimecomplexitiesarethesame,
theprocessororworkcomplexityis
asymptoticallylessinmodelBas
comparedtomodelA.
Advanced Topics in Algorithms and Data
Structures
ClassificationofthePRAM
model
AnalgorithmdesignedforastrongerPRAM
modelcanbesimulatedonaweakermodel
eitherwithasymptoticallymoreprocessors
(work)orwithasymptoticallymoretime.
Advanced Topics in Algorithms and Data
Structures
AddingnnumbersonaPRAM

AddingnnumbersonaPRAM
Advanced Topics in Algorithms and Data
Structures
AddingnnumbersonaPRAM
ThisalgorithmworksontheEREWPRAM
modelastherearenoreadorwrite
conflicts.
Wewillusethisalgorithmtodesigna
matrixmultiplicationalgorithmonthe
EREWPRAM.
PRAMs are classified as EREW, CREW and CRCW.
EREW. In the exclusive-read-exclusive-write (EREW) PRAM model, no
conflicts are permitted for either reading or writing. If, during the
execution
of a program on this model, some conflict occurs, the programs
behavior is undefined.
CREW. In the concurrent-read-exclusive-write (CREW) PRAM model,
simultaneous
readings by many processors from some memory cell are
permitted. However, if a writing conflict occurs, the behavior of the
program is undefined. The idea behind this model is that it may be
cheap to implement some broadcasting primitive on a real machine,
so
one needs to examine the usefulness of such a decision.
CRCW. Finally, the concurrent-read-concurrent-write
(CRCW) PRAM,
the strongest of these models, permits simultaneous
accesses for both
reading and writing. In the case of multiple processors
trying to write
to some memory cell, one must define which of the
processors eventually
does the writing. There are several answers that
researchers have
given to this question. The most commonly used are:1
Theorem 8 Any algorithm written for the PRIORITY CRCW
model can
be simulated on the EREW model with a slowdown of O(lg r)
where r is the number of processors employed by the
algorithm.
Proof. We have an algorithm that runs correctly on a
PRIORITY
CRCW PRAM, and a EREW PRAM machine on which we
want to simulate
the algorithm. If we were to execute the algorithm without
modification on
the EREW machine, it would not work. The problem would
not be in the
executable part of the code, since both machines understand the same set
of executable statements. Instead, the problem would be in the statements
that access the shared memory for reading or writing. For example, every
time the algorithm says:
Processor Pi reads from memory location y into x
it might involve a concurrent reading which the EREW machine cannot
handle. The same is true for a concurrent write statement (depicted visually
in figure 7.8). In order to fix the problem, we have to simulate each
statement of that sort into a sequence of statements that do not involve any
concurrency, but have the same result as if we had concurrency.
Let us assume that the algorithm uses r processors, named P1, P2 . . . , Pr;
the EREW machine also uses r processors. The EREW machine, however,
will need a little more memory: r auxiliary memory locations A[1..r], which
will be used to resolve the conflicts.
The idea is to replace each fragment code of the algorithm:
Processor Pi accesses (reads into x or writes x into) memory
location y
with code which:
a) has the processors request permission to access a
particular memory
location,
b) finds out, for every memory location, whether there is
conflict, and
c) decides which processsor of the competing will do the
access.
This is achieved by the following fragment of code:
1. Processor Pi writes (y, i) into A[i]
2. Auxiliary array A[1..r] is sorted lexicographically in increasing order.
3. Processor Pi reads A[i 1] and A[i] and determines whether it is
the highest priority processor accessing some memory location
4. If Pi is the highest priority processor, then:
If the operation was a write, Pi does the writing
Else Pi does the reading, and the value read is propagated
to the processors interested in this value.
The last step takes O(lg r) time. The sorting step also takes O(lg r), as
the following non-trivial fact shows. (We mention it here without
proof. For
a proof, consult [?].)
41
Prefix Sum - Doubling
CREW PRAM
Given: n elements in A*0 n-1]
Var: A & j are global, i is local
spawn (P1, P2,..P
n-1
) // note # of PC
For all Pi, 1 <= i <= n -1)
for j = 0 to log n - 1 do
if (i - 2
j
>= 0)
A[i] = A[i] + A[i - 2
j
]
42
Sum of elements EREW PRAM
Given: n elements in A*0 n-1]
Var: A & j are global, i is local
spawn (P0, P1, P2,..P
n/2-1
)// P = n/2
For all Pi, 0 <= i <= (n/2 -1)
for j = 0 to log n - 1 do
if (i mod 2
j
= 0) & (2i + 2
j
< n)
A[2i] = A[2i] + A[2i + 2
j
]
Algorithm
Let A and B be the given shared arrays of r and s elements respectively, sorted in
nondecreasing
order. It is required to merge them into a shared array C.
As presented by Akl[4], the algorithm is as follows:
Let P1, P2... PN be N processors available.
Step 1: The Algorithm selects N-1 elements from array A for a shared array A'. This divides
A
into N approximately equal size segments. A shared array B of N-1 elements of B is chosen
similarly.
For this step, each Pi inserts A[ i * r/N ] and B[ i * r/N ] in parallel into the ith location of A
and B
respectively.
Step 2: This step merges A and B into a shared array V of size 2N - 2. Each element v of V
is a
triple consisting of an element A or B followed by its position in A or B followed by the
name A or B.
For this step each Pi:
a. Using Sequential BINARY SEARCH each processor searches the array B in parallel to find
the
smallest j such that A*i+ < B*j+. If such j exists, then V*i + j -1+ is set by the triple (A*i], i,
A),
otherwise V[i + N -1+ is set by the triple (A*i], i, A).
b. Using Sequential BINARY SEARCH each processor searches the array A to find the
smallest j
such that B*i+ < A*j+. If such j exists, then V*i + j -1] is set by the triple
(B*i], i, B), otherwise V*i + N -1+ is set by the triple (B*i], i, B).
Step 3: To merge A and B into shared array C, the indices of two elements (one in A
and one in
B) at which each processor is to begin merging are computed in a shared array Q of
ordered pairs. This step
is executed as follows:
a. P1 sets Q[1] by (1,1).
b. Each Pi , i >=2, checks if V[2i -2+ is equal to (A*k+, k, A) or not. If it is equal then Pi
searches B
using BINARY SEARCH to find the smallest j such that B*j+ > A*k+ and sets Q*i] by (k *
r/N, j),
otherwise Pi searches A using BINARY SEARCH to find the smallest j such that
A*j+ > B*k+ and sets Q*i] by (j, k * s/N ).
Step 4: Each Pi, i < N uses the sequential merge, and Q[i] = (x, y), and Q[i+1] = (u, v) to
merge
two subarrays A[x..u-1] and B[y..v-1] and places the results of the merge in array C at
position x + y - 1.
Processor PN uses Q[N] = (w, z) to merge two subarrays A[w..r] and B[z..s].

45
List ranking EREW algorithm
LIST-RANK(L) (in O(lg n) time)
1. for each processor i, in parallel
2. do if next[i]=nil
3. then d[i]0
4. else d[i]1
5. while there exists an object i such that next[i]=nil
6. do for each processor i, in parallel
7. do if next[i]=nil
8. then d[i] d[i]+ d[next[i]]
9. next[i] next[next[i]]
46
List-ranking EREW algorithm
1
3
1
4
1
6
1
1
1
0
0
5
(a)
3
4 6 1 0
5
(b)
2
2 2 2 1 0
3
4 6 1 0
5
(c)
4
4 3 2 1 0
3
4 6 1 0
5
(d)
5
4 3 2 1 0
47
Applications of List Ranking
Expression Tree Evaluation
Parentheses Matching
Tree Traversals
EarDecomposition of Graphs
Euler tour of trees


1

Graph coloring

Determining the vertices of a graph can be colored with c colors so that no two
adjacent vertices are assigned the same color is called the
graph coloring problem. To solve the problem quickly, we can create a
processor for every possible coloring of the graph, then each processor checks
to see if the coloring it represents is valid.




49
Linear Arrays and Rings
Linear Array
Asymmetric network
Degree d=2
Diameter D=N-1
Bisection bandwidth: b=1
Allows for using different sections of the channel by different sources concurrently.
Ring
d=2
D=N-1 for unidirectional ring or for bidirectional ring
Linear Array
Ring
Ring arranged to use short wires

2 / N D =
50
Ring
Fully Connected Topology
Needs N(N-1)/2 links to connect N
processor nodes.
Example
N=16 -> 136 connections.
N=1,024 -> 524,288 connections
D=1
d=N-1

Chordal ring
Example
N=16, d=3 -> D=5


51
Multidimensional Meshes and Tori
Mesh
Popular topology, particularly for SIMD architectures since they match many data
parallel applications (eg image processing, weather forecasting).
Illiac IV, Goodyear MPP, CM-2, Intel Paragon
Asymmetric
d= 2k except at boundary nodes.
k-dimensional mesh has N=n
k
nodes.
Torus
Mesh with looping connections at the boundaries to provide symmetry.

2D Grid
3D Cube
52
Trees
Diameter and ave distance logarithmic
k-ary tree, height d = log
k
N
address specified d-vector of radix k coordinates describing path down
from root
Fixed degree
Route up to common ancestor and down
Bisection BW?
53
Trees (cont.)
Fat tree
The channel width increases as we go up
Solves bottleneck problem toward the root

Star
Two level tree with d=N-1, D=2
Centralized supervisor node
54
Hypercubes
Each PE is connected to (d = log N) other PEs
d = log N
Binary labels of neighbor PEs differ in only one bit
A d-dimensional hypercube can be partitioned into two (d-1)-dimensional
hypercubes
The distance between Pi and Pj in a hypercube: the number of bit positions in
which i and j differ (ie. the Hamming distance)
Example:
10011 01001 = 11010
Distance between PE11 and PE9 is 3


0-D 1-D 2-D 3-D 4-D
5-D
001 011
000 010
100 110
111
101
*From Parallel Computer Architectures; A Hardware/Software approach, D. E. Culler
55
Hypercube routing functions
Example
Consider 4D hypercube (n=4)
Source address s = 0110 and destination address
d = 1101
Direction bits r = 0110 1101 = 1011
1. Route from 0110 to 0111 because r = 1011
2. Route from 0111 to 0101 because r = 1011
3. Skip dimension 3 because r = 1011
4. Route from 0101 to 1101 because r = 1011

56
k-ary n-cubes
Rings, meshes, torii and hypercubes are special cases
of a general topology called a k-ary n-cube
Has n dimensions with k nodes along each dimension
An n processor ring is a n-ary 1-cube
An nxn mesh is a n-ary 2-cube (without end-around
connections)
An n-dimensional hypercube is a 2-ary n-cube

N=k
n
Routing distance is minimized for topologies with
higher dimension
Cost is lowest for lower dimension. Scalability is also
greatest and VLSI layout is easiest.

57
Cube-connected cycle
d=3
D=2k-1+
Example N=8
We can use the 2CCC network

2 / k
58
59
Network properties
Node degree d - the number of edges incident on
a node.
In degree
Out degree
Diameter D of a network is the maximum
shortest path between any two nodes.
The network is symmetric if it looks the same
from any node.
The network is scalable if it expandable with
scalable performance when the machine
resources are increased.
60
Bisection width
Bisection width is the minimum number of wires that must be cut to
divide the network into two equal halves.
Small bisection width -> low bandwidth
A large bisection width -> a lot of extra wires

A cut of a network C(N1,N2) is a set of channels that partition the set of all
nodes into two disjoint sets N1 and N2. Each element of C(N1,N2) is a
channel with a source in N1 and destination in N2 or vice versa.
A bisection of a network is a cut that partitions the entire network nearly
in half, such that |N2||N1||N2+1|. Here |N2| means the number of
nodes that belong to the partition N2.
The channel bisection of a network is the minimum channel count over all
bisections of the network:

| ) 2 , 1 ( | min
sec
N N C Bc
tions bi
=
61
Factors Affecting Performance
Functionality how the network supports
data routing, interrupt handling,
synchronization, request/message combining,
and coherence
Network latency worst-case time for a unit
message to be transferred
Bandwidth maximum data rate
Hardware complexity implementation costs
for wire, logic, switches, connectors, etc.
62
2 2 Switches
*From Advanced Computer Architectures, K. Hwang, 1993.
63
Switches
Module size Legitimate states Permutation connection
2 2 4 2
4 4 256 24
8 8 16,777,216 40,320
N N N
N
N!
Permutation function: each input can only be connected a
single output.
Legitimate state: Each input can be connected to multiple
outputs, but each output can only be connected to a single
input

64
Single-stage networks
Single stage Shuffle-Exchange IN (left)
Perfect shuffle mapping function (right)
Perfect shuffle operation: cyclic shift 1
place left, eg 101 --> 011
Exchange operation: invert least
significant bit, e.g. 101 --> 100

*From Ben Macey at http://www.ee.uwa.edu.au/~maceyb/aca319-2003
65
Multistage Interconnection Networks
The capability of single stage networks are limited but if we cascade enough of
them together, they form a completely connected MIN (Multistage
Interconnection Network).
Switches can perform their own routing or can be controlled by a central router
This type of networks can be classified into the following four categories:
Nonblocking
A network is called strictly nonblocking if it can connect any idle input to any idle output
regardless of what other connections are currently in process
Rearrangeable nonblocking
In this case a network should be able to establish all possible connections between
inputs and outputs by rearranging its existing connections.
Blocking interconnection
A network is said to be blocking if it can perform many, but not all, possible connections
between terminals.
Example: the Omega network
66
Omega networks
A multi-stage IN using 2 2 switch boxes and a perfect shuffle
interconnect pattern between the stages
In the Omega MIN there is one unique path from each input to each
output.
No redundant paths no fault tolerance and the possibility of blocking


Example:
Connect input 101 to output
001
Use the bits of the destination
address, 001, for dynamically
selecting a path
Routing:
- 0 means use upper output
- 1 means use lower output

*From Ben Macey at http://www.ee.uwa.edu.au/~maceyb/aca319-2003
67
Omega networks
log
2
N stages of 2 2 switches
N/2 switches per stage
S=(N/2) log
2
(N) switches
Number of permutations in a omega network
2
S


68
Baseline networks
The network can be generated recursively
The first stage N N, the second (N/2) (N/2)
Networks are topologically equivalent if one network can be easily
reproduced from the other networks by simply rearranging nodes at each
stage.

*From Advanced Computer Architectures, K. Hwang, 1993.
69
Crossbar Network
Each junction is a switching component
connecting the row to the column.
Can only have one connection in each column

*From Advanced Computer Architectures, K. Hwang, 1993.
70
Crossbar Network
The major advantage of the cross-bar switch is its
potential for speed.
In one clock, a connection can be made between
source and destination.
The diameter of the cross-bar is one.
Blocking if the destination is in use
Because of its complexity, the cost of the cross-
bar switch can become the dominant factor for a
large multiprocessor system.
Crossbars can be used to implement the ab
switches used in MINs. In this case each crossbar
is small so costs are kept down.


71
Performance Comparison
Network Latency Switching
complexity
Wiring
complexity
Blocking
Bus Constant
O(N)
O(1) O(w) yes
MIN O(log
2
N) O(Nlog
2
N) O(Nw log
2

N)
yes
Crossbar O(1) O(N
2
) O(N
2
w) no
PCAM Algorithm Design
Partitioning
Computation and data are decomposed.
Communication
Coordinate task execution
Agglomeration
Combining of tasks for performance
Mapping
Assignment of tasks to processors

Partitioning
Ignore the number of processors and the
target architecture.
Expose opportunities for parallelism.
Divide up both the computation and data
Can take two approaches
domain decomposition
functional decomposition
Domain Decomposition
Start algorithm design by analyzing the data
Divide the data into small pieces
Approximately equal in size
Then partition the computation by associating
it with the data.
Communication issues may arise as one task
needs the data from another task.
Functional Decomposition
Focus on the computation
Divide the computation into disjoint tasks
Avoid data dependency among tasks
After dividing the computation, examine the
data requirements of each task.

Functional Decomposition
Not as natural as domain
decomposition
Consider search problems
Often functional
decomposition is very useful at
a higher level.
Climate modeling
Ocean simulation
Hydrology
Atmosphere, etc.
Communication
The information flow between tasks is
specified in this stage of the design

Remember:
Tasks execute concurrently.
Data dependencies may limit concurrency.

Communication
Define Channel
Link the producers with the consumers.
Consider the costs
Intellectual
Physical
Distribute the communication.
Specify the messages that are sent.

Communication Patterns
Local vs. Global
Structured vs. Unstructured
Static vs. Dynamic
Synchronous vs. Asynchronous
Local Communication
Communication within a neighborhood.
Algorithm choice determines communication.
Global Communication
Not localized.
Examples
All-to-All
Master-Worker

5
3
7
2
1
Structured Communication
Each tasks communication resembles each
other tasks communication
Is there a pattern?
Unstructured Communication
No regular pattern that can
be exploited.
Examples
Unstructured Grid
Resolution changes
Complicates the next stages of design
Synchronous Communication
Both consumers and producers are aware
when communication is required
Explicit and simple
t = 1 t = 2 t = 3
Asynchronous Communication
Timing of send/receive is unknown.
No pattern

Consider: very large data structure
Distribute among computational tasks (polling)
Define a set of read/write tasks
Shared Memory

Agglomeration
Partition and Communication steps were
abstract
Agglomeration moves to concrete.
Combine tasks to execute efficiently on some
parallel computer.
Consider replication.
Mapping
Specify where each task is to operate.
Mapping may need to change depending on
the target architecture.
Mapping is NP-complete.

Mapping
Goal: Reduce Execution Time
Concurrent tasks ---> Different processors
High communication ---> Same processor
Mapping is a game of trade-offs.

Mapping
Many domain-decomposition problems make
mapping easy.
Grids
Arrays
etc.
91
Speedup in Simplest Terms
Speed Up= Sequential Access Time/ Parallel Access Time

Quinns notation for speedup is
+(n,p)
for data size n and p processors.
92
Linear Speedup Usually Optimal
Speedup is linear if S(n) = O(n)
Theorem: The maximum possible speedup for parallel
computers with n PEs for traditional problems is n.
Proof:
Assume a computation is partitioned perfectly into n
processes of equal duration.
Assume no overhead is incurred as a result of this
partitioning of the computation (e.g., partitioning
process, information passing, coordination of processes,
etc),
Under these ideal conditions, the parallel computation will
execute n times faster than the sequential computation.
The parallel running time is t
s
/n.
Then the parallel speedup of this computation is
S(n) = t
s
/(t
s
/n) = n
93
Linear Speedup Usually Optimal (cont)
We shall later see that this proof is not valid for
certain types of nontraditional problems.
Unfortunately, the best speedup possible for most
applications is much smaller than n
The optimal performance assumed in last proof is
unattainable.
Usually some parts of programs are sequential and allow
only one PE to be active.
Sometimes a large number of processors are idle for
certain portions of the program.
During parts of the execution, many PEs may be waiting
to receive or to send data.
E.g., recall blocking can occur in message passing

94
Superlinear Speedup
Superlinear speedup occurs when S(n) > n
Most texts besides Akls and Quinns argue that
Linear speedup is the maximum speedup obtainable.
The preceding proof is used to argue that
superlinearity is always impossible.
Occasionally speedup that appears to be superlinear may
occur, but can be explained by other reasons such as
the extra memory in parallel system.
a sub-optimal sequential algorithm used.
luck, in case of algorithm that has a random aspect in
its design (e.g., random selection)

95
Superlinearity (cont)
Selim Akl has given a multitude of examples that establish that superlinear
algorithms are required for many nonstandad problems
If a problem either cannot be solved or cannot be solved in the
required time without the use of parallel computation, it seems fair to
say that t
s
=.
Since for a fixed t
p
>0, S(n) = t
s
/t
p
is greater than 1 for all sufficiently
large values of t
s
, it seems reasonable to consider these solutions
to be superlinear.
Examples include nonstandard problems involving
Real-Time requirements where meeting deadlines is part of the
problem requirements.
Problems where all data is not initially available, but has to be
processed after it arrives.
Real life situations such as a person who can only keep a driveway
open during a severe snowstorm with the help of friends.
Some problems are natural to solve using parallelism and sequential
solutions are inefficient.
96
Superlinearity (cont)
The last chapter of Akls textbook and several journal
papers by Akl were written to establish that
superlinearity can occur.
It may still be a long time before the possibility of
superlinearity occurring is fully accepted.
Superlinearity has long been a hotly debated topic and is
unlikely to be widely accepted quickly.
For more details on superlinearity, see *2+ Parallel
Computation: Models and Methods, Selim Akl, pgs 14-20
(Speedup Folklore Theorem) and Chapter 12.
This material is covered in more detail in my PDA class.

97
Speedup Analysis
Recall speedup definition: +(n,p) = t
s
/t
p

A bound on the maximum speedup is given by

+(n,p) = [o(n) +(n)]/[o(n) +(n)/p + k(n,p)]

Inherently sequential computations are o(n)
Potentially parallel computations are (n)
Communication operations are k(n,p)
The bound above is due to the assumption in formula
that the speedup of the parallel portion of computation
will be exactly p.
Note k(n,p) =0 for SIMDs, since communication steps are
usually included with computation steps.
98
Execution time for parallel portion (n)/p
Shows nontrivial parallel algorithms computation
component as a decreasing function of the
number of processors used.
processors
time
99
Time for communication
k(n,p)
Shows a nontrivial parallel algorithms
communication component as an increasing
function of the number of processors.
processors
time
100
Execution Time of Parallel Portion
(n)/p + k(n,p)
Combining these, we see for a fixed problem size,
there is an optimum number of processors that
minimizes overall execution time.
processors
time
101
Speedup Plot
elbowing out
processors
speedup
102
Cost
The cost of a parallel algorithm (or program) is
Cost = Parallel running time #processors
Since cost is a much overused word, the term
algorithm cost is sometimes used for clarity.
The cost of a parallel algorithm should be compared
to the running time of a sequential algorithm.
Cost removes the advantage of parallelism by
charging for each additional processor.
A parallel algorithm whose cost is big-oh of the
running time of an optimal sequential algorithm is
called cost-optimal.
103
Cost Optimal
From last slide, a parallel algorithm is optimal if
parallel cost = O(f(t)),
where f(t) is the running time of an optimal
sequential algorithm.
Equivalently, a parallel algorithm for a problem is said
to be cost-optimal if its cost is proportional to the
running time of an optimal sequential algorithm for
the same problem.
By proportional, we means that
cost t
p
n = k t
s
where k is a constant and n is nr of processors.
In cases where no optimal sequential algorithm is
known, then the fastest known sequential
algorithm is sometimes used instead.
104
Efficiency
used Processors
Speedup
Efficiency
time execution Parallel used Processors
time execution Sequential
Efficiency
=

=
processors p on n size of
problem a for p) (n, by Quinn in denoted Efficiency
Cost
time running Sequential
Efficiency
Processors
Speedup
Efficiency
time running Parallel Processors
time running Sequential
Efficiency
c
=
=

=
105
Bounds on Efficiency
Recall
(1)

For algorithms for traditional problems, superlinearity is not
possible and
(2) speedup processors
Since speedup 0 and processors > 1, it follows from the
above two equations that
0 s c(n,p) s 1
Algorithms for non-traditional problems also satisfy 0 s
c(n,p). However, for superlinear algorithms if follows that
c(n,p) > 1 since speedup > p.

p
speedup
processors
speedup
efficiency = =
106
Amdahls Law
Let f be the fraction of operations in a
computation that must be performed
sequentially, where 0 f 1. The maximum
speedup achievable by a parallel computer
with n processors is
f n f f
p S
1
/ ) 1 (
1
) ( s
+
s
The word law is often used by computer scientists when it is an observed
phenomena (e.g, Moores Law) and not a theorem that has been proven in a strict
sense.
However, Amdahls law can be proved for traditional problems
107
Proof for Traditional Problems: If the fraction of the
computation that cannot be divided into concurrent tasks is f,
and no overhead incurs when the computation is divided into
concurrent parts, the time to perform the computation with n
processors is given by t
p
ft
s
+ [(1 - f )t
s
] / n, as shown below:
108
Proof of Amdahls Law (cont.)
Using the preceding expression for t
p






The last expression is obtained by dividing numerator and
denominator by t
s
, which establishes Amdahls law.

Multiplying numerator & denominator by n produces the
following alternate version of this formula:


n
f
f
n
t f
f t
t
t
t
n S
s
s
s
p
s
) 1 (
1

) 1 (
) (

+
=

+
s =
f n
n
f nf
n
n S
) 1 ( 1 ) 1 (
) (
+
=
+
s
109
Amdahls Law
Preceding proof assumes that speedup can not be
superliner; i.e.,
S(n) = t
s
/ t
p
s n
Assumption only valid for traditional problems.
Question: Where is this assumption used?
The pictorial portion of this argument is taken from
chapter 1 of Wilkinson and Allen
Sometimes Amdahls law is just stated as
S(n) s 1/f
Note that S(n) never exceeds 1/f and approaches
1/f as n increases.
110
Consequences of Amdahls Limitations to
Parallelism
For a long time, Amdahls law was viewed as a fatal flaw to the
usefulness of parallelism.
Amdahls law is valid for traditional problems and has several
useful interpretations.
Some textbooks show how Amdahls law can be used to
increase the efficient of parallel algorithms
See Reference (16), Jordan & Alaghband textbook
Amdahls law shows that efforts required to further reduce
the fraction of the code that is sequential may pay off in large
performance gains.
Hardware that achieves even a small decrease in the percent
of things executed sequentially may be considerably more
efficient.
111
Limitations of Amdahls Law
A key flaw in past arguments that Amdahls law is
a fatal limit to the future of parallelism is
Gustafons Law: The proportion of the computations
that are sequential normally decreases as the problem
size increases.
Note: Gustafons law is a observed phenomena and not a
theorem.
Other limitations in applying Amdahls Law:
Its proof focuses on the steps in a particular algorithm,
and does not consider that other algorithms with more
parallelism may exist
Amdahls law applies only to standard problems were
superlinearity can not occur
112
Other Limitations of Amdahls Law
Recall



Amdahls law ignores the communication cost
k(n,p)n in MIMD systems.
This term does not occur in SIMD systems, as
communications routing steps are deterministic and
counted as part of computation cost.
On communications-intensive applications, even the
k(n,p) term does not capture the additional
communication slowdown due to network
congestion.
As a result, Amdahls law usually overestimates
speedup achievable
) , ( / ) ( ) (
) ( ) (
) , (
p n p n n
n n
p n
k o
o

+ +
+
s
113
Amdahl Effect
Typically communications time k(n,p) has
lower complexity than (n)/p (i.e., time for
parallel part)
As n increases, (n)/p dominates k(n,p)
As n increases,
sequential portion of algorithm decreases
speedup increases
Amdahl Effect: Speedup is usually an
increasing function of the problem size.
114
Illustration of Amdahl Effect
n = 100
n = 1,000
n = 10,000
Speedup
Processors
115
The Isoefficiency Metric
(Terminology)
Parallel system a parallel program executing
on a parallel computer
Scalability of a parallel system - a measure of
its ability to increase performance as number
of processors increases
A scalable system maintains efficiency as
processors are added
Isoefficiency - a way to measure scalability
116
Notation Needed for the
Isoefficiency Relation
n data size
p number of processors
T(n,p) Execution time, using p processors
+(n,p) speedup
o(n) Inherently sequential computations
(n) Potentially parallel computations
k(n,p) Communication operations
c(n,p) Efficiency

Note: At least in some printings, there appears to be a misprint on page 170
in Quinns textbook, with (n) being sometimes replaced with |(n). To
correct, simply replace each | with .
117
Isoefficiency Concepts
T
0
(n,p) is the total time spent by processes
doing work not done by sequential algorithm.
T
0
(n,p) = (p-1)o(n) + pk(n,p)
We want the algorithm to maintain a constant
level of efficiency as the data size n increases.
Hence, c(n,p) is required to be a constant.
Recall that T(n,1) represents the sequential
execution time.
118
The Isoefficiency Relation
Suppose a parallel system exhibits efficiency c(n,p). Define




In order to maintain the same level of efficiency as the number of
processors increases, n must be increased so that the following
inequality is satisfied.




) , ( ) ( ) 1 ( ) , ( T

) , ( 1
) , (
0
p n p n p p n
p n
p n
C
k o
c
c
+ =

=


) , ( ) 1 , (
0
p n CT n T >
119
Isoefficiency Relation Derivation
(See page 170-17 in Quinn)
MAIN STEPS:
Begin with speedup formula
Compute total amount of overhead
Assume efficiency remains constant
Determine relation between sequential
execution time and overhead
120
Deriving Isoefficiency Relation
(see Quinn, pgs 170-17)
) , ( ) ( ) 1 ( ) , ( p n p n p p n T
o
k o + =
Determine overhead
Substitute overhead into speedup equation
) , ( ) ( ) (
)) ( ) ( (
0
) , (
p n T n n
n n p
p n
+ +
+
s
o
o

Substitute T(n,1) = o(n) + (n). Assume efficiency is constant.


) , ( ) 1 , (
0
p n CT n T >
Isoefficiency Relation
121
Isoefficiency Relation Usage
Used to determine the range of processors for which
a given level of efficiency can be maintained
The way to maintain a given efficiency is to increase
the problem size when the number of processors
increase.
The maximum problem size we can solve is limited by
the amount of memory available
The memory size is a constant multiple of the
number of processors for most parallel systems

122
The Scalability Function
Suppose the isoefficiency relation reduces to n
> f(p)
Let M(n) denote memory required for
problem of size n
M(f(p))/p shows how memory usage per
processor must increase to maintain same
efficiency
We call M(f(p))/p the scalability function [i.e.,
scale(p) = M(f(p))/p) ]

123
Meaning of Scalability Function
To maintain efficiency when increasing p, we must
increase n
Maximum problem size is limited by available
memory, which increases linearly with p
Scalability function shows how memory usage per
processor must grow to maintain efficiency
If the scalability function is a constant this means the
parallel system is perfectly scalable
124
Interpreting Scalability Function
Number of processors
M
e
m
o
r
y

n
e
e
d
e
d

p
e
r

p
r
o
c
e
s
s
o
r

Cplogp
Cp
Clogp
C
Memory Size
Can maintain
efficiency
Cannot maintain
efficiency
CSE 160/Berman
Odd-Even Transposition Sort
Parallel version of bubblesort many compare-
exchanges done simultaneously
Algorithm consists of Odd Phases and Even
Phases
In even phase, even-numbered processes exchange
numbers (via messages) with their right neighbor
In odd phase, odd-numbered processes exchange
numbers (via messages) with their right neighbor
Algorithm alternates odd phase and even phase
for O(n) iterations

CSE 160/Berman
Odd-Even Transposition Sort
Data Movement
General Pattern for n=5
P0 P1 P2 P3 P4
T=1
T=2
T=3
T=4
T=5
CSE 160/Berman
Odd-Even Transposition Sort
Example
General Pattern for n=5
P0 P1 P2 P3 P4
3 10 4 8 1
3 4 10 1 8
3 4 1 10 8
3 1 4 8 10
T=1
T=2
T=3
T=4
1 3 4 8 10
T=5
3 10 4 8 1
T=0
CSE 160/Berman
Odd-Even Transposition Code
Compare-exchange accomplished through message passing
Odd Phase







Even Phase

P_i = 0, 2,4,,n-2

recv(&A, P_i+1);
send(&B, P_i+1);
if (A<B) B=A;



P_i = 1,3,5,,n-1

send(&A, P_i-1);
recv(&B, P_i-1);
if (A<B) A=B;
P_i = 2,4,6,,n-2
send(&A, P_i-1);
recv(&B, P_i-1);
if (A<B) A=B;
P_i = 1,3,5,,n-3
recv(&A, P_i+1);
send(&B, P_i+1);
if (A<B) B=A;
P0 P1 P2 P3 P4
P0 P1 P2 P3 P4
9/06/99
Sorting on the CRCW and
CREW PRAMs
Sort on the CRCW PRAM
Similar idea used to design MIN-CRCW
Need more powerful (less realistic) model of resolving concurrent
writes that sums the value to be concurrently written
Use n(n-1)/2 processors and in a constant time
Drawback : processors and powerful model
Sort on the CREW PRAM
uses an auxiliary two-dimensional array Win[1:n,1:n]
separate write and separate sum in O(1) and O(log n) using O(n**2)
processors
9/06/99
9/06/99
Odd-Even Merge Sort on
the EREW PRAM
Sort EREW PRAM model : use too many processors
Merge Sort on the EREW PRAM
Based on the idea of divide-and-conquer
Idea : divide a list into two sub-lists, recursively sort and
merge
Use n processors with complexity O(n)
S(n) = O(log n), C(n) = O(n**2)
Odd-Even Merge Sort on the EREW PRAM
Speedup the Merge process
Make Odd-index list and Even-index list, and recursive merge
Last do a even-odd compare exchange step
9/06/99
Odd-Even Merge Sort on the
EREW PRAM : contd
Odd-Even Merge Sort on the EREW PRAM
number of processors used : n
W(n) = 1+ 2+ + log n = O((log n)**2)
C(n) = O(n log2n)
9/06/99
9/06/99
9/06/99
Sorting on the One-
dimensional Mesh
Any comparison-based parallel sorting algorithm on Mp
must perform at least n-1 communications steps to
properly decide between the relative order of the
elements in P1 and Pn.
Can achieve a speedup of at most log n on the one-
dimensional mesh
Two algorithms
Insertion Sort
Odd-Even Transposition Sort
9/06/99
Sorting on the Two-
dimensional Mesh
Order in the two-dimensional mesh
Row, column-major orders
Snake order
Snake-order sorting
Repetition of row and column sort
When column sort, the direction is snake-order direction
W(n) = (\ceil(log n) + 1)\root(n) ~= \root(n)
C(n) = nW(n)
S(n) ~= \root(n)
9/06/99
9/06/99
9/06/99
Bitonic MergeSort
EREW PRAM
What is Bitonic List?
A sequence of numbers, x1, , xn, with the property that
there exists an index I such that x1<x2<<xi and xi>xi+1>>xn or else
there exists a cyclic shift of indices so that condition (1) holds
What is rearrangement?
For X=(x1, , xn), X = (x1, ., xn) is defined as
xi = min (xi, xi+n/2), xi+n/2 = max(xi, xi+n/2)
Property
Let A and B be the sub-lists of X after the rearrangement
Then A and B are bitonic lists.
And any element in B is larger than all elements in A.
9/06/99
9/06/99
Bitonic MergeSort EREW
PRAM : contd
Bitonic Sort
Input : A bitonic list
Output : A sorted list
Algorithm :
Recursive rearrangement
Bitonic Merge
Merge of two increasing-order lists as one sorted list
Change the index for rearrangement : (i + n/2) => (n-i+1)
results are two bitonic sub-lists
Call Bitonic Sort for two different lists
Bitonic MergeSort Algorithm
Input : a random list
Output : a sorted list
Recursive call of Bitonic MergeSort and following call of
Bitonic Merge

9/06/99
9/06/99
9/06/99
Bitonic MergeSort EREW
PRAM : contd
Bitonic MergeSort Complexity
W(n) = O((log n)**2)
C(n) = O(n log**2 n))
S(n) = O(n / log n)

Bitonic Merge Sorting Network
See the figure
9/06/99