Sie sind auf Seite 1von 53

Chapter 2

Conditions of parallelism
In order to move parallel processing into the
mainstream of computing, H T Kung has
identified the need to make significant progress
in 3 key areas:
1. Computation models for parallel computing
2. Interprocessor communication in parallel
architectures.
3. System integration for incorporating parallel
systems into general computing environments.

Conditions of parallelism
Data and resources dependences:
The ability to execute several program segments in
parallel requires each segment to be independent of
the other segments.
Dependence graphs are used to describe certain
relationships.
The nodes of the graph corresponds to the program
statement (instruction) and the directed edges with
different labels show the relations among the
statements.
The analysis of dependence graph shows where
opportunity exists for parallelization.

Data dependence
1. Flow dependence: a statement S2 is flow dependent on
statement S1 if an execution path exists from s1 to s2 and if
at least one output of s1 feeds in as input to s2.
2. Antidependence: Statement s2 is anitidependent on
statement s1 if s2 follows s1 in program order and if the
output of s2 overlaps the input to s1.
3. Output dependence: two statements are output
dependent if they produce (write) the same output
variable.
4. I/O dependence: read and write are I/O statements. I/O
dependence occur not because the same variable is
involved but because the same file is referenced by both I/o
statement.

5. Unknown dependence: the dependence relation
between 2 statements cannot be determined in the
following situations:
A. the subscript of a variable is itself subscribed
(indirect addressing).
B. the subscript does not contain the loop index
variable.
C. a variable appears more than once with subscripts
having different coefficients the loop variable.
D. The subscript is nonlinear in the loop index variable.
Example
S1 load r1,a
S2 add r2,r1
S3 move r1,r3
S4 store b, r1

S1 read (4), A(i) read array a from tape unit 4
S2 rewind (4)
S3 write(4),B(i) write array b into tape unit 4
S4 rewind (4)

Control dependence: this refers to the situation
where the order of execution of statements
cannot be determined before run time. Eg. If
condition. Control dependence often prohibits
parallelism.
Resource dependence: deals with the conflicts in
using shared resources.
When conflicting resource is ALU- alu
dependence.
If memory location- storage dependence.
Bernsteins Conditions
In 1966, Bernstein revealed a set of conditions
based on which two processes can execute in
parallel.
Input set-all input variables needed to execute
the process.
Output set- all output variables generated
after the execution.

2 processes p1 and p2 having I1 and I2 as
input set and O1 and O2 as output set.
Now these 2 processes can execute in parallel
if they satisfy the following conditions.
I1 O2 = null (anti-independent)
I2 O1 = null (flow independent)
O1 O2 = null (output independent )
The input set is also called the read set or the
domain of process.
The output set has been called as the write set or
the range of the process.
P1 C= D * E
P2 M= G + C
P3 A= B + C
P4 C= L + M
P5 F = G / E
Only 5 pairs can execute in parallel.
P1-P5, P2-P3, P2-P5, P3-P5, P4-P5.
Parallelism relation is commutative- p1-p2
implies p2-p1.
But it is not transitive. P1-p2, p2-p3. does not
necessarily guarantee p1-p3.
Hardware and software parallelism
For implementation of parallelism, we need
special hardware and software support.
Mismatch problem between them
Hardware parallelism
This refers to the type of parallelism defined
by the machine architecture and hardware
multiplicity.
Hardware parallelism is often a function of
cost and performance tradeoffs.
It displays the resource utilization patterns of
simultaneously executable operations.
One way to characterize the parallelism in a processor is by
the number of instruction issues per machine cycle. If a
processor issues k instructions per machine cycle, then it is
called a k-issue processor.
A conventional processor takes one or more machine cycles
to issue a single instructions. These types of processors are
called as one-issue machine.
For eg. Intel i960C is a three-issue processor. One
arithmetic, one memory access and one branch instruction
issued per cycle.
A multiprocessor system built with n k-issue processors
should be able to handle a maximum number of nk threads
of instructions simultaneously.
Software parallelism
This type of parallelism is defined by the
control and data dependence of programs.

Mismatch Example.
Software parallelism: there are eight instructions.
4 load and 4 arithmetic instructions.
Therefore software parallelism 8/3= 2.67
instructions per cycle.
Hardware parallelism- using 2-issue processor,
can execute only one load (memory access) and
one arithmetic operation simultaneously.
Hardware parallelism- 8/7= 1.14 instructions per
cycle.
Now using dual-processor- 8/6.
Of the many types of software parallelism,
two most important are control parallelism
and data parallelism.
Control parallelism allows two or more
operations to be performed in parallel. Eg.
Pipeline operations.
Data parallelism in which almost the same
operation is performed over many data
elements by many processors in parallel.

To solve mismatch problem
To solve the problem of hardware and software
mismatch, one approach is to develop
compilation support.
The other is through hardware redesign for more
efficient exploitation by an intelligent compiler.
One must design the compiler and hardware
jointly. Interaction between the can lead to a
better solution to the mismatch problem.
Hardware and software design tradeoffs also
exists in terms of cost, complexity, expandability,
compatibility and performance.
Program partitioning and
scheduling
Grain size or granularity is a measure of the
amount of computation involved in a software
process. The simplest measure is to count the
number of instructions in a grain (program
segment).
Grain size determines the basic program segment
chosen for parallel processing.
Grain size are commonly described as fine,
medium or course depending on the processing
level involved.
Latency: is a time measure of the
communication overhead incurred between
the machine subsystems.
For eg. Memory latency: is the time required
by the processor to access the memory.
Synchronization latency is the time required
for two processors to synchronize with each
other.
Levels of parallelism
1. instruction level: a typical grain contains
less than 20 instructions called fine grain. Its
easy to detect the parallelism here.
2. Loop level: here the grain size is less than
500 instructions.
3. procedures level: this corresponds to
medium grain. Contains less than 2000
instructions. Detection of parallelism is more
difficult at this level as compared o fine level.
4. Sub-program level: the number of
instructions range to thousands. Form coarse
grain.
5. Job level: this corresponds to the parallel
execution of essentially independent jobs
(programs).the grain size can be as high as
tens of thousands of instructions in a single
program.
Program Flow mechanisms
Conventional computers are based on a control
flow mechanisms by which the order of program
execution is explicitly stated in the user programs.
Data flow computers are based on a data driven
mechanism which allows the execution of any
instruction to be driven by data availability.
Reduction computers are based on demand
driven mechanism which initiates an operation
based on a demand for its results by other
computations.
Control flow computers
Conventional von Neumann computers use a
program counter to sequence the execution of
instructions in a program.
Control flow computers use shared memory to
hold program instructions and data.
Data flow computers, the instructions are
executed once the operand is available. Data
directly goes to the instruction. Computational
results ()data token) are passed directly between
instructions.
The data generated by an instruction will be
duplicated into may copies and forwarded
directly to all needy instructions. Data tokens
once consumed by an instruction, will no
longer be available for reuse.
It does not require shared memory or
program counter. It just requires special
mechanisms to detect data availability., to
match data tokens with needy instructions.
A data flow architecture
Arvind and its associates at MIT have developed a
tagged-token architecture for building data flow
computers.
The global architecture consists of n processing
elements interconnected by an n*n routing
network.
Within each PE, the machine provides a token
matching which dispatches only those
instructions whose input data are already
available.
Instructions are stored in program memory.
Each datum is tagged with the address of the
instruction to which it belongs.
Tagged tokens enter the PE through local path.
It is the machines job to match up data with
the same tag to needy instructions.
Each instruction represents a synchronization
operation.

Another synchronization mechanism called I-
structure is also provided within each PE. The
I-structure is a tagged memory unit for
overlapped usage of a data by both the
producer and consumer processes. Each word
of I-structure uses a 2 bit tag indicating
whether the word is empty, is full or has a
pending read requests.
Demand driven mechanisms
The computation is triggered by the demand for
an operations result.
Eg. A= ((b+1)*c-(d/e)).
The data driven computation chooses a bottom-
up approach, starting from the innermost
operation.
Such computations are also called as eager
evaluation because the operations are carried out
immediately after all the operands become
available.
A demand driven goes with top-down
approach.
In this when a is demanded then operation
should will start.
They are also called as lazy evaluation, because
the operations are executed only when their
results are required by another instruction.
System interconnect network
Static and dynamic networks are used for
interconnecting computer subsystems or for
constructing multiprocessors/multicomputers.
Static: formed of point to point direct
connections which will not change during
program execution.
Dynamic: are implemented with switched
channels which are dynamically configured to
match the communication demand.
Parameters
Node degree: the no. of edges incident on a
node. In-degree. Out-degree. Total is node
degree. The total reflects the number of i/o
ports required per node. Thus the cost of the
node.
Diameter D: shortest path between any 2
nodes. The path length is measured by the
number of links traversed.
Network Size: the total number of nodes.
Bisection width b: When a given network is
cut into two equal halves, the minimum
number of edges along the cut. In
communication network, each edge
corresponds to a channel with w bit wires.
Thus the wire bisection width is B=bw. Thus B
reflects the wiring density of a network.
Data routing functions
A data routing network is used for inter-PE data
exchange. Routing network can be static-
hypercube routing network used in TMC/CM-2 or
dynamic- multistage network used in IBM GF11.
Commonly seen data routing functions among
PEs include shifting, rotation,permutation,
broadcast, multicast, personalized
communication, shuffle etc. these routing
functions can be implemented on ring, mesh,
hypercube etc.
Permutations: For n objects there are n!
permutations by which the n objects can be
reordered. The set of all permutations form
permutation group.
Eg. Permutation = (a,b,c)(d,e). Bijection
mapping.
In circular fashion- a-b, b-c, c-a, d-e, e-d.
A.b,c has a period of 3. e,d has period of 2.
Therefore total is 2*3=6.
Crossbar switch, multistage network.
Perfect Shuffle and Exchange
This is obtained by shifting 1 bit to the left and
wrapping around the most significant to the
least significant position. (circular left shift).
Inverse perfect shuffle: circular right shift.
Hypercube routing function: 3 routing
functions are defined in this
Network performance parameters
1. Functionality: this refers to how the network
supports data routing, interrupt handling,
synchronization etc.
2. Network latency: this refers to the worst case time
delay for a unit message to be transferred through the
network.
3. Bandwidth: refers to the maxi. Data transfer rate in
terms of Mbytes/s through the network.
4. Hardware Complexity: refers to the implementation
costs such as those of wires, switches etc.
5. Scalability: refers to the ability of a network to be
expandable with increasing machine resources.
Static Connection Network
Static networks use direct links which are fixed
once built. This type of network is more
suitable for computers where the
communication pattern is fixed or predictable.
Linear Array
1-D network.
N nodes are Connected by N-1 links.
Internal nodes have degree 2. end nodes have
degree 1.
The diameter is N-1.
The bisection width is b = 1.
Ring and Chordal ring

N nodes are Connected by N links.
nodes have degree 2.
The diameter is N/2.
The bisection width is b = 2.
By increasing the node degree from 2 to 3 or 4,
we get chordal ring. In general, more links added,
higher the node degree and shorter the network
diameter.

Barrel Shifter: N=16 nodes. Network Size=
N=2^n. node degree 2n-1, Diameter D=n/2.
Node I is connected to node j if j-i=2^r for
some r=0,1,2,3n-1.
Tree
A k-level completely balanced binary tree
should have N=(2^k)-1 nodes.
Eg. 5-level tree has 31 nodes.
The maximum node degree is 3.
Diameter is 2(k-1).
Star:2 level tree with high node degree of
d=N-1. Constant diameter of 2.

Fat tree
The channel width of a tree increases as we
ascend from leaves to the root.
The conventional tree faced the problem of
bottleneck towards the root, since the traffic
towards the root becomes heavier. This idea
has been applied in the connection machine
CM-5.
Mesh and torus
A 3*3 mesh network is shown. This has been
implemented in Illiac IV, MPP< DAP, CM-2.
A k dimensional mesh with N=n^k nodes.
2D- Interior node degree of 4. (2K)
Network diameter is k(n-1).
The node degree at the boundary and corner
nodes are 3 or 2.
Illiac and torus are variations of mesh.
Illiac mesh:
Node degree 4.
Diameter: n-1. (3-1=2)
Links= 2N (9*2=18)
b= 2n (2*3=6)
Torus:
Node degree=4
Diameter= 2 Floor function (r/2) (Eg.2)
Links=2N (=18)
b= 2n (6)

K-ary n-cube:
n is the dimension, k is number of nodes.
Node degree= 2n (=6)
Diameter= n Floor function (k/2)
Links= nN
Bisection= 2k^(n-1)
Dynamic connection networks
Dynamic connections can implement all
communication patterns based on program
demands.
Instead of fixed connections switches are
used.
In increasing order of cost and performance,
dynamic interconnections networks include
bus system, multistage interconnections
networks (MIN) and crossbar switch networks.
Bus system
A bus system is a collection of wires and
connectors for data transaction among
processors, memory modules and peripheral
devices attached to the bus.
The bus is used for only one transaction at a
time. Incase of multiple requests, the
arbitration logic must be able to allocate or
deallocate the bus servicing the requests one
at a time.

For this reason the bus has been called
contention bus or a time sharing bus among
multiple functional modules.
A bus system has a lower cost and provides a
limited bandwidth as compared to the other
two dynamic connection network.
Switch modules
A a*b switch module has a inputs and b
outputs. A binary switch is a=b=2.
In a switch, one-to-one and one-to-many
mappings are allowed but many-to-one are
not allowed due to conflicts at the output
terminal.

MIN
MINs have been used in both MIMD and
SIMD computers.
A Number of a*b switches are used in each
stage. Fixed interstage connections are used
between the switches.
The ISC patterns used include perfect shuffle,
butterfly, cube connections etc.
Crossbar network
The highest bandwidth and interconnections
capability are provided by crossbar networks.
Each crosspoint switch can provide a
dedicated connection path between a pair.
Table 2.4 gives summary of dynamic network
characteristics.

Das könnte Ihnen auch gefallen