Beruflich Dokumente
Kultur Dokumente
P-M-S Notation
We utilize the \P-M-S" notation [79 to des
ribe key
omponents of a
omputer system. The letters
stand for
Pro essor
49
50
b
u
s
register 0
M
S
register 1
Figure 3.1: The PMS notation for a hypotheti
al
pu with only two registers and two fun
tional
units
onne
ted by a swit
h.
P
9
M
Figure 3.2: Cartoon illustration of how a
omputer \bus" moves data from one
omponent on the
bus to another.
50
51
Memory
Processor
instructions)
Figure 3.3: The PMS notation for the \von Neumann" ar
hite
ture.
data
Memory
Processor
(cpu)
instruction
Memory
Figure 3.4: The PMS notation for the \Harvard" ar
hite
ture.
memory and pro
essor would in fa
t go through a swit
h, just as in the diagram of a simple
pu in
Figure 3.1. Even a personal
omputer (PC) in
ludes a bus whi
h
onne
ts the pro
essor, memory
and peripherals. The PMS diagram of a basi
PC is left as Exer
ise 3.1.
cache
main memory
Figure 3.5: Depi
tion of a possible mapping between a
a
he and main memory. The lightly shaded
areas in main memory depi
t regions that a given
a
he blo
k
ould \mirror" whereas the dark area
depi
ts the
urrent lo
ations being held by the
a
he. In some
a
hes, temporary in
onsisten
ies
may be allowed between the
a
he value and the
orresponding memory lo
ation.
Draft: O
tober 9, 2001, do not distribute
51
52
cpu
(P)
Main
bus (S)
Memory
(M)
cache (M)
Figure 3.6: The PMS notation for a basi
pro
essor with a
a
he and memory management logi
.
P
Memory
Figure 3.7: The PMS notation for the a basi
shared-memory multipro
essor ar
hite
ture.
with the fa
t that the
a
he value is more re
ent than the value in main memory. We do not intend
to survey the dierent te
hniques for dealing with this here. Similarly, we do not
onsider in detail
what happens on a write when the lo
ation is not in
a
he. But su
e it to say that one
an assume
that the lo
ation is brought into
a
he on a write and the new value is at least pla
ed at that time
in the
orresponding
a
he lo
ation.
Stri
tly speaking, the MMU is another pro
essor (a \P" in the notation) that handles memory
requests from the main pro
essor and has the
a
he as its own separate memory, as shown in
Figure 3.6. A key point about a
a
he is that the
ontents are frequently destroyed by a memory
referen
e. Sin
e multiple lo
ations in main memory map onto a given
a
he lo
ation, this is a
ne
essary evil.
Another signi
ant point is that
a
hes are often loaded with
hunks of
ontiguous data (an entire
a
he line or blo
k) whenever there is a
a
he fault. This is done to in
rease performan
e sin
e it is
often faster to load several
ontiguous memory lo
ations at one time than it would be to load them
individually, due to the design of
urrent memory
hips and other fa
tors. Loading several nearby
memory lo
ations at one time is motivated by the assumption of lo
ality of referen
e, namely
that if one memory lo
ation is read now, then other nearby memory lo
ations will likely be read
soon. Thus, loading a
a
he line is presumed to be fet
hing useful data in advan
e.
3.2
We now use the PMS notation to des
ribe one of the two prin
iple types of parallel
omputer ar
hite
tures. The shared-memory multipro
essor was the rst parallel system
ommer
ially available,
and it is by far the most su
essful from a
ommer
ial point of view. Essentially every workstation
vendor now oers shared-memory multipro
essor versions of their advan
ed workstations, and it is
ommon to nd
ommodity multipro
essor PC's in use.
52
53
bus, whi
h is the swit
h in this design,
an be a bottlene
k sin
e all memory referen
es must
ow
through it. The following denition makes this notion more pre
ise.
Denition 3.1 We say there is
ontention for a swit
h (or more pre
isely for the link between the
swit
h and memory) if there are two or more pro
essors trying to read from or write to memory.
More generally, we
an dene
ontention for any devi
e to be when two or more agents attempt to
use it simultaneously.
Performan
e estimates
an be given based on this simple model on
e we know basi
quantitative
information about the performan
e of individual parts. Let us assume that ea
h pro
essor
an do F
binary operations operations per unit of time. Let us assume that generi
ally these would require
two loads from memory and a store to memory at the
ompletion of the operation. If there is
no memory at all in ea
h pro
essor, then every operation would presumably involve three memory
referen
es. If the bus
an transfer a maximum of W words of data per unit of time, then the full
potential of the shared-memory ar
hite
ture would be realized only if W 3P F , where P is the
number of pro
essors. This means that there would be a limit in e
ien
y unless
P
3WF
(3.1)
We
an quantify the ee
t of the bottlene
k by estimating the time for exe
ution of a hypotheti
al
omputation for large P . On
e the bus has saturated, so that W words per time unit are being
transmitted, then at most W=3 operations
an be exe
uted per time unit, no matter how many
pro
essors there are. So the parallel time TP to do a total of N of su
h operations
ould not be less
than 3N=W . We pause to re
ord this observation as the following slightly more general result.
Theorem 3.1 Assume that an arithmeti omputation requires ` memory referen es (loads from
memory or stores to memory), and that the bus
an transfer a maximum of W words of data per
unit of time. Then the parallel time T to do a total of N of su
h
omputations on a basi
sharedmemory multipro
essor
ould not be less than `N=W times the time to do one of them, regardless of
the number of pro
essors P .
P
On a single pro
essor, suppose we
an do F binary operations per time unit (assume the bus is
fast enough to allow this rate). So the sequential time to do a total of N of su
h operations would
be T1 = N=F . Therefore the speed-up with P pro
essors must be bounded by
SP
3WF
(3.2)
no matter how large P be omes. In general, we would have the following result.
Theorem 3.2 Assume that an arithmeti omputation requires ` memory referen es (loads from
memory or stores to memory), that F of them
an be done per unit of time, and that the bus
an
transfer a maximum of W words of data per unit of time. Then the parallel speed-up S to do a
total of N of su
h
omputations on a basi
shared-memory multipro
essor is limited by
P
SP
W
`F
(3.3)
53
54
cpu
cpu
cpu
cpu
cpu
cache
cache
cache
cache
cache
cache
Memory
Figure 3.8: The PMS notation for the a shared-memory multipro
essor with a lo
al
a
he for ea
h
pro
essor.
3.2.2 Adding a lo
al
a
he
Typi
al numbers for
ommodity mi
ropro
essors today would have F measured in billions of instru
tions per se
ond, whereas a bus able to transmit billions of words of data per se
ond are rare. So a
typi
al value for W=F
ould be signi
antly less than one. (See Exer
ise 3.3 for an example.) Thus
real systems often interpose a lo
al memory (e.g. a
a
he, Se
tion 3.1.2) as shown in Figure 3.8.
This
hanges
ompletely the performan
e assessment made previously, sin
e it
ould be the
ase
that substantial
omputations are done only on data residing
ompletely in
a
he, with little tra
over the bus. As a general rule, algorithms with a larger WM will favor su
h ma
hines.
Although this ar
hite
ture is more di
ult to model at an abstra
t level, we will see that assessments
an be made for parti
ular algorithms. Moreover, this ar
hite
ture has proved to be a huge
ommer
ial su
ess. Almost all of the workstation-
lass
omputers sold today have a parallel version
with an ar
hite
ture essentially of this type. Even PC's have evolved to in
orporate su
h a multipro
essor design. These are often referred to as symmetri
multipro
essors (SMP) re
e
ting the
fa
t that all of the pro
essors have the same role. This is meant to be in
ontrast to a \master-slave"
relationship among pro
essors. Symmetry is also a feature of many distributed-memory parallel
omputers (see Se
tion 3.3), so it is not a distinguishing feature. The SMP a
ronym of
ourse also
ould
stand for \shared-memory multipro
essor" whi
h would be a better
hara
terization.
Although the introdu
tion of a lo
al
a
he at ea
h pro
essor in
reases the potential performan
e
of the overall multipro
essor, it also
ompli
ates the design substantially. The major problem is
insuring
a
he
oheren
e, that is the agreement of
a
he values whi
h are mirroring the same values
in shared memory: Coheren
y addresses the issue of what memory is updated, whereas
onsisten
y
addresses when a memory lo
ation is updated. It is well beyond the s
ope of the book to explain the
various te
hniques used to insure the
orre
tness of
a
he values, but su
e it to say that all of the
designs are done to insure that the multiple
a
hes behave as if there were no
a
he at all, ex
ept
hopefully everything works faster in the typi
al
ase. The interested reader is referred to [32, 61.
The SGI Origin series, a type of distributed-shared memory
omputer, provide hardware support
for
oherent a
ess of a global address spa
e utilizing all memory. However, not all parallel
omputers implement
oherent memories. The Cray \T3" series of massively parallel
omputers in
lude
hardware support to view all of memory in a global address spa
e (i.e., in
luding memory at other
pro
essors) but there is no restri
tion on the a
ess.
54
55
is a shared-memory
omputer with typi
ally a large lo
al
a
he (the KSR-1
omputer [52 had 32
Megabytes of lo
al
a
he for ea
h pro
essor). However, the main memory
onsists simply of the
memory in the other pro
essors'
a
hes (see Figure 3.16). Any referen
e to data not in the lo
al
a
he of a given pro
essor results in a
a
he miss, together with a
a
he repla
ement in the lo
al
a
he. As one might imagine, the di
ult part of this ar
hite
ture is to insure
a
he
oheren
e
(Se
tion 3.2, page 54), but the benet is a shared-memory
omputer with essentially the largest
possible lo
al
a
he.
3.3
Bandwidth limitations render a single bus multipro
essor impra
ti
al for large P . Su
h ma
hines
have rarely ex
eeded thirty-odd pro
essors. That, in addition to the limited bandwidth of a single memory module, or even a bank of interleaved memory, motivates the distributed-memory
multi
omputer. These are
alled massively-parallel pro
essor (MPP) systems when P is suf
iently large, say a thousand or more. Sin
e it is never
lear what \massive" means, we
an also
take MPP to mean moderately parallel pro
essor. Roughly speaking, the range 32 < P < 1000
might
hara
terize this range. Million pro
essor ma
hines have been announ
ed (su
h as the IBM
Blue Gene ma
hine), so the letter M
an be taken to mean all of these things.
We note the distin
tion between multipro
essor and multi
omputer in the denitions of sharedmemory and distributed-memory ar
hite
tures, respe
tively. The latter is intended to evoke the
impression of a
olle
tion of
omputers, that is, a
olle
tion of
omplete
omputer systems. The
term multipro
essor highlights the fa
t that the pro
essors in a shared-memory multipro
essor are
\multi" but the memory is not.
The individual
omputers in a distributed-memory multi
omputer are often
alled nodes. The
nodes in a distributed-memory multi
omputer are made of, at a minimum, a pro
essor (
pu) and
some memory. They may also have disks
onne
ted via SCSI buses, individual ethernet
onne
tions,
individual serial lines, graphi
s output devi
es, and so forth (anything a single
omputer might
have).
55
56
Network
Figure 3.9: A distributed-memory multi
omputer with nodes (
onsisting of pro
essor-memory pairs)
onne
ted to the network without a separate
ommuni
ation system (store-and-forward approa
h).
P
P
Figure 3.10: A distributed-memory multi
omputer with six pro
essors
onne
ted by a network with
a ring topology.
56
57
Figure 3.11: The two possible states for a 2 2
ross-bar swit
h. Dotted lines indi
ate paths not in
use, and solid lines indi
ate
onne
tions being used.
0
Figure 3.12: A 4 4 two-stage inter
onne
t swit
h using four 2 2
ross-bar swit
hes (the grey
boxes) as basi
building blo
ks.
Delta prototype ma
hine and then later in the
ommer
ial follow-on, the Paragon. These meshes
in
orporate a wrap-around and thus should be referred to as a torus mesh. The Cray T3D uses a
three-dimensional mesh. It is also a (three-dimensional) torus, hen
e the \T" in the name \T3D".
Many other graphs have been proposed, and some of them have found their way into MPP
systems. These in
lude trees, fat trees [60, and hyper
ubes (see Se
tion 3.3.2). Su
h graphs (or
networks) will also be dis
ussed in Chapter 7 in the
ontext of of algorithms for
olle
tive, or
aggregate, operations.
Another network of
urrent interest is based on using a
ross-bar swit
h. The resulting graph
an be thought of as the
omplete graph on P nodes, i.e., all of the nodes are dire
tly
onne
ted.
However, the
onne
tions in this
ase
an not be a
urately represented by a stati
graph. More
pre
isely, a
ross-bar swit
h
an be dened as a swit
h whi
h
an implement simultaneous (
ontention
free)
ommuni
ations between pro
essors i and (i) for all pro
essor numbers i, where denotes
an arbitrary permutation. (A permutation of f1; : : : ; P g is a one-to-one mapping of this set
onto itself. In parti
ular, if (i) = (j ) then i = j .) Any of the P nodes
an be ee
tively dire
tly
onne
ted to any other node at any given time, but the set of simultaneous
onne
tions is at most
P . For a 2 2
ross-bar swit
h, there are only two states for the swit
h, as indi
ated in Figure 3.11.
A similar type of swit
h ee
tively allows arbitrary
onne
tions through the use of a multi-stage
inter
onne
t. The basi
building blo
k of a multi-stage inter
onne
t is a
ross-bar swit
h, a 2 2
ross-bar swit
h in the simplest
ase. In a typi
al multi-stage inter
onne
t, at most P paths
ould
be in use and
on
i
ts
ould redu
e this number substantially. In parti
ular, not all permutations
ommuni
ate without
ontention simultaneously. Figure 3.12 shows a 4 4 multi-stage inter
onne
t
with two stages based on 2 2
ross-bar swit
hes. Figure 3.13 indi
ates
ontention between two
message routes that is typi
al in a 4 4 multi-stage inter
onne
t.
Cross-bar swit
hes are utilized on both the IBM SP2 ma
hine and on the NEC resear
h prototype
Cenju-3 ma
hine [55. Their networks
onsists of a multi-stage inter
onne
t swit
h using 4 4
rossbar swit
hes. Thus it takes only three stages to
onne
t 64 pro
essors, instead of the six stages
that would be required using 2 2
ross-bar swit
hes. Using a larger
ross-bar swit
hes as the basi
building blo
ks not only allows fewer stages, but it also de
reases the number of
ommuni
ations
with
ontention (see Exer
ise 3.15).
Draft: O
tober 9, 2001, do not distribute
57
58
Figure 3.13: Contention in a 4 4 two-stage inter
onne
t swit
h between messages from 2 to 0 (dark
dashed line) and from 0 to 1 (dark solid line) whi
h would have to use the same
ommuni
ation wire
at the top (where the dark solid and dashed lines
oin
ide).
6
6
2
14
2
2
3
4
7
15
10
11
12
13
4
0
d=2
5
8
d=3
d=4
Figure 3.14: Hyper
ubes of dimension d = 2; 3; 4
onstru
ted indu
tively. Dotted lines indi
ate links
joining mat
hing nodes from two identi
al
opies of the lower-dimensional hyper
ube.
Corporation, founded in 1983, was one of the rst
ommer
ial enterprises to make hyper
ube-
onne
ted
parallel
omputers.
58
59
in this approa
h. The rst Conne
tion Ma
hine [82 used a variant of this network in whi
h the ring
is repla
ed by a mesh.
Hyper
ubes have one feature not available with some other networks, in parti
ular rings and
meshes. Appropriate sub-
ubes of a hyper
ube are themselves hyper
ubes of the appropriate dimension. Sub-meshes of a toroidal mesh
an never retain the toroidal topology, so algorithms using
this feature
an have dierent behavior for sub-meshes.
Another useful feature of hyper
ubes is that many other networks
an be imbedded in them. By
a graph embedding we mean a one-to-one mapping of the verti
es of one graph into another (i.e.,
no vertex is mapped to another vertex more than on
e) in whi
h ea
h edge is mapped to an edge
(hen
e neighboring verti
es get mapped to neighboring verti
es).
It is possible to imbed a ring in a hyper
ube using what is known as a Gray
ode [?. Su
h
odes are not unique, but are dened by the property of being minimal swit
hing
odes in their
binary representation. That is, a Gray
ode is a sequen
e of integers g0 ; g1 ; : : : having the property
that the binary representation of gi and gi+1 dier by at most one bit for all i. They were
reated
initially to minimize the energy expended in swit
hing relays from one state to another.
The natural representation of a hyper
ube has a similar property. That is, if the number of a
node has binary representation id id 1 : : : i2 i1 where the j -th
oordinate of the vertex is ij , then the
neighbors of any node dier by at most one bit. In parti
ular, they dier pre
isely in the
oordinate
dire
tion
orresponding to the edge between them. Therefore, any embedding of a ring onto a
hyper
ube naturally denes a Gray
ode. Similarly, a Gray
ode provides an imbedding of a ring of
2d nodes into a d-dimensional hyper
ube.
1.
1+i
= gP
+ 2d :
= 2d
where
= 2d and
The sequen
e dened by (3.4) is known as the binary re
e
ted Gray
ode sin
e it traverses the
paired i 1 dimensional hyper
ubes in a re
e
ted order while traversing dimension i (see Figure 3.14).
Toroidal meshes
an be imbedded into hyper
ubes by viewing them as Cartesian produ
ts of
rings. That is, a 2j 2k mesh naturally imbeds into a (j + k )-dimensional hyper
ube, using a pair of
Gray
odes, one for ea
h
oordinate. Similarly, a three-dimensional toroidal mesh of size 2i 2j 2k
naturally imbeds into a (i + j + k )-dimensional hyper
ube.
m
(3.5)
where m is the number of words being sent, is the \laten
y"
orresponding to the
ost of sending
a null message, i.e. a message of no length, and is the in
remental time required to send an
additional word in a message. This model does not a
ount for
ontention in the network that
an o
ur when simultaneous
ommuni
ations are attempted that involve interse
ting links in the
network.
The ne
essity of laten
y
an be seen from the physi
al mail system, in whi
h we use envelopes to
surround messages. The envelope usually
ontains little important information and is just dis
arded
on
e the message arrives. The weight of the envelope
an easily ex
eed the weight of the paper
arrying the message.
Draft: O
tober 9, 2001, do not distribute
59
60
node b
computational
Processor
computational
Processor
Memory
Memory
communication
Processor
communication
Processor
....
network
Figure 3.15: A distributed-memory multi
omputer with pro
essors
onne
ted to the network with
a separate
ommuni
ation system (dire
t-
onne
t approa
h).
There are at least two dis
tin
t types of message systems. In the \store and forward" approa
h,
ea
h message is stored temporarily by ea
h pro
essor in the path from the sender to the re
eiver
(see Figure 3.9). This makes the message-passing system simple to implement, as ea
h pro
essor
deals with messages in a uniform way. However, it adds to the laten
y and makes and depend
strongly on the length of the path. Although utilized in rst-generation hyper
ube-
onne
ted multipro
essors [43, this approa
h has been abandoned in favor of the \dire
t-
onne
t" s
heme whi
h
ee
tively employs spe
ial pro
essors to handle messaging (Figure 3.15). This is similar to the
routing of the long-distan
e phone system. On
e a
onne
tion is established, whose route
ould vary
from
all to
all even for the same pair of phone numbers, the parti
ular routing is held xed for
the duration of the
all. In the \dire
t-
onne
t" message-passing system, a similar
onne
tion is
established, and then data moves rapidly in an amount of time that is essentially independent of
the distan
e between sour
e and destination. To implement this requires a separate pro
essor and
buer memory to handle the message system.
60
61
Parallelism in super
omputers is not a new
on
ept. One
an tra
e this
on
ept quite far ba
k, but
we will
onsider in detail only two of the more re
ent examples of this. The te
hnique of pipelining
was used extensively in the Control Data Corporation (CDC) 6000-series (and later) ma
hines whi
h
provided at the time a distin
tly higher level of
oating point
omputation. The later Cray Resear
h
Corporation
omputers utilized ve
tor pro
essors to in
rease the level again.
A pipelined pro
essor is one in whi
h the fun
tional units whi
h exe
ute basi
(e.g., arithmeti
)
instru
tions are broken into smaller sub-atomi
parts whi
h
an exe
ute simultaneously [82. It is
analogous to an assembly line in a fa
tory. The operands pro
eed down a \pipe" whi
h has separate
ompartments. On
e the rst sub-task is
ompleted the information is transferred to the next
ompartment, and the rst
ompartment is now free to be used for a dierent set of operands. As a
result of this design, the speed of operation
an be limited only by the time it takes for the longest
subtask to
omplete. In prin
iple, one new operation
an be initiated with this frequen
y, even
though previous ones are still in the \pipeline." On the other hand, this also means that it may
take several
y
les for an operation to
omplete. Later operations whi
h depend on the result may
have to be postponed (through the insertion of no-ops). Thus a greater potential speed is traded in
return for an in
reased
omplexity. The type of dependen
e that
an
ause pipelined
omputations
to perform less than optimally is quite similar to the type of dependen
e we saw in Chapter 1.
The original pipelined pro
essors (in the CDC 6000 series
omputers) would initiate at most one
oating point operation per
y
le. However, if there are multiple fun
tional (e.g.,
oating point)
units (whi
h the CDC 6000 series had), it is
on
eivable to have all of them initiating instru
tions
at ea
h
y
le. Su
h a design has emerged with the moniker super-s
alar. Currently, the fastest
mi
ropro
essors use this type of ar
hite
ture.
The immediate su
essor to the pipelined ma
hines were ve
tor pro
essors whi
h utilize
pipelines as the main ar
hite
ture but make additional restri
tions on the sequen
ing of the fun
tional units, based on the assumption that \ve
tor" operations will be done. In a pipelined system,
at ea
h
y
le a dierent
oating-point or logi
al operation
ould be initiated [82. By restri
ting to
more limited ve
tor operations, greater performan
e was a
hieved, albeit on more restri
ted types
of
al
ulations.
In a super-s
alar design, arbitrary sets of operations
an be initiated at ea
h
lo
k
y
le. This
extra
exibility has made this design the ar
hite
ture of
hoi
e for the
urrent generation of
omputer
hips being used in everything from high-end PC's to workstations to the individual pro
essors in
the nodes of MPP systems.
It is not our intention to in
lude
omplete des
riptions and analyses of pipelined, super-s
alar
and ve
tor pro
essors, but rather to note that they
an be viewed as parallel pro
essors. A pipelined
pro
essor
an be working on multiple operations in any one
y
le. A super-s
alar pro
essor
an
further initiate work on multiple operations in any one
y
le, while
ontinuing to work on many
more. A ve
tor pro
essor
an appear to apply a single operation to multiple data values (e.g., add
two ve
tors) simultaneously. The granularity of parallelism in these ar
hite
tures is quite ne, often
referred to as instru
tion-level parallelism. But the essential workings of these ar
hite
tures
an be
Draft: O
tober 9, 2001, do not distribute
61
62
analyzed in a way that is similar to the more
oarse-grained type of parallelism we are
onsidering
at length here.
3.5
Figure 3.16 depi
ts quantitative and qualitative aspe
ts of four dierent
ommer
ial designs. The
shaded areas depi
t the pathway from pro
essors (in
luding lo
al memories, e.g.
a
hes) to main
memory. The width of the pathway is intended to indi
ate relative bandwidths in the various designs.
It is roughly proportional to the logarithm of the bandwidth, but no attempt has been made to make
this exa
t. The length of the pathway also indi
ates to some extent the laten
ies for these various
designs. The ve
tor super
omputer has a relatively low laten
y, where as the distributed-memory
omputer or network of workstations has a relatively high laten
y.
This
omparison allows us to make broad assessments of the dierent designs. The ve
tor super
omputer has a relatively small lo
al memory (its ve
tor registers) but makes up for this in a very
high bandwidth and low laten
y to main memory. The COMA design (page 54) provides a large
lo
al memory without sa
ri
ing the simpli
ity of shared memory. The shared memory multipro
essor also retains this but does not have a very large lo
al memory. Finally, the distributed-memory
omputer or a network of workstations has a large lo
al memory similar to the COMA design but
annot mat
h the bandwidth to main memory. Instead of the COMA engine, one simply has a
network to transmit messages.
No absolute
omparisons
an be made regarding the relative performan
e of the dierent designs.
Dierent algorithms will perform dierently on dierent ar
hite
tures. However, the shared memory
multipro
essor will not perform well on problems with limited amounts of
omputation done per
memory referen
e, that is, ones for whi
h the work/memory ratio, WM , is small (see Denition 1.2).
Similarly, a network of workstations (Se
tion 3.3.4)
annot perform well on
omputations that involve
a large number of small memory transfers to dierent pro
essors. We now try to quantify this with
parti
ular algorithms.
3.5.1 Summation
We
an assess the behavior of dierent designs quantitatively by
onsidering some basi
algorithms.
To begin with,
onsider the norm evaluation
n
X
x2i
(3.6)
i=1
whi
h o
urs frequently in s
ienti
omputation. This has the same form as the summation problem
studied in Se
tion 1.4.1. Here there are n memory referen
es (to the array quantities xi for i =
1; : : : ; n) and 2n
oating point operations. This algorithm has therefore a
onstant work/memory
ratio WM (dened in Denition 1.2); see Exer
ise 3.12.
If an individual pro
essor
an do F
oating point operations per time unit and W words per time
unit
an be transmitted from memory, then the norm requires at least the maximum of n=W and
2n=F time units to be
ompleted. Memory referen
es and
omputation
an frequently be interleaved,
so the maximum of these
ould easily be the exe
ution time. The ratio of the two times, the ratio
of
ommuni
ation time to
omputation time, is F =2W for
omputing a norm. Note that the ratio
F =W has units of
oating point operations per memory referen
e and may be thought of as the
number of
oating point operations that
an be done in the time it takes to store to, or retrieve
from, memory one word of data. If W=F is too large, the
omputer will not be e
ient at
omputing
a norm be
ause pro
essors will be idle waiting for data. Similar linear algebrai
al
ulations, su
h
as a dot produ
t (Exer
ise 3.13)
an be analyzed similarly.
Draft: O
tober 9, 2001, do not distribute
62
P
Main
vector
registers
cache
cache
Memory
engine used
to access
vector supercomputer
cache
other caches)
P
memory bus
No Main
(Directory +
Memory
cache
63
memory
Main
Switch /
Network
Memory
memory
Figure 3.16: Comparison of various ar
hite
tures. The shaded areas indi
ated the bandwidth to
memory. Wider paths indi
ate faster speeds.
In the basi
shared-memory
omputer with P pro
essors ea
h
omputing norms, the
omputation
time stays the same, sin
e ea
h norm
an be done independently of the others, but the
ommuni
ation
time be
omes P times larger, namely P n=W , if all of the xi 's are dierent for ea
h pro
essor. This
is be
ause they all must pass through the same memory path, whose bandwidth is xed. The ratio
of the
ommuni
ation time to
omputation time, is P F =2W for
omputing P norms on a basi
shared-memory
omputer. Sin
e this in
reases with P , this algorithm does not s
ale well for this
ar
hite
ture. The addition of a
a
he at ea
h pro
essor
an help, but only if the xi 's are already in
a
he.
n
X
aik bkj
(3.7)
k=1
and therefore involves 2n (or 2n 1, see Exer
ise 3.14)
oating point operations. Sin
e this must
be done n2 times, the total
omputational work is, say, 2n3
oating point operations. On the other
hand, only 3n2 memory referen
es are involved, n2 ea
h to bring A and B from memory, respe
tively,
and another n2 to store the result, C . The work/memory ratio WM (Denition 1.2) for this algorithm
grows linearly with matrix dimension n (see Exer
ise 1.2).
Draft: O
tober 9, 2001, do not distribute
63
single
multiple
instruction stream
64
MIMD
SISD
SIMD
single
multiple
data stream
Figure 3.17: Flynn's taxonomy provides a
omputational model whi
h dierentiates
omputer ar
hite
tures based on the
ow of information.
The ratio of
ommuni
ation time to
omputation time is 3F =2W n for
omputing one matrix
multipli
ation, and 3P F =2W n for
omputing P matrix multipli
ations independently on a basi
shared-memory
omputer. In this
ase, we see that there is a
han
e for s
alability: P = n
pro
essors
ould be e
iently
omputing matrix multipli
ations independently on a basi
sharedmemory
omputer, as long as the ratio F =W is small enough.
The
onsiderations are similar for distributed-memory
omputers. Suppose we
onsider the
ase
P = 2 and a norm
al
ulation. Let us suppose we divide the work and the array xi in half and
P 2 2
give one-half to
One pro
essor will
ompute s1 = n=
xi and the other will
i=1
Pnea
h pro
essor.
ompute s2 = i=n=2+1 x2i . The two pro
essors will then have to
ommuni
ate in some way so that
Pn
x2i = s1 + s2
an be determined. This latter step will take at least units of time, whereas
i=1
the partial sums require only n=F units of time. The ratio of
ommuni
ation time to
omputation
time is F =n. This is indeed favorable sin
e it be
omes small as n in
reases, but note that it also
implies that it is ine
ient if n is less than F . The latter number (whose units are \
oating point
operations" without any time units) measures the number of
oating point operations that
an be
done while waiting for a null message to
omplete. In some
ases, this
ould be thousands of
oating
point operations.
3.6
Taxonomies
A taxonomy is a model based on a dis
rete set of
hara
teristi
s, su
h as
olor, gender or atomi
weight. Sin
e the model variables are dis
rete, a taxonomy puts items in neat \pigeon holes" even if
they don't quite t. Like any model, they provide a
ondensation of information that is often useful
in
omparing dierent things.
Perhaps the most famous taxonomy of parallel
omputing is due to Flynn (see Figure 3.17). It
is based only on the logi
al
ow of information (instru
tions and data) in a
omputer, with only
two values for
ow type: single and multiple. The instru
tion stream refers to the progression of
instru
tions that are being
arried out in a
omputer. If there are several being done at on
e, this
instru
tion stream is
alled multiple. If there is a unique instru
tion at ea
h instant, it is
alled single.
Similarly, the data stream is the progression of data values being brought from memory into the
pu. If several are being brought in simultaneously, the data stream is multiple, otherwise it is said
to be single. The
onventional von Neumann or Harvard ar
hite
ture
omputers (Se
tion 3.1.1) both
have a single instru
tion-stream and a single data-stream. Su
h a system is
alled SISD in Flynn's
Taxonomy. Note that this taxonomy does not distinguish the ar
hite
tural dieren
es between von
Neumann and Harvard.
Draft: O
tober 9, 2001, do not distribute
64
multiple
65
shared memory
multiprocessors
(Sequent, Cray,...)
single
instruction stream
distributed memory
multicomputers
(Ametek, HEP, ...)
"SIMD" arrays
vector processors
single
multiple
memory system
Figure 3.18: Another taxonomy provides a
omputational model whi
h dierentiates dierent
omputer based on the physi
al memory system used instead of the
ow of data.
A parallel
omputer almost by denition should have multiple pie
es of data being used at ea
h
instant. However, it is possible to have either a single instru
tion or multiple instru
tions at any
instant. The former is referred to as a single instru
tion-stream, multiple data-stream (SIMD)
omputer, while the latter is referred to as a multiple instru
tion-stream, multiple data-stream
(MIMD)
omputer. Doing the same operation to dierent pie
es of data (SIMD)
an
ertainly
interesting results, but in Flynn's taxonomy, the
on
ept of an MISD
omputer is not allowed. It
theoreti
ally
ould exist, and it is
urious to
ontemplate a
omputer performing multiple operations
on a single data-stream. However, there are no examples of su
h a
omputer that we are aware of,
and the MISD
omputer apparently has not been missed so far.
Other taxonomies (models)
ould be
onstru
ted based on dierent variables, su
h as depi
ted
in Figure 3.18. Here we have substituted the data
ow variable for a variable whi
h dierentiates
omputers based on the physi
al
hara
teristi
s of the memory system. One
an imagine this arising
by rotating in three-dimensional spa
e, keeping the verti
al axis (instru
tion stream) xed. With
the data-stream axis pointing out of the page, new information (the memory system) is exposed.
Shared-memory and distributed-memory pro
essors appear to move apart, whereas ve
tor pro
essors
merge with sequential pro
essors, leaving massively parallel SIMD
omputers in a separate box.
Another taxonomy
an be
onstru
ted, su
h as depi
ted in Figure 3.19, based the physi
al
hara
teristi
s of the pro
essors and memory system. Thus there are at least four dimensions of
data that
ould be of interest in des
ribing dierent ar
hite
tures.
All of these models have limitations in that they fail to
apture all of the important
hara
teristi
s
of some ma
hines. For example, some ma
hines
an be viewed as both shared and distributed
memory ma
hines (KSR, BBN Butter
y, DASH).
3.7
Current trends
It is dangerous in a book to dis
uss the future, and it dates a book to dis
uss
urrent events.
However, several trends
an be seen at the moment that will
ontinue to ae
t high-performan
e
omputing in the future. We will mention some of them here in the hope that the pointers to more
up to date information may be useful.
On the
hip level, one trend that is expe
ted to
ontinue is the exploitation of instru
tion-level
parallelism. This is being done by having more fun
tional units exe
uting simultaneously. Another
trend is multiple levels of memory. This is happening even on the
hip level. Due to the in
reasing
density of transistors on
hips, and the in
reasing speed at whi
h they operate, it is beginning to
Draft: O
tober 9, 2001, do not distribute
65
single
multiple
processor system
66
shared memory
multiprocessors
distributed memory
multicomputers
(SIMD and MIMD)
scalar processors
vector processors
single
multiple
memory system
Figure 3.19: Another taxonomy provides a
omputational model whi
h dierentiates dierent
omputer based on both the physi
al memory system and pro
essor system used instead of the
ow of
information.
take several
y
les to a
ess memory even on the
pu
hip. Therefore, the number of levels of
a
he
an be expe
ted to grow rather than de
line. Both of these trends have signi
ant impli
ations for
development of algorithms, whether sequential or parallel.
One signi
ant su
ess in parallel
omputing ar
hite
tures is the pervasiveness of small sharedmemory pro
essors today. These are now routinely
onstru
ted on a single board. Su
h systems
are even available as
ommodity PC boards. As a result, parallel systems are being
onstru
ted
whi
h involve a network of su
h boards. Thus they represent a distributed-memory ar
hite
ture
globally, with subsets of pro
essors whi
h have hardware support for shared memory. For example,
it is possible to
onstru
t low-
ost networks of PC's whi
h have this topology. This
ompli
ates
the programming methodology substantially, sin
e it mixes two major types of ar
hite
ture in a
non-homogeneous way. However, one
an expe
t this trend to
ontinue, toward larger numbers of
pro
essors per board, and with multiple pro
essors per
hip.
3.8
Exer ises
Exer
ise 3.1 Draw and explain the PMS diagram of a personal
omputer with only a
pu, RAM
and
oppy disk,
onne
ted via a bus.
Exer
ise 3.2 Multiple memory banks are a way to in
rease the through-put of memory systems.
They are designed to ameliorate the fa
t that standard DRAM memory
hips take multiple
y
les
to
omplete a memory referen
e. Give a PMS diagram of a single pro
essor with a memory system
omprised of multiple memory banks. (Hint: reverse the roles of P and M in the basi
shared-memory
multipro
essor ar
hite
ture in Figure 3.7.
Exer
ise 3.3 The Sili
on Graphi
s \Power Challenge" is a ma
hine with 36 pro
essors ea
h having
a peak
oating point rating of 300 mega
ops, but only 1.28 giga-bytes per se
ond bus bandwidth.
Assuming 8 bytes per word, how many operations must be done by ea
h pro
essor (working within
its lo
al
a
he) per data transfer on the bus in order for the maximum rating of 10.8 giga-
ops to be
a
hieved? (Assume the bus use is distributed uniformly among the pro
essors.)
Exer
ise 3.4 Derive the bounds presented in Se
tion 3.2.1 in terms of the work/memory ratio WM
dened in Denition 1.2. That is, prove a general result giving an upper-bound on the number of
pro
essors that
an e
iently use the bus in terms of WM for that algorithm. Then apply this result
to the hypotheti
al binary operation
onsidered in Se
tion 3.2.1.
Draft: O
tober 9, 2001, do not distribute
66
3.8. EXERCISES
67
Exer ise 3.5 Draw two-dimensional and three-dimensional versions of a \ ube- onne ted y les"
graph. How many pro
essors are required in these
ases? How many pro
essors are required in the
four-dimensional
ase?
Exer
ise 3.6 Derive the formula for the number of pro
essors
for a d-dimensional \
ube-
onne
ted
y
les" graph.
Exer
ise 3.7 Derive the Gray
ode for the integers zero through fteen dened by Lemma 3.3.
(Hint: just use the formula (3.4).)
Exer
ise 3.8 Draw the Gray
ode for the integers zero through fteen dened by Lemma 3.3 on a
opy of Figure 3.14. (Hint: rst do Exer
ise 3.7.)
Exer
ise 3.9 Prove that a ring of size j k
an be imbedded in a j k toroidal mesh. (Hint: dene
a numbering s
heme for a mesh based on Cartesian
oordinates.)
Exer ise 3.10 The total memory bandwidth of a distributed memory ma hine depends on the ratio
of the number of edges to the number of verti
es in the graph representation of the network. For
example, this ratio is one for a ring. Determine this ratio for a two-dimensional toroidal mesh and
for a three-dimensional toroidal mesh. (Hint:
onsider the number of edges emanating from ea
h
vertex, and then a
ount for redundan
ies.)
Exer
ise 3.11 Determine the ratio of the number of edges to the number of verti
es in the graph
representation of for a hyper
ube network of dimension d. (Hint: see Exer
ise 3.10.)
Exer
ise 3.12 Determine the work/memory ratio WM (dened in Denition 1.2) for the norm
evaluation algorithm des
ribed in Se
tion 3.5.1.
Exer
ise 3.13 The dot
P produ
t of two ve
tors (x : 1; : : : ; n) and (y : 1; : : : ; n) is a simple variant
of a norm evaluation: =1 x y . Determine the work/memory ratio WM (dened in Denition 1.2)
for the evaluation of a dot produ
t.
i
n
i
Exer ise 3.14 Just how many arithmeti operations are required to ompute ea h sum in (3.7)?
Show that there are two dierent ways: one takes 2n FLOPs and the other one less. (Hint: the
dieren
e is in the initialization of the summation loop.)
Exer ise 3.15 Show that there are fewer ommuni ation patterns with ontention in a 4 4 ross-
bar swit
h than in a 2-stage multi-stage inter
onne
t swit
h using 2 2
ross-bar swit
hes. (Hint:
any permutation is available with the former, and only
ertain permutations
an be done with the
latter.)
Exer
ise 3.16 Determine the number of permutations in Figure 3.12 whi
h result in
ontention
(
f. Figure 3.13). Whi
h ones are in
ontention? How many are
ontention free?
Exer
ise 3.17 Prove that a 2 2 mesh
an be imbedded in a d-dimensional hyper
ube if i + j d.
i
Exer
ise 3.18 Prove that a 2 2 2 mesh
an be imbedded in a d-dimensional hyper
ube if
i+j
+ k d.
Exer
ise 3.19 Prove that a
omplete binary tree of depth d 1 (whi
h has 2 1 nodes)
an
not be imbedded in a d-dimensional hyper
ube if d 3. (Hint:
onsider the parity of the binary
d
representation of the nodes of the hyper
ube. In going from one level to another in the tree, show
that the parity must
hange. Count the number of nodes of one parity and prove that a hyper
ube
must have the same number of even and odd parity nodes.)
Draft: O
tober 9, 2001, do not distribute
67
68
Exer
ise 3.20 A
ross-bar swit
h
an be used as a basi
building blo
k for many networks. Show
how one
an implement a network similar to a four-dimensional
ube-
onne
ted
yl
es network, with
P = 64, using sixteen 4 4
ross-bar swit
hes.
Exer ise 3.21 An ethernet \hub" is a devi e that merges network onne tions via it \ports." An
8-port hub
an be used mu
h like a 4 4
ross-bar swit
h as a basi
building blo
k for many networks.
Show how one
an implement a network similar to a four-dimensional
ube-
onne
ted
yl
es network,
with P = 64, using sixteen 8-port hubs.
68
Chapter 4
Dependen
es
I wish I didn't know now what I didn't know then|from the song
Dependen
es between dierent program parts are the major obsta
le to a
hieving high performan
e on modern
omputer ar
hite
tures. There are dierent sour
es of dependen
es in
omputer
programs. In \imperative" programming languages like Fortran and C, dependen
es are indi
ated
by
Data Dependen es
Dependen
es for
e order among
omputations. Consider the two fragments of
ode shown in Program 4.1. The last two statements in the left
olumn
an be inter
hanged without
hanging the
resulting values of x or y. However the last two statements in the right
olumn
annot be inter
hanged without
hanging the resulting value of y. In either
ase, the assignment to the variable
z (the rst line of
ode in both
olumns) must
ome rst. The requirement to maintain exe
ution
order results sin
e the variables used in one line depend on the assigned values in another line. Our
purpose here is to formalize the denitions of su
h dependen
es.
We will refer to the operations
orresponding to a fragment of
ode su
h as in Program 4.1 as
a
omputation. We will then dene the dependen
e of one
omputation on another. We will be
1 Some
language onstru ts, su h as HPF's !HPF$Independent, permit the programmer to assert a la k of dependen es in a se tion of ode. See Se tion 4.7.
69
70
CHAPTER 4. DEPENDENCES
z = 3.14159
x = z + 3
y = z + 6
z = 3.14159
z = z + 3
y = z + 6
would not be well formed in any of these languages, sin
e there is no assignment involved. Similarly,
a
ode fragment whi
h entered a subprogram, but did not return from it, would not be well formed.
We leave the notions of operation, instru
tions,
ode and ma
hine informal, although they
an
be made quite formal and pre
ise. The reader
an imagine the operations
arried out as spe
ied in
a typi
al Fortran, C, or C++
ode
ompiled for an a
tual
omputing ma
hine like a favorite workstation. We assume that our \ma
hine" has memory lo
ations whi
h store the variables des
ribed
in the
odes we will
onsider. Among other things, a
omputation C reads data from, and writes
data to, memory lo
ations.
Denition 4.2 The read set R(C ) of a
omputation C is the set of memory lo
ations from whi
h
data are read during this
omputation. The write set W (C ) of a
omputation C is the set of memory
lo
ations to whi
h data are written during this
omputation. The a
ess set A(C ) is dened to be
R(C ) [ W (C ).
R(C ) is alled the Use or In set, and the write set W (C ) is alled the
Denition 4.3 Suppose that C and D are
omputations in a
ode. There is a (dire
t) data
dependen
e between C and D if one of the following Bernstein
onditions hold:
W (C ) \ R(D) 6= ;;
or
(4.1)
R(C ) \ W (D) 6= ;; or
(4.2)
W (C ) \ W (D) 6= ;:
(4.3)
We are
onsidering
odes in languages whi
h have a spe
ied order of
omputation. In Fortran,
C, or C++, the order of
omputation is spe
ied (in part) by the order of the lines of
ode in various
les, together with rules for spe
ial instru
tions su
h as bran
hes, e.g., goto in Fortran. This order
indu
es a partial order, whi
h we denote by ", on the set of
omputations that a
ode
an spe
ify.
In parti
ular, we say that the
omputation C o
urs before the
omputation D, and write C "D, if
all of the lines of C pre
ede all of the lines of D (with no overlap) in the standard order of exe
ution.
Note that the order of C and D in Denition 4.3 does not ae
t whether we de
lare a data
dependen
e between them. That is, if there is a data dependen
e for C "D, then there is also a
Draft: O
tober 9, 2001, do not distribute
70
71
dependen
e for D"C . However, the spe
i
dependen
es (4.1) and (4.2) do depend on the order. If
(4.1) holds for C "D, then (4.2) holds for D"C , and vi
e versa. The
ondition (4.3) is independent
of order. It is important to distinguish between the dierent types of dependen
es in some
ases,
and so spe
ial names are given for the dierent sub-
ases, as follows.
Suppose that the
omputation C o
urs before the
omputation D. If (4.1) holds, then we
say there is a true dependen
e,
ow dependen
e or forward dependen
e (all equivalent
terminology) from C to D.
Example 4.1 Consider the fragment of ode in the left-hand side of Program 4.1. Let C denote the
se
ond line and D denote the third line. Then W (C ) = fxg and R(D) = fzg. Thus W (C ) \R(D) =
;, and there is no forward dependen
e from C to D. But in the
ode fragment in the right
olumn
there is a forward dependen
e. Again, let C denote the se
ond line and D denote the third line.
Then W (C ) = fzg and R(D) = fzg. Thus W (C ) \ R(D) = fzg is not empty.
Lemma 4.1 If a variable x does not appear in the write set of either
omputation C or D, then it
annot
ontribute to a dependen
e between C and D.
The proof of this result is immediate, sin
e the hypothesis means that x
annot appear in any
of the three interse
tions (4.1), (4.2) or (4.3), sin
e all three involve at least one write set.
When there is any dependen
e between
omputations, we
annot exe
ute them in parallel, sin
e
the result will depend on the exa
t order of
omputation. We will therefore be most interested in
proving that there are not dependen
es between
omputations in many
ases. However, proving
that there are dependen
es between
omputations
an be useful too: it
an halt a futile eort to
parallelize
ode before it starts. The main reason for studying dependen
es is the following, whi
h
we state without proof.
Theorem 4.1 If two
omputations C and D have no data dependen
es, then they
an be
omputed
independently (in parallel) without
hanging the result of the overall
omputation.
There is one more bit of formalism we need to introdu
e to minimize possible
onfusion. In the
fragments at the beginning of the se
tion, all of the variables appearing there are all initialized in
that fragment itself. However, it would be awkward (and in fa
t not useful) to limit our denition
of
omputations to su
h situations. In fa
t, dependen
es
an be dierent depending on dierent
exe
utions of the
ode. For example, the dependen
es in the
ode Program 4.2
an only be determined
on
e the data i and j are read in. Dierent exe
utions of the
ode
ould
learly lead to dierent
values for i and j. For this reason, we will add a subs
ript to a parti
ular
omputation S to indi
ate
the parti
ular exe
ution of it, e.g., Se where the subs
ript e denotes the parti
ular exe
ution.
read i,j
x(i)=y(j)
y(i)=x(j)
71
72
CHAPTER 4. DEPENDENCES
However, the only signi
ant use of this
on
ept we will make is to indi
ate the exe
ution of a
parti
ular text of
ode in dierent iterations of a loop. In this
ase, we will use the loop indi
es as
subs
ript to indi
ate this
on
ept.
4.2
One espe
ially important form of parallelism is loop-level parallelism in whi
h ea
h loop iteration
(or group of iterations) is exe
uted in parallel. The primary
onsideration for studying dependen
es
here is that they inhibit parallelism. However, dependen
es in loops lead to ine
ien
ies even on
urrent s
alar pro
essors, su
h as those with long pipelines (Se
tion 3.4). Thus it is very important
to know whether there are dependen
es between
omputations in dierent loop iterations. Su
h
dependen
es are
alled loop-
arried data dependen
es. We begin the dis
ussion with an example to
larify the issues.
Consider the
omputation of a norm (3.6). As observed there, this has a form similar to the
summation problem studied in Se
tion 1.4.1. However, it has twi
e as mu
h
omputation per memory
referen
e, and it is quite a
ommon sub-task in s
ienti
omputation. In Fortran this might be
written as shown in Program 4.3. In a pipelined pro
essor, we see that it will be di
ult to a
hieve
mu
h performan
e be
ause we need to
omplete one iteration of the loop before we
an start the
next: we must wait for the new value of sum to be
omputed. We will see that this is
aused by a
loop-
arried data dependen
e.
sum = x(1)*x(1)
DO 1 I=2,N
sum = sum + x(I)*x(I)
sum1 = x(1)*x(1)
sum2 = x(2)*x(2)
DO 1 I=2,N/2
sum1 = sum1 + x(2*I-1)*x(2*I-1)
sum2 = sum2 + x(2*I)*x(2*I)
sum = sum1 + sum2
72
73
DO 1 j=1,k
summ(j) = x(j)*x(j)
DO 3 i=2,N/k
DO 2 j=1,k
summ(j) = summ(j) + x(k*(i-1)+j)*x(k*(i-1)+j)
ontinue
sum = 0.0
DO 4 j=1,k
sum = sum + summ(j)
Program 4.5: Code for norm evaluation with arbitrary number of loop splittings. Note that this is
a
y
li
de
omposition of the summation problem.
des
ribe loop splitting in the terminology of parallelism as a
y
li
(or modulo) de
omposition of
(3.6) (see Denition 1.8).
Denition 4.4 A loop is in normalized form if its index in
reases from zero to its limit by one.
Any loop
an be
onverted into normalized form by an ane
hange of index variables. For example,
the loop
1
do 1 I=11,31,3
A(I)=B(2*I)
do 1 J=0,6
A(11+3*J)=B(2*(11+3*J))
That is, all \strides" are eliminated by the obvious
hange of loop index, and loops with de
reasing
index are reversed. See Exer
ise 4.3.
The set of loop indi
es for normalized nested loops are multi-indi
es, i.e., n-tuples of nonnegative integers, I := (i1 ; i2 ; ; in ), where n is the number of loops. In Program 4.5 the DO 2 and
DO 3 loops are a nested pair of loops (n = 2). By
onvention, i1 is the loop index for the outer-most
loop, and so on with in being the loop index for the inner-most loop. We use the symbol ~0 to denote
the n-tuple
onsisting of n zeros. There is a natural total order \<" on multi-indi
es, lexi
ographi
al
order, that is, the order used in di
tionaries.
Denition 4.5 The lexi
ographi
al order for all n-tuples of integers is dened by
(i1 ; i2 ; ; in ) < (j1 ; j2 ; ; jn )
whenever
ik < jk
73
`<k
if k > 1.
74
CHAPTER 4. DEPENDENCES
Variation of the performance of the x*x calculation on the number of splittings (n = 16000)
3
10
Cray C90
MFlops/sec
HP735/125
RS6000^2
2
10
Alpha
Sparc20
10
10
15
number of splittings
Figure 4.1: Performan
e of norm evaluation for ve
tors of length n = 16; 000 on various
omputers
as a fun
tion of the number of loop splittings.
We
an then write I J if either I < J or I = J , and similarly we write J > I if I < J .
The standard order of evaluation of nested (normalized) loops in Fortan, C, C++ and other
languages provides the same total order on the set of loop indi
es (i.e., on multi-indi
es) as the
lexi
ographi
al order < dened in Denition 4.5. Indeed, the loop exe
ution for index I
omes
before the loop exe
ution for index J if and only if I < J . See Exer
ise 4.24 for another possible
loop ordering.
If S a
omputation that is en
losed in at least n 1 denite loops with main index variables
l1 ; l2 ; ; ln , then SI denotes the exe
ution of S for whi
h
(l1 ; l2 ; ; ln ) = (i1 ; i2 ; ; in ) =: I :
Denition 4.6 There is a loop- arried data dependen e between parts S and T of a program
and J , with
I < J,
A loop-
arried dependen
e
an be des
ribed more pre
isely as forward (4.1), ba
kward (4.2), or
output (4.3) depending on whi
h of the Bernstein
onditions hold. Note that we do not assume
that the parts S and T are disjoint or in any parti
ular order. The ordering required following
Denition 4.3 is enfor
ed by the assumption I < J . In view of this, SI "TJ in Denition 4.6.
74
75
= 0 for all
for some
`<k
(4.4)
k;
and the rst non-zero element hk will be positive. However, subsequent entries
ould have either
sign. Symboli
ally, we
an write the typi
al dieren
e as
I <J
=)
= (0; : : : ; 0; +; ; : : : ; ):
(4.5)
Su h n-tuples have a spe ial role so we give them a spe ial name.
Denition 4.7 An n-tuple I of integers is lexi
ographi
ally positive if the rst non-zero entry
is positive, i.e.,
if
k>
ik >
1. The index
0 and
i`
= 0 for all
` < k;
>~
0,
Denition 4.8 Suppose I := (i1 ; i2 ; ; i ) and J := (j1 ; j2 ; ; j ) are multi-indi
es with I < J .
If there is a dependen
e between S( 1 2
i ;i ;
;i
n)
:= (j1
and T( 1
j ;j2 ;
i1 ; j2
i2 ;
;j
n),
;j
then
in )
(4.6)
Example 4.3 In the original loop for the norm evaluation (3.6) the set of distan
e ve
tors
onsists
of all positive integers. In the nested loop pair in Program 4.5, the set of distan
e ve
tors
onsists of
all multi-indi
es of the form (i; j ) where i is a positive integer. In parti
ular, there is no dependen
e
ve
tor of the form (0; j ) for any j .
As noted in (4.4) or (4.5), the dieren
e of lexi
ographi
ally ordered multi-indi
es is lexi
ographi
ally positive. The position of the (positive) rst non-zero, i.e., the
arrier index, is important for
distan
e ve
tors, so we give it a spe
ial name.
Denition 4.9 The loop
orresponding to the rst (from the left) nonzero element of a dependen
e
ve
tor is
alled the
arrier of that dependen
e.
The
arrier of a dependen
e
orresponds to the
arrier index (Denition 4.7) of the dependen
e
distan
e ve
tor.
Example 4.4 If we re-write the nested loop pair in Program 4.5, as shown in Program 4.6, then
the set of distan
e ve
tors
onsists of all multi-indi
es of the form (0; j ) where j is a positive integer.
Thus the se
ond index is the
arrier of these dependen
es. Note that we
an exe
ute ea
h of the i
loop instan
es independently (in parallel). Note that Program 4.6 represents a blo
k de
omposition
of Program 4.3.
Again, we state without proof the main theorem about dependen
e ve
tors.
Theorem 4.2 If all dependen
e ve
tors have
arrier indi
es greater than j , then the outer-most j
loops
an be exe
uted in parallel.
75
76
2
3
4
CHAPTER 4. DEPENDENCES
DO 3 i=1,N/k
summ(i) = x(k*(i-1)+1)*x(k*(i-1)+1)
DO 2 j=2,k
summ(i) = summ(i) + x(k*(i-1)+j)*x(k*(i-1)+j)
ontinue
sum = 0.0
DO 4 j=1,k
sum = sum + summ(j)
DO 1 I=1,100
TEMP = I
A(I) = 1.0/TEMP
Program 4.7: Simple
ode with a dependen
e
aused by the use of a temporary variable.
There is a dependen
e between the se
ond and third lines, and more importantly, a loop-
aried
dependen
e for the entire loop body. However, it is
lear that these are not essential dependen
es.
If we write this as in Program 4.8 there is no longer a dependen
e. The addition of a line of the
form
C*private TEMP
to the
ode in Program 4.7 is intended to produ
e the equivalent result as if we had expli
itly made
TEMP a dierent variable for ea
h index as is done in Program 4.8.
DO 1 I=1,100
TEMP(I) = I
A(I) = 1.0/TEMP(I)
Program 4.8: Simple
ode with a dependen
e
aused by the use of a temporary variable.
Privatizing a variable in a loop is therefore a very simple
on
ept. However, it is
lear that not all
variables
an be made private without destroying the
orre
tness of the loop. The following simple
results regard
orre
t privatization of variables.
If a variable V is not in the write-set of any of the statements in a loop, then it
annot
ause a
loop-
arried dependen
e due to Lemma 4.1. Thus when dis
ussing privatization of variables, we are
only
on
erned with ones whi
h do appear in some write-set (that is, in an assignment). If su
h a
variable V is in the write-set of the some statement, but it is only in the read set in the rst exe
uted
statement in the loop in whi
h it o
urs, then it
annot be safely privatized. This is be
ause the
rst read in the I-th iteration will be referring to the value set in the (I 1)-st iteration. Note that
the rst statement in whi
h a variable o
urs may not be the rst exe
uted statement, sin
e it may
be pre
eded by a
onditional bran
h.
Draft: O
tober 9, 2001, do not distribute
76
77
We do not attempt here a statement of a
omplete theorem regarding the possibility of privatization, but su
e it to say that it is not possible to get ne
essary and su
ient
onditions based on
stati
ode analysis alone. Consider the
ode shown in Program 4.9. Let us assume that PARITY is a
user-supplied fun
tion that (
orre
tly)
omputes whether a number is even or odd, and returns 0 or
1 a
ordingly. Then we
an prove that TEMP will always be initialized before its use in the statement
labeled \1" but a
ompiler would have to assume that there might be
ases when PARITY is neither
0 nor 1, so that \1" be
omes the rst exe
uted statement in whi
h TEMP o
urs.
DO 1 I=1,100
IF( PARITY(I) .EQ. 1)
TEMP = I
ENDIF
IF( PARITY(I) .EQ. 0)
TEMP = 0
ENDIF
A(I) = TEMP
Program 4.9: Simple
ode with apparent di
ulty for automati
privatization.
To indi
ate that a variable is to be
onsidered private to ea
h loop iteration, we will use a notation
su
h as
C*private TEMP
whi
h
an be read as a
omment (in Fortran) if not being interpreted as a privatization
ommand.
Many
ompilers for shared-memory multi-pro
essors re
ognize a dire
tive like this. We take it to
mean that TEMP is private regardless of
orre
tness.
4.3
Dependen e Examples
We now review some of the numeri
al examples introdu
ed in Chapter 1 and
onsider dependen
es
related to su
h
odes. In this se
tion, we
onsider mainly ones in whi
h \privatization" (Se
tion 4.2.3)
of temporaries
an remove dependen
es.
The symmetri
, tridiagonal matrix for a two-point boundary-value problem on a general mesh
(1.24) of mesh size hi = H(I) is initialized by the
ode shown in Program 4.10. Here DIAG is an
array holding the diagonal terms ai;i in (1.24), and OFFDIAG is an array holding the o-diagonal
terms ai;i+1 . We assume the array H(I) has been previously
omputed. Due to symmetry of the
matrix only one o-diagonal array is ne
essary. Sin
e ea
h array element on the left-hand side of
the assignments (denoted by the \=" sign) depend only on data previously
omputed, ea
h element
an be
omputed independently of all others (see Exer
ise 4.4).
for I=1,N
DIAG(I) = (1.0/H(I)) + (1.0/H(I+1))
OFFDIAG(I) = - (1.0/H(I+1))
endfor(I)
77
78
CHAPTER 4. DEPENDENCES
there is a loop-
arried dependen
e (see Exer
ise 4.5) be
ause the temporary variables DXO and DXI
will be overwritten by dierent loop iterations. De
laring DXO and DXI to be \private" will remove
the loop-
arried dependen
es in the
ode while preserving its
orre
tness (see Exer
ise 4.6).
for I=1,N
DXO=1.0/H(I)
DXI=1.0/H(I+1)
DIAG(I) = DXO + DXI
OFFDIAG(I) = - DXI
endfor(I)
Program 4.12: More
omplex
ode for initialization of nite dieren
e matrix.
These dependen
es
an be removed
orre
tly by expanding the loop by hand. We subdivide
the iteration into P parts and initialize ea
h one separately as shown in Program 4.13. This
ode
omputes the same set of values, although it requires P 1 additional divisions. If DXO and DXI
are de
lared as \private" in the outer loop then there are no loop-
arried dependen
es in the \for
IP" loop, and the
ode is
orre
t (see Exer
ise 4.8). Program 4.13 represents a blo
k de
omposition
(see Denition 1.8) of Program 4.12. For P = 1, Program 4.13 and Program 4.12 are identi
al.
for IP=0,P-1
DXI=1.0/H(IP*(N/P)+1)
for I=IP*(N/P)+1, IP*(N/P) + N/P
DXO = DXI
DXI=1.0/H(I+1)
DIAG(I) = DXO + DXI
OFFDIAG(I) = - DXI
endfor(I)
endfor(IP)
Program 4.13: Expli itly parallel ode for initialization of nite dieren e matrix.
4.4
Potential loop-
arried dependen
es
an be
aused by the use of indire
tion arrays. Suppose l1, l2,
..., ln are loop indi
es, and we have a
ode
ontaining the following statement:
Draft: O
tober 9, 2001, do not distribute
78
79
= G(j1 ; j2 ; :::; jn )
(4.7)
for some values of the loop indi
es i1 ; i2 ; :::; in and j1 ; j2 ; :::; jn whi
h are within loop bounds. Su
h
equations are
alled dependen
e equations.
There are no dependen
es possible involving (S) if (4.7)
an be shown to have no solutions for
the given array expressions. Solutions to (4.7) must satisfy loop bounds and be integers. We
an
solve su
h inequality-
onstrained dis
rete equations a priori if F and G are simple enough. However,
in general the solution set of su
h equations is di
ult to analyze. Many dependen
e tests are based
on the following theorem [35:
and
x; y
Consider the simple
ode shown in Program 4.14. Suppose, for example, that
F (l )
= a0 + a1 l and G(l) = b0 + b1 l:
(4.8)
b1 j
= b0
a0
(4.9)
a0 :
(4.10)
The GCD test determines that there is no dependen
e if g
d(a1 ; b1 ) does not divide
summarize this as Theorem 4.4.
(S)
b0
a0 .
We
do l = 1,n
A(F(l)) = somefun( A(G(l)) )
enddo
Program 4.14: Simple ode with potential loop- arried dependen es.
Theorem 4.4 Suppose F and G are given by (4.8). Then the
ode (S) in Program 4.14 does not
have a loop-
arried dependen
e if g
d(a1 ; b1 ) does not divide b0 a0 .
As an example,
onsider the simple
ode in Program 4.15. We
an never have
2k = 2` + 3
for any integers k and ` sin
e this would imply that two divides three. Thus the GCD test allows
us to
on
lude that there are no dependen
es possible in Program 4.15.
The GCD test is for un
onstrained equality, that is, it does not in
lude the ee
ts of loop bounds,
and so is only a su
ient test to pre
lude dependen
es, not a ne
essary
ondition. That is, we
an
have g
d(a1 ; b1 )jb0 a0 but still have no solutions satisfying the loop bounds.
The GCD test
an be overly pessimisti
regarding potential dependen
ies sin
e it ignores loop
bounds. A diametri
ally opposed dependen
e test is based on
he
king only the loop bounds, but
ignoring whether or not the solutions are integers. If
F (x)
G(y )
79
=0
(4.11)
80
CHAPTER 4. DEPENDENCES
do l = 1,n
A(2*l) = A(2*l+3) + 1
enddo
x;y
G(y ))
0 max
(F (x)
2
x;y
G(y )):
(4.12)
1<yn
(4.13)
x < y;
G(y ))
G(y ))
= a0 + a1
= a0 + a1
2b1 + (a+
1
2b1 + (a1
b0
b0
b1 ) + (n
b1 ) + (n
2)
2)
(4.14)
(a1 + b1 )+ (n
2) b0 + b1
a0
a1
b1
+ (a+
1
b1 )+ (n
2):
(4.15)
Banerjee's dependen
e test
onsists of determining whether (4.15) holds. If it does not, there
an be no dependen
e. We summarize this as Theorem 4.5.
Theorem 4.5 The
ode (S) in Program 4.14, where
loop-
arried dependen
e if (4.15) does not hold.
and
The general problem of determining whether (4.7) has integer solutions falls in the mathemati
al
realm of number theory. See [87 for results that generalize both Theorem 4.4 and Theorem 4.5 by
ombining a
hara
terization of solutions of (linear) Diophantine equations together with inequalities
oming from loop bounds. Exer
ise 4.23 gives an example involving a quadrati
Diophantine
equation in whi
h one
an prove a dependen
e by exhibiting a simple solution to a well known
quadrati
equation. The question of deriving an algorithm for solving nonlinear Diophantine equations is Hilbert's Tenth Problem [67, whi
h is in general unsolvable. See Exer
ise 4.23 for an exoti
appli
ation of number theory to the study of dependen
es.
4.5
Loop Transformations
Re
all (Denition 1.7) that the iteration spa
e (or iteration set) of a group of nested loops is the set
of index tuples whi
h satisfy the loop bounds. Ea
h point represents the a
tions of the body of the
loop
orresponding to the
oordinates of the point. When the loops are normalized (Denition 4.4),
as we now assume, the points of the iteration spa
e are multi-indi
es. The iteration spa
e of any
double loop of the form in Program 4.16 is
f(i; j ) 2 Z ; j; 1 i j 10g:
2
Points of iteration sets
an serve as points of a dependen
e graph if we only
are about the loop
body as whole. That is, ea
h point in the graph
orresponds to the entire loop body for a given tuple
Draft: O
tober 9, 2001, do not distribute
80
81
do I = 1 , 10
do J = I , 10
...
enddo
enddo
index j
6
5
4
3
2
1
0
0
5
index i
10
Figure 4.2: Dependen
e graph for the double loop in Program 4.17 with ten iterations in ea
h loop.
of loop indi
es. This is a
oarse view of possible dependen
es, but in many
ases it is su
ient. We
an indi
ate the loop-
arried dependen
es by drawing any distan
e ve
tors (Denition 4.8) from ea
h
iteration multi-index to its
orresponding iteration multi-index where the dependen
e o
urs. For
example,
onsider the
ode shown in Program 4.17. Figure 4.2 shows the
orresponding dependen
e
graph for the
ase m = n = 10,
f. Exer
ise 4.13. It is interesting to note that there are only
two distan
e ve
tors in ea
h
ase, ( 10 ) and ( 01 ) : This is typi
al for loops with very regular stru
ture.
Unfortunately, ea
h distan
e ve
tor has a dierent
arrier index, so no parallelism is available dire
tly
(see Theorem 4.8).
do i = 1, m
do j = 1, n
A( i, j ) = A( i , j-1 ) + 1
B( i, j ) = B( i-1, j ) + 1
enddo
enddo
81
82
CHAPTER 4. DEPENDENCES
ii and jj dened by
jj
1 0
=
ii
1 1
i
i
=
j
i+j
i
=
j
1 0
1 1
jj
jj
=
ii
ii jj
using the matrix 11 01 whi
h is the inverse of ( 11 01 ). In terms of the new loop indi
es, the loop
be
omes as shown in Program 4.18. The dependen
e graph for the transformed loop is depi
ted in
Figure 4.3.
do jj = 1,m
do ii = 1 + jj, n + jj
A( jj, ii-jj ) = A( jj , ii-jj-1 ) + 1
B( jj, ii-jj ) = B( jj-1, ii-jj ) + 1
enddo
enddo
1
1 0
=
1
1 1
1
0
0
1 0
=
1
1 1
and
0
1
(4.16)
The rst of these dependen
e ve
tors is easier to identify if we write the se
ond line of the loop as
B( jj, ii-jj ) = B( jj-1, (ii-1) - (jj-1) ) + 1 .
If we want the transformed loop to be normalized, we must require loop transformation matri
es
to have inverses with integer entries. Otherwise the resulting loop indi
es would take on non-integer
values (see Exer
ise 4.14). Su
h matri
es are sometimes
alled unimodular. We re
all the following
result on unimodular matri
es (see Exer
ise 4.15):
Theorem 4.6 An integer matrix T has an integer inverse if and only if j det(T )j = 1.
It is simple to des
ribe the unimodular matri
es for standard loop transformations. For example,
inter
hanging the loops in a nested pair of loops is a
hieved by multiplying by the matrix
0 1
1 0
(4.17)
Applying this matrix to the transformed loop above results in the same thing as doing one transformation using the matrix produ
t of the two matri
es, i.e.,
0 1
1 0
1 0
1 1
=
1 1
1 0
(4.18)
The resulting distan
e ve
tors would be ( 11 ) and ( 10 ), and the
orresponding dependen
e graph would
just be the re
e
tion of Figure 4.3 with respe
t to the diagonal ii=jj. The resulting transformed
ode appears in Program 4.19. It now be
omes apparent (Exer
ise 4.21) that the inner loop
an be
exe
uted in parallel as it has no loop-
arried dependen
es. This
an also be determined from the
re
e
ted version of Figure 4.3. All distan
e ve
tors point away from the lines of
onstant ii (with
jj varying). Moreover, the inner loop is not a
arrier for either dependen
e distan
e ve
tor (see
Theorem 4.8).
Draft: O
tober 9, 2001, do not distribute
82
83
20
18
16
14
index ii
12
10
8
6
4
2
0
0
10
index jj
12
14
16
18
Figure 4.3: Dependen
e graph for a transformed double loop with ten iterations in ea
h loop.
do ii = 2 , m + n
do jj = max{ 1 , ii - n }, min{ ii - 1 , m }
A( jj, ii-jj ) = A( jj , (ii-1) - jj ) + 1
B( jj, ii-jj ) = B( jj-1, (ii-1) - (jj-1) ) + 1
enddo
enddo
Program 4.19: Another transformation of the
ode in Program 4.17 whi
h eliminates loop-
arried
dependen
es in the inner loop.
TD
2 Dg :
In our example in Program 4.17, we started with a loop having distan
es ve
tors
D
0
1
1
0
Using the nal transformation (4.18) in the previous se
tion, T = ( 11 10 ), (4.19) be
omes
TD
1
1
83
1
0
(4.19)
84
CHAPTER 4. DEPENDENCES
Theorem 4.7 A linear loop transformation produ
es
orre
t
ode if and only if every dependen
e
ve
tor of the original (normalized) loop is transformed into an integer ve
tor that is lexi
ographi
ally
positive (Denition 4.7).
A prin
ipal use of linear loop transformations is to yield parallel
ode. The following theorem
(whose proof we also omit) tells us when this is possible.
Theorem 4.8 In a loop nest with index ve
tor I = (i1 ; i2; ; i ) and dependen
e distan
e ve
tor
set D, we
an run the k-th loop, from the outer-most (with index i ), in parallel if and only if the
k -th loop is not the
arrier of any dependen
e.
n
The
ondition of Theorem 4.8 is satised if and only if for every d 2 D, k is not the
arrier index
(see Denition 4.7) for d. This is true if, for all d 2 D, either dj > 0 for some j < k or dk = 0.
In our example, we had three sets of distan
e ve
tors:
D1
0
1
1
0
D2
1
1
0
1
D3
1
1
1
0
(4.20)
For the rst two sets (D1 and D2 ), there are
arriers in both loops. The
arrier index for ( 10 ) is one,
the
arrier index for ( 01 ) is two, and the
arrier index for ( 11 ) is again one. In the last one (D3 ),
there is no
arrier in the se
ond (inner) loop; the
arrier index for both distan
e ve
tors is one.
84
85
= 1;
imax
= 10;
jmin
= 1 and
(4.21)
jmax
= 10:
jj
10
and
orrespondingly
iimin
= 2;
iimax
jj
= 20;
= 1 and
jjmax
= 10:
10 and 1
ii
1 and 10:
4.6
jjmin
In a time-stepping s
heme (e.g. (1.16) or (1.18)) for an ordinary dierential equation (1.11), one
annot parallelize so easily. Consider the
ode in Program 4.20 for the expli
it Euler method (1.18)
in the spe
ial
ase that f is independent of t. We
annot parallelize Program 4.20 in a simple
way be
ause the
omputation of the I-th iteration requires the previous iteration. The simplest
alternative is to use a dierent time-stepping s
heme algorithm as des
ribed in Se
tion 1.5.1 (see
(1.19)). In the spe
ial
ase f is linear and independent of t, we will see that there are alternative
parallelizations in Se
tion 13.3.
OLDSOL = SOMETHIN
for I=1,N
SOLN(I) = OLDSOL + H(I)*F(OLDSOL)
OLDSOL = SOLN(I)
endfor(I)
Program 4.20: Simple
ode for an ordinary dierential equation where H(I) is the mesh size of the
I-th interval.
mu
h more e
ient,
easier to program
(but unfortunately) essentially sequential.
85
86
CHAPTER 4. DEPENDENCES
for I=1,N
SNEW(I) = (F(I) - SOLN(I+1)*OFFDIAG(I) - SOLN(I-1)*OFFDIAG(I-1) )/DIAG(I)
endfor(I)
for I=1,N
SOLN(I) = SNEW(I)
endfor(I)
Program 4.21: Simple
ode for Ja
obi iteration for a two-point boundary value problem for an
ordinary dierential equation.
k+1
k
i=1
...
Program 4.22: Simple
ode for Gauss-Seidel iteration for a two-point boundary value problem for
an ordinary dierential equation.
4.6.2 Pipelining
Be
ause the matrix is tightly banded, a \pipelining" approa
h
an be used to a
hieve some parallelism
for the Gauss-Seidel iteration. Suppose the overall iteration is Program 4.23. Note we have
hanged
the iteration limits slightlyas well as in
luding expli
itly the outer iteration. The dependen
e ve
tors
for Program 4.23 are 11 and ( 01 ). Sin
e there is a
arrier in both loops, we
annot exe
ute it in
parallel. However, the loop transformation
2 1
1 1
(4.22)
maps the distan
e ve
tors to ( 10 ) and ( 11 ). As we have seen before, the
arrier index of both of
these ve
tors is one. This means that the inner loop in the
orrespondingly transformed
ode
an
be exe
uted in parallel.
The inverse loop transformation to (4.22) is
1
1
1
2
86
(4.23)
87
for K=0,somelimit
for I=0,N
SOLN(I) = (F(I) - SOLN(I+1)*OFFDIAG(I) - SOLN(I-1)*OFFDIAG(I-1) )/DIAG(I)
endfor(I)
endfor(K)
Program 4.23: Complete
ode for Gauss-Seidel iteration with expli
it outer iteration.
and allows us to write I = 2*II - KK where II and KK are the new loop variables. Thus the
ode
in Program 4.23 takes the form shown in Program 4.24.
The loop bounds also must be transformed. Using the geometri
approa
h from Se
tion 4.5.2,
we see that the new iteration set is a parallelogram with verti
es at
(0,0), (N,N), (2*somelimit,somelimit) and (2*somelimit+N,somelimit+N).
Thus KK ranges from zero to 2*somelimit+N and II ranges within the parallelogram. The limits
on II
an be determined from the algebrai
approa
h in Se
tion 4.5.2 as follows. Let i0 denote the
numeri
al values of II, let k 0 denote the numeri
al values of KK and let s denote somelimit. We
have
0 2i0 k 0 N and 0 k 0 i0 s:
These inequalities are equivalent to
k
0 2i0 N + k0 and
0 i0 k0
s:
Combining we nd
maxf 12 k 0 ; k 0 sg i0 minf 12 (N + k 0 ); k 0 g
whi
h provides the limits for the II loop in Program 4.24.
for KK = 0, 2*somelimit + N
for II = max(KK/2,KK-somelimit), min((N+KK)/2,KK)
SOLN(2*II-KK) = (F(2*II-KK) - SOLN(2*II-KK+1)*OFFDIAG(2*II-KK)
SOLN(2*II-KK-1)*OFFDIAG(2*II-KK-1) )/DIAG(2*II-KK)
endfor(I)
endfor(K)
Program 4.24: Transformed
ode for Gauss-Seidel iteration with expli
it outer iteration.
The transformed
ode in Program 4.24 is referred to as a pipelined version of Program 4.23 be
ause
we start the
omputation of later iterations of the original K index as soon as the information required
is available. Otherwise said,
omputed information is \piped" into the later iterations as needed.
There are at least two drawba
ks to this approa
h. A small one is that the amount of parallelism
is variable (the size of the II loop), however this only leads to a load balan
ing problem (see
Denition 1.10). A more serious problem is the fa
t that somelimit is often determined adaptively
by some termination
riterion on the ve
tor SOLN. This would typi
ally be done after a
omplete
iteration of the original I loop.
4.7
Chapter Comments
In Chapter 8, the High Performan
e Fortran (HPF) language will be introdu
ed. This language has
a dire
tive (see Chapter 9) whi
h asserts independen
e of loops. For example,
onsider the following
program fragment:
Draft: O
tober 9, 2001, do not distribute
87
88
CHAPTER 4. DEPENDENCES
integer a(100), oset
read(5,*) oset
!HPF$Independent
do i = 1, 10
a(i) = b(i) + a(i+oset)
enddo
The !HPF$Independent dire
tive
laims that the iterations of the loop
an be performed in any
order. This is true only if oset 10. For exe
utions where oset < 10, this program will not be
HPF
onforming and will produ
e potentially erroneous results. Although a program model
an
assert various properties, su
h as freedom from deadlo
k and sequential
onsisten
y, it is sometimes
possible to sidestep the model's intent with unpleasant
onsequen
es.
4.8
Exer ises
Exer
ise 4.1 Consider the two
ode fragments in Program 4.1:
C:
D:
z = 3.14159
x = z + 3
y = z + 6
z = 3.14159
z = z + 3
y = z + 6
Prove that there is no ba
kward dependen
e (4.2) and no output dependen
e (4.3) between C and
D, where C denotes the se
ond line and D the third, in the
odes in either
olumn. (Hint: see
Example 4.1.)
Exer
ise 4.2 Suppose the loop bounds, instead of (4.13), take the form
1x<m
1 < y n:
do 1
A(I)
do 1
B(I)
I=3,9,2
= B(I+1)
I=9,2,-1
= C(I+1)
Exer
ise 4.4 Prove that there are no loop-
arried dependen
es in the
ode in Program 4.10.
Exer
ise 4.5 Determine all of the loop-
arried dependen
es in the
ode in Program 4.11. Give all
of the details of your derivation (i.e. give a
omplete proof).
DXO and DXI are de
lared lo
al in the \ for I" loop in Program 4.11.
Prove there are no loop-
arried dependen
es in the resulting
ode. (Hint: write the
ode with loop
indi
es added to DXO and DXI and study the resulting dependen
es.)
Exer
ise 4.7 Prove that the
odes in Program 4.12 and Program 4.13 produ
e exa
tly the same
values for the arrays DIAG and OFFDIAG. (Hint: use indu
tion to show both
odes are equivalent to
Program 4.10.)
Exer
ise 4.8 Suppose that DXO and DXI are de
lared lo
al in the \ for IP" loop in Program 4.13.
Prove there are no loop-
arried dependen
es in the resulting
ode, and that the
ode is
orre
t. (Hint:
write the
ode with loop indi
es added to DXO and DXI and study the resulting dependen
es.)
Draft: O
tober 9, 2001, do not distribute
88
4.8. EXERCISES
89
Exer
ise 4.9 Prove that an n-tuple I of integers is lexi
ographi
ally positive if and only if I
where ~0 denotes the n-tuple
onsisting of
zeros.
>~
0,
Exer
ise 4.10 Prove that any n-tuple I of integers is either lexi
ographi
ally positive, lexi
ographi
ally negative (i.e., I < ~0) or ~0. Prove that I is lexi
ographi
ally positive if I is lexi
ographi
ally
negative. (Hint:
hara
terize what it means to be lexi
ographi
ally negative,
f. Denition 4.7.)
Exer
ise 4.11 Use the GCD test Theorem 4.4 to determine whether there are any possible dependen
es in the
ode
do l = 1,n
A(9*l) = A(6*l+2) + 1
enddo
Exer ise 4.14 Prove that a ne essary ondition for the transformed loop to be normalized is that
the inverse matrix for the transformation have integer entries. (Hint: normalized loop indi
es get
in
remented by a unit ve
tor at ea
h step. The transforms of this ve
tor
orrespond to a way of
in
rementing the transformed loop indi
es. Write the original loop indi
es in terms of the transformed
loop indi
es times the inverse matrix. Sin
e the transformed loop indi
es are normalized, show that
a fra
tional inverse matrix would lead to fra
tional loop indi
es in the original variables.)
Exer
ise 4.15 Prove Theorem 4.6. (Hint: use a formula for the inverse matrix to show that the
inverse of an integer matrix with determinant one must be integer. Then write I = T T 1 and use
the fa
t that 1 = det T det T 1. Note that the determinant of an integer matrix must be an integer.)
Exer
ise 4.16 Prove that the set of unimodular matri
es is
losed with respe
t to multipli
ation.
(Hint: use the fa
t that det U T = det U det T and Exer
ise 4.15.)
Exer
ise 4.17 Find the read and write sets of S(
i;j )
nest:
S:
do i = 1 , m
do j = 2 , n-2
a(j+1) = .33 * (a(j) + a(j+1) + a(j+2))
enddo
enddo
Find the dependen
e distan
e ve
tors of the loop nest in and draw its dependen
e graph. Whi
h loop
is the
arrier of the dependen
es? Can either the i or the j loop be run in parallel? Why?
Exer ise 4.18 Find a linear loop transformation for the loop in Exer ise 4.17 so that the innermost
loop of the transformed loop nest does not
arry a dependen
e (and so
an be run in parallel) and
draw the dependen
e graph of the resulting loop nest.
Exer ise 4.19 Apply the transformation you found in Exer ise 4.18 to the loop nest in Exer ise 4.17.
Exer
ise 4.20 Write a small program to verify that the transformed loop (with m = 3; n = 20 and
a(0
89
90
CHAPTER 4. DEPENDENCES
DO 1 I = 1, MAX
DO 1 J = 1, MAX
DO 1 K = 1, MAX
A(K**2) = 1/(1 + A(I**2 + J**2) )
Exer
ise 4.21 Prove there are no loop-
arried dependen
es in the inner loop in Program 4.19.
Exer
ise 4.22 Prove there are loop-
arried dependen
es in the
ode Program 4.25 for
iently large.
MAX su-
Exer
ise 4.23 Prove there are no loop
arried dependen
es in the
ode Program 4.26 for any value
of MAX. (Hint: use Fermat's Last Theorem.)
DO 1 I = 1, MAX
DO 1 J = 1, MAX
DO 1 K = 1, MAX
DO 1 N = 3, MAX
A(K**N) = 1/(1 + A(I**N + J**N) )
Exer
ise 4.24 There is another natural partial order \" on multi-indi
es, whi
h is dened by
(i1 ; i2 ; ; in )
Exer
ise 4.25 In the
ase of n nested loops with exa
tly n distan
e ve
tors, one
an try to determine
a transformation to put all of the
arriers in the rst loop as follows. Let D be the matrix whose
olumns are the distan
e ve
tors. Suppose that D is unimodular. We seek a transformation T su
h
that U = T D
onsists of
olumn ve
tors all of whi
h have
arrier index equal to one, so that the
transformed loops have dependen
es
arried only in the outer-most loop. Determine a matrix U with
this property, and prove that the resulting matrix T exists (give some formula). Give
onditions
under whi
h this denes a unimodular matrix T . Give an example for n = 2.
Exer
ise 4.26 Extend the GCD and Bannerjee dependen
e tests to nested loops. (Hint: the algebra
must be extended to n dimensions.)
90