Pschap3 4

Chapter 3
Computer Ar hite ture

What else do you expe t from the ountry that invented ro k and
roll?|Chevrolet Camero advertisement, May 1994
In this hapter, we explore some basi on epts regarding omputer ar hite ture. The main
obje tive is to provide a quantitative basis for modeling the performan e of various (parallel) omputers. Fortunately, there is a well developed language for des ribing the various omponents, and
relations between them, in a omputer system. We will begin with a brief introdu tion to this language and illustrate it by using it to des ribe basi (sequential) omputer on epts. Then we will
use the notation to des ribe both more advan ed sequential omputer on epts as well as parallel
omputers.
3.1
P-M-S Notation
We utilize the \P-M-S" notation [79 to des ribe key omponents of a omputer system. The letters
stand for

Pro essor
{ a devi e that performs mathemati al operations on data

Memory
{ a devi e that stores data and an make it available to other devi es

Swit h
{ a devi e that allows data to be transferred between other devi es.

In a PMS diagram, the P's, M's and S's are onne ted by \wires" indi ated simply by lines. In the
simplest ase, there would be only one wire oming to a P or an M, with S's allowing multiple wires
to be joined. The full model ontains other letters (e.g., \K" for \ ontrol") but we will only use
these three elements in our des riptions.
A PMS diagram allows essential features of a omputer design to be ome apparent while masking
inessential details. As su h, it is a model of a omputer, not a full representation. The PMS notation
an be used at several levels in modeling a omputer system. It an be used at a high level to
des ribe the major omponents of the system, but it an also be used at a lower level to des ribe
the fun tioning of dierent omponents.
We illustrate the use of PMS diagrams by introdu ing some basi on epts for sequential pro essor
designs.
Draft: O tober 9, 2001, do not distribute
49
50
CHAPTER 3. COMPUTER ARCHITECTURE
b
u
s
register 0
M
S
register 1
Figure 3.1: The PMS notation for a hypotheti al pu with only two registers and two fun tional
units onne ted by a swit h.
P
9
M
Figure 3.2: Cartoon illustration of how a omputer \bus" moves data from one omponent on the
bus to another.
3.1.1 Diagram of a sequential CPU

The basi fun tioning of a entral pro essing unit ( pu) an be explained using a PMS diagram. In
Figure 3.1, we see that inside a prototypi al pu is some memory alled \registers" and some devi es
(pro essors) that an perform arithmeti operations. These are onne ted by a swit h, indi ated
only by the letter S on the right-hand side in Figure 3.1 and showing the logi al onne tions. For
example, it shows that the registers and memory are onne ted not dire tly but through the swit h.
Thus the speed of the swit h is riti al to the performan e of the overall pu, not just the speed
of the individual omponents that perform arithmeti (often alled fun tional units). This is of
ourse only a toy pu, sin e it has no possibility to do input or output. It is not even lear from
this how a sequen e of operations ould be introdu ed, but our obje tive is not to give the PMS
diagram for a omplete pu. Rather, the intent was to show how it might be done. Later, the pu
will appear just as a \P" or pro essor in higher-level diagrams of large systems.
A typi al implementation of a swit h in a pu is a bus as depi ted in Figure 3.2. To explain
how a bus works, we have introdu ed some artoon elements to show how data moves from point
to point on the bus. A bus an be thought of as a transit devi e that moves around in a pres ribed
route, stopping at pres ribed lo ations (a pro essor, memory, the oppy disk drive, and so forth) to
take on passengers (data) or to deposit same.
Two of the main \sequential" omputer designs an be easily be des ribed and ontrasted: the
basi \von Neumann" ar hite ture an be represented as shown in Figure 3.3; another ar hite ture,
the \Harvard" ar hite ture is depi ted in Figure 3.4. The latter is remarkably similar to Babbage's
\analyti al engine" [33. It is also perhaps worth noting that von Neumann advo ated the ar hite ture in Figure 3.3 while at Prin eton.
What we see by ontrasting the von Neumann and Harvard ar hite tures is a major dieren e in
design philosophy. In the Harvard ar hite ture, there are two types of memory, one for data and one
for instru tions. In the von Neumann ar hite ture, these are fused into one memory. Instru tions
form in some sense just a dierent kind of data. This ar hite ture dominated omputer design for
some time, however it is quite ommon now to nd segregated memory systems espe ially at the
level of the a he (see Se tion 3.1.2 and [82).
These simple diagrams do not indi ate a swit h, but in pra ti e the dire t onne tions between
50
3.1. P-M-S NOTATION
51
Memory
Processor
(both data and

(cpu)
instructions)
Figure 3.3: The PMS notation for the \von Neumann" ar hite ture.
data
Memory
Processor
(cpu)
instruction
Memory
Figure 3.4: The PMS notation for the \Harvard" ar hite ture.
memory and pro essor would in fa t go through a swit h, just as in the diagram of a simple pu in
Figure 3.1. Even a personal omputer (PC) in ludes a bus whi h onne ts the pro essor, memory
and peripherals. The PMS diagram of a basi PC is left as Exer ise 3.1.
3.1.2 Diagram of a a he system

Another feature now found in many PC's as well as more powerful omputers is a a he. This is
simply an additional memory (usually using faster, but more expensive te hnology) that is designed
to allow faster a ess by the pro essor [82. A a he is not an independent memory spa e for the
pro essor, rather it is only a partial \mirror" of the main memory. There are numerous ways this is
done in pra ti e, but in general the idea is to have a many-to-one mapping of main memory to the
a he, as shown in Figure 3.5. Many-to-many mappings are also used, but we avoid this ompli ation
in this dis ussion.
When a memory referen e o urs (that is, when the pro essor attempts to read or write the
ontents of a memory lo ation), a devi e alled the \memory management unit" (MMU) de ides if
this lo ation is held in a he or not. If it is, this value is returned rather qui kly for a read, but
if not then the orresponding value in main memory is loaded into the appropriate a he lo ation
as well as being sent to the pro essor. When the latter o urs, it is alled a a he fault. When a
write to memory o urs, a similar determination is made as to whether this lo ation is represented
in the a he. If it is, then the value in a he is updated. Moreover, something must be done to deal
cache
main memory
Figure 3.5: Depi tion of a possible mapping between a a he and main memory. The lightly shaded
areas in main memory depi t regions that a given a he blo k ould \mirror" whereas the dark area
depi ts the urrent lo ations being held by the a he. In some a hes, temporary in onsisten ies
may be allowed between the a he value and the orresponding memory lo ation.
51
52

Memory
Management
Unit (P)
cpu
(P)
Main
bus (S)
Memory
(M)
cache (M)
Figure 3.6: The PMS notation for a basi pro essor with a a he and memory management logi .
P
Switch (memory bus)
Memory
Figure 3.7: The PMS notation for the a basi shared-memory multipro essor ar hite ture.
with the fa t that the a he value is more re ent than the value in main memory. We do not intend
to survey the dierent te hniques for dealing with this here. Similarly, we do not onsider in detail
what happens on a write when the lo ation is not in a he. But su e it to say that one an assume
that the lo ation is brought into a he on a write and the new value is at least pla ed at that time
in the orresponding a he lo ation.
Stri tly speaking, the MMU is another pro essor (a \P" in the notation) that handles memory
requests from the main pro essor and has the a he as its own separate memory, as shown in
Figure 3.6. A key point about a a he is that the ontents are frequently destroyed by a memory
referen e. Sin e multiple lo ations in main memory map onto a given a he lo ation, this is a
ne essary evil.
Another signi ant point is that a hes are often loaded with hunks of ontiguous data (an entire
a he line or blo k) whenever there is a a he fault. This is done to in rease performan e sin e it is
often faster to load several ontiguous memory lo ations at one time than it would be to load them
individually, due to the design of urrent memory hips and other fa tors. Loading several nearby
memory lo ations at one time is motivated by the assumption of lo ality of referen e, namely
that if one memory lo ation is read now, then other nearby memory lo ations will likely be read
soon. Thus, loading a a he line is presumed to be fet hing useful data in advan e.
3.2
Shared Memory Multipro essor
We now use the PMS notation to des ribe one of the two prin iple types of parallel omputer ar hite tures. The shared-memory multipro essor was the rst parallel system ommer ially available,
and it is by far the most su essful from a ommer ial point of view. Essentially every workstation
vendor now oers shared-memory multipro essor versions of their advan ed workstations, and it is
ommon to nd ommodity multipro essor PC's in use.
3.2.1 Basi shared memory ar hite ture

The PMS diagram for a basi shared-memory multipro essor is depi ted in Figure 3.7. Shown are
n pro essors onne ted to a ommon memory via a bus. We see from Figure 3.7 that the memory
52
3.2. SHARED MEMORY MULTIPROCESSOR
53
bus, whi h is the swit h in this design, an be a bottlene k sin e all memory referen es must ow
through it. The following denition makes this notion more pre ise.
Denition 3.1 We say there is ontention for a swit h (or more pre isely for the link between the
swit h and memory) if there are two or more pro essors trying to read from or write to memory.
More generally, we an dene ontention for any devi e to be when two or more agents attempt to
use it simultaneously.
Performan e estimates an be given based on this simple model on e we know basi quantitative
information about the performan e of individual parts. Let us assume that ea h pro essor an do F
binary operations operations per unit of time. Let us assume that generi ally these would require
two loads from memory and a store to memory at the ompletion of the operation. If there is
no memory at all in ea h pro essor, then every operation would presumably involve three memory
referen es. If the bus an transfer a maximum of W words of data per unit of time, then the full
potential of the shared-memory ar hite ture would be realized only if W 3P F , where P is the
number of pro essors. This means that there would be a limit in e ien y unless
P
3WF
(3.1)
We an quantify the ee t of the bottlene k by estimating the time for exe ution of a hypotheti al
omputation for large P . On e the bus has saturated, so that W words per time unit are being
transmitted, then at most W=3 operations an be exe uted per time unit, no matter how many
pro essors there are. So the parallel time TP to do a total of N of su h operations ould not be less
than 3N=W . We pause to re ord this observation as the following slightly more general result.
Theorem 3.1 Assume that an arithmeti omputation requires ` memory referen es (loads from
memory or stores to memory), and that the bus an transfer a maximum of W words of data per
unit of time. Then the parallel time T to do a total of N of su h omputations on a basi sharedmemory multipro essor ould not be less than `N=W times the time to do one of them, regardless of
the number of pro essors P .
P
On a single pro essor, suppose we an do F binary operations per time unit (assume the bus is
fast enough to allow this rate). So the sequential time to do a total of N of su h operations would
be T1 = N=F . Therefore the speed-up with P pro essors must be bounded by
SP
3WF
(3.2)
no matter how large P be omes. In general, we would have the following result.
Theorem 3.2 Assume that an arithmeti omputation requires ` memory referen es (loads from
memory or stores to memory), that F of them an be done per unit of time, and that the bus an
transfer a maximum of W words of data per unit of time. Then the parallel speed-up S to do a
total of N of su h omputations on a basi shared-memory multipro essor is limited by
P
SP
W
`F
(3.3)
regardless of the number of pro essors P .

The bounds presented above an be stated in terms of the work/memory ratio WM dened in
Denition 1.2 (see Exer ise 3.4). Note that the ratio W=F measures the number of words transferred
per binary operation.
53
54

cpu
cpu
cpu
cpu
cpu
cpu
cache
cache
cache
cache
cache
cache
Switch (memory bus)
Memory
Figure 3.8: The PMS notation for the a shared-memory multipro essor with a lo al a he for ea h
pro essor.
3.2.2 Adding a lo al a he
Typi al numbers for ommodity mi ropro essors today would have F measured in billions of instru tions per se ond, whereas a bus able to transmit billions of words of data per se ond are rare. So a
typi al value for W=F ould be signi antly less than one. (See Exer ise 3.3 for an example.) Thus
real systems often interpose a lo al memory (e.g. a a he, Se tion 3.1.2) as shown in Figure 3.8.
This hanges ompletely the performan e assessment made previously, sin e it ould be the ase
that substantial omputations are done only on data residing ompletely in a he, with little tra
over the bus. As a general rule, algorithms with a larger WM will favor su h ma hines.
Although this ar hite ture is more di ult to model at an abstra t level, we will see that assessments an be made for parti ular algorithms. Moreover, this ar hite ture has proved to be a huge
ommer ial su ess. Almost all of the workstation- lass omputers sold today have a parallel version
with an ar hite ture essentially of this type. Even PC's have evolved to in orporate su h a multipro essor design. These are often referred to as symmetri multipro essors (SMP) re e ting the
fa t that all of the pro essors have the same role. This is meant to be in ontrast to a \master-slave"
relationship among pro essors. Symmetry is also a feature of many distributed-memory parallel omputers (see Se tion 3.3), so it is not a distinguishing feature. The SMP a ronym of ourse also ould
stand for \shared-memory multipro essor" whi h would be a better hara terization.
Although the introdu tion of a lo al a he at ea h pro essor in reases the potential performan e
of the overall multipro essor, it also ompli ates the design substantially. The major problem is
insuring a he oheren e, that is the agreement of a he values whi h are mirroring the same values
in shared memory: Coheren y addresses the issue of what memory is updated, whereas onsisten y
addresses when a memory lo ation is updated. It is well beyond the s ope of the book to explain the
various te hniques used to insure the orre tness of a he values, but su e it to say that all of the
designs are done to insure that the multiple a hes behave as if there were no a he at all, ex ept
hopefully everything works faster in the typi al ase. The interested reader is referred to [32, 61.
The SGI Origin series, a type of distributed-shared memory omputer, provide hardware support
for oherent a ess of a global address spa e utilizing all memory. However, not all parallel omputers implement oherent memories. The Cray \T3" series of massively parallel omputers in lude
hardware support to view all of memory in a global address spa e (i.e., in luding memory at other
pro essors) but there is no restri tion on the a ess.
3.2.3 Ca he-only ar hite tures

If a little is good, more might be better. In fa t, if there is enough a he, it may be possible to
minimize the global memory. The a he-only memory ar hite ture (COMA) was developed
[52,[62, [80 to arry this on ept to the extreme. As the name implies, all of the memory is a he in
this system. KSR termed their parti ular COMA design All a he to emphasize this. The COMA
54
3.3. DISTRIBUTED-MEMORY MULTICOMPUTER
55
is a shared-memory omputer with typi ally a large lo al a he (the KSR-1 omputer [52 had 32
Megabytes of lo al a he for ea h pro essor). However, the main memory onsists simply of the
memory in the other pro essors' a hes (see Figure 3.16). Any referen e to data not in the lo al
a he of a given pro essor results in a a he miss, together with a a he repla ement in the lo al
a he. As one might imagine, the di ult part of this ar hite ture is to insure a he oheren e
(Se tion 3.2, page 54), but the benet is a shared-memory omputer with essentially the largest
possible lo al a he.
3.3
Distributed-Memory Multi omputer
Bandwidth limitations render a single bus multipro essor impra ti al for large P . Su h ma hines
have rarely ex eeded thirty-odd pro essors. That, in addition to the limited bandwidth of a single memory module, or even a bank of interleaved memory, motivates the distributed-memory
multi omputer. These are alled massively-parallel pro essor (MPP) systems when P is suf iently large, say a thousand or more. Sin e it is never lear what \massive" means, we an also
take MPP to mean moderately parallel pro essor. Roughly speaking, the range 32 < P < 1000
might hara terize this range. Million pro essor ma hines have been announ ed (su h as the IBM
Blue Gene ma hine), so the letter M an be taken to mean all of these things.
We note the distin tion between multipro essor and multi omputer in the denitions of sharedmemory and distributed-memory ar hite tures, respe tively. The latter is intended to evoke the
impression of a olle tion of omputers, that is, a olle tion of omplete omputer systems. The
term multipro essor highlights the fa t that the pro essors in a shared-memory multipro essor are
\multi" but the memory is not.
The individual omputers in a distributed-memory multi omputer are often alled nodes. The
nodes in a distributed-memory multi omputer are made of, at a minimum, a pro essor ( pu) and
some memory. They may also have disks onne ted via SCSI buses, individual ethernet onne tions,
individual serial lines, graphi s output devi es, and so forth (anything a single omputer might
have).
3.3.1 The network

The nodes in a distributed-memory multi omputer are onne ted via a network of wires whi h is
easily des ribed using graphs. A network an be based on any graph: the edges are the ommuni ations links and the verti es are the nodes. The physi al onne tions may be made in a variety of
ways, but we may refer to the pro essors being onne ted when we really mean that the nodes are
onne ted (the onne tions may not go dire tly to the pro essor). That is, we may refer loosely to
the nodes as a pro essor even though we do not mean that the memory in the node has disappeared.
The general onguration is as in Figure 3.9 in whi h the network is left unspe ied.
The set of network \topologies" whi h have been proposed or even implemented are as diverse
as graph theory itself. One of the most basi is a ring (Figure 3.10), both be ause of the simpli ity
of implementation and be ause e ient algorithms exist for rings to do olle tive operations (see
Chapter 7).
Two and three dimensional meshes are natural extensions of rings and are used in urrent ommer ial parallel systems. The two-dimensional mesh requires four onne tions per pro essor, while
the three-dimensional mesh requires six onne tions per pro essor. The edges of meshes are often
onne ted via wrap-around links at the boundaries of the mesh, forming the higher-dimensional analog of a ring. The nodes at the boundary of the mesh are onne ted so that the network an be viewed
as a graph on a three-dimensional torus. The three dimensional ase is alled a three-dimensional
torus, by analogy.
Several ommer ial omputers have been built with a mesh topology. The two-dimensional
mesh appeared in an early system proposed by Ametek, and it was later adopted by Intel in the
55
56
Network
Figure 3.9: A distributed-memory multi omputer with nodes ( onsisting of pro essor-memory pairs)
onne ted to the network without a separate ommuni ation system (store-and-forward approa h).
P
P
Figure 3.10: A distributed-memory multi omputer with six pro essors onne ted by a network with
a ring topology.
56
57
Figure 3.11: The two possible states for a 2 2 ross-bar swit h. Dotted lines indi ate paths not in
use, and solid lines indi ate onne tions being used.
0
Figure 3.12: A 4 4 two-stage inter onne t swit h using four 2 2 ross-bar swit hes (the grey
boxes) as basi building blo ks.
Delta prototype ma hine and then later in the ommer ial follow-on, the Paragon. These meshes
in orporate a wrap-around and thus should be referred to as a torus mesh. The Cray T3D uses a
three-dimensional mesh. It is also a (three-dimensional) torus, hen e the \T" in the name \T3D".
Many other graphs have been proposed, and some of them have found their way into MPP
systems. These in lude trees, fat trees [60, and hyper ubes (see Se tion 3.3.2). Su h graphs (or
networks) will also be dis ussed in Chapter 7 in the ontext of of algorithms for olle tive, or
aggregate, operations.
Another network of urrent interest is based on using a ross-bar swit h. The resulting graph
an be thought of as the omplete graph on P nodes, i.e., all of the nodes are dire tly onne ted.
However, the onne tions in this ase an not be a urately represented by a stati graph. More
pre isely, a ross-bar swit h an be dened as a swit h whi h an implement simultaneous ( ontention
free) ommuni ations between pro essors i and (i) for all pro essor numbers i, where denotes
an arbitrary permutation. (A permutation of f1; : : : ; P g is a one-to-one mapping of this set
onto itself. In parti ular, if (i) = (j ) then i = j .) Any of the P nodes an be ee tively dire tly
onne ted to any other node at any given time, but the set of simultaneous onne tions is at most
P . For a 2 2 ross-bar swit h, there are only two states for the swit h, as indi ated in Figure 3.11.
A similar type of swit h ee tively allows arbitrary onne tions through the use of a multi-stage
inter onne t. The basi building blo k of a multi-stage inter onne t is a ross-bar swit h, a 2 2
ross-bar swit h in the simplest ase. In a typi al multi-stage inter onne t, at most P paths ould
be in use and on i ts ould redu e this number substantially. In parti ular, not all permutations
ommuni ate without ontention simultaneously. Figure 3.12 shows a 4 4 multi-stage inter onne t
with two stages based on 2 2 ross-bar swit hes. Figure 3.13 indi ates ontention between two
message routes that is typi al in a 4 4 multi-stage inter onne t.
Cross-bar swit hes are utilized on both the IBM SP2 ma hine and on the NEC resear h prototype
Cenju-3 ma hine [55. Their networks onsists of a multi-stage inter onne t swit h using 4 4 rossbar swit hes. Thus it takes only three stages to onne t 64 pro essors, instead of the six stages
that would be required using 2 2 ross-bar swit hes. Using a larger ross-bar swit hes as the basi
building blo ks not only allows fewer stages, but it also de reases the number of ommuni ations
with ontention (see Exer ise 3.15).
57
58

0
Figure 3.13: Contention in a 4 4 two-stage inter onne t swit h between messages from 2 to 0 (dark
dashed line) and from 0 to 1 (dark solid line) whi h would have to use the same ommuni ation wire
at the top (where the dark solid and dashed lines oin ide).
6
6
2
14
2
2
3
4
7
15
10
11
12
13
4
0
d=2
5
8
d=3
d=4
Figure 3.14: Hyper ubes of dimension d = 2; 3; 4 onstru ted indu tively. Dotted lines indi ate links
joining mat hing nodes from two identi al opies of the lower-dimensional hyper ube.
3.3.2 Hyper ubes

A popular network among distributed memory ma hines for many years has been the hyper ube.
1
Here the graph in question is made of the edges of a d-dimensional hyper ube of unit size, with the
pro essors naturally numbered a ording to the verti es of the hyper ube. The verti es are at points
(i1 ; : : : ; id ) where ea h ij takes the value zero or one. This an be viewed as the binary representation
P
of an integer p in the range 0 p 2d 1, that is, p = dj=1 ij 2j 1 whose binary representation is
of ourse id id 1 i2 i1 . Thus a d-dmensional hyper ube- onne ted, distributed-memory omputer
has 2d pro essors. One an easily onstru t a hyper ube as shown in Figure 3.14 by indu tion from
the next lower dimension. Simply take two opies and onne t the mat hing nodes.
In a hyper ube, there are d onne tions per pro essor. This design allows the bandwidth to s ale
up as the size of the pro essor in reases, but it also imposes some limitations. First, as pro essors
are added the number of onne tions at ea h pro essor must in rease. Pra ti ally this puts physi al
onstraints on the size of the largest hyper ube omputer one an build with given (xed) hardware.
Se ond, to in rease the size of the omputer requires a doubling of the budget. While many fund
raisers will tell you that if a person has given you x dollars in the past then you an hope to get
2x dollars from them in the future, this logi may not be appropriate for designing and/or selling
omputers.
One network proposed to deal with the rst problem is ube- onne ted y les. Unfortunately
it makes the se ond problem worse. The basi idea is to make building blo ks out of a ring of d
pro essors, ea h having at least three available onne tions with two used for the ring onne tions.
The third onne tion is used to onne t 2d su h rings in a hyper ube. Thus one has d2d pro essors
1 nCUBE
Corporation, founded in 1983, was one of the rst ommer ial enterprises to make hyper ube- onne ted
parallel omputers.
58
59
in this approa h. The rst Conne tion Ma hine [82 used a variant of this network in whi h the ring
is repla ed by a mesh.
Hyper ubes have one feature not available with some other networks, in parti ular rings and
meshes. Appropriate sub- ubes of a hyper ube are themselves hyper ubes of the appropriate dimension. Sub-meshes of a toroidal mesh an never retain the toroidal topology, so algorithms using
this feature an have dierent behavior for sub-meshes.
Another useful feature of hyper ubes is that many other networks an be imbedded in them. By
a graph embedding we mean a one-to-one mapping of the verti es of one graph into another (i.e.,
no vertex is mapped to another vertex more than on e) in whi h ea h edge is mapped to an edge
(hen e neighboring verti es get mapped to neighboring verti es).
It is possible to imbed a ring in a hyper ube using what is known as a Gray ode [?. Su h
odes are not unique, but are dened by the property of being minimal swit hing odes in their
binary representation. That is, a Gray ode is a sequen e of integers g0 ; g1 ; : : : having the property
that the binary representation of gi and gi+1 dier by at most one bit for all i. They were reated
initially to minimize the energy expended in swit hing relays from one state to another.
The natural representation of a hyper ube has a similar property. That is, if the number of a
node has binary representation id id 1 : : : i2 i1 where the j -th oordinate of the vertex is ij , then the
neighbors of any node dier by at most one bit. In parti ular, they dier pre isely in the oordinate
dire tion orresponding to the edge between them. Therefore, any embedding of a ring onto a
hyper ube naturally denes a Gray ode. Similarly, a Gray ode provides an imbedding of a ring of
2d nodes into a d-dimensional hyper ube.
Theorem 3.3 For any d 1, there is a Gray ode 0 = g0; g1 ; : : : g
0 gi < 2 for all i = 0; 1; : : : ; P

d
1.
The proof is simple indu tion. The result is evident for

denition
gP
1+i
= gP
+ 2d :
= 2d
where
= 2d and
= 1 and the indu tion step uses the

(3.4)
The sequen e dened by (3.4) is known as the binary re e ted Gray ode sin e it traverses the
paired i 1 dimensional hyper ubes in a re e ted order while traversing dimension i (see Figure 3.14).
Toroidal meshes an be imbedded into hyper ubes by viewing them as Cartesian produ ts of
rings. That is, a 2j 2k mesh naturally imbeds into a (j + k )-dimensional hyper ube, using a pair of
Gray odes, one for ea h oordinate. Similarly, a three-dimensional toroidal mesh of size 2i 2j 2k
naturally imbeds into a (i + j + k )-dimensional hyper ube.
3.3.3 Data ex hange

The ost of ex hanging data in a distributed-memory multi omputer an be approximated by the
model
+
m
(3.5)
where m is the number of words being sent, is the \laten y" orresponding to the ost of sending
a null message, i.e. a message of no length, and is the in remental time required to send an
additional word in a message. This model does not a ount for ontention in the network that
an o ur when simultaneous ommuni ations are attempted that involve interse ting links in the
network.
The ne essity of laten y an be seen from the physi al mail system, in whi h we use envelopes to
surround messages. The envelope usually ontains little important information and is just dis arded
on e the message arrives. The weight of the envelope an easily ex eed the weight of the paper
arrying the message.
59
60

node a
node b
computational
Processor
computational
Processor
Memory
Memory
communication
Processor
communication
Processor
....
network
Figure 3.15: A distributed-memory multi omputer with pro essors onne ted to the network with
a separate ommuni ation system (dire t- onne t approa h).
There are at least two dis tin t types of message systems. In the \store and forward" approa h,
ea h message is stored temporarily by ea h pro essor in the path from the sender to the re eiver
(see Figure 3.9). This makes the message-passing system simple to implement, as ea h pro essor
deals with messages in a uniform way. However, it adds to the laten y and makes and depend
strongly on the length of the path. Although utilized in rst-generation hyper ube- onne ted multipro essors [43, this approa h has been abandoned in favor of the \dire t- onne t" s heme whi h
ee tively employs spe ial pro essors to handle messaging (Figure 3.15). This is similar to the
routing of the long-distan e phone system. On e a onne tion is established, whose route ould vary
from all to all even for the same pair of phone numbers, the parti ular routing is held xed for
the duration of the all. In the \dire t- onne t" message-passing system, a similar onne tion is
established, and then data moves rapidly in an amount of time that is essentially independent of
the distan e between sour e and destination. To implement this requires a separate pro essor and
buer memory to handle the message system.
3.3.4 NOW what?

Sin e one basi motivation for distributed-memory parallel omputers ame from the ost-ee tiveness
of workstation te hnology, it is natural to onsider a network of workstations (NOW) as a
parallel omputer. The rst su h networks were onne ted by broad ast media su h as ethernet,
2
with relatively low bandwidth. However, the speed and omplexity of the network has in reased
rapidly re ently. For example, Myrinet is a ommer ial network and swit h for onne ting standard
workstations at speeds that approa h the fastest bus speeds of su h ma hines. Ar hite turally, a
NOW is not dierent from a distributed-memory ma hine that omes all in one box, provided the
network topology (or the swit h) is the same.
2 The word ethernet derives from luminiferous ether, a hypotheti al substan e (proposed by Aristotle to exist
throughout the universe) through whi h ele tromagneti waves were to travel. The non-existen e of this kind of ether
was proved in experiments at the end of the 19-th entury arried out in Ryerson Physi al Laboratory, now home
to the Computer S ien e Department at the Univeristy of Chi ago. The experiments were done in wooden hannels
running the length of the building whi h now are used to run ethernet ables.
60
3.4. PIPELINE AND VECTOR PROCESSORS
61
3.3.5 Hybrid parallel ar hite tures

It is natural when there are two ompeting designs to try to unify them to apture the best of both.
The BBN Butter y [57[82 was su h a ombination of shared-memory and distributed-memory
on epts. More re ently, the a he-only memory ar hite ture, or COMA (see Se tion 3.2.3) was
developed [52,[62, [80 to provide a shared-memory ma hine whi h ould s ale to very large numbers
of pro essors (see Figure 3.16). How we hara terize these ma hines is less important than our ability
to predi t their performan e by using appropriate models.
3.4
Pipeline and Ve tor Pro essors
Parallelism in super omputers is not a new on ept. One an tra e this on ept quite far ba k, but
we will onsider in detail only two of the more re ent examples of this. The te hnique of pipelining
was used extensively in the Control Data Corporation (CDC) 6000-series (and later) ma hines whi h
provided at the time a distin tly higher level of oating point omputation. The later Cray Resear h
Corporation omputers utilized ve tor pro essors to in rease the level again.
A pipelined pro essor is one in whi h the fun tional units whi h exe ute basi (e.g., arithmeti )
instru tions are broken into smaller sub-atomi parts whi h an exe ute simultaneously [82. It is
analogous to an assembly line in a fa tory. The operands pro eed down a \pipe" whi h has separate
ompartments. On e the rst sub-task is ompleted the information is transferred to the next
ompartment, and the rst ompartment is now free to be used for a dierent set of operands. As a
result of this design, the speed of operation an be limited only by the time it takes for the longest
subtask to omplete. In prin iple, one new operation an be initiated with this frequen y, even
though previous ones are still in the \pipeline." On the other hand, this also means that it may
take several y les for an operation to omplete. Later operations whi h depend on the result may
have to be postponed (through the insertion of no-ops). Thus a greater potential speed is traded in
return for an in reased omplexity. The type of dependen e that an ause pipelined omputations
to perform less than optimally is quite similar to the type of dependen e we saw in Chapter 1.
The original pipelined pro essors (in the CDC 6000 series omputers) would initiate at most one
oating point operation per y le. However, if there are multiple fun tional (e.g., oating point)
units (whi h the CDC 6000 series had), it is on eivable to have all of them initiating instru tions
at ea h y le. Su h a design has emerged with the moniker super-s alar. Currently, the fastest
mi ropro essors use this type of ar hite ture.
The immediate su essor to the pipelined ma hines were ve tor pro essors whi h utilize
pipelines as the main ar hite ture but make additional restri tions on the sequen ing of the fun tional units, based on the assumption that \ve tor" operations will be done. In a pipelined system,
at ea h y le a dierent oating-point or logi al operation ould be initiated [82. By restri ting to
more limited ve tor operations, greater performan e was a hieved, albeit on more restri ted types
of al ulations.
In a super-s alar design, arbitrary sets of operations an be initiated at ea h lo k y le. This
extra exibility has made this design the ar hite ture of hoi e for the urrent generation of omputer
hips being used in everything from high-end PC's to workstations to the individual pro essors in
the nodes of MPP systems.
It is not our intention to in lude omplete des riptions and analyses of pipelined, super-s alar
and ve tor pro essors, but rather to note that they an be viewed as parallel pro essors. A pipelined
pro essor an be working on multiple operations in any one y le. A super-s alar pro essor an
further initiate work on multiple operations in any one y le, while ontinuing to work on many
more. A ve tor pro essor an appear to apply a single operation to multiple data values (e.g., add
two ve tors) simultaneously. The granularity of parallelism in these ar hite tures is quite ne, often
referred to as instru tion-level parallelism. But the essential workings of these ar hite tures an be
61
62
analyzed in a way that is similar to the more oarse-grained type of parallelism we are onsidering
at length here.
3.5
Comparison of Parallel Ar hite tures
Figure 3.16 depi ts quantitative and qualitative aspe ts of four dierent ommer ial designs. The
shaded areas depi t the pathway from pro essors (in luding lo al memories, e.g. a hes) to main
memory. The width of the pathway is intended to indi ate relative bandwidths in the various designs.
It is roughly proportional to the logarithm of the bandwidth, but no attempt has been made to make
this exa t. The length of the pathway also indi ates to some extent the laten ies for these various
designs. The ve tor super omputer has a relatively low laten y, where as the distributed-memory
omputer or network of workstations has a relatively high laten y.
This omparison allows us to make broad assessments of the dierent designs. The ve tor super omputer has a relatively small lo al memory (its ve tor registers) but makes up for this in a very
high bandwidth and low laten y to main memory. The COMA design (page 54) provides a large
lo al memory without sa ri ing the simpli ity of shared memory. The shared memory multipro essor also retains this but does not have a very large lo al memory. Finally, the distributed-memory
omputer or a network of workstations has a large lo al memory similar to the COMA design but
annot mat h the bandwidth to main memory. Instead of the COMA engine, one simply has a
network to transmit messages.
No absolute omparisons an be made regarding the relative performan e of the dierent designs.
Dierent algorithms will perform dierently on dierent ar hite tures. However, the shared memory
multipro essor will not perform well on problems with limited amounts of omputation done per
memory referen e, that is, ones for whi h the work/memory ratio, WM , is small (see Denition 1.2).
Similarly, a network of workstations (Se tion 3.3.4) annot perform well on omputations that involve
a large number of small memory transfers to dierent pro essors. We now try to quantify this with
parti ular algorithms.
3.5.1 Summation
We an assess the behavior of dierent designs quantitatively by onsidering some basi algorithms.
To begin with, onsider the norm evaluation
n
X
x2i
(3.6)
i=1
whi h o urs frequently in s ienti omputation. This has the same form as the summation problem
studied in Se tion 1.4.1. Here there are n memory referen es (to the array quantities xi for i =
1; : : : ; n) and 2n oating point operations. This algorithm has therefore a onstant work/memory
ratio WM (dened in Denition 1.2); see Exer ise 3.12.
If an individual pro essor an do F oating point operations per time unit and W words per time
unit an be transmitted from memory, then the norm requires at least the maximum of n=W and
2n=F time units to be ompleted. Memory referen es and omputation an frequently be interleaved,
so the maximum of these ould easily be the exe ution time. The ratio of the two times, the ratio
of ommuni ation time to omputation time, is F =2W for omputing a norm. Note that the ratio
F =W has units of oating point operations per memory referen e and may be thought of as the
number of oating point operations that an be done in the time it takes to store to, or retrieve
from, memory one word of data. If W=F is too large, the omputer will not be e ient at omputing
a norm be ause pro essors will be idle waiting for data. Similar linear algebrai al ulations, su h
as a dot produ t (Exer ise 3.13) an be analyzed similarly.
62
3.5. COMPARISON OF PARALLEL ARCHITECTURES
P
Main
vector
registers
cache
cache
Memory
engine used
to access
vector supercomputer
cache
other caches)
Cache-Only Memory Architecture
P
memory bus
No Main
(Directory +
Memory
cache
63
memory
Main
Switch /
Network
Memory
shared memory multiprocessor
memory
distributed memory multicomputer / NOW
Figure 3.16: Comparison of various ar hite tures. The shaded areas indi ated the bandwidth to
memory. Wider paths indi ate faster speeds.
In the basi shared-memory omputer with P pro essors ea h omputing norms, the omputation
time stays the same, sin e ea h norm an be done independently of the others, but the ommuni ation
time be omes P times larger, namely P n=W , if all of the xi 's are dierent for ea h pro essor. This
is be ause they all must pass through the same memory path, whose bandwidth is xed. The ratio
of the ommuni ation time to omputation time, is P F =2W for omputing P norms on a basi
shared-memory omputer. Sin e this in reases with P , this algorithm does not s ale well for this
ar hite ture. The addition of a a he at ea h pro essor an help, but only if the xi 's are already in
a he.
3.5.2 Matrix multipli ation

Other algorithms have dierent relationships between data a ess patterns and omputational requirements. Matrix multipli ation is an example where the omputation naturally ex eeds the
ommuni ation by an amount that grows with the size of the omputation. The basi omputation
for C = AB for square matri es A = (aij ) and B = (bij ) is
ij
n
X
aik bkj
(3.7)
k=1
and therefore involves 2n (or 2n 1, see Exer ise 3.14) oating point operations. Sin e this must
be done n2 times, the total omputational work is, say, 2n3 oating point operations. On the other
hand, only 3n2 memory referen es are involved, n2 ea h to bring A and B from memory, respe tively,
and another n2 to store the result, C . The work/memory ratio WM (Denition 1.2) for this algorithm
grows linearly with matrix dimension n (see Exer ise 1.2).
63
single
multiple
instruction stream
64
MIMD
SISD
SIMD
single
multiple
data stream
Figure 3.17: Flynn's taxonomy provides a omputational model whi h dierentiates omputer ar hite tures based on the ow of information.
The ratio of ommuni ation time to omputation time is 3F =2W n for omputing one matrix
multipli ation, and 3P F =2W n for omputing P matrix multipli ations independently on a basi
shared-memory omputer. In this ase, we see that there is a han e for s alability: P = n
pro essors ould be e iently omputing matrix multipli ations independently on a basi sharedmemory omputer, as long as the ratio F =W is small enough.
The onsiderations are similar for distributed-memory omputers. Suppose we onsider the ase
P = 2 and a norm al ulation. Let us suppose we divide the work and the array xi in half and
P 2 2
give one-half to
One pro essor will ompute s1 = n=
xi and the other will
i=1
Pnea h pro essor.
ompute s2 = i=n=2+1 x2i . The two pro essors will then have to ommuni ate in some way so that
Pn
x2i = s1 + s2 an be determined. This latter step will take at least units of time, whereas
i=1
the partial sums require only n=F units of time. The ratio of ommuni ation time to omputation
time is F =n. This is indeed favorable sin e it be omes small as n in reases, but note that it also
implies that it is ine ient if n is less than F . The latter number (whose units are \ oating point
operations" without any time units) measures the number of oating point operations that an be
done while waiting for a null message to omplete. In some ases, this ould be thousands of oating
point operations.
3.6
Taxonomies
A taxonomy is a model based on a dis rete set of hara teristi s, su h as olor, gender or atomi
weight. Sin e the model variables are dis rete, a taxonomy puts items in neat \pigeon holes" even if
they don't quite t. Like any model, they provide a ondensation of information that is often useful
in omparing dierent things.
Perhaps the most famous taxonomy of parallel omputing is due to Flynn (see Figure 3.17). It
is based only on the logi al ow of information (instru tions and data) in a omputer, with only
two values for ow type: single and multiple. The instru tion stream refers to the progression of
instru tions that are being arried out in a omputer. If there are several being done at on e, this
instru tion stream is alled multiple. If there is a unique instru tion at ea h instant, it is alled single.
Similarly, the data stream is the progression of data values being brought from memory into the
pu. If several are being brought in simultaneously, the data stream is multiple, otherwise it is said
to be single. The onventional von Neumann or Harvard ar hite ture omputers (Se tion 3.1.1) both
have a single instru tion-stream and a single data-stream. Su h a system is alled SISD in Flynn's
Taxonomy. Note that this taxonomy does not distinguish the ar hite tural dieren es between von
Neumann and Harvard.
64
multiple
65
shared memory
multiprocessors
(Sequent, Cray,...)
single
instruction stream
3.7. CURRENT TRENDS
von Neumann and
distributed memory
multicomputers
(Ametek, HEP, ...)
"SIMD" arrays
vector processors
(ICL DAP, CM-2,

Goodyear MPP,...)
single
multiple
memory system
Figure 3.18: Another taxonomy provides a omputational model whi h dierentiates dierent omputer based on the physi al memory system used instead of the ow of data.
A parallel omputer almost by denition should have multiple pie es of data being used at ea h
instant. However, it is possible to have either a single instru tion or multiple instru tions at any
instant. The former is referred to as a single instru tion-stream, multiple data-stream (SIMD)
omputer, while the latter is referred to as a multiple instru tion-stream, multiple data-stream
(MIMD) omputer. Doing the same operation to dierent pie es of data (SIMD) an ertainly
interesting results, but in Flynn's taxonomy, the on ept of an MISD omputer is not allowed. It
theoreti ally ould exist, and it is urious to ontemplate a omputer performing multiple operations
on a single data-stream. However, there are no examples of su h a omputer that we are aware of,
and the MISD omputer apparently has not been missed so far.
Other taxonomies (models) ould be onstru ted based on dierent variables, su h as depi ted
in Figure 3.18. Here we have substituted the data ow variable for a variable whi h dierentiates
omputers based on the physi al hara teristi s of the memory system. One an imagine this arising
by rotating in three-dimensional spa e, keeping the verti al axis (instru tion stream) xed. With
the data-stream axis pointing out of the page, new information (the memory system) is exposed.
Shared-memory and distributed-memory pro essors appear to move apart, whereas ve tor pro essors
merge with sequential pro essors, leaving massively parallel SIMD omputers in a separate box.
Another taxonomy an be onstru ted, su h as depi ted in Figure 3.19, based the physi al
hara teristi s of the pro essors and memory system. Thus there are at least four dimensions of
data that ould be of interest in des ribing dierent ar hite tures.
All of these models have limitations in that they fail to apture all of the important hara teristi s
of some ma hines. For example, some ma hines an be viewed as both shared and distributed
memory ma hines (KSR, BBN Butter y, DASH).
3.7
Current trends
It is dangerous in a book to dis uss the future, and it dates a book to dis uss urrent events.
However, several trends an be seen at the moment that will ontinue to ae t high-performan e
omputing in the future. We will mention some of them here in the hope that the pointers to more
up to date information may be useful.
On the hip level, one trend that is expe ted to ontinue is the exploitation of instru tion-level
parallelism. This is being done by having more fun tional units exe uting simultaneously. Another
trend is multiple levels of memory. This is happening even on the hip level. Due to the in reasing
density of transistors on hips, and the in reasing speed at whi h they operate, it is beginning to
65
single
multiple
processor system
66
shared memory
multiprocessors
distributed memory
multicomputers
(SIMD and MIMD)
scalar processors
vector processors
single
multiple
memory system
Figure 3.19: Another taxonomy provides a omputational model whi h dierentiates dierent omputer based on both the physi al memory system and pro essor system used instead of the ow of
information.
take several y les to a ess memory even on the pu hip. Therefore, the number of levels of a he
an be expe ted to grow rather than de line. Both of these trends have signi ant impli ations for
development of algorithms, whether sequential or parallel.
One signi ant su ess in parallel omputing ar hite tures is the pervasiveness of small sharedmemory pro essors today. These are now routinely onstru ted on a single board. Su h systems
are even available as ommodity PC boards. As a result, parallel systems are being onstru ted
whi h involve a network of su h boards. Thus they represent a distributed-memory ar hite ture
globally, with subsets of pro essors whi h have hardware support for shared memory. For example,
it is possible to onstru t low- ost networks of PC's whi h have this topology. This ompli ates
the programming methodology substantially, sin e it mixes two major types of ar hite ture in a
non-homogeneous way. However, one an expe t this trend to ontinue, toward larger numbers of
pro essors per board, and with multiple pro essors per hip.
3.8
Exer ises
Exer ise 3.1 Draw and explain the PMS diagram of a personal omputer with only a pu, RAM
and oppy disk, onne ted via a bus.
Exer ise 3.2 Multiple memory banks are a way to in rease the through-put of memory systems.
They are designed to ameliorate the fa t that standard DRAM memory hips take multiple y les
to omplete a memory referen e. Give a PMS diagram of a single pro essor with a memory system
omprised of multiple memory banks. (Hint: reverse the roles of P and M in the basi shared-memory
multipro essor ar hite ture in Figure 3.7.
Exer ise 3.3 The Sili on Graphi s \Power Challenge" is a ma hine with 36 pro essors ea h having
a peak oating point rating of 300 mega ops, but only 1.28 giga-bytes per se ond bus bandwidth.
Assuming 8 bytes per word, how many operations must be done by ea h pro essor (working within
its lo al a he) per data transfer on the bus in order for the maximum rating of 10.8 giga- ops to be
a hieved? (Assume the bus use is distributed uniformly among the pro essors.)
Exer ise 3.4 Derive the bounds presented in Se tion 3.2.1 in terms of the work/memory ratio WM
dened in Denition 1.2. That is, prove a general result giving an upper-bound on the number of
pro essors that an e iently use the bus in terms of WM for that algorithm. Then apply this result
to the hypotheti al binary operation onsidered in Se tion 3.2.1.
66
3.8. EXERCISES
67
Exer ise 3.5 Draw two-dimensional and three-dimensional versions of a \ ube- onne ted y les"
graph. How many pro essors are required in these ases? How many pro essors are required in the
four-dimensional ase?
Exer ise 3.6 Derive the formula for the number of pro essors
for a d-dimensional \ ube- onne ted y les" graph.
as a fun tion of the dimension
Exer ise 3.7 Derive the Gray ode for the integers zero through fteen dened by Lemma 3.3.
(Hint: just use the formula (3.4).)
Exer ise 3.8 Draw the Gray ode for the integers zero through fteen dened by Lemma 3.3 on a
opy of Figure 3.14. (Hint: rst do Exer ise 3.7.)
Exer ise 3.9 Prove that a ring of size j k an be imbedded in a j k toroidal mesh. (Hint: dene
a numbering s heme for a mesh based on Cartesian oordinates.)
Exer ise 3.10 The total memory bandwidth of a distributed memory ma hine depends on the ratio
of the number of edges to the number of verti es in the graph representation of the network. For
example, this ratio is one for a ring. Determine this ratio for a two-dimensional toroidal mesh and
for a three-dimensional toroidal mesh. (Hint: onsider the number of edges emanating from ea h
vertex, and then a ount for redundan ies.)
Exer ise 3.11 Determine the ratio of the number of edges to the number of verti es in the graph
representation of for a hyper ube network of dimension d. (Hint: see Exer ise 3.10.)
Exer ise 3.12 Determine the work/memory ratio WM (dened in Denition 1.2) for the norm
evaluation algorithm des ribed in Se tion 3.5.1.
Exer ise 3.13 The dot
P produ t of two ve tors (x : 1; : : : ; n) and (y : 1; : : : ; n) is a simple variant
of a norm evaluation: =1 x y . Determine the work/memory ratio WM (dened in Denition 1.2)
for the evaluation of a dot produ t.
i
n
i
Exer ise 3.14 Just how many arithmeti operations are required to ompute ea h sum in (3.7)?
Show that there are two dierent ways: one takes 2n FLOPs and the other one less. (Hint: the
dieren e is in the initialization of the summation loop.)
Exer ise 3.15 Show that there are fewer ommuni ation patterns with ontention in a 4 4 ross-
bar swit h than in a 2-stage multi-stage inter onne t swit h using 2 2 ross-bar swit hes. (Hint:
any permutation is available with the former, and only ertain permutations an be done with the
latter.)
Exer ise 3.16 Determine the number of permutations in Figure 3.12 whi h result in ontention
( f. Figure 3.13). Whi h ones are in ontention? How many are ontention free?
Exer ise 3.17 Prove that a 2 2 mesh an be imbedded in a d-dimensional hyper ube if i + j d.
i
Exer ise 3.18 Prove that a 2 2 2 mesh an be imbedded in a d-dimensional hyper ube if
i+j
+ k d.
Exer ise 3.19 Prove that a omplete binary tree of depth d 1 (whi h has 2 1 nodes) an
not be imbedded in a d-dimensional hyper ube if d 3. (Hint: onsider the parity of the binary
d
representation of the nodes of the hyper ube. In going from one level to another in the tree, show
that the parity must hange. Count the number of nodes of one parity and prove that a hyper ube
must have the same number of even and odd parity nodes.)
67
68
Exer ise 3.20 A ross-bar swit h an be used as a basi building blo k for many networks. Show
how one an implement a network similar to a four-dimensional ube- onne ted yl es network, with
P = 64, using sixteen 4 4 ross-bar swit hes.
Exer ise 3.21 An ethernet \hub" is a devi e that merges network onne tions via it \ports." An
8-port hub an be used mu h like a 4 4 ross-bar swit h as a basi building blo k for many networks.
Show how one an implement a network similar to a four-dimensional ube- onne ted yl es network,
with P = 64, using sixteen 8-port hubs.
d is an integer, d 2. What is the minimum number of 4 4 ross-bar

swit hes required to onne t P = 2 pro essors? Give the best bound you an, as a fun tion of d
for 2 d 8. What about for general d 8? (Hint: there must be at least one onne tion to the
network from ea h pro essor; see Exer ise 3.20.)
Exer ise 3.22 Suppose
68
Chapter 4
Dependen es
I wish I didn't know now what I didn't know then|from the song
\Against the Wind" by Bob Seger
Dependen es between dierent program parts are the major obsta le to a hieving high performan e on modern omputer ar hite tures. There are dierent sour es of dependen es in omputer
programs. In \imperative" programming languages like Fortran and C, dependen es are indi ated
by

ow of data (through memory), hen e data dependen es, and

ow of ontrol (through the program), hen e ontrol dependen es.
We will limit dis ussion to data dependen es in this hapter.

Our goal is to develop te hniques for orre tly identifying dependen es in odes and determining
whether they an be easily remedied. In addition, we will be des ribing some basi te hniques that
an be used automati ally by ompilers, so this will provide the basis for assessment of what is
reasonable to expe t from \automati parallelizers" urrently. We will see that there are signi ant
limitations to what an be surmised based on the text of the ode alone. Furthermore, dependen es
an be hidden in subroutine or fun tion evaluations as a result of modi ation to variables as side
ee ts of the subroutine or fun tion all. Without interpro edural analysis, ompiler optimization
preserves orre tness by taking a onservative optimization ta k around the pro edure all. Knowledge of the underlying algorithms allows further opportunities for optimization that annot be done
by ompilers urrently.1
4.1
Data Dependen es
Dependen es for e order among omputations. Consider the two fragments of ode shown in Program 4.1. The last two statements in the left olumn an be inter hanged without hanging the
resulting values of x or y. However the last two statements in the right olumn annot be inter hanged without hanging the resulting value of y. In either ase, the assignment to the variable
z (the rst line of ode in both olumns) must ome rst. The requirement to maintain exe ution
order results sin e the variables used in one line depend on the assigned values in another line. Our
purpose here is to formalize the denitions of su h dependen es.
We will refer to the operations orresponding to a fragment of ode su h as in Program 4.1 as
a omputation. We will then dene the dependen e of one omputation on another. We will be
1 Some
language onstru ts, su h as HPF's !HPF$Independent, permit the programmer to assert a la k of dependen es in a se tion of ode. See Se tion 4.7.
69
70
CHAPTER 4. DEPENDENCES
z = 3.14159
x = z + 3
y = z + 6
z = 3.14159
z = z + 3
y = z + 6
Program 4.1: Two ode fragments displaying dependen es.

interested in well formed ode fragments, by whi h we mean that they are ontiguous parts of the
program text that an be parsed and exe uted on their own. The fragments above are both well
formed Fortran statements. Were ea h line followed by semi- olon, they would both be well formed
statements in C or C++. However, a fragment su h as
x+y+z
would not be well formed in any of these languages, sin e there is no assignment involved. Similarly,
a ode fragment whi h entered a subprogram, but did not return from it, would not be well formed.
Denition 4.1 A omputation is a set of operations arried out by a well-formed sequen e of

instru tions in a ode and exe uted by a ma hine.
We leave the notions of operation, instru tions, ode and ma hine informal, although they an
be made quite formal and pre ise. The reader an imagine the operations arried out as spe ied in
a typi al Fortran, C, or C++ ode ompiled for an a tual omputing ma hine like a favorite workstation. We assume that our \ma hine" has memory lo ations whi h store the variables des ribed
in the odes we will onsider. Among other things, a omputation C reads data from, and writes
data to, memory lo ations.
Denition 4.2 The read set R(C ) of a omputation C is the set of memory lo ations from whi h
data are read during this omputation. The write set W (C ) of a omputation C is the set of memory
lo ations to whi h data are written during this omputation. The a ess set A(C ) is dened to be
R(C ) [ W (C ).
Sometimes the read set
Def or Out set.
R(C ) is alled the Use or In set, and the write set W (C ) is alled the
Denition 4.3 Suppose that C and D are omputations in a ode. There is a (dire t) data
dependen e between C and D if one of the following Bernstein onditions hold:
W (C ) \ R(D) 6= ;;
or
(4.1)
R(C ) \ W (D) 6= ;; or
(4.2)
W (C ) \ W (D) 6= ;:
(4.3)
We are onsidering odes in languages whi h have a spe ied order of omputation. In Fortran,
C, or C++, the order of omputation is spe ied (in part) by the order of the lines of ode in various
les, together with rules for spe ial instru tions su h as bran hes, e.g., goto in Fortran. This order
indu es a partial order, whi h we denote by ", on the set of omputations that a ode an spe ify.
In parti ular, we say that the omputation C o urs before the omputation D, and write C "D, if
all of the lines of C pre ede all of the lines of D (with no overlap) in the standard order of exe ution.
Note that the order of C and D in Denition 4.3 does not ae t whether we de lare a data
dependen e between them. That is, if there is a data dependen e for C "D, then there is also a
70
4.1. DATA DEPENDENCES
71
dependen e for D"C . However, the spe i dependen es (4.1) and (4.2) do depend on the order. If
(4.1) holds for C "D, then (4.2) holds for D"C , and vi e versa. The ondition (4.3) is independent
of order. It is important to distinguish between the dierent types of dependen es in some ases,
and so spe ial names are given for the dierent sub- ases, as follows.
Suppose that the omputation C o urs before the omputation D. If (4.1) holds, then we
say there is a true dependen e, ow dependen e or forward dependen e (all equivalent
terminology) from C to D.
Example 4.1 Consider the fragment of ode in the left-hand side of Program 4.1. Let C denote the
se ond line and D denote the third line. Then W (C ) = fxg and R(D) = fzg. Thus W (C ) \R(D) =
;, and there is no forward dependen e from C to D. But in the ode fragment in the right olumn
there is a forward dependen e. Again, let C denote the se ond line and D denote the third line.
Then W (C ) = fzg and R(D) = fzg. Thus W (C ) \ R(D) = fzg is not empty.
If (4.2) holds, then we say there is an anti-dependen e or a ba kward dependen e between

and D. There is no ba kward dependen e in either olumn in Program 4.1 (Exer ise 4.1). If
(4.3) holds, then we say there is an output dependen e between C and D. There is no output
dependen e in either olumn in Program 4.1 (Exer ise 4.1).
A useful tool to simplify the examination of possible dependen es is the following result.
C
Lemma 4.1 If a variable x does not appear in the write set of either omputation C or D, then it
annot ontribute to a dependen e between C and D.
The proof of this result is immediate, sin e the hypothesis means that x annot appear in any
of the three interse tions (4.1), (4.2) or (4.3), sin e all three involve at least one write set.
When there is any dependen e between omputations, we annot exe ute them in parallel, sin e
the result will depend on the exa t order of omputation. We will therefore be most interested in
proving that there are not dependen es between omputations in many ases. However, proving
that there are dependen es between omputations an be useful too: it an halt a futile eort to
parallelize ode before it starts. The main reason for studying dependen es is the following, whi h
we state without proof.
Theorem 4.1 If two omputations C and D have no data dependen es, then they an be omputed
independently (in parallel) without hanging the result of the overall omputation.
There is one more bit of formalism we need to introdu e to minimize possible onfusion. In the
fragments at the beginning of the se tion, all of the variables appearing there are all initialized in
that fragment itself. However, it would be awkward (and in fa t not useful) to limit our denition
of omputations to su h situations. In fa t, dependen es an be dierent depending on dierent
exe utions of the ode. For example, the dependen es in the ode Program 4.2 an only be determined
on e the data i and j are read in. Dierent exe utions of the ode ould learly lead to dierent
values for i and j. For this reason, we will add a subs ript to a parti ular omputation S to indi ate
the parti ular exe ution of it, e.g., Se where the subs ript e denotes the parti ular exe ution.
read i,j
x(i)=y(j)
y(i)=x(j)
Program 4.2: Code with potential dependen es based on I/O.

The index for the \exe ution" ould also be thought of as the environment of S [69 [73. From
now on, we impli itly assume that there is a parti ular environment or exe ution asso iated with
ea h omputation that denes any uninitialized variables in the text dening that omputation.
71
72
However, the only signi ant use of this on ept we will make is to indi ate the exe ution of a
parti ular text of ode in dierent iterations of a loop. In this ase, we will use the loop indi es as
subs ript to indi ate this on ept.
4.2
Loop-Carried Data Dependen es
One espe ially important form of parallelism is loop-level parallelism in whi h ea h loop iteration
(or group of iterations) is exe uted in parallel. The primary onsideration for studying dependen es
here is that they inhibit parallelism. However, dependen es in loops lead to ine ien ies even on
urrent s alar pro essors, su h as those with long pipelines (Se tion 3.4). Thus it is very important
to know whether there are dependen es between omputations in dierent loop iterations. Su h
dependen es are alled loop- arried data dependen es. We begin the dis ussion with an example to
larify the issues.
Consider the omputation of a norm (3.6). As observed there, this has a form similar to the
summation problem studied in Se tion 1.4.1. However, it has twi e as mu h omputation per memory
referen e, and it is quite a ommon sub-task in s ienti omputation. In Fortran this might be
written as shown in Program 4.3. In a pipelined pro essor, we see that it will be di ult to a hieve
mu h performan e be ause we need to omplete one iteration of the loop before we an start the
next: we must wait for the new value of sum to be omputed. We will see that this is aused by a
loop- arried data dependen e.
sum = x(1)*x(1)
DO 1 I=2,N
sum = sum + x(I)*x(I)
Program 4.3: Simple ode for norm evaluation.

One way to break the dependen e is to reorder the summation, e.g., by adding the odd- and evensubs ripted array elements separately. This ould be written in Fortran as shown in Program 4.4.
This loop splitting [48 an hange dramati ally the performan e, f. Figure 4.1.
sum1 = x(1)*x(1)
sum2 = x(2)*x(2)
DO 1 I=2,N/2
sum1 = sum1 + x(2*I-1)*x(2*I-1)
sum2 = sum2 + x(2*I)*x(2*I)
sum = sum1 + sum2
Program 4.4: Code with a split loop for norm evaluation.

The number of splittings in Program 4.4 is dened to be one, whereas Program 4.3 is the ase
of zero splittings. Any number of splittings k an be done via the ode shown in Program 4.5.
Figure 4.1 depi ts the performan e of Program 4.5 for the evaluation of a norm (3.6) for ve tors
of length n = 16; 000 on popular workstations made by Sun, Digital, IBM and H-P, as well as for
the Cray C90 [48. The key point is the distin t in rease in performan e of the workstations as the
number of splittings is in reased, at least to eight or so.
It is worth noting that loop splitting is not a universal pana ea, sin e it severely degrades the
performan e on the Cray C90 (and also on the H-P system for large numbers of splittings). However,
we will see that this type of re-ordering of loops plays a signi ant role in a hieving parallelism at
both the instru tion level as depi ted in Figure 4.1 and at a more oarse-grained level. We an
72
4.2. LOOP-CARRIED DATA DEPENDENCES

1
2
3
4
73
DO 1 j=1,k
summ(j) = x(j)*x(j)
DO 3 i=2,N/k
DO 2 j=1,k
summ(j) = summ(j) + x(k*(i-1)+j)*x(k*(i-1)+j)
ontinue
sum = 0.0
DO 4 j=1,k
sum = sum + summ(j)
Program 4.5: Code for norm evaluation with arbitrary number of loop splittings. Note that this is
a y li de omposition of the summation problem.
des ribe loop splitting in the terminology of parallelism as a y li (or modulo) de omposition of
(3.6) (see Denition 1.8).
4.2.1 Some denitions

We now turn to a detailed study of dependen es in loops so that we an formalize what loop splitting does as well as to prepare for the study of more omplex loops. Loop- arried data dependen es
between program parts are an important spe ial ase of what we have onsidered so far in Denition 4.3. Let us now restri t attention to loops whi h are denite (having expli it loop bounds) and
written in normalized form.
Denition 4.4 A loop is in normalized form if its index in reases from zero to its limit by one.
Any loop an be onverted into normalized form by an ane hange of index variables. For example,
the loop
1
do 1 I=11,31,3
A(I)=B(2*I)
is normalized by writing I = 11 + 3J, yielding the normalized loop

1
do 1 J=0,6
A(11+3*J)=B(2*(11+3*J))
That is, all \strides" are eliminated by the obvious hange of loop index, and loops with de reasing
index are reversed. See Exer ise 4.3.
The set of loop indi es for normalized nested loops are multi-indi es, i.e., n-tuples of nonnegative integers, I := (i1 ; i2 ; ; in ), where n is the number of loops. In Program 4.5 the DO 2 and
DO 3 loops are a nested pair of loops (n = 2). By onvention, i1 is the loop index for the outer-most
loop, and so on with in being the loop index for the inner-most loop. We use the symbol ~0 to denote
the n-tuple onsisting of n zeros. There is a natural total order \<" on multi-indi es, lexi ographi al
order, that is, the order used in di tionaries.
Denition 4.5 The lexi ographi al order for all n-tuples of integers is dened by
(i1 ; i2 ; ; in ) < (j1 ; j2 ; ; jn )
whenever
ik < jk
for some k n, with i = j for all

`
73
`<k
if k > 1.
74
Variation of the performance of the x*x calculation on the number of splittings (n = 16000)
3
10
Cray C90
MFlops/sec
HP735/125
RS6000^2
2
10
Alpha
Sparc20
10
10
15
number of splittings
Figure 4.1: Performan e of norm evaluation for ve tors of length n = 16; 000 on various omputers
as a fun tion of the number of loop splittings.
We an then write I J if either I < J or I = J , and similarly we write J > I if I < J .
The standard order of evaluation of nested (normalized) loops in Fortan, C, C++ and other
languages provides the same total order on the set of loop indi es (i.e., on multi-indi es) as the
lexi ographi al order < dened in Denition 4.5. Indeed, the loop exe ution for index I omes
before the loop exe ution for index J if and only if I < J . See Exer ise 4.24 for another possible
loop ordering.
If S a omputation that is en losed in at least n 1 denite loops with main index variables
l1 ; l2 ; ; ln , then SI denotes the exe ution of S for whi h
(l1 ; l2 ; ; ln ) = (i1 ; i2 ; ; in ) =: I :
Denition 4.6 There is a loop- arried data dependen e between parts S and T of a program
if there are loop index ve tors (i.e., multi-indi es)

dependen e between S and T .
I
and J , with
I < J,
su h that there is a data
A loop- arried dependen e an be des ribed more pre isely as forward (4.1), ba kward (4.2), or
output (4.3) depending on whi h of the Bernstein onditions hold. Note that we do not assume
that the parts S and T are disjoint or in any parti ular order. The ordering required following
Denition 4.3 is enfor ed by the assumption I < J . In view of this, SI "TJ in Denition 4.6.
Example 4.2 Program 4.3 has a forward dependen e in line 1, sin e
sum is both in the write set

and the read set of this line. To omplete the veri ation of the denition, it is su ient to pi k any
I less than any J in the loop bounds. The split ode that follows still has loop- arried dependen es
in the \ DO 3 i" loop, but there are no loop- arried dependen es in the \ DO 2 j" loop. The latter
fa t explains why modern super-s alar ar hite tures (and their ompilers) are able to a hieve su h
high performan e, by exploiting the parallelism there.
74
4.2. LOOP-CARRIED DATA DEPENDENCES
75
4.2.2 Distan e ve tors

We now introdu e the notions of distan e ve tors and arriers for loop- arried dependen es. The
dieren e J I of two multi-indi es I and J is dened to be the n-tuple whose k -th entry is jk ik
for all k = 1; : : : ; n. However, it is not in general a multi-index. Whenever I < J , the dieren e
H =J
I will have entries
h`
= 0 for all
for some
`<k
(4.4)
k;
and the rst non-zero element hk will be positive. However, subsequent entries ould have either
sign. Symboli ally, we an write the typi al dieren e as
I <J
=)
= (0; : : : ; 0; +; ; : : : ; ):
(4.5)
Su h n-tuples have a spe ial role so we give them a spe ial name.
Denition 4.7 An n-tuple I of integers is lexi ographi ally positive if the rst non-zero entry
is positive, i.e.,
if
k>
ik >
1. The index
0 and
i`
= 0 for all
` < k;
is alled the arrier index of I .
An n-tuple I of integers is lexi ographi ally positive if and only if I

onsisting of n zeros (see Exer ise 4.9).
>~
0,
where ~0 denotes the n-tuple
Denition 4.8 Suppose I := (i1 ; i2 ; ; i ) and J := (j1 ; j2 ; ; j ) are multi-indi es with I < J .
If there is a dependen e between S( 1 2
i ;i ;
;i
n)
:= (j1
and T( 1
j ;j2 ;
i1 ; j2
i2 ;
;j
n),
;j
then
in )
(4.6)
is alled a distan e ve tor for that dependen e.
Example 4.3 In the original loop for the norm evaluation (3.6) the set of distan e ve tors onsists
of all positive integers. In the nested loop pair in Program 4.5, the set of distan e ve tors onsists of
all multi-indi es of the form (i; j ) where i is a positive integer. In parti ular, there is no dependen e
ve tor of the form (0; j ) for any j .
As noted in (4.4) or (4.5), the dieren e of lexi ographi ally ordered multi-indi es is lexi ographi ally positive. The position of the (positive) rst non-zero, i.e., the arrier index, is important for
distan e ve tors, so we give it a spe ial name.
Denition 4.9 The loop orresponding to the rst (from the left) nonzero element of a dependen e
ve tor is alled the arrier of that dependen e.
The arrier of a dependen e orresponds to the arrier index (Denition 4.7) of the dependen e
distan e ve tor.
Example 4.4 If we re-write the nested loop pair in Program 4.5, as shown in Program 4.6, then
the set of distan e ve tors onsists of all multi-indi es of the form (0; j ) where j is a positive integer.
Thus the se ond index is the arrier of these dependen es. Note that we an exe ute ea h of the i
loop instan es independently (in parallel). Note that Program 4.6 represents a blo k de omposition
of Program 4.3.
Again, we state without proof the main theorem about dependen e ve tors.
Theorem 4.2 If all dependen e ve tors have arrier indi es greater than j , then the outer-most j
loops an be exe uted in parallel.
75
76
2
3
4
DO 3 i=1,N/k
summ(i) = x(k*(i-1)+1)*x(k*(i-1)+1)
DO 2 j=2,k
summ(i) = summ(i) + x(k*(i-1)+j)*x(k*(i-1)+j)
ontinue
sum = 0.0
DO 4 j=1,k
sum = sum + summ(j)
Program 4.6: Code for norm evaluation with blo k de omposition.
4.2.3 Privatization of variables

An important notion regarding loop- arried dependen es is the privatization of variables. It may
happen that some loop- arried dependen es an be removed if some variables are de lared private
or lo al to the loop. This is equivalent in some sense to adding an extra array index to the variable
to make it dierent for ea h iteration of the loop. The simple ode in Program 4.7 gives an example
of a dependen e aused by the use of the temporary variable TEMP.
DO 1 I=1,100
TEMP = I
A(I) = 1.0/TEMP
Program 4.7: Simple ode with a dependen e aused by the use of a temporary variable.
There is a dependen e between the se ond and third lines, and more importantly, a loop- aried
dependen e for the entire loop body. However, it is lear that these are not essential dependen es.
If we write this as in Program 4.8 there is no longer a dependen e. The addition of a line of the
form
C*private TEMP
to the ode in Program 4.7 is intended to produ e the equivalent result as if we had expli itly made
TEMP a dierent variable for ea h index as is done in Program 4.8.
DO 1 I=1,100
TEMP(I) = I
A(I) = 1.0/TEMP(I)
Program 4.8: Simple ode with a dependen e aused by the use of a temporary variable.
Privatizing a variable in a loop is therefore a very simple on ept. However, it is lear that not all
variables an be made private without destroying the orre tness of the loop. The following simple
results regard orre t privatization of variables.
If a variable V is not in the write-set of any of the statements in a loop, then it annot ause a
loop- arried dependen e due to Lemma 4.1. Thus when dis ussing privatization of variables, we are
only on erned with ones whi h do appear in some write-set (that is, in an assignment). If su h a
variable V is in the write-set of the some statement, but it is only in the read set in the rst exe uted
statement in the loop in whi h it o urs, then it annot be safely privatized. This is be ause the
rst read in the I-th iteration will be referring to the value set in the (I 1)-st iteration. Note that
the rst statement in whi h a variable o urs may not be the rst exe uted statement, sin e it may
be pre eded by a onditional bran h.
76
4.3. DEPENDENCE EXAMPLES
77
We do not attempt here a statement of a omplete theorem regarding the possibility of privatization, but su e it to say that it is not possible to get ne essary and su ient onditions based on
stati ode analysis alone. Consider the ode shown in Program 4.9. Let us assume that PARITY is a
user-supplied fun tion that ( orre tly) omputes whether a number is even or odd, and returns 0 or
1 a ordingly. Then we an prove that TEMP will always be initialized before its use in the statement
labeled \1" but a ompiler would have to assume that there might be ases when PARITY is neither
0 nor 1, so that \1" be omes the rst exe uted statement in whi h TEMP o urs.
DO 1 I=1,100
IF( PARITY(I) .EQ. 1)
TEMP = I
ENDIF
IF( PARITY(I) .EQ. 0)
TEMP = 0
ENDIF
A(I) = TEMP
Program 4.9: Simple ode with apparent di ulty for automati privatization.
To indi ate that a variable is to be onsidered private to ea h loop iteration, we will use a notation
su h as
C*private TEMP
whi h an be read as a omment (in Fortran) if not being interpreted as a privatization ommand.
Many ompilers for shared-memory multi-pro essors re ognize a dire tive like this. We take it to
mean that TEMP is private regardless of orre tness.
4.3
Dependen e Examples
We now review some of the numeri al examples introdu ed in Chapter 1 and onsider dependen es
related to su h odes. In this se tion, we onsider mainly ones in whi h \privatization" (Se tion 4.2.3)
of temporaries an remove dependen es.
The symmetri , tridiagonal matrix for a two-point boundary-value problem on a general mesh
(1.24) of mesh size hi = H(I) is initialized by the ode shown in Program 4.10. Here DIAG is an
array holding the diagonal terms ai;i in (1.24), and OFFDIAG is an array holding the o-diagonal
terms ai;i+1 . We assume the array H(I) has been previously omputed. Due to symmetry of the
matrix only one o-diagonal array is ne essary. Sin e ea h array element on the left-hand side of
the assignments (denoted by the \=" sign) depend only on data previously omputed, ea h element
an be omputed independently of all others (see Exer ise 4.4).
for I=1,N
DIAG(I) = (1.0/H(I)) + (1.0/H(I+1))
OFFDIAG(I) = - (1.0/H(I+1))
endfor(I)
Program 4.10: Simple ode for initialization of nite dieren e matrix.

Modern ompilers would tend to avoid dupli ation of the \divide" operation in this ode. But
suppose we used the \old ways" of writing Fortran? The simplest type of dependen e arises due to
the use of temporary variables. Suppose that the ode were written as shown in Program 4.11. Now
77
78
there is a loop- arried dependen e (see Exer ise 4.5) be ause the temporary variables DXO and DXI
will be overwritten by dierent loop iterations. De laring DXO and DXI to be \private" will remove
the loop- arried dependen es in the ode while preserving its orre tness (see Exer ise 4.6).
for I=1,N
DXO=1.0/H(I)
DXI=1.0/H(I+1)
DIAG(I) = DXO + DXI
OFFDIAG(I) = - DXI
endfor(I)
Program 4.11: Complex ode for initialization of nite dieren e matrix.

The loop- arried dependen es are more ompli ated if instead we write the ode as in Program 4.12. Here we ompute 1.0/H(I) only on e and shift old values to DXO. Now, de laring DXO
and DXI private will produ e in orre t ode, although it will remove the dependen es.
DXI=1.0/H(1)
for I=1,N
DXO = DXI
DXI=1.0/H(I+1)
DIAG(I) = DXO + DXI
OFFDIAG(I) = - DXI
endfor(I)
Program 4.12: More omplex ode for initialization of nite dieren e matrix.
These dependen es an be removed orre tly by expanding the loop by hand. We subdivide
the iteration into P parts and initialize ea h one separately as shown in Program 4.13. This ode
omputes the same set of values, although it requires P 1 additional divisions. If DXO and DXI
are de lared as \private" in the outer loop then there are no loop- arried dependen es in the \for
IP" loop, and the ode is orre t (see Exer ise 4.8). Program 4.13 represents a blo k de omposition
(see Denition 1.8) of Program 4.12. For P = 1, Program 4.13 and Program 4.12 are identi al.
for IP=0,P-1
DXI=1.0/H(IP*(N/P)+1)
for I=IP*(N/P)+1, IP*(N/P) + N/P
DXO = DXI
DXI=1.0/H(I+1)
DIAG(I) = DXO + DXI
OFFDIAG(I) = - DXI
endfor(I)
endfor(IP)
Program 4.13: Expli itly parallel ode for initialization of nite dieren e matrix.
4.4
Testing for Loop-Carried Dependen es
Potential loop- arried dependen es an be aused by the use of indire tion arrays. Suppose l1, l2,
..., ln are loop indi es, and we have a ode ontaining the following statement:
78
4.4. TESTING FOR LOOP-CARRIED DEPENDENCES

(S)
79
A(F(l1,l2, ... ,ln)) = ... A(G(l1,l2, ... ,ln)) ...
There is a loop- arried dependen e between S and itself if and only if

F (i1 ; i2 ; :::; in )
= G(j1 ; j2 ; :::; jn )
(4.7)
for some values of the loop indi es i1 ; i2 ; :::; in and j1 ; j2 ; :::; jn whi h are within loop bounds. Su h
equations are alled dependen e equations.
There are no dependen es possible involving (S) if (4.7) an be shown to have no solutions for
the given array expressions. Solutions to (4.7) must satisfy loop bounds and be integers. We an
solve su h inequality- onstrained dis rete equations a priori if F and G are simple enough. However,
in general the solution set of su h equations is di ult to analyze. Many dependen e tests are based
on the following theorem [35:
Theorem 4.3 Let a,
an integer solution pair

divisor of a and b.
and
x; y
n be integers. The linear (Diophantine) equation ax + by = n has

if and only if g d(a; b)jn, where g d(a; b) denotes the greatest ommon
Consider the simple ode shown in Program 4.14. Suppose, for example, that
F (l )
= a0 + a1 l and G(l) = b0 + b1 l:
(4.8)
Then there is a pair i; j satisfying F (i) = G(j ) if and only if

a1 i
b1 j
= b0
a0
(4.9)
a0 :
(4.10)
whi h is true by Theorem 4.3 if and only if

g d(a1 ; b1 )jb0
The GCD test determines that there is no dependen e if g d(a1 ; b1 ) does not divide
summarize this as Theorem 4.4.
(S)
b0
a0 .
We
do l = 1,n
A(F(l)) = somefun( A(G(l)) )
enddo
Program 4.14: Simple ode with potential loop- arried dependen es.
Theorem 4.4 Suppose F and G are given by (4.8). Then the ode (S) in Program 4.14 does not
have a loop- arried dependen e if g d(a1 ; b1 ) does not divide b0 a0 .
As an example, onsider the simple ode in Program 4.15. We an never have
2k = 2` + 3
for any integers k and ` sin e this would imply that two divides three. Thus the GCD test allows
us to on lude that there are no dependen es possible in Program 4.15.
The GCD test is for un onstrained equality, that is, it does not in lude the ee ts of loop bounds,
and so is only a su ient test to pre lude dependen es, not a ne essary ondition. That is, we an
have g d(a1 ; b1 )jb0 a0 but still have no solutions satisfying the loop bounds.
The GCD test an be overly pessimisti regarding potential dependen ies sin e it ignores loop
bounds. A diametri ally opposed dependen e test is based on he king only the loop bounds, but
ignoring whether or not the solutions are integers. If
F (x)
G(y )
79
=0
(4.11)
80
do l = 1,n
A(2*l) = A(2*l+3) + 1
enddo
Program 4.15: Simple ode with no dependen es: GCD test.

has a solution for x and y in a suitable region R, then
min (F (x)
x;y
G(y ))
0 max
(F (x)
2
x;y
G(y )):
(4.12)
If F and G are as in (4.8) and the region R is given by

1x<n
1<yn
(4.13)
x < y;
we an ompute the minimum and maximum (see Exer ise 4.12)

maxR (F (x)
minR (F (x)
G(y ))
G(y ))
= a0 + a1
= a0 + a1
2b1 + (a+
1
2b1 + (a1
b0
b0
b1 ) + (n
b1 ) + (n
2)
2)
(4.14)
where t+ = t if t 0 and is zero otherwise and t = t if t 0 and is zero otherwise.

Combining (4.12) and (4.14) leads to Banerjee's inequality
b1
(a1 + b1 )+ (n
2) b0 + b1
a0
a1
b1
+ (a+
1
b1 )+ (n
2):
(4.15)
Banerjee's dependen e test onsists of determining whether (4.15) holds. If it does not, there
an be no dependen e. We summarize this as Theorem 4.5.
Theorem 4.5 The ode (S) in Program 4.14, where
loop- arried dependen e if (4.15) does not hold.
and
are given by (4.8), does not have a
The general problem of determining whether (4.7) has integer solutions falls in the mathemati al
realm of number theory. See [87 for results that generalize both Theorem 4.4 and Theorem 4.5 by
ombining a hara terization of solutions of (linear) Diophantine equations together with inequalities oming from loop bounds. Exer ise 4.23 gives an example involving a quadrati Diophantine
equation in whi h one an prove a dependen e by exhibiting a simple solution to a well known
quadrati equation. The question of deriving an algorithm for solving nonlinear Diophantine equations is Hilbert's Tenth Problem [67, whi h is in general unsolvable. See Exer ise 4.23 for an exoti
appli ation of number theory to the study of dependen es.
4.5
Loop Transformations
Re all (Denition 1.7) that the iteration spa e (or iteration set) of a group of nested loops is the set
of index tuples whi h satisfy the loop bounds. Ea h point represents the a tions of the body of the
loop orresponding to the oordinates of the point. When the loops are normalized (Denition 4.4),
as we now assume, the points of the iteration spa e are multi-indi es. The iteration spa e of any
double loop of the form in Program 4.16 is
f(i; j ) 2 Z ; j; 1 i j 10g:
2
Points of iteration sets an serve as points of a dependen e graph if we only are about the loop
body as whole. That is, ea h point in the graph orresponds to the entire loop body for a given tuple
80
4.5. LOOP TRANSFORMATIONS
81
do I = 1 , 10
do J = I , 10
...
enddo
enddo
Program 4.16: Just some loop bounds to dene an iteration spa e.

10
9
8
7
index j
6
5
4
3
2
1
0
0
5
index i
10
Figure 4.2: Dependen e graph for the double loop in Program 4.17 with ten iterations in ea h loop.
of loop indi es. This is a oarse view of possible dependen es, but in many ases it is su ient. We
an indi ate the loop- arried dependen es by drawing any distan e ve tors (Denition 4.8) from ea h
iteration multi-index to its orresponding iteration multi-index where the dependen e o urs. For
example, onsider the ode shown in Program 4.17. Figure 4.2 shows the orresponding dependen e
graph for the ase m = n = 10, f. Exer ise 4.13. It is interesting to note that there are only
two distan e ve tors in ea h ase, ( 10 ) and ( 01 ) : This is typi al for loops with very regular stru ture.
Unfortunately, ea h distan e ve tor has a dierent arrier index, so no parallelism is available dire tly
(see Theorem 4.8).
do i = 1, m
do j = 1, n
A( i, j ) = A( i , j-1 ) + 1
B( i, j ) = B( i-1, j ) + 1
enddo
enddo
Program 4.17: Simple ode with loop- arried dependen es.

Be ause the iteration spa e may be viewed as a ve tor spa e, we an apply linear transformations
[83 to iteration sets. Applying the transformation ( 11 01 ) ; to the loop indi es above we get new indi es
81
82
ii and jj dened by

jj
1 0
=
ii
1 1
i
i
=
j
i+j
We an write the original loop indi es in terms of the new ones:

i
=
j

1 0
1 1
jj
jj
=
ii
ii jj
using the matrix 11 01 whi h is the inverse of ( 11 01 ). In terms of the new loop indi es, the loop
be omes as shown in Program 4.18. The dependen e graph for the transformed loop is depi ted in
Figure 4.3.
do jj = 1,m
do ii = 1 + jj, n + jj
A( jj, ii-jj ) = A( jj , ii-jj-1 ) + 1
B( jj, ii-jj ) = B( jj-1, ii-jj ) + 1
enddo
enddo
Program 4.18: Transformed ode with loop- arried dependen es.

Observe that the dependen e ve tors for the transformed loop are transformed by the matrix ( 11 10 ).
That is, the dependen e ve tors in Figure 4.3 are

1
1 0
=
1
1 1

1
0

0
1 0
=
1
1 1
and
0
1
(4.16)
The rst of these dependen e ve tors is easier to identify if we write the se ond line of the loop as
B( jj, ii-jj ) = B( jj-1, (ii-1) - (jj-1) ) + 1 .
If we want the transformed loop to be normalized, we must require loop transformation matri es
to have inverses with integer entries. Otherwise the resulting loop indi es would take on non-integer
values (see Exer ise 4.14). Su h matri es are sometimes alled unimodular. We re all the following
result on unimodular matri es (see Exer ise 4.15):
Theorem 4.6 An integer matrix T has an integer inverse if and only if j det(T )j = 1.
It is simple to des ribe the unimodular matri es for standard loop transformations. For example,
inter hanging the loops in a nested pair of loops is a hieved by multiplying by the matrix

0 1
1 0
(4.17)
Applying this matrix to the transformed loop above results in the same thing as doing one transformation using the matrix produ t of the two matri es, i.e.,

0 1
1 0
1 0
1 1
=
1 1
1 0
(4.18)
The resulting distan e ve tors would be ( 11 ) and ( 10 ), and the orresponding dependen e graph would
just be the re e tion of Figure 4.3 with respe t to the diagonal ii=jj. The resulting transformed
ode appears in Program 4.19. It now be omes apparent (Exer ise 4.21) that the inner loop an be
exe uted in parallel as it has no loop- arried dependen es. This an also be determined from the
re e ted version of Figure 4.3. All distan e ve tors point away from the lines of onstant ii (with
jj varying). Moreover, the inner loop is not a arrier for either dependen e distan e ve tor (see
Theorem 4.8).
82
4.5. LOOP TRANSFORMATIONS
83
20
18
16
14
index ii
12
10
8
6
4
2
0
0
10
index jj
12
14
16
18
Figure 4.3: Dependen e graph for a transformed double loop with ten iterations in ea h loop.
do ii = 2 , m + n
do jj = max{ 1 , ii - n }, min{ ii - 1 , m }
A( jj, ii-jj ) = A( jj , (ii-1) - jj ) + 1
B( jj, ii-jj ) = B( jj-1, (ii-1) - (jj-1) ) + 1
enddo
enddo
Program 4.19: Another transformation of the ode in Program 4.17 whi h eliminates loop- arried
dependen es in the inner loop.
4.5.1 Distan e ve tor al ulus

We now give the general rules for transformation of loops and the orresponding transformation of
dependen e distan e ve tors. We have already seen many of these in the example above.
If a loop nest with set of dependen e distan e ve tors D is transformed by a linear transformation
T , then the set of dependen e distan e ve tors for the transformed loop nest is
= fT d :
TD
2 Dg :
In our example in Program 4.17, we started with a loop having distan es ve tors
D

0
1
1
0
Using the nal transformation (4.18) in the previous se tion, T = ( 11 10 ), (4.19) be omes
TD

1
1
with ode as shown in Program 4.19.

83
1
0
(4.19)
84
Repeating su essive transformations T and then U an be a hieved with one transformation

using the matrix produ t U T , as seen in (4.18). Fortunately, the produ t of unimodular matri es is
again a unimodular matrix (see Exer ise 4.16).

Not all unimodular transformations will lead to orre t ode. For example, 01 01 is unimodular
(it is its own inverse as well) and orresponds to an index reversal in both loops. Reversing the
order of the loop indi es in Program 4.17 will not yield the same result. The following tells us when
transformations are allowed. We omit its proof.
Theorem 4.7 A linear loop transformation produ es orre t ode if and only if every dependen e
ve tor of the original (normalized) loop is transformed into an integer ve tor that is lexi ographi ally
positive (Denition 4.7).
A prin ipal use of linear loop transformations is to yield parallel ode. The following theorem
(whose proof we also omit) tells us when this is possible.
Theorem 4.8 In a loop nest with index ve tor I = (i1 ; i2; ; i ) and dependen e distan e ve tor
set D, we an run the k-th loop, from the outer-most (with index i ), in parallel if and only if the
k -th loop is not the arrier of any dependen e.
n
The ondition of Theorem 4.8 is satised if and only if for every d 2 D, k is not the arrier index
(see Denition 4.7) for d. This is true if, for all d 2 D, either dj > 0 for some j < k or dk = 0.
In our example, we had three sets of distan e ve tors:
D1

0
1
1
0
D2

1
1
0
1
D3

1
1
1
0
(4.20)
For the rst two sets (D1 and D2 ), there are arriers in both loops. The arrier index for ( 10 ) is one,
the arrier index for ( 01 ) is two, and the arrier index for ( 11 ) is again one. In the last one (D3 ),
there is no arrier in the se ond (inner) loop; the arrier index for both distan e ve tors is one.
4.5.2 Transforming loop bounds

Computing transformed loop bounds is straightforward. It is similar to using Fubini's Theorem in
multiple integrals where you must inter hange the order of variables and gure out how to write the
domain in the new oordinates. There are two approa hes, a geometri one and an algebrai one.
In the geometri approa h, we use the fa t that the loop transformation is based on a linear
transformation on the iteration spa e. For example, su h transformations map parallelograms to
parallelograms. If our initial iteration set is a parallelogram, we an ompute the image parallelogram
by omputing the verti es. (The linear transformation maps verti es to verti es, and edges to edges.)
In the example transformed by (4.18), the verti es are (1; 1), (1; 10), (10; 1) and (10; 10), and they
are mapped to (2; 1), (11; 1), (11; 10) and (20; 10). From su h a geometri pi ture, one an then
dedu e the orresponding loop limits. However, it is often useful to take advantage of at least some
parts of the algebrai approa h.
The algebrai approa h works with the inequalities that are impli it in the loop bounds. One
follows the re ipe:

nd inequalities des ribing original iteration set,

nd maxima and minima of original indi es,
substitute new indi es for the original ones to get new maxima, minima, and inequalities, and
from outermost new index to innermost new index, al ulate loop bounds.
84
4.6. DEPENDENCE EXAMPLES CONTINUED
85
Example 4.5 In the example transformed by (4.18), start with

1 i 10 1 j 10
whi h means that

imin
= 1;
imax
= 10;
jmin
= 1 and
(4.21)
jmax
= 10:
Substituting the denition of i and j in terms of ii and jj in (4.21) yields

1 jj 10 1 ii
jj
10
and orrespondingly
iimin
= 2;
So ii must run from 2 to 20, and
iimax
jj
= 20;
= 1 and
jjmax
= 10:
must be bounded below by both

ii
10 and 1
ii
1 and 10:
and bounded above by both
4.6
jjmin
Dependen e examples ontinued
In a time-stepping s heme (e.g. (1.16) or (1.18)) for an ordinary dierential equation (1.11), one
annot parallelize so easily. Consider the ode in Program 4.20 for the expli it Euler method (1.18)
in the spe ial ase that f is independent of t. We annot parallelize Program 4.20 in a simple
way be ause the omputation of the I-th iteration requires the previous iteration. The simplest
alternative is to use a dierent time-stepping s heme algorithm as des ribed in Se tion 1.5.1 (see
(1.19)). In the spe ial ase f is linear and independent of t, we will see that there are alternative
parallelizations in Se tion 13.3.
OLDSOL = SOMETHIN
for I=1,N
SOLN(I) = OLDSOL + H(I)*F(OLDSOL)
OLDSOL = SOLN(I)
endfor(I)
Program 4.20: Simple ode for an ordinary dierential equation where H(I) is the mesh size of the
I-th interval.
4.6.1 Gauss-Seidel iteration

The Ja obi iteration for solving the two-point boundary-value (1.24) an be des ribed via the ode
in Program 4.21 using the arrays dened in Se tion 4.3. Neither loop in Program 4.21 has any
loop- arried dependen e; they are perfe tly parallel. However, the Gauss-Seidel variant is

mu h more e ient,
easier to program
(but unfortunately) essentially sequential.
85
86
for I=1,N
SNEW(I) = (F(I) - SOLN(I+1)*OFFDIAG(I) - SOLN(I-1)*OFFDIAG(I-1) )/DIAG(I)
endfor(I)
for I=1,N
SOLN(I) = SNEW(I)
endfor(I)
Program 4.21: Simple ode for Ja obi iteration for a two-point boundary value problem for an
ordinary dierential equation.
k+1
k
i=1
...
Figure 4.4: Data dependen es in the Gauss-Seidel iteration.

The Gauss-Seidel iteration for solving the two-point boundary-value problem is oded as in
Program 4.22 for the basi step. This has a loop- arried dependen e, and thus the loop on I annot
be omputed in parallel. Here, SOLN(I-1) is a value just omputed, thus is a \new" value, whereas
SOLN(I+1) is the value from the previous iteration, an \old" value. The data dependen es are shown
in Figure 4.4, and this should be ompared with the Ja obi iteration depi ted in Figure 1.12. As in
the ODE ase, the omputation of the I-th iteration requires all of the previous iterations.
for I=1,N
SOLN(I) = (F(I) - SOLN(I+1)*OFFDIAG(I) - SOLN(I-1)*OFFDIAG(I-1) )/DIAG(I)
endfor(I)
Program 4.22: Simple ode for Gauss-Seidel iteration for a two-point boundary value problem for
an ordinary dierential equation.
4.6.2 Pipelining
Be ause the matrix is tightly banded, a \pipelining" approa h an be used to a hieve some parallelism
for the Gauss-Seidel iteration. Suppose the overall iteration is Program 4.23. Note we have hanged
the iteration limits slightlyas well as in luding expli itly the outer iteration. The dependen e ve tors
for Program 4.23 are 11 and ( 01 ). Sin e there is a arrier in both loops, we annot exe ute it in
parallel. However, the loop transformation

2 1
1 1
(4.22)
maps the distan e ve tors to ( 10 ) and ( 11 ). As we have seen before, the arrier index of both of
these ve tors is one. This means that the inner loop in the orrespondingly transformed ode an
be exe uted in parallel.
The inverse loop transformation to (4.22) is

1
1
1
2
86
(4.23)
4.7. CHAPTER COMMENTS
87
for K=0,somelimit
for I=0,N
SOLN(I) = (F(I) - SOLN(I+1)*OFFDIAG(I) - SOLN(I-1)*OFFDIAG(I-1) )/DIAG(I)
endfor(I)
endfor(K)
Program 4.23: Complete ode for Gauss-Seidel iteration with expli it outer iteration.
and allows us to write I = 2*II - KK where II and KK are the new loop variables. Thus the ode
in Program 4.23 takes the form shown in Program 4.24.
The loop bounds also must be transformed. Using the geometri approa h from Se tion 4.5.2,
we see that the new iteration set is a parallelogram with verti es at
(0,0), (N,N), (2*somelimit,somelimit) and (2*somelimit+N,somelimit+N).
Thus KK ranges from zero to 2*somelimit+N and II ranges within the parallelogram. The limits
on II an be determined from the algebrai approa h in Se tion 4.5.2 as follows. Let i0 denote the
numeri al values of II, let k 0 denote the numeri al values of KK and let s denote somelimit. We
have
0 2i0 k 0 N and 0 k 0 i0 s:
These inequalities are equivalent to
k
0 2i0 N + k0 and
0 i0 k0
s:
Combining we nd
maxf 12 k 0 ; k 0 sg i0 minf 12 (N + k 0 ); k 0 g
whi h provides the limits for the II loop in Program 4.24.
for KK = 0, 2*somelimit + N
for II = max(KK/2,KK-somelimit), min((N+KK)/2,KK)
SOLN(2*II-KK) = (F(2*II-KK) - SOLN(2*II-KK+1)*OFFDIAG(2*II-KK)
SOLN(2*II-KK-1)*OFFDIAG(2*II-KK-1) )/DIAG(2*II-KK)
endfor(I)
endfor(K)
Program 4.24: Transformed ode for Gauss-Seidel iteration with expli it outer iteration.
The transformed ode in Program 4.24 is referred to as a pipelined version of Program 4.23 be ause
we start the omputation of later iterations of the original K index as soon as the information required
is available. Otherwise said, omputed information is \piped" into the later iterations as needed.
There are at least two drawba ks to this approa h. A small one is that the amount of parallelism
is variable (the size of the II loop), however this only leads to a load balan ing problem (see
Denition 1.10). A more serious problem is the fa t that somelimit is often determined adaptively
by some termination riterion on the ve tor SOLN. This would typi ally be done after a omplete
iteration of the original I loop.
4.7
Chapter Comments
In Chapter 8, the High Performan e Fortran (HPF) language will be introdu ed. This language has
a dire tive (see Chapter 9) whi h asserts independen e of loops. For example, onsider the following
program fragment:
87
88
integer a(100), oset
read(5,*) oset
!HPF$Independent
do i = 1, 10
a(i) = b(i) + a(i+oset)
enddo
The !HPF$Independent dire tive laims that the iterations of the loop an be performed in any
order. This is true only if oset 10. For exe utions where oset < 10, this program will not be
HPF onforming and will produ e potentially erroneous results. Although a program model an
assert various properties, su h as freedom from deadlo k and sequential onsisten y, it is sometimes
possible to sidestep the model's intent with unpleasant onsequen es.
4.8
Exer ises
Exer ise 4.1 Consider the two ode fragments in Program 4.1:
C:
D:
z = 3.14159
x = z + 3
y = z + 6
z = 3.14159
z = z + 3
y = z + 6
Prove that there is no ba kward dependen e (4.2) and no output dependen e (4.3) between C and
D, where C denotes the se ond line and D the third, in the odes in either olumn. (Hint: see
Example 4.1.)
Exer ise 4.2 Suppose the loop bounds, instead of (4.13), take the form
1x<m
1 < y n:
Determine an inequality orresponding to (4.15).
Exer ise 4.3 Consider the loops

1
1
do 1
A(I)
do 1
B(I)
I=3,9,2
= B(I+1)
I=9,2,-1
= C(I+1)
Write these in normalized form (see Denition 4.4).
Exer ise 4.4 Prove that there are no loop- arried dependen es in the ode in Program 4.10.
Exer ise 4.5 Determine all of the loop- arried dependen es in the ode in Program 4.11. Give all
of the details of your derivation (i.e. give a omplete proof).
Exer ise 4.6 Suppose that
DXO and DXI are de lared lo al in the \ for I" loop in Program 4.11.
Prove there are no loop- arried dependen es in the resulting ode. (Hint: write the ode with loop
indi es added to DXO and DXI and study the resulting dependen es.)
Exer ise 4.7 Prove that the odes in Program 4.12 and Program 4.13 produ e exa tly the same
values for the arrays DIAG and OFFDIAG. (Hint: use indu tion to show both odes are equivalent to
Program 4.10.)
Exer ise 4.8 Suppose that DXO and DXI are de lared lo al in the \ for IP" loop in Program 4.13.
Prove there are no loop- arried dependen es in the resulting ode, and that the ode is orre t. (Hint:
write the ode with loop indi es added to DXO and DXI and study the resulting dependen es.)
88
4.8. EXERCISES
89
Exer ise 4.9 Prove that an n-tuple I of integers is lexi ographi ally positive if and only if I
where ~0 denotes the n-tuple onsisting of
zeros.
>~
0,
Exer ise 4.10 Prove that any n-tuple I of integers is either lexi ographi ally positive, lexi ographi ally negative (i.e., I < ~0) or ~0. Prove that I is lexi ographi ally positive if I is lexi ographi ally
negative. (Hint: hara terize what it means to be lexi ographi ally negative, f. Denition 4.7.)
Exer ise 4.11 Use the GCD test Theorem 4.4 to determine whether there are any possible dependen es in the ode
do l = 1,n
A(9*l) = A(6*l+2) + 1
enddo
Exer ise 4.12 Prove the equalities in (4.14).

Exer ise 4.13 Prove that Figure 4.2 orre tly depi ts the distan e ve tors for the indi ated loops,
namely, ( 10 ) and ( 01 ) :
Exer ise 4.14 Prove that a ne essary ondition for the transformed loop to be normalized is that
the inverse matrix for the transformation have integer entries. (Hint: normalized loop indi es get
in remented by a unit ve tor at ea h step. The transforms of this ve tor orrespond to a way of
in rementing the transformed loop indi es. Write the original loop indi es in terms of the transformed
loop indi es times the inverse matrix. Sin e the transformed loop indi es are normalized, show that
a fra tional inverse matrix would lead to fra tional loop indi es in the original variables.)
Exer ise 4.15 Prove Theorem 4.6. (Hint: use a formula for the inverse matrix to show that the
inverse of an integer matrix with determinant one must be integer. Then write I = T T 1 and use
the fa t that 1 = det T det T 1. Note that the determinant of an integer matrix must be an integer.)
Exer ise 4.16 Prove that the set of unimodular matri es is losed with respe t to multipli ation.
(Hint: use the fa t that det U T = det U det T and Exer ise 4.15.)
Exer ise 4.17 Find the read and write sets of S(
i;j )
nest:
S:
for ea h index pair (i; j ) of the following loop
do i = 1 , m
do j = 2 , n-2
a(j+1) = .33 * (a(j) + a(j+1) + a(j+2))
enddo
enddo
Find the dependen e distan e ve tors of the loop nest in and draw its dependen e graph. Whi h loop
is the arrier of the dependen es? Can either the i or the j loop be run in parallel? Why?
Exer ise 4.18 Find a linear loop transformation for the loop in Exer ise 4.17 so that the innermost
loop of the transformed loop nest does not arry a dependen e (and so an be run in parallel) and
draw the dependen e graph of the resulting loop nest.
Exer ise 4.19 Apply the transformation you found in Exer ise 4.18 to the loop nest in Exer ise 4.17.
Exer ise 4.20 Write a small program to verify that the transformed loop (with m = 3; n = 20 and
a(0
: 20) = 1 initially) leaves the same result in
as the original loop nest in Exer ise 4.17.
89
90
DO 1 I = 1, MAX
DO 1 J = 1, MAX
DO 1 K = 1, MAX
A(K**2) = 1/(1 + A(I**2 + J**2) )
Program 4.25: A ode with loop- arried dependen es.
Exer ise 4.21 Prove there are no loop- arried dependen es in the inner loop in Program 4.19.
Exer ise 4.22 Prove there are loop- arried dependen es in the ode Program 4.25 for
iently large.
MAX su-
Exer ise 4.23 Prove there are no loop arried dependen es in the ode Program 4.26 for any value
of MAX. (Hint: use Fermat's Last Theorem.)
DO 1 I = 1, MAX
DO 1 J = 1, MAX
DO 1 K = 1, MAX
DO 1 N = 3, MAX
A(K**N) = 1/(1 + A(I**N + J**N) )
Program 4.26: A ode with no loop- arried dependen es.
Exer ise 4.24 There is another natural partial order \" on multi-indi es, whi h is dened by
(i1 ; i2 ; ; in )
(j1 ; j2 ; ; j ) whenever i j for all k = 1; : : : ; n. Whenever I J , the

dieren e J I is the multi-index whose k-th entry is j i . Determine an exe ution order so that
the loop exe ution for index I omes before the loop exe ution for index J if and only if I J .
n
Exer ise 4.25 In the ase of n nested loops with exa tly n distan e ve tors, one an try to determine
a transformation to put all of the arriers in the rst loop as follows. Let D be the matrix whose
olumns are the distan e ve tors. Suppose that D is unimodular. We seek a transformation T su h
that U = T D onsists of olumn ve tors all of whi h have arrier index equal to one, so that the
transformed loops have dependen es arried only in the outer-most loop. Determine a matrix U with
this property, and prove that the resulting matrix T exists (give some formula). Give onditions
under whi h this denes a unimodular matrix T . Give an example for n = 2.
Exer ise 4.26 Extend the GCD and Bannerjee dependen e tests to nested loops. (Hint: the algebra
must be extended to n dimensions.)
90

Pschap3 4

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Pschap3 4

Hochgeladen von

Copyright:

Verfügbare Formate

Chapter 3

Computer Ar hite ture

{ a devi e that performs mathemati al operations on data

{ a devi e that stores data and an make it available to other devi es

{ a devi e that allows data to be transferred between other devi es.

CHAPTER 3. COMPUTER ARCHITECTURE

3.1.1 Diagram of a sequential CPU

3.1. P-M-S NOTATION

(both data and

3.1.2 Diagram of a a he system

CHAPTER 3. COMPUTER ARCHITECTURE

Switch (memory bus)

Shared Memory Multipro essor

3.2.1 Basi shared memory ar hite ture

3.2. SHARED MEMORY MULTIPROCESSOR

regardless of the number of pro essors P .

CHAPTER 3. COMPUTER ARCHITECTURE

Switch (memory bus)

3.2.3 Ca he-only ar hite tures

3.3. DISTRIBUTED-MEMORY MULTICOMPUTER

Distributed-Memory Multi omputer

3.3.1 The network

CHAPTER 3. COMPUTER ARCHITECTURE

Draft: O tober 9, 2001, do not distribute

3.3. DISTRIBUTED-MEMORY MULTICOMPUTER

CHAPTER 3. COMPUTER ARCHITECTURE

3.3.2 Hyper ubes

Draft: O tober 9, 2001, do not distribute

3.3. DISTRIBUTED-MEMORY MULTICOMPUTER

Theorem 3.3 For any d  1, there is a Gray ode 0 = g0; g1 ; : : : g

0  gi < 2 for all i = 0; 1; : : : ; P

The proof is simple indu tion. The result is evident for

= 1 and the indu tion step uses the

3.3.3 Data ex hange

CHAPTER 3. COMPUTER ARCHITECTURE

3.3.4 NOW what?

Draft: O tober 9, 2001, do not distribute

3.4. PIPELINE AND VECTOR PROCESSORS

3.3.5 Hybrid parallel ar hite tures

Pipeline and Ve tor Pro essors

CHAPTER 3. COMPUTER ARCHITECTURE

Comparison of Parallel Ar hite tures

3.5. COMPARISON OF PARALLEL ARCHITECTURES

Cache-Only Memory Architecture

shared memory multiprocessor

distributed memory multicomputer / NOW

3.5.2 Matrix multipli ation

CHAPTER 3. COMPUTER ARCHITECTURE

3.7. CURRENT TRENDS

von Neumann and

(ICL DAP, CM-2,

CHAPTER 3. COMPUTER ARCHITECTURE

as a fun tion of the dimension

CHAPTER 3. COMPUTER ARCHITECTURE

d is an integer, d  2. What is the minimum number of 4 4 ross-bar

Exer ise 3.22 Suppose

Draft: O tober 9, 2001, do not distribute

\Against the Wind" by Bob Seger

ow of data (through memory), hen e data dependen es, and

We will limit dis ussion to data dependen es in this hapter.

Draft: O tober 9, 2001, do not distribute

Program 4.1: Two ode fragments displaying dependen es.

Theorem 3.3 For any d 1, there is a Gray ode 0 = g0; g1 ; : : : g

0 gi < 2 for all i = 0; 1; : : : ; P

d is an integer, d 2. What is the minimum number of 4 4 ross-bar

Denition 4.1 A omputation is a set of operations arried out by a well-formed sequen e of

4.2.1 Some denitions

for some k n, with i = j for all

Program 4.10: Simple ode for initialization of nite dieren e matrix.

Program 4.11: Complex ode for initialization of nite dieren e matrix.

where t+ = t if t 0 and is zero otherwise and t = t if t 0 and is zero otherwise.

Program 4.16: Just some loop bounds to dene an iteration spa e.

Substituting the denition of i and j in terms of ii and jj in (4.21) yields