Sie sind auf Seite 1von 13

Intro to communication

9/8/09

Hypercubes
• Advantages:
– Best diameter and bisection width we have seen so far.
– Practically, these are very useful bounds and multiple systems
use a hypercube network topology.

• Disadvantages:
– Number of links per node is not constant
– p = 2d, 2d links per node = Θ(log p)

1
Bit labeling
• Because the number of processors along
any given dimension is 2, we can
represent a single coordinate by a single
bit: 0 or 1

• Each processor, therefore, has an


identifier of d bits.
– For example, (0,1,1) in a 8-processor
hypercube

Helpful facts
• Flipping any of the d bits returns a neighbor of a
given processor.
– Each flip corresponds to a link in the network.

• Because at most d bits can be flipped, a


processor has d neighbors.

• This also gives us a quick algorithm to check if


any two processors are neighbors.
– Hamming distance of 1

2
Communication paths
• First, find the number of bits that two
messages differ.

• Route the message to a neighbor, and


continue until it arrives.

• If the processor ids differ by k bits, there


are k! possible shortest paths possible.
– High connectivity = good

Our model of parallel computation


• As described previously, our model will be:

– p processors connected by an interconnection


network
– Each processor has its own local memory
– Accessing remote memory requires using the network

• We want processors to communicate


simultaneously
– Using MPI function calls for your assignments
– Further complicated because accessing remote
memory often requires a send/receive

3
Details
• We will assume that only one message is being
routed per processor.

• Before the message is sent, it must be prepared,


a destination and error correcting applied and
any other preprocessing.

• This is called setup time and we will use τ to


model this parameter.

More details
• The transmission of the message can be
performed at the rate of µ time per word.

• It follows that total communication time can be


modeled as:
– τ + µm, where m is the size of the message

• To be comparable to computation, we usually


measure this in floating point operations, or
flops.

4
Supporting simultaneous
communication
• A processor can send and receive a message at
any given time step.

• It is clear that a processor can not perform


multiple sends and receives at the same time.

• Therefore, we would like communication steps


such that no two destinations are the same:
– We will call this a permutation network

Example
• This permutation (2 3 1 0) implies:
– Processor 0 sends a message to 2 (0, 2)
– Processor 1 sends a message to 3 (1, 3)
– Processor 2 sends a message to 1 (2, 1)
– Processor 3 sends a message to 0 (3, 0)

• Note that any step where a processor is not


involved can be represented by communication
with itself, e.g., (0,0).

5
Multistage communication networks

• Multistage interconnection networks


(details next class and in text) are
topologically equivalent.

• It can be shown that these can achieve a


permutation in at most 3 steps.

Take home messages


• We make the theoretical assumption that
any permutation can be supported in
parallel (with some constant overhead).

• We will use permutation networks as our


interconnection network in this course.
– Simplifies algorithm design and analysis

6
In class example
• In small groups, what are some
permutations that can be achieved using a
ring network?

• What about a hypercube? Come up with


one example for a 3D cube of 8
processors.

Often used permutations


• The permutation network, by definition,
allows any communication where no two
destinations are the same in one step

• Even so, we usually only use the following:


– Shift permutations
– Hypercubic permutations

7
Shift permutations
• A shift permutation is one such that each
processor communicates with its neighbor

• Left shift:
– Processor i communicates with processor i -1

• Right shift:
– Processor i communicates with processor (i +
1) mod p

Hypercubic permutations
• Let d = ceiling (log p)

• Just as in a static hypercube, each processor’s


id can be viewed as a string of d bits

• Fix a bit position j. The communication where


processors that only differ at this bit is a
hypercubic permutation

• There are d hypercubic permutations

8
Another reason why
hypercubes are cool

• Obviously, these can be simultaneously


routed on any hypercube network

• This is better than many multistage


networks, some of which we will discuss
next class.

Tips for parallel algorithm analysis


• In any algorithm we design, we will
separate its computation and
communication costs.

• For example:
– Computation time = O(f(n,p))
– Communication time = O(τg(n,p) + µh(n,p))

9
A simple algorithm example
• Suppose we’d like to add n numbers,
where n >> p.

• For simplicity, lets assume p evenly


divides n.

• What is one algorithm for this?

An approach
• It is clear that computing the sum of n/p
integers on a single processor requires n/p
time.

• Adding p resulting integers can be


performed in log p steps.

• Computation time = O(n/p + log p)

10
Communication time
• Communication time = O(τ + µ)log p

• Spoiler: Can this be done on a


hypercube?

How do we know if this good?


• The sequential runtime of adding n
integers is Θ(n).

• Therefore, a parallel algorithm must take


at least Ω(n/p); ours is O(n/p + log p)

• As a result, we are efficient as long as:


– n > p log p (in class)

11
But why?
• Note τ and µ are constants, and we are
using O notation here, not Θ

• Why isn’t the algorithm?


– Max{f(n,p), g(n,p), h(n,p)}

Older values of communication


constants (relative to one flop)
Machine Τ µ τ/µ

IBM BG/L 3000 50-100 15-30

Intel Delta 4650 87 54

Intel Paragon 7800 9 867

Cray T3D 27000 9 3000

IBM SP1 28000 50 560

CM5 450 3 113

12
An example
• Suppose our algorithm achieves ideal speedup

• Further,
– f(n,p) = g(n,p) = h(n,p) = T(n, 1) / p

• If τ, however, is 1000, our speedup would be


p/1000
– This implies 1000 processors would be as fast as 1

Rules of thumb
• We would like f(n,p) to be as close to
O(T(n,1)/p) as possible

• Ideally, h(n,p) is a slower growing function


than f(n,p) and g(n,p) even slower

13

Das könnte Ihnen auch gefallen