Beruflich Dokumente
Kultur Dokumente
Wall clock time the time from the start of the first processor to
the stopping time of the last processor in a parallel ensemble.
But how does this scale when the number of processors
is changed of the program is ported to another machine
alltogether?
How much faster is the parallel version? This begs the obvious
followup question whats the baseline serial version with which
we compare? Can we use a suboptimal serial program to
make our parallel program look
Raw FLOP count What good are FLOP counts when they dont
solve a problem?
Sources of Overhead in Parallel Programs
P0
P1
P2
P3
P4
P5
P6
P7
The parallel runtime is the time that elapses from the moment
the first processor starts to the moment the last processor
finishes execution.
Let Tall be the total time collectively spent by all the processing
elements.
To = pTP TS . (1)
Performance Metrics for Parallel Systems: Speedup
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 3 5 7 9 11 13 15
0 2 4 6 8 10 12 14
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
3 7 11 15
0 4 8 12
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
7 15
0 8
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
15
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(e) Accumulation of the sum at processing element 0 after the final communication
Mathematically, it is given by
S
E = . (2)
p
Efficiency E is given by
n
log n
E =
n
1
=
log n
Parallel Time, Speedup, and Efficiency Example
1 2 1
0 0 0
1 2 1
0 1 2 3
(a) (b) (c)
n2
TP = 9tc + 2(ts + tw n)
p
Cost reflects the sum of the time that each processing element
spends solving the problem.
The first log p of the log n steps of the original algorithm are
simulated in (n/p) log p steps on p processing elements.
2 6 10 14
1 5 9 13
3 7 11 15
0 4 8 12 0 4 8 12
0 1 2 3 0 1 2 3
(a) (b)
7 15 15
0 8 0
0 1 2 3 0 1 2 3
(c) (d)
40
35
30
25
20
15
S
10
Binary exchange
2-D transpose
5 3-D transpose
0
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
S TS
E= =
p pTP
or
1
E= To
. (4)
1 + TS
25
20 n = 512
15 n = 320
n = 192
S
10
5 n = 64
0
0 5 10 15 20 25 30 35 40
E E
p W
(a) (b)
What is the rate at which the problem size must increase with
respect to the number of processing elements to keep the
efficiency fixed?
This rate determines the scalability of the system. The slower this
rate, the better.
W + To(W, p)
TP = (8)
p
1
E = ,
1 + To(W, p)/W
To(W, p) 1E
= ,
W E
E
W = To(W, p). (11)
1E
W = Kp3/2. (14)
W = Kp3/4W 3/4
W 1/4 = Kp3/4
W = K 4 p3 (15)
pTP = (W ). (16)
W + To(W, p) = (W )
To(W, p) = O(W ) (17)
W = (To(W, p)) (18)
d
TP = 0 (19)
dp
n
TP = + 2 log p. (20)
p
(One may verify that this is indeed a min by verifying that the
second derivative is positive).
Note that both TPmin and TPcost opt for adding n numbers are
(log n). This may not always be the case.
Asymptotic Analysis of Parallel Programs
Algorithm A1 A2 A3 A4
2
p n log n n n
TP 1 n n n log n
S n log n log n n log n n
E log n
n 1 log
n
n
1
tc n 2
S= 2 (24)
tc np + ts log p + tw n
0 tc c p
S = cp
tc p + ts log p + tw c p
or S 0 = O( p).
We have TP = O(n2/p).
n3 n2
TP = tc + ts log p + 2tw
p p
tc n 3
S= n3 n2
(25)
tc p + ts log p + 2tw p
Scaled Speedup: Example (continued)
tc(c p)1.5
S0 = (cp)1.5
= O(p)
tc p + ts log p + 2tw cp
p
tc c p
00
S = = O(p5/6)
(cp)2/3
tc cp
p + ts log p + 2tw
p
W = Tser + Tpar .
Tpar
TP = Tser + .
p
W Tser
TP = Tser + (26)
p
Serial Fraction f
Tser
f= .
W
Therefore, we have:
W f W
TP = f W +
p
TP 1f
= f+
W p
Serial Fraction
1 1f
=f+ .
S p
We have: 2
tc np +ts log p+tw n
t c n2
f= (28)
1 1/p
or
tsp log p + tw np 1
f =
tc n 2 p1
ts log p + tw n
f
tc n 2