Sie sind auf Seite 1von 32

Papers by Hala ElAarag on

Synchronization of shared memory


multiprocessors

1
Minimum-Length Critical Path
Algorithms for Barrier
Synchronization in Shared
Memory Parallel Machines

2
Barrier Synchronization
• Introduction
• MCS Algorithm
• Proposal
– Motivation
– Implementation details
– VFO Algorithm
– Results

3
Introduction
• Why do we use Barriers?

• Types of Barriers.
– Centralized vs. Tree-based Algorithms

• What is a “HOT SPOT”?

4
Centralized Barriers
shared count : integer := P
shared sense : Boolean := true
processor private local_sense : Boolean := true

procedure central_barrier
local_sense := not local_sense // each processor toggles its own sense
if fetch_and_decrement (&count) = 1
count := P
sense := local_sense // last processor toggles global sense
else
repeat until sense = local_sense

5
Tree-Based Barriers

1 group 10

10 groups 10 10 …. 10

100 gps
10 …. 10 10 10 10 …. 10
...

6
MCS Algorithm
Arrival tree (fan-in = 4) Wakeup tree (fan-out =2)
P0
P0

P1 P2

P1 P2 P3 P4

P3 P4 P5 P6

P5 P6 P7 P8 P9 P10 P11 P12


P7 P8 P9 P10 P11 P12 P13 P14

7
MCS Algorithm
Format of a tree node
F T T T T T T T T

Parent- parent- childpointers[0..1] havechild [0..3] childnotready[0..3] dummy


sense pointer

P0
P0

P1 P2

P1 P2 P3 P4

P3 P4 P5 P6

P5 P6 P7 P8 P9 P10 P11 P12 P10 P11


8
P7 P8 P9 P12 P13 P14
Proposal
• Motivation: Variable Fan-Out (VFO)
wakeup tree
P0

P1 P3 P7 P15

P2 P5 P11 P6 P13 P14

P4 P9 P10 P12

P8

9
Wakeup time for P=16 with VFO tree

T=0 P0

T=1 P1

T=2 P2 P3

T=3 P4 P5 P6 P7

T=4 P8 P9 P10 P11 P12 P13 P14 P15

10
Implementation Details
Approach I
Leftchildpointer and nextsibling pointer
P0

P1 P3 P7 P15

P2 P5 P11 P6 P13 P14

P4 P9 P10 P12

P8

11
procedure to build a VFO wakeup tree of size 2k
nodes.

CreateVFOTree (k)
If (k=0)
{Allocate new node and let R be its pointer;
Rleftchildpointer = null; Rnextsiblingpointer = null;
return(R) }
R1 = CreateVFOTree (k-1);
/* R1 is the root of a VFO tree of half the required size */
R2 = CreateVFOTree (k-1);
R2nextsiblingpointer = R1leftchildpointer;
R1leftchildpointer = R2;

Return (R1) 12
Approach I

P0

P1 P3 P7 P15

P2 P5 P11 P6 P13 P14

P4 P9 P10 P12

P8

13
Approach II
• For non-root node j
• Left most child = 2*j
• sibling = 2*j+1
P0

P1 P3 P7 P15

P2 P5 P11 P6 P13 P14

P4 P9 P10 P12

14
P8
VFO Algorithm
type treenode = record
parentsense : Boolean
parentpointer : ^ Boolean // pointer to parent in the arrival tree
havechild : array [0 ..3] of Boolean
childnotready : array [0 ..3] of Boolean
dummy : Boolean // pseudo-data

// Shared variables:
nodes : shared array [ 0..P-1] of treenode
// nodes[j] is allocated in memory module # j which is
// locally accessible to the processor whose index is j
// Local variables for each processor:
myid : integer // processor index
sense : Boolean // initially true
childspinpointer: ^Boolean
childid : integer

// Initial values in nodes[i]


// parentpointer = &nodes[floor((i-1)/4)].childnotready[(i-1) mod 4],
// or &dummy if i = 0
// childnotready = havechild
// parentsense = false
// havechild[j] = true if 4*i + j < P; otherwise false

15
VFO Algorithm (cont.)
Algorithm VFO_barrier(myid) // myid is the index of the calling processor
with nodes[myid] do
repeat until childnotready = {false, false, false, false}
childnotready = havechild // prepare for next barrier
parentpointer ^ = false // remote access to tell parent I’m ready
// if not root, wait until my parent signals wakeup
if myid != 0
repeat until parentsense = sense
// now signal children in wakeup tree
childid = Max {1, 2*myid} //leftmost child of root is 1
repeat while childid < P
childspinpointer = &nodes[childid].parentsense
childspinpointer ^ = sense // remote access
childid = 2 * childid + 1
//prepare for next barrier
sense = not sense // sense is not shared but is retained across
// barrier calls

16
Results
Lemma 1:
The VFO barrier algorithm has a worst case
extra arrival overhead of log4 P, a best
case of O(1) and a wakeup time of log2 P.

17
Results
Lemma 2:
The VFO algorithm requires only O(P) shared
space and performs the minimum number of
remote accesses, 2P-2, for a barrier of P
processors.

18
Expanded Fetch&: A Scalable
Approach to Reducing Traffic in a
Multistage Interconnection Network

19
Fetch and Phi Operations
The Hardware Solution
• Generalized Fetch and Phi
• Restricted Fetch and Phi
• Proposal
– Motivation
– basic idea
– analysis

20
Previous Work
• [GOT83] designed the NYU Ultracomputer and introduced
the Generalized Fetch and Phi operations
• [SOH89] introduced the Restricted Fetch and Phi
operations
• [DIC92] designed scalable combining switches that are
much cheaper than the ones in the NYU Ultracomputer
• [TZE91] proposed an alternative combining architecture
that has much lower hardware complexity than previous
ones
• [HAN96] proposed a novel MIN architecture which is the
Simple Serial Synchronized MIN
21
Generalized Fetch and Phi
• Example
Assuming V is a shared variable,
PEi: ANSi  F&A (V,ei)
PEj ANSj  F&A (V,ej)
then,
ANSi  V or ANSi  V + ej
ANS j  V + ei ANSj  V
and, in either case, V=V + ei + ej. 22
Restricted Fetch and Phi

• RF(X, C), where C is a constant for all


requesting processors
S4

S3

S2

S1

P1 P2 P3 P4

# of lines = log A + B + C
where A = # of addressable memory locations 23
B = # of bits needed to represent C
N = # of processors in the system
Restricted Fetch and Phi
Suppose
• P1, P3, P4 executes RF(X, C)
• priority: P1 > P2 > P3 > P4

The 3 values of X that represent a serialization of the instruction

V1 = X(C)
V2 = X(C  C)
V3 = X(C  C  C)
24
Proposal

• Motivation
– in a multiprocessor system, each processor is
computing a small part of a much larger task.
– With the task distributed among N processors.
– each one is likely to be doing the same or nearly the
same set of operations on the same or similar data.
• Idea
– try to maximize number of cycles to hold the bus in
order to get a large number of processors to join each
operation.

25
Model and Assumption
• Let N = Number of processors
• Assumption: once a MF occurs, each of the
other N - 1 processors will join with independent
identically distributed probability p during each
cycle.
• Thus the number of processors joining in
constitutes a binomial distribution with probability
equal to p.

26
Analysis
the expected number of processors joining in the first cycle is: (N - 1) p
In the second cycle:
(N - (N - 1)p - 1)p = (N - N p + p - 1)p = (N(1 - p) - (1 - p))p = (N - 1) (1
- p) p
In the third cycle:
(N - (N - 1)p - (N - 1)(1 - p)p - 1)p = (N - 1)(1 - p - (1-p)p) p = (N - 1) (1 -
p)2 p
In the mth cycle:
(N - (N - 1)p - (N - 1)(1 - p)p - (N - 1)( 1 - p)2 p - …… - (N - 1)(1 - p)m-2p -
1) p =
(N -1)(1 - p - (1 - p)p - (1 - p)2p - ……. - (1 - p)m-2p) p = (N - 1)(1 - p) p=

(N -1)(1-p)m-1p
27
Analysis
Thus the total number of processors expected
to join the MF by m cycles is:

M = E [number of joining on 1st cycle] + E [number of


joining in 2nd cycle] +…+ E [number of joining in
mth cycle]
= (N - 1) (1 - (1 - p)m )

28
Expected number of joining
processors for various p

29
Analysis
d ( N  1)(1  (1  p) m )
= 1
dm
- (N - 1) (1 - p)m ln ( 1 - p) = 1
1
(1 - p)m = -
( N  1) ln(1  p)
1
m log (1 - p) = log -
( N  1) ln(1  p)
1
log(  )
( N  1) ln(1  p)
m= =
log(1  p)

1
log (1 p ) (  ) 30
( N  ) ln(1  p)
Expected number of joining
processors for various N

31
Tradeoff
• Large m allows many processors to join.
• Small m allows for different MF

• let q = mean number of cycles between


different MF
• hold the bus for min (m,q-1)

32

Das könnte Ihnen auch gefallen