Beruflich Dokumente
Kultur Dokumente
1
Minimum-Length Critical Path
Algorithms for Barrier
Synchronization in Shared
Memory Parallel Machines
2
Barrier Synchronization
• Introduction
• MCS Algorithm
• Proposal
– Motivation
– Implementation details
– VFO Algorithm
– Results
3
Introduction
• Why do we use Barriers?
• Types of Barriers.
– Centralized vs. Tree-based Algorithms
4
Centralized Barriers
shared count : integer := P
shared sense : Boolean := true
processor private local_sense : Boolean := true
procedure central_barrier
local_sense := not local_sense // each processor toggles its own sense
if fetch_and_decrement (&count) = 1
count := P
sense := local_sense // last processor toggles global sense
else
repeat until sense = local_sense
5
Tree-Based Barriers
1 group 10
10 groups 10 10 …. 10
100 gps
10 …. 10 10 10 10 …. 10
...
6
MCS Algorithm
Arrival tree (fan-in = 4) Wakeup tree (fan-out =2)
P0
P0
P1 P2
P1 P2 P3 P4
P3 P4 P5 P6
7
MCS Algorithm
Format of a tree node
F T T T T T T T T
P0
P0
P1 P2
P1 P2 P3 P4
P3 P4 P5 P6
P1 P3 P7 P15
P4 P9 P10 P12
P8
9
Wakeup time for P=16 with VFO tree
T=0 P0
T=1 P1
T=2 P2 P3
T=3 P4 P5 P6 P7
10
Implementation Details
Approach I
Leftchildpointer and nextsibling pointer
P0
P1 P3 P7 P15
P4 P9 P10 P12
P8
11
procedure to build a VFO wakeup tree of size 2k
nodes.
CreateVFOTree (k)
If (k=0)
{Allocate new node and let R be its pointer;
Rleftchildpointer = null; Rnextsiblingpointer = null;
return(R) }
R1 = CreateVFOTree (k-1);
/* R1 is the root of a VFO tree of half the required size */
R2 = CreateVFOTree (k-1);
R2nextsiblingpointer = R1leftchildpointer;
R1leftchildpointer = R2;
Return (R1) 12
Approach I
P0
P1 P3 P7 P15
P4 P9 P10 P12
P8
13
Approach II
• For non-root node j
• Left most child = 2*j
• sibling = 2*j+1
P0
P1 P3 P7 P15
P4 P9 P10 P12
14
P8
VFO Algorithm
type treenode = record
parentsense : Boolean
parentpointer : ^ Boolean // pointer to parent in the arrival tree
havechild : array [0 ..3] of Boolean
childnotready : array [0 ..3] of Boolean
dummy : Boolean // pseudo-data
// Shared variables:
nodes : shared array [ 0..P-1] of treenode
// nodes[j] is allocated in memory module # j which is
// locally accessible to the processor whose index is j
// Local variables for each processor:
myid : integer // processor index
sense : Boolean // initially true
childspinpointer: ^Boolean
childid : integer
15
VFO Algorithm (cont.)
Algorithm VFO_barrier(myid) // myid is the index of the calling processor
with nodes[myid] do
repeat until childnotready = {false, false, false, false}
childnotready = havechild // prepare for next barrier
parentpointer ^ = false // remote access to tell parent I’m ready
// if not root, wait until my parent signals wakeup
if myid != 0
repeat until parentsense = sense
// now signal children in wakeup tree
childid = Max {1, 2*myid} //leftmost child of root is 1
repeat while childid < P
childspinpointer = &nodes[childid].parentsense
childspinpointer ^ = sense // remote access
childid = 2 * childid + 1
//prepare for next barrier
sense = not sense // sense is not shared but is retained across
// barrier calls
16
Results
Lemma 1:
The VFO barrier algorithm has a worst case
extra arrival overhead of log4 P, a best
case of O(1) and a wakeup time of log2 P.
17
Results
Lemma 2:
The VFO algorithm requires only O(P) shared
space and performs the minimum number of
remote accesses, 2P-2, for a barrier of P
processors.
18
Expanded Fetch&: A Scalable
Approach to Reducing Traffic in a
Multistage Interconnection Network
19
Fetch and Phi Operations
The Hardware Solution
• Generalized Fetch and Phi
• Restricted Fetch and Phi
• Proposal
– Motivation
– basic idea
– analysis
20
Previous Work
• [GOT83] designed the NYU Ultracomputer and introduced
the Generalized Fetch and Phi operations
• [SOH89] introduced the Restricted Fetch and Phi
operations
• [DIC92] designed scalable combining switches that are
much cheaper than the ones in the NYU Ultracomputer
• [TZE91] proposed an alternative combining architecture
that has much lower hardware complexity than previous
ones
• [HAN96] proposed a novel MIN architecture which is the
Simple Serial Synchronized MIN
21
Generalized Fetch and Phi
• Example
Assuming V is a shared variable,
PEi: ANSi F&A (V,ei)
PEj ANSj F&A (V,ej)
then,
ANSi V or ANSi V + ej
ANS j V + ei ANSj V
and, in either case, V=V + ei + ej. 22
Restricted Fetch and Phi
S3
S2
S1
P1 P2 P3 P4
# of lines = log A + B + C
where A = # of addressable memory locations 23
B = # of bits needed to represent C
N = # of processors in the system
Restricted Fetch and Phi
Suppose
• P1, P3, P4 executes RF(X, C)
• priority: P1 > P2 > P3 > P4
V1 = X(C)
V2 = X(C C)
V3 = X(C C C)
24
Proposal
• Motivation
– in a multiprocessor system, each processor is
computing a small part of a much larger task.
– With the task distributed among N processors.
– each one is likely to be doing the same or nearly the
same set of operations on the same or similar data.
• Idea
– try to maximize number of cycles to hold the bus in
order to get a large number of processors to join each
operation.
25
Model and Assumption
• Let N = Number of processors
• Assumption: once a MF occurs, each of the
other N - 1 processors will join with independent
identically distributed probability p during each
cycle.
• Thus the number of processors joining in
constitutes a binomial distribution with probability
equal to p.
26
Analysis
the expected number of processors joining in the first cycle is: (N - 1) p
In the second cycle:
(N - (N - 1)p - 1)p = (N - N p + p - 1)p = (N(1 - p) - (1 - p))p = (N - 1) (1
- p) p
In the third cycle:
(N - (N - 1)p - (N - 1)(1 - p)p - 1)p = (N - 1)(1 - p - (1-p)p) p = (N - 1) (1 -
p)2 p
In the mth cycle:
(N - (N - 1)p - (N - 1)(1 - p)p - (N - 1)( 1 - p)2 p - …… - (N - 1)(1 - p)m-2p -
1) p =
(N -1)(1 - p - (1 - p)p - (1 - p)2p - ……. - (1 - p)m-2p) p = (N - 1)(1 - p) p=
(N -1)(1-p)m-1p
27
Analysis
Thus the total number of processors expected
to join the MF by m cycles is:
28
Expected number of joining
processors for various p
29
Analysis
d ( N 1)(1 (1 p) m )
= 1
dm
- (N - 1) (1 - p)m ln ( 1 - p) = 1
1
(1 - p)m = -
( N 1) ln(1 p)
1
m log (1 - p) = log -
( N 1) ln(1 p)
1
log( )
( N 1) ln(1 p)
m= =
log(1 p)
1
log (1 p ) ( ) 30
( N ) ln(1 p)
Expected number of joining
processors for various N
31
Tradeoff
• Large m allows many processors to join.
• Small m allows for different MF
32