Papers by Hala Elaarag On Synchronization of Shared Memory Multiprocessors

Papers by Hala ElAarag on
Synchronization of shared memory

multiprocessors
1
Minimum-Length Critical Path
Algorithms for Barrier
Synchronization in Shared
Memory Parallel Machines
2
Barrier Synchronization
• Introduction
• MCS Algorithm
• Proposal
– Motivation
– Implementation details
– VFO Algorithm
– Results
3
Introduction
• Why do we use Barriers?
• Types of Barriers.
– Centralized vs. Tree-based Algorithms
• What is a “HOT SPOT”?
4
Centralized Barriers
shared count : integer := P
shared sense : Boolean := true
processor private local_sense : Boolean := true
procedure central_barrier
local_sense := not local_sense // each processor toggles its own sense
if fetch_and_decrement (&count) = 1
count := P
sense := local_sense // last processor toggles global sense
else
repeat until sense = local_sense
5
Tree-Based Barriers
1 group 10
10 groups 10 10 …. 10
100 gps
10 …. 10 10 10 10 …. 10
...
6
MCS Algorithm
Arrival tree (fan-in = 4) Wakeup tree (fan-out =2)
P0
P0
P1 P2
P1 P2 P3 P4
P3 P4 P5 P6
P5 P6 P7 P8 P9 P10 P11 P12

P7 P8 P9 P10 P11 P12 P13 P14
7
MCS Algorithm
Format of a tree node
F T T T T T T T T
Parent- parent- childpointers[0..1] havechild [0..3] childnotready[0..3] dummy

sense pointer
P0
P0
P1 P2
P1 P2 P3 P4
P3 P4 P5 P6
P5 P6 P7 P8 P9 P10 P11 P12 P10 P11

8
P7 P8 P9 P12 P13 P14
Proposal
• Motivation: Variable Fan-Out (VFO)
wakeup tree
P0
P1 P3 P7 P15
P2 P5 P11 P6 P13 P14
P4 P9 P10 P12
P8
9
Wakeup time for P=16 with VFO tree
T=0 P0
T=1 P1
T=2 P2 P3
T=3 P4 P5 P6 P7
T=4 P8 P9 P10 P11 P12 P13 P14 P15
10
Implementation Details
Approach I
Leftchildpointer and nextsibling pointer
P0
P1 P3 P7 P15
P2 P5 P11 P6 P13 P14
P4 P9 P10 P12
P8
11
procedure to build a VFO wakeup tree of size 2k
nodes.
CreateVFOTree (k)
If (k=0)
{Allocate new node and let R be its pointer;
Rleftchildpointer = null; Rnextsiblingpointer = null;
return(R) }
R1 = CreateVFOTree (k-1);
/* R1 is the root of a VFO tree of half the required size */
R2 = CreateVFOTree (k-1);
R2nextsiblingpointer = R1leftchildpointer;
R1leftchildpointer = R2;
Return (R1) 12
Approach I
P0
P1 P3 P7 P15
P2 P5 P11 P6 P13 P14
P4 P9 P10 P12
P8
13
Approach II
• For non-root node j
• Left most child = 2*j
• sibling = 2*j+1
P0
P1 P3 P7 P15
P2 P5 P11 P6 P13 P14
P4 P9 P10 P12
14
P8
VFO Algorithm
type treenode = record
parentsense : Boolean
parentpointer : ^ Boolean // pointer to parent in the arrival tree
havechild : array [0 ..3] of Boolean
childnotready : array [0 ..3] of Boolean
dummy : Boolean // pseudo-data
// Shared variables:
nodes : shared array [ 0..P-1] of treenode
// nodes[j] is allocated in memory module # j which is
// locally accessible to the processor whose index is j
// Local variables for each processor:
myid : integer // processor index
sense : Boolean // initially true
childspinpointer: ^Boolean
childid : integer
// Initial values in nodes[i]

// parentpointer = &nodes[floor((i-1)/4)].childnotready[(i-1) mod 4],
// or &dummy if i = 0
// childnotready = havechild
// parentsense = false
// havechild[j] = true if 4*i + j < P; otherwise false
15
VFO Algorithm (cont.)
Algorithm VFO_barrier(myid) // myid is the index of the calling processor
with nodes[myid] do
repeat until childnotready = {false, false, false, false}
childnotready = havechild // prepare for next barrier
parentpointer ^ = false // remote access to tell parent I’m ready
// if not root, wait until my parent signals wakeup
if myid != 0
repeat until parentsense = sense
// now signal children in wakeup tree
childid = Max {1, 2*myid} //leftmost child of root is 1
repeat while childid < P
childspinpointer = &nodes[childid].parentsense
childspinpointer ^ = sense // remote access
childid = 2 * childid + 1
//prepare for next barrier
sense = not sense // sense is not shared but is retained across
// barrier calls
16
Results
Lemma 1:
The VFO barrier algorithm has a worst case
extra arrival overhead of log4 P, a best
case of O(1) and a wakeup time of log2 P.
17
Results
Lemma 2:
The VFO algorithm requires only O(P) shared
space and performs the minimum number of
remote accesses, 2P-2, for a barrier of P
processors.
18
Expanded Fetch&: A Scalable
Approach to Reducing Traffic in a
Multistage Interconnection Network
19
Fetch and Phi Operations
The Hardware Solution
• Generalized Fetch and Phi
• Restricted Fetch and Phi
• Proposal
– Motivation
– basic idea
– analysis
20
Previous Work
• [GOT83] designed the NYU Ultracomputer and introduced
the Generalized Fetch and Phi operations
• [SOH89] introduced the Restricted Fetch and Phi
operations
• [DIC92] designed scalable combining switches that are
much cheaper than the ones in the NYU Ultracomputer
• [TZE91] proposed an alternative combining architecture
that has much lower hardware complexity than previous
ones
• [HAN96] proposed a novel MIN architecture which is the
Simple Serial Synchronized MIN
21
Generalized Fetch and Phi
• Example
Assuming V is a shared variable,
PEi: ANSi  F&A (V,ei)
PEj ANSj  F&A (V,ej)
then,
ANSi  V or ANSi  V + ej
ANS j  V + ei ANSj  V
and, in either case, V=V + ei + ej. 22
Restricted Fetch and Phi
• RF(X, C), where C is a constant for all

requesting processors
S4
S3
S2
S1
P1 P2 P3 P4
# of lines = log A + B + C
where A = # of addressable memory locations 23
B = # of bits needed to represent C
N = # of processors in the system
Restricted Fetch and Phi
Suppose
• P1, P3, P4 executes RF(X, C)
• priority: P1 > P2 > P3 > P4
The 3 values of X that represent a serialization of the instruction
V1 = X(C)
V2 = X(C  C)
V3 = X(C  C  C)
24
Proposal
• Motivation
– in a multiprocessor system, each processor is
computing a small part of a much larger task.
– With the task distributed among N processors.
– each one is likely to be doing the same or nearly the
same set of operations on the same or similar data.
• Idea
– try to maximize number of cycles to hold the bus in
order to get a large number of processors to join each
operation.
25
Model and Assumption
• Let N = Number of processors
• Assumption: once a MF occurs, each of the
other N - 1 processors will join with independent
identically distributed probability p during each
cycle.
• Thus the number of processors joining in
constitutes a binomial distribution with probability
equal to p.
26
Analysis
the expected number of processors joining in the first cycle is: (N - 1) p
In the second cycle:
(N - (N - 1)p - 1)p = (N - N p + p - 1)p = (N(1 - p) - (1 - p))p = (N - 1) (1
- p) p
In the third cycle:
(N - (N - 1)p - (N - 1)(1 - p)p - 1)p = (N - 1)(1 - p - (1-p)p) p = (N - 1) (1 -
p)2 p
In the mth cycle:
(N - (N - 1)p - (N - 1)(1 - p)p - (N - 1)( 1 - p)2 p - …… - (N - 1)(1 - p)m-2p -
1) p =
(N -1)(1 - p - (1 - p)p - (1 - p)2p - ……. - (1 - p)m-2p) p = (N - 1)(1 - p) p=
(N -1)(1-p)m-1p
27
Analysis
Thus the total number of processors expected
to join the MF by m cycles is:
M = E [number of joining on 1st cycle] + E [number of

joining in 2nd cycle] +…+ E [number of joining in
mth cycle]
= (N - 1) (1 - (1 - p)m )
28
Expected number of joining
processors for various p
29
Analysis
d ( N  1)(1  (1  p) m )
= 1
dm
- (N - 1) (1 - p)m ln ( 1 - p) = 1
1
(1 - p)m = -
( N  1) ln(1  p)
1
m log (1 - p) = log -
( N  1) ln(1  p)
1
log(  )
( N  1) ln(1  p)
m= =
log(1  p)
1
log (1 p ) (  ) 30
( N  ) ln(1  p)
Expected number of joining
processors for various N
31
Tradeoff
• Large m allows many processors to join.
• Small m allows for different MF
• let q = mean number of cycles between

different MF
• hold the bus for min (m,q-1)
32

Papers by Hala Elaarag On Synchronization of Shared Memory Multiprocessors

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Papers by Hala Elaarag On Synchronization of Shared Memory Multiprocessors

Hochgeladen von

Copyright:

Verfügbare Formate

Papers by Hala ElAarag on

Synchronization of shared memory

• What is a “HOT SPOT”?

P5 P6 P7 P8 P9 P10 P11 P12

Parent- parent- childpointers[0..1] havechild [0..3] childnotready[0..3] dummy

P5 P6 P7 P8 P9 P10 P11 P12 P10 P11

P2 P5 P11 P6 P13 P14

T=4 P8 P9 P10 P11 P12 P13 P14 P15

P2 P5 P11 P6 P13 P14

P2 P5 P11 P6 P13 P14

P2 P5 P11 P6 P13 P14

// Initial values in nodes[i]

• RF(X, C), where C is a constant for all

The 3 values of X that represent a serialization of the instruction

M = E [number of joining on 1st cycle] + E [number of

• let q = mean number of cycles between

Das könnte Ihnen auch gefallen