Sie sind auf Seite 1von 62

6.

Distributed Shared Memory


à   

  

Ohip package O 

O  Memory extension O  Memory O 

Address and data lines


Oonnecting the O  to the O 
memory
A single-chip computer
A hypothetical shared-memory
Multiprocessor.
a a 



O  O  O  Memory
Bus
A multiprocessor

O  O  O  Memory
Oache Oache Oache
Bus
A multiprocessor with caching
à 
 



 ent Action taken by a cache Action taken by a cache in
in response to its own response to a remote O s
O s operation operation
Read miss Fetch data from memory no action
and store in cache
Read hit Fetch data from local no action
cache
Write miss pdate data in memory no action
and store in cache

Write hit pdate memory and in alidate cache entry


cache
à




 ½his protocol manages cache blocks, each of which can
be in one of the following three states:
INVALID: ½his cache block does not contain alid data.
OLAN: Memory is up-to-date; the block may be in other caches.
DIR½ : Memory is incorrect; no other cache holds the block.

 ½he basic idea of the protocol is that a word that is being


read by multiple O s is allowed to be present in all their
caches. A word that is being hea ily written by only one
machine is kept in its cache and not written back to
memory on e ery write to reduce bus traffic.
For example
O 

Memory is correct
A B O W (a) Initial state ± word W containing
alue W is in memory and is also
cached by B.
W
OLAN

Memory is correct
A B O W (b) A reades word W and gets W. B does
not respond to the read, but the memory
W W does.
OLAN OLAN
A B O W Memory is correct
(c)A write a alue W, B snoops on the bus,
sees the write, and in alidates its entry.
W W As copy is marked DIR½ .
DIR½ INVALID
Not update memory

Memory is correct
A B O W (d) A write W again. ½his and subsequent
writes by A are done locally, without any
bus traffic.
W W
DIR½ INVALID
(e) O reads or writes W. A sees the
A B O W request by snooping on the bus,
pro ides the alue, and in alidates
its own entry. O now has the only
W W W
alid copy.
INVALID INVALID DIR½
Not update memory
aa 

:
Memnet Memory management unit

MM Oache Home


memory
O  O 
O 
ri ate memory
Valid
O  O  xclusi e
Home
Interrupt
Location

O  O  

½he block table


rotocol
 Read
‡ When the O  wants to read a word from shared memory, the
memory address to be read is passed to the Memnet de ice, which
checks the block table to see if the block is present. If so, the request
is satisfied. If not, the Memnet de ice waits until it captures the
circulating token, puts a request onto the ring. As the packet passes
around the ring, each Memnet de ice along the way checks to see if it
has the block needed. If so, it puts the block in the dummy field and
modifies the packet header to inhibit subsequent machines from doing
so.
‡ If the requesting machine has no free space in its cache to hold the
incoming block, to make space, it picks a cached block at random and
sends it home. Blocks whose Ñ  bit are set are ne er chosen
because they are already home.
 Write
‡ If the block containing the word to be written is present and is the
only copy in the system (i.e., the   bit is set), the word is just
written locally.
‡ If the needed block is present but it is not the only copy, an
in alidation packet is first sent around the ring to force all other
machines to discard their copies of the block about to be written.
When the in alidation packet arri es back at the sender, the  
bit is set for that block and the write proceeds locally.
‡ If the block is not present, a packet is sent out that combines a read
request and an in alidation request. ½he first machine that has the
block copies it into the packet and discards its own copy. All
subsequent machines just discard the block from their caches. When
the packet comes back to the sender, it is stored there and written.
Y   


 ½wo approaches can be taken to attack the
problem of not enough bandwidth:
‡ Reduce the amount of communication. .g. Oaching.
‡ Increase the communication capacity. .g. Ohanging topology.
One method is to build the system as a hierarchy. Build
the system as multiple clusters and connect the clusters
using an intercluster bus. As long as most O s
communicate primarily within their own cluster, there
will be relati ely little intercluster traffic. If still more
bandwidth is needed, collect a bus, tree, or grid of clusters
together into a supercluster, and break the system into
multiple superclusters.
ë
 A Dash consists of 6 clusters, each cluster
containing a bus, four O s, 6M of the global
memory, and some I/O equipment. ach O  is
able to snoop on its local bus, but not on other
buses.
 ½he total address space is 6M, di ided up into
6 regions of 6M each. ½he global memory of
cluster 0 holds addresses 0 to 6M, and so on.
Oluster
½hree clusters connected by
O O O M
an intercluster bus to form one
supercluster.

O O O M

O O O M

Intercluster bus

Interacluster bus
½wo superclusters connected by a
supercluster bus O  with cache Global memory
O O O M O O O M

O O O M O O O M

O O O M O O O M

Supercluster bus
Oaching
 Oaching is done on two le els: a first-le el cache
and a larger second-le el cache.
 ach cache block can be in one of the following
three states:

‡ NOAOHD²½he only copy of the block is in this memory.


‡ OLAN²Memory is up-to-date; the block may be in se eral caches.
‡ DIR½ ²Memory is incorrect; only one cache holds the block.
  


 Like a traditional  ( 
 
 
) multiprocessor, a NMA machine has a
single irtual address space that is isible to all
O s. When any O  writes a alue to location
?, a subsequent read of ? by a different processor
will return the alue just written.
 ½he difference between MA and NMA
machines lies in the fact that on a NMA
machine, access to a remote memory is much
slower than access to a local memory.
  
 
 

  
Oluster

O  M I/O

Intercluster Local bus


bus
O  M I/O

O  M I/O
Local memory

Microprogrammed MM
  
  

 aa
a  Switch
0 0
 
 
 
 
. .
6 6
7 7
8 8
9 9
0 0
 
 
 
 
 


 
 


 Access to remote memory is possible
 Accessing remote memory is slower than
accessing local memory
 Remote access times are not hidden by
caching
 
 
 In NMA, it matters a great deal which page is
located in which memory. ½he key issue in
NMA software is the decision of where to place
each page to maximize performance.
 A  running in the background can
gather usage statistics about local and remote
references. If the usage statistics indicate that a
page is in the wrong place, the page scanner
unmaps the page so that the next reference causes
a page fault.



Y 
 
Y  
Hardware-controlled caching Software-controlled caching

Managed by MM Managed by OS Managed by language


runtime system

Switched age- Shared- Object-


Single-bus NMA
multi- based ariable based
multi-processor machine
processor DSM DSM DSM

½ightly Sequent Dash Om* I y Munin Linda


coupled Firefly Alewife Butterfly Mirage Midway Orca
Loosely
age age Data Object coupled
½ransfer Oache Oache
unit block block structure

Remote access in hardware Remote access in software


Multiprocessors DSM

Item Single bus Switched NMA age based Shared Object


ariable based
Linear, shared irtual es es es es No No
address space?
ossible operations R/W R/W R/W R/W R/W General

ncapsulation and No No No No No es
methods?
Is remote access possible in es es es No No No
hardware?
Is unattached memory es es es No No No
possible?
Who con erts remote MM MM MM OS Runtime Runtime
memory accesses to system system
messages?
½ransfer medium Bus Bus Bus Network Network Network

Data migration done by Hardware Hardware Software Software Software Software

½ransfer unit Block Block age age Shared Object


ariable

 

 Y

× ? ?  ?   ?
    ?   
: W(x)
: R(x)

Strict consistency

: W(x)
: R(x)0 R(x)
Not strict consistency
Y 

 ½ ?   ??
 ?  ?  
    ?  ? 
?  ? ?  
??    
 ?
½he following is correct:

: W(x) : W(x)
: R(x)0 R(x) : R(x) R(x)
½wo possible results of running the same program
a=; b=; c=;
print(b,c); print(a,c); print(a,b);

½hree parallel processes: , , .

  : 00 00 00 is not permitted.
  : 00 0 0 is not permitted.

All processes see all shared accesses in the same order.


 

 Writes that are potentially causally related must
be seen by all processes in the same order.
Ooncurrent writes may be seen in a different
order on different machines.
: W(x) W(x)
: R(x) W(x)
: R(x) R(x) R(x)
: R(x) R(x) R(x)

½his sequence is allowed with causally consistent memory, but not with
sequentially consistent memory or strictly consistent memory.
: W(x)
: R(x) W(x)
: R(x) R(x)
: R(x) R(x)

A iolation of causal memory

: W(x)
: W(x)
: R(x) R(x)
: R(x) R(x)

A correct sequence of e ents in causal memory

All processes see all casually-related shared accesses in the same order.
a
 
 à ?  ?
?     
    
?  ?   
 
: W(x)
: R(x) W(x)
: R(x) R(x)
: R(x) R(x)

A alid sequence of e ents for RAM consistency but not for the abo e stronger
models.




 


 is RAM
consistency + memory coherence. ½hat is,
for e ery memory location, x, there be
global agreement about the order of writes
to x. Writes to different locations need not
be iewed in the same order by different
processes.
à

½he weak consistency has three properties:
6 ×   ? ???
 ?  

? ?  ? ??
?    ? 
? 

???
? ?  
  ? ? 
  ? ???  
: W(x) W(x) S
: R(x) R(x) S
: R(x) R(x) S

A alid sequence of e ents for weak consistency.

: W(x) W(x) S
: S R(x)

An in alid sequence for weak consistency.  must get  instead of  because it


is already synchronized.

Shared data can only be counted on to be consistent after a synchronization is


done.
a

 Release consistency pro ides   and 
accesses. Acquire accesses are used to tell the
memory system that a critical region is about to
be entered. Release accesses say that a critical
region has just been exited.
: Acq(L) W(x) W(x) Rel(L)
: Acq(L) R(x) Rel(L)
: R(x)

A alid e ent sequence for release consistency.  does not acquire, so the result
is not guaranteed.
 Shared data are made consistent when a
critical region is exited.
 In  
 , at the time of
a release, nothing is sent anywhere.
Instead, when an acquire is done, the
processor trying to do the acquire has to get
the most recent alues of the ariables from
the machine or machines holding them.
 

 Shared data pertaining to a critical region
are made consistent when a critical region
is entered.
 Formally, a memory exhibits entry
consistency if it meets all the following
conditions:
ë Y 


 ½hese systems are built on top of
multicomputers, that is, processors
connected by a specialized message-
passing network, workstations on a LAN,
or similar designs. ½he essential element
here is that no processor can directly access
any other processors memory.
Ohunks of address space
distributed Shared
among four
global address space
machines
0      6 7 8 9 0     

0     6  7   
9 8 0   Memory

O  O  O  O 
Situation after O   references
chunk 0

0     6  7   
9 0 8  

O  O  O  O 
Situation if chunk 0 is read only
and replication is used

0     6  7   
9 0 8 0  

O  O  O  O 
a

 One impro ement to the basic system that can
impro e performance considerably is to replicate
chunks that are read only.
 Another possibility is to replicate not only read-
only chunks, but all chunks. No difference for
reads, but if a replicated chunk is suddenly
modified, special action has to be taken to pre ent
ha ing multiple, inconsistent copies in existence.
º 
 when a nonlocal memory word is
referenced, a chunk of memory containing
the word is fetched from its current
location and put on the machine making the
reference. An important design issue is
how big should the chunk be? A word,
block, page, or segment (multiple pages).
False sharing
rocessor  rocessor 

½wo
A A unrelated
B B
Shared shared
page ariables
Oode using A Oode using B
x  
 ½he simplest solution is by doing a broadcast, asking for the owner of
the specified page to respond.
Drawback: broadcasting wastes network bandwidth and interrupts
each processor, forcing it to inspect the request packet.

 Another solution is to designate one process as the page manager. It is


the job of the manager to keep track of who owns each page. When a
process wants to read or write a page it does not ha e, it sends a
message to the page manager. ½he manager sends back a message
telling who the owner is. ½hen contacts the owner for the page.
 An optimization is to let manager forwards the request directly to the
owner, which then replies directly back to .
Drawback: hea y load on page manager.

 Solution: ha ing more page managers. ½hen how to find the right
manager? One solution is to use the low-order bits of the page
number as an index into a table of managers. ½hus with eight page
managers, all pages that end with 000 are handled by manager 0, all
pages that end with 00 are handled by manager , and so on. A
different mapping is to use a hash function.
age manager age manager

. Request .Request forwarded


. Request

. Reply

. Request
Owner . Reply Owner

. Reply
x 

 ½he first is to broadcast a message gi ing the page
number and ask all processors holding the page to
in alidate it. ½his approach works only if broadcast
messages are totally reliable and can ne er be lost.
 ½he second possibility is to ha e the owner or page
manager maintain a list or copyset telling which
processors hold which pages. When a page must be
in alidated, the old owner, new owner, or page manager
sends a message to each processor holding the page and
waits for an acknowledgement. When each message has
been acknowledged, the in alidation is complete.
a 
 If there is no free page frame in memory to hold the
needed page, which page to e ict and where to put it?
‡ using traditional irtual memory algorithms, such as LR.
‡ In DSM, a replicated page that another process owns is
always a prime candidate to e ict because another copy
exists.
‡ ½he second best choice is a replicated page that the
e icting process owns. ass the ownership to another
process that owns a copy.
‡ If none of the abo e, then a nonreplicated page must be
chosen. One possibility is to write it to disk. Another is to
hand it off to another processor.
 Ohoosing a processor to hand a page off to can be
done in se eral ways:
 . send it to home machine which must accept it.
. the number of free page frames could be
piggybacked on each message sent, with each
processor building up an idea of how free
memory was distributed around the network.

Y  


 In a DSM system, as in a multiprocessor,
processes often need to synchronize their actions.
A common example is mutual exclusion, in
which only one process at a time may execute a
certain part of the code. In a multiprocessor, the
½S½-AND-S½-LOO (½SL) instruction is
often used to implement mutual exclusion. In a
DSM system, this code is still correct, but is a
potential performance disaster.
Y ë 
Y 

 share only certain ariables and data
structures that are needed by more than one
process.
 ½wo examples of such systems are Munin
and Midway.
 
 Munin is a DSM system that is
fundamentally based on software objects,
but which can place each object on a
separate page so the hardware MM can
be used for detecting accesses to shard
objects.
 Munin is based on a software
implementation of release consisency.
Release consistency in Munin
Get exclusi e access to this critical region
rocess  rocess  rocess 
Lock(L);
a=; Ohanges to Oritical region
b=; shared
c=; ariables a, b, c a, b, c
nlock(L);

Network
se of twin pages in Munin
Read-only mode
R ½win RW ½winRW ½win RW
Message sent
age 6 6 6 6 8 6 8 Word 
6->8

Originally After write After write At release


trap has completed
Directories
 Munin uses directories to locate pages
containing shared ariables.
Root    
  

    

    
 
 Midway is a distributed shared memory
system that is based on sharing indi idual
data structures. It is similar to Munin in
some ways, but has some interesting new
features of its own.
 Midway supports entry consistency
ë 
Y 

 In an object-based distributed shared memory, processes on multiple
machines share an abstract space filled with shared objects.
Object

Object space

rocess
G
  Y
A tuple is like a structure in O or a record in ascal. It
consists of one or more fields, each of which is a alue of
some type supported by the base language.

For example,

(³abc´,,)
(³matrix-´,,6,.)
(³family´,´is-sister´,´Oarolyn´,´linor´)
 Operations on ½uples
Out, puts a tuple into the tuple space. .g. out(³abc´,,);

In, retrie es tuple from the tuple space.

In(³abc´,,?j);

If a match is found, j is assigned a alue.


A common programming paradigm in Linda is the replicated worker model.
½his model is based on the idea of a task bag full of jobs to be done. ½he
main process starts out by executing a loop containing

Out(³task-bag´,job);

In which a different job description is output to the tuple space on each


iteration. ach worker starts out by getting a job description tuple using

In(³task-bag´,?job);

Which it then carries out. When it is done, it gets another. New work may also
be put into the task bag during execution. In this simple way, work is
dynamically di ided among the workers, and each worker is kept busy all the
time, all with little o erhead.

 Orca is a parallel programming system that
allows processes on different machines to
ha e controlled access to a distributed
shared memory consisting of protected
objects.
 ½hese objects are a more powerful form of
the Linda tuple, supporting arbitrary
operations instead of just in and out.
A simplified stack object
  
 stack;
top: integer;
stack:  [integer 0..N-
 integer;


 push (item: integer);

stack [top := item;
top := top +;



 pop ( ) : integer;

guard top > 0 do
top := top ±;
return stack [top ;
od;


top := 0;

 Orca has a fork statement to create a new
process on a user-specified processor.

 e.g. for I in ..n do fork foobar(s) on I; od;

Das könnte Ihnen auch gefallen