Beruflich Dokumente
Kultur Dokumente
O O O Memory
Bus
A multiprocessor
O O O Memory
Oache Oache Oache
Bus
A multiprocessor with caching
à
ent Action taken by a cache Action taken by a cache in
in response to its own response to a remote O s
O s operation operation
Read miss Fetch data from memory no action
and store in cache
Read hit Fetch data from local no action
cache
Write miss pdate data in memory no action
and store in cache
Memory is correct
A B O W (a) Initial state ± word W containing
alue W is in memory and is also
cached by B.
W
OLAN
Memory is correct
A B O W (b) A reades word W and gets W. B does
not respond to the read, but the memory
W W does.
OLAN OLAN
A B O W Memory is correct
(c)A write a alue W, B snoops on the bus,
sees the write, and in alidates its entry.
W W As copy is marked DIR½ .
DIR½ INVALID
Not update memory
Memory is correct
A B O W (d) A write W again. ½his and subsequent
writes by A are done locally, without any
bus traffic.
W W
DIR½ INVALID
(e) O reads or writes W. A sees the
A B O W request by snooping on the bus,
pro ides the alue, and in alidates
its own entry. O now has the only
W W W
alid copy.
INVALID INVALID DIR½
Not update memory
aa
:
Memnet Memory management unit
O O
O O O M
O O O M
Intercluster bus
Interacluster bus
½wo superclusters connected by a
supercluster bus O with cache Global memory
O O O M O O O M
O O O M O O O M
O O O M O O O M
Supercluster bus
Oaching
Oaching is done on two le els: a first-le el cache
and a larger second-le el cache.
ach cache block can be in one of the following
three states:
O M I/O
O M I/O
Local memory
Microprogrammed MM
aa
a Switch
0 0
. .
6 6
7 7
8 8
9 9
0 0
Access to remote memory is possible
Accessing remote memory is slower than
accessing local memory
Remote access times are not hidden by
caching
In NMA, it matters a great deal which page is
located in which memory. ½he key issue in
NMA software is the decision of where to place
each page to maximize performance.
A running in the background can
gather usage statistics about local and remote
references. If the usage statistics indicate that a
page is in the wrong place, the page scanner
unmaps the page so that the next reference causes
a page fault.
Y
Y
Hardware-controlled caching Software-controlled caching
ncapsulation and No No No No No es
methods?
Is remote access possible in es es es No No No
hardware?
Is unattached memory es es es No No No
possible?
Who con erts remote MM MM MM OS Runtime Runtime
memory accesses to system system
messages?
½ransfer medium Bus Bus Bus Network Network Network
Strict consistency
: W(x)
: R(x)0 R(x)
Not strict consistency
Y
½ ? ??
? ?
? ?
? ? ?
??
?
½he following is correct:
: W(x) : W(x)
: R(x)0 R(x) : R(x) R(x)
½wo possible results of running the same program
a=; b=; c=;
print(b,c); print(a,c); print(a,b);
: 00 00 00 is not permitted.
: 00 0 0 is not permitted.
½his sequence is allowed with causally consistent memory, but not with
sequentially consistent memory or strictly consistent memory.
: W(x)
: R(x) W(x)
: R(x) R(x)
: R(x) R(x)
: W(x)
: W(x)
: R(x) R(x)
: R(x) R(x)
All processes see all casually-related shared accesses in the same order.
a
à ? ?
?
? ?
: W(x)
: R(x) W(x)
: R(x) R(x)
: R(x) R(x)
A alid sequence of e ents for RAM consistency but not for the abo e stronger
models.
is RAM
consistency + memory coherence. ½hat is,
for e ery memory location, x, there be
global agreement about the order of writes
to x. Writes to different locations need not
be iewed in the same order by different
processes.
à
½he weak consistency has three properties:
6 × ? ???
?
? ? ? ??
? ?
?
???
? ?
? ?
? ???
: W(x) W(x) S
: R(x) R(x) S
: R(x) R(x) S
: W(x) W(x) S
: S R(x)
A alid e ent sequence for release consistency. does not acquire, so the result
is not guaranteed.
Shared data are made consistent when a
critical region is exited.
In
, at the time of
a release, nothing is sent anywhere.
Instead, when an acquire is done, the
processor trying to do the acquire has to get
the most recent alues of the ariables from
the machine or machines holding them.
Shared data pertaining to a critical region
are made consistent when a critical region
is entered.
Formally, a memory exhibits entry
consistency if it meets all the following
conditions:
ë Y
½hese systems are built on top of
multicomputers, that is, processors
connected by a specialized message-
passing network, workstations on a LAN,
or similar designs. ½he essential element
here is that no processor can directly access
any other processors memory.
Ohunks of address space
distributed Shared
among four
global address space
machines
0 6 7 8 9 0
0 6 7
9 8 0 Memory
O O O O
Situation after O references
chunk 0
0 6 7
9 0 8
O O O O
Situation if chunk 0 is read only
and replication is used
0 6 7
9 0 8 0
O O O O
a
One impro ement to the basic system that can
impro e performance considerably is to replicate
chunks that are read only.
Another possibility is to replicate not only read-
only chunks, but all chunks. No difference for
reads, but if a replicated chunk is suddenly
modified, special action has to be taken to pre ent
ha ing multiple, inconsistent copies in existence.
º
when a nonlocal memory word is
referenced, a chunk of memory containing
the word is fetched from its current
location and put on the machine making the
reference. An important design issue is
how big should the chunk be? A word,
block, page, or segment (multiple pages).
False sharing
rocessor rocessor
½wo
A A unrelated
B B
Shared shared
page ariables
Oode using A Oode using B
x
½he simplest solution is by doing a broadcast, asking for the owner of
the specified page to respond.
Drawback: broadcasting wastes network bandwidth and interrupts
each processor, forcing it to inspect the request packet.
Solution: ha ing more page managers. ½hen how to find the right
manager? One solution is to use the low-order bits of the page
number as an index into a table of managers. ½hus with eight page
managers, all pages that end with 000 are handled by manager 0, all
pages that end with 00 are handled by manager , and so on. A
different mapping is to use a hash function.
age manager age manager
. Reply
. Request
Owner . Reply Owner
. Reply
x
½he first is to broadcast a message gi ing the page
number and ask all processors holding the page to
in alidate it. ½his approach works only if broadcast
messages are totally reliable and can ne er be lost.
½he second possibility is to ha e the owner or page
manager maintain a list or copyset telling which
processors hold which pages. When a page must be
in alidated, the old owner, new owner, or page manager
sends a message to each processor holding the page and
waits for an acknowledgement. When each message has
been acknowledged, the in alidation is complete.
a
If there is no free page frame in memory to hold the
needed page, which page to e ict and where to put it?
using traditional irtual memory algorithms, such as LR.
In DSM, a replicated page that another process owns is
always a prime candidate to e ict because another copy
exists.
½he second best choice is a replicated page that the
e icting process owns. ass the ownership to another
process that owns a copy.
If none of the abo e, then a nonreplicated page must be
chosen. One possibility is to write it to disk. Another is to
hand it off to another processor.
Ohoosing a processor to hand a page off to can be
done in se eral ways:
. send it to home machine which must accept it.
. the number of free page frames could be
piggybacked on each message sent, with each
processor building up an idea of how free
memory was distributed around the network.
Y
In a DSM system, as in a multiprocessor,
processes often need to synchronize their actions.
A common example is mutual exclusion, in
which only one process at a time may execute a
certain part of the code. In a multiprocessor, the
½S½-AND-S½-LOO (½SL) instruction is
often used to implement mutual exclusion. In a
DSM system, this code is still correct, but is a
potential performance disaster.
Y ë
Y
share only certain ariables and data
structures that are needed by more than one
process.
½wo examples of such systems are Munin
and Midway.
Munin is a DSM system that is
fundamentally based on software objects,
but which can place each object on a
separate page so the hardware MM can
be used for detecting accesses to shard
objects.
Munin is based on a software
implementation of release consisency.
Release consistency in Munin
Get exclusi e access to this critical region
rocess rocess rocess
Lock(L);
a=; Ohanges to Oritical region
b=; shared
c=; ariables a, b, c a, b, c
nlock(L);
Network
se of twin pages in Munin
Read-only mode
R ½win RW ½winRW ½win RW
Message sent
age 6 6 6 6 8 6 8 Word
6->8
Midway is a distributed shared memory
system that is based on sharing indi idual
data structures. It is similar to Munin in
some ways, but has some interesting new
features of its own.
Midway supports entry consistency
ë
Y
In an object-based distributed shared memory, processes on multiple
machines share an abstract space filled with shared objects.
Object
Object space
rocess
G
Y
A tuple is like a structure in O or a record in ascal. It
consists of one or more fields, each of which is a alue of
some type supported by the base language.
For example,
(³abc´,,)
(³matrix-´,,6,.)
(³family´,´is-sister´,´Oarolyn´,´linor´)
Operations on ½uples
Out, puts a tuple into the tuple space. .g. out(³abc´,,);
In(³abc´,,?j);
Out(³task-bag´,job);
In(³task-bag´,?job);
Which it then carries out. When it is done, it gets another. New work may also
be put into the task bag during execution. In this simple way, work is
dynamically di ided among the workers, and each worker is kept busy all the
time, all with little o erhead.
Orca is a parallel programming system that
allows processes on different machines to
ha e controlled access to a distributed
shared memory consisting of protected
objects.
½hese objects are a more powerful form of
the Linda tuple, supporting arbitrary
operations instead of just in and out.
A simplified stack object
stack;
top: integer;
stack: [integer 0..N-
integer;
push (item: integer);
stack [top := item;
top := top +;
pop ( ) : integer;
guard top > 0 do
top := top ±;
return stack [top ;
od;
top := 0;
Orca has a fork statement to create a new
process on a user-specified processor.