Sie sind auf Seite 1von 10

460 IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO.

4, APRIL 1990

Recoverable Distributed Shared Virtual Memory

Abstmct-This paper examines the problem of rollback recov- tributed shared virtual memory be recoverable without a
ery in distributed shared virtual memory environments, in which global restart. This is particularly important for large, com-
the shared memory is implemented in software in a loosely plex applications which require long computation time. Al-
coupled distributed multicomputer system. A user-transparent though various memory coherence algorithms have been pro-
checkpointing recovery scheme and a new twin-page disk stor-
age management technique are presented for implementing re- posed [l], [4], [5], very few existing checkpointing and re-
coverable distributed shared virtual memory. The checkpointing covery techniques are appropriate for distributed shared vir-
scheme is unique in that it can be integrated with the memory tual memory environments. The difficulty in checkpointing
coherence protocol for managing the shared virtual memory. and recovery is that a global restart may be required at the
The twin-page disk design allows checkpointing to proceed in an point of recovery since each processor maintains part of the
incremental fashion without an explicit undo at the time of re-
covery. The recoverable distributed shared virtual memory allows shared address space.
the system to restart computation from a checkpoint without a This paper presents a user-transparent checkpointing recov-
global restart. ery scheme and a new paging disk management technique for
implementing recoverable distributed share virtual memory,
Index Terms-Distributed shared virtual memory, memory allowing the system to restart computation from a checkpoint
coherence, paging disk management, rollback recovery.
without a global restart. The checkpointing process can be
integrated with the memory coherence protocol. With a con-
I. INTRODUCTION sistent checkpoint state maintained on the paging store, after a
single node failure (a processor and its associated memory),

D STRIBUTED shared virtual memory has recently


been developed to support a shared memory program-
ming model in loosely coupled distributed multicomputers
the affected processes can be restarted from the checkpoint.
A new twin-page disk storage management is also developed
to allow incremental checkpointing and rapid rollback. In-
[ 11-[5]. Potential benefits of distributed shared virtual mem- cremental checkpointing reduces the overhead by potentially
ory include: ease in process migration, ease in passing com- spreading the disk activity required by a checkpoint over the
plex data structures between processors, and ease in object checkpoint interval. Rapid rollback is realized by not explicitly
invocation and object synchronization in object-oriented sys- undoing those disk pages updated by a failure-interrupted pro-
tems [l], [2], [5]. In contrast to shared physical memory cess since the last checkpoint. Our emphasis in this paper is on
in tightly coupled parallel architectures [6]-[9], distributed integrating user-transparent checkpointing recovery techniques
shared virtual memory exists only virtually and is generally with the memory coherence protocol and on providing disk
implemented in software. storage management which allows incremental checkpointing
In the typical implementation of distributed shared virtual and rapid rollback recovery.
memory, a memory mapping routine in each processor maps A large body of literature exists concerning checkpointing
the local memory onto the shared virtual address space. Mem- and recovery, primarily for message passing systems rather
ory pages are paged not only between physical memories and than the distributed shared virtual memory systems of this
the paging disk, but also between the physical memories of paper. In pure message passing systems, there is a separate
different processors. Since multiple processors may contain address space for each process, and interprocess communi-
copies of the same page, memory coherence must be main- cation is through clear message-sends and message-receives.
tained in distributed shared virtual memory; otherwise, a pro- By contrast, in distributed shared virtual memory systems, all
cessor may read stale data from its local memory. processes share a single address space; shared variables are
It is desirable, from a reliability perspective, that the dis- usually accessible by all processes; and the sender-receiver
relationship is generally not clear. Approaches to coordinat-
Manuscript received June 29, 1989; revised November 30, 1989. This ing recovering concurrent processes communicating through
work was supported in part by SDIOfiST and managed by the Office of
Naval Research under Contract N00014-88-K-0656, by a grant from Texas monitor calls have been proposed by Kim [lo]. A hardware
Instruments Incorporated, and by the National Aeronautics and Space Admin- recovery block approach has also been proposed to implement
istration (NASA) under Contract NASA NAG 1-613 in cooperation with the a fault-tolerant multiprocessor by Lee and Shin [ 111. Both ap-
Illinois Computer Laboratory for Aerospace System and Software (ICLASS).
An earlier version of this manuscript was presented at the IEEE 19th Inter- proaches maintain multiple checkpoints and the history of pro-
national Symposium on Fault-Tolerant Computing, 1989. cess interactions, and also resolve rollback propagation at the
The authors are with the Center for Reliable and High-Performance Com- point of recovery. In a distributed shared virtual memory en-
puting. Coordinated Science Laboratory, University of Illinois, Urbana, IL
61801. vironment, maintaining multiple virtual memory checkpoints
IEEE Log Number 8933888. complicates the page-mapping implementation.

0018-9340/90/0400-0460$01.00 O 1990 IEEE


WU AND FUCHS: RECOVERABLE DISTRIBUTED SHARED VIRTUAL MEMORY 46 1

time
* I P
shared data

rime

Fig. 2 . A possible rollback propagation scenario.


Fig. 1 . Hardware architecture of the distributed multicomputer system.
An application program on a shared virtual memory system
Shadow paging is a well-known technique for implement- consists of multiple processes communicating with each other
ing transaction recovery in database applications [ 121-[ 141. through accessing the shared virtual address space. A process
Shadow paging has also been used in the implementation of in this context refers to a lightweight process executing in
checkpointing and rollback recovery in a uniprocessor system the shared virtual memory space [3]- a lightweight process
[U]. A variation of shadow paging, called TWIST-which is sometimes referred to as a task or thread. Each process
provides fast undo logging during normal processing but re- has a unique identifier. Processes can be dynamically created
quires explicit invalidation involving slow disk I/O’s at the or destroyed, provided the identifier of a destroyed process is
time of recovery, has been proposed for transaction recovery not reused again.
[ 141. The new twin-page disk storage management of this pa- The local area network is assumed to provide reliable com-
per uses two disk pages for a shared virtual page. However, munication between different processors, and between a pro-
our approach is not for transaction recovery; it provides incre- cessor and the disk. Disk storage is assumed to be reliable
mental checkpointing of parallel processes and rapid rollback and a single disk-page update is atomic. Each processor is
recovery. assumed to be fail-stop: a processor stops processing in re-
The paper is organized as follows. Section I1 describes the sponse to an internal error [16]. Also the checkpoint is not
model of computation. Section 111 demonstrates the integration corrupted by undetected errors. After a node failure, the con-
of the checkpointing and recovery algorithms into the shared tents of the physical memory in that node are considered to
virtual memory coherence protocol. Section IV presents the be invalid. However, the necessary process state for restart-
data structures and algorithms for the twin-page disk manage- ing execution, such as the program counter, the process identi-
ment for incremental checkpointing. fier, register contents, etc., are reliably maintained in the PCB
(process control block) and stored in the checkpoint. As a re-
sult, all the processes affected by the failed processor resume
11. COMPUTATION
MODEL execution from their checkpoints. The disk server processor
The distributed system considered in this paper, as shown has to resume fault-free execution after a failure in order to
in Fig. 1 , may contain a number of diskless workstations con- provide access to the checkpoint on disk.
nected to each other and to the disk servers by a local area 111. RECOVERABLE
MEMORY
COHERENCE
MANAGEMENT
network (LAN), such as an Ethernet or token ring. There is
no physically shared memory between different processors ex- A . Shared Memory Checkpointing
cept the disk which serves as the paging store for the shared Checkpointing by a process in this paper is a matter of
virtual memory. Disk servers execute paging store manage- flushing the dirty pages modified by the process since the last
ment routines for the distributed shared virtual memory and checkpoint, and storing the process state onto disk. Due to the
maintain a copy of a consistent shared virtual memory state shared-virtual-memory accesses by different processes, roll-
on disk. The single shared virtual address space distinguishes back propagation might be required if each process simply
this system from standard commercial distributed workstation checkpointed itself independently. For example, two check-
environments. points, C , , of process p , and C , , of process q, might be
In this paper, we illustrate the application of our approach taken in such a way that shared data were updated by p after
to the distributed shared virtual memory developed by Li C , and those modified data were read by q before C , , as
[3], in which a “page” of size 1K bytes was used as the shown in Fig. 2. A rollback of process p to C , would require
unit of memory coherence maintenance and data transfer be- process q to roll back to a checkpoint C,I taken before C ,
tween processors. The system has homogeneous processors, since p might write different values to the shared data after
a fast communication link, and a memory management unit rollback and therefore q could have read invalid data from p .
with a page-level, access-protection mechanism. The access- Rollback propagation requires not only the maintenance of
protection mechanism allows programs to set the access mode multiple checkpoints, which complicates the shared virtual
for each page so that a page fault is generated when an ille- memory management, but also the recording of process in-
gal memory reference occurs. Memory coherence algorithms teractions during normal processing. On the other hand, a
proposed by Li employ an invalidation strategy [ 11 : before globally coordinated checkpointing scheme based on explicit
a processor writes to a page, the processor first obtains the interprocess communication can be used to maintain a con-
write privilege by invalidating all copies of the same page on sistent checkpoint so that no rollback propagation may be re-
other processors. The accessibility of each page is appropri- quired. However, both approaches require the maintenance of
ately maintained by the memory coherence algorithm. the history of process interactions. Unlike the explicit mes-
462 IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO. 4, APRIL 1990

sage sends and receives in message passing systems, it is not


obvious when processes communicate in a shared memory
page manager
system. request
forward
read request
In this paper, we propose a checkpointing scheme which
maintains a shared memory checkpoint without the require- send
confirmation
ment of rollback propagation should any process perform a read fault on
rollback. In our approach, a process p is required to check- page PP

point whenever one of the pages modified by p since p’s last Checkpoint if page pp has been modified
checkpoint is to be read by another process q . A modified since the last checkpoint.
thensendpagepp
page sent to another process on a write page fault is treated Protocol for handling a read page fault with checkpointing
Fig. 3.
as if the process is to read this dirty page. If no other process
reads any of the pages modified since process p’s last check- the page, sets the copyset to null, and asks the owner of the
point, then p does not checkpoint to prevent rollback propa- page to send a copy to the requesting processor. On a read
gation. page fault, the centralized manager includes the identifier of
the requesting processor into the copyset of the page, and asks
B . Implementation of Memory Coherence Management the owner of the page to send a copy to the requesting pro-
with Checkpointing cessor. After sending the page, the owner relinquishes both
The checkpointing scheme of this paper can be integrated the read and write access privileges for the page on serving a
with the shared virtual memory management protocol. Since write page fault, ensuring only a single owner for each page.
process p invalidates other copies before writing into a page, By contrast, on serving a read page fault, the owner retains
the event that a page modified by p since its last checkpoint the read access privilege while relinquishing the write access
is to be read by a different process q executing on another privilege for the page.
processor is manifested by a page fault generated by q . Before To handle a page fault with integrated checkpointing, the
the requested page is sent, if it has been modified by p since owner simply takes a checkpoint before sending the page if
p’s last checkpoint, then a checkpoint of p is initiated. it has been modified since the last checkpoint. The enhanced
However, if processes p and q are scheduled on the same protocol for handling a read page fault with the checkpointing
processor, no page fault may be generated on such an occa- scheme embedded is illustrated in Fig. 3. Compared to the
sion. In this case, the system-kernel implementation of pro- original protocol [I], the protocol of Fig. 3 differs only in
cess synchronization can be modified to implement the check- step 3, where a checkpoint is taken if the requested page has
pointing. For example, a lock variable is usually used to pro- been modified since the last checkpoint.
tect shared data and to enforce mutual exclusion. After process It should be noted that checkpoints can be initiated, if de-
q obtains the lock and before it accesses the protected data, a sired, in addition to those established by the memory co-
checkpoint of process p is initiated if process p has modified herence protocol. Additional checkpoints can be initiated by
the data sincep’s last checkpoint. An alternative approach is to the operating system or application. The memory coherence-
generate a page fault if process q accesses a memory-resident based checkpointing scheme simply guarantees that rollback
page modified by process p since p’s last checkpoint. Such propagation will not occur.
an alternative requires a hardware support similar to the 801
storage architecture [ 171, in which the memory management C . Memory Coherence Management and Rollback
unit detects access violation by a database transaction process Recovery
at the page level. I ) Transient Failures: If a processor and its associated
Two pieces of information concerning a page are crucial physical memory experience a transient failure, then it may
in the memory coherence protocol: owner and copyset. The be possible to restart computation on that processor. After a
owner indicates which processor currently owns the page and transient failure of the processor, the contents of its local phys-
the copyset maintains the identifiers of the processors which ical memory are considered to be invalid. As a result, the key
have read-only copies of the page. Two classes of algorithms data structures for memory coherence, owner and copyset,
based on the maintenance of page ownerships have been pro- may be lost if the processor executes the centralized manager
posed for memory coherence: centralized manager and dis- routine. We demonstrate the procedures for reconstruction of
tributed managers [ 11. The monitor-like centralized manager the data structures on demand.
algorithm is used to demonstrate the integration of our check- Case 1 (The restarting processor does not execute the cen-
pointing recovery approach. tralized manager): Initially, all the memory pages are invalid
In the monitor-like centralized manager algorithm [ 13, the and will be fetched on demand after restart, i.e., requests will
centralized manager maintains both the owner and copyset in be sent to the centralized manager and subsequently forwarded
a data structure called Info. Every processor also keeps its to the owner of the page. When the restarting processor is the
own local page table, indicating its access right to each page owner of the requested page, the page is fetched from disk.
in the shared virtual address space. To handle a page fault, the Case 2 (The restarting processor also executes the central-
faulting processor sends a request to the centralized manager. ized manager): In this case, the Info (containing owner and
On a write page fault, the centralized manager sends invalida- copyset) needs to be reconstructed by the restarting processor.
tion messages to all the processors having a read-only copy of Upon receiving a page-fault request after restart, the restart-
WU AND FUCHS: RECOVERABLE DISTRIBUTED SHARED VIRTUAL MEMORY 463

ing processor executes a ReconstructZnf o algorithm- which these synchronization protocols are implemented by atomic
broadcasts a special message Locateownership to ask the read/write accesses to shared synchronization variables in the
owner to send a copy to the requesting processor, and then re- shared virtual address space, then the checkpointing and roll-
constructs the Info accordingly. On receiving a LocateOwn- back will maintain the deterministic ordering. Since memory
ership message, if the processor is the owner, a copy of the coherence is always maintained, necessary checkpoints will
requested page is sent to the requesting processor and an Own- be issued to ensure the correct ordering in the presence of
ership message is sent to the restarting processor; otherwise, rollback recovery.
a Copyship message is sent to the restarting processor, in- As an example, suppose process p has to read the shared
dicating whether this processor has a read-only copy of the variable VAR before process q can write to it. This must be
page. If no processor indicates itself as the owner of the page, implemented in such a way that process p writes to a syn-
then the restarting processor assumes the ownership and the chronization variable, syn, after reading VAR, and process q
page is fetched from disk. For practical implementation, each reads the modified syn before it writes to VAR. When pro-
of the Locateownership exchanges can handle several pages cess q reads the modified syn, process p will be checkpointed.
simultaneously. As a result, the correct ordering of accesses to VAR will be
2) Permanent Failures: Since checkpoints are maintained preserved even if process p rolls back.
on disk, it is not possible for the system to survive a per-
manent server failure. However, in the event of a permanent E . Checkpointing Frequency
failure by a client processor, the system is able to recover by The frequency of checkpointing in this approach is sensitive
restarting processes previously executing on the faulty pro- to the patterns of data sharing between processes and the size
cessor on a fault-free processor. If the faulty processor is not of a page, the unit of memory coherence maintenance. The
the centralized manager node, the centralized manager node checkpointing frequency is small if there exists reference lo-
will become the new owner of the pages previously owned cality on the shared pages. Frequent checkpointing may occur
by the faulty processor. On serving a page-fault request, the if there exists intensive readlwrite sharing on the same page.
centralized manager fetches the requested page from disk if However, this is not unique to our scheme. Frequent interpro-
this page was previously owned by the faulty processor. If cess communication may also result in significant overhead in
the faulty processor happens to be the processor which exe- other checkpointing alternatives since process interactions dur-
cuted the centralized manager, then a new fault-free processor ing normal processing are still required to be recorded. Fre-
can be arbitrarily chosen to execute the centralized manager. quent checkpointing may be a tradeoff for simple recovery, no
This new processor, chosen to execute the centralized man- requirement of recording process interactions during normal
ager, uses the Reconstructlnfo algorithm to find the owner processing, and simple shared virtual memory management.
of the requested page and reconstruct Info similar to the Case Coincidentally, good parallel algorithm designs typically also
2 scenario discussed above. minimize interprocess communication.
In the implementation, every page has an owner but it does Previous studies of shared memory reference characteris-
not imply that every page must exist in the main memory of tics based on parallel program traces indicate that typically
some processor. When a page is requested and is not cur- only a small fraction of memory references are to shared data
rently in the memory of the owner processor, it is fetched [18]-[20]. For instance, the ratio of the number of write refer-
from disk. In addition, when the recoverable memory co- ences to shared data over the total memory references ranges
herence algorithms are executed, every processor except the only from 0.003 to 0.019 in the four traces studied by Eggers
permanently faulty ones, whose identifiers are known to the and Katz [19]. Moreover, there is temporal locality as well as
centralized manager, is functioning. With the exchanges of processor locality for shared data references [ 181, [ 191, indi-
LocateOwnership messages between the centralized manager cating that for large time periods shared memory addresses
and every other node, a failure during recovery can also be can be considered as private and no traffic is generated in
properly handled. The recoverable page-fault handling rou- maintaining memory coherence. These measurements suggest
tine based on Li’s MonitorCentralManager algorithm [3] is that the memory coherence-based checkpointing scheme will
presented in the Appendix. not result in excessive checkpoints due to shared memory ref-
erences in the traces analyzed.
D.Synchronization Checkpointing frequency is also dependent on the page size.
In our checkpointing and rollback scheme, a process that In distributed shared virtual memory, the unit of memory co-
rolls back may read a different value from a memory location herence maintenance (a page) is typically larger than that of
than it read before the rollback, if another process updates actual sharing between processes (e.g., as small as a word
the same memory location after the first read. If determinis- of 4 bytes). Checkpoints may potentially be issued due to
tic ordering is not enforced, then such a scenario is similar to external page faults on modified pages but not due to inter-
that which would occur due to nondeterministic read/write ac- process communication. As an example, suppose page A in
cesses to a page in normal execution, and may be considered processor i contains variables x, y , and z . Page A , with only
correct. However, if the ordering of accesses to a shared vari- x and y updated, is requested by another processor j due to
able needs to be deterministic, then the application programs a read/write miss on variable z . A checkpoint of the process
must employ specific synchronization protocols to enforce the running on processor i is required since page A is modified
ordering in our approach to checkpointing and rollback. If even though variable z is not. However, the checkpoint in
464 IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO. 4, APRIL 1990

processor i would not be an additional one if processor j sub- xo x1

sequently references variable x or y , since such a checkpoint


would be required anyway when x or y is referenced. In fact, disk mite
checkpoint
it is possible that no additional checkpoints may be issued if disk disk
memory references to shared data demonstrate spatial local-
ity. From the trace analysis of Eggers and Katz [19], two of
the four programs exhibited a high degree of spatial locality
in references to shared addresses.
It should be noted that our discussion is limited to the effects
of data sharing and page size on the checkpointing frequency
but not the actual performance degradation, which is depen-
dent on many factors, including the checkpoint frequency, the
0 out-of-date; W working; C: checkpoint : I: invalid
disk access speed, the number of dirty pages required to be
flushed, and the speed of servicing page faults. Fig. 4. State transitions for twin pages.

IV. TWIN-PAGE
DISKSTORAGE
FOR INCREMENTAL
CHECKPOINTING struct vector [
int pid; / * process identifier * /
A . Approach int timestamp; / * timestamp f o r disk write * /
int cs; / * checkpoint sequence number * /
This section describes a twin-page disk storage management int rs; / * recovery sequence number * /
1 CRvector;
for implementing incremental checkpointing and rapid roll-
back. A simplified system model is first described, in which Fig. 5 . Data structure for CRvector.
the distributed shared virtual memory system is implemented
on a number of diskless workstations and the paging store is
provided by a separate disk server processor. struct entry [
int CS; / * current checkpoint sequence number * /
Two physically contiguous disk pages are allocated for each int RS; / * current recovery sequence number * /
writable page in the shared virtual address space. When a page struct invalid *inv; / * pointer to invalid * /
is fetched from disk, both disk pages are transferred and only 1 CRtable [Nprocess];
/ * Nprocess is the total number of processes * /
one of them is retained, according to a simple selection algo- struct invalid f / * invalid <cs, rs> pair * /
rithm. Dirty memory pages can be written onto disk at any int cs;
instant. Thus, no modification to the virtual memory replace- int rs;
struct invalid *inv;
ment strategy is required. In the event of a process restart after 1;
a rollback, no explicit undo is required; the undo is implicitly
performed by not retaining the invalid data when fetching a Fig. 6. Data structure for CRtable.
page from disk.
Because of the incremental checkpointing and no explicit B. Data Structures
undo at rollback, there are four possible versions of data in To implement incremental checkpointing and rapid roll-
each of the twin disk pages: working, invalid, out-of-date, back, two data structures are used: checkpoint recovery vector,
and checkpoint. A working version represents the data writ- CRvector, and checkpoint recovery sequence table, CRtable.
ten onto disk during the current checkpoint interval; an invalid The CRvector, as described in Fig. 5, is stored at the header
version represents the data updated on disk by a process after of each disk page. The pid records the identifier of the last
a checkpoint and before a failure; an out-of-date version was process updating this page; the timestamp indicates the time
the checkpoint version of the page in a previous checkpoint (server local time) when the disk page was updated; the cs
interval. At any instant, one of the twin disk pages maintains records the checkpoint sequence number; and the rs records
a checkpoint of the page; for example, if Xo and X I represent the recovery sequence number of the last process that updated
copies of the data stored on the twin disk pages for page X , the page. The CRtable, as shown in Fig. 6, is an array of en-
either Xo or X I must be a checkpoint version of X . This is trys, with one entry for each process. The CRtable is used
achieved by writing into the disk page containing a noncheck- to determine the state of a twin page. When a new process
point version when a page is updated. is dynamically created, a new entry in the CRtable is also
After a checkpoint, the previous working version becomes created; however, when a process is destroyed in the mid-
the checkpoint version and the previous checkpoint becomes dle of the execution, its corresponding entry remains in the
an out-of-date version. After a rollback, the previous working CRtable. An entry contains the current checkpoint sequence
version becomes invalid. Fig. 4 shows the state transitions of number, CS; the current recovery sequence number, RS; and
X , and X I . Initially, every twin page starts from either state a a header, inv, pointing to a list of invalid (cs, r s ) pairs of
or 6 and remains in that state until the page is written to disk, a process. A monotonically increasing timer, Time, which
then the state changes to P or E , respectively. Notice that, in must be reliable and capable of surviving a server failure, is
Fig. 4, checkpoint and rollback only result in transitions from used by the disk server to generate the timestamp for each
states and E . States CY, 6, y,and ( all remain unchanged on disk-page write.
a checkpoint or rollback. The timestamp in the CRvector is used to distinguish
WU AND FUCHS: RECOVERABLE DISTRIBUTED SHARED VIRTUAL MEMORY 465

1 let Xi be the most recent version;


2 if (CRvector.rs(Xi) == CRtable[CRvector.pid(Xi)1 .RS)
3 / * CRvector.rs(X,) represents the CRvect0r.m of page Xi * /
4 if (CRvector.cs(Xl)==CRtable[CRvector.pid(Xi)1 .CS) (
5 XI is a working version;
6 X1-l is a checkpoint version;
7 )
6 else (
9 / * CRvector.cs(Xi) < CRtable[CRvector.pid(XI) 1 .CS * /
10 XI is a checkpoint version;
11 is an out-of-date version;
12 1
13 else if (<CRvector.cs (XI), CRvector. rs (Xi)> is invalid)
14 t
15 Xi is an invalid version;
16 X1-l is a checkpoint version;
17
18 else (
19 XI is a checkpoint version;
20 is an out-of-date version;
21 1
~~~~ ~

Fig. 7. Algorithm Identify: Identifies states of twin pages.

which of the twin disk pages contains the most recent data. transparently. The recovering process can be restarted on any
The CS is incremented on a checkpoint of a process while available processor, with the same pid for efficiency reasons.
the RS is incremented on a rollback of a process. The cs in There is no requirement for undoing disk pages updated since
the CRvector, by comparing it to the corresponding CS in the last checkpoint of process p . As a result, not only is the
the CRtable, is used to separate a working version from a restart of the recovering process expedited but also the service
checkpoint version of a twin page. With the use of r s , which requests from other clients are not delayed.
records the incarnation of the process that modified that disk
page, the invalid data can be identified. Disk pages with the D . Selection Algorithms for Disk-Page Writes and Fetches
same (cs, r s ) represent the shared memory pages modified by In order to fetch from and write into the correct copy of the
an incarnation of a process in a particular checkpoint interval. twin disk pages, the state of each disk page has to be identified
Initially, the cs, rs are assigned zeros for each CRvector. first. Algorithm Identify, as shown in Fig. 7, determines the
The timestamp of the CRvector for each twin-page is as- state of each of the twin pages, and is executed on a disk-page
signed - 1 and -2, respectively, although the twin disk pages fetch and disk-page write. On a disk-page fetch, the working
contain the same data at first. The CS is assigned 1, the RS version is returned (if there is one), otherwise the checkpoint
is assigned zero, and the inv is assigned nil in every CRtable version is returned. On a disk-page write, the data are writ-
entry, indicating that the initial state is a checkpoint. ten into the disk page containing a noncheckpoint version of
the page. To prevent the necessity of an extra disk access on
C . Incremental Checkpointing and Rollback a disk-page write, the disk manager can maintain a copy of
At a checkpoint, all pages modified since the last check- all the CRvector’s in a buffer. The buffered copies will also
point by the process being checkpointed have to be flushed provide a saving in read operations since they allow the iden-
onto disk along with the processor registers. After that, the tification of the disk page to be fetched without reading both
disk manager increments the CS of process p in the corre- pages.
sponding CRtable entry on disk. Checkpointing of a process Property I : If X ; , where i = 0 or 1, is an invalid version
is not complete until the corresponding CS is successfully of X then the timestamp of X ; is larger than that of XI-i .
incremented. A dirty page can be written onto disk at any in- In addition, the rs of X , is less than the current RS in the
stant before the increment of CS- incremental checkpointing. corresponding CRtable entry.
When a disk page is updated, the CRvector of that page is Proof: An invalid Xi can only exist after a rollback of
also updated, with the cs and rs from the CS and RS of the the process which updated X ; . The working version is always
corresponding CRtable entry. the most recent version of a page. The property follows since
In the event of a rollback by process p , the disk man- an invalid version was a working version before the rollback
ager first inserts a (cs, r s ) pair, with its values from and the RS in the corresponding CRtable is incremented on
CRtableb].CS and CRtableb].RS, respectively, and then a rollback by a process. U
increments the RS of process p . The RS has to be incre- Property 2: Algorithm Identify correctly recognizes the
mented after the (cs, r s ) has been inserted. The recovering states of Xi and X 1 - i , where i = 0 or 1, on a disk access to
process cannot be restarted until the RS is successfully in- page X .
cremented. Since repeated insertions of (cs, rs)’s can be eas- Proof: From property 1, if X ; is the most recent ver-
ily detected, disk server failures occurring between the inser- sion, then X I - , cannot be an invalid version. If the rs in
tion of (cs, r s ) and the incrementing of RS can be handled the CRvector of X ; is equal to the RS in the corresponding
466 IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO. 4, APRIL 1990

CRtable entry, from Property 1, Xi is not an invalid version, Average no.


of a s . r n ' s
either. From Fig. 4,Xi is either a working version or a check- visited Total accesses = 20,000
point version. If Xi is a working version, then the cs of its per access
CRvector must be equal to the corresponding C S .
x 25% write, non-optimized
If the rs of its CRvector is less than the corresponding R S , 50% write, non-optimized
then Xi cannot be a working version. Xi is either an invalid A 25% write, optimized
version or a checkpoint version. The invalid list pointed to by * 50% write, optimized
a
:
. .
the inv in the corresponding CRtable entry is examined. If
the pair (cs, rs) is in the invalid list, then Xi is an invalid
version; otherwise, Xi is a checkpoint version. 0
By correctly recognizing the state of a page, the occasion -- I I I I I
when a checkpoint is required can be easily identified. In the 4000 6ooo loo00
8000 12000
recoverable memory coherence algorithm described in Section incamation length
111, before the owner sends a page, a checkpoint of a process (no. of disk accesses)
is taken if the page is modified by the process since its last Fig. 8. Average number of (cs, rs) pairs visited for 20000 accesses.
checkpoint. This occasion is indicated by the fact that the page
Average no.
is a working version. of a s . rs>k
visited Total accesses = 40,000
E . Systems with Multiple Servers per access
In a system with multiple disk servers, each disk manager on 4- *.. x 25% write, non-optimized
a server processor maintains a CRtable on disk. Checkpoint- U. .-..
....... 0 50% write, non-optimized
ing of a process is complete after all the servers successfully 3- ...*.. . A 25% write, optimized

update the corresponding CS on disk. This can be simply im-


.. )r. * 50% write, optimized
n.......
.a.:::..*..
2-
plemented with a standard two-phase protocol among servers. "O.:::..*...
..a.. .::8: ::....*.....
On a process rollback, all the disk managers insert the invalid 1- '--- & - - * -
.L- '..a .....a::: ::a
(cs, rs) and increment the corresponding RS on disk, follow- A---&-- *--*-:$::%:-#z-*z-$
ing a two-phase protocol. Only the CRtable is replicated on 0
,

the disk of each server.


It is possible that the server processor may fail while ex-
ecuting a process in the shared virtual memory. In the event
of a server failure, the disk space for backing up the shared
virtual memory is reliably maintained by the twin-page ap-
proach. If service is to resume, the server must resume fault- visited on a disk access under different numbers of rollbacks
free execution after a rollback. After restart, the disk manager by a process, we simulated a single process accessing 4000
can resume service to clients after all disk managers in other pages at random. Total disk accesses of 20 000 and 40OOO with
servers successfully update their corresponding CRtable en- either 25% or 50% being disk writes were simulated. Figs.
tries on disk for the affected processes. 8 and 9 show the average number of (cs, rs) pairs visited
per disk access versus the incarnation length, the number of
F. Disk Access Performance disk accesses before a process performs a rollback. The points
The twin-page approach to incremental checkpointing and in Figs. 8 and 9 provide results based on Algorithm Identify
rollback recovery has the potential of enhancing the speed of and results based on an optimized algorithm, with the nonopti-
rollback recovery, without utilizing disk accesses to invalidate mized algorithm visiting the entire invalid list. Since the pages
disk pages at recovery. However, in order to check whether the not modified by the process will have the initial rs, the in-
(CS, rs) pair of a disk page is invalid, a list traversal may be valid list will always be visited after the process experiences
required. It is important that the average number of (cs, r s ) a rollback. The optimized algorithm checks first whether the
pairs visited per disk access not be large since Algorithm page has never been modified; no visiting is performed if this
Identify might have to be executed on every disk access. condition is true.
If there is no failure in the system, then only lines I , 2, Although the average number of (cs, rs) pairs visited is
and 4 of Algorithm Identify are executed, with no visit to the considerably reduced by the optimization in our simulation,
invalid list required. After a process rolls back and restarts the reduction may not be as large in real applications. Due to
execution, all the disk pages modified by p will have the rs the random-access assumption, the never-modified pages may
less than the RS of process p , and a visit to the invalid list be referenced more often by the process simulated than by a
is performed on the first subsequent disk access. In the worst real process, which typically exhibits reference locality. Also,
case, the entire invalid list has to be visited. In fact, the ac- the demand for optimization will disappear if all the pid's of
tual number of (CS, rs) pairs visited by Algorithm Identify the disk pages are initialized to process dummy, a fictitious
is equal to the difference between the rs of the most recent process which never fails. In other words, process dummy
version of the page and the current R S . owns all the never-modified disk pages.
In order to understand the average number of (cs, r s ) pairs Incremental checkpointing without an explicit undo is ac-
WU AND FUCHS: RECOVERABLE DISTRIBUTED SHARED VIRTUAL MEMORY 467

Lock (PTable[ppl .lock); Lock (PTable[ppl .lock);


ask manager for read access to pp; if (I am owner) (
if (I am NOT manager) ( if (PTable[pp] .access == nil)
receive a copy of pp; fetch page pp from disk;
send confirmation to manager; if (pp has been modified by a process since
1; its last checkpoint)
PTable [ppl .access = read; initiate a checkpoint of the last process
Unlock (PTable[ppl .lock); writing to pp and wait until completion;
PTable [ppl .access = read;
Fig. 10. Recoverable read fault handler algorithm. send a copy of pp to RequestNode;
1;
complished at the cost of disk space and time. Since both Unlock (PTable[ppl .lock);
contiguous disk pages are fetched, extra transfer time is re- if (I am manager) (
quired. This extra transfer time, however, is much smaller Lock (Info[ppl . lock);
than the seek time, which represents the largest part of the if (Info[pp].null == Yes);
ReconstructInfo (pp, read, RequestNode);
time for a random disk-page access. The execution time of else t
Algorithm Identify is a small number of comparisons as seen Info [ppl .copyset = Info[ppl .copyset U
from Fig. 7, and small number of visits to the (cs, r s ) pairs, (RequestNode);
If (I am RequestNode)
less than 4 based on the simulation as shown in Figs. 8 and receive page pp from Info[ppl.owner;
9. Compared to the disk access time, which is in the order of else t
a few milliseconds, the execution time of Algorithm Identify ask Info[pp] .owner to send page pp
to RequestNode;
is negligible. receive confirmation f r o m RequestNode;
Disk space is required for keeping the CRvector, CRtable, 1;
and the twin pages. The overhead for storing the twin pages is 1;
Unlock (Info[ppl .lock);
equivalent to the size of the writable pages in the shared vir- 1;
tual address space. Since the invalid lists in the CRtable are Fig. 11. Recoverable read fault server algorithm.
monotonically increasing in size, mechanisms to reduce them
may be required if failure rates are high. A cleanup proce- Lock (PTable[ppl .lock);
dure executing in background can be invoked when the size of ask manager for write access to pp;
the invalid lists does reach a certain threshold. To implement if (I am NOT manager) (
receive a copy of pp;
this procedure, a color bit is associated with each (cs, r s ) send confirmation to manager;
pair, and a color flag, also a single bit, is maintained on 1;
disk. When a new (cs, r s ) pair is inserted, the color bit is PTable[ppl.access = write;
Unlock (Ptable[ppl .lock);
assigned the value of the color flag. When the cleanup pro-
cedure is invoked, the color flag is changed. After that, the Fig. 12. Recoverable write fault handler algorithm.
timestamp’s of all the invalid pages indicated by the (cs, r s )
pairs with their color bits different from the current color flag
Lock (PTableCppl .lock);
are changed to -1. Upon completion, these (cs, r s ) pairs can if (I am owner) (
be deleted. Although a sequential access is required to finish if (PTable[pp].access == nil)
the cleanup, the execution of the procedure does not interfere fetch page pp from disk;
if (pp has been modified by a process since
with the normal processing in accessing a twin page; new its last checkpoint)
(cs, 7s) pairs can be inserted as well. Since it is executed in initiate a checkpoint of the last process
background and, in most applications where failure rates are writing to pp and wait until completion;
send a copy of pp to RequestNode;
not high, is rarely invoked, no significant performance penalty PTable[pp].access = nil;
will be incurred. 1;
Unlock (PTable[ppl .lock);
V. SUMMARY if (I am manager) t
In this paper, a user-transparent checkpointing and recovery Lock(Infotpp1 .lock);
if (Info[pp].null == Yes)
scheme and a twin-page disk storage management technique ReconstructInfo(pp, write, RequestNode);
have been presented for the design of a recoverable distributed else (
shared virtual memory system. The checkpointing recovery invalidate (pp, Info [ppl .copyset);
if (I am RequestNode)
scheme, maintaining a single shared memory state, prevents receive page pp from Info[pp] .owner;
rollback propagation and simplifies the management of the else (
shared virtual memory. Memory coherence is achieved after ask Info[ppl.owner to send page pp
to RequestNode;
restart from a processor failure despite the loss of key data receive confirmation from RequestNode;
structures for coherence maintenance in the node which fails. 1;
The twin-page disk management reduces recovery over- 1;
head by allowing incremental checkpointing without an ex-
.
Info [pp] copyset = ( ) ;
Info[pp].owner = RequestNode;
plicit undo. Memory pages can be written into the disk at any Unlock(Info[ppl .lock);
time before checkpointing, and no explicit disk I/O activities 1;
are required at rollback to restore the disk state to a previous Fig. 13. Recoverable write fault server algorithm.
468 IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO. 4, APRIL 1990

if (I am manager) I
broadcast LocateOwnership(pp, req, ReqProc);
receive all the Copyship(pp) messages;
receive Ownership (pp), if any;
if (req -= read) t
Info[pp] .copyset -
(ReqProc) U
(processors indicated in the
Copyship(pp) with a Yes);
if (there is an Ownership (pp) returned)
Info[pp].owner-the processor responded;
else
Info[pp].owner
if (I am ReqProc)
-
ManagerNode;

if (Info[pp].owner == ManagerNode)
fetch page pp from disk;
else
receive page pp from Info[ppl.owner;
else
receive confirmation from ReqProc;
);
else / * req is a write fault * /
if (I am ReqProc)
if (there is an Ownership (pp) returned)
receive page pp from this owner;
else
fetch page pp from disk;
else
receive confirmation from ReqProc;
Info[pp] .null = No;
);
Fig. 14. Reronstructlnfo server algorithm.

if (I am NOT ReqProc) (
Lock (PTable[ppl .lock);
if (Ptable[pp] .access == nil)
send Copyship(pp) to ManagerNode with a NO;
else if (Ptable[ppl.access == read) (
send Copyship(pp) to ManagerNode with a Yes;
if (req == write)
Ptable[ppl.access = nil;
);
else (
/ * Ptable[pp].access == write * /
send a copy of pp to ReqProc;
send ownership(pp) to ManagerNode;
if (req == read)
Ptable [ppl .access = read;
else
PTable [pp].access = nil;
1;
Unlock (Ptable[ppl .lock);
1;
Fig. 15. Locateownership server algorithm.

checkpoint. This is accomplished at the cost of disk space for quests after restart. The Locateownership server algorithm,
maintaining two versions of a page as well as the CRtable and executed by a nonmanager node, responds to the message
CRvector. The disk management routine, Algorithm Iden- broadcast by the centralized manager.
tify, recognizes the correct page to fetch from and identifies
the correct page to overwrite into. Execution overhead of Al- ACKNOWLEDGMENT
gorithm Identify by the disk server, as shown in Section IV, The authors wish to express their sincere appreciation to the
is insignificant compared to the disk access time. referees for their detailed and helpful comments. The authors
also would like to thank A. Gupta and J.-S. Long for their
APPENDIX discussions regarding this research.

RECOVERABLE
MEMORY
COHERENCE
ALGORITHMS REFERENCES
Figs. 10-15 show the recoverable memory coherence al- [ l ] K. Li and P. Hudak, “Memory coherence in shared virtual memory
systems,” in Proc. 5th ACM Symp. Principles Distributed Com-
gorithms based on the MonitorCentralManager protocol [3]. put., 1986, pp. 229-239.
An additional bit, null, is associated with each entry of the [2] K. Li, “IVY: A shared virtual memory system for parallel computing,”
Info, indicating the validity of the entry. The Reconstruct- in Proc. 1988 Int. Conf. Parallel Processing, 1988, pp. 94-101.
131 -, “Shared virtual memory on loosely coupled multiprocessors,”
Info server algorithm, executed by the centralized manager, Ph.D. dissertation, Tech. Rep. YALEUDCSRR-492, Dep. Comput.
handles both local and remote read-fault and write-fault re- Sci., Yale Univ., Sept. 1986.
WU AND FUCHS: RECOVERABLE DISTRIBUTED SHARED VIRTUAL MEMORY 469

R. Bisiani, A. Nowatzyk, and M . Ravishankar, “Coherent shared [20] F. Darema-Rogers, G . F. Pfister, and K. So, “Memory access patterns
memory on a distributed memory machine,” in Proc. 1989 Int. Conf. of parallel scientific programs,” in Proc. 1987 ACM SIGMETRICS
Parallel Processing, Vol. I Architecture, 1989. pp. 1-133-1-141. Conf. Measurement Modeling Comput. Syst., 1987, pp. 46-58.
U. Ramachandran, M. Ahamad, and M. Y . A. Khalidi, “Coherence of
distributed shared memory: Unifying synchronization and data trans-
fer,” in Proc. 1989 Int. Conf. Parallel Processing. Vol. II Software, Kun-Lung Wu ( S ’ 8 5 ) received the B.S. degree
1989, pp. 11-160-11-169. in electrical engineering from the National Taiwan
C. P. Thacker, L. C. Stewart, and E. H. Satterthwaite, Jr., “Firefly: University, Taipei, Taiwan, R.O.C., in 1982 and
A multiprocessor workstation,” IEEE Trans. Comput., vol. 37, pp. the M.S. degree in computer science from the Uni-
909-920, Aug. 1988. versity of Illinois at Urbana-Champaign in 1986.
Balance 8000 Technical Summary, Sequent Computer Systems, Inc., From August 1982 to August 1984, he was on his
Nov. 1984. military service at Kaoshiung, Taiwan. Currently
G. F. Pfister, W. C . Brantley, et al., “The IBM research parallel he is working for the Ph.D. degree in computer
processor prototype (RP3): Introduction and architecture,” in Proc. science. Since 1985, he has been a Research Assis-
1985 Int. Conf. Parallel Processing, 1985, pp. 764-770. tant at the Coordinated Science Laboratory at the
D. Gajski, D. Kuck, D. Lawrie, and A. Sameh, “Cedar-A large University of Illinois at Urbana-Champaign. In the
scale multiprocessor,” in Proc. 1983 Int. Conf. Parallel Processing, summer of 1986, he also worked as a consultant in the Database Systems
1983, pp. 524-529. Branch of the Artificial Intelligence Laboratory, Texas Instruments Inc., Dal-
K. H . Kim, “Programmer-transparent coordination of recovering con- las. TX. His research interests include parallel and distributed processing,
current processes: Philosophy and rules for efficient implementation,” database transaction management. fault-tolerant computing, and computer ar-
IEEE Trans. Software Eng., vol. 14, pp. 810-821, June 1988. chitecture.
Y:H. Lee and K. G. Shin, “Design and evaluation of a fault-tolerant Mr. Wu is a student member of the Association for Computing Machinery
multiprocessor using hardware recovery blocks,” IEEE Trans. Com- and also a member of‘ Phi Kappa Phi.
put., vol. C-33, pp. 113-124, Feb. 1984.
J . Kent and H. Garcia-Molina, “Optimizing shadow recovery algo-
rithms,” IEEE Trans. Software Eng., vol. 14, pp. 155-168, Feb.
1988. W . Kent Fuchs (S’80-M’85) received the B.S.E.
R. A. Lorie, “Physical integrity in a large segmented database.” ACM degree in electrical engineering from Duke Univer-
Trans. Database Syst., vol. 2, pp. 91-104, Mar. 1977. sity, Durham, NC, in 1977 and the M.S. degree in
A. Reuter, “A fast transaction-oriented logging scheme for UNDO electrical engineering from the University of Illi-
recovery,” IEEE Trans. Software Eng., vol. SE-6. pp, 348-356, July nois. Urbana, in 1982. In 1984 he received the
1980. M. Div. degree from Trinity Evangelical Divinity
S . M. Thatte, “Persistent memory: A storage architecture for object- School in Deerfield, IL, and in 1985 the Ph.D. de-
oriented database systems,” in Proc. 1986 Int. Workshop Object- gree in electrical engineering from the University
Oriented Database Syst., 1986, pp. 148-159. of lllinoia.
R. D. Schlichting and F. B. Schneider. “Fail-stop processors: An ap- He is currently an Associate Professor in the De-
proach to designing fault-tolerant computing systems,” ACM Trans. partments of Electrical and Computer Engineering,
Comput. Syst., vol. I , pp. 222-238, Aug. 1983. Computer Science, and the Coordinated Science Laboratory, University of
A. Chang and M . F. Mergen, “801 Storage: Architecture and pro- Illinois. He joined the University of Illinois as an Assistant Professor in 1985
gramming,” ACM Trans. Comput. Syst., vol. 6, pp. 28-50, Feb. and was promoted to Associate Professor in 1989. His research interests in-
1988. clude all aspects of VLSl system design with emphasis on reliable computing.
A. Agarwal and A. Gupta. “Memory-reference characteristics of mul- Dr. Fuchs’s recent awards include appointment as Fellow in the Center for
tiprocessor applications under MACH.” in Proc. 1988 ACM SIG- Advanced Studies, University of Illinois 1989, the Xerox Faculty Award for
METRICS Conf. Measurement Modeling Comput. Syst. 1988. pp.~
Excellence in Research 1987, College of Engineering. University of Illinois,
2 15-225. the Digital Equipment Corporation Incentives for Excellence Faculty Award
S . J . Eggers and R. H. Katz, “A characterization of sharing in par- 1986- 1988, the Best Paper Award, IEEE/ACM Design Automation Confer-
allel programs and its application to coherency protocol evaluation,” ence (DAC) 1986, simulation and test category, and nomination for the Best
in Proc. 15th Annu. Int. Symp. Cornput. Architecture, 1988, pp. Paper Award. DAC 1987, simulation and test category.
373-382.

Das könnte Ihnen auch gefallen