Beruflich Dokumente
Kultur Dokumente
4, APRIL 1990
Abstmct-This paper examines the problem of rollback recov- tributed shared virtual memory be recoverable without a
ery in distributed shared virtual memory environments, in which global restart. This is particularly important for large, com-
the shared memory is implemented in software in a loosely plex applications which require long computation time. Al-
coupled distributed multicomputer system. A user-transparent though various memory coherence algorithms have been pro-
checkpointing recovery scheme and a new twin-page disk stor-
age management technique are presented for implementing re- posed [l], [4], [5], very few existing checkpointing and re-
coverable distributed shared virtual memory. The checkpointing covery techniques are appropriate for distributed shared vir-
scheme is unique in that it can be integrated with the memory tual memory environments. The difficulty in checkpointing
coherence protocol for managing the shared virtual memory. and recovery is that a global restart may be required at the
The twin-page disk design allows checkpointing to proceed in an point of recovery since each processor maintains part of the
incremental fashion without an explicit undo at the time of re-
covery. The recoverable distributed shared virtual memory allows shared address space.
the system to restart computation from a checkpoint without a This paper presents a user-transparent checkpointing recov-
global restart. ery scheme and a new paging disk management technique for
implementing recoverable distributed share virtual memory,
Index Terms-Distributed shared virtual memory, memory allowing the system to restart computation from a checkpoint
coherence, paging disk management, rollback recovery.
without a global restart. The checkpointing process can be
integrated with the memory coherence protocol. With a con-
I. INTRODUCTION sistent checkpoint state maintained on the paging store, after a
single node failure (a processor and its associated memory),
time
* I P
shared data
rime
point whenever one of the pages modified by p since p’s last Checkpoint if page pp has been modified
checkpoint is to be read by another process q . A modified since the last checkpoint.
thensendpagepp
page sent to another process on a write page fault is treated Protocol for handling a read page fault with checkpointing
Fig. 3.
as if the process is to read this dirty page. If no other process
reads any of the pages modified since process p’s last check- the page, sets the copyset to null, and asks the owner of the
point, then p does not checkpoint to prevent rollback propa- page to send a copy to the requesting processor. On a read
gation. page fault, the centralized manager includes the identifier of
the requesting processor into the copyset of the page, and asks
B . Implementation of Memory Coherence Management the owner of the page to send a copy to the requesting pro-
with Checkpointing cessor. After sending the page, the owner relinquishes both
The checkpointing scheme of this paper can be integrated the read and write access privileges for the page on serving a
with the shared virtual memory management protocol. Since write page fault, ensuring only a single owner for each page.
process p invalidates other copies before writing into a page, By contrast, on serving a read page fault, the owner retains
the event that a page modified by p since its last checkpoint the read access privilege while relinquishing the write access
is to be read by a different process q executing on another privilege for the page.
processor is manifested by a page fault generated by q . Before To handle a page fault with integrated checkpointing, the
the requested page is sent, if it has been modified by p since owner simply takes a checkpoint before sending the page if
p’s last checkpoint, then a checkpoint of p is initiated. it has been modified since the last checkpoint. The enhanced
However, if processes p and q are scheduled on the same protocol for handling a read page fault with the checkpointing
processor, no page fault may be generated on such an occa- scheme embedded is illustrated in Fig. 3. Compared to the
sion. In this case, the system-kernel implementation of pro- original protocol [I], the protocol of Fig. 3 differs only in
cess synchronization can be modified to implement the check- step 3, where a checkpoint is taken if the requested page has
pointing. For example, a lock variable is usually used to pro- been modified since the last checkpoint.
tect shared data and to enforce mutual exclusion. After process It should be noted that checkpoints can be initiated, if de-
q obtains the lock and before it accesses the protected data, a sired, in addition to those established by the memory co-
checkpoint of process p is initiated if process p has modified herence protocol. Additional checkpoints can be initiated by
the data sincep’s last checkpoint. An alternative approach is to the operating system or application. The memory coherence-
generate a page fault if process q accesses a memory-resident based checkpointing scheme simply guarantees that rollback
page modified by process p since p’s last checkpoint. Such propagation will not occur.
an alternative requires a hardware support similar to the 801
storage architecture [ 171, in which the memory management C . Memory Coherence Management and Rollback
unit detects access violation by a database transaction process Recovery
at the page level. I ) Transient Failures: If a processor and its associated
Two pieces of information concerning a page are crucial physical memory experience a transient failure, then it may
in the memory coherence protocol: owner and copyset. The be possible to restart computation on that processor. After a
owner indicates which processor currently owns the page and transient failure of the processor, the contents of its local phys-
the copyset maintains the identifiers of the processors which ical memory are considered to be invalid. As a result, the key
have read-only copies of the page. Two classes of algorithms data structures for memory coherence, owner and copyset,
based on the maintenance of page ownerships have been pro- may be lost if the processor executes the centralized manager
posed for memory coherence: centralized manager and dis- routine. We demonstrate the procedures for reconstruction of
tributed managers [ 11. The monitor-like centralized manager the data structures on demand.
algorithm is used to demonstrate the integration of our check- Case 1 (The restarting processor does not execute the cen-
pointing recovery approach. tralized manager): Initially, all the memory pages are invalid
In the monitor-like centralized manager algorithm [ 13, the and will be fetched on demand after restart, i.e., requests will
centralized manager maintains both the owner and copyset in be sent to the centralized manager and subsequently forwarded
a data structure called Info. Every processor also keeps its to the owner of the page. When the restarting processor is the
own local page table, indicating its access right to each page owner of the requested page, the page is fetched from disk.
in the shared virtual address space. To handle a page fault, the Case 2 (The restarting processor also executes the central-
faulting processor sends a request to the centralized manager. ized manager): In this case, the Info (containing owner and
On a write page fault, the centralized manager sends invalida- copyset) needs to be reconstructed by the restarting processor.
tion messages to all the processors having a read-only copy of Upon receiving a page-fault request after restart, the restart-
WU AND FUCHS: RECOVERABLE DISTRIBUTED SHARED VIRTUAL MEMORY 463
ing processor executes a ReconstructZnf o algorithm- which these synchronization protocols are implemented by atomic
broadcasts a special message Locateownership to ask the read/write accesses to shared synchronization variables in the
owner to send a copy to the requesting processor, and then re- shared virtual address space, then the checkpointing and roll-
constructs the Info accordingly. On receiving a LocateOwn- back will maintain the deterministic ordering. Since memory
ership message, if the processor is the owner, a copy of the coherence is always maintained, necessary checkpoints will
requested page is sent to the requesting processor and an Own- be issued to ensure the correct ordering in the presence of
ership message is sent to the restarting processor; otherwise, rollback recovery.
a Copyship message is sent to the restarting processor, in- As an example, suppose process p has to read the shared
dicating whether this processor has a read-only copy of the variable VAR before process q can write to it. This must be
page. If no processor indicates itself as the owner of the page, implemented in such a way that process p writes to a syn-
then the restarting processor assumes the ownership and the chronization variable, syn, after reading VAR, and process q
page is fetched from disk. For practical implementation, each reads the modified syn before it writes to VAR. When pro-
of the Locateownership exchanges can handle several pages cess q reads the modified syn, process p will be checkpointed.
simultaneously. As a result, the correct ordering of accesses to VAR will be
2) Permanent Failures: Since checkpoints are maintained preserved even if process p rolls back.
on disk, it is not possible for the system to survive a per-
manent server failure. However, in the event of a permanent E . Checkpointing Frequency
failure by a client processor, the system is able to recover by The frequency of checkpointing in this approach is sensitive
restarting processes previously executing on the faulty pro- to the patterns of data sharing between processes and the size
cessor on a fault-free processor. If the faulty processor is not of a page, the unit of memory coherence maintenance. The
the centralized manager node, the centralized manager node checkpointing frequency is small if there exists reference lo-
will become the new owner of the pages previously owned cality on the shared pages. Frequent checkpointing may occur
by the faulty processor. On serving a page-fault request, the if there exists intensive readlwrite sharing on the same page.
centralized manager fetches the requested page from disk if However, this is not unique to our scheme. Frequent interpro-
this page was previously owned by the faulty processor. If cess communication may also result in significant overhead in
the faulty processor happens to be the processor which exe- other checkpointing alternatives since process interactions dur-
cuted the centralized manager, then a new fault-free processor ing normal processing are still required to be recorded. Fre-
can be arbitrarily chosen to execute the centralized manager. quent checkpointing may be a tradeoff for simple recovery, no
This new processor, chosen to execute the centralized man- requirement of recording process interactions during normal
ager, uses the Reconstructlnfo algorithm to find the owner processing, and simple shared virtual memory management.
of the requested page and reconstruct Info similar to the Case Coincidentally, good parallel algorithm designs typically also
2 scenario discussed above. minimize interprocess communication.
In the implementation, every page has an owner but it does Previous studies of shared memory reference characteris-
not imply that every page must exist in the main memory of tics based on parallel program traces indicate that typically
some processor. When a page is requested and is not cur- only a small fraction of memory references are to shared data
rently in the memory of the owner processor, it is fetched [18]-[20]. For instance, the ratio of the number of write refer-
from disk. In addition, when the recoverable memory co- ences to shared data over the total memory references ranges
herence algorithms are executed, every processor except the only from 0.003 to 0.019 in the four traces studied by Eggers
permanently faulty ones, whose identifiers are known to the and Katz [19]. Moreover, there is temporal locality as well as
centralized manager, is functioning. With the exchanges of processor locality for shared data references [ 181, [ 191, indi-
LocateOwnership messages between the centralized manager cating that for large time periods shared memory addresses
and every other node, a failure during recovery can also be can be considered as private and no traffic is generated in
properly handled. The recoverable page-fault handling rou- maintaining memory coherence. These measurements suggest
tine based on Li’s MonitorCentralManager algorithm [3] is that the memory coherence-based checkpointing scheme will
presented in the Appendix. not result in excessive checkpoints due to shared memory ref-
erences in the traces analyzed.
D.Synchronization Checkpointing frequency is also dependent on the page size.
In our checkpointing and rollback scheme, a process that In distributed shared virtual memory, the unit of memory co-
rolls back may read a different value from a memory location herence maintenance (a page) is typically larger than that of
than it read before the rollback, if another process updates actual sharing between processes (e.g., as small as a word
the same memory location after the first read. If determinis- of 4 bytes). Checkpoints may potentially be issued due to
tic ordering is not enforced, then such a scenario is similar to external page faults on modified pages but not due to inter-
that which would occur due to nondeterministic read/write ac- process communication. As an example, suppose page A in
cesses to a page in normal execution, and may be considered processor i contains variables x, y , and z . Page A , with only
correct. However, if the ordering of accesses to a shared vari- x and y updated, is requested by another processor j due to
able needs to be deterministic, then the application programs a read/write miss on variable z . A checkpoint of the process
must employ specific synchronization protocols to enforce the running on processor i is required since page A is modified
ordering in our approach to checkpointing and rollback. If even though variable z is not. However, the checkpoint in
464 IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO. 4, APRIL 1990
IV. TWIN-PAGE
DISKSTORAGE
FOR INCREMENTAL
CHECKPOINTING struct vector [
int pid; / * process identifier * /
A . Approach int timestamp; / * timestamp f o r disk write * /
int cs; / * checkpoint sequence number * /
This section describes a twin-page disk storage management int rs; / * recovery sequence number * /
1 CRvector;
for implementing incremental checkpointing and rapid roll-
back. A simplified system model is first described, in which Fig. 5 . Data structure for CRvector.
the distributed shared virtual memory system is implemented
on a number of diskless workstations and the paging store is
provided by a separate disk server processor. struct entry [
int CS; / * current checkpoint sequence number * /
Two physically contiguous disk pages are allocated for each int RS; / * current recovery sequence number * /
writable page in the shared virtual address space. When a page struct invalid *inv; / * pointer to invalid * /
is fetched from disk, both disk pages are transferred and only 1 CRtable [Nprocess];
/ * Nprocess is the total number of processes * /
one of them is retained, according to a simple selection algo- struct invalid f / * invalid <cs, rs> pair * /
rithm. Dirty memory pages can be written onto disk at any int cs;
instant. Thus, no modification to the virtual memory replace- int rs;
struct invalid *inv;
ment strategy is required. In the event of a process restart after 1;
a rollback, no explicit undo is required; the undo is implicitly
performed by not retaining the invalid data when fetching a Fig. 6. Data structure for CRtable.
page from disk.
Because of the incremental checkpointing and no explicit B. Data Structures
undo at rollback, there are four possible versions of data in To implement incremental checkpointing and rapid roll-
each of the twin disk pages: working, invalid, out-of-date, back, two data structures are used: checkpoint recovery vector,
and checkpoint. A working version represents the data writ- CRvector, and checkpoint recovery sequence table, CRtable.
ten onto disk during the current checkpoint interval; an invalid The CRvector, as described in Fig. 5, is stored at the header
version represents the data updated on disk by a process after of each disk page. The pid records the identifier of the last
a checkpoint and before a failure; an out-of-date version was process updating this page; the timestamp indicates the time
the checkpoint version of the page in a previous checkpoint (server local time) when the disk page was updated; the cs
interval. At any instant, one of the twin disk pages maintains records the checkpoint sequence number; and the rs records
a checkpoint of the page; for example, if Xo and X I represent the recovery sequence number of the last process that updated
copies of the data stored on the twin disk pages for page X , the page. The CRtable, as shown in Fig. 6, is an array of en-
either Xo or X I must be a checkpoint version of X . This is trys, with one entry for each process. The CRtable is used
achieved by writing into the disk page containing a noncheck- to determine the state of a twin page. When a new process
point version when a page is updated. is dynamically created, a new entry in the CRtable is also
After a checkpoint, the previous working version becomes created; however, when a process is destroyed in the mid-
the checkpoint version and the previous checkpoint becomes dle of the execution, its corresponding entry remains in the
an out-of-date version. After a rollback, the previous working CRtable. An entry contains the current checkpoint sequence
version becomes invalid. Fig. 4 shows the state transitions of number, CS; the current recovery sequence number, RS; and
X , and X I . Initially, every twin page starts from either state a a header, inv, pointing to a list of invalid (cs, r s ) pairs of
or 6 and remains in that state until the page is written to disk, a process. A monotonically increasing timer, Time, which
then the state changes to P or E , respectively. Notice that, in must be reliable and capable of surviving a server failure, is
Fig. 4, checkpoint and rollback only result in transitions from used by the disk server to generate the timestamp for each
states and E . States CY, 6, y,and ( all remain unchanged on disk-page write.
a checkpoint or rollback. The timestamp in the CRvector is used to distinguish
WU AND FUCHS: RECOVERABLE DISTRIBUTED SHARED VIRTUAL MEMORY 465
which of the twin disk pages contains the most recent data. transparently. The recovering process can be restarted on any
The CS is incremented on a checkpoint of a process while available processor, with the same pid for efficiency reasons.
the RS is incremented on a rollback of a process. The cs in There is no requirement for undoing disk pages updated since
the CRvector, by comparing it to the corresponding CS in the last checkpoint of process p . As a result, not only is the
the CRtable, is used to separate a working version from a restart of the recovering process expedited but also the service
checkpoint version of a twin page. With the use of r s , which requests from other clients are not delayed.
records the incarnation of the process that modified that disk
page, the invalid data can be identified. Disk pages with the D . Selection Algorithms for Disk-Page Writes and Fetches
same (cs, r s ) represent the shared memory pages modified by In order to fetch from and write into the correct copy of the
an incarnation of a process in a particular checkpoint interval. twin disk pages, the state of each disk page has to be identified
Initially, the cs, rs are assigned zeros for each CRvector. first. Algorithm Identify, as shown in Fig. 7, determines the
The timestamp of the CRvector for each twin-page is as- state of each of the twin pages, and is executed on a disk-page
signed - 1 and -2, respectively, although the twin disk pages fetch and disk-page write. On a disk-page fetch, the working
contain the same data at first. The CS is assigned 1, the RS version is returned (if there is one), otherwise the checkpoint
is assigned zero, and the inv is assigned nil in every CRtable version is returned. On a disk-page write, the data are writ-
entry, indicating that the initial state is a checkpoint. ten into the disk page containing a noncheckpoint version of
the page. To prevent the necessity of an extra disk access on
C . Incremental Checkpointing and Rollback a disk-page write, the disk manager can maintain a copy of
At a checkpoint, all pages modified since the last check- all the CRvector’s in a buffer. The buffered copies will also
point by the process being checkpointed have to be flushed provide a saving in read operations since they allow the iden-
onto disk along with the processor registers. After that, the tification of the disk page to be fetched without reading both
disk manager increments the CS of process p in the corre- pages.
sponding CRtable entry on disk. Checkpointing of a process Property I : If X ; , where i = 0 or 1, is an invalid version
is not complete until the corresponding CS is successfully of X then the timestamp of X ; is larger than that of XI-i .
incremented. A dirty page can be written onto disk at any in- In addition, the rs of X , is less than the current RS in the
stant before the increment of CS- incremental checkpointing. corresponding CRtable entry.
When a disk page is updated, the CRvector of that page is Proof: An invalid Xi can only exist after a rollback of
also updated, with the cs and rs from the CS and RS of the the process which updated X ; . The working version is always
corresponding CRtable entry. the most recent version of a page. The property follows since
In the event of a rollback by process p , the disk man- an invalid version was a working version before the rollback
ager first inserts a (cs, r s ) pair, with its values from and the RS in the corresponding CRtable is incremented on
CRtableb].CS and CRtableb].RS, respectively, and then a rollback by a process. U
increments the RS of process p . The RS has to be incre- Property 2: Algorithm Identify correctly recognizes the
mented after the (cs, r s ) has been inserted. The recovering states of Xi and X 1 - i , where i = 0 or 1, on a disk access to
process cannot be restarted until the RS is successfully in- page X .
cremented. Since repeated insertions of (cs, rs)’s can be eas- Proof: From property 1, if X ; is the most recent ver-
ily detected, disk server failures occurring between the inser- sion, then X I - , cannot be an invalid version. If the rs in
tion of (cs, r s ) and the incrementing of RS can be handled the CRvector of X ; is equal to the RS in the corresponding
466 IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO. 4, APRIL 1990
if (I am manager) I
broadcast LocateOwnership(pp, req, ReqProc);
receive all the Copyship(pp) messages;
receive Ownership (pp), if any;
if (req -= read) t
Info[pp] .copyset -
(ReqProc) U
(processors indicated in the
Copyship(pp) with a Yes);
if (there is an Ownership (pp) returned)
Info[pp].owner-the processor responded;
else
Info[pp].owner
if (I am ReqProc)
-
ManagerNode;
if (Info[pp].owner == ManagerNode)
fetch page pp from disk;
else
receive page pp from Info[ppl.owner;
else
receive confirmation from ReqProc;
);
else / * req is a write fault * /
if (I am ReqProc)
if (there is an Ownership (pp) returned)
receive page pp from this owner;
else
fetch page pp from disk;
else
receive confirmation from ReqProc;
Info[pp] .null = No;
);
Fig. 14. Reronstructlnfo server algorithm.
if (I am NOT ReqProc) (
Lock (PTable[ppl .lock);
if (Ptable[pp] .access == nil)
send Copyship(pp) to ManagerNode with a NO;
else if (Ptable[ppl.access == read) (
send Copyship(pp) to ManagerNode with a Yes;
if (req == write)
Ptable[ppl.access = nil;
);
else (
/ * Ptable[pp].access == write * /
send a copy of pp to ReqProc;
send ownership(pp) to ManagerNode;
if (req == read)
Ptable [ppl .access = read;
else
PTable [pp].access = nil;
1;
Unlock (Ptable[ppl .lock);
1;
Fig. 15. Locateownership server algorithm.
checkpoint. This is accomplished at the cost of disk space for quests after restart. The Locateownership server algorithm,
maintaining two versions of a page as well as the CRtable and executed by a nonmanager node, responds to the message
CRvector. The disk management routine, Algorithm Iden- broadcast by the centralized manager.
tify, recognizes the correct page to fetch from and identifies
the correct page to overwrite into. Execution overhead of Al- ACKNOWLEDGMENT
gorithm Identify by the disk server, as shown in Section IV, The authors wish to express their sincere appreciation to the
is insignificant compared to the disk access time. referees for their detailed and helpful comments. The authors
also would like to thank A. Gupta and J.-S. Long for their
APPENDIX discussions regarding this research.
RECOVERABLE
MEMORY
COHERENCE
ALGORITHMS REFERENCES
Figs. 10-15 show the recoverable memory coherence al- [ l ] K. Li and P. Hudak, “Memory coherence in shared virtual memory
systems,” in Proc. 5th ACM Symp. Principles Distributed Com-
gorithms based on the MonitorCentralManager protocol [3]. put., 1986, pp. 229-239.
An additional bit, null, is associated with each entry of the [2] K. Li, “IVY: A shared virtual memory system for parallel computing,”
Info, indicating the validity of the entry. The Reconstruct- in Proc. 1988 Int. Conf. Parallel Processing, 1988, pp. 94-101.
131 -, “Shared virtual memory on loosely coupled multiprocessors,”
Info server algorithm, executed by the centralized manager, Ph.D. dissertation, Tech. Rep. YALEUDCSRR-492, Dep. Comput.
handles both local and remote read-fault and write-fault re- Sci., Yale Univ., Sept. 1986.
WU AND FUCHS: RECOVERABLE DISTRIBUTED SHARED VIRTUAL MEMORY 469
R. Bisiani, A. Nowatzyk, and M . Ravishankar, “Coherent shared [20] F. Darema-Rogers, G . F. Pfister, and K. So, “Memory access patterns
memory on a distributed memory machine,” in Proc. 1989 Int. Conf. of parallel scientific programs,” in Proc. 1987 ACM SIGMETRICS
Parallel Processing, Vol. I Architecture, 1989. pp. 1-133-1-141. Conf. Measurement Modeling Comput. Syst., 1987, pp. 46-58.
U. Ramachandran, M. Ahamad, and M. Y . A. Khalidi, “Coherence of
distributed shared memory: Unifying synchronization and data trans-
fer,” in Proc. 1989 Int. Conf. Parallel Processing. Vol. II Software, Kun-Lung Wu ( S ’ 8 5 ) received the B.S. degree
1989, pp. 11-160-11-169. in electrical engineering from the National Taiwan
C. P. Thacker, L. C. Stewart, and E. H. Satterthwaite, Jr., “Firefly: University, Taipei, Taiwan, R.O.C., in 1982 and
A multiprocessor workstation,” IEEE Trans. Comput., vol. 37, pp. the M.S. degree in computer science from the Uni-
909-920, Aug. 1988. versity of Illinois at Urbana-Champaign in 1986.
Balance 8000 Technical Summary, Sequent Computer Systems, Inc., From August 1982 to August 1984, he was on his
Nov. 1984. military service at Kaoshiung, Taiwan. Currently
G. F. Pfister, W. C . Brantley, et al., “The IBM research parallel he is working for the Ph.D. degree in computer
processor prototype (RP3): Introduction and architecture,” in Proc. science. Since 1985, he has been a Research Assis-
1985 Int. Conf. Parallel Processing, 1985, pp. 764-770. tant at the Coordinated Science Laboratory at the
D. Gajski, D. Kuck, D. Lawrie, and A. Sameh, “Cedar-A large University of Illinois at Urbana-Champaign. In the
scale multiprocessor,” in Proc. 1983 Int. Conf. Parallel Processing, summer of 1986, he also worked as a consultant in the Database Systems
1983, pp. 524-529. Branch of the Artificial Intelligence Laboratory, Texas Instruments Inc., Dal-
K. H . Kim, “Programmer-transparent coordination of recovering con- las. TX. His research interests include parallel and distributed processing,
current processes: Philosophy and rules for efficient implementation,” database transaction management. fault-tolerant computing, and computer ar-
IEEE Trans. Software Eng., vol. 14, pp. 810-821, June 1988. chitecture.
Y:H. Lee and K. G. Shin, “Design and evaluation of a fault-tolerant Mr. Wu is a student member of the Association for Computing Machinery
multiprocessor using hardware recovery blocks,” IEEE Trans. Com- and also a member of‘ Phi Kappa Phi.
put., vol. C-33, pp. 113-124, Feb. 1984.
J . Kent and H. Garcia-Molina, “Optimizing shadow recovery algo-
rithms,” IEEE Trans. Software Eng., vol. 14, pp. 155-168, Feb.
1988. W . Kent Fuchs (S’80-M’85) received the B.S.E.
R. A. Lorie, “Physical integrity in a large segmented database.” ACM degree in electrical engineering from Duke Univer-
Trans. Database Syst., vol. 2, pp. 91-104, Mar. 1977. sity, Durham, NC, in 1977 and the M.S. degree in
A. Reuter, “A fast transaction-oriented logging scheme for UNDO electrical engineering from the University of Illi-
recovery,” IEEE Trans. Software Eng., vol. SE-6. pp, 348-356, July nois. Urbana, in 1982. In 1984 he received the
1980. M. Div. degree from Trinity Evangelical Divinity
S . M. Thatte, “Persistent memory: A storage architecture for object- School in Deerfield, IL, and in 1985 the Ph.D. de-
oriented database systems,” in Proc. 1986 Int. Workshop Object- gree in electrical engineering from the University
Oriented Database Syst., 1986, pp. 148-159. of lllinoia.
R. D. Schlichting and F. B. Schneider. “Fail-stop processors: An ap- He is currently an Associate Professor in the De-
proach to designing fault-tolerant computing systems,” ACM Trans. partments of Electrical and Computer Engineering,
Comput. Syst., vol. I , pp. 222-238, Aug. 1983. Computer Science, and the Coordinated Science Laboratory, University of
A. Chang and M . F. Mergen, “801 Storage: Architecture and pro- Illinois. He joined the University of Illinois as an Assistant Professor in 1985
gramming,” ACM Trans. Comput. Syst., vol. 6, pp. 28-50, Feb. and was promoted to Associate Professor in 1989. His research interests in-
1988. clude all aspects of VLSl system design with emphasis on reliable computing.
A. Agarwal and A. Gupta. “Memory-reference characteristics of mul- Dr. Fuchs’s recent awards include appointment as Fellow in the Center for
tiprocessor applications under MACH.” in Proc. 1988 ACM SIG- Advanced Studies, University of Illinois 1989, the Xerox Faculty Award for
METRICS Conf. Measurement Modeling Comput. Syst. 1988. pp.~
Excellence in Research 1987, College of Engineering. University of Illinois,
2 15-225. the Digital Equipment Corporation Incentives for Excellence Faculty Award
S . J . Eggers and R. H. Katz, “A characterization of sharing in par- 1986- 1988, the Best Paper Award, IEEE/ACM Design Automation Confer-
allel programs and its application to coherency protocol evaluation,” ence (DAC) 1986, simulation and test category, and nomination for the Best
in Proc. 15th Annu. Int. Symp. Cornput. Architecture, 1988, pp. Paper Award. DAC 1987, simulation and test category.
373-382.