Beruflich Dokumente
Kultur Dokumente
C. MOHAN IBM Almaden and DON HADERLE IBM Santa Teresa and BRUCE LINDSAY, HAMID PIRAHESH and PETER SCHWARZ IBM Almaden Research Center Laboratory Research Center
and efficient method, called ARIES ( Algorithm for Recouery which supports partial rollbacks of transactions, finegranularity (e. g., record) locking and recovery using write-ahead logging (WAL). We introduce history to redo all missing updates before performing the rollbacks of the paradigm of repeating the loser transactions during restart after a system failure. ARIES uses a log sequence number in each page to correlate the state of a page with respect to logged updates of that page. All updates of a transaction are logged, including those performed during rollbacks. By appropriate chaining of the log records written during rollbacks to those written during forward progress, a bounded amount of logging is ensured during rollbacks even in the face of repeated failures during restart or of nested rollbacks We deal with a variety of features that are very Important transaction processing system ARIES supports in building and operating an industrial-strength fuzzy checkpoints, selective and deferred restart, fuzzy image copies, media recovery, and high concurrency lock modes (e. g., increment /decrement) which exploit the semantics of the operations and require the ability to perform operation logging. ARIES is flexible with respect to the kinds of buffer management policies that can be implemented. It supports objects of varying length efficiently. By enabling parallelism during restart, page-oriented redo, and logical undo, it enhances concurrency and performance. We show why some of the System R paradigms for logging and recovery, which were based on the shadow page technique, need to be changed in the context of WAL. We compare ARIES to the WAL-based recovery methods of
and Isolation Exploiting Semantics),
a simple
Authors addresses: C Mohan, Data Base Technology Institute, IBM Almaden Research Center, San Jose, CA 95120; D. Haderle, Data Base Technology Institute, IBM Santa Teresa Laboratory, San Jose, CA 95150; B. Lindsay, H. Pirahesh, and P. Schwarz, IBM Almaden Research Center, San Jose, CA 95120. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1992 0362-5915/92/0300-0094 $1.50 ACM Transactions on Database Systems, Vol 17, No. 1, March 1992, Pages 94-162
95
DB2TM, IMS, and TandemTM systems. ARIES is applicable not only to database management systems but also to persistent object-oriented languages, recoverable file systems and transaction-based operating systems. ARIES has been implemented, to varying degrees, in IBMs OS/2TM Extended Edition Database Manager, DB2, Workstation Data Save Facility/VM, Starburst and QuickSilver, and in the University of Wisconsins EXODUS and Gamma database machine. Categories
dures,
and Subject
restart,
Descriptors:
fault
D.4.5
E.5.
[Operating
Systems]:
Reliabilitybackup
proce-
checkpoint/
tolerance; processing;
Management]:
temsconcurrency,
Physical
and
Designreco~ery
[Data]: Files backup/ recouery; H.2.2 [Database and restart; H.2.4 [Database Management]: SysManagement]: Database Adminis-
transaction recovery
H.2.7 [Database
trationlogging General
Terms: Algorithms,
Designj
Performance,
management,
1. INTRODUCTION In this section, first we introduce some basic concepts relating to recovthe
ery, concurrency control, and buffer organization of the rest of the paper. 1.1 Logging, Failures, and Recovery which
management,
and then
we outline
concept,
is well the
It encapsulates
ACID
Consistency,
and Durability) properties not limited to the database Guaranteeing concurrent important been performance methods judged have using the execution problem in atomicity
[361. The application of the transaction concept is area [6, 17, 22, 23, 30, 39, 40, 51, 74, 88, 90, 1011. and durability of transactions, in the face of
of multiple transactions and various failures, is a very in transaction processing. While many methods have the past been to and deal the with this problem, and to this supported the assumptions, of such may be a page complexity of concurrency ad hoc nature problem within
developed
acceptable. degree
Solutions
and across pages, complexity of the resulting logic, space overhead on nonvolatile storage and in memory for data and the log, overhead in terms of the number of synchronous and asynchronous 1/0s required during restart recovery and normal processing, kinds of functionality supported tion rollbacks, etc.), amount of processing performed during degree of concurrent processing supported during restart system-induced transaction rollbacks caused by deadlocks, (partial restart transacrecovery,
M AS/400, DB2, IBM, and 0S/2 are trademarks of the International Business Machines Corp. Encompass, NonStop SQL and Tandem are trademarks of Tandem Computers, Inc. DEC, VAX DBMS, VAX and Rdb/VMS are trademarks of Digital Equipment Corp. Informix is a registered trademark of Informix Software, Inc.
96
on stored data (e. g., requiring unique keys for all records, mum size of objects to the page size, etc.), ability to support which allow the concurrent execution, based
on commutativity
properties [2, 26, 38, 45, 88, 891, of operations like increment/decrement on the same data by different transactions, and so on. In this paper we introduce a new recovery method, called ARL?LSl (Algorithm very well flexibility for Recovery and Isolation Exploiting Semantics), which fares with respect to all these metrics. It also provides a great deal of to take advantage of some special characteristics of a class of applications that of applications for better performance (e. g., the kinds IMS Fast Path [28, 421 supports efficiently). To meet transaction and data recovery guarantees, ARIES records in a log of a transaction, objects. The committed and its actions the are reflected which for cause changes ensuring either despite to recoverthat the various able types back). records data log becomes actions source
in the database
or that its uncommitted actions logged actions reflect data object the source for reconstruction
are undone (i.e., rolled content, then those log of damaged or lost data
become
(i.e., media recovery). Conceptually, the log can be thought of as an ever growing sequential file. In the actual implementation, multiple physical files may be used in a serial fashion to ease the job of archiving log records [151. Every record log record is assigned a unique log sequence number (LSN) is appended to the log. The LSNS are assigned in ascending when that sequence.
Typically, they are the logical addresses of the corresponding log records. At [671. If more times, version numbers or timestamps are also used as LSNS than one log is used for storing the log records relating to different pieces of data, then a form of two-phase commit protocol (e. g., the current industrystandard Presumed Abort protocol [63, 641) must be used. The nonvolatile version of the log is stored on what is generally called stable storage. Stable storage means nonvolatile storage which remains intact Disk is an example of nonvolatile and available across system failures. storage and its stability is generally improved by maintaining synchronously two identical copies of the log on different devices. We would expect online log records stored on direct access storage devices to be archived cheaper and slower medium like tape at regular intervals. The archived records may be discarded once the appropriate image copies (archive the to a log
dumps)
of the database have been produced and those log records are no longer needed for media recovery. Whenever log records are written, they are placed first only in the volatile storage (i.e., virtual storage) buffers of the log file. Only at certain times (e.g., at commit time) are the log records up to a certain point (LSN) written, in log page sequence, to stable storage. This is called forcing the log up to that LSN. Besides forces caused by transaction and buffer manager activi -
1 The choice of the name ARIES, besides its use as an acronym that describes certain features of our recovery method, is also supposed to convey the relationship of our work to the Starburst project at IBM, since Aries is the name of a constellation. ACM TransactIons on Database Systems, Vol. 17, No 1, March 1992
ARIES: A Transaction Recovery Method ties, a system buffers as they process fill up. may, in the background, that periodically force
. the
97 log
we assume
describes
the update
performed to only a single page. This is not a requirement in the Starburst [87] implementation of ARIES, sometimes
might be written to describe updates to two pages. The undo (respectively, redo) portion of a log record provides information on how to undo (respectively, redo) changes performed by the transaction. A log record which contains record. information log record that (e.g., fields both the or only undo and the record redo may information be written respectively. may update (e.g., is called an undo-redo only the log redo Sometimes, a log to contain Depending be recorded
Such a record
is called
undo-redo
information
images or values of specific add 5 to field 3 of record 15, logging permits semantics of the operations, the the use of operations same field
field 4 of record 10). Operation lock modes, which exploit the For example, with certain
on the data.
of a record could have uncommitted permit more concurrency than what property be locked ARIES of the model exclusively of [3], which (X mode) and prototype accepted
updates of many transactions. These is permitted by the strict executions says that duration. logging (WAL) protocol. Some based on WAL are IBMs AS/400TM modified objects must
for commit
of the commercial
[23, 901, IBMs DB2TM [1, 10,11,12,13,14,15,19, 35, [271, Tandems EncompassTM [4, 371, IBMs IMS [42, m [161, Honeywells MRDS [911, 43, 53, 76, 80, 941, Informixs Informix-Turbo [29], IBMs 0S/2 Extended Tandems NonStop SQL M [95], MCCS ORION EditionTM Database Manager [71, IBMs QuickSilver [40], IBMs Starburst
[871, SYNAPSE [781, IBMs System/38 [99], and DECS VAX DBMSTM and VAX Rdb/VMSTM [811. In WAL-based systems, an updated page is written back to the same nonvolatile storage location from where it was read. That is, in-place what updating is performed on nonvolatile which storage. Contrast this with happens in the shadow page technique is used in systems such as
System R [311 and SQL/DS [51 and which is illustrated in Figure 1. There the updated version of the page is written to a different location on nonvolatile storage and the previous version of the page is used for performing database recovery if the system were to fail before the next checkpoint. The WAL protocol asserts that the some data must already be on stable allowed to replace the previous version That is, the system is not allowed storage records storage. version of the which describe To enable the log records representing changes to storage before the changed data is of that data on nonvolatile storage. an updated page to the nonvolatile
to write
database until at least the undo portions of the log the updates to the page have been written to stable enforcement of this protocol, systems using the WAL in every page the LSN of the log record that update performed on that page. The reader is
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
store recent
98
C Mohan et al.
Page
Map
Fig. 1.
and On and
the
current
verson us!ng
a failure, the
shadow
version
to [31, 971 for discussions than the shadow original shadowing problems is performed of the drawbacks
about
why
the
WAL
ered to be better
using
introduce
comments apply to the methods suggested in [82, 881. Later, in Section 10, we show why some of the recovery paradigms of System R, which were based on the shadow page technique, are inappropriate in the WAL context, when we need support are described Transaction for high levels in Section 2. status is also of concurrency stored in the and various log and other features that can be
no transaction
considered complete until its committed status and all its log data recorded on stable storage by forcing the log up to the transactions log records LSN. This allows a restart recovery procedure
to recover
transactions that completed successfully but whose updated pages were not physically written to nonvolatile storage before the failure of the system. This means that a transaction is not permitted to complete its commit processing (see [63, 64]) until the redo portions of all log records of that transaction have been written to stable storage. We deal with three types of failures: transaction or process, system, and media or device. When a transaction or process failure occurs, typically the transaction would be in such a state that its updates would have to be undone. It is possible that the transaction had corrupted some pages in the buffer pool if it was the process disappeared.
storage restarted the contents recovered the log. contents and of and that using would recovery the an log. image media
in the When
be lost performed When would copy
middle of performing some updates when the virtual a system failure occurs, typically
and the using a media be lost (archive transaction the and or device the dump) system failure lost data version would storage occurs, would of the have versions typically have lost data to to be of the be and nonvolatile
database
Forward processing refers to the updates performed when the system is in normal (i. e., not restart recovery) processing and the transaction is updating
ACM TransactIons on Database Systems, Vol 17, No. 1, March 1992.
ARIES: A Transaction Recovery Method the database and using later because of the data program. manipulation That update the back (e.g., calls. execution SQL) calls issued rollback
to set up savepoints
in the transaction
of a previous
to be contrasted with total rollback in which are undone and the transaction is terminated. concept deals place another is exposed with if a partial partial at the application recovery. were whose point rollback rollback level A only database nested
all updates of the transaction Whether or not the savepoint to us since this is said to have by a total is an earlier point paper taken or in the rollback followed
is immaterial
to be later
rollback
of termination
transaction than the point of termination of the first rollback. Normal undo refers to total or partial transaction rollback when the system is in normal operation. or it may constraint restart A normal be system violations). after undo may be caused by a transaction request to rollback initiated because of deadlocks or errors (e. g., integrity Restart undo refers to transaction rollback during a system failure. To make partial or total rollback
recovery
efficient and also to make debugging easier, all the log records written by a transaction are linked via the PreuLSN field of the log records in reverse chronological order. That is, the most recently written log record of the transaction would point that transaction, if there the updates performed to the previous most recent log record written by is such a log record.2 In many WAL-based systems, during a rollback are logged using what are called
compensation log records (CLRS) [151. Whether a CLRS update is undone, should that CLR be encountered during a rollback, depends on the particular system. As we will see later, in ARIES, a CLRS update is never undone and hence CLRS are viewed as redo-only log records. Page-oriented redo is said to occur if the log record whose update is being redone describes which page of the database was originally modified during normal processing and if the same page is modified during the redo processing. No internal descriptors of tables or indexes need to be accessed to redo the update. That is, no other with page of the database redo which needs to be examined. in System This is to be contrasted logical is required R, SQL/DS
and AS/400 for indexes [21, 621. In those not logged separately but are redone using
performing a redo requires accessing several descriptors and pages of the database. The index tree would have to be retraversed to determine the page(s) to be modified and, sometimes, the index page(s) modified because of this redo operation may be different from the index page(s) originally modified during normal processing. Being able to perform page-oriented redo allows the the system to provide recovery contents independence does not require amongst objects. That is, recovery of one pages accesses to any other
2 The AS/400, Encompass and NonStop SQL do not explicitly link all the log records written by backward scan of the log must be a transaction. This makes undo inefficient since a sequential performed to retrieve all the desired log records of a transaction. ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992
100
C. Mohan et al As we will page-oriented describe undo later, and this logical makes undo.
(data or catalog) pages of the database. media recovery very simple. In a similar Being levels fashion, we can define
able to perform logical undos allows the system of concurrency than what would be possible if the only to page-oriented undos. This is because
restricted
appropriate concurrency control of one transaction to be moved one were restricted to only
protocols, would permit uncommitted updates to a different page by another transaction. If undos, then the latter transaction
page-oriented
would have had to wait for the former to commit. Page-oriented redo and page-oriented undo permit faster recovery since pages of the database other than the pages mentioned in the log records are not accessed. In the interest of efficiency, interest of ARIES/IM ARIES supports high concurrency, method for page-oriented redo and its supports, in logical undos. In [62], we introduce control and recovery logical in B -tree undos the the
concurrency
indexes
and show the advantages of being able to perform ARIES/IM with other index methods. 1.2 Latches and Locks and locks discussed have not
latches
by comparing
extent
to
on
like of
other
while
discussed logical
much.
Latches of data.
semaphores.
Usually,
guarantee
consistency
We need
to
since we need to support held for a much shorter is not informed so as to avoid is much about latch deadlocks cheaper
a multiprocessor period than are waits. Latches latches and involving acquiring
or involving
releasing a lock. In the no-conflict case, the overhead amounts to 10s of instructions for the former versus 100s of instructions for the latter. Latches are cheaper because the latch control information is always in virtual memory in a fixed place, and direct addressability to the latch information is possible given the latch name. As the protocols presented later in this paper and those in [57, 621 show, each transaction holds at most two or three latches simultaneously. As a result, the latch request blocks can be permanently allocated to each transaction and initialized with transaction ID, etc. right at the start of that transaction. On the other hand, typically, storage for individual locks has to be acquired, formatted and released dynamically, causing more instructions to be executed to acquire and release locks. This is advisable because, in most systems, the number of lockable objects is many orders of magnitude greater than the number of latchable objects. Typically, all information relating to locks currently held or requested by all the transactions is stored in a single, central hash table. Addressability to a particular locks information is gained the address of the hash anchor and pointers. Usually, in the process
ACM Transactions
on Database Systems, Vol
of trying
ARIES: A Transaction Recovery because multiple transactions may be simultaneously the contents of the lock table, one or more latches releasedone latch on the hash anchor and, locks chain of holders and waiters. Locks may be obtained in different IX (Intention exclusive), IS (Intention
Method
101
possibly,
(Shared
exclusive), and at different granularities such as record (tuple), table tion), and file (tablespace) [321. The S and X locks are the most common
S provides the read privilege and X provides the read and write privileges. Locks on a given object can be held simultaneously by different transactions only if those locks modes are compatible. The compatibility relationships amongst the above modes of locking are shown in Figure 2. A check mark (<) indicates that the corresponding modes are compatible. With hierarchical locking, the intention locks (IX, IS, and SIX) are generally obtained on the higher levels of the hierarchy (e.g., table), and the S and X locks are obtained and X), on the lower levels (e. g., record). The nonintention mode locks (S when obtained on an object at a certain level of the hierarchy,
implicitly grant locks of the corresponding mode on the lower level objects of that higher level object. The intention mode locks, on the other hand, only give the privilege of requesting the corresponding mode locks on the lower level objects. For example, grants S on all the records of that table, and it explicitly on the records. defined in the literature Additional, semantically [2, 38, 45, 551 and ARIES intention or nonintention SIX on a table implicitly allows X to be requested rich lock modes have been can accommodate them.
Lock requests may be made with the conditional or the unconditional option. A conditional request means that the requestor is not willing to wait if, when the request is processed, the lock is not grantable immediately. An unconditional lock becomes unconditional request means that the requestor is willing to wait until the grantable. Locks may be held for different durations. An request for an instant duration lock means that the lock is not but the lock manager has to delay returning status until the lock becomes grantable. some time after they are acquired termination. terminates, concerning the lock Manual
to be actually granted, call with the success duration locks long before transaction when the transaction The above durations,
1.3
are released
and, typically,
Commit duration locks are released only i.e., after commit or rollback is completed. conditional apply requests, to latches different also. modes, and
discussions except
duration,
Fine-Granularity
(e.g., record) locking has been supported by nonrelational (e.g., IMS [53, 76, 801) for a long time. Surprisingly, only
available relational systems provide fine-granularity IBMs System R [321, S/38 [991 and SQL/DS [51, and locking from to providing
Tandems Encompass [37] supported record and/or key the beginning. 3 Although many interesting problems relating
3 Encompass and S/38 had only X locks for records and no locks were acquired these systems for reads. ACM Transactions
automatically
by
102
C. Mohan
et al.
Fig. 2. matrix
Lock
mode comparability
m
lx Slx
+ 4
fine-granularity locking in the context of WAL remain to be solved, the research community has not been paying enough attention to this area [3, 75, 88]. Some of the System R solutions worked only because of the use of the shadow page recovery technique in combination with 10). Supporting fine-granularity locking and variable flexible fashion requires addressing some interesting issues which have never really been discussed in the locking length storage database (see Section records in a management literature.
Unfortunately, some of the interesting techniques that were developed for System R and which are now part of SQL/DS did not get documented in the literature. here At the expense problems of making and their gains this paper long, we will be discussing some of those solutions. importance concurrency) necessary (see [79] for the descripto and as object-oriented invent concurrency
As supporting
high
concurrency
control and recovery methods that take advantage of the semantics of the operations on the data [2, 26, 38, 88, 891, and that support fine-granularity locking efficiently. Object-oriented systems may tend to encourage users to define view a large of the number of small granularity the concept objects and users In with may the expect object instances logical as unit of system of a to be the appropriate database, of locking. of a page, object-oriented about as the object-oriented during the unit will in for
its physical
orientation
of objects, becomes unnatural to think object accesses and modifications. Also, to have many terminal interactions
course
transaction, thereby increasing the lock hold times. If the were to be a page, lock wait times and deadlock possibilities vated. Other discussions concerning transaction management oriented environment can be found in [22, 29]. As more and more customers adopt relational systems applications, it becomes ever more important 77, 79, 83] and storage management without the system users or administrators. Since to handle requiring relational
hot-spots [28, 34, 68, too much tuning by systems have been
welcomed to a great extent because of their ease of use, it is important that we pay greater attention to this area than what has been done in the context of the nonrelational systems. Apart from the need for high concurrency for user data, the ease with which online data definition operations can be performed in relational systems by even ordinary users requires the support for high concurrency of access to, at least, the catalog data. Since a leaf page in an index typically describes data in hundreds of data pages, page-level locking of index data is just not acceptable. A flexible recovery method that
ACM TransactIons on Database Systems, Vol 17, No. 1, March 1992.
ARIES: A Transaction Recovery Method allows the needed. The above support facts of high argue for levels of concurrency semantically during rich index modes
. accesses
103
is
supporting
of locking
such as increment/decrement rently modify even the same increment and decrement
which allow multiple transactions to concurpiece of data. In funds-transfer applications, are frequently performed on the branch are forced operations
operations
and teller balances by numerous transactions. If those transactions to use only X locks, then they will be serialized, even though their commute. 1.4 The Buffer buffer Management manager the buffer storage (BM) pool version is the and component 1/0s to of the The fix transaction pages primitive
system from/to
that the
manages nonvolatile
does
read/write
of the database.
of the BM may
be used to request the buffer address of a logical page in the database. If the requested page is not in the buffer pool, BM allocates a buffer slot and reads when the p~ge in. There may be instances (e. g., during a B -tree page split, the new page is allocated) where the current contents of a page on storage are not of interest. In such a case, the fix new primitive
nonvolatile
may be used to make the BM allocate a ji-ee slot and return the address of that slot, if BM does not find the page in the buffer pool. The fix-new invoker will then format the page as desired. Once a page is fixed in the buffer pool, the corresponding buffer slot is not available for page replacement until the unfix primitive is issued by the data manipulative component. Actually, for each page, BM keeps a fix count which is incremented by one during every fix operation and which is decremented by one during every unfix operation. A page in the buffer pool is said to be dirty if the buffer version of the page has some updates which are not yet reflected in the nonvolatile storage version of the same page. The fix primitive is also used to communicate the intention to modify the page. Dirty pages can be written back to nonvolatile storage of BM when no fix with the modification it is being the amount state write intention written out. is held, basis, that may thus dirty allowing the role pages to read accesses to the page while in writing storage failure pages without in the were in the nonvolatile if a system buffer other pool pages to reduce [96] discusses would
percentage be replaced
synchronous
to be performed
time of replacement. While performing those writes, BM ensures that the WAL protocol is obeyed. As a consequence, BM may have to force the log up to the LSN of the dirty page before writing the page to nonvolatile storage. Given the large of this nature transactions buffer pools that to be very rare are common today, we would expect a force and most log forces to occur because of the prepare state.
committing
or entering
BM also implements the support for latching pages. To provide direct addressability to page latches and to reduce the storage associated with those latches, the latch on a logical page is actually the latch on the corresponding buffer slot. This means that a logical page can be latched only after it is fixed
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
104
C. Mohan et al pool and the latch highly acceptable control block has to be released conditions. (BCB) The before the page is unfixed. information buffer slot. count is The is, the
latch
control
in the buffer
of the logical
the fix
Buffer management policies differ among the many systems in existence WAL-Based Methods). If a page modified by a (see Section 11, Other transaction is allowed to be written to the permanent database on nonvolatile storage before that transaction commits, then the steal policy is said to be followed no-steal restart volatile Otherwise, by the buffer manager (see [361 for such terminologies). policy is said to be in effect. Steal implies that during normal rollback, storage some version undo work might have to be performed is not on the allowed of the database. If a transaction a or nonto
all pages modified by it are written to the permanent then a force policy is said to be in effect. Otherwise, policy, during transactions. database restart Deferred
policy is said to be in effect. With a force redo work will be necessary for committed said to occur if, even in the virtual storage not performed database calls. performed mined that to be rolled updating
buffers,
the updates
are
in-place when the transaction issues The updates are kept in a pending list using the pending list information, committing. is discarded
in-place,
only
the transaction is definitely back, then the pending list policy has implications
If the transaction needs or ignored. The deferred can see its are possible or not. see [8, 15, 24, 961.
on whether
a transaction
own updates or not, and on whether partial rollbacks For more discussions concerning buffer management, 1.5 The Organization rest of the paper is organized as follows. After
stating
our
goals
in
Section 2 and giving an overview of the new recovery method ARIES in Section 3, we present, in Section 4, the important data structures used by ARIES during normal and restart recovery processing. Next, in Section 5, the protocols followed during normal processing are presented followed, in Section 6, by the description of the processing performed during latter section also presents ways to exploit parallelism methods for performing recovery selectively some of the data. checkpoints during impact of failures description of how Section 9 introduces
a method tiques context caused detail in for some of the by using the of the those
restart during
The and of
or postponing
Then, in Section 7, algorithms are described for taking the different log passes of restart recovery to reduce the during recovery. This is followed, in Section 8, by the fuzzy image copying and media the significant notion of nested
them technique of many such as efficiently. paradigms and of the IMS, System WAL-based Encompass WAL context. Section which R. We existing page paradigms recovery in the
implementing shadow
problems in use
11 describes
characteristics systems
different
DB2,
ACM Transactions
105
Section 12 outlines the many different properties of ARIES. We conclude by summarizing, in Section 13, the features of ARIES which provide flexibility and efficiency, and by describing the extensions and the current status of the implementations of ARIES. Besides presenting a new recovery method, by way of motivation for our work, we also describe some previously unpublished aspects of recovery in System R. For comparison purposes, we also do a survey of the recovery methods used by other WAL-based systems and collect information appearing in several aims in resulting publications, many of which are not widely available. One of our this paper is to show the intricate from the different choices made for and unobvious interactions the recovery technique, the
granularity of locking and the storage management scheme. One cannot make arbitrarily independent choices for these and still expect the combination to function together correctly and efficiently. This point needs to be emphasized books cover, as it is not always dealt with adequately in most papers and on concurrency control and recovery. as much as possible, all the interesting in building and operating an system. In this paper, we have tried to recovery-related problems that industrial-strength transaction
the goals
of our work
and outlines
the difficulties
involved
in designing a recovery method The goals relate to the metrics discussed earlier, in Section 1.1.
that supports the features that we aimed for. for comparison of recovery methods that we
Simplicity. and program algorithms strived paper that simple. feeling. for is long
Concurrency for, compared are bound to yet a simple, because ignored the
are complex subjects to think aspects of data management. if they are complex. of numerous algorithm 3 gives itself flexible, the main algorithm. Although
Hence,
discussion in Section
problems
the reader
Operation logging. The recovery method had to permit operation logging (and value logging) so that semantically rich lock modes could be supported. This would let one transaction modify the same data that was modified earlier by another transaction which transaction: actions are semantically has not yet committed, when the compatible (e.g., increment/decrement two
operations; see [2, 26, 45, 881). As should be clear, always perform value or state logging (i. e., logging images systems of modified that data), cannot support operation do very physical byte-oriented
recovery methods which before-images and afterlogging. of all This includes to a changes
logging
page [6, 76, 811. The difficulty in supporting operation logging is that we need to track precisely, using a concept like the LSN, the exact state of a page with respect to logged actions relating to that page. An undo or a redo of an update should not be performed without being sure that the original update
106
C. Mohan et al or is not present, that precisely how respectively. modified the page This also means start affected that, during if one or more back, then we the rollbacks
is present transactions
had previously
a page
rolling
need to know
has been
and how much of each of the rollbacks had been accomplished so far. This requires that updates performed during rollbacks also be logged via the so-called compensation log records (CLRS). The LSN concept lets us avoid attempting to redo present in the page. when the operations us perform, thing that saving amount log an operation when the operations effect is already It also lets us avoid attempting to undo an operation effect is not present in the page. Operation logging lets
if found desirable, logical logging, which means that not everywas changed on a page needs to be logged explicitly, thereby space. For example, changes of control information, like the and need not be logged. The redo and the undo of operation
logically.
Efficient support for the storage and manipFlexible storage management. ulation of varying length data is important. In contrast to systems like IMS, the intent here is to be able to avoid the need for off-line reorganization of the data to garbage collect any space that might have been freed up because of deletions and updates that caused data shrinkage. It is desirable that the this that that the data the recovery method and the concurrency control method be such of the logging within and locking a page for is logical in nature so that movements garbage collection reasons do not cause movements to be logged. For an
moved
data
to be locked
or the
index,
also means that one transaction must page currently has some uncommitted
tion. This may lead to log; logical undos may a transaction that has space during its later permit Partial this in data rollbacks.
problems in performing page-oriented undos using the be necessary. Further, we would like to be able to let freed up some space be able to use, if necessary, that insert activity [50]. System R, for example, does not
port the concept of savepoints and rollbacks to savepoints (i.e., partial rollbacks). This is crucial for handling, in a user-friendly fashion (i. e., without requiring a total rollback of the transaction), integrity constraint violations information Flexible (see [1, 311), and (see [49]). buffer management. problems arising from using obsolete cached
The recovery
method
should
make
the
least
number of restrictive assumptions about the buffer management policies (steal, force, etc.) in effect. At the same time, the method must be able to take advantage of the characteristics of any specific policy that is in effect (e.g., with a force policy there is no need to perform any redos for committed transactions.) This flexibility could result in increased concurrency, decreased 1/0s and efficient usage of buffer storage. Depending on the policies, the work that needs to be performed during restart recovery after a system
ACM Transactions
ARIES: A Transaction Recovery Method failure large or during media recovery maybe main memories, it must be noted more that
desirable. This is because, with a no-steal policy, a page may never get written to nonvolatile storage if the page always contains uncommitted updates due to fine-~anularity locking and overlapping transactions updates to that running by locking page. The reduce all the situation Under objects would those be further conditions, page) and by quiescing aggravated either then all activities writing if there are longhave transactions. the system the would page
to frequently
concurrency on the
volatile storage, or by doing nothing special and then paying a huge redo recovery cost if the system were to fail. Also, a no-steal policy additional bookkeeping overhead to track whether a page contains uncommitted updates. cally rich lock modes, in the general Hence, general discussed Recovery and perform methods enough We believe that, partial rollbacks
given our goal of supporting semantiand varying length objects efficiently, undo logging and in-place updating.
like the transaction workspace model of AIM [46] are not for our purposes. Other problems relating to no-steal are 11 with reference It should to IMS be possible Fast Path. copy (archive dump),
in Section
to image
or restart
recovery
at different
granularities,
rather than only at the entire database level. The recovery of one object should not force the concurrent or lock-step recovery of another object. Contrast this with what happens in the shadow page technique as implemented in System R, where index and space management information are recovered lock-step with user and catalog table (relation) data by starting from an internally consistent state of the whole database and redoing changes to all the processing. some object, related objects of the Recovery independence catalog information database simultaneously, as in normal means that, during the restart recovery of in the database cannot be accessed for objects, since that information itself with the object being recovered and [141. During restart recovery, it should
to do selective recovery and defer recovery of some objects to a in time to speed up restart and also to accommodate some offline recovery means that even if one page in the database
Page-oriented
is corrupted because of a process failure or a media problem, it should be possible to recover that page alone. To be able to do this efficiently, we need to log spans every multiple with pages pages change and individually, the update even affects if the object being updated This, rollbacks, in more than one page. during
conjunction
the writing
performed
will make media recovery image copying of different different frequencies. Logical undo. that is different This from
very simple (see Section 8). This will also permit objects to be performed independently and at
relates to the ability, during undo, to affect the one modified during forward processing,
a page as is
108
needed in the earlier-mentioned context of the split index page containing uncommitted data of another to perform logical undos allows higher levels especially in search rollback processing,
of concurrency
structures [57, 59, 621. If logging is not performed during logical undos would be very difficult to support, if we System recovery
also desired recovery independence and page-oriented recovery. but at the expense of R and SQL/DS support logical undos, independence. Parallelism and fast recovery. With multiprocessors becoming
very
com-
mon and greater recovery method stages that of restart the recovery
data availability becoming increasingly important, the has to be able to exploit parallelism during the different recovery method and during media recovery recovery. It is also fast, important if in fact a be such that can be very
hot-standby approach is going to be used (a la IBMs IMS/VS Tandems NonStop [4, 371). This means that redo processing possible, undo processing should be page-oriented (cf. always
and undos in System R and SQL/DS for indexes and space management). It should also be possible to let the backup system start processing new transactions, even before the undo processing for the interrupted transactions completes. there This were is necessary long update because transactions. Our recovery etc.) goal is to have by the good recovery performance (log method both data in virtual during volume, and undo processing may take a long time if
processing. imposed
The
overhead
nonvolatile storages for accomplishing the above goals should be minimal. Contrast this with the space overhead caused by the shadow page technique. This goal also implied that we should minimize the number of pages that are modified (dirtied) during restart. The idea is to reduce the number of pages that have to be written back to nonvolatile storage and also to reduce CPU overhead. This rules out methods which, during restart recovery, first undo some committed changes that had already reached the nonvolatile storage before the failure and then redo them (see, e.g., [16, 21, 72, 78, 881). It also rules out nonvolatile methods storage in which updates that are not present in a page on are undone unnecessarily (see, e.g., [41, 71, 881). The
method should not cause deadlocks involving transactions that are already rolling back. Further, the writing of CLRS should not result in an unbounded number of log records having to be written for a transaction because of the undoing of CLRS, if there were nested rollbacks or repeated system failures during rollbacks. It should also be possible to take checkpoints and image copies without quiescing significant activities in the system. The impact of these operations on other activities should be minimal. To contrast, checkpointing and image copying in System R cause major perturbations in the rest of the system [31]. As the reader will have realized by now, some of these goals are contradictory. Based on our features, experiences
ACM Transactions
of different developers existing systems existing transaction systems and contacts 17, No 1, March 1992
109
with customers, we made the necessary tradeoffs. We were keen on learning from the past successes and mistakes involving many prototypes and products.
OF ARIES section which is to provide satisfies quite a brief reasonably overview of the new recovery in
method
we set forth
Section 2. Issues like deferred and selective restart, restart recovery, and so on will be discussed in the later ARIES guarantees the atomicity and durability
properties
in the fact of process, transaction, system and media failures. For this purpose, ARIES keeps track of the changes made to the database by using a log and it does write-ahead logging (WAL). Besides logging, on a peraffected-page transactions, (CLRS), during partial both basis, update ARIES also performed and in which and then normal activities performed during forward logs, typically using compensation during restart starts partial processing. after forward going or total Figure again. rollbacks 3 gives three Because processing of log records of transactions an example updates, of a rolls of
updates rollback
a transaction,
performing
of the undo
the two updates, two CLRS are written. In ARIES, that they are redo-only log records. By appropriate log records written during forward processing,
a bounded
is ensured during rollbacks, even in the face of repeated failures during restart or of nested rollbacks. This is to be contrasted with what happens in IMS, which may undo the same non-CLR multiple times, and in AS/400, DB2 and NonStop SQL, which, besides undoing may also undo CLRS one or more times severe problems in real-life the CLR, customer when In ARIES, to be written, action as Figure 5 shows, besides is made the same non-CLR multiple (see Figure 4). These have of a log record UndoNxtLSN causes pointer times, caused a CLR which
containing to contain
undone log record. The predecessor every log record, including a CLR,
contains the PreuLSN pointer which points to the most recent preceding log record written by the same transaction. The UndoNxtLSN pointer allows us to determine precisely how much of the transaction has not been undone so far. In Figure 5, log record 3, which is the CLR for log record 3, points to log record 2, which is the predecessor of log record 3. Thus, during rollback, the UndoNxtLSN field of the most recently written CLR keeps track of the progress of rollback. It tells the system from whereto continue the rollback of the transaction, rollback or if bypass those if a system failure were to interrupt the completion a nested rollback were to be performed. It lets the log records that had already been undone. Since of the system are
CLRS
available to describe what actions are actually ~erformed during the undo of an original action, the undo action need not be, in terms of which page(s) is affected, the exact inverse of the original action. That is, logical undo which allows very high concurrency to be supported is made possible. For example,
ACM Transactions on Database Systems, Vol 17, No. 1, March 1992.
110
C. Mohan et al.
w
Fig. 3. Partial rollback example.
Log
12
3324
!3j
>
a patilal
After
performing
the
performs
by undoing 3 and
3 and then 5
records performs
go[ng
act~ons
4 and
I
Log 1
Before Failure
Restart
,
2 3 3 ~ 1; >
1 )
a key inserted on page 10 of a B -tree by one transaction may be moved to page 20 by another transaction before the key insertion is committed. Later, if the first transaction were to roll back, then the key will be located on page 20 by retraversing the tree and deleted from there. A CLR will be written to describe the key deletion on page 20. This permits page-oriented redo which is very efficient. [59, 621 describe this logical undo feature. ARIES uses a single LSN a page is updated and placed in the page-LSN ARIES/LHS and ARIES/IM the pages which state. exploit
Whenever
a log record is written, the LSN field of the updated page. This
with the LSN allows ARIES to precisely track, for restartand mediarecovery purposes, the state of the page with respect to logged updates for that page. It allows ARIES to support novel lock modes! using which, before an update performed on a records field by one transaction is committed, another transaction may be permitted to modify the same data for specified operations. Periodically during checkpoint log records and the modified needed begin normal identify processing, ARIES takes checkpoints. the transactions that are active, their The states, the is
LSNS of their most recently written log records, data (dirty data) that is in the buffer pool. The latter to determine from where the redo pass of restart
should
its processing.
on Database Systems, Vol. 17, No. 1, March 1992.
ACM Transactions
111
Before Log
Failure 3
12 ,; \\
-%
-. ?% / -=--------During
3 F
/
2 1! ) i-
--Restart
,,
----------------------------------------------+1
restart pass,
recovery
first
record
checkpoint,
in progress at the time of the checkpoint is brought up to date as of the end of the log. The analysis pass uses the dirty pages information to determine the starting point ( li!edoLSIV) for the log scan of the immediately following redo pass. The analysis pass also determines the list of transactions rolled back in the undo pass. For each in-progress transaction, most recently written log record will also be determined. that are to be the LSN of the Then, during
the redo pass, ARIES repeats history, with respect to those updates logged on stable storage, but whose effects on the database pages did not get reflected on nonvolatile storage before the failure of the system. This is done for the updates of all transactions, including the updates of those transactions that had neither committed nor reached the in-doubt state of two-phase commit by the time loser of the system are failure redone). (i.e., even the missing essentially updates of the so-called the state of transactions This reestablishes
the database as of the time of the system failure. A log records update is redone if the affected pages page-LSN is less than the log records LSN. No logging is performed when updates are redone. The redo pass obtains the locks needed to protect the uncommitted updates of those distributed transactions that will remain in the in-doubt (prepared) state [63, 64] at the end of restart The updates recovery. next log pass are rolled is the undo pass during which order, all loser transactions sweep of
back,
in reverse
chronological
in a single
the log. This is done by continually taking the maximum of the LSNS of the next log record to be processed for each of the yet-to-be-completely-undone loser transactions, until no transaction remains to be undone. Unlike during the redo pass, performing undos is not a conditional operation during the undo pass (and during normal undo). That is, ARIES does not compare the page.LSN of the affected page to the LSN of the log record to decide
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
112
C. Mohan et al
Log @
DB2
Checkpoint i Follure
m
r
System
Undo Losers / * .
,&
& Analysis
IMS
(FP Updates)
ARIES
1 ------.-:---------
Fig. 6,
Restart
processing
in different
methods.
whether transaction
or not
to undo the
the undo
update. pass,
When if it
a non-CLR is an
is encountered or undo-only
for
a log
during
undo-redo
record, then its update is undone. In any case, the next record to process for that transaction is determined by looking at the PrevLSN of that non-CLR. Since CLRS are never undone (i.e., CLRS are not compensated see Figure 5), when a CLR is encountered during undo, it is used just to determine the next log record to process by looking at the UndoNxtLSN field of the CLR. For those transactions which were already rolling back at the time of the system failure, ARIES will rollback only those actions been undone. This is possible since history is repeated and since the last CLR written for each transaction indirectly) to the next non-CLR record that that had not already for such transactions points (directly or The net result is
is to be undone,
that, if only page-oriented undos are involved or logical undos generate only CLRS, then, for rolled back transactions, the number of CLRS written will be exactly equal to the number of undoable) log records processing of those transactions. This will be the repeated failures during restart or if there are nested written during forward case even if there are rollbacks.
4. DATA This
STRUCTURES describes the major data structures that are used by ARIES.
section
4.1
Below, types
the
important
fields
that
may
be present
in
different
ACM Transactions
113
LSN. Address of the first byte of the log record in the ever-growing log address space. This is a monotonically increasing value. This is shown here as a field only to make it easier to describe ARIES. The LSN need not actually Type. regular pare), be stored Indicates update in the record. whether this is a compensation a commit record (e.g., record (compensation), record (e. g., prea
record
(update),
TransID. PrevLSN.
of the transaction,
of the preceding
log record
by the
tion. This field has a value of zero in nontransaction-related the first log record of a transaction, thus avoiding the need begin transaction log record.
Present only in records of type update or compensation. of the page to which the updates of this record were applied. normally consist of two parts: an objectID (e.g., a log record we assume
will
tablespaceID),
and a page number within that object. ARIES can deal with contains updates for multiple pages. For ease of exposition, only one page is involved.
only in CLRS. It is the LSN of the next log record is to be processed during rollback. That is, of PrevLSN of the log record that the current log are no more log records to be undone, then
record is compensating. If there this field contains a zero. Data. This is the redo and/or
undo
data
that
describes
the
update
that
was performed. CLRS contain only redo information undone. Updates can be logged in a logical fashion.
(e.g., amount of free space) of that page need not be logged since they can be easily derived. The undo information and the redo information for the entire object need not be logged. It suffices if the changed fields alone are logged. For increment or decrement types of operations, before and after-images of the field are not needed. Information about the type of operation and the decrement or increment amount is enough. The information here would also be used to determine redo and/or 4.2 One undo the appropriate log record. action routine to be used to perform the of this
Page Structure of the fields in every page of the database is the page-LSN field. It
contains the LSN of the log record that describes the latest update to the page. This record may be a regular update record or a CLR. ARIES expects the buffer manager to enforce the WAL protocol. Except for this, ARIES does not place any restrictions on the buffer page replacement policy. The steal buffer management policy may be used. In-place updating is performed on nonvolatile storage. Updates are applied immediately and directly to the
ACM Transactions on Database Systems, Vol. 17, No, 1, March 1992.
C. Mohan et al. the object. That is, no deferred updating it is found desirable, deferred updatcan be implemented. being ARIES is policies from implemented.
version of the page containing INGRES [861 is performed. and, consequently, enough deferred not to preclude
If
logging those
4.3
Transaction called
A table
the state of active transactions. The table is initialized during the analysis pass from the most recent checkpoints record(s) and is modified during the analysis of the log records written after the beginning table then The of that checkpoint. If a table used of the During the undo pass, the entries of the checkpoint is taken during restart recovery, will be included in the checkpoint record(s). during normal processing by the important fields of the transaction TransID. State. Transaction Commit ID. prepared (P also called in-doubt) are also modified. the contents of the same table is also A description
(U). The LSN The recent of the latest LSN of the log record next written record written by the transaction. during rollis an
If the most
log record
transaction
undoable non-CLR log record, If that most recent log record UndoNxtLSN value from that
4.4
A table called the dirty .pages table is used to represent information about dirty buffer pages during normal processing. This table is also used during restart recovery. The actual implementation of this table may be done using hashing or via the deferred-writes queue mechanism the table consists of two fields: PageID and RecLSN normal processing, when a nondirty the intention to modify, the buffer of [961. Each entry in (recovery LSN). During with (BP)
page is being fixed in the buffers manager records in the buffer pool
dirty .pages table, as RecLSN, the current end-of-log LSN, which will be the LSN of the next log record to be written. The value of RecLSN indicates from what point in the log there may be updates which are, possibly, not yet in the nonvolatile storage version of the page. Whenever pages are written back to nonvolatile storage, the corresponding entries in the BP dirty _pages table are removed. record(s) that The contents of this table are included is written during normal processing. The in the checkpoint restart dirty pages is modified pass. The
table is initialized from the latest checkpoints record(s) and during the analysis of the other records during the analysis
ACM Transactions on Database Systems, Vol 17, No 1, March 1992
ARIES: A Transaction Recovery Method minimum RecLSN pass during restart value in the recovery. table gives the starting point for
. the
115 redo
PROCESSING discusses processing. from the actions that are performed that as part of normal as
transaction
Section a system
6 discusses failure.
the actions
are performed
of recovering
5.1
During
rollback or total rollback. The rollbacks may be system- or application-initiated. The causes of rollbacks may be deadlocks, error conditions, integrity constraint violations, unexpected database state, etc. If the granularity of locking is a record, then, when an update is to be performed on a record in a page, after the record is locked, that in the buffer and latched in the X mode, the update is performed, page is fixed a log record
is appended to the log, the LSN of the log record is placed in the page .LSN field of the page and in the transaction table, and the page is unlatched and unfixed. The page latch is held during the call to the logger. This is done to ensure that the order of logging of updates of a page is the same as the order in which those updates are performed on the page. This is very important if some of the redo information is going to be logged repetition correctly. to ensure physically (e.g., the amount of free space in the page) and guaranteed for the physical redo to work be held during read and update operations the page contents. This is necessary might move records around within such garbage collection is going might look at the page since they of history has to be The page latch must physical consistency of
because inserters and updaters of records a page to do garbage collection. When transaction Readers necessary held should be allowed to get confused. of pages latch index operations (also in the are see
on, no other
S mode and modifiers latch in the X mode. The data page latch is not held while any performed. At most two page latches are
simultaneously
[57, 621). This means that two transactions, T1 and T2, that are modifying different pieces of data may modify a particular data page in one order (Tl, T2) and a particular index page in another order (T2, T1).4 This scenario is impossible in System R and SQL/DS since in those systems, locks, instead of latches are used for providing physical consistency. Typically, all the (physical) page locks are released only at the end of the RSS (data manager) call. A single RSS call deals with modifying the data and all relevant indexes. deadlocks This may involve waiting page for many locks 1/0s and locks. or (physical) This means locks that and involving (physical) alone page
4 The situation
gets very complicated if operations like increment/decrement are supported high concurrency lock modes and indexes are allowed to be defined on fields on which operations are supported. We are currently studying those situations.
with such
C. Mohan et al record/key 7 depicts locks are possible. They have been a major problem followed in
the commit of two transactions. The dotted lines show how up to date the states of pages PI and P2 are on nonvolatile storage with respect to logged updates of those pages. During restart recovery, it must be realized that the most recent log record written for PI, which was written by a transaction which later committed, needs to be redone, and that there is nothing to be redone for P2. This situation points to the need for having the LSN to relate the state of a page on nonvolatile and the need for knowing where some information in the checkpoint storage restart record to a particular position redo pass should begin (see Section 5.4). in the log by noting
scenario, the restart redo log scan should begin at least from the log record representing the most recent update of PI by T2, since that update needs to be redone. It is not assumed that a single log record can always accommodate information needed to redo or undo the update operation. There instances when more than one record needs to be written for this all the may be purpose.
For example, one record may be written with the undo information and another one with the redo information. In such cases, (1) the undo-only log record should be written before the redo-only log record is written, and (2) it is the LSN of the redo-only log record field. The first condition is enforced situation in which the redo-only written to stable storage the redo of that redo-only history feature) only that should be placed in the page.LSN to make sure that we do not have and not the undo-only restart of the record recovery, repeating record to a
record
gets
before a failure, and that during log record is performed (because later that there isnt
to realize
an undo-only
undo the effect of that operation. Given that the undo-only record is written before the redo-only record, the second condition ensures that we do not have a situation in which even though the page in nonvolatile storage already contains the unnecessarily the undo-only redo could update during record of the redo-only record, that same update gets redone restart recovery because the page contained the L SN of instead of that of the redo-only record. This unnecessary problems if operation logging is being performed. that etc. during forward processing free space inventory update,
cause
integrity
There may be some log records written cannot or should not be undone (prepare,
records). These are identified as redo-only log records. See Section 10.3 for a discussion of this kind of situation for free space inventory updates. Sometimes, the identity of the (data) record to be modified or read may not be known before a (data) page is examined. For example, during an insert, the record ID is not determined until the page is examined to find an empty slot. In such cases, the record lock must be obtained after the page is latched. To avoid waiting for a lock while holding a latch, which could lead to an undetected deadlock, the lock is requested conditionally, and if it is not granted, then the latch is released and the lock is requested unconditionally. Once the unconditionally requested lock is granted, the page is latched again, and any previously verified conditions are rechecked. This rechecking is ACM Transactions on Database Systems, Vol 17, No. 1, March 1992.
117
/
/ / j;:
PI pi PI
# PI
El
P ! ! O
w P2
Log
LZNS
Commit
:\,;
Commit
o
a
T1
Failure
/
@ Checkpoint
T2
Fig. 7.
Database
state as a failure.
required changed.
bered occurred. update, taken. update If page, actions tion that the then it If to
because, The
detect If the
after
quickly, conditions
the
on
page
at
was
unlatched,
time if any to be
the
changes satisfied
conditions
could could for corrective immediately,
could
be have
have
page_LSN
value
are
the
of unlatching
Otherwise, is granted or
can
granularity there are is no the to isolate unlocked is updating readers hold an by who or
page page
since
be sufficient taken to support so that if they performed amount locking rency be used Applicability is control with
transaction. record-locking
change,
dirty
not while
reads,
should acquiring reading utility
then,
even with
to hold are page. the to
page
the
locking,
X latch physical reads
a page are
of those
least which
schemes ARIES.
that
are
the
5.2
To provide
notion
of a sauepoint be outstanding is established perform atomicity. the outstanding updates After undoing
during
number a system
the execution
of savepoints like command SQL I)B2, that can of a can a
a savepoint
savepoint
statementsystem
or the the
establishment transaction
savepoint.
118
continue lar that LSN
C. Mohan et al.
execution is or latest in of the is it set no to log virtual and start going outstanding one. by the when the level, user but to LSNS If (i.e., When user forward if When the it savepoint has again a rollback (see Figure been is 3). A particuto the is the at
longer
has
performed
a savepoint transaction, is being not yet desires If symbolic the would expect
established,
SaueLSN,
a log back record) to not [42]
remembered
beginning SaveLSN savepoint, were expose numbers INGRES Figure locks undo get are the
established
a to
supplies
remembered
to be exposed SaveLSNs and [181. 8 describes acquired on in as the and, for in do the
to the mapping
internally,
as is done
the during
routine
ROLLBACK
routine even have back R* that [31, is the though always
which
SaveLSN
back
No during do not in [1001. ease fit in need some of a a
to a savepoint.
activity involved
The input
a page. System the log that is all each
to the
Since R and log the
rollback, we
deadlocks,
a rolling
transaction 641 are and undone a CLR about to the case a logical described undo dont records it its is in field caused have are
algorithms reverse
chronological
record
is undone,
It is easy
CLRS
written.
performed,
non-CLRs before, PrevLSN Since tion When process when CLRS (e.g., is
mentioned the
is written, record
its
UndoNxtLSN
is made
value will
never is
be undone, Redo-only encountered, by looking the record then, already log occur, none scenarios it via actions. involved in, for 10.3). should CLRS, to In in ARIES.
before-images). determined during the us skip were would rollback methods, by of original not possible (see guarantee with small Section next over to
a non-CLR
processed, field
PrevLSN
encountered up to determine pointer nested during the first describe various handled Being us page the inverses situations management ARIES deal safely helps rollback the
rollback,
UndoNxtLSN undone because log again. in be easy the force particular, the original index actions the log
UndoNxtLSN CLRS, during 13 the are gives exact affect undo space us to online a in
UndoNxtLSN that were Figures restart nested during to Such be could action undone
second
rollback
of the
be processed
conjunction
to describe, not
having
undo the
actions undo
which are
was
of a bounded computer
of logging situations
undo
a circular
ACM Transactions
119
\\\ ***
,0
w m
dFm
0
~ v al sQ
m c
m L ..
<0
..
z
-J
..
x
m.
nc.1
WE
0 % : 0 CIA . .. .
n
.
..!
!. :
..
n
WI--l
>
!!
Fl
..!
-_l
al
ulc l..-
&
..2
!!
al-
;E %2
120
C. Mohan
et al
log might
transactions mentation advantage When of the partial cannot lock again, after nor
Knowing
back space all
the bound,
currently The Manager the
we can
running impletakes
keep in reserve
shortage).
savepoint or total release release, thereby a partial ever undoes of the when a CLR makes than the
is completed. locks
of the rollback
a partial cause
may
to be undone release undoes because a (partial) object locks CLRS of the roll-
data
inconsistencies. But, non-CLR the very UndoNxtLSN first system resolving rollbacks. update can because more
during
the
on that partial
to consider
to total
5.3
Transaction that
Termination
some
Assume
the
form which
the list
of two-phase
64])) is
commit is used
protocol
(e. g.,
or Presumed
Commit
(see [63,
prepare
record
The
synchronously
locks restart could into the logging like be same of erasing [191. is done the recovery, 5 When the
as part
includes
of update-type of the
logging
occur of the
after
a transaction during S and IS) (at the actions record. enters its they the they transaction. of getting
enters
protect new in
locks To
other
of objects)
to be sure these by
erased,
complete
contents, that
pending
writing if there an are write that this action we that
actions
Once any which an this action log
state,
end record
pending
and releasing
locks.
pending
involves
or returning
system, and
OSfile.
does
return
not
redo-only
place
log record.
with when
we assume
record
associated
a checkpoint
5Another possibility is not to log the locks, but to regenerate the lock names during restart recovery by examining all the log records written by the in-doubt transaction see Sections 6.1 and 64, and item 18 (Section 12) for further ramifications of this approach ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
121
back releasing
in-doubt state is rolled back by writing the transaction to its beginning, discarding its locks, and then writing the end record.
not the rollback and end records are synchronously written will depend on the type of two-phase commit protocol used. of the prepare record may be avoided if the transaction one or is read-only.
is not
5.4
Checkpoints checkpoints are taken to reduce the amount of work that needs
Periodically,
to be performed during restart recovery. The work may relate to the extent of the log that needs to be examined, the number of data pages that have to be read from nonvolatile storage, etc. Checkpoints can be taken asynchronously (i.e., fuzzy while transaction record table, processing, by including writing a updates, begin-chkpt and any file is going record. mapping are open on). Such a checkpoint is initiated is constructed the (like Then the
of exposition,
assume that all the information record. It is easy to deal with log this information. Once the
is constructed,
to the log. Once that record reaches stable storage, the LSN of the begin-chkpt record is stored in the master record which is in a well-known place on stable storage. If a failure were to occur before the endchkpt record migrates to stable storage, but after the begin _chkpt record migrates to stable storage, then that checkpoint is considered an incomplete checkpoint. Between the begin--chkpt and end. chkpt log records, transactions might have written other log records. If one or more transactions are likely to remain in the in-doubt state for a long time because of prolonged loss of contact with the commit coordinator, about This locks then way, could it is a good idea locks were if a failure be reacquired to include (e.g., to occur, in the then, end-chkpt held by the restart record those information transactions. those the update-type X, IX and SIX) during to having
recovery,
without
access
prepare records of those transactions. Since latches may need to be acquired to read the dirty _pages table correctly while gathering the needed information, it is a good idea to gather the information a little at a time to reduce contention on the tables. For example, tion before Figure if the dirty _pages table has 1000 rows, If the already during each latch entries acquisichange 100 entries can be examined. examined
the end of the checkpoint, the recovery algorithms 10). This is because, in computing the restart
of the RecLSNs of the dirty pages included also takes into account the log records that
the beginning updates of the that checkpoint. were performed This is of the since
since of some
122
the
C. Mohan et al.
be reflected pages is that pages batch details are has to in the about some ensure to reduce the the the be in the
that
dirty
to
page
list
ARIES
storage on system ple buffer frequently written just to such in case
require
basis, The
that
The
any out
dirty
dirty can
nonvolatile manager write multi its are are work, could This is, using -
assumption
a continuous
manager Even
pages
1/0
[961 gives if there manager reasonably were an pages to occur. 1/0 and time
storage
restart
To avoid
prevention buffer
of updates
pages
make
a copy the
of each data
of those unavailability
1/0
from
minimizes
6. RESTART
When the
PROCESSING
system the invoked routine begin or redo the table is taken. availability, the duration of restart this if they [601. are by is by processing exploiting is going modified during new to must be as short during is it recovery. processing as .chkpt shutdown. pass and data to restarts after a failure, state Figure of the master last pass, At the routine of the of This the undo and restart record complete invokes in that the end the order. recovery ensure needs the the to be
performed and The before analysis dirty For possible. the Ideas during redo for necessary checkpoint high durability that to site input routine pointer
atomicity
properties
9 describes
RESTART
system. the taken for buffer recovery, the pool a contains
to this
record
pass,
_pages
is updated
appropriately.
of restart
parallelism allowing
improving
transaction
explored
6.1
The
Analysis
first the pass
Pass
of the 10 analysis log pass that the actions. is made The which were and must this pass by before during input restart recovery routine is the routine is the that LSN
analysis
impleof the
pass.
ments
master
Figure
describes
RESTART_ routine
were
ANALYSIS
to this
record.
the the failed from that list list
The
outputs
of this failure
are the
in
transaction
table,
which
contains
of transactions
state
system the that are
at the time
of system
of pages shut the
or shutdown;
potentially the routine system start
the in-doubt or unprepared the dirtypages table, which dirty in the the records for buffers is the log. for whom The end when location only transactions records the on log which
RedoLSN,
processing are failure, end
records
totally
but
ACM llansactlons
ARIES: ATransaction
RE.STAR7(Master Addr);
Restart_Analys~ Restart_ buffer remove Restart_ reacquire pool entries locks s(Master_Addr, Dirty_Pages for for table e); Trans_Table, := Dirty_ Dlrty_Pages, Pages; pages from the
Recovery Method
123
RedoLSN);
Redo(RedoLSN,
Undo (Trans_Tabl
transactions;
this
pass,
whose
already
appear
an entry
is made
log records LSN as the pages RecLSN. to track the state changes of transactions most recent log record that table that would need ultimately the transaction then are removed
also to note
log record
is encountered,
to that version of that file is may be recreated and updated file erasure is committed. In
that case, some pages of the recreated file will reappear in the dirty-pages table later with RecLSN values greater than the end-of-log LSN when the file was erased. The RedoLSN is the minimum RecLSN from the dirty-pages table at the end of the analysis are no pages in the dirty _pages It is not necessary ARIES there missing logged Hence, tion. This implementation is no analysis Section updates. redo pass. 6.2), That that there in the This pass. table. 0S/2 redo The redo pass can be skipped analysis because, ARIES unlike irrespective System or nonloser pass and, in fact, Database as we mentioned of whether R, SQL/DS status they if there in the before all were
Extended
Edition
Manager redoes
(see also
in the
unconditionally
is, it redoes
by loser or nonloser
transactions,
and DB2.
the loser
of a transac-
That information is, strictly speaking, needed would not be true for a system (like DB2) their update locks are reacquired
only for the undo pass. in which for in-doubt the lock names as they are encountered locks forces the RedoLSN transactions which in of from
transactions
by inferring
from the log records of the in-doubt transactions, during the redo pass. This technique for reacquiring computation to consider the Begin _LSNs of in-doubt turn requires that we know, before the start the in-doubt transactions. Without the analysis pass, the transaction
the checkpoint record and the log records encountered during the redo pass. The RedoLSN would have to be the minimum(minimum( RecLSN from the dirty-pages table in the end.chkpt record), LSN(begin-chkpt record)). Suppression of the analysis pass would also require that other methods be used to
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992
124
0
Trans_able, D1rty_pages, to RedoLSN) ; empty; / /* 00; open log scan at Beg)n_Chkpt /* read )n the Begln_Chkpt read log record followlng record record / / */
#~START_ANALYSIS(Mast er_Addr,
ln]tiallze the tables
Trans_Table
arm D1rty_Pages
Master_Rec := Read_Dl sk(Master_Addr) ; Open_ Log_ Scan (Master_Rec .Chkpt LSN) ; LogRec := Next_ Logo; LogRec := Next_ Logo; WHILE NOT(End_of_Log)
Begln_Chkpt
ret Urn*/ IF trans related record & LogRec.7ransi3 /C- ;n Trans Table THEN /* not chkpt/OSflle /* log ~ecord */ Insert (Log Rec. Trans ID, U ,Log Rec. LSN, Log Rec. Frev LSN) l!,:o Trans Table; SELECT(LogRec. Type) WHEN(update I compensation) DO; Trans_Tabl e[LogRec. Trans ID] .Last LSN := LogRt-:. LSN; THEN THEN Trans_Tahl e[.ogRec. TransIO] .UndoNxt LSN := LogRec. LSN; to by this CLR */ IF LogRec. Type = update IF LogRec 1s undoable
ELSE Trans_Tabl e[LogRec. Trans IDU.UndoNxt LSN := LogRec. UndoNxt LSN; / next record to undo 1s the one pointed IF LogRec is redoable & LogRec. ~age ID NOT IN DTrty_Pages THEN insert (LogRec. Page ID, Log Rec. LSN) Into Llrty_Pages; END; / WHEN(update I compensation) */ WHEN(Begln_Chkpt ) ; /* found an Incomplete WHEN(End_ Chkpt) FOR each entry DO; in LogRec. Tran_Table 00; Table; checkpoints Begln_Chkpt
record.
ignore
It
*/
IF Trans ID NOT IN Trans_Table Insert entry (Trans ID, State, ENO; END; /* FOR /
FOR each entry in LogRec.Dirty PagLst 00; IF Pagel Ll NOT IN Olrty_Pages-THEN lrsert ELSE set RecLSN of Dlrty_Pages END; / FOR / END; / WHEN(End Chkpt) */ WhEN( prepare \ rollback) DO; entry
entry
IF LogRec. Type = prepare THEk Trans_Tabl e[Log Rec. Transit]. ELSE Trans Table [LogRec .Trans ID]. State := U; Trans_Tabl~[LogRec .TransID] .Last LSN := LogRec. LSN; bac<) entry */ for which TransID all ENO; / WHEN(prepare I roll WHEN(end) delete Trans_Table WHEN(OSfile_return) delete
State
:= P ;
from Olrty_?ages
pages of
ENO; /* SELECT / LogRec := Next_ Logo; ENO; / WHILE / FOR EACH Trans Table entry with (State = U) & (Undo Nxt LSN = O) 00; /* rolled back trans write end re~ord and remove entry from Trans Table; I* w)th mlsslng end record ENO; /* FOR */ RedoLSN := minimum(Di rty_Pages. RE-URN; Rec LSN) ; /* return start posltlon for
*/ *[
~edo *I
Fig. 10.
analysis.
updates be used
to files to filter
which update
been
consequence
is that
.pages records
record.
6.2
The
Redo Pass
second Figure pass 11 of the describes log that the is made during restart routine recovery that is the redo
pass.
RESTART.REDO
implements
125
Di rty_Pages); /* open log scan and :;s]tlon at restart pt *J /* read log record a: restart redo point */ /* look at all records till end of log */ I compensation) & LogRec is redoable &
Open_ Log_Scan(RedoLSN); LojRec := Next_ Logo; WHILE NOT(End_of_Log) 00; IF LogRec. Type = (update
LogRec. PageIO IN Oirty-Pages & LogRec. LSN >= Oi rty_Pages[LogRec .~ageID] .Rec LSN THEN 00; / a redoable page update. updated page mg-t not have made It to */ /* disk before sys failure. need to access cage and check Its LSN */ Page := fix&l atch(LogRec. PageIO, X); IF Page. LSN < LogRec. LSN THEN 00 /* update not or cage. need to redo It *I Redo_Update(Page, END; ELSE Dlrty_Pages LogRec); / [* [LogRec. PageIO] .Rec LSN := Page. LSN+l; / I* unfix&unlatch (Page); / LSN on ~age has to /a read next /* reading till be checked 1og record end of log */ */ */ ENO; LogRec : = Next_ Log (); ENO; RETURN; / redo redid update update */ *I Pag.?. LSN := LogRec. LSN; .~date already on page *I update dirty page list with correct info. tr-s w1ll happen if this */ ~~gewas written to disk after :Re checkpt b.t before sYs failure */
Fig. 11.
redo,
redo
pass are
actions. table by
The this
to the
this The
supplied RedoLSN
the
When
a redoable
record in the
is encoun-
does such
records records
or the be
the To
state
pages
is found the
to be
Thus,
RecLSN
information This Even behind some routine updates this of that have redo may
to limit
as of the in Section
transactions
transactions the only the and redo. last nonvolatile log write to idea the of pages redo
[691 we Since table dirty-pages pages were might Because we that and
explored
of restricting
repeating during in the pages pass. of the system CPU the option is became the some
to possibly
is page-oriented, modified will may time written like systems to can table
pass.
are at
read the
some
pages
which before
of reasons
volume log
identify corresponding
dirty pages
nonvolatile be used
storage, eliminate
although the
available
log
records
ACM Transactions
126
C. Mohan et al.
the
dirty
.pages pass.
table Even if
when such
those records in
log
records
are
analysis complete, being pass. For after of all are dirty parallel possibly pass. also records in the For
were
a system The we
failure
as to
execution actions in 1/0s the buffers the we of redo can log the in
remaining of the
exploiting table
information asynchronous in
of initiating they records the redo building or and with group pages by orders only may are pass
corresponding performed things need on 1/0s queue applied violate are a per
in logged, queues
updates
sophisticated potentially .pages table) initiated the corresponding that may does each get not
information and, buffer Updates represented for a given These disaster as the pool, to
queues
processes.
different
in the page
missing ideas
reapplied
as before.
applicable [731.
context
supporting
remote
backups
6.3
The
Undo Pass
third Figure undo The history is not or like restart order, pass dirty pass 12 of the actions. _pages is repeated consulted not. DB2 -undo in of the to Contrast that a single do not sweep routine LSNS The log that the The table before this rolls of the next by in an is is made to not the with back of the next record entry of the 5.2. pages CLRS. dirty The to loser this undo what history losers log. log during routine pass we but This is an restart is during undo describe perform is done until for transaction log of process manager by recovery that restart undo the operation in in Section reverse for the this is the
undo
pass.
the table. since page
describes
RESTART_
consulted whether
UNDO
routine
implements transaction pass. LSN should 10.1 redo. chronotaking each of transaction to be each back the during of the usual the is exactly for Also, on the be for
input
initiated,
determine
performed systems The logical the the rolled those as WAL undo we
repeat
transactions, record
maximum
to be processed
yet-to-be-completely-undone to be undone. back is determined The before routine while transactions. described this protocol pass.
remains
transaction
transactions,
writing
nonvolatile
127
State
/
= U
pick
in
UP
Trans_Table)
from Trans_Tab7e
UndoNxtLSN of unprepared trans with maximum UndoNxt LSN */ J* read log record to be undone or a CLR *J
IF LogRec is undoable THEN 00; f record needs undoing (not Page := flx&latch(LogRec .Page IO, X); Undo_Update(Page, LogRec); Log_Wri te(compensati on ,LogRec .Trans ID, Trans_Tabl e[LogRec. TransID] LogRec. Page ID, LogRec. PrevLSN, Page. LSN := LgLSN; . . . ,LgLSN, Data); / store
redo-only
record)
*I
Trans_Tabl e[LogRec. TransID] .LastLSN := LgLSN; / store LSN of CLR in table unfix&unl atch(Page); ENO; I* undoable record case ELSE; /* record cannot be undone - ignore it Trans_Tabl e[LogRec. Trans IO] .UndoNxt LSN := LogRec. PrevLSN; /x next record to process is J* the one preceding this record in its backward chain IF LogRec. PrevLSN = O THEN DO; /* have undone completely - write end Log_Wrlte( end ,LogRec .Trans IO, Trans_Tabl e[LogRec. Transit]. delete Trans_Table entry where TransID . LogRec. TransIO; ENO ; ENO; /* WHEN( update) */ WHEN(compensation) Trans_Tabl e[LogRec. TransID] WHEN(rollback ENO; /* /* END; RETURN ; SELECT / WHILE */ [ prepare) Trans_Tabl .UndoNxtLSN LastLSN, . . .) ; /* delete trans I* trans from fully table undone
/* pick UP addr of next record to examine e[LogRec. TransIO] .UndoNxtLSN := LogRec. PrevLSN; I* pick UP addr of next record to examine
Fig. 12.
To exploit processes. single leaves undos objects parallel, actually for all Figure the log was page partial transaction the missing one a single
parallelism, It is important because the possibility (see require in the pages may explained transaction.
undo each
pass
can
also
be performed be dealt with in in then fashion, be performed scenario page. After 4 and During redone record restart the Since undo the without
multiple by still the for in of even Here, the a the write, then a
transaction
applying
accomplishing redoing the the undo in parallel, using Before that 3) ARIES. the
that
undos), pages
changes
updates
to the
failure, disk
rollback went
was
performed
and
forward
(3, 4, 4, 3, 5 and performed. have the Each of how option we recovery concept,
be matched
many
ARIES,
allowing in the
is completed. could,
supports
128
C. Mohan et al.
u
m
Wrl te !bdated
* 1234435
REDO
344356 6521
Restart recovery example with ARIES.
UNDO
Fig. 13.
loser
only
to
its Later,
latest we entry
instead the
of
totally
rolling by invoking
back its
the
loser
passing lock
about correctly
savepoint
which ability
Doing
generate
transactions those
completing savepoints
so that so on.
application 6.4
program
Selective
or Deferred
after a system as work of by of time new soon to
Restart
failure, as point which such even when data In some are we in some first may time. wish we This and of the for the the then for objects the loser alone can to restart may data wish is is usually opening it which for the to processing defer doing It is for to done unavailable. the is system possible redo is brought finish handling based forward is DB2 (DBA) that before of log those they records transacinverses DB2. and/or up. on those to reduce of
Sometimes, new some the the undo offline the solely the of recovery amount processing
transactions
possible.
a later
during
recovering recovery
transactions.
example,
needs work
to be performed needs DB2 This is able is possible in the [151. of locking, That in the is, log
offline
system and
transactions
then
written will
processing
indexes)
actions logical
exceptions
table
allocation online,
in virtual
to be applied to protect
there objects,
in-doubt
no locks
to those
recovery
completed.
objects
ACM Transactions
ARIES:
A Transaction
Recovery
Method
129
is performed ranges. also, has Redos undos This are we modified objects.
by during
rolling normal
forward rollbacks,
using
the
log maybe of
in
CLRS
none that
loser of
may
require state
is because not
on the always
current
page-oriented. 10.3), generally CLRS. write for the since do a in fact, hence For high the key we
involving
space approach of an
Section
insert
record
a CLR
stating
is O% full.
concurrency, effect deletion), cannot undo during the in each that other Even have restart of
management undo of which when to handle and point all the (e.g., page
logical predict
retraversing maybe
in terms even
is unpredictable;
page-oriented the handle in time, recovery order. undos the if the Hence, to the where logical, then we
logical
is necessary. possible of the (possibly, sets is of a transaction of the rest of restart recovery logical) are to
It is not
Remember
methods,
undo
transaction, the
next
record and/or
be
processed
PrevLSN
us to all transactions
under needs
to perform,
suggest
algorithm:
it for 1. Perform the repeating of history for the online objects, as usual; postpone the log ranges. the off/ine objects and remember 2. Proceed with the undo pass as usual, but stop undoing a loser transaction when one of its log records is encountered for which a CLR cannot be generated for the above reasons. Call such a transaction a stopped transaction. But continue undoing the other, unstopped transactions. 3. For the stopped transactions, acquire locks to protect their updates which have not yet been undone. This could be done as part of the undo pass by continuing to follow the pointers, as usual, even for the stopped transactions and acquiring locks based on the encountered non-CLRs that were written by the stopped transactions. 4. When restart recovery is completed and later the previously offline objects are made online, fkst repeat history based on the remembered log ranges and then continue with the undoing of the stopped transactions. After each of the stopped transactions is totally rolled back, release its still held locks. 5. Whenever an offline object becomes online, when the repeating of history is completed for that object, new transactions can be allowed to access that object in parallel with the further undoing of all of the stopped transactions that can make progress. The tion above in the requires update transactions. ACM Transactions on Database Systems, Vol 17, No, 1, March 1992. the ability to generate log records. lock names is based doing on the that informafor
(non-GLR)
DB2
already
in-doubt
130
C. Mohan
et al.
Even the
of the of new
objects
but it by
it
that loser followthe and loser are (1) that redo such system
processing
transactions
completed, history
repeat
reacquire, of the
their
records,
updates new in
in-doubt
transactions, of the in to time step ensure the for of the pass (1) step
as the
locks
acquired
completes. appropriately encountered back during log are CLR. that and we mark transaction and is then undone. undo that at the the
transactions already obtained as to which whose LSNS last updates back we can that log because work that of locks possibly not or rolled then by
If a loser then,
transaction
analysis remain than need not like on the release This the undo more in of or to yet
UndoNxtLSN during of its which object lock more as the locks If a long
obtained undone. some records objects only (e. g., once during using
to release those
specially
corresponding
locking
is in effect) record we in
because
undo hence,
do not systems a be
same
non-CLR
non-CLR performed
ARIES deadlocks
transaction rollbacks.
undo
resolution
7. CHECKPOINTS
In this 1/0 can of restart section, we
DURING
describe by, processing. By work table list dirty-pages from .pages what list taking if
RESTART
how the impact taking of failures checkpoints on CPU during processing different and stages
be reduced recovery
optionally,
Analysis
can the of that This latter, table. Redo notified during that page the the is the save of the
pass.
some
of the
pass,
we
table
as the
entries
the end
restart
different dirty
happens is obtained
a normal buffer
checkpoint. (BP)
dirty-pages
pass.
so that, the redo by
At
the pass,
redo the
pass,
is for
whenever
page
to nonvolatile
making
RecLSN
be equal
record
such
ACM Transactions
.
is enough BM does
131
if not
restart own
dirty-pages
as it does of what be taken would redo the of of be time checkpoint end not that of will the to
track log
checkpoints of the the checkpoint table table by of table at this at the end the
a failure the
were
dirty-pages
restart
of the of
checkpoint. be the is
entries
entries
analysis
whether
or
parallelism
employed
Undo
table the does
pass.
At the those
the BP then
undo At this
pass, point,
the the
dirty-pages is cleaned no longer table written dirty, as pass, as the entries of a for up in as it to
by removing
entries
pages
processingremoving adding pass, entries the entries when of the is taken list
pages
the
BP
at the
transaction System
checkpoint recovery,
entries that
at that
time. sometimes physical This pages R. This would it may (the be required shadow up some pages) the be and were true its
or redo
to be performed. be repeated
is another
consequence
since
in
accommoin our
checkpoints
these
checkpoints
optional
System
8. MEDIA
We some called performed tions. might Of With will
RECOVERY
that media recovery tablespace, will etc.) be required entity. involving to image in easily that version the copy contrast an image produce the of the A at the fuzzy such entity method, to the image copying entity. This level of a file or
such a
DBspace,
image
an by entity other the method copy is
copy (also
can transaccopy of [52]. with performed means that no be
archive
a high we
dump)
with
concurrency could us
image
contain
uncommitted Let
course,
uncommitted directly
nonvolatile
132
C. Mohan et al.
versions systems of some of the copied directly be such it much a copy Since may pages from more the may the efficient and more in it. Of be present in the
more
recent
transaction version geometry manager have copying (e.g., easy case, latching When begin. to to to
nonvolatile since since system convenient is found [131), course, For the the
object
overheads
be eliminated. copying, systems image method but copy most image assertion all the of updates image the storage image-copied in the image by log. into recovery
transaction incremental the page presented amount level, image of with the the The that in record
buffers. copying, to
latter
as described will
some
be needed.
example,
the
location is noted
of
chkpt
remembered
image
point with
copy checkpoint.
information LSNS less entity to is than
had
minimum(minimum(RecL
pages
checkpoints checkpoint)) the that the point fuzzy entity point LSN of the call time
copy
have
been operaas
version
be at least
media
begin. same the
recovery
chkpt one redo is
redo point.
record given point. When reloaded redo being unless or the a log records image pared end pass such until Since, an page point. in in
reason
taking media
of the
computation
image-copied starting log records corresponding checkpoint Unlike the be update entity about Section dirt in must the
then During
is initiated
recovery the entity applied, list if log comthe undo such table by log the redo, the
recovered the LSN record LSN copy to the of the log on the refers
information page to
.pages and
it unnecessary.
restart
record its LSN
is greater checkpoint, log that may records had recovery. be kept table log. DBA an of the
of the if the
beginchkpt accessed must are the (e.g., 6.4) complete independence is logged nonvolatile or
of the
and
is reached,
any
in-progress
those
to the
separately
somewhere last
obtained in
ARIES,
update in the
arbitrary
database recovery,
accomplished
ACM Transactions
.
forward
133
that (e. g.,
version
page
as described which,
is to be contrasted updates
space to
records the
recovery of reconeven not by log when logging written starting records to had being in
expensive the complete for R), state partial pages up or If scans changes any of
entire explicitly
one page
of an index
even
which
System paying
if CLRS
image
attention
representing determine rolled would recovered useless tion would back recovery Individual of media the the had back be
should
undone.
transactions
scans rolled
work not
performed, any changes the 10.2 of but gets the also a chance is executed log in and
turns
out page
back An
forward R during
to skip pass
as it is done
System
of restart because
termination pool which abnormal by hitting that every the the page scan the the page cornonstate of the buffer and is the changes.
is actively code
changes the
database
application
performance-conscious may key) to Given page is storage all is relevant from does to bit first page 1 DB2 is set is the started put all to had occur or due the
terminations the attention process operation update. rupted volatile using log
users limit.
systems
exhausted
expensive
process the
uninterruptable an date page, by efficient rolling The for recovery by using and update whenever value an version
recover
and log
roll-forward operation a bit X-latched. logged a page is equal availability system redo missing problem state by in
RecLSN
automatically the Once and to l, page the page header. update LSN for
corruption complete
of a page
is detected
is reset this
is latched,
to see if its From restart but the entire page storage. left in
recovery situation
is initiated. letting
is unacceptable updates
to bring
page
were page
in the that
corrupted were
in the the
uncorrupted abnormally
on nonvolatile
is to make
pages
ACM Transactions
134
terminating leaving and latch, clean-ups. For CLRS This supports
C. Mohan et al.
process, unfix process calls around aids are system issued by the transaction operations in performing the system. like fix, necessary By unfix
enough the
footprints
user
before
performing processes
the
variety
mentioned
in this system
section
and
writing locking.
is a very only
even if the
with the
is supporting approach,
is to be contrasted page
no-CLRs
suggested
[521, which
9. NESTED
There not. which may We
TOP
ACTIONS
when the we of atomicit would whether y property of file extension. data area of the like the
are times
do need in the
some
for
updates
of
a transaction
commits themselves. extends other then undo of the
to be
or is This
committed, illustrated
a transaction
a file
causes updates
be allowed If the
to some system
extended effects extending transaction performed
transactions extending not very might transactions. data completion, traditionally in the it it would
to the
to roll other
to undo
hand,
extension.
to a loss
of updates if the
committed
other
extension-related by kinds of
updates
to the
before have called
database
a failure
actions transaction The conflicts which
performed transaction pendent mechanism transaction In the dent poses, should which A nested (1) (2) ARIES, above
transactions,
top actions
waits that
[511. A
inde-
independent before
proceeding.
transaction initiating
of course, the
vulnerable
to lock
transaction,
unacceptable.
using requirement
top action,
having nested complete transaction. A is
without
to support
indepenour purwhich action storage, define a
to perform to mean
undone on outcome execution consists the redo action; of position and and the undo any once the
top and
of actions action
a transaction
later
is logged
of actions
to
stable which
performing of the
a sequence
steps:
of the
last the
log
record; of the
information
actions
(3)
on step We
top
action,
writing position
UndoNxtLSN
points
record
whose
assume
the
effects
of data
any
actions
like resident
a file the
and
updates before
database redo,
dummy system
is written.
discuss database
to only
is resident
ACM Transactions
135
*
Fig. 14. Nested top action example.
top
action
if nested to top
transaction then nested the be the dummy top dummy undone (as
were
to
as part action as
a system
written, nested redo-only) nested a dummy sense this quent Nor costly Figure 3, 4 and transactions rolled It then writing context in [59, can record advantage top
then top
undo-redo atomicity
property to redo
be thought
record enclosing
stable
actions. do we
6 Also, into
do not conflict
transaction. with
approach. of a nested as the dummy by top nested top action of the and a single the using consisting Even and is not action though hence undone. implementation of only log top action a single record and concept can relies update, avoid in the be found it of the the actions
record activity
enclosing to be
needs
6 ensures be
that
on repeating
consists
redo-only
nested index
of a hash-based
management
10.
This
RECOVERY
section
PARADIGMS
some can be of the found the problems and in need methods handling [97]. for Our certain some caused associated transaction aim is to us difficulties features of the of with providing rollbacks. show which recovery the how we fineSome certain had to
record)
locking recovery
of the goals
existing and
in accomplish-
include of
in ARIES. R,
we show developed in
why the
System
which
context
6 The dummy CLR may have to be forced if some urdogged updates may be performed other transactions which depended on the nested top action having completed.
later by
136
technique, high
is a need
System design 82,
for R of 881.
levels
more
limitations
preceding of updates
performed
during
CLRS). no logging no tracking of index of page on pages). and state space management itself information to relate it changes. to logged updates (i.e.,
on page
no LSNS
10.1
The has The
Selective
goal been aim of this in
Redo
subsection in why systems updates 6). is to many ARIES restart in System later, 2 introduce systems repeats after passes the and locking history. failures, of the log: the they a undo generally redo pass of the (i.e., pass and perform and then an the concept to show with of selective the problems WAL-based redo that recovery. that it
implemented supporting is to motivate transaction recovery (see Figure As we other the call will
fine-granularity
show
R paradigm
undo
The
preceding
System transacof many and in a before. record records than is page set the to is LSN has
redo
DB2,
is incorrect
only We
with
hand, actions this
WAL
and
fine-granularity
prepared the
locking.
During
WAL-based pass,
does just
redo
R redoes tions System pitfalls, Some perform such WAL During describing update log the log records [311.
of committed
in-doubt) redo it
selectiue
below,
redo.
selective
paradigm
R intuitively as we discuss WAL-based selective systems, technique the redo an needs records than undo is always the on the needs that written, updates if redo record in update to LSN,
locking
approach were page LSN page to is 15). record undo been the and page not when also to
inconsistencies Let us consider described of a log the is log less LSN if the no undo page. a CLR of the being the rolled back of the
locking the to
implemented.
which
pass,
LSN
be reapplied then LSN L SN page. to the of the be would when even Writing when the simpler (see Figure
to
the
page. During
redone
undo
Otherwise,
operback. just to on
CLR
update,
to handle
updates system
in a special
CLR
an undo
performed
to be necessary
handling
ACM Transactions
137
T1 Is a Nonloser
REDO Redoes
T2 is a
Loser
Update
30 20
Fig. 15.
Selective
scenario.
This
will
happen, but in
if there there U1
was an for
U2 update
for
was (CLll
U1
written that,
LSN
of l.Jl
LSN
of U2).
storage then, during written, that with selective state in (say, and been
interrupts would appear it. be On any only under modiwas with pushed the to be
of this
update U2 be had
U2
be made arises
there
locking these
properties
of the
to a losing
or in-rollback) losing
subsequently LSN the time undone redo the the undo present value to page_LSN history, page. Undoing harmless oriented DBMS reuse data effect and an only locking the and update update pass in 30 LSN by
update have
had
of the to not. or
established not the to former undo update log records an to know illustrate In
comes
Figures
fine-granularity with with to the is LSN LSN perform page. greater 20 the This whether than 30 since
since undo or or
it
it belongs
to a nonloser
causes
is because equal
determine
should
is no longer even certain logging, [81], unique page. and will in the when as
current in with in
present
example, [6],
physical/byteVAX is no automatic
VAX
Rdb/VMS
there
of freed is not
operation operation
logging, whose
inconsistencies
be caused
undoing
ACM Transactions
138
C. Mohan et al.
T1 0
~,
LSN
Vr! fe !Mated
IJq ,, i . .
F,2
20
T1 is a Nonloser
10
30
Commit
T2 is a Loser 30
20 Even on Page
to Undo Is NOT
Reversing the pass need become of that update would redoing The to have be during problem were
the
order
selective
and is
the
undo
solve undo
either. to precede
incorrect
redo pass, 15,
might of 20
of which
actions LSN
to be redone. greater CLRS is redone not that use the redo than LSN
In
Figure
undo
a log
only 30
records on the
even
update durability by
present and
would
atomicity R makes it
properties unnecessary needs technique, called the restart are are and checkpoint in functions
of page.LSN needs an version (see is not. all version, there are and is one are to action
to determine the shadow of the Updates thus restart, even which after the recovery is that the
what page
to the
undone
redone.
a checkpoint,
database, between
shadow
points
uersion, create
of the
is saved
two checkcurrent
recovin the are the is performed
a new
version
from ery. not the As in
database
shadow
a result, and which database, This with
is done
updates redo.
reason not
y even
selective
space 8
changes
or undone
logically.
7 This simple view, as it is depicted in Figure 17, is not completely accuratesee Section 10.2. s In fact, if index changes had been logged, then selective redo would not have worked. The problem would have come from structure modifications (like page split) which were performed which were taken advantage of later by transacafter the last checkpoint by loser transactions tions which ultimately committed. Even if logical undo were performed (if necessary), if redo was page oriented, selective redo would have caused problems. To make it work, the structure modifications could have been performed using separate transactions. Of course, this would have been very expensive. For an alternate, efficient solution, see [62]. ACM Transactions on Database Systems, Vol.
17, No. 1,
March 1992.
.
redo,
139
but
repeats
commit
does us to
not
perform
selective
support effect.
locking, ability to
repeating
side irrespective
of whether Section 9.
transaction
ultimately
commits
described
in
10.2
The backs writing for them. not
Rollback
goal
State
subsection their has been there and is to discuss how in the writing of the many been, to them role in that difficulties CLRS problems. systems the and the they introduced that While and literature, advantages play In fact, these and the its will and back. the whether would has describe the been by roll-
of this
progress solves
and
some really
a significant of writing have undone present in this of numrollentire partial level, is Since of the written of only time track some performed restart. last the are those partial the need with occurred System checkpoint R. a wanted of at the a in recovery
discussion been
problems
by the and
actions were paper, writing A ber back rollback very effects left
what in [56].
additional In this
as
questions
elsewhere
to note roll
advantages
Section
a unique causing a if
transaction.
Supporting application
a transaction of the
during a way
to nonvolatile transaction time of the which time after That we next may the is,
to keep easy to
rollback. care is record already last restart before at the about taken.
the So,
a checkpoint
checkpoint for
System
keeps
to be failure
undone
of a system
is unimportant
since
uisible
the failure. this
the
failure
system
special
passes when
some
to have
them
140
C. Mohan et al
Last
Fig. 17.
~..----_- . .
12 3 4 5,,.-6 7 8 ::jg
Log
Checkpoint
Fig. 18.
Partial
rollback
handling
in System R,
the
for log
T1 record
points
to log
does that the not
record
2 since been CLRS, rollback in the written by follow that log this 4 and
by
the
time
the of a
3 had
already write
because also does of the a transaction transaction record protocol. notice preceding
R not to
record inferred
a partial breakage
place.
record written
completion
rollback 1, instead
does pass,
examine,
analysis
3, we conclude with the undo needs log or not analysis hence 6, the 7, recovery checkpoint, the 5 and records during the record
partial Since,
the
undo state
to be performed 2 definitely depend pass and pass 8. pass, in it To the log the pass is
database Whether transaction 9 points caused records a forward to log pass, 5 will and in record log had
record will
to be undone. is a losing log rolled by it point the undo 4 and undo redo pass rollback putting that the record back
to be undone During record of log redone during If log will Here, pass.
it is concluded redo
pointer
analysis and
record
then, pass
during log
transaction the
is involved
both
see why
to precede
pass
g In the other systems, because of the fact that CLRS are written and that, sometimes, page LSNS are compared with log records LSNS to determine whether redo needs to be performed or not, the redo pass precedes the undo pass see the Section 10. 1. Selectlve Redo and Figure 6. ACM Transactions on Database Systems, Vol 17, No. 1, March 1992
.
a record the
141
is
same of that is
a record to be in
because and
records dealt
reused To
transaction to the
repeat
respect
sequence redo
of actions
be fore
failure,
be performed
the
is performed. a commit to not undo as across happened be In written record a loser the redo in actions nor and pass, System were pages normal logging 8). as a the also the created has adds for value 2, T1 rolls redo and is not a prepare during none R and known, record, the of the hence the undo records the with may then pass will exact other be the log way for quite transaction records in which a given different 2 be redone.
If 9 is neither will and be Since one page from forward determined 1 will
be undone. are
CLRS
transactions
or undo
processing,
(i.e.,
repeating
cause occur
history
further some
is
to guarantee).
to this (see problems being 5.4). Not done A piece T1 the adds the required writing physically
in System
R also
potentially did not or undo logging performed operation). the and last T2
space normal
during
redo
(see information
also
being
on an
object
to an
after-image of data
us consider
1, T2
T2
operation recovery
these the
problem ln this its would to will does or the the let not not
have is
instead by lock
System course,
being
accomplished
redoing which
information using will high these. WAL-based during the being which rollbacks. locking. were once more started and, than data Section
dumb
depend 10.3).
logging;
Allowing
logging
information
logically
concurrency
examples). logging
is concerned, some original suggested is pushed (or coarser are also system. which undone undone, is that,
rolled the
as denoted
during
immediate
if a transaction possibly
is illustrated before
a transaction Then,
rolling
ACM Transactions
142
C. Mohan
et al.
recovery, CLRS the lock 22, the are idea Section next
the
are such
undone has
and
already while
nondeaditem in 8. do not
a situation, benefits
retaining
CLRS
management
early [691. We
of locks
objects in
Section
of CLRS
recovery rollbacks.
suggested
is an important
drawback
of such
10.3
The length A record
Space
goal
Management
subsection finer than is to page point level efficiently. in that on We a doing the data do not record space page deal reader concurrency, from The approach a goal, logging did for The slot # way locking by is not with consumed solutions to to [50]. do not flexible by This this For to storage during another problem space index preout the problems involved and in space
granularity
of locking
varying
to be dealt
is to make or the briefly problem until update in here,
management transaction is discussed reservation updates, vent before such the the in
released
a transaction
transaction
interested
the commit
space
consumed is dealt
another under
circumstances flexible (i.e., systems first byte the have like then how garbage or log flexibility variable run quite availability 19 shows e.g., storing to to be the (by, the the to
is described it not the on the was want record. page. records identifies record. not of data within
to
do as
a page,
of a record
to identify
logging name
which
of the got
describes is that to lock us the and have reduce Figure state and
changed. within
collection
on a page within
a page IMS,
modify
length
efficiently. deal
systems
frequently y of data
fragmentation. track of the version in same has tracking the actual of the log
to users. which from same the keeping earlier is and page page) to all an of state in the nonvolatile storage point used. the which exact
leads that
Assuming only of
is attempted
143
.Og
Oelete RI Free 200 Bytes Insert R2 Consume 200 Bytes Oelete R2 Free 200 Bytes Insert Commit R3 Consume 100 Bytes /
Fig. 19.
problem
an
LSN
to
each
avoid
attempting
to
redo
operations
which
are
already
to the page.
file free in data records one in it containing space DB2. or with for inventory Each index the FSIPS inserting records pages FSIP pages. obtained same are key the of one (FSIPS). describes During from consulted new such make page To or more They the a record a clustering related The at not an special to provide to identify record. as that sure that requires avoid also relations are space insert index keys) a data FSIP least every update called has space a called
(SMPS) many on
information operation, about as that page keeps 25% with only of the to the of the of
information or more
(or closely
new
enough
space
information at least 5090 -consuming in updates T1 thereby full. an Later, update to would current the undos might
space-re-
information of the
corresponding during redo and must space update the Now, this FSIPS the an FSIP. full FSIP the to and
handling recovery
recovery
FSIPS
and
independence, Transaction to full not space FSIP. then wrong, need need That whether does the ing, ing for 27% full,
to the cause
be logged. page FSIP were not record to say This to change to to cause as O% full, scenario changes inventory has change full, back, an update roll from it 23% from then to would and for full O% does the the be the
cause
space
to go to 35%
which
change had
3 l% its
written
a redoiundo which
record,
given
data
points
to the updates.
changes
redo-only
space system
to do logical
is, while that to
to the
update, space FSIP
to determine
and which processcan also processif it a describes in We
causes
then perform in which inverse We
to change
a CLR which forward forward example rollback. during rollback.
a change, does
update
during
needs the
to perform exact
an example
performed
144
C. Mohan
et al
10.4
Noticing support objects explain DB2 This
Multiple
the record
LSNS
problems locking, precisely supports in the caused it by may idea. of locking where the that user of the is less has [10, each than the 12]. actions a page. option into The of way by be having tempting one LSN per page when trying
to
state
why already happens DB2 and
assigning a granularity
it is not a good
case
of
requiring minipages DB2 does transactions state an LSN the LSN the for able
up each
leaf page
despite not DB2 each
index
2 to 16
recovery during by leaf log equal LSN This the storing and
on such
of loser
is as follows. LSN with Whenever in the minipage that incurring (and not when LSNS for carry
minipages having the page it is on The undo, log undone overhead availcase to be at even media during turns divides is needed in (atoms out up a to [61] in to the have page,
besides field.
as a whole. is stored
is updated, During
minipage LSNS.
maximum
not the log page records
of the LSN
is compared needs too therefore over length objects (LSN) recovery, of repeating of loser DB2 like for much
to the
space
records
to determine
update
to be actually
space
conveniently
locking, to have is
varying deleted
Maintaining
state
done,
minipage
variable to make
performing seen of
transactions
in ARIES.
physically
technique the length one
a fixed
minipages, problem.
special
proposed objects
fine-granularity terminology
do not
varying
paper).
11.
In
OTHER
the
WAL-BASED
we which page
METHODS
summarize also use (like space sections introduce in been of lack it here. 17, No. 1, March 1992 dimensions. has the the that overhead of data, of this the this We paper properties WAL protocol. of System e.g., for and the extra and Next, been with of some Recovery R) are very 1/0s [31] we for and not costly involving additional recovery the the that the other significant based here of page data, map
following, methods shadow of their the (see First, we will along method But, unable the nonvolatile
recovery on the because extra blocks sions). which methods recovery by we are disturbing
technique storage
well-known clustering
disadvantages,
checkpoints,
shadow
systems
be examining
compare
implemented of information
modifications implementation,
Siemens.
because
to include
ACM Transactions
145
a hierarchical (FF), which is but IMS and In FF, of the storage only efficient A single recovery
consists
IMS
no support FF the types supports entry provides minimum DEDBs. for and two and
have operations, of
many
kinds the
databases:
databases
(DEDBs). mechanisms possible But, for DEDBs database via global each
field
many
calls)
to
be the
availability XRF, ports pOOk DB2 Limited recovery different minipage provides data [80,
and [431.
sharing 941.
different
system in
the 13,
available
presented
table logging
stability,
and data
repeatable
for tables reorganizing with has dem within protocol (file, able key read,
read)
and
to be turned
operations both
data. The
access
Encompass in
incorporated hot-standby support 64]. and unlocked even [881 (a la as IMS) will less in
Tandems support
NonStop
NonStop
access.
transaction
Presumed
two-phase locking
supports
granularities repeat-
prefix and
(cursor be turned
stability,
off temporarily
on files. methods two methods logging based have method on value several (VLM), has The
presents
is much implemented
complex CMUS
logging
method
(OLM),
Buffer
have and written ing dirty failure. also OLM
management.
the write and back in DB2 an OLM that has steal a
NonStop policies.
VLM
and
DB2 VLM
adopted
fetch
storage
end-write
alone. might
dirty
to nonvolatile
restart
identifying 961,
pages
buffer manager
a sophisticated
146
C. Mohan
et al.
whenever such the DB2s MSDBS, not see its at commit to the log dirty all
or an closed.
is opened,
operation written
and is back
record only
records This
to bring means
information For does writes, call are is locks system to stable the all to 1/0s, updates policy pages the lelism that logging the pages release being is used next for were nonvolatile
up to date
as of the
updating. For records the are how is given group DEDBs, for log applied even FP The commit the modified processes does not
that policy
MSDB all
time,
the
log
MSDB commit
released.
released
storage.
records. (i.e.,
transferred records
processes.
time
after
were This system
committed),
of the
completion any
uncommitted with a no-steal the to gain DEDB with paralBefore the pages is on this locking placed in
to nonvolatile The processing IMS by that FF IMS may Of during result all commit Normal in the restart and similar use
storage of separate
since
for
DEDBs. storage
writing also
to let the
some
storage.
course,
recovery
considered
processing. checkpoints recovery similar consistent) to those checkpoints DB2s major the object on dirty one The writes for Since will partial since the any no each alternately deferred be present committed updates
Normal
when all the activity
checkpointing.
system in (not is not the necessarily record do going _pages with described are we dirty system
are the
mode. to
ones
are
taken quiesce
take,
consistent of the SQL, logging similar of IMS volatile MSDBS, version. tion commit have not are writing writes and
of ARIES. even
Encompass
update actions
concurrently.
checkpoint difference objects [961. updating in are their changes updated included For of two
indexspaces,
complete
is performed checkpointed
is ensured Care
present. been
applied
is written. written
DEDBs, nonvolatile
committed
storage are
ACM Transactions
147
restart
records NonStop
and during
be written
nonvolatile dirtying may pages. SQL, Version concept only is excluded deferred for
before
Because for
completion
be
delayed
completion
Partial
port partial program access undo FP data partial
rollbacks.
transaction In This The its log rollbacks. level. data. in DB2
NonStop From
OLM
and
VLM
do not
sup-
2 Release is exposed
1, IMS at the
supports
savepoint is available
to those
applications FP
FP and
data
does
not
MSDBS. provide
supports
internal
statement-level
Compensation
and IMS for IMS FP FF does FP to get the Since modified time. during some with log some when none of transaction write not to
log records.
CLRS write such data the during CLRS until
SQL, During written rollback commit updating are locking from IMS
VLM, log
rollback,
changes
made. hence
DEDBs, pool at
Encompass, restart records must of its the the rollbacks written have log
and
write might
transaction. to comto nonvolatile of the been no-steal to FP log the nonto Too it IMS the FP on
commit
processingi.e., been
records went
storage policy, nonvolatile writes records undo volatile the rollbacks, often, has VLM amount repeated rollbacks. media many
system and
Even updates
corresponding hence
written
would
to be undone, [931. Since for data supporting at restart problems. As written a result, even only with performed in for
CLRS contain
media just
information,
to write
these
CLRS,
which
is accessed
many rollbacks.
a rolled some
back negative
of to
restart.
CLRS
respect during
CLRS
for
undos
and
redos
148
C. Mohan
et al
restart done modify rupt During causing CLRS worst grows ignores The might written records IMS will net to
undomodify
with failures redomodify processing. recovery, writing
and during
redomodify
restart. for a given are and and
This undointer-
is
records No
CLRS
identical In the
a given the
or restart during repeated avoids not CLR case, Because during media like
number
exponentially. CLRS result wind during written need up during is writing forward by
ARIES does
multiple
IMS
linearly.
of its
to redo
Log record
of records) (or logs undo providing its log objects. page. to reduce also OLM and CLRS log logs DB2 state) both
contents.
of its and
information before,
after-image does value FF not For in updated recovery and DB2 VLM and fields. their
no-steal physical
mentioned
locking
(see Since
[761). IMS
Ihls does
information CLRS hot-standby the backup logs the is used of redo and the both also
information. only track the the buffer updates. of updated of The undo redo
need
includes
a modified
takeover
NonStop
of the the
might
OLM
but the
periodically
undomodify only the redomodify
logs an operation
redomodify of the parts undomodify where L SNS and
consistent
snapno modify
shot
redo a page reside.
of each
or undo But map
records.
set of pages
modified
Encompass and NonStop SQL use one LSN on each page Page overhead. uses no LSNS, but OLM uses one to keep track of the state of the page. VLM LSN. DB2 uses one LSN and IMS FF no LSN. Not having the LSN in IMS FF and VLM to know the exact state of a page does not cause any problems because of IMS and VLMS value logging and physical locking attributes. It is acceptable to redo an already present update or undo an absent update. IMS FP uses a field in the pages of DEDBs as a version number to correctly handle redos after all the data sharing systems have failed [671. When DB2 divides an index minipage, besides
ACM Transactions
leaf page into minipages then it one LSN for the page as a whole.
17, No. 1, March 1992.
uses
one LSN
for
each
Log passes during restart recovery. Encompass and NonStop SQL two passes (redo and then undo), and DB2 makes three passes (analysis, and redo This dirty then undo see Figure from within the two because 6). Encompass of the after and NonStop policy became SQL of writing dirty. start passes page beginning checkpoints penultimate the page successful
is sufficient
of the buffer
management
seem to repeat history before performing the undo pass. They do not seem to repeat history if a backup system takes over when a primary system fails [41. In the case of a takeover by a hot-standby, locks are first reacquired for the losers updates and then the rollbacks with the processing of new transactions. using that a separate point, which process is to gain determined of the losers are performed in parallel Each loser transaction is rolled back DB2 information starts its redo in scan from the last before, recorded
parallelism. using
successful checkpoint, as modified by the analysis DB2 does selective redo (see Section 10.1). VLM makes one backward undo, and then redo). Many
pass. As mentioned
pass and OLM makes three passes (analysis, lists are maintained during OLMS and VLMS
passes. The undomodify and redomodify log records of OLM are used only to modify these lists, unlike in the case of the CLRS written in the other systems. In VLM, the one backward pass is used to undo uncommitted changes on nonvolatile storage and also to redo missing committed changes. No log records are written during these operations. In OLM, during the undo pass, for each object to be recovered, if an operation consistent version of the object does not exist on nonvolatile storage, then it restores a snapshot of the object from the snapshot log record version of the object, (1) in the remainder updates that precede the snapshot so that, starting from a consistent of the undo pass any to-be-undone can be undone logically, and (2) records only) that is similar to the
log record
in the redo pass any committed or in-doubt updates (modify follow the snapshot record can be redone logically. This shadowing performed in [16, 781 the database-wide checkpointing the use of a single log instead of IMS first reloads MSDBS from the that latest were successful included of buffers checkpoint This cannot means
using a separate logthe difference is that is replaced by object-level checkpointing and two logs. the file that received their contents during before the failure. the The restart dirty after just DEDB into buffers the same the pass records during Then, are also reloaded it makes
buffers number
as before.
a failure,
one forward
over the log (see Figure 6). During that pass, it accumulates log records in memory on a per-transaction basis and redoes, if necessary, completed transactions FP updates. Multiple processes are used in parallel to redo the DEDB updates. As far as FP is concerned, only the updates starting from the last checkpoint before the failure are of interest. At the end of that one pass, in-progress transactions FF updates are undone (using the log records in memory), in parallel, using one process per transaction. If the space allocated in memory for a transactions log records is not enough, then a backward scan of the log will be performed to fetch the needed records during that transactions rollback. In the XRF context, when a hot-standby IMS
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
150
C. Mohan
et al.
takes over, the handling of the loser transactions Tandem does it. That is, rollbacks are performed transaction processing. Page forces the end available. Restart checkpoints. IMS, DB2, OLM and VLM during restart. Information OLM, on VLM and DB2 and
is similar in parallel
to
the with
way new
force
all
dirty
pages is
at not
of restart.
Encompass
NonStop
SQL
take
only SQL
at is
recovery.
Information
on Encompass
on data. a unique
and
NonStop key
SQL
require
that that
every if an
unique
is used to guarantee
attempt is made to undo a logged action which was never applied to the nonvolatile storage version of the data, then the latter is realized and the undo fails. In other words, idempotence of operations is achieved using the unique key. IMS in effect does byte-range locking and logging and hence does not allow records results in the fragmentation imposes that some additional an objects representation to be moved around freely within a page. This and the less efficient usage of free space. IMS with respect into to FP data. fixed length VLM (less requires than one be divided
constraints
page sized), unrelocatable quanta. The consequences of these restrictions are similar to those for IMS. [2, 26, 56] do not discuss recovery from system failures, while the theory of [33] does not include semantically logging). In other sections of this with 12. some of the other ATTRIBUTES makes approaches rich paper, that modes of locking (i.e., operation we have pointed out the problems been proposed in the literature.
have
ARIES
few assumptions
advantages over other recovery methods. While ARIES is simple, it possesses several interesting and useful properties. Each of most of these properties has been demonstrated in one or more existing or proposed systems, as summarized in the last section. However, we proposed or real, which has all of these properties. ARIES are: (1) Support for finer larities of locking.
a uniform locking fashion.
than page-level
ARIES Recovery on the supports
concurrency
page-level affected
control
and by what
and multiple
the granularity
granuin of
record-level
locking
is not
expected
is. Depending
contention
the appropri-
ate level of locking can be chosen. It also allows locking (e.g., record, table, and tablespace-level) tablespace). Concurrency control schemes of [2]) can also be used. (2) Flexible buffer management long as the write-ahead logging schemes other
multiple granularities of for the same object (e. g., than locking (e.g., the As is
during restart and normal processing. protocol is followed, the buffer manager
ARIES: A Transaction Recovery Method free to use any page incomplete transactions transactions commit dirtied by a transaction transaction is allowed lead to reduced
151
replacement policy. In particular, dirty pages of can be written to nonvolatile storage before those (steal policy). Also, it is not required that all pages be written to commit for back to nonvolatile storage (i.e., no-force policy). These storage and fewer 1/0s before the properties involving
demands
buffer
frequently updated (hot-spot) pages. ARIES does not preclude the possibilities of using deferred-updating and force-at-commit policies and benefiting from them. ARIES is quite flexible in these respects. (3) Minimal (excluding required (4) No on the page. logged unique around ensured operation (5) Actions space overheadonly log) space overhead The LSN on There etc, the one of this LSN per page. scheme is limited of the last logged idempotence on the length. The permanent to the storage action performed value. or undo of to is an be the CLRS of the can not be be respect
on each page to store the LSN constraints actions. keys, within since should taken written actions in the data are to guarantee
of a page is a monotonically no restrictions can be of variable collection. page on each or not. of an update during had the undo taken actually An example undos, is used
Idempotence to determine
need not necessarily update. during inverse Since undo might between the the inverses
correct is the one that relates to the free space information 10% free, 20% free) about data pages that are maintained pages. Because of finer than page-level granularity locking,
space information change takes place during the initial update of a page by a transaction, a free space information change might occur during the undo (from 20% free to 10% free) of that original change because of intervening update activities of other transactions (see Section 10.3). Other benefits of this attribute in the context of hash-based storage methods and index management can be found in [59, 621. The changes made information and the It suffices if the (6) Support for operation to a page can be logged redo information logging and novel lock modes. in a logical fashion. The undo object
changed fields alone are logged. Since history is repeated, for increment or decrement kinds of operations before- and after-images of the field are not needed. Information about the type of operation and the decrement or increment amount is enough. Garbage collection actions and changes to some fields (e.g., amount of free space) of that page need not be logged. Novel lock modes based on commutativity and other properties of operations can be supported [2, 26, 881. (7) Even redo-only and undo-only (single call to the be efficient undo and redo information about records are accommodated. log component) sometimes an update
While it may to include the
at other
152
times it may be efficient (from the original data, the undo record constructed and, after the update is performed in-place in the data
from sary tions, (8) the updated records. the undo data, ARIES record must the redo size can record restrictions) handle both before can be constructed) the Under record. and/or these (because of log record to log situations. the redo information
different
be logged
for partial and total transaction to be rolled back totally, ARIES and the partial rollback
rollback. Besides allowing allows the establishment of to such savepoints. errors in a result in recoverable information and
Without the support for partial rollbacks, (e.g., unique key violation, out-of-date distributed database wasted work. system) will
require
rollbacks
(9) Support for objects spanning multiple pages. Objects pages (e.g., an IMS record which consists of multiple scattered over many pages). When an object is modified, written itself for every page affected by that objects update, ARIES does not treat multipage in any special way.
can span multiple segments may be if log records are works fine. ARIES
(10) Allows files to be acquired or returned, system. ARIES provides the flexibility namically and permanently to the
any time, from or to the operating of being able to return files dysystem (see [19] for the
operating
detailed description of a technique to accomplish this). Such an action is considered to be one that cannot be undone. It does not prevent the same file from being reallocated to the database system. Mappings between objects (table spaces, as in System R. (11) Some actions etc.) and files are not required committed to be defined statically as
of a transaction
maybe
a whole is rolled back. This a dummy CLR to implement given as an example situation
refers to the technique of using the concept of nested top actions. File extension has been which could benefit from this. storage Other applicaand methods
tions of this technique, in the context of hash-based index management, can be found in [59, 621.
(12) Efficient checkpoints (including during restart recovery). By supporting fuzzy checkpointing, ARIES makes taking a checkpoint an efficient operation. Checkpoints can be taken even when update activities and logging are going on concurrently. Permitting the impact written checkpoints even during restart processing will help reduce The dirty .pages information the number redo pass. of pages which of failures during restart recovery. during checkpointing helps reduce from nonvolatile storage during the
are read
(13) Simultaneous processing of multiple transactions in forward processing and /or in rollback accessing same page. Since many transactions could simultaneously be going forward or rolling back on a given page, the level of concurrent access supported could be quite high. Except for the short duration latching which has to be performed any time a page is being
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
ARIES: A Transaction Recovery Method physically rollback, fashion. (14) No locking or deadlocks during transaction rollback. is required during transaction rollback, no deadlocks will modified or examined, rolling back transactions
153
be it during forward processing or during do not affect one another in any unusual Since no locking involve transac-
tions that are rolling back. Avoiding locking during rollbacks simplifies not only the rollback logic, but also the deadlock detector logic. The deadlock detector need not worry about making the mistake of choosing a rolling back transaction as a victim in the event of a deadlock (cf. System R and R* [31, 49, 64]). (15) Bounded logging Even during restart failures in spite of repeated occur during failures the or of nested number of rollbacks. CLRS written The number time if repeated restart,
This is also true if partial rollbacks are nested. written will be the same as that written at the during normal processing. The latter again is
of transaction
a fixed number and is, usually, equal to the number of undoable records written during the forward processing of the transaction. No log records are written during the redo pass of restart. (16) Permits exploitation Restart of parallelism can be made and faster selective/deferred by not doing processing for 1/0s faster restart. all the needed
one at a time while processing the corresponding log record. the early identification of the pages needing recovery and of asynchronous parallel Undo 1/0s for the reading in of those into hanrestart offline can be processed the redo pass. concurrently parallelism as they requires are brought complete
pages
by a single process. Some of the to speed up restart or to accommodate transactions dumping) data the system can be performed for media
devices. If desired, undo of loser with new transaction processing. (17) Fuzzy image copying (archive
in parallel Media
recovery.
recovery and image copying of the take advantage of device geometry, performed outside the transaction
are supported very efficiently. To actual act of copying can even be (i.e., without going through the and one is accessing recovery only
buffer pool). This can happen even while the latter modifying the information being copied. During media forward traversal of the log is made. of loser transactions after and supports the savepoint a system concept, (18) Continuation repeats history
pass, instead of totally rolling back the loser transactions, roll back each loser only to its latest savepoint. Locks must be acquired to protect the transactions uncommitted, not undone updates. Later, we could resume the transaction by invoking its application at a special entry point and passing enough be resumed. (19) Only information about the savepoint of log during from restart which execution is to
one backward
traversal
or media
recovery.
154
Both
during
media
the log is sufficient. likely to be stored (20) Need only compensation information.
recovery and restart This is especially important in a slow medium like tape.
recovery
redo records
in
records. only
undone the
to contain during
the amount
a transaction rollback will be half processing of that transaction. (21) Support for distributed Whether ARIES. of locks during transactions. does not affect (22) Early release a given site
transactions.
accommodates
is a coordinator rollback
transaction
tion using partial rollbacks. Because ARIES because it never undoes a particular non-CLR (partial) rollback, when the transactions very object is undone and a CLR is written on that object. This makes it possible partial rollbacks. It should from being information
never undoes CLRS and more than once, during a first update to a particular
for it, the system can release the lock to consider resolving deadlocks using
be noted that ARIES does not prevent the shadow page technique used for selected portions of the data to avoid logging of only undo or both undo and redo information. This may be useful for fields, as is the case in the 0S/2 Extended Edition In such instances, for such data, the modified pages to nonvolatile storage before commit. will Whether depend or not is on what
have to be forced
media recovery and partial rollbacks can be supported logged and for which updates shadowing is done.
13.
SUMMARY paper, we presented the ARIES of System recovery method and showed in the why WAL
In this
some of the
recovery
paradigms
R are inappropriate
context. We dealt with a variety of features that are very important in building and operating an industrial-strength transaction processing system. Several issues regarding operation logging, fine-granularity locking, space management, and flexible recovery were discussed. In brief, ARIES accomplishes the goals that we set out with by logging all updates on a per-page basis, using an LSN on every page for tracking page state, repeating history during restart recovery before undoing the loser transactions, and chaining the CLRS to the predecessors of the log records that they compensated. Use of ARIES is not restricted to the database area alone. It can also be used recoverable it is being in a system for implementing persistent object-oriented languages, and transaction-based operating systems. In fact, QuickSilver distributed operating system [401 and aid the backing up of workstation In this section, we summarize to which specific attributes that
ACM Transactions
of ARIES
and efficiency.
ARIES: A Transaction Recovery Method Repeating CLRS during chained using history undos, exactly, permits which field in turn or not: implies using LSNS
155
the following,
irrespective
of whether
the UndoNxtLSN
level locking to be supported and records to be moved around a page to avoid storage fragmentation without the moved having to be locked and without the movements having to be one state variable, a log sequence number, per page.
(3) Reuse of storage released by one transaction for the same transactions later actions or for other transactions actions once the former commits, thereby efficient leading usage to the of storage. processing during the preservation of clustering of records and the
(4) The inverse of an action origianlly performed during forward of a transaction to be different from the action(s) performed undo That of that original is, logical undo
action (e. g., class changes in the space map pages). with recovery independence is made possible. undo on the same page concurrently with records at new
(6) Recovery of each page independently relating to transaction state, especially (7) If necessary, the continuation the time of system failure. (8) Selective transaction (9) Partial or deferred processing rollback restart,
in progress with
of losers availability.
concurrently
to improve
of transactions.
(10) Operation logging and logical logging of changes within a page. For example, decrement and increment operations may be logged, rather than the before- and after-images of modified data. Chaining, using the UndoNxtLSN field, forward processing permits the following, history is also followed: of undoing CLRS actions, thus avoiding written to release writing during CLRS for CLRS to log records written during provided the protocol of repeating
also makes
it unnecessary
to store undo
information
in CLRS. forward
(2) The avoidance of the undo of the same log record processing more than once. (3) As a transaction is being rolled back, the ability
object when all the updates to that object had been undone. important while rolling back a long transaction or while deadlock by partially rolling back without the victim. any special via nested actions top like (4) Handling partial log, as in System (5) Making permanent, rollbacks R. if
necessary
actions,
156
C. Mohan
et al.
changes made by a transaction, irrespective itself subsequently rolls back or commits. Performing (1) Checkpoints recovery. (2) Files to be returned ing dynamic binding (3) Recovery user data, (4) Identifying 1/0s could without pages the analysis pass before any time repeating during
of whether
the
transaction
history the
permits and
to be taken
redo
to the operating system dynamically, between database objects and files. information special requiring concurrently treatment redo, so that with requiring
thereby the
allowof
of file-related possibly
recovery parallel
for the former. asynchronous the redo pass starts. pages by eliminating e.g., that some empty
be initiated
for them
even before
(5) Exploiting opportunities to avoid redos on some those pages from the dirty .pages table on noticing, pages have been freed. (6) Exploiting opportunities to avoid writing end. write records after volatile table storage when and by the end. write records
reading some pages during redo, e.g., by dirt y pages have been written to nonthose pages from the dirty .pages are encountered.
eliminating
in the in-doubt and in-progress states so that for them during the redo pass to support
selective or deferred restart, the continuation of loser transactions after restart, and undo of loser transactions in parallel with new transaction processing. 13.1 ARIES Implementations forms and Extensions of the recovery algorithms used in the IBM Research
the basis
prototype systems Starburst [871 and QuickSilver [401, in the University of Wisconsins EXODUS and Gamma database machine [201, and in the IBM program products 0S/2 Extended Edition Database Manager [71 and Workstation history, Data Save Facility/VM has been implemented [441. One feature of ARIES, namely repeating in DB2 Version 2 Release 1 to use the concept
of nested top action for supporting segmented tablespaces. A simulation study of the performance of ARIES is reported in [981. The following concluSimulation results indicate the sions from that study are worth noting: success of the ARIES recovery method in providing fast recovery from failures, caused by long intercheckpoint intervals, efficient use of page LSNS, log LSNS, and RecLSNs avoids redoing updates unnecessarily, and the actual recovery load is reduced skillfully. Besides, algorithms difference the overhead incurred by the concurrency control and recovery indicated by the negligibly small on transactions is very low, as between the mean transaction
response time and the average duration of a transaction if it ran alone in a never failing system. This observation also emerges as evidence that the recovery method goes well with concurrency control through fine-granularity locking, an important virtue.
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992
ARIES: A Transaction Recovery Method We have transaction methods, extended model called ARIES (see [70, ARIES /KVL, to make 85]). Based ARIES/IM it work and in the ARIES context /LHS,
. of the
on ARIES,
we have
developed to
efficiently
provide high concurrency and recovery for B -tree indexes [57, 62] and for hash-based storage structures [59]. We have also extended ARIES to restrict the amount of repeating of history that takes place for the loser transactions based [65, [691. We have designed concurrency control and recovery algorithms, on ARIES, for the N-way data sharing (i. e., shared disks) environment 66,67, 68]. Commit.LSN, a method which takes advantage that exists reevaluation in [54, 58, processing, in every page to reduce the overheads, and also to improve 60]. Although messages message are we did not discuss
of the page.LSN
locking, latching and predicate concurrency, has been presented an important part of transaction in this paper. and recovery
logging
ACKNOWLEDGMENTS
We have benefited immensely from the work that was System R project and in the DB2 and IMS product groups. valuable lessons by looking at the experiences with those the source code and internal documents of those systems The Starburst project gave us the opportunity to begin design some of the fundamental algorithms of a transaction into account experiences with the prior systems. We would edge the contributions of the designers of the other also like to thank have adopted our Brian and Irv Oki, Erhard Traiger
performed We have
in the learned
systems. Access to was very helpful. from scratch and system, taking like to acknowlWe would
systems.
our colleagues in the research and product groups that research results. Our thanks also go to Klaus Kuespert, Rahm, Andreas Reuter, Pat Selinger, Dennis Shasha, detailed comments on the paper.
for their
REFERENCES 1. BAKER, J., CRUS, R., AND HADERLE, D. Method for assuring atomicity of multi-row update operations in a database system. U.S. Patent 4,498,145, IBM, Feb. 19S5. 2. BADRINATH, B. R., AND RAMAMRITHAM, K. Semantics-based concurrency control: Beyond 3rd IEEE International Conference on Data Engineering commutativity. In Proceedings (Feb. 1987). Concurrency Control and Recovery in 3. BERNSTEIN, P., HADZILACOS, V., AND GOODMAN, N. Database Systems. Addison-Wesley, Reading, Mass., 1987. 4. BORR, A. Robustness to crash in a distributed database: A non-shared-memory multi10th International Conference on Very Large Data Bases processor approach. In Proceedings (Singapore, Aug. 1984). 5. CHAMBERLAIN, D., GILBERT, A., AND YOST, R. A history of System R and SQL)Data System. 7th International Conference on Very Large Data Bases (Cannes, Sept. In Proceedings 1981). ACM Trans. 6. CHANG, A., AND MERGEN, M. 801 storage: Architecture and programming. Comput. Syst., 6, 1 (Feb. 1988), 28-50. 7. CHANG, P. Y., AND MYRE, W. W. 0S/2 EE database manager: Overview and technical ZBM Syst. J. 27, 2 (198S). highlights. schemes 8. COPELAND, G., KHOSHAFIAN, S., SMITH, M., AND VALDURIEZ, P. Buffering International Conference on Data Engineering for permanent data. In Proceedings (Los Angeles, Feb. 1986). ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
158
C. Mohan
et al.
Application
System/400
performance
characteristics.
IBM S@. J. 28, 3 (1989). 10. CHENG, J., LOOSELY, C., SHIBAMIYA, A., AND WORTHINGTON, P. IBM Database 2 perforIBM Sy.st. J. 23, 2 (1984). mance: Design, implementation, and tuning. 11. CRUS, R , HADERLE, D., AND HERRON, H. Method for managing lock escalation in a multiprocessing, multiprogramming environment. U.S. Patent 4,716,528, IBM, Dec. 1987. IBM Tech. Disclosure 12. CRUS, R., MALKEMUS, T., AND PUTZOLU, G. R. Index mini-pages Bull. 26, 4 (April 1983), 5460-5463. 13. CRUS, R., PUTZOLU, F., AND MORTENSON, J. A Incremental data base log image copy IBM !lec~. Disclosure Bull. 25, 7B (Dec. 1982), 3730-3732. Bull. 25, 7B 14. CRUS, R., AND PUTZOLU, F. Data base allocation table. IBM Tech. Disclosure (Dec. 1982), 3722-2724. 15. CRUS, R. Data recovery in IBM Database2. IBM Syst. J. 23,2(1984). Informix-Turbo, In Proceedings LZEECornpcon Sprmg88(Feb. -March l988), 16. CURTIS, R. operating 17. DASGUPTA, P., LEBLANC, R., JR., AND APPELBE, W. The Clouds distributed 8th International Conference on Distributed Computing Systems system. In Proceedings (San Jose, Calif., June 1988). AGuideto INGRES. Addison-Wesley, Reading, Mass., l987. 18. DATE, C. data sets. IBM Tech. Disclosure 19. DEY, R., SHAN, M., AND TRAIGER, 1. Method fordropping Bull. 25, 11A (April 1983), 5453-5455. AND 20. DEWITT, D., GHANDEHARIZADEH, S., SCHNEIDER, D., BRICKER, A., HSIAO, H.-I., Data Eng. RASMUSSEN,R. The Gamma database machine project. IEEE Trans. Knowledge 2, 1 (March 1990). 21. DELORME, D., HOLM, M., LEE, W., PASSE, P., RICARD, G., TIMMS, G., JR., AND YOUNGREN, L. Database index journaling for enhanced recovery. U.S. Patent 4,819,156, IBM, April 1989 The treatment of 22. DIXON, G. N., BARRINGTON, G. D., SHRIVASTAVA, S., AND WHEATER, S. M. persistent objects in Arjuna. Comput. J. 32, 4 (1989). management. Ph.D. dissertation, Tech. Rep. CMU-CS-88-192, 23. DUCHAMP, D. Transaction Carnegie-Mellon Univ., Dec. 1988, ACM of database buffer management, 24. EFFEUSBERG, W., AND HAERDER, T. Principles Trans. Database Syst. 9, 4 (Dec. 1984). 25. ELHARDT, K , AND BAYER, R. A database cache for high performance and fast restart in database systems. ACM Tram Database Syst. 9, 4 (Dec. 1984). locking for 26. FEKETE, A., LYNCH, N., MERRITT, M., AND WEIHL, W. Commutativity-based nested transactions. Tech. Rep. MIT/LCS/TM-370.b, MIT, July 1989, Data base integrity as provided for by a particular data base management 27. FOSSUM, B J. W. Klimbie and K. L. Koffeman, Eds., North-Holland, system. In Data Base Management, Amsterdam, 1974. of concurrency control in IMS/VS Fast Path. 28. GAWLICK, D., AND KINKADE, D. Varieties IEEE Database Eng. 8, 2 (June 1985). management in an object-oriented database system. 29. GARZA, J., AND KIM, W. Transaction ACM-SIGMOD International Conference on Management of Data (Chicago, In Proceedings June 1988). CHAOS% Support for real-time atomic transactions. In 30. GHEITH, A., AND SCHWAN, K. Proceedings 19th International Symposium on Fault-Tolerant Computing (Chicago, June 1989). 31. GRAY, J., MCJONES, P., BLASGEN, M., LINDSAY, B., LORIE, R., PRICE, T., PUTZOLU, F., AND ACM Comput. TRAIGER, I. The recovery manager of the System R database manager. Suru. 13, 2 (June 1981). SystemsAn Aduanced systems. In Operating 32. GRAY, J. Notes on data base operating Course, R. Bayer, R. Graham, and G. Seegmuller, Eds., LNCS Vol. 60, Springer-Verlag, New York, 1978. m database systems. J. ACM 35, 1 (Jan. 1988), 33. HADZILACOS, V, A theory of reliability 121-145. S.yst. 13, 2 (1988), hot spot data in DB-sharing systems. Inf 34. HAERDER, T. Handling 155-166. ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992
159
IBM Database 2 overview. IBM Syst. J. 23, 2 (1984). Principles of transaction oriented database recoveryA taxonomy. ACM CornPUt. Sure. 15, 4 (Dec. 1983). 37. HELLAND, P. The TMF application programming interface: Program to program communication, transactions, and concurrency in the Tandem NonStop system. Tandem Tech. Rep. TR89.3, Tandem Computers, Feb. 1989.
36. HAERDER, T., AND REUTER, A.
AND WEIHL, W.
ACM
Hybrid
concurrency
control
Symposium
for abstract
on Principles
data
types.
In
7th
SIGACT-SIGMOD-SIGART
of Database
Systems (Austin, Tex., March 1988). 39. HERLIHY, M., AND WING, J. M. Avalon: 17th International systems. In Proceedings (Pittsburgh, Pa., July 1987).
Language
Symposium
support
on
for
reliable
distributed
Computing
Fault-Tolerant
40. HASKIN, R., MALACHI, Y., SAWDON, W., AND CHAN, G. Recovery management in QuickSilver. ACM !/runs. Comput. Syst. 6, 1 (Feb. 1988), 82-108. Dec. GG24-1652, IBM, April 1984. 41. IMS/ VS Version 1 Release 3 Recovery/Restart. Programming. Dec. SC26-4178, IBM, March 1986. 42. IMS/ VS Version 2 Application 43. IMS/ VS Extended April 1987.
Recovery Facility (XRF): / VM: Technical General Reference. Information.
IBM, IBM,
Save Facility
45. KORTH, H. Locking primitives in a database system. JACM 30, 1 (Jan. 1983), 55-79. 46. LUM, V., DADAM, P., ERBE, R., GUENAUER, J., PISTOR, P., WALCH, G., WERNER, H., AND WOODFILL, J. Design of an integrated DBMS to support advanced applications. In Proceedings International Conference on Foundations of Data Organization (Kyoto, May 1985). 47. LEVINE, F., AND MOHAN, C. Method for concurrent record access, insertion, deletion and alteration using an index tree. U.S. Patent 4,914,569, IBM, April 1990. Isolation Locking. Dec. GG66-3193, IBM Dallas Systems 48. LEWIS, R. Z. ZMS Program Center, Dec. 1990. 49. LINDSAY, B., HAAS, L., MOHAN, C., WILMS, P., AND YOST, R. Computation and communication in R*: A distributed database manager. ACM Trans. Comput. Syst. 2, 1 (Feb. 1984). 9th ACM Symposium on Operating Systems Principles (Bretton Woods, Also in Proceedings Oct. 1983). Also available as IBM Res. Rep. RJ3740, San Jose, Calif., Jan. 1983. 50. LINDSAY, B., MOHAN, C., AND PIRAHESH, H. Method for reserving space needed for rollBull. 29, 6 (Nov. 1986). back actions. IBM Tech. Disclosure AND SCHEIFLER, R. Guardians and actions: Linguistic support for robust, distributed programs. ACM Trans. Program. Lang. Syst. 5, 3 (July 1983). 52. LINDSAY, B., SELINGER, P., GALTIERL C., GRAY, J., LORIE, R., PUTZOLU, F., TRAIGER, I., AND WADE, B. Notes on distributed databases. IBM Res. Rep. RJ2571, San Jose, Calif., July 1979. 53. MCGEE, W. C. The information management syste]m IMS/VSPart II: Data base faciliIBM Syst. J. 16, 2 (1977). ties; Part V: Transaction processing facilities. 54. MOHAN, C., HADERLE, D., WANG, Y., AND CHENG, J. Single table access using multiple indexes: Optimization, execution, and concurrency control techniques. In Proceedings International Conference on Extending Data Base Technology (Venice, March 1990). An expanded version of this paper is available as IBM Res. Rep. RJ7341, IBM Almaden Research Center, March 1990. 55. MOHAN, C., FUSSELL, D., AND SILBERSCHATZ, A. Compatibility and commutativity of lock modes. Znf Control 61, 1 (April 1984). Also available as IBM Res. Rep. RJ3948, San Jose, Calif., July 1983. 56. MOSS, E., GRIFFETH, N., AND GRAHAM, M. Abstraction in recovery management. In Proceedings ACM SIGMOD International Conference on Management of Data (Washington, D. C., May 1986). 57. MOHAN, C. ARIES /KVL: A key-value locking method for concurrency control of multiac16th International Conference tion transactions operating on B-tree indexes. In Proceedings on Very Large Data Bases (Brisbane, Aug. 1990). Another version of this paper is available as IBM Res. Rep. RJ7008, IBM Almaden Research Center, Sept. 1989. ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
51. LISKOV, B.,
160
C. Mohan et al
Commit -LSN: A novel and simple method for reducing locking and latching in 16th International Conference on Very Large processing systems In Proceedings Data l?ases (Brisbane, Aug. 1990). Also available as IBM Res. Rep. RJ7344, IBM Almaden Research Center, Feb. 1990. 59 MOHAN, C. ARIES/LHS: A concurrency control and recovery method using write-ahead logging for linear hashing with separators. IBM Res. Rep., IBM Almaden Research Center, Nov. 1990. 60. MOHAN, C. A cost-effective method for providing improved data avadability during DBMS of the 4th International Workshop on HLgh restart recovery after a failure In Proceedings Performance Transachon Systems (Asilomar, Calif., Sept. 1991). Also available as IBM Res. Rep. RJ81 14, IBM Almaden Research Center, April 1991. transaction 61. Moss, E., LEBAN, B., AND CHRYSANTHIS, P. Fine grained concurrency for the database 3rd IEEE International Conference on Data Engineering (Los Angeles, cache. In Proceedings Feb. 1987), 62. MOHAN, C., AND LEVINE, F. ARIES/IM: An efficient and high concurrency index management method using write-ahead logging. IBM Res. Rep. RJ6846, IBM Almaden Research Center, Aug. 1989. 63. MOHAN, C., AND LINDSAY, B. Efficient commit protocols for the tree of processes model of 2nd ACM SIGACT/ SIGOPS Sympos~um on Pridistributed transactions. In Proceedings nciples of Distributed Computing (Montreal, Aug. 1983). Also available as IBM Res. Rep. RJ3881, IBM San Jose Research Laboratory, June 1983. 64. MOHAN, C., LINDSAY, B., AND OBERMARCK, R. Transaction management in the R* dktributed database management system. ACM Trans. Database Syst. 11, 4 (Dec. 1986). 65. MOHAN, C., ANn NARANG, I. Recovery and coherency-control protocols for fast intersystem page transfer and tine-granularity locking in a shared disks transaction environment. In Proceedings 17th International Conference on Very Large Data Bases (Barcelona, Sept. 1991). A longer version is available as IBM Res. Rep. RJ8017, IBM Almaden Research Center, March 1991. 66. MOHAN, C., AND NARANG, I. Efficient locking and caching of data in the multisystem of the International Conference on shared disks transaction environment. In proceedings Extending Database Technology (Vienna, Mar. 1992). Also available as IBM Res. Rep. RJ8301, IBM Almaden Research Center, Aug. 1991. 67. MOHAN, C., NARANG, I., AND PALMER, J. A case study of problems in migrating to distributed computing: Page recovery using multiple logs in the shared disks environment. IBM Res. Rep. RJ7343, IBM Almaden Research Center, March 1990. 68. MOHAN, C., NARANG, I., SILEN, S. Solutions to hot spot problems in a shared disks of the 4th International Workshop on High Perfortransaction environment. In proceedings mance Transaction Systems (Asilomar, Calif., Sept. 1991). Also available as IBM Res Rep. 8281, IBM Almaden Research Center, Aug. 1991. 69. MOHAN, C., AND PIRAHESH, H. ARIES-RRH: Restricted repeating of history in the ARIES 7th International Conference on Data Engitransaction recovery method. In Proceedings neering (Kobe, April 1991). Also available as IBM Res. Rep. RJ7342, IBM Almaden Research Center, Feb. 1990 70. MOHAN, C , AND ROTHERMEL, K. Recovery protocol for nested transactions using writeBull. 31, 4 (Sept 1988). ahead logging. IBM Tech. Dwclosure 3rd 71. Moss, E. Checkpoint and restart in distributed transaction systems. In Proceedings Symposium on Reliability in Dwtributed Software and Database Systems (Clearwater Beach, Oct. 1983). 13th International 72. Moss, E Log-based recovery for nested transactions. In Proceedings Conference on Very Large Data Bases (Brighton, Sept. 1987). 73. MOHAN, C., TIUEBER, K., AND OBERMARCK, R. Algorithms for the management of remote backup databases for disaster recovery. IBM Res. Rep. RJ7885, IBM Almaden Research Center, Nov. 1990. 74. NETT, E., KAISER, J., AND KROGER, R. Providing recoverability in a transaction oriented 6th International Conference on Distributed distributed operating system. In Proceedings Computing Systems (Cambridge, May 1986). ACM Transactions on Database Systems, Vol. 17, No, 1, March 1992
58. MOHAN, C.
161
locking.
The commit/abort problem GMD Tech. Rep. 267, GMD mbH, Sankt Augustin, Sept. 1987. feature. IBM
76. OBERMARCK, R. IMS/VS Calif., July 1980. 77. ONEILL, P. (Dec. 1986). 78. ONG, K.
SIGMOD
The
Escrow
transaction
method.
ACM
SYNAPSE
Symposium
approach
to database
recovery.
on Principles
of Database
Systems
79. PEINL, P., REUTER, A., AND SAMMER, H. High ACM SIGMOD International Conference on Management of Data case study. In Proceedings (Chicago, June 1988). 80. PETERSON,R. J., AND STRICKLAND, J. P. Log write-ahead protocols and IMS/VS logging. In
Proceedings 2nd
In Proceedings 3rd ACM SIGACT(Waterloo, April 1984). contention in a stock trading database: A
ACM SIGACT-SIGMOD
1983).
(Atlanta,
Ga., March
81. RENGARAJAN, T. K., SPIRO, P., AND WRIGHT, W. DBMS software. Digital Tech. J. 8 (Feb. 1989). 82. REUTER, A.
Softw.
Eng.
SE-6,
logging
IEEE Trans.
83. REUTER, A.
SIGMOD
data elements.
Systems
Symposium
of Database
of recovery techniques.
85. ROTHERMEL, K., AND MOHAN, C. ARIES/NT: A recovery method based on write-ahead 15th International Conference on Very Large logging fornested transactions. In Proceedings Data Bases (Amsterdam, Aug. 1989). Alonger version ofthis paper is available as IBM Res. Rep. RJ6650, lBMAlmaden Research Center, Jan. 1989. 86. ROWE, L., AND STONEBRAKER, M. The commercial INGRES epilogue. Ch. 3 in The ZNGRES Papers, Stonebraker, M., Ed., Addson-Wesley, Reading, Mass., 1986. 87. SCHWARZ, P., CHANG, W., FREYTAG, J., LOHMAN, G., MCPHERSON, J., MOHAN, C., AND Workshop on PIRAHESH, H. Extensibility in the Starburst database system. In Proceedings Object-Oriented Data Base Systems (Asilomar, Sept. 1986). Also available as IBM Res. Rep. RJ5311, San Jose, Calif., Sept. 1986. 88. SCHWARZ,P. Transactions on typed objects. Ph.D. dissertation, Carnegie Mellon Univ., Dec. 1984. Tech. Rep. CMU-CS-84-166,
ACM Trans. 89. SHASHA, D., AND GOODMAN, N. Concurrent search structure algorithms. Database Syst. 13, 1 (March 1988). 90. SPECTOR, A., PAUSCH, R., AND BRUELL, G. Came Lot: A flexible, distributed transaction IEEE Compcon Spring 88 (San Francisco, Calif., March processing system. In Proceedings 1988).
91. SPRATT, L.
Syst.
ACM The transaction resolution journal: Extending the before journal. 1985). 92. STONEBRAKER, M. The design of the POSTGRES storage system. In Proceedings International Conference on Very Large Data Bases (Brighton, Sept. 1987). Rev. 19, 3 (July
Oper. 13th
IMSj VS Version 1 Release 3 Fast Path 93. STILLWELL, J. W., AND RADER, P. M. Dec. G320-0149-0, IBM, Sept. 1984. 94. STRICKLAND, J., UHROWCZIK, P., AND WATTS, V. IMS/VS: An evolving system.
J. 21, 4 (1982). 95.
high-performance, THE TANDEM DATABASE GROUP. NonStop SQL: A distributed, Science Vol. 359, high-availability implementation of SQL. In Lecture Notes in Computer D. Gawlick, M. Haynie, and A. Reuter, Eds., Springer-Verlag, New York, 1989. Managing IBM Database 2 buffers to maximize
ACM Oper.
performance.
Syst. Rev.
memory
management
16,
A simulation study for the performance recovery method. M. SC. thesis, Middle East Technical
162
C. Mohan et al.
WATSON, C. T., AND ABERLE, G. F System/38 machine database support. In IBM Syst, 38/ Tech. Deu., Dec. G580-0237, IBM July 1980. 100. WEIKUM, G. Principles and realization strategies of multi-level transaction management. ACM Trans. Database Syst. 16, 1 (Mar. 1991). 101. WEINSTEIN, M., PAGE, T., JR , LNEZEY, B., AND POPEK, G. Transactions and synchroniza10th ACM Symposium on Operating tion in a distributed operating system. In Proceedings Systems Principles (Orcas Island, Dec. 1985).
99
Received January
1991