Jim Gray - Notes On Database Operating Systems

CHAPTER 3.F.
J. N. Gray IBM Research Laboratory San Jose, Ca., USA
Notes on Data Base Operating Systems
~bstract This paper is a compendium of data base management operating s y s t e m s folklore. It is an early ~aper and is still in draft form. It is intended as a set of course notes for a class on data base operating systems. After a brief o v e r v i e w of what a data management system is it focuses on particular issues unique to the t r a n s a c t i o n management component especially locking and recovery.
394
Notes
on Data Base Operating Systems Jim Gray IBM Research Laboratory San J@se, California. 95193 Summer 1977
A C K N O WL E D G ~ E N TS
This paper plagerizes the woDk of the large and anonymous army of people working in the field. Because of the state of the field, there a~e few references to the literature (much of the "literature" is in internal memoranda, private correspondence, program logic manuals and prologues to the source language listings of various systems.) The section on data management largely reflects the ideas of Don Chamberlin, Ted Codd, Chris Date, Dieter Gawlick, ~ndy Heller, Frank King, Franco Putzolu, and Bob Taylor. The discussion of views is abstracted from a paper co-authored with Don Chamberlin and Irv Tr alger. The section on data communications stems from conversations with Denny anderson, Homer Leonard, a~d Charlie Sanders. The section on transaction Jackson and Thomas ~ork. scheduling derives from discussions management is an with Bob of
The ideas on distributed transaction discussions with Homor Leonard. Bruce Lindsay
amalgam
motivated the discussion of exception handling.
The presentation of consistency and locking derives from discussions and papers co-authored with Kapali Eswaran, Raymond Lorie, Franco Putzolu and Irving Traiger. also Ron Obermark (IMS program isolation), and Phil Macri (DMS 1100) and Paul Roever have clarified many locking issues for me. The presentation of recovery is co-authored with Paul ~cJones and John Nauman. Dieter Gawlick made many valuable suggestions. It reflects the ideas of dike Blasgen, Dar Busa, Hon Obermark, Earl Jenner, Tom Price, Franco Putzolu, Butler Lampson, Howard Sturgis and Steve ~eick. All members of the System ~ group contributed materially to his paper. (IBM Research, San Jose~ have
I am indebte~ to ~ike Blasgen, Dieter Gawlick, especially to John Nauman each of whom made suggestions about earlier drafts of these notes.
Jerry Saltzer and many constructive
If you feel your ideas or work are inadaquately or incorrectly plagerized, please annotate this manuscript and return it to me.
395
I. INTRODUCTION
Most large i n s t i t u t i o n s have now heavily invested in a data base system. In general they have automated such clerical tasks as inventory control, order entry, or billing. These systems often support a world-wide network of hundreds of terminals. Their purpose is to reliably store and retrieve large quantities of data. The life of many institutions is critically dependent on such systems, when the system is down the corporation has amnesia. This puts an enormous burden on the im~lementors and operators of such systems. The systems must on the one hand be very high performance and on the other hand they must be very reliable.
!.!.
~ SAMPLE SYSTEM
Perhaps it is best to begin by giving an example of such a system. A large bank may have one thousand teller t e r m i n a l s (several have 20,000 tellers b u t at present no single system supports such a large network). For each teller, there is a record describing the teller's cash drawee and for each branch there is a record describing the cash position of that branch (bank general ledger). It is likely to have several million demand deposit accounts (say 10,000,000 accounts). Associated with each account is a master record giving the account owner, the account balance and a list of recent deposits and withdrawals applicable to this account. Also there are records describing the account owners. This data base occupies over 10,000,000,000 bytes and must all be on-line at all times. The data base is manipulated with application dependent t r a n s a c t i o n s which were w r i t t e n for th~s application when it was installed. There are many transactions defined on this data base to query it and update it. A particular user is allowed to invoke a subset of these transactions. Invoking a transaction consists of typing a message and pushing a button. The teller terminal a p p e n d s the transaction identity, teller identity and terminal identity to the message and transmits it to the central data manager. The data communication manager receives the message and translates it to some canonical form. It then passes the message to the transaction manager which validates the teller's authorization to invoke the specified transaction and then allocates and dispatches an instance of the transaction. The transaction processes the message, generates a response and terminates. Data communications delivers the message to the teller. Perhaps the most common t r a n s a c t i o n is in this environment is the DEBIT CREDIT transaction shich takes in a message from any teller, debits or credits the appropriate account (after running some validity checks}, adjusts the teller cash drawer and branch balance, and then sends a r e s p o n s e message to the teller. The transaction flow is:
396
DEBIT_CREDiT: BEG [N_TR ANSACT!ON ~ GET MESSAGE; EXTRACT ACCOUNT NUMBER
, DELTA , TELLER , BRANCH FEOS MES S~GE; FIND ACCOUNT(ACCOUNT_NDMBER} IN DATA BASE~ IF NOT FOUND ~ ACCOUN~ B A L A N C E + D E L T A <0 THEN PUT N E G A T I V E RESPONSE; ELSE DO; A C C O U N T _ B A L A N C E = A C C O U N T _ B ALAN CE DELTa ; POST HISTORY B E C O B B ON ACCOUNT (DELTA) ; C A S H _ D R A W E R (TELLEB)=CAS h DRAW ER (T ELL ER) +DELTA ; BRANCH_BALANCE(BRANCH) = B R A N C H _ B A L A N C E (BEANCH} +DELTA: PUT SESSAGE ('NEW BALANCE =' ACCOUNT BALANCE); END~ COM MI T; per second
At peak periods the s y s t e m runs a b o u t thirty t r a n s a c t i o n s with a response time of two seconds.
The DEBIT C R E D I T t r a n s a c t i o n is very "small". There is another c l a s s of t r a n s a c t i o n s w h i c h b e h a v e r a t h e r differently. For example, once a month a t r a n s a c t i o n is run which produces a summary s t a t e m e n t for each account. This t r a n s a c t i o n might be described b y: S O N T H L Y _ S T A T E ~E NT: ANSWER ::= SELECT FRO~ A C C O U N T , HISTORY WHERE a C C O U N T . I C C O U N T _ N U M B E R = H I S T O R Y . A C C O U N T _ N U M B E R AND H I S T O R Y _ D A ~ E > LAST R E P O R T G R O U P E D BY A C C D U aT. ACCOUNT_NUM BER, a S C E N D I N G BY ACCOUNT.ACCOIINT_ADDRESS; That is, c o l l e c t all recent history r e c o r d s for each account and place them c l u s t e r e d with the account record i n t o an answer file. The a n s w e r s a p p e a r sorted by sailing address. If each account has about fifteen t r a n s a c t i o n s against it per month then this t r a n s a c t i o n w i l l read 160,000,000 records and write a similar number of records. A naive implementation of this t r a n s a c t i o n will t a k e 80 days t o e x e c u t e (50 m i l l i s e c o n d s per disk seek implies two million seeks per day,) hswewer, the system must run this t r a n s a c t i o ~ once a month and it must c g m p l e t e w i t h i m a few hours. There is a broad spread of t r a n s a c t i o n s b e t w e e n these two types. Two particularly interesting types of transactions are c o n v e r s a t i o n a l t ~ a n s a c t i o n s ~hich carry on a d i a l o g u e w i t h the user and d i s t r i b u t e d transactions which access data or terminals at s e v e r a l nodes of a computer network. Systems of 10,000 t e r m i n a l s o~ 1 0 0 , 0 0 0 , 0 0 0 , 0 0 0 bytes of on-line data or 150 t r a n s a c t i o n s per second are g e n e r a l l y c o n s i d e r e d %o be the limit of present t e c h n o l o g y (software and hardware}.
!.2.
R E L A T I O N S H I P TO O P E R A t i n G SYSTEM
If one tries to i m p l e m e n t such an a p p l i c a t i o n on top of a general purpose operating system it q u i c k l y becomes clear that many n e c e s s a r y functions are absent froa the operating system. Historically, two a ~ p r o a c h e s have been taken to this problem:
397
Write an new, simpler and "vastly superior"
operating system.
Extend the basic operating system to have the desired function.
The first approach was very popular in the mid-sixties and is Laving a renaissance with the advent of minicomputers. The initial cost of a data management system is so low that almost any large customer can justify "rolling his ow~'. The perfomance of such tailored systems is often ten times better than one based on a general purpose system. One must trade this off against the problems of maintaining the system as it grows to meet new needs and applications. Groups which followed this path now find themselves maintaining a rather large operating system which must be modified to support new devices (faster disks, tape archives,...) and new protocols (e.g. networks and displays.) Gradually, these systems have grown to include all the functions of a general purpose operating system. Perhaps the most successful approach to this has been to implement a hypervisor which runs both the data management operating system and some "standard" operating system. The "standard" operating system runs when the data manager is idle. The hypervisor is simply a interrupt handler which dispatches one or another system. The second approach of extending the basic operating system is plagued with a different set of difficulties. The principal problem is the performance penalty of a general purpose operating system. Very few systems are designed to deal with very large files, or with networks of thousands of nodes. To take a specific example, consider the process structure of a general purpose system: The allocation and deal~ocation of a process should be ~ fast (500 instructions for the pair is expensive) because we want to do it 100 times per second. The storage occupied by the process descriptor should also be small (less than 1000 bytes.} Lastly, preemptive scheduling of processes makes no sense since they are not CPU bound (they do a lot of I/O). A typical system uses 16,000 bytes to represent a process and requires 200,000 instructions to allocate and deallocate this structure (systems without protection do it cheaper.) Another problem is that the general purpose systems have been designed for batch and time sharing operation. They have not paid sufficient attention to issues such as continuous operation: keeping the systgm up for weeks at a time and gracefully degrading in case of some hardware or software error.
!.3. GENEHAL
STRUCTURE OF DA~A mANAGEMENT
SYSTEMS
These notes try to discuss issues which are independent of which operating system strategy is adopted. No matter how the system is structured, there are certain problems it must solve. The general structure common to several data nanagement systems is presented. Then two particular problems within the transaction management component are discussed in detail: cgncurrency control (locking} and system reliability (recover~. This presentation decomposes the system into four major components: the description and
Dictionary: the central repository of definition of all persistent system objects.
398
Data Coamunications: traffic. Data Base manager:
manages
teleprocessing
lines
and
message
manages the information
stored in the system. resources and system on the
Transaction Management : man ages system services such as iockimg and recovery. Each of these components call one another and basic operating system for services. 1.4. BIBLI 0GHAPH~
in turn depend
These notes are rather nitty-gritty, they are aimed at system implementors rather than at users. If this is the wrong level of detail for you (is too detailed) then you may prefer the very readable books: Martin, C__ompu_ter D~ta~base Orqanization, Prentice Hall, every DP vice presideDt should know..) 1977. (What Hall,
Martin, Co_~uter Data-base orqanizatin, (2nd edition), Prentice 1976. (What every application programmer should know.)
The following is a brief list of some of the more popular general purpose data management systems whlch are commercially available: Airlines Control Program, Customer Information Corporation. Data Management Extended System {nternational Business Machines Corporation. International Business Machines
Compater System,
1100, Sperry Univac Corporation. System, Xercx Corporation. International Business
Data Management
Information Management Machines Corporation. Integrated Integrated Database
System / Virtual Systems,
Management
System,
Cullinane corporation. Systems Inc. of America.
Data Store/l,
Honeywell
Information
~odel 204 Data Base Management System 2000, Total,
System,
Computer Corporation
SRI Systems Corporation.
Cincom Systems Corporation. these manufacturers will be descriptions of their systems. pleased to provide you with
Each of extensive
399
Several experimental systems of the more interesting are:
are under construction at
present. to on
Some
Astrahan et. al., "System H: a Relational Approach Management", ~strahan et. al., ACM Transactions Systems, Vol. I, No. 2, June 1976.
Database Database Proc. ACM
Marill and Stern, "The DatacomputerA Network Data Utility.", 1975 National Computer Conference, AF[PS Press, 1975.
Stonebraker et. al., "The Design and Implementation of INGBESS.", Transactions on Database Systems, Vol. I, No. 3, Sept 1976,
There are very few publicly awailable case studies of data base usage. The following are interesting but may not be representative: IBM Systems Journal, Vol. 16, No. 2, facilities and use of IMS and ACP). June 1977. (describes the
"IHS/VS Primer," IBM World Trade Systems Center, Palo Alto California, form number S320-5767-I, January 1977. "Share Guide IMS User Profile, A Summary of Message Processing Program Activity in Online ~MS Systems" IBM Palo alto-Raleigh Systems Center Bulletin, form number G320-6005, January 1977. Also there is one "standard" (actually "proposed standard" system): April 1971. Available from ACM.
CODAS~L Data B~se Task Group Report,
400
2. DICTIONARY
2.1.
WHAT
IT IS
The d e s c r i p t i o n of the system, the data bases, the transactions, the t e l e c o m m u n i c a t i o n s network, and of the users are all c o l l e c t e d in the dictionary. This repository: Defines t h e a t t r i b Q t e s Cross-references of o b j e c t s such as data bases and terminals.
these objects. (e.g. German) descriptions of the meaning
R e c o r d s natural l a n g u a g e and ~se of objects.
When the system arrives, the d i c t i o n a r y contains only a very few definitions of transactions (usually utilities), defines a few distinguished users (operator, data base administrator,...), and defines a few special terminals (master console). The system administrator proceeds to define n e w terminals, transactions, users, and data bases. (The s y s t e m a d m i n i s t r a t o r f u n c t i o n i n c l u d e s data base a d m i n i s t r a t i o n (DBA) and ~ata c o m m u n i c a t i o n s (network) a d m i n i s t r a t i o n (DCA)). Also, the s y s t e m a d m i n i s t r a t o r may modify existing d e f i n i t i o n s to match the actual system or to reflect changes. This a d d i t i o n and m o d i f i c a t i o n process is treated a s an e d i t i n g o p e r a t i o n . For example, one define~ a new user by entering the "defi,e" t r a n s a c t i o n and selecting D S E R from the menu of d e f i n a b l e t y p e ~ ~his causes a form to Be d i s p l a y e d which has a field for each a t t r i b u t e of a user. The d e f i n e r f i l l s in this form and submits it to the dictionary. If the form is i n c o r r e c t l y f i l l e d out, it is r e d i s p l a y e d and the definer corrects it. Redelinition follows a similar pattern, the current form is displayed, edited and t h e n submitted. (There is a l s o a non-interactive interface to the d i c t i o n a r y for p r o g r a m s rather t h a n people.) All changes are v a l i d a t e d by the d i c t i o n a r y for syntactic and semantic correctness. The a b i l i t y to e s t a b l i s h t h e c o r r e c t n e s s of a d e f i n i t i o n is s i m i l a r to ability of a c o m p i l e r to d e t e c t the c o r r e c t n e s s of program. That is, m a n y s e m a n t i c errors go undetected. These errors are a s i g n i f i c a n t problem. Aside from v a l i d a t i n g and s t o r i n g definitions, the dictionary provides a query facility w h i c h a n s w e r s g u e s t i o n s such as: " W h i c h t r a n s a c t i o n s use record type A of file B?" or, " W h a t are the a t t r i b u t e s of terminal 3426?". The dictionary p e r f o r m s one f u r t h e r service, that of c o m p i l i n g the d e f i n i t i o n s into a " m a c h i n e readable" form more d i r e c t l y usable by the other system components. For example, a terminal definition is c o n v e r t e d from a variable length c h a r a c t e r string to a fixed format ',descriptor" giving the t e r m i n a l a t t r i b u t e s in n o n - s y m b o l i c form. The d i c t i o n a r y is a data base along with a set of t r a n s a c t i o n s to m a n i p u l a t e this data base. Some s y s t e m s integrate the d i c t i o n a r y with the data management system so that the data d e f i n i t i o n and data m a n i p u l a t i o n i n t e r f a c e are homogeneous. This has the virtue of sharing large b o d i e s of code and and of p r o v i d i n g a u n i f o r m i n t e r f a c e to the user. Ingress and System R are ~ x a m F l e s of such systems.
401
Historically, the argument against using the data base for the dictionary has been performance. There is very high read traffic on the dictionary during the normal operation of the system. A user logon requires examining the definitions of the user, his terminal, his category, and of the session that his logon establishes. The invocation of a transaction requires examining his authorization, the transaction, and the transaction descriptor (to build the transaction.) in turn the transaction definition may reference data bases and queues which may in turn reference files, records and fields. The perfomance of these accesses is critical because they appear in the processing of each transaction. These performance constraints combined with the fact that the accesses are predominantly read-only have caused most systems to special-case the dictionary. The dictionary definitions and their compiled descriptors a~e stored by the data base management component. The dictionary compiled descriptors are stored on a special ~evice and a cache of them is maintained in high-speed storage on an LRU (Least Recently Used) basis. This mechanism generally uses a coarse granularity of locks and because operations are read only it keeps no log. Updates to the descriptors are made periodically while the system is quiesced. The descriptors in the dictionary are persistent. During operation, many other short-lived descriptors are created for short-lived objects such as cursors, processes, and messages. Many of these descriptors are also kept in the descriptor cache. The dictionary is the natnral extension of the catalog or file system present in operating systems. The dictionary simply attaches more semantics to the objects it stores and more powerful operators on these objects. Readers familiar with the literature may find a striking similarity between the dictionary and the notion of conceptual schema which is "a model of the enterprise,,. The dictionary is the conceptual schema without its artificial intelligence aspects. In time the dictionary component will evolve in the direction suggested by papers on conceptual schema.
2.2. BIBLIOGRAPHY DB/DC UCC Data Dictionary General Information GH20-9104-I, May 1977. TEN, Technical Corporation, 1976. Information Manual, IBM, fom number
Manual,
University
Computing
Lefkovits, Data Dictionary ~ s t e m s , Q.E.D. Information 1977. (A buyer's guide for data dictionaries.)
Sciences Inc.,
Nijssen (editor) , Modelinq i_~n D~ata Base Manaqement Ss t y , North Holland, 1976. (All you ever wanted to know about conceptual schema. ) "SEQUEL 2: A Unified Approach to Data Definition, Manipulation, and Control. " Chamberlin et. al., IBM Journal of ReseaTch and Development, Vol. 20, No. 6, November 1976. (presents a unified data definition, data manipulation facility.)
402
3. DATA mANAGEMENT
The Data management c o m p o n e n t stores and r e t r i e v e s sets of records. It i ~ p l e m e n t s the objects: network, set of records, cursor, record, field, and view.
3. I. RECORDS
AND FIELDS
A record type is a s e q u e n c e of field types, and a record instance is a c o r r e s p o n d i n g s e q u e n c e of field instances. Record types and i n s t a n c e s are p e r s i s t e n t objects. Record i n s t a n c e s are the atomic units of i n s e r t i o n and retrieval. Fields are s u b - o b j e c t s of records and are the atomic units of update. Fields h a v e the a t t r i b u t e s of a t o m s (e.g. FIXED(31) or CHAR(#)) and f i e l d i n s t a n c e s have atomic values (e.g. "3" or "BUTTERFLY"). Each record instance has a unique name called a record i d e n t i f i e r (RID). A field type constrains the type and values of i n s t a n c e s of a field and defines the representation of such instances. The record type s p e c i f i e s what fields occur in i n s t a n c e s of that record type. A typical record might have ten fields and occupy 256 bytes a l t h o u g h r e c o r d s often have h u n d r e d s of fields (e.g. a record giving s t a t i s t i c s on a c e n s u s tract h a s over 600 f i e l d s ) , and may be very large (several t h o u s a n d bytes). A ~ simple record (nine fields and about e i g h t y characters) might be d e s c r i b e d by: DECLARE I P H O N E _ B O O K RECORD, 2 P E R S O N _ N A M E CHAR(#), 2 ADDRESS, 3 STREET_NU~IBEE CHA~ I~) , 3 STREET NAME CHAR(~), 3 CITY CHAR ($) , 3 STATE CHAR(~) , 3 ZIP_CODE CHAR(5), 2 PHONE_NUMBER, 3 A R E A _ C O D E CHAR(3), 3 P R E F I X CH~R(~3), 3 S T A T I O N CHUB(4) ; The o p e r a t o r s on records include INSERT, DELETE, FETCH, and UPDATE. R e c o r d s c a n be C O N N E C T E D to and D I S C O N N E C [ E D from m e m b e r s h i p in a set (see below). These o p e r a t o r s actually apply t o cursors which in turn point to records. The n o t i o n s of record and field c o r r e s p o n d very closely to the notions of r e c o r d and element in COBOL or s t r u c t u r e and field in PL/I. Records are v a r i o u s l y c a l l e d e n t i t i e s , s e g m e n t s , tuples, and r o w s by d i f f e r e n t subcultures. Most systems have s i m i l a r motions of records although they may or say not support v a r i a b l e length fields, o p t i o n a l fields (nulls), or r e p e a t e d fields.
3.2. SETS A set is a c o l l e c t i o n of records. This c o l l e c t i o n is represented by and i m p l e m e n t e d as an " a c c e s s path" that runs through the c o l l e u t i o n of records. Sets perform the f u n c t i o n s of
403
r e l a t i n g the r e c o r d s of the set. In s o m e i n s t a n c e s physical storage. directing the physical c l u s t e r i n g of records in
A record i n s t a n c e may occur in many most once in a particular set.
d i f f e r e n t sets but it may occur at
There are three set types of interest: S e q u e n t i a l set: the records in the set form a single sequence. The records in the set are ordered e i t h e r by order of arrival ~ n t r y sequenced (ES)), by cursor position at insert (CS), or are ordered (ascending or descending) by some subset of field values (key sequenced (KS)). Sequential sets model i n d e x e d - s e q u e n t i a l files (ISAM, V S ~ ) . Partitioned set: The records in the set form a sequence of disjoint groups of sequential sets. Cursor operators allow one to point at a particular group. T h e r e a f t e r the sequential set operators are used to navigate within the group. The set is thus major ordered by hash and minor ordered (ES, C S or KS) within a group. Hashed files in which each group forms a hash bucket are modeled by p a r t i t i o n e d sets. P a r e n t - c h i l d set: The records of the set are o r g a n i z e d into a two level hierarchy. Each record i n s t a n c e is either a parent or a child (but not both). Each child has a unique parent and no children. Each parent has a (possibly null) list of children. Using p a r e n t - c h i l d sets one can b u i l d n e t w o r k s and hierarchies. P o s i t i o n a l o p e r a t o r s on p a r e n t - c h i l d sets i n c l u d e the o p e r a t o r s to locate parents, as well as o p e r a t i o n s to navigate on the s e q u e n t i a l set of c h i l d r e n of a parent. The CONNECT and DISCONNECT o p e r a t o r s explicitly relate a child to a parent. One obtains implicit c o n n e c t and d i s c o n n e c t by a s s e r t i n g that records inserted in one set s h o u l d also be c o n m e c t e d to another. (Similar rules apply for connect, delete and update.) Parent-child sets can be used to support h i e r a r c h i c a l and network data models. A partitioned set is a d e g e n e r a t e form of a parent-child set (the p a r t i t i o n s have no parents), and a s e q u e n t i a l set is a d e g e n e r a t e form of a p a r t i t i o n e d set (there is only one partition.) In this d i s c u s s i o n care has been taken to define the o p e r a t o r s so that they also subset. ~his has the c o n s e q u e n c e that if the program uses the simplest model it will be able to run on any data and also allows for subset i m p l e m e n t a t i o n s on s m a l l computers. Inserting a record in one set may trigger its connection to s e v e r a l other sets. If set "I" is an index for set "F", then an insert, delete and update of a record in "F" may trigger a corresponding insert, delete, or update in s e t "[". In order to support this, data manager must know: that insertion, update or deletion of a record causes c o n n e c t i o n to, movemen~ in, or d i s c o n n e c t i o n from other sets. where to insert the new record in the new set: For sequential sets, the o r d e r i n g or entry sequenced. must be either key sequenced its
404
For p a r t i t i o n e d sets, data manager rule and know that the p a r t i t i o n s sequenced.
must know are e n t r y
the p a r t i t i o n i n g sequenced or key
For p a r e n t - c h i l d sets, the data manager must know that c e r t a i n record types are parents and that others are children. Further, in the c a s e of children, data m a n a g e r must be able to d e d u c e the parent of the child. We will often use the term "file" as a synonym for set.
3.3.
CURSORS.
A cursor is "opened" on a specific set and thereafter points e x c l u s i v e l y to r e c o r d s in that set. A f te r a c u r s o r is opened it may be moved, copied, or closed. While a cursor is o p e n e d it may be used to m a n i p u l a t e the r e c o r d it addresses. Records are a d d r e s s e d by carsors. Cursors serve the functions of:
pointing
at a record. all r e c o r d s in a set.
enumerating
T r a n s l a t i n g b e t w e e n the stored record format and the format visible to the c u r s o r user. A simple i n s t a n c e of this might be a cursor w h i c h hides s o m e fields of a record. This a s p e c t will be d i s c u s s e d with the notion of vie~. A cursor is an e p h e m e r a l o b j e c t which is created from a d e s c r i p t o r when a transaction is i n i t i a t e d or during transaction execution by an expli c i t OPEN C U R S O R command. Also one may COPY CURSOR a cursor to make another i n s t a n c e of the c u r s o r with i n d e p e n d e n t positioning. A cursor is opened on a specific set (which thereby defines the e n u m e r a t i o n order (next) of the cursor.) A c u r s o r is d e s t r o y e d by the C L O S E _ C HRSOR command.
3.3.2.
OPERETZONS
ON C U R S O R S include:
Operators
on c u r s o r s
FETCH (<cursor> [,<position>]) [HOLD] RETURNS (<record>) Which r e t r i e v e s the r e c o r d p o i n t e d at by the named cursor. The record is mowed to the specified target. If the p o s i t i o n is s p e c i f i e d the c u r s o r is first positioned. If HOLD is s p e c i f i e d the record is locked fo u p d a t e (exclusive), o t h e r w i s e the record is l o c k e d in s h a r e mode. I N S E R T (<cursor> [ , < p o s i t i o n > ] ,<record>) Inserts the s p e c i f i e d record into the set s p e c i f i e d by cursor. If the set is key s e q u e n c e d or entry sequenced then the cursor is moved to the c o r r e c t position before the record is inserted, o t h e r w i s e the r e c o r d is i n s e r t e d at (after) the c u r r e n t position of the c u r s o r in the set. If the r e c o r d type a u t o m a t i c a l l y a p p e a r s in other sets, it also i n s e r t e d in them.
405
U P D A T E (<cursor> , [ < p o s i t i o n > ] ,<new-record>) If position is s p e c i f i e d the cursor is first positioned. The new record is t h e n inserted in the set at the c u r s o r position r e p l a c i n g t h e r e c o r d pointed at by the cursor. If the set is seguenced by the updated fields, this may cause the r e c o r d and cursor to move in the set. DELETE (<cursor> [,<position>]) Deletes the record ppinted at r e p o s i t i o n i n g the cursor. HOVE C U R S O R (<cursor> ,<position>) Repo~itions the cursor i n the set.
by
the c u r s o r
after
optionally
HOLD
3.3.3. CURSOR P O S I ; I O N I N G A cursor is opened to traverse e x p r e s s i o n s have the syntax: a particular set. Positioning
--+ . . . . . . . . . . . . . <RID) i +-; FIRST . . . . . . . . . . . . . . . . . . . . . + l N-TH + +-CHILD---+-+ + . . . . . . . . . . . . LAST ............... ~--+ + + - - N E X T ...... + + +-PARENT--+ + - - P R E V I O U S - - + - - < S E L E C T I O N EXPRESSION>--+ +-GROUP---+
where RID, FIRST, N-th, and LAST specify specific record occurrences while the o t h e r options specify the address r e l a t i v e to the c u r r e n t cursor position. It is also possible to set a cursor from another cursor. The selection e x p r e s s i o n may be any b o o l e a n e x p r e s s i o n valid for all record types in the se~. The s e l e c t i o n expression includes the r e l a t i o n a l operators: =, ~=, >, <, <=, >=, and for c h a r a c t e r strings a "matches-prefix" operator sometimes called generic key. If next or p r e v i o u s is specified, the set must be s e a r c h e d sequentially because the current p o s i t i o n is relevant, otherwise, the search can employ hashing or indices to locate the record. The s e l e c t i o n expression search may be p e r f o r m e d via an index which maps field values into RID's. Examples of commands are: FETCH (CURSORI,NEXT N A M ~ ' S M I T H ' ) HOLD RETORNS (POP) ; D~LETE (CHRSORI,NEXT N A M ~ ' J O E I C H I L D ) ; INSERT (CURSORI,,NEWCHILD) ; For partitioned sets one may point the cursor at a specific partition by ~ u a l i f y i n g t h e s e o p e r a t o r s by a d d i n g the m o d i f i e r GROUP. A cursor on a parent-child (or partitioned) set points to both a parent record and a child record (or group and child within group), cursors on such sets have t w o components: the parent or group c u r s o r and the child cursor. Moving the parent cursor, positions the child cursor to the first r e c o r d in the group or under the parent. For parent-child sets one q u a l i f i e s the position operator with the m o d i f i e r NEXT PARENT in order to locate the first child of the next parent or with the mpdifier W I T H I N P A R E N T if the search is to be r e s t r i c t e d to children of the current parent or group. Otherwise p o s i t i o n a l operators operate on children of the c u r r e n t parent. There are rather obscure issues a s s o c i a t e d with cursor positioning.
406
The following A cursor

-
is a good
set
of rules: positions:
can have the f o l l o w i n g
Null. Before the first record.
At a record. Between two After records. record. If the c u r s o r
the last
if the c u r s o r p o i n t s at a null set, then it is null. points to a n o n - n u l l set then it is a l w a y s non-null. initially the cursor is before O P E N _ C U R S O R s p e c i f i e s a [osition. An INSERT operation leaves the first record
unless
the
the c u r s o r
pointing
at the new rezord.
A DELETE o p e r a t i o n leaves the cursor between the two adjacent r e c o r d s , or at the t o ~ if there is no p r e v i o u s record, or at the b o t t o m if there is a p r e v i o u s but n o s u c c e s s o r record. A UPDATE record. operation leaves the cursor pointing at the updated
If an o p e r a t i o n
fails
the c u r s o r
is not
altered.
_3.~.
VARIOUS
DAT A ~ODELS_ in their notion of set.
Data
models d i f f e r
3.~.I.
RELATIONAL
DATA MOD~EL
The r e l a t i o n a l model r e s t r i c t s itself to h o m o g e n e o u s (only one record type) s e q u e n t i a l sets. The virtue of this a p p r o a c h is its s i m p l i c i t y and the a b i l i t y to define o p e r a t o r s that "distribute" over the set, applying u n i f o r m l y to each record of the set. Since much of data processing i n v o l v e s repetitive operations on large volumes of data, t h i s d i s t r i b u t i v e p r o p e r t y provides a c o n c i s e l a n g u a g e t o e x p r e s s such algorithms. There is a srong a n a l o g y here with APL which uses the s i m p l e data structure of a r r a y and t h e r e f o r e is able to define p o w e r f u l o p e r a t o r s w h i c h work for a l l arrays. APL p r o g r a m s are very short and much of the c o n t r o l s t r u c t u r e of t h e program is hidden inside of the operators. TO give an example of this~ accounts in an invoice file SELECT A C C O U N T _ N U M B E R a "relational" might be: program to find all overdue
FROM INVOICE
WHERE DUE_DATE<TODAY;
This should be c o m p a r e d to a PL/1 program w i t h a loop to get next record, and test for D U E D A T E and END_OF_FILE. The MONTHLY S T A T E M E N T transaction described in the i n t r o d u c t i o n is a n o t h e r i n s t a n c e of the power and u s e f u l n e s s of r e l a t i o n a l operators. On the other hand, if the work to be done does not involve processing
407
many records, then the r e l a t i o n a l model seems to have little a d v a n t a g e over other models. Consider the DEBIT CREDIT transaction which (I) reads a message from a terminal, (2) finds an account, (3) updates the account, (~) posts a h i s t o r y record, (5) updates the teller cash drawer, (6) u p d a t e s the branch balance, and (7) puts a message to the t~rminal. Such a t r a n s a c t i o n w o u l d benefit little from relational o p e r a t o r s (each o p e r a t i o n t o u c h e s only cue record.) One c a n define a g g r e g a t e o p e r a t o r s that distribute over hierarchies or networks. For example, the MAPLIST function of T.ISP distributes an arbitrary f u n c t i o n over an a r b i t r a r y data structure.
3.~.2.
HIERARCHIqA_L. D~TAA ~_~DD~L
H i e r a r c h i c a l models use p a r e n t - c h i l d sets in a stylized way to produce a forest (collection of trees) of records. A typical application might use the t h r e e r e c o r d types: LOCATIONS, aCCOUNTS, and INVOICES and two parent-child sets to c o n s t r u c t the f o l l o w i n g hierarchy: All the accounts at a location are clustered together and all outstanding invoices of an account are clustered with the account. That is a location has its a c c o u n t s as c h i l d r e n and an account has its invoices as children. This may be d e p i c t e d s c h e m a t i c l y by:
I
LOCATION S
I I
aCCOUNTS
I I
J
4
INVOICES
i
+
This s t r u c t u r e has the a d v a n t a g e that records used together may appear c l u s t e r e d together in physical storage and that i n f o r m a t i o n common to all the c h i l d r e n can be factored into the parent record. Also, one may quickly find the first record under a parent and deduce when the last has been seen without s c a u m i n g the rest of the data base. Finding all invoices for an account received on a certain day involves positioning a c u r s o r on the location, another cursor on the a c c o u n t number under that location, and a third c u r s o r to scan over the invoices: SET CURSORI to LOCATION=NAPA; SET C U R S O r 2 TO A C C O U N T = F R E E M A R K _ A B B ; SET CURSOR3 BEFORE F I R S ~ C H I L D ( C U R S O R 2 ) ; DO WHILE (-END_OF_CHILDREN) : FETCH (CURSOR3) NEXT CHILD; DO_SOMETHING;
END;
Because this is such a common p h e n o m e n o n , and b e c a u s e in a hierarchy there is only one path to a record, most hierarchical systems a b b r e v i a t e the cursor s e t t i n g operation to setting the lowest cursor in the h i e r a r c h y by s p e c i f y i m g a "fully qualified key" or path from the root to the leaf (the other cursors are set implicitly.) In the above example:
408
SET CURSOR3
DO W H I L E FETCH END; ~hich
TO L O C A T I D N = N A P A , ACCOUNT=FREEM ARK_ABBEY, I N V O I C E = ANY; ~=END_OF_CHILDREN) : (CURSOR3) NEXT CHILD; DO_SOMETHING;
implicitly
sets up c u r s o r s
one
and two.
The i m p l i c i t record naming of the h i e r a r c h i c a l m o d e l makes p r o g r a m m i n g much s i m p l e r than for a g e n e r a l network. If the data can be s t r u c t u r e d as a h i e r a r c h y in some a p p l i c a t i o n then it is d e s i r a b l e to use this model to address it.
3.~.3.
NETWORK
DATA
MODEL
Not all problems conveniently fit a h i e r a r c h i c a l model. If n o t h i n g else, d i f f e r e n t users may want to see the same i n f o r m a t i o n in a different hierarchy. For e x a m p l e an a p p l i c a t i o n might want to see the hierarchy "upside-down" with i n v o i c e at the top and l o c a t i o n at the bottom. Support for l o g i c a l h i e r a r c h i e s (views) r e q u i r e s that the data management system support a general network. The efficient implementation of c e r t a i n relational operators (sort-merge or join) also require parent-child sets r e q u i r e s the full c a p a b i l i t y of the network data model. The general s t a t e m e n t is that if a l l r e l a t i o n s h i p s are nested one to many m a p p i n g s then the data can be e x p r e s s e d as a hierarchy. If t h e r e are many to many m a p p i n g s then a network is required. To consider a specific example of the need for networks, imagine that several l o c a t i o n s may s e r v i c e the same a c c o u n t a n d that each location services several accounts. Then the h i e r a r c h y introduced in the previous section would require e i t h e r that l o c a t i o n s be s u b s i d i a r y to a c c o u n t s and be d u p l i c a t e d or that the a c c o u n t s record be d u p l i c a t e d in the h i e r a r c h y under the two locations. This will give r i s e to c o m p l e x i t i e s about t h e account having t w o b a l a n c e s ..... A network model would allow one to c o n s t r u c t the structure: LOCATION I 4 ILCCATION I
i +----) 1
V
I
41--
I +
I"
I I I
V
i ACCODNT l . . . . . . . . . . +
I aCCOUNT i ......... + .
I
V @ INVOICE A network i
i
V 4| INVOICE
I sets.
b u i l t out of two
parent-child
409
3.4.4 COMPARISON OF DATA MODELS By using "symbolic" p o i n t e r s (keys), one may map any network data s t r u c t u r e into a r e l a t i o n a l structure. In that sense all three models are e q u i v a l e n t and the relational model is c o m p l e t e l y general. Howeve. ~, there are substantial differences in the s t y l e and c o n v e n i e n c e of the different models. Analysis of specific c a s e s usually i n d i c a t e s that associative pointers (keys) cost t h r e e page faults to follow (for a m u l t i - m e g a b y t e set) w h e r e a s following a direct pointer costs only one page fault. This p e r f o r m a n c e difference explains why the equivalence of the three data models is irrelevant. If there is h e a v y t r a f f i c between sets then pointers mast be used. (High level languages can hide the use of t h e s e pointers.} It is my b i a s that one should resort to the more elaborate model oDly when the s i m p l e r model leads to excessive c o m p l e x i t y or to poor per foraance.
3.5.
VIEWS
Reosrds, sets, and n e t w o r k s which are a c t u a l l y stored are called base objects. Any query e v a l u a t e s to a virtual set of records which may be displayed on the user's screen, fed to a further guery, deleted from an e x i s t i n g set, i n s e r t e d i n t o an e x i s t i n g set, or copied to form a new base set. More i m p o r t a n t l y for this discussion, the query d e f i n i t i o n may be stored as a named view. The principal difference between a copy and a view is that u p d a t e s to the original sets which produzed the virtual set will be reflected in a view but will not affect a copy. A view is a dynamic picture of a query, whereas a copy is a static picture. There is a need for both views and copies. Someone wantlng to record the monthly sales v o l u m e of each d e p a r t m e n t might run *he f o l l o w i n g t r a n s a c t i o n at the end of each month (an a r b i t r a r y synta~):
MON T HL Y_ VOL UM E= DEPARTMENT, SUfl (VOLUME) FROM SALES G R O U P E D ~Y DEPARTMENT;
The new base set M O N T H L Y _ V O L U M E is defined to hold the answer. other hand, the c u r r e n t volume can be gotten by the view:
DEFINE CURRENT_VOLUME (DEPARTMENT, VOLUME) DEPARTM ENT, SUM (VOLUME) FROM SALES GP~OUPED BY DEPARTMENT; VIEW AS:
On the
Thereafter, any updates to SALES set will be r e f l e c t e d in the CURRENT V O L U M E view. Again, C U R R E N T _ V O L U M E may be used in the s a m e ways ba~e sets can be used. For e x a m p l e one can compute the difference between the current and monthly volume. The s e m a n t i c s of views are quite simple. Views can be s u p p o r t e d by a process of s u b s t i t u t i o n in the a b s t r a c t syntax (parse tree) of the statement. Each time a view is mentioned, it is replaced by its definition. TO summarize, any query e v a l u a t e s to a virtual set. Naming this virtual set makes it a view. Thereafter, this view can be used as a set. This allows views to be defined as field and record subsets of sets, s t a t i s t i c a l s u m m a r i e s of sets and more c o m p l e x c ~ b i n a t i o n s of sets.
410
There -
are t h r e e
major r e a s o n s
for d e f i n i n g
views: view of data, thereby
Data independence: giving p r o g r a m s a logical i s o l a t i n g t h e m from data r e o r g a n i z a t i o n .
Data isolation: g i v i n ~ the p r o g r a m exactly that it needs, t h e r e b y m i n i m i z i n g e r r o r propagation. Authorization: hiding authors and users. sensitive information
subset
of the data
from a
program,
its
As the data base evolves, records and sets are often "reorganized". C h a n g i n g the u n d e r l y i n g data should not cause all the p r o g r a m s t o be r e c o m p i l e d or r e w r i t t e n so long as the s e m a n t i c s of the data is not changed. Old p r o g r a m s s h o u l d be a b l e to see the data in the old way. Views are used to a c h i e v e this. Typical reorganization fields operations include:
Adding
to r e c o r d s .
Splitting Combining Adding
records. records. access paths. may be obtained by:
or d r o p p i n g of base
Simple
views
records
Reraming
or permuting
fields, of a field. by: s a t i s f y some
Converting
the r e p r e s e n t a t i o n of base subset sets
Siaple v a r i a t i o n s S e l e c t i n g that predicate ; Projecting
may be o b t a i n e d
of the r e c o r d s
of a set which
out s o m e
fields
or r e c o r d s into
in the
set. sets which can be
Combining existing viewed as a single t h e example
sets t o g e t h e r l a r g e r set.
new virtual
Consider
of a set
of records
of the form:
NAME
| ADDRESS
i TELEPHONE_NUMBEB
I ACCOUNT_NUMBER
Some a p p l i c a t i o n s might be only i n t e r e s t e d in t h e name and t e l e p h o n e number, others might want n a m e and a d d r e s s while others m i g h t want name and a c c o u n t number, and of c o u r s e one a p p l i c a t i o n w o u l d like t o see the whole record. A view can a p p r o p r i a t e l y subset the base set. If the set owner decides to p a r t i t i o n the r e c o r d into two new r e c o r d sets: PHONE_BOOK ACCOUNT S
l NAME
Programs programs
I ADDRESS
which which
| PHONE_NUMBEr]
| NAME
| ACCOONT_
NUMBER
used views will now access base sets (records) and a c c e s s e d the e n t i r e larger set will now access a view
411
(logical set/record)o
This larger view is defined by:
DEFINE VIEW WHOLE THING: Na~E, aDDRESS,PHONE_NU~BER,aCCOUNT_NUMBEE FBO~ PHONE_BOOK,aCCOUNTS WHERE PHONE_BOOK.NAME=aCCOUNTS.NaME; 3.5. I Views and Update Any view can support read 9perations; however, since only base sets are actually stored, only base sets can actually be updated. To make an update via a view, it ~tust be possible to propagate the updates down to the underlying base set. If the view is very simple (e. g., record subset) then this propagation is straight forward. If the view is a one-to-one mapping of records in some base set but some fields of the base are missing from the view, then update and delete present no problem but insert requires that the unspecified ("invisible") fields of the new records in the base set be filled in with the "u~def~med" value. This may or may not be allowed by the integrity constraints on the base set. Beyond these very simple rules, propagation of updates from views to base sets becomes complicated, dangerous, and sometimes impossible. To give an example of the problems, consider the WHOLE_THING view mentioned above. Deletion of a record may be implemented by a deletion from one or both of the ccnstituent sets (PHONE_BOOK and aCCOUNTS). The correct deletion rule is dependent on the semantics of the data. Similar comments apply to insert and update. My colleagues and ! have resigned ourselves to the idea that there is no elegant solution to the view update problem. (Materialization (reading) is not a problem!) Existing systems use either very restrictive view mechanisms (subset only), or they provide incredibly ad hoc view update facilities. We propose that simple views (subsets) be done automatically and that a technigue akin to that used for abstract data types be used for ccm[lex views: the view deflower will specify the semantics of the operators NEXT, FETCH, INSERT, DELETE, and OPDA~E.
3.6.
STRUCTUBE
OF
DATa
MaNaGER
Data manager
is large enough to be subdivided into several components:
View component: is responsible for interpreting the request, and calling the other components to do the actual work. The view component implements cursors and uses them to communicate as the internal and external representation of the view. Record component: stores logical records on "pages", manages the contents of nages and the problems of variable length and overflow records. Index comp_onent: implements sequential and associative access to sets. If only associative access is required, hashing should be used. If both sequential and associative access are required then indices implemented as B-trees should be used (see Knuth Vol 3 or IBM's Virtual Sequential access Method.)
412
Buffer manager: maps the data "pages" on s e c o n d a r y storage to primary storage buffer pool. If the o p e r a t i n g system provided a really fancy page m a n a g e r (virtual memory) then the buffer manager might not be needed. But, issues such as double buffering of s e q u e n t i a l I/O, Write Ahead Log protocol (see r e c o v e r y section), checkpoint, and l o c k i n g seem to argue against u s i n g the page managers of e x i s t i n g systems. If you are looking for a hard problem, here is one: define an i n t e r f a c e to page management which which is u s e a b l e by data m a n a g e m e n t in lieu of buffer management.
3.7. A SAMPLE DATA BASE DESIGN The introduction d e s c r i b e d a very simple data base and a simple t r a n s a c t i o n which uses it. we discuss how that data base could be s t r u c t a r e d and how the t r a n s a c t i o n would a c c e s s it. The data base c o n s i s t s of ~he records ACCOUNT (ACCOUNT_N VMBER, CU STOMER_N[I MBE R, ACCOU I~T_BALANCE, HISTOR Y) C U S T O M E R ~CUSTOH E~_NUMBER, C U S T O M E R _ N R M E,ADDRESS, ..... ) H I S T O R Y (TI M~,TELLE R,CODE, ACCOUNT_NUMBER, CHANGE, PREV_HLSTORX) CAS H_DRAW ER (TELLER_NUMBER ,BALANCE) BRA NCH_BALA NCE ( BRANCH, BALANCE) TELLER (TELLER N U M B E R , T E L L E r _ N A M E , ...... ) T h i s is a very cryptic d e s c r i p t i o n which says that a c u s t o m e ~ record has f i e l d s giving %he c u s t o m e r number, customer name, address and other attributes. The CASH_DRAWER, BRANCH_BALANCE and TELLER files (sets) are rather small (less t h a n 100,000 bytes). The ACCOUNT and C U S T O M E R files are large (about 1 , 0 0 0 , 0 0 0 , 0 0 0 byte~ . The h i s t o r y file is e x t r e m e l y large. If there are fifteen t r a n s a c t i o n s against each account per month and if each histor~ record is fifty b y t e s then the h i s t o r y file grows 7 , 5 0 0 , 0 0 0 , 0 0 0 bytes per month. Traffic on BRANCH BALANCE and CASH HRAWER is high and is by BRANCH and T E L L E R NUMBER ~ e s p e c t i v e l y . Therefore these two sets are kept in high sp~ed storage and are accessed via a hash on these attributes. That is, t h e s e sets are i m p l e m e n t e d as p a r t i t i o n e d sets. Traffic on the ACCOUNT file is high but random. Most a c c e s s e s are via ACCOUNT NUMBER but some are via CUSTOMER_NUMBER. Therefore, the file is h a s h e d on ACCOUNT_NUMBER (partitioned set). A s e q u e n t i a l set, NAMES, is maintained on t h e s e records which gives a s e q u e n t i a l and a s s o c i a t i v e access path to the records a s c e n d i n g by c u s t o m e r name. CUSTOMER i s treated similarly (having a hash on c u s t o m e r number and ~ index on customer name.) The T E L L E R file is o r g a n i z e d as a s e q u e n t i a l set. The HISTORY file is the most interesting. These records are written once and thereafter are only read. Almost every t r a n s a c t i o n generates s u c h a record and for legal reasons the file must be maintained forever. This causes it to be kept as an e n t r y s e q u e n c e d set. New records are i n s e r t e d at the end of the set. To allow all r e c e n t histGry r e c o r d s for a s p e c i f i c account to be q u i c k l y located, a parent c h i l d set is defined to link each ACCOUNT record (parent) to its H I S T O R Y records (children). Each A C C O U N T record points to i t s most recent HISTORY record. Each HISTORY r e c o r d points to the p r e v i p u s history record for that ACCOUNT. Given this structure, we can d i s c u s s the e x e c u t i o n of the C R E D I T _ D E B I f transaction outlined in the i n t r o d u c t i o n . We will assume that the l o c k i n g is done at the g r a n u l a r i t y of a page and that recovery is a c h i e v e d by keeping a log (:see s e c t i o n cn t r a n s a c t i o n management.)
413
At initia'tion, the data manager allocates the c u r s o r s for the t r a n s a c t i o n on the ACCOUNTS, HISTORY, BRANCH, and CASH_DRAWER sets. In each i n s t a n c e it gets a lock on the set to insure that the set is available for update (this is an IX mode lock as e x p l a i n e d int the l o c k i n g section 5.7.6.2.) L o c k i n g at a finer g r a n u l a r i t y will be done during t r a n s a c t i o n e x e c u t i o n (see l o c k i n g section). The first call the data manager sees is a request to find the ACCOUNT record with a given account number. This is d o n e by hashing the a c c o u n t number, thereby computing an a n c h o r for the hash chain. Buffer manager is called to bring that page of the file into fast storage. B u f f e r aanager looks in the buffer pool to see if the page is there. If the page is present, buffer manager returns it immediately. Otherwise, it finds a free b u f f e r page, reads the page into that buffer and returns the buffer. Data manager t h e n locks the page in share mode (so that no one else modifies it). This lock will be held to the end of the transaction. The r e c o r d component s e a r c h e s the page for a record with that account number. If the record is found, its value is r e t u r n e d to the caller and the cursor is left pointing at the record. The next request updates account balance of the r e c o r d a d d r e s s e d by the cursor. This r e q u i r e s c o n v e r t i n g the share mode lock a c q u i r e d by the previous call to a lock on the page in e x c l u s i v e mode, so that no one else sees the new account balance until the transaction successfully completes. Also %he record component must w r i t e a log record that allows it to undo or redo this update in case of a transaction or system a b o r t (see section on recuvery). Further, the t r a n s a c t i o n must note that the page d~pends on a c e r t a i n log record so that buffer manager can observe the write ahead log protocol (see r e c o v e r y section.) Lastly, the rec~rd c o m p o n e n t does the update to the balance of the record. Next the t r a n s a c t i o n f a b r i c a t e s the history record and inserts it in the history file as a child of the fetched account. The record component calls buffer manager to get the last page of the h i s t o r y file (since it is an e n t r y sequence set the record goes on the last page.) Because there is a lot of insert activity on the HISTORY file, the page is likely to be in the buffer pool. So b u f f e r manager returns it, the record c o m p o n e n t locks it and updates it. Next, the record c o m p o n e n t updates the p a r e n t - c h i l d set so that the new h i s t o r y record is a child of the parent a c c o u n t record. A!l of t h e s e updates are r e c o r d e d in the s y s t e m log in case of error. The next call updates the teller cash drawer. This r e q u i r e s locking the a p p r o p r i a t e C A S H _ D R A W E R record in e x c l u s i v e mode (it is located by hash). An u n d o - r e d o log record is written and the update is made. A similar scenario is performed for the B R A N C H _ B A L A N C E file. when the transaction ends, data manager r e l e a s e s all its locks and puts the t r a n s a c t i o n ' s pages in the buffer manager's c h a i n of ~ages e l i g i b l e for write to disk. If data manager or any other component detects an error at any point, it issues an ABORT T R A N S A C T I C N c o m m a n d which initiates t r a n s a c t i o n undo (see recovery section.) This c a u s e s data manager to undo ~!I its updates to records on behalf of this user and then to release all its locks and b u f f e r pages. The recovery and locking later sections. I suggest the reader aspects of data manager are e l a b o r a t e d in
design
and
evaluate
the performance
of
the
414
MONTHLY STATEMENT transaction described in the introduction as well as a trans~ction which given two dates and an account number will display the history of that account. 3. 8. COMPARISON TO FILE ACC_~ESS ME__T~OD From the example above, it should be clear that data manager, is a lot more fancy than the typical file access methods (indexed sequential files). File systems usually do sot support partitioned or parent-chiid sets. Some support the notion of record, but none support the notions of field, network or view. They generally lock at the granularity of a file rather than at the granularity of a record. File systems generally do recovery by taking periodic image dumps of the entire file. This does not work well for a transaction environment or for very large files. In general, data so that The operating manager builds upon the operating system file system
system is responsible for device support. for allocation, import, export and
The operating system utilities accounting are useable. -
The data is available to programs outside of the data manager.
3. 9. BIBLIOGRAPHY. Chamberlin et. al., "Views, Authorizationw and Locking in a ~elational Data Base System", 1975 NCC, Spartan Press. 1975. (Explains what views are and the problems associated with them.) C__omputinq Surveys, Vol. 8 No. I, March 1976. papers giving current trends and issues management component of data base systems.) Date, Introduction seminal book systems. ) Date, (A good collection of related to the data
to Database systems, Addison Wesley, 1975. (The on the data management part of data aanagement
"Ks Architecture for High Level Language Database Extelsion~," Proceedings of 1976 SIGMOD Conference, ACM, 1976. (Unifies the relational, hierarchical and network models.) Sortin_nq and _Searching, Vol. --3, all about B-trees among other
Knuth, Th_e_Art of Compute~ Programming: Addison ~esleyt 1975. (Explains thin gs. )
McGee, IBM Systems Journal, Vol. 16, No. 2, 1977, pp. 84-168. (A very readable tutorial on ZMS, what it does, how it works, and how it is used.) Senko, ,,Data Structures and Data Accessing in Data Base Systems, Past, Present, Future," IBM Systems Journal, Vol- 16, No. 3, 1977, pp. 208-257. (A short tutorial on data models.)
415
4o DAT~ C O M M U N I C A T I O N S
The area of data c o m m u n i c a t i o n s is the least understood aspect of DB/DC systems. It must deal with e v o l v i n g network managers, evolving intelligent t e r m i n a l s and in general seems to be in a continJ~ing state of chaos. Do not feel too bad if you find this section bewildering. Data c o m m u n i c a t i o n s is r e s p o n s i b l e for the flow of messages. ,~essages may come via t e l e c o m m u n i c a t i o n s lines from t e r m i n a l s and from other systems, or messages may be generated by processes running within the system. Messages may be destined for e x t e r n a l endpoints, for buffer areas called queues, or for e x e c u t i n g processes. Data c o m m u n i c a t i o n s e x t e r n a l l y provides the f u n c t i o n s of: Routing messages. B u f f e r i n g messages. Message mapping so sender and r e c e i v e r can each be "physical" c h a r a c t e r i s t i c s of the other. Internally data c o m m u n i c a t i o n s provides: messages to and from a unawure of the
Message t r a n s f o r m a t i o n which maps "external" format p a l a t a b l e to network manager. Device c o n t r o l of terminals. Message errors. recovery in ~he face
of t r a n s m i s s i o n
errors and
system
4. I. MESSAGES,
SESSIONS,
AND R E L A T I O N S H I P TO NETWORK MANAGER
Messages and endpoints are the fundamental objects of data communications. A message consists of a set of records. Records in turn consist of a set of fields. Messages t h e r e f o r e look very much like data base s e q u e n t i a l sets. Messages are defined by message descriptors. Typical unformatted d e f i n i t i o n s might be: A line from typewriter terminal is a one field, one record message. A screen image for a d i s p l a y is a two field (control and data}, one record message. A multi-screen d i s p l a - image is a m u l t i - f i e l d m u l t i - r e c o r d message. Data c o m m u n i c a t i o n s depends heavily on the n e t w o r k manager provided by the base o p e r a t i n g system. AHPANET, DECNET, and SNA (embodied in NCP and VTAM} are e x a m p l e s of such network managers. The network manager provides the notion of ~ ~ which is the s m a l l e s t addressable network unit. A work station, a queue, an process, and a card reader are each examples of endpoirts. Network manager transmits rather s t y l i z e d transmission records T ~ s ) . These ~re simply byte strings. Network manager makes a best effort to deliver these byte strings to their destination. It is the r e s p o n s i b i l i t y of data c o m m u n i c a t i o n s to package m e s s a g e s (records and fields) into transmission records and then r e c o n s t r u c t the message from t r a n s m i s s i o n records when they arrive at the other end. The f o l l o w i n g figure s u m m a r i z e s this: a p p l i c a t i o n and t e r m i n a l c o n t r o l programs see messages via sessions. DC in the host and terminal map
416
these messages i n t o network manager.

4
transmission
records which
are
carried by
the
i i
TERMINAL CONTROL PROGRAm message DC IN T E R ~ I N R L t r a n s m i s s i o n record <---session--->
TRANSACTION
|
i message ' DC IN HOST t r a n s m i s s i o n record <-connection--> The three main layers of a session. There are two ways to send messages : A one shot message can be sent to an endpoint in a c a n o n i c a l form with a very rigid protocol. Logon messages (session initiation) are o f t e n of this form. The second way to s e n d messages is via an e s t a b l i s h e d s e s s i o n between the two endpoints. ~hen t h e s e s s i o n is established, certain protocols are agreed to (e. 9. messages will be recoverable, session w i l l be half duplex,...) Thereafter, m e s s a g e s sent via the session obey these protocols. Sessions: Establish the message formats desired by the sender and receiver. Allow sender and r e c e i v e r to validate rather than r e v a l i d a t i n g each message. Allow a set of m e s s a g e s to he related Establish recovery, one a n o t h e r ' s i d e n t i t y once NETWORK ~ANAGER
l
i
i J
l 4 NETWORK mANAGER
together
(see c o n v e r s a t i o n s ) .
routing, and pacing protocols.
The network operatiug system provides c o n n e c t i o n between endpoints. A c o n n e c t i o n should be t h o u g h t of as a piece of wire w h i c h can carry messages blocked (by data communications) into transmission records. S e s s i o n s map many to one onto c o n n e c t i o n s . At any instant, a s e s s i o n uses a p a r t i c u l a r connection. But if the c o n n e c t i o n fails or if aB endpoint fails, the s e s s i c n may be transparently mapped tca new connection. For example, if a t e r m i n a l breaks, the operator may move the s e s s i o n to a new terminal. Similarly, if a c o n n e c t i o n breaks, an a l t e r n a t e c o n n e c t i o n may be established. C o n n e c t i o n s hide the problems of t r a n s m i s s i o n m a n a g e m e n t (an SNA term): T r a n s m i s s i o n control, b l o c k i n g and (TRs), managing T~ s e q u e n c e Dumbers Path control, Link control, d e b l o c k i n g transmission records and first level retry logic.
or r o u t i n g of %Rs through the network. sending TRs over t e l e p r o c e s s i n g bandwidth and lines. of the network
Pacing, dividing the among connections.
buffer pools
The data c o m m u n i c a t i o n s c o m p o n e n t and i m p l e m e n t i n g the notion of session.
the n e t w o r k manager cooperate
in
417
4.2.
SESSION MANAGEMENT
The p r i n c i p a l purpose of the session notion is to: Give device i n d e p e n d e n c e : the s e s s i o n makes t r a n s p a r e n t whether the endpoint is ASCII, EBCDIC, one-line, multi-line, procram or terminal. Manages the high level protocols conversations. for recovery, r e l a t e d messages and
session c r e a t i o n s p e c i f i e s the protocols to be used on the session by each participant. One participant may be an ASCII t y p e w r i t e r and the other p a r t i c i p a n t may be a s o p h i s t i c a t e d system. In this case the sophisticated system has l o t s of l o g i c to handle the s e s s i o n protocol and errors on the session. On the other hand if the endpoJut is an i n t e l l i g e n t terminal and if the other endpoint is willing to accept the t e r m i n a l ' s p r o t o c o l the session m a n a g e m e n t is rather simple. (Note: The above is the way it is supposed to work. In ~ s e s s i o n s with i n t e l l i g e n t t e r m i n a l s are very complex and the programs are much more subtle because i n t e l l i g e n t terminals can make such c o m p l e x mistakes. Typically, it is much e a s i e r to handle a m a s t e r - s l a v e s e s s i o n than %o handle a s y m m e t r i c session.) Network manager simply delivers t r a n s m i s s i o n records to endpoints. So it is the r e s p o n s i b i l i t y of data c o m m u n i c a t i o n s to "know" about the device c h a r a c t e r i s t i c s and to c o n t r o l the device. This means that DC must i m p l e m e n t all the code to provide the t e r m i n a l appearance. There is a version of this code for e a c h device type (display, printer, typewriter,...). This causes the DC component to be very big in t e r m s of t h o u s a n d s (K) Lines Of Code (KLOC). If the n e t w o r k manager defines a g e n e r a l l y useful e n d p o i n t model, then the DC manager can use this model for endpoints which fit the model. This is the justification for the T Y P E I , T Y P E 2 , . . . Logical Units (endpoints) of SNA and for the attempts to define a network logical t e r m i n a l for AEPANET. Sessions with dedicated t e r m i n a l s and peer nodes of the network are automatically (re) e s t a b l i s h e d when the system is (re)started. Of course, the operator of the terminal must re-establish his i d e n t i t y so that security will not be violated. Sessions for s w i t c h e d lines are created dynamically as the terminals connect to the system. When DC c r e a t e s the session, it s p e c i f i e s what protocols are to be used to t r a n s l a t e message format-s so that the session user is not aware of the c h a r a c t e r i s t i c s of the device at the other end point.
~.3.
QUEUES
As mentioned before a s e s s i o n may be: FRO~ TO program program or terminal or queue or t e r m i n a l or queue
Queues allow buffered t r a n s m i s s i o n between endpoints. Queues are a s s o c i a t e d (by DC) with users, e n d p o i n t s and transactions. If a u s e r is not logged on or if a process is doing something else, a queue can be used to hold one or more messages thereby freeing the session for further work or for termination. At a later time the program or e n d p o i n t may poll the gueue and obtain the message.
418
Queues are a c t u a l l y passive so one needs to a s s o c i a t e an a l g o r i t h m with a queue. T y p i c a l a l g o r i t h m s are: Allocate N servers for this queue. when a m e s s a g e arrives in this queue. when N m e s s a g e s a p p e a r in the queue. at s p e c i f i e d intervals.
Schedule a transaction Schedule a transaction
Schedule a t r a n s a c t i o n
Further, queues may be d e c l a r e d to be r e c o v e r a b l e in which case DC is responsible for r e c o n s t r u c t i n g the queue and it's m e s s a g e s if the system c r a s h e s or if the message consumer aborts.
~.~.
MESSAGE
RECOVERY
A session may be d e s i g n a t e d as r e c o v e r a b l e in which case, all messages t r a v e l i n g on the s e s s i o n are seguence numbered and logged. If the t r a n s m i s s i o n fails (positive a c k n o w l e d g e not r e c e i v e d by sender), then the session e n d p o i n t s resynchronize back to that message sequence number and the lost and s u b s e q u e n t messages are re-presented by the sender endpoint. If one of the endpoints fails, when it is restarted the s e s s i o n w i l l be r e e s t a b l i s h e d and the c o m m u n i c a t i o n resumed. This r e q u i r e s that the e n d p o i n t s be " r e c o v e r a b l e " although one e n d p o i n t of a session may assume r e c o v e r y r e s p o n s i b i l i t y for the other. If a m e s s a g e ends u p in a r e c o v e r a b l e queue then: If the d e q u e u e r of the message (session m e s s a g e will be r e p l a c e d in the queue. If the s y s t e m crashes, the queue will be log or some other r e c o N e r y mechanism). If the s e s s i o n or queue is not recoverable, the transmission, dequeuer or s y s t e m fail. or process) aborts, the
reconstructed
(using the
then the message is lost if
It is the r e s p o n s i b i l i t y of the data c o m m u n i c a t i o n s c o m p g n e n t to assure that a r e c o v e r a b l e message is ,,successfully" processed or p r e s e n t e d e x a c t l ~ once. It d o e s this by r e q u i r i n g that receipt of r e c o v e r a b l e m e s s a g e s be acknowledged. A transaction " a c k n o w l e d g e s " receipt of a message after it has prpcessed it, at c o m m i t time (see recovery section. )
~.~.
RESPONSE
MODE PROCESS~NGG.
The d e f a u l t protocol for r e c o v e r a b l e messages c o n s i s t s of the scenario:

T ERM IN ~L SYSTEM
. . . . . . . . reques t...... > log input m e s s a g e < ....... acknow ledge-process m e s s a g e C O M M I T (log reply forced) < . . . . . . . repl~ acknowledge-> log reply This implies four entries to the acknowledged (at each endpoint). network manager (forced)
419
Each of these passes requires several thousand instructions (in typical implementations.) If one is willing to s a c r i f i c e the recoverability of the input message then t~e logging and a c k n o w l e d g m e n t of the input message can be eliminated and the reply sequence used as acknowledgment. This reduces line traffic and i n t e r r u p t handling by a factor of two. This is the res~se mode message processing. The output (commit} message ~s logged. If something goes wrong before commit, it is as t h o u g h the message was never received. If something goes wrong after commit then the log is used to re-present the message. However, the sender must be able to match responses to requests (he may send five requests and get three responses). The easiest way to do this is to insist that there is at most one message outstanding at a time (i.e. lock the keyboard}. Another scheme is to acknowledge a batch of messages with a single acknowledge. One does this by tagging the acknowledge with the sequence number of the latest message received. If messages ace recoverable, then the sender must retain the message until it is acknowledged and so acknowledges should be sent f a i r l y frequently.
_4.5. C_ON_VERSATI~ONS A conversation is a sequence of messages. Messages are usually grouped for recovery purposes so that all are processed or ignored together. A simple conversation might consist of a clerk filling out a form. Each line the operator enters is checked for syntactic correctness and checked to see that the airline flight is available or that the required number of widgets are in stock. If the form passes the test it is redisplayed by the transaction with the unit price and total price for the line item filled in. At any time the operator can abort the conversation. However, if the system backs up the user (because of deadlock) or if the system crashes then when it restarts it would be nice to re-pr es ent the message of the conversation so that the o~erator's typing is saved. This requires that the group of messages be identified to the data communications component as a conversation so that it can manage this recovery process. (The details of this protocol are an unsolved problem so far as I know.)
4.6.
MESSAGE ~APPING
One of the features provided by DC is to insulate each endpoint from the characteristics of the other. There are two levels of mapping to do this: One level maps transmis3ion records into messages. The next level maps the message into a structured message. The first level of mapping is defined by the session, the second level of mapping is defined by the recipient (transaction). The first level of mapping converts all messages int@ some canonical form (e.g. a byte string of EBCDIC characters.) This transformation may handle such matters as pagination on a screen (if the message will not fit on one screen image). The second level of mapping transforms the message from an u n i n t e r p r e t e d string of bytes into a message-record-field structure. When one writes a transaction, one also writes a message mapping description which makes these transformations. For example, an airlines reservation transaction might have a mapping program that first displays a blank ticket. On input, the mapping program extracts the fields entered by the terminal operator and puts them in a set of (multi-field) records. The transaction reads t h e s e records (much in the style data base records are read) and then puts out a set of records to be displayed. ~he mapping program fills in the blank ticket with these records and passes the resulting byte string to session
420
management. %.7. TOPICS NOT COVERED A complete discussion of DC should include: manager (another lecturer will cover that). that).
~o~e detail on network Authorization
to terminals
(another lecturer
will cover
More detail on message The Logon-Signon
mapping.
process.
_~._8. B]~BLIOGHAPHY. Kimbelton, Schneider, "Computer Communication Networks: Approaches, Objectives, and P e r f o r m a n c e C o n s i d e r a t i o n s , " Computing Surveys, Vol. 7, No. 3, Sept. 1975. (A survey paper on network managers.) "Customer Information Control System/ Virtual Storage (C IC S/VS) , System/Application Design Guide." IBm, form number SC33-0068, 1977 (An e m i n e n t l y readable manual on all aspects of data management systems. Explains various session management protocols and explains a rather nice message mapping facility. )
Ea
de,
Homans Jones, "CICS/VS and its Role in Systems Network Architecture," ID~ Systems Journal, Vol. 16, NO. 3, 1977. (Tells how CICS joined SNA and what SNA did for it.) Vol. 15, No. I, architecture.) Jan. 1976. (RII about SNA, IBM's
IB~ Systems Journal, network manager
421
5. TRANSACTION MANAGEMENT
The transaction management system is responsible for scheduling system activity, managing physical resources, and managing system shutdown and restart. It includes components which perform scheduling, recovery, logging, and locking. In general transaction management performs those operating system functions not available from the basic operating system. It does this either by extending the operating system objects (e.g. enhancing processes to have recovery and logging) or by providing entirely new facilities {e.g. independemt recovery management.) As these functions become better understood, the duties of transaction management will gradually migrate into the operating system. Transaction management implements the following objects:
Transaction descriptor: A transaction descriptor is a prototype for a transaction giving instructions on how to build an instance of the transaction. The descriptor describes how to schedule the transaction, what recovery and locking options to use, what data base views the transaction needs, what program the transaction runs, and how much space and time it requires. Process: A process {domain) which is capable of running or is running a transaction. A process is bound to a program and to other resources. A process is a unit of scheduling and resource allocation. Over time a process may execute several transaction instances although at any instant a process is executing on behalf of at most one transaction instance. Conversely, a transaction instance may involve several processes. Multiple concurrent processes executing on behalf of a single transaction instance are called cohqts. Data management system processes are fancier than operating system processes since they understand locking, recovery and logging protocols but we will coutinue to use the old (familiar name for them). Transaction instance: A process (cohorts) executing a transaction. unit of locking and recovery. or collection of processes A transaction instance is the
In what follows, we shall blur these distinctions and generically call each of these objects transactions unless a more precise term is needed. The life of a transactiom instance is fairly simple. A message or request arrives which causes a process to be built from the transaction descriptor. The process issues a B E G I N T R A N S A C T I O N action which establishes a recovery umit. It then issues a series of actions against the system state. Finally it issues the COMMIT_TRANSACTION action which causes the outputs of the transaction to be made public (both updates and output messages.) Alternatively, if the transaction runs into trouble, it may issue the ABORT TRANSACTION action which cancels all actions performed by this transaction. The system provides a set of objects and actions on these objects along with a set of primitives which allow groups of actions to be collected into atomic transactions. It guarantees no consistency on the objects beyond the atomicity of the actions. That is, an action will either successfully complete or it will not modify the system state at all.
422
Further, if two a c t i o n s are performed on an object then the result will be e q u i v a l e n t to the serial e x e c u t i o n of the two actions. (As explained below this is achieved by using locking within system actions.) The n o t i o n of transaction is i n t r o d u c e d to provide a similar a b s t r a c t i o n above the s y s t e m interface. Transactions are an all or nothing thing, e i t h e r t h e y happen completely or all trace of t h e m (except in t h e log) is erased. Before a transaction c o m p l e t e s , it may be aborte______ and its d updates to r e c o v e r a b l e data may be undone. The a b o r t can come e i t h e r from the t r a n s a c t i o n itself {suicide: bad i n p u t data, o p e r a t o r cancel,...) or f r o m outside (murder: deadlock, timeout, s y s t e m crash .... ) However, once a t r a n s a c t i o n c o m m i t s (successfully completes) , the e f f e c t s of the transaction cannot be b l i n d l y undone. Rather, to undo a c o n m i t t e d transaction, one must resort to c o m p e n s a t i o n : r u n n i n g a new t r a n s a c t i o n which c o r r e c t s the e r r o r s of its predecessor. C o m p e n s a t i o n is usually highly a p p l i c a t i o n d e p e n d e n t and is not provided by the system. These d e f i n i t i o n s may be c l a r i f i e d by a few examples. The following is a picture of the three p o s s i b l e d e s t i n i e s of a transaction.
BEGIN BEGIN BEGIN
action action
action action .
action action action ABORT=>
action COMMIT R successful transaction
ABORT
A suicidal transaction
A murdered transaction
A sim~ t r a n s a c t i o n takes in a s i n g l e m e s s a g e does something, and then produces a single message. Simple t r a n s a c t i o n s t y p i c a l l y make fifteen data base calls, almost all t r a n s a c t i o n s are simple at present (see Guide/Share Profile of IMS users). About half of all simple t r a n s a c t i o n s are r e a d - o n l y (make no c h a n g e s to the data b&se.) For simple t r a n s a c t i o n s , the n o t i o n of process, r e c o v e r y unit and m e s s a g e coincide. If a t r a n s a c t i o n sends and receives s e v e r a l s y n c h r o n o u s messages it is called a co_.~nversati__ ona_~l transaction. R c o n v e r s a t i o n a l t r a n s a c t i o n has several m e s s a g e s per process and t r a n s a c t i o n instances. Conversational t r a n s a c t i o n s are likely to last for a long time (minutes while the o p e r a t o r thinks and types) and hence pose special resource management problems. The term batch t r a n s a c t i o n is used to describe a t r a n s a c t i o n which is " u n u s u a l l y big". In g e n e r a l such t r a n s a c t i o n s are not on-line, rather they are usually s t a r t e d by a s y s t e m e v e n t (timer driven) and run for a long time as a " b a c k g r o u n d " job. Such a t r a n s a c t i o n usually performs thousands of data management c a l l s before terminating. Often, the process will commit some of its work before the entire o p e r a t i o n is complete. This is an i n s t a n c e of multiple (related) r e c o v e r y units per process. If a t r a n s a c t i o n does work at s e v e r a l nodes of a network then it will require a process s t r u c t u r e (cohort) to r e p r e s e n t its work at each p a r t i c i p a t i n g node. Such a t r a n s a c t i o n is called distributed.
423
The following table summarizes the possibilities and .~hows th~ independence of the notions of process, m e s s a g e and transaction instance (commit). Cohorts c o m m u n i c a t e with one another via the s e s s i o n - m e s s a g e f a c i l i t i e s provided by data communications.

I
4
I
@
PBOCESSES
MESSAGES
COM~ITS
i SI~PLF I CONVZNS~TION~L
I
| I I
1 many
DISTRIBUTED
+--
I in I out 1 many in many out I none (?) many 1 in 1 out 1 many among c o h o r t s
J .+ | J |
I I
We introduce an additional notion of save point in the notion of transaction. A save point is a fire-wall which allows a t r a n s a c t i o n to stop short of total backup. If a t r a n s a c t i o n gets into trouble (e.g. deadlock, resource limit} it may be sufficient to back up to such an intermediate save point rather than undoing all the work of the transaction. For example a c o n v e r s a t i o n a l t r a n s a c t i o n which i n v o l v e s s e v e r a l user i n t e r a c t i o n s might e s t a b l i s h a save point at each user m e s s a g e thereby minimizing r e t y p i n g by the user. Save points do not commit any of the t r a n s a c t i o n ' s updates. Each save point is numbered, the b e g i n n i n g of the t r a n s a c t i o n is save point I and successive save points are numbered 2, 3, .... The user is allowed to save some data at each s a v e point and to retrieve this data if he returns to that point. Backing up to save point I resets the t r a n s a c t i o n instance to The r e c o v e r y component provides the actions: designates the b e g i n n i n g of a transaction.
BEGIN TRANSaCTION:
S ~ V E _ T R R N S R C T I O N : designates a fire-wall within the transaction. If an incomplete trans~ction is backed-up, undo may stop at such a point rather t h a n u n d o i n g the e n t i r e transaction. BACK~P_TRRN SRCTION: e a r l i e r save point. undoes the effects of a transaction to a
CO~IT_TRRNSACTION: s~gnals successful and causes outputs to b e committed.

RBOHT TRANSACTION: its initi[l state.
c o m p l e t i o n of
transaction
causes undo of a transaction.
Using these primitives, a p [ l i c a t i o n p r o g r a m s can c o n s t r u c t groups of a c t i o n s which are atomic. It is i n t e r e s t i n g that this one level of recovery is adequate to support multiple levels of transactions by using the notion of save point. The r e c o v e r y c o m p o n e n t supports two a c t i o n s which r e c o v e r y rather than transaction recovery: CHECKPOINT: log. Coordinates the r e c o r d i n g of deal with system
the system state
in the
424
RESTART: Coordinates system restart, reading the c h e c k p o i n t record and using the l o g to either redo c o m m i t t e d t r a n s a c t i o n s undo transactions which were u n c o m m i t t e d at the time of s h u t d o w n or crash.
log and the
5.1.
TRANSACTION
SCH/IDULIN ~.
The s c h e d u l i n g problem can be broken into many components: l i s t e n i n g for new work, allocating resources for new work, scheduling (maintaining the d i s p a t c h e r list), and dispatching. The listener communications is event driven. and from d i s p a t c h e d It receives processes. messages from data
A d i s t i n g u i s h e d field of the message specifies a transaction name. Often, this field has been filled in by data c o m m u n i c a t i o n s which resolved the transaction name to a reference to a transaction descriptor. Sometimes this field is symbolic in which case the listener uses the name in a d i r e c t o r y call to get a r e f e r e n c e to the transaction descriptor. (~he d i r e c t o r y may be determined by the m e s s a g e source.) If the n a m e is bad or if t h e s e n d e r is not a u t h o r i z e d to i n v o k e the t r a n s a c t i o n then the message is d i s c a r d e d and and a nega t i v e a c k n o w l e d g e is sent to the s o u r c e of t h e message. If the sender is a u t h o r i z e d to i n v o k e the n a m e d transaction, then the allocator examines the t r a n s a c t i o n d e s c r i p t o r and the current system state and d e c i d e s whether to put t h i s message in a w o r k - t o - d o list or to a l l o c a t e the t r a n s a c t i o n right away. C r i t e r i a for this are: The s y s t e m There which may be o v e r l o a d e d ("full".) of t r a n s a c t i o n s of this type
may be a l i m i t on the number can run c o n c u r r e n t l y .
There may be a t h r e s h o l d , N, such that N m e s s a g e s of this type must arrive at w h i c h point a server is allocated and the messages are b a t c h e d to this server. The transaction unawailable. The transaction off-shift,...) may have an affinity to resources which are
may
run
at
special
time
(overnight,
If the t r a n s a c t i o n can run immediately, then the a l l o c a t o r either a l l o c a t e s a new process to process the message or gives the message to a primed t r a n s a c t i o n w h i c h is w a i t i n g for input. If a new process is to be c r e a t e d , a process (domain) is a l l o c a t e d and all o b j e c t s m e n t i o n e d in the t r a n s a c t i o n d e s c r i p t o r are a l l o c a t e d as part of the domain. Program man%gement sets up the a d d r e s s space to hold the programs, data m a n a g e m e n t will a l l o c a t e the cursors of the t r a n s a c t i o n for the process, data c o m m u n i c a t i o n a l l o c a t e s the n e c e s s a r y queues, the recovery component a l l o c a t e s a log cursor and writes a begin t r a n s a c t i o n log record, and so on. The p r o c e s s i s then set up with a p o i n t e r to the i n p u t message. This a l l o c a t e d process is given places it on the d i s p a t c h e r queue. process. to the s c h e d u l e r which ~_ventaally The d i s p a t c h e r e v e n t u a l l y runs the
425
Once the process is d i s p a t c h e d by the t r a n s a c t i o n scheduler, the operating system s c h e d u l e r is responsible for scheduling the process a g a i n s t the physical resources of the system. When the tra~Lsaction completes, it returns to the scheduler. The scheduler m y or may not c o l l a p s e the process s t r u c t u r e depending on whether the t r a n s a c t i o n is hatched or primed. If the transaction has released resources needed by waiting u n s c h e d u l e d transacticns, the scheduler will now dispatch these transactions. Primed transactions are an o p t i m i z a t i o n which dramatically reduce a l l o c a t i o n and d e a l l o c a t i o n overhead. Process allocation can be an e x p e n s i v e operation and so t r a n s a c t i o n s which are e x e c u t e d f r e q u e n t l y are often primed. A primed transaction has a large part of the domain already built. In particular programs are loaded, cursors are allocated and the program ;rolog has b e e n executed. The transaction (pr~essj is waiting for input. The s c h e d u l e r need only pass the m e s s a g e to the t r a n s a c t i o n (process). Often the system a d m i n i s t r a t o r or operator will prime several i n s t a n c e s of a transaction. A banking system doing three w i t h d r a w a l s and five d e p o s i t s ~ r second might have two withdrawal t r a n s a c t i o n s and four deposit t r a n s a c t i o n s primed. Yet a n o t h e r variant has the process ask for a message after it completes. If a new message has arrived for that transaction type, then the process processes it. If there is no work for the transaction, then the process disappears. This is called batchinq messaqes as o p p o s e d to priming. It is appropriate if message traffic is "bursty" (not u n i f o r m l y distributed in time). It avoids keeping a process a l l o c a t e d when there is no work for it to do.
5.~. DISTRIBUTED TRANSACTION MANAGEMENT.

A distributed system is assumed to consist of a collection of autonomous nc~es which are tied together with a distributed data communication system in the style of high level ARPANET, DECNET, or SNA protocols. Resources are assumed to be partitioned in the sense that a resource is owned by only one node. The system s h o u l d be: Inhomogeneous (nodes are small, medium, large, ours, theirs,...)
U n a f f e c t e d by the loss of messages. Unaffected by the loss of nodes (i.e. r e q u e s t s to that for the node to return, ether nodes continue working.) node wait
Each node may i ~plement whatever data management and transaction management system i t wants to. We only r e q u i r e that it obey the network protocols. So one node might be a minicomputer running a fairly simple data m a n a g e m e n t system and using an o l d - m a s t e r n e w - m a s t e r recovery protocol. Another node might be running a very s o p h i s t i c a t e d data management system with many concurrent t r a n s a c t i o n s and fancy recovery. If one t r a n s a c t i o n may a c c e s s resources in many n o d e s of a network then a part of the t r a n s a c t i o n must "run" in each node. We already have an
426
Each node
will
want to local actions of the process. for the process.
Ruthorize Build Track
an e x e c u t i o n local
envircnment
resources
h e l d by the process. mechanism section). to undo the local u p d a t e s of that
Establish a recovery process (see recover~ -
Observe the two phase commit p r o t o c o l (in c o o p e r a t i o n c o h o r t s (see section on recolery)). entity w h i c h r e p r e s e n t s t r a n s a c t i o n instances: processes. Therefore, the structure n e e d e d for a process is almost identical to the structure needed c e n t r a l i z e d system.
with its
in a distributed system by a t r a n s a c t i o n in a
This latter o b s e r v a t i o n is key. That is why I a d v o c a t e viewing each node as a t r a n s a c t i o n processor. (This is a minority view.) To i n s t a l l a d i s t r i b u t e d t r a n s a c t i o n , one must i n s t a l l prototypes for its c o h o r t s in the various nodes. This a l l o w s each node to control a c c e s s by distributed transactions ia the same way i t c o n t r o l s access by terminals. If a node wants to give away the keys to its kingdom it can install a uniwersal c o h o r ~ (transaction) which has a c c e s s to all data and which p e r f o r m s all reqaests. If a t r a a s a c t i q n w a n t s to initiate a process (cohort) in a new node, some process of the t r a n s a c t i o n must r e q u e s t t h a t the node c o n s t r u c t a c o h o r t and that the c o h o r t go into s e s s i o n with t h e r e q u e s t i n g process (see data c o m m u n i c a t i o n s s e c t i o n for a d i s c u s s i o n of sessions). The picture b e l o w shows this. NODE I
4 +
1 ******** I # i . . . . . # . . . . . . +
#
# | SESSION% I
# NODE2 # i. . . . . #. . . . . .
I
t
#
* TIP6 *
I
|
i
Two cohorts
I
transaction in session.
of a d i s t r i b u t e d
R process carries both the t r a n s a c t i o n name T1 a n d the process name (in NODEI the cohort of TI is process P2 and in NODE2 the cohort of TI is process P6.)
427
The two processes can now converse and carry out the work of the transaction. If one process abomts, they should both abort, and if one process commits they ~hould both commit. Thus they need Go: obey the lock protocol section on locking). of holding locks to end of transaction (see recovery section). transactions of more than two (see
observe the two phase commit protocol obviously generalize to
These comments chorts.
5.3. THE DATA ~ A N ~ G E M E N T
SYSTEM AS A ~UBSYSTE~.
It has been the recent experience of general purpose operating systems that the o p e r a t i n g system is extended or enhanced by some "~pplicatioa program" like a data management system, or a n e t w o r k management system. Each of these systems often has very clear ideas about resource management and scheduling. It is almost impossible to write such systems unless the basic operating system: allows the subsystem to basic operating system. appear to users as an extension of the
allows the subsystem to participate in major system system shutdown/restarZ, process termination, ....
events such as
To cope with these problems, operating systems have either made system calls indistinguishable from other calls (e.g. MULTICS) or they have reserved a set of operating systems calls for subsystems (e.g. user SVCs in 0S/360.) These two approaches address only the first of the two problems above. The notion of sub-syste_mm is introduced to capture the second notion. For example, in Inn's operating system VS release 2.2, notifies each known subsystem at important system events (e.g. startup, memory failure, checkpoint,...) Typically a user might install a Job Entry Subsystem, a Network Subsystem, a Text Processing Subsystem, and perhaps several different Data Management Subsystems on the same operating system. The basic operating system serves as a co-ordinator among these sub-systems. It passes calls from users to these subsystems. It broadcasts events to all subsystems. of the host operating system,
The data manager acts a a subsystem extending it's basic facilities.
The data management component is in turn comprised of components. The following is a partial list of the components in the bottom half of the data base component of System R: Catalog manager: maintains directories of system objects.
Call analyzer:
regulates system entry-exit.
428
Record manager: Index component: Sort component: Loader:
e x t r a c t s records
from pages.
m a i n t a i n s i n d i c e s on the data base. m a i n t a i n s sorted versions of sets.
performs bulk ~ n s e r t i o n
of records into a file.
Buffer manager:
maps dat~ base pages to and f r o m s e c o n d a r y storage. K e e p s s t a t i s t i c s about system performance and
P e r f o m a n c e monitor: state. Lock component:
m a i n t a i n s the locks
(synchronization notion
primitives). COMMIT,
Recovery manager: i m p l e m e n t s the ABORT, and h a n d l e s s y s t e m restart. Log manager:
of transaction
maintains the system log.
Notice that primitive forms of these f u n c t i o n s are present in most general purpose operating s~stems. In the future one may expect to see the o p e r a t i n g s y s t e m s u b s u m e mcst of these data m a n a g e m e n t functions.
5.4.
EXCEPTION
HANDLING
The
protocol
for
handling
synchronous
errors
(errors
generated
by the
process)
is
another
issue
defined
by
which are transaction
management (extending the basic operating systems facilities). In g e n e r a l the data m a n a g e m e n t system wants to abort the t r a n s a c t i o n if the a p p l i c a t i o n p r o g r a m fails. This is g e n e r a l l y handled by o r g a n i z i n g the e x c e p t i o n s i n t o a hierarchy. If a lower l e v e l of the hierarchy fails to handle the error, it is passed to a higher node of the hierarchy. The data m a n a g e r usually has a few handlers very near the top of the hierarchy (the o p e r a t i n g system gets the root of the hierarchy.} Either the process or the d a t a management system e s t a b l i s h an e x c e p t i o n h a n d l e r to field errors. When, (or both) may
an e x c e p t i o n is d e t e c t e d then the e x c e p t i o n is siqna!ed.
Exception h a n d l e r s are invoked in some fixed order (usually order of estaDiishment) until one s u c c e s s f u l l y c o r r e c t s the error. This o p e r a t i o n is called e e ~ c o l a t i o n . PL/I ON units or the IBM Operating System s e t - t a s k - a b n o r m a l - e x i t (STAE) are i n s t a n c e s of this mechanism. E x a m p l e s of e x c e p t i o n c o n d i t i o n s are: arithmetic exception conditions (i.e., overflow), invalid program r e f e r e n c e (i.e., to p r o t e c t e d storage) wild branches, i n f i n i t e loops, deadlock, ..~ and a t t e m p t i n g to read beyond end of file. There may be several e x c e p t i o n h a n d l e r s active for a process at a p a r t i c u l a r instant. The p r o g r a m ' s handler is usually given the first try at r e c o v e r y if the program has e s t a b l i s h e d a handler. The handler will, in general, d i a g n o s e the failure as one that was e x p e c t e d (overflow), one that was unexpected but can be handled (invalid p r o g r a m r e f e r e n c e ) , or one that is u n e x p e c t e d and cannot be dealt with by the handler (infinite loop). If the failure can be corrected, the handler sakes the c o r r e c t i o n and c o n t i n u e s p r o c e s s i n g the program (perhaps at a different point of execution). If the failure c a n n o t be c o r r e c t e d by
429
this handler, then the exception will percolate to the next exception handler for that process. ~he system generally aborts any pocess which percolates to the system recovery routine or does not participate in recovery. This process involves terminating all processing being done on behalf of the process, restoring all nonconsumable resources in use by the process to operating system control (i.e., storage), and removing to the greatest extent possible the effects of the transaction.
5.5. OTHER COMPONENTS WITHIN TRANSACTION MANAGEMENT
We mention in passing that the transaction also support the following notions:
management
component must
Timer services: Performing operations at specified times. This involves running transactions at specified times or intervals and providing a timed wait if it is not available from the base operating system. Directory management: Management of the directories used by transaction management and other components of the system. This is a high-performance low-function in-core data management system. Given a name and a type (queue, transaction, endpoint,...} it returns a reference to the object of that name and type. (This is where the cache of dictionary descriptors is kept.) Authorization Control : Regulates the building transactions. These topics will be discussed by other lecturers.
BIBLIOGRAPHY.
and
use
of
~.6.
Stonebraker, ~euhold, "A Distributed Data Base Version of INGRESS", Proceedings of Second Berkeley Workshop on Networks and Distributed Data, La~rence iivermore Laboratory, (1977). (Gives another approach to distributed transaction management.) "Information Management System / Virtual Storage (IMS/VS) System Manual Vol. I: Logic.", IBM, for~ number LY20-8004-2. (Tells all about I~S. The discussion of scheduling presented here is in the tradition of IMS/VS pp 3.36-3.41.) "OS/VS2 System Logic Library.", IBM, form number SY28-0763, the subsystem interface of OS/VS pp. 3.159-3.168.) (Documents form
"05/$2 ~VS Supervisor Services and Macro Instructions.", XBM, number GC28-0756, (Explains OS percolation on pages 53-62.)
430
_5.!. LOCK ~ N ~ G ~ E ~ T .
This section derives F r a n c o Putzol u. from papers c o - a u t h o r e d with Irv Traiger and
The s y s t e m consists of objects which are related in certain ways. These relationships are best t h o u g h t of as assertions about the objects. Examples of such a s s e r t i o n s are: 'Names is an i n d e x for T e l e p h o n e _ n u m b e r s . ' 'Count of x is t h e number of e m p l o y e e s in d e p a r t m e n t x.' The system state is said to be g o n s i s t e n t if it satisfies all its assertioDs. In s o m e cases, the data base must become temporarily i n c o n s i s t e n t i~ o r d e r to ~ r a n s f o r m it to a new consistent state. For example, adding a new e m p l o y e e involves s e v e r a l atomic actions and the updating of several fields. The d a t a base may be i n c o n s i s t e n t until all t h e s e updates have been completed. To cope with these t e m p o r a r y inconsistencies, s e g u e n c e s of atomic actions are grouped to f o r m ~ransactions. Transactions are the units of consistency. They are large~ atomic a c t i o n s on the system state which t r a n s f o r m it from one c o n s i s t e n t state to a new c o n s i s t e n t state. Transactions preserve consistency. If some a c t i o n of a t r a n s a c t i o n fails then the entire t r a n s a c t i o n is 'undone' thereby r e t u r n i n g the data base to a c o n s i s t e n t state. Thus t r a n s a c t i o n s are a l s o the units of recovery. Hardware failure, s y s t e m error, deadlock, protection v i o l a t i o n s and program e r r o r are each a s o u r c e of s u c h failure.
5.7.1.
PROS aND CONS OF C O ~ C U R R E N C Y
If t r a n s a c t i o n s are run one at a time then each t r a n s a c t i o n will see the c o n s i s t e n t s t a t e left behind by its predecessor. But if several transactions are scheduled concurrently then the inputs of some transaction may be inconsistent even though each transaction in i s o l a t i o n is consistent. C o n c u r r e n c y is introduced to improve system response and utilization. It should not cause programs to m a l f u n c t i o n . Concurrency ns awes "o control should not consume more r e s o u r c e s than it
If the data base is r e a d - o n l y then no concurrency control is needed. However, if t r a n s a c t i o n s update shared data then their c o n c u r r e n t e x e c u t i o n needs to be r e g u l a t e d so that they do not u p d a t e the s a m e item at the s a m e time. If all t r a n s a c t i o n s are simple and all data are in primary storage then t h e r e is no n e e d for concurrency. H o w e v e r , if a n y transaction runs for a long time or does I/O thegn c o n c u r r e n c y may be n e e d e d to i m p r o v e responsiveness and utilization of the system. If c o n c u r r e n c y is allowed, then l o n g - r u n n i n g transactions will (usually) not delay short ones. C o n c u r r e n c y must be to s h a r e d resources. this purpose. r e g u l a t e d by some facility which r e g u l a t e s access Data management systems t y p i c a l l y use locks for
The simplest lock protocol associates a l~k with each object. W h e n e v e r using the object, the t r a n s a c t i o n a c q u i r e s the lock and holds
431
it until the transaction is complete. The lock is a serialization mechanism which insures that omly one transaction accesses the object at a time. It has the effect of: notifying others that the object is busy: and of protecting the lock requestor from modifications of others. This protocol varies from the serially reusable resource protocol common to most operating systems (and recently renamed monitors) in that the lock protocol holds locks to transaction commit. It will be argued below that this is a critical difference. Responsibility for requesting and releasing locks can either be assumed by the user or be delegated to the system. User controlled locking results in potentially fewer locks due to the user's knowledge of the semantics of the data. On the other hand, user controlled locking requires difficult and potentially unreliable application programming. Hence the approach taken by most data base systems is to use automatic lock protocols which insure protection from inconsistency, while still allowing the user to specify alternative lock protocols as an optimization.
5.7.2. CONCURRENCY PROBLEMS Locking is intended to eliminate three c0ncurrenc y. forms of inconsistency due to
Lost Updates: If transaction T1 updates a record previously updated by transaction T2 then undoing T2 will also undo the update of TI. (i.e. if transaction T~ updates record R from 100 to 101 and then transaction T2 updates A from 101 to 151 then backing up T1 will set A back to the original value of 100 losing the update of T2.} This is called a Write -> Write dependency. Dirty Read: If transaction T1 updates a record which is read by ~2, then if TI aborts T2 will have read a record which never existed. (i.e. TI updates R to I00,000,000, T2 reads this value, T1 then aborts and the record returns to the value I00}o This is called a Write -> Read dependency. Uu-repeatable Read: If transaction TI reads a record which is then altered and committed by T2 and if TI re-reads the record then TI will see two different committed values for the same record. Such a dependency is called a Read -> Write dependency. If there were arise. no concurrency then none of these anomalous cases would
Note that the order in which reads occur does not affect concurrency. In particular reads commute. That is why we do not care about ~ead -> Read dependencies.
~.!-~- M O D ~
o_z CQNSlSTE~C~ ~ND LOC~ PROtOCOlS
fairly formal model is required in order to make precise statements about the issues of locking and recoYery. Because the problems are so complex one must either accept many simplifying assumptions or accept a less formal approach. A compromise is adopted here. First we will introduce a fairly formal model of transactions, locks and recowery which will allow us to discuss the issues of lock management and recovery management, after this presentation, the implementation
432
issues associated 5.7.3. I. Several
with locking and recovery will be discussed. Definitions of consis_~enc Z
Several u _ e q ~ a l e n t d e f i n i t i o n s of c o n s i s t e n c y are presented. e The first definition is an operational and intuitive one useful in describing the system behavior to users. The second d e f i n i t i o n is a procedural one in terms of lock protocols, it is useful in explaining the system implementation. The third definition is in terms of a trace of the system actions, it is useful in formally stating and proving consistency properties.
5.7.3_.1.1.
Informal
Definition
of C o n s i s t e n c y
An output (write) of a t r a n s a c t i o n is committed when the transaction abdicates the right to 'undo' the write thereby making the new value available to all other transactions (i.e. commits), outputs are said to be uncommitted or dirty if they are not yet committed by the writer. Concurrent e x e c u t i o n raises the problem that reading or writing other transactions' dirty data may yield inconsistent data. Using this notion of dirty data, c o n s i s t e n c y may be defined as:
D e f i n i t i o n I: T r a n s a c t i o n T sees a consistent state if: (a) T does not overwrite dirty data of other transactions. (b) T does not commit any writes until it completes all its writes (i.e. until the end of transaction (rOT)). (c) T does not read dirty data from other transactions. (d) Other transactions do not dirty any data read by T before T completes. Clauses (a) and (b) insure that there are no lost updates.
Clause (c) isolates a t r a n s a c t i o n from the u n c o m m i t t e d data of other transactions. Without this clause, a transaction might read uncommitted values which are subsequently updated or are undone. If clause (c) is observed, no u n c o m m i t t e d values are read. Clause (d) insures r e p e a t a b l e reads. For example, without clause (c) a transaction may read two different (committed) values if it reads the same entity twice. This is because a t r a n s a c t i o n which updates the entity could begin, update and commit in the interval between the two reads. More elaborate kinds of anomalies due to c o n c u r r e n c y are possible if one updates an entity after reading it or if more than one entity is involved (see example below). The rules specified I. have the properties that:
If all transactions observe the consistency protocols then any e x e c u t i o n of the system is e q u i v a l e n t to some ,,serial" execution of the transactions (i.e. it is as though there was no concurrency. ) If all t r a n s a c t i o n s observe the consistency t r a n s a c t i o n sees a c o n s i s t e n t state. protocols, then each
2. 3.
If all t r a n s a c t i o n s observe the c o n s i s t e n c y protocols then system backup (undoing all i n - p r o g r e s s transactions) loses no updates of completed transactions. If all transactions observe t r a n s a c t i o n backup (undoing any the consistency protocols then i n - p r o g r e s s transaction) produces
~.
433
a c o n s i s t e n t state. Assertions I and 2 are proved in the paper "On the Notions of Consistency and P r e d i c a t e Locks" C A C M Vol. 9, No. 11, Nov. 1976. Proving the second two a s s e r t i o n s is a good r e s e a r c h problem. It requires e x t e n d i n g the model used for the f i r s t two a s s e r t i o n s and reviewed here to i n c l u d e r e c o v e r y notions. 5.7.3. I. 2. Schedules: Formalize D i r t l and C o m m i t t e d Data
The d e f i n i t i o n of what it means for a t r a n s a c t i o n to see a c o n s i s t e n t state was given in terms o~ dirty data. I n order to make the notion of dirty data e x p l i c i t it is n e c e s s a r y to consider the e x e c u t i o n of a transaction in the context of a set of c o n c u r r e n t l y executing transactions. To do this we i n t r o d u c e the notion of a schedule for a set of t r a n s a c t i o n s . A s c h e d u l e can be thought of as a h i s t o r y or audit trail of the actions performed by the set of transactions. Given a s c h e d u l e the notion of a p a r t i c u l a r e n t i t y being dirtied by a particular transaction is made explicit and h e n c e the notion of seeing a consistent s t a t e is formalized. These notions may then be used to connect the various definitions of c o n s i s t e n c y and show their equivalence. The system directly supports object~ and actions. Actions are c a t e g o r i z e d as b e q i ~ actions, end actions, abort actions, share l o c k actions, exclusive lock actions, unlock actions, ~ea_~d actions, and write actions. Commit and abort a c t i o n s are presumed to unlock any locks held by the t r a n s a c t i o n but not e x p l i c i t l y unlocke~ by the transaction. For the purpeses of the f o l l o w i n g definitions, share lock actions and their corresponding unlock a c t i o n s are a d d i t i o n a l l y considered to be read actions and e x c l u s i v e lock actions and their c o r r e s p o n d i n g unlock actipns are a d d i t i o n a l l y c o n s i d e r e d to be write actions. For the purposes of this model, a transaction is a n y sequence of a c t i o n s beginning with a begin action and ending with an commit or abort action and not c o n t a i n i n g other begin, c o m m i t or abort actions. Here are two trivial transactions. TI BEGIN SHARE LOCK A E X C L U S I V E LOCK B READ A WRITE B
CONMIT
T2 BEGIN SHARE LOCK B READ B SHARE LOCK A READ A

ABOR~
Any (sequence preserving) merging of the actions of a set of t r a n s a c t i o n s ~ t o a single sequence is called a s c h e d u l e for the set of transactions. A s c h e d u l e is a history of the order in which a c t i o n s were s u c c e s s f u l l y executed (it d ~ s not record actions which were undone due to backup (This aspect of the model m e e d s to be generalized to prove assertions 3 and 4 above)). The simplest schedules run all actions of one transaction and t h e n all actions of another transaction,... Such o n e - t r a n s a c t i o n - a t - a - t i m e s c h e d u l e s are called serial because they have no concurrency among transactions. Clearly, a serial schedule has no c o n c u r r e n c y induced i n c o n s i s t e n c y and no t r a n s a c t i o n sees dirty data. Locking constrains the set of allowed schedules. In particular, a schedule is ~ only if it does not schedule a lock action on an entity for one t r a n s a c t i o n when that e n t i t y is a l r e a d y locked by some
484
other t r a n s a c t i o n in a c o n f l i c f i n g mode. The f o l l o w i n g c o m p a t i b i l i t y among the simple lock modes.
table
shows the
| | j COmPATIBILItY
MODE OF LOCK I EXCLUSIVE | CONFLICT I CONFLICT
| SHARE | SHARE i | EXCLUSIVE i COZPATIBLE | CONFLICT
I | I
l
BODE OF I REQUEST
|
| I
The f o l l o w i n g are three example schedules of t w o transactions. The first s c h e d u l e is legal, t h e second is serial and legal and the t h i r d sched u l e is not legal s i n c e TI and T2 have c o n f l i c t i n g l o c k s on the object A. TI T2 T2 T2 TI T2 T2 T2 TI TI TI TI BEGIN BEGIN S H A R E LOCK B READ B S H A R E LOCK SHARE LOCK A R~D A ABORT E X C L U S I V E LOCK READ A WRITE B COM~IT TI TI TI TI TI T1 T2 T2 T2 T2 T2 T2 BEGIN S H A R E LOCK A E X C L U S I V E LOCK B RE.~D A WRITE B C~MMIT BEGIN S H A R E LOCK B READ B S H A R E LOCK A RE~D A ABO~T T2 TI TI T2 T2 T2 T2 T2 TI TI TI TI BEGIN BEGIN E X C L U S I V E LOCK A S H A R E LOCK B
READ B
S H A R E LOCK A READ A ABORT SHARE LOCK B READ A WRITE B COMSIT
legal&~s er i a l The three v a r i e t i e s
legalS seria I - legal~- se rial of schedules ( s e r i a l ~ l e g a l impossible).
An initial state and a schedule completely define the system's behavior. At each step of the schedule one can deduce which e n t i t y v a l u e s have been c o m m i t t e d and which are dirty: if locking is used, u p d a t e d data is d i r t x u n t i l it is unlocked. One t r a n s a c t i o n i n s t a n c e ~s said to dep~d on a n o t h e r if the first takes s o m e of its i n p u t s from the second. The notion of d e p e n d e n c y can be useful in c o m p a r i n g two schedules of t h e same set of transactions. Each schedule, S, derides a t e r n a r y dependenc_l r e l a t i o n on the set: TRANSACTIONS X OBJECTS X TRANSACTIONS as follows. Suppose that transaction T performs action a on entity e at some step in the schedule and that t r a n s a c t i o n T' p e r f o r m s action a' on entity e at a later step in t h e s c h e d u l e . Further s u p p o s e that T and T' are distinct. Then: (T,e,T') if or or is in DEP(S) a is a write a c t i o n and a' is a w r i t e action a is a w r i t e a c t i o n and a' is a read action a is a read a c t i o n and a' is a write action
The d e p e n d e n c y set of a schedule c o m p l e t e l y defines the inputs and outputs each t r a n s a c t i o n "sees". If two d i s t i n c t schedules have the same d e p e n d e n c y set then t h e y p r o v i d e each t r a n s a c t i o n w i t h the sane i n p u t s and outputs. HeDce we say two s c h e d u l e s are equivalent if t h e y h a v e the s a m e d e p e n d e n c y sets. If a schedule is e q u i v a l e n t to a serial schedule, then that s c h e d u l e must be consistent since in a serial
435
s c h e d u l e t h e r e are no i n c o n s i s t e n c i e s due to concurrency. On the other hand, if a s c h e d u l e is not e q u i v a l e n t to a serial schedule the~ it is probable (possible) that some t r a n s a c t i o n sees an i n c o n s i s t e n t state. Hence, Definition 2: A schedule serial schedule. is c o n s i s t e n t if it is equivalent to some
The following argument may clarify the i n c o n s i s t e n c y of s c h e d u l e s not e q u i v a l e n t to serial schedules. Define the relation <<< on the set of t r a n s a c t i o n s by: T<<<T' Let <<<~ if for some entity e he the t r a n s i t i v e (T,e,T') is in DEP(S). then define:
closure of <<<,
BEFORE(T) = {T'J T' < < < $ T} AFTER(T) = {T'| T <<<$ E'}.
The obvious i n t e r p r e t a t i o n of this is that the BEFORE(T) set is the set of t r a n s a c t i o n s which c o n t r i b u t e inputs to T and each AFTER(T) set is the set of t r a n s a c t i o n s which take their inputs from T If s o m e t r a n s a c t i o n is bot~ before T and after T in some schedule then no serial schedule could give such results. In this case, c o n c u r r e n c y has introduced inconsistency. On the other hand, if all relevant t r a n s a c t i o n s are e i t h e r before or after T (but not both) then T will see a c o n s i s t e n t state. If all transactions dichotomize others in this way then the relation <<<$ will be a partial order and the whole schedule will provide consistency. The above d e f i n i t i o n s can be related as follows: Kssertion: A schedule is c o n s i s t e n t if and only if (by definition) the schedule is e q u i v a l e n t to a serial s c h e d u l e if and only if the relation <<<$ is a partial order.
5.1.3.!.3. L_._OCkkProtocol Defin.ition of Consistenc z

whether an i n s t a n t i a t i o n of a t ~ a n s a c t i o n sees a consistent state depends on the actions of other c o n c u r r e n t transactions. All t r a n s a c t i o n s agree to certain lock protocols so that they can all be g u a r a n t e e d consistency. Since the lock system allows only l e g a l schedules we want a protocol such that: every l e g a l schedule is a c o n s i s t e n t schedule. lock
Consistency can be p r o c e d u r a l l y defined b y the lock p r o t o c o l which produces it: A transaction locks its inputs to guarantee their c o n s i s t e n c y and locks its o u t p u t s to mark them as d i r t y (uncommitted). For this section, locks are d i c h o t o m i z e d as s h a r e mode locks allow m u l t i p l e readers of the same e n t i t y and _exclu____~ssive mode which reserve e x c l u s i v e a c c e s s to an entity. The lock protocols refined to these modes is: which locks
436
D e f i n i t i o n 3: T r a n s a c t i o n ~ cbserves th___~e o n s i s t e n c y 19ok prot0col if: c (a) T sets an e x c l u s i v e lock on any data it dirties. (b) T sets a s h a r e lock on any data it reads. (c) T holds all l o c k s to EOT. These lock protocol definitions can be stated more precisely and tersely with the i n t r o d u c t i o n of the f o l l o w i n g notation. A transaction is w e l l f o r m e d if it a l w a y s locks an e n t i t y in e x c l u s i v e (shared or exclusive) mode before w r i t i n g (reading) it. A transaction is tw__~ophase if it does not (share or exclusive) lock an entity after u n l o c k i n g some entity. A two phase transaction has a g r o w i n g phase during which it a c q u i r e s locks and a shrinking phase during which it r e l e a s e s locks. The lock c o n s i s t e n c y protocol can be r e d e f i n e d as: the c o n s i s t e n c y lock protocol if
Definition 3': Transaction T o b s e r v e s it is is well formed and two phase.
D e f i n i t i o n 3 was too r e s t r i c t i v e in the sense that consistency does not require that a t r a n s a c t i o n hold all locks to the EOT (i.e. the EOT is the s h r i n k i n g phase). Rather , the c o n s t r a i n t that the t r a n s a c t i o n be two p h a s e is a d e q u a t e to immure consistency. On the other hand, once a t r a n s a c t i o n unlocks an u p d a t e d entity, it has c o m m i t t e d that entity and so cannot be undone w i t h o u t c a s c a d i n g backup to any t r a n s a c t i o n s which may have s u b s e q u e n t l y read the entity. For that reason, the s h r i n k i n g phase is usually d e f e r r e d to the end of the transaction; thus, the transaction is always recoverable and all updates are committed together. 5.7.3.2. Relationsni R Rm__qoo~ Definitions_ may be r e l a t e d as follows: if T sees a consistent
These d e f i n i t i o n s state in S. Assertion: {a)
If each transaction observes the consistency lock protocol (Definition 3') then a n y legal schedule is consistent (Definition 2) (i.e, each t r a n s a c t i o n sees a c o n s i s t e n t state in the sense of Definiti on I ). (b) Unless t r a n s a c t i o n T o b s e r v e s the c o n s i s t e n c y lock protocol then it is possible to d e f i n e another t r a n s a c t i o n T' which does observe the c o n s i s t e n c y lock protocol such that T and T' have a legal schedule S but T does not see a c o n s i s t e n t s t a t e in S.
This says that if a t r a n s a c t i o n observes the c o n s i s t e n c y lock protocol d e f i n i t i o n of consistency (Definition 3') then it is assured of the d e f i n i h i o n of c o n s i s t e n c y b a s e d on c o m m i t t e d and dirty data (Definition 1 or 3). Unless a t r a n s a c t i o n a c t u a l l y sets the locks prescribed by c o n s i s t e n c y one can c o n s t r u c t t r a n s a c t i o n mixes and s c h e d u l e s which will cause the t r a n s a c t i o n to see an i n c o n s i s t e n t state. However, in particular cases such t r a n s a c t i o n mixes may newer occur due to the structure or use of the system. In these cases an a p p a r e n t l y inadequate locking may a c t u a l l y p r o v i d e consistency. For example, a data base r e o r g a n i z a t i o n usually need do no locking since it is run as an off-line utility which is newer run c o n c u r r e n t l y with other transactions.
437
5.7.4. LOCKING., T B A N S ~ C T I O ~
~ACKUP ANE SYSTEM R E C O V E R Y
To repeat, there is no n i c e formal model of recovery (Laapson and Sturgis have a model in their f o r t h - c o m i n g CACM paper on two phase commit p r o c e s s i n g but the m o d e l in the version I saw was rather vague.) Here, we will limp along with a (even more} v a g u e model. A t r a n s a c t i o n T is said to be r e c o v e r a b l e if it can be u n d o n e before 'EOT' without undoing other transactions' updates. A t r a n s a c t i o n T is said to be repeatable if it will reproduce the original output if rerun f o l l o w i n g recovery, a s s u m i n g that no locks were released in the backup process. R e c o v e r a b i l i t y requires u p d a t e locks be held to commit point. Repeatibility r e q u i r e s t h a t all t r a n s a c t i o n s observe the c o n s i s t e n c y lock protocol. The normal (i.e. t r o u b l e free) o p e r a t i o n of a data base system can be d e s c r i b e d in terms of an i n i t i a l c o n s i s t e n t state S0 and a schedule of transactions mapping the data base into a final consistent state $3 (see Figure). Sl is a checkpoint state, since transactions are in progress, $I may be i n c o n s i s t e n t . A ~ s y ~ m crash leaves the data base in state $2. Since t r a n s a c t i o n s ~3 and T5 were in progress at the time of crash, S2 is p o t e n t i a l l y inconsistent. Ss ~ recover[ amounts to bringing the data base to a new consistent state in one of the f o l l o w i n g ways: (a) Starting from state S2, undo all i n - p r o g r e s s at the t i m e of the crash. actions of transactions
(b) starting from state $1 first undo all actions of t r a n s a c t i o n s in progress at the time cf the c r a s h (i.e. a c t i o n s of T3 before $1) and then red_o all actions of t r a n s a c t i o n s which completed after $I and before the crash (i.e. a c t i o n s of T2 and T4 after S1). (c) starting crash. Observe that at SO redo all t r a n s a c t i o n s which c o m p l e t e d before the
(a) and
(c) are degenerate cases of
(b).
t
i
TI|
T2i
---1 T31
!
I---i
>
<
i
t
I t ! SO
i "--> .... I T4|---I < l Z51 ..... > ..... S1 S2
I t
i J I S3
Figure. S y s t e m states, $0 is initial state, SI is c h e c k p o i n t state, S2 is a c r a s h and S3 is the state that r e s u l t s in the absence of a crash. If some t r a n s a c t i o n does not hold update locks to commit point then: Backing up the t r a n s a c t i o n may deadlock (because reacguire the locks in order to perform undo.) backup must
Backing up a t r a n s a c t i o n may loose updates (because an update may have bee applied to the output of the undone transaction but backup will restore the e n t i t y to its o r i g i n a l value.)
438
Consequentall~, backup may cascade: backing up one transaction may require backing up another. (Randell calls this the domino effect.) If for example, T3 writes a record, r, and then T~ further updates r then undoing T3 will cause the update of T4 to to be lost. This situation can only arise if some transaction does not hold its write locks to commit point. For these reasons, ~ i known data manaqe_mment systems (which support concurrent updators} require that all transactions hold their u ~ a t e locks to commit point.
On the other hand, If all the transactions hold all update locks to commit point then system recovery loses no updates of complete transactions. However there may be no schedule which would give the same result because transactions may have r_e~ad outputs of undone transactions. If all the transactions observe the consistency lock protocol then the recovered state is consistent and derives from the schedule obtatined from the original system schedule by deleting incomplete transactions. Note that consistency prevents read dependencies on transactions which might be undone by system recovery. The schedule obtained by considering only the actions of completed transactions produces the recovered state. Transaction crash gives ris~ to transaction backu~ which has properties analogous to system recovery.
5.7.5. LOWER DEGREES OF CO~SISTENCX Most systems do not provide consistency as outlined here. Typically they do not hold read locks to EOT so that R->W->R dependencies are not precluded. Very primitive systems sometimes set no read locks at all, rather they only set update locks so as to avoid lost update and deadlock during backout. We have characterized these lock protocols as degree 2 and degree I consistency respectively and have studied them extensively (see "Granularity of locks and degrees of consistency in a shared data base", Gray, Lorie, Putzolu, and Tzaiger, in Modeling i_n Data Base s _ ~ , S ~orth Holland Publishing (1976).) I believe that the lower degrees of consistency are a bad idea but several of my colleagues disagree. The motivation of the lower degrees is performance. If less is locked then less computation and storage is consumed. Further if less is locked, concurrent] is increased since fewer conflicts appear. (Note that the granularity lock scheme of the next section is motivated by minimizing the number of explicit locks set.) 5.7.6. LOCK GHANULAHITY
An important issue which arises in the design of a system is the choice of !ockable unit~, i.e. the data aggregates which are atomically locked to insure consistency. Examples of lockable units are areas, files, individual records, field values, and intervals of field values. The choice of lockable units presents a tradeoff between concurrency and overhead, which is related to the size or granularity of the units themselves. On the one hand, concurrency is increased if a fine lockable unit (for exam[le a record or field) is chosen. Such unit is appropriate for a "simple" transaction which accesses few records. On
439
t h e other hand a fine unit of locking would be costly for a "complex" trans action which accesses a large number of records. Such a transaction would have to set and reset a large number of locks, i n c u r r i n g the c o m p u t a t i o n a l overhead of many i n v o c a t i o n s of the lock manager, and the storage o v e r h e a d of r e p r e s e n t i n g many locks. A coarse lockable uuit (for e x a m p l e a file) is probably convenient for a t r a n s a c t i o n w h i c h accesses many records. However, such a coarse unit discriminates against transactions which only want to lock one member of the file. From this discussion it f o l l o w s that it would be desirable to have lockable units of different g r a n u l a r i t i e s c o e x i s t i n g in the s a m e system. The following presents a lock protocol s a t i s f y i n g these requirements and discusses the r e l a t e d i ~ p l e m e n t a t i o n issues of scheduling, granting and c o n v e r t i n g lock requests. 5.7.6.1. Hierarchical Locks
We will first assume that the set of r e s o u r c e s to be locked is o r g a n i z e d in a hierarchy. Note that this h i e r a r c h y is used in the context of a c o l l e c t i o n of resources and has nothing to do with the data model u s e d in a data b a s e system. The h i e r a r c h y of the following figure may be suggestive. We adopt the notation that each level of the hierarchy is given a node Zype which is a generic name for all the node i n s t a n c e s of that type. ~or example, the data base has nodes of type area as its i m m e d i a t e descendants, each area in turn has node~ of type file as its i m m e d i a t e descendants and each file has nodes of type record as its immediate d e s c e n d a n t s in the hierarchy. Since it is a hierarchy, each node has a unique parent.
DATA BASE
! i
aREAS
! i
FILES
i !
RECORDS Figure I: A sample lock hierarchy.
Each node of the hierarch~ can be locked. If one requests e x c l u s i v e access (X) to a p a r t i c u l a r ~ode, then when the request is granted, the r e g u e s t o r has e x c l u s i v e access to that node and implicitlz to each of ! ~ descenda--!rt-~ sIf one requests sh_aare__dd c c e s s a (S| to a particular node, then w h e n t h e request is granted, the requestor has shared access to that node and i m p l i G ~ l _ ~ t__qoeaq~ desc@ndant of that node. These two access modes lock an entire subtree r o o t e d at the r e q u e s t e d node. O u r goal is to find some technique for i m D ~ locking an entire subtree. In order to l o c k a sl,btree r o o t e d at node R in share or e x c l u s i v e mode it is i m p o r t a n t to prevent locks on the a n c e s t o r s of R which might implicitly lock R and its d e s c e n d a n t s in an i n c o m p a t i b l e mode. Hence a new a c c e s s mode, i n t e n t i o n mode ( I ) , is introduced. I n t e n t i o n mode is used to "tag. (lock) all ancestors of a node to be locked in s h a r e or e x c l u s i v e mode. These tags signal the fact that locking is being done at a "finer" level and t h e r e b y prevents implicit or e x p l i c i t e x c l u s i v e or s h a r e locks on the ancestors. The protocol to lock a subtree rooted at node R in exclusive o~ share
440
mode is to first lock all ancestors of R in i n t e n t i o n mode and then to lock node R in exclusive or share mode. For e x a m p l e , using the figure above, to lock a particular file one s h o u l d obtain i n t e n t i o n access to the data base, to the a r e a c o n t a i n i n g the file and then request e x c l u n i v e (or share) a c c e s s to the file itself. This i m p l i c i t l y locks all records of the file in exclusive (or share) mode. 5.7. 6.2. Access Modes and C o m ~ a t i b i l i t Z
We say that two lock r e q u e s t s for the same node by two d i f f e r e n t t r a n s a c t i o n s are c o ~ t i b l e if t h e y can be g r a n t e d concurrently. The mode of the r e q u e s t d e t e r m i n e s its c o m p a t i b i l i t y with r e q u e s t s made by other t n a n s a c t i o n s . The three modes X, S and I are i n c o m p a t i b l e with one a n o t h e r but d i s t i n c t S requests may be granted together and distinct I r e q u e s t s may be granted together. The c o m p a t i b i l i t i e s among modes d e r i v e from their semantics. Share mode allows r e a d i n g but not m o d i f i c a t i o n of the c o r r e s p o n d i n g resource by the r e q u e s t o r and by ot~er t r a n s a c t i o n s . The s e m a n t i c s of e x c l u s i v e mode is that the grantee m a y read and modify the r e s o u r c e but no other t r a n s a c t i o n may read or modify the resource while the e x c l u s i v e lock is set. The reason for d i c h o t o m i z i n g s h a r e and e x c l u s i v e access is that several share r e q u e s t s can be granted c o n c u r r e n t l y (are compatible) whereas an e x c l u s i v e r e q u e s t is not c o m p a t i b l e w i t h any other r~guest. Intention mode was intrpduced to be incompatible with share and exclusive mode (to p r e v e n t s h a r e an d e x c l u s i v e locks). However, i n t e n t i o n m o d e is c o m p a t i b l e with itself since two t r a n s a c t i o n s having i n t e n t i o n a c c e s s to a node will e x p l i c i t l y lock d e s c e n d a n t s of the node in X, S or I mode and thereby will either be c o m p a t i b l e with one another or w i l l be s c h e d a l e d on #he basis of their r e q u e s t s at the finer level. For example, two t r a n s a c t i o n s can simultaneously be g r a n t e d t h e data base and some area and some file in i n t e n t i o n mode. In this case their e x p l i c i t locks on p a r t i c u l a r r e c o r d s in the file will r e s o l v e any conflicts among them. The notion of i n t e n t i o n m o d e is refined to i n t e n t i o n s h a r e mode (IS) and i n t e n t i o n e x c l u s i v e m o d e (IX) for two reasons: the i n t e n t i o n share mode only r e q u e s t s s h a r e or i n t e n t i o n share locks at the lower nodes of the tree (i.e. never requests an e x c l u s i v e lock below the i n t e n t i o n share node), hence IS is c o m p a t i b l e with S mode. Since read o n l y is a c o m m o n form of a c c e s s it will be profitable to d i s t i n g u i s h this for greater concurrency. Secondly, if a t r a n s a c t i o n has an i n t e n t i o n share lock on a node it can c o n v e r t this to a share lock at a later time, but one c a n n o t c o n v e r t an i n f e n t i o n e x c l u s i v e lock to a s h a r e lock on a node. Rather to get the c o m b i n e d rights of share mode and i n t e n t i o n exclusive mode one must obtain an X or SIX m o d e lock. (This issue is d i s c u s s e d in the s e c t i o n on r e r e q u e s t s below). We recognize one further refinement of modes, namely share and i n t e n t i o n e x c l u s i v e mode ( S I ~ . Suppose one t r a n s a c t i o n wants to read an e n t i r e s u b t r e e and to update p a r t i c u l a r nodes of that subtree. Using the modes prodded so far it w o u l d have t h e options of: (a) requesting exclusive a c c e s s to the root of the subtree and doing no further locking or (b) requesting intention e x c l u s i v e access to the root of the subtree an~ explicitly locking the lower nodes in intention, share or exclusive mode Rlterna tire (a) has low concurrency. If only a small fraction of the r e a d n o d e s are u p d a t e d then a l t e r n a t i v e (b) has nigh locking overhead. The c o r r e c t access mode would be share a c c e s s to the suhtree thereby allowing the t r a n s a c t i o n to read all n o d e s of the subtree without further locking a__nn_d i n t e n t i o n e x c l u s i v e access to the s u b t r e e thereby a l l o w i n g the t r a n s a o t a o n &o set e x c l u s i v e locks on those n o d e s in the s u b t r e e which
441
are to be updated and IX or SIX locks on the i n t e r v e n i n g nodes. Since this is a common case, SIX mode .is introduced. It is c o m p a t i b l e with IS mode since other transactions requesting IS mode will e x p l i c i t l y lock lower nodes in IS or S mode t h e r e b y avoiding any updates (IX or X mode) produced b y the SIX mode transaction. However SIX mode is not c o m p a t i b l e with IX, S, SIX or X mode requests. The table null mode below gives the c o m p a t i b i l i t y of t h e r e q u e s t (NL) represents the absence of a request. modes, where
I__ J NL IS . . IX .. S { NL | [ES YES YES XES | IS { YES YES YES YES | IX | YES YES YES NO | S { YES YES NO YES I SIX | YES YES NO NO l X l y~S .. NO .. NO NO Table To summarize, I. C o m p a t i b i l i t i e s six among
SIX YES [ES NO NO NO NO access
X i YES | NO NO | NO { NO | NO i modes.
we r e c o g n i z e
modes of a c c e s s i.e. represents
to a resource: the absence of a request
NL: Gives no access of a resource.
to a node,
IS: Gives i n t e n t i o n share access to the r e q u e s t e d node and allows the requestor to lock descendant n o d e s in S or IS mode. (It does ~ o implicit locking.) IX: Gives intention exclus~ive access to the requested node and allous the requestor to e x p l i c i t l y lock descendants in X, S, SIX, IX or IS mode. (It does n_~o implicit locking.}
S: G i v e s s h a r e a c c e s s to ~ h e r e q u e s t e d node and to all d e s c e n d a n t s of the requested node without setting further locks. (It implicitly sets S locks on all d e s c e n d a n t s of the requested node.) SIX: Gives share and i n t e n t i o n e x c l u s i v e a c c e s s to the requested node. (In p a r t i c u l a r it i m p l i c i t l y l o c k s all d e s c e n d a n t s of the node in share mode and allows the r e q u e s t o r to e x p l i c i t l y lock d e s c e n d a n t nodes in X, SIX or IX mode.}
X: Gives e x c l u s i v e access t o the r e q u e s t e d node and to all d e s c e n d a n t s o f the r e q u e s t e d node w i t h o u t s e t t i n g further locks. (It i m p l i c i t l y sets X locks on all descendanhs. Locking lower nodes in S or IS mode w o u l d give no i n c r e a s e d access.) IS mode is the weakes% n @ n - n u l l form of a c c e s s to a resource. It c a r r i e s fewer p r i v i l e g e s than IX or S modes. IX mode allows IS, IX, S, SIX and X mode locks to be set on d e s c e n d a n t n o d e s while S mode allows read onl~ a c c e s s to all descendants of the node without further locking. SIX mode carries the p r i v i l e g e s of S and of I X mode (hence the name SIX}. X mode is the most p r i v i l e g e d form of access and allows reading and writing of all descendants of a node without further locking. Hence the modes can be r a n k e d in the partial order of privileges s S o w n the figure below. Note that it is not a total order since IX and S are incomparable.
442
l J
SIX
! i i
S
! J
IX
! i I
IS
NL
J | i I
of modes by t h e i r privileges.
Figure 5.7.6.3.
2.
The partial
ordering Nodes
Rules for ~ e q u e s t i n ~
The i m p l i c i t locking of n o d e s will not work if t r a n s a c t i o n s are a l l o w e d to leap into the middle of the tree and b e g i n locking n o d e s at random. The i m p l i c i t locking i m p l i e d by the S and X modes d e p e n d s on all t r a n s a c t i o n s o b e y i n g the f o l l o w i n g protocol: (a) Before r e q u e s t i n g an S or IS lock on a node, all a n c e s t o r nodes of the requested node mus~ be held in iX or IS m o d e by the requestor. (b) Before r e q u e s t i n g an n o d e s of the r e q u e s t e d requestor. X, SIX or IX lock on a node, all a n c e s t o r node must be held in SIX or IX mode by the
(c) Locks s h o u l d be r e l e a s e d e i t h e r at the end of the t r a n s a c t i o n (in a n y order) or in leaf to root order. In p a r t i c u l a r , if locks are not held to end of t r a n s a c t i o n , one s h o u l d not hold a lock a f t e r r e l e a s i n g its a n c e s t o r s . To paraphrase this, locks a_~r~ r e q u e s t e d root to leaf, and r e l e a s e d l e a f tc root. N o t i c e that l e a f nodes are n e v e r r e q u e s t e d in i n t e n t i o n mode since they h a v e no d e s c e n d a n t s , and that once a node is a c q u i r e d in S or X mode, no f u r t h e r e x p l i c i t l o c k i n g is r e q u i r e d at lower levels. 5.7.6.~. Several Examples
To lock record R for read: lock d a t a - b a s e w i t h mode = I S lock area c o n t a i n i n g B with mode = IS lock file c o n t a i n i n g R w i t h mode = I S lock record R with mode = S Don't panic, the transaction probably already a n d file lock.
has
the
data
base,
area
To lock r e c o r d R for w r i t e - e x c l u s i v e access: lock d a t a - b a s e with mode = IX lock area c o n t a i n i n g R with mode = IX lock file c o n t a i n i n g R with mode = IX lock r e c o r d R with mode = X Note that if the records of this and the p r e v i o u s e x a m p l e are distinct, each request can be g r a n t e d s i m u l t a n e o u s l y to different transactions even though both refer to the s a m e file.
443
To lock a file F for read and write access: lock d a t a - b a s e with mode = IX lock area c o n t a i n i n g F with m o d e = IX lock file F with mode = X Since this r e s e r Y e s e x c l u s i v e access to the file, if this request uses the s a m e file as the p r e v i p u s two e x a m p l e s it or the other transactions will have to wait. Unlike examples I, 2 and 4, no a d d i t i o n a l locking n e e d be done (at the record level). To lock a file F for c o m p l e t e s c a n and occasional update: lock data-base with ~ode = IX lock area c o n t a i n i n g F with mode = IX lock file F with mode = SIX Thereafter, p a r t i c u l a r records in F can be locked for update by l o c k i n g records in X mode. Notice that (unlike the previous example) this t r a n s a c t i o n is c o m p a t i b l e with the first example. This is the reason for introducing SiX mode. Tc quiesce the data base: lock data base w i t h mode = X. Note that this locks e v e r y o n e else out. 5.7.6. 5. Directed Acyclic ~raphs of Locks
The n o t i o n s so far i n t r o d u c e d can be g e n e r a l i z e d to work for d i r e c t e d acyclic graphs (DAG) qf r e s o u r c e s rather than simply hierarchies of resources, a tree is a simple D~Go ~he key o b s e r v a t i o n is that to i m p l i c i t l y or e x p l i c i t l y lock a node, one should lock all the p a r e n t s of the node in the DaG and so by induction lock all a n c e s t o r s of the node. In particular, to lock a subgraph one must i m p l i c i t l y or e x p l i c i t l y lock all a n c e s t o r s of the s u b g r a p h in the a p p r o p r i a t e mode (for a "~ree t h e r e is only one parent). To give an example of a n o n - h i e r a r c h i c a l structure, imagine the locks are o r g a n i z e d as: DaTA BaSE
I I
AR~AS
! !
F i i ES
!
INDICES
I I
! !
IREC OR DS
! i
F i g u r e 3. a n u n - h i e r a r c h i c a l l o c k graph. We p o s t u l a t e that areas are "physical" notions and that files, indices and records are logical actions. The data base is a c o l l e c t i o n of areas. Each area is a c o l l e c t i o n of files and indices. Each file has a corresponding index in the same area. Each r e c o r d belongs to some file and to its c o r r e s p o n d i n g index, a record is c o m p r i s e d of field values and some field is indexed by the index a s s o c i a t e d w i t h the file c o n t a i n i n g the record. The file gives a sequential access path to the r e c o r d s and the index gives an a s s o c i a t i v e access path to the records based on field values. Since i n d i v i d u a l fields are never locked, t h e y dc not appear in the lock graph.
444
To write a record R in file P with i n d e x I: lock data base with mode = IX lock area c o n t a i n i n g F with mude = IX lock file F ~ith mode = I X lock index I ~ith mode = IX lock record R w i t h mode = X Note that al_l paths to record E are locked. lock F and I in exclusive mode thereby e x c l u s i v e mode. Alternatively, one could implicitly locking R in
To give a more c o m p l e t e e x p l a n a t i o n we o b s e r v e that a node can be locked ~ ~ (by r e q u e s t i n g it) or i m p l i c i t l y (by a p p r o p r i a t e e x p l i c i t l o c k s on the a n c e s t o r s of the node) in one of five modes: IS, IX, S, SIX, X. However, the d e f i n i t i o n of implicit locks and the protocols for setting e x p l i c i t locks have to be extended for D~G's as follows: A node is i_~mlicitl_~ q r a n t ~ d ~ S_ m o d e to a t r a n s a c t i o n if at least one of its p a r e n t s is (implicitly or explicitly) granted to the t r a n s a c t i o n in S, SIX or X mode. By i n d u c t i o n that means that at least one of the node's a n c e s t o r s must be e x p l i c i t l y granted in S, SiX or X mode to the transaction. A node is i m p l i c ~ t ! y granted i__%~ mode if all of its parents are (implicitly or explicitly) granted to the t r a n s a c t i o n in X mode. By induction, this is e q u i v a l e n t to the c o n d i t i o n that all nodes in some cut set of the c o l l e c t i o n of all paths l e a d i n g from the node to the roots of the graph are e x p l i c i t l y granted to the transaction in X mode and all a n c e s t o r s of n o d e s in Zhe cut set are e x p l i c i t l y granted in IX or SIX mode. By e x a m i n a t i o n of the p a r t i a l order of modes (see figure above), a node is i m p l i c i t l y g r a n t e d in iS u o d e if it is i m p l i c i t l y granted in S mode, and a node is i m p l i c i t l y g r a n t e d in Is, IX, S and SIX mode if it is i m p l i c i t l y granted in X mode. 5.7.6.5. The P r o t o c o l For ~ e q u e s t i n q Locks On a DAG
(a) Before r e q u e s t i n g an S or IS lock on a node, one should request at least one parent (and by i n d u c t i o n a path to a root) in IS (or greater) mode. As a c o n s e q u e n c e none of the a n c e s t o r s along this path can be granted to another t r a n s a c t i o n in a mode incompatible with IS. (b) Before r e q u e s t i n g IX, SIX or X mode a c c e s s to a node, one should request all parents of the node in IX (or greater) mode. As a c o n s e q u e n c e all a n c e s t g r s will be held in IX (or greater mode) and c a n n o t be held by othex t r a n s a c t i o n s in a mode i n c o m p a t i b l e with IX (i.e. S, SIX, X). (c) Locks should be r e l e a s e d either at the end of the t r a n s a c t i o n (in any order) or in leaf to root order. In particular, if locks are not held to the end of transaction, one should not hold a lower lock a f t e r r e l e a s i n g i t s ancestors. To give an e x a m p l e using t h e n o n - h i e r a r c h i c h a l lock graph in the figure above, a sequential scan of all r e c o r d s in file F need not use an index so one can get an i m p l i c i t share lock on each record in the file by: lock data base lock area c o n t a i n i n g F with m o d e = IS with mode = IS
445
lock
file F
with
mode = S to an
This gives implicit S mode access to all r e c o r d s in F. Conversely, r e a d a r e c o r d in a f i l e v i a the i n d e x I f o r file F, cue n e e d n o t get i m p l i c i t or e x p l i c i t l o c k on file F: lock lock lock This file data b a s e area containing index I with m o d e = IS w i t h mode = IS with m o d e = S
a g a i n gives i m p l i c i t S mode a c c e s s to a l l r e c o r d s F) In both these c a s e s , o nl__ ode p a t h was l o c k e d Z a r e c o r d R in f i l e F lock o n all a n c e s t o r s
in i r d e x I (in for reading. with i n d e x of R. I one
But to i n s e r t , d e l e t e or u p d a t e m u s t get a n i m p l i c i t or e x p l i c i t
The first e x a m p l e of this s e c t i o n s h o w e d how an e x p l i c i t X lock on a r e c o r d is o b t a i n e d . To get an i m p l i c i t X lock on all r e c o r d s in a f i l e one c a n s i m p l y l o c k the i n d e x add f i l e in X mode, or l o c k t h e a r e a i n X mode. The l a t t e r e x a m p l e s allow bulk l o a d or u p d a t e of a file w i t h o u t f u r t h e r l o c k i n g s i n c e all r e c o r d s in the file are i m p l i c i t l y g r a n t e d in X mode.
5.1.6.1.
Proof
Of E~u/vale~ce
Of
The Lock
Protocol
we w i l l now prove that conventional one w h i c h explicitly locks atomic
the d e s c r i b e d l o c k p r o t o c o l is e q u i v a l e n t to a uses only two modes (S and X), and which resources (the l e a v e s of a t r e e or s i n k s of a
bAG).
L e t G = (N,R) be a f i n i t e ( d i r e c t e d acyclic) q r a p h where N is the set of n o d e s and A is t h e set of arcs. C i s a s s u m e d to be w i t h o u t c i r c u i t s (i.e. there i s no n o n - n u l L path l e a d i n g from a n o d e n to itself). R n o d e p is a 9 a r e n t of a n o d e n a n d n is a .chi!d of p if there is an a r c f r o m p to n. A node n is a s o u r c e (sink) if n h a s no p a r e n t s (no children). Let Q be t h e s e t of s i n k s ef G. An a n c e s t o r of n o d e n is a n y n o d e ( i n c l u d i n g n) in a lath from a s o u r c e to n. A D o d e T s l $ c e of a s i n k n is a c o l l e c t i o n of m o d e s such t h a t e a c h path from a s o u r c e to n c o n t a i n s a t l e a s t one node of the slice. We a l s o i n t r o d u c e t h e set o f lock m o d e s M = [NL,IS,IX, S,SIX, X] and t h e compatibility matrix C : MxM->{YES,NO] d e s c r i b e d i n T a b l e I. Let c : m x m - > [ Y E S , N O } be the re.~tr~ction of C t o m = {NL, S,X]. A l o c k , g r a p h is a m a p p i n g L : N->M such that: (a) if L(n) e {IS,S} t h e n e i t h e r n is a s o u r c e or t h e r e e x i s t s a p a r e n t p of n such that L(p) e { I S , I X , S , S I X , X } . By i n d u c t i o n t h e r e e x i s t s a path from a s o u r c e to n such that L takes only -alues in {IS, IX, S, SlX,X] on it. E q u i v a l e n t l y Z is n o t e q u a l to NL on the path. (b) i f L(n) e {IX, SIX, X} t h e n e i t h e r n is a r o o t or f o r all p a r e n t s p l . . . p k of n we have L(:pi) ~ {IX,SIX, X] (i=1...k). By i n d u c t i o n L t a k e s only v a l u e s in {IX,SIX, X] on a l l the a n c e s t o r s of n. The i n t e r p r e t a t i o n of a lock-graph is t h a t it gives a map of the e x p l i c i t locks held by a p a r t i c u l a r t r a n s a c t i o n o b s e r v i n g the six s t a t e lock protocol described above. The notion of projection of a l o c k - g r a p h is now i n t r o d u c e d to m o d e l the set of i m p l i c i t l o c k s on atomic resources acquired by a transaction.
446
The p r o j e c t i o n of a l o c k - g r a p h L is the mapping I: Q->m c o n s t r u c t e d as follows: (a) l(n)=X if t h e r e exists a n o d e - s l i c e [nl...ns} of n such that T (~i)=X for each node in the slice. (b) l(n)=S if (a) is not s a t i s f i e d and there exists an ancestor aa of n such that L ( n a ~ C IS, SIX,X}. (c) I(n)=NL if (a) and (b) are not satisfied. Two lock-graphs LI and L2 are said to be c o m p a t i b l e if C ( L I ( n ) , L 2 ( n ) ) = Y E S for all n q N. Similarly two p r o j e c t i o n s 11 and 12 are c o m p a t i b l e if c ( 1 1 ( n ) , 1 2 ( n ) ) = Y E S for a l l n e Q. Theorem: If two l o c k - g r a p h s LI and L2 are c o m p a t i b l e then their projections 11 and 12 are compatible. In other words if the e x p l i c i t locks set by two t r a n s a c t i o n s do not c o n f l i c t then a l s o the t h r e e - s t a t e locks i m p l i c i t l y acguired do not conflict. Proof: Assume that 11 and 12 are incompatible. He want to prove that LI and L2 are incompatible. By d e f i n i t i o n of c o m p a t i b i l i t y there must exist a sink n such that 11 (n)=X and 12 (n) e IS,X] (or vice versa). By d e f i n i t i o n of p r o j e c t i o n there must exist a n o d e - s l i c e {nl...ns} of n such that L 1 { n l ) = . . . = L 1 ( n s ) = X . A l s o there must e x i s t an ancestor n0 of n such that L2(n0) e IS,SIX,X}. From the d e f i n i t i o n of l o c k - g r a p h there is a path PI f r o m a source to nO on w h i c h L2 does not take the value NL. If PI i n t e r s e c t s the n o d e - s l i c e at ni then L I and L2 are i n c o m p a t i b l e since LI (ni)=X which is i n c o m p a t i b l e with the non-null value of L2(ni). Hence the theorem is proved. A l t e r n a t i ~ e l y there is a path P2 from n0 to the s i n k n which i n t e r s e c t s the n o d e - s l i c e at ni. Frpm the d e f i n i t i o n of lock-graph LI takes a value in [IX,SIX,X} on all a n c e s t o r s of ni. in particular LI (n0) e [IX,SIX,x}. Since L2(n0) ~ [5,SIX,X] we have C ( I I ( n 0 ) , L 2 (nO))=NO. Q. E. D.
5.7.7. LOCK ~ A N A G E M E N T PRAGMATICS Thus far we have d i s c u s s e d when to lock (lock before a c c e s s and hold locks to c o m m i t point) and why to lock (to g u a r a n t e e c o n s i s t e n c y and to m a k e r e c o v e r y p o s s i b l e without c a s c a d i n g t r a n s a c t i o n backup,) and what to lock (lock at a g r a n u l a r i t y T h a t b a l a n c e s concurrency against i n s t r u c t i o n overhead i n setting locks.) The r e m a i n d e r of this s e c t i o n will discuss issues a s s o c i a t e d with hcw to i m p l e m e n t a lock manager. 5.7.7.1. The Lock Mann eg~ ~I_~nterface
This is a s i m p l e v e r s i o n of the System R lock manager. 5.7.7oi.1o Lock a c t i o n s has two basic calls:
Lock manager
LOCK <lock>, <mode> ,<class>, < c o n t r o l > Where < l o c k > is the resource name (in System R foe example an e i g h t byte name). <mode> is one of the modes specified above (S I X | SIX | IX I IS). <class> is a notion described below. <control> can be either WAIT in which case the call is synchronous and waits until the request is g r a n t e d or is cancelled by the
447
deadlock detector, or <control> can be TEST in which case request is canceled if it cannot be granted immediately.
the
UNLOCK < l o c k > , < c l a s s > Releases the s p e c i f i e d lock in the specified class. If the <lock> is not specified, all locks held in the specified class are released. 5.7.7.1.2. Lock Names
The a s s o c i a t i o n between lock names and objects is purely a convention. Lock manager associates np semantics with names. G e n e r a l l y the first byte is reserved for the subsystem (component) identifier and the remaining seven bytes name the object. For example, data manager might use bytes (2...4) for the file name and bytes (4...7) for the record name in c o n s t r u c t i n g names for record locks. Since there are so many locks, one only allocates those with non-null queue headers. (i.e. free locks occupy no space.) Setting a lock consists of hashing the lock name into a table. If the header already exists, the request enqueues on it, otherwise the request allocates the lock header and places it ~n the hash table. When the queue of a lock becomes empty, the header is d e a l l o c a t e d (by the unlock operation).
5_.7_.1.3.!. Lock Classes

Many operations acquire a set of locks. If the operation is successful, the locks should be retained. If the operation is unsuccessful or when the operation commits the locks should be released. In order to avoid double book-keeping the lock manager allows users to name sets of locks (in the new DBTG proposal these are called keep lists, in IMS program isolation these are called *Q class
locks).
For each lock held by each process, lock manager keeps a list of <class, count> pairs. Each lock request for a class increments the count for that class. Each unlock request decrements the count. ~hen all counts for all the lock's classes are zero then the lock is not held by the process. 5.7.7.1.4. Latches
Lock manager needs a serialization mechanism to perform its function (e.g. inserting elements in a queue or hash chain). It does this by implementing a lower level primitive called latches. Latches are semaphores. They p~ovide a cheap serialization mechanism without providing deadlock detection, class tracking, modes o~ sharing (beyond S or X) ,... They are used by lock manager and by other performance critical managers (notably buffer manager and log manager). 5. 7._7,_I. 5. Performance of ~ock Hanaqe_~
Lock manager is about 3:900 lines of {PL/1 like) source code. It depends c r i t i c a l l y on the Compare and Swap logic provided by the multiprocessor feature of System 370. It c o m p r i s e s three percent of the code and about ten percent of the instruction execution of a program in System B (this may vary a great deal. ) A l o c k - u n l o c k pair currently costs 350 instructions but if these notes are ever finished, this will be reduced to 120 instructions (this should reduce its slice of the execution pie.) A latch-unlatch pair require 10 instrt~ctions
448
(they expand i~-line). (Initially they required careful r e d e s i g n i m p r o v e d ~this dramatically.) 5.7.7.2. S c h e d u l i n ~ and G r a n t i n g R e _ ~ t _ ~ s
120 i n s t r u c t i o n s but a
Thus far we have d e s c r i b e d the s e m a n t i c s of the various request modes and have d e s c r i b e d the p r o t o c o l which requesters must follow. To complete the d i s c u s s i o n we discuss how requests are s c h e d u l e d and gr ant ed. The set of all r e q u e s t s f e r a p a r t i c u l a r r e s o u r c e are kept in a queue s o r t e d by some fair scheduler. By "fair" we mean that no particular t r a n s a c t i o n will be d e l a y e d indefinitely. F i r s t - i n first-out is the simplest fair s c h e d u l e r and we a d o p t such a scheduler for this d i s c u s s i o n modulo d e a d l o c k preemption decisions. The group of mutually c o m p a t i b l e r e q u e s t s for a resource a p p e a r i n g at the head of the queue is c a l l e d the qranted qro~R. All these requests can be granted c o n c u r r e n t l y . A s s u m i n g that each transaction has at most one request in the q u e u e then the c o m p a t i b i l i t y of two requests b y different t r a n s a c t i o n s d e p e n d s only on the modes of the requests and may be c o m p u t e d using Table I. A s s o c i a t e d with t h e granted group is a ~ o _ ~ m o d e w h i c h is the s u p r e m u m mode qf the members of the group which is c o m p u t e d using Figure 2 or Table 3. Table 2 gives a list of the possible types of requests that can coexist in a group and the c o r r e s p o n d i n g mode of the group. Table 2. Possible request groups and their group mode. Set brackets i n d i c a t e t h a t several such requests may be present.
OD .S o F ....
i
!
1 ....
% ~ - - J x
MODE OF 6RO~P x
SiX s IX [~ s} _ ~ i
I |
I I
I i
{s~x, {IS}} I {s, {s},{~s}} I {xx, {Ix}, {Is} } I

..... {zs, {IS} ]__1.___
The figure b e l o w d e p i c t s the queue for a p a r t i c u l a r resource, ~ h o w i n g the reque~sts and t h e i r mpdes. The granted g r o u p c o n s i s t s of five requests and has group mode IX. Xhe next request in the ~ueue is for S mode which is i n c o m p a t i b l e with the group mode IX and hence must wait.
# GRANTED GROUP: G R O U P M O D E = IX * iIS|--iIXt--IlSl--IlSl--|ISl--*-isl-ilSl-lXi-ltsi-IIXl Figure 5. The queue of r e q u e s t s f o r a resource. When a new r e q u e s t for a r e s o u r c e a r r i v e s , the scheduler appends it to the end of the queue. There are two cases to consider: either someone is already w a i t i n g or all o u t s t a n d i n g r e q u e s t s for this resource are granted (i.e. no one is waiting). If waiters exist, then the r e q u e s t can not be granted and the new r e q u e s t must wait. If no one is waiting and the new request is o o m p a t i b l e w i t h the granted group mode then the new request can be granted immediately. O t h e r w i s e the new request must wait its turn in the queue and in the case of deadlock it may preempt some incompatible re quests in the queue. (Alternatively the new request could be canceled. In Figure 5 all the requests decided to wa it. )
448
When a p a r t i c u l a r request lea~es the g r a n t e d group the group ~ode of the group may change. If the mode of the first waiting request in the queue is c o m p a t i b l e with the new m o d e of the granted group, then the w a i t i n g r e ~ e s t is granted. In F i g u r e 5, if the IX request leaves the group, then the g r o u p mode becomes IS which is c o m p a t i b l e with S and so the S may be granted. The new group mode will be S and since this is compatible with IS mode the IS request following the S request may also join the granted group. This produces the s i t u a t i o n depicted in Figure 6:
GRANTED GROUP GROUPHODE = S
iZSl--iiSl--llSl--Iisl--|sj--iiSl--,-lXl-|ISl-IiXl
Figure 6= The queue after the IX request is released.
The X request of Figure 6 will not be granted u n t i l all requests leave the granted group s i n c e it is not c o m p a t i b l e with any of them. 5.7.7.3. Conversions
I t r a n s a c t i o n might r e - r e q u e s t the same resource for several reasons: P e r h a p s it has forgotten %hat it already has access to the record; after all, if it is s e t t i n g many locks it may be simpler to just always request a c c e s s to the record rather than first asking itself " h a v e I seen this r e c o r d before". The lock manager has all the i n f o r m a t i o n to answer this question and it seems wasteful to duplicate. Alternatively, the t r a n s a c t i o n may know it has a c c e s s to the record, but want to increase its access mode (for e x a m p l e from S to X mode if it is in a read, test, and sometimes update scan of a file). So the lock manager must be p r e p a r e d for r e - r e q u e s t s b ~ a transaction for a lock. We call such re-requests conversions. When a request is f o u n d to be a conversion, the old (granted,'. mode of the requestor to the resource and the newly requested mode are c o m p a r e d using Table 3 to compute the new mode w h i c h is the supremum of the old and the r e q u e s t e d mode (ref. Figure 2). Table 3. The new mode given the requested and old mode.
l
I__ I IS I IX | S I SIX l_/X ! Is I Is I IX I S I SIX ! x
NEW MODE Ix s Ix s IX SIX SiX S SIX SIX x . . .x. . . .
I
s i x .....x .... six x SiX X SIX X SIX X x x I I I I } I
So for example, if one has mode is SIX.
IX m o ~ e and
r e q u e s t s S mode then
the new
If the new mode is equal to the old m o d e (note it is never less t h a n the old mode) t h e n the request can be granted i m m e d i a t e l y and the granted mode is unchanged. If the new mode is c o m p a t i b l e with the group mode of the other members of the g r a n t e d group (a requestor is always c o m p a t i b l e with him, self) then again the request can be granted immediately. The granted mode is the new mode and the group mode is recomputed using Table 2. In all other cases, the requested c o n v e r s i o n must wait u n t i l the group mode of the other granted requests is compatible with the new mode. Note that this i m m e d i a t e granting of
450
conversions sc hed uling.
oyez
waiting
requests
is
minor
violation
of
fair
If two c o n v e r s i o n s are waiting, each of w h i c h is i n c o m p a t i b l e with an already granted request of the other transaction, then a deadlock e x i s t s and the a l r e a d y granted access of one must be preempted. O t h e r w i s e t h e r e is a way of s c h e d u l i n g the waiting conversions: tamely, grant a c o n v e r s i o n when it is c o m p a t i b l e with all other granted modes in the granted group. (Since there is no deadlock cycle this is a l w a y s possible.) The following example may help to clarify t h e s e points. queue for a p a r t i c u l a r resource is: Suppose the
GROUPMODE
IS
IISl---l~Sl
Figure 7. A s i m p l e queue. Now s u p p o s e the f i r s t t r a n s a c t i o n wants to convert t o X mode. It must wait for the s e c o n d (already granted) request to leave the queue. If it d e c i d e s to wait then the s i t u a t i o n becomes: ****************************** # G E O U P M O D E = IS * |IS<-X|---|IS|
******************************
F i g u r e 8. A c o n v e r s i o n to X mode
waits.
No new request may enter the granted group since there is now a conve2sion request waiting. In general, c o n v e r s i o n s are scheduled before new requests. If the second t r a n s a c t i o n now c o n v e r t s to IX, SIX, or S mode it m a y be granted i m m e d i a t e l y s i n c e this does not conflict with the q r a n t e d (IS) mode of the first transaction. When the second transaction e v e n t u a l l y l e a v e s the queue, the first c o n v e r s i o n can be made: ****************************** * GEOUPMODE = X *
*
iXl
Figure 9. One t r a n s a c t i o n leaves and the c o n v e r s i o n i s granted. However, if t h e second t r a n s a c t i o n one o b t a i n s the queue: tries to c o n v e r t to e x c l u s i v e mode
GBOUPMODE
IS
~;
|IS<-X |--- II S<-X|
Figure 10. Two c o n f l i c t i n g c o n v e r s i o n s are waiting. Since X is i n c o m p a t i b l e w ~ t h IS (see Table I), this situation implies that each t r a n s a c t i o n i~ waiting for the other to leave the queue (i. e. deadlock) and so one t r a n s a c t i o n --__mL'st be preempted. In all other cases (i.e. when no cycle exists) there is a way to s c h e d u l e the c o n v e r s i o n s so that no a l r e a d y granted access is violated.
451
5.7.7.~.
Deadlock D e t e c t i o n
One issue the lock m a n a g e r must deal with is deadlock. Deadlock consists of each member of a set of t r a n s a c t i o n s waiting for some other member of the set to give up a lock. Standard lore has it that one can have timeout or d e a d l o c k - p r e v e n t i o n or deadlock detection. Timeout causes waits to he denied a f t e r some specified interval. It has the proper*y that as the system becomes more congested, more and more t r a n s a c t i o n s time out (because time runs slower and because more r e s o u r c e s are in use so that one waits more). Also timeout puts an upper limit on the d u r a t i o n of a transaction. In general t h ~ dynamic properties of timeout make it a c c e p t a b l e for a lightly loaded system but i n a p p r o p r i a t e for a c o m g e s t e d system. Deadlock p r e v e n t i o n is achieved by: requesting all Jlocks at once, or requesting locks in a s p e c i f i e d order, ~ never waiting ~or a lock, or . .. In general deadlock p r e v e n t i o n is a bad deal because one rarely knows what locks are n e e d e d in advance (consider looking something up in an index,} and consequently, D u e locks too much in advance. Although some s i t u a t i o n s a l l o w a d e a d l o c k prevention, general s y s t e m s tend to r e q u i r e d e a d l o c k detection. IMS, for e x a m p l e , started with a deadlock prevention scheme (intent scheduling} hut was forced to i n t r o d u c e a deadlock d e t e c t i o n scheme to increase c o n c u r r e n c y (Program I sola tion). D e a d l o c k d e t e c t i o n and r e s o l u t i o n is no big deal in a data m a n a g e m e n t system environment. The system a l r e a d y has lots of f a c i l i t i e s for transaction backup so that it can deal with other sorts of errors. Deadlock simply becomes another (hopefully i n f r e q u e n ~ souxce of backup. As will be seen, the algorithms for d e t e c t i n g and r e s o l v i n g deadlock are not c o m p l i c a t e d or time consuming. The deadlock d e t e c t i o n - r e s o l u t i o n scenario is: D e t e c t a deadlock. Pick a victim (a lock to preempt from a process.}
Back up v i c t i m which w i l l release lock. Grant a waiter. R e s t a r t victim.
(optionall~
Lock manager is only l e s p o n s i b l e for deadlock detection and victim selection. Recovery manegement implements transaction backup and c o n t r o l s restart logic. 5.7. 7.4. I. How to Detect Deadlock There are many heuristic ways of detecting deadlock (e.g. linearly order resources or p r o c e s s e s and declare d e a d l o c k if ordering is violated by a wait request.) Here we restrict ourselves to algorithmic solutions. terms. We
The d e t e c t i o n of d e a d l o c k may be cast in g r a p h - t h e o r e t i c introduce the notion of the wait-for ra~/~h.
452
The n o d e s The e d g e s
-
of the
graph
are t r a n s a c t i o n s are directed
and
locks. as follows: from L
of the graph
and ~re c o n s t r u c t e d T then draw
If lock to T.
L is g r a n t e d
to t r a n s a c t i o n
an edge
If t r a n s a c t i o n from T to L.
T is waiting
for t r a n s a c t i o n
L then draw
an edge
At any instant, t h e r e is a d e a d l o c k if and only if the wait-fo~ graph has a cycle. Hence d e a d l o c k d e t e c t i o n becomes an issue of b u i l d i n g the wait-for graph and s e a r c h i n g it for cycles. Often this ' t r a n s a c t i o n w a i t s for lock waits for transaction' graph can be reduced to a s m a l l e r ' t r a n s a c t i o n waits for transaction' graph. The l a r g e r graph need be m ~ i n t a i n e d only if the i d e n t i t y of the locks in the c y c l e are relevant. I know of no case w h e r e this is required. 5.7.7.4.2. One could When to Look opt to look for Deadlock.
for deadlock:
Whenever
anyone waits.
Perigdically Never. or b e i n g look for
R e l e a s i n g a lock One could look for d e a d l o c k continuously. So one should never g r a n t e d a lock never c r e a t e s a deadlock. d e a d l o c k more f r e q u e n t l y than when a wait occurs. The cost of l o o k i n g for deadlock every time anyone graph. is (should be} waits is:
Continual Almost wasted
maintenance
o f the w a i t - f o r since
certain failure instructions) o
deadlock
rare
(i.e.
The cost
of p e r i o d i c
deadlock late.
detection
is:
Detecting
deadlocks
c o s t of deadlock and the By i n c r e a s i n g the p e r i o d one d e c r e a s e s the For each s i t u a t i o n there probability of s u c c e s s f u l l y finding one. s h o u l d be an o p t i m a l d e t e c t i o n period, C O S T # $ + $ # + cost of d e t e c t i o n + c o s t of d e t e c t i n g
late
!
P E R I O
optimal D -> is auch like periodic deadlock detection
Never
testing
for d e a d l o c k s
453
with a very long period. All s y s t e m s have a mechanism to detect dead programs (infinite loops, wait fok lost interrupt, .... } This is usually a part of a l l o c a t i o n and r e s o u r c e scheduling. It is p r o b a b l y outside and above d e a d l o c k detection. similarly, if d e a d l o c k is very freg~ent, the system is t h r a s h i n g and the t r a n s a c t i o n s c h e d u l e r should s t o p scheduling new work and perhaps abort some current work to reduce this thrashing. Otherwise the system is likely to spend the m a j o r i t y of its t i m e backing up. 5.7.7.4.3. What To Do When Deadlock is Detected.
All t r a n s a c t i o n s in a deadlock are waiting. The only way to get things going again is to orant some waiter. But this can only be achieved after a lock is p r e e m p t e d from some holder. Since the victim is waiting, he will get the ' deadlock' response f r o m lock manager rather than the 'granted' response. In b r e a k i n g the deadlock some set of victims will be p=eempted. We want to minimize the amount of work lost by these preemptions. Therefore, d e a d l o c k resolution wants to pick a minimum cost set of victims to break deadlocks. T r a n s a c t i o n management must a s s o c i a t e a cost with each transaction. In the absence of policy decisions: the cost of a victim is the cost of u n d o i n g his work and then redoing it. The length of the transaction log is a crude estimate of this cost. At any rate, transaction m a n a g e m e n t must provide lock management with an e s t i m a t e of the cost of each transaction. Lock manager may i m p l e m e n t either of the following two protocols : For each cycle, choose the minimum cost victim i~ that cycle. Choose the minimum cost cut-set of the deadlock between these two options is b e s t graph. visualized by the
The d i f f e r e n c e picture:
- - - > L1 .... > <.... L2 <---
I I I
T1
I I
v
I I I
T3
T2
A ! !
<. . . . . . . . . . .
i ! V
A i !
L3 . . . . . . . . . . >
If TI and T3 have a cost of 2 and T2 has a cost of 3 then c y c l e - a t - a - t i m e a l g o r i t h m will choose TI and T3 as victims; whereas, minimal cut set algorithm ~ill choose T2 as a victim.
a a
The cost of finding a minimal cut set is c o n s i d e r a b l ~ greater (seems to be NP complete) than the c y c l e - a t - a - t i m e scheme. If there are N common cycles the c y c l e - a t - a - t i m e scheme is at most N times worse than the minimal cut set scheme. So it seems that the cycle-at-a-time scheme is better.
454
5.7.7.5.
L o c k S a n a q e m e n t In a Dist_~ibuted System.
To repeat the d i s c u s s i o n in the s e c t i o n on d i s t r i b u t e d transacticn management, if a transaction wants to do work a t a new node, some process of the t r a n s a c t i o n must request that the node conztruct a cohort and that t h e cohort go into session with the requesting process (see section on data c o m m u n i c a t i o n s for a discussion of sessions.) The p i c t u r e below shows this.
NODE + 1
i
~IIP2
I
|
~ : ' ; ~
# .... --+
....
# #
J SESSIONI I
#
NODE2 ..... # ......
i j
-i
T1P6 ~ * ~
I I i
A cohort carries both the t r a n s a c t i o n name T1 and the process name (in NODEI the cohort of TI is process P2 and in NODE2 the cohort of T1 is process P6.) The two p r o c e s s e s can now c o n v e r s e and carry out the work of the transaction. If one process aborts, t h e y should Both abort, and if one process commits they s h o u l d both commit. The lock manager of each node can keep its lock tables in any form it desires. Further, d e a d l o c k d e t e c t o r s running in each node may use any technique they like to detect d e a d l o c k s among t r a n s a c t i o n s which run exclusively in that node. We call such d e a d l o c k s local deadlocks. However, just b e c a u s e t h e r e are no c y c l e s in the local w a i t - f o r graphs does not mean that there are no cycles. Gluing acyclic local graphs together might p r o d u c e a graph with cycles. (See the example below.) So, the d e a d l o c k d e t e c t o r s of each n o d e will h a v e to agree on a c o m m o n protocol in order to handle deadlocks involving distributed transactions. We call such deadlocks g l o b a l deadlocks. I n s p e c t i o n of the f o l l o w i n g figure may help to u n d e r s t a n d the nature of g l o b a l deadlocks. Note that t r a n s a c t i o n TI has two [rocesses PI and P2 in nodes 1 and 2 r e s p e c t i v e l y . PI is s e s s i o n - w a i t i n g for its cohort P2 to do some work. P2, in the process of doing this work, n e e d e d access to FILE2 in NODE2. But F I L E 2 is locked e x c l u s i v e b y a n o t h e r process (Pq of NODE2) so P2 is in lock wait state. Thus the t r a n s a c t i o n TI is w a i t i n g for FILE2. Now T r a n s a c t i o n T2 is in a similar state, one of its c h o r t s is session waiting for the other which in turn is lock w a i t i n g for FILEt. in f a c t t r a n s a c t i o n T I is w a i t i n g for FILE2 which is granted to t r a n s a c t i o n T2 which is waiting for file FILEt which is
455
granted
to transaction NODEI
TI.
A global
deadlock
if
you
ever
saw
one.
~#~** TIPI
lock grant < .....
lock ~ # ~ * ~ wait ~ILEI < . . . . .
~ * T2P3
# # s e s s i on # w ait + ..... # # V
J SESSIONI J
# # # # ....... # #
| SESSION2 J
# # LU2 # +..... #
# # session wait
A # # #. . . . . . . +
# # | J
#
# lock ~ * ~ wait ~ ~ * T IP2 . . . . . > @ F I L E 2
#
lock # grant ~ # ~ ...... > T2Pq
I
| |
The n o t i o n of wait-for graph must be generalized %o handle global deadlock. T h e n o d e s of t h e g r a p h a r e p r o c e s s e s a n d r e s o u r c e s (sessions are resources). The e d g e s of t h e g r a p h a r e c o n s t r u c t e d as follows: Dra~ Draw a directed the process edge is frem a process wait for to a the resource resource, the r e s o u r c e if (session). if
in l o c k is
or t h e
process
in session-wait from a resource granted of t h e to
for to the
a directed the resource
edge is
a process process and the
lock
or i t is a s e s s i o n session-wait on it. deadlock is a
process
process
is
not
in
A local
l o c k w a i t - > .... - > l o c k w a i t global deadlock is a
cycle.
iockwait->...->sessionsait->lockwait->...->sessionwait->
cycle
5.2.1._5. 1. _~o_~_t o
The finding of
Find
Global
~De~dlock________ss has already been described. called the node. This global task is
local
deadlocks
To find deadlock
global detector
deadlocks, is s t a r t e d
a distinguish task, in s o m e distinguished
456
in s e s s i o n with all local deadlock d e t e c t o r s and c o o r d i n a t e s the activities of the local deadlock detectors. This global deadlock detector can run in any node, but probably should be located to m i n i m i z e its c o m m u n i c a t i o n distance to the lock managers. Each local deadlock d e t e c t o r r e e d s to find all potential global deadlock ~ s in his node. In the previous section it was shown that a global deadlock cycle has the form: locKwait->...->sessionwait ->lockwait->.~.->sessionwait-> So each local d e a d l o c k d e t e c t o r p e r i o d i c a l l y e n u m e r a t e s all
s e s s i o n - > l o c k w a i t - > . . . - > s e ssion wait paths in his n o d e by w o r k i n g b a c k w a r d s from processes which are in session-wait. (as opposed to c o n s o l e wait, disk wait, processor wait, ...) S t a r t i n g at s u c h a process it sees if some local process is lock waiting for this process. If so the deadlock detector searches backwards l o o k i n g for some process which has a session in progress. W h e n such a path is found the global d e a d l o c k d e t e c t o r : following information i s sent to the
Sessions and t r a n s a c t i o n s at e n d p o i n t s of p r e e m p t i o n costs. The m i n i m u m cost t r a n s a c t i o n in the cost.
the path and their local
path and his local p r e - e m p t i o n
(It may make sense to batch this i n f o r m a t i o n to the global detector.) Periodically, o the global d e a d l o c k detector:
c o l l e c t s these m e s s a g e s , glues all these paths t o g e t h e r by m a t c h i n g up s e s s i o n s
e n u m e r a t e s cycles and s e l e c t s v i c t i m s just as in the local deadlock d e t e z t o r case. One tricky point is that the cost of a d i s t r i b u t e d transaction is the sum of the c o s t s of its cohorts. The global deadlock detector approximates t h i s cost by summing the c o s t s of the c o h o r t s of tLe t r a n s a c t i o n known to it (not all cohorts of a deadlocked t r a n s a c t i o n w i l l be in known to the g l o b a l d e a d l o c k detector.) When a victim is selected, the lock manager of the node the victim is waiting in is i n f o r m e d of the den@lock. The local lock manager in turn i n f o r m s the v i c t i m with a d ~ a d l o c k return. The use of periodic d e a d l ~ c k d e t e c t i o n (as o p p o s e d to d e t e c t i o n every t i m e anyone waits) is even more i m p o r t a n t for a d i s t r i b u t e d system than for a c e n t r a l i z e ~ system. ~he cost of d e t e c t i o n is much higher in a d i s t r i b u t e d system. This will a l t e ~ the i n t e r s e c t i o n of the cost of d e t e c t i o n and cost of d e t e c t i n g late curves. If the n e t w o r k is really large the deadlock d e t e c t o r can be staged. That is we can look for d e a d l o c k a m o n g four nodes, then among s i x t e e n nodes, and so on. If one node crashes, then its p a r t i t i o n of the system is unavailable.
457
In this case, its cohorts in other n o d e s can wait for it to recover or they can abort. If the down node happens to house the global lock manager then no global deadlocks will be d e t e c t e d until the n o d e recovers. If this is uncool, then the l o c k m a n a g e r s can nominate a new global lock manager w h e n e v e r the c u r r e n t one crashes. The new uanager can run in a n y node which can be in s e s s i o n with all other nodes. The new global lock manager c o l l e c t s the local graphs and goes about gluing them together, finding cycles, and picking victims. 5.7.7.6. R e l a t i o n s h i p t_~o O p e r a t i n Q System Lock Manage r
Most operating systems provide a lock manager to r e g u l a t e access to files and other s y s t e m resources. This lock manager usually supports a limited set of lock names, the modes: share, e x c l u s i v e and beware, and has some form of d e a d l o c k detection. These lock managers are usually not prepared for the d e m a n d s of a data management system (fast calls, lots of locks, many modes, lock classes,...) The basic lock manager could be e x t e n d e d and refined and in t i m e that is what will h a p p e n There is a big prcblem about having two lock managers in the same host. Each may think it has no d e a d l o c k but if their graphs are glued t o g e t h e r a "global" d e a d l o c k exists. This makes it very difficult to build on top of the basic l o c k manager. 5.7.7.7. The Convoz P h e n o ~ n o n : Pr@e~mp~!ve S c h e d u l i n g is Bad
Lock manager has s t r o n g i n t e r a c t i o n s with the scheduler. Suppose that there are certain high traffic sh~red system resources. O p e r a t i n g on these resources consists of locking them, a l t e r i n g them and then unlocking them {the buffer p o o l and log are e x a m p l e s of this.) These o p e r a t i o n s are designed t q be very fast so that the resource is almost a l w a y s free. In p a r t i c u l a r the r e s o u r c e is never held during an I/O operation. For example, ~he buffer manager latch is acquired e v e r y 1000 i n s t r u c t i o n s and is held for about 50 instructions. If the s y s t e m has no p r e e m p t i v e scheduling then on a uni-processor when a process begins the resource is free and when he c o m p l e t e s the r e s o u r c e is free ~ e c a u s e he does not hold it when he does I/O or yields t h e processor.) On a multi-processor, if the resource is busy, the process can sit in a b u s y wait until the r e s o u r c e is free because the r e s o u r c e is known to be held by others for only a short time. If the basic system has a preemptive scheduler, and if that scheduler preempts a process holding a critical r e s o u r c e (e.g. the log latch) then t e r r i b l e things happen: ~ii other processes waiting for the latch are d i s p a t c h e d and because the r e s o u r c e is high t r a f f i c each of these processes r e q u e s t s and w a i t s for the resource. Ultimately the holder of the resource is r e d i s p a t c h e d and he almost immediately grants the latch to the next waiter. But because it is high traffic, the process almost immediately rereque~ts the l a t c h (i.e. about 1000 i n s t r u c t i o n s later.) Fair s c h e d u l i n g r e q u i r e s that he wait so he goes on the end of the queue waiting for those ahead of him. This queue of waiters is called a ~OnVOy. It is a stable phenomenon: once a convoy is established it persists for a very long time. we (System R) have found several s o l u t i o n s to this problem. The obvious solution is to e l i m i n a t e such resources. That is a good idea and can be a c h i e v e d to some degree by refining the granularity of the lockable unit (e.g. twenty buffer manager latches rather than just one.) However, if a convoy ever forms on any of t h e s e latches it will be stable so that is not a solution. I leave it as an exercise for the reader to find a better solution to the problem.
458
5.7.8. BIBLIOSRAP HY Eagles, "Currency and Concurrency in the COBOL Data Base Facility", in Sodelinq in Da_~ta_ Base ~ sty. Nijssen editor, North Holland, 1976. (A nice discussion of how locks are used.) Eswaran et. a L "On the Notions of Consistency and Predicate Locks in a Relational Database System," CACM, Vol. 19, No. 11, November 1976. (Introduces the notion of consistency, ignore the stuff on predicat e locks.) ,,Granularity of Locks and Degrees of Consistency in a Shared Data Base", in M odelinq ~ Data Base ~anaqement systessNijssen editor, North Holland, 1976, (This section is a condensation and then elaboration of this paper. Hence Franco Putzolu and Ir Traiger should be considered co-authors of this section.)
459
5.8. R E C O V E R Y MANAGEMENT
5.8.1.
MODEL OF ERRORS
In order to design a recovery system, it is i m p o r t a n t to have a clear notion of what k i n d s of errors can be expected and what their probabilities are. The model of errors b e l o w is inspired by the p r e s e n t a t i o n by Lampson and Sturgis in "Crash R e c o v e r y in a D i s t r i b u t e d Data Storage System", which may someday a p p e a r in the CACM. We first postulate that all errors are detectable. c o m p l a i n s about a situation, then it is OK. 5.8.1. I. ~odel of Storag ~ ~ r r 0 p s Storage c o m e s in three flavors i n c r e a s i n g reliability: with i n d e p e n d e n t failure modes and That is, if no one
Volatile storage: p a g i n g space and main memory, On-Line N o B - v o l a t i l e Storage: disks, more r e l i a b l e t h a n v o l a t i l e storage. Off-Line Non-volatile than disks. Storage: Tape usually survive crashes. Is
archive.
Even
mere reliable
To repeat, we assume that these three k i n d s of storage have independent failure modes. The s t o r a g e is blocked into fixed length units called the unit of a l l o c a t i o n and transfer. Any page transfer Success can have ene of three outcomes: pages which are
(target gets new value) (target is a mess)
Partial failure Total failure
(target is unchanged)
Any page may s p o n t a n e o u B l y fail. That is a spec of dust may settle on it or a black hole may pass through it so that it no longer retains it's o r i g i n a l information. One can a l w a y s detect whether a t r a n s f e r failed or a page s p o n t a n e o u s l y f a i l e d by r e a d i n g the target page at a later time. (This can be made more and more certain by adding r e d u n d a n c y to the page.) Lastly, The p r o b a b i l i t y that N " i n d e p e n d e n t " archive negligable. Here we c h o o s e N=2 (This c a n be made c e r t a i n by c h o o s i n g larger and larger N.) 5.8.1.2. Mode of Data C o m m u n i c a t i o n s Errors pages fail is more and more
Communication via sessions.
traffic is broken into units c a l l e d messages which travel
The t r a n s m i s s i o n of a m e s s a g e has one of t h r e e p o s s i b l e outcomes:
460
S u c c e s s f u l l y received. I n c o r r e c t l y received. Not received. a
The receiver of the message c~n detect whether he has received p a r t i c u l a r message and w h e t h e r it ~as c o r r e c t l y received. F o r each message transmitted, will be s u c c e s s f u l l y received. there is a n o n - z e r o
probability that it
It is the job of r e c o v e r y manager to deal with these storage and t r a n s m i s s i o n e r r o r s and correct them. This model of errors is implicit in what follows and will a p p e a r a g a i n in the e x a m p l e s at the end of the section.
5.8.2.
OVERV!FN
OF RECOVERY
MANAGEMENT.
A t r a n s a c t i o n is begun e x p l i c i t l y when a process is allocated or when an e x i s t i n g process issues B E G I N _ T R A N S A C T I O N . When a transaction is initiated, recovery m a n a g e r is i n v o k e d to allocate the recovery s t r u c t u r e n e c e s s a r y to recover the transaction. This process places a c a p a b i l i t y for the COMMIT, SAVE, and BACKUP calls of r e c o v e r y manager in the t r a n s a c t i o n ' s c a p a b i l i t y list. Thereafter, all a c t i o n s by the t r a n s a c t i o n on r e c o v e r a b l e data are recorded in the recovery leg using log manager. In general, each action p e r f o r m i n g an u p d a t e o p e r a t i o n should write an undo-log record and a r e d o - l o g record in the t r a n s a c t i o n ' s log. The undo log record gives the old v a l u e of the object and the redo log record gives the new value (see below}. At a t r a n s a c t i o n save point, r e c o v e r y manager r e c o r d s the save point identifier, and e n o u g h i n f p r m a t i o n so that each c o m p o n e n t of the system could be backed up to this point. In the event of a m i n o r error, the t r a n s a c t i o n may be undone to a save point in which case the a p p l i c a t i o n (on its next or pending call} is given f e e d b a c k i n d i c a t i n g that the d a t a base s y s t e m has amnesia a b o u t all r e c o v e r a b l e actions since %ha% save point. If the transaction is completely backed-up (aborted), it may or may not be restarted d e p e n d i n g on the a t t r i b u t e s of fhe t r a n s a c t i o n and of its i n i t i a t i n g message. If the t r a n s a c t i o n c o m p l e t e s successfully (commits}, then (logically} it is a l w a y s redone in case of a crash. On the other hand, if it is in-progress at the time of the local or s y s t e m failure, then the t r a n s a c t i o n is logically u n d c n e (aborted). R e c o v e r y manager must also respond to the f o l l o w i n g kinds of failures: A c t i o n failure: a particular call cannot c o m p l e t e due to a foreseen condition. In general the action undoes itself (cleans up its component) and then returns to the caller. Examples of this are bad parameters, r e s o u r c e limits, and data not found.
461
Transaction failure: a particular transaction cannot proceed and so is aborted. The transaction may be reinitiated in some cases. Examples of such errors are deadlock, ti~eout, protection violation, and transaction-local system errors. System failure: a serious error is detected below the action interface. The system ~s stopped and restarted. Errors in critical tables, wild branches by trusted processes, operating system downs and hardware downs are sources of system failure. Most nonvolatile storage is presumed to survive a system failure. Media failure: a nonrecoverable error is detected on some usually reliable (nonvolatile) storage device. The recovery of recoverable data from a media failure is the responsibility of the component which implements it. If the device contained recoverable data the manager must reconstruct the data from an archive copy using the log and then place the result on an alternate device. Media failures do not generally force system failure. Parity error, head crash, dust on magnetic media, and lost tapes are typical media failures. Software errors which make the media unreadable are also regarded as media errors as are catastrophes such as fire, flood, insurrection, and operator error. The system periodically ma~es copies copies of each recoverable object and keeps these copies in a safe place (archive). In case the object suffers a media error, all transactions with locks outstanding against the object are aborted. A special transaction (a utility) acquires the object in exclusive mode. (T~is takes the object "off-line".) This transaction merges an acc~mulatien of changes to the object since the object copy was made and a r e c e n t archive version of the object to produce the most recent committed version. This accumulation of changes may take two forms: it may be the REDO-log portion of the system log, or it may be a change accumulation log which was constructed from the REDO-log portion of the system log when the system log is compressed. After media recovery, the data is unlocked and made public again. The process of making an archive copy of an object has many varieties. certain objects, notably ~MS queue space, are recovered from scratch using an infinite redo log. Other objects, notably data bases, get copied to some external media which can be used to restore the object to a consistent state if a failure occurs. (The resource may or may not be off-line while the co[y is being made.) Recovery manager also periodically performs system checkpnint recording critical parts of the system state in a safe spot nonvolatile storage (sometimes called the warm start file.) Recovery shutdown. manager coordinates the process of system restart In performing system restart, it chooses among: by in
system
~arm start: system shut down in controlled manner. Recovery n e e d o n l y l o c a t e l a s t c h e c k p o i n t r e c o r d and r e b u i l d control structure. Emergency restart: system failed in uncontrolled manner. Non-volatile storage contains recent state consistent with the log. However, some transactions were in progress at time of failure and must be redone or undone to obtain most recent consistent state.
462
Cold start: the system is being b r o u g h t up with amnesia about prior incarnations. The log is not referenced to determine previous state.
5.8.3. R E C O V E R ~ PROTOCOLS All p a r t i c i p a n t s in a t r a n s a c t i o n , including all c o m p o n e n t s understand and obey the following protocols when operating r e c o v e r a b l e objects: o C o n s i s t e n c y lock protocol.
on
The D O - U N D O - R E D O p a r a d i g m for log records. Write Ahead L o g p r o t o c o l (WAL).
Two phase commit protocol. lock
The c o n s i s t e n c y lock p r o t o c o l was d i s c u s s e d in the s e c t i o n on management. The r e m a i n i n g p E o t o c o l s nee discussed below. 5.8.3.1. ~ and the D O - U ~ D O - R E D O Paradiq~o
Perhaps the s i m p l e s t &nd easiest t implement recovery technique is based on the o l d - m a s t e r n e ~ - m a s t e r dichotomy common to most batch data p r o c e s s i n g systems: If the run fails, one g o e s b a c k to the o l d - m a s t e r and tries again. Unhappil~, this t e c h n i q u e does not seem to g e n e r a l i z e to concurrent transactions. If several transactions concurrently a c c e s s an object, then m a k i n g a n e w - m a s t e r object or r e t u r n i n g to the o l d - m a s t e r may be i n a p p r o p r i a t e b e c a u s e it c o m m i t s or backs up all updates to the object by a l ! t r a n s a c t i o n s . It is desirable to be able to c o m m i t or undo updates on a per-transaction basis. ~iwen a action consistent state and a c o l l e c t i o n of i n - p r o g r e s s t r a n s a c t i o n s (i.e. c o m m i t not yet executed) one wants to be able to s e l e c t i v e l y undo a subset of the t r a n s a c t i o n s without a f f e c t i n g the others. Such a f a c i l i t y is called t r a n s a c t i o n backup. A second s h o r t - c o m i n g of versions is that in the e v e n t of a media error, one must r e c o n s t r u c t the most recent c o n s i s t e n t state. For example, if a page or c o l l e c t i o n of pages is lost from n o n - v o l a t i l e storage then they must be reconstructed from some redundant information. Doubly-recording the versions on i n d e p e n d e n t devices is quite e x p e n s i v e for large o b ~ c t s . However, this is the t e c h n i q u e used for some small o b j e c t s such as the warm start file. Lastly, w r i t i n g a ne, v e r s i o n of a large data base often consumes large a m o u n t s of storage and bandwidth. Having a b ~ d o n e d the n o t i o n of versions, we a d o p t the a p p r o a c h of K~datin--q !~ place and of k e e p i n g an i n c r e m e n t a l ~ of changes to the system state. (Logs are s p m e t i m e s called a u d i t trails or journals.) Each a c t i o n which m o d i f i e s a r e c o v e r a b l e object writes a log record giving the old and new v a l u e of the updated object. Read operations n e e d generate no log r e c o r d s , but update o p e r a t i o n s must record enough i n f o r m a t i o n in the log so that given the r e c o r d at a later time the o p e r a t i o n can be C o m p l e t e l y undone or redone. These records will be a g g r e g a t e d by transaction and c o l l e c t e d in a common system !o__q which r e s i d e s in n o n v o l a t i l e storage and will itself be duplexed and have
463
independent
failure modes.
~n wh__~a! follows w_~e assum_____~e tha___~% th__~e!o_~ never fails. By duplexing, triplexing,.., the log one can make this assumption less false. Every recoverable operation must have: A DO entry which does the action and also sufficient to undo and to redo the operation. An UNDO entry which undoes the by the DO action. records a log record
action given the log record written log record written the log into a
A REDO entry which redoes the action given the by the DO action. Optionally a DISPLAY human-readable format. entry which translates
To give an example of an action and the log record it must write consider the data base record update operator. This action must record in the log the: (I) record name (2) the old record value (used for UNDO) (3) the new record value. (used for REDO) The log subsystem augments this with the additional fields: (4) transaction identifier (5) action identifier (6) length of log record (7) pointer to previous log record of this transaction DECLARE I UPDATE_LOG RECCRD BASED~ 2 LENGTH F~XED(16), /* length of log record *," 2 TYPE FIXED(16) , /* code assigned to update log recs*/ 2 TRANSACTION YIXED(~8),/* name of transaction */ 2 PRE~_LOG_REC POINTER (31), /~ relative address of prey log*/ /* record of this transaction */ 2 SET FIXED(32), /* name of updated set */ 2 RECORD FIXED(32), /* name of updated record */ 2 NPIELDS FIXED(16), /* number of updated fields */ 2 CHANGES (NFIELDS), /$ for each changed field: */ 3 FIELD FIXED (16), /* name of field */ 3 OLD VALUE, /* old value of field */ F_LENGTH FIXED(16} ,/* length of old field value */ F_ATOM CHAR (F_LENGTH),/* value in old field */ 3 NEW VALUE LIKE OLD .VALUE, /* new value of field */ 2 LENGTH_AT END FIXED(16);/* allows reading log backwards */ The data manager's undo operaticn restores the record to its old value appropriately updating indices and sets. The redo operation restores the record to its new value. The display operation returns a text string giving a symbolic display of the log record. The log itself is recorded on a dedicated media( disk, tape,...). Once a log record is recorded, it cannot be updated. However, the log component provides a facility to open read cursors on the log which will traverse the system log or will traverse the log of a particular transaction in either direction. The UNDO operation must face a rather difficult problem at restart: The
464
undo operation may be performed more than once if restart itself is redone several times (i.e. if the system fails during restart.) Also one may be called upon to nndo operations which were never reflected in nonvolatile storage (i.e. log write o c c u r r e d but object write did not.) Similar problems exist for REDO. One may have to REDO an already done action if the updated object was recorded in non-volatile storage before the crash or if resZart is restarted. The write ahead log p r o t o c p l and (see below). 5.8.3.2. Write Ahead Loq Protocol that memory comes in two flavors: Volatile storage does not survive a storage usually survives a system high water marks solve these ~roblems
The recovery system postalates v o l a t i l e and n o n v o l a t i l e storage. system restart and nonvolatile restart.
Suppose an object is recorded in non-volatile storage before the log records for the object are recorded in the non-volatile log. If the system c r a s h e s at such a point, then one cannot undo the update. Similarly, if the new object is one of a set which are committed together and if a media error occurs on the object then a nutually consistent v e r s i o n of the set of objects c a n n o t be construc%e~ from their non-volatile versions. Analysis of these two examples indicate that the log s h o u l d be y r i t t e n to non-volatile storage before the object is written. Actions are required to write log records whenever modifying r e c o v e r a b l e objects. The leg (once recorded in nonvolatile storage} is considered to be very reliable. IK general the log is dual recorded on physical media with independent failure modes (e.g. dual tapes or spindles) although single logging is a system option. The Wr~te Ahead ~ o ~ ~ ~ (W.AL) is:
Before over-writing a recoverable object to nonvolatile storage with u n c o m m i t t e d updates, a transaction (process) should first force its undo log for relevant updates to nonvolatile log space. Before c ommi tti ng an update to a recoverable object, the transaction coordinator (see below) must force the redo and undo log to n o n v o l a t i l e storage so that it can go either way on the transaction commit. (This is guaranteed by recovery management which will s y n c h r o n i z e the commit ~rocess with the writing of the phase12 log transition record at the end of phase I of commit processing. This point cannot be understood before the section on two phase commit processing is read.) This protocol needs to be interpreted b r o a d l y in the case of messages: One should not send a r e c o v e r a b l e message before it is logged (so that the message can be canceled or retransmitted.) In this case, the wires of the n e t w o r k are the " n o n - v o l a t i l e storage". The write ahead log protocol is i m p l e m e n t e d as follows. Every log record has a unique s e q u e n c e number. Every recoverable object has a "high water mark" which is the largest log sequence number that applies to it. Whenever an objec~ is updated, its high water mark is set to the log s e q u e n c e number of the new log record. The object cannot be written to non-volatile storage before the log has been written past the object's high water mark. Log manager provides a synchronous call
465
to force out all log records up to a certain s e q u e n c e number. At system restart a t r a n s a c t i o n may be undone or redone. If an error occurs the restart may be repeated. This means that an operation may be undone or redone more than once. Also, s i n c e the log is "ahead of" n o n - v o l a t i l e storage the first undo may a p p l y to an a l r e a d y undone (not-yet-done) change. S i m i l a r l y the first redo m a y redo an already done change. This r e q u i r e s that the redo and undo operators be repeatable (ide_e~otent) in the sense that doing them once produces the same result as doing them several times. Undo or redo may be invoked r e p e a t e d l y if restart is retried several times or if the failure occurs during p h a s e 2 of commit processing. Here again, the high water mark is handy. If the high water mark is recorded with t h e object, and if the movement of the object to n o n v o l a t i l e storage is a t o m i c (this is true for pages and for messages) then one can read to high water mark to see if u n d o or redo is necessary. This is a s i m p l e way to make the undo and redo operators idempotent. Message s e q u e n c e numbers on a session perform the function ~ f high water marks. That is the recipient can discard messages below the last s e q u e n c e number received. As a h i s t o r i c a l note , the need for WAL only b e c a m e a p p a r e n t with the w i d e s p r e a d use of LSI memories. Prior to that time the log b u f f e r s resided in core s t o r a g e w h i c h survived software errors, hardware errors and power failure. This allowed the system to treat the log buffers in core as n o n - v o l a t i l e storage. At power shutdown, an exception handler in the data m a n a g e m e n t dumps the log buffers. If this fails a s c a v e n g e r is run which reads them out of core to storage. In general the c o n t e n t s of LSI storage does not survive power failures. To guard against power failure, memory failure and wild stores by the software, most systems have opted for the WAL protocol. 5.8.3.3. The Two P_ahase C o m m i t Protoc.~ol The Generals Pa_~a~q~.
5.8.3.3.1.
In order to u n d e r s t a n d that the two phase c o m m i t protocol solves some prcblem it is useful to a n a l y z e th3 generals paradox. T h e r e are two generals on campaign. They have an objective (a hill) which they want to capture. If t h e y sim,!taneously march on the o b j e c t i v e they are assured of success. If only one marches, he will be annihilated. The generals are e n c a m p e d only a short distance apart, but due to technical difficulties, t h e y can c o m m u n i c a t e only via runners. These m e s s e n g e r s have a flaw, e v e r y time they venture out of camp they stand some chance of getting lost (they are not very smart.) The problem is to find some p r o t o c o l which allows the generals to march together even t h o u g h some m e s s e n g e r s get lost. There is a simple proof that no fixed length protocol exists: Let P be the shortest such protocol. Suppose the last m e s s e n g e r in P gets lost. Then either this messenger is useless or one of the g e n e r a l s doesn't get a needed message. By the m i n i m a l i t y of P, the last message is not useless so one of the gemeral d o e s n ' t march if the last message is lost. This c o n t r a d i c t i o n proves that no such p r o t o c o l P exists.
466
The generals paradox (which as you now see is not a paradox) has s t r o n g a n a l o g i e s to ~ r o b l e m s faced by a data r e c o v e r y m a n a g e m e n t when doing c o m m i t processing, Imagine that one of the generals is a computer in Tokyo and t h a t the other general is a cash dispensing t e r m i n a l in Fuesse~ German~. The goal is to open a cash drawer with a million M a r k s in it debit the appropriate T o k y o computer. account in (at Fuessen) and of the
the n o n - v o l a t i l e storage
If only one thing h a p p e n s either the Germans or d e s t r o y t h e general that did not "march". 5.8.3.3.2. The Two Phase C_~mmi~ P r o t o c o l
the J a p a n e s e
will
As e x p l a i n e d above, t h e r e is no s o l u t i o n to the two g e n e r a l s problem. If however, the r e s t r i c t i o n that the the p r o t o c o l have some finite fixed maximum l e n g t h is r e l a x e d then a s o l u t i o n i s possible. The p r o t o c o l about to be d e s c r i b e d may require a r b i t r a r i l y m a n y messages. Usually it r e q u i r e s only a few m e s s a g e s , s o m e t i m e s it r e q u i r e s more and in some cases (a set of measure zero) i t r e q u i r e s an i n f i n i t e number of messages. The p r o t o c o l w o r k s by i n t r o d u c i n g a c o m m i t coordinato_~r. The commit coordinator has a communication path to all ~ c i pant_~s. P a r t i c i p a n t s are either cohorts (processes) at several nodes or are autonomous components within a process (like DB and DC) or are both. The commit C o o r d i n a t o r a s k s all the p a r t i c i p a n t s to go into a state such that~ no matter what happens, the p a r t i c i p a n t can either redo or undo the t r a n s a c t i o n (this means w r i t i n g the log in a very safe place). Once the c o o r d i n a t o r gets the votes from everyone: If anyone aborted, the coordinator broadcasts abort participants, records abort in his log and terminates. case all p a r t i c i p a n t s will abort. to all In this
If all participants voted yes, the coordinator synchronously records a commit r e c o r d in the log, then b r o a d c a s t s c o m m i t %o all participants and when an acknowledge is r e c e i v e d from e a c h participant, the c o c r d i n a t o r terminates. The key to the success of this a p p r o a c h is that the decision to c o m m i t has been c e n t r a l i z e d in a single place and is not time constrained. The following diagrams show the possible interactions between a c o o r d i n a t o r and a participant. Note that a c o o r d i n a t o r may abort a participant which a g r e e s to commit. This may h a p p e n because another p a r t i c i p a n t has a b o r t e d
467
COORDIN&TOR commit ........ >
PARTICIP&NT
request
commit
---- . . . . . ---->
---- . . . . . . . . . . . .
agree
< . . . . . . . .
commit yes <...... (I) commit ........ > Successful commit commit > exchange.
request
commit
abort
< . . . . . . . . . . nO < . . . . . . .
abort
_ -- . . . . . . . >
{2 ) P a r t i c i p a n t commit .......
aborts
commit.
>
request
commit
-- . . . . ------>
. . . . . . . . .
-- . . . . . .
agree
< . . . . . .
abort
]I0
abort
< ...... (3) C o o r d i n a t o r Three possible two aborts commit. scenarios.
phase c o m m i t
468
The logic for the c o o r d i n a t o r is best described by a simple program: COORDINATOR: PROCEDURE; VOTE='COMMIT'; /~ c o l l e c t votes ~/ DO FOR EACH P A R T I C I P A N T ~ H I L E ( V O T E = ' C O M M I T ~) ; DO; SEND HI~ R E Q U E S T C O M ~ I T ; IF REPLY ~= 'AGREE t THEN VOTE =e ABORT': END; IF FOTE='CO~MIT' THEN DO; /$ if all agree then c o m m i t # / WRITE_LOG(PHASEI2_COMMIT) FORCE: FOR EACH P A R T I C I P A N T ; DO UNTIL (+ACK) ; SEND HIM COMMIT; WAIT +ACKNOWLEDGE; IF TIME LIMIT T H E N RETRANSMIT; END; END $ ~LSE DO: /~ if any abort, then a b o r t S / FOR EACH P A R T I C I P A N T DO UNTIL (+ACK) ; S E N D M E S S A G E ABORT; WAIT + A C K N O W L E D G E ; IF T I ~ E L I M I T THEN RETRANSMIT; END END; W R I T E _ L O G ( C O O R D I N A T O B COMPLETE) ;/$common exits/ HETURN; END COORDINATOR; The p r o t o c o l for the p a r t i n i p a n t is simpler:
PARTICIPANT: PROCEDURE; WAIT_FOR REQUEST COMMIT; /~ p h a s e I ~/ FORCE UNDO REDO LOG TO N O N V O L A T I L E STORE; IF SUCCESS THEN /$ writes AGREE in log $/ REPLY t AGREE~ ; ELSE REPLY 'ABORT'; WAIT FOR VERDICT; /~ phase 2 #/ IF VERDICT = 'COMMIT t THEN DO ; RELEASE RESOURCES g LOCKS; REPLY +ACKNOWLEDGE; END : ELSE DO; UNDO PARTICIPANT; REPLY + A C K N O W L E D G E ; END; END PARTICIPANT; T h e r e is a last p i e c e of iogic that needs to be included: In the ewent of restart, r e c o w e r y manager has only the log and the nonvolatile store. If the c o o r d i n a t o r c r a s h e d before the P H A S E I 2 _ C O M M I T record appeared in the log, then restart will b r o a d c a s t abort to all partlcipants. If the t r a n s a c t i o n 's P H A S E I 2 _ C O M M I T record appeared and the COORDINATOR_COMPLETE record did not a p p e a r then restart will re-broadcast the COSM II message. If the t r a n s a c t i o n 's COORDINaTORCOMPLETE record appears in the log, then restart will
469
i g n o r e the transaction. Similarly t r a n s a c t i o n s will be aborted if the log has not been forced with AGREE. If the AGREE record appears, then restart asks the coordinator whether the t r a n s a c t i o n committed or a b o r t e d and acts a c c o r d i n g l y (redo or undo.} Examination phases: of this protocol shows that t r a n s a c t i o n commit has two
before its P H A S E I 2 _ C O ~ S I T w r i t t e n and, after its written. PHASEI2_COM~IT
or
AGREE COMMIT
log record
ha~
been
or A G R E E _ C O M M I T
log
record
has
been
This is the r e a s o n it is called a t~1o phase c o m m i t protocol. A fairly lengthy a n a l y s i s is r e q u i r e d to convince oneself that a crash or lost message will not cause one participant to "march" the wrong way. Let us consider a few cases. If any p a r t i c i p a n t aborts or crashes in his phase I then the e n t i r e transaction will be aborted (because the coordinator will sense that he is not r e p l y i n g u s i n g timeout). If an p a r t i c i p a n t crashes in his ~haFe 2 then recovery manager as a part of restart of that p a r t i c i p a n t will ask the c o o r d i n a t o r whether or not to redo or undo the t r a n s a c t i o n instance. Since the p a r t i c i p a n t wrote enough i n f o r m a t i o n for this in the log during phase I, r e c o v e r y manager can go e i t h e r way on c o m p l e t i n g this participant. This requires that the undo and redo be i d e m p o t e n t operations. Conversely, if the c o o r d i n a t o r crashes before it writes the log record, then restart w i l l broadcast abort to all participants. No p a r t i c i p a n t has committed because the ccordinator's PHASEI2_COMMIT record is synchronously written before any commit messages are sent to participants. On the other hand if the c o o r d i n a t o r ' s P H A S E 1 2 _ C O M S I T record is found in the log at restart, then the r e c o v e r y manager broadcasts commit to all participants and waits for acknowledge. This redoes the t r a n s a c t i o n (coordinator). This rathe[ sloppy a r g u m e n t can be (has been} made more precise. The net effect of the a l g o r i t h m is that either all the p a r t i c i p a n t s commit or that none of them commit (all abort.) 5.8.3.3.3. Nested Two Phase Commit P r o t o c o l
Many o p t i m i z a t i o n s of the two phase commit p r o t o c o l are possible. As described above, comwit requires ~N messages if there are participants. The c o o r d i n a t o r invokes each p a r t i c i p a n t once to take the vote and once to b r o a d c a s t the result. If invocation and return are e x p e n s i v e (e.g. go over thin wires) then a more e c o n o m i c a l protocol may be desired. If the p a r t i c i p a n t s can be linearly ordered then a simpler and faster commit protocol which has 2N calls and r e t u r n s is possible. This protocol is c a l l e d the n e s t e d two ~hase commit. The protocol works as follows: Each participant o r dee. is given a sequence n u m b e r in the commit call
470
In p a r t i c u l a r , each participant knows the name of the n ex__~t p a r t i c i p a n t and the l a s t p a r t i c i p a n t knows that he is the last.
Commit c o n s i s t s of p a r t i c i p a n t s s u c c e s s i v e l y c a l l i n g one another (N-I calls) a f t e r performing p h a s e I commit. At t h e end of the c a l l i n g s e q u e n c e each p a r t i c i p a n t will h a v e s u c c e s s f u l l y c o m p l e t e d phase I or some participant will have b r o k e n the c a l l chain. So the last p a r t i c i p a n t can perform p h a s e 2 and returns success. Each p a r t i c i p a n t keeps this up so t h a t in t h e end there are N-I returns to give a grand total of 2(N-I} c a l l s and returns on a s u c c e s s f u l commit. There is one last call r e q u i r e d to s i g n a l the c o o r d i n a t o r (last participant) that the commit completed so that r e s t a r t can ignore redoing this transaction. If some p a r t i c i p a n t does not s u c c e e d in phase I then he i s s u e s abort and t r a n s a c t i o n undo is started. The following is the a l g o r i t h m of e a c h COMMIT: PROCEDURE ; PERFORM PHASE I COMMIT; IF F A I L THEN R E T U R N FAILURE; IF I AM LAST THEN W H I T E L O G ( P H A S E 1 2 ) FORCE; ELSE DO; CALL COMMIT(I+I} ; IF FAIL THEN DO ; ABORT ; RETURN FAILURE; participant:
END;
END; PERFORM PHASE 2 COMMIT; IF I _ A M F I R S T THEN I N F O R M LAST THAT COMMIT COMPLETED; RETURN SUCCESS; END ; The f o l l o w i n g giwes a picture of a three deep nest:
471
commit
-- . . . . . . . -->
-- PHASEI - - > R 2 -- PHESE1

<-PHASE2 fin
--> ~3
--
< - - PHASL2 -y es
> (a) RI commit

. . . . . . ---->
a successful
commit.
-- PHASEI - - > R2 - - PEAS ZI--> R3 <-- A B O R T -<-- ABORT no

< . . . . . . . . . .
--
(~ 5.8.3.3.4. The n e s t e d Comparison protocol
an u n s u c c e s s f u l General and
commit. Nested P r o t o c o l s
Between
is a p p r o p r i a t e send-receive
for a system in which cost is high and broadcast not
The message available.
The need for concurrency phase 2 is low.
within
phase
and c o n c u r r e n c y
within
The p a r t i c i p a n t and cohort structure or u n i v e r s a l l y know~.
of t h e
transaction
is static
Most data management s y s t e m s have opted for the nested c o m m i t protocol for these reasons. On the other h a n d the g e n e r a l two phase c o m m i t p r o t o c o l is a p p r o p r i a t e if: B r o a d c a s t is the normal mode of i n t e r p r o c e s s case the c o o r d i n a t o r sends two ~ e s s a g e s and messages for a total of 2N messages.) r i n g - n e t s , and s p a c e - s a t e l i t e n e t s have this c o m m u n i c a t i o n (in that each process s e n d s two Aloha net, Ethernet, property.
Parallelism among the cohorts of a t r a n s ~ c t i o n is d e s i r a b l e (the nested protocol has o n l y one process active at a time d u r i n g commit processing.} 5.8.3.4. S u ~ of Recover i Protocols isolates the transaction from
The Consistency lock protocol i n c o n s i s t e n c i e s due to concurrency. The D O - R E D O - U N D O 10g record p r o t o c o l and uncommitted actions. The write nonvolatile
allows
for
recovery
of c o m m i t t e d
ahead log protocol insures that the s t o r a g e so that undo and redo can always
log is ahead be performed.
of
472
The two p h a s e commit p r o t o c o l c o o r d i n a t e s the c o m m i t m e n t of a u t o n o m o u s p a r t i c i p a n t s (or cohorts) within a transaction. The f o l l o w i n g t a b l e e x p l a i n s the virtues of the write ahead log and two phase commit protocols. It e x a m i n e s the possible situations after a crash. The relevant i s s a e s are whether an u p d a t e to the object survived (was w r i t t e n co n o n v o l a t i l e storage), and whether the log record c o r r e s p o n d i n g to the update survived. One will never have to redo an u p d a t e whose log record is not written because: O n l y c o m m i t t e d t r a n s a c t i o n s are redone, and C O M M I T w r i t e s out the t r a n s a c t i o n ' s log records b e f o r e the commit completes. So the (no, no, redo) case is precluded by two phase commit. Similarly, write ahead log (WAL} precludes the (no,yes,*) cases b e c a u s e an u p d a t e is never written before itms log record. The other c a s e s should be obvious.
LOG RECORD WRITTEN

NO
J
|
I
OHJ][CT WRITTEN
NO
I I
| | J |
R~DO OPERATION
l |
UNDO OPERATION NONE
NO
YES YES
i i i l ! i J
IMPOSSIBLE i BECAUSE OF i TWO PHASE | COMMIT |
i YES
NO YES | |
I
I M P O S S I B L E BECAUSE OF ~ R I T E A H E A D LOG REDO NONE
I
I
!
|
I
i
I J i j i
NONE UNDO
5.8.~.
STRUCTURE
OF R E C O V B R Y
MANAGER components.
Recovery e
management c o n s i s t s of two
Recovery manager which is responsible for the tracking of t r a n s a c t i o n s and the c o o r d i n a t i o n of t r a n s a c t i o n COMMIT and ABORT, and system C H E C K P O I N T and RESTART (see below). Log manager w h i c h is used Dy r e c o v e r y manager and other c o m p o n e n t s to record i n f o r m a t i o n in the s y s t e m log for the t r a n s a c t i a ~ or system.
473
RECOVERY SAN~GER
I I
I
~NDO ">
mm.~
__I I OTmE~ I ACTIONS I

i I I
I I
I I
j_|
I I I i
~EAD~WR ITE LOG
!
V
I I !
Lo~ ~ANAGER
I I ........!
Relationship between Log managers and component actions. The purpose of the recovery system is two-fold: First, the recovery system allows an in-progress transaction to be "undone m in the event of a "minor" error without affecting othez transactions. Examples of such errors are operator cancellation of the transaction, deadlock, timeout, protection Or integrity wiglation, resource limit, .... Second, in the event of a .serious" error, the recovery subsystem minimizes the amount of work that is lost and by estoring all data to its most recent committed state. It does this by periodically recording copies of key portions of the the system state in nonvolatile storage and by continuously maintaining a log of changes to the state, as they occur. In the event of a catastrophe~ the most recent transaction consistent version of the state is reconstructed from the current state on nonvolatile storage by using the log to undo any crash. transactions which were incomplete at the in time of the the
and redoing any transactions which completed between the checkpoints and the crash.
interval
In the case that on-line nonvolatile storage does not survive, one must start with an archival version of the state and reconstruct the most recent consistent state from it. This process reguires: Periodicali~ making complete system. archive copies of objects within the
Running a change accumulation utility against the logs written since the dump. This utility produces a much smaller list of updates which will brin 9 the image dump up to date. A!so this list is sorted by physical address so that adding it to the image dump is a sequential operation. The change accumulation is merged with the image to reconstruct the most recent consistent state. Other reasons for keeping a log of the actions of transactions include auditing and peformance monitoring since the log is a trace of system activity. There are three separate recovery mechanisms: (a} Incremental log of updates to the state.
474
(b) Current on-line version of the state. (c) Archive v e r s i o n s of the state. 5. 8. %. 1. Trans.action Save Loq!_q When the t r a n s a c t i o n invokes SAVE, a log record is recorded which describes the c u r r e n t s t a t e of the transaction. Each c o m p o n e n t i n v o l v e d in the t r a n s a c t i o n is then must record whatever it needs to restore it's r e c o v e r a b l e objects to t h e i r state at this point. For e~ample, the t e r m i n a l handler might record the c u r r e n t state of the session so that if the t r a n s a c t i o n backs up to this point, the terminal can be reset to this point. Similarly, data base manager might record the p o s i t i o n s of cursors. The a p p l i c a t i o n p r o g r a m may also record log r e c o r d s at a save point. A save point does not c o m m i t any r e s o u r c e s or r e l e a s e any locks. 5.8.4.2. Transaction Commit Loqic
When the ~ a n s a c t i o n issues COHMIT, r e c o v e r y manager invokes each component ( p a r t i c i p a n t ) to perform c o m m i t processing. The details of c o m m i t processing were d i s c u s s e d under the topics of recovery p r o t o c o l s above. Briefly, commit is a two phase ~rccess. During phase I, each manager write~ a log r e c o r d which allows it to go either way on the t r a n s a c t i o n (undo or redo). If all resoruce m a n a g e r s agree to commit, then r e c o v e r y manager f o r c e s the log to s e c o n d a r y s t o r a g e and e n t e r s phase 2 of commit. Phase 2 c o n s i s t s of c o m m i t t i n g updates: sending messages, writing updates ~o non v o l a t i l e s t o r a g e and r e l e a s i n g locks. In p h a s e I any r e s o u r c e manager can u n i l a t e r a l l y a b o r t the t r a n s a c t i o n t h e r e b y causing the c o m m i t to fail. Once a r e s o u r c e manager agrees to phase I commit, t h a t r e s o u r c e manager must be w i l l i n g to accept either abort or commit from r e c o v e r y manager. 5.8.4.3. T r a n s a c t i o n BaCku R ~oqic
The effect of a n y i n c o m p l e t e t r a n s a c t i o n can be undone by reading the log of that t r a n s a c t i o n b a c k w a r d s undoing each action in tur u. Given the log of a t r a n s a c t i o n TUNDO(T) : DO WHILE (LOG(T) ~= NULL) ; LOG ~ECOHD = L A S T _ R E C O R D ( L C G ( T ) ) ; UNDOER = W H O _ W R O T E ( L O G _ R E C O R D } , C~LL UHDOER(LOG ~ECORD) ; INVALIDATE(LOG RECORD); END UNDO; Clearly, this process can be s t o p p e d half-~ay, thereby r e t u r n i n g the t r a n s a c t i o n to an i n t e r m e d i a t e save point. T r a n s a c t i o n save points allow the t r a n s a c t i o n "to backtrack in case of some error and yet s a l v a g e all s u c c e s s f u l work. F r o m this d i s c u s s i o n it follows that a t r a n s a c t i o n ~ s log is a push d o w n stack and that w r i t i n g a new record pushes it onto the st~ck while undoing a record pops it off the s t a c k (in~ alidat es it). For e f f i c i e n c y reasons, all t r a n s a c t i o n logs are merged into one slstem log which is then mapped into a log file. Sut the log records of a particular transaction are threaded together and authored of~ of the process e x e c u t i n g the transaction. Notice that UNDO requires that while the t r a n ~ c t i o n is active, the log
475
be directly addressable. This is the reason that at least one version of the log should be on seme direct access device. A tape based log would not be convenient for in-progress transaction undo. The uudo logic of recovery manager is very simple. It reads a record, looks at the name of the oge~ation that wrote the record and calls the undo entry point of tha~ operation using the record type. Thus recovery manager is table driven and therefore it is fairly easy to add new operations to the system. Another alternative is to defer updates until phase 2 of commit processing. Once a transaction gets to phase 2 it must complete successfully, thus if all updates a2e done in phase 2 no undo is ever required ~ e d o logic is required.) IMS data communications and IMS Fast Path use this protocol. ~-~-~-3- S~$ea Checkpoint ~oqic System checkpoints may be triggered by operator commands, timer facilities, or counters such as the number of bytes of log record since last checkpoint. The general idea is to minimize the distance one must travel in the log in the event of a catastrophe. This must be balanced against the cost of ta~ing f~equent checkpoints. Five minutes is a typical checkpoint interval. Checkpoint algorithms which ~e~uire a system quiesce should be avoided because they imply that checkpoints will be taken infrequently thereby making restart expensive. The checkpoint process consists of writing a BEGIN CHECKPOINT record in the log, then invoking each component of the system so that it can contribute to the checkpoint and then writing an EN0 CHECKPOINT record in the log. These records bracket the checkpoint records of the other system components. Each com;onent may write one or more log records so that it will be able to restart from the checkpoint. For example, buffer manager will record the names of the buffers in the buffer pool, file manager might record the status of files, network manager may record the network status, and transaction manager will record the names of all transactions active at the checkpoint. After the checkpoint log records have been written to non-volatile storage, recovery manager records the address of the most recent checkpoint in a warm start file. This allows restart to guickly locate the checkpoint record (ra~her than seguentially searching the log for it.} Because this is such a critical resource, the restart file is duplexed (two copies are kept} and writes to it are alternated so that one file points to the current and another points to the previous checkpoint log record. At system restart, the invokes each component begins network-restart base from the operating programs are loaded and the transaction manager to re-initialize itself. Data communications and the data base manager reacquires the data system (opens the files}.
Recovery manager is then given control. Recovery manager examines the most recent warm start file written by checkpoint to discover the location of the most recent system checkpoint in the log. Recovery manager then examines the most recent checkpoint record in the log. If there was no work in progress at the system checkpoint and the system checkpoint is the last record in the log then the system is in restarting from a shutdown in a quieseced state. This is a warm start and no transactions need be undone or redone. In this case, recovery
476
manager writes a restart record in the log and which opens the system for g e n e r a l use. On the other hand if there was check p o i n t , or if t h e r e a r e further from a c r a s h (emergency restart). The f o l l o w i n g TI T2 T3
T4 ~
returns
to the s c h e d u l e r
work in p r o g r e s s at the system log r e c o r d s then this is a r e s t a r t
figure i
will h e l p to e x p l a i n i | + ! +
.t
emergency
restart
logic:
I I
T5
+ CHECKPOINT
< < I < < I---< SYSTEM CRASH
Five transaction types c h e c k p o i n t and the c r a s h
with point.
respect
to
the
most
recent
system
Transactions T I, T2 and T r a n s a c t i o n s T4 and T5 have call t r a n s a c t i o n s like T 1 , T 2 like T4 and T 5 losers. Then RESTART: PROCEDURE; D I C H O T O M I Z E WINNERS REDO THE WINNERS; UNDO THE LOSERS; END RESTART; It is why?) important that the
T3 have committed and must be redone. not c o m m i t t e d and so must be undone. Let's and T3 w i n n e r s and lets call t r a n s a c t i o n s the r e s t a r t logic is:
AND LOSERS;
REDOs
occur
before
the
UNDOs.
(Do
you
see
As it stands, this i m p l i e s because redoing the w i n n e r s t r a n s a c t i o n s ever run.
reading reguires
every going
log record e v e r written back to redo alKost all
Much of the s o p h i s t i c a t i o n of the restart process is d e d i c a t e d to minimizing the am~nt of w o r k that must be done so that r e s t a r t can be as quick as possible. (~e are d e s c r i b i n g here one of the more t r i v i a l worka b l e schemes.) In g e n e r a l r e s t a r t discovers a t i m e T such that redo log r e c o r d s w r i t t e n prior to time T are not r e l e v a n t to restart. TO see how to c o m p u t e the time T we f i r st c o n s i d e r a p a r t i c u l a r object: a data b a s e page P. Because this is a r e s t a r t from a crash, the most r e c e n t v e r s i o n of P may or may not have been recorded on n o n - v o l a t i l e storage. Suppose page P was written out with high w a t e r mark LSN(P). If P c o n t a i n e d u p d a t e s of w i n n e r s which updated P which logge4 after LSN(P), then these updates to P must be redone . C o n v e r s e l y , if it was w r i t t e n out with a l o s e r ' s update t h e n t h e s e u p d a t e s must be undone. (Similarly, m e s s a g e M may or may ~ot have b e e n sent to i t s d e s t i n a t i o n . If it was g e n e r a t e d by a t r a n s a c t i o n which is to be u n d o n e then the message s hou id he canceled. If M was generated by a committed t r a n s a c t i o n but not sent then it s h o u l d be r e t r a n s m i t t e d . ) The figure below i l l u s t r a t e s the five p o s s i b l e types of t r a n s a c t i o n s at this point: TI b e g a n and c o m m i t t e d b e f o r e LSN(P), T2 began before LSN(P) and ended before the crash, T3 b e g a n a f t e r LSN(P) and ended b e f o r e the crash, T~ b e g a n before LSH (P) but no C O M M I T r e c o r d appears in the log, and T5 began a f t e r LSN(P) and a p p a r e n t l y never ended. To honor the commit of TI, T2 and T3 t r a n s a c t i o n s r e q u i r e s that their updates be
477
added to page P (redone). be undone.

T1 T2 T3 I i
But Tq and
T5 have not committed and so must
...... '"
I i
< < I <
Tq T5
+. . . . . . . . . . . . . + wrote page P LS N (P)
<
I---<
S~STEM CRASH
FiYe transactions types with respect to the most recent write of page P and the crash point. Notice that none of the updates of T5 are reflected in this state so it is already undone. Notice also that all of the updates of TI are in the state so it need not be redone. So only T2, T3, and Tq remain. T2 and T3 must be redone from HWP(P) forward, the updates of the first half of T2 are already reflected in the page P because it has log sequence number LSN(P}. On the other hand, T4 must be undone from LSN(P) backwards. (Here we are skipping over the following anomaly: if after LSN(P), T2 backs up to a point prior to the LSN (P) then some undo work is required for T2. This problem is not difficult, just annoying.) Therefore the oldest ~edo log record relevant to P is at or after LSN(P) (The write ahead log protocol is relevant here.) At system checkpoint, da%~ manager records LSNMIN, the log sequence numbor of the oldest page not yet w r i t t e n (the minimum L S N ( ~ of all pages, P, not yet written.) Similarly, transaction manager records name of each t r a n s a c t i o n active at the checkpoint. ~estart chooses T as the LSNMIN of the most recent checkpoint. Restart proceeds as follows: It reads the system checkpoint log record and puts each t r a n s a c t i o n active at the c h e c k p o i n t into the loser set. It then scans the log forward to the eLd. If an COMMIT log record is encountered, that t r a n s a c t i o n is promoted to the winners set. If a BEGIN TRANSACTION record is found, the transaction is tentatively added to the loser set. When th~ end of the log is encountered, the winners and losers have been computed. The next thing is to read the log backwards, undoing the losers, and starting from time T, read the log forward r e d o i n g the winners. This discussion of restart is very simplistic. mechanisms to speed restart by: Many systems to b~we added
Preventing the writing of uncommitted objects storage (stealing) so that undo is never required.
non-volatile
Writing committed objects to secondary storage at phase 2 of commit ( f o r c i n W , su that redo is only rarely required. Logging the successful completion of the movement of secondary storage. This minimizes redos. Forcing all objects at system checkpoint Media Failure L o q ~ to minimize an object to "T".
5.8.4.5.
In the event of a hard system error (one which causes a loss of n o n - v o l a t i l e storage integrity), it must be possible to continue with a minimum of lost work. redundant copies of an object m~st be
478
maintained, for example on magnetic ta~e reels which are stored in a vault. It is important that the archive mechanism have independent failure modes from the regular storage subsystem. Thus using doubly redundant disk storage weald protect against a disk head crash, but wouldn't protect against a bug in the disk driver routine or a fire in the machine room. The archive mechanism periodically writes a checkpoint of the data base contents to magnetic tape, and writes a redo log of all update actions to magnetic tape. Then recovering from a hard failure is accomplished by locating the most recent surviving version on tape, loading it back into the s~stem, and then redoing all updates from that point forward using the surviving log tapes. While performing a system checkpoint causes relatively few disk writes and takes only a few seconds, copying the entire data base to tape is potentially a lengthy o2eration. Fortunately there is a (little used) trick: one can take a f_uzzy dump oZ an object by writing it to archive with an idle task. After the dump is taken, the log generated during the fuzzy dump is merged with the fuzzy dump to produce a sharp d um~. The details of this algorithm are left as an exercise for the reader. 5.8.4.6. Cold Start
Cold start is too horrible to contemplate. Since we assumed that the log never fails, cold start is never required. The system should be cold started once: whe~ its first version i s created by the implementors. Thereafter, it should be restarted. In particular moving to new hardware or adding to a new release of the system should not require a cold start. (i.e. all data should survive.) Note that this requires that the format of ~he log never change, it can only be extended by adding new types of log records.
5.8.5. LOG MAnaGEmENT The log is a large linear byte space. It is very convenient if the log is write-once and then read-only and if space in the log is never re-written. ~his allows one to identify log records by the relative byte address of the last byte of the record. A typical (small) transaction writes a 500 bytes of log. One can run about one hundred such transactions per second on current hardware. There are almost 100,000 seconds in a day. So the log can grow at 5 billion bytes per day. (Here typically, systems write four log tapes a day at 50 megabytes per tape. ) Given these statistics the log addresses should be about 48 bits long (good for 200 years on current hardware. ) Log manager must map this semi-infinite logical file (log) into the rather finite files (32 bit addresses) proided by the basic operating system. As one file is filled, another is allocated and the old one is archived. Log manager provides other resource managers with the operations : WBITE_TOG: causes the identified log record to be written to the log. Once a log record is written, it can only he read, it cannot be edited. Write log is the basic command used by all r~source managers to record log records. It returns the address of the last byte of the written log Eecord.
479
FORCE LOG: causes the identified log record and all prior log records to be recorded in n o n v o l a t i l e storage. When it returns, the w r i t e s have completed. OPEN LOG: i n d i c a t e s that the issuer wishes t o re____ad the log of some t r a n s a c t i o n or the ent ire log in sequential order. It creates a read c u r s o r on the log. S E A R C H LOG: moves the cursor a designated number of bytes or until a log ~ c o r d satisfying some criterion is located. READ_LOG: requests that the log log cursor be read. r e c o r d currently s e l e c t e d by the
CHECK_LOG: a l l o w s the issuer to test whether a r e c o r d has b e e n p l a c e d in the n o n - v o l a t i l e log and optionally to wait until the log record has been w r i t t e n out. GET CURSOR: c a u s e s the currenh v a l u e of the write cursor to be r e t u r n e d to the issuer. The RBA returned may be used at a later time to position a read cursor. CLOSE_LOG: i n d i c a t e s the issuer is finished r e a d i n g the log.
The write log o p e r a t i o n m o v e s a new log record to the end of the current log buffer. If the buffer fills, a n o t h e r is allocated and the write c o n t i n u e s into the n e w buffer. When a log buffer fills or when a synchronous log write is issued, a log d a e m o n writes the buffer to n o n v o l a t i l e storage. Traditignally, logs have been recorded om m a g n e t i c t a p e because it is so i n e x p e n s i v e to store and because the t r a n s f e r rate is ~uite high. In the future disk, CCD (nonvolatile?) or m a g n e t i c bubbles may be a t t r a c t i v e as a s t a g i n g d e v i c e for the log. This is e s p e c i a l l y true because an o n - l i n e version of the log is very desirable for t r a n s a c t i o n undo and for fast restart. It is important to doubly record the log. If the log is not doubly recorded, then a media error on the log device will produce a cold start of the system. The dual log devices s h o u l d be on separate paths so that if one device or path fails the system c a n continue in d e g r a d e d mode (this is only a p p r o p r i a t e for applications requiring high av nil ability. ) The following problem is l e f t as an exercise for the reader: We have decided to log to d e d i c a t e d dual disk drives. When a drive fills it will be archived to a mass storage device. This archive process makes the disk unavailable to the log aanager (because of arm contention.) Describe a s c h e m e which: minimizes the number of drives reguired. a l w a y s has a large disk reserve of free disk space always has line. ~.8.~.j. a large f r a c t i o n of the recent Accu_____aulation section of the log on
Log Archivin_q an__~d ~ a n q e
When the log is archived, i t can be c o m p r e s s e d so that it is c o n v e n i e n t for media recovery. For d i s k objects, log records can be soted by cylinder, then track then sector then time. Probably, all the records
480
in the archived log belong to completed transactions. So one only needs to keep redo records of committed (not aborted) transactions. Further only the most recent redo record (new value) need be recorded. This compressed redo log is called a change accumulation log. Since it is sorted by physical address, media recover becomes a merge of the image dump of the object and its change accumulation tape.
FAST_MEDIA_RECOVER~: PEOCEDUEE(I~AGE,CHANGE ACCUM) DO ~HILE (-End_Of_File I ~ G E ) ; READ I~AGE PAGE; UPDATE WITH REDO BECOEDS FBO~ CHANGE_ACCUM; WRITE IMAGE PAGE TO DISK; END END;
This is a purely sequential process sequential on disk being recovered) transfer rates of the devices. The consfruction as an idle task. of
(sequential on input files and so is limited only by
and the
the change accumulation
file can
be done off-line
If media errors are rare and availability of the data is not a critical problem then one may run the change accumulation utilities when needed. This may save building chamge accumulation files which are never used.
5.8.6.
EXANPLES OF A RECOVEBY EOUTINiS. Data Communications
5.8.6. I. How to Get Perfectl[ Reliable watch this space 5.8.6.2. (a coming attraction)
How to Get Perfec~l_/ ~eliable Data Mann emqe~__ent (a coming attraction)
watch this space
~.~-!- HISTORICAL
NOTE ON a~COVEEY MANAGEMENT.
~ost of my understanding of the to~ic of recovery derives from the experience of the IMS developers and from the development of System R. Unfortunately, both these groups have been more interested in writing code and understandin the problems than in writing papers. Hence there is very little public literature which I can cite. Eon Obermark seems to have discovered the notion of write ahead log in 1974. He also implemented the nested two phase commit protocol (almost). This work is known as the IMS-Program Isolation Feature. Earl Jenner and Steve Weick first documented the two phase commit protocol in 1975 although it seems to have its roots in some systems built by Niko Garzado in 1970. Subsequently, the SNA architects, the IMS group and the System R group has explored various implementations of these ideas. Paul Hcjones (now at Xerox) and I were stimulated by Warren Titlemann's history file in INTEELISP to implement the DO-UNDO-REDO paradigm for System R. The above presentation of recovery derives from drafts of various (unpublished) papers co-authored with John Nauman, Paul McJones, and Homer Leonard. The two phase commit protocol was independently discovered by Lampson and Sturgis (see below) and the nested commit protocol was independently discovered by Lewis, Sterns, and Rosenkrantz (see below.)
481
5.8.8. BIBLIOGRAPHY &isberg, "A Principle for Resilient Sharing of Distributed Resources," Second National Conference on Software Engineering, IEE~ Cat. No. 76CH1125-4C, 1976, pp. 562-570. (A novel proposal (not covered in these notes) which proposes a protocol whereby multiple hosts can cooperate to provide xe// reliable transaction processing. It is the first believable proposal for system duplexing or triplexing I have yet seen. Merits further study and development.) Sjork, "Recovery Conference, Scenario for a DB/DC System," Proceedings 1973, pp. 142-146. ACM National
Davies, "Recovery Semantics for a DB/~C System," Proceedings ACM National Conference, 1973, pp. 136-141. (The abowe two companion papers are the seminal work in the field. Anyone inter&sted in the topic of software recowery should read them both at least three times and then once a year thereafter.) Lampson, Sturgis, "Crash Recover] in a Distributed System," Xerox Palo Alto Research Center, 1976. ~o appear CAC~. (A very nice paper which suggests the model of errors presented in section 5.8.1 and goes on to propose a three phase commit protocol. This three phase commit protocol is an elaboration of the two phase commit protocol. This is the first (only) public mention of the two phase commit protocol.) Rosenkrantz, Sterns, Lewis~ "System Level Concurrency Control for Data Base Systems," General Electric Research, Proceedings of Second Berkeley Workshop on Distributed Data Management and Data Sanagement, Lawrence Berkeley Laboratory, LBL-6146, 1977, pp. 132-145. also, to appear in Transactions on Data Systems, ACM. (Presents a form of nested commit protocol, allows only one cohort at a time to execute.) "Information Management System/Virtu al Storage (IMS/VS), Programming Reference Manual, IBM Form No. SH20-9027-2, (Briefly describes WAL.) System p. 5-2.

Jim Gray - Notes On Database Operating Systems

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Jim Gray - Notes On Database Operating Systems

Hochgeladen von

Copyright:

Verfügbare Formate

CHAPTER 3.F.

J. N. Gray IBM Research Laboratory San Jose, Ca., USA

Notes on Data Base Operating Systems

motivated the discussion of exception handling.

Jerry Saltzer and many constructive

DEBIT_CREDiT: BEG [N_TR ANSACT!ON ~ GET MESSAGE; EXTRACT ACCOUNT NUMBER

Write an new, simpler and "vastly superior"

Extend the basic operating system to have the desired function.

STRUCTURE OF DA~A mANAGEMENT

Dictionary: the central repository of definition of all persistent system objects.

Data Coamunications: traffic. Data Base manager:

manages the information

stored in the system. resources and system on the

1100, Sperry Univac Corporation. System, Xercx Corporation. International Business

Information Management Machines Corporation. Integrated Integrated Database

System / Virtual Systems,

Cullinane corporation. Systems Inc. of America.

~odel 204 Data Base Management System 2000, Total,

SRI Systems Corporation.

Several experimental systems of the more interesting are:

are under construction at

Database Database Proc. ACM

CODAS~L Data B~se Task Group Report,

these objects. (e.g. German) descriptions of the meaning

R e c o r d s natural l a n g u a g e and ~se of objects.

r e l a t i n g the r e c o r d s of the set. In s o m e i n s t a n c e s physical storage. directing the physical c l u s t e r i n g of records in

A record i n s t a n c e may occur in many most once in a particular set.

d i f f e r e n t sets but it may occur at

must know are e n t r y

the p a r t i t i o n i n g sequenced or key

at a record. all r e c o r d s in a set.

The following A cursor

can have the f o l l o w i n g

Null. Before the first record.

At a record. Between two After records. record. If the c u r s o r

at the new rezord.

DAT A ~ODELS_ in their notion of set.

HIERARCHIqA_L. D~TAA ~_~DD~L

DO W H I L E FETCH END; ~hich

TO L O C A T I D N = N A P A , ACCOUNT=FREEM ARK_ABBEY, I N V O I C E = ANY; ~=END_OF_CHILDREN) : (CURSOR3) NEXT CHILD; DO_SOMETHING;

views: view of data, thereby

Data independence: giving p r o g r a m s a logical i s o l a t i n g t h e m from data r e o r g a n i z a t i o n .

Splitting Combining Adding

records. records. access paths. may be obtained by:

fields, of a field. by: s a t i s f y some

the r e p r e s e n t a t i o n of base subset sets

Siaple v a r i a t i o n s S e l e c t i n g that predicate ; Projecting

set. sets which can be

Combining existing viewed as a single t h e example

This larger view is defined by:

is large enough to be subdivided into several components:

The operating system utilities accounting are useable. -

The data is available to programs outside of the data manager.

AND R E L A T I O N S H I P TO NETWORK MANAGER

these messages i n t o network manager.

TERMINAL CONTROL PROGRAm message DC IN T E R ~ I N R L t r a n s m i s s i o n record <---session--->

routing, and pacing protocols.

Pacing, dividing the among connections.

The data c o m m u n i c a t i o n s c o m p o n e n t and i m p l e m e n t i n g the notion of session.

the n e t w o r k manager cooperate

Schedule a transaction Schedule a transaction

then the message is lost if