Beruflich Dokumente
Kultur Dokumente
B lock 2, 2015/16
A d van ced C o m p u te r S y stem s
(ACS)
DIKU Course Compendium
Block 2, 2015/16
C on ten ts
P r e fa c e v
L ea rn in g G o a ls v ii
S o u r c e L ist ix
1 F u n d a m en ta l A b s t r a c t io n s 1
3 T e ch n iq u e s fo r P e r fo r m a n c e 27
4 C on cu rre n cy C o n tro l 45
6 E x p e r im e n ta l D e s ig n 127
8 T o p ic s in D is t r ib u t e d C o o r d in a t io n a n d D is t r ib u t e d T r a n s
a c t io n s 221
10 D a t a P r o c e s s in g - E x te r n a l S o r t in g 383
11 D a t a P r o c e s s in g - B a s ic R e la t io n a l O p e r a t o r s a n d J o in s 395
This compendium has been designed for the course Advanced Computer Sys
tems (ACS), taught at the Department of Computer Science (DIKU), Univer
sity of Copenhagen. The contents of the compendium are in correspondence
with the rules of use at academic courses defined by CopyDan. The com
pendium is organized into 12 parts, each containing textbook chapters or refer
ence papers related to topics covered in the course. Each part is also prefaced
by a short description of the learning expectations with respect to the readings.
The compendium starts with a review of fu n d a m e n ta l a b s t r a c t io n s in
computer systems, namely interpreters, memory, and communication links
(Part 1). The course explores multiple p r o p e r t ie s that may be attached
to these abstractions, and exposes principled design and implementation tech
niques to obtain these properties while respecting interfaces and achieving high
performance. A first property is the notion of s t r o n g m o d u la rity , achieved
by organization of interpreters into clients and services and use of remote pro
cedure call (RPC) mechanics (Part 2).
After a brief review of general techniques for p e r fo r m a n c e (Part 3), the
properties of a t o m ic it y and d u r a b ility are explored. The former property of
atomicity may be understood with respect to before-or-after, or alternatively
all-or-nothing semantics. Multiple different concurrency control protocols to
achieve before-or-after atomicity over a memory abstraction are first introduced
(Part 4). Following concurrency control, recovery protocols for all-or-nothing
atomicity and durability are discussed (Part 5).
The text then ventures into a brief foray on techniques for e x p e r im e n ta l
d e s ig n (Part 6), which allow performance characteristics of different designs
and implementations of a given abstraction to be analyzed. After this foray, the
compendium then turns to the property of h igh av a ila b ility in the presence
of faults, achieved by a combination of techniques. First, general techniques for
reliability, in particular replication techniques, are discussed (Part 7). Distri
bution of system functionality and replication introduce the problem of main
taining consistency. So, solutions for achieving high degrees of consistency
in distributed scenarios, including ordered multicast, two-phase commit , and
state-machine replication, are then discussed (Part 8). Finally, communication
schemes that decouple system functions are discussed, along with the classic
end-to-end argument (Part 9).
The text finally explores the property of s c a la b ility with large data vol
umes, and reviews design and implementation techniques for data processing
operators, including external sorting, basic relational operators and jo in s , as
well as parallelism (Parts 10, 11, and 12, respectively).
We hope you enjoy your readings!
v
L earning G oa ls
K n o w le d g e
Skills
C o m p e te n ces
D. DeWitt and J. Gray. Parallel database systems: the future of high per
formance database systems. Commun. ACM 35, pp. 85-98 (14 of 1868),
1992. Doi: 10.1145/129888.129894
IX
F. B. Schneider. Implementing fault-tolerant services using the state machine
approach: a tutorial. ACM Comput. Surv. 22(4) pp. 299-319 (21 of
409), 1990. Doi: 10.1145/98163.98167
1
44 CHAPTER 2 Elements of Computer System Organization
2 .5 .1 2 The Shell and Im plied Contexts, Search Paths, and Name Discovery........110
2 .5 .1 3 Suggestions for Further R eading...........................................................................112
Exercises................................................................................................................................................112
OVERVIEW
Although the number o f potential abstractions for com puter system com ponents is
unlimited, remarkably the vast majority that actually appear in practice fall into one of
three well-defined classes: the memory, the interpreter,and the com m unication link.
These three abstractions are so fundamental that theoreticians compare com puter algo
rithms in terms o f the number o f data items they must remember, the number o f steps
their interpreter must execute, and the number o f messages they must communicate.
Designers use these three abstractions to organize physical hardware structures,
not because they are the only ways to interconnect gates, but rather because
To meet the many requirements o f different applications, system designers build lay
ers on this fundamental base, but in doing so they do not routinely create com pletely
different abstractions. Instead, they elaborate the same three abstractions, rearrang
ing and repackaging them to create features that are useful and interfaces that are
convenient for each application. Thus, for example, the designer o f a general-purpose
system such as a personal com puter or a network server develops interfaces that
exhibit highly refined forms o f the same three abstractions. The user, in turn, may
see the m em ory in the form o f an organized file or database system, the interpreter
in the form o f a w ord processor, a game-playing system, or a high-level programming
language, and the communication link in the form o f instant messaging or the World
Wide Web. On examination, underneath each o f these abstractions is a series o f layers
built on the basic hardware versions o f those same abstractions.
A primary method by w hich the abstract com ponents o f a com puter system inter
act is reference. What that means is that the usual way for one com ponent to connect
to another is by name. Names appear in the interfaces o f all three o f the fundamental
abstractions as well as the interfaces o f their m ore elaborate higher-layer counter
parts. The m em ory stores and retrieves objects by name, the interpreter manipulates
named objects, and names identify communication links. Names are thus the glue
that interconnects the abstractions. Named interconnections can, with proper design,
be easy to change. Names also allow the sharing o f objects, and they permit finding
previously created objects at a later time.
This chapter briefly reviews the architecture and organization o f com puter sys
tems in the light o f abstraction, naming, and layering. Some parts o f this review will be
familiar to the reader with a background in com puter software or hardware, but the
systems perspective may provide som e new insights into those familiar con cepts and
2
2.1 The Three Fundamental Abstractions 45
it lays the foundation for com ing chapters. Section 2.1 describes the three fundamen
tal abstractions, Section 2.2 presents a m odel for naming and explains h ow names are
used in com puter systems, and Section 2.3 discusses h ow a designer com bines the
abstractions, using names and layers, to create a typical com puter system, presenting
the file system as a concrete example o f the use o f naming and layering for the mem
ory abstraction. Section 2.4 looks at h ow the rest o f this b ook will consist o f designing
som e higher-level version o f one or more o f the three fundamental abstractions, using
names for interconnection and built up in layers. Section 2.5 is a case study showing
h ow abstractions, naming, and layering are applied in a real file system.
2.1.1 Memory
Memory, sometimes called storage, is the system com ponent that remembers data
values for use in computation. Although m em ory technology is wide-ranging, as sug
gested by the list o f examples in Figure 2.1, all m em ory devices fit a simple abstract
m odel that has tw o operations, named w rite and read :
write[name, value)
value <— read (name)
The w rite operation specifies in value a value to be remembered and in name a name
by which one can recall that value in the future. The read operation specifies in name
the name o f som e previously rem em bered value, and the m emory device returns that
value. A later call to w rite that specifies
Hardware memory devices: the same name updates the value associ
RAM chip ated with that name.
Flash memory Memories can be either volatile or
Magnetic tape non-volatile. A volatile m em ory is one
Magnetic disk w h ose mechanism o f retaining informa
CD-R and DVD-R tion consum es energy; if its pow er supply
Higher level memory systems: is interrupted for som e reason, it forgets
RAID its information content. When one turns
File system off the pow er to a non-volatile m em ory
Database management system (sometimes called “ stable storage” ), it
retains its content, and w hen p ow er is
FIGURE 2.1 again available, read operations return
„ , , t the same values as before. By connecting
Some examples of memory devices that may , „ ,
...
h o tam ihar
a volatile m em oryJ
to a batteryJ or an
3
46 CHAPTER 2 Elements of Computer System Organization
Thus, the current chapter suggests that files be placed in a durable storage medium—
that is, they should survive system shutdown and remain intact for as long as they are
needed. Chapter 8 [on-line] revisits durability specifications and classifies applications
according to their durability requirements.
This chapter introduces the concept of stable bindings for names, which, once deter
mined, never again change.
uninterruptible pow er supply, it can be made durable, which means that it is designed
to remember things for at least som e specified period, known as its durability. Even
non-volatile m em ory devices are subject to eventual deterioration, known as decay,
so they usually also have a specified durability, perhaps measured in years. We will
revisit durability in Chapters 8 [on-line] and 10 [on-line],where w e will see methods
o f obtaining different levels o f durability. Sidebar 2.1 com pares the meaning o f dura
bility with tw o other, related words.
At the physical level, a m em ory system does not normally name, read , or write
values o f arbitrary size. Instead, hardware layer m em ory devices read and w rite con
tiguous arrays o f bits, usually fixed in length, known by various terms such as bytes
(usually 8 bits, but one sometimes encounters architectures with 6-, 7-, or 9-bit bytes),
w ords (a small integer number o f bytes, typically 2,4, or 8), lines (several words), and
blocks (a number o f bytes, usually a pow er o f 2, that can measure in the thousands).
Whatever the size o f the array, the unit o f physical layer m em ory written or read is
known as a mem ory (or storage) cell. In most cases, the name argument in the read
and w rite calls is actually the name o f a cell. Higher-layer memory systems also read
and w rite contiguous arrays o f bits, but these arrays usually can be o f any convenient
length, and are called by terms such as record, segment, or file.
4
2.1 The Three Fundamental Abstractions 47
means that the result o f every read or w rite is as if that read or w rite occurred either
com pletely before or com pletely after any other read or w r it e . Although it might seem
that a designer should be able simply to assume these tw o properties, that assump
tion is risky and often wrong. There are a surprising number o f threats to read/write
coherence and before-or-after atomicity:
■ Concurrency. In systems where different actors can perform read and w rite
operations concurrently, they may initiate tw o such operations on the same
named cell at about the same time. There needs to be som e kind o f arbitration
that decides w hich one goes first and to ensure that one operation com pletes
before the other begins.
■ Rem ote storage. When the m em ory device is physically distant, the same con
cerns arise, but they are amplified by delays, w hich make the question o f “
which
w rite was most recent?”problematic and by additional forms o f failure intro
■ Cell size incom m ensurate with value size. A large value may occu p y multiple
m em ory cells, in which case before-or-after atomicity requires special attention.
The problem is that both reading and writing o f a multiple-cell value is usually
done one cell at a time. A reader running concurrently with a writer that is
updating the same multiple-cell value may end up with a mixed bag o f cells, only
som e o f which have been updated. Computer architects call this hazard write
tearing. Failures that occu r in the middle o f writing multiple-cell values can
further com plicate the situation. To restore before-or-after atomicity, concurrent
readers and writers must som ehow be coordinated, and a failure in the middle
o f an update must leave either all or none o f the intended update intact. When
these conditions are met, the read or write is said to be atomic. A closely related
5
48 CHAPTER 2 Elements of Computer System Organization
risk arises w hen a small value shares a m em ory cell with other small values.
The risk is that if tw o writers concurrently update different values that share
the same cell, one may overwrite the other’ s update. Atomicity can also solve
this problem. Chapter 5 begins the study o f atomicity by exploring methods
o f coordinating concurrent activities. Chapter 9 [on-line] expands the study of
atomicity to also encom pass failures.
Often, the designer o f a system must co p e with not just one but several o f these
threats simultaneously. The combination o f replication and remoteness is particularly
challenging. It can be surprisingly difficult to design m em ories that are both efficient
and also read/write coherent and atomic. To simplify the design or achieve higher per
formance, designers sometimes build m emory systems that have weaker coherence
specifications. For example, a multiple processor system might specify: “ The result
o f a read will be the value o f the latest write if that w rite was perform ed by the same
processor.”There is an entire literature o f “ data consistency m odels”that explores
the detailed properties o f different m em ory coherence specifications. In a layered
m em ory system, it is essential that the designer o f a layer know precisely the coher
ence and atomicity specifications o f any low er layer m em ory that it uses. In turn, if
the layer being designed provides m em ory for higher layers, the designer must specify
precisely these tw o properties that higher layers can expect and depend on. Unless
otherwise mentioned, w e will assume that physical m em ory devices provide read/
write coherence for individual cells, but that before-or-after atomicity for multicell
values (for example, files) is separately provided by the layer that implements them.
6
2.1 The Three Fundamental Abstractions 49
Sidebar 2.2 H ow M agnetic D isks W ork Magnetic disks consist o f rotating circular
platters coated on both sides with a magnetic material such as ferric oxide. An elec
tromagnet called a disk head records information by aligning the magnetic field of the
particles in a small region on the platter’s surface.The same disk head reads the data
by sensing the polarity of the aligned particles as the platter spins by. The disk spins
continuously at a constant rate, and the disk head actually floats just a few nanometers
above the disk surface on an air cushion created by the rotation of the platter.
From a single position above a platter, a disk head can read or write a set of bits, called
a track,located a constant distance from the center. In the top view below, the shaded
region identifies a track. Tracks are formatted into equal-sized blocks, called sectors,
by writing separation marks periodically around the track. Because all sectors are the
same size, the outer tracks have more sectors than the inner ones.
A typical modern disk module,known as a“ hard drive”because its platters are made of
a rigid material, contains several platters spinning on a common axis called a spindle,
as in the side view above. One disk head per platter surface is mounted on a com b
like structure that moves the heads in unison across the platters. Movement to a spe
cific track is called seeking, and the comb-like structure is known as a seek arm .The
set of tracks that can be read or written when the seek arm is in one position (for
example, the shaded regions of the side view) is called a cylinder.Tracks, platters, and
sectors are each numbered. A sector is thus addressed by geometric coordinates: track
number, platter number, and rotational position. Modern disk controllers typically do
the geometric mapping internally and present their clients with an address space
consisting of consecutively numbered sectors.
To read or write a particular sector, the disk controller first seeks the desired track.
Once the seek arm is in position, the controller waits for the beginning of the desired
sector to rotate under the disk head, and then it activates the head on the desired plat
ter. Physically encoding digital data in analog magnetic domains usually requires that
the controller write complete sectors.
The time required for disk access is called latency, a term defined more precisely in
Chapter 6. Moving a seek arm takes time. Vendors quote seek times of 5 to 10 mil
liseconds, but that is an average over all possible seek arm moves. A move from one
(Sidebar continues)
7
50 CHAPTER 2 Elements of Computer System Organization
cylinder to the next may require only 1/20 of the time of a move from the innermost
to the outermost track. It also takes time for a particular sector to rotate under the
disk head. A typical disk rotation rate is 7200 rpm, for which the platter rotates once
in 8.3 milliseconds. The time to transfer the data depends on the magnetic record
ing density, the rotation rate, the cylinder number (outer cylinders may transfer at
higher rates), and the number of bits read or written. A platter that holds 40 gigabytes
transfers data at rates between 300 and 600 megabits per second; thus a 1-kilobyte
sector transfers in a microsecond or two. Seek time and rotation delay are limited by
mechanical engineering considerations and tend to improve only slowly, but mag
netic recording density depends on materials technology, which has improved both
steadily and rapidly for many years.
Early disk systems stored between 20 and 80 megabytes. In the 1970s Kenneth Haughton,
an IBM inventor, described a new technique of placing disk platters in a sealed enclo
sure to avoid contamination. The initial implementation stored 30 megabytes on each
of two spindles, in a configuration known as a 30-30 drive. Haughton nicknamed it
the “Winchester” , after the Winchester 30-30 rifle. The code name stuck, and for many
years hard drives were known as Winchester drives. Over the years, Winchester drives
have gotten physically smaller while simultaneously evolving to larger capacities.
a delay that is a thousand times longer, waiting for that second sector to again rotate
under the read head. Thus the maximum rate at which one can transfer data to or from
a disk is dramatically larger than the rate one w ould achieve w hen choosing sectors at
random. A random access m em ory (RAM) is one for which the latency for memory
cells chosen at random is approximately the same as the latency for cells chosen in
the pattern best suited for that memory device. An electronic m em ory chip is usually
configured for random access. Memory devices that involve mechanical movement,
such as optical disks (CDs and DVDs) and magnetic tapes and disks, are not.
For devices that do not provide random access, it is usually a g o o d idea, having
paid the cost in delay o f moving the mechanical com ponents into position, to read or
w rite a large block o f data. Large-block read and w rite operations are sometimes rela
beled get and p u t , respectively and this b ook uses that convention. Traditionally, the
unqualified term m em ory meant random-access volatile m em ory and the term stor
age was used for non-volatile m em ory that is read and written in large blocks with get
and p u t . In practice, there are enough exceptions to this naming rule that the words
“m em ory”and “ storage”have b ecom e almost interchangeable.
FIGURE 2.2__________________________________________________________________________
An associative memory implemented in two layers. The associativity layer maps the unconstrained
names of its arguments to the consecutive integer addresses required by the physical layer
location-addressed memory.
9
52 CHAPTER 2 Elements of Computer System Organization
10
2.1 The Three Fundamental Abstractions 53
making identical copies o f the data on different disks. Exercise 8.8 [on-line] explores
a simple but elegant performance optimization known as RAID 5. These and several
other RAID configurations w ere originally described in depth in a paper by Randy
Katz, Garth Gibson, and David Patterson, w h o also assigned the traditional num
bers to the different configurations [see Suggestions for Further Reading 10.2.2],
2.1.2 Interpreters
Interpreters are the active elements o f a com puter system; they perform the actions
that constitute computations. Figure 2.4 lists som e examples o f interpreters that may
be familiar. As with memory, interpreters also com e in a w ide range o f physical mani
festations. However, they too can be described with a simple abstraction, consisting
o f just three components:
1. An instruction reference, w hich tells the interpreter where to find its next
instruction
2. A repertoire, which defines the set o f actions the interpreter is prepared to
perform w hen it retrieves an instruction from the location named by the
instruction reference
3. An environm ent reference, w hich tells the interpreter where to find its
environment, the current state on w hich the interpreter should perform the
action o f the current instruction
11
54 CHAPTER 2 Elements of Computer System Organization
1 procedure interprets
2 do forever
3 instruction read (instruction_reference)
4 perform instruction in the context of environment_reference
5 if interrupt_signal = TRUE then
6 instruction_reference<—entry point of interrupts an dler
7 environment_reference<r- environment ref of in te rru p tjh a n d le r
FIGURE 2.5______________________________________________________________________
Structure of, and pseudocode for, an abstract interpreter. Solid arrows show control flow, and
dashed arrows suggest information flow. Sidebar 2.3 describes this book’s conventions for
expressing pseudocode.
Sidebar 2.3 Representation: Pseudocode and Messages This book presents many
examples o f program fragments. Most of them are represented in pseudocode, an
imaginary programming language that adopts familiar features from different existing
programming languages as needed and that occasionally intersperses English text to
characterize some step whose exact detail is unimportant. The pseudocode has some
standard features, several of which this brief example shows.
The line numbers on the left are not part of the pseudocode; they are there simply
to allow the text to refer to lines in the program. Procedures are explicitly declared
(Sidebar continues)
12
2.1 The Three Fundamental Abstractions 55
(as in line /), and indentation groups blocks of statements together. Program variables
are set in italic, program key words in bold, and literals such as the names of pro
cedures and built-in constants in small caps. The left arrow denotes substitution or
assignment (line 2) and the symbol "=" denotes equality in conditional expressions.
The double slash precedes comments that are not part of the pseudocode. Various
forms of iteration (while, until, for each, do occasionally), conditionals (if), set operations
(is in), and case statements (do case) appear when they are helpful in expressing an
example. The construction for j from 0 to 3 iterates four times; array indices start at
0 unless otherwise mentioned. The construction y.x means the element named x in
the structure named y. To minimize clutter, the pseudocode omits declarations wher
ever the meaning is reasonably apparent from the context. Procedure parameters are
passed by value unless the declaration reference appears. Section 2.2.1 of this chapter
discusses the distinction between use by value and use by reference. When more
than one variable uses the same structure, the declaration structure_name instance
variab!e_name may be used.
The notation a ( ll. . .15) denotes extraction of bits 11 through 15 from the string a
(or from the variable a considered as a string). Bits are numbered left to right starting
with zero, with the most significant bit of integers first (using big-endian notation, as
described in Sidebar 4.3).The + operator, when applied to strings, concatenates the
strings.
X
56 CHAPTER 2 Elements of Computer System Organization
2.1.2.1 P rocessors
A general-purpose processor is an implementation o f an interpreter. For purposes of
concrete discussion throughout this book, w e use a typical reduced instruction set
processor. The p rocessor’ s instruction reference is a p rogra m counter, stored in a
fast m em ory register inside the processor. The program counter contains the address
o f the m em ory location that stores the next instruction o f the current program. The
environment reference o f the processor consists in part o f a small amount o f built-in
location-addressed m em ory in the form o f named (by number) registers for fast access
to temporary results o f computations.
Our general-purpose processor may be directly w ired to a memory, w hich is also
part o f its environment. The addresses in the program counter and in instructions
are then names in the address space o f that memory, so this part o f the environment
reference is wired in and unchangeable. When w e discuss virtualization in Chapter 5,
w e will extend the processor to refer to m em ory indirectly via one or more registers.
With that change, the environment reference is maintained in those registers, thus
allowing addresses issued by the processor to map to different names in the address
space o f the memory.
The repertoire o f our general-purpose processor includes instructions for express
ing computations such as adding tw o numbers ( add ), subtracting one number from
another ( sub ), comparing tw o numbers ( cmp ), and changing the program counter to the
address o f another instruction ( jm p ). These instructions operate on values stored in the
named registers o f the processor, which is why they are colloquially called “ op-codes” .
The repertoire also includes instructions to m ove data betw een processor regis
ters and memory. To distinguish program instructions from m em ory operations, w e
use the name load for the instruction that reads a value from a named m em ory cell
into a register o f the processor and store for the instruction that writes the value from
a register into a named m em ory cell. These instructions take tw o integer arguments,
the name o f a m em ory cell and the name o f a processor register.
The general-purpose processor provides a stack, a push-down data structure that
is stored in m em ory and used to implement procedure calls. When calling a p roce
dure, the caller pushes arguments o f the called procedure (the callee) on the stack.
When the callee returns, the caller p op s the stack back to its previous size. This imple
mentation o f procedures supports recursive calls because every invocation o f a pro
cedure always finds its arguments at the top o f the stack. We dedicate one register for
implementing stack operations efficiently. This register, known as the stack pointer,
holds the m em ory address o f the top o f the stack.
As part o f interpreting an instruction, the processor increments the program
counter so that, w hen that instruction is complete, the program counter contains the
address o f the next instruction o f the program. If the instruction being interpreted is
a jm p , that instruction loads a new value into the program counter. In both cases, the
flow o f instruction interpretation is under control o f the running program.
The processor also implements interrupts. An interrupt can occu r because the
processor has detected som e problem with the running program (e.g., the pro
gram attempted to execute an instruction that the interpreter does not or cannot
14
2.1 The Three Fundamental Abstractions 57
implement, such as dividing by zero). An interrupt can also occu r because a signal
arrives from outside the processor, indicating that som e external device needs atten
tion (e.g., the keyboard signals that a key press is available). In the first case, the
interrupt mechanism may transfer control to an exception handler elsewhere in the
program. In the second case, the interrupt handler may do som e work and then return
control to the original program. We shall return to the subject o f interrupts and the
distinction betw een interrupt handlers and exception handlers in the discussion of
threads in Chapter 5.
In addition to general-purpose processors, com puter systems typically also have
special-purpose processors, w hich have a limited repertoire. For example, a clock
chip is a simple, hard-wired interpreter that just counts: at som e specified frequency,
it executes an add instruction, which adds 1 to the contents o f a register or memory
location that corresponds to the clock. All processors, whether general-purpose or
specialized, are examples o f interpreters. However, they may differ substantially in the
repertoire they provide. One must consult the device manufacturer’ s manual to learn
the repertoire.
Instruction
FIGURE 2.6________________________________________________________________________
The model for a layered interpreter. Each layer interface, shown as a dashed line, represents an
abstraction barrier, across which an upper layer procedure requests execution of instructions
from the repertoire of the lower layer. The lower layer procedure typically implements an
instruction by performing several instructions from the repertoire of a next lower layer interface.
15
58 CHAPTER 2 Elements of Computer System Organization
Typical instruction
Human user across this interface
generating
requests
Java language
nextch
layer interface
Machine language
layer interface
2.1 The Three Fundamental Abstractions 59
One goal in the design o f a layered interpreter is to ensure that the designer o f each
layer can be confident that the layer b elow either com pletes each instruction suc
cessfully or does nothing at all. Half-finished instructions should never be a concern,
even if there is a catastrophic failure. That goal is another example o f atomicity, and
achieving it is relatively difficult. For the moment, w e simply assume that interpret
ers are atomic, and w e defer the discussion o f h ow to achieve atomicity to Chapter 9
[on-line].
The send operation specifies an array o f bits, called a message, to be sent over the
communication link identified by link_name (for example, a wire). The argument
outgoing_message_buffer identifies the message to be sent, usually by giving the
address and size o f a buffer in m em ory that contains the message. The receive opera
tion accepts an incom ing message, again usually by designating the address and size o f
a buffer in m em ory to hold the incom ing message. O nce the low est layer o f a system
has received a message, higher layers may acquire the message by calling a receive
interface o f the low er layer, or the low er layer may “ upcall”to the higher layer, in
which case the interface might be better characterized as d e liv e r (incom ing_message) .
Names connect systems to communication links in tw o different ways. First, the
link_name arguments o f send and receive identify one o f possibly several available com
munication links attached to the system.
Second, som e communication links are
Hardware technology: actually multiply-attached networks o f
twisted pair links, and som e additional m ethod is
coaxial cable n eeded to name w hich o f several p o s
optical fiber sible recipients should receive the mes
Higher level sage. The name o f the intended recipient
Ethernet is typically one o f the com ponents o f the
Universal Serial Bus (USB) message.
the Internet At first glance, it might appear
the telephone system that sending and receiving a mes
a unix pipe sage is just an example o f copying
an array o f bits from one m em ory to
FIGURE 2.8 _______ another m em ory over a wire using a
Some examples of communication links. sequence o f read and w rite operations,
17
60 CHAPTER 2 Elements of Computer System Organization
w r it e . Programs that invoke send and receive must take these different semantics
explicitly into account. On the other hand, som e communication link implementa
tions do provide a layer that does its best to hide a send / receive interface behind a
read / write interface.
Just as with m em ory and interpreters, designers organize and implement com m u
nication links in layers. Rather than continuing a detailed discussion o f communica
tion links here, w e defer that discussion to Section 7.2 [on-line], which describes a
three-layer m odel that organizes communication links into systems called networks.
Figure 7.18 [on-line] illustrates this three-layer network model, which com prises a link
layer, a network layer, and an end-to-end layer.
■ create a co p y o f the com ponent object and include the cop y in the using
object (use by value'), or
■ ch oose a name for the com ponent object and include just that name in the using
object (use by reference). The com ponent object is said to export the name.
18
C h a p te r 2
Note that the course assignments will be instrumental in achieving the goals
in this as well as other parts of the course.
19
This page has intentionally been left blank.
20
4.2 Communication Between Client and Service 167
example, the client and service have to construct messages and convert numbers into
bit strings and the like. Programming these conversions is tedious and error prone.
Stubs remove this burden from the programmer (see Figure 4.7). A stub is a p roce
dure that hides the marshaling and communication details from the caller and callee.
An RPC system can use stubs as follows. The client module invokes a remote p roce
dure, say get_ t im e , in the same way that it w ould call any other procedure. However,
get _ t im e is actually just the name o f a stub procedure that runs inside the client m od
ule (see Figure 4.8).The stub marshals the arguments o f a call into a message, sends
the message, and waits for a response. On arrival o f the response, the client stub
unmarshals the response and returns to the caller.
Similarly, a service stub waits for a message, unmarshals the arguments, and calls
the procedure that the client requests ( get_ t im e in the example). After the procedure
returns, the service stub marshals the results o f the procedure call into a message and
sends it in a response to the client stub.
Writing stubs that convert more com plex objects into an appropriate on-wire
representation b ecom es quite tedious. Some high-level programming languages
21
168 CHAPTER 4 Enforcing Modularity with Clients and Services
Client Service
FIGURE 4.7________
Implementation of
a remote procedure
call using stubs. The
stubs hide all remote
communication from
the caller and callee.
Client program
Service program
FIGURE 4.8
get_ time client and service using stubs.
such as Java can generate these stubs automatically from an interface specification
[Suggestions for Further Reading 4.1.3], simplifying client/service programming even
further. Figure 4.9 show s the client for such an RPC system. The RPC system w ould
generate a procedure similar to the get_ time stub in Figure 4.8 . The client program of
Figure 4.9 looks almost identical to the one using a local procedure call on page 149,
22
4.2 Communication Between Client and Service 169
except that it handles an additional error because remote procedure calls are not
identical to procedure calls (as discussed below).The procedure that the service calls
on line 7 is just the original procedure get_ t im e on page 149.
Whether a system uses RPC with automatic stub generation is up to the imple-
menters. For example, som e implementations o f Sun’ s Network File System (see
Section 4.5) use automatic stub generation, but others do not.
23
170 CHAPTER 4 Enforcing Modularity with Clients and Services
example, the programmer probably would not think of setting an interval timer
when invoking sort (x), even though sort internally has a successive-approximation
loop that, if programmed wrong, might not terminate.
But n ow consider calling sqrt with an RPC. An interval timer suddenly b ecom es
essential because the network betw een client and service can lose a message, or the
other com puter can crash independently. To avoid fate sharing, the RPC programmer
must adjust the co d e to prepare for and handle this failure. When the client receives
a“ service failure”signal, the client may be able to recover by, for example, trying
a different service or choosing an alternative algorithm that doesn ’ t use a remote
service.
The second difference betw een ordinary procedure calls and RPCs is that RPCs
introduce a new failure mode, the “ no respon se”failure. When there is no response
from a service, the client cannot tell which o f tw o things went wrong: (1) som e failure
occurred before the service had a chance to perform the requested action, or (2) the
service perform ed the action and then a failure occurred, causing just the response
to be lost.
Most RPC designs handle the no-response case by choosing one o f three im ple
mentation strategies:
■ At-least-once RPC. If the client stub doesn ’t receive a response within som e
specific time, the stub resends the request as many times as necessary until it
receives a response from the service. This implementation may cause the ser
vice to execute a request more than once. For applications that call sqrt,execut
ing the request m ore than on ce is harmless because with the same argument
sqrt should always produce the same answer. In programming language terms,
the sqrt service has no side effects. Such side-effect-free operations are also
idem poten t: repeating the same request or sequence o f requests several times
has the same effect as doing it just once. An at-least-once implementation does
not provide the guarantee implied by its name. For example, if the service was
located in a building that has been blow n away by a hurricane, retrying doesn ’t
help. To handle such cases, an at-least-once RPC implementation will give up
after som e number o f retries. When that happens, the request may have been
executed m ore than on ce or not at all.
■ At-most-once RPC. If the client stub doesn ’ t receive a response within som e
specific time, then the client stub returns an error to the caller, indicating
that the service may or may not have processed the request. At-most-once
semantics may be m ore appropriate for requests that do have side effects. For
example, in a banking application, using at-least-once semantics for a request
to transfer $100 from one account to another could result in multiple $100
transfers. Using at-most-once semantics assures that either zero or one transfers
take place, a somewhat more controlled outcome. Implementing at-most-once
RPC is harder than it sounds because the underlying network may duplicate
the request message without the client stub’ s knowledge. Chapter 7 [on-line]
describes an at-most-once implementation, and Birrell and N elson’ s paper gives
24
4.2 Communication Between Client and Service 171
■ Exactly-once RPC. These semantics are the ideal, but because the client and
service are independent it is in principle im possible to guarantee. As in the
case o f at-least-once, if the service is in a building that was blow n away by a
hurricane, the best the client stub can do is return error status. On the other
hand, by adding the com plexity o f extra message exchanges and careful record
keeping, one can approach exactly-once semantics closely enough to satisfy
som e applications. The general idea is that, if the RPC requesting transfer of
$100 from account A to B produces a “ n o respon se”
failure, the client stub sends
a separate RPC request to the service to ask about the status o f the request that
got no response.This solution requires that both the client and the service stubs
keep careful records o f each remote procedure call request and response. These
records must be fault tolerant because the com puter running the service might
fail and lose its state betw een the original RPC and the inquiry to check on the
RPC’ s status. Chapters 8 [on-line] through 10 [on-line] introduce the necessary
techniques.
The programmer must be aware that RPC semantics differ from those o f ordinary
procedure calls, and because different RPC systems handle the no-response case in
different ways, it is important to understand just which semantics any particular RPC
system tries to provide. Even if the name o f the implementation implies a guarantee
(e.g., at-least-once), w e have seen that there are cases in which the implementation
cannot deliver it. One cannot simply take a collection o f legacy programs and arbi
trarily separate the modules with RPC. Some thought and reprogramming is inevitably
required. Problem set 2 explores the effects o f different RPC semantics in the context
o f a simple client/service application.
The third difference is that calling a local procedure takes typically much less time
than calling a remote procedure call. For example, invoking a remote sqrt is likely to
be m ore expensive than the computation for sqrt itself because the overhead o f a
remote procedure call is much higher than the overhead o f following the procedure
calling conventions. To hide the cost o f a remote procedure call, a client stub may
deploy various performance-enhancing techniques (see Chapter 6), such as caching
results and pipelining requests (as is done in the X W indow System o f Sidebar 4.4).
These techniques increase com plexity and can introduce new problem s (e.g., h ow to
ensure that the cache at the client stays consistent with the one at the service).The
performance difference betw een procedure calls and remote procedure calls requires
the designer to consider carefully what procedure calls should be remote ones and
which ones should be ordinary, local procedure calls.
A final difference betw een procedure calls and RPCs is that som e programming
language features don ’ t com bine well with RPC. For example, a procedure that com
municates with another procedure through global variables cannot typically be
executed remotely because separate com puters usually have separate address spaces.
Similarly, other language constructs that use explicit addresses w on ’ t work. Arguments
25
172 CHAPTER 4 Enforcing Modularity with Clients and Services
consisting o f data structures that contain pointers, for example, are a problem because
pointers to objects in the client com puter are local addresses that have different bind
ings when resolved in the service computer. It is possible to design systems that use
global references for objects that are passed by reference to remote procedure calls
but require significant additional machinery and introduce new problems. For exam
ple, a new plan is needed for determining whether an object can be deleted locally
because a remote com puter might still have a reference to the object. Solutions exist,
however; see, for example, the article on Network Objects [Suggestions for Further
Reading 4.1.2],
Since RPCs don ’ t provide the same semantics as procedure calls, the w ord “ pro
cedure”in “ remote procedure call”can be misleading. Over the years the con cept of
RPC has evolved from its original interpretation as an exact simulation o f an ordinary
procedure call to instead mean any client/service interaction in which the request is
follow ed by a response. This text uses this modern interpretation.
26
C h a p te r 3
27
300 CHAPTER 6 Perform ance
OVERVIEW
The specification o f a com puter system typically includes explicit (or implicit) perfor
mance goals. For example, the specification may indicate h ow many concurrent users
the system should be able to support. Typically, the simplest design fails to m eet these
goals because the design has a bottleneck, a stage in the com puter system that takes
longer to perform its task than any o f the other stages. To overcom e bottlenecks, the
system designer faces the task o f creating a design that perform s well, yet is simple
and modular.
This chapter describes techniques to avoid or hide performance bottlenecks.
Section 6.1 presents ways to identify bottlenecks and the general approaches to han
dle them, including exploiting workload properties, concurrent execution o f opera
tions, speculation, and batching. Section 6.2 examines specific versions of the general
techniques to attack the common problem o f implementing multilevel memory sys
tems efficiently. Section 6.3 presents scheduling algorithms for services to choose
which request to process first, if there are several waiting for service.
28
6.1 Designing for P erform ance 301
for each individual client request, and has little overhead so that it can serve many
clients. As w e will see, it is im possible to maximize all o f these goals simultaneously,
and thus a designer must make trade-offs.Trade-offs may favor on e class o f requests
over another and may result in bottlenecks for the unfavored classes o f requests.
Designing for perform ance creates tw o major challenges in com puter systems.
First, on e must consider the benefits o f optim ization in the context o f technology
improvements. Some bottlenecks are intrinsic ones; they require careful thinking to
ensure that the system runs faster than the perform ance o f the slowest stage. Some
bottlenecks are tech n ology dependent; time may eliminate these, as technology
improves. Unfortunately, it is som etim es difficult to decide w hether o r not a bottle
neck is intrinsic. Not uncommonly, a perform ance optimization for the next product
release is irrelevant by the time the product ships because tech n ology improvements
have rem oved the bottleneck completely. This ph en om en on is so com m on in co m
puter design that it has led to formulation o f the design hint: when in doubt use
brute force.Sidebar 6.1 discusses this hint.
Sidebar 6.1 Design Hint: When in Doubt use Brute Force This chapter describes
a few design hints that help a designer resolve trade-offs in the face o f limits.These
design hints are hints because they often guide the designer in the right direction,
but sometimes they d on ’ t. In this book we cover only a few, but the interested
reader should digest Hints fo r computer system design by B. Lampson, which pres
ents many more practical guidelines in the form o f hints [Suggestions for Further
Reading 1.5.4],
The design hint “ when in doubt use brute force”is a direct corollary o f the
d(technology)/dt curve (see Section 1.4). Given computing technology’ s historical
rate o f improvement, it is typically wiser to choose simple algorithms that are well
understood rather than complex, badly characterized algorithms. By the time the com
plex algorithm is fully understood, implemented, and debugged, new hardware might
be able to execute the simple algorithm fast enough.Thompson and Ritchie used a
fixed-size table o f processes in the UNIX system and searched the table linearly because
a table was simple to implement and the number of processes was small. With Joe
Condon, Thompson also built the Belle chess machine that relied mostly on special-
purpose hardware to search many positions per second rather than on sophisticated
algorithms. Belle won the world computer chess championships several times in the
late 1970s and early 1980s and achieved an ELO rating o f 2250. (ELO is a numerical
rating systems used by the World Chess Federation (FIDI) to rank chess players; a
rating o f 2250 makes one a strong competitive player.) Later, as technology marched
on, programs that performed brute-force searching algorithms on an off-the-shelf PC
conquered the world computer chess championships. As o f August 2005, the Hydra
supercomputer (64 PCs, each with a chess coprocessor) is estimated by its creators to
have an ELO rating o f 3200, which is better than the best human player.
29
302 CHAPTER 6 Perform ance
A secon d challenge in designing for perform ance is maintaining the sim plicity
o f the design. For example, if the design uses different devices with approxim ately
the same high-level function but radically different perform ance, a challenge is to
abstract devices such that they can be used through a sim ple uniform interface. In
this chapter, w e see h ow a clever im plem entation o f the read and w rite interface
for m em ory can transparently extend the effective size o f RAM to the size o f a
m agnetic disk.
6.1.1.2 Latency
Latency is the delay betw een a change at the input to a system and the corresponding
change at its output. From the client/service perspective, the latency o f a request is the
time from issuing the request until the time the response is received from the service.
30
6.1 Designing for P erform ance 303
Service
FIGURE 6.1
This latency has several com ponents: the latency o f sending a message to the service,
the latency o f processin g the request, and the latency o f sending a response back.
If a task, such as asking a service to perform a request, is a sequence o f subtasks, w e
can think o f the com plete task as traversing stages o f a pipeline, where each stage o f
the pipeline perform s a subtask (see Figure 6.1). In our example, the first stage in the
pipeline is sending the request, the secon d stage is the service digitizing the frame, the
third stage is the file service storing the frame, and the final stage is sending a response
back to the client.
With this pipeline m odel in mind, it is easy to see that latency o f a pipeline with
stages A and B is greater than or equal to the sum o f the latencies for each stage in
the pipeline:
6.1.1.3 Throughput
Throughput is a measure o f the rate o f useful w ork d on e by a service for som e given
workload o f requests. In the camera example, the throughput w e might care about is
h ow many frames per secon d the system can process because it may determine what
quality camera w e want to buy.
The throughput o f a system with pipelined stages is less than or equal to the mini
mum o f the throughput for each stage:
31
304 CHAPTER 6 Perform ance
Again, if the stages are o f a single service, passing the request from on e stage to
another usually adds little overhead and has little impact on total throughput. Thus,
for first-order analysis that overhead can be ignored, and the relation is usually close
to equality.
Consider a com puter system with tw o stages: on e that is able to process data at
a rate o f 1,000 kilobytes per secon d and a secon d on e at a rate o f 100 kilobytes per
second. If the fast stage generates on e byte o f output for each byte o f input, the overall
throughput must be less than or equal to 100 kilobytes per second. If there is negligi
ble overhead in passing requests betw een the tw o stages, then the throughput o f the
system is equal to the throughput o f the bottleneck stage, 100 kilobytes per second. In
this case, the utilization o f stage 1 is 10% and that o f stage 2 is 100%.
W hen a stage p rocesses requests serially, the throughput and the latency o f a stage
are directly related. The average number o f requests a stage handles is inversely pro
portional to the average time to process a single request:
throughput = j ^ r y
If all stages process requests serially, the average throughput o f the com plete p ip e
line is inversely proportional to the average time a request spends in the pipeline. In
these pipelines, reducing latency improves throughput, and the other way around.
W hen a stage p rocesses requests concurrently, as w e will see later in this chapter,
there is no direct relationship betw een latency and throughput. For stages that pro
cess requests concurrently, an increase in throughput may not lead to a decrease in
latency. A useful analogy is pipes through w hich water flow s with a constant velocity.
One can have several parallel pipes (or on e fatter pipe), w hich improves throughput
but d oesn ’ t change latency.
32
6.1 Designing for P erform ance 305
1. Measure the system to find out whether or not a perform ance enhancement
is needed. If perform ance is a problem, identify w hich aspect o f perform ance
(throughput or latency) is the problem. For multistage pipelines in w hich stages
process requests concurrently, there is no direct relationship betw een latency
and throughput, so improving latency and improving throughput might require
different techniques.
2. Measure again, this time to identify the perform ance bottleneck. The bottleneck
may not be in the place the designer ex p ected and may shift from on e design
iteration to another.
3. Predict the impact o f the p ro p o sed perform ance enhancement with a simple
back-of-the-envelope model. (We introduce a few simple m odels in this chap
ter.) This prediction includes determining where the next bottleneck will be.
A quick way to determine the next bottleneck is to unrealistically assume that
the planned perform ance enhancement will rem ove the current bottleneck and
result in a stage with zero latency and infinite throughput. Under this assump
tion, determine the next bottleneck and calculate its performance.This calcula
tion will result in on e o f tw o conclusions:
a. Removing the current bottleneck d oesn ’
t improve system perform ance
significantly. In this case, stop iterating, and reconsider the w h ole design
or revisit the requirements. Perhaps the designer can adjust the interfaces
betw een stages with the goal o f tolerating costly operations. We will discuss
several approaches in the next sections.
b. Removing the current bottleneck is likely to improve the system perform ance.
In this case, focus attention on the bottleneck stage. Consider brute-force
m ethods o f relieving the bottleneck stage (e.g., add m ore memory).Taking
33
306 CHAPTER 6 Perform ance
, , (^technology ) ,,
advantage o f t h e ------ —— —— curve may be less expensive than being
clever. If brute-force m ethods w o n ’
t relieve the bottleneck, be smart. For
example, try to exploit properties o f the workload or find better algorithms.
4. Measure the n ew implementation to verify that the change has the predicted
impact. If not, revisit steps 1-3 and determine what went wrong.
5. Iterate. Repeat steps 1-5 until the perform ance m eets the required level.
The rest o f this chapter introduces various systems approaches to reducing
latency and increasing throughput, as w ell as simple perform ance m odels to predict
the resulting performance.
Service
FIGURE 6.2______________________________
34
6.1 Designing for P erform ance 307
Sidebar 6.2 Design Hint: Optimize for the Common Case A cache (see
Section 2.1.1.3) is the most com m on example o f optimizing for the most fre
quent cases. We saw caches in the case study o f the Domain Name System (in
Section 4.4). As another example, consider a Web browser. Most Web browsers
maintain a cache o f recently accessed Web pages.This cache is indexed by the name
o f the Web page (e.g.,http://www.Scholarly.edu) and returns the page for that name. If
the user asks to view the same page again, then the cache can return the cached copy
o f the page immediately (a fast path); only the first access requires a trip to the service
(a slow path). In addition to improving the user’ s interactive experience, the cache
helps reduce the load on services and the load on the network. Because caches are so
effective, many applications use several o f them. For example, in addition to caching
Web pages, many Web browsers have a cache to store the results o f looking up names,
such as“ www.Scholarly.edu” ,so that the next request t o “ www.Scholarly.edu”doesn’ t
require a DNS lookup.
The design o f multilevel memory in Section 6.2 is another example o f how well a
designer can exploit non-uniformity in a workload. Because applications have locality
o f reference, one can build large and fast memory systems out o f a combination o f a
small but fast memory and a large but slow memory.
works so w ell that it has led to the design hint optim ize for the com m on case (see
Sidebar 6.2).
To evaluate the perform ance o f systems with a fast and slow path, designers typi
cally com pute the average latency. If w e k n ow the latency o f the fast and slow paths,
and the frequency with w hich the system will take the fast path, then the average
latency is:
35
308 CHAPTER 6 Perform ance
those subtasks in parallel. The m ethod can be applied either within a m ultiprocessor
system o r (if the subtasks aren’ t too entangled) with com pletely separate computers.
If the processing parallelizes perfectly (i.e., each subtask can run without any coor
dination with other subtasks and each subtask requires the same amount o f work), then
this plan can, in principle, speed up the processing by a factor n, where n is the number
o f subtasks executing in parallel. In practice, the speedup is usually less than n because
there is overhead in parallelizing a com putation—the subtasks need to communicate
with each other, for example, to exchange intermediate results; because the subtasks
d o not require an equal amount o f work; because the computation cannot be executed
com pletely in parallel, so som e fraction o f the computation must be executed sequen
tially; or because the subtasks interfere with each other (e.g.,they contend for a shared
resource such as a lock, a shared memory, or a shared comm unication network).
Consider the processin g that a search engine needs to perform in order to respond
to a user search query. An early version o f G o o gle’ s search en gin e—described in more
detail in Suggestions for Further Reading 3.2.4—parallelized this processin g as follows.
The search engine splits the index o f the Web up in n pieces, each p iece stored on a
separate machine. W hen a front end receives a user query, it sends a co p y o f the query
to each o f the n machines. Each machine runs the query against its part o f the index
and sends the results back to the front end. The front end accumulates the results
from the n machines, ch ooses a g o o d order in w hich to display them, generates a Web
page, and sends it to the user. This plan can give g o o d speedup if the index is large and
each o f the n m achines must perform a substantial, similar amount o f computation.
It is unlikely to achieve a full speedup o f a factor n because there is parallelization
overhead (to send the query to the n machines, receive n partial results, and merge
them); because the amount o f w ork is not balanced perfectly across the n machines
and the front end must wait until the slow est responds; and because the work don e by
the front end in farming out the query and m erging hasn’ t been parallelized.
Although parallelizing can improve performance, several challenges must be
overcome. First, many applications are difficult to parallelize. Applications such as
search have exploitable parallelism, but other computations d o n ’ t split easily into
n mostly independent pieces. Second, developing parallel applications is difficult
because the program m er must manage the concurrency and coordinate the activities
o f the different sub tasks. As w e saw in Chapter 5, it is easy to get this w ron g and intro
duce race conditions and deadlocks. Systems have been d eveloped to make develop
ment o f parallel applications easier, but they are often limited to a particular domain.
The paper by Dean and Ghemawat [Suggestions for Further Reading 6.4.3] provides
an exam ple o f h ow the program m ing and management effort can be minimized for
certain stylized applications running in parallel on hundreds o f machines. In general,
however, program m ers must often struggle with threads and locks, or explicit m essage
passing, to obtain concurrency.
Because o f these tw o challenges in parallelizing applications, designers traditionally
have preferred to rely o n continuous tech nology improvements to reduce application
latency. However, physical and engineering limitations (primarily the problem o f heat
dissipation) are n o w leading p rocessor manufacturers away from making processors
36
6.1 Designing for P erform ance 309
faster and toward placing several (and soon, probably, several hundred or even several
thousand, as some are predicting [Suggestions for Further Reading 1.6.4]) processors
on a single chip. This development means that improving performance by using con
currency will inevitably increase in importance.
FIGURE 6.3
A simple service composed of several stages, with each stage operating concurrently using
threads.
37
310 CHAPTER 6 Perform ance
When a designer is faced with such intrinsic limits, the only option is to design sys
tems that hide latency and try to exploit performance dimensions that do follow
d(technology)/dt. For example, transmission rates for data networks have improved
dramatically, and so if a designer can organize the system such that communication can
be overlapped with useful computation and many network requests can be batched
into a large request, then the large request can be transferred efficiently. Many Web
browsers use this strategy: while a large transfer runs in the background, users can
continue browsing Web pages, hiding the latency o f the transfer.
in issuing multiple requests asynchronously is that the client must then match the
responses with the outstanding requests.
Once the system is organized to have many requests in flight concurrently, a
designer may be able to improve throughput further by using interleaving. The idea
is to make n instances of the bottleneck stage and run those n instances concurrently
(see Figure 6.4). Stage 1 feeds the first request to instance 1, the second request to
instance 2, and so on. If the throughput o f a single instance is t, then the throughput
using interleaving is n X t, assuming enough requests are available to run all instances
Service
FIGURE 6 .4
Interleaving requests.
38
6.1 Designing for P erform ance 311
concurrently at full speed and the requests d o n ’ t interfere with each other.The cost
o f interleaving is additional co p ies o f the bottleneck stage.
RAID (see Section 2.1.1.4) interleaves several disks to achieve a high aggregate
disk throughput. RAID 0 stripes the data across the disks: it stores block 0 on disk
0, block 1 on disk 1, and so on. If requests arrive for blocks on different disks, the
RAID controller can serve those requests concurrently, improving throughput. In a
similar style on e can interleave m em ory chips to improve throughput. If the current
instruction is stored in m em ory chip 0 and the next on e is in m em ory chip 1, the pro
cessor can retrieve them concurrently. The cost o f this design is the additional disks
and m em ory chips, but often systems already have several m em ory chips or disks, in
w hich case the added cost o f interleaving can be small in com parison with the per
form ance benefit.
*The textbook by Jain is an excellent source to learn about queuing theory and how to reason
about performance in computer systems [Suggestions for Further Reading 1.1.2].
39
312 CHAPTER 6 Perform ance
In som e constrained cases, w here the designer can plan the system so that the
capacity just matches the offered load o f requests, it is possible to calculate the degree
o f concurrency necessary to achieve high throughput and the maximum length o f
the queue n eeded betw een stages. For example, suppose w e have a p rocessor that
perform s on e instruction per n anosecond using a m em ory that takes 10 nanosec
onds to respond. To avoid having the processor wait for the memory, it must make a
m em ory request 10 instructions in advance o f the instruction that needs it. If every
instruction makes a request o f memory, then by the time the m em ory responds, the
p rocessor will have issued 9 m ore .To avoid being a bottleneck, the m em ory therefore
must be prepared to serve 10 requests concurrently.
If half o f the instructions make a request o f memory, then on average there will
be five outstanding requests. Thus, a m em ory that can serve five requests con cur
rently w ould have enough capacity to keep up. To calculate the maximum length
o f the queue n eeded for this case depends on the application’ s pattern o f m em ory
references. For example, if every secon d instruction makes a m em ory request, a fixed-
size queue o f size five is sufficient to ensure that the queue never overflows. If the
p rocessor perform s five instructions that make m em ory references follow ed by five
that d on ’ t, then a fixed-size queue o f size five will work, but the queue length will
vary in length and the throughput will be different. If the requests arrive randomly,
the queue can grow, in principle, without limit. If w e w ere to use a m em ory that can
handle 10 requests concurrently for this random pattern o f m em ory references, then
the m em ory w ould be utilized at 50% o f capacity, and the average queue length w ould
be (1/(1—0.5) = 2.With this configuration, the p rocessor observes latencies for som e
m em ory requests o f 20 or m ore instruction cycles, and it is running m uch slow er
than the designer expected. This exam ple illustrates that a designer must understand
non-uniform patterns in the references to m em ory and exploit them to achieve g o o d
performance.
In many com puter systems, the designer cannot plan the offered load that pre
cisely, and thus stages will experien ce periods o f overload. For example, an applica
tion may have several threads that b eco m e runnable all at the same time and there
may not be enough p rocessors available to run them. In such cases, at least occasional
overload is inevitable. The significance o f overload depends critically on h ow long
it lasts. If the duration is com parable to the service time, then a queue is simply an
orderly way to delay som e requests for service until a later time w h en the offered
load drops b elo w the capacity o f the service. Put another way, a queue handles short
bursts o f too m uch demand by time-averaging with adjacent periods w hen there is
ex cess capacity.
If overload persists over lon g periods o f time, the system designer has only tw o
choices:
1. Increase the capacity o f the system. If the system must meet the offered load,
on e approach is to design a system that has less overhead so that it can perform
m ore useful w ork or purchase a better com puter system with higher capacity.
In com puter systems, it is typically less expensive to buy the next generation o f
40
6.1 Designing for P erform ance 313
the com puter system that has higher capacity because o f tech n ology im prove
ments than trying to squeeze the last ou n ce out o f the implementation through
com p lex algorithms.
2. Shed load. If purchasing a com puter system with higher capacity isn’
t an option
and system perform ance cannot be improved, the preferred m ethod is to shed
load by reducing or limiting the offered load until the load is less than the
capacity o f the system.
One approach to control the offered load is to use a boun ded buffer (see Figure
5.5) betw een stages. W hen the boun ded buffer ahead o f the bottleneck stage is full,
then the stage before it must wait until the bounded buffer em pties a slot. Because
the previous stage is waiting, its boun ded buffer may till up too, w hich may cause the
stage before it to wait, and so on. The bottleneck may be pushed all the way back to
the beginning o f the pipeline. If this happens, the system cannot accept any m ore
input, and what happens next depends on h ow the system is used.
If the source o f the load needs the results o f the output to generate the next
request, then the load will be self-managing. This m odel o f use applies to som e inter
active systems, in w hich the users cannot type the next com m and until the previous
on e finishes.This same idea will be used in Chapter 7 [on-line] in the implementation
o f self-pacing network protocols.
If the source o f the load decides not to make the request at all, then the offered
load decreases. If the source, however, simply holds on to the request and resubmits
it later, then the offered load d oesn ’t decrease, but som e requests are just deferred,
perhaps to a time w hen the system isn’ t overloaded.
A crude approach to limiting a source is to put a quota on h ow many requests a
source may have outstanding. For example, som e systems enforce a rule that an appli
cation may not create m ore than som e fixed num ber o f active threads at the same
time and may not have m ore than som e fixed num ber o f o p en hies. If a source has
reached its quota for a given service, the system denies the next request, limiting the
offered load on the system.
An alternative to limiting the offered load is reducing it w h en a stage becom es
overloaded. We will see on e exam ple o f this approach in Section 6.2. If the address
spaces o f a num ber o f applications cannot fit in memory, the virtual m em ory man
ager can swap out a com plete address space o f on e or m ore applications so that the
remaining applications fit in memory. W hen the offered load decreases to normal
levels, the virtual m em ory manager can swap in som e o f the applications that w ere
sw apped out.
41
314 CHAPTER 6 Perform ance
6.1.7.1 Batching
Batching is perform ing several requests as a group to avoid the setup overhead o f
doin g them on e at a time. Opportunities for batching arise naturally at a bottleneck
stage, w hich may have a queue o f requests waiting to be processed. For example, if a
stage has several requests to send to the next stage, the stage can com bin e all o f the
m essages into a single message and send that on e message to the next stage.This use
o f batching divides the overhead o f an expensive operation (e.g., sending a message)
over the several messages. More generally, batching works well w hen processin g a
request has a fixed delay (e.g., transmitting the request) and a variable delay (e.g.,
perform ing the operation specified in the request). Without batching, processin g n
requests takes n X (f + v), w h e r e / is the fixed delay and v is the variable delay.With
batching, processin g n requests ta k e s / + n X v.
O n ce a stage perform s batching, the potential arises for additional perform ance
wins. Batching may create opportunities for the stage to avoid work. If tw o or m ore
write requests in a batch are for the same disk block, then the stage can perform just
the last one.
Batching may also provide opportunities to improve latency by reordering the
processin g o f requests. As w e will see in Section 6.3.4, if a disk controller receives a
batch o f requests, it can schedule them in an order that reduces the m ovement o f the
disk arm, reducing the total latency for the batch o f requests.
6.1.7.2 Dallying
Dallying is delaying a request on the chance that the operation w o n ’
t be needed, or
to create m ore opportunities for batching. For example, a stage may delay a request
that overwrites a disk block in the h op e that a secon d on e will com e along for the
same block. If a secon d on e co m e s along, the stage can delete the first request and
perform just the secon d one. As applied to writes, this benefit is som etim es called
write absorption.
Dallying also increases the opportunities for batching. It purposely increases the
latency o f som e requests in the h ope that m ore requests will co m e along that can be
com bin ed with the delayed requests to form a batch. In this case, dallying increases
the latency o f som e requests to improve the average latency o f all requests.
A key design question in dallying is to decide h ow long to wait.There is no generic
answer to this question.The costs and benefits o f dallying are application and system
specific.
6.1.7.3 Speculation
Speculation is perform ing an operation in advance o f receiving a request on the
chance that it will be requested. The goal is that the results can be delivered with
less latency and perhaps with less setup overhead. Speculation can achieve this goal
in tw o different ways. First, speculation can perform operations using otherwise idle
resources. In this case, even if the speculation is wrong, perform ing the additional
operations has no downside. Second, speculation can use a busy resource to d o an
42
6.1 Designing for P erform ance 315
operation that has a lon g lead time so that the result o f the operation can be available
without waiting if it turns out to be needed. In this case, speculation might increase
the delay and overhead o f other requests without benefit because the prediction that
the results may be n eeded might turn out to be wrong.
Speculating may sound bewildering because h ow can a com puter system predict
the input o f an operation if it hasn’ t received the request yet, and h ow can it predict
if the result o f the operation will be useful in the future? Fortunately, many applica
tions have request patterns that a system designer can exploit to predict an input.
In som e cases, the input value is evident; for example, a future instruction may add
register 5 to register 9, and these register values may be available now. In som e cases,
the input values can be predicted accurately; for example, a program that asks to read
byte n is likely to want to read bytes n + 1, n + 2, and so on, too. Similarly, for many
applications a system can predict what results will be useful in the future. If a program
perform s instruction n, it will likely soon need the result o f instruction n + 1; only
w hen the instruction n is a jmp will the prediction be wrong.
Sometimes a system can use speculation even if the system cannot predict accu
rately what the input to an operation is or whether the result will be useful. For exam
ple, if an input has only tw o values, then the system might create a n ew thread and
have the main thread run with on e input value and the secon d thread with the other
input value. Later, w h en the system knows the value o f the input, it terminates the
thread that is com puting with the w ron g value and undoes any changes that thread
might have made. This use o f speculation b ecom es challenging w hen it involves
shared state that is updated by different thread, but using techniques presented in
Chapter 9 [on-line] it is possible to undo the operations o f a thread, even w h en shared
state is involved.
Speculation creates m ore opportunities for batching and dallying. If the system
speculates that a read request for block n will be follow ed by read requests for blocks
n + 1 through n + 8, then the system can batch those read requests. If a write request
might soon be follow ed by another write request, the system can dally for a while to
see if any others com e in and, if so, batch all the writes together.
Key design questions associated with speculation are w hen to speculate and h ow
much. Speculation can increase the load on later stages. If this increase in load results
in a load higher than the capacity o f a later stage, then requests must wait and latency
will increase.Also, any work don e that turns out to be not useful is overhead, and per
form ing this unnecessary w ork may slow dow n other requests.There is n o generic
answer to this design question; instead, a designer must evaluate the benefits and cost
o f speculation in the context o f the system.
43
316 CHAPTER 6 Perform ance
techniques with discipline. There is always the risk that by the time the designer has
w orked out the concurrency problem s and the system has made it through the sys
tem tests, tech nology improvements will have made the extra com plexity unneces
sary. Problem set 14 explores several performance-enhancing techniques and their
challenges with a simple multithreaded service.
44
C h a p te r 4
C o n cu rre n cy C o n tro l
45
Apply deadlock detection using a waits-for graph to transaction sched
ules.
* -v: - ' ’
v' ^ "• • , * —/ ' - - '
. i s
-s‘
.,
j\ 888W
M i!
R H I I
H n M re n M S
48
Overview of Transaction Management 521
1. Users should be able to regard the execution of each transaction as a tom ic:
Either all actions are carried out or none are. Users should not have to
worry about the effect of incomplete transactions (say, when a system crash
occurs).
4. Once the DBMS informs the user that a transaction has been successfully
completed, its effects should persist even if the system crashes before all
its changes are reflected on disk. This property is called du rability.
The acronym ACID is sometimes used to refer to these four properties of trans
actions: atomicity, consistency, isolation and durability. We now consider how
each of these properties is ensured in a DBMS.
Users are responsible for ensuring transaction consistency. That is, the user
who submits a transaction must ensure that, when run to completion by itself
against a ‘consistent’database instance, the transaction will leave the database
in a ‘consistent’state. For example, the user may (naturally) have the consis
tency criterion that fund transfers between bank accounts should not change
the total amount of money in the accounts. To transfer money from one ac
count to another, a transaction must debit one account, temporarily leaving the
database inconsistent in a global sense, even though the new account balance
may satisfy any integrity constraints with respect to the range of acceptable
account balances. The user’ s notion of a consistent database is preserved when
the second account is credited with the transferred amount. If a faulty trans
fer program always credits the second account with one dollar less than the
amount debited from the first account, the DBMS cannot be expected to de
tect inconsistencies due to such errors in the user program ’s logic.
49
522 C h a pter 16
how the DBMS implements this guarantee in Section 16.4.) For example, if
two transactions T1 and T2 are executed concurrently, the net effect is guar
anteed to be equivalent to executing (all of) F I followed by executing F2 or
executing F2 followed by executing T l. (The DBMS provides no guarantees
about which of these orders is effectively chosen.) If each transaction maps a
consistent database instance to another consistent database instance, execut
ing several transactions one after the other (on a consistent initial database
instance) results in a consistent final database instance.
The DBMS component that ensures atomicity and durability, called the recov
ery m anager , is discussed further in Section 16.7.
50
O v erv iew o f T ra n sa ction M a n a g em en t 523
In addition to reading and writing, each transaction m ust specify as its final
action either c o m m it (i.e., complete successfully) or a b o r t (i.e., terminate
and undo all the actions carried out thus far). A b ort? denotes the action of T
aborting, and C o m m it? denotes T committing.
1. Transactions interact with each other only via database read and write
operations; for example, they are not allowed to exchange messages.
If the first assumption is violated, the DBMS has no way to detect or prevent
inconsistencies cause by such external interactions between transactions, and it
is upto the writer of the application to ensure that the program is well-behaved.
We relax the second assumption in Section 16.6.2.
51
524 C h a pter 16
Tl T2
R{A)
W(A)
R(B)
W{B)
R(C)
W (C)
Note that the schedule in Figure 16.1 does not contain an abort or commit ac
tion for either transaction. A schedule that contains either an abort or a comm it
for each transaction whose actions are listed in it is called a c o m p le t e s c h e d
ule. A complete schedule must contain all the actions of every transaction
that appears in it. If the actions of different transactions are not interleaved—
that is, transactions are executed from start to finish, one by one— we call the
schedule a seria l sch edu le.
52
Overview of Transaction Management 525
16.3.2 Serializability
XI X2
R(A)
W(A)
R(A)
W (A )
R(B)
W (B )
R(B)
W(B)
Commit
Commit
The preceding definition of a serializable schedule does not cover the case of
schedules containing aborted transactions. We extend the definition of serial
izable schedules to cover aborted transactions in Section 16.3.4.
53
54
Overview of Transaction Management 527
that the actions are interleaved so that (1) the account transfer program T 1
deducts $100 from account A, then (2) the interest deposit program T 2 reads
the current values of accounts A and B and adds 6% interest to each, and then
(3) the account transfer program credits $100 to account B. The corresponding
schedule, which is the view the DBMS has of this series of events, is illustrated
in Figure 16.4. The result of this schedule is different from any result that we
would get by running one of the two transactions first and then the other. The
problem can be traced to the fact that the value of A written by T1 is read by
T 2 before T1 has completed all its changes.
T1 T2
R(A)
W (A )
R(A)
W(A)
R(B)
W (B)
Commit
R(B)
W(B)
Commit
The general problem illustrated here is that T1 may write some value into A
that makes the database inconsistent. As long as T1 overwrites this value with
a ‘correct’value of A before committing, no harm is done if T1 and T 2 run in
some serial order, because T2 would then not see the (temporary) inconsistency.
On the other hand, interleaved execution can expose this inconsistency and lead
to an inconsistent final database state.
55
528 C h a pte r 16
The second way in which anomalous behavior could result is that a transaction
T 2 could change the value of an object A that has been read by a transaction
T l, while T1 is still in progress.
If T l tries to read the value of A again, it will get a different result, even though
it has not modified A in the meantime. This situation could not arise in a serial
execution of two transactions; it is called an u n r e p e a t a b le read.
To see why this can cause problems, consider the follovung example. Suppose
that A is the number of available copies for a book. A transaction that places
an order first reads A, checks that it is greater than 0, and then decrements it.
Transaction T l reads A and sees the value 1. Transaction T2 also reads A and
sees the value 1, decrements A to 0 and commits. Transaction T l then tries to
decrement A and gets an error (if there is an integrity constraint that prevents
A from becom ing negative).
This situation can never arise in a serial execution of T l and T 2; the second
transaction would read A and see 0 and therefore not proceed with the order
(and so would not attempt to decrement A).
Suppose that Harry and Larry are two employees, and their salaries must be
kept equal. Transaction T l sets their salaries to $2000 and transaction T 2 sets
their salaries to $1000. If we execute these in the serial order T l followed by
T2, both receive the salary $1000; the serial order T2 followed by T l gives each
the salary $2000. Either of these is acceptable from a consistency standpoint
(although Harry and Larry may prefer a higher salary!). Note that neither
transaction reads a salary value before writing it— such a write is called a
b lin d w rite, for obvious reasons.
56
O verview o f T ra n sa ction M a n a gem en t 529
Tl T2
R(A)
W (A )
R(A)
W(A)
R(B)
W (B)
Commit
Abort
2We must also consider incomplete transactions for a rigorous discussion of system failures, because
transactions that are active when the system fails are neither aborted nor committed. However, system
recovery usually begins by aborting all active transactions, and for our informal discussion, considering
schedules involving committed and aborted transactions is sufficient.
57
530 C h a pter 16
Now, T 2 has read a value for A that should never have been there. (Recall
that aborted transactions’effects are not supposed to be visible to other trans
actions.) If T 2 had not yet committed, we could deal with the situation by
cascading the abort of T 1 and also aborting T 2; this process recursively aborts
any transaction that read data written by T2, and so on. But T 2 has already
committed, and so we cannot undo its actions. We say that such a schedule
is unrecoverable. In a r e co v e r a b le sch edu le, transactions commit only after
(and if!) all transactions whose changes they read commit. If transactions read
only the changes of committed transactions, not only is the schedule recover
able, but also aborting a transaction can be accomplished without cascading
the abort to other transactions. Such a schedule is said to a v o id c a s c a d in g
aborts.
58
Overview of Transaction Management 531
Of course, a transaction that has an exclusive lock can also read the object;
an additional shared lock is not required. A transaction that requests a lock is
suspended until the DBMS is able to grant it the requested lock. The DBMS
keeps track of the locks it has granted and ensures that if a transaction holds
an exclusive lock on an object, no other transaction holds a shared or exclusive
lock on the same object. The second rule in Strict 2PL is
Requests to acquire and release locks can be automatically inserted into trans
actions by the DBMS; users need not worry about these details. (We discuss
how application programmers can select properties of transactions and control
locking overhead in Section 16.6.3.)
If the Strict 2PL protocol is used, such interleaving is disallowed. Let us see
why. Assuming that the transactions proceed at the same relative speed as
59
532 C h a pter 16
before, T 1 would obtain an exclusive lock on A first and then read and write
A (Figure 16.6). Then, T2 would request a lock on A. However, this request
Tl T2
X{A)
R{A)
W{A)
cannot be granted until T1 releases its exclusive lock on A, and the DBMS
therefore suspends T 2. T1 now proceeds to obtain an exclusive lock on B ,
reads and writes B, then finally commits, at which time its locks are released.
T 2’s lock request is now granted, and it proceeds. In this example the locking
protocol results in a serial execution of the two transactions, shown in Figure
16.7.
Tl T2
X(A)
B(A)
W{A)
X (B )
R(B)
W(B)
Commit
X(A)
R(A)
W (A )
X (B )
RIB )
W (B )
Commit
It can be shown that the Strict 2PL algorithm allows only serializable sched
ules. None of the anomalies discussed in Section 16.3.3 can arise if the DBMS
implements Strict 2PL.
60
Overview of Transaction Management 533
T1 T2
S(A)
R(A)
S(A)
R(A)
X (B )
R(B)
W (B )
Com m it
X (C )
R(C)
W (C)
Commit
16.4.2 Deadlocks
61
534 C h a p t e r 16
Throughput can be increased in three ways (other than buying a faster system):
■ By locking the smallest sized objects possible (reducing the likelihood that
two transactions need the same lock).
■ By reducing the time that transaction hold locks (so that other transactions
are blocked for a shorter time).
62
536 C h a p t e r 16
Even with the use of savepoints, certain applications might require us to run
several transactions one after the other. To minimize the overhead in such
situations, SQL:1999 introduces another feature, called ch a in e d tra n sa ction s.
We can commit or roll back a transaction and immediately initiate another
transaction. This is done by using the optional keywords AND CHAIN in the
COMMIT and ROLLBACK statements.
64
Overview of Transaction Management 537
Suppose that this query runs as part of transaction T 1 and an SQL statement
that modifies the age of a given sailor, say Joe, with rating=8 runs as part of
transaction T 2. What ‘ ob jects’should the DBMS lock when executing these
transactions? Intuitively, we must detect a conflict between these transactions.
The DBMS could set a shared lock on the entire Sailors table for T1 and set
an exclusive lock on Sailors for T2, which would ensure that the two transac
tions are executed in a serializable manner. However, this approach yields low
concurrency, and we can do better by locking smaller objects, reflecting what
each transaction actually accesses. Thus, the DBM S could set a shared lock
on every row with rating=8 for transaction T1 and set an exclusive lock on
just the row for the modified tuple for transaction T 2. Now, other read-only
transactions that do not involve rating=8 rows can proceed without waiting for
T1 or T2.
65
538 C h a pte r 16
a new sailor with rating=8 and runs as transaction T 3. (Observe that this
example violates our assumption of a fixed number of objects in the database,
but we must obviously deal with such situations in practice.)
Suppose that the DBMS sets shared locks on every existing Sailors row with
rating=8 for T 1. This does not prevent transaction T3 from creating a brand
new row with rating=8 and setting an exclusive lock on this row. If this new row
has a smaller age value than existing rows, T1 returns an answer that depends
on when it executed relative to T2. However, our locking scheme im poses no
relative order on these two transactions.
It may well be that the application invoking T l can accept the potential inac
curacy due to phantoms. If so, the approach of setting shared locks on existing
tuples for T l is adequate, and offers better performance. SQL allows a pro
grammer to make this choice— and other similar choices— explicitly, as we see
next.
66
O v erview o f T ra n sa ction M a n a g em en t 539
The highest degree of isolation from the effects of other transactions is achieved
by setting the isolation level for a transaction T to SERIALIZABLE. This isolation
level ensures that T reads only the changes made by committed transactions,
that no value read or written by T is changed by any other transaction until T
is complete, and that if T reads a set of values based on some search condition,
this set is not changed by other transactions until T is complete (i.e., T avoids
the phantom phenomenon).
REPEATABLE READ ensures that T reads only the changes made by commit
ted transactions and no value read or written by T is changed by any other
transaction until T is complete. However, T could experience the phantom
phenomenon; for example, while T examines all Sailors records with ra tin g—1,
another transaction might add a new such Sailors record, which is missed by
T.
READ COMMITTED ensures that T reads only the changes made by committed
transactions, and that no value written by T is changed by any other transaction
until T is complete. However, a value read by T may well be modified by
67
540 C h a p t e r 16
A READ COMMITTED t r a n s a c t io n o b t a i n s e x c lu s iv e lo c k s b e f o r e w r it in g o b j e c t s
a n d h o ld s t h e s e lo c k s u n til t h e en d . I t a ls o o b t a i n s s h a r e d lo c k s b e f o r e r e a d
i n g o b je c t s , b u t t h e s e lo c k s a r e r e le a s e d im m e d ia t e ly ; th e ir o n ly e ffe c t is t o
g u a r a n t e e t h a t th e t r a n s a c t io n t h a t la s t m o d i f i e d t h e o b j e c t is c o m p le t e . ( T h is
g u a r a n t e e r e lie s o n t h e fa c t t h a t every SQL t r a n s a c t io n o b t a i n s e x c lu s iv e lo c k s
b e f o r e w r it in g o b j e c t s a n d h o ld s e x c lu s iv e lo c k s u n til t h e end.)
A READ UNCOMMITTED transaction does not obtain shared locks before reading
objects. This m ode represents the greatest exposure to uncommitted changes
of other transactions; so much so that SQL prohibits such a transaction from
making any changes itself—a READ UNCOMMITTED transaction is required to have
an access m ode of READ ONLY. Since such a transaction obtains no locks for
reading objects and it is not allowed to write objects (and therefore never
requests exclusive locks), it never makes any lock requests.
The SER IA L IZA B L E isolation level is generally the safest and is recommended for
m ost transactions. Some transactions, however, can run with a lower isolation
level, and the smaller number of locks requested can contribute to improved sys
tem performance. For example, a statistical query that finds the average sailor
age can be run at the READ COMMITTED level or even the READ UNCOMMITTED
level, because a few incorrect or missing values do not significantly affect the
result if the number of sailors is large.
The isolation level and access m ode can be set using the SET TRANSACTION com
mand. For example, the following command declares the current transaction
to be SERIA L IZA B L E and READ ONLY:
68
Overview of Transaction Management 541
When a DBMS is restarted after crashes, the recovery manager is given control
and must bring the database to a consistent state. The recovery manager is
also responsible for undoing the actions of an aborted transaction. To see what
it takes to implement a recovery manager, it is necessary to understand what
happens during normal execution.
This implies that the system does not crash while a write is in progress and is
unrealistic. In practice, disk writes do not have this property, and steps must
be taken during restart after a crash (Section 18.6) to verify that the most
recent write to a given page was completed successfully, and to deal with the
consequences if not.
2. WThen a transaction commits, must we ensure that all the changes it has
made to objects in the buffer pool are immediately forced to disk? If so,
we say that'a fo r c e approach is used.
SA concurrency control technique that does not involve locking could be used instead, but we
assume that locking is used.
69
542 C h a pte r 16
However, these policies have important drawbacks. The no-steal approach as
sumes that all pages modified by ongoing transactions can be accom m odated
in the buffer pool, and in the presence of large transactions (typically run in
batch mode, e.g., payroll processing), this assumption is unrealistic. The force
approach results in excessive page I/O costs. If a highly used page is updated
in succession by 20 transactions, it would be written to disk 20 times. W ith a
no-force approach, on the other hand, the in-memory copy of the page would
be successively modified and written to disk just once, reflecting the effects
of all 20 updates, when the page is eventually replaced in the buffer pool (in
accordance with the buffer manager’ s page replacement policy).
For these reasons, m ost systems use a steal, no-force approach. Thus, if a
frame is dirty and chosen for replacement, the page it contains is written to
disk even if the m odifying transaction is still active (steal); in addition, pages in
the buffer pool that are modified by a transaction are not forced to disk when
the transaction commits (no-force).
The log enables the recovery manager to undo the actions of aborted and
incomplete transactions and redo the actions of committed transactions. For
example, a transaction that committed before the crash may have made updates
6Nothing in life is really guaranteed except death and taxes. However, we can reduce the chance
of log failure to be vanishingly small by taking steps such as duplexing the log and storing the copies
in different secure locations.
70
Overview of Transaction Management 543
to a copy (of a database object) in the buffer pool, and this change may not have
been written to disk before the crash, because of a no-force approach. Such
changes must be identified using the log and written to disk. Further, changes
of transactions that did not commit prior to the crash might have been written
to disk because of a steal approach. Such changes must be identified using the
log and then undone.
71
*.’.•" >7-;;.V f ^ < fs v
;:^v^;:>::Vfv:;;:v«:
jfe& i 1
i i S S i,<
S ^S*
*"" M.
:-■; ■
':>; - -•
;AC--: : . '
® E" ;$1 |
««&
msmgMM
tes\•
p > ,:;p<£
M H T .- »
« » » » «
■ ......
Mr ill
72
17
CONCURRENCY CONTROL
P ooh was sittin g in his house one day, coun tin g his p o ts o f honey,
when there cam e a knock on the door.
“Fourteen,”said Pooh. “ C om e in. Fourteen. O r was it fifteen? Bother.
T h a t’s m uddled me.”
549 73
550 C h a p t e r 17
“
Hallo, P ooh,” said Rabbit. “ Hallo, Rabbit. Fourteen, w asn ’
t it?”
“
W hat w as?” “ M y p o ts o f honey what I was counting.”
“Fourteen, th a t’
s right.”
“
Are you su re?”
“
No,”said Rabbit. “ D oes it m atter?”
As we saw in Section 16.3.3, two actions conflict if they operate on the sam e
data o b je ct and at least one o f them is a write. T h e ou tcom e o f a schedule
d epen ds only on the order o f conflicting operations; we can interchange any
pair o f n onconflicting operations w ithout altering the effect o f the schedule on
the database. If tw o schedules are conflict equivalent, it is easy to see that
they have the sam e effect on a database. Indeed, becau se they order all pairs
o f conflicting operations in the sam e way, we can obtain one o f them from
the other by repeatedly sw apping pairs o f nonconflicting actions, that is, by
sw appin g pairs of actions w hose relative order d oes not alter the outcom e.
74
Concurrency Control 551
TI T2 T3
R(A)
W{A)
C om m it
W(A)
C om m it
W {A)
C om m it
T 3, but it is not conflict equivalent to this serial schedule becau se the w rites of
T 1 and T 2 are ordered differently.
It is useful to capture all potential conflicts betw een the transactions in a sch ed
ule in a p r e c e d e n c e g ra p h , also called a s e r i a liz a b ilit y g r a p h . T h e prece
dence graph for a schedule S contains:
T h e precedence graphs for the schedules show n in Figures 16.7, 16.8, and 17.1
are shown in Figure 17.2 (parts a, b, and c, respectively).
75
552 C h a p t e r 17
2. Strict 2PL ensures that the precedence graph for any schedule that it allows
is acyclic.
It can b e show n that even nonstrict 2PL ensures acyclicity o f the precedence
graph and therefore allows only conflict serializable schedules. Intuitively, an
equivalent serial order of transactions is given by the order in w hich transactions
enter their shrinking phase: If T 2 reads or writes an o b je ct w ritten by T 1, T1
m ust have released its lock on the o b je ct before T 2 requested a lock on this
object. Thus, T1 precedes T 2. (A sim ilar argum ent show s that T1 precedes
T 2 if T 2 writes an o b je c t previously read by T l. A form al p r o o f o f the claim
w ould have to show that there is no cycle of transactions that ‘ p reced e’each
other by this argument.)
T h e reader is invited to revisit the exam ples in Section 16.3.3 to see how the
correspon d in g schedules are disallow ed by Strict 2PL and 2PL. Similarly, it
w ould b e instructive to work out how the schedules for the exam ples in Section
16.3.4 are disallowed by Strict 2PL but not by 2PL.
76
Concurrency Control 553
1. If Ti reads the initial value of o b je ct A in 51, it m ust also read the initial
value o f A in 52.
3. For each data o b ject A, the transaction (if any) that perform s the final
w rite on A in 51 m ust also perform the final w rite on A in 52.
17.2 IN T R O D U C T IO N T O L O C K M A N A G EM EN T
77
554 C h a p t e r 17
W hen a transaction aborts or com m its, it releases all its locks. W hen a lock
on an o b ject is released, the lock m anager u pdates the lock table entry for the
o b je ct and exam ines the lock request at the head o f the queue for this object.
If this request can now b e granted, the transaction that m ade the request is
w oken up and given the lock. Indeed, if several requests for a shared lock on the
o b je c t are at the front o f the queue, all o f these requests can now b e granted
together.
T h e im plem entation o f lock and unlock com m ands m ust ensure that these are
atom ic operations. To ensure atom icity o f these operation s when several in
stances of the lock m anager co d e can execute concurrently, access to the lock
table has to be guarded by an operating system synchronization m echanism
such as a semaphore.
78
Concurrency Control 555
In addition to locks, which are held over a lon g duration, a D B M S also su pports
short-duration la tch es. Setting a latch before reading or w riting a page ensures
that the physical read or w rite operation is atom ic; otherwise, tw o read/w rite
operations m ight conflict if the o b je cts b ein g locked do n ot correspon d to disk
pages (the units of I/O). Latches are unset im m ediately after the physical read
or w rite operation is com pleted.
17.3 L O C K CO N V E R SIO N S
79
556 C h a p t e r 17
A better approach is to avoid the need for lock u pgrades altogether by obtaining
exclusive locks initially, and d o w n g r a d i n g to a shared lock on ce it is clear that
this is sufficient. In our exam ple of an SQ L u pd a te statement, rows in a table
are locked in exclusive m od e first. If a row d oes not satisfy the con dition for
bein g updated, the lock on the row is dow n graded to a shared lock. D oes the
dow ngrade approach violate the 2PL requirem ent? O n the surface, it does,
becau se d ow n grading reduces the locking privileges held by a transaction, and
the transaction m ay go on to acquire other locks. However, this is a special case,
becau se the tran saction did nothing bu t read the o b je ct that it downgraded,
even th ough it conservatively obtained an exclusive lock. We can safely expand
our definition o f 2PL from Section 17.1 to allow lock dow ngrades in the grow ing
phase, provided that the transaction has n ot m odified the object.
17.4 D E A L IN G W IT H D E A D L O C K S
D eadlocks ten d to b e rare and typically involve very few transactions. In prac
tice, therefore, database system s periodically check for deadlocks. W hen a
transaction T i is su spen ded because a lock that it requests cannot b e granted,
it m ust wait until all transactions T j that currently hold conflicting locks re
lease them. T h e lock m anager m aintains a structure called a w a its - fo r g r a p h
to detect deadlock cycles. T h e n odes correspon d to active transactions, and
there is an arc from T i to T j if (and only if) T i is w aiting for T j to release a
lock. T h e lock m anager adds edges to this graph when it queues lock requests
and rem oves edges when it grants lock requests.
C on sider the schedule show n in Figure 17.3. T h e last step, shown below the
line, creates a cycle in the waits-for graph. Figure 17.4 shows the waits-for
graph before and after this step.
80
Concurrency Control 557
Tl T2 T3 T4
S(A)
R(A)
X(B)
W(B)
S(B )
S(C)
■R(C)
X(C)
X(B)
X(A)
81
558 C h a p t e r 17
O b s e r v e t h a t t h e w a its- fo r g r a p h d e s c r ib e s a ll a c tiv e tr a n s a c t io n s , s o m e o f w h ic h
T i t o T j in t h e w a its- fo r g r a p h , a n d
e v e n tu a lly a b o r t. I f t h e r e is a n e d g e f r o m
b oth T i a n d T j e v e n tu a lly c o m m it, t h e r e is a n e d g e in t h e o p p o s i t e d ir e c
t io n ( from T j t o Ti) in th e p r e c e d e n c e g r a p h (w h ich in v o lv e s o n ly c o m m i t t e d
t r a n s a c t io n s ) .
T h e w a its- fo r g r a p h is p e r io d ic a l l y c h e c k e d fo r cy c le s, w h ic h in d ic a t e d e a d lo c k .
A d e a d lo c k is r e s o lv e d b y a b o r t in g a t r a n s a c t io n t h a t is o n a c y c le a n d r e le a s in g
its lo ck s; th is a c t io n a llo w s s o m e o f t h e w a it in g t r a n s a c t io n s t o p r o c e e d . The
c h o ic e o f w h ic h t r a n s a c t io n t o a b o r t c a n b e m a d e u s in g s e v e r a l c r ite r ia : th e
o n e w ith t h e fe w e s t lo ck s, t h e o n e th a t h a s d o n e t h e le a st w ork , t h e o n e t h a t is
fa r t h e s t f r o m c o m p le t io n , a n d s o on. F u rth er, a t r a n s a c t io n m ig h t h a v e b e e n
r e p e a t e d l y r e s ta r te d ; if so, it s h o u ld e v e n tu a lly b e f a v o r e d d u r in g d e a d lo c k
d e t e c t io n a n d a llo w e d t o c o m p le te .
E m p ir ic a l r e s u lt s in d ic a t e th a t d e a d lo c k s a re r e la t iv e ly in fre q u e n t, a n d d e te c tio n -
b a s e d s c h e m e s w o r k w e ll in p r a c tic e . H o w ev e r, if t h e r e is a h ig h le v e l o f c o n
t e n t io n fo r l o c k s a n d t h e r e f o r e an in c r e a s e d lik e lih o o d o f d e a d lo c k s , p re v e n tio n -
b a s e d s c h e m e s c o u ld p e r f o r m b e tte r . W e c a n p r e v e n t d e a d lo c k s b y g i v in g ea ch
t r a n s a c t io n a p r io r it y a n d e n s u r in g th a t lo w e r - p r io r ity t r a n s a c t io n s a re n o t
a llo w e d t o w a it fo r h ig h e r - p r io r ity t r a n s a c t io n s (or v ic e versa). O n e w ay to
a s s ig n p r io r it ie s is t o g iv e e a ch t r a n s a c t io n a t i m e s t a m p w h e n it s ta r t s up.
T h e lo w e r t h e t im e s t a m p , th e h ig h e r is t h e t r a n s a c t io n ’
s p r io r ity ; th a t is, th e
o l d e s t t r a n s a c t io n h a s t h e h ig h e s t p riority .
I f a t r a n s a c t io n T i r e q u e s t s a lo c k a n d t r a n s a c t io n T j h o ld s a c o n f lic t in g lock,
t h e lo c k m a n a g e r c a n u s e o n e o f t h e fo llo w in g t w o p o lic ie s :
82
Concurrency Control 559
A s u b t le p o in t is t h a t w e m u s t a ls o e n s u r e t h a t n o t r a n s a c t io n is peren n ially-
a b o r t e d b e c a u s e it n e v e r h a s a s u ff ic ie n tly h ig h p riority . ( N o te th a t, in b o t h
s c h e m e s, th e h ig h e r - p r io r ity t r a n s a c t io n is n e v e r a b o r te d .) W h en a tra n sa c
t io n is a b o r t e d a n d r e s ta r te d , it s h o u ld b e g iv e n t h e s a m e t im e s t a m p it h a d
orig in a lly . R e is s u in g t im e s t a m p s in t h is w a y e n s u r e s t h a t e a c h t r a n s a c t io n
w ill e v e n tu a lly b e c o m e t h e o ld e s t tr a n s a c t io n , a n d t h e r e f o r e t h e o n e w it h th e
h ig h e s t p riority , a n d w ill g e t all t h e lo c k s it r e q u ire s.
T h e w a it- d ie s c h e m e is n o n p r e e m p t iv e ; o n ly a t r a n s a c t io n r e q u e s t i n g a lo c k c a n
b e a b o r t e d . A s a t r a n s a c t io n g r o w s o ld e r (an d its p r io r it y in cre a se s) , it t e n d s
t o w a it fo r m o r e a n d m o r e y o u n g e r t r a n s a c tio n s . A y o u n g e r t r a n s a c t io n th a t
c o n f lic t s w it h a n o ld e r t r a n s a c t io n m a y b e r e p e a t e d l y a b o r t e d (a d is a d v a n t a g e
w it h r e s p e c t t o w ou n d -w a it), b u t o n t h e o t h e r h a n d , a t r a n s a c t io n t h a t h a s
all t h e lo c k s it n e e d s is n e v e r a b o r t e d fo r d e a d lo c k r e a s o n s (an a d v a n ta g e w ith
r e s p e c t t o w o u n d - w a it, w h ic h is p r e e m p tiv e ) .
17.5 S P E C IA L IZ E D L O C K IN G T E C H N IQ U E S
T h u s fa r w e h a v e t r e a t e d a d a t a b a s e as a fixed c o ll e c t io n o f independent d a t a
o b j e c t s in o u r p r e s e n t a t io n o f l o c k in g p r o t o c o ls . W e n o w r e la x e a ch o f t h e s e
r e s t r ic t io n s a n d d is c u s s t h e c o n s e q u e n c e s .
I f t h e c o lle c t io n o f d a t a b a s e o b j e c t s is n o t fix e d , b u t c a n g r o w a n d sh r in k
t h r o u g h t h e in s e r tio n a n d d e le t io n o f o b je c t s , w e m u s t d e a l w it h a s u b t le c o m p l i
c a t io n k n o w n as t h e phantom problem , w h ic h w a s illu s t r a t e d in S e c t io n 16.6.2.
W e d is c u s s th is p r o b le m in S e c t io n 17.5.1.
83
560 C h a p t e r 17
A lt h o u g h t r e a t in g a d a t a b a s e a s a n in d e p e n d e n t c o lle c t io n o f o b j e c t s is a d e
q u a t e fo r a d is c u s s io n o f s e r ia liz a b ility a n d r e co v e ra b ility , m u c h b e t t e r p e r f o r
m a n c e c a n s o m e t i m e s b e o b t a i n e d u s in g p r o t o c o l s t h a t r e c o g n iz e a n d e x p lo it
th e r e la t io n s h ip s b e t w e e n o b je c t s . W e d is c u s s t w o s u c h ca ses, n a m ely , lo c k in g
in t r e e - s t r u c t u r e d in d e x e s ( S e c tio n 17.5.2) a n d lo c k in g a c o ll e c t io n o f o b j e c t s
w it h c o n t a in m e n t r e la t io n s h ip s b e t w e e n t h e m ( S e c tio n 17.5.3).
C o n s id e r t h e fo llo w in g e x a m p le : T r a n s a c t io n T 1 s c a n s t h e S a ilo r s r e la t io n t o
fin d t h e o l d e s t s a ilo r fo r e a ch o f t h e rating le v e ls 1 a n d 2. F ir st, T 1 id e n tifie s
a n d lo c k s all p a g e s ( a s s u m in g t h a t p a g e - le v e l lo c k s a r e set) c o n t a in in g s a ilo r s
w it h r a t in g 1 a n d th e n fin d s t h e a g e o f t h e o ld e s t sa ilo r, w h ic h is, say, 71.
N e x t, t r a n s a c t io n T 2 in s e r ts a n e w s a ilo r w ith r a t in g 1 a n d a g e 96. O b serv e
t h a t th is n e w S a ilo r s r e c o r d c a n b e in s e r t e d o n t o a p a g e t h a t d o e s n o t c o n t a in
o t h e r s a ilo r s w it h r a t in g 1; th u s, a n e x c lu s iv e lo c k o n th is p a g e d o e s n o t c o n f lic t
w it h a n y o f t h e lo c k s h e ld b y T l . T 2 a ls o lo c k s t h e p a g e c o n t a in in g t h e o ld e s t
s a ilo r w it h r a t in g 2 a n d d e le te s t h is s a ilo r ( w h o se a g e is, say, 80). T 2 th e n
c o m m i t s a n d r e le a s e s its lo ck s. F in a lly , t r a n s a c t io n T l id e n tifie s a n d lo c k s
p a g e s c o n t a in in g (all r e m a in in g ) s a ilo r s w it h r a t in g 2 a n d fin d s t h e a g e o f t h e
o l d e s t s u c h sa ilo r, w h ic h is, say, 63.
T h e r e s u lt o f t h e in te r le a v e d e x e c u t io n is t h a t a g e s 71 a n d 63 a r e p r in t e d in
r e s p o n s e t o t h e query. I f T l h a d r u n first, t h e n T 2, w e w o u ld h a v e g o t t e n th e
a g e s 71 a n d 80; if T 2 h a d ru n first, t h e n T l , w e w o u ld h a v e g o t t e n t h e a g e s
96 a n d 63. T h u s, th e r e s u lt o f th e in te r le a v e d e x e c u t io n is n o t id e n t ic a l t o a n y
s e r ia l e x e c t io n o f T l a n d T 2, e v e n t h o u g h b o t h t r a n s a c t io n s fo llo w S t r ic t 2 P L
a n d c o m m it. T h e p r o b le m is t h a t T l a s s u m e s t h a t t h e p a g e s it h a s lo c k e d
in c lu d e all p a g e s c o n t a in in g S a ilo r s r e c o r d s w it h r a t in g 1, a n d t h is a s s u m p t io n
is v io la t e d w h e n T 2 in s e r ts a n e w s u c h s a ilo r o n a d iffe r e n t p a ge.
T h e fla w is n o t in t h e S t r ic t 2 P L p r o t o c o l. R a th e r , it is in T l ’
s im p lic it a s
s u m p t io n t h a t it h a s lo c k e d t h e s e t o f all S a ilo r s r e c o r d s w it h rating v a lu e 1.
T l’
s s e m a n t ic s r e q u ir e s it t o id e n t ify a ll s u c h r e c o r d s , b u t lo c k in g p a g e s th a t
c o n t a in s u c h r e c o r d s at a given time d o e s n o t p r e v e n t n e w “p h a n t o m ” r e c o r d s
fr o m b e in g a d d e d on o th e r pa ges. T l h as th e refore not lo c k e d t h e s e t o f d e s ir e d
S a ilo r s r e c o r d s .
84
Concurrency Control 561
A c lo s e r l o o k at h o w a t r a n s a c t io n id e n tifie s p a g e s c o n t a in in g S a ilo r s r e c o r d s
w ith rating 1 s u g g e s t s h o w th e p r o b le m ca n b e h a n d le d :
■ I f th e r e is n o in d e x a n d all p a g e s in t h e file m u s t b e s ca n n e d , T 1 m u s t
s o m e h o w e n s u r e t h a t n o n e w p a g e s are a d d e d t o t h e file, in a d d it io n t o
lo c k in g all e x is t in g p a g e s.
■ I f th e r e is an in d e x o n t h e rating field, T 1 c a n o b t a in a lo c k o n th e in d e x
p a g e — again , a s s u m in g th a t p h y s ic a l l o c k in g is d o n e at t h e p a g e le v e l— th a t
c o n t a in s a d a t a e n t r y w ith rating=1. I f t h e r e a re n o s u c h d a t a en tries, th a t
is, n o r e c o r d s w it h th is rating v alue, t h e p a g e t h a t would c o n t a in a d a ta
e n try fo r rating—1 is lo c k e d t o p r e v e n t s u c h a r e c o r d fr o m b e i n g in se rte d .
A n y t r a n s a c t io n t h a t t r ie s t o in s e r t a r e c o r d w it h rating=l in t o t h e S a ilo r s
r e la tio n m u s t in s e r t a d a t a e n t r y p o i n t i n g t o th e n e w r e c o r d in t o th is in d e x
p a g e a n d is b lo c k e d u n til T 1 r e le a s e s its lo ck s. T h is t e c h n iq u e is c a lle d
in d e x lo ck in g .
W e n o t e th a t in d e x lo c k in g is a s p e c ia l c a s e o f a m o r e g e n e r a l c o n c e p t c a lle d
p r e d ic a t e lo ck in g . In o u r e x a m p le , th e lo c k o n t h e in d e x p a g e im p lic it ly
lo c k e d all S a ilo r s r e c o r d s t h a t s a t is f y th e l o g ic a l p r e d i c a t e ratings 1. M ore
gen era lly , w e ca n s u p p o r t im p lic it lo c k in g o f all r e c o r d s t h a t m a t c h an a r b itr a r y
p r e d ica te . G e n e r a l p r e d ic a t e lo c k in g is e x p e n s iv e t o im p le m e n t a n d t h e r e f o r e
n o t c o m m o n ly u sed.
A s tr a ig h tf o r w a r d a p p r o a c h t o c o n c u r r e n c y c o n t r o l fo r B + tr e e s a n d I S A M
in d e x e s is t o ig n o r e t h e in d e x s tr u c tu r e , tr e a t e a c h p a g e as a d a t a o b je c t , a n d
u s e s o m e v e r sio n o f 2PL. T h is s im p lis t ic lo c k in g s t r a t e g y w o u ld le a d t o v e r y h ig h
lo c k c o n t e n t io n in th e h ig h e r le v e ls o f th e tree, b e c a u s e e v e r y t r e e s e a r c h b e g in s
at th e r o o t a n d p r o c e e d s a lo n g s o m e p a t h t o a le a f n o d e . F o rtu n a te ly , m u c h
m o r e efficien t lo c k in g p r o t o c o l s t h a t e x p lo it th e h ie r a r c h ic a l s t r u c t u r e o f a t r e e
85
n
562 C h a p t e r 17
T w o o b s e r v a t io n s p r o v id e th e n e c e s s a r y in sig h t:
S e a r c h e s s h o u ld o b t a in s h a r e d lo c k s o n n o d e s, s t a r t in g at t h e r o o t a n d p r o
c e e d i n g a lo n g a p a t h t o th e d e s ir e d leaf. T h e first o b s e r v a t io n s u g g e s t s th a t a
lo c k o n a n o d e ca n b e r e le a s e d as s o o n as a lo c k o n a c h ild n o d e is o b ta in e d ,
b e c a u s e s e a r c h e s n e v e r g o b a c k u p t h e tree.
A c o n s e r v a t iv e lo c k in g s t r a t e g y fo r in s e r ts w o u ld b e t o o b t a in e x c lu s iv e lo c k s o n
all n o d e s as w e g o d o w n fr o m t h e r o o t t o th e le a f n o d e t o b e m o d if ie d , b e c a u s e
s p lit s c a n p r o p a g a t e all t h e w a y f r o m a le a f t o th e ro o t. H o w e v e r, o n c e w e lo c k
t h e c h ild o f a n o d e , t h e lo c k o n t h e n o d e is r e q u ir e d o n ly in t h e e v e n t t h a t a
s p lit p r o p a g a t e s b a c k t o it. In p a rtic u la r, if th e c h ild o f th is n o d e (on t h e p a th
t o t h e m o d if ie d leaf) is n o t fu ll w h e n it is lo ck e d , a n y s p lit t h a t p r o p a g a t e s u p
t o t h e c h ild ca n b e r e s o lv e d at t h e ch ild , a n d d o e s n o t p r o p a g a t e fu r th e r t o th e
c u r r e n t n o d e . T h e r e fo r e , w h e n w e lo c k a c h ild n o d e , w e ca n r e le a s e t h e lo c k o n
t h e p a r e n t if th e c h ild is n o t full. T h e lo c k s h e ld th u s b y a n in s e r t f o r c e an y
o t h e r t r a n s a c t io n f o llo w in g th e s a m e p a t h t o w a it at t h e e a r lie s t p o i n t (i.e., th e
n o d e n e a r e s t th e r o o t) th a t m ig h t b e a ffe c te d b y th e in sert. T h e t e c h n iq u e o f
lo c k in g a c h ild n o d e a n d (if p o s s ib le ) r e le a s in g th e lo c k o n t h e p a r e n t is c a lle d
lo c k - c o u p lin g , o r c r a b b i n g (th in k o f h o w a c r a b w alk s, a n d c o m p a r e it t o
h o w w e p r o c e e d d o w n a tree, a lt e r n a t e ly r e le a s in g a lo c k o n a p a r e n t a n d s e t t in g
a lo c k o n a child).
T i a lw a y s m a in ta in s a lo c k o n o n e n o d e in t h e p a th , t o fo r c e n e w t r a n s a c t io n s
t h a t w a n t t o r e a d o r m o d if y n o d e s o n t h e s a m e p a th t o w a it u n til t h e cu r r e n t
t r a n s a c t io n is d o n e . I f t r a n s a c t io n T j w a n ts t o d e le te 38*, fo r e x a m p le , it m u s t
a ls o t r a v e r s e t h e p a th f r o m th e r o o t t o n o d e D a n d is f o r c e d t o w a it u n til T i
86
Concurrency Control 563
is done. O f course, if som e transaction Tk holds a lock on, say, n ode C before
Ti reaches this node, Ti is sim ilarly forced to wait for Tk to complete.
To insert data entry 45*, a transaction m ust obtain an S lock on n ode A, obtain
an S lock on n ode B and release the lock on A, then obtain an S lock on node
C (observe that the lock on B is not released, becau se C is full), then obtain
an X lock on n ode E and release the locks on C and then B. B ecause n ode E
has space for the new entry, the insert is accom plished by m odifyin g this node.
In contrast, consider the insertion o f data entry 25*. P roceed in g as for the
insert o f 45*, we obtain an X lock on n ode H. Unfortunately, this n ode is full
and m ust b e split. S plitting H requires that we also m odify the parent, n ode F,
but the transaction has only an S lock on F. Thus, it m ust request an upgrade
of this lock to an X lock. If no other transaction holds an S lock on F, the
upgrade is granted, and since F has space, the split d oes not propagate further
and the insertion o f 25* can proceed (by splittin g H and locking G to m odify
the sibling pointer in I to point to the newly created node). However, if another
transaction holds an S lock on n ode F, the first transaction is suspended until
this transaction releases its S lock.
87
564 C h a p t e r 17
w ould have to lock its parent, n ode C (and possib ly ancestors o f C, in order to
lock C).
E x cep t for the locks on interm ediate n odes that we indicated could be released
early, som e variant of 2PL m ust b e used to govern when locks can b e released,
to ensure serializability and recoverability.
T his approach im proves considerably on the naive use of 2PL, but several ex
clusive locks are still set unnecessarily and, although they are quickly released,
affect perform an ce substantially. One way to im prove perform an ce is for inserts
to obtain shared locks instead of exclusive locks, except for the leaf, which is
locked in exclusive mode. In the vast m ajority of cases, a split is not required
and this approach works very well. If the leaf is full, however, we m ust upgrade
from shared locks to exclusive locks for all n odes to which the split propagates.
N ote that such lock upgrade requests can also lead to deadlocks.
T h e tree locking ideas that we d escribe illustrate the potential for efficient
lockin g p rotocols in this very im portan t special case, but they are not the
current state o f the art. T h e interested reader should pursue the leads in the
bibliography.
A com m on situation is that a transaction needs to read an entire file and m odify
a few o f the records in it; that is, it needs an S lock on the file and an I X lock
so that it can subsequently lock som e of the contained o b je cts in X mode. It
is useful to define a new kind o f lock, called an S I X lock, that is logically
equivalent to h olding an S lock and an I X lock. A transaction can obtain a
single S I X lock (which conflicts with any lock that conflicts with either S or
IX) instead of an S lock and an I X lock.
A subtle point is that locks m ust be released in leaf-to-root order for this p roto
col to work correctly. To see this, consider what happens when a transaction Ti
locks all nodes on a path from the root (corresponding to the entire database)
to the node correspon din g to som e page p in I S m ode, locks p in S mode, and
then releases the lock on the root node. A nother transaction T j could now
obtain an X lock on the root. T his lock im plicitly gives T j an X lock on page
p, which conflicts with the S lock currently held by Ti.
89
566 C h a p t e r 17
a certain num ber o f locks at that granularity, to start obtain in g locks at the
next higher granularity (e.g., at the page level). T his procedu re is called lo c k
e s c a la t io n .
17.6 C O N C U R R E N C Y C O N T R O L W IT H O U T L O C K IN G
L ocking p rotocols take a pessim istic approach to conflicts betw een transactions
and use either transaction abort or blockin g to resolve conflicts. In a system
w ith relatively light contention for data objects, the overhead o f obtaining locks
and follow ing a locking p ro to co l m ust nonetheless b e paid.
In optim istic concurrency control, the basic prem ise is that m ost transactions
d o not conflict w ith other transactions, and the idea is to b e as perm issive
as p ossib le in allowing transactions to execute. Transactions proceed in three
phases:
2. V a lid a tio n : If the transaction decides that it wants to com m it, the D BM S
checks whether the transaction could p ossib ly have conflicted w ith any
other concurrently executin g transaction. If there is a p ossib le conflict, the
transaction is aborted; its private w orkspace is cleared and it is restarted.
3. W r ite : If validation determ ines that there are no p ossib le conflicts, the
changes to data o b je cts m ade by the transaction in its private w orkspace
are copied into the database.
If, indeed, there are few conflicts, and validation can b e done efficiently, this
approach should lead to better perform an ce than locking. If there are many
90
Concurrency Control 567
2. T i com pletes before T j starts its W rite phase, and T i does not w rite any
database o b je ct read by T j.
3. T i com pletes its R ead phase before T j com p letes its R ead phase, and T i
does not w rite any database o b je ct that is either read or w ritten by T j.
Further, the first condition allows T j to see som e of T V s changes, but clearly,
they execute com pletely in serial order w ith respect to each other. T h e secon d
condition allows T j to read o b jects while T i is still m odifyin g objects, but there
is no conflict becau se T j does not read any o b je c t m odified by Ti. A lthough
T j m ight overwrite som e o b jects w ritten by T i, all o f T V s writes precede all o f
T j’s writes. T h e third condition allows T i and T j to w rite o b je cts at the sam e
tim e and thus have even m ore overlap in tim e than the second condition, but
the sets o f o b jects w ritten by the two transactions cannot overlap. Thus, no
RW, WR, or W W conflicts are possible if any o f these three conditions is met.
91
568 C h a p t e r 17
can make the W rite phase long. An alternative approach (which carries the
penalty of p o o r physical locality of objects, such as B + tree leaf pages, that
m ust b e clustered) is to use a level of indirection. In this scheme, every o b je ct
is accessed via a logical pointer, and in the W rite phase, we sim ply sw itch the
logical pointer to point to the version of the o b je ct in the private workspace,
instead of copyin g the object.
Clearly, it is not the case that optim istic concurren cy control has no overheads;
rather, the locking overheads of lock-based approaches are replaced w ith the
overheads of recordin g read-lists and write-lists for transactions, checking for
conflicts, and copyin g changes from the private workspace. Similarly, the im
plicit cost o f blockin g in a lock-based approach is replaced by the im plicit cost
of the work w asted by restarted transactions.
O p tim istic C on curren cy C on trol using the three validation condition s d escribed
earlier is often overly conservative and unnecessarily aborts and restarts trans
actions. In particular, accordin g to the validation conditions, Ti cannot w rite
any o b je ct read by Tj. However, since the validation is aim ed at ensuring that
T i logically executes before T j , there is no harm if T i writes all data item s
required by T j before T j reads them.
T h e basic idea is that each transaction in the R ead phase tells the D B M S abou t
item s it is reading, and when a transaction T i is com m itted (and its w rites are
accepted), the D B M S checks whether any o f the item s w ritten by T i are bein g
read by any (yet to b e validated) transaction Tj. If so, we know that T j ’ s
validation m ust eventually fail. W e can either allow T j to discover this when
it is validated (the d ie policy) or kill it and restart it im m ediately (the k ill
policy).
92
Concurrency Control 569
hash bucket containing the entry, and the lock is held while the read data item
is copied from the database buffer into the private w orkspace of the transaction.
It seem s that the ‘ kill’policy is always better than the ‘ d ie’policy, because it
reduces the overall response tim e and w asted processing. However, executing
T to the end has the advantage that all o f the data item s required for its
execution are prefetched into the database buffer, and restarted executions of
T will not require disk I/O for reads. T his assum es that the database buffer
is large enough that prefetched pages are not replaced, and, m ore im portant,
that a c c e s s in v a r ia n c e prevails; that is, successive executions o f T require
the sam e data for execution. W hen T is restarted its execution tim e is much
shorter than before because no disk I/O is required, and thus its chances of
validation are higher. (Of course, if a transaction has already com pleted its
R ead phase once, subsequent conflicts should b e handled using the ‘ kill’policy
because all its data o b jects are already in the buffer pool.)
T im estam ps can also b e used in another way: Each transaction can be assigned
a tim estam p at startup, and we can ensure, at execution time, that if action
ai of transaction Ti conflicts with action aj of transaction Tj, ai occu rs before
a j if TS(T i ) < TS(Tj). If an action violates this ordering, the transaction is
aborted and restarted.
93
570 C h a p t e r 17
1. If TS(T) < RTS(O), the w rite action conflicts w ith the m ost recent read
action o f O, and T is therefore aborted and restarted.
W e now consider the justification for the T h om as W rite Rule. If TS(T) <
WTS(O), the current w rite action has, in effect, been m ade obsolete by the
m ost recent write of O, which follows the current write accordin g to the tim es
tam p ordering. W e can think o f T ’ s w rite action as if it had occu rred im m edi
ately before the m ost recent w rite o f O and was never read by anyone.
If the T h om as W rite R ule is not used, that is, T is aborted in case (2), the
tim estam p protocol, like 2PL, allows only conflict serializable schedules. If the
T h om a s W rite R ule is used, som e schedules are perm itted that are not conflict
serializable, as illustrated by the schedule in Figure 17.6.2 B ecause T 2’ s write
follow s TVs read and precedes T l ’ s w rite of the sam e object, this schedule is
not conflict serializable.
2In the other direction, 2PL permits some schedules that are not allowed by the timestamp algo
rithm with the Thomas Write Rule; see Exercise 17.7.
94
Concurrency Control 571
T1 T2
R(A)
W(A)
C om m it
W(A)
C om m it
T1 T2
R(A)
C om m it
W(A)
C om m it
Recoverability
95
572 C h a p t e r 17
TI T2
W{A)
R(A)
W(B)
C om m it
T his p ro to co l represents yet another way o f using tim estam ps, assigned at
startup time, to achieve serializability. T h e goal is to ensure that a transac
tion never has to wait to read a database object, and the idea is to maintain
several versions of each database object, each with a write tim estam p, and let
transaction Ti read the m ost recent version w hose tim estam p precedes TS(Ti).
T o check this condition, every o b je ct also has an associated read tim estam p,
and whenever a transaction reads the object, the read tim estam p is set to
the m axim um o f the current read tim estam p and the reader’ s tim estam p. If T i
wants to w rite an o b je ct O and TS(Ti) < RTS(O), Ti is aborted and restarted
with a new, larger tim estam p. Otherwise, T i creates a new version o f O and
sets the read and w rite tim estam ps o f the new version to TS(Ti).
T h e draw backs o f this schem e are similar to th ose o f tim estam p concurrency
control, and in addition, there is the cost o f m aintaining versions. O n the
other hand, reads are never blocked, which can b e im portan t for workloads
dom inated by transactions that on ly read values from the database.
96
Concurrency Control 573
17.7 R E V IE W Q U E ST IO N S
■ W hen are two schedules con flict equivalent ? W hat is a co n flict seria liza b le
schedule? W hat is a strict schedule? ( S e c t io n 17.1)
■ W hat does the lock m a n a g e r do? D escribe the lock table and tra n sa ction
table data structures and their role in lock m anagement. ( S e c t io n 17.2)
■ D iscuss the relative m erits o f lock u pgrades and lock dow ngrades. (Sec
t io n 17.3)
■ D escribe and com pare deadlock detection and d eadlock prevention schemes.
W hy are detection schem es m ore com m on ly used? ( S e c t io n 17.4)
■ If the collection o f database o b je cts is not fixed, but can grow and shrink
through insertion and deletion o f objects, we m ust deal with a subtle com
plication known as the p h a n tom problem . D escrib e this problem and the
index locking approach to solving the problem . ( S e c t io n 17.5.1)
■ In tree index structures, locking higher levels of the tree can b ecom e a per
form ance bottleneck. Explain why. D escribe specialized locking techniques
that address the problem , and explain why they work correctly despite not
bein g two-phase. ( S e c t io n 17.5.2)
97
574 C h a p t e r 17
■ In o p tim istic con cu rre n cy con trol , no locks are set and tran saction s read
and m odify data o b jects in a private workspace. H ow are conflicts betw een
transactions d etected and resolved in this approach? ( S e c t io n 17.6.1)
■ In timestamp-based, con cu rren cy con trol , transactions are assigned a tim es
tam p at startup; how is it used to ensure serializability? H ow d oes the
T h om a s W rite R u le im prove concurrency? ( S e c t io n 17.6.2)
■ Explain why tim estam p-based concurrency control allows schedules that
are not recoverable. D escribe how it can b e m odified through bu fferin g to
disallow such schedules. ( S e c t io n 17.6.2)
■ D escribe m u ltiv ersion con cu rren cy c o n tro l W hat are its benefits and dis
advantages in com parison to locking? ( S e c t io n 17.6.3)
E X E R C IS E S
1. Describe how a typical lock manager is implemented. Why must lock and unlock be
atomic operations? What is the difference between a lock and a latch? What are convoys
and how should a lock manager handle them?
2. Compare lock downgrades with upgrades. Explain why downgrades violate 2PL but
are nonetheless acceptable. Discuss the use of update locks in conjunction with lock
downgrades.
3. Contrast the timestamps assigned to restarted transactions when timestamps are used
for deadlock prevention versus when timestamps are used for concurrency control.
4. State and justify the Thomas Write Rule.
5. Show that, if two schedules are conflict equivalent, then they are view equivalent.
6. Give an example of a serializable schedule that is not strict.
7. Give an example of a strict schedule that is not serialiable.
8. Motivate and describe the use of locks for improved conflict resolution in Optimistic
Concurrency Control.
The actions are listed in the order they are scheduled and prefixed with the transaction name.
If a commit or abort is not shown, the schedule is incomplete; assume that abort or commit
must follow all the listed actions.
98
Concurrency Control 575
E x ercise 17.3 Consider the following concurrency control protocols: 2PL, Strict 2PL, Con
servative 2PL, Optimistic, Timestamp without the Thomas Write Rule, Timestamp with the
Thomas Write Rule, and Multiversion. For each of the schedules in Exercise 17.2, state which
of these protocols allows it, that is, allows the actions to occur in exactly the order shown.
For the timestamp-based protocols, assume that the timestamp for transaction Ti is i and
that a version of the protocol that ensures recoverability is used. Further, if the Thomas
Write Rule is used, show the equivalent serial schedule.
E x ercise 17.4 Consider the following sequences of actions, listed in the order they are sub
mitted to the DBMS:
For each sequence and for each of the following concurrency control mechanisms, describe
how the concurrency control mechanism handles the sequence.
Assume that the timestamp of transaction Ti is i. For lock-based concurrency control mech
anisms, add lock and unlock requests to the previous sequence of actions as per the locking
protocol. The DBMS processes actions in the order shown. If a transaction is blocked, assume
that all its actions are queued until it is resumed; the DBMS continues with the next action
(according to the listed sequence) of an unblocked transaction.
99
C h a p te r 5
R eco v ery
• Predict what portions of the log and database are necessary for recovery
under different failure scenarios.
101
Predict how recovery metadata is updated during normal operation.
Interpret the contents of the log resulting from ARIES normal operation.
Explain the three phases of ARIES crash recovery: analysis, redo, and
undo.
Predict how recovery metadata, system state, and the log are updated
during recovery.
_____18
CRASH RECOVERY
What steps are taken in the ARIES method to recover from a DBMS
crash?
«
•" How is the log maintained during normal operation?
w How is the log used to recover from a crash?
«
•* What information in addition to the log is used during recovery?
What is a checkpoint and why is it used?
What happens if repeated crashes occur during recovery?
»■ How is media failure handled?
How does the recovery algorithm interact with concurrency control?
** K ey con cepts: steps in recovery, analysis, redo, undo; ARIES,
repeating history; log, LSN, forcing pages, WAL; types of log
records, update, commit, abort, end, compensation; transaction ta
ble, lastLSN; dirty page table, recLSN; checkpoint, fuzzy checkpoint
ing, master log record; media recovery; interaction with concurrency
control; shadow paging
579 103
580 C h a p t e r 18
We discuss recovery from a crash in Section 18.6. Aborting (or rolling back)
a single transaction is a special case of Undo, discussed in Section 18.6.3. We
discuss media failures in Section 18.7, and conclude in Section 18.8 with a
discussion of the interaction of concurrency control and recovery and other ap
proaches to recovery. In this chapter, we consider recovery only in a centralized
DBMS; recovery in a distributed DBMS is discussed in Chapter 22.
18.1 IN T R O D U C T IO N T O A R IE S
1. A n a ly s is : Identifies dirty pages in the buffer pool (i.e., changes that have
not been written to disk) and active transactions at the time of the crash.
2. R e d o : Repeats all actions, starting from an appropriate point in the log,
and restores the database state to what it was at the time of the crash.
3. U n d o : Undoes the actions of transactions that did not commit, so that
the database reflects only the actions of committed transactions.
Consider the simple execution history illustrated in Figure 18.1. When the
system is restarted, the Analysis phase identifies T1 and T 3 as transactions
104
Crash Recovery 581
LSN LO G
30 -j- T2 commit
40 -j- T2 end
60 update: T3 writes P3
1X
CRASH, RESTART
The second point distinguishes ARIES from other recovery algorithms and is
the basis for much of its simplicity and flexibility. In particular, ARIES can
support concurrency control protocols that involve locks of finer granularity
than a page (e.g., record-level locks). The second and third points are also
105
582 C h a p t e r 18
important in dealing with operations where redoing and undoing the opera
tion are not exact inverses of each other. We discuss the interaction between
concurrency control and crash recovery in Section 18.8, where we also discuss
other approaches to recovery briefly.
18.2 TH E L O G
The most recent portion of the log, called the l o g tail, is kept in main memory
and is periodically forced to stable storage. This way, log records and data
records are written to disk at the same granularity (pages or sets of pages).
For recovery purposes, every page in the database contains the LSN of the most
recent log record that describes a change to this page. This LSN is called the
pageL SN .
106
Crash Recovery 583
Additional fields depend on the type of the log record. We already mentioned
the additional contents of the various log record types, with the exception of
the update and compensation log record types, which we describe next.
Update L o g R ecords
107
584 C h a p t e r 18
Fields common to all log records Additional fields for update log records
change are also included. The b e f o r e - im a g e is the value of the changed bytes
before the change; the a fte r - im a g e is the value after the change. An update
log record that contains both before- and after-images can be used to redo
the change and undo it. In certain contexts, which we do not discuss further,
we can recognize that the change will never be undone (or, perhaps, redone).
A r e d o - o n ly u p d a t e log record contains just the after-image; similarly an
u n d o - o n ly u p d a t e record contains just the before-image.
C om pensation L o g R ecords
As an example, consider the fourth update log record shown in Figure 18.3.
If this update is undone, a CLR would be written, and the information in it
would include the transID, pagelD, length, offset, and before-image fields from
the update record. Notice that the CLR records the (undo) action of changing
the affected bytes back to the before-image value; thus, this value and the
location of the affected bytes constitute the redo information for the action
described by the CLR. The undoNextLSN field is set to the LSN of the first
log record in Figure 18.3.
Unlike an update log record, a CLR describes an action that will never be
u n d o n e , that is, we never undo an undo action. The reason is simple: An update
log record describes a change made by a transaction during normal execution
and the transaction may subsequently be aborted, whereas a CLR describes
an action taken to rollback a transaction for which the decision to abort has
already been made. Therefore, the transaction m u s t be rolled back, and the
108
Crash Recovery 585
A CLR may be written to stable storage (following WAL, of course) but the
undo action it describes may not yet been written to disk when the system
crashes again. In this case, the undo action described in the CLR is reapplied
during the Redo phase, just like the action described in update log records.
For these reasons, a CLR contains the information needed to reapply, or redo,
the change described but not to reverse it.
In addition to the log, the following two tables contain important recovery-
related information:
During normal operation, these are maintained by the transaction manager and
the buffer manager, respectively, and during restart after a crash, these tables
are reconstructed in the Analysis phase of restart.
Consider the following simple example. Transaction T1000 changes the value of
bytes 21 to 23 on page P500 from ‘ABC’to ‘ DEF’ , transaction T2000 changes
‘HIJ’to ‘KLM’on page P600, transaction T2000 changes bytes 20 through 22
from ‘GDE’to ‘ QRS’on page P500, then transaction T1000 changes ‘ TUV’
to ‘
WXY’on page P505. The dirty page table, the transaction table,3 and
3The status field is not shown in the figure for space reasons; all transactions are in progress.
109
586 C h a p t e r 18
pagcID recLSN
TRANSACTION TABLE
the log at this instant are shown in Figure 18.3. Observe that the log is shown
growing from top to bottom; older records are at the top. Although the records
for each transaction are linked using the prevLSN field, the log as a whole also
has a sequential order that is important—for example, T2000’ s change to page
P500 follows TlOOO’ s change to page P500, and in the event of a crash, these
changes must be redone in the same order.
18.4 T H E W R ITE-A H EA D L O G P R O T O C O L
Before writing a page to disk, every update log record that describes a change
to this page must be forced to stable storage. This is accomplished by forcing
all log records up to and including the one with LSN equal to the pageLSN to
stable storage before writing the page to disk.
When a transaction is committed, the log tail is forced to stable storage, even
if a no-force approach is being used. It is worth contrasting this operation with
the actions taken under a force approach: If a force approach is used, all the
pages modified by the transaction, rather than a portion of the log that includes
all its records, must be forced to disk when the transaction commits. The set of
110
Crash Recovery 587
all changed pages is typically much larger than the log tail because the size of
an update log record is close to (twice) the size of the changed bytes, which is
likely to be much smaller than the page size. Further, the log is maintained as a
sequential file, and all writes to the log are sequential writes. Consequently, the
cost of forcing the log tail is much smaller than the cost of writing all changed
pages to disk.
18.5 C H E C K P O IN T IN G
When the system comes back up after a crash, the restart process begins by
locating the most recent checkpoint record. For uniformity, the system always
begins normal execution by taking a checkpoint, in which the transaction table
and dirty page table are both empty.
18.6 R E C O V E R IN G F R O M A S Y S T E M C R A SH
When the system is restarted after a crash, the recovery manager proceeds in
three phases, as shown in Figure 18.4.Il
Ill
588 C h a p t e r 18
UNDO
Oldest log record
A of transactions
A active at crash
Smallest recLSN
B in dirty page table
at end of Analysis
A N A L Y S IS
c Most recent checkpoint
Observe that the relative order of the three points A , B , and C in the log may
differ from that shown in Figure 18.4. The three phases of restart are described
in more detail in the following sections.
1. It determines the point in the log at which to start the Redo pass.
2. It determines (a conservative superset of the) pages in the buffer pool that
were dirty at the time of the crash.
3. It identifies transactions that were active at the time of the crash and must
be undone.
Analysis begins by examining the most recent begin_checkpoint log record and
initializing the dirty page table and transaction table to the copies of those
structures in the next end-checkpoint record. Thus, these tables are initialized
to the set of dirty pages and active transactions at the time of the checkpoint.
112
Crash Recovery 589
(If additional log records are betw een the begin_checkpoint and end_checkpoint
records, the tables m ust be adju sted to reflect the inform ation in these records,
but we om it the details o f this step. See E xercise 18.9.) A nalysis then scans
the log in the forward direction until it reaches the end of the log:
At the end of the Analysis phase, the transaction table contains an accurate
list o f all transactions that were active at the tim e o f the crash— this is the
set o f transactions with status U. T h e dirty page table includes all pages that
were dirty at the tim e of the crash but m ay also contain som e pages that were
w ritten to disk. If an end-write log record were w ritten at the com pletion o f
each write operation, the dirty page table constructed during Analysis could
b e m ade m ore accurate, but in A RIES, the additional cost o f w riting end.write
log records is not considered to b e w orth the gain.
T h e dirty page table and the transaction table, held in memory, are lost in the
crash. T h e m ost recent checkpoint was taken at the begin n in g o f the execution,
with an em pty transaction table and dirty page table; it is n ot show n in Figure
18.3. After exam ining this log record, which we assum e is ju st before the
first log record show n in the figure, Analysis initializes the two tables to be
empty. Scanning forward in the log, T1000 is added to the transaction table;
in addition, P500 is added to the dirty page table w ith recLSN equal to the
LSN o f the first show n log record. Similarly, T2000 is added to the transaction
table and P600 is added to the dirty pa ge table. There is no change based on
the third log record, and the fourth record results in the addition o f P505 to
113
590 C h a p t e r 18
the dirty page table. T h e com m it record for T2000 (not in the figure) is now
encountered, and T2000 is rem oved from the tran saction table.
T h e Analysis phase is now com plete, and it is recogn ized that the only active
transaction at the tim e o f the crash is T1000, w ith lastLSN equal to the LSN
o f the fourth record in Figure 18.3. T h e dirty p a ge table recon stru cted in the
A nalysis phase is identical to that shown in the figure. T h e u pd a te log record
for the change to P700 is lost in the crash and n ot seen during the Analysis
pass. Thanks to the WAL protocol, however, all is well— the correspon d in g
change to pa ge P700 cannot have been w ritten to disk either!
T h e R ed o phase begin s w ith the log record that has the sm allest recLSN of all
pages in the dirty page table con stru cted by the A nalysis pass becau se this log
record identifies the oldest u pd a te that m ay not have been w ritten to disk prior
to the crash. Starting from this log record, R e d o scans forward until the end
of the log. For each redoable log record (update or CLR) encountered, R ed o
checks w hether the logged action m ust b e redone. T h e action m ust b e redone
unless one of the follow ing condition s holds:
■ T h e affected pa ge is in the dirty page table, bu t the recLSN for the entry
is greater than the LSN of the log record b e in g checked.
T h e first con d ition obviou sly m eans that all changes to this page have been
w ritten to disk. B ecau se the recLSN is the first u p d a te to this page that m ay
114
Crash Recovery 591
not have been written to disk, the secon d condition m eans that the u pdate
bein g checked was indeed propagated to disk. T he third condition, which is
checked last because it requires us to retrieve the page, also ensures that the
u pdate bein g checked was w ritten to disk, becau se either this update or a later
u pdate to the page was written. (Recall our assum ption that a w rite to a page
is atomic; this assum ption is im portant here!)
2. T he pageL SN on the page is set to the LSN o f the redone log record. No
additional log record is w ritten at this time.
Let us continue with the exam ple discussed in Section 18.6.1. Trom the dirty
page table, the sm allest recLSN is seen to be the LSN o f the first log record
shown in Figure 18.3. Clearly, the changes recorded by earlier log records
(there happen to be none in this example) have been w ritten to disk. Now,
R edo fetches the affected page, P500, and com pares the LSN o f this log record
with the pageLSN on the page and, because we assum ed that this page was not
written to disk before the crash, finds that the pageLSN is less. T h e update
is therefore reapplied; bytes 21 through 23 are changed to ‘ DEF’ , and the
pageLSN is set to the LSN o f this u pd ate log record.
R ed o then exam ines the second log record. Again, the affected page, P600, is
fetched and the pageL SN is com pared to the LSN of the u pd ate log record. In
this case, becau se we assum ed that P600 was w ritten to disk before the crash,
they are equal, and the u pd ate does not have to be redone.
T he rem aining log records are processed similarly, bringing the system back
to the exact state it was in at the tim e o f the crash. N ote that the first tw o
conditions indicating that a redo is unnecessary never hold in this example.
Intuitively, they com e into play when the dirty page table contains a very old
recLSN, goin g back to before the m ost recent checkpoint. In this case, as R ed o
scans forward from the log record w ith this LSN, it encounters log records for
pages that were w ritten to disk prior to the checkpoint and therefore not in
the dirty pa ge table in the checkpoint. S om e o f these pages m ay b e dirtied
again after the checkpoint; nonetheless, the u pd a tes to these pages prior to the
checkpoint need not b e redone. A lthough the third condition alone is sufficient
to recogn ize that these u pdates need not be redone, it requires us to fetch
the affected page. T he first two conditions allow us to recogn ize this situation
without fetching the page. (The reader is encouraged to construct exam ples
that illustrate the use of each of these conditions; see E xercise 18.8.)
115
592 C h a p t e r 18
At the end of the R ed o phase, end type records are w ritten for all transactions
with status C, which are removed from the transaction table.
T he U ndo phase, unlike the other two phases, scans backward from the end
of the log. T h e goal o f this phase is to un do the actions o f all transactions
active at the tim e o f the crash, that is, to effectively abort them. T his set of
transactions is identified in the transaction table con stru cted by the Analysis
phase.
U ndo begins w ith the transaction table con stru cted by the Analysis phase,
which identifies all transactions active at the tim e o f the crash, and includes the
LSN of the m ost recent log record (the lastLSN field) for each such transaction.
Such transactions are called lo s e r t r a n s a c tio n s . All actions o f losers m ust be
undone, and further, these actions m ust b e undone in the reverse o f the order
in which they appear in the log.
Consider the set o f lastLSN values for all loser transactions. Let us call this set
T o U n d o . U ndo repeatedly chooses the largest (i.e., m ost recent) LSN value in
this set and processes it, until T oU n do is empty. T o p rocess a log record:
1. If it is a C L R and the undoN extLSN value is not null, the undoN extLSN
value is added to the set ToUndo; if the undoN extL SN is null, an end
record is w ritten for the transaction because it is com pletely undone, and
the C L R is discarded.
W hen the set T oU n do is empty, the U ndo phase is com plete. R estart is now
com plete, and the system can proceed with norm al operations.
Let us continue with the scenario discussed in Section s 18.6.1 and 18.6.2. T he
only active transaction at the tim e of the crash was determ ined to b e T1000.
From the transaction table, we get the LSN o f its m ost recent log record, which
is the fourth u pdate lo g record in Figure 18.3. T h e update is undone, and a
C L R is w ritten w ith undoN extLSN equal to the LSN o f the first log record in
the figure. T h e next record to b e undone for transaction T1000 is the first log
record in the figure. After this is undone, a C L R and an end log record for
T1000 are written, and the U ndo phase is com plete.
116
Crash Recovery 593
In this example, u n doing the action recorded in the first log record causes the
action o f the third log record, which is due to a com m itted transaction, to be
overwritten and thereby lost! This situation arises because T2000 overwrote
a data item written by TT000 while T1000 was still active; if Strict 2PL were
followed, T2000 would not have been allowed to overwrite this data item.
A borting a Transaction
10 - 1~ update: T1 writes P5
30 T1 abort
60 update: T2 writes P5
X CRASH, RESTART
X CRASH, RESTART
T h e lo g shows the order in which the D B M S executed various actions; note that
the LSNs are in ascending order, and that each log record for a transaction has
a prevLSN field that points to the previous log record for that transaction. We
have not shown null prevLSNs, that is, som e special value used in the prevLSN
field of the first log record for a transaction to indicate that there is no previous
log record. We also com pacted the figure by occasion ally displaying two log
records (separated by a comma) on a single line.
L og record (with LSN) 30 indicates that T l aborts. All actions o f this trans
action should b e undone in reverse order, and the only action o f T l, described
by the u pdate log record 10, is indeed undone as indicated by C L R 40.
After the first crash, Analysis identifies P I (with recLSN 50), P 3 (with recLSN
20), and P5 (with recLSN 10) as dirty pages. L og record 45 shows that T l is a
com p leted transaction: hence, the transaction table identifies T 2 (with lastLSN
60) and T3 (with lastLSN 50) as active at the tim e o f the crash. T h e R edo
phase begins with log record 10, which is the m inim um recLSN in the dirty
page table, and reapplies all actions (for the u pdate and C L R records), as per
the R ed o algorithm presented in Section 18.6.2.
T h e T oU ndo set consists o f LSNs 60, for T2, and 50, for T3. T h e U ndo phase
now begin s by processin g the log record w ith LSN 60 becau se 60 is the largest
LSN in the T oU ndo set. T h e u pd ate is undone, and a C L R (with LSN 70)
is w ritten to the log. This C L R has undoN extLSN equal to 20, which is the
prevLSN value in log record 60; 20 is the next action to b e undone for T2. Now
the largest rem aining LSN in the T oU n do set is 50. T h e w rite correspon din g
to log record 50 is now undone, and a C L R describing the change is written.
T his C L R has LSN 80, and its undoN extLSN field is null becau se 50 is the
only log record for transaction T 3. T herefore T 3 is com pletely undone, and an
end record is written. L og records 70, 80, and 85 are w ritten to stable storage
before the system crashes a secon d time; however, the changes described by
these records may not have been w ritten to disk.
W hen the system is restarted after the secon d crash, Analysis determ ines that
the only active transaction at the tim e o f the crash was T 2; in addition, the dirty
page table is identical to what it was during the'previous restart. L og records
10 through 85 are processed again during Redo. (If som e o f the changes m ade
during the previous R ed o were w ritten to disk, the pageLSN s on the affected
pages are used to detect this situation and avoid w riting these pages again.)
T h e U ndo phase considers the only LSN in the T oU ndo set, 70, and processes it
by adding the undoN extLSN value (20) to the T oU ndo set. Next, log record 20
is p rocessed by undoing T 2’ s w rite of page P3, and a C L R is w ritten (LSN 90).
B ecause 20 is the first o f T 2 ’
s log records— and therefore, the last of its records
118
Crash Recovery 595
R ecovery is now com plete, and norm al execution can resum e w ith the w riting
o f a checkpoint record.
This exam ple illustrated repeated crashes during the U ndo phase. For co m
pleteness, let us consider w hat happens if the system crashes while R estart is
in the A nalysis or R ed o phase. If a crash o ccu rs during the A nalysis phase, all
the work done in this phase is lost, and on restart the A nalysis phase starts
afresh w ith the sam e inform ation as before. If a crash occu rs during the R ed o
phase, the only effect that survives the crash is that som e of the changes m ade
during R edo m ay have been w ritten to disk prior to the crash. R estart starts
again w ith the A nalysis phase and then the R ed o phase, and som e u pd ate log
records that were redone the first tim e around will not be redone a secon d tim e
because the pageL SN is now equal to the u p d a te record ’ s LSN (although the
pages have to be fetched again to detect this).
We can take checkpoints during R estart to m inim ize repeated work in the event
of a crash, bu t we d o not discuss this point.
18.7 M E D IA R E C O V E R Y
W hen a database o b ject such as a file or a page is corrupted, the copy o f that
o b ject is brought up-to-date by using the log to identify and reapply the changes
of com m itted transactions and un do the changes o f u n com m itted transactions
(as of the tim e of the m edia recovery operation).
119
596 C h a p t e r 18
Finally, the u pdates o f transactions that are in com plete at the tim e o f m edia
recovery or that were aborted after the fuzzy copy was com p leted need to be
undone to ensure that the page reflects only the actions o f com m itted transac
tions. T h e set of such transactions can be identified as in the Analysis pass,
and we om it the details.
Like ARIES, the m ost popular alternative recovery algorithm s also m aintain a
log o f database actions accordin g to the WAL protocol. A m ajor distinction
betw een A R IE S and these variants is that the R ed o phase in A R IE S repeats
history, that is, redoes the actions of all transactions, not ju st the non-losers.
O ther algorithm s redo only the non-losers, and the R ed o phase follows the
U ndo phase, in which the actions of losers are rolled back.
120
Crash Recovery 597
changes; however, other transactions continue to see the original page table,
and therefore the original page, until this transaction com m its. A bortin g a
transaction is simple: Just discard its shadow versions o f the p a ge table and
the data pages. C om m ittin g a transaction involves m aking its version o f the
page table pu blic and discarding the original data pages that are su perseded
by shadow pages.
T his schem e suffers from a num ber o f problem s. First, data becom es highly
fragm ented due to the replacem ent o f pages by shadow versions, which m ay be
located far from the original page. T his ph en om en on reduces data clustering
and makes g o o d garbage collection imperative. Second, the schem e does not
yield a sufficiently high degree o f concurrency. Third, there is a substantial
storage overhead due to the use of shadow pages. Fourth, the p rocess abortin g
a transaction can itself run into deadlocks, and this situation m ust b e specially
handled becau se the sem antics o f abortin g an abort transaction gets murky.
For these reasons, even in System R, shadow pa gin g was eventually superseded
by W AL-based recovery techniques.
18.9 R E V IE W Q U EST IO N S
■ W hat are the different types o f log records and when are they w ritten?
( S e c t io n 18.2)
■ W hat inform ation is m aintained in the transaction table and the dirty page
table? ( S e c t io n 18.3)
■ In which direction does the A nalysis phase o f recovery scan the log? At
which point in the log does it begin and end the scan? ( S e c t io n 18.6.1)
■ D escribe what inform ation is gathered in the A nalysis phase and how.
( S e c t io n 18.6.1)
121
598 C h a p t e r 18
■ W hat is a redoable log record? U nder w hat con d ition s is the logged ac
tion redone? W hat steps are carried out when a logged action is redone?
( S e c t io n 18.6.2)
■ W hat are loser transactions? H ow are they processed in the U ndo phase
and in w hat order? ( S e c t io n 18.6.3)
■ Explain w hat happens if there are crashes during the U n do phase o f re
covery. W hat is the role of C L R s? W hat if there are crashes during the
Analysis and R ed o phases? ( S e c t io n 18.6.3)
■ H ow d oes a D B M S recover from m edia failure w ith out reading the com plete
log? ( S e c t io n 18.7)
E X E R C IS E S
1. How does the recovery manager ensure atomicity of transactions? How does it ensure
durability?
2. What is the difference between stable storage and disk?
3. What is the difference between a system crash and a media failure?
4. Explain the WAL protocol.
5. Describe the steal and no-force policies.
1. What are the roles of the Analysis, Redo, and Undo phases in ARIES?
2. Consider the execution shown in Figure 18.6.
122
Crash Recovery 599
LSN LOG
00 -r- begin_checkpoint
10 4 end_checkpoint
20 4 update: T1 writes P5
30 4 update: T2 writes P3
40 4 T2 commit
50 4 T2 end
60 4 update: T3 writes P3
70 4 T1 abort
CRASH, RESTART
>k
LSN LOG
00 —j— update: T1 writes P2
10 4 update: T1 writes PI
20 4 update: T2 writes P5
30 4 update: T3 writes P3
40 4 T3 commit
50 4 update: T2 writes P5
60 4 update: T2 writes P3
70 -I- T2 abort
(a) What is done during Analysis? (Be precise about the points at which Analysis
begins and ends and describe the contents of any tables constructed in this phase.)
(b) What is done during Redo? (Be precise about the points at which Redo begins and
ends.)
(c) What is done during Undo? (Be precise about the points at which Undo begins
and ends.)
123
600 C h a p t e r 18
LSN LOG
00 -j- begincheckpoint
10 -j- entLcheckpoint
50 -i- T2 commit
70 -j- T2 end
80 • 4 - update: T1 writes P5
90 T3 abort
* CRASH, RESTART
3. Show the log after T 2 is rolled back, including all prevLSN and undonextLSN values in
log records.
E x er cise 18.5 Consider the execution shown in Figure 18.8. In addition, the system crashes
during recovery after writing two log records to stable storage and again after writing another
two log records.
1. What is the value of the LSN stored in the master log record?
2. What is done during Analysis?
3. What is done during Redo?
4. What is done during Undo?
5. Show the log when recovery is complete, including all non-null prevLSN and undonextLSN
values in log records.
124
Crash Recovery 601
LSN LOG
00 - j- begin_checkpoint
10 4- update: T1 writes PI
20 4- T1 commit
30 4- update: T2 writes P2
40 4- T1 end
50 4- T2 abort
60 -b update: T3 writes P3
70 4-
1 end_checkpoint
80 T3 commit
CRASH, RESTART
X
6. Give an example that illustrates how the paradigm of repeating history and the use of
CLRs allow ARIES to support locks of finer granularity than a page.
1. If the system fails repeatedly during recovery, what is the maximum number of log
records that can be written (as a function of the number of update and other log records
written before the crash) before restart completes successfully?
2. What is the oldest log record we need to retain?
3. If a bounded amount of stable storage is used for the log, how can we always ensure
enough stable storage to hold all log records written during restart?
E x ercise 18.8 Consider the three conditions under which a redo is unnecessary (Section
20 .2.2).
E x ercise 18.9 The description in Section 18.6.1 of the Analysis phase made the simplifying
assumption that no log records appeared between the begin-checkpoint and end-checkpoint
records for the most recent complete checkpoint. The following questions explore how such
records should be handled.1
1. Explain why log records could be written between the begin_checkpoint and end-checkpoint
records.
2. Describe how the Analysis phase could be modified to handle such records.
3. Consider the execution shown in Figure 18.9. Show the contents of the end-checkpoint
record.
4. Illustrate your modified Analysis phase on the execution shown in Figure 18.9.
125
C h a p te r 6
E x p erim en ta l D esign
Chapter 6 o f L ilja ’
s book is given to deepen understanding o f available m ea
surem ent strategies; however, it is to be considered as an additional reading
and not fundam ental to the attainm ent o f the learning goals above.
127
Introduction
If the automobile industry had followed the same development cycles as the
computer industry, it has been speculated that a Rolls Royce car would cost
less than $100 with an efficiency o f more than 200 miles per gallon o f gasoline.
While we certainly get more car for our money now than we did twenty years
ago, no other industry has ever changed at the incredible rate o f the computer
and electronics industry.
Computer systems have gone from being the exclusive domain o f a few scien
tists and engineers who used them to speed up some esoteric computations, such
as calculating the trajectory o f artillery shells, for instance, to being so comm on
that they go unnoticed. They have replaced many o f the mechanical control
systems in our cars, thereby reducing cost while improving efficiency, reliability,
and performance. They have made possible such previously science-fiction-like
devices as cellular phones. They have provided countless hours o f entertainment
for children ranging in age from one to one hundred. They have even brought
sound to the comm on greeting card. One constant throughout this proliferation
and change, however, has been the need for system developers and users to
understand the performance o f these computer-based devices.
While measuring the cost o f a system is usually relatively straightforward
(except for the confounding effects o f manufacturers’discounts to special cus
tomers), determining the performance o f a computer system can oftentimes seem
like an exercise in futility. Surprisingly, one o f the main difficulties in measuring
performance is that reasonable people often disagree strongly on how perfor
mance should be measured or interpreted, and even on what ‘ performance’
actually means.
1
128
2 Introduction
The goals o f any analysis o f the performance o f a computer system, or one o f its
components, will depend on the specific situation and the skills, interests, and
abilities o f the analyst. However, we can identify several different typical goals o f
performance analysis that are useful both to computer-system designers and to
users.
129
3 1.2 Common goals of performance analysis
130
4 Introduction
131
5 1.3 Solution techniques
132
6 Introduction
simulation can still provide useful insights into the effect o f the memory system
on the performance o f a specific application program.
Finally, a simple analytical model o f the memory system can be developed as
follows. Let tc be the time delay observed by a memory reference if the memory
location being referenced is in the cache. Also, let tm be the corresponding delay
if the referenced location is not in the cache. The cache hit ratio, denoted h, is the
fraction o f all memory references issued by the processor that are satisfied by the
cache. The fraction o f references that miss in the cache and so must also access
the memory is 1 — h. Thus, the average time required for all cache hits is htc while
the average time required for all cache misses is (1 — h)tm. A simple model o f the
overall average memory-access time observed by an executing program then is
133
7 1.5 Exercises
Solution technique
given situation. The following chapters are designed to help you develop pre
cisely this skill.
1.4 Summary
1.5 Exercises
134
8 Introduction
135
Metrics of performance
‘
Time is a great teacher, but unfortunately it kills all its pupils.’
H e c to r B e rlio z
For instance, we may need to count how many times a processor initiates an
input/output request. We may also be interested in how long each o f these
requests takes. Finally, it is probably also useful to determine the number o f
bits transmitted and stored.
From these types o f measured values, we can derive the actual value that we
wish to use to describe the performance o f the system. This value is called a
performance metric.
If we are interested specifically in the time, count, or size value measured, we
can use that value directly as our performance metric. Often, however, we are
interested in normalizing event counts to a comm on time basis to provide a speed
metric such as operations executed per second. This type o f metric is called a rate
metric or throughput and is calculated by dividing the count o f the number o f
events that occur in a given interval by the time interval over which the events
occur. Since a rate metric is normalized to a common time basis, such as seconds,
it is useful for comparing different measurements made over different time
intervals.
Choosing an appropriate performance metric depends on the goals for
the specific situation and the cost o f gathering the necessary information. For
136
10 Metrics of performance
example, suppose that you need to choose between two different computer sys
tems to use for a short period o f time for one specific task, such as choosing
between two systems to do some word processing for an afternoon. Since the
penalty for being wrong in this case, that is, choosing the slower o f the two
machines, is very small, you may decide to use the processors’clock frequencies
as the performance metric. Then you simply choose the system with the fastest
clock. However, since the clock frequency is not a reliable performance metric
(see Section 2.3.1), you would want to choose a better metric if you are trying to
decide which system to buy when you expect to purchase hundreds o f systems for
your company. Since the consequences o f being wrong are much larger in this
case (you could lose your job, for instance!), you should take the time to perform
a rigorous comparison using a better performance metric. This situation then
begs the question o f what constitutes a good performance metric.
There are many different metrics that have been used to describe a computer
system’ s performance. Some o f these metrics are commonly used throughout the
field, such as MIPS and MFLOPS (which are defined later in this chapter),
whereas others are invented for new situations as they are needed. Experience
has shown that not all o f these metrics are ‘ g o o d ’in the sense that sometimes
using a particular metric can lead to erroneous or misleading conclusions.
Consequently, it is useful to understand the characteristics o f a ‘ g o o d ’perfor
mance metric. This understanding will help when deciding which o f the existing
performance metrics to use for a particular situation, and when developing a new
performance metric.
A performance metric that satisfies all o f the following requirements is gen
erally useful to a performance analyst in allowing accurate and detailed compar
isons o f different measurements. These criteria have been developed by observing
the results o f numerous performance analyses over many years. While they
should not be considered absolute requirements o f a performance metric, it
has been observed that using a metric that does not satisfy these requirements
can often lead the analyst to make erroneous conclusions.
1. Linearity. Since humans intuitively tend to think in linear terms, the value o f
the metric should be linearly proportional to the actual performance o f the
machine. That is, if the value o f the metric changes by a certain ratio, the
actual performance o f the machine should change by the same ratio. This
proportionality characteristic makes the metric intuitively appealing to most
people. For example, suppose that you are upgrading your system to a system
137
11 2.2 Characteristics of a good performance metric
whose speed metric (i.e. execution-rate metric) is twice as large as the same
metric on your current system. You then would expect the new system to be
able to run your application programs in half the time taken by your old
system. Similarly, if the metric for the new system were three times larger than
that o f your current system, you would expect to see the execution times
reduced to one-third o f the original values.
138
12 Metrics of performance
it is that the metric will be determined incorrectly. The only thing worse than a
bad metric is a metric whose value is measured incorrectly.
5. Consistency. A consistent performance metric is one for which the units o f the
metric and its precise definition are the same across different systems and
different configurations o f the same system. If the units o f a metric are not
consistent, it is impossible to use the metric to compare the performances o f
the different systems. While the necessity for this characteristic would also
seem obvious, it is not satisfied by many popular metrics, such as MIPS
(Section 2.3.2) and MFLOPS (Section 2.3.3).
A wide variety o f performance metrics has been proposed and used in the com
puter field. Unfortunately, many o f these metrics are not good in the sense
defined above, or they are often used and interpreted incorrectly. The following
subsections describe many o f these common metrics and evaluate them against
the above characteristics o f a good performance metric.
139
13 2.3 Processor and system performance metrics
2.3.2 MIPS
140
14 Metrics of performance
formed by an instruction are at the heart o f the differences between RISC and
CISC processors and render MIPS essentially useless as a performance metric.
Another derisive explanation o f the MIPS acronym is meaningless indicator o f
performance since it is really no better a measure o f overall performance than is
the processor’
s clock frequency.
2.3.3 MFLOPS
M FLOPS = — (2.2)
te x 106 v '
141
15 2.3 Processor and system performance metrics
2.3.4 SPEC
1. Measure the time required to execute each program in the set on the system
being tested.
2. Divide the time measured for each program in the first step by the time
required to execute each program on a standard basis machine to normalize
the execution times.
3. Average together all o f these normalized values using the geometric mean (see
Section 3.3.4) to produce a single-number performance metric.
While the SPEC methodology is certainly more rigorous than is using MIPS or
MFLOPS as a measure o f performance, it still produces a problematic perfor
mance metric. One shortcoming is that averaging together the individual normal
ized results with the geometric mean produces a metric that is not linearly related
to a program’ s actual execution time. Thus, the SPEC metric is not intuitive
(characteristic 1). Furthermore, and more importantly, it has been shown to be
an unreliable metric (characteristic 2) in that a given program may execute faster
on a system that has a lower SPEC rating than it does on a competing system
with a higher rating.
Finally, although the defined methodology appears to make the metric inde
pendent o f outside influences (characteristic 6), it is actually subject to a wide
range o f tinkering. For example, many compiler developers have used these
benchmarks as practice programs, thereby tuning their optimizations to the char
acteristics o f this collection o f applications. As a result, the execution times o f the
collection o f programs in the SPEC suite can be quite sensitive to the particular
selection o f optimization flags chosen when the program is compiled. Also, the
selection o f specific programs that comprise the SPEC suite is determined by a
committee o f representatives from the manufacturers within the cooperative. This
committee is subject to numerous outside pressures since each manufacturer has a
strong interest in advocating application programs that will perform well on their
machines. Thus, while SPEC is a significant step in the right direction towards
defining a good performance metric, it still falls short o f the goal.
142
16 Metrics of performance
2.3.5 QUIPS
The QUIPS metric, which was developed in conjunction with the H INT bench
mark program, is a fundamentally different type o f performance metric. (The
details o f the HINT benchmark and the precise definition o f QUIPS are given in
Section 7.2.3). Instead o f defining the effort expended to reach a certain result as
the measure o f what is accomplished, the QUIPS metric defines the quality o f the
solution as a more meaningful indication o f a user’ s final goal. The quality is
rigorously defined on the basis o f mathematical characteristics o f the problem
being solved. Dividing this measure o f solution quality by the time required to
achieve that level o f quality produces QUIPS, or quality improvements per sec
ond.
This new performance metric has several o f the characteristics o f a good
performance metric. The mathematically precise definition o f ‘ quality’for the
defined problem makes this metric insensitive to outside influences (characteristic
6) and makes it entirely self-consistent when it is ported to different machines
(characteristic 5). It is also easily repeatable (characteristic 3) and it is linear
(characteristic 1) since, for the particular problem chosen for the HINT bench
mark, the resulting measure o f quality is linearly related to the time required to
obtain the solution.
Given the positive aspects o f this metric, it still does present a few potential
difficulties when used as a general-purpose performance metric. The primary
potential difficulty is that it need not always be a reliable metric (characteristic
2) due to its narrow focus on floating-point and memory system performance. It
is generally a very good metric for predicting how a computer system will per
form when executing numerical programs. However, it does not exercise some
aspects o f a system that are important when executing other types o f application
programs, such as the input/output subsystem, the instruction cache, and the
operating system’ s ability to multiprogram, for instance. Furthermore, while the
developers have done an excellent jo b o f making the HINT benchmark easy to
measure (characteristic 4) and portable to other machines, it is difficult to change
the quality definition. A new problem must be developed to focus on other
aspects o f a system’s performance since the definition o f quality is tightly coupled
to the problem being solved. Developing a new problem to more broadly exercise
the system could be a difficult task since it must maintain all o f the characteristics
described above.
Despite these difficulties, QUIPS is an important new type o f metric that
rigorously defines interesting aspects o f performance while providing enough
flexibility to allow new and unusual system architectures to demonstrate their
capabilities. While it is not a completely general-purpose metric, it should prove
to be very useful in measuring a system’ s numerical processing capabilities.
143
17 2.3 Processor and system performance metrics
It also should be a strong stimulus for greater rigor in defining future perfor
mance metrics.
144
18 Metrics of performance
mainO
{
int i ;
float a;
init_timer();
/* Stuff to be measured */
for (i=0;i< 1000;i++){
a = i * a / 1 0 ;
}
145
19 2.5 Speedup and relative change
Speedup and relative change are useful metrics for comparing systems since they
normalize performance to a common basis. Although these metrics are defined in
terms o f throughput or speed metrics, they are often calculated directly from
execution times, as described below.
146
20 Metrics of performance
If system 2 is faster than system 1, then 73 <7 3 and the speedup ratio will be
larger than 1. If system 2 is slower than system 1, however, the speedup ratio will
be less than 1. This situation is often referred to as a slowdown instead o f a
speedup.
Relative change. Another technique for normalizing performance is to express
the performance o f a system as a percent change relative to the performance o f
another system. We again use the throughput metrics R x and R2 as measures o f
the speeds o f systems 1 and 2, respectively. The relative change o f system 2 with
respect to system 1, denoted A2)i, (that is, using system 1 as the basis) is then
defined to be
R i —
Relative change o f system 2 w.r.t. system 1 = A2ji = (2.4)
Ri
Again assuming that the execution time o f each system is measured when
executing the same program, the ‘
distance traveled’by each system is the same
so that Rx = D jT x and R2 = D/T2. Thus,
R2 ~ R i _ D / T 2 — D/T\
(2.5)
Ri D/Tx
Typically, the value o f A2,i is multiplied by 100 to express the relative change
as a percentage with respect to a given basis system. This definition o f relative
change will produce a positive value if system 2 is faster than system 1, whereas a
negative value indicates that the basis system is faster.
Example. As an example o f how to apply these two different normalization
techniques, the speedup and relative change o f the systems shown in Table 2.1
are found using system 1 as the basis. From the raw execution times, we can
easily see that system 4 is the fastest, followed by systems 2, 1, and 3, in that
order. However, the speedup values give us a more precise indication o f exactly
how much faster one system is than the others. For instance, system 2 has a
147
21 2.6 Means versus ends metrics
Table 2.1. An example of calculating speedup and relative change using system 1 as
the basis
1 480 1 0
2 360 1.33 + 33
3 540 0.89 - 11
4 210 2.29 + 129
speedup o f 1.33 compared with system 1 or, equivalently, it is 33% faster. System
4 has a speedup ratio o f 2.29 compared with system 1 (or it is 129% faster). We
also see that system 3 is actually 11% slower than system 1, giving it a slowdown
factor o f 0.89. O
s = 0;
for (i = 1; i < N; i+ + )
s = s + x[i] * y [i ] ;
148
22 Metrics of performance
27V
*1 =
------ FLOPS/cycle. (2.6)
N(t+ + t*) t~\~+ L
2Nf 2f
FLOPS/cycle. (2.7)
Mfif +f(t+ + L)] hi + f( t+ + L=)
If is four cycles, t+ is five cycles, q is ten cycles, / is 10%, and the proces
sor’ s clock rate is 250 MHz (i.e. one cycle is 4 ns), then q = 607V ns and
t2 = 7V[4 + 0.1(5 + 10)] x 4 ns = 22N ns. The speedup o f program 2 relative
to program 1 then is found to be S2ti = 607V/227V = 2.73.
Calculating the execution rates realized by each program with these assump
tions produces 7/ = 2/(60 ns) = 33 MFLOPS and R2 = 2(0.1 )/(22 ns) =
9.09 MFLOPS. Thus, even though we have reduced the total execution time
from q = 607V ns to t2 = 227V ns, the means-based metric (MFLOPS) shows
that program 2 is 72% slower than program 1. The ends-based metric (execution
time), however, shows that program 2 is actually 173% faster than program 1.
We reach completely different conclusions when using these two different types
o f metrics because the means-based metric unfairly gives program 1 credit for all
o f the useless operations o f multiplying and adding zero. This example highlights
the danger o f using the wrong metric to reach a conclusion about computer-
system performance.
s = 0;
for (i = 1; i < N; i++)
if (x [i ] != 0 && y [ i ] != 0)
s = s + x [i ] * y [i ] ;
Figure 2.3. The vector dot-product example program of Figure 2.2 modified to calculate
only nonzero elements.
149
23 2.8 For further reading
2.7 Summary
• The following paper argues strongly for total execution time as the best mea
sure o f performance:
James E. Smith, ‘
Characterizing Computer Performance with a Single
Number,’Communications o f the ACM, October 1988, pp. 1202-1206.
• The QUIPS metric is described in detail in the following paper, which also
introduced the idea o f means-based versus ends-based metrics:
• Some o f the characteristics of the SPEC metric are discussed in the following
papers:
• Parallel computing systems are becoming more common. They present some
interesting performance measurement problems, though, as discussed in
Lawrence A. Crowl, ‘
How to Measure, Present, and Compare Parallel
Performance,’IEEE Parallel and Distributed Technology, Spring 1994,
pp. 9-25.
150
24 Metrics of performance
2.9 Exercises
151
Measurement tools and techniques
'When the only tool you have is a hammer, every problem begins to resemble a nail.’
A b ra h a m M a s lo w
The previous chapters have discussed what performance metrics may be useful
for the performance analyst, how to summarize measured data, and how to
understand and quantify the systematic and random errors that affect our
measurements. Now that we know what to do with our measured values,
this chapter presents several tools and techniques for actually measuring the
values we desire.
The focus o f this chapter is on fundamental measurement concepts. The goal is
not to teach you how to use specific measurement tools, but, rather, to help you
understand the strengths and limitations o f the various measurement techniques.
By the end o f this chapter, you should be able to select an appropriate measure
ment technique to determine the value o f a desired performance metric. You also
should have developed some understanding o f the trade-offs involved in using
the various types o f tools and techniques.
There are many different types o f performance metrics that we may wish to
measure. The different strategies for measuring the values o f these metrics are
typically based around the idea o f an event, where an event is some predefined
change in the system state. The precise definition o f a specific event is up to the
performance analyst and depends on the metric being measured. For instance, an
event may be defined to be a memory reference, a disk access, a network com
munication operation, a change in a processor’ s internal state, or some pattern or
combination o f other subevents.
82
152
83 6.1 Events and measurement strategies
The different types o f metrics that a performance analyst may wish to measure
can be classified into the following categories based on the type o f event or events
that comprise the metric.
1. Event-count metrics. Metrics that fall into this category are those that are
simple counts o f the number o f times a specific event occurs. Examples o f
event-count metrics include the number o f page faults in a system with
virtual memory, and the number o f disk input/output requests made by a
program.
2. Secondary-event metrics. These types o f metrics record the values o f some
secondary parameters whenever a given event occurs. For instance, to deter
mine the average number o f messages queued in the send buffer o f a com
munication port, we would need to record the number o f messages in the
queue each time a message was added to, or removed from, the queue. Thus,
the triggering event is a message-enqueue or -dequeue operation, and the
metrics being recorded are the number o f messages in the queue and the total
number o f queue operations. We may also wish to record the size (e.g. the
number o f bytes) o f each message sent to later determine the average mes
sage size.
3. Profiles. A profile is an aggregate metric used to characterize the overall
behavior o f an application program or o f an entire system. Typically, it is
used to identify where the program or system is spending its execution time.
153
84 Measurement tools and techniques
154
85 6.1 Events and measurement strategies
155
86 Measurement tools and techniques
-n bits -
Figure 6.1 A hardware-based interval timer uses a free-running clock source to continuously
increment an n-bit counter. This counter can be read directly by the operating system or by
an application program. The period o f the clock, Tc, determines the resolution o f the timer.
156
87 6.2 Interval timers
Clock
_n_n_n_n > Prescaler _n _n_
(divide-by-m)
To processor's interrupt input
Figure 6.2 A software interrupt-based timer divides down a free-running clock to produce
a processor interrupt with the period Tc. The interrupt service routine then maintains a
counter variable in memory that it increments each time the interrupt occurs.
tion is not directly incremented by the free-running clock. Instead, the hardware
clock is used to generate a processor interrupt at regular intervals. The interrupt-
service routine then increments a counter variable it maintains, which is the value
actually read by an application program. The value o f this variable then is a
count o f the number o f interrupts that have occurred since the count variable
was last initialized. Some systems allow an application program to reset this
counter. This feature allows the timer to always start from zero when timing
the duration o f an event.
The period o f the interrupts in the software-based approach corresponds to
the period o f the timer. As before, we denote this period Tc so that the total time
elapsed between two readings o f the software counter value is again
Te = (x2 —x{)Tc. The processor interrupt is typically derived from a free-run
ning clock source that is divided by m through a prescaling counter, as shown in
Figure 6.2. This prescaler is necessary in order to reduce the frequency o f the
interrupt signal fed into the processor. Interrupts would occur much too often,
and thus would generate a huge amount o f processor overhead, if this prescaling
were not done.
Timer rollover. One important consideration with these types o f interval timers
is the number o f bits available for counting. This characteristic directly deter
mines the longest interval that can be measured. (The complementary issue o f the
shortest interval that can be measured is discussed in Section 6.2.2.) A binary
counter used in a hardware timer, or the equivalent count variable used in a
software implementation, is said to ‘ roll over’to zero as its count undergoes a
transition from its maximum value o f 2" — 1 to the zero value, where n is the
number o f bits in the counter.
If the counter rolls over between the reading o f the counter at the start o f the
interval being measured and the reading o f the counter at the end, the difference
o f the count values, x2 —Xj, will be a negative number. This negative value is
obviously not a valid measurement o f the time interval. Any program that uses
an interval timer must take care to ensure that this type o f roll over can never
occur, or it must detect and, possibly, correct the error. Note that a negative
value that occurs due to a single roll over o f the counter can be converted to the
appropriate value by adding the maximum count value, 2n, to the negative value
157
88 Measurement tools and techniques
obtained when subtracting X\ from x2. Table 6.1 shows the maximum time
between timer roll overs for various counter widths and input clock periods.
When it is used in this way, we can see that the time we actually measure
includes more than the time required by the event itself. Specifically, accessing
the timer requires a minimum o f one memory-read operation. In some imple
mentations, reading the timer may require as much as a call to the operating-
system kernel, which can be very time-consuming. Additionally, the value read
from the timer must be stored somewhere before the event being timed begins.
This requires at least one store operation, and, in some systems, it could
require substantially more. These operations must be performed twice, once
at the start o f the event, and once again at the end. Taken altogether, these
operations can add up to a significant amount o f time relative to the duration
o f the event itself.
To obtain a better understanding o f this timer overhead, consider the time
line shown in Figure 6.3. Here, T x is the time required to read the value o f the
interval timer’ s counter. It may be as short as a single memory read, or as long
as a call into the operating-system kernel. Next, T2 is the time required to store
the current time. This time includes any time in the kernel after the counter has
been read, which would include, at a minimum, the execution o f the return
instruction. Time T2 is the actual duration o f the event we are trying to
measure. Finally, the time from when the event ends until the program actually
reads the counter value again is T4. Note that reading the counter this second
time involves the same set o f operations as the first read o f the counter so that
Ta = Tx.
Assigning these times to each o f the components in the timing operation now
allows us to compare the timer overhead with the time o f the event itself, which is
what we actually want to know. This event time, Te is time T3 in our time line, so
that Te = T3. What we measure, however, is Tm = T2 + T3 + T4. Thus, our
158
89 6.2 Interval timers
Table 6.1 The maximum time available before a binary interval timer with n bits and an
input clock with a period of Tc rolls over is Tc2n
Counter width, n
T
1c 16 24 32 48 64
<D
6
<D
a
Figure 6.3 The overhead incurred when using an interval timer to measure the execution
time o f any portion o f a program can be understood by breaking down the operations
necessary to use the timer into the com ponents shown here.
159
90 Measurement tools and techniques
The smallest change that can be be detected and displayed by an interval timer is
its resolution. This resolution is a single clock tick, which, in terms o f time, is the
period o f the timer’ s clock input, Tc. This finite resolution introduces a random
quantization error into all measurements made using the timer.
For instance, consider an event whose duration is n ticks o f the clock input,
plus a little bit more. That is, Te = nTc + A, where n is a positive integer and
0 < A < Tc. If, when one is measuring this event, the timer value is read
shortly after the event has actually begun, as shown in Figure 6.4(a), the
timer will count n clock ticks before the end o f the event. The total execution
time reported then will be nTc. If, on the other hand, there is slightly less time
between the actual start o f the event and the point at which the timer value is
read, as shown in Figure 6.4(b), the timer will count n + 1 clock ticks before
the end o f the event is detected. The total time reported in this case will then be
( n + l) T c.
In general, the actual event time is within the range nTc < Te < (n + \)TC.
Thus, the fact that events are typically not exactly whole number factors o f the
timer’
s clock period causes the time value reported to be rounded either up or
down by one clock period. This rounding is completely unpredictable and is one
readily identifiable (albeit possibly small) source o f random errors in our mea
surements (see Section 4.2). Looking at this quantization effect another way, if we
made ten measurements o f the same event, we would expect that approximately
five o f them would be reported as nTc with the remainder reported as (n + 1)TC. If
Tc is large relative to the event being measured, this quantization effect can make
it impossible to directly measure the duration o f the event. Consequently, we
typically would like Tc to be as small as possible, within the constraints imposed
by the number o f bits available in the timer (see Table 6.1).
Event r*-------------------------------------------------------------------H
(a) Interval timer reports event duration o f n = 13 clock ticks.
ciock I—H— {— |— |— |— |— |— |— |— |— |— |— |— |—
i—|
Event r*-------------------------------------------------------------------
Figure 6.4 The finite resolution o f an interval timer causes quantization o f the reported
duration o f the events measured.
160
91 6.2 Interval timers
Owing to the above quantization effect, we cannot directly measure events whose
durations are less than the resolution o f the timer. Similarly, quantization makes
it difficult to accurately measure events with durations that are only a few times
larger than the timer’ s resolution. We can, however, make many measurements
o f a short duration event to obtain a statistical estimate o f the event’ s duration.
Consider an event whose duration is smaller than the timer’ s resolution, that
is, Te < Tc. If we measure this interval once, there are two possible outcomes. If
we happen to start our measurement such that the event straddles the active edge
o f the clock that drives the timer’s internal counter, as shown in Figure 6.5(a), we
will see the clock advance by one tick. On the other hand, since Te < Tc, it is
entirely possible that the event will begin and end within one clock period, as
shown in Figure 6.5(b). In this case, the timer will not advance during this
measurement. Thus, we have a Bernoulli experiment whose outcome is 1 with
probability p, which corresponds to the timer advancing by one tick while are
measuring the event. If the clock does not advance, though, the outcome is 0 with
probability 1 —p.
Repeating this measurement n times produces a distribution that approximates
a binomial distribution. (It is only approximate since, for a true binomial dis-
xT c
T
Lc
T
1e
(b) Event Te begins and ends within the resolution o f the interval timer.
Figure 6.5 When one is measuring an event whose duration is less than the resolution o f
the interval timer, that is, Te < Tc, there are two possible outcomes for each measurement.
Either the event happens to straddle the active edge o f the timer’s clock, in which case the
counter advances by one tick, or the event begins and completes between two clock edges.
In the latter case, the interval timer will show the same count value both before and after
the event. Measuring this event multiple times approximates a binomial distribution.
161
92 Measurement tools and techniques
We can then use the technique for calculating a confidence interval for a propor
tion (see Section 4.4.3) to obtain a confidence interval for this average event
time.1
Example. We wish to measure an event whose duration we suspect is less than
the 40 ps resolution o f our interval timer. Out o f n = 10,482 measurements o f
this event, we find that the clock actually advances by one tick during m = 852 o f
them. For a 95% confidence level, we construct the interval for the ratio m/n =
852/10,482 as follows:
8 52
f l 852 ^
1 0 ,4 8 2 1 0 ,4 8 2 y
= (0.0786, 0.0840). (6.2)
(Ci’C2) = I M 8 2 T (L % ) 10,482
Scaling this interval by the timer’ s clock period gives us the 95% confidence
interval (3.14, 3.36)ps for the duration o f this event. O
1 The basic idea behind this technique was first suggested by Peter H. D an zig and Steve M elvin in an
unpublished technical report from the University o f Southern California.
162
93 6.3 Program profiling
6.3.1 PC sampling
163
94 Measurement tools and techniques
(6.3)
So, with 99% confidence, we estimate that the program spends between 0.39%
and 2.6% o f its time executing subroutine X. Multiplying by the period o f the
interrupt, we estimate that, out o f the 8 s the program was executing, there is a
99% chance that it spent between 31 (0.0039 x 8) and 210 (0.0261 x 8) ms
executing subroutine X. O
The confidence interval calculated in the above example produces a rather
large range o f times that the program could be spending in subroutine X. Put
in other terms, if we were to repeat this experiment several times, we would
expect that, in 99% o f the experiments, from three to 21 o f the 800 samples
would come from subroutine X. While this 7 : 1 range o f possible execution
times appears large, we estimate that subroutine X still accounts for less than
3% o f the total execution time. Thus, we most likely would start our program
tuning efforts on a routine that consumes a much larger fraction o f the total
execution time.
This example does demonstrate the importance o f having a sufficient number
o f samples in each state to produce reliable information, however. To reduce the
size o f the confidence interval in this example we need more samples o f each
event. Obtaining more samples per event requires either sampling for a longer
period o f time, or increasing the sampling rate. In some situations, we can simply
let the program execute for a longer period o f time. This will increase the total
number o f samples and, hence, the number o f samples obtained for each sub
routine.
Some programs have a fixed duration, however, and cannot be forced to
execute for a longer period. In this situation, we can run the program multiple
164
95 6.3 Program profiling
times and simply add the samples from each run. The alternative o f increasing
the sampling frequency will not always be possible, since the interrupt period is
often fixed by the system or the profiling tool itself. Furthermore, increasing the
sampling frequency increases the number o f times the interrupt-service routine is
executed, which increases the perturbation to the program. O f course, each run
o f the program must be performed under identical conditions. Otherwise, if the
test conditions are not identical, we are testing two essentially different systems.
Consequently, in this case, the two sets o f samples cannot be simply added
together to form one larger sample set.
It is also important to note that this sampling procedure implicitly assumes
that the interrupt occurs completely asynchronously with respect to any events in
the program being profiled. Although the interrupts occur at fixed, predefined
intervals, if the program events and the interrupt are asynchronous, the inter
rupts will occur at random points in the execution o f the program being sampled.
Thus, the samples taken at these points are completely independent o f each
other. This sample independence is critical to obtaining accurate results with
this technique since any synchronism between the events in the program and
the interrupt will cause some areas o f the program to be sampled more often than
they should, given their actual frequency o f occurrence.
165
96 Measurement tools and techniques
1. $37: la $ 2 5 , __io b
2. lw $15, 0($25)
3. addu $9, $15, -1
4. sw $9, 0($25)
5. la $ 8 , __io b
6. lw $11, 0($8)
7. bge $11, 0, $3J
8. move $4, $8
9. ja l __f i l b u f
10. move $17, $2
11. $38: la $ 1 2 , __io b
Figure 6.6 A basic block is a sequence o f instructions with no branches into or out of the
block. In this example, one basic block begins at statement 1 and ends at statement 7. A
second basic block begins at statement 8 and ends at statement 9. Statement 10 is a basic
block consisting o f only one instruction. Statement 11 begins another basic block since it is
the target o f an instruction that branches to label $38.
One o f the key differences between this basic-block profile and a profile gen
erated through sampling is that the basic-block profile shows the exact execution
frequencies o f all o f the instructions executed by a program. The sampling pro
file, on the other hand, is only a statistical estimate o f the frequencies. Hence, if a
sampling experiment is run a second time, the precise execution frequences will
most likely be at least slightly different. A basic-block profile, however, will
produce exactly the same frequencies whenever the program is executed with
the same inputs.
Although the repeatability and exact frequencies o f basic-block counting
would seem to make it the obvious profiling choice over a sampling-based pro
file, modifying a program to count its basic-block executions can add a substan
tial amount o f run-time overhead. For instance, to instrument a program for
basic-block counting would require the addition o f at least one instruction to
increment the appropriate counter when the block begins executing to each basic
block. Since the counters that need to be incremented must be unique for each
basic block, it is likely that additional instructions to calculate the appropriate
offset for the current block into the array o f counters will be necessary.
In most programs, the number o f instructions in a basic block is typically
between three and 20. Thus, the number o f instructions executed by the instru
mented program is likely to increase by at least a few percent and possibly as
much as 100% compared with the uninstrumented program. These additional
instructions can substantially increase the total running time o f the program.
166
97 6.4 Event tracing
Furthermore, the additional memory required to store the counter array, plus the
execution o f the additional instructions, can cause other substantial perturba
tions. For instance, these changes to the program can significantly alter its
memory behavior.
So, while basic-block counting provides exact profile information, it does so at
the expense o f substantial overhead. Sampling, on the other hand, distributes its
perturbations randomly throughout a program’ s execution. Also, the total per
turbation due to sampling can be controlled somewhat by varying the period o f
the sampling interrupt interval. Nevertheless, basic-block counting can be a
useful tool for precisely characterizing a program’ s execution profile. Many
compilers, in fact, have compile-time flags a user can set to automatically insert
appropriate code into a program as it is compiled to generate the desired basic-
block counts when it is subsequently executed.
167
98 Measurement tools and techniques
Disk
s
o
s
Figure 6.7 The overall process used to generate, store, and consume a program trace.
168
99 6.4 Event tracing
sum_x = 0.0;
t r a c e (1);
sum_xx = 0.0;
t r a c e (2);
fo r (i = 1; i <= n; i++)
t r a c e (3) ;
{
sum_x += x [ i ] ;
tra ce(4 ) ;
sum_xx += (x[i]*x [ i] );
t r a c e (5);
>
mean = sum_x / n;
t r a c e (6);
var = ((n * sum_xx) - (sum_x * sum_x)) / (n * (n-1));
t r a c e (7);
std _dev = s q r t( v a r ) ;
t r a c e (8);
z_p = u n it_n orm al(1 - (0.5 * alpha));
t r a c e (9);
h a lf_ in t = z_p * std_dev / sqrt(n );
t r a c e (10);
cl = mean - h a lf_ in t ;
t r a c e (11);
c2 = mean + h a lf_ in t ;
t r a c e (12);
(a) The original source program with calls to the tracing routine inserted,
tra ced )
{ p r in t ( i, t im e ) ;}
(b) The trace routine simply prints the statement number, i, and the current time.
Figure 6.8 Program tracing can be performed by inserting additional statements into the
source code to call a tracing subroutine at appropriate points.
169
100 Measurement tools and techniques
170
101 6.4 Event tracing
each item in the trace requires 16 bits to encode the necessary information, our
tracing will produce more than 190 Mbytes o f data per uninstrumented second o f
execution time, or more than 11 Gbytes per minute! In addition to obtaining the
disks necessary to store this amount o f data, the input/output operations
required to move this large volume o f data from the traced program to the
disks create additional perturbations. Thus, it is desirable to reduce the amount
o f information that must be stored.
171
102 Measurement tools and techniques
1. if (i > 5)
2.then a = a + i ;
3. e ls e b = b + 1;
4. i = i + 1;
Figure 6.9 A code fragment to be processed using the abstract execution tracing technique..
172
103 6.4 Event tracing
Figure 6.10 The control flow graph corresponding to the program fragment shown in
Figure 6.9.
x x x x x x x x x x x . . • x x x x x x x x x x x . . .
--------------------p ------------------- »
Figure 6.11 In trace sampling, k consecutive events com prise one sample o f the trace. A
new sample is taken every P events (P is called the sampling interval).
173
104 Measurement tools and techniques
174
105 6.6 Perturbations due to measuring
time the processor must have been busy executing real application programs
during the given measurement interval.
Specifically, consider an ‘idle’program that simply counts up from zero for a
fixed period o f time. If this program is the only application running on a single
processor o f a time-shared system, the final count value at the end o f the mea
surement interval is the value that indirectly corresponds to an unloaded pro
cessor. If two applications are executed simultaneously and evenly share the
processor, however, the processor will run our idle measurement program half
as often as when it was the only application running. Consequently, if we allow
both programs to run for the same time interval as when we ran the idle program
by itself, its total count value at the end o f the interval should be half o f the value
observed when only a single copy was executed.
Similarly, if three applications are executed simultaneously and equally share
the processor for the same measurement interval, the final count value in our idle
program should be one-third o f the value observed when it was executed by
itself. This line o f thought can be further extended to n application programs
simultaneously sharing the processor. After calibrating the counter process by
running it by itself on an otherwise unloaded system, it can be used to indirectly
measure the system load.
Example. In a time-shared system, the operating system will share a single
processor evenly among all o f the jobs executing in the system. Each available
jo b is allowed to run for the time slice Ts. After this interval, the currently
executing jo b is temporarily put to sleep, and the next ready jo b is switched in
to run. Indirect load monitoring takes advantage o f this behavior to estimate the
system load. Initially, the load-monitor program is calibrated by allowing it to
run by itself for a time T, as shown in Figure 6.12(a). At the end o f this time, its
counter value, n, is recorded. If the load monitor and another application are run
simultaneously so that in total two jobs are sharing the processor, as shown in
Figure 6.12(b), each jo b would be expected to be executing for half o f the total
time available. Thus, if the load monitor is again allowed to run for time T, we
would expect its final count value to be n/2. Similarly, running the load monitor
with two other applications for time T would result in a final count value o f n/3,
as shown in Figure 6.12(c). Consequently, knowing the value o f the count after
running the load monitor for time T allows us to deduce what the average load
during the measurement interval must have been O
One o f the curious (and certainly most annoying!) aspects o f developing tools to
measure computer-systems performance is that instrumenting a system or pro-
175
106 Measurement tools and techniques
Count = n
Count = n !2
Count = n/3
176
107 6.7 Summary
the additional memory locations necessary for the instrumentation could change
the pattern o f conflict misses in the cache in such a way as to actually improve
the cache performance perceived by the application. The bottom line is that
the effects o f adding instrumentation to a system being tested are entirely
unpredictable.
Besides these direct changes to a program’ s performance, instrumenting a
program can cause more subtle indirect perturbations. For example, an instru
mented program will take longer to execute than will the uninstrumented pro
gram. This increase in execution time will then cause it to experience more
context switches than it would have experienced if it had not been instrumented.
These additional context switches can substantially alter the program’ s paging
behavior, for instance, making the instrumented program behave substantially
differently than the uninstrumented program.
6.7 Summary
177
C h a p te r 7
179
Explain techniques for redundancy, such as n-version programming, error
coding, duplicated components, replication.
Categorize main variants of replication techniques and implement simple
replication protocols.
8 -2 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
Overview
Construction o f reliable systems from unreliable components is one of the most impor
tant applications o f modularity. There are, in principle, three basic steps to building
reliable systems:
1. Error detection-, discovering that there is an error in a data value or control signal.
Error detection is accomplished with the help o f redundancy extra information
that can verify correctness.
2. Error containment, limiting how far the effects of an error propagate. Error
containment comes from careful application o f modularity. When discussing
reliability, a module is usually taken to be the unit that fails independently o f other
such units. It is also usually the unit o f repair and replacement.
3. Error masking, ensuring correct operation despite the error. Error masking is
accomplished by providing enough additional redundancy that it is possible to
discover correct, or at least acceptably close, values o f the erroneous data or control
signal. When masking involves changing incorrect values to correct ones, it is
usually called error correction.
Since these three steps can overlap in practice, one sometimes finds a single error-han
dling mechanism that merges two or even all three of the steps.
In earlier chapters each o f these ideas has already appeared in specialized forms:
• A primary purpose of enforced modularity, as provided by client/server
architecture, virtual memory, and threads, is error containment.
181
8.1 Faults, Failures, and Fault Tolerant Design 8 -3
• Network links typically use error detection to identify and discard damaged
frames.
• Some end-to-end protocols time out and resend lost data segments, thus
masking the loss.
• Routing algorithms find their way around links that fail, masking those failures.
• Some real-time applications fill in missing data by interpolation or repetition,
thus masking loss.
and, as we will see in Chapter 11 [on-line], secure systems use a technique called defense
in depth both to contain and to mask errors in individual protection mechanisms. In this
chapter we explore systematic application o f these techniques to more general problems,
as well as learn about both their power and their limitations.
182
8 -4 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
• Hardware fault: A gate whose output is stuck at the value z e r o . Until something
depends on the gate correctly producing the output value o n e , nothing goes wrong.
If you publish a paper with an incorrect sum that was calculated by this gate, a
failure has occurred. Furthermore, the paper now contains a fault that may lead
some reader to do something that causes a failure elsewhere.
• Design fault: A miscalculation that has led to installing too little memory in a
telephone switch. It may be months or years until the first time that the presented
load is great enough that the switch actually begins failing to accept calls that its
specification says it should be able to handle.
• Implementation fault: Installing less memory than the design called for. In this
case the failure may be identical to the one in the previous example o f a design
fault, but the fault itself is different.
• Operations fault: The operator responsible for running the weekly payroll ran the
payroll program twice last Friday. Even though the operator shredded the extra
checks, this fault has probably filled the payroll database with errors such as wrong
values for year-to-date tax payments.
• Environment fault: Lightning strikes a power line, causing a voltage surge. The
computer is still running, but a register that was being updated at that instant now
has several bits in error. Environment faults come in all sizes, from bacteria
contaminating ink-jet printer cartridges to a storm surge washing an entire
building out to sea.
Some o f these examples suggest that a fault may either be latent, meaning that it isn’ t
affecting anything right now, or active. When a fault is active, wrong results appear in
data values or control signals. These wrong results are errors. If one has a formal specifi
cation for the design o f a module, an error would show up as a violation of some assertion
or invariant of the specification. The violation means that either the formal specification
is wrong (for example, someone didn’ t articulate all o f the assumptions) or a module that
this component depends on did not meet its own specification. Unfortunately, formal
specifications are rare in practice, so discovery of errors is more likely to be somewhat ad
hoc.
If an error is not detected and masked, the module probably does not perform to its
specification. Not producing the intended result at an interface is the formal definition
o f a failure. Thus, the distinction between fault and failure is closely tied to modularity
and the building o f systems out o f well-defined subsystems. In a system built o f sub
systems, the failure o f a subsystem is a fault from the point o f view o f the larger subsystem
that contains it. That fault may cause an error that leads to the failure o f the larger sub
system, unless the larger subsystem anticipates the possibility of the first one failing,
detects the resulting error, and masks it. Thus, if you notice that you have a flat tire, you
have detected an error caused by failure of a subsystem you depend on. If you miss an
appointment because o f the flat tire, the person you intended to meet notices a failure of
183
8.1 Faults, Failures, and Fault Tolerant Design 8 -5
a larger subsystem. If you change to a spare tire in time to get to the appointment, you
have masked the error within your subsystem. Fault tolerance thus consists o f noticing
active faults and component subsystem failures and doing something helpful in response.
One such helpful response is error containment, which is another close relative o f
modularity and the building o f systems out o f subsystems. When an active fault causes
an error in a subsystem, it may be difficult to confine the effects o f that error to just a
portion o f the subsystem. On the other hand, one should expect that, as seen from out
side that subsystem, the only effects will be at the specified interfaces o f the subsystem.
In consequence, the boundary adopted for error containment is usually the boundary of
the smallest subsystem inside which the error occurred. From the point o f view o f the
next higher-level subsystem, the subsystem with the error may contain the error in one
of four ways:
1. Mask the error, so the higher-level subsystem does not realize that anything went
wrong. One can think o f failure as falling off a cliff and masking as a way of
providing some separation from the edge.
2. Detect and report the error at its interface, producing what is called a fail-fast
design. Fail-fast subsystems simplify the job o f detection and masking for the next
higher-level subsystem. If a fail-fast module correctly reports that its output is
questionable, it has actually met its specification, so it has not failed. (Fail-fast
modules can still fail, for example by not noticing their own errors.)
3. Immediately stop dead, thereby hoping to limit propagation o f bad values, a
technique known as fail-stop. Fail-stop subsystems require that the higher-level
subsystem take some additional measure to discover the failure, for example by
setting a timer and responding to its expiration. A problem with fail-stop design is
that it can be difficult to distinguish a stopped subsystem from one that is merely
running more slowly than expected. This problem is particularly acute in
asynchronous systems.
4. D o nothing, simply failing without warning. At the interface, the error may have
contaminated any or all output values. (Informally called a “
crash”or perhaps “
fail-
thud”.)
Another useful distinction is that of transient versus persistent faults. A transient fault,
also known as a single-event upset, is temporary, triggered by some passing external event
such as lightning striking a power line or a cosmic ray passing through a chip. It is usually
possible to mask an error caused by a transient fault by trying the operation again. An
error that is successfully masked by retry is known as a soft error. A persistent fault contin
ues to produce errors, no matter how many times one retries, and the corresponding
errors are called hard errors. An intermittent fault is a persistent fault that is active only
occasionally, for example, when the noise level is higher than usual but still within spec
ifications. Finally, it is sometimes useful to talk about latency, which in reliability
terminology is the time between when a fault causes an error and when the error is
184
8 -6 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
detected or causes the module to fail. Latency can be an important parameter because
some error-detection and error-masking mechanisms depend on there being at most a
small fixed number o f errors— often just one— at a time. If the error latency is large,
there may be time for a second error to occur before the first one is detected and masked,
in which case masking o f the first error may not succeed. Also, a large error latency gives
time for the error to propagate and may thus complicate containment.
U sin g this term inology, an im properly fabricated stuck-at-ZERO bit in a m em ory chip
is a persistent fault: whenever the bit should contain a o n e the fault is active and the value
o f the bit is in error; at times when the bit is su p posed to contain a z e r o , the fault is latent.
If the chip is a co m p o n en t o f a fault tolerant m em ory m odule, the m od u le design p r o b
ably includes an error-correction co d e that prevents that error from turning into a failure
o f the m odule. If a passing co sm ic ray flips another bit in the same chip, a transient fault
has caused that bit also to be in error, but the same error-correction co d e m ay still be able
to prevent this error from turning into a m od u le failure. O n the other hand, if the error-
correction co d e can handle on ly single-bit errors, the com bin ation o f the persistent and
the transient fault m igh t lead the m od u le to prod u ce w ron g data across its interface, a
failure o f the m odule. If som eon e were then to test the m od u le by storin g new data in it
and reading it back, the test w ou ld probably not reveal a failure because the transient
fault does not affect the new data. Because sim ple input/output testing does not reveal
successfully masked errors, a fault tolerant m od u le design should always include som e
way to report that the m od u le m asked an error. If it does not, the user o f the m od u le m ay
not realize that persistent errors are accum ulating but hidden.
185
8.1 Faults, Failures, and Fault Tolerant Design 8 -7
Be explicit
Get all of the assumptions out on the table.
The primary purpose of creating a fault-tolerance model is to expose and document the
assumptions and articulate them explicitly. The designer needs to have these assump
tions not only for the initial design, but also in order to respond to field reports of
186
8 -8 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
187
8.2 Measures of Reliability and Failure Tolerance 8 -9
the system cannot be used until it is repaired, perhaps by replacing the failed component,
so we are equally interested in the time to repair (TTR). If we observe a system through
iVrun-fail-repair cycles and observe in each cycle i the values of TTFi and TTRt, we can
calculate the fraction o f time it operated properly, a useful measure known as availability:
A v a ila b ilit y = ----- time system was running-----
time system should have been running
N
^ TTFi
— ~ Eq. 8—1
J ( TTFj + TTR;)
i= 1
By separating the denominator o f the availability expression into two sums and dividing
each by N (the number o f observed failures) we obtain two time averages that are fre
quently reported as operational statistics: the mean time to failure (MTTF) and the mean
time to repair (MTTR):
1 N
M TTF = ~ 2 j TTFi M TTR = TTR, Eq. 8-2
i= 1 N i= 1
The sum o f these two statistics is usually called the mean time between failures (MTBF).
Thus availability can be variously described as
M TTF M TTF M T B F - M TTR
A v a ila b ility = Eq. 8-3
M TBF M TTF+ M TTR M TBF
In some situations, it is more useful to measure the fraction o f time that the system is not
working, known as its down time.
One thing that the definition o f down time makes clear is that M TTR and MTBF are
in some sense equally important. One can reduce down time either by reducing M TTR
or by increasing MTBF.
Components are often repaired by simply replacing them with new ones. When failed
components are discarded rather than fixed and returned to service, it is common to use
a slightly different method to measure MTTF. The method is to place a batch ofATom-
ponents in service in different systems (or in what is hoped to be an equivalent test
environment), run them until they have all failed, and use the set o f failure times as the
TTFi in equation 8-2. This procedure substitutes an ensemble average for the time aver
age. We could use this same procedure on components that are not usually discarded
when they fail, in the hope of determining their MTTF more quickly, but we might
obtain a different value for the MTTF. Some failure processes do have the property that
the ensemble average is the same as the time average (processes with this property are
188
8 -1 0 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
called ergodic), but other failure processes do not. For example, the repair itself may cause
wear, tear, and disruption to other parts o f the system, in which case each successive sys
tem failure might on average occur sooner than did the previous one. If that is the case,
an MTTF calculated from an ensemble-average measurement might be too optimistic.
As we have defined them, availability, MTTF, MTTR, and MTBF are backward
looking measures. They are used for two distinct purposes: (1) for evaluating how the
system is doing (compared, for example, with predictions made when the system was
designed) and (2 ) for predicting how the system will behave in the future. The first pur
pose is concrete and well defined. The second requires that one take on faith that samples
from the past provide an adequate predictor o f the future, which can be a risky assump
tion. There are other problems associated with these measures. While M TTR can usually
be measured in the field, the more reliable a component or system the longer it takes to
evaluate its MTTF, so that measure is often not directly available. Instead, it is common
to use and measure proxies to estimate its value. The quality o f the resulting estimate of
availability then depends on the quality o f the proxy.
A typical 3.5-inch magnetic disk comes with a reliability specification o f 300,000
hours “ M TTF” , which is about 34 years. Since the company quoting this number has
probably not been in business that long, it is apparent that whatever they are calling
“M TTF”is not the same as either the time-average or the ensemble-average MTTF that
we just defined. It is actually a quite different statistic, which is why we put quotes
around its name. Sometimes this “ M TTF”is a theoretical prediction obtained by mod
eling the ways that the components o f the disk might be expected to fail and calculating
an expected time to failure.
A more likely possibility is that the manufacturer measured this “ M TTF”by running
an array o f disks simultaneously for a much shorter time and counting the number of
failures. For example, suppose the manufacturer ran 1,000 disks for 3,000 hours (about
four months) each, and during that time 10 o f the disks failed. The observed failure rate
o f this sample is 1 failure for every 300,000 hours o f operation. The next step is to invert
the failure rate to obtain 300,000 hours o f operation per failure and then quote this num
ber as the “ M TTF” . But the relation between this sample observation o f failure rate and
the real MTTF is problematic. If the failure process were memoryless (meaning that the
failure rate is independent o f time; Section 8.2.2, below, explores this idea more thor
oughly), we would have the special case in which the MTTF really is the inverse o f the
failure rate. A good clue that the disk failure process is not memoryless is that the disk
specification may also mention an “ expected operational lifetime”of only 5 years. That
statistic is probably the real M TTF— though even that may be a prediction based on
modeling rather than a measured ensemble average. An appropriate re-interpretation of
the 34-year “ M TTF”statistic is to invert it and identify the result as a short-term failure
rate that applies only within the expected operational lifetime. The paragraph discussing
equation 8-9 on page 8-13 describes a fallacy that sometimes leads to miscalculation of
statistics such as the MTTF.
Magnetic disks, light bulbs, and many other components exhibit a time-varying sta
tistical failure rate known as a bathtub curve, illustrated in Figure 8.1 and defined more
189
8.2 Measures of Reliability and Failure Tolerance 8 -1 1
carefully in Section 8.2.2, below. When components come off the production line, a cer
tain fraction fail almost immediately because of gross manufacturing defects. Those
components that survive this initial period usually run for a long time with a relatively
uniform failure rate. Eventually, accumulated wear and tear cause the failure rate to
increase again, often quite rapidly, producing a failure rate plot that resembles the shape
o f a bathtub.
Several other suggestive and colorful terms describe these phenomena. Components
that fail early are said to be subject to infant mortality, and those that fail near the end of
their expected lifetimes are said to burn out. Manufacturers sometimes burn in such com
ponents by running them for a while before shipping, with the intent o f identifying and
discarding the ones that would otherwise fail immediately upon being placed in service.
When a vendor quotes an “ expected operational lifetime,”it is probably the mean time
to failure o f those components that survive burn in, while the much larger “ M TTF”
number is probably the inverse o f the observed failure rate at the lowest point o f the bath
tub. (The published numbers also sometimes depend on the outcome o f a debate
between the legal department and the marketing department, but that gets us into a dif
ferent topic.) A chip manufacturer describes the fraction o f components that survive the
burn-in period as the yield of the production line. Component manufacturers usually
exhibit a phenomenon known informally as a learning curve, which simply means that
the first components coming out of a new production line tend to have more failures
than later ones. The reason is that manufacturers design for iteration: upon seeing and
analyzing failures in the early production batches, the production line designer figures
out how to refine the manufacturing process to reduce the infant mortality rate.
One job o f the system designer is to exploit the nonuniform failure rates predicted by
the bathtub and learning curves. For example, a conservative designer exploits the learn
ing curve by avoiding the latest generation o f hard disks in favor of slightly older designs
that have accumulated more field experience. One can usually rely on other designers
who may be concerned more about cost or performance than availability to shake out the
bugs in the newest generation o f disks.
FIGURE 8.1_________________________________________________________________
A bathtub curve, showing how the conditional failure rate of a component changes with time.
190
8 -1 2 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
The 34-year “ M TTF”disk drive specification may seem like public relations puffery
in the face o f the specification of a 5-year expected operational lifetime, but these two
numbers actually are useful as a measure o f the nonuniformity o f the failure rate. This
nonuniformity is also susceptible to exploitation, depending on the operation plan. If the
operation plan puts the component in a system such as a satellite, in which it will run
until it fails, the designer would base system availability and reliability estimates on the
5-year figure. On the other hand, the designer o f a ground-based storage system, mindful
that the 5-year operational lifetime identifies the point where the conditional failure rate
starts to climb rapidly at the far end o f the bathtub curve, might include a plan to replace
perfectly good hard disks before burn-out begins to dominate the failure rate— in this
case, perhaps every 3 years. Since one can arrange to do scheduled replacement at conve
nient times, for example, when the system is down for another reason, or perhaps even
without bringing the system down, the designer can minimize the effect on system avail
ability. The manufacturer’ s 34-year “ M TTF” , which is probably the inverse o f the
observed failure rate at the lowest point o f the bathtub curve, then can be used as an esti
mate o f the expected rate o f unplanned replacements, although experience suggests that
this specification may be a bit optimistic. Scheduled replacements are an example of pre
ventive maintenance, which is active intervention intended to increase the mean time to
failure of a module or system and thus improve availability.
For some components, observed failure rates are so low that MTTF is estimated by
accelerated aging. This technique involves making an educated guess about what the
dominant underlying cause o f failure will be and then amplifying that cause. For exam
ple, it is conjectured that failures in recordable Compact Disks are heat-related. A typical
test scenario is to store batches of recorded CDs at various elevated temperatures for sev
eral months, periodically bringing them out to test them and count how many have
failed. One then plots these failure rates versus temperature and extrapolates to estimate
what the failure rate would have been at room temperature. Again making the assump
tion that the failure process is memoryless, that failure rate is then inverted to produce
an MTTF. Published MTTFs o f 100 years or more have been obtained this way. If the
dominant fault mechanism turns out to be something else (such as bacteria munching
on the plastic coating) or if after 50 years the failure process turns out not to be memo
ryless after all, an estimate from an accelerated aging study may be far wide of the mark.
A designer must use such estimates with caution and understanding o f the assumptions
that went into them.
Availability is sometimes discussed by counting the number of nines in the numerical
representation o f the availability measure. Thus a system that is up and running 99.9%
o f the time is said to have 3-nines availability. Measuring by nines is often used in mar
keting because it sounds impressive. A more meaningful number is usually obtained by
calculating the corresponding down time. A 3-nines system can be down nearly 1.5 min
utes per day or 8 hours per year, a 5-nines system 5 minutes per year, and a 7-nines
system only 3 seconds per year. Another problem with measuring by nines is that it tells
only about availability, without any information about MTTF. One 3-nines system may
have a brief failure every day, while a different 3 -nines system may have a single eight
191
8.2 Measures of Reliability and Failure Tolerance 8 -1 3
hour outage once a year. Depending on the application, the difference between those two
systems could be important. Any single measure should always be suspect.
Finally, availability can be a more fine-grained concept. Some systems are designed
so that when they fail, some functions (for example, the ability to read data) remain avail
able, while others (the ability to make changes to the data) are not. Systems that continue
to provide partial service in the face o f failure are called fail-soft, a concept defined more
carefully in Section 8.3.
R(t) = Pr[ the m odule has not yet failed at time t, given that Eq. 8-5
the m odule was operating at time 0
(The bathtub curve and these two reliability functions are three ways o f presenting the
same information. If you are rusty on probability, a brief reminder o f how they are
related appears in Sidebar 8.1.) Once f(t) is at hand, one can directly calculate the
MTTF:
192
8 -1 4 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
Sidebar 8.1: Reliability functions The failure rate function, the reliability function, and the
bathtub curve (which in probability texts is called the c o n d itio n a l f a ilu r e rate f u n c tio n , and
which in operations research texts is called the h a z a rd fu n c tio n ) are actually three
mathematically related ways o f describing the same information. The failure rate function , f i t )
as defined in equation 8—6, is a p r o b a b ility den sity f u n c tio n , which is everywhere non-negative
and whose integral over all time is 1. Integrating the failure rate function from the time the
component was created (conventionally taken to be t = 0) to the present time yields
t
F(t) = jK O dt
o
F ( t) is the cumulative probability that the component has failed by time t. The cumulative
probability that the component has n o t failed is the probability that it is still operating at time
t given that it was operating at time 0, which is exactly the definition o f the reliability function,
R (t). That is,
m =4
m
4
conditional failure rate does change with time. This misappropriation starts with a fal
lacy: an assumption that the MTTF, as defined in eq. 8-7, can be calculated by inverting
the measured failure rate. The fallacy arises because in general,
That is, the expected value o f the inverse is not equal to the inverse o f the expected value,
except in certain special cases. The important special case in which they are equal is the
memoryless distribution o f eq. 8- 8 . When a random process is memoryless, calculations
and measurements are so much simpler that designers sometimes forget that the same
simplicity does not apply everywhere.
Just as availability is sometimes expressed in an oversimplified way by counting the
number o f nines in its numerical representation, reliability in component manufacturing
is sometimes expressed in an oversimplified way by counting standard deviations in the
observed distribution o f some component parameter, such as the maximum propagation
time of a gate. The usual symbol for standard deviation is the Greek letter a (sigma), and
a normal distribution has a standard deviation of 1.0 , so saying that a component has
“4.5 o reliability”is a shorthand way o f saying that the production line controls varia
tions in that parameter well enough that the specified tolerance is 4.5 standard deviations
away from the mean value, as illustrated in Figure 8 .2 . Suppose, for example, that a pro-
193
8.2 Measures of Reliability and Failure Tolerance 8 -1 5
duction line is manufacturing gates that are specified to have a mean propagation time
of 10 nanoseconds and a maximum propagation time o f 11.8 nanoseconds with 4.5 o
reliability. The difference between the mean and the maximum, 1.8 nanoseconds, is the
tolerance. For that tolerance to be 4.5 o, o would have to be no more than 0.4 nanosec
onds. T o meet the specification, the production line designer would measure the actual
propagation times o f production line samples and, if the observed variance is greater than
0.4 ns, look for ways to reduce the variance to that level.
Another way o f interpreting “ 4.5 o reliability”is to calculate the expected fraction of
components that are outside the specified tolerance. That fraction is the integral o f one
tail o f the normal distribution from 4.5 to ° °, which is about 3.4 x 10“6, so in our exam
ple no more than 3.4 out o f each million gates manufactured would have delays greater
than 11.8 nanoseconds. Unfortunately, this measure describes only the failure rate o f the
production line, it does not say anything about the failure rate of the component after it
is installed in a system.
A currently popular quality control method, known as “ Six Sigma” , is an application
of two of our design principles to the manufacturing process. The idea is to use measure
ment, feedback, and iteration (design for iteration: “ you won’ t get it right the first time”)
to reduce the variance (the robustness principle, “ be strict on outputs” ) o f production-line
manufacturing. The “ Six Sigma”label is somewhat misleading because in the application
of the method, the number 6 is allocated to deal with two quite different effects. The
method sets a target o f controlling the production line variance to the level o f 4.5 CJ, just
as in the gate example of Figure 8 .2 . The remaining 1.5 a is the amount that the mean
output value is allowed to drift away from its original specification over the life o f the
FIGURE 8.2_________________________________________________________________
The normal probability density function applied to production of gates that are specified to have
mean propagation time of 10 nanoseconds and maximum propagation time of 11.8 nanosec
onds. The upper numbers on the horizontal axis measure the distance from the mean in units
of the standard deviation, a. The lower numbers depict the corresponding propagation times.
The integral of the tail from 4.5 a to is so small that it is not visible in this figure.
194
8 -1 6 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
production line. So even though the production line may start 6 cr away from the toler
ance limit, after it has been operating for a while one may find that the failure rate has
drifted upward to the same 3.4 in a million calculated for the 4.5 o case.
In manufacturing quality control literature, these applications o f the two design prin
ciples are known as Taguchi methods, after their popularizer, Genichi Taguchi.
195
8.3 Tolerating Active Faults 8 -1 7
196
8 -1 8 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
197
8.3 Tolerating Active Faults 8 -1 9
2. For each undetectable error, evaluate the probability o f its occurrence. If that
probability is not negligible, modify the system design in whatever way necessary
to make the error reliably detectable.
3. For each detectable error, implement a detection procedure and reclassify the
module in which it is detected as fail-fast.
4. For each detectable error try to devise a way o f masking it. If there is a way,
reclassify this error as a maskable error.
5. For each maskable error, evaluate its probability o f occurrence, the cost o f failure,
and the cost o f the masking method devised in the previous step. If the evaluation
indicates it is worthwhile, implement the masking method and reclassify this error
as a tolerated error.
When finished developing such a model, the designer should have a useful fault tol
erance specification for the system. Some errors, which have negligible probability of
occurrence or for which a masking measure would be too expensive, are identified as
untolerated. When those errors occur the system fails, leaving its users to cope with the
result. Other errors have specified recovery algorithms, and when those occur the system
should continue to run correctly. A review o f the system recovery strategy can now focus
separately on two distinct questions:
• Is the designer’s list o f potential error events complete, and is the assessment of
the probability o f each error realistic?
• Is the designer’ s set of algorithms, procedures, and implementations that are
supposed to detect and mask the anticipated errors complete and correct?
These two questions are different. The first is a question o f models of the real world.
It addresses an issue of experience and judgment about real-world probabilities and
whether all real-world modes o f failure have been discovered or some have gone unno
ticed. Two different engineers, with different real-world experiences, may reasonably
disagree on such judgments— they may have different models o f the real world. The eval
uation o f modes o f failure and of probabilities is a point at which a designer may easily
go astray because such judgments must be based not on theory but on experience in the
field, either personally acquired by the designer or learned from the experience o f others.
A new technology, or an old technology placed in a new environment, is likely to create
surprises. A wrong judgment can lead to wasted effort devising detection and masking
algorithms that will rarely be invoked rather than the ones that are really needed. On the
other hand, if the needed experience is not available, all is not lost: the iteration part of
the design process is explicitly intended to provide that experience.
The second question is more abstract and also more absolutely answerable, in that an
argument for correctness (unless it is hopelessly complicated) or a counterexample to that
argument should be something that everyone can agree on. In system design, it is helpful
to follow design procedures that distinctly separate these classes o f questions. When
someone questions a reliability feature, the designer can first ask, “ Are you questioning
198
8 -2 0 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
the correctness of my recovery algorithm or are you questioning my model o f what may
fail?”and thereby properly focus the discussion or argument.
Creating a fault tolerance model also lays the groundwork for the iteration part of the
fault tolerance design process. If a system in the field begins to fail more often than
expected, or completely unexpected failures occur, analysis o f those failures can be com
pared with the fault tolerance model to discover what has gone wrong. By again asking
the two questions marked with bullets above, the model allows the designer to distin
guish between, on the one hand, failure probability predictions being proven wrong by
field experience, and on the other, inadequate or misimplemented masking procedures.
With this information the designer can work out appropriate adjustments to the model
and the corresponding changes needed for the system.
Iteration and review o f fault tolerance models is also important to keep them up to
date in the light o f technology changes. For example, the Network File System described
in Section 4.4 was first deployed using a local area network, where packet loss errors are
rare and may even be masked by the link layer. When later users deployed it on larger
networks, where lost packets are more common, it became necessary to revise its fault
tolerance model and add additional error detection in the form o f end-to-end check
sums. The processor time required to calculate and check those checksums caused some
performance loss, which is why its designers did not originally include checksums. But
loss of data integrity outweighed loss o f performance and the designers reversed the
trade-off.
T o illustrate, an example o f a fault tolerance model applied to a popular kind o f mem
ory devices, RAM, appears in Section 8.7. This fault tolerance model employs error
detection and masking techniques that are described below in Section 8.4 o f this chapter,
so the reader may prefer to delay detailed study of that section until completing Section
8.4.
199
8.4 Systematically Applying Redundancy 8 -2 1
Suppose we create an encoding in which the Hamming distance between every pair
of legitimate data patterns is 2 . Then, if one bit changes accidentally, since no legitimate
data item can have that pattern, we can detect that something went wrong, but it is not
possible to figure out what the original data pattern was. Thus, if the two patterns above
were two members of the code and the first bit o f the upper pattern were flipped from
o n e to z e r o , there is no way to tell that the result, 0 0 0 1 0 1 , is not the result o f flipping the
200
8 -2 2 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
1-bit errors and detect 2-bit errors. But a 3-bit error would look just like a 1-bit error in
some other code pattern, so it would decode to a wrong value. More generally, if the
Hamming distance o f a code is d, a little analysis reveals that one can detect d —1 errors
and correct \_(d- 1)/2 J errors. The reason that this form o f redundancy is named
“forward”error correction is that the creator o f the data performs the coding before stor
ing or transmitting it, and anyone can later decode the data without appealing to the
creator. (Chapter 7[on-line] described the technique o f asking the sender of a lost frame,
packet, or message to retransmit it. That technique goes by the name o f backward error
correction.)
The systematic construction o f forward error-detection and error-correction codes is
a large field o f study, which we do not intend to explore. However, two specific examples
o f commonly encountered codes are worth examining.
The first example is a simple parity
check on a 2 -bit value, in which the parity
110 — 010
bit is the x o r o f the 2 data bits. The coded
pattern is 3 bits long, so there are 23 = 8
possible patterns for this 3-bit quantity,
only 4 o f which represent legitimate data.
As illustrated in Figure 8.4, the 4 “ correct”
patterns have the property that changing
any single bit transforms the word into one
o f the 4 illegal patterns. T o transform the
coded quantity into another legal pattern,
at least 2 bits must change (in other words, FIGURE 8.4_____________________
the Hamming distance o f this code is 2). Patterns for a simple parity-check code.
The conclusion is that a simple parity Each line connects patterns that differ in
check can detect any single error, but it only one bit; bold-face patterns are the
doesn’ t have enough information to cor legitimate ones.
rect errors.
The second example, in Figure 8.5, shows a forward error-correction code that can
correct 1-bit errors in a 4-bit data value, by encoding the 4 bits into 7 -bit words. In this
code, bits Pj, P(v P5, and P3 carry the data, while bits P4, P2, and P\ are calculated from
the data bits. (This out-of-order numbering scheme creates a multidimensional binary
coordinate system with a use that will be evident in a moment.) We could analyze this
code to determine its Hamming distance, but we can also observe that three extra bits
can carry exactly enough information to distinguish 8 cases: no error, an error in bit 1,
an error in bit 2, ... or an error in bit 7. Thus, it is not surprising that an error-correction
code can be created. This code calculates bits Pj, P2, and P4 as follows:
Pj = P j © P5 © P3
P2 = P7 © P(, © P3
P4 = P7 © P(, © P5
201
8.4 Systematically Applying Redundancy 8 -2 3
Now, suppose that the array o f bits P\ through P j is sent across a network and noise
causes bit P5 to flip. If the recipient recalculates P\, and P 4> the recalculated values
of P\ and P4 will be different from the received bits P\ and P4. The recipient then writes
P4 P2 P\ in order, representing the troubled bits as o n e s and untroubled bits as z e r o s , and
notices that their binary value is 1012 = 5 , the position o f the flipped bit. In this code,
whenever there is a one-bit error, the troubled parity bits directly identify the bit to cor
rect. (That was the reason for the out-of-order bit-numbering scheme, which created a
3-dimensional coordinate system for locating an erroneous bit.)
The use o f 3 check bits for 4 data bits suggests that an error-correction code may not
be efficient, but in fact the apparent inefficiency o f this example is only because it is so
small. Extending the same reasoning, one can, for example, provide single-error correc
tion for 56 data bits using 7 check bits in a 63-bit code word.
In both of these examples of coding, the assumed threat to integrity is that an uni
dentified bit out of a group may be in error. Forward error correction can also be effective
against other threats. A different threat, called erasure, is also common in digital systems.
An erasure occurs when the value o f a particular, identified bit o f a group is unintelligible
or perhaps even completely missing. Since we know which bit is in question, the simple
parity-check code, in which the parity bit is the x o r o f the other bits, becomes a forward
error correction code. The unavailable bit can be reconstructed simply by calculating the
x o r o f the unerased bits. Returning to the example o f Figure 8.4, if we find a pattern in
which the first and last bits have values 0 and 1 respectively, but the middle bit is illegible,
the only possibilities are 001 and Oil. Since 001 is not a legitimate code pattern, the
original pattern must have been O il. The simple parity check allows correction o f only
a single erasure. If there is a threat o f multiple erasures, a more complex coding scheme
is needed. Suppose, for example, we have 4 bits to protect, and they are coded as in Fig
ure 8.5. In that case, if as many as 3 bits are erased, the remaining 4 bits are sufficient to
reconstruct the values of the 3 that are missing.
Since erasure, in the form o f lost packets, is a threat in a best-effort packet network,
this same scheme o f forward error correction is applicable. One might, for example, send
four numbered, identical-length packets of data followed by a parity packet that contains
b it p7 P6 P5 P4 P3 P2 Pi
C h o o se P | so XOR o f every oth er b it (P7 © P5 © P3 © P p is 0 © © © ©
FIGURE 8.5_________________________________________________________________
A single-error-correction code. In the table, the symbol © marks the bits that participate in the
calculation of one of the redundant bits. The payload bits are P7, P6, P5, and P3, and the redun
dant bits are P4, P2, and P^The “every other” notes describe a 3-dimensional coordinate
system that can locate an erroneous bit.
202
8 -2 4 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
as its payload the bit-by-bit x o r o f the payloads of the previous four. (That is, the first bit
o f the parity packet is the x o r o f the first bit o f each o f the other four packets; the second
bits are treated similarly, etc.) Although the parity packet adds 25% to the network load,
as long as any four o f the five packets make it through, the receiving side can reconstruct
all o f the payload data perfectly without having to ask for a retransmission. If the network
is so unreliable that more than one packet out o f five typically gets lost, then one might
send seven packets, o f which four contain useful data and the remaining three are calcu
lated using the formulas of Figure 8.5. (Using the numbering scheme o f that figure, the
payload of packet 4, for example, would consist o f the x o r of the payloads o f packets 7,
6 , and 5.) Now, if any four o f the seven packets make it through, the receiving end can
reconstruct the data.
Forward error correction is especially useful in broadcast protocols, where the exist
ence o f a large number o f recipients, each o f which may miss different frames, packets,
or stream segments, makes the alternative o f backward error correction by requesting
retransmission unattractive. Forward error correction is also useful when controlling jit
ter in stream transmission because it eliminates the round-trip delay that would be
required in requesting retransmission o f missing stream segments. Finally, forward error
correction is usually the only way to control errors when communication is one-way or
round-trip delays are so long that requesting retransmission is impractical, for example,
when communicating with a deep-space probe. On the other hand, using forward error
correction to replace lost packets may have the side effect o f interfering with congestion
control techniques in which an overloaded packet forwarder tries to signal the sender to
slow down by discarding an occasional packet.
Another application o f forward error correction to counter erasure is in storing data
on magnetic disks. The threat in this case is that an entire disk drive may fail, for example
because of a disk head crash. Assuming that the failure occurs long after the data was orig
inally written, this example illustrates one-way communication in which backward error
correction (asking the original writer to write the data again) is not usually an option.
One response is to use a RAID array (see Section 2.1.1.4) in a configuration known as
RAID 4. In this configuration, one might use an array o f five disks, with four o f the disks
containing application data and each sector o f the fifth disk containing the bit-by-bit x o r
o f the corresponding sectors o f the first four. If any o f the five disks fails, its identity will
quickly be discovered because disks are usually designed to be fail-fast and report failures
at their interface. After replacing the failed disk, one can restore its contents by reading
the other four disks and calculating, sector by sector, the x o r of their data (see exercise
8.9). To maintain this strategy, whenever anyone updates a data sector, the RAID 4 sys
tem must also update the corresponding sector o f the parity disk, as shown in Figure 8 .6 .
That figure makes it apparent that, in RAID 4, forward error correction has an identifi
able read and write performance cost, in addition to the obvious increase in the amount
o f disk space used. Since loss o f data can be devastating, there is considerable interest in
RAID, and much ingenuity has been devoted to devising ways of minimizing the perfor
mance penalty.
203
8.4 Systematically Applying Redundancy 8 -2 5
new sector
data 1
data 2
data 3
data 4
parity
FIGURE 8.6
Update of a sector on disk 2 of a five-disk RAID 4 system. The old parity sector contains
parity «- data 1 © data 2 © data 3 © data 4. To construct a new parity sector that includes the
new data 2, one could read the corresponding sectors of data 1, data 3, and data 4 and per
form three more x o r s . But a faster way is to read just the old parity sector and the old data 2
sector and compute the new parity sector as
new parity «- old parity © old data 2 © new data 2
204
8 -2 6 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
The corresponding way o f building a reliable system out of unreliable discrete compo
nents is to acquire multiple copies of each component. Identical multiple copies are
called replicas, and the technique is called replication. There is more to it than just making
copies: one must also devise a plan to arrange or interconnect the replicas so that a failure
in one replica is automatically masked with the help o f the ones that don’ t fail. For exam
ple, if one is concerned about the possibility that a diode may fail by either shorting out
or creating an open circuit, one can set up a network o f four diodes as in Figure 8.7, cre
ating what we might call a “ superdiode” . This interconnection scheme, known as a quad
component, was developed by Claude E. Shannon and Edward F. Moore in the 1950s as
a way of increasing the reliability of relays in telephone systems. It can also be used with
resistors and capacitors in circuits that can tolerate a modest range o f component values.
This particular superdiode can tolerate a single short circuit and a single open circuit in
any two component diodes, and it can also tolerate certain other multiple failures, such
as open circuits in both upper diodes plus a short circuit in one o f the lower diodes. If
the bridging connection o f the figure is added, the superdiode can tolerate additional
multiple open-circuit failures (such as one upper diode and one lower diode), but it will
be less tolerant o f certain short-circuit failures (such as one left diode and one right
diode).
A serious problem with this superdiode is that it masks failures silently. There is no
easy way to determine how much failure tolerance remains in the system.
8.4.3 Voting
Although there have been attempts to extend quad-component methods to digital logic,
the intricacy o f the required interconnections grows much too rapidly. Fortunately, there
is a systematic alternative that takes advantage o f the static discipline and level regenera
tion that are inherent properties of digital logic. In addition, it has the nice feature that
it can be applied at any level of module, from a single gate on up to an entire computer.
The technique is to substitute in place o f a single module a set of replicas o f that same
module, all operating in parallel with the same inputs, and compare their outputs with a
device known as a voter. This basic strategy is called N-modular redundancy, or NMR.
When N has the value 3 the strategy is called triple-modular redundancy, abbreviated
TMR. When other values are used for N the strategy is named by replacing the N of
NM R with the number, as in 5MR. The combination o f Wreplicas of some module and
FIGURE 8.7
A quad-component superdiode.
The dotted line represents an
H— — H
optional bridging connection,
which allows the superdiode to
tolerate a different set of failures, *-U
as described in the text.
205
8.4 Systematically Applying Redundancy 8 -2 7
the voting system is sometimes called a supermodule. Several different schemes exist for
interconnection and voting, only a few of which we explore here.
The simplest scheme, called fail-vote, consists of NM R with a majority voter. One
assembles TVreplicas o f the module and a voter that consists o f an TV-way comparator and
some counting logic. If a majority o f the replicas agree on the result, the voter accepts
that result and passes it along to the next system component. If any replicas disagree with
the majority, the voter may in addition raise an alert, calling for repair o f the replicas that
were in the minority. If there is no majority, the voter signals that the supermodule has
failed. In failure-tolerance terms, a triply-redundant fail-vote supermodule can mask the
failure o f any one replica, and it is fail-fast if any two replicas fail in different ways.
If the reliability, as was defined in Section 8.2.2, o f a single replica module is R and
the underlying fault mechanisms are independent, a TM R fail-vote supermodule will
operate correctly if all 3 modules are working (with reliability R? ) or if 1 module has
failed and the other 2 are working (with reliability I - R)). Since a single-module
failure can happen in 3 different ways, the reliability o f the supermodule is the sum,
( ' - R su p ern oM e) 0
These calculations assume that the voter is perfectly reliable. Rather than trying to
create perfect voters, the obvious thing to do is replicate them, too. In fact, everything—
modules, inputs, outputs, sensors, actuators, etc.— should be replicated, and the final
vote should be taken by the client o f the system. Thus, three-engine airplanes vote with
their propellers: when one engine fails, the two that continue to operate overpower the
inoperative one. On the input side, the pilot’ s hand presses forward on three separate
throttle levers. A fully replicated TM R supermodule is shown in Figure 8 .8 . With this
interconnection arrangement, any measurement or estimate of the reliability, R, o f a
component module should include the corresponding voter. It is actually customary
(and more logical) to consider a voter to be a component o f the next module in the chain
rather than, as the diagram suggests, the previous module. This fully replicated design is
sometimes described as recursive.
206
8 -2 8 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
FIGURE 8.8
Triple-modular
redundant super
module, with three
inputs, three voters,
and three outputs.
207
8.4 Systematically Applying Redundancy 8 -2 9
thetical airplane during its 6 hours o f flight, which amounts to only 0.001 o f the single
engine M TTF— the mission time is very short compared with the MTTF and the reli
ability is far higher. Figure 8.10 shows the same curve, but for flight times that are
comparable with the MTTF. In this region, if the plane tried to keep flying for 8000
hours (about 1.4 times the single-engine MTTF), a single-engine plane would fail to
complete the flight in 3 out o f 4 tries, but the 3-engine plane would fail to complete the
flight in 5 out o f 6 tries. (One should be wary o f these calculations because the assump
tions o f independence and memoryless operation may not be met in practice. Sidebar 8.2
elaborates.)
FIGURE 8.9_________________________________________________________________
Reliability with triple modular redundancy, for mission times much less than the MTTF of 6,000
hours. The vertical dotted line represents a six-hour flight.
FIGURE 8.10
Reliability with triple modular redundancy, for mission times comparable to the MTTF of 6,000
hours. The two vertical dotted lines represent mission times of 6,000 hours (left) and 8,400
hours (right).
208
8 -3 0 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
10 software failures
30 hardware failures
N ew hardware is installed, identical to the old except that it never fails. T h e M T T F should
jum p to 6 weeks because the only remaining failures are software, right?
Perhaps— but only if the software failure process is independent o f the hardware failure process.
Suppose the software failure occurs because there is a bug (fault) in a clock-updating procedure:
The bug always crashes the system exactly 420 hours (2 1/2 weeks) after it is started— if it gets
a chance to run that long. The old hardware was causing crashes so often that the software bug
only occasionally had a chance to do its thing— only about once every 6 weeks. M ost o f the
time, the recovery from a hardware failure, which requires restarting the system, had the side
effect o f resetting the process that triggered the software bug. So, when the new hardware is
installed, the system has an M T T F o f only 2.5 weeks, much less than hoped.
M TTF's are useful, but one must be careful to understand what assumptions go into their
measurement and use.
If we had assumed that the plane could limp home with just one engine, the MTTF
would have increased, rather than decreased, but only modestly. Replication provides a
dramatic improvement in reliability for missions o f duration short compared with the
MTTF, but the MTTF itself changes much less. We can verify this claim with a little
more analysis, again assuming memoryless failure processes to make the mathematics
tractable. Suppose we have an NM R system with the property that it somehow continues
to be useful as long as at least one replica is still working. (This system requires using fail-
fast replicas and a cleverer voter, as described in Section 8.4.4 below.) If a single replica
has an M T T F reppca = 1, there are ^independent replicas, and the failure process is mem
oryless, the expected time until the first failure is M TTFrepUcJ N, the expected time from
then until the second failure is M FTFrepHcal{ N - 1), etc., and the expected time until the
system of N replicas fails is the sum o f these times,
M TTFsystem = 1 + 1/2 + 1/3 + ...(1/7V) Eq.8-11
209
8.4 Systematically Applying Redundancy 8 -3 1
which for large /Vis approximately ln(N). As we add to the cost by adding more replicas,
M TTFSyStem grows disappointingly slowly— proportional to the logarithm o f the cost. To
multiply the M TTFsystem by K, the number o f replicas required is eK — the cost grows
exponentially. The significant conclusion is that in systems fo r which the mission time is
long compared with MTTFreplica, simple replication escalates the cost while providing little
benefit. On the other hand, there is a way of making replication effective for long mis
sions, too. The method is to enhance replication by adding repair.
8.4.4 Repair
Let us return now to a fail-vote TM R supermodule (that is, it requires that at least two
replicas be working) in which the voter has just noticed that one of the three replicas is
producing results that disagree with the other two. Since the voter is in a position to
report which replica has failed, suppose that it passes such a report along to a repair per
son who immediately examines the failing replica and either fixes or replaces it. For this
approach, the mean time to repair (MTTR) measure becomes o f interest. The super
module fails if either the second or third replica fails before the repair to the first one can
be completed. Our intuition is that if the M TTR is small compared with the combined
MTTF o f the other two replicas, the chance that the supermodule fails will be similarly
small.
The exact effect on chances o f supermodule failure depends on the shape o f the reli
ability function o f the replicas. In the case where the failure and repair processes are both
memoryless, the effect is easy to calculate. Since the rate o f failure o f 1 replica is 1/MTTF,
the rate o f failure o f 2 replicas is 2/MTTF. If the repair time is short compared with
M T T F the probability o f a failure o f 1 o f the 2 remaining replicas while waiting a time
T for repair o f the one that failed is approximately 2T/MTTF. Since the mean time to
repair is MTTR, we have
PR supermodule fails while waiting for repair) = ^ Eq. 8-12
Continuing our airplane example and temporarily suspending disbelief, suppose that
during a long flight we send a mechanic out on the airplane’ s wing to replace a failed
engine. If the replacement takes 1 hour, the chance that one o f the other two engines fails
during that hour is approximately 1/3000. Moreover, once the replacement is complete,
we expect to fly another 2000 hours until the next engine failure. Assuming further that
the mechanic is carrying an unlimited supply o f replacement engines, completing a
10,000 hour flight— or even a longer one— becomes plausible. The general formula for
the MTTF o f a fail-vote TM R supermodule with memoryless failure and repair processes
is (this formula comes out o f the analysis o f continuous-transition birth-and-death
Markov processes, an advanced probability technique that is beyond our scope):
210
8 -3 2 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
Thus, our 3-engine plane with hypothetical in-flight repair has an MTTF o f 6 million
hours, an enormous improvement over the 6000 hours o f a single-engine plane. This
equation can be interpreted as saying that, compared with an unreplicated module, the
MTTF has been reduced by the usual factor o f 3 because there are 3 replicas, but at the
same time the availability o f repair has increased the MTTF by a factor equal to the ratio
o f the MTTF of the remaining 2 engines to the MTTR.
Replacing an airplane engine in flight may be a fanciful idea, but replacing a magnetic
disk in a computer system on the ground is quite reasonable. Suppose that we store 3
replicas o f a set of data on 3 independent hard disks, each o f which has an MTTF o f 5
years (using as the MTTF the expected operational lifetime, not the “ M TTF”derived
from the short-term failure rate). Suppose also, that if a disk fails, we can locate, install,
and copy the data to a replacement disk in an average o f 10 hours. In that case, by eq.
8-13, the MTTF of the data is
211
8.4 Systematically Applying Redundancy 8 -3 3
Each o f these concerns acts to reduce the reliability below what might be expected from
our overly simple analysis. Nevertheless, NM R with repair remains a useful technique,
and in Chapter 10[on-line] we will see ways in which it can be applied to disk storage.
One of the most powerful applications o f NM R is in the masking of transient errors.
When a transient error occurs in one replica, the NM R voter immediately masks it.
Because the error is transient, the subsequent behavior o f the supermodule is as if repair
happened by the next operation cycle. The numerical result is little short o f extraordi
nary. For example, consider a processor arithmetic logic unit (ALU) with a 1 gigahertz
clock and which is triply replicated with voters checking its output at the end o f each
clock cycle. In equation 8-13 we have M TTRreplica = 1 (in this application, equation
8-13 is only an approximation because the time to repair is a constant rather than the
result of a memoryless process), and M TTFSUpermodule = (M TTFrepiiCa )2/ 6
cycles. If M TTFrepHca is 101°cycles (1 error in 10 billion cycles, which at this clock speed
means one error every 10 seconds), M TTFsupermo(iu^ is 1020/ 6 cycles, about 500 years.
TM R has taken three ALUs that were for practical use nearly worthless and created a
super-ALU that is almost infallible.
The reason things seem so good is that we are evaluating the chance that two transient
errors occur in the same operation cycle. If transient errors really are independent, that
chance is small. This effect is powerful, but the leverage works in both directions, thereby
creating a potential hazard: it is especially important to keep track o f the rate at which
transient errors actually occur. If they are happening, say, 20 times as often as hoped,
M TTFsupermoeiuie will be 1/400 o f the original prediction— the super-ALU is likely to fail
once per year. That may still be acceptable for some applications, but it is a big change.
Also, as usual, the assumption o f independence is absolutely critical. If all the ALUs came
from the same production line, it seems likely that they will have at least some faults in
common, in which case the super-ALU may be just as worthless as the individual ALUs.
Several variations on the simple fail-vote structure appear in practice:
• Purging. In an NM R design with a voter, whenever the voter detects that one
replica disagrees with the majority, the voter calls for its repair and in addition
marks that replica D OW N and ignores its output until hearing that it has been
repaired. This technique doesn’ t add anything to a TM R design, but with higher
levels o f replication, as long as replicas fail one at a time and any two replicas
continue to operate correctly, the supermodule works.
• Pair-and-compare. Create a fail-fast module by taking two replicas, giving them the
same inputs, and connecting a simple comparator to their outputs. As long as the
comparator reports that the two replicas of a pair agree, the next stage o f the system
accepts the output. If the comparator detects a disagreement, it reports that the
module has failed. The major attraction of pair-and-compare is that it can be used
to create fail-fast modules starting with easily available commercial, off-the-shelf
components, rather than commissioning specialized fail-fast versions. Special
high-reliability components typically have a cost that is much higher than off-the-
shelf designs, for two reasons. First, since they take more time to design and test,2 1
212
8 -3 4 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
the ones that are available are typically o f an older, more expensive technology.
Second, they are usually low-volume products that cannot take advantage of
economies o f large-scale production. These considerations also conspire to
produce long delivery cycles, making it harder to keep spares in stock. An
important aspect o f using standard, high-volume, low-cost components is that one
can afford to keep a stock o f spares, which in turn means that M TTR can be made
small: just replace a failing replica with a spare (the popular term for this approach
is pair-and-spare) and do the actual diagnosis and repair at leisure.
• NM R with fail-fast replicas. If each o f the replicas is itself a fail-fast design (perhaps
using pair-and-compare internally), then a voter can restrict its attention to the
outputs o f only those replicas that claim to be producing good results and ignore
those that are reporting that their outputs are questionable. With this organization,
a TM R system can continue to operate even if 2 of its 3 replicas have failed, since
the 1 remaining replica is presumably checking its own results. An NM R system
with repair and constructed o f fail-fast replicas is so robust that it is unusual to find
examples for which W is greater than 2 .
Figure 8.11 compares the ability to continue operating until repair arrives o f 5MR
designs that use fail-vote, purging, and fail-fast replicas. The observant reader will note
that this chart can be deemed guilty of a misleading comparison, since it claims that the
5MR system continues working when only one fail-fast replica is still running. But if that
fail-fast replica is actually a pair-and-compare module, it might be more accurate to say
that there are two still-working replicas at that point.
Another technique that takes advantage o f repair, can improve availability, and can
degrade gracefully (in other words, it can be fail-soft) is called partition. If there is a
choice o f purchasing a system that has either one fast processor or two slower processors,
the two-processor system has the virtue that when one o f its processors fails, the system
5MR with
fail-vote fails
t 5 5MR with
purging fails
Number 4
of
replicas
3 I I 5MR with fail-fast
* replicas fails
still 2
working
correctly
0
time ----- ►
FIGURE 8.11
Failure points of three different 5MR supermodule designs, if repair does not happen in time.
213
8.4 Systematically Applying Redundancy 8 -3 5
can continue to operate with half o f its usual capacity until someone can repair the failed
processor. An electric power company, rather than installing a single generator o f capac
ity K megawatts, may install W generators o f capacity K IN megawatts each.
When equivalent modules can easily share a load, partition can extend to what is
called N + 1 redundancy. Suppose a system has a load that would require the capacity of
N equivalent modules. The designer partitions the load across N + 1 or more modules.
Then, if any one o f the modules fails, the system can carry on at full capacity until the
failed module can be repaired.
N + 1 redundancy is most applicable to modules that are completely interchangeable,
can be dynamically allocated, and are not used as storage devices. Examples are proces
sors, dial-up modems, airplanes, and electric generators. Thus, one extra airplane located
at a busy hub can mask the failure o f any single plane in an airline’ s fleet. When modules
are not completely equivalent (for example, electric generators come in a range of capac
ities, but can still be interconnected to share load), the design must ensure that the spare
capacity is greater than the capacity o f the largest individual module. For devices that
provide storage, such as a hard disk, it is also possible to apply partition and N + 1 redun
dancy with the same goals, but it requires a greater level of organization to preserve the
stored contents when a failure occurs, for example by using RAID, as was described in
Section 8.4.1, or some more general replica management system such as those discussed
in Section 10.3.7.
For some applications an occasional interruption o f availability is acceptable, while in
others every interruption causes a major problem. When repair is part o f the fault toler
ance plan, it is sometimes possible, with extra care and added complexity, to design a
system to provide continuous operation. Adding this feature requires that when failures
occur, one can quickly identify the failing component, remove it from the system, repair
it, and reinstall it (or a replacement part) all without halting operation of the system. The
design required for continuous operation of computer hardware involves connecting and
disconnecting cables and turning off power to some components but not others, without
damaging anything. When hardware is designed to allow connection and disconnection
from a system that continues to operate, it is said to allow hot swap.
In a computer system, continuous operation also has significant implications for the
software. Configuration management software must anticipate hot swap so that it can
stop using hardware components that are about to be disconnected, as well as discover
newly attached components and put them to work. In addition, maintaining state is a
challenge. If there are periodic consistency checks on data, those checks (and repairs to
data when the checks reveal inconsistencies) must be designed to work correctly even
though the system is in operation and the data is perhaps being read and updated by
other users at the same time.
Overall, continuous operation is not a feature that should be casually added to a list
of system requirements. When someone suggests it, it may be helpful to point out that
it is much like trying to keep an airplane flying indefinitely. Many large systems that
appear to provide continuous operation are actually designed to stop occasionally for
maintenance.2 4
1
214
This page has intentionally been left blank.
215
8.6 Wrapping up Reliability 8 -5 1
Whereas redundancy can provide masking o f errors, redundant components that are
used only when failures occur are much more likely to cause trouble than redundant
components that are regularly exercised in normal operation. The reason is that failures
in regularly exercised components are likely to be immediately noticed and fixed. Fail
ures in unused components may not be noticed until a failure somewhere else happens.
But then there are two failures, which may violate the design assumptions o f the masking
plan. This observation is especially true for software, where rarely-used recovery proce
dures often accumulate unnoticed bugs and incompatibilities as other parts o f the system
evolve. The alternative o f periodic testing o f rarely-used components to lower their fail
ure latency is a band-aid that rarely works well.
In applying these design principles, it is important to consider the threats, the conse
quences, the environment, and the application. Some faults are more likely than others,
216
8 -5 2 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
some failures are more disruptive than others, and different techniques may be appropri
ate in different environments. A computer-controlled radiation therapy machine, a deep-
space probe, a telephone switch, and an airline reservation system all need fault tolerance,
but in quite different forms. The radiation therapy machine should emphasize fault
detection and fail-fast design, to avoid injuring patients. Masking faults may actually be
a mistake. It is likely to be safer to stop, find their cause, and fix them before continuing
operation. The deep-space probe, once the mission begins, needs to concentrate on fail
ure masking to ensure mission success. The telephone switch needs many nines o f
availability because customers expect to always receive a dial tone, but if it occasionally
disconnects one ongoing call, that customer will simply redial without thinking much
about it. Users o f the airline reservation system might tolerate short gaps in availability,
but the durability o f its storage system is vital. At the other extreme, most people find
that a digital watch has an MTTF that is long compared with the time until the watch
is misplaced, becomes obsolete, goes out o f style, or is discarded. Consequently, no pro
vision for either error masking or repair is really needed. Some applications have built-in
redundancy that a designer can exploit. In a video stream, it is usually possible to mask
the loss of a single video frame by just repeating the previous frame.
217
8.6 Wrapping up Reliability 8 -5 3
These two points interact: When an error propagates it can contaminate otherwise
correct data, which can increase the cost of masking and perhaps even render masking
impossible. The result is that when the cost is small, error masking is usually done locally.
(That is assuming that masking is done at all. Many personal computer designs omit
memory error masking. Section 8.8.1 discusses some o f the reasons for this design
decision.)
A closely related observation is that when a lower layer masks a fault it is important
that it also report the event to a higher layer, so that the higher layer can keep track of
how much masking is going on and thus how much failure tolerance there remains.
Reporting to a higher layer is a key aspect o f the safety margin principle.
218
8 -5 4 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
that can tolerate Byzantine faults. Because the tolerance algorithms can be quite com
plex, we defer the topic to advanced study.
We also have not explored the full range o f reliability techniques that one might
encounter in practice. For an example that has not yet been mentioned, Sidebar 8.4
describes the heartbeat, a popular technique for detecting failures o f active processes.
This chapter has oversimplified some ideas. For example, the definition o f availability
proposed in Section 8.2 o f this chapter is too simple to adequately characterize many
large systems. If a bank has hundreds o f automatic teller machines, there will probably
always be a few teller machines that are not working at any instant. For this case, an avail
ability measure based on the percentage o f transactions completed within a specified
response time would probably be more appropriate.
A rapidly moving but in-depth discussion of fault tolerance can be found in Chapter
3 o f the book Transaction Processing: Concepts and Techniques, by Jim Gray and Andreas
Reuter. A broader treatment, with case studies, can be found in the book Reliable Com
puter Systems: Design and Evaluation, by Daniel P. Siewiorek and Robert S. Swarz.
Byzantine faults are an area o f ongoing research and development, and the best source is
current professional literature.
This chapter has concentrated on general techniques for achieving reliability that are
applicable to hardware, software, and complete systems. Looking ahead, Chapters 9[on
line] and 10[on-line] revisit reliability in the context o f specific software techniques that
permit reconstruction o f stored state following a failure when there are several concur
rent activities. Chapter 11 [on-line], on securing systems against malicious attack,
introduces a redundancy scheme known as defense in depth that can help both to contain
and to mask errors in the design or implementation o f individual security mechanisms.
Sidebar 8.4: D etectin g failures with heartbeats. An activity such as a W eb server is usually
intended to keep running indefinitely. I f it fails (perhaps by crashing) its clients may notice that
it has stopped responding, but clients are not typically in a position to restart the server.
Som ething more systematic is needed to detect the failure and initiate recovery. O ne helpful
technique is to program the thread that should be perform ing the activity to send a periodic
signal to another thread (or a message to a m onitoring service) that says, in effect, “I'm still
O K ”.The periodic signal is known as a heartbeat and the observing thread or service is known
as a watchdog.
The w atchdog service sets a timer, and on receipt o f a heartbeat message it restarts the timer. If
the timer ever expires, the w atchdog assumes that the m onitored service has gotten into trouble
and it initiates recovery. O n e limitation o f this technique is that if the m onitored service fails
in such a way that the only thing it does is send heartbeat signals, the failure will go undetected.
As with all fixed timers, choosing a go o d heartbeat interval is an engineering challenge. Setting
the interval too short wastes resources sending and responding to heartbeat signals. Setting the
interval too lon g delays detection o f failures. Since detection is a prerequisite to repair, a long
heartbeat interval increases M T T R and thus reduces availability.2
9
1
219
C h a p te r 8
T o p ics in D istrib u te d
C o o r d in a tio n and D istrib u te d
T ran saction s
221
• Explain the notion of state-machine replication and the ISIS approach to
totally ordered multicast among replicas.
In this chapter, we introduce som e topics and algorithm s related to the issu e of how
p ro ce sse s coordinate their actions and agree on shared values in distributed system s,
despite failures. The chapter begin s with algorithm s to achieve mutual exclusion am ong
a collection of p rocesses, so as to coordinate their a c c e s s e s to shared resources. It g o e s
on to exam ine how an election can be im plem ented in a distributed system . That is, it
describ es how a grou p of p r o ce sse s can agree on a new coordinator of their activities after
the previous coordinator has failed,
The secon d half exam ines the related problem s of m ulticast com m unication,
con sen sus, byzantine agreem ent and interactive consistency. In multicast, the issu e is
how to agree on such m atters as the order in which m e ssa g e s are to be delivered.
C on sen su s and the other problem s generalize from this: how can any collection of
p ro ce sse s agree on so m e value, no matter what the dom ain of the values in qu estion ? We
encounter a fundam ental result in the theory of distributed system s: that under certain
conditions - including surprisingly benign failure conditions - it is im possib le to
guarantee that p r o ce sse s will reach con sen su s.
223 419
420 CHAPTER 11 COORDINATION AND AGREEMENT
11.1 Introduction
This chapter introduces a collection of algorithms whose goals vary but that share an aim
that is fundamental in distributed systems: for a set of processes to coordinate their
actions or to agree on one or more values. For example, in the case of a complex piece
of machinery such as a spaceship, it is essential that the computers controlling it agree
on such conditions as whether the spaceship’ s mission is proceeding or has been
aborted. Furthermore, the computers must coordinate their actions correctly with respect
to shared resources (the spaceship’ s sensors and actuators). The computers must be able
to do so even where there is no fixed master-slave relationship between the components
(which would make coordination particularly simple). The reason for avoiding fixed
master-slave relationships is that we often require our systems to keep working correctly
even if failures occur, so we need to avoid single points of failure, such as fixed masters.
An important distinction for us, as in Chapter 10, will be whether the distributed
system under study is asynchronous or synchronous. In an asynchronous system we can
make no timing assumptions. In a synchronous system, we shall assume that there are
bounds on the maximum message transmission delay, on the time to execute each step
of a process, and on clock drift rates. The synchronous assumptions allow us to use
timeouts to detect process crashes.
Another important aim of the chapter while discussing algorithms is to consider
failures, and how to deal with them when designing algorithms. Section 2.3.2 introduced
a failure model, which we shall use in this chapter. Coping with failures is a subtle
business, so we begin by considering some algorithms that tolerate no failures and
progress through benign failures until we consider how to tolerate arbitrary failures. We
encounter a fundamental result in the theory of distributed systems. Even under
surprisingly benign failure conditions, it is impossible to guarantee in an asynchronous
system that a collection of processes can agree on a shared value - for example, for all
of a spaceship’ s controlling processes to agree ‘mission proceed’or ‘ mission abort’ .
Section 11.2 examines the problem of distributed mutual exclusion. This is the
extension to distributed systems of the familiar problem of avoiding race conditions in
kernels and multi-threaded applications. Since much of what occurs in distributed
systems is resource sharing, this is an important problem to solve. Next, Section 11.3
introduces a related but more general issue of how to ‘ elect’one of a collection of
processes to perform a special role. For example, in Chapter 10 we saw how processes
synchronized their clocks to a designated time server. If this server fails and several
surviving servers can fulfil that role, then for the sake of consistency it is necessary to
choose just one server to take over.
Multicast communication is the subject of Section 11.4. As Section 4.5.1
explained, multicast is a very useful communication paradigm, with applications from
locating resources to coordinating the updates to replicated data. Section 11.4 examines
multicast reliability and ordering semantics, and gives algorithms to achieve the
variations. Multicast delivery is essentially a problem of agreement between processes:
the recipients agree on which messages they will receive, and in which order they will
receive them. Section 11.5 discusses the problem of agreement more generally,
primarily in the forms known as consensus and byzantine agreement.2 4
224
SECTION 11.1 INTRODUCTION 421
The treatment followed in this chapter involves stating the assumptions and the
goals to be met, and giving an informal account of why the algorithms presented are
correct. There is insufficient space to provide a more rigorous approach. For that, we
refer the reader to a text that gives a thorough account of distributed algorithms, such as
Attiya and Welch [1998] and Lynch [1996],
Before presenting the problems and algorithms, we discuss failure assumptions
and the practical matter of detecting failures in distributed systems.
225
422 CHAPTER 11 COORDINATION AND AGREEMENT
The chapter assumes, unless we state otherwise, that processes only fail by
crashing - an assumption that is good enough for many systems. In Section 11.5, we
shall consider how to treat the cases where processes have arbitrary (byzantine) failures.
Whatever the type of failure, a correct process is one that exhibits no failures at any
point in the execution under consideration. Note that correctness applies to the whole
execution, not just to a part of it. So a process that suffers a crash failure is ‘
non-failed’
before that point, not ‘ correct’before that point.
One of the problems in the design of algorithms that can overcome process crashes
is that of deciding when a process has crashed. A failure detector [Chandra and Toueg
1996, Stelling et al. 1998] is a service that processes queries about whether a particular
process has failed. It is often implemented by an object local to each process (on the
same computer) that runs a failure-detection algorithm in conjunction with its
counterparts at other processes. The object local to each process is called a local failure
detector. We shall outline how to implement failure detectors shortly, but first we shall
concentrate on some of the properties o f failure detectors.
A failure ‘ detector’is not necessarily accurate. Most fall into the category of
unreliable failure detectors. An unreliable failure detector may produce one of two
values when given the identity of a process: Unsuspected or Suspected. Both of these
results are hints, which may or may not accurately reflect whether the process has
actually failed. A result of Unsuspected signifies that the detector has recently received
evidence suggesting that the process has not failed; for example, a message was recently
received from it. But of course the process can have failed since then. A result of
Suspected signifies that the failure detector has some indication that the process may
have failed. For example, it may be that no message from the process has been received
for more than a nominal maximum length of silence (even in an asynchronous system,
practical upper bounds can be used as hints). The suspicion may be misplaced: for
example, the process could be functioning correctly, but on the other side of a network
partition; or it could be running more slowly than expected.
A reliable failure detector is one that is always accurate in detecting a process’ s
failure. It answers processes’queries with either a response of Unsuspected - which, as
before, can only be a hint - or Failed. A result of Failed means that the detector has
determined that the process has crashed. Recall that a process that has crashed stays that
way, since by definition a process never takes another step once it has crashed.
It is important to realize that, although we speak of one failure detector acting for
a collection of processes, the response that the failure detector gives to a process is only
as good as the information available at that process. A failure detector may sometimes
give different responses to different processes, since communication conditions vary
from process to process.
We can implement an unreliable failure detector using the following algorithm.
Each process p sends a ‘ p is here’message to every other process, and it does this every
T seconds. The failure detector uses an estimate of the maximum message transmission
time of D seconds. If the local failure detector at process q does not receive a ‘ p is here’
message within T + D seconds of the last one, then it reports to q that p is Suspected.
However, if it subsequently receives a ‘ p is here’message, then it reports to q that p is
OK.
In a real distributed system, there are practical limits on message transmission
times. Even email systems give up after a few days, since it is likely that communication2
6
226
SECTION 11.2 DISTRIBUTED MUTUAL EXCLUSION 423
links and routers will have been repaired in that time. If we choose small values for T
and D (so that they total 0.1 second, say), then the failure detector is likely to suspect
non-crashed processes many times, and much bandwidth will be taken up with ‘ p is
here’messages. If we choose a large total timeout value (a week, say) then crashed
processes will often be reported as Unsuspected.
A practical solution to this problem is to use timeout values that reflect the
observed network delay conditions. If a local failure detector receives a L p is here’in 20
seconds instead of the expected maximum of 10 seconds, then it could reset its timeout
value for p accordingly. The failure detector remains unreliable, and its answers to
queries are still only hints, but the probability of its accuracy increases.
In a synchronous system, our failure detector can be made into a reliable one. We
can choose D so that it is not an estimate but an absolute bound on message transmission
times; the absence of a lp is here’message within T + D seconds entitles the local
failure detector to conclude that p has crashed.
The reader may wonder whether failure detectors are of any practical use.
Unreliable failure detectors may suspect a process that has not failed (they may be
inaccurate); and they may not suspect a process that has in fact failed (they may be
incomplete). Reliable failure detectors, on the other hand, require that the system is
synchronous (and few practical systems are).
We have introduced failure detectors because they help us to think about the
nature of failures in a distributed system. And any practical system that is designed to
cope with failures must detect them - however imperfectly. But it turns out that even
unreliable failure detectors with certain well-defined properties can help us to provide
practical solutions to the problem of coordinating processes in the presence of failures.
We return to this point in Section 11.5.
Taking the example just given, suppose that p 3 either had not failed but was
running unusually slowly (that is, the assumption that the system is synchronous is
incorrect) or that p 3 had failed but is then replaced. Just as p 2 sends its coordinator
message, p 3 (or its replacement) does the same. p 2 receives p ^ s coordinator message
after it sent its own and so sets elected2 = P y Due to variable message transmission
delays, p ] receives p 2's coordinator message after /?3’ s and so eventually sets
elected} = p 2. Condition E l has been broken.
With regard to the performance of the algorithm, in the best case the process with
the second highest identifier notices the coordinator’ s failure. Then it can immediately
elect itself and send N -2 coordinator messages. The turnaround time is one message.
The bully algorithm requires 0(N ) messages in the worst case - that is, when the
process with the least identifier first detects the coordinator’ s failure. For then N - I
processes altogether begin elections, each sending messages to processes with higher
identifiers.
228
SECTION 11.4 MULTICAST COMMUNICATION 437
Efficiency. The information that the same message is to be delivered to all processes
in a group allows the implementation to be efficient in its utilization of bandwidth. It
can take steps to send the message no more than once over any communication link,
by sending the message over a distribution tree; and it can use network hardware
support for multicast where this is available. The implementation can also minimize
the total time taken to deliver the message to all destinations, instead of transmitting
it separately and serially.
To see these advantages, compare the bandwidth utilization and the total
transmission time taken when sending the same message from a computer in London
to two computers on the same Ethernet in Palo Alto, (a) by two separate UDP sends,
and (b) by a single IP-multicast operation. In the former case, two copies of the
messages are sent independently, and the second is delayed by the first. In the latter
case, a set of multicast-aware routers forward a single copy of the message from
London to a router on the destination LAN. The final router then uses hardware
multicast (provided by the Ethernet) to deliver the message to the destinations,
instead of sending it twice.
System model 0 The system contains a collection o f processes, which can communicate
reliably over one-to-one channels. As before, processes may fail only by crashing.
The processes are members of groups, which are the destinations of messages sent
with the multicast operation. It is generally useful to allow processes to be members of
several groups simultaneously - for example, to enable processes to receive information
from several sources by joining several groups. But to simplify our discussion of
ordering properties, we shall sometimes restrict processes to being members of at most
one group at a time.
The operation multicast(g, m) sends the message m to all members of the group g
of processes. Correspondingly, there is an operation delivetfm) that delivers a message
sent by multicast to the calling process. We use the term deliver rather than receive to
make clear that a multicast message is not always handed to the application layer inside
the process as soon as it is received at the process’
s node. This is explained when we
discuss multicast delivery semantics shortly.
Every message m carries the unique identifier of the process senderfm) that sent
it, and the unique destination group identifier group(m). We assume that processes do
not lie a bou t the o rigin o r destin ation s o f m essa ges.
A group is said to be closed if only members of the group may multicast to it
(Figure 11.9). A process in a closed group delivers to itself any message that it2 9
229
438 CHAPTER 11 COORDINATION AND AGREEMENT
o o
multicasts to the group. A group is open if processes outside the group may send to it.
(The categories ‘ open’and ‘ closed’also apply with analogous meanings to mailing
lists.) Closed groups of processes are useful, for example, for cooperating servers to
send messages to one another that only they should receive. Open groups are useful, for
example, for delivering events to groups of interested processes.
Some algorithms assume that groups are closed. The same effect as openness can
be achieved with a closed group by picking a member o f the group and sending it a
message (one-to-one) for it to multicast to its group. Rodrigues et a l [1998] discuss
multicast to open groups.
230
SECTION 11.4 MULTICAST COMMUNICATION 439
network bandwidth. A more practical basic multicast service can be built using IP
multicast, and we invite the reader to show this.
Agreement: If a correct process delivers message m, then all other correct processes
in group(m) will eventually deliver m.
The integrity property is analogous to that for reliable one-to-one communication. The
validity property guarantees liveness for the sender. This may seem an unusual property,
because it is asymmetric (it mentions only one particular process). But notice that
validity and agreement together amount to an overall liveness requirement: if one
process (the sender) eventually delivers a message m then, since the correct processes
agree on the set of messages they deliver, it follows that m will eventually be delivered
to all the group’s correct members.
The advantage of expressing the validity condition in terms of self-delivery is
simplicity. What we require is that the message be delivered eventually by some correct
member of the group.
The agreement condition is related to atomicity, the property of ‘ all or nothing’,
applied to delivery of messages to a group. If a process that multicasts a message crashes
before it has delivered it, then it is possible that the message will not be delivered to any
process in the group; but if it is delivered to some correct process, then all other correct
processes will deliver it. Many papers in the literature use the term ‘ atomic’to include
a total ordering condition; we define this shortly.
231
44 0 CHAPTER 11 COORDINATION AND AGREEMENT
232
SECTION 11.4 MULTICAST COMMUNICATION 441
Incoming
messages
For p to R-multicast a message to group g, it piggy backs onto the message the
value Sp and acknowledgments, of the form <q, R^>. An acknowledgement states, for
some sender q, the sequence number of the latest message from q destined for g that p
has delivered since it last multicast a message. The multicaster p then IP-multicasts the
message with its piggy backed values to g, and increments Sp by one.
The piggy backed values in a multicast message enable the recipients to learn
about messages that they have not received. A process R-delivers a message destined for
g bearing the sequence number S’ from p if and only if S = Rp + 1, and it increments
Rp by one immediately after delivery. If an arriving message has S s Rp , then r has
delivered the message before and it discards it. If S > Rp + 1, or if R > R^ for an
enclosed acknowledgement < q ,R> , then there are one or more messages that it has not
yet received (and which are likely to have been dropped, in the first case). It keeps any
message for which S > Rp + 1 in a hold-back queue (Figure 11.11) - such queues are
often used to meet message delivery guarantees. It requests missing messages by
sending negative acknowledgements - to the original sender or to a process q from
which it has received an acknowledgement <q , R^> with R** no less than the required
sequence number.
The hold-back queue is not strictly necessary for reliability but it simplifies the
protocol by enabling us to use sequence numbers to represent sets of delivered
messages. It also provides us with a guarantee of delivery order (see Section 11.4.3).
The integrity property follows from the detection of duplicates and the underlying
properties of IP multicast (which uses checksums to expunge corrupted messages).The
validity property holds because IP multicast has that property. For agreement we
require, first, that a process can always detect missing messages. That in tum means that
it will always receive a further message that enables it to detect the omission. As this
simplified protocol stands, we guarantee detection o f missing messages only in the case
where correct processes multicast messages indefinitely. Second, the agreement
property requires that there is always an available copy of any message needed by a
process that did not receive it. We therefore assume that processes retain copies of the
messages they have delivered - indefinitely, in this simplified protocol.23
233
442 CHAPTER 11 COORDINATION AND AGREEMENT
234
SECTION 11 .4 MULTICAST COMMUNICATION 443
Notice the consistent ordering of totally ordered m essages 7j and T2, the FIFO-related
m essa ges Fj and F2 and the causally related m essages and C3- and the otherwise
arbitrary delivery ordering of m essages
Causal ordering implies FIFO ordering, since any two multicasts by the same process
are related by happened-before. Note that FIFO ordering and causal ordering are only
partial orderings: not all m essages are sent by the same process, in general; similarly,
som e multicasts are concurrent (not ordered by happened-before).
Figure 11.12 illustrates the orderings for the case o f three processes. Close
inspection o f the figure shows that the totally ordered messages are delivered in the
opp osite order to the physical time at which they were sent. In fact, the definition o f total
235
444 CHAPTER 11 COORDINATION AND AGREEMENT
ordering allows message delivery to be ordered arbitrarily, as long as the order is the
same at different processes. Since total ordering is not necessarily also a FIFO or causal
ordering, we define the hybrid of FIFO-total ordering as one for which message delivery
obeys both FIFO and total ordering; similarly, under causal-total ordering message
delivery obeys both causal and total ordering.
The definitions of ordered multicast do not assume or imply reliability. For
example, the reader should check that, under total ordering, if correct process p delivers
message m and then delivers m', then a correct process q can deliver m without also
delivering m' or any other message ordered after m.
We can also form hybrids of ordered and reliable protocols. A reliable totally
ordered multicast is often referred to in the literature as an atomic multicast. Similarly,
we may form reliable FIFO multicast, reliable causal multicast and reliable versions of
the hybrid ordered multicasts.
Ordering the delivery of multicast messages, as we shall see, can be expensive in
terms o f delivery latency and bandwidth consumption. The ordering semantics that we
have described may delay the delivery of messages unnecessarily. That is, at the
application level, a message may be delayed for another message that it does not in fact
depend upon. For this reason, some have proposed multicast systems that use the
application-specific message semantics alone to determine the order of message
delivery [Cheriton and Skeen 1993, Pedone and Schiper 1999].
The example Of the bulletin board 0 To make multicast delivery semantics more
concrete, consider an application in which users post messages to bulletin boards. Each
user runs a bulletin-board application process. Every topic o f discussion has its own
process group. When a user posts a message to a bulletin board, the application
multicasts the user’ s posting to the corresponding group. Each user’ s process is a
member of the group for the topic in which he or she is interested, so that the user will
receive just the postings concerning that topic.
Reliable multicast is required if every user is to receive every posting eventually.
The users also have ordering requirements. Figure 11.13 shows the postings as they
appear to a particular user. At a minimum, FIFO ordering is desirable, since then every
posting from a given user - ‘ A.Hanlon’ , say - will be received in the same order, and
users can talk consistently about A.Hanlon’ s second posting.
236
SECTION 11,4 MULTICAST COMMUNICATION 4 45
Note that the message whose subjects are ‘ Re: Microkernels’(25) and ‘Re: Mach’
(27) appear after the messages to which they refer. A causally ordered multicast is
needed to guarantee this relationship. Otherwise, arbitrary message delays could mean
that, say, a message ‘ Re: Mach’could appear before the original message about Mach.
If the multicast delivery was totally ordered, then the numbering in the left-hand
column would be consistent between users. Users could refer unambiguously, for
example, to ‘ message 24’ .
In practice, the USENET bulletin board system implements neither causal nor
total ordering. The communication costs o f achieving these orderings on a large scale
outweighs their advantages.
Implementing FIFO ordering 0 FIFO-ordered multicast (with operations FO-multicast
and FO-deliver) is achieved with sequence numbers, much as we would achieve it for
one-to-one communication. We shall consider only non-overlapping groups. The reader
should verify that the reliable multicast protocol that we defined on top of IP multicast
in Section 11.4.2 also guarantees FIFO ordering, but we shall show how to construct a
FIFO-ordered multicast on top of any given basic multicast. We use the variables Sp and
Rq held at process p from the reliable multicast protocol of Section 11.4.2: Sp is a count
o f how many messages p has sent to g and, for each q, Rq is the sequence number of the
latest message p has delivered from process q that was sent to group g.
For p to FO-multicast a message to group g, it piggy backs the value Sp onto the
message, B-?nulticast& the message to g and then increments by 1. Upon receipt of a
message from q bearing the sequence number 5, p checks whether S = Rq + 1. If so,
this message is the next one expected from the sender q and p FO-delivers it, setting
Rq :=S. If S > Rq + 1, it places the message in the hold-back queue until the intervening
messages have been delivered and 5 = Rq + 1.
o
Since all messages from a given sender are delivered in the same sequence, and
since a message’ s delivery is delayed until its sequence number has been reached, the
condition for FIFO ordering is clearly satisfied. But this is so only under the assumption
that groups are non-overlapping.
Note that we can use any implementation of B-multicast in this protocol.
Moreover, if we use a reliable R-multicast primitive instead of B-multicast, then we
obtain a reliable FIFO multicast.
Implementing total ordering 0 The basic approach to implementing total ordering is to
assign totally ordered identifiers to multicast messages so that each process makes the
same ordering decision based upon these identifiers. The delivery algorithm is very
similar to the one we described for FIFO ordering; the difference is that processes keep
group-specific sequence numbers rather than process-specific sequence numbers. We
only consider how to totally order messages sent to non-overlapping groups. We call the
multicast operations TO-multicast and TO-deliver.
We discuss two main methods for assigning identifiers to messages. The first of
these is for a process called a sequencer to assign them (Figure 11.14). A process
wishing to TO-multicast a message m to group g attaches a unique identifier id{m) to it.
The messages for g are sent to the sequencer for g, sequencerig), as well as to the
members of g. (The sequencer may be chosen to be a member of g.) The process
sequencerig) maintains a group-specific sequence number s , which it uses to assign
increasing and consecutive sequence numbers to the messages that it B-delivers, It2 7
3
237
446 CHAPTER 11 COORDINATION AND AGREEMENT
group receive only one message per multicast; its disadvantage is increased bandwidth
utilization. The protocol is described in full at www.cdk3.net/coordination.
The second method that we examine for achieving totally ordered multicast is one
in which the processes collectively agree on the assignment of sequence numbers to
messages in a distributed fashion. A simple algorithm - similar to one that was
originally developed to implement totally ordered multicast delivery for the ISIS toolkit
[Birman and Joseph 1987a] - is shown in Figure 11.15. Once more, a process B-
multicasts its message to the members of the group. The group may be open or closed.
The receiving processes propose sequence numbers for messages as they arrive and
return these to the sender, which uses them to generate agreed sequence numbers.
Each process q in group g keeps Aq g , the largest agreed sequence number it has
observed so far for group g, and Pq, its own largest proposed sequence number. The
algorithm for process p to multicast a message m to group g is as follows:
1. p B-multicasts <m, i> to g, where i is a unique identifier for m.
2. Each process q replies to the sender p with a proposal for the message’ s agreed
sequence number of Pq := Max(Aq gi Pq) + 1. In reality, we must include process
identifiers in the proposed values Pq to ensure a total order, since otherwise
different processes could propose the same integer value; but for the sake of
simplicity we shall not make that explicit here. Each process provisionally assigns
the proposed sequence number to the message and places it in its hold-back queue,
which is ordered with the smallest sequence number at the front.
3. p collects all the proposed sequence numbers and selects the largest one a as the
next agreed sequence number. It then B-multicasts <i, a> to g. Each process q in
g sets Aq Max(Aq g , a) and attaches a to the message (which is identified by i).
It reorders the message in the hold-back queue if the agreed sequence number
differs from the proposed one. When the message at the front of the hold-back
queue has been assigned its agreed sequence number, it is transferred to the tail of
the delivery queue. Messages that have been assigned their agreed sequence2 9
3
239
448 CHAPTER 11 COORDINATION AND AGREEMENT
number but are not at the head of the hold-back queue are not yet transferred,
however.
If every process agrees the same set of sequence numbers and delivers them in the
corresponding order, then total ordering is satisfied. It is clear that correct processes
ultimately agree on the same set of sequence numbers, but we must show that they are
monotonically increasing and that no correct process can deliver a message prematurely.
Assume that a message m] has been assigned an agreed sequence number and has
reached the front of the hold-back queue. By construction, a message that is received
after this stage will and should be delivered after it will have a larger proposed
sequence number and thus a larger agreed sequence number than my So let m2 be any
other message that has not yet been assigned its agreed sequence number but which is
on the same queue. We have that:
Therefore:
time it multicast the message. Both of those conditions can be detected by examining
vector timestamps, as shown in Figure 11.16. Note that a process can immediately CO-
deliver to itself any message that it CO-multicasts,, although this is not described in
Figure 11.16.
Each process updates its vector timestamp upon delivering any message, to
maintain the count of causally precedent messages. It does this by incrementing they'th
entry in its timestamp by one. This is an optimization of the merge operation that appears
in the rules for updating vector clocks in Section 10.4. We can make the optimization in
view of the delivery condition in the algorithm of Figure 11.16, which guarantees that
only theyth entry will increase.
We outline the proof of the correctness of this algorithm as follows. Suppose that
multicast(g, m) —> multicast{g, m ). Let V and V' be the vector timestamps of m and
m', respectively. It is straightforward to prove inductively from the algorithm that
V < V'. In particular, if process p k multicast m, then VTA] < V'[k],
Consider what happens when some correct process p (. B-delivers m' (as opposed
to CO-delivering it) without first CO-delivenng m. By the algorithm, V{[k] can increase
only when p t delivers a message from p k, when it increases by 1. But has not
received m, and therefore VT[k] cannot increase beyond V(k] - 1. It is therefore not
possible for p. to CO-deliver m , since this would require that Vj[k] > V'[k], and
therefore that V-[k\ > V[k).
The reader should check that if we substitute the reliable R-multicast primitive in
place of B-multicast, then we obtain a multicast that is both reliable and causally
ordered.
Furthermore, if we combine the protocol for causal multicast with the sequencer-
based protocol for totally ordered delivery, then we obtain message delivery that is both
total and causal. The sequencer delivers messages according to the causal order and
multicasts the sequence numbers for the messages in the order in which it receives them.
The processes in the destination group do not deliver a message until they have received
an order message from the sequencer and the message is next in the delivery sequence.
241
45 0 CHAPTER 11 COORDINATION AND AGREEMENT
Since the sequencer delivers.message in causal order, and since all other processes
deliver messages in the same order as the sequencer, the ordering is indeed both total
and causal
Global total ordering: Let *<’be the relation of ordering between delivery events.
We require that *<’obeys pairwise total ordering and that it is acyclic - under
pairwise total ordering, ‘
< ’is not acyclic by default.
One way of implementing these orders would be to multicast each message m to the
group of all processes in the system. Each process either discards or delivers the
message according to whether it belongs to group(m). This would be an inefficient and
unsatisfactory implementation: a multicast should involve as few processes as possible
beyond the members of the destination group. Alternatives are explored in Birman et al.
[1991], Garcia-Molina and Spauster [1991], Hadzilacos and Toueg [1994], Kindberg
[1995] and Rodrigues et al. [1998].
242
SECTION 11.5 CONSENSUS AND RELATED PROBLEMS 451
This section introduces the problem of consensus [Pease et al. 1980, Lamport et al.
1982] and the related problems of byzantine generals and interactive consistency. We
shall refer to these collectively as problems o f agreement. Roughly speaking, the
problem is for processes to agree on a value after one or more of the processes has
proposed what that value should be.
For example, in Chapter 2 we described a situation in which two armies should
decide consistently to attack or retreat. Similarly, we may require that all the correct
computers controlling a spaceship’ s engines should decide ‘ proceed’ , or ail of them
decide ‘ abort’,after each has proposed one action or the other. In a transaction to transfer
funds from one account to another, the computers involved must consistently agree to
perform the respective debil and credit. In mutual exclusion, the processes agree on
which process can enter the critical section. In an election, the processes agree on which
is the elected process. In totally ordered multicast, the processes agree on the order of
message delivery.
Protocols exist that are tailored to these individual types of agreement. We
described some of them above, and Chapters 12 and 13 examine transactions. But it is
useful for us to consider more general forms o f agreement, in a search for common
characteristics and solutions.
This section defines consensus more precisely and relates it to three related
agreement problems: byzantine generals, interactive consistency and totally ordered
multicast. We go on to examine under what circumstances the problems can be solved,
and sketch some solutions. In particular, we shall discuss the well-known impossibility
result of Fischer et a l [1985], which states that in an asynchronous system a collection
of processes containing only one faulty process cannot be guaranteed to reach
consensus. Finally, we consider how it is that practical algorithms exist despite the
impossibility result.
243
452 CHAPTER 11 COORDINATION AND AGREEMENT
244
SECTION 11.5 CONSENSUS AND RELATED PROBLEMS 45 3
the definition of majority, and the integrity property of a reliable multicast. Every
process receives the same set of proposed values, and every process evaluates the same
function of those values. So they must all agree, and if every process proposed the same
value, then they all decide on this value.
Note that majority is only one possible function that the processes could use to
agree upon a value from the candidate values. For example, if the values are ordered then
the functions minimum and maximum may be appropriate.
If processes can crash then this introduces the complication of detecting failures,
and it is not immediately clear that a run of the consensus algorithm can terminate. In
fact, if the system is asynchronous then it may not; we shall return to this point shortly.
If processes can fail in arbitrary (byzantine) ways, then faulty processes can in
principle communicate random values to the others. This may seem unlikely in practice,
but it is not beyond the bounds of possibility for a process with a bug to fail in this way.
Moreover, the fault may not be accidental but the result of mischievous or malevolent
operation. Someone could deliberately make a process send different values to different
peers in an attempt to thwart the others, which are trying to reach consensus. In case of
inconsistency, correct processes must compare what they have received with what other
processes claim to have received.
The byzantine generals problem 0 In the informal statement of the byzantine generals
problem [Lamport et at 1982], three or more generals are to agree to attack or to retreat.
One, the commander, issues the order. The others, lieutenants to the commander, are to
decide to attack or retreat. But one or more of the generals may be ‘ treacherous1- that
is, faulty. If the commander is treacherous, he proposes attacking to one general and
retreating to another. If a lieutenant is treacherous, he tells one of his peers that the
commander told him to attack and another that they are to retreat.
The byzantine generals problem differs from consensus in that a distinguished
process supplies a value that the others are to agree upon, instead of each of them
proposing a value. The requirements are:
Termination: Eventually each correct process sets its decision variable.
Agreement: The decision value of all correct processes is the same: if p i and p ■
are
correct and have entered the decided state, then dj = d j (i, j = 1,2,..., N ).
Integrity: If the commander is correct, then all correct processes decide on the value
that the commander proposed.
Note that, for the byzantine generals problem, integrity implies agreement when the
commander is correct; but the commander need not be correct.
interactive consistency 0 The interactive consistency problem is another variant of
consensus, in which every process proposes a single value. The goal of the algorithm is
for the correct processes to agree on a vector of values, one for each process. We shall
cal! this the ‘decision vector1. For example, the goal could be for each of a set of
processes to obtain the same information about their respective states.
The requirements for interactive consistency are:
Termination: Eventually each correct process sets its decision variable.
Agreement: The decision vector of all correct processes is the same.
245
454 CHAPTER 11 COORDINATION AND AGREEMENT
• The commander p j sends its proposed value v to itself and each of the
remaining processes;
* All processes run C with the values Vp v2, . . vN that they receive (pj may be
faulty);2
6
4
246
SECTION 11.5 CONSENSUS AND RELATED PROBLEMS 455
multicasts the set of values that it has not sent in previous rounds. It then takes delivery
of similar multicast messages from other processes and records any new values.
Although this is not shown in Figure 11.18, the duration o f a round is limited by setting
a timeout based on the maximum time for a correct process to multicast a message. After
/ + 1 rounds, each process chooses the minimum value it has received as its decision
value.
Termination is obvious from the fact that the system is synchronous. To check the
correctness of the algorithm, we must show that each process arrives at the same set of
values at the end of the final round. Agreement and integrity will then follow, because
the processes apply the minimum function to this set.
Assume, to the contrary, that two processes differ in their final set of values.
Without loss of generality, some correct process p (- possesses a value v that another
correct process p j ( i* j ) does not possess. The only explanation for p { possessing a
proposed value v at the end that p - does not possess is that any third process, p k say,
that managed to send v to p. crashed before v could be delivered to p -r In turn, any
® J
process sending v in the previous round must have crashed, to explain why p k possesses
v in that round but p. did not receive it. Proceeding in this way, we have to posit at least
one crash in each of the preceding rounds. But we have assumed that at most/crashes
can occur, and there are / + 1 rounds. We have arrived at a contradiction.
It turns out that any algorithm to reach consensus despite up to/crash failures
requires at least / + 1 rounds of message exchanges, no matter how it is constructed
[Dolev and Strong 1983]. This lower bound also applies in the case of byzantine failures
[Fischer and Lynch 1980].
Impossibility with three processes 0 Figure 11.19 shows two scenarios in which just one
of three processes is faulty. In the left configuration one of the lieutenants, p 3 is faulty;
on the right the commander, p i is faulty. Each scenario in Figure 11.18 shows two
rounds of messages: the values the commander sends, and the values that the lieutenants
subsequently send to each other. The numeric prefixes serve to specify the sources of
messages and to show the different rounds. Read the symbol in messages as ‘ says’;
for example, ‘ 3:l:w’is the message ‘ 3 says 1 says u\
In the left-hand scenario, the commander correctly sends the same value v>to each
of the other two processes, and p 2 correctly echoes this to p y However, p 3 sends a
value u * v to p,. All p 2 knows at this stage is that it has received differing values; it
cannot tell which were sent out by the commander.
In the right-hand scenario, the commander is faulty and sends differing values to
the lieutenants. After p 3 has correctly echoed the value x that it received, p 2 is in the
same situation as it was in when p 3 was faulty: it has received two differing values.
If a solution exists, then process is bound to decide on value v when the
commander is correct, by the integrity condition, If we accept that no algorithm can
possibly distinguish between the two scenarios, p 2 must also choose the value sent by
the commander in the right-hand scenario.
Following exactly the same reasoning for p 3, assuming that it is correct, we are
forced to conclude, by symmetry, that p 3 also chooses the value sent by the commander
as its decision value. But this contradicts the agreement condition (the commander sends
differing values if it is faulty). So no solution is possible.
Note that this argument rests on our intuition that nothing can be done to improve
a correct general’ s knowledge beyond the first stage, where it cannot tell which process
is faulty. It is possible to prove the correctness of this intuition [Pease et al. 1980].
Byzantine agreement can be reached for three generals, with one of them faulty, if the
generals digitally sign their messages.
Impossibility With N < 3f 0 Pease et al. generalized the basic impossibility result for
three processes, to prove that no solution is possible if N < 3f. In outline, the argument
is as follows. Assume that a solution exists with N < 3f. Let each of three processes p,,
p 2 and p 3 use the solution to simulate the behaviour of n]t n2 and n3 generals,
respectively, where + n2 + n3 = N and nlt n2>n3 < N/3. We assume, furthermore,
that one of the three processes is faulty. Those of p {, p 2 and p 3 that are correct simulate
correct generals: they simulate the interactions of their own generals internally and send4 9
2
249
458 CHAPTER 11 COORDINATION AND AGREEMENT
messages from their generals to those simulated by other processes. The faulty process’ s
simulated generals are faulty: the messages that it sends as part of the simulation to the
other two processes may be spurious. Since N < 3/ and n]fn2, «3 2 N/3, at m ost/
simulated generals are faulty.
Because the algorithm that the processes run is assumed to be correct, the
simulation terminates. The correct simulated generals (in the two correct processes)
agree and satisfy the integrity property. But now we have a means for the two correct
processes out of the three to reach consensus: each decides on the value chosen by all of
their simulated generals. This contradicts our impossibility result for three processes,
with one faulty.
Solution with ono faulty process 0 There is not sufficient space to describe fully the
algorithm of Pease et al. that solves the byzantine generals problem in a synchronous
system with N > 3 /+ 1. Instead, we give the operation of the algorithm for the case
N > 4 , f = 1 and illustrate it for N = 4, / = 1.
The correct generals reach agreement in two rounds of messages:
• In the first round, the commander sends a value to each of the lieutenants.
• In the second round, each of the lieutenants sends the value it received to its peers.
A lieutenant receives a value from the commander, plus JV-2 values from its peers. If
the commander is faulty, then all the lieutenants are correct and each will have gathered
exactly the set of values that the commander sent out. Otherwise, one of the lieutenants
is faulty; each of its correct peers receives N - 2 copies of the value that the commander
sent, plus a value that the faulty lieutenant sent to it.
In either case, the correct lieutenants need only apply a simple majority function
to the set of values they receive. Since N >4, (N - 2)^2. Therefore, the majority
function will ignore any value that a faulty lieutenant sent, and it will produce the value
that the commander sent if the commander is correct.
We now illustrate the algorithm that we have just outlined for the case of four
generals. Figure 11.20 shows two scenarios similar to those in Figure 11.19, but in this
case there are four processes, one of which is faulty. As in Figure 11.19, in the left-hand
configuration one of the lieutenants, p 3, is faulty; on the right, the commander, p v is
faulty.
In the left-hand case, the two correct lieutenant processes agree, deciding on the
commander's value:
p 2 decides on majority(v, u, v) = v
p 4 decides on majority(v, v, w) = v
In the right-hand case the commander is faulty, but the three correct processes agree:
p 2, p 3 and p 4 decide on majority(u, v, w) = 1 (the special value 1 applies
where no majority of values exists).
The algorithm takes account of the fact that a faulty process may omit to send a message.
If a correct process does not receive a message within a suitable time limit (the system
is synchronous), it proceeds as though the faulty process had sent it the value JL.
Discussion 0 We can measure the efficiency of a solution to the byzantine generals
problem - or any other agreement problem - by asking:
250
SECTION 11.5 CONSENSUS AND RELATED PROBLEMS 459
P4 Pa
Faulty processes are shown shaded
• How many message rounds does it take? (This is a factor in how long it takes for
the algorithm to terminate.)
• How many messages are sent, and of what size? (This measures the total
bandwidth utilization and has an impact on the execution time.)
In the general case (/ > 1) the Lamport et al. algorithm for unsigned messages operates
over / + 1 rounds. In each round, a process sends to a subset of the other processes the
values that it received in the previous round. The algorithm is very costly: it involves
sending 0 (N ^+ l) messages.
Fischer and Lynch [1982] proved that any deterministic solution to consensus
assuming byzantine failures (and hence to the byzantine generals problem, as Section
11.5.1 showed) will take at least / + 1 message rounds. So no algorithm can operate
faster in this respect than that of Lamport et al. But there have been improvements in the
message complexity, for example Garay and Moses [1993].
Several algorithms, such as that of Dolev and Strong [1983], take advantage of
signed messages. Dolev and Strong’ | algorithm again takes / + 1 rounds, but the
number of messages sent is only 0(N ).
The complexity and cost of the solutions suggest that they are applicable only
where the threat is great. If faulty hardware is the source of the threat, then the likelihood
of truly arbitrary behaviour is small. Solutions that are based on more detailed
knowledge of the fault model may be more efficient [Barborak et al. 1993]. If malicious
users are the source of the threat, then a system to counter them is likely to use digital
signatures; a solution without signatures is impractical.
251
460 CHAPTER 11 COORDINATION AND AGREEMENT
assume that a faulty process has not sent them a message within the round, because the
maximum delay has been exceeded.
Fischer et al. [1985] proved that no algorithm can guarantee to reach consensus in
an asynchronous system, even with one process crash failure. In an asynchronous
system, processes can respond to messages at arbitrary times, so a crashed process is
indistinguishable from a slow one. Their proof, which is beyond the scope of this book,
involves showing that there is always some continuation of the processes’execution that
avoids consensus being reached.
We immediately know from the result of Fischer et al. that there is no guaranteed
solution in an asynchronous system to the byzantine generals problem, to interactive
consistency or to totally ordered and reliable multicast. If there were such a solution
then, by the results of Section 11.5.1, we would have a solution to consensus -
contradicting the impossibility result.
Note the word ‘ guarantee’in the statement o f the impossibility result. The result
does not mean that processes can never reach distributed consensus in an asynchronous
system if one is faulty. It allows that consensus can be reached with some probability
greater than zero, confirming what we know in practice. For example, despite the fact
that our systems are often effectively asynchronous, transaction systems have been
reaching consensus regularly for many years.
One approach to working around the impossibility result is to consider partially
synchronous systems, which are sufficiently weaker than synchronous systems to be
useful as models of practical systems, and sufficiently stronger than asynchronous
systems for consensus to be solvable in them [Dwork et al. 1988]. That approach is
beyond the scope of this book. However, three other techniques for working around the
impossibility result that we shall now outline are fault masking, and reaching consensus
by exploiting failure detectors and by randomizing aspects of the processes’behaviour.
Masking faults 0 The first technique is to avoid the impossibility result altogether by
masking any process failures that occur (see Section 2.3.2 for an introduction to fault-
masking). For example, transaction systems employ persistent storage, which survives
crash failures. If a process crashes, then it is restarted (automatically, or by an
administrator). The process places sufficient information in persistent storage at critical
points in its program so that if it should crash and be restarted, it will find sufficient data
to be able to continue correctly with its interrupted task. In other words, it will behave
like a process that is correct, but which sometimes takes a long time to perform a
processing step.
O f course, fault masking is generally applicable in system design. Chapter 13
discusses how transactional systems take advantage of persistent storage. Chapter 14
describes how process failures can also be masked by replicating software components,
Consensus using failure detectors 0 Another method for circumventing the
impossibility result is to employ failure detectors. Some practical systems employ
‘perfect by design’failure detectors to reach consensus. No failure detector in an
asynchronous system that works solely by message passing can really be perfect.
However, processes can agree to deem a process that has not responded for more than a
bounded time to have failed. An unresponsive process may not really have failed, but
the remaining processes act as if it had done. They make the failure ‘ fail-silent’by
discarding any subsequent messages that they do in fact receive from a ‘failed’process.25
252
SECTION 11.5 CONSENSUS AND RELATED PROBLEMS 461
probabilistic algorithm that solves consensus even with byzantine failures can be found
in Canetti and Rabin [1993].
11.6 Summary
The chapter began by discussing the need for processes to access shared resources under
conditions of mutual exclusion. Locks are not always implemented by the servers that
manage the shared resources, and a separate distributed mutual exclusion service is then
required. Three algorithms were considered that achieve mutual exclusion: one
employing a central server, a ring-based algorithm, and a multicast-based algorithm
using logical clocks. None of these mechanisms can withstand failure as we described
them, although they can be modified to tolerate some faults.
Then the chapter considered a ring-based algorithm and the bully algorithm,
whose common aim is to elect a process uniquely from a given set - even if several
elections take place concurrently. The Bully algorithm could be used, for example, to
elect a new master time server, or a new lock server, when the previous one fails.
The chapter described multicast communication. It discussed reliable multicast, in
which the correct processes agree on the set of messages to be delivered; and multicast
with FIFO, causal and total delivery ordering. We gave algorithms for reliable multicast
and for all three types of delivery ordering.
Finally, we described the three problems of consensus, byzantine generals and
interactive consistency. We defined the conditions for their solution and we showed
relationships between these problems - including the relationship between consensus
and reliable, totally ordered multicast.
Solutions exist in a synchronous system, and we described some of them. In fact,
solutions exist even when arbitrary failures are possible. We outlined part of the solution
to the byzantine generals problem of Lamport et al. More recent algorithms have lower
complexity, but in principle none can better the / + 1 rounds taken by this algorithm,
unless messages are digitally signed.
The chapter ended by describing the fundamental result of Fischer el al.
concerning the impossibility of guaranteeing consensus in an asynchronous system. We
discussed how it is that, nonetheless, systems regularly do reach agreement in
asynchronous systems.
EXERCISES
11.2 If all client processes are single-threaded, is mutual exclusion condition ME3, which
specifies entiy in happened-before order, relevant? page 425
11.3 Give a formula for the maximum throughput of a mutual exclusion system in terms of
the synchronization delay. page 42554
2
254
EXERCISES 463
11.4 In the central server algorithm for mutual exclusion, describe a situation in which two
requests are not processed in happened-before order. page 426
11.5 Adapt the central server algorithm for mutual exclusion to handle the crash failure of any
client (in any state), assuming that the server is correct and given a reliable failure
detector. Comment on whether the resultant system is fault tolerant. What would happen
if a client that possesses the token is wrongly suspected to have failed? page 426
11.6 Give an example execution of the ring-based algorithm to show that processes are not
necessarily granted entry to the critical section in happened-before order. page 427
11.7 In a certain system, each process typically uses a critical section many times before
another process requires it. Explain why Ricart and Agrawala’ s multicast-based mutual
exclusion algorithm is inefficient for this case, and describe how to improve its
performance. Does your adaptation satisfy liveness condition ME2? pa ge429
11.8 In the Bully algorithm, a recovering process starts an election and will become the new
coordinator if it has a higher identifier than the current incumbent. Is this a necessary
feature o f the algorithm? page 434
11.9 Suggest how to adapt the Bully algorithm to deal with temporary network partitions
(slow communication) and slow processes. page 436
11.10 Devise a protocol for basic multicast over IP multicast. page 438
11.11 How, if at all, should the definitions o f integrity, agreement and validity for reliable
multicast change for the case of open groups? page 439
11.12 Explain why reversing the order of the lines lR-deliver m’and ‘ if (q* p) then B-
multicast(g, m)\ end if in Figure 11.10 makes the algorithm no longer satisfy uniform
agreement. Does the reliable multicast algorithm based on IP multicast satisfy uniform
agreement? page 440
11.13 Explain whether the algorithm for reliable multicast over IP multicast works for open as
well as closed groups. Given any algorithm for closed groups, how, simply, can we
derive an algorithm for open groups? page 440
11.14 Consider how to address the impractical assumptions we made in order to meet the
validity and agreement properties for the reliable multicast protocol based on IP
multicast. Hint: add a rule for deleting retained messages when they have been delivered
everywhere; and consider adding a dummy ‘ heartbeat’message, which is never
delivered to the application, but which the protocol sends if the application has no
message to send. page 440
11.15 Show that the FIFO-ordered multicast algorithm does not work for overlapping groups,
by considering two messages sent from the same source to two overlapping groups, and
considering a process in the intersection of those groups. Adapt the protocol to work for
this case. Hint: processes should include with their messages the latest sequence
numbers of messages sent to all groups. page 445
11.16 Show that, if the basic multicast that we use in the algorithm of Figure 11.14 is also
FIFO-ordered, then the resultant totally-ordered multicast is also causally ordered. Is it
the case that any multicast that is both FIFO-ordered and totally ordered is thereby
causally ordered? page 4462
5
255
464 CHAPTER 11 COORDINATION AND AGREEMENT
11.17 Suggest how to adapt the causally ordered multicast protocol to handle overlapping
groups. page 449
11.18 In discussing Maekawa’ s mutual exclusion algorithm, we gave an example of three
subsets of a set of three processes that could lead to a deadlock. Use these subsets as
multicast groups to show how a pairwise total ordering is not necessarily acyclic.
page 450
11.19 Construct a solution to reliable, totally ordered multicast in a synchronous system, using
a reliable multicast and a solution to the consensus problem. page 450
11.20 We gave a solution to consensus from a solution to reliable and totally ordered multicast,
which involved selecting the first value to be delivered. Explain from first principles
why, in an asynchronous system, we could not instead derive a solution by using a
reliable but not totally ordered multicast service and the ‘
majority’function. (Note that,
if we could, then this would contradict the impossibility result of Fischer et al.\) Hint:
consider slow/failed processes. page 455
11.21 Show that byzantine agreement can be reached for three generals, with one o f them
faulty, if the generals digitally sign their messages. page 457
11.22 Explain how to adapt the algorithm for reliable multicast over TP multicast to eliminate
the hold-back queue - so that a received message that is not a duplicate can be delivered
immediately, but without any ordering guarantees. Hint: use sets instead of sequence
numbers to represent the messages that have been delivered so far. page 44126
5
256
—
v • " r-f'--
DISTRIBU TED T R A N S A C T IO N S
13.1 Introduction
13.2 Flat and nested distributed transactions
13.3 Atomic com m it protocols
13.4 Concurrency control in distributed transactions
13.5 Distributed deadlocks
13.6 Transaction recovery
13.7 Summary
This chapter introduces distributed transactions - th ose that involve m ore than one
server. Distributed tran sactions m ay be either flat or nested,
An atom ic com m it p rotocol is a cooperative procedure used by a set of servers
involved in a distributed transaction. It enables the servers to reach a joint d ecision as to
whether a transaction can be com m itted or aborted, This chapter d escrib es the tw o-phase
com m it protocol, which is the m ost com m on ly u sed atom ic com m it protocol.
The section on con cu rren cy control in distributed transactions d isc u sse s how
locking, tim estam p ordering and optim istic concurrency control may be extended for use
with distributed transactions.
The u se of locking sch e m e s can lead to distributed deadlocks. Distributed deadlock
detection algorithm s are discu ssed .
S ervers that provide transactions include a recovery m anager w h ose concern is to
ensure that the effects of tran sactions on the ob jects m anaged by a server can be
recovered w hen it is replaced after a failure. The recovery m anager sav es the ob jects in
perm anent storage togeth er with intentions lists and inform ation about the status of each
transaction.
257 515
51 6 CHAPTER 13 DISTRIBUTED TRANSACTIONS
13.1 Introduction
In Chapter 12, we discussed flat and nested transactions that accessed objects at a single
server. In the general case, a transaction, whether flat or nested, will access objects
located in several different computers. We use the term distributed transaction to refer
to a flat or nested transaction that accesses objects managed by multiple servers.
When a distributed transaction comes to an end, the atomicity property of
transactions requires that either all of the servers involved commit the transaction or all
o f them abort the transaction. To achieve this, one of the servers takes on a coordinator
role, which involves ensuring the same outcome at all of the servers. The manner in
which the coordinator achieves this depends on the protocol chosen. A protocol known
as the 'two-phase commit protocol’is the most commonly used. This protocol allows
the servers to communicate with one another to reach a joint decision as to whether to
commit or abort.
Concurrency control in distributed transactions is based on the methods discussed
in Chapter 12. Each server applies local concurrency control to its own objects, which
ensures that transactions are serialized locally. Distributed transactions must be
serialized globally. How this is achieved varies as to whether locking, timestamp
ordering or optimistic concurrency control is in use. In some cases, the transactions may
be serialized at the individual servers, but at the same time a cycle of dependencies
between the different servers may occur and a distributed deadlock arise.
Transaction recovery is concerned with ensuring that all the objects involved in
transactions are recoverable. In addition to that, it guarantees that the values o f the
objects reflect all the changes made by committed transactions and none of those made
by aborted ones.
258
SECTION 13.2 FLAT AND NESTED DISTRIBUTED TRANSACTIONS 5 17
a.withdraw(IO)
7= openTransaction
openSubTransaction
a. withdraw(IO);
b. withdraw(20)
openSubTransaction
b. withdraw(20);
openSubTransaction
c. deposit(IO);
openSubTransaction c. deposit(IO)
d. deposit(20);
d. deposit(20)
closeTransaction
259
518 CHAPTER 13 DISTRIBUTED TRANSACTIONS
fa in
fjV^raricliX;
260
SECTION 13.3 ATOMIC COMMIT PROTOCOLS 519
The coordinator records the new participant in its participant list. The fact that the
coordinator knows all the participants and each participant knows the coordinator will
enable them to collect the information that will be needed at commit time.
Figure 13.3 shows a client whose (flat) banking transaction involves accounts A,
B, C and D at servers BranchX, BranchY and BranchZ. The client’ s transaction, T\
transfers $4 from account A to account C and then transfers $3 from account B to
account D. The transaction described on the left is expanded to show that
openTransaction and closeTransaciion are directed to the coordinator, which would be
situated in one of the servers involved in the transaction. Each server is shown with a
participant, which joins the transaction by invoking the join method in the coordinator.
When the client invokes one o f the methods in the transaction, for example
b.withdraw(T, 3), the object receiving the invocation (B at BranchY in this case) informs
its participant object that the object belongs to the transaction T. If it has not already
informed the coordinator, the participant object uses the join operation to do so. In this
example, we show the transaction identifier being passed as an additional argument so
that the recipient can pass it on to the coordinator. By the time the client calls
closeTransaction, the coordinator has references to all of the participants.
Note that it is possible for a participant to call abortTransaction in the coordinator
if for some reason it is unable to continue with the transaction.
Transaction commit protocols were devised in the early 1970s, and the two-phase
commit protocol appeared in Gray [1978], The atomicity of transactions requires that
when a distributed transaction comes to an end, either all of its operations are carried out
or none of them. In the case of a distributed transaction, the client has requested the
operations at more than one server. A transaction comes to an end when the client
requests that a transaction be committed or aborted. A simple way to complete the
transaction in an atomic manner is for the coordinator to communicate the commit or
abort request to all of the participants in the transaction and to keep on repeating the
request until all of them have acknowledged that they had carried it out. This is an
example of a one-phase atomic commit protocol.
This simple one-phase atomic commit protocol is inadequate because, in the case
when the client requests a commit, it does not allow a server to make a unilateral
decision to abort a transaction, Reasons that prevent a server from being able to commit
its part o f a transaction generally relate to issues of concurrency control. For example,
if locking is in use, the. resolution o f a deadlock can lead to the aborting of a transaction
without the client being aware unless it makes another request to the server. If optimistic
concurrency control is in use, the failure of validation at a server would cause it to decide
to abort the transaction. The coordinator may not know when a server has crashed and
been replaced during the progress of a distributed transaction - such a server will need
to abort the transaction.
The two-phase commit protocol is designed to allow any participant to abort its
part of a transaction. Due to the requirement for atomicity, if one part of a transaction is
aborted, then the whole transaction must also be aborted. In the first phase of the6 1
2
261
520 CHAPTER 13 DISTRIBUTED TRANSACTIONS
protocol, each participant votes for the transaction to be committed or aborted. O nce a
participant has voted to commit a transaction, it is not allowed to abort it. Therefore,
before a participant votes to commit a transaction, it must ensure that it will eventually
be able to carry out its part of the commit protocol, even if it fails and is replaced in the
interim. A participant in a transaction is said to be in a prepared state for a transaction
if it will eventually be able to commit it. To make sure of this, each participant saves in
permanent storage all of the objects that it has altered in the transaction, together with
its status - prepared.
In the second phase of the protocol, every participant in the transaction carries out
the joint decision. If any one participant votes to abort, then the decision must be to abort
the transaction. If all the participants vote to commit, then the decision is to commit the
transaction.
The problem is to ensure that all of the participants vote and that they all reach the
same decision. This is fairly simple if no errors occur, but the protocol must work
correctly even when some of the servers fail, messages are lost or servers are temporarily
unable to communicate with one another.
Failure model for the commit protocols 0 Section 12.1.2 presents a failure model for
transactions that applies equally to the two-phase (or any other) commit protocol.
Commit protocols are designed to work in an asynchronous system in which servers
may crash and messages may be lost. It is assumed that an underlying request-reply
protocol removes corrupt and duplicated messages. There are no byzantine faults -
servers either crash or else they obey the messages they are sent.
The two-phase commit protocol is an example of a protocol for reaching a
consensus. Chapter 11 asserts that consensus cannot be reached in an asynchronous
system if processes sometimes fail. However, the two-phase commit protocol does reach
consensus under those conditions. This is because crash failures of processes are masked
by replacing a crashed process with a new process whose state is set from information
saved in permanent storage and information held by other processes.
262
SECTION 13.3 ATOMIC COMMIT PROTOCOLS 521
Timeout actions in the two-phase commit protocol 0 There are various stages in the
protocol at which the coordinator or a participant cannot progress its part of the protocol
until it receives another request or reply from one o f the others.
263
522 CHAPTER 13 DISTRIBUTED TRANSACTIONS
Consider first the situation where a participant has voted Yes and is waiting for the
coordinator to report on the outcome of the vote by telling it to commit or abort the
transaction. See step (2) in Figure 13.6. Such a participant is uncertain of the outcome
and cannot proceed any further until it gets the outcome of the vote from the coordinator.
The participant cannot decide unilaterally what to do next, and meanwhile the objects
used by its transaction cannot be released for use by other transactions. The participant
makes a getDecision request to the coordinator to determine the outcome o f the
transaction. When it gets the reply it continues the protocol at step (4) in Figure 13.5. If
Coordinator i 1Participant
step s ta tu s step status
1 p repared to c o m m it — tii canC om m it? ; ^
(w a itin g fo r votes ) 1
? Ves — 2 prepared to com m it
264
SECTION 13.3 ATOMIC COMMIT PROTOCOLS 52 3
the coordinator has failed, the participant will not be able to get the decision until the
coordinator is replaced, which can result in extensive delays for participants in the
uncertain state.
Other alternative strategies are available for the participants to obtain a decision
cooperatively instead of contacting the coordinator. These strategies have the advantage
that they may be used when the coordinator has failed. See Exercise 13.5 and Bernstein
et al. [1987] for details. However, even with a cooperative protocol, if all the
participants are in the uncertain state, they will be unable to get a decision until the
coordinator or a participant with the knowledge is available.
Another point at which a participant may be delayed is when it has carried out all
its client requests in the transaction but has not yet received a canCommitl call from the
coordinator. As the client sends the closeTransaciion to the coordinator, a participant
can only detect such a situation if it notices that it has not had a request in a particular
transaction for a long time, for example by a timeout period on a lock. As no decision
has been made at this stage, the participant can decide to abort unilaterally after some
period of time.
The coordinator may be delayed when it is waiting for votes from the participants.
As it has not yet decided the fate of the transaction it may decide to abort the transaction
after some period o f time. It must then announce doAbort to the participants who have
already sent their votes. Some tardy participants may try to vote Yes after this, but their
votes will be ignored and they will enter the uncertain state as described above.
Performance of the two-phase commit protocol 0 Provided that all goes well - that is,
that the coordinator and participants and the communication between them do not fail,
the two-phase commit protocol involving N participants can be completed with N
canCommit? messages and replies, followed by N doCommit messages. That is, the cost
in messages is proportional to 3N, and the cost in time is three rounds of messages. The
have Committed messages are not counted in the estimated cost of the protocol, which
can function correctly without them - their role is to enable servers to delete stale
coordinator information.
In the worst case, there may be arbitrarily many server and communication
failures during the two-phase commit protocol. However, the protocol is designed to
tolerate a succession o f failures (server crashes or lost messages) and is guaranteed to
complete eventually, although it is not possible to specify a time limit within which it
will be completed.
As noted in the section on timeouts, the two-phase commit protocol can cause
considerable delays to participants in the uncertain state. These delays occur when the
coordinator has failed and cannot reply to getDecision requests from participants. Even
if a cooperative protocol allows participants to make getDecision requests to other
participants, delays will occur if all the active participants are uncertain.
Three-phase commit protocols have been designed to alleviate such delays. They
are more expensive in the number of messages and the number o f rounds required for
the normal (failure-free) case. For a description of three-phase commit protocols, see
Exercise 13.2 and Bernstein et al. [1987].
265
52 4 CHAPTER 13 DISTRIBUTED TRANSACTIONS
266
SECTION 13.3 ATOMIC COMMIT PROTOCOLS 52 5
267
52 6 CHAPTER 13 DISTRIBUTED TRANSACTIONS
In our example, subtransactions T21 and T22 are orphans because their parent aborted
without passing information about them to the top-level transaction. Their coordinator
can however, make enquiries about the status of their parent by using the getStatus
operation. A provisionally committed subtransaction of an aborted transaction should be
aborted, irrespective of whether the top-level transaction eventually commits.
The top-level transaction plays the role of coordinator in the two-phase commit
protocol, and the participant list consists of the coordinators of all the subtransactions in
the tree that have provisionally committed but do not have aborted ancestors. By this
stage, the logic of the program has determined that the top-level transaction should try
to commit whatever is left, in spite of some aborted subtransactions. In Figure 13.8, the
coordinators of T, T\ and T n are participants and will be asked to vote on the outcome.
If they vote to commit, then they must prepare their transactions by saving the state of
the objects in permanent storage. This state is recorded as belonging to the top-level
transaction of which it will form a part. The two-phase commit protocol may be
performed in either a hierarchic manner or in a flat manner.
The second phase of the two-phase commit protocol is the same as for the non
nested case. The coordinator collects the votes and then informs the participants as to
the outcome. When it is complete, coordinator and participants will have committed or
aborted their transactions.
268
SECTION 13.3 ATOMIC COMMIT PROTOCOLS 527
Figure 13.10 canCom m it? for hierarchic tw o-ph ase com m it protocol
canCommit ?(trans, subTrans) —> Yes/No
Call a coordinator to ask coordinator of child subtransaction whether it can commit
a subtransaction subTrans. The first argument trans is the transaction identifier of
top-level transaction. Participant replies with its vote Yes/No.
matching the TID in the second argument. For example, the coordinator of T \2 is also
the coordinator of T2 1 , since they run in the same server, but when it receives the
canCommit? call, the second argument will be T\ and it will deal only with T\2-
If a participant finds any subtransactions that match the second argument, it
prepares the objects and replies with a Yes vote. If it fails to find any, then it must have
crashed since it performed the subtransaction and it replies with a No vote.
Flat two-phase commit protocol 0 In this approach, the coordinator of the top-level
transaction sends canCommit? messages to the coordinators o f all of the subtransactions
in the provisional commit list. In our example, to the coordinators of 7j and Tn- During
the commit protocol, the participants refer to the transaction by its top-level TID. Each
participant looks in its transaction list for any transaction or subtransaction matching
that TID. For example, the coordinator of T 12 is also the coordinator of 721) since they
run in the same server (TV).
Unfortunately, this does not provide sufficient information to enable correct
actions by participants such as the coordinator at server N that have a mix of
provisionally committed and aborted subtransactions. If TV’ s coordinator is just asked to
commit T it will end up by committing both 7 ) 2 and T21, because, according to its local
information, both have provisionally committed. This is wrong in the case of T21,
because its parent, T2 , has aborted. To allow for such cases, the canCommit ? operation
for the flat commit protocol has a second argument that provides a list of aborted
subtransactions, as shown in Figure 13.11. A participant can commit descendants of the
top-level transaction unless they have aborted ancestors. When a participant receives a
canCommit? request, it does the following:
- check that they do not have aborted ancestors in the abortList. Then prepare to
commit (by recording the transaction and its objects in permanent storage);
269
52 8 CHAPTER 13 DISTRIBUTED TRANSACTIONS
Each server manages a set of objects and is responsible for ensuring that they remain
consistent when accessed by concurrent transactions. Therefore, each server is
responsible for applying concurrency control to its own objects. The members of a
collection of servers of distributed transactions are jointly responsible for ensuring that
they are performed in a serially equivalent manner.
This implies that if transaction T is before transaction U in their conflicting access
to objects at one of the servers then they must be in that order at all of the servers whose
objects are accessed in a conflicting manner by both T and U.
13.4.1 Locking
In a distributed transaction, the locks on an object are held locally (in the same server).
The local lock manager can decide whether to grant a lock or make the requesting
transaction wait. However, it cannot release any locks until it knows that the transaction
has been committed or aborted at all the servers involved in the transaction. When
locking is used for concurrency control, the objects remain locked and are unavailable2 0
7
270
SECTION 13.4 CONCURRENCY CONTROL IN DISTRIBUTED TRANSACTIONS 52 9
for other transactions during the atomic commit protocol, although an aborted
transaction releases its locks after phase 1 of the protocol,
As lock managers in different servers set their locks independently o f one another,
it is possible that different servers may impose different orderings on transactions.
Consider the following interleaving of transactions T and U at servers X and Y:
T U
Write(A) at X locks A
Write(B) at y locks B
The transaction T locks object A at server X and then transaction U locks object B at
server Y. After that, T tries to access B at server Y and waits for ITs lock. Similarly,
transaction U tries to access A at server X and has to wait for T s lock. Therefore, we
have T before U in one server and U before Tin the other. These different orderings can
lead to cyclic dependencies between transactions and a distributed deadlock situation
arises. The detection and resolution of distributed deadlocks is discussed in the next
section of this chapter. When a deadlock is detected, a transaction is aborted to resolve
the deadlock. In this case, the coordinator will be informed and will abort the transaction
at the participants involved in the transaction.
271
530 CHAPTER 13 DISTRIBUTED TRANSACTIONS
corresponds to the order in which they are started in real time. Timestamps can be kept
roughly synchronized by the use of synchronized local physical clocks (see Chapter 10).
When timestamp ordering is used for concurrency control, conflicts are resolved
as each operation is performed. If the resolution of a conflict requires a transaction to be
aborted, the coordinator will be informed and it will abort the transaction at all the
participants. Therefore, any transaction that reaches the client request to commit should
always be able to commit. Therefore, a participant in the two-phase commit protocol
will normally agree to commit. The only situation in which a participant will not agree
to commit is if it had crashed during the transaction.
T U
Read(A) at X Read(B) at Y
Write(A) Write(B)
Read(B) at Y Read(A) at X
Write(B) Write(A)
The transactions access the objects in the order T before U at server X and in the order
U before T at server Y. Now suppose that T and U start validation at about the same time,
but server X validates T first and server Y validates U first. Recall that Section 12.5
recommends a simplification of the validation protocol that makes a rule that only one
transaction may perform validation and update phases at a time. Therefore each server
will be unable to validate the other transaction until the first one has completed. This is
an example of commitment deadlock.
The validation rules in Section 12.5 assume that validation is fast, which is true
for single-server transactions. However, in a distributed transaction, the two-phase
commit protocol may take some time and will delay other transactions from entering
validation until a decision on the current transaction has been obtained. In distributed
optimistic transactions, each server applies a parallel validation protocol. This is an
extension o f either backward or forward validation to allow multiple transactions to be
in the validation phase at the same time. In this extension, rule 3 must be checked as well
as rule 2 for backward validation. That is, the write set o f the transaction being validated
must be checked for overlaps with the write set of earlier overlapping transactions. Kung
and Robinson [1981] describe parallel validation in their paper.7 2
272
SECTION 13.5 DISTRIBUTED DEADLOCKS 531
The discussion of deadlocks in Section 12.4 shows that deadlocks can arise within a
single server when locking is used for concurrency control. Servers must either prevent
or detect and resolve deadlocks. Using timeouts to resolve possible deadlocks is a
clumsy approach - it is difficult to choose an appropriate timeout interval, and
transactions are aborted unnecessarily. With deadlock detection schemes, a transaction
is aborted only when it is involved in a deadlock. Most deadlock detection schemes
operate by finding cycles in the transaction wait-for graph. In a distributed system
involving multiple servers being accessed by multiple transactions, a global wait-for
273
532 CHAPTER 13 DISTRIBUTED TRANSACTIONS
u V W
d.deposit(lO) lock D
b.deposit(lO) lock B
a.deposit(20) lock A at Y
at X
c.deposit(30) lock C
b.withdraw(30) wait at Y at Z
c.withdraw(20) wait at Z
a.withdraw(20) wait at X
graph can in theory be constructed from the local ones. There can be a cycle in the global
wait-for graph that is not in any single local one - that is, there can be a distributed
deadlock. Recall that the wait-for graph is a directed graph in which nodes represent
transactions and objects, and edges represent either an object held by a transaction or a
transaction waiting for an object. There is a deadlock if and only if there is a cycle in the
wait-for graph.
Figure 13.12 shows the interleavings of the transactions U, V and W involving the
objects A and B managed by servers X and Y and objects C and D managed by server Z.
The complete wait-for graph in Figure 13.13(a) shows that a deadlock cycle
consists of alternate edges, which represent a transaction waiting for an object and an
object held by a transaction. As any transaction can only be waiting for one object at a
time, objects can be left out of wait-for graphs, as shown in Figure 13.13(b).
Detection of a distributed deadlock requires a cycle to be found in the global
transaction wait-for graph that is distributed among the servers that were involved in the
transactions. Local wait-for graphs can be built by the lock manager at each server, as
discussed in Chapter 12. In the above example, the local wait-for graphs of the servers
are:
As the global wait-for graph is held in part by each o f the several servers involved,
communication between these servers is required to find cycles in the graph.
A simple solution is to use centralized deadlock detection, in which one server
takes on the role of global deadlock detector. From time to time, each server sends the
latest copy of its local wait-for graph to the global deadlock detector, which
amalgamates the information in the local graphs in order to construct a global wait-for
graph. The global deadlock detector checks for cycles in the global wait-for graph.2 4
7
274
SECTION 13.5 DISTRIBUTED DEADLOCKS 5 33
(a) (b)
When it finds a cycle, it makes a decision on how to resolve the deadlock and informs
the servers as to the transaction to be aborted to resolve the deadlock.
Centralized deadlock detection is not a good idea, because it depends on a single
server to carry it out. It suffers from the usual problems associated with centralized
solutions in distributed systems - poor availability, lack of fault tolerance and no ability
to scale. In addition, the cost of the frequent transmission of local wait-for graphs is
high. If the global graph is collected less frequently, deadlocks may take longer to be
detected.
Phantom deadlocks 0 A deadlock that is ‘ detected’but is not really a deadlock is called
a phantom deadlock. In distributed deadlock detection, information about wait-for
relationships between transactions is transmitted from one server to another. If there is
a deadlock, the necessary information will eventually be collected in one place and a
cycle will be detected. As this procedure will take some time, there is a chance that one
of the transactions that holds a lock will meanwhile have released it, in which case the
deadlock will no longer exist.
Consider the case of a global deadlock detector that receives local wait-for graphs
from servers X and Y, as shown in Figure 13.14. Suppose that transaction U then releases
an object at server X and requests the one held by V at server Y. Suppose also that the
global detector receives server F s local graph before server X ’ s. In this case, it would
detect a cycle T —»U —> V —»T, although the edge T —> U no longer exists. This is an
example of a phantom deadlock.
The observant reader will have realized that if transactions are using two-phase
locks, they cannot release objects and then obtain more objects, and phantom deadlock
cycles cannot occur in the way suggested above. Consider the situation in which a cycle 2 5
7
275
534 CHAPTER 13 DISTRIBUTED TRANSACTIONS
T —> U —> V —>Tis detected: either this represents a deadlock or each o f the transactions
T, U and V must eventually commit. It is actually impossible for any of them to commit,
because each of them is waiting for an object that will never be released.
A phantom deadlock could be detected if a waiting transaction in a deadlock cycle
aborts during the deadlock detection procedure. For example, if there is a cycle
T —»U —> V —¥T and U aborts after the information concerning U has been collected,
then the cycle has been broken already and there is no deadlock.
Edge chasing 0 A distributed approach to deadlock detection uses a technique called
edge chasing or path pushing. In this approach, the global wait-for graph is not
constructed, but each of the servers involved has knowledge about some of its edges.
The servers attempt to find cycles by forwarding messages called probes, which follow
the edges o f the graph throughout the distributed system. A probe message consists of
transaction wait-for relationships representing a path in the global wait-for graph.
The question is: when should a server send out a probe? Consider the situation at
server X in Figure 13.13. This server has j ust added the edge W —> U to its 1ocal wait-for
graph and at this time, transaction U is waiting to access object B, which transaction V
holds at server Y. This edge could possibly be part of a cycle such as
V —> T\ —>72 —■>...—»W-»£7 — involving transactions using objects at other
servers. This indicates that there is a potential distributed deadlock cycle, which could
be found by sending out a probe to server Y.
Now consider the situation a little earlier when server Z added the edge V —> W to
its local graph: at this point in time, W is not waiting. Therefore, there would be no point
in sending out a probe.
Each distributed transaction starts at a server (called the coordinator o f the
transaction) and moves to several other servers (called participants in the transaction),
which can communicate with the coordinator. At any point in time, a transaction can be
either active or waiting at just one of these servers. The coordinator is responsible for
recording whether the transaction is active or is waiting for a particular object, and
participants can get this information from their coordinator, Lock managers inform
coordinators when transactions start waiting for objects and when transactions acquire
objects and become active again. When a transaction is aborted to break a deadlock, its
coordinator will inform the participants and all of its locks will be removed, with the
effect that all edges involving that transaction will be removed from the local wait-for
graphs.26
7
276
SECTION 13.5 DISTRIBUTED DEADLOCKS 535
Initiation. When a server notes that a transaction T starts waiting for another
transaction U, where U is waiting to access an object at another server, it initiates
detection by sending a probe containing the edge < T -» U > to the server of the
object at which transaction U is blocked. If U is sharing a lock, probes are sent to all
the holders of the lock. Sometimes further transactions may start sharing the lock
later on, in which case probes can be sent to them too.
In our example, the following steps describe how deadlock detection is initiated and the
probes that are forwarded during the corresponding detection phase.
• Server X initiates detection by sending probe < W —»V > to the server of B (Server
n
277
53 6 CHAPTER 13 DISTRIBUTED TRANSACTIONS
• Server Y receives probe < W - + U > , notes that B is held by V and appends V to
the probe to produce < W —> U -» V >. It notes that Vis waiting for C at server 2.
This probe is forwarded to server Z.
• Server Z receives probe < W -> U V > and notes C is held by W and appends
W to the probe to produce
This path contains a cycle, The server detects a deadlock. One of the transactions in the
cycle must be aborted to break the deadlock. The transaction to be aborted can be chosen
according to transaction priorities, which are described shortly.
Figure 13.15 shows the progress of the probe messages from the initiation by the
server of A to the deadlock detection by the server of C. Probes are shown as heavy
arrows, objects as circles and transaction coordinators as rectangles. Each probe is
shown as going directly from one object to another. In reality, before a server transmits
a probe to another server, it consults the coordinator of the last transaction in the path to
find out whether the latter is waiting for another object elsewhere. For example, before
the server of B transmits the probe <W —> U —»V> it consults the coordinator of V to
find out that V is waiting for C. In most of the edge-chasing algorithms, the servers o f
objects send probes to transaction coordinators, which then forward them (if the
transaction is waiting) to the server o f the object the transaction is waiting for. In our
example, the server of B transmits the probe <W —»U V> to the coordinator of V,
which then forwards it to the server of C. This shows that when a probe is forwarded,
two messages are required.
The above algorithm should find any deadlock that occurs, provided that waiting
transactions do not abort and there are no failures such as lost messages or servers
crashing. To understand this, consider a deadlock cycle in which the last transaction, W,
starts waiting and completes the cycle. When W starts waiting for an object, the server
initiates a probe that goes to the server of the object held by each transaction that W is
waiting for. The recipients extend and forward the probes to the servers of objects
requested by all waiting transactions they find. Thus every transaction that W waits for
directly or indirectly will be added to the probe unless a deadlock is detected. When
there is a deadlock, W is waiting for itself indirectly. Therefore, the probe will return to
the object that W holds.
It might appear that large numbers of messages are sent in order to detect
deadlock. In the above example, we see two probe messages to detect a cycle involving
three transactions. Each o f the probe messages is in general two messages (from object
to coordinator and then from coordinator to object).
A probe that detects a cycle involving N transactions will be forwarded by (N - 1)
transaction coordinators via (N- I) servers of objects, requiring 2(N - 1) messages.
Fortunately, the majority of deadlocks involve cycles containing only two transactions,
and there is no need for undue concern about the number of messages involved. This
observation has been made from studies of databases. It can also be argued by
considering the probability of conflicting access to objects. Sec Bernstein etal. [1987].
278
SECTION 13.5 DISTRIBUTED DEADLOCKS 537
279
53 8 CHAPTER 13 DISTRIBUTED TRANSACTIONS
0 Waits
for C
probe
queue
280
SECTION 13.6 TRANSACTION RECOVERY 539
Kshemkalyani and Singhal [1994] argue that distributed deadlocks are not very well
understood because there is no global state or time in a distributed system. In fact, any
cycle that has been collected may contain sections recorded at different times. In
addition, sites may hear about deadlocks but may not hear that they have been resolved
until after random delays. The paper describes distributed deadlocks in terms of the
contents of distributed memory, using causal relationships between events at different
sites.
The atomic property of transactions requires that the effects of all committed
transactions and none of the effects o f incomplete or aborted transactions are reflected
in the objects they accessed. This property can be described in terms of two aspects:
durability and failure atomicity. Durability requires that objects are saved in permanent
storage and will be available indefinitely thereafter. Therefore, an acknowledgment o f a
client’s commit request implies that all the effects of the transaction have been recorded
in permanent storage as well as in the server’ s (volatile) objects. Failure atomicity
requires that effects o f transactions are atomic even when the server crashes. Recovery
is concerned with ensuring that a server’ s objects are durable and that the service
provides failure atomicity.
Although file servers and database servers maintain data in permanent storage,
other kinds of servers o f recoverable objects need not do so except for recovery
purposes. In this chapter, we assume that when a server is running it keeps all of its
objects in its volatile memory and records its committed objects in a recovery file or
files. Therefore, recovery consists of restoring the server with the latest committed
versions of its objects from permanent storage. Databases need to deal with large
volumes of data. They generally hold the objects in stable storage on disk with a cache
in volatile memory.
The two requirements for durability and for failure atomicity are not really
independent of one another and can be dealt with by a single mechanism - the recovery
manager. The task o f a recovery manager is:
281
540 CHAPTER 13 DISTRIBUTED TRANSACTIONS
Intentions list 0 Any server that provides transactions needs to keep track o f the
objects accessed by clients’transactions. Recall from Chapter 12 that when a client
opens a transaction, the server first contacted provides a new transaction identifier and
returns it to the client. Each subsequent client request within a transaction up to and
including the commit or abort request includes the transaction identifier as an argument.
During the progress of a transaction, the update operations are applied to a private set of
tentative versions of the objects belonging to the transaction.
At each server, an intentions list is recorded for all of its currently active
transactions - an intentions list of a particular transaction contains a list o f the references
and the values o f all the objects that are altered by that transaction. When a transaction
is committed, that transaction’ s intentions list is used to identify the objects it affected.
The committed version of each object is replaced by the tentative version made by that
transaction, and the new value is written to the server’ s recovery file. When a transaction
aborts, the server uses the intentions list to delete all the tentative versions of objects
made by that transaction.
Recall also that a distributed transaction must carry out an atomic commit protocol
before it can be committed or aborted. Our discussion of recovery is based on the two-
phase commit protocol, in which all the participants involved in a transaction first say
whether they are prepared to commit and then, later on if all the participants agree, they
all carry out the actual commit actions. If the participants cannot agree to commit, they
must abort the transaction.
At the point when a participant says it is prepared to commit a transaction, its
recovery manager must have saved both its intentions list for that transaction and the
objects in that intentions list in its recovery file, so that it will be able to carry out the
commitment later on, even if it crashes in the interim.
When all the participants involved in a transaction agree to commit it, the
coordinator informs the client and then sends messages to the participants to commit
their part of the transaction. Once the client has been informed that a transaction has
committed, the recovery files of the participating servers must contain sufficient
information to ensure that the transaction is committed by all of the servers, even if some
o f them crash between preparing to commit and committing.
Entries in recovery file 0 To deal with recovery of a server that can be involved in
distributed transactions, further information in addition to the values of the objects is
stored in the recovery file. This information concerns the status of each transaction -
whether it is committed, aborted or prepared to commit. In addition, each object in the
recovery file is associated with a particular transaction by saving the intentions list, in
the recovery file. Figure 13.18 shows a summary of Ihetypes of entry included in a
recovery file.
The transaction status values relating to the two-phase commit protocol are
discussed in Section 13.6.4 on recovery of the two-phase commit protocol We shall
now describe two approaches to the use of recovery files: logging and shadow versions.
13.6.1 Logging
In the logging technique, the recovery file represents a log containing the history of all
the transactions performed by a server. The history consists of values o f objects,2 8
282
SECTION 13.6 TRANSACTION RECOVERY 541
transaction status entries and intentions lists of transactions. The order of the entries in
the log reflects the order in which transactions have prepared, committed and aborted at
that server. In practice, the recovery' file will contain a recent snapshot of the values of
all the objects in the server followed by a history of transactions after the snapshot.
During the normal operation o f a server, its recovery manager is called whenever
a transaction prepares to commit, commits or aborts a transaction. When the server is
prepared to commit a transaction, the recovery manager appends all the objects in its
intentions list to the recovery file, followed by the current status o f that transaction
(prepared) together with its intentions list. When a transaction is eventually committed
or aborted, the recovery manager appends the corresponding status of the transaction to
its recovery file,
It is assumed that the append operation is atomic in the sense that it writes one or
more complete entries to the recovery file. If the server fails, only the last write can be
incomplete. To make efficient use o f the disk, several subsequent writes can be buffered
and then written as a single write to disk. An additional advantage of the logging
technique is that sequential writes to disk are faster than writes to random locations.
After a crash, any transaction that does not have a committed status in the log is
aborted, Therefore, when a transaction commits, its committed status entry must be
forced to the log - that is, written to the log together with any other buffered entries.
The recovery manager associates a unique identifier with each object so that the
successive versions o f an object in the recovery file may be associated with the server’ s
objects. For example, a durable form of a remote object reference such as a CORBA
persistent reference will do as an object identifier,
Figure 13.19 illustrates the log mechanism for the banking service transactions T
and U in Figure 12.7. The log was recently reorganized, and entries to the left of the
double line represent a snapshot of the values of A, B and C before transactions T and U
started. In this diagram, we use the names A, B and C as unique identifiers for objects.
We show the situation when transaction Jhas committed and transaction V has prepared
but not committed. When transaction T prepares to commit, the values of objects A and
B are written at positions Pi and P2 in the log, followed by a prepared transaction status
entry' for T with its intentions list (< A, P\ >, < B, P2 >). When transaction T commits, a
committed transaction status entry for T is put at position P4 . Then when transaction U2 3
8
283
54 2 CHAPTER 13 DISTRIBUTED TRANSACTIONS
Po P\ Pi Pi Pa Ps Pe Pi
O bjects Object:# Object: C Object:^ Object:# Trans: T Trans: T Object; C Object:# Trans: V
100 200 300 80 220 prepared committed 278 242 prepared
<A ,#i> <C, P5>
<#, # 2 > <B, P g>
Po Ps P4
Checkpoint
prepares to commit, the values of objects C and B are written at positions P5 and P$ in
the log, followed by a prepared transaction status entry for V with its intentions list
(< C, P 5 >, < B, P<5 >).
Each transaction status entry contains a pointer to the position in the recovery file
o f the previous transaction status entry to enable the recovery manager to follow the
transaction status entries in reverse order through the recovery file. The last pointer in
the sequence of transaction status entries points to the checkpoint.
Recovery Ol objects 0 When a server is replaced after a crash, it first sets default initial
values for its objects and then hands over to its recovery manager. The recovery
manager is responsible for restoring the server’ s objects so that they include all the
effects of all the committed transactions performed in the correct order and none o f the
effects of incomplete or aborted transactions.
The most recent information about transactions is at the end of the log. There are
two approaches to restoring the data from the recovery file. In the first, the recovery
manager starts at the beginning and restores the values of all of the objects from the most
recent checkpoint. It then reads in the values of each of the objects, associates them with
their intentions lists and for committed transactions replaces the values of the objects. In
this approach, the transactions are replayed in the order in which they were executed and
there could be a large number of them. In the second approach, the recovery manager
will restore a server’ s objects by ‘reading the recovery file backwards’ .The recovery file
has been structured so that there is a backwards pointer from each transaction status
entry to the next. The recovery manager uses transactions with committed status to
restore those objects that have not yet been restored. It continues until it has restored all
o f the server’s objects. This has the advantage that each object is restored once only.
To recover the effects of a transaction, a recovery manager gets the corresponding
intentions list from its recovery file. The intentions list contains the identifiers and
positions in the recovery file o f values of all the objects affected by the transaction.
If the server fails at the point reached in Figure 13.19, its recovery manager will
recover the objects as follows. It starts at the last transaction status entry in the log (at
Pi) and concludes that transaction U has not committed and its effects should be
ignored. It then moves to the previous transaction status entry in the log (at P4) and
concludes that transaction T has committed, To recover the objects affected by2 4
8
284
SECTION 13.6 TRANSACTION RECOVERY 543
transaction T, it moves to the previous transaction status entry in the log (at P3) and finds
the intentions list for T (< A, P\ >, < B, P 2 >)■It then restores objects A and B from the
values at Pi and P2 . As it has not yet restored C, it moves back to Pq, which is a
checkpoint, and restores C.
To help with subsequent reorganization of the recovery file, the recovery manager
notes all the prepared transactions it finds during the process o f restoring the server’
s
objects. For each prepared transaction, it adds an aborted transaction status to the
recovery file. This ensures that in the recovery file, every transaction is eventually
shown as either committed or aborted.
The server could fail again during the recovery' procedures. It is essential that
recovery be idempotent in the sense that it can be done any number of times with the
same effect. This is straightforward under our assumption that all the objects are restored
to volatile memory. In the case of a database, which keeps its objects in permanent
storage, with a cache in volatile memory, some of the objects in permanent storage will
be out of date when a server is replaced after a crash. Therefore, its recovery manager
has to restore the objects in permanent storage. If it fails during recovery, the partially
restored objects will still be there. This makes idempotence a little harder to achieve.
Reorganizing Ihe recovery file 0 A recovery manager is responsible for reorganizing its
recovery file so as to make the process of recovery faster and to reduce its use of space.
If the recovery file is never reorganized, then the recovery process must search
backwards through the recovery file until it has found a value for each o f its objects.
Conceptually, the only information required for recovery is a copy o f the committed
versions o f all the objects in the server. This would be the most compact form for the
recovery file. The name checkpointing is used to refer to the process of writing the
current committed values of a server’ s objects to a new recovery file, together with
transaction status entries and intentions lists o f transactions that have not yet been fully
resolved (including information related to the two-phase commit protocol). The term
checkpoint is used to refer to the information stored by the checkpointing process. The
purpose of making checkpoints is to reduce the number o f transactions to be dealt with
during recovery and to reclaim file space.
Checkpointing can be done immediately after recovery but before any new
transactions are started. However, recovery may not occur very often. Therefore,
checkpointing may need to be done from time to time during the normal activity of a
server. The checkpoint is written to a future recovery file, and the current recovery file
remains in use until the checkpoint is complete. Checkpointing consists of ‘ adding a
mark’to the recovery file when the checkpointing starts, writing the server’ s objects to
the future recovery file and then copying (1 ) entries before the mark that relate to as yet
unresolved transactions and (2 ) all entries after the mark in the recovery file to the future
recovery file. When the checkpoint is complete, the future recovery file becomes the
recovery file.
The recovery system can reduce its use of space by discarding the old recovery
file. When the recover}' manager is carrying out the recovery process, it may encounter
a checkpoint in the recovery file. When this happens, it can restore immediately all
outstanding objects from the checkpoint.
285
54 4 CHAPTER 13 DISTRIBUTED TRANSACTIONS
Po Po' P o" Pi P2 P3 P4
V ersion sto r e 100 200 300 80 220 278 v : 242. ;•/
Checkpoint
286
SECTION 13.6 TRANSACTION RECOVERY 545
during the normal activity of the system. This is because logging requires only a
sequence of append operations to the same file, whereas shadow versions requires an
additional stable storage write (involving two unrelated disk blocks).
Shadow versions on their own are not sufficient for a server that handles
distributed transactions. Transaction status entries and intentions lists are saved in a file
called the transaction status file. Each intentions list represents the part of the map that
will be altered by a transaction when it commits. The transaction status file may, for
example, be organized as a log.
The figure below shows the map and the transaction status file for our current
example when Thas committed and U is prepared to commit.
There is a chance that a server may crash between the time when a committed status is
written to the transaction status file and the time when the map is updated - in which
case the client will not have been acknowledged. The recovery manager must allow for
this possibility when the server is replaced after a crash, for example by checking
whether the map includes the effects of the last committed transaction in the transaction
status file. If it does not, then the latter should be marked as aborted.
13.6.3 The need for transaction status and intentions list entries in a recovery file
It is possible to design a simple recovery file that does not include entries for transaction
status items and intentions lists. This sort o f recovery file may be suitable when all
transactions are directed to a single server, The use of transaction status items and
intentions lists in the recovery file is essential for a server that is intended to participate
in distributed transactions. This approach can also be useful for servers o f non-
distributed transactions for various reasons, including the following:
• Some recovery managers are designed to write the objects to the recovery file
early - under the assumption that transactions normally commit.
• If transactions use a large number of big objects, the need to write them
contiguously to the recovery file may complicate the design of a server. When
objects are referenced from intentions lists, they can be found wherever they are.
• In timestamp ordering concurrency control, a server sometimes knows that a
transaction will eventually be able to commit and acknowledges the client - at this
time the objects are written to the recovery file (see Chapter 12) to ensure their
permanence. However, the transaction may have to wait to commit until earlier
transactions have committed. In such situations, the corresponding transaction
status entries in the recovery file will be waiting to commit and then committed to
ensure timestamp ordering of committed transactions in the recovery file. On
recovery, any waiting-to-commit transactions can be allowed to commit, because2 7
8
287
546 CHAPTER 13 DISTRIBUTED TRANSACTIONS
the ones they were waiting for have either just committed or if not have to be
aborted due to failure of the server,
In phase 1 o f the protocol, when the coordinator is prepared to commit (and has already
added a prepared status entry to its recovery file), its recovery manager adds a
coordinator entry to its recovery file. Before a participant can vote Yes, it must have
already prepared to commit (and must have already added a prepared status entry to its
recovery file). When it votes Yes, its recovery manager records a participant entry and
adds an uncertain transaction status to its recovery file as a forced write. When a
participant votes No, it adds an abort transaction status to its recovery file.
In phase 2 of the protocol, the recovery manager of the coordinator adds either a
committed or an aborted transaction status to its recovery file, according to the decision.
This must be a forced write. Recovery managers of participants add a commit or abort
transaction status to their recovery files according to the message received from the
coordinator. When a coordinator has received a confirmation from all of its participants,
its recovery manager adds a done transaction status to its recovery file - this need not be
forced. The done status entry is not part of the protocol but is used when the recovery
file is reorganized. Figure 13.21 shows the entries in a log for transaction T, in which the
server played the coordinator role, and for transaction U, in which the server played the
288
SECTION 13.6 TRANSACTION RECOVERY 547
Coordinator prepared No decision had been reached before the server failed. It
sends abortTransaction to all the servers in the participant
list and adds the transaction status aborted in its recovery
file. Same action for state aborted. If there is no participant
list, the participants will eventually timeout and abort the
transaction.
Coordinator committed A decision to commit had been reached before the server
failed. It sends a doCommit to all of the participants in its
participant list (in case it had not done so before) and
resumes the two-phase protocol at step 4 (see Figure 13.5).
Participant uncertain The participant failed before it knew the outcome of the
transaction. It cannot determine the status of the
transaction until the coordinator informs it o f the decision.
It will send a getDecision to the coordinator to determine
the status of the transaction. When it receives the reply it
will commit or abort accordingly.
Participant prepared The participant has not yet voted and can abort the
transaction.
participant role. For both transactions, the prepared transaction status entry comes first.
In the case of a coordinator it is followed by a coordinator entry, and a committed
transaction status entry. The done transaction status entry is not shown in Figure 13.21.
In the case of a participant, the prepared transaction status entry is followed by a
participant entry whose state is uncertain and then a committed or aborted transaction
status entry.
When a server is replaced after a crash, the recovery manager has to deal with the
two-phase commit protocol in addition to restoring the objects. For any transaction
where the server has played the coordinator role, it should find a coordinator entry and
a set of transaction status entries. For any transaction where the server played the
participant role, it should find a participant entry and a set of transaction status entries.
In both cases, the most recent transaction status entry - that is, the one nearest the end
of the log - determines the transaction status at the time of failure. The action of the
recovery manager with respect to the two-phase commit protocol for any transaction2 9
8
289
5 48 CHAPTER 13 DISTRIBUTED TRANSACTIONS
depends on whether the server was the coordinator or a participant and on its status at
the time of failure, as shown in Figure 13.22.
290
SECTION 13.7 SUMMARY 54 9
Az Az
T
Ail,
h
version on the stack. Transactions T \2 and T2 act in a similar way, finally leaving the
result of T2 at the top of the stack.
13.7 Summary
In the most general case, a client’ s transaction will request operations on objects in
several different servers. A distributed transaction is any transaction whose activity
involves several different servers. A nested transaction structure may be used to allow
additional concurrency and independent committing by the servers in a distributed
transaction.
The atomicity property of transactions requires that the servers participating in a
distributed transaction either all commit it or all abort it. Atomic commit protocols are
designed to achieve this effect, even if servers crash during their execution. The two-
phase commit protocol allows a server to decide to abort unilaterally. It includes timeout
actions to deal with delays due to servers crashing. The two-phase commit protocol can
take an unbounded amount of time to complete but is guaranteed to complete eventually.
Concurrency control in distributed transactions is modular - each server is
responsible for the serializability of transactions that access its own objects. However,
additional protocols are required to ensure that transactions are serializable globally.
Distributed transactions that use timestamp ordering require a means of generating an
agreed timestamp ordering between the multiple servers. Those that use optimistic
concurrency control require global validation or a means of forcing a global ordering on
committing transactions.
Distributed transactions that use two-phase locking can suffer from distributed
deadlocks. The aim of distributed deadlock detection is to look for cycles in the global
wait-for graph. If a cycle is found, one or more transactions must be aborted to resolve
the deadlock. Edge chasing is a non-centralized approach to the detection of distributed
deadlocks.
Transaction-based applications have strong requirements for the long life and
integrity of the information stored, but they do not usually have requirements for
immediate response at all times. Atomic commit protocols are the key to distributed
transactions, but they cannot be guaranteed to complete within a particular time limit.
Transactions are made durable by performing checkpoints and logging in a recovery
file, which is used for recovery when a server is replaced after a crash. Users of a
transaction service would experience some delay during recovery. Although it is
291
550 CHAPTER 13 DISTRIBUTED TRANSACTIONS
assumed that the servers of distributed transactions exhibit crash failures and run in an
asynchronous system, they are able to reach consensus about the outcome of
transactions because crashed servers are replaced with new processes that can acquire
all the relevant information from permanent storage or from other servers.
EXERCISES
292
EXERCISES 551
13.6 Extend the definition of two-phase locking to apply to distributed transactions. Explain
how this is ensured by distributed transactions using strict two-phase locking locally.
page 528 and Chapter 12
13.7 Assuming that strict two-phase locking is in use, describe how the actions of the two-
phase commit protocol relate to the concurrency control actions of each individual
server. How does distributed deadlock detection Fit in? Pa8 es 520 and 528
13.8 A server uses timestamp ordering for local concurrency control. What changes must be
made to adapt it for use with distributed transactions? Under what conditions could it be
argued that the two-phase commit protocol is redundant with timestamp ordering?
pages 520 and 529
13.9 Consider distributed optimistic concurrency control in which each server performs local
backward validation sequentially (that is, with only one transaction in the validate and
update phase at one time), in relation to your answer to Exercise 13.4. Describe the
possible outcomes when the two transactions attempt to commit. What difference does
it make if the servers use parallel validation? Chapter 12 and page 530
13.10 A centralized global deadlock detector holds the union of local wait-for graphs. Give an
example to explain how a phantom deadlock could be detected if a waiting transaction
in a deadlock cycle aborts during the deadlock detection procedure. page 533
13.11 Consider the edge-chasing algorithm (without priorities). Give examples to show that it
could detect phantom deadlocks. page 534
13.12 A server manages the objects aj, a 2 .... an. It provides two operations for its clients:
Read(i) returns the value of a\
Write(i, Value) assigns Value to aj
The transactions T, U and Fare defined as follows:
T: x = Read(i); Write(j, 44);
U: Write(i, 55); Writefj, 66 );
V: Write(k, 77); Write(k, 8 8 );
Describe the information written to the log file on behalf of these three transactions if
strict two-phase locking is in use and U acquires a( and aj before T. Describe how the
recovery manager would use this information to recover the effects of T, U and V when
the server is replaced after a crash. What is the significance of the order of the commit
entries in the log file? pages 540-542
13.13 The appending of an entry to the log file is atomic, but append operations from different
transactions may be interleaved. How does this affect the answer to Exercise 13.12?
pages 540-5422 3
9
293
552 CHAPTER 13 DISTRIBUTED TRANSACTIONS
13.14 The transactions T, U and V o f Exercise 13.12 use strict two-phase locking and their
requests are interleaved as follows:
T U V
x = Read(i);
Writefk, 77);
Writefi, 55)
Write(j, 44)
Write(k,88 )
Write(j, 66 )
Assuming that the recovery manager appends the data entry corresponding to each Write
operation to the log file immediately instead o f waiting until the end of the transaction,
describe the information written to the log file on behalf of the transactions T, U and V.
Does early writing affect the correctness of the recovery procedure? What are the
advantages and disadvantages of early writing? pages 540-542
13.15 Transactions T and U are run with timestamp ordering concurrency control. Describe the
information written to the log file on behalf of T and U, allowing for the fact that U has
a later timestamp than T and must wait to commit after T. Why is it essential that the
commit entries in the log file be ordered by timestamps? Describe the effect of recovery
if the server crashes (i) between the two Commits and (ii) after both of them.
T U
x - Read(i);
Write(i, 55);
Write(j, 66 );
Write(j, 44);
Commit
Commit
What are the advantages and disadvantages of early writing with timestamp ordering?
page 545
13.16 The transactions T and U in Exercise 13.15 are run with optimistic concurrency control
using backward validation and restarting any transactions that fail. Describe the
information written to the log file on their behalf. Why is it essential that the commit
entries in the log file be ordered by transaction numbers? How are the write sets of
committed transactions represented in the log file? pages 540-542
13.17 Suppose that the coordinator of a transaction crashes after it has recorded the intentions
list entry but before it has recorded the participant list or sent out the canCommit?
requests. Describe how the participants resolve the situation. What will the coordinator
do when it recovers? Would it be any better to record the participant list before the
intentions list entry? page 546
294
Implementing Fault-Tolerant Services Using the State Machine
Approach: A Tutorial
FRED B. SCHNEIDER
D e p a r tm e n t o f C o m p u te r S cience, C o rn e ll U n iv e r s ity , Ith a c a , N e w Y o rk 14853
The state machine approach is a general method for implementing fault-tolerant services
in distributed systems. This paper reviews the approach and describes protocols for two
different failure models—Byzantine and fail stop. System reconfiguration techniques for
removing faulty components and integrating repaired components are also discussed.
Permission to copy without fee all or part of this material is granted provided that the copies are not made or
distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its
date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To
copy otherwise, or to republish, requires a fee and/or specific permission.
© 1990 ACM 0360-0300/90/1200-0299 $01.50
295
300 F. B. Schneider
296
Implementing Fault-Tolerant Services 301
time varying, then the result would not Byzantine failures is the weakest possible
satisfy the semantic characterization given assumption that could be made about the
above and therefore would not be a state effects of a failure. Since a design based on
machine. This is because values sent to assumptions about the behavior of faulty
actuator (the output of the state machine) components runs the risk of failing if these
would not depend solely on the requests assumptions are not satisfied, it is prudent
made to the state machine but would, in that life-critical systems tolerate Byzantine
addition, depend on the execution speed of failures. For most applications, however, it
the loop. In the structure used above, this suffices to assume fail-stop failures.
problem has been avoided by moving the A system consisting of a set of distinct
loop into monitor. components is t fault tolerant if it satisfies
In practice, having to structure a system its specification provided that no more than
in terms of state machines and clients does t of those components become faulty during
not constitute a real restriction. Anything some interval of interest .2 Fault-tolerance
that can be structured in terms of proce traditionally has been specified in terms of
dures and procedure calls can also be struc mean time between failures (MTBF), prob
tured using state machines and clients— a ability of failure over a given interval, and
state machine implements the procedure, other statistical measures [Siewiorek and
and requests implement the procedure Swarz 1982]. Although it is clear that such
calls. In fact, state machines permit more characterizations are important to the
flexibility in system structure than is usu users of a system, there are advantages in
ally available with procedure calls. With describing fault tolerance of a system in
state machines, a client making a request terms of the maximum number of com po
is not delayed until that request is proc nent failures that can be tolerated over
essed, and the output of a request can be some interval of interest. Asserting that a
sent someplace other than to the client system is t fault tolerant makes explicit the
making the request. We have not yet en assumptions required for correct operation;
countered an application that could not be M TBF and other statistical measures do
programmed cleanly in terms of state ma not. Moreover, t fault tolerance is unrelated
chines and clients. to the reliability of the components that
make up the system and therefore is a
2. FAULT TOLERANCE measure of the fault tolerance supported by
the system architecture, in contrast to fault
Before turning to the implementation of tolerance achieved simply by using reliable
fault-tolerant state machines, we must in components. M TBF and other statistical
troduce some terminology concerning fail reliability measures o f a t fault-tolerant
ures. A component is considered faulty once system can be derived from statistical reli
its behavior is no longer consistent with its ability measures for the components used
specification. In this paper, we consider two in constructing that system— in particular,
representative classes of faulty behavior: the probability that there will be t or more
B y za n tin e F ailu res. The component failures during the operating interval of
can exhibit arbitrary and malicious behav interest. Thus, t is typically chosen based
ior, perhaps involving collusion with other on statistical measures of component reli
faulty components [Lamport et al. 1982]. ability.
F ail-stop F ailu res. In response to a fail
3. FAULT-TOLERANT STATE MACHINES
ure, the component changes to a state that
permits other components to detect that A t fault-tolerant version of a state machine
a failure has occurred and then stops can be implemented by replicating that
[Schneider 1984].
Byzantine failures can be the most disrup 2A t fault-tolerant system might continue to operate
tive, and there is anecdotal evidence that correctly if more than t failures occur, but correct
such failures do occur in practice. Allowing operation cannot be guaranteed.
state machine and running a replica on be partitioned in other ways, the Agree
each of the processors in a distributed sys ment-Order partitioning is a natural choice
tem. Provided each replica being run by a because it corresponds to the existing sep
nonfaulty processor starts in the same ini aration of the client from the state machine
tial state and executes the same requests in replicas.
the same order, then each will do the same Implementations of Agreement and Or
thing and produce the same output. Thus, der are discussed in Sections 3.1 and 3.2.
if we assume that each failure can affect at These implementations make no assump
most one processor, hence one state ma tions about clients or commands. Although
chine replica, then by combining the output this generality is useful, knowledge of com
of the state machine replicas of this ensem mands allows Replica Coordination, hence
ble, we can obtain the output for the t fault- Agreement and Order, to be weakened and
tolerant state machine. thus allows cheaper protocols to be used for
When processors can experience Byzan managing the replicas in an ensemble. E x
tine failures, an ensemble implementing a amples of two common weakenings follow.
t fault-tolerant state machine must have at First, Agreement can be relaxed for read
least 2 t + 1 replicas, and the output of the only requests when fail-stop processors are
ensemble is the output produced by the being assumed. When processors are fail
majority of the replicas. This is because stop, a request r whose processing does not
with 2 t + 1 replicas, the majority of the modify state variables need only be sent to
outputs remain correct even after as many a single nonfaulty state machine replica.
as t failures. If processors experience only This is because the response from this rep
fail-stop failures, then an ensemble con lica is—by definition—guaranteed to be
taining t + 1 replicas suffices, and the out correct and because r changes no state vari
put of the ensemble can be the output ables, the state of the replica that processes
produced by any of its members. This is r will remain identical to the states of rep
because only correct outputs are produced licas that do not.
by fail-stop processors, and after t failures Second, Order can be relaxed for requests
one nonfaulty replica will remain among that commute. Two requests r and r' com
the £+ 1 replicas. mute in a state machine sm if the sequence
The key, then, for implementing a £fault- of outputs and final state of sm that would
tolerant state machine is to ensure the result from processing r followed by r' is
following: the same as would result from processing
r' followed by r. An example of a state
R ep lica C oord in a tion . All replicas re machine where Order can be relaxed
ceive and process the same sequence of appears in Figure 3. State machine tally
requests. determines which from among a set of al
This can be decomposed into two require ternatives receives at least MAJ votes and
ments concerning dissemination of re sends this choice to SYSTEM. If clients
quests to replicas in an ensemble. cannot vote more than once and the num
ber of clients Cno satisfies 2MAJ > Cno,
A greem en t. Every nonfaulty state ma then every request commutes with every
chine replica receives every request. other. Thus, implementing Order would be
O rder. Every nonfaulty state machine unnecessary—different replicas of the state
replica processes the requests it receives in machine will produce the same outputs
the same relative order. even if they process requests in different
orders. On the other hand, if clients can
Notice that Agreement governs the behav vote more than once or 2MAJ < Cno, then
ior of a client in interacting with state reordering requests might change the out
machine replicas and that Order governs come of the election.
the behavior of a state machine replica with Theories for constructing state machine
respect to requests from various clients. ensembles that do not satisfy Replica C o
Thus, although Replica Coordination could ordination are proposed in Aizikowitz
299
304 F. B. Schneider
t a lly : s t a t e _ m a c h i n e satisfies IC l and IC2, then the Agreement
v a r v o te s : array [c a n d id a te ] o f i n t e g e r in it 0 requirement is satisfied. Either the client
c a s t_ v o te : c o m m a n d ( c h o ic e : c a n d id a te ) can serve as the transmitter or the client
v o te s lc h o ic e ] := v o te s [c h o ic e ] + 1; can send its request to a single state ma
i f v o te s [c h o ic e ] > MAJ —> s e n d ch o ice to chine replica and let that replica serve as
SYSTEM ; the transmitter. When the client does not
h a lt
itself serve as the transmitter, however, the
□votes [c h o ic e ] < M A J s k ip client must ensure that its request is not
fi
lost or corrupted by the transmitter before
e n d c a s t_ v o te
the request is disseminated to the state
e n d t a lly
machine replicas. One way to monitor for
Figure 3. Election. such corruption is by having the client be
among the processors that receive the re
quest from the transmitter.
[1989] and Mancini and Pappalardo [1988].
Both theories are based on proving that an 3.2 Order and Stability
ensemble of state machines implements
the same specification as a single replica The Order requirement can be satisfied by
does. The approach taken in Aizikowitz assigning unique identifiers to requests and
[1989] uses temporal logic descriptions of having state machine replicas process re
state sequences, whereas the approach in quests according to a total ordering relation
Mancini and Pappalardo 1988] uses an al on these unique identifiers. This is equiva
gebra of action sequences. A detailed de lent to requiring the following, where a
scription of this work is beyond the scope request is defined to be stable at sm, once
of this tutorial. no request from a correct client and bearing
a lower unique identifier can be subse
3.1 Agreement quently delivered to state machine replica
smt:
The Agreement requirement can be satis
fied by using any protocol that allows a O rd e r Im plem en tation . A replica next
designated processor, called the transmit processes the stable request with the small
ter, to disseminate a value to some other est unique identifier.
processors in such a way that
ICl: All nonfaulty processors agree on the Further refinement of Order Implemen
same value. tation requires selecting a method for as
signing unique identifiers to requests and
IC2: If the transmitter is nonfaulty, then
devising a stability test for that assignment
all nonfaulty processors use its value
method. Note that any method for assign
as the one on which they agree.
ing unique identifiers is constrained by 0 1
Protocols to establish IC l and IC 2 have and 02 of Section 1 , which imply that if
received considerable attention in the lit request r, could have caused request r7 to be
erature and are sometimes called Byzantine made then uid(r,) < uid(rj) holds, where
Agreement protocols, reliable broadcast pro uid(r) is the unique identifier assigned to a
tocols, or simply agreement protocols. The request r.
hard part in designing such protocols is In the sections that follow, we give three
coping with a transmitter that fails part refinements of the Order Implementation.
way through an execution. See Strong and Two are based on the use of clocks; a third
Dolev [1983] for protocols that can tolerate uses an ordering defined by the replicas of
Byzantine processor failures and Schneider the ensemble.
et al. [1984] for a (significantly cheaper)
protocol that can tolerate (only) fail-stop
3.2.1 Using Logical Clocks
processor failures.
If requests are distributed to all state A logical clock [Lamport 1978a] is a map
machine replicas by using a protocol that ping t from events to the integers. T(e),
300
Implementing Fault-Tolerant Services 305
being executed by processor p if the local The advantage of this approach to com
clock at p reads r and uid(r) < t — A. puting unique identifiers is that communi
cation between all processors in the system
One disadvantage of this stability test is
is not necessary. When logical clocks or
that it forces the state machine to lag be
synchronized real-time clocks are used in
hind its clients by A, where A is propor
computing unique request identifiers, all
tional to the worst-case message delivery
processors hosting clients or state machine
delay. This disadvantage can be avoided.
replicas must communicate. In the case
Due to property O l of the total ordering
of logical clocks, this communication is
on request identifiers, if communications
needed in order for requests to become sta
channels satisfy FIFO Channels, then a
ble; in the case of synchronized real-time
state machine replica that has received a
clocks, this communication is needed in
request r from a client c can subsequently
order to keep the clocks synchronized .6 In
receive from c only requests with unique
the replica-generated identifier approach of
identifiers greater than uid(r). Thus, a re
this section, the only communication re
quest r is also stable at a state machine
quired is among processors running the
replica provided a request with a larger
client and state machine replicas.
unique identifier has been received from
By constraining the possible candidates
every client.
proposed in phase 1 for a request’ s unique
R eal-tim e C lo c k S ta b ility T e st T o le r identifier, it is possible to obtain a simple
a tin g B y za n tin e F a ilu res II. A request stability test. To describe this stability test,
r is stable at a state machine replica sm, if some terminology is required. We say that
a request with a larger unique identifier has a state machine replica sm; has seen a re
been received from every client. quest r once sm, has received r and pro
posed a candidate unique identifier for r.
This second stability test is never passed if We say that sm, has accepted r once that
a (faulty) processor refuses to make re replica knows the ultimate choice of unique
quests. However, by combining the first identifier for r. Define cuid(smi, r) to be
and second test so that a request is consid the candidate unique identifier proposed by
ered stable when it satisfies either test, a replica sm; for request r. Two constraints
stability test results that lags clients by A that lead to a simple stability test are:
only when faulty processors or network de
lays force it. Such a combined test is dis UID1: cuid(smi, r) < uid(r).
cussed in [Gopal et al. 1990]. UID2: If a request r' is seen by replica
sm, after r has been accepted by
3.2.3 Using Replica-Generated Identifiers sm, then uid(r) < cuid(smi, r').
In the previous two refinements of the Or
der Implementation, clients determine the If these constraints hold throughout exe
order in which requests are processed—the cution, then the following test can be used
unique identifier uid(r) for a request r is to determine whether a request is stable at
assigned by the client making that request. a state machine replica:
In the following refinement of the Order
R ep lica - G en era ted Id e n tifie rs S ta b il
Implementation, the state machine replicas
ity Test. A request r that has been ac
determine this order. Unique identifiers are
cepted by sm; is stable provided there is no
computed in two phases. In the first phase,
which can be part of the agreement protocol
used to satisfy the Agreement requirement, 6T h is com m unications cost argument illustrates an
state machine replicas propose candidate advantage o f having a client forward its request to a
unique identifiers for a request. Then, in single state machine replica that then serves as the
transmitter for dissem inating the request. In effect,
the second phase, one of these candidates that state machine replica becom es the client o f the
is selected and it becomes the unique iden state machine, and so com m unication need only in
tifier for that request. volve those processors running state machine replicas.
request r' that has (i) been seen by suit, (ii) unique identifiers such that:
not been accepted by smf, and (iii) for which
•U ID l and UID 2 are satisfied. (1)
cuid{smi, r') < uid(r) holds.
•r ^ r ' = > uid(r) ^ uid(r')- (2 )
T o prove that this stability test works, •Every request that is seen
we must show that once an accepted re eventually becomes accepted. (3)
quest r is deemed stable at s/n,, no request
One simple solution for a system of fail-
with a smaller unique identifier will be sub
stop processors is the following:
sequently accepted at sm, . Let r be a request
that, according to the Replica-Generated R ep lica - g e n e ra te d U n iq u e Id en tifiers.
Identifiers Stability Test, is stable at rep In a system with N clients, each state ma
lica sm,. Due to UID2, for any request r' chine replica sm, maintains two variables:
that has not been seen by sm„ uid(r) < SEEN i is the largest cuid(smi, r ) assigned
cuid{smj, r') holds. Thus, by transitivity to any request r so far seen by sm,, and
using UIDl, uid(r) <uid(r') holds, and we
ACCEPTi is the largest uid(r) assigned to
conclude that r ' cannot have a smaller
unique identifier than r. Now consider the any request r so far accepted by sm,.
case in which request r ' has been seen but Upon receipt of a request r, each replica
not accepted by sm, and—because the sta sm, computes
bility test for r is satisfied—uid(r) <
cuid{sm;, r') holds. Due to UIDl, we con cuid{smi, r) :=
clude that uid(r) < uid(r') holds and, max( [ SEEN ,J, [ACCEPT,J)
therefore, r ' does not have a smaller unique
identifier than r. Thus, we have shown that + 1 + i/N. (4)
once a request r satisfies the Replica- (Notice, this means that all candidate
Generated Identifiers Stability Test at sm,, unique identifiers are themselves unique.)
any request r' that is accepted by sm, will The replica then disseminates (using an
satisfy uid(r) < uid(r'), as desired. agreement protocol) cuid(smi, r ) to all
Unlike clock-generated unique identi other replicas and awaits receipt of a can
fiers for requests, replica-generated ones do didate unique identifier for r from every
not necessarily satisfy O l and 02 of Section nonfaulty replica, participating in the
1 . Without further restrictions, it is possi
agreement protocol for that value as well.
ble for a client to make a request r, send a Let N F be the set of replicas from which
message to another client causing request candidate unique identifiers were received.
r' to be issued, yet have uid{r') < uid(r). Finally, the replica computes
However, 01 and 02 will hold provided that
once a client starts disseminating a request uid(r) := max (cuid{smj, r )) (5)
to the state machine replicas, the client snijE. NF
performs no other communication until
and accepts r.
every state machine replica has accepted
that request. T o see why this works, con We prove that this protocol satisfies
sider a request r being made by some client (1)-(3) as follows. U ID l follows from us
and suppose some request r ' was influenced ing assignment (5) to compute uid(r), and
by r. The delay ensures that r is accepted UID 2 follows from assignment (4) to
by every state machine replica before compute cuid(smi, r ). T o conclude that
r' is seen. Thus, from UID2 we conclude (2) holds, we argue as follows. Because an
uid(r) < cuid(sm,, r') and, by transi agreement protocol is used to disseminate
tivity with UIDl, that uid(r) < uid(r'), as candidate unique identifiers, all replicas re
required. ceive the same values from the same repli
T o complete this Order Implementation, cas. Thus, all replicas will execute the same
we have only to devise protocols for com assignment statement (5), and all will com
puting unique identifiers and candidate pute the same value for uid(r). T o establish
that these uid{r) values are unique for each are too small have no effect on the outcome
request, it suffices to observe that maxi- of (5) at nonfaulty replicas and those that
mums of disjoint subsets of a collection of are too large will satisfy U IDl and UID 2 .
unique values—the candidate unique iden
tifiers—are also unique. Finally, to estab 4. TOLERATING FAULTY OUTPUT DEVICES
lish (3), that every request that is seen is
eventually accepted, we must prove that for It is not possible to implement a t fault-
each replica sm,, a replica sm, eventually tolerant system by using a single voter to
learns cuid(smj, r) or learns that sm, has combine the outputs of an ensemble of state
failed. This follows trivially from the use of machine replicas into one output. This is
an agreement protocol to distribute the because a single failure—of the voter-can
cuid(smj, r) and the definition of a fail- prevent the system from producing the cor
stop processor. rect output. Solutions to this problem de
An optimization of our Replica-gener pend on whether the output of the state
ated Unique Identifiers protocol is the basis machine implemented by the ensemble is
for the ABCAST protocol in the ISIS to be used within the system or outside the
Toolkit [Birman and Joseph 1987] devel system.
oped at Cornell. In this optimization, can
didate unique identifiers are returned to the 4.1 Outputs Used Outside the System
client instead of being disseminated to the
other state machine replicas. The client If the output of the state machine is sent
then executes assignment (5) to compute to an output device, then that device is
uid(r). Finally, an agreement protocol is already a single component whose failure
used by the client in disseminating uid(r) cannot be tolerated. Thus, being able to
to the state machine replicas. Some unique tolerate a faulty voter is not sufficient—the
replica takes over for the client if the client system must also be able to tolerate a faulty
fails. output device. The usual solution to this
It is possible to modify our Replica problem is to replicate the output device
generated Unique Identifiers protocol for and voter. Each voter combines the output
use in systems where processors can exhibit of each state machine replica, producing a
Byzantine failures, have synchronized real signal that drives one output device. What
time clocks, and communications channels ever reads the outputs of the system is
have bounded message-delivery delays— assumed to combine the outputs of the
the same environment as was assumed for replicated devices. This reader, which is not
using synchronized real-time clocks to gen considered part of the computing system,
erate unique identifiers. The following implements the critical voter.
changes are required. First, each replica sm, If output devices can exhibit Byzantine
uses timeouts so that sm, cannot be forever failures, then by taking the output pro
delayed waiting to receive and participate duced by the majority of the devices, 2 £ +
in the agreement protocol for disseminating 1 -fold replication permits up to t faulty
a candidate unique identifier from a faulty output devices to be tolerated. For example,
replica sm,. Second, if smf does determine a flap on an airplane wing might be de
that smj has timed out, sm, disseminates signed so that when the 2 t + 1 actuators
“ smj timeout”to all replicas (by using an that control it do not agree, the flap always
agreement protocol). Finally, NF is the set moves in the direction of the majority
of replicas in the ensemble less any sm, for (rather than twisting). If output devices
which “ sm, timeout”has been received from exhibit only fail-stop failures, then only
t + 1 or more replicas. Notice, Byzantine t + 1 -fold replication is necessary to toler
failures that cause faulty replicas to pro ate t failures because any output produced
pose candidate unique identifiers not pro by a fail-stop output device can be assumed
duced by (4) do not cause difficulty. This is correct. For example, video display termi
because candidate unique identifiers that nals usually present information with
305
310 F. B. Schneider
enough redundancy so that they can be Byzantine failures are possible, the client
treated as fail stop— failure detection is need not gather a majority of responses to
implemented by the viewer. With such an its requests to the state machine. It can use
output device, a human user can look at the single response produced locally.
one of t + 1 devices, decide whether the
output is faulty, and only if it is faulty, look
at another, and so on. 5. TOLERATING FAULTY CLIENTS
Implementing a t fault-tolerant state ma
chine is not sufficient for implementing a t
4.2 Outputs Used Inside the System fault-tolerant system. Faults might result
If the output of the state machine is to a in clients making requests that cause the
client, then the client itself can combine state machine to produce erroneous output
the outputs of state machine replicas in the or that corrupt the state machine so that
ensemble. Here, the voter—a part of the subsequent requests from nonfaulty clients
client— is faulty exactly when the client is, are incorrectly processed. Therefore, in this
so the fact that an incorrect output is read section we discuss various methods for in
by the client due to a faulty voter is irrele sulating the state machine from faults that
vant. When Byzantine failures are possible, affect clients.
the client waits until it has received t + 1
identical responses, each from a different 5.1 Replicating the Client
member of the ensemble, and takes that as
the response from the t fault-tolerant state One way to avoid having faults affect a
machine. When only fail-stop failures are client is by replicating the client and run
possible, the client can proceed as soon as ning each replica on hardware that fails
it has received a response from any member independently. This replication, however,
of the ensemble, since any output produced also requires changes to state machines
by a replica must be correct. that handle requests from that client. This
When the client is executed on the same is because after a client has been replicated
processor as one o f the state machine rep .ZV-fold, any state machine it interacts with
licas, optimization of client-implemented receives N requests—one from each client
voting is possible .7 This is because correct replica—when it formerly receives a single
ness of the processor implies that both the request. Moreover, corresponding requests
state machine replica and client will be from different client replicas will not nec
correct. Therefore, the response produced essarily be identical. First, they will differ
by the state machine replica running locally in their unique identifiers. Second, unless
can be used as that client’ s response from the original client is itself a state machine
the t fault-tolerant state machine. And, if and the methods of Section 3 are used to
the processor is faulty, we are entitled to coordinate the replicas, corresponding re
view the client as being faulty, so it does quests from different replicas can also dif
not matter what state machine responses fer in their content. For example, if a client
the client receives. Summarizing, we have makes requests based on the value of some
the following: time-varying sensor, then its replicas will
each read their sensors at slightly differ
D epen d en t-F a ilu res O utput O p tim iza ent times and, therefore, make different
tion. If a client and a state machine replica requests.
run on the same processor, then even when We first consider modifications to a state
machine sm for the case in which requests
7 Care must be exercised when analyzing the fault from different client replicas are known to
tolerance of such a system because a single processor differ only in their unique identifiers. For
failure can now cause two system com ponents to fail. this case, modifications are needed for cop
Im plicit in m ost o f our discussions is that system ing with receiving N requests instead of a
com ponents fail independently. It is not always p o s
sible to transform a t fault-tolerant system in which single one. These modifications involve
clients and state machine replicas have independent changing each command so that instead of
failures to one in which they share processors. processing every request received, requests
ACM Computing Surveys, Vol. 22, No. 4, December 1990
306
Implementing Fault-Tolerant Services 311
are buffered until enough 8 have been re the effects of requests from faulty clients.
ceived; only then is the corresponding com For example, memory (Figure 1 ) permits
mand performed (a single time). In effect, any client to write to any location. There
a voter is being added to sm to control fore, a faulty client can overwrite all
invocation of its commands. Client repli locations, destroying information. This
cation can be made invisible to the designer problem could be prevented by restricting
of a state machine by including such a voter write requests from each client to only cer
in the support software that receives re tain memory locations— the state machine
quests, tests for stability, and orders stable can enforce this.
requests by unique identifier. Including tests in commands is another
Modifying the state machine for the case way to design a state machine that cannot
in which requests from different client rep be corrupted by requests from faulty
licas can also differ in their content typi clients. For example, mutex as specified in
cally requires exploiting knowledge of the Figure 2, will execute a release command
application. As before, the idea is to trans made by any client—even one that does not
form multiple requests into a single one. have access to the resource. Consequently,
For example, in a t fault-tolerant system, if a faulty client could issue such a request
2t + 1 different requests are received, each and cause mutex to grant a second client
containing the value of a sensor, then a access to the resource before the first has
single request containing the median of relinquished access. A better formulation
those values might be constructed and of mutex ignores release commands from
processed by the state machine. (Given at all but the client to which exclusive access
most t Byzantine faults, the median of has been granted. This is implemented by
2t + 1 values is a reasonable one to use changing the release in mutex to
because it is bounded from above and below
release:
by a nonfaulty value.) A general method for
co m m a n d
transforming multiple requests containing
i f user ¥= client —»s k ip
sensor values into a single request is dis □waiting = <t> A user = client —*
cussed in Marzullo [1989]. That method is user := <t>
based on viewing a sensor value as an in □waiting 7^ <I> A user = client —>
terval that includes the actual value being sen d OK to head (waiting);
measured; a single interval (sensor) is com user := head (waiting);
puted from a set of intervals by using a waiting := tail(waiting)
fault-tolerant intersection algorithm. fi
e n d release
5.2 Defensive Programming Sometimes, a faulty client not making a
Sometimes a client cannot be made fault request can be just as catastrophic as one
tolerant by using replication. In some cir making an erroneous request. For example,
cumstances, due to the unavailability of if a client of mutex failed and stopped while
sensors or processors, it simply might not it had exclusive access to the resource, then
be possible to replicate the client. In other no client could be granted access to the
circumstances, the application semantics resource. Of course, unless we are prepared
might not afford a reasonable way to trans to bound the length of time that a correctly
form multiple requests from client replicas functioning process can retain exclusive ac
into the single request needed by the state cess to the resource, there is little we can
machine. In all of these circumstances, do about this problem. This is because there
careful design of state machines can limit is no way for a state machine to distinguish
between a client that has stopped executing
8 If Byzantine failures are possible, then a t fault- because it has failed and one that is exe
tolerant client requires 21 + 1-fold replication and a cuting very slowly. However, given an up
comm and is perform ed after t + 1 requests have been per bound B on the interval between an
received. If failures are restricted to fail stop, then
t + 1-fold replication will suffice, and a com m and acquire and the following release, the ac
can be perform ed after a single request has been quire command of mutex can automatically
received. schedule release on behalf of a client.
ACM Computing Surveys, Vol. 22, No. 4, December 1990
307
312 • F. B. Schneider
308
Implementing Fault-Tolerant Services 313
more than a total of t actuator faults can machine.) One is for clients and output
be tolerated because faulty actuators can devices to query the state machine peri
not be disabled. odically for information about relevant
The configuration of a system structured pending configuration changes. Obviously,
in terms of a state machine and clients can communication costs for this scheme are
be described using three sets: the clients C, reduced if clients and output devices share
the state machine replicas S, and the out processors with state machine replicas. An
put devices O. S is used by the agreement other way to make configuration informa
protocol and therefore must be known to tion available is for the state machine to
clients and state machine replicas. It can include information about configuration
also be used by an output device to deter changes in messages it sends to clients and
mine which sen d operations made by state output devices in the course of normal pro
machine replicas should be ignored. C and cessing. Doing this requires periodic com
0 are used by state machine replicas to munication between the state machine and
determine from which clients requests clients and between the state machine and
should be processed and to which devices output devices.
output should be sent. Therefore, C and Requests to change the configuration of
O must be available to all state machine the system are made by a failure/recovery
replicas. detection mechanism. It is convenient to
Two problems must be solved to support think of this mechanism as a collection of
changing the system configuration. First, clients, one for each element of C, S. or 0.
the values o f C, S, and 0 must be available Each of these configurators is responsible
when required. Second, whenever a client, for detecting the failure or repair of the
state machine replica, or output device is single object it manages and, when such an
added to the configuration, the state of that event is detected, for making a request to
element must be updated to reflect the alter the configuration. A configurator is
current state o f the system. These prob likely to be part of an existing client or
lems are considered in the following two state machine replica and might be imple
sections. mented in a variety of ways.
When elements are fail stop, a configu
7.1 Managing the Configuration rator need only check the failure-detection
mechanism of that element. When ele
The configuration of a system can be man ments can exhibit Byzantine failures, de
aged using the state machine in that sys tecting failures is not always possible.
tem. Sets C, S, and 0 are stored in state When it is possible, a higher degree of fault
variables and changed by commands. Each tolerance can be achieved by reconfigura
configuration is valid for a collection of tion. A nonfaulty configurator satisfies two
requests—those requests r such that uid(r) safety properties:
is in the range defined by two succes
sive configuration-change requests. Thus, Cl: Only a faulty element is removed
whenever a client, state machine replica, or from the configuration.
output device performs an action connected C2: Only a nonfaulty element is added to
with processing r, it uses the configuration the configuration.
that is valid for r. This means that a con
figuration-change request must schedule A configurator that does nothing satisfies
the new configuration for some point far C l and C2. Changing the configuration en
enough in the future so that clients, state hances faults tolerance only if F l and F2
machine replicas, and output devices all also hold. For F l and F2 to hold, a config
find out about the new configuration before urator must also (1) detect faults and cause
it actually comes into effect. elements to be removed and (2) detect re
There are various ways to make config pairs and cause elements to be added. Thus,
uration information available to the clients the degree to which a configurator en
and output devices of a system. (The infor hances fault tolerance is directly related to
mation is already available to the state the degree to which (1) and (2) are achieved.
Here, the semantics of the application can processed k inputs is all that is required to
be helpful. For example, to infer that a put it in state e[/-j0in]. Unfortunately, the
client is faulty, a state machine can com design of self-stabilizing state machines is
pare requests made by different clients or not always possible.
by the same client over a period of time. To When elements are not self-stabilizing,
determine that a processor executing a processors are fail stop, and logical clocks
state machine replica is faulty, the state are implemented, cooperation of a single
machine can monitor messages sent by state machine replica sm, is sufficient to
other state machine replicas during execu integrate a new element e into the system.
tion of an agreement protocol. And, by This is because state information obtained
monitoring aspects of the environment from any state machine replica sra, must be
being controlled by actuators, a state ma correct. In order to integrate e at request
chine replica might be able to determine /join, replica smt must have access to enough
that an output device is faulty. Some ele state information so that e [/>;„ ] can be
ments, such as processors, have internal assembled and forwarded to e.
failure-detection circuitry that can be read
•When e is an output device, e[rjoin] is
to determine whether that element is faulty
likely to be only a small amount of device
or has been repaired and restarted. A con
specific setup information—information
figurator for such an element can be im
plemented by having the state machine that changes infrequently and can be
stored in state variables of sm;.
periodically poll this circuitry.
In order to analyze the fault tolerance of •When e is a client, the information
a system that uses configurators, failure of needed for e[/j0in] is frequently based on
a configurator can be considered equivalent recent sensor values read and can there
to the failure of the element that the con fore be determined by using information
figurator manages. This is because with provided to sm, by other clients.
respect to the Combining Condition, re •And, when e is a state machine replica,
moval of a nonfaulty element from the sys the information needed for e[rjom] is
tem or addition of a faulty one is the same stored in the state variables and pending
as that element failing. Thus, in a i fault- requests at s/n,.
tolerant system, the sum of the number of
The protocol for integrating a client or
faulty configurators that manage nonfaulty
output device e is simple—e[rjoin] is sent to
elements and the number of faulty com po
e before the output produced by processing
nents with nonfaulty configurators must be
any request with a unique identifier larger
bounded by t.
than uid(rjoin). The protocol for integrating
a state machine replica s/nnew is a bit more
7.2 Integrating a Repaired Object complex. It is not sufficient for replica smi
Not only must an element being added to a simply to send the values of all its state
configuration be nonfaulty, it also must variables and copies of any pending re
have the correct state so that its actions quests to smnew. This is because some client
will be consistent with those o f the rest of request might be received by sm, after send
the system. Define e[rj to be the state that ing e[rjoin] but delivered to smnew before its
a non-faulty system element e should be in repair. Such a request would neither be
after processing requests r0 through r,. An reflected in the state information for
element e joining the configuration imme warded by smi to smnew nor received by
diately after request rJ0in must be in state smnew directly. Thus, s/n, must, for a time,
e[rjoin] before it can participate in the relay to s/nnew requests received from
running system. clients.10Since requests from a given client
An element is self-stabilizing [Dijkstra are received by smnev, in the order sent and
1974] if its current state is completely de in ascending order by request identifier,
fined by the previous k inputs it has pro
cessed for some fixed k. Running such an 10Duplicate copies of some requests might be received
element long enough to ensure that it has by s m new.
once smnew has received a request directly by time -rjoin + A according to its clock.
(i.e., not relayed) from a client c, there is Therefore, every request received by sm,
no need for requests from c with larger after rjoin -I- A must also be received directly
identifiers to be relayed to smnew. If smnew by smnew. Clearly, sm, need not relay such
informs sm, of the identifier on a request requests, and we have the following
received directly from each client c, then protocol:
sm, can know when to stop relaying to smnew
requests from c. In tegra tio n w ith Fail-stop P ro c e sso r s
The complete integration protocol is and Real-tim e C locks. A state machine
summarized in the following: replica sm, can integrate an element e
at request rjoin into a running system as
In tegra tion w ith Fail-stop P r o c e s so r s follows:
and L o g ic a l C locks. A state machine If e is a client or output device, then sm,
replica sm, can integrate an element e sends the relevant portions of its state vari
at request rjoin into a running system as ables to e and does so before sending any
follows: output produced by requests with unique
If e is a client or output device, sm, sends identifiers larger than the one on rjoin.
the relevant portions of its state variables If e is a state machine replica smnew,
to e and does so before sending any output then sm,
produced by requests with unique identi (1) sends the values of its state variables
fiers larger than the one on rj0in• and copies of any pending requests to
If e is a state machine replica smnew,
^ n ew ,
then sm,
and then
(1) sends the values of its state variables (2) sends to smnew every request received
and copies of any pending requests to during the next interval of duration A.
S^new }
sm„ ew to consider requests received directly uses the Real-time Clock Stability Test.
from c stable only after no relayed requests The decentralized commit protocol of
from c can arrive. Thus, the stability test Skeen [1982] can be viewed as a straight
must be changed: forward application of the state machine
approach, whereas the two-phase commit
S ta b ility T e st D u r in g R estart. A re
protocol described in Gray [1978] can be
quest r received directly from a client c by
obtained from decentralized commit simply
a restarting state machine replica smnew is
by making restrictive assumptions about
stable only after the last request from c
failures and performing optimizations
relayed by another processor has been based on these assumptions. The Paxon
received by smnew. Synod commit protocol [Lamport 1989]
An obvious way to implement this new also can be understood in terms of the state
stability test is for a message to be sent to machine approach. It is similar to, but less
smnev, when no further requests from c will expensive to execute, than the standard
be relayed. three-phase commit protocol. Finally, the
method of implementing highly available
distributed services in Liskov and Ladin
8. RELATED WORK [1986] uses the state machine approach,
The state machine approach was first de with clever optimizations of the stability
scribed in Lamport [1978a] for environ test and agreement protocol that are pos
ments in which failures could not occur. It sible due to the semantics of the application
was generalized to handle fail-stop failures and the use of fail-stop processors.
in Schneider [1982], a class of failures A critique of the state machine approach
between fail-stop and Byzantine failures for transaction management in database
in Lamport [1978b], and full Byzantine systems appears in Garcia-Molina et al.
failures in Lamport [1984]. These various [1986]. Experiments evaluating the per
state machine implementations were first formance of various of the stability tests in
characterized using the Agreement and a network of SUN Workstations are re
Order requirements and a stability test in ported in Pittelli and Garcia-Molina
Schneider [1985]. [1989]. That study also reports on the per
The state machine approach has been formance of request batching, which is
used in the design of significant fault- possible when requests describe database
tolerant process control applications transactions, and the use of null requests
[Wensley et al. 1978]. It has also been used in the Logical Clock Stability Test Toler
in the design of distributed synchroniza ating Fail-stop Failures of Section 3.
tion— including read/write locks and dis Primitives to support the Agreement and
tributed semaphores [Schneider 1980], Order requirements for Replica Coordina
input/output guards for CSP and condi tion have been included in two operating
tional Ada SELECT statements [Schneider systems toolkits. The ISIS Toolkit [Birman
1982]— and in the design of a fail-stop pro 1985] provides ABCAST and CBCA ST for
cessor approximation using processors that allowing an applications programmer to
can exhibit arbitrary behavior in response control the delivery order of messages to
to a failure [Schlichting and Schneider the members of a process group (i.e., collec
1983; Schneider 1984]. A stable storage im tion of state machine replicas). ABCAST
plementation described in Bernstein [1985] ensures that all state machine replicas pro
exploits properties of a synchronous broad cess requests in the same order; CBCA ST
cast network to avoid explicit protocols for allows more flexibility in message ordering
Agreement and Order and uses Transmit and ensures that causally related requests
ting a Default Vote (as described in Sec are delivered in the correct relative order.
tion 7). The notion of A common storage, ISIS has been used to implement a number
suggested in Cristian et al. [1985], is a state of prototype applications. One example is
machine implementation of memory that the RNFS (replicated NFS) file system, a
313
318 F. B. Schneider
network file system that is tolerant to fail- B irman , K. P., and J oseph , T. 1987. Reliable com
stop failures and runs on top of NFS, that munication in the presence of failures. A C M
T O C S 5, 1 (Feb. 1987), 47-76.
was designed using the state machine ap
C ristian , F., Ag hili , H., S tron g , H. R., and D olev ,
proach [Marzullo and Schmuck 1988]. D. 1985. Atomic broadcast: From simple mes
The Psync primitive [Peterson et al. sage diffusion to Byzantine agreement. In P r o
1989], which has been implemented in the c e ed ings o f th e 1 5 th I n te r n a t io n a l C o n fe re n c e on
jc-kernel [Hutchinson and Peterson 1988], F a u lt - t o le r a n t C o m p u tin g (Ann Arbor, Mich.,
is similar to the CBCA ST of ISIS. Psync, June 1985), IEEE Computer Society.
however, makes available to the program D ij k s t r a , E. W. 1974. S e lf sta b iliza tio n in sp ite
o f d istrib u ted con trol. C o m m u n . A C M 17, 11
mer the graph o f the message “ potential (Nov.), 643-644.
causality”relation, whereas CB CA ST does F isch er , M., Lynch , N., and P aterson , M. 1985.
not. Psync is intended to be a low-level Impossibility of distributed consensus with
protocol that can be used to implement one faulty process. J . A C M 32, 2 (Apr. 1986),
protocols like ABCAST and CBCAST; the 374-382.
ISIS primitives are intended for use by G arcia -Molina , H., P ittelli , F., and D avidson , S.
applications programmers and, therefore, 1986. Application of Byzantine agreement in
database systems. A C M T O D S 11, 1 (Mar. 1986),
hide the “potential causality”relation while 27-47.
at the same time include support for group G opal , A., S tron g , R., T oueg , S., and C ristian ,
management and failure reporting. F., 1990. Early-delivery atomic broadcast. To
appear in P ro c e e d in g s o f th e 9 t h A C M
S IG A C T - S I G O P S S y m p o s iu m o n P r in c ip le s o f
ACKNOWLEDGMENTS D is tr ib u te d C o m p u tin g (Quebec City, Quebec,
Aug. 1990).
This material is based on work supported in part by G r a y , J. 1978. Notes on data base operating systems.
the Office of Naval Research under contract N00014- In O p e r a tin g S y s te m s : A n A d v a n c e d C o urse , L e c
86-K-0092, the National Science Foundation under t u r e N o te s in C o m p u te r S cience. Vol. 60. Springer-
Grants Nos. DCR-8320274 and CCR-8701103, and Verlag, New York, pp. 393-481.
Digital Equipment Corporation. Any opinions, find H alpern , J., S im on s , B., S tr o n g , R., and D olev ,
ings, and conclusions or recommendations expressed D. 1984. Fault-tolerant clock synchronization.
in this publication are those of the author and do not In P ro c e e d in g s o f th e 3 r d A C M S IG A C T - S IG O P S
reflect the views of these agencies. S y m p o s iu m o n P r in c ip le s o f D is tr ib u te d C o m p u t
in g (Vancouver, Canada, Aug.), pp. 89-102.
Discussions with 0. Babaoglu, K. Birman, and
L. Lamport over the past 5 years have helped me H u tch inson , N., and P eterson , L. 1988. Design
formulate the ideas in this paper. Useful comments on of the x-kernel. In P ro c e e d in g s o f S I G C O M M
’8 8 — S y m p o s iu m o n C o m m u n ic a tio n A r c h ite c
drafts of this paper were provided by J. Aizikowitz,
tu re s a n d P ro to c o ls (Stanford, Calif., Aug.), pp.
0. Babaoglu, A. Bernstein, K. Birman, R. Brown,
65-75.
D. Gries, K. Marzullo, and B. Simons. I am very
L am port , L. 1978a. Time, clocks and the ordering
grateful to Sal March, managing editor of A C M C o m
of events in a distributed system. C o m m u n . A C M
p u t in g S u rv e y s , for his thorough reading of this paper 21, 7 (July), 558-565.
and many helpful comments. LA M PORT, L. 1979b. The implementation of reliable
distributed multiprocess systems. C o m p u t. N e t
w o rk s 2, 95-114.
REFERENCES L am port , L. 1984. Using time instead of timeout
Aizik ow itz , J. 1989. Designing distributed services for fault-tolerance in distributed systems. A C M
T O P L A S 6, 2 (Apr.), 254-280.
using refinement mappings. Ph.D. dissertation,
Computer Science Dept., Cornell Univ., Ithaca, Lam port , L. 1989. The part-time parliament. Tech.
New York. Also available as Tech. Rep. TR Rep. 49. Digital Equipment Corporation Systems
89-1040. Research Center, Palo Alto, Calif.
B ern stein , A. J. 1985. A loosely coupled system for L am port , L., and M elliar -Sm ith , P. M.
reliably storing data. I E E E T ra n s . S o ftw . E n g . 1984. Byzantine clock synchronization. In P ro
S E -1 1 , 5 (May), 446-454. c e ed ings o f th e 3 r d A C M S I G A C T - S I G O P S S y m
B irman , K. P. 1985. Replication and fault tolerance p o s iu m o n P r in c ip le s o f D is tr ib u te d C o m p u tin g
in the ISIS system. In P ro c e e d in g s o f th e 1 0 th (Vancouver, Canada, Aug.), 68-74.
A C M S y m p o s iu m o n O p e r a tin g S y s te m s P r i n c i L amport , L., S hostak , R., and P ease , M.
p le s (Orcas Island, Washington, Dec. 1985), A C M , 1982. The Byzantine generals problem. A C M
pp. 79-86. T O P L A S 4, 3 (July), 382-401.
314
Im p lem e n tin g F ault-T oleran t S erv ices 319
L iskov , B., and Ladtn, R. 1986. Highly available S chneider , F. B. 1982. Synchronization in dis
distributed services and fault-tolerant distributed tributed programs. A C M T O P L A S 4, 2 (Apr.),
garbage collection. In P ro c e e d in g s o f th e 5 t h A C M 179-195.
S y m p o s iu m o n P r in c ip le s o f D is tr ib u te d C o m p u t S ch neider , F. B. 1984. Byzantine generals in ac
in g (Calgary, Alberta, Canada, Aug.), A C M , pp. tion: Implementing fail-stop processors. A C M
29-39. T O C S 2, 2 (May), 145-154.
M ancini, L., and P appalardo , G. 1988. Towards
S ch neider , F. B. 1985. Paradigms for distributed
a theory of replicated processing. F o r m a l T e c h programs. D is tr ib u te d S ys te m s . M e th o d s a n d
n iq u e s in R e a l- T im e a n d F a u lt - T o le r a n t S ystem s.
T o o ls f o r S p e c ific a tio n . L e c tu r e N o te s in C o m p u te r
L e c tu r e N o te s in C o m p u te r S cience, Vol. 331.
S c ie n c e , Vol. 190. Springer-Verlag, New York, pp.
Springer-Verlag, New York, pp. 175-192. 343-430.
M arzullo , K. 1989. Implementing fault-tolerant
S chneider , F. B. 1986. A paradigm for reliable clock
sensors. Tech. Rep. TR 89-997. Computer Sci
synchronization. In P ro c e e d in g s o f th e A d v a n c e d
ence Dept., Cornell Univ., Ithaca, New York.
S e m in a r on R e a l- T im e L o c a l A r e a N e tw o rk s
M arzullo , K., and S chmuck , F. 1988. Supplying (Bandol, France, Apr.), INRIA, pp. 85-104.
high availability with a standard network file
system. In P ro c e e d in g s o f th e 8 t h I n t e r n a t io n a l S chneider , F. B., G ries , D., and S chlichting ,
C o n fe re n c e o n D is tr ib u te d C o m p u tin g S y s te m s
R. D. 1984. Fault-tolerant broadcasts. Sci.
C o m p u t. P ro g ra m . 4, 1-15.
(San Jose, CA, June), IEEE Computer Society,
pp. 447-455. S iew iorek , D. P., AND S warz, R. S. 1982. The
P eterson , L. L., B ucholz , N. C., and S ch licht - T h e o ry a n d P ra c tic e o f R e lia b le S y s te m D e s ig n .
ING, R. D. 1989. Preserving and using context Digital Press, Bedford, Mass.
information in interprocess communication. S keen , D. 1982. Crash recovery in a distributed
A C M T O C S 7, 3 (Aug.), 217-246. database system. Ph.D. dissertation, Univ. of
P lT T E L L l, F. M ., A N D G A R C IA - M O L IN A , H. California at Berkeley, May.
1989. Reliable scheduling in a TMR database S tron g , H. R., AND D olev , D. 1983. Byzantine
system. A C M T O C S 7, 1 (Feb.), 25-60. agreement. I n te lle c tu a l L e v e ra g e f o r th e I n f o r m a
SCHLICHTING, R. D., AND SCHNEIDER, F. B. t io n S o c ie ty , D ig e s t o f P a p e rs . (Compcon 83,
1983. Fail-Stop processors: An approach to de IEEE Computer Society, Mar.), IEEE Computer
signing fault-tolerant computing systems. A C M Society, pp. 77-82.
T O C S 1, 3 (Aug.), 222-238. W ensley , J., W ensky , J. H., L amport , L.,
S C H N E iD E R , F. B. 1980. Ensuring consistency on a G oldberg , J., G reen , M. W., Levitt , K. N.,
distributed database system by use of distributed M elliar -Sm ith , P. M., S hostak , R. E., and
semaphores. In P ro c e e d in g s o f I n t e r n a t io n a l S y m W einstock , C. B. 1978. SIFT: Design and
p o s iu m o n D is tr ib u te d D a ta B ases (Paris, France, analysis of a fault-tolerant computer for aircraft
Mar.), INRIA, pp. 183-189. control. P ro c . I E E E 6 6 , 10 (Oct.), 1240-1255.
317
Explain the design and implementation of message-oriented middleware
(MOM).
319
SEC. 4.1 FUNDAMENTALS 125
not necessary for the sending application to continue execution after submitting
the message. Likewise, the receiving application need not be executing when the
m essage is submitted.
In contrast, with transient communication, a m essage is stored by the c o m
munication system only as long as the sending and receiving application are exe
cuting. M ore precisely, in terms o f Fig. 4-4, if the middleware cannot deliver a
m essage due to a transmission interrupt, or because the recipient is currently not
active, it will sim ply be discarded. Typically, all transport-level com m unication
services offer only transient communication. In this case, the com m unication sy s
tem consists o f traditional store-and-forward routers. If a router cannot deliver a
m essage to the next one or the destination host, it will sim ply drop the message.
B esides being persistent or transient, com m unication can also be asynchro
nous or synchronous. The characteristic feature o f asynchronous com m unication
is that a sender continues immediately after it has submitted its m essage for
transmission. This means that the m essage is (temporarily) stored immediately by
the m iddleware upon submission. With synchronous communication, the sender
is blocked until its request is known to be accepted. There are essentially three
points where synchronization can take place. First, the sender may be blocked
until the m iddleware notifies that it will take over transmission o f the request.
Second, the sender may synchronize until its request has been delivered to the
intended recipient. Third, synchronization may take place by letting the sender
wait until its request has been fully processed, that is, up to the time that the reci
pient returns a response.
Various combinations o f persistence and synchronization occur in practice.
Popular ones are persistence in combination with synchronization at request sub
mission, which is a com m on scheme for many message-queuing systems, which
we discuss later in this chapter. Likewise, transient comm unication with syn
chronization after the request has been fully processed is also w idely used. This
schem e corresponds with remote procedure calls, which we also discuss below.
B esides persistence and synchronization, w e should also make a distinction
between discrete and streaming communication. The exam ples so far all fall in the
category o f discrete communication: the parties comm unicate by m essages, each
m essage form ing a com plete unit o f information. In contrast, streaming involves
sending multiple messages, one after the other, where the m essages are related to
each other by the order they are sent, or because there is a temporal relationship.
W e return to streaming comm unication extensively below.
320
This page has intentionally been left blank.
321
This page has intentionally been left blank.
322
140 COM M UNICATION CHAP. 4
Directory machine
Perform ing an R P C
The actual RPC is carried out transparently and in the usual way. The client
stub marshals the parameters to the runtime library for transmission using the pro
tocol chosen at binding time. When a m essage arrives at the server side, it is
routed to the correct server based on the end point contained in the incom ing m es
sage. The runtime library passes the m essage to the server stub, which unmarshals
the parameters and calls the server. The reply goes back by the reverse route.
D C E provides several semantic options. The default is at-most-once opera
tion, in which case no call is ever carried out m ore than once, even in the face o f
system crashes. In practice, what this means is that if a server crashes during an
RPC and then recovers quickly, the client does not repeat the operation, for fear
that it might already have been carried out once.
Alternatively, it is possible to mark a remote procedure as idem potent (in the
ID L file), in which case it can be repeated multiple times without harm. For ex
ample, reading a specified block from a file can be tried over and over until it
succeeds. When an idem potent RPC fails due to a server crash, the client can wait
until the server reboots and then try again. Other semantics are also available (but
rarely used), including broadcasting the RPC to all the m achines on the local net
work. W e return to RPC semantics in Chap. 8, when discussing RPC in the pres
ence o f failures.
323
SEC. 4.3 MESSAGE-ORIENTED COM M UN ICATION 141
Many distributed systems and applications are built directly on top o f the sim
ple message-oriented m odel offered by the transport layer. T o better understand
and appreciate the message-oriented systems as part o f m iddleware solutions, we
first discuss m essaging through transport-level sockets.
Berkeley Sockets
Special attention has been paid to standardizing the interface o f the transport
layer to allow programmers to make use o f its entire suite o f (messaging) proto
cols through a sim ple set o f primitives. Also, standard interfaces make it easier to
port an application to a different machine.
As an example, we briefly discuss the sockets interface as introduced in the
1970s in Berkeley UNIX. Another important interface is XTI, which stands for
the X/Open Transport Interface, formerly called the Transport Layer Interface
(TLI), and developed by AT&T. Sockets and X T I are very similar in their m odel
o f network programming, but differ in their set o f primitives.
Conceptually, a socket is a com m unication end point to which an application
can write data that are to be sent out over the underlying network, and from which
incoming data can be read. A socket form s an abstraction over the actual com m u
nication end point that is used by the local operating system for a specific tran
sport protocol. In the follow in g text, w e concentrate on the socket prim itives for
TCP, which are shown in Fig. 4-14.
Servers generally execute the first four primitives, normally in the order
given. W hen calling the socket primitive, the caller creates a new com m unication
end point for a specific transport protocol. Internally, creating a com m unication
end point means that the local operating system reserves resources to a cco m m o
date sending and receiving m essages for the specified protocol.
The bind primitive associates a local address with the newly-created socket.
For example, a server should bind the IP address o f its machine together with a
(possibly well-known) port number to a socket. Binding tells the operating system
that the server wants to receive m essages only on the specified address and port.2 4
3
324
142 COMMUNICATION CHAP. 4
Primitive Meaning
Socket Create a new communication end point
Bind Attach a local address to a socket
Listen Announce willingness to accept connections
Accept Block caller until a connection request arrives
Connect Actively attempt to establish a connection
Send Send some data over the connection
Receive Receive some data over the connection
Close Release the connection
325
SE C . 4.3 M E S S A G E -O R IE N T E D C O M M U N I C A T I O N 143
Server
I socket [—H bind I— listen ]—►{ accept read HK
4T~
I
Synchronization p o in t------ I Communication \
326
144 C O M M U N IC A T IO N CHAP. 4
Primitive Meaning
MPLbsend Append outgoing message to a local send buffer
MPLsend Send a message and wait until copied to local or remote buffer
MPLssend Send a message and wait until receipt starts
MPLsendrecv Send a message and wait for reply
MPLisend Pass reference to outgoing message, and continue
MPLissend Pass reference to outgoing message, and wait until receipt starts
MPLrecv Receive a message; block if there is none
MPLirecv Check if there is an incoming message, but do not block
327
SEC. 4.3 MESSAGE-ORIENTED COMMUNICATION 145
W e n o w c o m e to an im p o rta n t c la s s o f m e ss a g e - o r ie n te d m id d le w a r e se rv ice s,
m essage-queuing systems, o r ju s t M essage-Oriented M id
g e n e ra lly k n o w n as
dleware (MOM). M e s s a g e - q u e u in g s y s te m s p r o v id e e x te n s iv e s u p p o r t fo r p e r
sisten t a sy n ch ro n o u s c o m m u n ica tio n . T h e e s s e n c e o f th e se s y s te m s is that th ey
o f fe r in term ed ia te-term s to r a g e ca p a c ity fo r m e s s a g e s , w ith o u t r e q u irin g eith er the
se n d e r o r r e c e iv e r to b e a c tiv e d u rin g m e s s a g e tra n sm ission . A n im p orta n t d iff e r
e n c e w ith B e r k e le y s o c k e t s and M P I is that m e s s a g e - q u e u in g s y s te m s are ty p i
c a lly ta rgeted to s u p p o r t m e s s a g e tran sfers that are a llo w e d to take m in u te s in
ste a d o f s e c o n d s o r m illis e c o n d s . W e first e x p la in a g e n e r a l a p p ro a ch to m e s s a g e
q u e u in g sy stem s, an d c o n c lu d e this s e c tio n b y c o m p a r in g th e m to m o r e tra dition a l
sy stem s, n o ta b ly the In tern et e-m a il system s.
M essage-Queuing M odel
328
146 COMMUNICATION CHAP. 4
1 1
329
SE C . 4.3 M E S S A G E -O R IE N T E D C O M M U N I C A T I O N 147
Primitive Meaning
Put Append a message to a specified queue
Get Block until the specified queue is nonempty, and remove the first message
Poll Check a specified queue for messages, and remove the first. Never block
Notify Install a handler to be called when a message is put into the specified queue
nonblocking call. The get primitive is a blocking call by which an authorized pro
cess can remove the longest pending message in the specified queue. The process
is blocked only if the queue is empty. Variations on this call allow searching for a
specific message in the queue, for example, using a priority, or a matching pat
tern. The nonblocking variant is given by the poll primitive. If the queue is empty,
or if a specific message could not be found, the calling process simply continues.
Finally, most queuing systems also allow a process to install a handler as a
c a l l b a c k f u n c t io n , which is automatically invoked whenever a message is put into
the queue. Callbacks can also be used to automatically start a process that will
fetch messages from the queue if no process is currently executing. This approach
is often implemented by means of a daemon on the receiver’ s side that continu
ously monitors the queue for incoming messages and handles accordingly.
Let us now take a closer look at what a general message-queuing system looks
like. One of the first restrictions that we make is that messages can be put only
into queues that are l o c a l to the sender, that is, queues on the same machine, or no
worse than on a machine nearby such as on the same LAN that can be efficiently
reached through an RPC. Such a queue is called the sou rce queue. Likewise,
messages can be read only from local queues. However, a message put into a
queue will contain the specification of a destination queue to which it should be
transferred. It is the responsibility of a message-queuing system to provide queues
to senders and receivers and take care that messages are transferred from their
source to their destination queue.
It is important to realize that the collection of queues is distributed across
multiple machines. Consequently, for a message-queuing system to transfer mes
sages, it should maintain a mapping of queues to network locations. In practice,
this means that it should maintain a (possibly distributed) database of queue
names to network locations, as shown in Fig. 4-19. Note that such a mapping is
completely analogous to the use of the Domain Name System (DNS) for e-mail in
the Internet. For example, when sending a message to the logical m a il address
s t e e n @ c s . v u . n l, the mailing system will query DNS to find the n e t w o r k (i.e., IP)
address of the recipient’ s mail server to use for the actual message transfer.30
330
148 COMMUNICATION CHAP. 4
331
SE C . 4.3 M E S S A G E -O R IE N T E D C O M M U N I C A T I O N 149
Sender A
Another reason why relays are used is that they allow for secondary proc
essing of messages. For example, messages may need to be logged for reasons of
security or fault tolerance. A special form of relay that we discuss in the next sec
tion is one that acts as a gateway, transforming messages into a format that can be
understood by the receiver.
Finally, relays can be used for multicasting purposes. In that case, an incom
ing message is simply put into each send queue.
M essage Brokers
332
150 COMMUNICATION CHAP. 4
system s operate. A com m on m essage format makes sense only if the collection o f
p ro cesses that make use o f that format indeed have enough in comm on. If the c o l
lection o f applications that make up a distributed information system is highly di
verse (which it often is), then the best com m on format may w ell be no m ore than
a sequ en ce o f bytes.
A lth ough a few com m on m essage formats for specific application domains
have been defined, the general approach is to learn to live with different formats,
and try to provide the means to make conversions as sim ple as possible. In m es
sage-queuing systems, conversions are handled by special nodes in a queuing net
work, k n ow n as m essage brokers. A m essage broker acts as an application-level
gateway in a m essage-queuing system. Its main purpose is to convert incom ing
m essages so that they can be understood by the destination application. Note that
to a m essage-queuing system, a m essage broker is just another application, as
show n in Fig. 4-21. In other words, a m essage broker is generally not considered
to be an integral part o f the queuing system.
Repository with
conversion rules
Source client Message broker and programs Destination client
Network
A m essa ge broker can be as sim ple as a reformatter for messages. For ex
ample, assum e an incom ing m essage contains a table from a database, in which
records are separated by a special end-of-record delimiter and fields within a rec
ord have a known, fixed length. If the destination application expects a different
delimiter betw een records, and also expects that fields have variable lengths, a
m essage broker can be used to convert m essages to the format expected by the
destination.
In a m ore advanced setting, a m essage broker may act as an application-level
gateway, such as one that handles the conversion between two different database
applications. In such cases, frequently it cannot be guaranteed that all information
333
SEC. 4.3 MESSAGE-ORIENTED COMMUNICATION 151
contained in the incom ing m essage can actually be transformed into something
appropriate for the outgoing message.
However, m ore com m on is the use o f a m essage broker for advanced enter
prise application integration (EAI) as w e discussed in Chap. 1. In this case,
rather than (only) converting m essages, a broker is responsible for matching appli
cations based on the m essages that are being exchanged. In such a model, called
publish/subscribe, applications send m essages in the form o f publishing. In par
ticular, they may publish a m essage on topic X, which is then sent to the broker.
Applications that have stated their interest in m essages on topic X, that is, who
have subscribed to those m essages, will then receive these m essages from the
broker. M ore advanced forms o f mediation are also possible, but w e will defer
further discussion until Chap. 13.
At the heart o f a m essage broker lies a repository o f rules and program s that
can transform a m essage o f type 77 to one o f type 72. The problem is defining
the rules and developing the programs. M ost m essage broker products com e with
sophisticated developm ent tools, but the bottom line is still that the repository
needs to be filled by experts. Here w e see a perfect exam ple where com m ercial
products are often m isleadingly said to provide “intelligence,”where, in fact, the
only intelligence is to be found in the heads o f those experts.
334
152 COMMUNICATION CHAP. 4
4.3.3 E x a m p le: IB M ’
s W e b S p h e r e M essa g e - Q u eu in g S y stem
Overview
335
SE C . 4.3 MESSAGE-ORIENTED COMMUNICATION 153
Queue managers can be linked into the same process as the application for
which it manages the queues. In that case, the queues are hidden from the applica
tion behind a standard interface, but effectively can be directly manipulated by the
application. An alternative organization is one in which queue managers and ap
plications run on separate machines. In that case, the application is offered the
same interface as when the queue manager is colocated on the same machine.
However, the interface is im plem ented as a proxy that com m unicates with the
queue manager using traditional RPC-based synchronous communication. In this
way, M Q basically retains the m odel that only queues local to an application can
be accessed.
Channels
336
154 COMMUNICATION CHAP. 4
Attribute Description
Transport type Determines the transport protocol to be used
FIFO delivery Indicates that messages are to be delivered in the order they are sent
Message length Maximum length of a single message
Setup retry count Specifies maximum number of retries to start up the remote MCA
Delivery retries Maximum times MCA will try to put received message into queue
M essage Transfer
337
SEC. 4.3 MESSAGE-ORIENTED COMMUNICATION 155
M SQ1
QMB
Routing table
QMA SQ1
QMB SQ1
QMD SQ1
SQ1
338
156 COMMUNICATION CHAP. 4
The principle o f using routing tables and aliases is shown in Fig. 4-24. For
example, an application linked to queue manager QMA can refer to a remote
queue manager using the local alias LAI. The queue manager w ill first look up
the actual destination in the alias table to find it is queue manager QMC. The
route to QMC is found in the routing table, which states that m essages for QMC
should be appended to the outgoing queue SQL, w hich is used to transfer m es
sages to queue manager QMB. The latter will use its routing table to forward the
m essage to QMC.
F ollow ing this approach o f routing and aliasing leads to a program m ing inter
face that, fundamentally, is relatively simple, called the M essage Queue Inter
face (MQI). The m ost important primitives o f M Q I are summ arized in Fig. 4-25.
Primitive Description
MQopen Open a (possibly remote) queue
MQclose Close a queue
MQput Put a message into an opened queue
MQget Get a message from a (local) queue
From the description so far, it should be clear that an important part o f m anag
ing M Q systems is connecting the various queue managers into a consistent over
lay network. M oreover, this network needs to be maintained over time. For small
networks, this maintenance will not require much m ore than average administra
tive work, but matters becom e com plicated when m essage queuing is used to
integrate and disintegrate large existing systems.3
9
339
SE C . 4.3 MESSAGE-ORIENTED COMMUNICATION 157
340
158 COMMUNICATION CHAP. 4
Data Stream
341
SEC. 4.4 S T R E A M - O R IE N T E D C O M M U N I C A T I O N 159
can be transferred as a data stream, but it is m ostly irrelevant exactly when the
transfer o f each item completes.
In synchronous transmission mode, there is a maximum end-to-end delay
defined for each unit in a data stream. Whether a data unit is transferred much fas
ter than the maximum tolerated delay is not important. For example, a sensor may
sample temperature at a certain rate and pass it through a network to an operator.
In that case, it may be important that the end-to-end propagation time through the
network is guaranteed to be low er than the time interval between taking samples,
but it cannot do any harm if samples are propagated much faster than necessary.
Finally, in isochronous transmission mode, it is necessary that data units are
transferred on time. This means that data transfer is subject to a maximum and
minimum end-to-end delay, also referred to as bounded (delay) jitter. Isochronous
transmission m ode is particularly interesting for distributed multimedia systems,
as it plays a crucial role in representing audio and video. In this chapter, w e co n
sider only continuous data streams using isochronous transmission, which we will
refer to sim ply as streams.
Streams can be simple or complex. A sim ple stream consists o f only a single
sequence o f data, whereas a com plex stream consists o f several related sim ple
streams, called substreams. The relation between the substreams in a com plex
stream is often also time dependent. For example, stereo audio can be transmitted
by means o f a com plex stream consisting o f tw o substreams, each used for a sin
gle audio channel. It is important, however, that those tw o substreams are continu
ously synchronized. In other words, data units from each stream are to be c o m
municated pairwise to ensure the effect o f stereo. Another exam ple o f a com plex
stream is one for transmitting a movie. Such a stream could consist o f a single
video stream, along with two streams for transmitting the sound o f the m ovie in
stereo. A fourth stream might contain subtitles for the deaf, or a translation into a
different language than the audio. Again, synchronization o f the substreams is im
portant. If synchronization fails, reproduction o f the m ovie fails. W e return to
stream synchronization below.
From a distributed systems perspective, w e can distinguish several elements
that are needed for supporting streams. For simplicity, w e concentrate on stream
ing stored data, as opposed to streaming live data. In the latter case, data is cap
tured in real time and sent over the network to recipients. The main difference b e
tween the tw o is that streaming live data leaves less opportunities for tuning a
stream. Follow ing W u et al. (2001), w e can then sketch a general client-server ar
chitecture for supporting continuous multim edia streams as shown in Fig. 4-26.
This general architecture reveals a number o f important issues that need to be
dealt with. In the first place, the multim edia data, notably video and to a lesser
extent audio, will need to be com pressed substantially in order to reduce the re
quired storage and especially the network capacity. M ore important from the per
spective o f comm unication are controlling the quality o f the transmission and syn
chronization issues. W e discuss these issues next.
342
160 COMMUNICATION CHAP. 4
Network
Figure 4-26. A general architecture for streaming stored multimedia data over a
network.
2. The maximum delay until a session has been set up (i.e., when an ap
plication can start sending data).
3. The maximum end-to-end delay (i.e., how long it will take until a
data unit makes it to a recipient).
343
SEC. 4 .4 STREAM-ORIENTED COMMUNICATION 161
Enforcing QoS
Given that the underlying system offers only a best-effort delivery service, a
distributed system can try to conceal as much as possible o f the lack o f quality o f
service. Fortunately, there are several mechanisms that it can deploy.
First, the situation is not really so bad as sketched so far. For example, the
Internet provides a means for differentiating classes o f data by means o f its dif
ferentiated services. A sending host can essentially mark ou tgoing packets as
belongin g to one o f several classes, including an expedited forwarding class that
essentially specifies that a packet should be forwarded by the current router with
absolute priority (Davie et ah, 2002). In addition, there is also an assured for
warding class, by which traffic is divided into four subclasses, along with three
ways to drop packets if the network gets congested. Assured forwarding therefore
effectively defines a range o f priorities that can be assigned to packets, and as
such allows applications to differentiate time-sensitive packets from noncritical
ones.
B esides these network-level solutions, a distributed system can also help in
getting data across to receivers. Although there are generally not many tools avail
able, one that is particularly useful is to use buffers to reduce jitter. The principle
is simple, as shown in Fig. 4-27. A ssum ing that packets are delayed with a cer
tain variance when transmitted over the network, the receiver sim ply stores them
in a buffer for a maximum amount o f time. This will allow the receiver to pass
packets to the application at a regular rate, know ing that there will always be
enough packets entering the buffer to be played back at that rate.
344
162 COMMUNICATION CHAP. 4
which time the buffer will have been com pletely emptied. The result is a gap in
the playback at the application. The only solution is to increase the buffer size.
The obvious drawback is that the delay at which the receiving application can
start playing back the data contained in the packets increases as well.
Other techniques can be used as well. R ealizing that w e are dealing with an
underlying best-effort service also means that packets may be lost. T o com pensate
for this loss in quality o f service, w e need to apply error correction techniques
(Perkins et al., 1998; and Wah et ah, 2000). Requesting the sender to retransmit a
m issing packet is generally out o f the question, so that forward error correction
(FEC) needs to be applied. A well-known technique is to encode the outgoing
packets in such a way that any k out o f n received packets is enough to reconstruct
k correct packets.
One problem that may occur is that a single packet contains multiple audio
and video frames. As a consequence, when a packet is lost, the receiver may actu
ally perceive a large gap when playing out frames. This effect can be somewhat
circumvented by interleaving frames, as shown in Fig. 4-28. In this way, when a
packet is lost, the resulting gap in su ccessive frames is distributed over time.
Note, however, that this approach does require a larger receive buffer in c o m
parison to noninterleaving, and thus im poses a higher start delay for the receiving
application. For example, when considering Fig. 4-28(b), to play the first four
frames, the receiver w ill need to have four packets delivered, instead o f only one
packet in com parison to noninterleaved transmission.
Lost packet
Sent □[Hills 0 0 0 S 1E E in E (MnHHHHI
Delivered
Lost packet
Sent e s e e IS E E S
Delivered [7] [J] J J T | @ @ ( 0 0 0g] 03] [u] 0
Lost frames
(b)
Figure 4-28. The effect of packet loss in (a) noninterleaved transmission and
(b) interleaved transmission.3
5
4
345
SEC. 4.4 STREAM-ORIENTED COMMUNICATION 163
Synchronization Mechanisms
346
164 COMMUNICATION CHAP. 4
Receiver's machine
Figure 4-29. The principle of explicit synchronization on the level data units.
For example, consider a m ovie that is presented as tw o input streams. The
video stream contains uncom pressed low-quality im ages o f 320x240 pixels, each
encoded by a single byte, leading to video data units o f 76,800 bytes each.
A ssum e that im ages are to be displayed at 30 Hz, or one im age every 33 msec.
The audio stream is assumed to contain audio samples grouped into units o f 11760
bytes, each corresponding to 33 m s o f audio, as explained above. If the input proc
ess can handle 2.5 MB/sec, w e can achieve lip synchronization by sim ply alternat
ing between reading an im age and reading a block o f audio sam ples every 33 ms.
The drawback o f this approach is that the application is made com pletely
responsible for im plem enting synchronization while it has only low-level facilities
available. A better approach is to offer an application an interface that allows it to
m ore easily control streams and devices. Returning to our example, assume that
the video display has a control interface that allow s it to specify the rate at which
im ages should be displayed. In addition, the interface offers the facility to register
a user-defined handler that is called each time k new im ages have arrived. An
analogous interface is offered by the audio device. With these control interfaces,
an application developer can write a sim ple monitor program consisting o f two
handlers, one for each stream, that jointly check if the video and audio stream are
sufficiently synchronized, and if necessary, adjust the rate at which video or audio
units are presented.
This last exam ple is illustrated in Fig. 4-30, and is typical for many m ul
timedia m iddleware systems. In effect, multimedia middleware offers a collection
o f interfaces for controlling audio and video streams, including interfaces for con
trolling devices such as monitors, cameras, m icrophones, etc. Each device and4 7
3
347
SE C . 4.4 STREAM-ORIENTED COMMUNICATION 165
stream has its own high-level interfaces, including interfaces for notifying an ap
plication when som e event occurred. The latter are subsequently used to write
handlers for synchronizing streams. Exam ples o f such interfaces are given in Blair
and Stefani (1998).
Application tells
Receiver's machine middleware what
348
166 COMMUNICATION CHAP. 4
better approach is to m erge the two substreams at the sender. The resulting stream
consists o f data units consisting o f pairs o f samples, one for each channel. The re
ceiver now m erely has to read in a data unit, and split it into a left and right sam
ple. Delays for both channels are now identical.
349
SEC. 4.5 MULTICAST COMMUNICATION 167
Overlay Construction
From the high-level description given above, it should be clear that although
building a tree by itself is not that difficult once we have organized the nodes into
an overlay, building an efficient tree may be a different story. Note that in our
description so far, the selection of nodes that participate in the tree does not take
350
168 COMMUNICATION CHAP. 4
into account any performance metrics: it is purely based on the (logical) routing of
messages through the overlay.
Figure 4-31. The relation between links in an overlay and actual network-level routes.
To understand the problem at hand, take a look at Fig. 4-31 which shows a
small set of four nodes that are organized in a simple overlay network, with node
A forming the root of a multicast tree. The costs for traversing a physical link are
also shown. Now, whenever A multicasts a message to the other nodes, it is seen
that this message will traverse each of the links <B , R b > , < R a , R b > , < R c , R d > ,
and < D , R d > twice. The overlay network would have been more efficient if we
had not constructed an overlay link from B to D , but instead from A to C. Such a
configuration would have saved the double traversal across links < R a , R b > and
<R c, R d>.
The quality of an application-level multicast tree is generally measured by
three different metrics: link stress, stretch, and tree cost. Link stress is defined
per link and counts how often a packet crosses the same link (Chu et al., 2002). A
link stress greater than 1 comes from the fact that although at a logical level a
packet may be forwarded along two different connections, part of those connec
tions may actually correspond to the same physical link, as we showed in Fig. 4-
31.
The stretch or Relative Delay Penalty (RDP) measures the ratio in the delay
between two nodes in the overlay, and the delay that those two nodes would
experience in the underlying network. For example, in the overlay network, mes
sages from B to C follow the route B —»R b —»R a —»R c -a C, having a total cost
of 59 units. However, messages would have been routed in the underlying net
work along the path B R b -a R d —> R c -a C, with a total cost of 47 units, lead
ing to a stretch of 1.255. Obviously, when constructing an overlay network, the
goal is to minimize the aggregated stretch, or similarly, the average RDP meas
ured over all node pairs.
Finally, the tree cost is a global metric, generally related to minimizing the
aggregated link costs. For example, if the cost of a link is taken to be the delay be
tween its two end nodes, then optimizing the tree cost boils down to finding a3 1
5
351
SE C . 4.5 MULTICAST COMMUNICATION 169
minimal spanning tree in which the total time for disseminating information to all
nodes is minimal.
To simplify matters somewhat, assume that a multicast group has an associ
ated and well-known node that keeps track of the nodes that have joined the tree.
When a new node issues a join request, it contacts this rendezvous node to obtain
a (potentially partial) list of members. The goal is to select the best member that
can operate as the new node’ s parent in the tree. Who should it select? There are
many alternatives and different proposals often follow very different solutions.
Consider, for example, a multicast group with only a single source. In this
case, the selection of the best node is obvious: it should be the source (because in
that case we can be assured that the stretch will be equal to 1). However, in doing
so, we would introduce a star topology with the source in the middle. Although
simple, it is not difficult to imagine the source may easily become overloaded. In
other words, selection of a node will generally be constrained in such a way that
only those nodes may be chosen who have k or less neighbors, with k being a
design parameter. This constraint severely complicates the tree-establishment al
gorithm, as a good solution may require that part of the existing tree is reconfig
ured.
Tan et al. (2003) provide an extensive overview and evaluation of various
solutions to this problem. As an illustration, let us take a closer look at one specif
ic family, known as switch-trees (Helder and Jamin, 2002). The basic idea is
simple. Assume we already have a multicast tree with a single source as root. In
this tree, a node P can switch parents by dropping the link to its current parent in
favor of a link to another node. The only constraints imposed on switching links is
that the new parent can never be a member of the subtree rooted at P (as this
would partition the tree and create a loop), and that the new parent will not have
too many immediate children. The latter is needed to limit the load of forwarding
messages by any single node.
There are different criteria for deciding to switch parents. A simple one is to
optimize the route to the source, effectively minimizing the delay when a message
is to be multicast. To this end, each node regularly receives information on other
nodes (we will explain one specific way of doing this below). At that point, the
node can evaluate whether another node would be a better parent in terms of delay
along the route to the source, and if so, initiates a switch.
Another criteria could be whether the delay to the potential other parent is
lower than to the current parent. If every node takes this as a criterion, then the
aggregated delays of the resulting tree should ideally be minimal. In other words,
this is an example of optimizing the cost of the tree as we explained above. How
ever, more information would be needed to construct such a tree, but as it turns
out, this simple scheme is a reasonable heuristic leading to a good approximation
of a minimal spanning tree.
As an example, consider the case where a node P receives information on the
neighbors of its parent. Note that the neighbors consist of 'P’ s grandparent, along3
2
5
352
170 COMMUNICATION CHAP. 4
with the other siblings of P 's parent. Node P can then evaluate the delays to each
of these nodes and subsequently choose the one with the lowest delay, say Q , as
its new parent. To that end, it sends a switch request to Q . To prevent loops from
being formed due to concurrent switching requests, a node that has an outstanding
switch request will simply refuse to process any incoming requests. In effect, this
leads to a situation where only completely independent switches can be carried
out simultaneously. Furthermore, P will provide Q with enough information to
allow the latter to conclude that both nodes have the same parent, or that Q is the
grandparent.
An important problem that we have not yet addressed is node failure. In the
case of switch-trees, a simple solution is proposed: whenever a node notices that
its parent has failed, it simply attaches itself to the root. At that point, the optimi
zation protocol can proceed as usual and will eventually place the node at a good
point in the multicast tree. Experiments described in Helder and Jamin (2002)
show that the resulting tree is indeed close to a minimal spanning one.
As the name suggests, epidemic algorithms are based on the theory of epi
demics, which studies the spreading of infectious diseases. In the case of large-
scale distributed systems, instead of spreading diseases, they spread information.
Research on epidemics for distributed systems also aims at a completely different
goal: whereas health organizations will do their utmost best to prevent infectious
diseases from spreading across large groups of people, designers of epidemic al
gorithms for distributed systems will try to “ infect”all nodes with new informa
tion as fast as possible.
Using the terminology from epidemics, a node that is part of a distributed sys
tem is called infected if it holds data that it is willing to spread to other nodes. A
353
SE C . 4.5 MULTICAST COMMUNICATION 171
node that has not yet seen this data is called susceptible. Finally, an updated
node that is not willing or able to spread its data is said to be removed. Note that
we assume we can distinguish old from new data, for example, because it has
been timestamped or versioned. In this light, nodes are also said to spread updates.
A popular propagation model is that of anti-entropy. In this model, a node P
picks another node Q at random, and subsequently exchanges updates with Q.
There are three approaches to exchanging updates:
When it comes to rapidly spreading updates, only pushing updates turns out to
be a bad choice. Intuitively, this can be understood as follows. First, note that in a
pure push-based approach, updates can be propagated only by infected nodes.
However, if many nodes are infected, the probability of each one selecting a sus
ceptible node is relatively small. Consequently, chances are that a particular node
remains susceptible for a long period simply because it is not selected by an
infected node.
In contrast, the pull-based approach works much better when many nodes are
infected. In that case, spreading updates is essentially triggered by susceptible
nodes. Chances are large that such a node will contact an infected one to subse
quently pull in the updates and become infected as well.
It can be shown that if only a single node is infected, updates will rapidly
spread across all nodes using either form of anti-entropy, although push-pull
remains the best strategy (Jelasity et al., 2005a). Define a round as spanning a
period in which every node will at least once have taken the initiative to exchange
updates with a randomly chosen other node. It can then be shown that the number
of rounds to propagate a single update to all nodes takes O ( l o g (AO), where N is
the number of nodes in the system. This indicates indeed that propagating updates
is fast, but above all scalable.
One specific variant of this approach is called rumor spreading, or simply
gossiping. It works as follows. If node P has just been updated for data item x, it
contacts an arbitrary other node Q and tries to push the update to Q . However, it is
possible that Q was already updated by another node. In that case, P may lose
interest in spreading the update any further, say with probability l/ k . In other
words, it then becomes removed.
Gossiping is completely analogous to real life. When Bob has some hot news
to spread around, he may phone his friend Alice telling her all about it. Alice, like
Bob, will be really excited to spread the gossip to her friends as well. However,
she will become disappointed when phoning a friend, say Chuck, only to hear that3 4
5
354
172 COMMUNICATION CHAP. 4
the news has already reached him. Chances are that she will stop phoning other
friends, for what good is it if they already know?
Gossiping turns out to be an excellent way of rapidly spreading news. How
ever, it cannot guarantee that all nodes will actually be updated (Demers et al.,
1987). It can be shown that when there is a large number of nodes that participate
in the epidemics, the fraction s of nodes that will remain ignorant of an update,
that is, remain susceptible, satisfies the equation:
s = e -(k + m - s)
- 2.5
- 5.0
- 7.5
10.0
Figure 4-32. The relation between the fraction s o f update-ignorant nodes and
the parameter k in pure gossiping. The graph displays ln(s) as a function o f k.
355
SE C . 4.5 MULTICAST COMMUNICATION 173
Removing Data
Applications
356
174 COMMUNICATION CHAP. 4
exchange. Every node i initially chooses an arbitrary number, say x t. When node i
contacts node j, they each update their value as:
X j ,X j <— ( Xj + x j ) / 2
Obviously, after this exchange, both i and j will have the same value. In fact, it is
not difficult to see that eventually all nodes will have the same value, namely the
average of all initial values. Propagation speed is again exponential.
What use does computing the average have? Consider the situation that all
nodes i have set x t to zero, except for x i , which has set it to 1:
If there N nodes, then eventually each node will compute the average, which is
l/ N . As a consequence, every node i can estimate the size of the system as being
1/jCj. This information alone can be used to dynamically adjust various system pa
rameters. For example, the size of the partial view (i.e., the number of neighbors
that each nodes keeps track of) should be dependent on the total number of parti
cipating nodes. Knowing this number will allow a node to dynamically adjust the
size of its partial view. Indeed, this can be viewed as a property of self-manage-
ment.
Computing the average may prove to be difficult when nodes regularly join
and leave the system. One practical solution to this problem is to introduce
epochs. Assuming that node 1 is stable, it simply starts a new epoch now and then.
When node i sees a new epoch for the first time, it resets its own variable x t to
zero and starts computing the average again.
Of course, other results can also be computed. For example, instead of having
a fixed node (x i ) start the computation of the average, we can easily pick a ran
dom node as follows. Every node i initially sets x t to a random number from the
same interval, say [0,1], and also stores it permanently as m;. Upon an exchange
between nodes i and j, each change their value to:
x h Xj <—m a x ( x u Xj )
Each node i for which m, < x l will lose the competition for being the initiator in
starting the computation of the average. In the end, there will be a single winner.
Of course, although it is easy to conclude that a node has lost, it is much more dif
ficult to decide that it has won, as it remains uncertain whether all results have
come in. The solution to this problem is to be optimistic: a node always assumes it
is the winner until proven otherwise. At that point, it simply resets the variable it
is using for computing the average to zero. Note that by now, several different
computations (in our example computing a maximum and computing an average)
may be executing concurrently.
357
SE C . 4.6 SUMMARY 175
4.6 SUMMARY
358
176 COMMUNICATION CHAP. 4
PROBLEMS
1. In many layered protocols, each layer has its own header. Surely it would be more
efficient to have a single header at the front of each message with all the control in it
than all these separate headers. Why is this not done?
2. Why are transport-level communication services often inappropriate for building dis
tributed applications?
3. A reliable multicast service allows a sender to reliably pass messages to a collection of
receivers. Does such a service belong to a middleware layer, or should it be part of a
lower-level layer?
4. Consider a procedure incr with two integer parameters. The procedure adds one to
each parameter. Now suppose that it is called with the same variable twice, for ex
ample, as incr{i, i). If i is initially 0, what value will it have afterward if call-by-refer-
ence is used? How about if copy/restore is used?
5. C has a construction called a union, in which a field of a record (called a struct in C)
can hold any one of several alternatives. At run time, there is no sure-fire way to tell
which one is in there. Does this feature of C have any implications for remote proce
dure call? Explain your answer.
6. One way to handle parameter conversion in RPC systems is to have each machine
send parameters in its native representation, with the other one doing the translation, if
need be. The native system could be indicated by a code in the first byte. However,
since locating the first byte in the first word is precisely the problem, can this work?5
9
3
359
CHAP. 4 PROBLEMS 177
7. Assume a client calls an asynchronous RPC to a server, and subsequently waits until
the server returns a result using another asynchronous RPC. Is this approach the same
as letting the client execute a normal RPC? What if we replace the asynchronous
RPCs with asynchronous RPCs?
8. Instead of letting a server register itself with a daemon as in DCE, we could also
choose to always assign it the same end point. That end point can then be used in ref
erences to objects in the server’s address space. What is the main drawback of this
scheme?
9. Would it be useful also to make a distinction between static and dynamic RPCs?
10. Describe how connectionless communication between a client and a server proceeds
when using sockets.
11. Explain the difference between the primitives MPLbsend and MPLisend in MPI.
12. Suppose that you could make use of only transient asynchronous communication
primitives, including only an asynchronous receive primitive. How would you imple
ment primitives for transient synchronous communication?
13. Suppose that you could make use of only transient synchronous communication primi
tives. How would you implement primitives for transient asynchronous communica
tion?
15. In the text we stated that in order to automatically start a process to fetch messages
from an input queue, a daemon is often used that monitors the input queue. Give an
alternative implementation that does not make use of a daemon.
16. Routing tables in IBM WebSphere, and in many other message-queuing systems, are
configured manually. Describe a simple way to do this automatically.
17. With persistent communication, a receiver generally has its own local buffer where
messages can be stored when the receiver is not executing. To create such a buffer, we
may need to specify its size. Give an argument why this is preferable, as well as one
against specification of the size.
18. Explain why transient synchronous communication has inherent scalability problems,
and how these could be solved.
19. Give an example where multicasting is also useful for discrete data streams.
20. Suppose that in a sensor network measured temperatures are not timestamped by the
sensor, but are immediately sent to the operator. Would it be enough to guarantee only
a maximum end-to-end delay?
21. How could you guarantee a maximum end-to-end delay when a collection of com
puters is organized in a (logical or physical) ring?2
22. How could you guarantee a minimum end-to-end delay when a collection of com
puters is organized in a (logical or physical) ring?
360
This page has intentionally been left blank.
361
tional groups across databases. Splitting data within func F U N C T IO N A L PARTITIONING
tional areas across multiple databases, or sharding,1 adds Functional partitioning is important for achieving high
the second dimension to horizontal scaling. The diagram degrees of scalability. Any good database architecture will
in figure 1 illustrates horizontal data-scaling strategies. decom pose the schema into tables grouped by function
As figure 1 illustrates, both approaches to horizontal ality. Users, products, transactions, and communication
scaling can be applied at once. Users, products, and trans are examples of functional areas. Leveraging database
actions can be in separate databases. Additionally, each concepts such as foreign keys is a com m on approach for
functional area can be split across multiple databases for maintaining consistency across these functional areas.
transactional capacity. As shown in the diagram, func Relying on database constraints to ensure consistency
tional areas can be scaled independently of one another. across functional groups creates a coupling of the schema
not uncommon, and in fact people encounter this delay
between a transaction and their running balance regularly
(e.g., ATM withdrawals and cellphone calls).
How the SQL statements are modified to relax con
sistency depends upon how the running balances are
defined. If they are simply estimates, meaning that some
transactions can be missed, the changes are quite simple,
as shown in figure 4.
are updated. Using an ACID-style transaction, the SQL We've now decoupled the updates to the user and
would be as shown in figure 3. transaction tables. Consistency between the tables is not
The total bought and sold columns in the user table guaranteed. In fact, a failure between the first and second
can be considered a cache of the transaction table. It is transaction will result in the user table being permanently
present for efficiency of the system. Given this, the con inconsistent, but if the contract stipulates that the run
straint on consistency could be relaxed. The buyer and ning totals are estimates, this may be adequate.
seller expectations can be set so their running balances do What if estimates are not acceptable, though? How
not reflect the result of a transaction immediately. This is can you still decouple the user and transaction updates?
Introducing a persistent
message queue solves the
problem. There are several
Begin transaction choices for implement
Insert into transaction(id, se lle rjd , buyerjd, amount); ing persistent messages.
End transaction The most critical factor in
Begin transaction implementing the queue,
Update user set amt_sold=amt_sold+$amount where id= $ se lle r_ id ; however, is ensuring that
Update user set amt_bought=amount_bought+ $ a m ou n t the backing persistence is
where id= $ b u yer_id ; on the same resource as the
database. This is necessary
FIG 4
End transaction
to allow the queue to be
transactionally committed
without involving a 2PC.
Now the SQL operations
look a bit different, as
Begin transaction shown in figure 5.
Insert into transaction^, se lle rjd , buyerjd, amount); This example takes
Queue message "update user("seller", se lle rjd , amount)"; some liberties with syntax
Queue message "update user("buyer", buyerjd, amount)"; and oversimplifying the
End transaction logic to illustrate the
For each message in queue concept. By queuing a
Begin transaction persistent message within
Dequeue message the same transaction as
If message.balance == "seller" the insert, the information
Update user set amt_sold=amt_sold + message.amount needed to update the run
where id=message.id; ning balances on the user
Else has been captured. The
Update user set amt_bought=amt_bought + message.amount transaction is contained on
where id=message.id; a single database instance
FIG 5
End if and therefore will not
End transaction impact system availability.
End for A separate message
processing com ponent will
FIG 7
If transaction successful
Remove message from queue sary: one on the message
End if queue and one on the user
End for database. Queue operations
are not committed unless
C O N C LU S IO N
Scaling systems to dramatic transaction rates requires a
new way of thinking about managing resources. The tra
ditional transactional models are problematic when loads
need to be spread across a large number of components.
Decoupling the operations and performing them in turn
provides for improved availability and scale at the cost of
consistency. BASE provides a model for thinking about
The g o o d news is that this decoupling. Q
This paper presents a design principle that helps guide placement of functions among the modules of
a distributed computer system. The principle, called the end-to-end argument, suggests that functions
placed at low levels of a system may be redundant or of little value when compared with the cost of
providing them at that low level. Examples discussed in the paper include bit-error recovery, security
using encryption, duplicate message suppression, recovery from system crashes, and delivery acknowl
edgment. Low-level mechanisms to support these functions are justified only as performance enhance
ments.
CR Categories and Subject Descriptors: C.O [General] Computer System Organization—system
architectures; C.2.2 [Computer-Com m unication Networks]: Network Protocols—protocol archi
tecture; C.2.4 [Computer-Com m unication Networks]: Distributed Systems; D.4.7 [Operating
Systems]: Organization and Design—distributed systems
General Terms: Design
Additional Key Words and Phrases: Data communication, protocol design, design principles
1. INTRODUCTION
Choosing the proper boundaries between functions is perhaps the primary activity
of the computer system designer. Design principles that provide guidance in this
choice of function placement are among the most important tools of a system
designer. This paper discusses one class of function placement argument that
has been used for many years with neither explicit recognition nor much convic
tion. However, the emergence of the data communication network as a computer
system component has sharpened this line of function placement argument by
making more apparent the situations in which and the reasons why it applies.
This paper articulates the argument explicitly, so as to examine its nature and
to see how general it really is. The argument appeals to application requirements
and provides a rationale for moving a function upward in a layered system closer
to the application that uses the function. We begin by considering the commu
nication network version of the argument.
This is a revised version of a paper adapted from End-to-End Arguments in System Design by J. H.
Saltzer, D.P. Reed, and D.D. Clark from the 2nd International Conference on Distributed Systems
(Paris, France, April 8-10) 1981, pp. 509-512. © IEEE 1981
This research was supported in part by the Advanced Research Projects Agency of the U.S.
Department of Defense and monitored by the Office of Naval Research under contract N00014-75-
C-0661.
Authors’address: J. H. Saltzer and D. D. Clark, M.I.T. Laboratory for Computer Science, 545
Technology Square, Cambridge, MA 02139. D. P. Reed, Software Arts, Inc., 27 Mica Lane, Wellesley,
MA 02181.
Permission to copy without fee all or part of this material is granted provided that the copies are not
made or distributed for direct commercial advantage, the ACM copyright notice and the title of the
publication and its date appear, and notice is given that copying is by permission of the Association
for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific
permission.
© 1984 ACM 0734-2071/84/1100-0277 $00.75
A C M T r a n s a c t io n s o n C o m p u te r S y s te m s , Vol. 2, N o. 4, N o v e m b e r 1984, P a g e s 277-288.
370
278 J. H. Saltzer, D. P. Reed, and D. D. Clark
(1) At host A the file transfer program calls upon the file system to read the file
from the disk, where it resides on several tracks, and the file system passes
it to the file transfer program in fixed-size blocks chosen to be disk format
independent.
(2) Also at host A, the file transfer program asks the data communication system
to transmit the file using some communication protocol that involves splitting
the data into packets. The packet size is typically different from the file
block size and the disk track size.
A C M T r a n s a c t io n s o n C o m p u te r S y ste m s, V ol. 2, N o. 4, N o v e m b e r 1984.3
1
7
371
End-to-End Arguments in System Design • 279
(3) The data communication network moves the packets from computer A to
computer B.
(4) At host B, a data communication program removes the packets from the
data communication protocol and hands the contained data to a second part
of the file transfer application that operates within host B.
(5) At host B, the file transfer program asks the file system to write the received
data on the disk of host B.
With this model of the steps involved, the following are some of the threats to
the transaction that a careful designer might be concerned about:
(1) The file, though originally written correctly onto the disk at host A, if read
now may contain incorrect data, perhaps because of hardware faults in the
disk storage system.
(2) The software of the file system, the file transfer program, or the data
communication system might make a mistake in buffering and copying the
data of the file, either at host A or host B.
(3) The hardware processor or its local memory might have a transient error
while doing the buffering and copying, either at host A or host B.
(4) The communication system might drop or change the bits in a packet or
deliver a packet more than once.
(5) Either of the hosts may crash part way through the transaction after
performing an unknown amount (perhaps all) of the transaction.
How would a careful file transfer application then cope with this list of threats?
One approach might be to reinforce each of the steps along the way using
duplicate copies, time-out and retry, carefully located redundancy for error
detection, crash recovery, etc. The goal would be to reduce the probability of
each of the individual threats to an acceptably small value. Unfortunately,
systematic countering of threat (2) requires writing correct programs, which is
quite difficult. Also, not all the programs that must be correct are written by the
file transfer-application programmer. If we assume further that all these threats
are relatively low in probability—low enough for the system to allow useful work
to be accomplished—brute force countermeasures, such as doing everything three
times, appear uneconomical.
The alternate approach might be called end-to-end check and retry. Suppose
that as an aid to coping with threat (1), stored with each file is a checksum that
has sufficient redundancy to reduce the chance of an undetected error in the file
to an acceptably negligible value. The application program follows the simple
steps above in transferring the file from A to B. Then, as a final additional step,
the part of the file transfer application residing in host B reads the transferred
file copy back from its disk storage system into its own memory, recalculates the
checksum, and sends this value back to host A, where it is compared with the
checksum of the original. Only if the two checksums agree does the file transfer
application declare the transaction committed. If the comparison fails, something
has gone wrong, and a retry from the beginning might be attempted.
If failures are fairly rare, this technique will normally work on the first try;
occasionally a second or even third try might be required. One would probably
consider two or more failures on the same file transfer attempt as indicating that
some part of this system is in need of repair.
A C M T r a n s a c t io n s o n C o m p u te r S y ste m s, Vol. 2, N o. 4, N o v e m b e r 1984.
372
280 J. H. Saltzer, D. P. Reed, and D. D. Clark
Now let us consider the usefulness of a common proposal, namely, that the
communication system provide, internally, a guarantee o f reliable data transmis
sion. It might accomplish this guarantee by providing selective redundancy in
the form of packet checksums, sequence number checking, and internal retry
mechanisms, for example. With sufficient care, the probability of undetected bit
errors can be reduced to any desirable level. The question is whether or not this
attempt to be helpful on the part of the communication system is useful to the
careful file transfer application.
The answer is that threat (4) may have been eliminated, but the careful file
transfer application must still counter the remaining threats; so it should still
provide its own retries based on an end-to-end checksum of the file. If it does,
the extra effort expended in the communication system to provide a guarantee
of reliable data transmission is only reducing the frequency of retries by the file
transfer application; it has no effect on inevitability or correctness of the outcome,
since correct file transmission is ensured by the end-to-end checksum and retry
whether or not the data transmission system is especially reliable.
Thus, the argument: In order to achieve careful file transfer, the application
program that performs the transfer must supply a file-transfer-specific, end-to-
end reliability guarantee—in this case, a checksum to detect failures and a retry-
commit plan. For the data communication system to go out of its way to be
extraordinarily reliable does not reduce the burden on the application program
to ensure reliability.
373
End-to-End Arguments in System Design 281
with the file length, and thus the expected time to transmit the file grows
exponentially with file length. Clearly, some effort at the lower levels to improve
network reliability can have a significant effect on application performance. But
the key idea here is that the lower levels need not provide “ perfect”reliability.
Thus the amount of effort to put into reliability measures within the data
communication system is seen to be an engineering trade-off based on perform
ance, rather than a requirement for correctness. Note that performance has
several aspects here. If the communication system is too unreliable, the file
transfer application performance will suffer because of frequent retries following
failures of its end-to-end checksum. If the communcation system is beefed up
with internal reliability measures, those measures also have a performance cost,
in the form of bandwidth lost to redundant data and added delay from waiting
for internal consistency checks to complete before delivering the data. There is
little reason to push in this direction very far, when it is considered that the end-
to-en d ch eck o f the file tra n sfer a p p lica tio n m u st still be im p le m e n te d n o m a tter
The p r o p e r trade-off requires
h ow reliab le the c o m m u n ic a tio n s y s te m becom es.
careful thought. For example, one might start by designing the communication
system to provide only the reliability that comes with little cost and engineering
effort, and then evaluate the residual error level to ensure that it is consistent
with an acceptable retry frequency at the file transfer level. It is probably not
important to strive for a negligble error rate at any point below the application
level.
Using performance to justify placing functions in a low-level subsystem must
be done carefully. Sometimes, by examining the problem thoroughly, the same
or better performance enhancement can be achieved at the high level. Performing
a function at a low level may be more efficient, if the function can be performed
with a minimum perturbation of the machinery already included in the low-level
subsystem. But the opposite situation can occur—that is, performing the function
at the lower level may cost more—for two reasons. First, since the lower level
subsystem is common to many applications, those applications that do not need
the function will pay for it anyway. Second, the low-level subsystem may not
have as much information as the higher levels, so it cannot do the job as
efficiently.
Frequently, the performance trade-off is quite complex. Consider again the
careful file transfer on an unreliable network. The usual technique for increasing
packet reliability is some sort of per-packet error check with a retry protocol.
This mechanism can be implemented either in the communication subsystem or
in the careful file transfer application. For example, the receiver in the careful
file transfer can periodically compute the checksum of the portion of the file thus
far received and transmit this back to the sender. The sender can then restart
by retransmitting any portion that has arrived in error.
The end-to-end argument does not tell us where to put the early checks, since
either layer can do this performance-enhancement job. Placing the early retry
protocol in the file transfer application simplifies the communication system but
may increase overall cost, since the communication system is shared by other
applications and each application must now provide its own reliability enhance
ment. Placing the early retry protocol in the communication system may be more
A C M T r a n s a c t io n s on C o m p u t e r S y ste m s, Vol. 2, No. 4, N o v e m b e r 1984.
374
282 J. H. Saltzer, D. P. Reed, and D. D. Clark
375
End-to-End Arguments in System Design 283
as they pass into the target node and are fanned out to the target application.
Third, the authenticity of the message must still be checked by the application.
If the application performs end-to-end encryption, it obtains its required authen
tication check and can handle key management to its satisfaction, and the data
are never exposed outside the application.
Thus, to satisfy the requirements of the application, there is no need for the
communication subsystem to provide for automatic encryption of all traffic.
Automatic encryption of all traffic by the communication subsystem may be
called for, however, to ensure something else—that a misbehaving user or
application program does not deliberately transmit information that should not
be exposed. The automatic encryption of all data as they are put into the network
is one more firewall the system designer can use to ensure that information does
not escape outside the system. Note however, that this is a different requirement
from authenticating access rights of a system user to specific parts of the data.
This network-level encryption can be quite unsophisticated—the same key can
be used by all hosts, with frequent changes of the key. No per-user keys complicate
the key management problem. The use of encryption for application-level au
thentication and protection is complementary. Neither mechanism can satisfy
both requirements completely.
376
284 J. H. Saltzer, D. P. Reed, and D. D. Clark
377
End-to-End Arguments in System Design 285
3 78
286 J. H. Saltzer, D. P. Reed, and D. D. Clark
379
End-to-End Arguments in System Design 287
against making any function a permanent fixture of lower level modules; the
function may be provided by a lower level module, but it should always be
replaceable by an application’ s special version of the function. The reasoning is
that for any function that can be thought of, at least some applications will find
that, of necessity, they must implement the function themselves in order to meet
correctly their own requirements. This line of reasoning leads Lampson to
propose an “ open”system in which the entire operating system consists of
replaceable routines from a library. Such an approach has only recently become
feasible in the context of computers dedicated to a single application. It may be
the case that the large quantity of fixed supervisor functions typical of large-
scale operating systems is only an artifact of economic pressures that have
demanded multiplexing of expensive hardware and therefore a protected super
visor. Most recent system “ kernelization”projects have, in fact, focused at least
in part on getting function out of low system levels [12,16]. Though this function
movement is inspired by a different kind of correctness argument, it has the side
effect of producing an operating system that is more flexible for applications,
which is exactly the main thrust of the end-to-end argument.
6. CONCLUSIONS
End-to-end arguments are a kind of “ Occam’ s razor”when it comes to choosing
the functions to be provided in a communication subsystem. Because the com
munication subsystem is frequently specified before applications that use the
subsystem are known, the designer may be tempted to “ help”the users by taking
on more function than necessary. Awareness of end-to-end arguments can help
to reduce such temptations.
It is fashionable these days to talk about layered communication protocols, but
without clearly defined criteria for assigning functions to layers. Such layerings
are desirable to enhance modularity. End-to-end arguments may be viewed as
part of a set of rational principles for organizing such layered systems. We hope
that our discussion will help to add substance to arguments about the “ proper”
layering.
ACKNOWLEDGMENTS
Many people have read and commented on an earlier draft of this paper, including
David Cheriton, F. B. Schneider, and Liba Svobodova. The subject was also
discussed at the ACM Workshop in Fundamentals of Distributed Computing, in
Fallbrook, Calif., December 1980. Those comments and discussions were quite
helpful in clarifying the arguments.
REFERENCES
1. B olt B eranek and N ewman Inc . Specifications for the interconnection of a host and an
IMP. Tech. Rep. 1822. Bolt Beranek and Newman Inc. Cambridge, Mass. Dec. 1981.
2. B ranstad , D.K. Security aspects of computer networks. AAIA Paper 73-427, AIAA Computer
Network Systems Conference, Huntsville, Ala. Apr. 1973.
3. C orbato , F.J., D aggett , M.M., D aley, R.C., C reasy, R.J., H elliwig , J.D., Orenstein , R.H.,
AND K orn , L.K. The Compatible Time-Sharing System, A Programmer’ s Guide. Massachusetts
A C M T r a n s a c t io n s o n C o m p u te r S y ste m s, V ol. 2, N o. 4, N o v e m b e r 1984.
3 80
288 J. H. Saltzer, D. P. Reed, and D. D. Clark
D a ta P r o c e ss in g - E xtern al
S o rtin g
383
This page has intentionally been left blank.
384
11.4. USING SECONDARY STORA GE EFFECTIVELY 525
request), even if this user is reading blocks belonging to a single relation, and
that relation is stored on a single cylinder of the disk. Later in this section we
shall discuss how to improve the performance of the system in various ways.
However, in all that follows, the following rule, which defines the I/O model of
computation, is assumed:
D o m in a n c e o f I / O co st: If a block needs to be moved between
disk and main memory, then the time taken to perform the read
or write is much larger than the time likely to be used manip
ulating that data in main memory. Thus, the number of block
accesses (reads and writes) is a good approximation to the time
needed by the algorithm and should be minimized.
In examples, we shall assume that the disk is a Megatron 747, with 16K-
byte blocks and the timing characteristics determined in Example 11.5. In
particular, the average time to read or write a block is about 11 milliseconds.
E x a m p le 11.6: Suppose our database has a relation R and a query asks for
the tuple of R that has a certain key value k. As we shall see, it is quite desirable
that an index on R be created and used to identify the disk block on which the
tuple with key value k appears. However it is generally unimportant whether
the index tells us where on the block this tuple appears.
The reason is that it will take on the order of 11 milliseconds to read this
16K-byte block. In 11 milliseconds, a modern microprocessor can execute mil
lions of instructions. However, searching for the key value k once the block is
in main memory will only take thousands of instructions, even if the dumbest
possible linear search is used. The additional time to perform the search ,in
main memory will therefore be less than 1% of the block access time and can
be neglected safely. □
The records (tuples) of R will be divided into disk blocks of 16,384 bytes per
block. We assume that 100 records fit in one block. That is, records are about
160 bytes long. With the typical extra information needed to store records in a
block (as discussed in Section 12.2, e.g.), 100 records of this size is about what
can fit in one 16,384-byte block. Thus, R, occupies 100,000 blocks totaling 1.64
billion bytes.
The machine on which the sorting occurs has one Megatron 747 disk and
100 megabytes of main memory available for buffering blocks of the relation.
The actual main memory is larger, but the rest of main-memory is used by the
system. The number of blocks that can fit in 100M bytes of memory (which,
recall, is really 100 x 220 bytes), is 100 x 220/214, or 6400 blocks. □
If the data fits in main memory, there are a number of well-known algorithms
that work well;5 variants of “ Quicksort”are generally considered the fastest.
The preferred version of Quicksort sorts only the key fields, carrying pointers
to the full records along with the keys. Only when the keys and their pointers
were in sorted order, would we use the pointers to bring every record to its
proper position.
Unfortunately, these ideas do not work very well when secondary memory
is needed to hold the data. The preferred approaches to sorting, when the data
is mostly in secondary memory, involve moving each block between main and
secondary memory only a small number of times, in a regular pattern. Often,
these algorithms operate in a small number of passes; in one pass every record
is read into main memory once and written out to disk once. In Section 11.4.4,
we see one such algorithm.
11.4.3 Merge-Sort
You may be familiar with a main-memory sorting algorithm called Merge-Sort
that works by merging sorted lists into larger sorted lists. To merge two sorted
lists, we repeatedly compare the smallest remaining keys of each list, move the
record with the smaller key to the output, and repeat, until one list is exhausted.
At that time, the output, in the order selected, followed by what remains of the
nonexhausted list, is the complete set of records, in sorted order.
387
528 CHAPTER 11. DATA STORAGE
Figure 11.10: Merging two sorted lists to make one sorted list
step (7), when the second list is exhausted. At that point, the remainder of the
first list, which happens to be only one element, is appended to the output and
the merge is done. Note that the output is in sorted order, as must be the case,
because at each step we chose the smallest of the remaining elements. □
The time to merge in main memory is linear in the sum of the lengths of the
lists. The reason is that, because the given lists are sorted, only the heads of
the two lists are ever candidates for being the smallest unselected element, and
we can compare them in a constant amount of time. The classic merge-sort
algorithm sorts recursively, using log2n phases if there are n elements to be
sorted. It can be described as follows:
BASIS: If there is a list of one element to be sorted, do nothing, because the
list is already sorted.
IN D U C T IO N : If there is a list of more than one element to be sorted, then
divide the list arbitrarily into two lists that are either of the same length, or as
close as possible if the original list is of odd length. Recursively sort the two
sublists. Then merge the resulting sorted lists into one sorted list.
The analysis of this algorithm is well known and not too important here. Briefly
T(n), the time to sort n elements, is some constant times n (to split the list and
merge the resulting sorted lists) plus the time to sort two lists of size n/2. That
is, T (n ) = 2T(n/2) + an for some constant a. The solution to this recurrence
equation is T(n) = 0 ( n log n), that is, proportional to nlogn.
Our first observation is that with data on secondary storage, we do not want
to start with a basis to the recursion that is one record or a few records. The
reason is that Merge-Sort is not as fast as some other algorithms when the
records to be sorted fit in main memory. Thus, we shall begin the recursion
by taking an entire main memory full of records, and sorting them using an
appropriate main-memory sorting algorithm such as Quicksort. We repeat the
following process as many times as necessary:
1. Fill all available main memory with blocks from the original relation to
be sorted.
2. Sort the records that are in main memory.
3. Write the sorted records from main memory onto new blocks of secondary
memory, forming one sorted sublist.
At the end of this first phase, all the records of the original relation will have
been read once into main memory, and become part of a main-memory-size
sorted sublist that has been written onto disk.
Now, let us consider how we complete the sort by merging the sorted sublists.
We could merge them in pairs, as in the classical Merge-Sort, but that would
involve reading all data in and out of memory 21og2n times if there were n
sorted sublists, For instance, the 16 sorted sublists of Example 11.9 would be
389
530 CHAPTER 11. DATA STORAGE
read in and out of secondary storage once to merge into 8 lists; another complete
reading and writing would reduce them to 4 sorted lists, and two more complete
read/write operations would reduce them to one sorted list. Thus, each block
would have 8 disk I/O’ s performed on it.
A better approach is to read the first block of each sorted sublist into a
main-memory buffer. For some huge relations, there would be too many sorted
sublists from phase one to read even one block per list into main memory, a
problem we shall deal with in Section 11.4.5. But for data such as that of
Example 11.7, there are relatively few lists, 16 in that example, and a block
from each list fits easily in main memory.
We also use a buffer for an output block that will contain as many of the
first elements in the complete sorted list as it can hold. Initially, the output
block is empty. The arrangement of buffers is suggested by Fig. 11.11. We
merge the sorted sublists into one sorted list with all the records as follows.
Select smallest
, unchosen for j
\ output /
Output
buffer
1. Find the smallest key among the first remaining elements of all the lists.
Since this comparison is done in main memory, a linear search is suffi
cient, taking a number of machine instructions proportional to the num
ber of sublists. However, if we wish, there is a method based on “priority-
queues” 6 that takes time proportional to the logarithm of the number of
sublists to find the smallest element.
2. Move the smallest element to the first available position of the output
block.*3
0
9
6See Aho, A. V. and J. D. Ullman Foundations o f Com puter S cien ce, Com puter Science
Press, 1992.
390
11.4. USING SECONDARY STORAGE EFFECTIVELY 531
3. If the output block is full, write it to disk and reinitialize the same buffer
in main memory to hold the next output block.
4. If the block from which the smallest element was just taken is now ex
hausted of records, read the next block from the same sorted sublist into
the same buffer that was used for the block just exhausted. If no blocks
remain, then leave its buffer empty and do not consider elements from
that list in any further competition for smallest remaining elements.
In the second phase, unlike the first phase, the blocks are read in an unpre
dictable order, since we cannot tell when an input block will become exhausted.
However, notice that every block holding records from one of the sorted lists is
read from disk exactly once. Thus, the total number of block reads is 100,000
in the second phase, just as for the first. Likewise, each record is placed once in
an output block, and each of these blocks is written to disk. Thus, the number
of block writes in the second phase is also 100,000. As the amount of second-
phase computation in main memory can again be neglected compared to the
I/O cost, we conclude that the second phase takes another 37 minutes, or 74
391
532 C H A P T E R 11. DATA S T O R A G E
If we need to sort more records, we can add a third pass. Use TPMMS to
sort groups of M 2/ R B records, turning them into sorted sublists. Then, in a
third phase, we merge up to { M / B ) —1of these lists in a final multiway merge.
The third phase lets us sort approximately M 3/ R B 2 records occupying
M s / B 3 blocks. For the parameters of Example 11.7, this amount is about
27 trillion records occupying 4.3 petabytes. Such an amount is unheard of to
day. Since even the 0.67 terabyte limit for TPMMS is unlikely to be carried
out in secondary storage, we suggest that the two-phase version of Multiway
Merge-Sort, is likely to be enough for all practical purposes.
if, we can find the tuple with that key value by using a standard binary search
technique. What is the maximum number of disk I/O’ s needed to find the tuple
with key K ?
D a ta P r o c e ssin g - B a sic
R ela tion a l O p e ra to rs and Join s
T h is ch a p te r c o n ta in s th e b o o k ch ap ter:
• A p p ly ex te r n a l m e m o r y a lg o r ith m s t o th e im p le m e n ta tio n o f d a ta p r o
c e s s in g o p era to rs. 9
5
3
395
This page has intentionally been left blank.
396
Chapter 15
Query Execution
Previous chapters gave us data structures that allow efficient execution of basic
database operations such as finding tuples given a search key. We are now ready
to use these structures to support efficient algorithms for answering queries. The
broad topic of query processing will be covered in this chapter and Chapter 16.
The q u e r y p r o c e s s o r is the group of components of a DBMS that turns user
queries and data-modification commands into a sequence of database operations
and executes those operations. Since SQL lets us express queries at a very high
level, the query processor must supply a lot of detail regarding how the query
is to be executed. Moreover, a naive execution strategy for a query may lead to
an algorithm for executing the query that takes far more time than necessary.
Figure 15.1 suggests the division of topics between Chapters 15 and 16.
In this chapter, we concentrate on query execution, that is, the algorithms
that manipulate the data of the database. We focus on the operations of the
extended relational algebra, described in Section 5-4. Because SQL uses a bag
model, we also assume that relations are bags, and thus use the bag versions of
the operators from Section 5.3.
We shall cover the principal methods for execution of the operations of rela
tional algebra. These methods differ in their basic strategy; scanning, hashing,
sorting, and indexing are the major approaches. The methods also differ on
their assumption as to the amount of available main memory. Some algorithms
assume that enough main memory is available to hold at least one of the re
lations involved in an operation. Others assume that the arguments of the
operation are too big to fit in memory, and these algorithms have significantly
different costs and structures.
P r e v ie w o f Q u e r y C o m p i la t io n
Query compilation is divided into the three major steps shown in Fig. 15.2.
query
b) Query rewrite, in which the parse tree is converted to an initial query plan,
which is usually an algebraic representation of the query. This initial plan
is then transformed into an equivalent plan that is expected to require less
time to execute.
c) Physical plan generation, where the abstract query plan from (b), often
called a logical query plan, is turned into a physical query plan by selecting
algorithms to implement each of the operators of the logical plan, and by
selecting an order of execution for these operators. The physical plan, like
the result of parsing and the logical plan, is represented by an expression
tree. The physical plan also includes details such as how the queried
relations are accessed, and when and if a relation should be sorted.
Parts (b) and (c) are often called the query optimizer, and these are the
hard parts of query compilation. Chapter 16 is devoted to query optimization;
we shall learn there how to select a “ query plan”that takes as little time as
possible. To select the best query plan we need to decide:
Each of these choices depends on the metadata about the database. Typical
metadata that is available to the query optimizer includes: the size of each
398
15.1. INTRODUCTION TO PHYSICAL-QUERY-PLAN OPERATORS 715
SQ L query
Query
optimization
Execute plan
c) If R, is t o o la r g e to fit in m a in m e m o r y , th e n th e m u ltiw a y m e r g in g a p
p r o a ch c o v e r e d in S e c tio n 11.4.3 is a g o o d ch oice. H ow ev er, in s te a d o f
s to r in g th e fin al s o r t e d R b a c k o n disk, w e p r o d u c e o n e b lo c k o f th e
sorted .ft at a tim e, as its t u p le s a r e n eed ed .
• W e a s s u m e th a t th e a r g u m e n ts o f a n y o p e r a t o r are fo u n d o n disk, b u t th e
resu lt o f th e o p e r a t o r is left in m a in m e m ory .
401
718 CHAPTER 15. QUERY EX E CU TIO N
• S o m e tim e s , w e a ls o n e e d t o k n ow th e n u m b e r o f tu p le s in R, a n d w e
d e n o t e th is q u a n tity b yT(R), o r ju s t T if R is u n d e r s to o d . If w e n e e d th e
n u m b e r o f tu p le s o f R th a t ca n fit in o n e b lock , w e ca n u se th e r a t io T/B.
F u rth er, th e r e are s o m e in s ta n c e s w h e re a r e la tio n is s t o r e d d is t r ib u te d
402
IN T R O D U C T IO N T O P H Y S IC A L - Q U E R Y - P L A N O P E R A T O R S 719
• W e c o n tin u e to u se B o r T as an e s t im a t e o f th e c o s t o f a c c e s s in g a
c lu s te r e d o r u n c lu s tc r e d r e la tio n in its en tirety, u s in g an in dex.
403
720 CHA PTER 15. Q VERY EXECUTION
Why Iterators?
W e sh a ll s e e in S e c t io n 16.7 h o w it e r a t o r s s u p p o r t efficien t e x e c u tio n w h en
T h e y c o n t r a s t w ith a material
th e y a re c o m p o s e d w ith in q u e r y plan s.
ization stra teg y , w h ere th e resu lt o f e a ch o p e r a t o r is p r o d u c e d in its e n
tir e ty — an d eith er s to r e d o n d isk o r a llo w e d t o ta k e u p s p a c e in m a in
m em ory . W h e n ite r a to r s a re u sed, m a n y o p e r a t io n s are a c tiv e a t on ce. T u
p le s p a s s b e tw e e n o p e r a t o r s as n e e d e d , th u s r e d u c in g th e n e e d fo r sto ra g e .
O f co u rse, a s w e sh a ll see, n o t all p h y s ic a l o p e r a t o r s s u p p o r t th e ite r a tio n
a p p ro a ch , o r “p ip e lin in g ,” in a u se fu l way. In s o m e ca ses, a lm o s t all th e
w o rk w o u ld n eed to b e d o n e b y th e O pen fu n ctio n , w h ich is ta n ta m o u n t
t o m a te ria liz a tio n .
1. Open. T h is fu n c tio n s ta r ts th e p r o c e s s o f g e t t in g tu p le s, b u t d o e s n o t g e t
a tu p le. It in itia liz e s a n y d a t a s tr u c tu r e s n e e d e d t o p e r fo r m th e o p e r a t io n
a n d ca lls O pen fo r an y a r g u m e n ts o f th e o p e ra tio n .
2. G e t N ext. T h is fu n c tio n re tu rn s th e n e x t tu p le in th e r e su lt a n d a d ju s t s
d a ta s tr u c t u r e s as n e c e s s a r y t o a llo w s u b s e q u e n t t u p le s t o b e o b ta in e d . In
g e t t i n g th e n e x t t u p le o f its resu lt, it t y p ic a lly ca lls G e tN e x t o n e o r m o r e
tim e s o n its argu m en t(s). I f th e re are n o m o r e tu p le s t o retu rn , G e tN e x t
r e tu r n s a s p e c ia l valu e N otF ou n d , w h ich w e a s s u m e c a n n o t b e m ista k e n
fo r a tuple.
404
IN TRODU CTION TO PHYSICAL-QUERY-PLAN OPERATORS 721
E x a m p l e 15.1 : P e rh a p s th e s im p le s t it e r a t o r is th e o n e th a t im p le m e n t s th e
ta b le- sca n o p e r a to r . T h e it e r a t o r is im p le m e n t e d b y a c la ss TableScan, a n d a
ta b le - sca n o p e r a t o r in a q u e r y p la n is an in s t a n c e o f th is c la ss p a r a m e t e r iz e d b y
th e r e la tio n R w e w ish to scan. L et u s a s s u m e th a t R is a r e la tio n c lu s t e r e d in
s o m e lis t o f b lo ck s, w h ich w e ca n a c c e s s in a c o n v e n ie n t w ay; th a t is, th e n o t io n
of “
g e t th e n e x t b lo ck o f R ”is im p le m e n t e d b y th e s to r a g e s y s t e m a n d n e e d
n o t b e d e s c r ib e d in detail. F u rth er, w e a s s u m e th a t w ith in a b lo c k th e re is a
d ir e c t o r y o f r e c o r d s (tuples) s o th a t it is e a s y t o g e t th e n e x t tu p le o f a b lo c k
o r te ll th a t th e la st tu p le h a s b e e n rea ch e d .
OpenQ {
b := the first block of R;
t := the first tuple of block b;
>
GetNextO {
IF (t is past the last tuple on block b) {
increment b to the next block;
IF (there is no next block)
RETURN NotFound;
ELSE /* b is a new block */
t := first tuple on block b;
> /* now we are ready to return t and increment */
oldt := t;
increment t to the next tuple of b;
RETURN oldt;
}
C lo s e O {
>
T h en , G e tN e x t ca n run a c o m p e t it io n fo r t h e first r e m a in in g t u p le at th e h e a d s
o f all th e su b lists. I f th e b lo c k fr o m th e w in n in g s u b lis t is e x h a u ste d , G e tN e x t
r e lo a d s its buffer. □
406
O N E -P A SS A L G O R IT H M S F O R D A T A B A S E O P E R A T IO N S 723
OpenQ {
R. OpenQ ;
CurRe1 := R;
GetNextQ {
IF (CurRel = R) {
t := R .GetNext();
IF (t <> NotFound) / * R is not exhausted */
RETURN t;
ELSE /* R is exhausted * / {
S.OpenO ;
CurRel := S;
}
}
/* here, we must read from S */
RETURN S.GetNextQ;
/* notice that if S is exhausted, S.GetNextQ
will return NotFound, which is the correct
action for our GetNext as well */
>
CloseQ {
R. CloseQ ;
S .CloseQ ;
a) Some methods involve reading the data only once from disk. These are
the o n e - p a s s algorithms, and they are the topic of this section. Usually,
they work only when at least one of the arguments of the operation fits in
main memory, although there are exceptions, especially for selection and
projection as discussed in Section 15.2.1.
b) Some methods work for data that is too large to fit in available main
memory but not for the largest imaginable data sets. An example of such
407
724 C H A P T E R 15. Q U E R Y E X E C U T IO N
elimination operator).
3. F u ll- r e la tio n , b in a r y o p e r a tio n s . All other operations are in this class:
set and bag versions of union, intersection, difference, joins, and prod
ucts. Except for bag union, each of these operations requires at least one
argument to be limited to size M, if we are to use a one-pass algorithm.
buffer, regardless of B .
The disk I/O requirement for this process depends only on how the argument
relation R is provided. If R is initially on disk, then the cost is whatever it
takes to perform a table-scan or index-scan of R . The cost was discussed in
Section 15.1.5; typically it is B if R is clustered and T if it is not clustered.4
8
0
408
O N E -PA SS A L G O R IT H M S F O R D A T A B A S E O P E R A T IO N S 725
Input Output
buffer buffer
However, we should remind the reader again of the important exception when
the operation being performed is a selection, and the condition compares a
constant to an attribute that has an index. In that case, we can use the index
to retrieve only a subset of the blocks holding R , thus improving performance,
often markedly.
D u p lic a t e E lim in a t io n
To eliminate duplicates, we can read each block of R one at a time, but for each
tuple we need to make a decision as to whether:1
1. It is the first time we have seen this tuple, in which case we copy it to the
output, or
409
726 C H A P T E R 15. Q U E R Y E X E C U T IO N
2 . We have seen the tuple before, in which case we must not output this
tuple.
To support this decision, we need to keep in memory one copy of every tuple
we have seen, as suggested in Fig. 15.6. One memory buffer holds one block of
J?.’
s tuples, and the remaining M — buffers can be used to hold a single copy
1
M -l buffers Output.
buffer
When storing the already-seen tuples, we must be careful about the main-
memory data structure we use. Naively, we might just list the tuples we have
seen. When a new tuple from R is considered, we compare it with all tuples
seen so far, and if it is not equal to any of these tuples we both copy it to the
output and add it to the in-memory list of tuples we have seen.
However, if there are n tuples in main memory, each new tuple takes pro
cessor time proportional to n, so the complete operation takes processor time
proportional to n2, Since n could be very large, this amount of time calls into
serious question our assumption that only the disk I/O time is significant, Thus,
we need a main-memory structure that allows each of the operations:
410
O N E -P A SS A L G O R IT H M S F O R D A T A B A S E O P E R A T IO N S 727
space overhead in addition to the space needed to store the tuples; for instance,
a main-memory hash table needs a bucket array and space for pointers to iink
the tuples in a bucket. However, the overhead tends to be small compared
with the space needed to store the tuples. We shall thus make the simplifying
assumption of no overhead space and concentrate on what is required to store
the tuples in main memory.
On this assumption, we may store in the M — 1 available buffers of main
memory as many tuples as will fit in M — blocks of R . If we want one copy
1
approximation to this rule, and the one we shall generally use, is:
•B ( S ( R ) ) < M
Note that we cannot in general compute the size of S(R ) without computing
S(R ) itself. Should we underestimate that size, so B ( S ( R ) ) is actually larger
than M , we shall pay a significant penalty due to thrashing, as the blocks
holding the distinct tuples of R must be brought into and out of main memory
frequently.
G r o u p in g
•For any COUNT aggregation, add one for each tuple of the group that is
seen.
•For SUM (a), add the value of attribute a to the accumulated sum for its
group.•
•AVG(a) is the hard case. We must maintain two accumulations: the count
of the number of tuples in the group and the sum of the a-values of these
tuples. Each is computed as we would for a COUNT and SUM aggregation,
respectively. After all tuples of R are seen, we take the quotient of the
sum and count to obtain the average-
411
728 C H A P T E R 15. Q U ER Y E X E C U T IO N
When all tuples of R have been read into the input buffer and contributed
to the aggregation(s) for their group, we can produce the output by writing the
tuple for each group. Note that until the last tuple is seen, we cannot begin to
create output for a operation. Thus, this algorithm does not fit the iterator
7
framework very well; the entire grouping has to be done by the Open function
before the first tuple can be retrieved by GetNext.
In order that the in-memory processing of each tuple be efficient, we need
to use a main-memory data structure that lets us find the entry for each group,
given values for the grouping attributes. As discussed above for the S operation,
common main-memory data structures such as hash tables or balanced trees
will serve well. We should remember, however, that the search key for this
structure is the grouping attributes only.
The number of disk I/O’ s needed for this one-pass algorithm is B, as must
be the case for any one-pass algorithm for a unary operator. The number of
required memory buffers M is not related to B in any simple way, although
typically M will be less than B . The problem is that the entries for the groups
could be longer or shorter than tuples of R , and the number of groups could
be anything equal to or less than the number of tuples of R . However, in most
cases, group entries will be no longer than R Js tuples, and there will be many
fewer groups than tuples.
discussion of joins, we shall consider only the natural join. An equijoin can
be implemented the same way, after attributes are renamed appropriately, and
theta-joins can be thought of as a product or equijoin followed by a selection
for those conditions that cannot be expressed in an equijoin.
Bag union can be computed by a very simple one-pass algorithm. To com
pute R U b 5, we copy each tuple of R to the output and then copy every tuple
of 5, as we did in Example 15.3. The number of disk I/O’ s is B ( R ) + B ( S ) } as
it must be for a one-pass algorithm on operands R and 5, while M — suffices 1
•min( B ( R ) , B { S ) ) < M
412
ONE-PASS ALGORITHM S FOR DATABASE OPERATIONS 729
This rule assumes that one buffer will be used to read the blocks of the larger
relation, while approximately M buffers are needed to house the entire smaller
relation and its main-memory data structure.
We shall now give the details of the various operations. In each case, we
assume R is the larger of the relations, and we house S in main memory.
S e t U n io n
S e t I n t e r s e c t io n
Read 5 into M — 1 buffers and build a search structure with full tuples as the
search key. Read each block of R, and for each tuple t of R, see if t is also in
S. If so, copy t to the output, and if not, ignore t.
S e t D iffe r e n c e
B a g In t e r s e c t io n
We read S into M —1 buffers, but we associate with each distinct tuple a count,
which initially measures the number of times this tuple occurs in S. Multiple
copies of a tuple t. are not stored individually. Rather we store one copy of t.
and associate with it a count equal to the number of times t occurs.
This structure could take slightly more space than B(S) blocks if there were
few duplicates, although frequently the result is that S is compacted. Thus, we
shall continue to assume that B(S) < M is sufficient for a one-pass algorithm
to work, although the condition is only an approximation.
Next, we read each block of R, and for each tuple t of R we see whether t.
occurs in S . If not we ignore t; it cannot appear in the intersection. However, if
t appears in S , and the count associated with t is still positive, then we output
t and decrement the count by 1. If t appears in 5, but its count has reached 0,
then we do not output t\ we have already produced as many copies of t in the
output as there were copies in 5.
B a g D iffe r e n c e
To compute S —b R, we read the tuples of S into main memory, and count the
number of occurrences of each distinct tuple, as we did for bag intersection.
When we read R, for each tuple t we see whether t occurs in 5, and if so, we
decrement its associated count. At the end, we copy to the output each tuple
in main memory whose count is positive, and the number of times we copy it
equals that count.
To compute R —g S , we also read the tuples of S into main memory and
count the number of occurrences of distinct tuples. We may think of a tuple t
with a count of c as c reasons not to copy t to the output as we read tuples of
R. That is, when we read a tuple t of R, we see if t occurs in S. If not, then we
copy t to the output. If t does occur in S, then we look at the current count c
associated with t. If c = 0, then copy t to the output. If c > 0, do not copy t
to the output, but decrement c by . 1
P rodu ct
However, the output size is also large, and the time per output tuple is small.
N a tu r a l J o in
In this and other join algorithms, let us take the convention that R(X,Y) is
being joined with S(Y^Z)) where Y represents all the attributes that R and S
414
ONE-PASS ALGORITHMS FOR DATABASE OPERATIONS 731
have in common, X is all attributes of R that are not in the schema of S , and
Z is all attributes of S that are not in the schema of R . We continue to assume
that S is the smaller relation. To compute the natural join, do the following:
1. Read all the tuples of S and form them into a main-memory search struc
ture with the attributes of Y as the search key. As usual, a hash table or
balanced tree are good examples of such structures. Use M — 1 blocks of
memory for this purpose.
2. Read each block of R into the one remaining main-memory buffer. For
each tuple t of R , find the tuples of S that agree with t on all attributes
of Y , using the search structure. For each matching tuple of 5, form a
tuple by joining it with t, and move the resulting tuple to the output.
Like all the one-pass, binary algorithms, this one takes B ( R ) + B ( S ) disk I/O’
s
to read the operands. It works as long as B ( S ) < M — 1, or approximately,
B ( S ) < M . Also as for the other algorithms we have studied, the space required
by the main-memory search structure is not counted but may lead to a small,
additional memory requirement.
We shall not discuss joins other than the natural join. Remember that an
equijoin is executed in essentially the same way as a natural join, but we must
account for the fact that “ equal”attributes from the two relations may have
different names. A theta-join that is not an equijoin can be replaced by an
equijoin or product followed by a selection.
415
732 CHAPTER 15. QUERY EXECUTION
* a) Projection.
* b) Distinct (<5).
c) Grouping (qx,).
* d) Set union.
e) Set intersection.
f) Set difference.
g) Bag intersection.
h) Bag difference.
i) Product.
j) Natural join.
E x e r c is e 15.2.2 : For each of the operators in Exercise 15.2.1, tell whether the
operator is b lo c k in g , by which we mean that the first output cannot be produced
until all the input has been read. Put another way, a blocking operator is one
whose only possible iterators have all the important work done by Dpen.
416
15.3. N E S T E D - L O O P J O IN S 733
If we are careless about how we buffer the blocks of relations R and S, then
this algorithm could require as many as T ( R ) T ( S ) disk I/O ’
s. However, there
are many situations where this algorithm can be modified to have much lower
cost. One case is when we can use an index on the join attribute or attributes
of R. to find the tuples of R that match a given tuple of S , without having to
read the entire relation R . We discuss index-based joins in Section 15.6.3. A
second improvement looks much more carefully at the way tuples of R and 5
are divided among blocks, and uses as much of the memory as it can to reduce
the number of disk I/O’ s as we go through the inner loop. We shall consider
this block-based version of nested-loop join in Section 15.3.3.
OpenQ {
R. Open();
S .OpenQ ;
s := S .GetNext();
GetNextQ {
REPEAT {
r := R.GetNext();
IF (r = NotFound) { /* R is exhausted for
the current s */
R.CIoseQ ;
s := S .GetNext();
IF (s = NotFound) RETURN NotFound;
/* both R and S are exhausted */
R.OpenQ ;
r := R.GetNext();
>
>
UNTIL(r and s join);
RETURN the join of r and s;
Close() {
R. CIoseQ ;
S. CloseQ ;
Point ( ) makes sure that when we run through the tuples of R in the inner
1
loop, we use as few disk I/O’ s as possible to read R . Point (2) enables us to join
each tuple of R that we read with not just one tuple of S, but with as many
tuples of 5 as will fit in memory.
418
15.3. N E S T E D - L O O P J O IN S 735
with search key equal to the common attributes of R and 5, is created for the
tuples of S that are in main memory. Then we go through all the blocks of R ,
reading each one in turn into the last block of memory. Once there, we compare
all the tuples of ft’s block with all the tuples in all the blocks of 5 that are
currently in main memory. For those that join, we output the joined tuple.
The nested-loop structure of this algorithm can be seen when we describe the
algorithm more formally, in Fig. 15.8.
The program of Fig. 15.8 appears to have three nested loops, However, there
really are only two loops if we look at the code at the right level of abstraction.
The first, or outer loop, runs through the tuples of S . The other two loops
run through the tuples of R . However, we expressed the process as two loops
to emphasize that the order in which we visit the tuples of R is not arbitrary.
Rather, we need to look at these tuples a block at a time (the role of the second
loop), and within one block, we look at all the tuples of that block before moving
on to the next block (the role of the third loop).
E x a m p le 15.4: Let B ( R ) = 1000, B ( S ) = 500, and M = 101. We shall use
1 0 0blocks of memory to buffer S in -block chunks, so the outer loop of
1 0 0
Fig. 15.8 iterates five times. At each iteration, we do 100 disk I/O ’
s to read the
chunk of 5, and we must read R entirely in the second loop, using 1000 disk
I/O’ s. Thus, the total number of disk I/O ’ s is 5500.
Notice that if we reversed the roles of R and 5, the algorithm would use
slightly more disk I/O’ s. We would iterate 10 times through the outer loop and
do 600 disk I/O ’ s at each iteration, for a total of 6000. In general, there is a
slight advantage to using the smaller relation in the outer loop. □
419
736 C H A P T E R 15. Q U E R Y E X E C U T IO N
or
B (S)B (R )
B (S) + 1 M -
a loose approximation. For , M grows with the number of groups, and for 5,
7
Approximate
Operators M required Disk I/O Section
(7, 7T 1 B 15.2.1
7, 5 B B 15.2.2
u, n, - , x, m min ( B ( R ) , B ( S ) ) B (R )+ B (S ) 15.2.3
EX any M > 2 B (R )B {S)/M 15.3.3
Figure 15.9: Main memory and disk I/O requirements for one-pass and nested-
loop algorithms
! E x e r c is e 15.3.5 : The iterator of Fig. 15,7 will not work properly if either R
or S is empty. Rewrite the functions so they will work, even if one or both
relations are empty.
a) Two passes are usually enough, even for very large relations,
b) Generalizing to more than two passes is not hard; we discuss these exten
sions in Section 15.8.
421
738 C H A P T E R 15. Q U E R Y E X E C U T IO N
3. Write the sorted list into M blocks of disk. We shall refer to the contents
of these blocks as one of the s o r t e d s u b l i s t s of R.
More precisely, we look at the first unconsidered tuple from each block, and
we find among them the first in sorted order, say t. We make one copy of t in
422
15.4. TWO-PASS ALGORITHMS BASED ON SORTING 739
the output, and we remove from the fronts o f the various input blocks all copies
of t. If a block is exhausted, we bring into its buffer the next block from the
same sublist, and if there are t ’
s on that block we remove them as well.
E x a m p le 15.5 : Suppose for sim plicity that tuples are integers, and only two
tuples fit on a block. Also, M = 3; i.e., there are three blocks in main memory.
The relation R consists of 17 tuples:
2 , 5, 2,1,2, 2,4, 5, 4, 3, 4, 2, 1, 5, 2, 1, 3
We read the first six tuples into the three blocks of main memory, sort them,
and write them out as the sublist R \. Similarly, tuples seven through twelve
are then read in, sorted and written as the sublist R2. The last five tuples are
likewise sorted and becom e the sublist Rj.
To start the second pass, we can bring into main m em ory the first block
(two tuples) from each of the three sublists. The situation is now;
Looking at the first tuples o f the three blocks in main memory, we find that
1 is the first tuple in sorted order. We therefore make one copy of 1 on the
output, and we remove all l ’ s from the blocks in memory. W hen we do so, the
block from J?3 is exhausted, so we bring in the next block, with tuples 2 and 3,
from that sublist. Had there been m ore l ’ s on this block, we would eliminate
them. The situation is now:
Now, 2 is the least tuple at the fronts of the lists, and in fact it happens
to appear on each list. We write one copy of 2 to the output and eliminate
2’s from the in-memory blocks. T he block from Ri is exhausted and the next
block from that sublist is brought to memory. That block has 2’ s, which are
eliminated, again exhausting the block from Ri. The third block from that
sublist is brought to memory, and its 2 is eliminated. T he present situation is:
Now, 3 is selected as the least tuple, one copy of 3 is written to the output,
and the blocks from R2 and i ?3 are exhausted and replaced from disk, leaving:
423
740 CHAFFER 15. QUERY EXECUTION
To com plete the example, 4 is next selected, consum ing m ost o f list R2. At the
final step, each list happens to consist of a single 5, which is output once and
eliminated from the input buffers. □
3. B(R) to read each block from the sublists at the appropriate time.
Thus, the total cost o f this algorithm is 3B(R,), com pared with B(R) for the
single-pass algorithm o f Section 15.2.2.
On the other hand, we can handle much larger files using the two-pass
algorithm than with the one-pass algorithm. Assum ing M blocks o f m emory
are available, we create sorted sublists of M blocks each. For the second pass,
we need one block from each sublist in main memory, so there can be no m ore
than M sublists, each M blocks long. Thus, B < M 2 is required for the two-
pass algorithm to be feasible, com pared with B < M for the one-pass algorithm.
Put another way, to com pute S(R) with the two-pass algorithm requires only
y/B(R) blocks o f main memory, rather than B(R) blocks.
2 . Use one main-memory buffer for each sublist, and initially load the first
block o f each sublist into its buffer.
3. R epeatedly find the least value o f the sort key (grouping attributes)
present am ong the first available tuples in the buffers. This value, v ,
becom es the next group, for which we:
(a) Prepare to com pute all the aggregates on list L for this group. As
in Section 15.2.2, use a count and sum in place o f an average.
424
15.4. TWO-PASS ALGORITHMS BASED ON SORTING 741
(b) Exam ine each o f the tuples with sort key v, and accum ulate the
needed aggregates.
(c) If a buffer becom es empty, replace it with the next block from the
same sublist.
When there are no m ore tuples with sort key v available, output a tuple
consisting of the grouping attributes of L and the associated values o f the
aggregations we have com puted for the group.
As for the <5 algorithm, this two-pass algorithm for 7 takes 3B(R) disk I/ O ’
s,
and will work as long as B(R) < M 2.
1. Repeatedly bring M blocks of R into main memory, sort their tuples, and
write the resulting sorted sublist back to disk.
3. Use one main-memory buffer for each sublist of R and S. Initialize each
with the first block from the correspon din g sublist.
4. Repeatedly find the first remaining tuple t am ong all the buffers. Copy
t to the output, and remove from the buffers all copies of t (if R and S
are sets there should be at m ost two copies). If a buffer becom es empty,
reload it with the next block from its sublist.
We observe that each tuple of R and S is read twice into main memory,
once when the sublists are being created, and the second time as part of one of
the sublists. The tuple is also written to disk once, as part of a newly formed
sublist. Thus, the cost in disk T/O’s is 3 (B(R) + B(S)).
The algorithm works as long as the total number o f sublists am ong the two
relations does not exceed M, because we need one buffer for each sublist. Since
each sublist is M blocks long, that says the sizes of the two relations must not
exceed M 2; that is, B(R) 4-B(S) < M 2.
425
742 CHAPTER 15. QUERY EXECUTION
E x a m p le 15.6: Let us make the same assum ptions as in Exam ple 15.5: M =
3, tuples are integers, and two tuples fit in a block. The data will be almost
the same as in that example as well. However, here we need two arguments, so
we shall assume that R has 12 tuples and S has 5 tuples. Since main m emory
can fit six tuples, in the first pass we get two sublists from R, which we shall
call Ri and R%, and only one sorted sublist from S, which we refer to as S\?
After creating the sorted sublists (from unsorted relations similar to the data
from Exam ple 15.5), the situation is:
Suppose we want to take the bag difference R ~ b S. We find that the least
tuple am ong the main-memory buffers is 1, so we count the number o f l ’
s am ong
the sublists of R and am ong the sublists of S. We find that 1 appears once in R*6
2
4
2Since S fits in main memory, we could actually use the one-pass algorithms of Sec
tion 15-2.3, but we shall use the two-pass approach for illustration.
426
15.4. TWO-PASS ALGORITHMS BASED ON SORTING 743
and twice in S. Since 1 does not appear m ore tim es in R than in S , we do not
output any copies of tuple 1 . Since the first block o f S i was exhausted counting
l !s, we loaded the next block o f Si, leaving the follow ing situation:
We now find that 2 is the least rem aining tuple, so we count the number
of its occurrences in i?, which is five occurrences, and we count the number of
its occurrences in S, which is one. We thus output tuple 2 four times. As we
perform the counts, we must reload the buffer for Ri twice, which leaves:
The analysis o f this family o f algorithm s is the same as for the set-union
algorithm described in Section 15.4.3:
relation joins with every tuple of the other relation. In this situation, there is
really no choice but to take a nested-loop join of the two sets of tuples with a
com m on value in the join-attribute(s).
T o avoid facing this situation, we can try to reduce main-memory use for
other aspects of the algorithm, and thus make available a large number of buffers
to hold the tuples with a given join-attribute value. In this section we shall dis
cuss the algorithm that makes the greatest possible number o f buffers available
for join in g tuples with a com m on value. In Section 15.4.7 we consider another
sort-based algorithm that uses fewer disk I/ O ’s, but can present problem s when
there are large numbers o f tuples with a com m on join-attribute value.
Given relations R(X,Y) and S(Y,Z) to join, and given M blocks o f main
m em ory for buffers, we do the following:
1 . Sort i?, using a two-phase, multiway m erge sort, with Y as the sort key,
2 . Sort S similarly.
3. M erge the sorted R and S. We generally use only two buffers, one for the
current block o f R and the other for the current block of S. The following
steps are done repeatedly:
(a) Find the least value y of the join attributes Y that is currently at
the front of the blocks for R and S.
(b) If y does not appear at the front of the other relation, then remove
the tuple(s) with sort key y.
(c) Otherwise, identify all the tuples from both relations having sort key
y. If necessary, read blocks from the sorted R and/or 5, until we are
sure there are no m ore y’ s in either relation. As many as M buffers
are available for this purpose.
(d) O utput all the tuples that can be form ed by joining tuples from R
and 5 with a com m on Y -value y.
(e) If either relation has no more unconsidered tuples in main memory,
reload the buffer for that relation.
E x a m p le 15.7: Let us consider the relations R and S from Exam ple 15.4.
Recall these relations occu py 1000 and 500 blocks, respectively, and there are
M = 101 main-memory buffers. When we use two-phase, multiway m erge sort
on a relation, we do four disk I/ O ’s per block, two in each of the two phases.
Thus, we use 4 (B(R) 4- B(S)) disk I/ O ’ s to sort R and S, or 6000 disk I/ O ’
s.
When we m erge the sorted R and S to find the joined tuples, we read each
block of R and 5 a fifth time, using another 1500 disk I/ O ’s. In this m erge we
generally need only tw o of the 101 blocks of memory. However, if necessary, we
could use all 101 blocks to hold the tuples o f R and S that share a com m on
T-value y. Thus, it is sufficient that for no y do the tuples o f R and S that
have Y-value y together occupy m ore than 101 blocks.
428
15.4. TWO-PASS ALGORITHMS BASED ON SORTING 745
If there is a F-value y for which the number of tuples with this F-value does
not fit in M buffers, then we need to m odify the above algorithm.
1. If the tuples from one of the relations, say 7?, that have F-value y fit in
M —1 buffers, then load these blocks o f R into buffers, and read the blocks
of S that hold tuples with y, one at a time, into the remaining buffer. In
effect, we do the one-pass join of Section 15.2.3 on only the tuples with
F-value y.
Note that in either case, it may be necessary to read blocks from one relation
and then ignore them, having to read them later. For example, in case (1), we
might first read the blocks o f S that have tuples with F-value y and find that
there are to o many to fit in M - 1 buffers. However, if we then read the tuples
o f R. with that F-value we find that they do fit in M — 1 buffers.
1. Create sorted sublists o f size M, using Y as the sort key, for both R and
S.
2. Bring the first block of each sublist into a buffer; we assume there are no
m ore than M sublists in all.
3. Repeatedly find the least F-value y am ong the first available tuples o f all
the sublists. Identify all the tuples of both relations that have F-value
y, perhaps using som e o f the M available buffers to hold them, if there
are fewer than M sublists. Output the join o f all tuples from R with all
tuples from S that share this com m on F-value. If the buffer for one of
the sublists is exhausted, then replenish it from disk.
E x a m p le 15.8: Let us again consider the problem o f Exam ple 15.4: join in g
relations R and S of sizes 1000 and 500 blocks, respectively, using 101 buffers.
We divide R into 10 sublists and S into 5 sublists, each of length 100, and
sort them .3 We then use 15 buffers to hold the current blocks of each o f the
sublists. If we face a situation in which many tuples have a fixed F-value, we
can use the remaining 86 buffers to store these tuples, but if there are m ore
tuples than that we must use a special algorithm such as was discussed at the
end of Section 15.4.5.
Assum ing that we do not need to m odify the algorithm for large groups o f
tuples with the same F-value, then we perform three disk I/ O ’ s per block of
data. T w o of those are to create the sorted sublists. Then, every block o f every
sorted sublist is read into main m em ory one m ore time in the multiway m erging
process. Thus, the total number o f disk I/ O ’ s is 4500. □
This sort-join algorithm is m ore efficient than the algorithm o f Section 15.4.5
when it can be used. As we observed in Exam ple 15.8, the number o f disk I/ O ’ s
is 3 (B[R) + B(S)). We can perform the algorithm on data that is alm ost as
large as that o f the previous algorithm. T he sizes o f the sorted sublists are3 0
*4
techn ically, we could have arranged for the sublists to have length 101 blocks each, with
the last sublist of R having 91 blocks and the last sublist of S having 96 blocks, but the costs
would turn out exactly the same.
430
15.4. TWO-PASS ALGORITHMS BASED ON SORTING 747
M blocks, and there can be at m ost M of them am ong the two lists. Thus,
B(R) + B(S) < M 2 is sufficient.
We might wonder whether we can avoid the trouble that arises when there
are many tuples with a com m on K-value, Som e im portant considerations are:
1 . Som etim es we can be sure the problem will not arise. For example, if Y
is a key for R , then a given K-value y can appear only once am ong all the
blocks of the sublists for R. When it is y ’ s turn, we can leave the tuple
from R in place and join it with all the tuples of S that match. If blocks of
S’ s sublists are exhausted during this process, they can have their buffers
reloaded with the next block, and there is never any need for additional
space, no m atter how many tuples o f S have K-value y. O f course, if Y
is a key for S rather than R, the sam e argum ent applies with R and S
switched.
2. If B(R) + B(S) is much less than M 2, we shall have many unused buffers
for storing tuples with a com m on K-value, as we suggested in Exam
ple 15.8.
3. If all else fails, we can use a nested-loop join on just the tuples with a
com m on F-value, using extra disk I/ O ’
s but gettin g the jo b done correctly.
This option was discussed in Section 15.4.5.
Approximate
Operators M required Disk I/O Section
7, 5 Vb 3B 15.4.1, 15.4.2
Figure 15.11: Main memory and disk I/O requirements for sort-based algo
rithms 4
1
3
431
748 CHAPTER 15. QUERY EXECUTION
b) Show the behavior of the two-pass grouping algorithm com putin g the
relation ,y a<AVG(b)(B). Relation R ( a , b ) consists of the thirty tuples f 0
through ^29 j and the tuple t* has i m odu lo 5 as its grouping com ponent
a , and i as its second com ponent b.
E x e r c is e 15.4.2 : For each o f the operations below, write an iterator that uses
the algorithm described in this section.
* a) Distinct (d),
b) G rouping (7 ^)-
* c) Set intersection.
d) B ag difference.
e) Natural join.
a) Set union.
! E x e r c is e 15.4.5: In Exam ple 15.7 we discussed the join of two relations R and
5, with 1000 and 500 blocks, respectively, and M = 101. However, we pointed
out that there would be additional disk I/ O ’ s if there were so many tuples with
a given value that neither relation ’ s tuples could fit in main memory. Calculate
the total number of disk I/ O ’ s needed if:
* a) There are only two F-values, each appearing in half the tuples of R and
half the tuples of 5 (recall Y is the join attribute or attributes).
* a) 5.
b) 7 .
duplicates from each 27; in turn and write out the resulting unique tuples.
This m ethod will work as long as the individual i?;’ s are sufficiently small to
fit in main m em ory and thus allow a one-pass algorithm. Since we assume the
hash function h partitions R into equal-sized buckets, each Ri will be approxi
m ately B(R)/(M —1) blocks in size. If that number of blocks is no larger than
M, i.e., B(R ) < M(M - 1), then the two-pass, hash-based algorithm will work.
In fact, as we discussed in Section 15.2.2, it is only necessary that the number
of distinct tuples in one bucket fit in M buffers, but we cannot be sure that
there are any duplicates at all. Thus, a conservative estimate, with a simple
form in which M and M - I are considered the same, is B(R) < M 2, exactly
as for the sort-based, two-pass algorithm for A
The number o f disk I/ O ’
s is also similar to that of the sort-based algorithm.
We read each block of R once as we hash its tuples, and we write each block
o f each bucket to disk. We then read each block of each bucket again in the
one-pass algorithm that focuses on that bucket. Thus, the total number of disk
I/ O ’s is 3B(R).
E x a m p le 15.9 : Let us renew our discussion o f the two relations R and S from
Exam ple 15.4, w hose sizes were 1000 and 500 blocks, respectively, and for which
101 main-memory buffers are m ade available. We may hash each relation to
100 buckets, so the average size of a bucket is 10 blocks for R, and 5 blocks
for S. Since the smaller number, 5, is much less than the number of available
buffers, we expect to have no trouble perform ing a one-pass join on each pair
of buckets.
T he number of disk I/ O ’ s is 1500 to read each o f R and 5 while hashing
into buckets, another 1500 to write all the buckets to disk, and a third 1500 to
read each pair of buckets into main m em ory again while taking the one-pass
join of corresponding buckets. Thus, the number o f disk I/ O ’ s required is 4500,
ju st as for the efficient sort-join of Section 15.4.7. □
ASometimes, the term “ hash-join”is reserved for the variant of the one-pass join algorithm
of Section 15.2.3 in which a hash table is used as the main-memory search structure. Then,
the two-pass hash-join algorithm described here is called “ partition hash-join.”
436
15.5. TWO-PASS ALGORITHMS BASED ON HASHING 753
T he argument for the latter point is the sam e as for the other binary operations:
one of each pair of buckets must fit in M - 1 buffers.
E x a m p le 15.10 : Consider the problem o f E xam ple 15.4, where we had to join
relations R and 5, o f 1000 and 500 blocks, respectively, using M = 101. If we
use a hybrid hash-join, then we want k, the number o f buckets, to be about
500/101. Su ppose we pick A: = 5. Then the average bucket will have 100 blocks
of 5 ’s tuples. If we try to fit one o f these buckets and four extra blocks for the
other four buckets, we need 104 blocks o f main memory, and we cannot take
the chance that the in-memory bucket will overflow memory.
Thus, we are advised to choose A; = 6 . Now, when hashing 5 on the first
pass, we have five buffers for five of the buckets, and we have up to 96 buffers
for the in-memory bucket, whose expected size is 500/6 or 83. The number
of disk I/ O ’s we use for 5 on the first pass is thus 500 to read all of 5, and
500 - 83 = 417 to write five buckets to disk. When we process R on the first
pass, we need to read all of R (1000 disk I/ O ’ s) and write 5 of its 6 buckets
(833 disk I/ O ’s).
On the second pass, we read all the buckets written to disk, or 417 + 833 =
1250 additional disk I/ O ’ s. The total number of disk I/ O ’s is thus 1500 to read
R and 5, 1250 to write 5/6 of these relations, and another 1250 to read those
tuples again, or 4000 disk I/ O ’ s. This figure com pares with the 4500 disk I/ O ’
s
needed for the straightforward hash-join or sort-join. □4 8
3
438
15.5. TW O-PASS A L G O R IT H M S B A SE D ON H A SH IN G 755
really depend on the number of duplicates and groups, respectively, rather than
on the number of tuples in the argument relation.
Approximate
Operators M required Disk I/O Section
u, n, - V B (S) 3( B { R ) + B { S ) ) 15.5.4
CXI
y / m (3-2 M / B ( S ) ) { B ( R ) + £ (S)) 15,5.6
Figure 15.13: Main memory and disk I/O requirements for hash-based algo
rithms; for binary operations, assume B ( S ) < B ( R )
Notice that the requirements for sort-based and the corresponding hash-
based algorithms are almost the same. The significant differences between the
two approaches are:
! Exercise 15.5.5: Suppose that we are using a disk where the time to move
the head to a block is milliseconds, and it takes
1 0 0 millisecond to read
1 / 2
disk:
b) How much time does the disk I/O take if we use a hybrid hash-join as
described in Example 15.10?
c) How much time does a sort-based join take under the same conditions,
assuming we write sorted sublists to consecutive blocks of disk?
be spread all over the file unless the values of a and b are very closely correlated.
□*4
1
techn ically, if the index is on a key for the relation, so only one tuple with a given value
in the index key exists, then the index is always “ clustering,” even if the relation is not
clustered. However, if there is only one tuple per index-key value, then there is no advantage
from clustering, and the performance measure for such an index is the same as if it were
considered nonclustering.
441
758 C H A P T E R 15. Q U E R Y E X E C U T IO N
n “1 a\ a i a[ ai fli ai
Figure 15.14: A clustering index has all tuples with a fixed value packed into
(close to) the minimum possible number of blocks
1 . Often, the index is not kept entirely in main memory, and therefore some
disk I/O’s are needed to support the index lookup.
2 . Even though all the tuples with a — v might fit in b blocks, they could
be spread over 6 blocks because they don’
+ 1 t start at the beginning of
a block.
3. Although the index is clustering, the tuples with a = v may be spread
over several extra blocks. Two reasons why that situation might occur
are:
(a) We might not pack blocks of R as tightly as possible because we
want to leave room for growth of R, as discussed in Section 13.1.6.
(b) R might be stored with some other tuples that do not belong to R ,
say in a clustered-file organization.
442
15.6. IN DEX-BA SED A L G O R IT H M S 759
Notions of Clustering
We have seen three different, although related, concepts called “
clustering”
or “
clustered.”
I/O’
s. That is, we must retrieve every block of R.
2. If R is not clustered and we do not use the index, then the cost is 20,000
disk I/O ’s.
3. If V ( R , a ) = 100 and the index is clustering, then the index-based algo
rithm uses 1000/100 = 1 0 disk I/O’s.
4. If V ( R ]a) ~ and the index is nonclustering, then the index-based
1 0
algorithm uses 20,000/10 = 2000 disk I/O ’ s. Notice that this cost is
higher than scanning the entire relation R , if R is clustered but the index
is not.
5. If V (R ,a) = 20,000, i.e., a is a key, then the index-based algorithm takes 1
disk I/O plus whatever is needed to access the index, regardless of whether
the index is clustering or not.
□
Index-scan as an access method can help in several other kinds of selection
operations.
445
762 C H A P T E R 15. Q U E R Y E X E C U T IO N
can then perform an ordinary sort-join, but we do not have to perform the
intermediate step of sorting one of the relations on Y.
As an extreme case, if we have sorting indexes on Y for both R and 5,
then we need to perform only the final step of the simple sort-based join of
Section 15.4.5. This method is sometimes called zig-zag join , because we jump
back and forth between the indexes finding F-values that they share in common.
Notice that tuples from R with a y-value that does not appear in S need never
be retrieved, and similarly, tuples of S whose T-value does not appear in R
need not be retrieved.
If the indexes are B-trees, then we can scan the leaves of the two B-trees in
order from the left, using the pointers from leaf to leaf that are built into the
structure, as suggested in Fig. 15.15. If R and S are clustered, then retrieval of
all the tuples with a given key will result in a number of disk I/O ’ s proportional
to the fractions of these two relations read. Note that in extreme cases, where
there are so many tuples from R. and S that neither fits in the available main
memory, we shall have to use a fixup like that discussed in Section 15.4.5.
However, in typical cases, the step of joining all tuples with a common V-value
can be carried out with only as many disk I/O ’ s as it takes to read them.
Example 15.15 : Let us continue with Example 15.13, to see how joins using
a combination of sorting and indexing would typically perform on this data.
First, assume that there is an index on Y for 5 that allows us to retrieve the
tuples of S sorted by Y. We shall, in this example, also assume both relations
and the index are clustered. For the moment, we assume there is no index on
R.
Assuming 101 available blocks of main memory, we may use them to create
10 sorted sublists for the 1000-block relation R. The number of disk I/O ’
s is
2000 to read and write all of R. We next use 11 blocks of memory — 10 for
446
15.6. INDEX-BASED ALGORITHMS 763
Index
the sublists of R and one for a block of S ’s tuples, retrieved via the index. We
neglect disk I/O ’s and memory buffers needed to manipulate the index, but if
the index is a B-tree, these numbers will be small anyway. In this second pass,
we read all the tuples of R and 5, using a total of 1500 disk I/O ’s, plus the small
amount needed for reading the index blocks once each. We thus estimate the
total number of disk I/O ’ s at 3500, which is less than that for other methods
considered so far.
Now, assume that both R and S have indexes on Y. Then there is no need
to sort either relation. We use just 1500 disk I/O ’ s to read the blocks of R
and S through their indexes. In fact, if we determine from the indexes alone
that a large fraction of R or S cannot match tuples of the other relation, then
the total cost could be considerably less than 1500 disk I/O ’ s, However, in any
event we should add the small number of disk I/O ’ s needed to read the indexes
themselves. □
c) 6{R).
Exercise 15.6.2: Suppose B(R) = 10,000 and T(R) = 500,000. Let there
be an index on R.a, and let V(R, a) = k for some number k. Give the cost
of cra=o(R)i as a function of k, under the following circumstances. You may
neglect disk I/O ’
s needed to access the index itself.
447
764 CHAPTER 15. QUERY EXECUTION
Exercise 15.6.3: Repeat Exercise 15.6.2 if the operation is the range query
ac< a and h < d ( R )■ You may assume that C and D are constants such that k f 10
of the values are in the range.
! Exercise 15.6.4 : If R is clustered, but the index on R.a is not clustering, then
depending on k we may prefer to implement a query by performing a table-scan
of R or using the index. For what values of k would we prefer to use the index
if the relation and query are as in:
a) Exercise 15.6.2.
b) Exercise 15.6.3.
SELECT birthdate
FROM Starsln, MovieStar
WHERE movieTitle = ’King Kong’ AND starName = name;
and MovieStar, which can be implemented much as a natural join Rtxi S. Since
there were only two movies named “ King Kong,”T(R) is very small. Suppose
that 5, the relation MovieStar, has an index on name. Compare the cost of an
index-join for this R txi S with the cost of a sort- or hash-based join.
448
15.7. B UFFEB. M AN A GEMENT 765
Requests
Buffers
Figure 15.16: The buffer manager responds to requests for main-memory access
to disk blocks
2. The buffer manager allocates buffers in virtual memory, allowing the op
erating system to decide which buffers are actually in main memory at
any time and which are in the “ swap space”on disk that the operating
system manages. Many “ main-memory”DBM S’ s and “object-oriented”
DBMS’ s operate this way.
Whichever approach a DBMS uses, the same problem arises: the buffer
manager should limit the number of buffers in use so they fit in the available
449
766 CHAPTER 15. QUERY EXECUTION
main memory. When the buffer manager controls main memory directly, and
requests exceed available space, it has to select a buffer to empty, by returning
its contents to disk. If the buffered block has not been changed, then it may
simply be erased from main memory, but if the block has changed it must be
written back to its place on the disk. When the buffer manager allocates space
in virtual memory, it has the option to allocate more buffers than can fit in
main memory. However, if all these buffers are really in use, then there will
be “ thrashing,”a common operating-system problem, where many blocks are
moved in and out of the disk’ s swap space. In this situation, the system spends
most of its time swapping blocks, while very little useful work gets done.
Normally, the number of buffers is a parameter set when the DBMS is
initialized. We would expect that this number is set so that the buffers occupy
the available main memory, regardless of whether the buffers are allocated in
main or virtual memory. In what follows, we shall not concern ourselves with
which mode of buffering is used, and simply assume that there is a fixed-size
buffer pool, a set of buffers available to queries and other database actions.
The critical choice that the buffer manager must make is what block to throw
out of the buffer pool when a buffer is needed for a newly requested block. The
buffer-replacement strategies in common use may be familiar to you from other
applications of scheduling policies, such as in operating systems. These include:4
0
5
450
15.7. B UFFER A'14NAGEMENT 767
First-In-First-Out (FIFO)
When a buffer is needed, under the FIFO policy the buffer that has been oc
cupied the longest by the same block is emptied and used for the new block.
In this approach, the buffer manager needs to know only the time at which the
block currently occupying a buffer was loaded into that buffer. An entry into a
table can thus be made when the block is read from disk, and there is no need
to modify the table when the block is accessed. FIFO requires less maintenance
than LR.U, but it can make more mistakes. A block that is used repeatedly, say
the root block of a B-tree index, will eventually become the oldest block in a
buffer. It will be written back to disk, only to be reread shortly thereafter into
another buffer.
System Control
The query processor or other components of a DBMS can give advice to the
buffer manager in order to avoid some of the mistakes that would occur with
451
768 CHAPTER 15. QUERY EXECUTION
Figure 1517: The clock algorithm visits buffers in a round-robin fashion and
replaces 01 -•
■1 with 10 •
•■0
a strict policy such as LRU, FIFO, or Clock. Recall from Section 12.3.5 that
there are sometimes technical reasons why a block in main memory can not
be moved to disk without first modifying certain other blocks that point to it.
These blocks are called “ pinned,”and any buffer manager has to modify its
buffer-replacement strategy to avoid expelling pinned blocks. This fact gives us
the opportunity to force other blocks to remain in main memory by declaring
them “ pinned,”even if there is no technical reason why they could not be
written to disk. For example, a cure for the problem with FIFO mentioned
above regarding the root of a B-tree is to “
pin”the root, forcing it to remain in
memory at all times. Similarly, for an algorithm like a one-pass hash-join, the
query processor may “ pin”the blocks of the smaller relation in order to assure
that it will remain in main memory during the entire time.
However, as we have seen, the buffer manager may not be willing or able to
guarantee the availability of these M buffers when the query is executed. There
are thus two related questions to ask about the physical operators:
2. When the expected M buffers are not available, and some blocks that are
expected to be in memory have actually been moved to disk by the buffer
manager, how does the buffer-replacement strategy used by the buffer
manager impact the number of additional I/O ’s that must be performed?
disk I/O ’
s for R. Notice that even if k = 1 (i.e., no extra buffers are available
to R), we save one disk I/O per iteration. □
Other algorithms also are impacted by the fact that M can vary and by the
buffer-replacement strategy used by the buffer manager. Here are some useful
observations.
* a) A one-pass join.
! Exercise 15.7.2: How would the number of disk I/O ’ s taken by a nested-loop
join improve if extra buffers became available and the buffer-replacement policy
were:
a) First-in-first-out.
454
15.8. A L G O R I T H M S U S I N G M O R E T H A N T W O P A S S E S 771
or 5 on R, then we modify the above so that at the final merge we perform the
operation on the tuples at the front of the sorted sublists. That is,
•For a 6, output one copy of each distinct tuple, and skip over copies of
the tuple.
•For a , sort on the grouping attributes only, and combine the tuples with
7
which is the size of each of the M pieces of R , cannot exceed s( M , k —1). That
is: s( M , k ) = M s ( M ,k - 1).
If we expand the above recursion, we find
s(M, k) = M s ( M , k - 1) - M 2s (M ,k —2) — ■
■•— M k~ ' s ( M , 1)
disk I/O’
s to read the sorted sublists in the final pass. The result is a total of
(2k - )( B ( R ) + B ( S ) ) disk I/O’
1 s.
memory and then read the second relation, one block at a time, into the Mth
buffer.
Let u ( M , k ) be the number of blocks in the largest relation that a &-pass hashing
algorithm can handle. We can define u recursively by:
BASIS: u(M, 1) = M, since the relation R must fit in M buffers; i.e., B (R ) <
M.
passes; that is, the buckets are of size u ( M , k - ). Since R is divided into M - 1
1
457
774 C H A P T E R 15. Q U ER Y E X E C U T IO N
! Exercise 15.8.2: There are several “ tricks”we have discussed for improving
the performance of two-pass algorithms. For the following, tell whether the
trick could be used in a multipass algorithm, and if so, how?
D a ta P r o c e ssin g - P a rallelism
T h is ch a p te r co n ta in s th e p a p ers:
J. D e a n a n d S. G h em a w a t. M a p R e d u ce : a flex ib le d a ta p r o c e s s
in g to o l. C o m m u n . A C M 53, 1, pp. 72-77 (6 o f 159), 2010. D oi:
10.1145/1629175.1629198
459
• E x p la in th e r e la tio n sh ip b e tw e e n M a p R e d u c e a n d p a r titio n e d p a ra llel
p r o c e s s in g stra tegies.
G raefe’
s paper on interface design fo r parallel operators is given to deepen
understanding; however, it is to be considered as an additional reading and not
fundamental to the attainment of the learning goals above.
contributed articles
D0I:10.1145/1629175.1629198
of MapReduce has been used exten
MapReduce advantages over parallel databases sively outside o f Google by a number of
organizations.1011
include storage-system independence and To help illustrate the MapReduce
fine-grain fault tolerance for large jobs. program m ing model, consider the
problem of counting the number of
BY JEFFREY DEAN AND SANJAY GHEMAWAT occurrences of each word in a large col
lection of documents. The user would
MapReduce:
write code like the following pseudo
code:
A Flexible
/ / key: document name
/ / value: document contents
fo r each word w in value:
Emltlntermedlate(w, “ 1”
);
Data
reduce(Strlng key. Iterator values):
/ / key: a word
/ / values: a l i s t o f counts
±nt r e s u lt = 0;
Processing
fo r each v in values:
r e s u lt += Parselnt(v);
Emit(AsString(result));
Tool
The map function emits each word
plus an associated count of occurrences
(just' T in this simple example). The re
duce function sums together all counts
emitted for a particular word.
MapReduce automatically paral
lelizes and executes the program on a
large cluster of com m odity machines.
The runtime system takes care o f the
details o f partitioning the input data,
mapreduce is a programming model for processing scheduling the program ’ s execution
and generating large data sets.4Users specify a across a set of machines, handling
machine failures, and m anaging re
map function that processes a key/value pair to quired inter-machine communication.
generate a set of intermediate key/value pairs and MapReduce allows programmers with
no experience with parallel and dis
a reduce function that merges all intermediate tributed systems to easily utilize the re
values associated with the same intermediate key. sources o f a large distributed system. A
We built a system around this programming model typical MapReduce computation pro
cesses many terabytes o f data on hun
in 2003 to simplify construction of the inverted dreds or thousands o f machines. Pro
index for handling searches at Google.com. Since grammers find the system easy to use,
and more than 100,000 MapReduce
then, more than 10,000 distinct programs have been jo b s are executed on G oogle’ s clusters
implemented using MapReduce at Google, including every day.
algorithms for large-scale graph processing, text
Compared to Parallel Databases
processing, machine learning, and statistical machine The query languages built into paral
translation. The Hadoop open source implementation lel database systems are also used to
475
476
contributed articles
express the type of computations sup support a new storage system by de would need to read only that sub-range
ported by MapReduce. A 2009 paper fining simple reader and writer imple instead of scanning the entire Bigtable.
by Andrew Pavlo et al. (referred to here mentations that operate on the storage Furthermore, like Vertica and other col
as the “ comparison paper” 13) com system. Examples of supported storage umn-store databases, we will read data
pared the performance o f MapReduce systems are files stored in distributed only from the columns needed for this
and parallel databases. It evaluated file systems,7 database query results,2'9 analysis, since Bigtable can store data
the open source H adoop implementa data stored in Bigtable,3and structured segregated by columns.
tion10of the MapReduce programming input files (such as B-trees). A single Yet another example is the process
model, DBMS-X (an unidentified com MapReduce operation easily processes ing of log data within a certain date
mercial database system), and Vertica and com bines data from a variety of range; see the Join task discussion in
(a column-store database system from storage systems. the comparison paper, where the Ha
a company co-founded by one of the Now consider a system in which a doop benchmark reads through 155
authors of the comparison paper). Ear parallel DBMS is used to perform all million records to process the 134,000
lier b log posts by som e of the paper’ s data analysis. The input to such analy records that fall within the date range
authors characterized MapReduce as sis must first be copied into the parallel of interest. Nearly every logging sys
“a major step backwards.” 1’
6 In this DBMS. This loading phase is inconve tem we are familiar with rolls over to
article, we address several m isconcep nient. It may also be unacceptably slow, a new log file periodically and embeds
tions about MapReduce in these three especially if the data will be analyzed the rollover time in the name of each
publications: only once or twice after being loaded. log file. Therefore, we can easily run a
► MapReduce cannot use indices and For example, consider a batch-oriented MapReduce operation over just the log
implies a full scan of all input data; Web-crawling-and-indexing system files that may potentially overlap the
► MapReduce input and outputs are that fetches a set of Web pages and specified date range, instead o f reading
always simple files in a file system; and generates an inverted index. It seems all log files.
► MapReduce requires the use of in awkward and inefficient to load the set
efficient textual data formats. of fetched pages into a database just so Complex Functions
We also discuss other important is they can be read through once to gener Map and Reduce functions are often
sues: ate an inverted index. Even if the cost of fairly simple and have straightforward
► MapReduce is storage-system inde loading the input into a parallel DBMS SQL equivalents. However, in many
pendent and can process data without is acceptable, we still need an appropri cases, especially for Map functions, the
first requiring it to be loaded into a da ate loading tool. Here is another place function is too complicated to be ex
tabase. In many cases, it is possible to MapReduce can be used; instead of pressed easily in a SQL query, as in the
run 50 or more separate MapReduce writing a custom loader with its own ad following examples:
analyses in complete passes over the hoc parallelization and fault-tolerance ► Extracting the set o f outgoing links
data before it is possible to load the data support, a simple MapReduce program from a collection of HTML documents
into a database and complete a single can be written to load the data into the and aggregating by target document;
analysis; parallel DBMS. ► Stitching together overlapping sat
► Complicated transformations are ellite images to remove seams and to
often easier to express in MapReduce Indices select high-quality imagery for Google
than in SQL; and The comparison paper incorrectly said Earth;
► Many conclusions in the compari that MapReduce cannot take advan ► Generating a collection of inverted
son paper were based on implementa tage of pregenerated indices, leading index files using a com pression scheme
tion and evaluation shortcomings not to skewed benchmark results in the tuned for efficient support of Google
fundamental to the MapReduce model; paper. For example, consider a large search queries;
we discuss these shortcomings later in data set partitioned into a collection ► Processing all road segments in the
this article. o f nondistributed databases, perhaps world and rendering map tile images
We encourage readers to read the using a hash function. An index can that display these segments for Google
original MapReduce paper1 and the be added to each database, and the Maps; and
comparison paper13for more context. result of running a database query us ► Fault-tolerant parallel execution of
ing this index can be used as an input programs written in higher-level lan
Heterogenous Systems to MapReduce. If the data is stored in guages (such as Sawzall14 and Pig Lat
Many production environments con D database partitions, we will run D in12) across a collection of input data.
tain a mix of storage systems. Customer database queries that will becom e the Conceptually, such user defined
data may be stored in a relational data D inputs to the MapReduce execution. functions (UDFs) can be com bined
base, and user requests may be logged Indeed, som e o f the authors of Pavlo et with SQL queries, but the experience
to a file system. Furthermore, as such al. have pursued this approach in their reported in the comparison paper indi
environments evolve, data may migrate more recent work.11 cates that UDF support is either buggy
to new storage systems. MapReduce Another example of the use o f in (in DBMS-X) or m issing (in Vertica).
provides a simple m odel for analyzing dices is a MapReduce that reads from These concerns may go away over the
data in such heterogenous systems. Bigtable. If the data needed maps to a long term, but for now, MapReduce is a
End users can extend MapReduce to4 7 sub-range of the Bigtable row space, we better framework for doing more com-
477
contributed articles
plicated tasks (such as those listed ear of protocol buffers uses an optim ized
lier) than the selection and aggregation binary representation that is more
that are SQL’ s forte. com pact and much faster to encode
and decode than the textual formats
Structured Data and Schemas used by the Hadoop benchmarks in the
Pavlo et al. did raise a good point that MapReduce is comparison paper. For example, the
schemas are helpful in allowing multi
ple applications to share the same data.
a highly effective automatically generated code to parse
a Rankings protocol buffer record
For example, consider the following and efficient runs in 20 nanoseconds per record as
schema from the comparison paper:
CREATE TABLE R a n k in g s (
tool for large-scale compared to the 1,731 nanoseconds
required per record to parse the tex
pageURL VARCHAR(IOO) fault-tolerant tual input format used in the Hadoop
PRIMARY KEY,
pa geR a n k INT.
data analysis. benchmark mentioned earlier. These
measurements were obtained on a JVM
a v g D u r a t io n INT ); running on a 2.4GHz Intel Core-2 Duo.
The Java code fragments used for the
The corresponding H adoop bench benchmark runs were:
marks in the com parison paper used
an inefficient and fragile textual for // F ragm en t 1: p r o t o c o l b u f
mat with different attributes separated f e r p a r s in g
by vertical bar characters: f o r ( in t i = 0; i < n u m ltera -
tio n s ; i++) {
1371h t t p ://www. s o m e h o s t .com/ r a n k i n g s .p a rse F ro m (v alu e);
i n d e x .h tm l 1602 p a g e r a n k = r a n k in g s , g e t -
P ageran k O ;
In contrast to ad hoc, inefficient }
formats, virtually all MapReduce op
erations at Google read and write data // F ra gm en t 2: t e x t f o r
in the Protocol Buffer format.8A high- mat p a r s i n g ( e x t r a c t e d from
level language describes the input and B ench m ark1j ava
output types, and compiler-generated // from t h e s o u r c e c o d e
code is used to hide the details o f en- p o s t e d b y P a v lo e t al.)
coding/decoding from application f o r ( in t i = 0; i < n u m ltera -
code. The corresponding protocol buf t i o n s ; i++) {
fer description for the Rankings data S t r i n g data[] = v a lu e . t o -
would be: S t r in g O .sp lit("\\| ” );
p a geran k = In teger.
m e s sa g e R a n k in g s { valu e0 f(d ata[0 ]);
r e q u i r e d s t r i n g p a g e u r l = 1: }
r e q u i r e d int32 p a gera n k = 2;
r e q u ire d int32 avgdu ration = 3; Given the factor o f an 80-fold dif
} ference in this record-parsing bench
mark, we suspect the absolute num
The following Map function frag bers for the H adoop benchmarks in
ment processes a Rankings record: the com parison paper are inflated and
cannot be used to reach conclusions
R a n k in g s r = new R a n k in g s 0) about fundamental differences in the
r .p a rse F ro m (v alu e) ; performance of MapReduce and paral
i f (r.getPagerankO > 10) { ... } lel DBMS.
478
contributed articles
mentation tricks like batching, sorting, format for structured data (protocol
and grouping of intermediate data and buffers) instead of inefficient textual
smart scheduling of reads are used by formats.
G oogle’ s MapReduce implementation Reading unnecessary data. The com
to mitigate these costs. parison paper says, “ MR is always forced
MapReduce implementations tend to start a query with a scan of the entire
not to use a push model due to the input file.”MapReduce does not require
ACM fault-tolerance properties required
by G oogle’ s developers. Most MapRe
a full scan over the data; it requires only
an implementation of its input inter
Transactions on duce executions over large data sets
encounter at least a few failures; apart
face to yield a set of records that match
som e input specification. Examples of
the benchmarks, starting with data in a heterogenous system with many dif
collection of files on disk, it is possible ferent storage systems. Third, MapRe
to run 50 separate MapReduce analy duce provides a go o d framework for
ses over the data before it is possible to the execution o f m ore com plicated
load the data into a database and com functions than are supported directly
plete a single analysis. Long load times MapReduce in SQL. B
may not matter if many queries will be
run on the data after loading, but this
provides fine-grain
is often not the case; data sets are often fault tolerance R eferences
1. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J.,
Silberschatz, A., and Rasin, A. HadoopDB: An
generated, processed once or twice,
and then discarded. For example, the
for large jobs; architectural hybrid of MapReduce and DBMS
technologies for analytical workloads. In Proceedings
Web-search index-building system de failure in the middle o f the Conference on Very Large Databases (Lyon,
France, 2009); http://db.cs.yale.edu/hadoopdb/
scribed in the MapReduce paper4 is a
sequence o f MapReduce phases where
of a multi-hour 2. Aster Data Systems, Inc. In-Database MapReduce
for Rich Analytics-, http://www.asterdata.com/product/
mapreduce.php.
the output of m ost phases is consum ed execution does 3. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C.,
Wallach, D.A., Burrows, M., Chandra, T., Fikes, A.,
by one or two subsequent MapReduce
phases. not require and Gruber, R.E. Bigtable: A distributed storage
system for structured data. In Proceedings o f the
Seventh Symposium on Operating System Design
Conclusion
restarting the job and Implementation (Seattle, WA, Nov. 6-8). Usenix
Association, 2006; http://labs.google.com/papers/
480
Encapsulation of Parallelism
in the Volcano Query Processing System
Goetz Graefe
University of Colorado
Boulder, CO 80309-0430
graefe@boulder Colorado edu
A b s tra c t
Volcano is a new dataflow query processing system we have developed for database systems research and education
The uniform interface between operators makes Volcano extensible by new operators All operators are designed and coded as
if they were meant for a single-process system only When attempting to parallelize Volcano, we had to choose between two
models o f parallelization, called here the bracket and operator models We describe the reasons for not choosing the bracket
model, introduce the novel operator model, and provide details o f Volcano’ s exchange operator that parallelizes all other opera
tors It allows ultra-operator parallelism on partitioned datasets and both vertical and horizontal inter-operator parallelism The
exchange operator encapsulates all parallelism issues and therefore makes implementation o f parallel database algorithms signifi
cantly easier and more robust Included m this encapsulation is the translauon between demand-driven dataflow within
processes and data-driven dataflow between processes Since the interface between Volcano operators is similar to the one
used in "real," commercial systems, the techniques described here can be used to parallelize other query processing engines
102
481
2. Previous Work
Since so many different system have been developed
to process large dataset efficiently, we only survey the sys
tems that have strongly influenced the design o f Volcano
At the start in 1987, we felt that some decisions in
WiSS [11] and GAMMA [12] were not optimal for perfor
mance or generality For instance, the decisions to protect
W iSS’
s butfer space by copying a data record in or out for
each request and to re-request a buffer page for every
record during a scan seemed to inflict too much overhead1
However, many of the design decisions in Volcano were
strongly influenced by experiences with WiSS and
GAMMA The design of the data exchange mechanism
between operators, the focus o f this paper, is one o f the
few radical departures from GAM M A’ s design
During the design o f the EXODUS storage manager
[10], many o f these issues were revisited Lessons learned
and tradeoffs explored in these discussions certainly helped
form the ideas behind Volcano The development o f E [24]
influenced the strong emphasis on iterators for query pro
cessing The design o f GENESIS [5] emphasized the
importance o f a uniform iterator interface
Finally, a number o f conventional (relational) and
extensible systems have influenced our design Without
further discussion, we mention Ingres [27], System R [3],
Bubba [2], Starburst [26], Postgres [28], and XPRS [29]
Furthermore, there has been a large amount o f research and
development in the database machine area, such that there
is an international workshop on the topic Almost all data
base machine proposals and unplementauons utilize parallel
ism in som e form We certainly have learned from this
work and tned to include its lessons in the design and Figure 1 Bracket M odel o f Parallelization.
implementation o f Volcano In particular, we have strived
for simplicity in the design, mechanisms that can support a
In the bracket model, there is a generic process tem
multitude o f policies, and efficiency in all details We
plate that can receive and send data and can execute
believe that the query execution engme should provide
exactly one operator at any point o f time A schematic
mechanisms, and that the query optimizer should incorporate
diagram o f such a template process is shown in Figure 1
and decide on policies
with two possible operators, join and aggregation The
Independently o f our work. Tandem Computers has code that makes up the generic template invokes the opera
designed an operator called the parallel operator which is tor which then controls execution, network I/O on the
very similar to V olcano’s exchange operator It has proven receiving and sending sides are performed as service to the
useful m Tandem's query execuuon engine [14], but is not operator on request, implemented as procedures to be called
yet documented in the open literature We learned about by the operator. The number o f inputs that can be active
this operator through one o f the referees Furthermore, the at any point o f Ume is limited to two since there are only
distributed database system R* used a technique similar to unary and binary operators m most database systems The
ours to transfer data between nodes [31] However, this operator is surrounded by generic template code which
operation was used only to effect data transfer and did not shields it from its environment, for example the operator(s)
support data or intra-operator parallelism that produce its input and consume its output
2.1. The Bracket Model of Parallelization One problem with the bracket model is that each
When attempting to parallelize exisung single-process locus o f control needs to be created This is typically done
Volcano software, we considered two paradigms or models by a separate scheduler process, requiring software develop
o f parallelization The first one, which we call the bracket ment beyond the actual operators, both initially and for each
model, has been used in a number o f systems, for example extension to the set o f query processing algorithms Thus,
GAM M A [12] and Bubba [2] The second one, which we the bracket model seems unsuitable for an extensible sys
call the operator model, is novel and is described in detail tem
in Section 4 In a query processing system using the bracket
model, operators are coded m such a way that network I/O
1 This statement only pertains to the original version of is their only means o f obtaining input and delivering output
WiSS as described in [11] Both decisions were reconsidered for (with the exception o f scan and store operators) The rea
the version o f WiSS used in GAMMA son is that each operator is its own locus of control and
network flow control must be used to coordinate muluplc
103
482
operators, e g , to match two operators’ speed in a Calling open for the top-most operator results in
producer-consumer relationship Unfortunately, this also instantiations for the associated state record, c g , allocation
means that passing a data item from one operator to o f a hash table, and in open calls for all inputs In this
another always involves expensive inter-process communica way, all iterators in a query are initiated recursively In
tion (IPC) system calls, even in the cases when an entire order to process the query, next for the top-most operator is
query is evaluated on a smgle machine (and could therefore called repeatedly until it fails with an end o f stream indica
be evaluated without IPC m a smgle process) or when data tor Finally, the close call recursively "shuts down" all
do not need to be rcparuiioned among nodes in a network iterators in the query This model of query execution
An example for the latter is the three-way join query matches very closely the one being mcluded in the E pro
"joinCselAsclB" in the W isconsin Benchmark [6,9] which gramming language design [24] and the algebraic query
uses the same jom attribute for both two-way joins Thus, evaluation system o f the Starburst extensible relauonal data
m queries with multiple operators (meaning almost all base system [22]
quenes), IPC and its overhead are mandatory rather than
The tree-structured query evaluauon plan is used to
optional
execute queries by demand-driven dataflow The return
In most (single-process) query processing engines, value o f next is, besides a status value, a structure called
operators schedule each other much more efficiently by NEXT^RECORD that consists of a record identifier and a
means o f procedure calls rather the system calls The con record address m the buffer pool This record is pinned
cepts and methods needed for operators to schedule each (fixed) m the buffer The protocol about fixmg and unfix
other using procedure calls are the subject of the next sec- ing records is as follow s Each record pinned in the buffer
uon is owned by exactly one operator at any point m ume
After receiving a record, the operator can hold on to it for
3. Volcano System Design
a while, e g , in a hash table, unfix it, e g , when a predi
In this section, we provide an overview o f the cate fails, or pass it on to the next operator Complex
modules in Volcano V olcano’
s file system is rather con operations like join that create new records have to fix
ventional It mcludes a modules to manage devices, buffer them in the buffer before passing them on, and have to
pools, files, records, and B+-trees For a detailed discus unfix their input records
sion, we refer to [17]
For intermediate results, Volcano uses virtual devices
The file system routines are used by the query pro Pages of such a device exist only m the buffer, and are
cessing routines to evaluate com plex query plans Quenes discarded when unfixed. Using this mechanism allows
are expressed as com plex algebra expressions, the operators assigning unique RID’ s to intermediate result records, and
o f this algebra are query processing algorithms All algebra allows managing such records in all operators as if they
operators are implemented as iterators, i e , they support a resided on a real (disk) device The operators are not
simple open-next-close protocol similar to eonvenuonal file affected by the use o f virtual devices, and can be pro
scans grammed as if all input comes from a disk-resident file and
Associated with each algonthm is a state record output is written to a disk file
The arguments for the algorithms are kept in the state 4. The Operator Model of Parallelization
record All operations on records, e g , comparisons and
When porting Volcano to a multi-processor machine,
hashing, are performed by support functions which are given
we felt it desirable to use the single-process query process
m the state records as arguments to the iterators Thus, the
ing code described above without any change The result is
query processing modules could be implemented without
very clean, self-scheduling parallel processing We call this
knowledge or constraint on the internal structure of data
objects novel approach the operator model o f parallelizing a query
evaluation engine In this model, all issues of control arc
In quenes involving more than one operator (i c , localized in one operator that uses and provides the standard
almost all quenes), state records are linked together by iterator interface to the operators above and below in a
means o f input pointers The input pom ters are also kept query tree
in the slate records They are pointers to a QEP structure
The module responsible for parallel execution and
that consists o f four pointers to the entry points o f the
synchronization is called the exchange iterator in Volcano
three procedures implementing the operator (open, next, and
NoUce that it is an iterator with open, next, and close pro
close) and a state record All state information for an
cedures, therefore, it can be msencid at any one place or at
iterator is kept in us state record, thus, an algorithm may
multiple places in a complex query tree Figure 2 shows a
be used multiple times in a query by including more than
complex query cxecuuon plan that includes data processing
one state record in the query An operator docs not need
operators, e g file scan and join, and exchange operators
to know what kind o f operator produces its input, and
whether its input com es from a com plex query tree or from This sccUon describes how the exchange iterator
a simple file scan W e call this concept anonymous inputs implements verucal and horizontal parallelism followed by a
or streams Streams are a simple but powerful abstracuon detailed example and a discussion o f alternative m odes of
that allows combining any number o f operators to evaluate operation o f Volcano’
s exchange operator
a complex query Together with the iterator control para
4.1. Vertical Parallelism
digm, streams represent the most efficient execution model
m terms of time (overhead for synchronizing operators) and The first funenon of exchange is to provide vertical
space (number o f records that must reside in memory at parallelism or pipelining between processes The open pro
any point o f time) for single process query evaluation cedure creates a new process after creating a data structure
in shared memory called a port for synchronization and data
104
483
is inserted into a linked list originating in the port and a
PR NT semaphore is used to inform the consumer about the new
packet. Records in packets are fixed in the shared buffer
and must be unfixed by a consuming operator
When its mput is exhausted, the exchange operator in
the producer process marks the last packet with an end-of-
stream tag, passes it to the consumer, and waits unUl the
consumer allows closing all open files This delay is
XCHG necessary in Volcano because files on virtual devices must
not be closed before all their records are unpinned in the
buffer In other words, it is a peculiarity due to olheT
design decisions in Volcano rather than inherent in the
exchange iterator or the operator model of parallelizauon
The alert reader has noticed that the exchange module
uses a different dataflow paradigm than all other operators
JOIN
While all other modules are based on demand-driven
dataflow (iterators, lazy evaluation), the producer-consumer
relationship o f exchange uses data-driven dataflow (eager
evaluation) There are two reasons for this change in para
digms Fust, we intend to use the exchange operator also
for horizontal parallelism, to be described below, which is
easier to implement with data-dnven dataflow. Second, this
scheme removes the need for request messages Even
though a scheme with request messages, e g , using a sema
phore, would probably perform acceptably on a shared-
memory machine, we felt that it creates unnecessary control
overhead and delays Since we believe that very high
degrees o f parallelism and very high-performance query
evaluation require a closely tied network, e g , a hypercube,
o f shared-memory machines, we decided to use a paradigm
for data exchange that has has been proven to perform well
in a shared-nothing database machine [12,13]
A run-tune switch o f exchange enables flow control
or back pressure using an addiuonal semaphore If the pro
ducer is significantly faster than the consumer, the producer
may pm a significant portion of the buffer, thus impeding
overall system performance If flow control is enabled,
after a producer has insened a new packet into the port, it
must request the flow control semaphore After a consumer
Figure 2 Operator M odel o f Parallelization has removed a packet from the port, it releases the flow
control semaphore The initial value o f the flow control
exchange The child process, created using the UNIX fork semaphore, e g , 4, determines how many packets the pro
system call, is an exact duplicate o f the parent process ducers may get ahead o f the consumers
The exchange operator then takes different paths in the Notice that flow control and demand-driven dataflow
parent and child processes are not the same One significant difference is that flow
The parent process serves as the consumer and the control allows some "slack” m the synchronization o f pro
child process as the producer m Volcano. The exchange ducer and consumer and therefore truly overlapped execu
operator in the consumer process acts as a normal iterator, tion, while demand-driven dataflow is a rather ngid struc
the only difference from other iterators is that it receives its ture o f request and delivery in which the consumer waits
mput via inter-process communication rather than iterator while the producer works on ns next output The second
(procedure) calls After creating the child process, significant difference is that data-dnven dataflow is easier to
open_exchange m the consumer is done Next_exchange combine efficiently with honzontal parallelism and parution-
wans for data to arnve via the port and returns them a mg
record at a tune C lose ^exchange informs the producer that 4.2. Horizontal Parallelism
it can close, waits for an acknowledgement, and returns
There are two forms o f honzontal parallelism which
The exchange operator in the producer process we call bushy parallelism and intra-operator parallelism In
becom es the driver for the query tree below the exchange bushy parallelism, different CPU's execute different subtrees
operator using open, next, and close on its mput The out o f a complex query tree Bushy parallelism and vertical^
put o f next is collected in packets, which are arrays o f parallelism are forms of inter-operator parallelism Intra
NEXT_RECORD structures The packet size is an argument operator parallelism means that several CPU’ s perform the
in the exchange iterator’ s state record, and can be set same operator on different subsets o f a stored dataset or an
between 1 and 32,000 records When a packet is filled, it
105
484
intermediate result2 only when we move to an environment wnh muluple
shared-memory machines* Others have also observed the
Bushy parallelism can easily be implemented by
inserting one or two exchange operators into a query tree high cost o f process creation and have provided alternatives,
For example, in order to sort two inputs into a merge-join in parucular "light-weight" processes in various forms, c g ,
in parallel the First or both inputs are separated from the in Mach [1]
merge-join by an exchange operation3 The parent process After all producer processes are forked, they run
turns to the second sort immediately after forking the child without further synchromzauon among themselves, with two
process that will produce the first input m sorted order excepuons First, when accessing a shared data structure,
Thus, the two sort operations are workmg in parallel e g , the port to the consumers or a buffer table, short-term
locks must be acquired for the durauon of one linked-list
Intra-operator parallelism requires data partiuonmg
insertion Second, when a producer group is also a consu
Partiuoning o f stored datasets is achieved by using muluple
mer group, i e , there are at least two exchange operators
files, preferably on different devices Partitioning o f inter
and three process groups involved in a vertical pipeline, the
mediate results is implemented by including multiple queues
processes that are both consumers and producers synchronize
in a port If there are multiple consumer processes, each
twice During the (very short) interval between synchroni-
uses its own input queue The producers use a support
zauons, the master o f this group creates a port which serves
function to decide into which o f the queues (or actually,
all processes m its group
into which o f the packets being Filled by the producer) an
output record must go Using a support function allows When a close request is propagated down the tree
implementing round-robin-, key-range-, or hash-partitioning and reaches the first exchange operator, the master
consumer’ s close_exchange procedure informs all producer
If an operator or an operator subtree is executed in
processes that they are allowed to close down using the
parallel by a group o f processes, one o f them is designated
semaphore mentioned above m the discussion on verucal
the master When a query tree is opened, only one process
parallelism If the producer processes are also consumers,
is running, which is naturally the master When a master
the master o f the process group informs its producers, etc.
forks a child process m a producer-consumer relationship,
In this way, all operators are shut down in an orderly
the child process becomes the master within its group The
fashion, and the entire query evaluation is self-scheduling
First action o f the master producer is to determine how
many slaves are needed by calling an appropriate support 4.3. An Example
function If the producer operation is to run in parallel, the Let us consider an example Assume a query with
master producer forks the other producer processes four operators. A, B, C, and D such that A calls B ’ s, B
Gerber pointed out that such a centralized scheme is calls C ’s, and C calls D ’ s open, close, and next pro
suboptimal for high degrees o f parallelism [15] When we cedures Now assume that this query plan is to be run m
changed our initial implementation from forking all producer three process groups, called A, B C , and D This requires
processes by the master to using a propagation tree scheme, an exchange operator between operators A and B, say X,
we observed significant performance improvements In such and one between C and D , say Y B and C continue to
a scheme, the master forks one slave, then both fork a new pass records via a simple procedure call to the C ’ s next
slave each, then all four fork a new slave each, etc This procedure without crossmg process boundaries Assume
scheme has been used very effectively for broadcast com further that A runs as a single process, Ao. while BC and
munication and synchronization in binary hyper cubes D run in parallel in processes BCo to B C i and Do to Dj,
for a total o f eight processes
Even after optimizing the forking scheme, its over
head is not negligible W e have considered using pruned A calls X ’s open, close, and next procedures instead
processes, i e , processes that are always present and wait of B ’s (Figure 2a), without knowledge that a process boun
for work packets Primed processes arc used in many com dary will be crossed, a consequence of anonymous inputs in
mercial database systems Since portable distribution o f Volcano When X is opened, it creates a port with one
compiled code (for support functions) is not trivial, we input queue for Ao and forks B Co (Figure 2b), which in
delayed this change and plan on using pruned processes turn forks BCi and B C i (Figure 2c) When the BC group
opens 7, B C a to BC i synchronize, and wait until the Y
operator in process B C o has initialized a port with three
2 A fourth form o f parallelism is inter-query parallelism, input queues B C o creates the port and stores its location
i e , the ability o f a database management system to work on
at an address known only to the BC processes Then BC o
several queries concurrently In the current version. Volcano
to B C i synchronize again, and BC i and B C i gel the port
docs not support inter-query parallelism A fifth and sixth form
mformauon from its location Next, BCo forks Do (Figure
o f parallelism that can be used for database operations involve
2d) which m turn forks D j to D j (Figure 2c)
hardware vector processing [30] and pipelining in the instrucuon
execution Since Volcano is a software architecture and folio w- When the D operators have exhausted their inputs m
ing the analysis in [8], we do not consider hardware parallelism D o to Dj, they return an end-of-stream indicator to the
further driver parts o f f In each D process, Y flags its last
packets to each of the BC processes (i e . a total o f 3x4=12
3 In general, sorted streams can be piped directly into the flagged packets) with an end-of-stream tag and then waits
jom, both in the single-process and the multi-process case on a semaphore for permission to close The copies o f the
Volcano's sort operator includes a parameter "final merge fan-in”
that allows sharing the merge space by two sort operators per
forming the final merge in an interleaved fashion as requested by * In fact, this work is currently under way
the merge join operator
106
485
BC0 BC0 BC, BC,
X X
B B I B I B
C C
° c
— ,Y ■ Y I Y
y operator in the B C p ro ce s s e s cou n t the num ber o f tagged C to f ’s next p ro ced u re w ill return an end-of-stream indi
packets, after four tags (the num ber o f p rodu cers or D cator In effect, the end-or-stream indicator has been p ro
processes), they have exhausted their inputs, and a call by pagated from the D operators to the C operators In due
107
486
turn, C , B, and then the driver part o f X will receive an record A third argument to next_exchange is used to com
end-of-stream indicator After receiving three tagged pack municate the required producer from the merge to the
ets, X 's next procedure in Ao will indicate end-of-stream to exchange iterator Further modifications included increasing
A the number of input buffers used by exchange, ihc number
o f semaphores (including for flow control) used between
When end-of-stream reaches the root operator o f the
producer and consumer part o f exchange, and the logic for
query, A , the query tree is closed Closing the exchange
end-of-stream All these modificauons were implemented in
operator X includes releasing the semaphore that allows the
such a way that they support multi-level merge trees, e g , a
BC processes to shut down (Figure 3f) The X driver in
parallel binary merge tree as used in [7] The merging
each BC process closes its input, operator B B closes C,
paths are selected automatically such that the load is distri
and C closes Y Closing 7 in BC t and flC 2 is an empty
buted as evenly as possible m each level
operation When the process BC o closes the exchange
operator Y, Y permits the D processes to shut down by Second, we implemented a sort algorithm that sorts
releasing a semaphore After the processes o f the D group data randomly partiUoned over multiple disks into a range-
have closed all files and deallocated all temporary data partitioned file with sorted partitions, i e , a sorted file dis
structures, e g , hash tables, they indicate the fact to f in tributed over multiple disks When using the same number
B C o using another semaphore, and Y’ s close procedure o f processors and disks, we used two processes per CPU,
returns to its caller, C ’ s close procedure, while the D one to perform the file scan and parti uon the records and
processes terminate (Figure 3g) When all B C processes another one to sort them W e realized that creating and
have closed down, X 's close procedure indicates the fact to running more processes than processors inflicted a signifi
Ao and query evaluation terminates (Figure 3h) cant cost, since these processes competed for the CPU’ s and
therefore required operaung system scheduling While the
4.4. Variants of the Exchange Operator
scheduling overhead may not be too significant, in our
There are a number o f situations for which the environment with a central run queue allowing processes to
exchange operator described so far required some modifica- migrate freely and a large cache associated with each CPU,
nons or extensions In this section, we outline additional the frequent cache migration adds a significant cost.
capabilities implemented m V olcano’ s exchange operator
In order to make better use of the available process
For som e operations, it is desirable to replicate or ing power, we decided to reduce the number of processes
broadcast a stream to all consumers For example, one o f by half, effectively moving to one process per disk This
the two partitioning methods for hash-division [19] requires required modificauons to the exchange operator Unul then,
that the divisor be replicated and used with each parution the exchange operator could "live' only at the top or the
of the dividend Another example is Barn's parallel join bottom o f the operator tree in a process Since the modifi
algorithm in which one of the two mput relations is not cation, the exchange operator can also be in the middle of
moved at all while the other relation is sent through all a process’ operator tree When the exchange operator is
processors [4] T o support these algorithms, the exchange opened, it does not fork any processes but establishes a
operator can be directed (by setting a switch in the state communication port for data exchange The next operation
record) to send all records to all consumers, after pinning requests records from its input tree, possibly sending them
them appropriately multiple times in the buffer pool off to other processes in the group, until a record for its
Notice that it is not necessary to copy the records since own partition is found
they reside in a shared buffer pool, it is sufficient to pm
them such that each consumer can unpin them as if it were This mode of operauon3 also makes flow control
the only process using them After we implemented this obsolete A process runs a producer (and produces input
feature, parallelizing our hash-division programs usmg both for the other processes) only if it does not have mput for
divisor partitioning and quotient parutioning [19] took only the consumer Therefore, if the producers are in danger of
about three hours and yielded not insignificant speedups overrunning the consumers, none o f the producer operators
gets scheduled, and the consumers consume the available
When we implemented and benchmarked parallel sort records
ing [21], we found it useful to add two more features to
exchange First, we wanted to implement a merge network In summary, the operator model of parallel query
in which some processors produce sorted streams merge evaluauon provides for self-scheduling parallel query evalua
concurrently by other processors V olcano’s sort iterator tion in an extensible database system The most important
can be used to generate a sorted stream A merge iterator properties o f this novel approach are that the new module
was easily derived from the sort module It uses a single implements three forms o f parallel processing within a sin
level merge, instead o f the cascaded merge o f runs used in gle module, that it makes parallel query processing entirely
sort. The input o f a merge iterator is an exchange Dif self-scheduling, and that it did not require any changes in
ferently from other operators, the merge iterator requires to the existing query processing modules, thus leveraging signi
distinguish the input records by their producer As an ficantly the ume and effort spent on them and allowing
example, for a join operation it does not matter where the easy parallel implemeniauon o f new algorithms
mput records were created, and all inputs can be accumu
lated in a single input stream For a merge operauon. it is
crucial to distinguish the mput records by their producer m 3 Whether exchange forks new producer processes (the ori
order to merge multiple sorted streams correctly ginal exchange design describe in Section 4 1) or uses the exist
ing process group to execute the producer operations is a run
W e modified the exchange module such that it can time switch
keep the input records separated according to their produc
ers, switched by setting an argument field m the state
108
487
5. Overhead and Performance consumer process Each o f these three groups included
three processes, thus, each o f the producer processes created
From the beginning o f the Volcano project, we were
33,333 records All these experiments were conducted with
very concerned about performance and overhead In this
flow control enabled with three ’
slack" packets per
section, we report on experimental measurements o f the
exchange We used different partitioning (hash) funcuons
overhead induced by the exchange operator This is not
for each exchange iterator to ensure that records w c t c pass
meant to be an extensive or complete analysis o f the
ing along all possible data paths, not only along three
operator’s performance and overhead, the purpose o f this
independent pipelines
section is to demonstrate that the overhead can be kept in
acceptable limits As can be seen m Table 3, the performance penalty
for very small packets was significant The elapsed tune
W e measured elapsed times o f a program that creates
was almost cut m half when the packet size was increased
records, fills them with four random integers, passes the
from 1 to 2 records, from 176 seconds to 98 seconds As
records over three process boundaries, and then unfixes the
the packet size was increased further, the elapsed tune
records in the buffer The measurements are elapsed times
shrank accordingly, to 15 71 seconds for 50 records per
on a Sequent Symmetry with twelve Intel 16 M Hz 80386
packet and 12 73 seconds for 250 records per packet
CPU ’ s This is a shared-memory machine with a 64 KB
cache for each CPU Each CPU delivers about 4 MIPS in It seemed reasonable to speculate that for small pack
this machine The times were measured using the hardware ets, most of the elapsed ume was spent on data exchange
microsecond clock available on such machines Sequent’ s T o verify this hypothesis, we calculated regression and
DYNIX operating system provides exactly the same inter correlation coefficients o f the number o f data packets
face as Berkeley 4 2 BSD or System V UNIX and runs (100,000 divided over the packet size) and the elapsed
(t e , executes system calls) on all processors tunes We found an intercept (base ume) o f 12 18 seconds,
a slope of 0 001654 seconds per packet, and a correlation
First, we measured the program without any exchange
o f more than 0 99 Considering that we exchanged data
operator Creating 100,000 records and releasmg them m
over three process boundaries and that on two of those
the buffer took 20 28 seconds Next, we measured the pro
boundaries there were three producers and three consumers,
gram with the exchange operator switched to the mode in
we estimate that the overhead was 1654jtscc / 1 667 =
which it does not create new processes In other words,
992|isec per packet and process boundary
compared to the last experiment, we added the overhead of
three procedure calls for each record For this run, we T w o conclusions can be drawn from these experi
measured 28 00 seconds Thus, the three exchange opera ments Fust, verucal parallelism can pay off even for very
tors in this mode added (28 OOsec - 20 28sec) / 3 / 100,000 simple query plans if the overhead of data transfer is small
= 25 73(xsec overhead per record and exchange operator Second, since the packet size can be set to any value, the
When we switched the exchange operator to create overhead of Volcano’ s exchange iterator is negligible
new processes, thus creating a pipeline o f four processes, 6. Summary and Conclusions
we observed an elapsed time o f 16 21 seconds with flow
W e have described Volcano, a new query evaluauon
control enabled, or 16 16 seconds with flow control dis
system, and how parallel query evaluation is encapsulated m
abled The fact that these times ars less than the time for
a single module or operator The system is operational on
smgle-proccss program execution indicates that data transfer
both single- and multi-processor systems, and has been used
using the exchange operator is very fast, and that pipelined
for a number in database query processing studies [19-
multi-process execution is warranted
21,23]
W e were particularly concerned about the granularity
Volcano uulizes dataflow techniques wuhin processes
o f data exchange between processes and its impact on
as well as between processes Within a process, demand-
Volcano’ s performance In a separate experiment, we reran
driven dataflow is implemented by means of iterators
the program multiple times varying the number o f records
Between processes, data-dnven dataflow is used to exchange
per exchange packet. Table 1 shows the performance for
data between producers and consumers efficiently If neces
transferring 100,000 records from a producer process group
sary, Volcano’ s data-driven dataflow can be augmented with
through two intermediate process groups to a single
flow control or back pressure Horizontal partitioning is
used both on stored and intermediate datasets to allow
Packet Size Elapsed Time intra-operator parallelism The design o f the exchange
[Records] [Seconds] operator embodies the parallel execution mechanism for
1 176 4 vertical, bushy, and intra-opeTator parallelism, and it per
2 97 6 forms the transitions from demand-driven to data-driven
5 45 27 dataflow and back
10 27 67 Using an operator to encapsulate parallelism as
20 20 15 explored in the Volcano project has a number o f advantages
50 15 71 over the bracket model First, it hides the fact that paral
100 13 76 lelism is used from all other operators Thus, other opera
200 12 87 tors can be implemented without consideration for parallel
250 12 73 ism Second, since the exchange operator uses the same
interface to its input and output, it can be placed anywhere
Table 1 Exchange Performance
in a tree and combined with any other operators Hence, it
can be used to parallelize new operators, and effccuvely
109
488
combines extensibility and parallelism Third, it does not Leonard Shapiro Jerry Borgvedt implemented a prototype
require a separate scheduler process since scheduling distnbuted-memory exchange operator — NSF supported
(including initialization, flow control, and final clean-up) is this work with contracts DU-8805200 and IRI-8912618
part o f the operator and therefore performed within the stan Sequent Computer Systems provided machine time for
dard open-next-close iterator paradigm This turns into an experiments on a large machine.
advantage in two situations When a new operator is
References
integrated into the system, the scheduler and the template
process would have to be modified, while the exchange 1 M Accetta, R Baron, W. Bolosky, D Golub, R.
operator does not require any modifications When the sys Rashid, A Tevaman and M Young, “Mach. A New
tem is ported to a new environment, only one module Kernel Foundation for UNIX Development”, Sumner
requires modifications, the exchange iterator, not two Conference Proceedings 1986,
modules, the template process and the scheduler Fourth, it 2 W Alexander and G. Copeland, “Process and
does not require that operators m a parallel query evaluation Dataflow Control in Distributed Data-Intensive
system use 1PC to exchange data. Thus, each process can Systems”, Proceedings o f the ACM S1CMOD
execute an arbitrary subtree o f a complex query evaluation Conference, Chicago, IL * June 1988, 90-98
plan. Fifth, a smgle process can have any number of
inputs, not just one or two Finally, the operator can be 3 M M . Astrahan, M W. Blasgen, D. D Chamberlin,
(and has been) implemented in such a way that it can mul K. P. Eswaran, J. N. Gray, P. P. Griffiths, W. F.
King, R. A Lone, P R. McJones, J W MehL G.
tiplex a smgle process between a producer and a consumer
R Putzolu, I L Traiger, B W Wade and V.
In some respects, it efficiently implements appheauon-
Watson, “System R* A Relational Approach to
specific co-routines or threads
Database Management”, ACM Transactions on
We plan on several extensions o f the exchange opera Database Systems 1, 2 (June 1976), 97-137.
tor First, we plan on extending our design and implemen
4. C. K. Baru, O. Fneder, D. Kandlur and M Segal,
tation to support both shared and distributed memory
("shared-nothing architecture") and to allow combining these “Join on a Cube* Analysis, Simulation, and
Implementauon”, Proceedings o f the 5th International
concepts in a closely tied network o f shared-memory multi
computers while maintaining the encapsulation properties. Workshop on Database Machines, 1987.
This might require using a pool o f "pruned" processes and 5 D S. Batory, “GENESIS* A Project to Develop an
interpreting support functions W e believe that in the long Extensible Database Management System”,
run, high-performance database machines, both for transac Proceedings o f the lnt'l Workshop on Object-Oriented
tion and query processing, will employ this architecture Database Systems, Pacific Grove, CA , September
Second, we plan on devising a error and exception manage 1986, 207-208.
ment scheme that makes excepuon notification and handling
6. D. Bitton, D. J. DeWitt and C. TurbyfiU,
transparent across process and machine boundaries. Third,
“Benchmarking Database Systems: A Systematic
we plan on using the exchange operator to parallelize query
Approach”, Proceeding o f the Conference on Very
processing in object-oriented database systems [16], In our
Large Data Bases, Florence, Italy, October-November
model, a complex object is represented in memory by a
1983, 8-19
pointer to the root component (pinned m the buffer) with
pointers to the sub-components (also pinned) and passed 7 D. Bitton, H. Boral, D. J DeWitt and W. K.
between operators by passing the root component [18] Wilkinson, “Parallel Algorithms for the Execuuon o f
While the current design already allows passing complex Relauonal Database Operauons”, ACM Transactions
objects in a shared-memory environment, more functionality on Database Systems 8, 3 (September 1983), 324-353
is needed in a distributed-memory system where objects 8. H. Boral and D. J DeWitt, "Database Machines. An
need to be packaged for network transfer Idea Whose Tim e Has Passed? A Critique o f the
Volcano is the first implemented query evaluation Future o f Database Machines”, Proceeding o f the
system that combines extensibility and parallelism. Encap International Workshop on Database Machines,
sulating all parallelism issues into one module was essential Munich, 1983
to making this combinauon possible The encapsulauon of 9. H. Boral and D J DeWitt, " A Methodology for
parallelism in Volcano allows for new query processing Database System Performance Evaluauon”,
algorithms to be coded for single-process execution but run Proceedings o f the ACM SIGMOD Conference,
in a highly parallel environment without modifications We Boston, MA, June 1984, 176-185
expect that this will speed parallel algorithm development
and evaluation significantly Since the operator model o f 10 M J Carey, D J DeWitt, J E Richardson and E
parallel query processing and Volcano’ s exchange operator J. Shekita, "O bject and File Management in the
encapsulates parallelism and both uses and provides an itera EXODUS Extensible Database System”, Proceedings
tor interface similar to many existing database systems, the o f the Conference on Very Large Data Bases, Kyoto,
concepts explored and outlined in this paper may very well Japan, August 1986, 91-100
be useful in parallelizing other database query processing 11 H T. Chou, D J DeWitt, R H Katz and A. C.
software Klug, “Design and Implementauon o f the Wisconsin
Storage System", Software - Practice and Experience
Acknowledgements
15, 10 (October 1985), 943-962
A number o f friends and colleagues were great
12. D J DeWitt, R H Gerber, G Graefe, M L.
sounding boards during the design and implementauon of
Heytens, K B Kumar and M Muraliknshna,
parallelism in Volcano, most notably Frank Symonds and
110
489
“GAMMA - A High Performance Dataflow Database Systems, Pacific Grove, C A , September 1986, 85-92.
Machine", Proceedings o f the Conference on Very
27 M Stonebraker, E. Wong. P. Kreps and G. D. Held,
Large Data Bases, Kyoto, Japan, August 1986, 228-
“The Design and Implementation of INGRES”, ACM
237
Transactions on Database Systems 1, 3 (September
13 D J DeWitt, S Ghandeharadizeh, D Schneider, A 1976), 189-222
Bncker, H I Hsiao and R Rasmussen, “The
28. M. Stonebraker and L A Rowe, "T he Design o f
Gamma Database Machine Project’
’, IEEE
POSTGRES", Proceedings o f the ACM SIGMOD
Transactions on Knowledge and Data Engineering 2,
Conference, Washington, DC., May 1986, 340-355.
1 (March 1990)
29 M Stonebraker, R. Katz, D. Patterson and J
14 S Englert, J Gray, R Kocher and P Shah, "A
Ousterhout, "T he Design of XPRS”, Proceedings o f
Benchmark o f NonStop SQL Release 2 Demonstrating
the Conference on Very Large Databases, Los
Near-Linear Speedup and Scaleup on Large
Angeles, CA, August 1988, 318-330.
Databases’’,Tandem Computer Systems Technical
Report 894 (May 1989) 30. S. Torn, K. Kojima, Y. Kanada, A. Sakata, S.
Yoshizumi and M. Takahashi, "Accelerating
15 R Gerber, “Dataflow Query Processing using
Nonnumencal Processing by an Extended Vector
Multiprocessor Hash-Partitioned Algorithms”, P hD
Processor”, Proceedings o f the IEEE Conference on
Thesis, Madison. October 1986 Data Engineering, Los Angeles, C A , February 1988,
16 G Graefe and D Maier, “Query Optimization in 194-201
Object-Oriented Database Systems A Prospectus", in
31. P. Williams, D. Darnels, L. Haas, G Lapis, B.
Advances m Object-Oriented Database Systems, vol
Lindsay, P. Ng, R. Obermarck, P Selinger, A.
334 , K, R Dittrich (editor), Spnnger-Verlag,
Walker, P W ilms and R. Yost, “R*. An Overview
September 1988, 358-363 o f the Architecture”, m Readings m Database
17 G. Graefe, “Volcano An Extensible and Parallel Systems, M. Stonebraker (editor), Morgan-Kanfman,
Dataflow Query Processing System”, Oregon San Mateo, C A , 1988.
Graduate Center, Computer Science Technical Report,
Beaverton, O R , June 1989
18 G Graefe, "S et Processing and Complex Object
Assembly m Volcano and the REVELATION
Project”, O regon Graduate Center, Computer Science
Technical Report, Beaverton, O R, June 1989
19 G. Graefe, “Relational Division. Four Algorithms
and Their Performance”, Proceedings o f the IEEE
Conference on Data Engineering, Los Angelos, CA,
February 1989, 94-101
20 G. Graefe and K Ward, “Dynamic Query Evaluation
Plans", Proceedings of the ACM S1GMOD
Conference, Portland, OR, May-June 1989, 358
21 G Graefe, “Parallel External Sorting in Volcano”,
submitted fo r publication, February 1990
22 L. M Haas, W F Cody, J C Freytag, G Lapis, B
G. Lmdsay, G. M Lohman, K Ono and H Pirahesh,
“An Extensible Processor for an Extended Relational
Query Language”, Computer Science Research
Report, San Jose, C A , April 1988
23 T Keller and G Graefe, "T he One-to-One Match
Operator of the Volcano Query Processing System",
O regon Graduate Center, Computer Science Technical
Report, Beaverton, O R , June 1989
24 J E Richardson and M J Carey, “Programming
Constructs for Database System Implementation in
E X O D U S", Proceedings o f the ACM SIGMOD
Conference, San Francisco, CA., May 1987, 208-219
25 K Salem and H Garcia-Molina, “Disk Stnpmg",
Proceedings o f the IEEE Conference on Data
Engineering, Los Angeles, C A , February 1986, 336
26 P Schwarz, W Chang, J C Freytag, G Lohman, J
McPherson, C Mohan and H Pirahesh, “Extensibility
in the Star burst Database System”, Proceedings o f
the Ini'I Workshop on Object-Oriented Database
Ill
490