ACS (OCR'et) PDF

UNIVERSITY OF C O P E N H A G E N
Advanced Computer Systems (ACS)

DIKU Course Compendium
B lock 2, 2015/16
A d van ced C o m p u te r S y stem s
(ACS)
DIKU Course Compendium
DIKU, Department of Computer Science,

University of Copenhagen, Denmark
Block 2, 2015/16
C on ten ts
P r e fa c e v
L ea rn in g G o a ls v ii
S o u r c e L ist ix
1 F u n d a m en ta l A b s t r a c t io n s 1
2 M o d u la r ity th r o u g h C lie n t s a n d S e rv ices, R P C 19
3 T e ch n iq u e s fo r P e r fo r m a n c e 27
4 C on cu rre n cy C o n tro l 45
5 R ecov ery 101
6 E x p e r im e n ta l D e s ig n 127
7 N o tio n s o f R e lia b ility 179
8 T o p ic s in D is t r ib u t e d C o o r d in a t io n a n d D is t r ib u t e d T r a n s
a c t io n s 221
9 C o m m u n ic a tio n a n d E n d -to- E n d A r g u m e n t 317
10 D a t a P r o c e s s in g - E x te r n a l S o r t in g 383
11 D a t a P r o c e s s in g - B a s ic R e la t io n a l O p e r a t o r s a n d J o in s 395
12 D a t a P r o c e s s in g - P a ra lle lism 459

P refa ce
This compendium has been designed for the course Advanced Computer Sys
tems (ACS), taught at the Department of Computer Science (DIKU), Univer
sity of Copenhagen. The contents of the compendium are in correspondence
with the rules of use at academic courses defined by CopyDan. The com
pendium is organized into 12 parts, each containing textbook chapters or refer
ence papers related to topics covered in the course. Each part is also prefaced
by a short description of the learning expectations with respect to the readings.
The compendium starts with a review of fu n d a m e n ta l a b s t r a c t io n s in
computer systems, namely interpreters, memory, and communication links
(Part 1). The course explores multiple p r o p e r t ie s that may be attached
to these abstractions, and exposes principled design and implementation tech
niques to obtain these properties while respecting interfaces and achieving high
performance. A first property is the notion of s t r o n g m o d u la rity , achieved
by organization of interpreters into clients and services and use of remote pro
cedure call (RPC) mechanics (Part 2).
After a brief review of general techniques for p e r fo r m a n c e (Part 3), the
properties of a t o m ic it y and d u r a b ility are explored. The former property of
atomicity may be understood with respect to before-or-after, or alternatively
all-or-nothing semantics. Multiple different concurrency control protocols to
achieve before-or-after atomicity over a memory abstraction are first introduced
(Part 4). Following concurrency control, recovery protocols for all-or-nothing
atomicity and durability are discussed (Part 5).
The text then ventures into a brief foray on techniques for e x p e r im e n ta l
d e s ig n (Part 6), which allow performance characteristics of different designs
and implementations of a given abstraction to be analyzed. After this foray, the
compendium then turns to the property of h igh av a ila b ility in the presence
of faults, achieved by a combination of techniques. First, general techniques for
reliability, in particular replication techniques, are discussed (Part 7). Distri
bution of system functionality and replication introduce the problem of main
taining consistency. So, solutions for achieving high degrees of consistency
in distributed scenarios, including ordered multicast, two-phase commit , and
state-machine replication, are then discussed (Part 8). Finally, communication
schemes that decouple system functions are discussed, along with the classic
end-to-end argument (Part 9).
The text finally explores the property of s c a la b ility with large data vol
umes, and reviews design and implementation techniques for data processing
operators, including external sorting, basic relational operators and jo in s , as
well as parallelism (Parts 10, 11, and 12, respectively).
We hope you enjoy your readings!
v
L earning G oa ls
The learning goals for ACS are listed below.
K n o w le d g e
• Describe the design of transactional and distributed systems, including

techniques for modularity, performance, and fault tolerance.
• Explain how to employ strong modularity through a client-service ab

straction as a paradigm to structure computer systems, while hiding
complexity of implementation from clients.
• Explain techniques for large-scale data processing.
Skills
• Implement systems that include mechanisms for modularity, atomicity,

and fault tolerance.
• Structure and conduct experiments to evaluate a system’

s performance.
C o m p e te n ces
• Discuss design alternatives for a modular computer system, identifying

desired system properties as well as describing mechanisms for improving
performance while arguing for their correctness.
• Analyze protocols for concurrency control and recovery, as well as for

distribution and replication.
• Apply principles of large-scale data processing to analyze concrete information

processing problems.
S ou rce List
G. Couloris, J. Dollimore, T. Kindberg. Distributed systems, concepts and

design. Third Edition. Chapters 11 (except 11.2 and 11.3) and 13, pp.
419-423, 436-464, and 515-552 (72 of 772). Addison-Wesley, 2001. ISBN:
0201-61918-0
J. Dean and S. Ghemawat. MapReduce: a flexible data processing tool. Com-

mun. ACM 53, 1, pp. 72-77 (6 of 159), 2010. Doi: 10.1145/1629175.1629198
D. DeWitt and J. Gray. Parallel database systems: the future of high per
formance database systems. Commun. ACM 35, pp. 85-98 (14 of 1868),
1992. Doi: 10.1145/129888.129894
D. Lilja. Measuring Computer Performance: A Practitioner’ s Guide. Chap

ters 1, 2, and 6, pp. 1-24 and 82-107 (50 of 261). Cambridge University
Press, 2000. ISBN: 978-0-521-64105-0
H. Garcia-Molina, J. D. Ullman, J. Widom. Database Systems: The Complete

Book. Chapters 11.4 and 15, pp. 525-533 and 713-774 (71 of 1119).
Prentice Hall, 2002. ISBN: 0-13-031995-3
G. Graefe. Encapsulation of parallelism in the Volcano query processing

system. SIGMOD Rec. 19, 2, pp. 102-111 (10 of 632), 1990. Doi:
10.1145/93605.98720
D. Pritchett. BASE: An Acid Alternative. Queue 6, 3 (May 2008), pp. 48-55

(8 of 72), 2008. D oi= 10.1145/1394127.1394128.
R. Ramakrishnan and J. Gehrke. Database Management Systems. Third

Edition. Chapters 16-18, pp. 519-544, 549-575, and 579-600 (75 of
1065). McGraw-Hill, 2003. ISBN: 978-0-07-246563-1
J. H. Saltzer and M. F. Kaashoek. Principles of Computer System Design: An

Introduction. Part I. Sections 2.1, 4.2, and 6.1, pp. 44-60, 167-172, and
300-316 (40 of 526). Morgan Kaufmann, 2009. ISBN: 978-0-12-374957-4
J. H. Saltzer and M. F. Kaashoek. Principles of Computer System Design:

An Introduction. Part II. Chapters 8.1-4 and 8.6, pp 8-2 - 8-35 and 8-51
- 8-54 (38 of 826). Creative Commons License, 2009.
J. H. Saltzer, D. P. Reed, and D. D. Clark. End-to-end arguments in system

design. ACM Trans. Comput. Syst. 2(4) pp. 277-288 (12 of 359), 1984.
Doi: 10.1145/357401.357402
IX
F. B. Schneider. Implementing fault-tolerant services using the state machine
approach: a tutorial. ACM Comput. Surv. 22(4) pp. 299-319 (21 of
409), 1990. Doi: 10.1145/98163.98167
A. S. Tanenbaum and M. V. Steen. Distributed Systems: Principles and

Paradigms. Second Edition. Chapters 4, pp. 124-125 and 140-177 (40
of 686) Pearson International Edition, 2007. ISBN: 0-13-613553-6
C h a p te r 1
F undam ental A b stra ctio n s
This chapter contains the book chapter:
J. H. Saltzer and M. F. Kaashoek. Principles of Computer Sys

tem Design: An Introduction. Part I. Section 2.1, pp. 44-60 (17 of
526). Morgan Kaufmann, 2009. ISBN: 978-0-12-374957-4
The chapter reviews the fu n d a m e n ta l a b s t r a c t io n s in computer systems:

memory, interpreters, and communication links. These abstractions manifest
themselves in hardware and in software, in centralized as well as distributed
systems. The ultimate goal of this portion of the material is to convey the
generality of these abstractions, and stimulate us to reflect on how different
versions o f these abstractions are implemented in terms of one another over
different system layers.
The learning goals for this portion of the material are listed below.
• Identify the fundamental abstractions in computer systems and their

APIs, including memory, interpreters, communication links.
• Explain how names are used in the fundamental abstractions.
• Design a top-level abstraction, respecting its correspondent API, based

on lower-level abstractions.
• Discuss performance and fault-tolerance aspects of such a design.
1
44 CHAPTER 2 Elements of Computer System Organization
2 .5 .1 2 The Shell and Im plied Contexts, Search Paths, and Name Discovery........110
2 .5 .1 3 Suggestions for Further R eading...........................................................................112
Exercises................................................................................................................................................112
OVERVIEW
Although the number o f potential abstractions for com puter system com ponents is
unlimited, remarkably the vast majority that actually appear in practice fall into one of
three well-defined classes: the memory, the interpreter,and the com m unication link.
These three abstractions are so fundamental that theoreticians compare com puter algo
rithms in terms o f the number o f data items they must remember, the number o f steps
their interpreter must execute, and the number o f messages they must communicate.
Designers use these three abstractions to organize physical hardware structures,
not because they are the only ways to interconnect gates, but rather because
■ they supply fundamental functions o f recall, processing, and communication,

■ so far, these are the only hardware abstractions that have proven both to be
widely useful and to have understandably simple interface semantics.
To meet the many requirements o f different applications, system designers build lay
ers on this fundamental base, but in doing so they do not routinely create com pletely
different abstractions. Instead, they elaborate the same three abstractions, rearrang
ing and repackaging them to create features that are useful and interfaces that are
convenient for each application. Thus, for example, the designer o f a general-purpose
system such as a personal com puter or a network server develops interfaces that
exhibit highly refined forms o f the same three abstractions. The user, in turn, may
see the m em ory in the form o f an organized file or database system, the interpreter
in the form o f a w ord processor, a game-playing system, or a high-level programming
language, and the communication link in the form o f instant messaging or the World
Wide Web. On examination, underneath each o f these abstractions is a series o f layers
built on the basic hardware versions o f those same abstractions.
A primary method by w hich the abstract com ponents o f a com puter system inter
act is reference. What that means is that the usual way for one com ponent to connect
to another is by name. Names appear in the interfaces o f all three o f the fundamental
abstractions as well as the interfaces o f their m ore elaborate higher-layer counter
parts. The m em ory stores and retrieves objects by name, the interpreter manipulates
named objects, and names identify communication links. Names are thus the glue
that interconnects the abstractions. Named interconnections can, with proper design,
be easy to change. Names also allow the sharing o f objects, and they permit finding
previously created objects at a later time.
This chapter briefly reviews the architecture and organization o f com puter sys
tems in the light o f abstraction, naming, and layering. Some parts o f this review will be
familiar to the reader with a background in com puter software or hardware, but the
systems perspective may provide som e new insights into those familiar con cepts and
2
2.1 The Three Fundamental Abstractions 45
it lays the foundation for com ing chapters. Section 2.1 describes the three fundamen
tal abstractions, Section 2.2 presents a m odel for naming and explains h ow names are
used in com puter systems, and Section 2.3 discusses h ow a designer com bines the
abstractions, using names and layers, to create a typical com puter system, presenting
the file system as a concrete example o f the use o f naming and layering for the mem
ory abstraction. Section 2.4 looks at h ow the rest o f this b ook will consist o f designing
som e higher-level version o f one or more o f the three fundamental abstractions, using
names for interconnection and built up in layers. Section 2.5 is a case study showing
h ow abstractions, naming, and layering are applied in a real file system.
2.1 THE THREE FUNDAMENTAL ABSTRACTIONS

We begin by examining, for each o f the three fundamental abstractions, what
the abstraction does, h ow it does it, its interfaces, and the ways it uses names for
interconnection.
2.1.1 Memory
Memory, sometimes called storage, is the system com ponent that remembers data
values for use in computation. Although m em ory technology is wide-ranging, as sug
gested by the list o f examples in Figure 2.1, all m em ory devices fit a simple abstract
m odel that has tw o operations, named w rite and read :
write[name, value)
value <— read (name)
The w rite operation specifies in value a value to be remembered and in name a name
by which one can recall that value in the future. The read operation specifies in name
the name o f som e previously rem em bered value, and the m emory device returns that
value. A later call to w rite that specifies
Hardware memory devices: the same name updates the value associ
RAM chip ated with that name.
Flash memory Memories can be either volatile or
Magnetic tape non-volatile. A volatile m em ory is one
Magnetic disk w h ose mechanism o f retaining informa
CD-R and DVD-R tion consum es energy; if its pow er supply
Higher level memory systems: is interrupted for som e reason, it forgets
RAID its information content. When one turns
File system off the pow er to a non-volatile m em ory
Database management system (sometimes called “ stable storage” ), it
retains its content, and w hen p ow er is
FIGURE 2.1 again available, read operations return
„ , , t the same values as before. By connecting
Some examples of memory devices that may , „ ,
...
h o tam ihar
a volatile m em oryJ
to a batteryJ or an
3
Sidebar 2.1 Terminology: Durability, Stability, and Persistence Both in common

English usage and in the professional literature, the terms durability, stability, and
persistence overlap in various ways and are sometimes used almost interchangeably.
In this text, we define and use them in a way that emphasizes certain distinctions.
Durability A property of a storage medium: the length of time it remembers.
Stability A property of an object: it is unchanging.
Persistence A property of an active agent: it keeps trying.
Thus, the current chapter suggests that files be placed in a durable storage medium—
that is, they should survive system shutdown and remain intact for as long as they are
needed. Chapter 8 [on-line] revisits durability specifications and classifies applications
according to their durability requirements.
This chapter introduces the concept of stable bindings for names, which, once deter
mined, never again change.
Chapter 7 [on-line] introduces the concept of a persistent sender, a participant in a

message exchange who keeps retransmitting a message until it gets confirmation that
the message was successfully received, and Chapter 8 [on-line] describes persistent
faults, which keep causing a system to fail.
uninterruptible pow er supply, it can be made durable, which means that it is designed
to remember things for at least som e specified period, known as its durability. Even
non-volatile m em ory devices are subject to eventual deterioration, known as decay,
so they usually also have a specified durability, perhaps measured in years. We will
revisit durability in Chapters 8 [on-line] and 10 [on-line],where w e will see methods
o f obtaining different levels o f durability. Sidebar 2.1 com pares the meaning o f dura
bility with tw o other, related words.
At the physical level, a m em ory system does not normally name, read , or write
values o f arbitrary size. Instead, hardware layer m em ory devices read and w rite con
tiguous arrays o f bits, usually fixed in length, known by various terms such as bytes
(usually 8 bits, but one sometimes encounters architectures with 6-, 7-, or 9-bit bytes),
w ords (a small integer number o f bytes, typically 2,4, or 8), lines (several words), and
blocks (a number o f bytes, usually a pow er o f 2, that can measure in the thousands).
Whatever the size o f the array, the unit o f physical layer m em ory written or read is
known as a mem ory (or storage) cell. In most cases, the name argument in the read
and w rite calls is actually the name o f a cell. Higher-layer memory systems also read
and w rite contiguous arrays o f bits, but these arrays usually can be o f any convenient
length, and are called by terms such as record, segment, or file.
2.1.1.1 Read/Write Coherence and Atomicity

Two useful properties for a m em ory are read/write coherence and before-or-after
atomicity. Read/write coherence means that the result o f the read o f a named cell
is always the same as the most recent w rite to that cell. Before-or-after atomicity
4
means that the result o f every read or w rite is as if that read or w rite occurred either
com pletely before or com pletely after any other read or w r it e . Although it might seem
that a designer should be able simply to assume these tw o properties, that assump
tion is risky and often wrong. There are a surprising number o f threats to read/write
coherence and before-or-after atomicity:
■ Concurrency. In systems where different actors can perform read and w rite
operations concurrently, they may initiate tw o such operations on the same
named cell at about the same time. There needs to be som e kind o f arbitration
that decides w hich one goes first and to ensure that one operation com pletes
before the other begins.
■ Rem ote storage. When the m em ory device is physically distant, the same con
cerns arise, but they are amplified by delays, w hich make the question o f “
which
w rite was most recent?”problematic and by additional forms o f failure intro
duced by communication links. Section 4.5 introduces remote storage, and

Chapter 10 [on-line] explores solutions to before-or-after atomicity and read/
write coherence problem s that arise with remote storage systems.
■ Perform ance enhancements. Optimizing com pilers and high-performance pro

cessors may rearrange the order o f m em ory operations, possibly changing the
very meaning o f “the most recent w rite to that cell”and thereby destroying read/
write coherence for concurrent read and w rite operations. For example, a com
piler might delay the write operation implied by an assignment statement until
the register holding the value to be written is needed for som e other purpose.
If som eone else performs a read o f that variable, they may receive an old value.
Some programming languages and high-performance processor architectures
provide special programming directives to allow a programmer to restore read/
write coherence on a case-by-case basis. For example, the Java language has a
synchronized declaration that protects a block o f co d e from read/write in co
herence, and Hewlett-Packard’ s Alpha processor architecture (among others)

includes a m em ory barrier ( m b ) instruction that forces all preceding reads
and writes to com plete before going on to the next instruction. Unfortunately,
both o f these constructs create opportunities for programmers to make subtle
mistakes.
■ Cell size incom m ensurate with value size. A large value may occu p y multiple
m em ory cells, in which case before-or-after atomicity requires special attention.
The problem is that both reading and writing o f a multiple-cell value is usually
done one cell at a time. A reader running concurrently with a writer that is
updating the same multiple-cell value may end up with a mixed bag o f cells, only
som e o f which have been updated. Computer architects call this hazard write
tearing. Failures that occu r in the middle o f writing multiple-cell values can
further com plicate the situation. To restore before-or-after atomicity, concurrent
readers and writers must som ehow be coordinated, and a failure in the middle
o f an update must leave either all or none o f the intended update intact. When
these conditions are met, the read or write is said to be atomic. A closely related
5
risk arises w hen a small value shares a m em ory cell with other small values.
The risk is that if tw o writers concurrently update different values that share
the same cell, one may overwrite the other’ s update. Atomicity can also solve
this problem. Chapter 5 begins the study o f atomicity by exploring methods
o f coordinating concurrent activities. Chapter 9 [on-line] expands the study of
atomicity to also encom pass failures.
■ Replicated storage. As Chapter 8 [on-line] will explore in detail, reliability of

storage can be increased by making multiple copies o f values and placing those
copies in distinct storage cells. Storage may also be replicated for increased
performance, so that several readers can operate concurrently. But replication
increases the number o f ways in which concurrent read and write operations
can interact and possibly lose either read/write coherence or before-or-after
atomicity. During the time it takes a writer to update several replicas, readers of
an updated replica can get different answers from readers o f a replica that the
writer hasn’t gotten to yet. Chapter 10 [on-line] discusses techniques to ensure
read/write coherence and before-or-after atomicity for replicated storage.
Often, the designer o f a system must co p e with not just one but several o f these
threats simultaneously. The combination o f replication and remoteness is particularly
challenging. It can be surprisingly difficult to design m em ories that are both efficient
and also read/write coherent and atomic. To simplify the design or achieve higher per
formance, designers sometimes build m emory systems that have weaker coherence
specifications. For example, a multiple processor system might specify: “ The result
o f a read will be the value o f the latest write if that w rite was perform ed by the same
processor.”There is an entire literature o f “ data consistency m odels”that explores
the detailed properties o f different m em ory coherence specifications. In a layered
m em ory system, it is essential that the designer o f a layer know precisely the coher
ence and atomicity specifications o f any low er layer m em ory that it uses. In turn, if
the layer being designed provides m em ory for higher layers, the designer must specify
precisely these tw o properties that higher layers can expect and depend on. Unless
otherwise mentioned, w e will assume that physical m em ory devices provide read/
write coherence for individual cells, but that before-or-after atomicity for multicell
values (for example, files) is separately provided by the layer that implements them.
2.1.1.2 Memory Latency

An important property o f a memory is the time it takes for a read or a write to complete,
w hich is known as its latency (often called access time, though that term has a more
precise definition that will be explained in Sidebar 6.4). In the magnetic disk memory
(described in Sidebar 2.2) the latency o f a particular sector depends on the mechani
cal state o f the device at the instant the user requests access. Having read a sector, one
may measure the time required to also read a different but nearby sector in m icrosec
on ds—but only if the user anticipates the second read and requests it before the disk
rotates past that second sector. A request just a few m icroseconds late may encounter
6
Sidebar 2.2 H ow M agnetic D isks W ork Magnetic disks consist o f rotating circular
platters coated on both sides with a magnetic material such as ferric oxide. An elec
tromagnet called a disk head records information by aligning the magnetic field of the
particles in a small region on the platter’s surface.The same disk head reads the data
by sensing the polarity of the aligned particles as the platter spins by. The disk spins
continuously at a constant rate, and the disk head actually floats just a few nanometers
above the disk surface on an air cushion created by the rotation of the platter.
From a single position above a platter, a disk head can read or write a set of bits, called
a track,located a constant distance from the center. In the top view below, the shaded
region identifies a track. Tracks are formatted into equal-sized blocks, called sectors,
by writing separation marks periodically around the track. Because all sectors are the
same size, the outer tracks have more sectors than the inner ones.
A typical modern disk module,known as a“ hard drive”because its platters are made of
a rigid material, contains several platters spinning on a common axis called a spindle,
as in the side view above. One disk head per platter surface is mounted on a com b
like structure that moves the heads in unison across the platters. Movement to a spe
cific track is called seeking, and the comb-like structure is known as a seek arm .The
set of tracks that can be read or written when the seek arm is in one position (for
example, the shaded regions of the side view) is called a cylinder.Tracks, platters, and
sectors are each numbered. A sector is thus addressed by geometric coordinates: track
number, platter number, and rotational position. Modern disk controllers typically do
the geometric mapping internally and present their clients with an address space
consisting of consecutively numbered sectors.
To read or write a particular sector, the disk controller first seeks the desired track.
Once the seek arm is in position, the controller waits for the beginning of the desired
sector to rotate under the disk head, and then it activates the head on the desired plat
ter. Physically encoding digital data in analog magnetic domains usually requires that
the controller write complete sectors.
The time required for disk access is called latency, a term defined more precisely in
Chapter 6. Moving a seek arm takes time. Vendors quote seek times of 5 to 10 mil
liseconds, but that is an average over all possible seek arm moves. A move from one
(Sidebar continues)
7
cylinder to the next may require only 1/20 of the time of a move from the innermost
to the outermost track. It also takes time for a particular sector to rotate under the
disk head. A typical disk rotation rate is 7200 rpm, for which the platter rotates once
in 8.3 milliseconds. The time to transfer the data depends on the magnetic record
ing density, the rotation rate, the cylinder number (outer cylinders may transfer at
higher rates), and the number of bits read or written. A platter that holds 40 gigabytes
transfers data at rates between 300 and 600 megabits per second; thus a 1-kilobyte
sector transfers in a microsecond or two. Seek time and rotation delay are limited by
mechanical engineering considerations and tend to improve only slowly, but mag
netic recording density depends on materials technology, which has improved both
steadily and rapidly for many years.
Early disk systems stored between 20 and 80 megabytes. In the 1970s Kenneth Haughton,
an IBM inventor, described a new technique of placing disk platters in a sealed enclo
sure to avoid contamination. The initial implementation stored 30 megabytes on each
of two spindles, in a configuration known as a 30-30 drive. Haughton nicknamed it
the “Winchester” , after the Winchester 30-30 rifle. The code name stuck, and for many
years hard drives were known as Winchester drives. Over the years, Winchester drives
have gotten physically smaller while simultaneously evolving to larger capacities.
a delay that is a thousand times longer, waiting for that second sector to again rotate
under the read head. Thus the maximum rate at which one can transfer data to or from
a disk is dramatically larger than the rate one w ould achieve w hen choosing sectors at
random. A random access m em ory (RAM) is one for which the latency for memory
cells chosen at random is approximately the same as the latency for cells chosen in
the pattern best suited for that memory device. An electronic m em ory chip is usually
configured for random access. Memory devices that involve mechanical movement,
such as optical disks (CDs and DVDs) and magnetic tapes and disks, are not.
For devices that do not provide random access, it is usually a g o o d idea, having
paid the cost in delay o f moving the mechanical com ponents into position, to read or
w rite a large block o f data. Large-block read and w rite operations are sometimes rela
beled get and p u t , respectively and this b ook uses that convention. Traditionally, the
unqualified term m em ory meant random-access volatile m em ory and the term stor
age was used for non-volatile m em ory that is read and written in large blocks with get
and p u t . In practice, there are enough exceptions to this naming rule that the words
“m em ory”and “ storage”have b ecom e almost interchangeable.
2.1.1.3 Memory Names and A ddresses

Physical implementations o f m em ory devices nearly always name a m em ory cell by
the geom etric coordinates o f its physical storage location. Thus, for example, an elec
tronic m em ory chip is organized as a two-dimensional array o f flip-flops, each holding
one named bit. The access mechanism splits the bit name into tw o parts, w hich in
turn g o to a pair o f multiplexers. One multiplexer selects an x-coordinate, the other a

y-coordinate, and the tw o coordinates in turn select the particular flip-flop that holds
that bit. Similarly, in a magnetic disk memory, one com ponent o f the name electrically
selects one o f the recording platters, while a distinct com ponent o f the name selects
the position o f the seek arm, thereby choosing a specific track on that platter. A third
name com ponent selects a particular sector on that track, which may be identified by
counting sectors as they pass under the read head, starting from an index mark that
identifies the first sector.
It is easy to design hardware that maps geom etric coordinates to and from sets
o f names consisting o f consecutive integers (0,1, 2, etc.). These consecutive integer
names are called addresses, and they form the address space o f the m em ory device.
A m em ory system that uses names that are sets o f consecutive integers is called a
location-addressed memory. Because the addresses are consecutive, the size o f the
m em ory cell that is named does not have to be the same as the size o f the cell that is
read or written. In som e m em ory architectures each byte has a distinct address, but
reads and writes can (and in som e cases must always) occu r in larger units, such as a
word or a line.
For most applications, consecutive integers are not exactly the names that one
w ould ch oose for recalling data. One w ould usually prefer to be allowed to ch oose
less constrained names. A mem ory system that accepts unconstrained names is called
an associative memory. Since physical mem ories are generally location-addressed,
a designer creates an associative m em ory by interposing an associativity layer, which
may be implemented either with hardware or software, that maps unconstrained
higher-level names to the constrained integer names o f an underlying location-
addressed memory, as in Figure 2.2. Examples o f software associative memories,
constructed on top o f one or m ore underlying location-addressed memories, include
personal telephone directories, file systems, and corporate database systems. A cache,
a device that remembers the result o f an expensive computation in the h ope o f not
redoing that computation if it is needed again soon, is sometimes im plemented as an
FIGURE 2.2__________________________________________________________________________
An associative memory implemented in two layers. The associativity layer maps the unconstrained
names of its arguments to the consecutive integer addresses required by the physical layer
location-addressed memory.
9
associative memory, either in software or hardware. (The design o f caches is discussed

in Section 6.2.)
Layers that provide associativity and name m apping figure strongly in the design
o f all m em ory and storage systems. For example,Table 2.2 on page 93 lists the lay
ers o f the Un ix file system. For another example o f layering o f m em ory abstractions,
Chapter 5 explains h ow m em ory can be virtualized by adding a name-mapping
layer.
2.1.1.4 Exploiting the Memory Abstraction: RAID

Returning to the subject o f abstraction, a system known as RAID provides an illustra
tion o f the p ow er o f modularity and o f h ow the storage abstraction can be applied to
go od effect. RAID is an acronym for Redundant Array o f Independent (or Inexpensive)
Disks. A RAID system consists o f a set o f disk drives and a controller configured with
an electrical and programming interface that is identical to the interface o f a single disk
drive, as show n in Figure 2.3. The RAID controller intercepts read and write requests
com ing across its interface, and it directs
them to one or m ore o f the disks. RAID
Single disk drive has tw o distinct goals:
■ Improved performance, by reading

or writing disks concurrently
■ Improved durability, by writing
information on more than one disk
/ \
Different RAID configurations offer dif
ferent trade-offs betw een these goals.
Whatever trade-off the designer chooses,
because the interface abstraction is that
o f a single disk, the programmer can
take advantage o f the improvements
in performance and durability without
reprogramming.
Certain useful RAID configurations
are traditionally identified by (some
what arbitrary) numbers. In later chap
ters, w e will encounter several o f these
numbered configurations. The configura
tion known as RAID 0 (in Section 6.1.3)
provides increased performance by
allowing concurrent reading and writ
ing. The configuration known as RAID 4
FIGURE 2.3_____________________________ (shown in Figure 8.6 [on-line]) improves
Abstraction in RAID. The read/ write disk reliability by applying error-correc
electrical and programming interface of the tion codes. Yet another configuration
RAID system, represented by the solid arrow, known as RAID 1 (in Section 8.5.4.6
is identical to that of a single disk. [on-line]) provides high durability by
10
making identical copies o f the data on different disks. Exercise 8.8 [on-line] explores
a simple but elegant performance optimization known as RAID 5. These and several
other RAID configurations w ere originally described in depth in a paper by Randy
Katz, Garth Gibson, and David Patterson, w h o also assigned the traditional num
bers to the different configurations [see Suggestions for Further Reading 10.2.2],
2.1.2 Interpreters
Interpreters are the active elements o f a com puter system; they perform the actions
that constitute computations. Figure 2.4 lists som e examples o f interpreters that may
be familiar. As with memory, interpreters also com e in a w ide range o f physical mani
festations. However, they too can be described with a simple abstraction, consisting
o f just three components:
1. An instruction reference, w hich tells the interpreter where to find its next
instruction
2. A repertoire, which defines the set o f actions the interpreter is prepared to
perform w hen it retrieves an instruction from the location named by the
instruction reference
3. An environm ent reference, w hich tells the interpreter where to find its
environment, the current state on w hich the interpreter should perform the
action o f the current instruction
The normal operation o f an interpreter is to proceed sequentially through some

program, as suggested by the diagram and pseu d ocod e o f Figure 2.3. Using the envi
ronment reference to find the current environment, the interpreter retrieves from that
environment the program instruction indicated in the instruction reference. Again
using the environment reference, the interpreter performs the action directed by the
program instruction. That action typically
involves using and perhaps changing data
Hardware:
in the environment, and also an appropri
Pentium 4, PowerPC 970, UltraSPARC T1
ate update o f the instruction reference.
disk controller
When it finishes performing the instruc
display controller
tion, the interpreter moves on, taking as
Software:
its next instruction the one n ow named
Alice, AppleScript, Perl, Tel, Scheme
by the instruction reference. Certain
LISP, Python, Forth, Java bytecode
events, called interrupts, may catch the
JavaScript, Smalltalk
attention o f the interpreter, causing it,
TeX, LaTeX
rather than the program, to supply the
Safari, Internet Explorer, Firefox
next instruction. The original program no
longer controls the interpreter; instead, a
FIGURE 2.4____________________________ different program, the interrupt handler,
Some common examples of interpreters. takes control and handles the event. The
The disk controller example is explained in interpreter may also change the environ
Section 2.3 and the Web browser examples ment reference to one that is appropriate
are the subject of Exercise 4.5. for the interrupt handler.
11
1 procedure interprets
2 do forever
3 instruction read (instruction_reference)
4 perform instruction in the context of environment_reference
5 if interrupt_signal = TRUE then
6 instruction_reference<—entry point of interrupts an dler
7 environment_reference<r- environment ref of in te rru p tjh a n d le r
FIGURE 2.5______________________________________________________________________
Structure of, and pseudocode for, an abstract interpreter. Solid arrows show control flow, and
dashed arrows suggest information flow. Sidebar 2.3 describes this book’s conventions for
expressing pseudocode.
Sidebar 2.3 Representation: Pseudocode and Messages This book presents many
examples o f program fragments. Most of them are represented in pseudocode, an
imaginary programming language that adopts familiar features from different existing
programming languages as needed and that occasionally intersperses English text to
characterize some step whose exact detail is unimportant. The pseudocode has some
standard features, several of which this brief example shows.
1 procedure sum (a, b) //A d d tw o num bers.

2 total <— a + b
3 return total
The line numbers on the left are not part of the pseudocode; they are there simply
to allow the text to refer to lines in the program. Procedures are explicitly declared
(Sidebar continues)
12
(as in line /), and indentation groups blocks of statements together. Program variables
are set in italic, program key words in bold, and literals such as the names of pro
cedures and built-in constants in small caps. The left arrow denotes substitution or
assignment (line 2) and the symbol "=" denotes equality in conditional expressions.
The double slash precedes comments that are not part of the pseudocode. Various
forms of iteration (while, until, for each, do occasionally), conditionals (if), set operations
(is in), and case statements (do case) appear when they are helpful in expressing an
example. The construction for j from 0 to 3 iterates four times; array indices start at
0 unless otherwise mentioned. The construction y.x means the element named x in
the structure named y. To minimize clutter, the pseudocode omits declarations wher
ever the meaning is reasonably apparent from the context. Procedure parameters are
passed by value unless the declaration reference appears. Section 2.2.1 of this chapter
discusses the distinction between use by value and use by reference. When more
than one variable uses the same structure, the declaration structure_name instance
variab!e_name may be used.
The notation a ( ll. . .15) denotes extraction of bits 11 through 15 from the string a
(or from the variable a considered as a string). Bits are numbered left to right starting
with zero, with the most significant bit of integers first (using big-endian notation, as
described in Sidebar 4.3).The + operator, when applied to strings, concatenates the
strings.
Some examples are represented in the instruction repertoire o f an imaginary

reduced instruction set computer (RISC). Because such programs are cumber
some, they appear only when it is essential to show how software interacts with
hardware.
In describing and using communication links, the notation
X
2.1.2.1 P rocessors
A general-purpose processor is an implementation o f an interpreter. For purposes of
concrete discussion throughout this book, w e use a typical reduced instruction set
processor. The p rocessor’ s instruction reference is a p rogra m counter, stored in a
fast m em ory register inside the processor. The program counter contains the address
o f the m em ory location that stores the next instruction o f the current program. The
environment reference o f the processor consists in part o f a small amount o f built-in
location-addressed m em ory in the form o f named (by number) registers for fast access
to temporary results o f computations.
Our general-purpose processor may be directly w ired to a memory, w hich is also
part o f its environment. The addresses in the program counter and in instructions
are then names in the address space o f that memory, so this part o f the environment
reference is wired in and unchangeable. When w e discuss virtualization in Chapter 5,
w e will extend the processor to refer to m em ory indirectly via one or more registers.
With that change, the environment reference is maintained in those registers, thus
allowing addresses issued by the processor to map to different names in the address
space o f the memory.
The repertoire o f our general-purpose processor includes instructions for express
ing computations such as adding tw o numbers ( add ), subtracting one number from
another ( sub ), comparing tw o numbers ( cmp ), and changing the program counter to the
address o f another instruction ( jm p ). These instructions operate on values stored in the
named registers o f the processor, which is why they are colloquially called “ op-codes” .
The repertoire also includes instructions to m ove data betw een processor regis
ters and memory. To distinguish program instructions from m em ory operations, w e
use the name load for the instruction that reads a value from a named m em ory cell
into a register o f the processor and store for the instruction that writes the value from
a register into a named m em ory cell. These instructions take tw o integer arguments,
the name o f a m em ory cell and the name o f a processor register.
The general-purpose processor provides a stack, a push-down data structure that
is stored in m em ory and used to implement procedure calls. When calling a p roce
dure, the caller pushes arguments o f the called procedure (the callee) on the stack.
When the callee returns, the caller p op s the stack back to its previous size. This imple
mentation o f procedures supports recursive calls because every invocation o f a pro
cedure always finds its arguments at the top o f the stack. We dedicate one register for
implementing stack operations efficiently. This register, known as the stack pointer,
holds the m em ory address o f the top o f the stack.
As part o f interpreting an instruction, the processor increments the program
counter so that, w hen that instruction is complete, the program counter contains the
address o f the next instruction o f the program. If the instruction being interpreted is
a jm p , that instruction loads a new value into the program counter. In both cases, the
flow o f instruction interpretation is under control o f the running program.
The processor also implements interrupts. An interrupt can occu r because the
processor has detected som e problem with the running program (e.g., the pro
gram attempted to execute an instruction that the interpreter does not or cannot
14
implement, such as dividing by zero). An interrupt can also occu r because a signal
arrives from outside the processor, indicating that som e external device needs atten
tion (e.g., the keyboard signals that a key press is available). In the first case, the
interrupt mechanism may transfer control to an exception handler elsewhere in the
program. In the second case, the interrupt handler may do som e work and then return
control to the original program. We shall return to the subject o f interrupts and the
distinction betw een interrupt handlers and exception handlers in the discussion of
threads in Chapter 5.
In addition to general-purpose processors, com puter systems typically also have
special-purpose processors, w hich have a limited repertoire. For example, a clock
chip is a simple, hard-wired interpreter that just counts: at som e specified frequency,
it executes an add instruction, which adds 1 to the contents o f a register or memory
location that corresponds to the clock. All processors, whether general-purpose or
specialized, are examples o f interpreters. However, they may differ substantially in the
repertoire they provide. One must consult the device manufacturer’ s manual to learn
the repertoire.
2.1.2.2 Interpreter Layers

Interpreters are nearly always organized in layers. The low est layer is usually a hard
ware engine that has a fairly primitive repertoire o f instructions, and successive layers
provide an increasingly rich or specialized repertoire. A full-blown application system
may involve four or five distinct layers o f interpretation. Across any given layer inter
face, the low er layer presents som e repertoire o f possible instructions to the upper
layer. Figure 2.6 illustrates this model.
Upper layer procedure
Instruction
Lower layer procedure
Instruction Instruction Instruction
FIGURE 2.6________________________________________________________________________
The model for a layered interpreter. Each layer interface, shown as a dashed line, represents an
abstraction barrier, across which an upper layer procedure requests execution of instructions
from the repertoire of the lower layer. The lower layer procedure typically implements an
instruction by performing several instructions from the repertoire of a next lower layer interface.
15
Consider, for example, a calendar management program. The person making

requests by moving and clicking a m ouse views the calendar program as an inter
preter o f the m ouse gestures. The instruction reference tells the interpreter to obtain
its next instruction from the keyboard and mouse. The repertoire o f instructions is
the set o f available requests—to add a new event, to insert som e descriptive text, to
change the hour, or to print a list o f the day’s events. The environment is a set o f files
that remembers the calendar from day to day.
The calendar program implements each action requested by the user by invok
ing statements in som e programming language such as Java. These statements—such
as iteration statements, conditional statements, substitution statements, procedure
calls—constitute the instruction repertoire o f the next low er layer. The instruction
reference keeps track o f w hich statement is to be executed next, and the environ
ment is the collection o f named variables used by the program. (We are assuming here
that the Java language program has not been com piled directly to machine language.
If a com piler is used, there w ould be one less layer.)
The actions o f the programming language are in turn implemented by hardware
machine language instructions o f som e general-purpose processor, with its ow n
instruction reference, repertoire, and environment reference.
Figure 2.7 illustrates the three layers just described. In practice, the layered struc
ture may be deeper—the calendar program is likely to be organized with an internal
upper layer that interprets the graphical gestures and a low er layer that manipulates
the calendar data, the Java interpreter may have an intermediate byte-code interpreter
layer, and som e machine languages are implemented with a m icrocode interpreter
layer on top o f a layer o f hardware gates.
Typical instruction
Human user across this interface
generating
requests
Calendar manager Add new event on

layer interface February 27
Java language
nextch
layer interface
Machine language
layer interface
One goal in the design o f a layered interpreter is to ensure that the designer o f each
layer can be confident that the layer b elow either com pletes each instruction suc
cessfully or does nothing at all. Half-finished instructions should never be a concern,
even if there is a catastrophic failure. That goal is another example o f atomicity, and
achieving it is relatively difficult. For the moment, w e simply assume that interpret
ers are atomic, and w e defer the discussion o f h ow to achieve atomicity to Chapter 9
[on-line].
2.1.3 Communication Links

A com m u n ication link provides a way for information to move betw een physically
separated components. Communication links, o f which a few examples are listed in
Figure 2.8, com e in a w ide range o f technologies, but, like mem ories and interpreters,
they can be described with a simple abstraction. The communication link abstraction
has tw o operations:
send ( link_name, outgoing_message_buffer)
receive ( Hnk_name, incoming_message_buffer)
The send operation specifies an array o f bits, called a message, to be sent over the
communication link identified by link_name (for example, a wire). The argument
outgoing_message_buffer identifies the message to be sent, usually by giving the
address and size o f a buffer in m em ory that contains the message. The receive opera
tion accepts an incom ing message, again usually by designating the address and size o f
a buffer in m em ory to hold the incom ing message. O nce the low est layer o f a system
has received a message, higher layers may acquire the message by calling a receive
interface o f the low er layer, or the low er layer may “ upcall”to the higher layer, in
which case the interface might be better characterized as d e liv e r (incom ing_message) .
Names connect systems to communication links in tw o different ways. First, the
link_name arguments o f send and receive identify one o f possibly several available com
munication links attached to the system.
Second, som e communication links are
Hardware technology: actually multiply-attached networks o f
twisted pair links, and som e additional m ethod is
coaxial cable n eeded to name w hich o f several p o s
optical fiber sible recipients should receive the mes
Higher level sage. The name o f the intended recipient
Ethernet is typically one o f the com ponents o f the
Universal Serial Bus (USB) message.
the Internet At first glance, it might appear
the telephone system that sending and receiving a mes
a unix pipe sage is just an example o f copying
an array o f bits from one m em ory to
FIGURE 2.8 _______ another m em ory over a wire using a
Some examples of communication links. sequence o f read and w rite operations,
17
so there is no need for a third abstraction. However, communication links involve

more than simple copyin g—they have many complications, such as a w ide range
o f operating parameters that makes the time to com plete a send or receive opera
tion unpredictable, a hostile environment that threatens integrity o f the data trans
fer, asynchronous operation that leads to the arrival o f messages w h ose size and
time o f delivery can not be known in advance, and most significant, the mes
sage may not even be delivered. Because o f these complications, the semantics of
send and receive are typically quite different from those associated with read and
w r it e . Programs that invoke send and receive must take these different semantics
explicitly into account. On the other hand, som e communication link implementa
tions do provide a layer that does its best to hide a send / receive interface behind a
read / write interface.
Just as with m em ory and interpreters, designers organize and implement com m u
nication links in layers. Rather than continuing a detailed discussion o f communica
tion links here, w e defer that discussion to Section 7.2 [on-line], which describes a
three-layer m odel that organizes communication links into systems called networks.
Figure 7.18 [on-line] illustrates this three-layer network model, which com prises a link
layer, a network layer, and an end-to-end layer.
2.2 NAMING IN COMPUTER SYSTEMS

Computer systems use names in many ways in their construction, configuration, and
operation. The previous section mentioned memory addresses, processor registers,
and link names, and Figure 2.9 lists several additional examples, som e o f which are prob
ably familiar, others o f which will turn up in later chapters. Some system names resem
ble those o f a programming language, whereas others are quite different. When building
systems out of subsystems, it is essential to be able to use a subsystem without having
to know details o f h ow that subsystem refers to its components. Names are thus used to
achieve modularity, and at the same time, modularity must sometimes hide names.
We approach names from an object point o f view: the com puter system manipu
lates objects. An interpreter performs the manipulation under control o f a program
or perhaps under the direction o f a human user. An object may be structured, which
means that it uses other objects as components. In a direct analogy with tw o ways in
w hich procedures can pass arguments, there are tw o ways to arrange for one object
to use another as a component:
■ create a co p y o f the com ponent object and include the cop y in the using
object (use by value'), or
■ ch oose a name for the com ponent object and include just that name in the using
object (use by reference). The com ponent object is said to export the name.
When passing arguments to procedures, use by value enhances modularity,

because if the callee accidentally modifies the argument it does not affect the origi
nal. But use by value can be problematic because it does not easily permit tw o or
more objects to share a com ponent object w h ose value changes. If both object A
18
C h a p te r 2
M o d u la rity th rou gh C lien ts and

Services, R P C
J. Saltzer and M. F. Kaashoek. Principles of Computer System

Design: An Introduction. Part I. Section 4.2, pp. 167-172 (6 of
The chapter discusses how to organize interpreters in terms of clients and

services. An important property of this organization is stro n g m odularity,
the capability of bounding failure propagation between these components. A
classic mechanism to achieve strong modularity with clients and services is the
remote procedure call (RPC). The ultimate goal of this portion of the material is
to enable us to write strongly modular software with RPCs, and to reflect on how
strong modularity affects program semantics but at the same time allows us to
incorporate additional mechanisms (e.g., fo r scaling the number o f connections)
in-between modular clients and services.
• Recognize and explain modular designs with clients and services.
• Predict the functioning of service calls under different RPC semantics

and failure modes.
• Identify different mechanisms to achieve RPCs.
• Implement RPC services with an appropriate mechanism, such as web

services.
Note that the course assignments will be instrumental in achieving the goals
in this as well as other parts of the course.
19
This page has intentionally been left blank.
20
4.2 Communication Between Client and Service 167
4.2 COMMUNICATION BETWEEN CLIENT AND SERVICE

This section describes tw o extensions to sending and receiving messages. First, it
introduces rem ote procedu re call (RPC), a stylized form o f client/service interac
tion in w hich each request is follow ed by a response.The goal o f RPC systems is to
make a remote procedure call look like an ordinary procedure call. Because a service
fails independently from a client, however, a remote procedure call can generally
not offer identical semantics to procedure calls. As explained in the next subsection,
som e RPC systems provide various alternative semantics and the programmer must
be aware o f the details.
Second, in som e applications it is desirable to be able to send messages to a recipi
ent that is not on-line and to receive messages from a sender that is not on-line. For
example, electronic mail allows users to send e-mail without requiring the recipient
to be on-line. Using an intermediary for communication, w e can implement these
applications.
4.2.1 Remote Procedure Call (RPC)

In many o f the examples in the previous section, the client and service interact in a
stylized fashion: the client sends a request, and the service replies with a response
after processing the client’ s request. This style is so com m on that it has received its
ow n name: remote procedu re call, or RPC for short.
RPCs com e in many varieties, adding features to the basic request/response style
o f interaction. Some RPC systems, for example, simplify the programming o f clients
and services by hiding many the details o f constructing and formatting messages. In
the time service example above, the programmer must call send _ message and receive _
m essage , and convert results into numbers, and so on. Similarly, in the file service
example, the client and service have to construct messages and convert numbers into
bit strings and the like. Programming these conversions is tedious and error prone.
Stubs remove this burden from the programmer (see Figure 4.7). A stub is a p roce
dure that hides the marshaling and communication details from the caller and callee.
An RPC system can use stubs as follows. The client module invokes a remote p roce
dure, say get_ t im e , in the same way that it w ould call any other procedure. However,
get _ t im e is actually just the name o f a stub procedure that runs inside the client m od
ule (see Figure 4.8).The stub marshals the arguments o f a call into a message, sends
the message, and waits for a response. On arrival o f the response, the client stub
unmarshals the response and returns to the caller.
Similarly, a service stub waits for a message, unmarshals the arguments, and calls
the procedure that the client requests ( get_ t im e in the example). After the procedure
returns, the service stub marshals the results o f the procedure call into a message and
sends it in a response to the client stub.
Writing stubs that convert more com plex objects into an appropriate on-wire
representation b ecom es quite tedious. Some high-level programming languages
21
168 CHAPTER 4 Enforcing Modularity with Clients and Services
Client Service
FIGURE 4.7________
Implementation of
a remote procedure
call using stubs. The
stubs hide all remote
communication from
the caller and callee.
Client program
1 procedure measure ifunc)

2 Start <- GET_TIME (SECONDS)
3 func () // invoke the function
4 end <- get_ time (seconds)
5 return end - start
6
7 procedure get_ time {unit) / / the client stub for get_ time
8 send_ message (NameForTimeService, {"Get time", unit})
9 response e- receive_ message (NameForClient)
10 return c o n ve rt2 in te rn a l ( response)
Service program
1 procedure time_service () // the service stub for get_time

2 do forever
3 request <- receive_message ( NameForTimeService)
4 opcode <- get_opcode (request)
5 u n ite - get_ argument (request)
6 if opcode = "Get time" and ( unit = SECONDS or unit = minutes ) then
7 response <- {"ok", get_ time {unit)}
8 else
9 response e- {"Bad request")
10 send_message ( NameForClient, response)
FIGURE 4.8
get_ time client and service using stubs.
such as Java can generate these stubs automatically from an interface specification
[Suggestions for Further Reading 4.1.3], simplifying client/service programming even
further. Figure 4.9 show s the client for such an RPC system. The RPC system w ould
generate a procedure similar to the get_ time stub in Figure 4.8 . The client program of
Figure 4.9 looks almost identical to the one using a local procedure call on page 149,
22
The client program
1 procedure measure [fund)

2 try
3 start <r- GET_TIME (SECONDS)
4 catch (signal servicefailed)
5 return servicefailed
6 func[ ) / / invoke the function
7 try FIGURE 4.9________
8 end<r- GET_TIME (SECONDS) GET_TIME client
9 catch (signal servicefailed)
using a system that
10 return servicefailed
generates RPC stubs
11 return end - start
automatically.
except that it handles an additional error because remote procedure calls are not
identical to procedure calls (as discussed below).The procedure that the service calls
on line 7 is just the original procedure get_ t im e on page 149.
Whether a system uses RPC with automatic stub generation is up to the imple-
menters. For example, som e implementations o f Sun’ s Network File System (see
Section 4.5) use automatic stub generation, but others do not.
4.2.2 RPCs are not Identical to Procedure Calls

It is tempting to think that by using stubs on e can make a rem ote procedure call
behave exactly the same as an ordinary procedure call, so that a program mer
d o esn ’
t have to think about whether the procedure runs locally or remotely. In fact,
this goal was a primary on e w hen RPC was originally p r o p o se d —hence the name
rem ote “ procedu re call”.However, RPCs are different from ordinary procedure calls
in three important ways: First, RPCs can reduce fate sharing betw een caller and
callee by exposin g the failures o f the callee to the caller so that the caller can
recover. Second, RPCs introduce n ew failures that d o n ’
t appear in procedu re calls.
These tw o differences change the semantics o f rem ote procedure calls as com pared
with ordinary procedure calls, and the changes usually require the program mer to
make adjustments to the surrounding code.Third, rem ote procedure calls take m ore
time than procedure calls; the number o f instructions to invoke a procedure (see
Figure 4.2) is much less than the cost o f invoking a stub, marshaling arguments,
sending a request over a network, invoking a service stub, unmarshaling arguments,
marshaling the response, receiving the respon se over the network, and unmarshal
ing the response.
To illustrate the first difference, consider writing a procedure call to the
library program sqrt , w hich com pu tes the square root o f its argument ac. A careful
program m er w ould plan for the case that sqrt ( x) will fail w h en x is negative by
providing an explicit exception handler for that case. However, the program mer
using ordinary procedu re calls almost certainly d o esn ’t g o to the trouble o f plan
ning for certain possible failures because they have negligible probability. For
23
example, the programmer probably would not think of setting an interval timer
when invoking sort (x), even though sort internally has a successive-approximation
loop that, if programmed wrong, might not terminate.
But n ow consider calling sqrt with an RPC. An interval timer suddenly b ecom es
essential because the network betw een client and service can lose a message, or the
other com puter can crash independently. To avoid fate sharing, the RPC programmer
must adjust the co d e to prepare for and handle this failure. When the client receives
a“ service failure”signal, the client may be able to recover by, for example, trying
a different service or choosing an alternative algorithm that doesn ’ t use a remote
service.
The second difference betw een ordinary procedure calls and RPCs is that RPCs
introduce a new failure mode, the “ no respon se”failure. When there is no response
from a service, the client cannot tell which o f tw o things went wrong: (1) som e failure
occurred before the service had a chance to perform the requested action, or (2) the
service perform ed the action and then a failure occurred, causing just the response
to be lost.
Most RPC designs handle the no-response case by choosing one o f three im ple
mentation strategies:
■ At-least-once RPC. If the client stub doesn ’t receive a response within som e
specific time, the stub resends the request as many times as necessary until it
receives a response from the service. This implementation may cause the ser
vice to execute a request more than once. For applications that call sqrt,execut
ing the request m ore than on ce is harmless because with the same argument
sqrt should always produce the same answer. In programming language terms,
the sqrt service has no side effects. Such side-effect-free operations are also
idem poten t: repeating the same request or sequence o f requests several times
has the same effect as doing it just once. An at-least-once implementation does
not provide the guarantee implied by its name. For example, if the service was
located in a building that has been blow n away by a hurricane, retrying doesn ’t
help. To handle such cases, an at-least-once RPC implementation will give up
after som e number o f retries. When that happens, the request may have been
executed m ore than on ce or not at all.
■ At-most-once RPC. If the client stub doesn ’ t receive a response within som e
specific time, then the client stub returns an error to the caller, indicating
that the service may or may not have processed the request. At-most-once
semantics may be m ore appropriate for requests that do have side effects. For
example, in a banking application, using at-least-once semantics for a request
to transfer $100 from one account to another could result in multiple $100
transfers. Using at-most-once semantics assures that either zero or one transfers
take place, a somewhat more controlled outcome. Implementing at-most-once
RPC is harder than it sounds because the underlying network may duplicate
the request message without the client stub’ s knowledge. Chapter 7 [on-line]
describes an at-most-once implementation, and Birrell and N elson’ s paper gives
24
a nice, com plete description o f an RPC system that implements at-most-once

[Suggestions for Further Reading 4.1.1],
■ Exactly-once RPC. These semantics are the ideal, but because the client and
service are independent it is in principle im possible to guarantee. As in the
case o f at-least-once, if the service is in a building that was blow n away by a
hurricane, the best the client stub can do is return error status. On the other
hand, by adding the com plexity o f extra message exchanges and careful record
keeping, one can approach exactly-once semantics closely enough to satisfy
som e applications. The general idea is that, if the RPC requesting transfer of
$100 from account A to B produces a “ n o respon se”
failure, the client stub sends
a separate RPC request to the service to ask about the status o f the request that
got no response.This solution requires that both the client and the service stubs
keep careful records o f each remote procedure call request and response. These
records must be fault tolerant because the com puter running the service might
fail and lose its state betw een the original RPC and the inquiry to check on the
RPC’ s status. Chapters 8 [on-line] through 10 [on-line] introduce the necessary
techniques.
The programmer must be aware that RPC semantics differ from those o f ordinary
procedure calls, and because different RPC systems handle the no-response case in
different ways, it is important to understand just which semantics any particular RPC
system tries to provide. Even if the name o f the implementation implies a guarantee
(e.g., at-least-once), w e have seen that there are cases in which the implementation
cannot deliver it. One cannot simply take a collection o f legacy programs and arbi
trarily separate the modules with RPC. Some thought and reprogramming is inevitably
required. Problem set 2 explores the effects o f different RPC semantics in the context
o f a simple client/service application.
The third difference is that calling a local procedure takes typically much less time
than calling a remote procedure call. For example, invoking a remote sqrt is likely to
be m ore expensive than the computation for sqrt itself because the overhead o f a
remote procedure call is much higher than the overhead o f following the procedure
calling conventions. To hide the cost o f a remote procedure call, a client stub may
deploy various performance-enhancing techniques (see Chapter 6), such as caching
results and pipelining requests (as is done in the X W indow System o f Sidebar 4.4).
These techniques increase com plexity and can introduce new problem s (e.g., h ow to
ensure that the cache at the client stays consistent with the one at the service).The
performance difference betw een procedure calls and remote procedure calls requires
the designer to consider carefully what procedure calls should be remote ones and
which ones should be ordinary, local procedure calls.
A final difference betw een procedure calls and RPCs is that som e programming
language features don ’ t com bine well with RPC. For example, a procedure that com
municates with another procedure through global variables cannot typically be
executed remotely because separate com puters usually have separate address spaces.
Similarly, other language constructs that use explicit addresses w on ’ t work. Arguments
25
consisting o f data structures that contain pointers, for example, are a problem because
pointers to objects in the client com puter are local addresses that have different bind
ings when resolved in the service computer. It is possible to design systems that use
global references for objects that are passed by reference to remote procedure calls
but require significant additional machinery and introduce new problems. For exam
ple, a new plan is needed for determining whether an object can be deleted locally
because a remote com puter might still have a reference to the object. Solutions exist,
however; see, for example, the article on Network Objects [Suggestions for Further
Reading 4.1.2],
Since RPCs don ’ t provide the same semantics as procedure calls, the w ord “ pro
cedure”in “ remote procedure call”can be misleading. Over the years the con cept of
RPC has evolved from its original interpretation as an exact simulation o f an ordinary
procedure call to instead mean any client/service interaction in which the request is
follow ed by a response. This text uses this modern interpretation.
k2.3 Communicating through an Intermediary

Sending a m essage from a sender to a receiver requires that both parties be avail
able at the same time. In many applications this requirement is too strict. For exam
ple, in electronic mail w e desire that a user be able to send an e-mail to a recipient
even if the recipient is not on-line at the time.The sender sends the m essage and
the recipient receives the m essage som e time later, perhaps w h en the sender is not
on-line. We can im plement such applications using an intermediary. In the case o f
communication, this intermediary d oesn ’ t have to be trusted because com m unica
tion applications often consider the intermediary to b e part o f an untrusted net
w ork and have a separate plan for securing m essages (as w e will see in Chapter 11
[on-line]).
The primary purpose o f the e-mail intermediary is to implement buffered com
munication. Buffered communication provides the send / receive abstraction but avoids
the requirement that the sender and receiver be present simultaneously. It allows
the delivery o f a message to be shifted in time.The intermediary can hold messages
until the recipient com es on-line.The intermediary might buffer messages in volatile
m em ory or in non-volatile memory, such as a file system. The latter design allows the
intermediary to buffer messages across pow er failures.
O n ce w e have an intermediary, three interesting design opportunities arise. First,
the sender and receiver may make different ch oices o f whether to p u sb or p u ll m es
sages. Push is w hen the initiator o f a data movement sends the data. Pull is w hen the
initiator o f a data movement asks the other end to send it the data.These definitions
are independent o f whether or not the system uses an intermediary, but in systems
with intermediaries it is not uncom m on to find both in a single system. For example,
the sender in the Internet’ s e-mail system, Simple Mail Transfer Protocol (SMTP),
pushes the mail to the service that holds the recipient’ s mailbox. On the other hand,
the receiving client pulls messages to fetch mail from a mailbox: the user hits the
26
C h a p te r 3
T ech n iqu es for P erform a n ce
J. Saltzer and M. F. Kaashoek. Principles of Computer System

Design: An Introduction. Part I. Section 6.1, pp. 300-316 (17 of
An important consequence of writing strongly modular software with clients

and services is that service implementations may be optimized for performance
or for reliability (or both) without affecting clients, as long as these optimiza
tions do not affect semantics. P erform an ce is a recurring theme in design
and implementation of system abstractions and services. There are a number
of general techniques for performance, among which concurrency is of special
importance. The ultimate goal of this portion of the material is to introduce us
to metrics to characterize performance, and provide us with general techniques
to design services fo r performance.
• Explain performance metrics such as latency, throughput, overhead, uti

lization, capacity, and scalability.
• List common hardware parameters that affect performance.
• Apply performance improvement techniques, such as concurrency, batch

ing, dallying, and fast-path coding.
27
300 CHAPTER 6 Perform ance
OVERVIEW
The specification o f a com puter system typically includes explicit (or implicit) perfor
mance goals. For example, the specification may indicate h ow many concurrent users
the system should be able to support. Typically, the simplest design fails to m eet these
goals because the design has a bottleneck, a stage in the com puter system that takes
longer to perform its task than any o f the other stages. To overcom e bottlenecks, the
system designer faces the task o f creating a design that perform s well, yet is simple
and modular.
This chapter describes techniques to avoid or hide performance bottlenecks.
Section 6.1 presents ways to identify bottlenecks and the general approaches to han
dle them, including exploiting workload properties, concurrent execution o f opera
tions, speculation, and batching. Section 6.2 examines specific versions of the general
techniques to attack the common problem o f implementing multilevel memory sys
tems efficiently. Section 6.3 presents scheduling algorithms for services to choose
which request to process first, if there are several waiting for service.
6.1 DESIGNING FOR PERFORMANCE

Performance bottlenecks show up in com puter systems for tw o reasons. First, limits
im posed by physics, technology, or econ om ics restrict the rate o f improvement in
som e dim ensions o f technology, while other dim ensions improve rapidly. An obvious
class o f limits are the physical ones. The speed o f light limits h ow fast signals travel
from on e end o f a chip to the other, h ow many m em ory elements can be within a
given latency from the processor, and h ow fast a network message can travel in the
Internet. Many other physical limits appear in com puter systems, such as p o w er and
heat dissipation.
These limits force a designer to make trade-offs. For example, by shrinking a chip,
a designer can make the chip faster, but it also reduces the area from w hich heat can
be dissipated. Worse, the pow er dissipation increases as the designer speeds up the
chip. A related trade-off is betw een the speed o f a laptop and its p o w er consum p
tion. A designer wants to minimize a laptop’ s p o w er consum ption so that the battery
lasts longer,yet custom ers want laptops with fast processors and large, bright screens.
Physical limits are only a subset o f the limits a designer faces; there are also algo
rithmic, reliability, and econ om ic limits. More limits mean m ore trade-offs and a higher
risk o f bottlenecks.
The secon d reason bottlenecks surface in com puter systems is that several clients
may share a device. If a device is busy serving on e client, other clients must wait until
the device b ecom es available. This property forces the system designer to answer
questions such as w hich client should receive the device first. Should the device
first perform the request that requires little work, perhaps at the cost o f delaying the
request that requires a lot o f work? The designer w ould like to devise a scheduling
plan that d oesn ’ t starve som e clients in favor o f others, provides lo w turnaround time
28
6.1 Designing for P erform ance 301
for each individual client request, and has little overhead so that it can serve many
clients. As w e will see, it is im possible to maximize all o f these goals simultaneously,
and thus a designer must make trade-offs.Trade-offs may favor on e class o f requests
over another and may result in bottlenecks for the unfavored classes o f requests.
Designing for perform ance creates tw o major challenges in com puter systems.
First, on e must consider the benefits o f optim ization in the context o f technology
improvements. Some bottlenecks are intrinsic ones; they require careful thinking to
ensure that the system runs faster than the perform ance o f the slowest stage. Some
bottlenecks are tech n ology dependent; time may eliminate these, as technology
improves. Unfortunately, it is som etim es difficult to decide w hether o r not a bottle
neck is intrinsic. Not uncommonly, a perform ance optimization for the next product
release is irrelevant by the time the product ships because tech n ology improvements
have rem oved the bottleneck completely. This ph en om en on is so com m on in co m
puter design that it has led to formulation o f the design hint: when in doubt use
brute force.Sidebar 6.1 discusses this hint.
Sidebar 6.1 Design Hint: When in Doubt use Brute Force This chapter describes
a few design hints that help a designer resolve trade-offs in the face o f limits.These
design hints are hints because they often guide the designer in the right direction,
but sometimes they d on ’ t. In this book we cover only a few, but the interested
reader should digest Hints fo r computer system design by B. Lampson, which pres
ents many more practical guidelines in the form o f hints [Suggestions for Further
Reading 1.5.4],
The design hint “ when in doubt use brute force”is a direct corollary o f the
d(technology)/dt curve (see Section 1.4). Given computing technology’ s historical
rate o f improvement, it is typically wiser to choose simple algorithms that are well
understood rather than complex, badly characterized algorithms. By the time the com
plex algorithm is fully understood, implemented, and debugged, new hardware might
be able to execute the simple algorithm fast enough.Thompson and Ritchie used a
fixed-size table o f processes in the UNIX system and searched the table linearly because
a table was simple to implement and the number of processes was small. With Joe
Condon, Thompson also built the Belle chess machine that relied mostly on special-
purpose hardware to search many positions per second rather than on sophisticated
algorithms. Belle won the world computer chess championships several times in the
late 1970s and early 1980s and achieved an ELO rating o f 2250. (ELO is a numerical
rating systems used by the World Chess Federation (FIDI) to rank chess players; a
rating o f 2250 makes one a strong competitive player.) Later, as technology marched
on, programs that performed brute-force searching algorithms on an off-the-shelf PC
conquered the world computer chess championships. As o f August 2005, the Hydra
supercomputer (64 PCs, each with a chess coprocessor) is estimated by its creators to
have an ELO rating o f 3200, which is better than the best human player.
29
A secon d challenge in designing for perform ance is maintaining the sim plicity
o f the design. For example, if the design uses different devices with approxim ately
the same high-level function but radically different perform ance, a challenge is to
abstract devices such that they can be used through a sim ple uniform interface. In
this chapter, w e see h ow a clever im plem entation o f the read and w rite interface
for m em ory can transparently extend the effective size o f RAM to the size o f a
m agnetic disk.
6.1.1 Performance Metrics

To understand bottlenecks m ore fully, recall that com puter systems are organized
in modules to achieve the benefits o f modularity and that to process a request, the
request may be handed from on e m odule to another. For example, a camera may gen
erate a continuous stream o f requests containing video frames and send them to a
service that digitizes each frame. The digitizing service in turn may send its output to
a file service that stores the frames on a magnetic disk.
By describing this application in a client/service style, w e can obtain som e insights
about important perform ance metrics. It is immediately clear that in a com puter sys
tem such as this one, four metrics are o f importance: the capacity o f the service, its
utilization, the time clients must wait for request to com plete, and throughput, the
rate at w hich services can handle requests.We will discuss each metric in turn.
6.1.1.1 Capacity, Utilization, Overhead, and Useful Work

Every service has som e capacity, a consistent measure o f a service’ s size or amount
o f resources. Utilization is the percentage o f capacity o f a resource that is used for
som e given workload o f requests. A sim ple measure o f processor capacity is cycles.
For example, the p rocessor might be utilized 10% for the duration o f som e workload,
w hich means that 90% o f its processor cycles are unused. For a m agnetic disk, the
capacity is usually measured in sectors. If a disk is utilized 80%, then 80% o f its sectors
are used to store data.
In a layered system, each layer may have a different view o f the capacity and utili
zation o f the underlying resources. For example, a processor may be 95% utilized but
delivering only 70% o f its cycles to the application because the operating system uses
25%. Each layer considers what the layers b elo w it d o to be overhead in time and
space, and what the layers above it do to be useful work. In the processor example,
from the application point o f view, the 25% o f cycles used by the operating system is
overhead and the 70% is useful work. In the disk example, if 10% o f the disk is used for
storing file system data structures, then from the application point o f view that 10%
used by the file system is overhead and only 90% is useful capacity.
6.1.1.2 Latency
Latency is the delay betw een a change at the input to a system and the corresponding
change at its output. From the client/service perspective, the latency o f a request is the
time from issuing the request until the time the response is received from the service.
30
Service
FIGURE 6.1
A simple service composed of several stages.
This latency has several com ponents: the latency o f sending a message to the service,
the latency o f processin g the request, and the latency o f sending a response back.
If a task, such as asking a service to perform a request, is a sequence o f subtasks, w e
can think o f the com plete task as traversing stages o f a pipeline, where each stage o f
the pipeline perform s a subtask (see Figure 6.1). In our example, the first stage in the
pipeline is sending the request, the secon d stage is the service digitizing the frame, the
third stage is the file service storing the frame, and the final stage is sending a response
back to the client.
With this pipeline m odel in mind, it is easy to see that latency o f a pipeline with
stages A and B is greater than or equal to the sum o f the latencies for each stage in
the pipeline:
latencyA+B ^ latencyA + latencyB
It is possibly greater because passing a request from on e stage to another might

add som e latency. For example, if the stages correspon d to different services, perhaps
running on different com puters con n ected by a network, then the overhead o f pass
ing requests from on e stage to another may add enough latency that it cannot be
ignored.
If the stages are o f a single service, that additional latency is typically small (e.g.,the
overhead o f invoking a procedure) and can usually be ignored for first-order analy
sis o f performance. Thus, in this case, to predict the latency o f a service that isn’ t
running yet but is ex p ected to perform tw o functions, A and B, with know n laten
cies, a designer can approximate the joint latency o f A and B by adding the latency
o f A and the latency o f B.
6.1.1.3 Throughput
Throughput is a measure o f the rate o f useful w ork d on e by a service for som e given
workload o f requests. In the camera example, the throughput w e might care about is
h ow many frames per secon d the system can process because it may determine what
quality camera w e want to buy.
The throughput o f a system with pipelined stages is less than or equal to the mini
mum o f the throughput for each stage:
th rou gh pu tA + B — minimum(throughputA, throughputs')
31
Again, if the stages are o f a single service, passing the request from on e stage to
another usually adds little overhead and has little impact on total throughput. Thus,
for first-order analysis that overhead can be ignored, and the relation is usually close
to equality.
Consider a com puter system with tw o stages: on e that is able to process data at
a rate o f 1,000 kilobytes per secon d and a secon d on e at a rate o f 100 kilobytes per
second. If the fast stage generates on e byte o f output for each byte o f input, the overall
throughput must be less than or equal to 100 kilobytes per second. If there is negligi
ble overhead in passing requests betw een the tw o stages, then the throughput o f the
system is equal to the throughput o f the bottleneck stage, 100 kilobytes per second. In
this case, the utilization o f stage 1 is 10% and that o f stage 2 is 100%.
W hen a stage p rocesses requests serially, the throughput and the latency o f a stage
are directly related. The average number o f requests a stage handles is inversely pro
portional to the average time to process a single request:
throughput = j ^ r y
If all stages process requests serially, the average throughput o f the com plete p ip e
line is inversely proportional to the average time a request spends in the pipeline. In
these pipelines, reducing latency improves throughput, and the other way around.
W hen a stage p rocesses requests concurrently, as w e will see later in this chapter,
there is no direct relationship betw een latency and throughput. For stages that pro
cess requests concurrently, an increase in throughput may not lead to a decrease in
latency. A useful analogy is pipes through w hich water flow s with a constant velocity.
One can have several parallel pipes (or on e fatter pipe), w hich improves throughput
but d oesn ’ t change latency.
6.1.2 A Systems Approach to Designing for Performance

To gauge h ow m uch improvement w e can h op e for in reducing a bottleneck, w e
must identify and determine the perform ance o f the slow est and the next-slowest
bottleneck. To improve the throughput o f a system in w hich all stages have equal
throughput requires improving all stages. On the other hand, improving the stage that
has a throughput that is 10 times low er than any other stage’ s throughput may result
in a factor o f 10 improvement in the throughput o f the w h ole system. We might deter
mine these bottlenecks by measurements or by using simple analytical calculations
based on the perform ance characteristics o f each bottleneck. In principle, the per
form ance o f any issue in a com puter system can be explained, but som etim es it may
require substantial digging to find the explanation; see, for example, the study by Perl
and Sites on W indows NT’ s perform ance [Suggestions for Further Reading 6.4.1],
One should approach perform ance optim ization from a systems point o f view.
This observation may sound trivial, but many person-years o f work have disappeared
in optim izing individual stages that resulted in small overall perform ance improve
ments. The reason that engineers are tem pted to fine-tune a single stage is that
32
optimizations result in som e measurable benefits. An individual engineer can design

an optimization (e.g., replacing a slow algorithm with a faster algorithm, rem oving
unnecessary expensive operations, reorganizing the co d e to have a fast path, etc.),
implement it, and measure it, and can usually observe som e perform ance im prove
ment in that stage. This improvement stimulates the design o f another optimization,
w hich results in n ew benefits, and so on. O nce on e gets into this cycle, it is difficult to
keep the law o f diminishing returns in mind and realize that further improvements
may result in little benefit to the system as a whole.
Since optim izing individual stages typically runs into the law o f diminishing returns,
an approach that focuses on overall perform ance is preferred.The iterative approach
articulated in Section 1.5.2 achieves this goal because at each iteration the designer
must consider whether or not the next iteration is w orth performing. If the next itera
tion identifies a bottleneck that, if removed, show s diminished returns, the designer
can stop. If the final perform ance is g o o d enough, the designer’ s jo b is done. If the
final perform ance d oesn ’t m eet the target, the designer may have to rethink the w h ole
design or revisit the design specification.
The iterative approach for designing for perform ance has the follow ing steps:
1. Measure the system to find out whether or not a perform ance enhancement
is needed. If perform ance is a problem, identify w hich aspect o f perform ance
(throughput or latency) is the problem. For multistage pipelines in w hich stages
process requests concurrently, there is no direct relationship betw een latency
and throughput, so improving latency and improving throughput might require
different techniques.
2. Measure again, this time to identify the perform ance bottleneck. The bottleneck
may not be in the place the designer ex p ected and may shift from on e design
iteration to another.
3. Predict the impact o f the p ro p o sed perform ance enhancement with a simple
back-of-the-envelope model. (We introduce a few simple m odels in this chap
ter.) This prediction includes determining where the next bottleneck will be.
A quick way to determine the next bottleneck is to unrealistically assume that
the planned perform ance enhancement will rem ove the current bottleneck and
result in a stage with zero latency and infinite throughput. Under this assump
tion, determine the next bottleneck and calculate its performance.This calcula
tion will result in on e o f tw o conclusions:
a. Removing the current bottleneck d oesn ’
t improve system perform ance
significantly. In this case, stop iterating, and reconsider the w h ole design
or revisit the requirements. Perhaps the designer can adjust the interfaces
betw een stages with the goal o f tolerating costly operations. We will discuss
several approaches in the next sections.
b. Removing the current bottleneck is likely to improve the system perform ance.
In this case, focus attention on the bottleneck stage. Consider brute-force
m ethods o f relieving the bottleneck stage (e.g., add m ore memory).Taking
33
, , (^technology ) ,,
advantage o f t h e ------ —— —— curve may be less expensive than being
clever. If brute-force m ethods w o n ’
t relieve the bottleneck, be smart. For
example, try to exploit properties o f the workload or find better algorithms.
4. Measure the n ew implementation to verify that the change has the predicted
impact. If not, revisit steps 1-3 and determine what went wrong.
5. Iterate. Repeat steps 1-5 until the perform ance m eets the required level.
The rest o f this chapter introduces various systems approaches to reducing
latency and increasing throughput, as w ell as simple perform ance m odels to predict
the resulting performance.
6.1.3 Reducing Latency by Exploiting Workload Properties

Reducing latency is difficult because the designer often runs into physical, algorith
mic, and econ om ic limits. For example, sending a m essage from a client on the east
coast o f the United States to a service on the w est coast is dominated by the speed o f
light. Looking up an item in a hash table cannot g o faster than the best algorithm for
implementing hash tables. Building a very large m em ory that has uniform lo w latency
is econom ically infeasible.
O n ce a designer has run into such limits, the com m on approach is to reduce
the latency o f som e requests, perhaps even at the cost o f increasing the latency for
other requests.A designer may observe that certain requests are m ore com m on than
other requests, and use that observation to im prove the perform ance o f the frequent
operations by splitting the staged pipelin e into a fast path for the frequent requests
and a slow path for other requests (see Figure 6.2). For example, a service might
rem em ber the results o f frequently asked requests so that w h en it receives a repeat
o f a recently handled request, it can return the rem em bered result immediately with
out having to recom pute it. In practice, exploiting non-uniformity in applications
Service
FIGURE 6.2______________________________
A simple service with a slow and fast path.
34
Sidebar 6.2 Design Hint: Optimize for the Common Case A cache (see
Section 2.1.1.3) is the most com m on example o f optimizing for the most fre
quent cases. We saw caches in the case study o f the Domain Name System (in
Section 4.4). As another example, consider a Web browser. Most Web browsers
maintain a cache o f recently accessed Web pages.This cache is indexed by the name
o f the Web page (e.g.,http://www.Scholarly.edu) and returns the page for that name. If
the user asks to view the same page again, then the cache can return the cached copy
o f the page immediately (a fast path); only the first access requires a trip to the service
(a slow path). In addition to improving the user’ s interactive experience, the cache
helps reduce the load on services and the load on the network. Because caches are so
effective, many applications use several o f them. For example, in addition to caching
Web pages, many Web browsers have a cache to store the results o f looking up names,
such as“ www.Scholarly.edu” ,so that the next request t o “ www.Scholarly.edu”doesn’ t
require a DNS lookup.
The design o f multilevel memory in Section 6.2 is another example o f how well a
designer can exploit non-uniformity in a workload. Because applications have locality
o f reference, one can build large and fast memory systems out o f a combination o f a
small but fast memory and a large but slow memory.
works so w ell that it has led to the design hint optim ize for the com m on case (see
Sidebar 6.2).
To evaluate the perform ance o f systems with a fast and slow path, designers typi
cally com pute the average latency. If w e k n ow the latency o f the fast and slow paths,
and the frequency with w hich the system will take the fast path, then the average
latency is:
AverageLatency = F requency^ X Latencyf„ st + F r e q u e n c y X Latencysiow (6.1)

W hether introducing a fast path is w orth the effort is dependent on the relative
difference in latency betw een the fast and slow path, and the frequency with w hich
the system can use the fast path, w hich is dependent on the workload. In addition, one
might be able to change the design so that the fast path b ecom es faster at the cost o f
a slow er slow path. If the frequency o f taking the fast path is low, then introducing a
fast path (and perhaps optim izing it at the cost o f the slow path) is likely not w orth
the complexity. In practice, as w e will see in Section 6.2, many workloads d o n ’t have
a uniform distribution o f requests, and introducing a fast path works well.
6.1.4 Reducing Latency using Concurrency

Another way to reduce latency that may require som e intellectual effort but that can
be effective is to parallelize a stage.We take the processin g that a stage must d o for a
single request and divide that processin g up into subtasks that can be perform ed co n
currently. Then, whenever several p rocessors are available they can be assigned to run
35
those subtasks in parallel. The m ethod can be applied either within a m ultiprocessor
system o r (if the subtasks aren’ t too entangled) with com pletely separate computers.
If the processing parallelizes perfectly (i.e., each subtask can run without any coor
dination with other subtasks and each subtask requires the same amount o f work), then
this plan can, in principle, speed up the processing by a factor n, where n is the number
o f subtasks executing in parallel. In practice, the speedup is usually less than n because
there is overhead in parallelizing a com putation—the subtasks need to communicate
with each other, for example, to exchange intermediate results; because the subtasks
d o not require an equal amount o f work; because the computation cannot be executed
com pletely in parallel, so som e fraction o f the computation must be executed sequen
tially; or because the subtasks interfere with each other (e.g.,they contend for a shared
resource such as a lock, a shared memory, or a shared comm unication network).
Consider the processin g that a search engine needs to perform in order to respond
to a user search query. An early version o f G o o gle’ s search en gin e—described in more
detail in Suggestions for Further Reading 3.2.4—parallelized this processin g as follows.
The search engine splits the index o f the Web up in n pieces, each p iece stored on a
separate machine. W hen a front end receives a user query, it sends a co p y o f the query
to each o f the n machines. Each machine runs the query against its part o f the index
and sends the results back to the front end. The front end accumulates the results
from the n machines, ch ooses a g o o d order in w hich to display them, generates a Web
page, and sends it to the user. This plan can give g o o d speedup if the index is large and
each o f the n m achines must perform a substantial, similar amount o f computation.
It is unlikely to achieve a full speedup o f a factor n because there is parallelization
overhead (to send the query to the n machines, receive n partial results, and merge
them); because the amount o f w ork is not balanced perfectly across the n machines
and the front end must wait until the slow est responds; and because the work don e by
the front end in farming out the query and m erging hasn’ t been parallelized.
Although parallelizing can improve performance, several challenges must be
overcome. First, many applications are difficult to parallelize. Applications such as
search have exploitable parallelism, but other computations d o n ’ t split easily into
n mostly independent pieces. Second, developing parallel applications is difficult
because the program m er must manage the concurrency and coordinate the activities
o f the different sub tasks. As w e saw in Chapter 5, it is easy to get this w ron g and intro
duce race conditions and deadlocks. Systems have been d eveloped to make develop
ment o f parallel applications easier, but they are often limited to a particular domain.
The paper by Dean and Ghemawat [Suggestions for Further Reading 6.4.3] provides
an exam ple o f h ow the program m ing and management effort can be minimized for
certain stylized applications running in parallel on hundreds o f machines. In general,
however, program m ers must often struggle with threads and locks, or explicit m essage
passing, to obtain concurrency.
Because o f these tw o challenges in parallelizing applications, designers traditionally
have preferred to rely o n continuous tech nology improvements to reduce application
latency. However, physical and engineering limitations (primarily the problem o f heat
dissipation) are n o w leading p rocessor manufacturers away from making processors
36
faster and toward placing several (and soon, probably, several hundred or even several
thousand, as some are predicting [Suggestions for Further Reading 1.6.4]) processors
on a single chip. This development means that improving performance by using con
currency will inevitably increase in importance.
6.1.5 Improving Throughput: Concurrency

If the designer cannot reduce the latency o f a request because of limits, an alter
native approach is to hide the latency of a request by overlapping it with other
requests. This approach doesn’ t improve the latency o f an individual request, but
it can improve system throughput. Because hiding latency is often much easier to
achieve than improving latency, it has led to the hint: instead o f reducing latency,
hide it (see Sidebar 6.3). This section discusses how one can introduce concurrency
in a multistage pipeline to increase throughput.
To overlap requests, we give each stage in the pipeline its own thread of computa
tion so that it can compute concurrently, operating much like an assembly line (see
Figure 6.3). If a stage has completed its task and has handed off the request to the
next stage, then the stage can start processing the second request while the next stage
processes the first request. In this fashion, the pipeline can work on several requests
concurrently.
An implementation of this approach has two challenges. First, some stages of the
pipeline may operate more slowly than other stage. As a result, one stage might not
be able to hand off the request to the next stage because that next stage is still
working on a previous request. As a result, a queue o f requests may build up, while
other stages might be idle. To ensure that a queue between two stages doesn’ t grow
without bound, the stages are often coupled using a bounded buffer. We will discuss
queuing in more detail in Section 6.1.6.
The secon d challenge is that several requests must be available. One natural source
o f multiple requests is if the system has several clients, each generating a request.
A single client can also be a source o f multiple requests if the client operates asyn
chronously. W hen an asynchronous client issues a request, rather than waiting for the
response, it continues computing, perhaps issuing m ore requests.The main challenge
FIGURE 6.3
A simple service composed of several stages, with each stage operating concurrently using
threads.
37
Sidebar 6.3 Design Hint: Instead o f Reducing Latency, Hide it Latency is

often not under the control o f the designer but rather is im posed on the designer
by physical properties such as the speed o f light. Consider sending a message from
the east coast o f the United States to the west coast at the speed o f light.This takes
about 20 milliseconds (see Section 7.1 [online]); in the same time, a processor can
execute millions o f instructions. Worse, each new generation o f processors gets
faster every year, but the speed o f light doesn’t improve. As David Clark, a network
researcher, put it succinctly: “One cannot bribe God.”The speed o f light shows up
as an intrinsic barrier in many places o f computer design, even when the distances
are short. For example, dies are so large that for a signal to travel from one end o f a
chip to another is a bottleneck that limits the clock speed o f a chip.
When a designer is faced with such intrinsic limits, the only option is to design sys
tems that hide latency and try to exploit performance dimensions that do follow
d(technology)/dt. For example, transmission rates for data networks have improved
dramatically, and so if a designer can organize the system such that communication can
be overlapped with useful computation and many network requests can be batched
into a large request, then the large request can be transferred efficiently. Many Web
browsers use this strategy: while a large transfer runs in the background, users can
continue browsing Web pages, hiding the latency o f the transfer.
in issuing multiple requests asynchronously is that the client must then match the
responses with the outstanding requests.
Once the system is organized to have many requests in flight concurrently, a
designer may be able to improve throughput further by using interleaving. The idea
is to make n instances of the bottleneck stage and run those n instances concurrently
(see Figure 6.4). Stage 1 feeds the first request to instance 1, the second request to
instance 2, and so on. If the throughput o f a single instance is t, then the throughput
using interleaving is n X t, assuming enough requests are available to run all instances
Service
FIGURE 6 .4
Interleaving requests.
38
concurrently at full speed and the requests d o n ’ t interfere with each other.The cost
o f interleaving is additional co p ies o f the bottleneck stage.
RAID (see Section 2.1.1.4) interleaves several disks to achieve a high aggregate
disk throughput. RAID 0 stripes the data across the disks: it stores block 0 on disk
0, block 1 on disk 1, and so on. If requests arrive for blocks on different disks, the
RAID controller can serve those requests concurrently, improving throughput. In a
similar style on e can interleave m em ory chips to improve throughput. If the current
instruction is stored in m em ory chip 0 and the next on e is in m em ory chip 1, the pro
cessor can retrieve them concurrently. The cost o f this design is the additional disks
and m em ory chips, but often systems already have several m em ory chips or disks, in
w hich case the added cost o f interleaving can be small in com parison with the per
form ance benefit.
6.1.6 Queuing and Overload

If a stage in Figure 6.3 operates at its capacity (e.g., all physical processors are run
ning threads), then a new request must wait until the stage becomes available; a queue
of requests builds up waiting for the busy stage, while other stages may run idle.
For example, the thread manager of Section 5.5 maintains a table of threads, which
records whether a thread is runnable; a runnable thread must wait until a processor
is available to run it.The stage that runs with an input queue while other stages are
running idle is a bottleneck.
Using queuing theory* w e can estimate the time that a request spends waiting in a
queue for its turn to be processed (e.g., the time a thread spends in the ready queue).
In queuing theory, the time that it takes to process a request (e.g., the time from w hen
a thread starts running on the processor until it yields) is called the service time.
The simplest queuing theory m odel assumes that requests (e.g., a thread entering
the ready queue) arrive according to a random, m em oryless process and have inde
pendent, exponentially distributed service times. In that case, a well-known queuing
theory result tells us that the average queuing delay, measured in units o f the average
service time and including the service time o f this request, will be 1/(1-p), w here p
is the service utilization. Thus, as the utilization approaches 1, the queuing delay will
grow without bound.
This same ph en om en on applies to the delays for threads waiting for a p rocessor
and to the delays that custom ers experience in supermarket checkout lines.Any time
the demand for a service com es from many statistically independent sources, there
will be fluctuations in the arrival o f load and thus in the length o f the queue at the
bottleneck stage and the time spent waiting for service.The rate o f arrival o f requests
for service is know n as the offered load .Whenever the offered load is greater than the
capacity o f a service for som e duration, the service is said to be overloaded for that
time period.
*The textbook by Jain is an excellent source to learn about queuing theory and how to reason
about performance in computer systems [Suggestions for Further Reading 1.1.2].
39
In som e constrained cases, w here the designer can plan the system so that the
capacity just matches the offered load o f requests, it is possible to calculate the degree
o f concurrency necessary to achieve high throughput and the maximum length o f
the queue n eeded betw een stages. For example, suppose w e have a p rocessor that
perform s on e instruction per n anosecond using a m em ory that takes 10 nanosec
onds to respond. To avoid having the processor wait for the memory, it must make a
m em ory request 10 instructions in advance o f the instruction that needs it. If every
instruction makes a request o f memory, then by the time the m em ory responds, the
p rocessor will have issued 9 m ore .To avoid being a bottleneck, the m em ory therefore
must be prepared to serve 10 requests concurrently.
If half o f the instructions make a request o f memory, then on average there will
be five outstanding requests. Thus, a m em ory that can serve five requests con cur
rently w ould have enough capacity to keep up. To calculate the maximum length
o f the queue n eeded for this case depends on the application’ s pattern o f m em ory
references. For example, if every secon d instruction makes a m em ory request, a fixed-
size queue o f size five is sufficient to ensure that the queue never overflows. If the
p rocessor perform s five instructions that make m em ory references follow ed by five
that d on ’ t, then a fixed-size queue o f size five will work, but the queue length will
vary in length and the throughput will be different. If the requests arrive randomly,
the queue can grow, in principle, without limit. If w e w ere to use a m em ory that can
handle 10 requests concurrently for this random pattern o f m em ory references, then
the m em ory w ould be utilized at 50% o f capacity, and the average queue length w ould
be (1/(1—0.5) = 2.With this configuration, the p rocessor observes latencies for som e
m em ory requests o f 20 or m ore instruction cycles, and it is running m uch slow er
than the designer expected. This exam ple illustrates that a designer must understand
non-uniform patterns in the references to m em ory and exploit them to achieve g o o d
performance.
In many com puter systems, the designer cannot plan the offered load that pre
cisely, and thus stages will experien ce periods o f overload. For example, an applica
tion may have several threads that b eco m e runnable all at the same time and there
may not be enough p rocessors available to run them. In such cases, at least occasional
overload is inevitable. The significance o f overload depends critically on h ow long
it lasts. If the duration is com parable to the service time, then a queue is simply an
orderly way to delay som e requests for service until a later time w h en the offered
load drops b elo w the capacity o f the service. Put another way, a queue handles short
bursts o f too m uch demand by time-averaging with adjacent periods w hen there is
ex cess capacity.
If overload persists over lon g periods o f time, the system designer has only tw o
choices:
1. Increase the capacity o f the system. If the system must meet the offered load,
on e approach is to design a system that has less overhead so that it can perform
m ore useful w ork or purchase a better com puter system with higher capacity.
In com puter systems, it is typically less expensive to buy the next generation o f
40
the com puter system that has higher capacity because o f tech n ology im prove
ments than trying to squeeze the last ou n ce out o f the implementation through
com p lex algorithms.
2. Shed load. If purchasing a com puter system with higher capacity isn’
t an option
and system perform ance cannot be improved, the preferred m ethod is to shed
load by reducing or limiting the offered load until the load is less than the
capacity o f the system.
One approach to control the offered load is to use a boun ded buffer (see Figure
5.5) betw een stages. W hen the boun ded buffer ahead o f the bottleneck stage is full,
then the stage before it must wait until the bounded buffer em pties a slot. Because
the previous stage is waiting, its boun ded buffer may till up too, w hich may cause the
stage before it to wait, and so on. The bottleneck may be pushed all the way back to
the beginning o f the pipeline. If this happens, the system cannot accept any m ore
input, and what happens next depends on h ow the system is used.
If the source o f the load needs the results o f the output to generate the next
request, then the load will be self-managing. This m odel o f use applies to som e inter
active systems, in w hich the users cannot type the next com m and until the previous
on e finishes.This same idea will be used in Chapter 7 [on-line] in the implementation
o f self-pacing network protocols.
If the source o f the load decides not to make the request at all, then the offered
load decreases. If the source, however, simply holds on to the request and resubmits
it later, then the offered load d oesn ’t decrease, but som e requests are just deferred,
perhaps to a time w hen the system isn’ t overloaded.
A crude approach to limiting a source is to put a quota on h ow many requests a
source may have outstanding. For example, som e systems enforce a rule that an appli
cation may not create m ore than som e fixed num ber o f active threads at the same
time and may not have m ore than som e fixed num ber o f o p en hies. If a source has
reached its quota for a given service, the system denies the next request, limiting the
offered load on the system.
An alternative to limiting the offered load is reducing it w h en a stage becom es
overloaded. We will see on e exam ple o f this approach in Section 6.2. If the address
spaces o f a num ber o f applications cannot fit in memory, the virtual m em ory man
ager can swap out a com plete address space o f on e or m ore applications so that the
remaining applications fit in memory. W hen the offered load decreases to normal
levels, the virtual m em ory manager can swap in som e o f the applications that w ere
sw apped out.
6.1.7 Fighting Bottlenecks

If the designer cannot rem ove a bottleneck with the techniques d escribed above, it
may be possible instead to fight the bottleneck using on e or m ore o f three different
techniques: batching, dallying, and speculation.
41
6.1.7.1 Batching
Batching is perform ing several requests as a group to avoid the setup overhead o f
doin g them on e at a time. Opportunities for batching arise naturally at a bottleneck
stage, w hich may have a queue o f requests waiting to be processed. For example, if a
stage has several requests to send to the next stage, the stage can com bin e all o f the
m essages into a single message and send that on e message to the next stage.This use
o f batching divides the overhead o f an expensive operation (e.g., sending a message)
over the several messages. More generally, batching works well w hen processin g a
request has a fixed delay (e.g., transmitting the request) and a variable delay (e.g.,
perform ing the operation specified in the request). Without batching, processin g n
requests takes n X (f + v), w h e r e / is the fixed delay and v is the variable delay.With
batching, processin g n requests ta k e s / + n X v.
O n ce a stage perform s batching, the potential arises for additional perform ance
wins. Batching may create opportunities for the stage to avoid work. If tw o or m ore
write requests in a batch are for the same disk block, then the stage can perform just
the last one.
Batching may also provide opportunities to improve latency by reordering the
processin g o f requests. As w e will see in Section 6.3.4, if a disk controller receives a
batch o f requests, it can schedule them in an order that reduces the m ovement o f the
disk arm, reducing the total latency for the batch o f requests.
6.1.7.2 Dallying
Dallying is delaying a request on the chance that the operation w o n ’
t be needed, or
to create m ore opportunities for batching. For example, a stage may delay a request
that overwrites a disk block in the h op e that a secon d on e will com e along for the
same block. If a secon d on e co m e s along, the stage can delete the first request and
perform just the secon d one. As applied to writes, this benefit is som etim es called
write absorption.
Dallying also increases the opportunities for batching. It purposely increases the
latency o f som e requests in the h ope that m ore requests will co m e along that can be
com bin ed with the delayed requests to form a batch. In this case, dallying increases
the latency o f som e requests to improve the average latency o f all requests.
A key design question in dallying is to decide h ow long to wait.There is no generic
answer to this question.The costs and benefits o f dallying are application and system
specific.
6.1.7.3 Speculation
Speculation is perform ing an operation in advance o f receiving a request on the
chance that it will be requested. The goal is that the results can be delivered with
less latency and perhaps with less setup overhead. Speculation can achieve this goal
in tw o different ways. First, speculation can perform operations using otherwise idle
resources. In this case, even if the speculation is wrong, perform ing the additional
operations has no downside. Second, speculation can use a busy resource to d o an
42
operation that has a lon g lead time so that the result o f the operation can be available
without waiting if it turns out to be needed. In this case, speculation might increase
the delay and overhead o f other requests without benefit because the prediction that
the results may be n eeded might turn out to be wrong.
Speculating may sound bewildering because h ow can a com puter system predict
the input o f an operation if it hasn’ t received the request yet, and h ow can it predict
if the result o f the operation will be useful in the future? Fortunately, many applica
tions have request patterns that a system designer can exploit to predict an input.
In som e cases, the input value is evident; for example, a future instruction may add
register 5 to register 9, and these register values may be available now. In som e cases,
the input values can be predicted accurately; for example, a program that asks to read
byte n is likely to want to read bytes n + 1, n + 2, and so on, too. Similarly, for many
applications a system can predict what results will be useful in the future. If a program
perform s instruction n, it will likely soon need the result o f instruction n + 1; only
w hen the instruction n is a jmp will the prediction be wrong.
Sometimes a system can use speculation even if the system cannot predict accu
rately what the input to an operation is or whether the result will be useful. For exam
ple, if an input has only tw o values, then the system might create a n ew thread and
have the main thread run with on e input value and the secon d thread with the other
input value. Later, w h en the system knows the value o f the input, it terminates the
thread that is com puting with the w ron g value and undoes any changes that thread
might have made. This use o f speculation b ecom es challenging w hen it involves
shared state that is updated by different thread, but using techniques presented in
Chapter 9 [on-line] it is possible to undo the operations o f a thread, even w h en shared
state is involved.
Speculation creates m ore opportunities for batching and dallying. If the system
speculates that a read request for block n will be follow ed by read requests for blocks
n + 1 through n + 8, then the system can batch those read requests. If a write request
might soon be follow ed by another write request, the system can dally for a while to
see if any others com e in and, if so, batch all the writes together.
Key design questions associated with speculation are w hen to speculate and h ow
much. Speculation can increase the load on later stages. If this increase in load results
in a load higher than the capacity o f a later stage, then requests must wait and latency
will increase.Also, any work don e that turns out to be not useful is overhead, and per
form ing this unnecessary w ork may slow dow n other requests.There is n o generic
answer to this design question; instead, a designer must evaluate the benefits and cost
o f speculation in the context o f the system.
6.1.7.4 C h allenges with Batching, Dallying, and Speculation

Batching, dallying, and speculation introduce com plexity because they introduce
concurrency. The designer must coordinate incom ing requests with the requests
that are batched, dallied, or speculated. Furthermore, if the requested operations
share variables, the designer must coordinate the references to these variables. Since
coordination is difficult to get right, a designer must use these performance-enhancing
43
techniques with discipline. There is always the risk that by the time the designer has
w orked out the concurrency problem s and the system has made it through the sys
tem tests, tech nology improvements will have made the extra com plexity unneces
sary. Problem set 14 explores several performance-enhancing techniques and their
challenges with a simple multithreaded service.
6.1.8 An Example: The I/O Bottleneck

We illustrate design for perform ance using batching, dallying, and speculation through
a case study involving a m agnetic disk such as was d escribed in Sidebar 2.2.The per
form ance p roblem with disks is that they are made o f mechanical com ponents. As
a result, reading and writing data to a magnetic disk is slow com pared to devices
that have no mechanical com ponents, such as RAM chips. The disk is therefore a
bottleneck in many applications. This bottleneck is usually referred to as the I/O
bottleneck.
Recall from Sidebar 2.2 that the performance o f reading and writing a disk block
is determined by (1) the time to m ove the head to the appropriate track (the seek
latency); (2) plus the time to wait until the requested sector rotates under the disk head
(the rotational latency); (3) plus the time to transfer the data from the disk to the com
puter (the transfer latency).
The I/O bottleneck is getting w orse over time. Seek latency and rotational latency
are not improving as fast as processor performance. Thus, from the perspective o f
programs running on ever faster processors, I/O is getting slow er over time. This
problem is an exam ple o f problem s due to incommensurate rates o f technology
improvement. Following the incommensurate scaling rule o f Chapter 1,applications
and systems have been redesigned several times over the last few decades to c o p e
with the I/O bottleneck.
To build som e intuition for the I/O bottleneck, consider a typical disk o f the last
decade.The average seek latency (the time to m ove the head over one-third o f the
disk) is about 8 m illiseconds.The disks spin at 7,200 rotations per minute, w hich is
on e rotation every 8.33 milliseconds. On average, the disk has to wait a half rotation
for the desired block to be under the disk head; thus, the average rotational latency is
4.17 milliseconds.
Bits read from a disk encounter tw o potential transfer rate limits, either o f w hich
may b eco m e the bottleneck. The first limit is mechanical: the rate at w hich bits spin
under the disk heads on their way to a buffer.The secon d limit is electrical: the rate at
w hich the I/O channel o r I/O bus can transfer the contents o f the buffer to the co m
puter. A typical m odern 400-gigabyte disk has 16,383 cylinders, or about 24 m ega
bytes p er cylinder. That disk w ould probably have 8 two-sided platters and thus 16
read/write heads, so there w ould be 24/16 =1.5 megabytes per track.When rotating
at 7,200 revolutions per minute (120 revolutions per second), the bits will go by a
head at 120 X 1.5 = 180 megabytes per second.The I/O channel speed depends on
w hich standard bus con n ects the disk to the computer. For the Integrated Device
Electronics (IDE) bus, 66 megabytes p er secon d is a com m on num ber in practice; for
44
C h a p te r 4
C o n cu rre n cy C o n tro l
This chapter contains the book chapters:
R. Ramakrishnan and J. Gehrke. Database Management Sys

tems. Third Edition. Chapters 16-17, pp. 519-544, and 549-575
(53 of 1065). McGraw-Hill, 2003. ISBN: 978-0-07-246563-1
While concurrency is a powerful technique for performance, it is unfortu

nately hard to get concurrent software right. The properties of a tom icity and
durability, and in particular the atomicity variant of before-or-after atomicity
(also called isolation ), are crucial for correctness, but particularly challenging
to achieve when concurrency is employed. Gladly, a general theory of how
to write correct, concurrent software, which respects before-or-after atomicity,
over a memory abstraction has been developed in the context of database sys
tems. This chapter discusses this theory, which is grounded on the notions of
transactions, the ACID properties, and the serializability of transaction sched
ules. The chapter elaborates on multiple protocols for concurrency control,
based on locking, timestamps, and multi-versioning. The ultimate goal of this
portion o f the material is to enable us to design our own strategies fo r achiev
ing atomicity when writing performance-optimized, concurrent services, and
reflect on their correctness by equivalence to a well-known concurrency control
protocol.
• Identify the multiple interpretations of the property of atomicity.
• Implement methods to ensure before-or-after atomicity, and argue for

their correctness.
• Explain the variants of the two-phase locking (2PL) protocol, in partic

ular the widely-used Strict 2PL.
• Discuss definitions of serializability and their implications, in particular

conflict-serializability and view-serializability.•
• Apply the conflict-serializability test using a precedence graph to trans

action schedules.
• Explain deadlock prevention and detection techniques.
45
Apply deadlock detection using a waits-for graph to transaction sched
ules.
Explain situations where predicate locking is required.
Explain the optimistic concurrency control and multi-version concurrency

control models.
Predict validation decisions under optimistic concurrency control.

47
SH H
* -v: - ' ’
v' ^ "• • , * —/ ' - - '
. i s
-s‘
.,
j\ 888W
M i!
R H I I
H n M re n M S
48
Overview of Transaction Management 521
1. Users should be able to regard the execution of each transaction as a tom ic:
Either all actions are carried out or none are. Users should not have to
worry about the effect of incomplete transactions (say, when a system crash
occurs).
2. Each transaction, run by itself with no concurrent execution of other trans

actions, must preserve the c o n s is te n c y of the database. The DBMS as
sumes that consistency holds for each transaction. Ensuring this property
of a transaction is the responsibility of the user.
3. Users should be able to understand a transaction without considering the

effect of other concurrently executing transactions, even if the DBMS in
terleaves the actions of several transactions for performance reasons. This
property is sometimes referred to as isola tion : Transactions are isolated,
or protected, from the effects of concurrently scheduling other transactions.
4. Once the DBMS informs the user that a transaction has been successfully
completed, its effects should persist even if the system crashes before all
its changes are reflected on disk. This property is called du rability.
The acronym ACID is sometimes used to refer to these four properties of trans
actions: atomicity, consistency, isolation and durability. We now consider how
each of these properties is ensured in a DBMS.
16.1.1 Consistency and Isolation
Users are responsible for ensuring transaction consistency. That is, the user
who submits a transaction must ensure that, when run to completion by itself
against a ‘consistent’database instance, the transaction will leave the database
in a ‘consistent’state. For example, the user may (naturally) have the consis
tency criterion that fund transfers between bank accounts should not change
the total amount of money in the accounts. To transfer money from one ac
count to another, a transaction must debit one account, temporarily leaving the
database inconsistent in a global sense, even though the new account balance
may satisfy any integrity constraints with respect to the range of acceptable
account balances. The user’ s notion of a consistent database is preserved when
the second account is credited with the transferred amount. If a faulty trans
fer program always credits the second account with one dollar less than the
amount debited from the first account, the DBMS cannot be expected to de
tect inconsistencies due to such errors in the user program ’s logic.
The isolation property is ensured by guaranteeing that, even though actions

of several transactions might be interleaved, the net effect is identical to ex
ecuting all transactions one after the other in some serial order. (We discuss
49
522 C h a pter 16
how the DBMS implements this guarantee in Section 16.4.) For example, if
two transactions T1 and T2 are executed concurrently, the net effect is guar
anteed to be equivalent to executing (all of) F I followed by executing F2 or
executing F2 followed by executing T l. (The DBMS provides no guarantees
about which of these orders is effectively chosen.) If each transaction maps a
consistent database instance to another consistent database instance, execut
ing several transactions one after the other (on a consistent initial database
instance) results in a consistent final database instance.
D a ta b a s e c o n s is te n c y is the property that every transaction sees a consistent

database instance. Database consistency follows from transaction atomicity,
isolation, and transaction consistency. Next, we discuss how atomicity and
durability are guaranteed in a DBMS.
16.1.2 Atomicity and Durability
Transactions can be incomplete for three kinds of reasons. First, a transaction

can be a b o rted , or terminated unsuccessfully, by the DBMS because some
anomaly arises during execution. If a transaction is aborted by the DBMS for
some internal reason, it is automatically restarted and executed anew. Second,
the system may crash (e.g., because the power supply is interrupted) while one
or more transactions are in progress. Third, a transaction may encounter an
unexpected situation (for example, read an unexpected data value or be unable
to access some disk) and decide to abort (i.e., terminate itself).
Of course, since users think of transactions as being atomic, a transaction that

is interrupted in the middle may leave the database in an inconsistent state.
Therefore, a DBMS must find a way to remove the effects of partial transactions
from the database. That is, it must ensure transaction atomicity: Either all of a
transaction’s actions are carried out or none are. A DBMS ensures transaction
atomicity by undoing the actions of incomplete transactions. This means that
users can ignore incomplete transactions in thinking about how the database is
modified by transactions over time. To be able to do this, the DBMS maintains
a record, called the log, of all writes to the database. The log is also used to
ensure durability: If the system crashes before the changes made by a completed
transaction are written to disk, the log is used to remember and restore these
changes when the system restarts.
The DBMS component that ensures atomicity and durability, called the recov
ery m anager , is discussed further in Section 16.7.
50
O v erv iew o f T ra n sa ction M a n a g em en t 523
16.2 TRANSACTIONS AND SCHEDULES
A transaction is seen by the DBMS as a series, or list, of a ction s. The actions

that can be executed by a transaction include r e a d s and w r ite s of database
objects. To keep our notation simple, we assume that an object O is always
read into a program variable that is also named O. We can therefore denote
the action of a transaction T reading an object O as R t (0 ); similarly, we can
denote writing as Wp(0). When the transaction T is clear from the context,
we omit the subscript.
In addition to reading and writing, each transaction m ust specify as its final
action either c o m m it (i.e., complete successfully) or a b o r t (i.e., terminate
and undo all the actions carried out thus far). A b ort? denotes the action of T
aborting, and C o m m it? denotes T committing.
We make two important assumptions:
1. Transactions interact with each other only via database read and write
operations; for example, they are not allowed to exchange messages.
2. A database is a fixed collection of independent objects. When objects are

added to or deleted from a database or there are relationships between
database objects that we want to exploit for performance, some additional
issues arise.
If the first assumption is violated, the DBMS has no way to detect or prevent
inconsistencies cause by such external interactions between transactions, and it
is upto the writer of the application to ensure that the program is well-behaved.
We relax the second assumption in Section 16.6.2.
A s ch e d u le is a list of actions (reading, writing, aborting, or committing)

from a set of transactions, and the order in which two actions of a transaction
T appear in a schedule must be the same as the order in which they appear in T.
Intuitively, a schedule represents an actual or potential execution sequence. For
example, the schedule in Figure 16.1 shows an execution order for actions of two
transactions T1 and T 2. We move forward in time as we go down from one row
to the next. We emphasize that a schedule describes the actions of transactions
as seen by the DBMS. In addition to these actions, a transaction may carry out
other actions, such as reading or writing from operating system files, evaluating
arithmetic expressions, and so on; however, we assume that these actions do
not affect other transactions; that is, the effect of a transaction on another
transaction can be understood solely in terms of the common database objects
that they read and write.
51
524 C h a pter 16
Tl T2
R{A)
W(A)
R(B)
W{B)
R(C)
W (C)
F igu re 16.1 A Schedule Involving Two Transactions
Note that the schedule in Figure 16.1 does not contain an abort or commit ac
tion for either transaction. A schedule that contains either an abort or a comm it
for each transaction whose actions are listed in it is called a c o m p le t e s c h e d
ule. A complete schedule must contain all the actions of every transaction
that appears in it. If the actions of different transactions are not interleaved—
that is, transactions are executed from start to finish, one by one— we call the
schedule a seria l sch edu le.
16.3 CONCURRENT EXECUTION OF TRANSACTIONS
Now that we have introduced the concept of a schedule, we have a convenient

way to describe interleaved executions of transactions. The DBMS interleaves
the actions of different transactions to improve performance, but not all inter
leavings should be allowed. In this section, we consider what interleavings, or
schedules, a DBMS should allow.
16.3.1 Motivation for Concurrent Execution
The schedule shown in Figure 16.1 represents an interleaved execution of the

two transactions. Ensuring transaction isolation while perm itting such concur
rent execution is difficult but necessary for performance reasons. First, while
one transaction is waiting for a page to be read in from disk, the CPU can
process another transaction. This is because I/O activity can be done in par
allel with CPU activity in a computer. Overlapping I/O and CPU activity
reduces the amount of time disks and processors are idle and increases s y s t e m
th r o u g h p u t (the average number of transactions completed in a given time).
Second, interleaved execution of a short transaction with a long transaction
usually allows the short transaction to complete quickly. In serial execution,
a short transaction could get stuck behind a long transaction, leading to un
predictable delays in r e s p o n s e tim e, or average time taken to complete a
transaction.
52
16.3.2 Serializability
A se ria liza b le sch ed u le over a set S of committed transactions is a schedule

whose effect on any consistent database instance is guaranteed to be identical
to that of some complete serial schedule over S. That is, the database instance
that results from executing the given schedule is identical to the database in
stance that results from executing the transactions in some serial order.1
As an example, the schedule shown in Figure 16.2 is serializable. Even though

the actions of T1 and X2 are interleaved, the result of this schedule is equivalent
to running X I (in its entirety) and then running T 2. Intuitively, T l’ s read and
write of B is not influenced by T 2’s actions on A, and the net effect is the same
if these actions are ‘swapped’to obtain the serial schedule T1;T2.
XI X2
R(A)
W(A)
R(A)
W (A )
R(B)
W (B )
R(B)
W(B)
Commit
Commit
F igu re 16.2 A Serializable Schedule
Executing transactions serially in different orders may produce different results,

but all are presumed to be acceptable; the DBMS makes no guarantees about
which of them will be the outcome of an interleaved execution. To see this,
note that the two example transactions from Figure 16.2 can be interleaved as
shown in Figure 16.3. This schedule, also serializable, is equivalent to the serial
schedule T2;T1. If X I and T2 are submitted concurrently to a DBMS, either
of these schedules (among others) could be chosen.
The preceding definition of a serializable schedule does not cover the case of
schedules containing aborted transactions. We extend the definition of serial
izable schedules to cover aborted transactions in Section 16.3.4.
1If a transaction prints a value to the screen, this ‘

effect’is not directly captured in the database.
For simplicity, we assume that such values are also written into the database.
53
54
that the actions are interleaved so that (1) the account transfer program T 1
deducts $100 from account A, then (2) the interest deposit program T 2 reads
the current values of accounts A and B and adds 6% interest to each, and then
(3) the account transfer program credits $100 to account B. The corresponding
schedule, which is the view the DBMS has of this series of events, is illustrated
in Figure 16.4. The result of this schedule is different from any result that we
would get by running one of the two transactions first and then the other. The
problem can be traced to the fact that the value of A written by T1 is read by
T 2 before T1 has completed all its changes.
T1 T2
R(A)
W (A )
R(A)
W(A)
R(B)
W (B)
Commit
R(B)
W(B)
Commit
F igu re 16.4 Reading Uncommitted Data
The general problem illustrated here is that T1 may write some value into A
that makes the database inconsistent. As long as T1 overwrites this value with
a ‘correct’value of A before committing, no harm is done if T1 and T 2 run in
some serial order, because T2 would then not see the (temporary) inconsistency.
On the other hand, interleaved execution can expose this inconsistency and lead
to an inconsistent final database state.
Note that although a transaction must leave a database in a consistent state

after it completes, it is not required to keep the database consistent while it is
still in progress. Such a requirement would be too restrictive: To transfer money
from one account to another, a transaction must debit one account, temporarily
leaving the database inconsistent, and then credit the second account, restoring
consistency.
55
528 C h a pte r 16
Unrepeatable Reads (RW Conflicts)
The second way in which anomalous behavior could result is that a transaction
T 2 could change the value of an object A that has been read by a transaction
T l, while T1 is still in progress.
If T l tries to read the value of A again, it will get a different result, even though
it has not modified A in the meantime. This situation could not arise in a serial
execution of two transactions; it is called an u n r e p e a t a b le read.
To see why this can cause problems, consider the follovung example. Suppose
that A is the number of available copies for a book. A transaction that places
an order first reads A, checks that it is greater than 0, and then decrements it.
Transaction T l reads A and sees the value 1. Transaction T2 also reads A and
sees the value 1, decrements A to 0 and commits. Transaction T l then tries to
decrement A and gets an error (if there is an integrity constraint that prevents
A from becom ing negative).
This situation can never arise in a serial execution of T l and T 2; the second
transaction would read A and see 0 and therefore not proceed with the order
(and so would not attempt to decrement A).
Overwriting Uncommitted Data (WW Conflicts)
The third source of anomalous behavior is that a transaction T 2 could overwrite

the value of an object A, wdfich has already been modified by a transaction T l,
while T l is still in progress. Even if T 2 does not read the value of A written
by T l, a potential problem exists as the following example illustrates.
Suppose that Harry and Larry are two employees, and their salaries must be
kept equal. Transaction T l sets their salaries to $2000 and transaction T 2 sets
their salaries to $1000. If we execute these in the serial order T l followed by
T2, both receive the salary $1000; the serial order T2 followed by T l gives each
the salary $2000. Either of these is acceptable from a consistency standpoint
(although Harry and Larry may prefer a higher salary!). Note that neither
transaction reads a salary value before writing it— such a write is called a
b lin d w rite, for obvious reasons.
Now, consider the following interleaving of the actions of T l and T 2: T2 sets

Harry’s salary to $1000, T l sets Larry’s salary to $2000, T 2 sets Larry’s salary
to $1000 and commits, and finally T l sets Harry’ s salary to $2000 and commits.
The result is not identical to the result of either of the two possible serial
56
O verview o f T ra n sa ction M a n a gem en t 529
executions, and the interleaved schedule is therefore not serializable. It violates

the desired consistency criterion that the two salaries must be equal.
The problem is that we have a lo st u p d a te. The first transaction to commit,

T 2, overwrote Larry’ s salary as set by T l. In the serial order T 2 followed by
T l, Larry’
s salary should reflect T l ’
s update rather than T 2’s, but T l ’
s update
is ‘
lost’
.
16.3.4 Schedules Involving Aborted Transactions
We now extend our definition of serializability to include aborted transactions.2

Intuitively, all actions of aborted transactions are to be undone, and we can
therefore imagine that they were never carried out to begin with. Using this
intuition, we extend the definition of a serializable schedule as follows: A se
ria liza b le sch e d u le over a set S of transactions is a schedule whose effect on
any consistent database instance is guaranteed to be identical to that of some
complete serial schedule over the set of com m itted transactions in S.
This definition of serializability relies on the actions of aborted transactions

being undone completely, which may be im possible in some situations. For
example, suppose that (1) an account transfer program T l deducts $100 from
account A, then (2) an interest deposit program T 2 reads the current values of
accounts A and B and adds 6% interest to each, then commits, and then (3)
T l is aborted. The corresponding schedule is shown in Figure 16.5.
Tl T2
R(A)
W (A )
R(A)
W(A)
R(B)
W (B)
Commit
Abort
Figure 16.5 An Unrecoverable Schedule
2We must also consider incomplete transactions for a rigorous discussion of system failures, because
transactions that are active when the system fails are neither aborted nor committed. However, system
recovery usually begins by aborting all active transactions, and for our informal discussion, considering
schedules involving committed and aborted transactions is sufficient.
57
530 C h a pter 16
Now, T 2 has read a value for A that should never have been there. (Recall
that aborted transactions’effects are not supposed to be visible to other trans
actions.) If T 2 had not yet committed, we could deal with the situation by
cascading the abort of T 1 and also aborting T 2; this process recursively aborts
any transaction that read data written by T2, and so on. But T 2 has already
committed, and so we cannot undo its actions. We say that such a schedule
is unrecoverable. In a r e co v e r a b le sch edu le, transactions commit only after
(and if!) all transactions whose changes they read commit. If transactions read
only the changes of committed transactions, not only is the schedule recover
able, but also aborting a transaction can be accomplished without cascading
the abort to other transactions. Such a schedule is said to a v o id c a s c a d in g
aborts.
There is another potential problem in undoing the actions of a transaction.

Suppose that a transaction T2 overwrites the value of an object A that has been
modified by a transaction T l, while T1 is still in progress, and T l subsequently
aborts. All of T l ’s changes to database objects are undone by restoring the
value of any object that it modified to the value of the object before T l ’ s
changes. (We look at the details of how a transaction abort is handled in
Chapter 18.) When T l is aborted and its changes are undone in this manner,
T2’ s changes are lost as well, even if T 2 decides to commit. So, for example, if
A originally had the value 5, then was changed by T l to 6, and by T 2 to 7, if
T l now aborts, the value of A becomes 5 again. Even if T 2 commits, its change
to A is inadvertently lost. A concurrency control technique called Strict 2PL,
introduced in Section 16.4, can prevent this problem (as discussed in Section
17.1).
16.4 LOCK-BASED CONCURRENCY CONTROL
A DBMS must be able to ensure that only serializable, recoverable schedules

are allowed and that no actions of committed transactions are lost while undo
ing aborted transactions. A DBMS typically uses a locking protocol to achieve
this. A lo ck is a small bookkeeping object associated with a database object.
A lo ck in g p r o t o c o l is a set of rules to be followed by each transaction (and en
forced by the DBMS) to ensure that, e\ên though actions of several transactions
might be interleaved, the net effect is identical to executing all transactions in
some serial order. Different locking protocols use different types of locks, such
as shared locks or exclusive locks, as we see next, when we discuss the Strict
2PL protocol.
58
16.4.1 Strict Two-Phase Locking (Strict 2PL)
The most'widely used locking protocol, called Strict Two-Phase Locking, or

Strict 2PL, has two rules. The first rule is
1. If a transaction T wants to read (respectively, modify) an object, it

first requests a sh a re d (respectively, exclu sive) lock on the object.
Of course, a transaction that has an exclusive lock can also read the object;
an additional shared lock is not required. A transaction that requests a lock is
suspended until the DBMS is able to grant it the requested lock. The DBMS
keeps track of the locks it has granted and ensures that if a transaction holds
an exclusive lock on an object, no other transaction holds a shared or exclusive
lock on the same object. The second rule in Strict 2PL is
2. All locks held by a transaction are released when the transaction is

completed.
Requests to acquire and release locks can be automatically inserted into trans
actions by the DBMS; users need not worry about these details. (We discuss
how application programmers can select properties of transactions and control
locking overhead in Section 16.6.3.)
In effect, the locking protocol allows only ‘safe’interleavings of transactions.

If two transactions access completely independent parts of the database, they
concurrently obtain the locks they need and proceed merrily on their ways. On
the other hand, if two transactions access the same object, and one wants to
modify it, their actions are effectively ordered serially— all actions of one of
these transactions (the one that gets the lock on the comm on object first) are
completed before (this lock is released and) the other transaction can proceed.
We denote the action of a transaction T requesting a shared (respectively, exclu

sive) lock on object 0 as S t (0) (respectively, X't (O)) and omit the subscript
denoting the transaction when it is clear from the context. As an example,
consider the schedule shown in Figure 16.4. This interleaving could result in a
state that cannot result from any serial execution of the three transactions. For
instance, T1 could change A from 10 to 20, then T2 (which reads the value 20
for A) could change B from 100 to 200, and then T1 would read the value 200
for B. If run serially, either T l or T2 would execute first, and read the values
10 for A and 100 for B: Clearly, the interleaved execution is not equivalent to
either serial execution.
If the Strict 2PL protocol is used, such interleaving is disallowed. Let us see
why. Assuming that the transactions proceed at the same relative speed as
59
532 C h a pter 16
before, T 1 would obtain an exclusive lock on A first and then read and write
A (Figure 16.6). Then, T2 would request a lock on A. However, this request
Tl T2
X{A)
R{A)
W{A)
F igu re 16.6 Schedule Illustrating Strict 2PL
cannot be granted until T1 releases its exclusive lock on A, and the DBMS
therefore suspends T 2. T1 now proceeds to obtain an exclusive lock on B ,
reads and writes B, then finally commits, at which time its locks are released.
T 2’s lock request is now granted, and it proceeds. In this example the locking
protocol results in a serial execution of the two transactions, shown in Figure
16.7.
Tl T2
X(A)
B(A)
W{A)
X (B )
R(B)
W(B)
Commit
X(A)
R(A)
W (A )
X (B )
RIB )
W (B )
Commit
F igu re 16,7 Schedule Illustrating Strict 2PL with Serial Execution
In general, however, the actions of different transactions could be interleaved.

As an example, consider the interleaving of two transactions shown in Figure
16.8, which is perm itted by the Strict 2PL protocol.
It can be shown that the Strict 2PL algorithm allows only serializable sched
ules. None of the anomalies discussed in Section 16.3.3 can arise if the DBMS
implements Strict 2PL.
60
T1 T2
S(A)
R(A)
S(A)
R(A)
X (B )
R(B)
W (B )
Com m it
X (C )
R(C)
W (C)
Commit
F igure 16.8 Schedule Following Strict 2PL with Interleaved Actions
16.4.2 Deadlocks
Consider the following example. Transaction T1 sets an exclusive lock on object

A, T 2 sets an exclusive lock on B, T 1 requests an exclusive lock on B and is
queued, and T 2 requests an exclusive lock on A and is queued. Now, T1 is
waiting for T 2 to release its lock and T 2 is waiting for T1 to release its lock.
Such a cycle of transactions waiting for locks to be released is called a d ead lock .
Clearly, these two transactions will make no further progress. Worse, they
hold locks that may be required by other transactions. The DBMS must either
prevent or detect (and resolve) such deadlock situations; the common approach
is to detect and resolve deadlocks.
A simple way to identify deadlocks is to use a timeout mechanism. If a trans

action has been waiting too long for a lock, we can assume (pessimistically)
that it is in a deadlock cycle and abort it. We discuss deadlocks in more detail
in Section 17.2.
16.5 PERFORMANCE OF LOCKIN G
Lock-based schemes are designed to resolve conflicts between transactions and

use two basic ’mechanisms: blocking and aborting. Both mechanisms involve
a performance penalty: Blocked transactions may hold locks that force other
transactions to wait, and aborting and restarting a transaction obviously wastes
the work done thus far by that transaction. A deadlock represents an extreme
instance of blocking in which a set of transactions is forever blocked unless one
of the deadlocked transactions is aborted by the DBMS.
61
534 C h a p t e r 16
In practice, fewer than 1% of transactions are involved in a deadlock, and there

are relatively few aborts. Therefore, the overhead of locking comes primarily
from delays due to blocking.3 Consider how blocking delays affect throughput.
The first few transactions are unlikely to conflict, and throughput rises in pro
portion to the number of active transactions. As more and more transactions
execute concurrently on the same number of database objects, the likelihood of
their blocking each other goes up. Thus, delays due to blocking increase with
the number of active transactions, and throughput increases more slowly than
the number of active transactions. In fact, there comes a point when adding
another active transaction actually reduces throughput; the new transaction is
blocked and effectively com petes with (and blocks) existing transactions. We
say that the system th ra sh e s at this point, which is illustrated in Figure 16.9.
F igure 16.9 Lock Thrashing
If a database system begins to thrash, the database administrator should reduce

the number of transactions allowed to run concurrently. Empirically, thrashing
is seen to occur when 30% of active transactions are blocked, and a DBA should
monitor the fraction of blocked transactions to see if the system is at risk of
thrashing.
Throughput can be increased in three ways (other than buying a faster system):
■ By locking the smallest sized objects possible (reducing the likelihood that
two transactions need the same lock).
■ By reducing the time that transaction hold locks (so that other transactions
are blocked for a shorter time).
3Many common deadlocks can be avoided using a technique called lo ck d o w n g r a d e s , implemented

in most commercial systems (Section 17.3).
62
536 C h a p t e r 16
SQL:1999 N e s t e d T ra n sa ction s: The concept of a transaction as an

atomic sequence of actions has been extended in SQL: 1.999 through the
introduction of the savepoint feature. This allows parts of a transaction to
be selectively rolled back. The introduction of savepoints represents the
first SQL support for the concept of n e ste d tra n sa ctio n s, which have
been extensively studied in the research community. The idea is that a
transaction can have several nested subtransactions, each of which can
be selectively rolled back. Savepoints support a simple form of one-level
nesting.
In a long-running transaction, we may want to define a series of savepoints.

The savepoint command allows us to give each savepoint a name:
SAVEPOINT (savepoint nam e )
A subsequent rollback command can specify the savepoint to roll back to
ROLLBACK TO SAVEPOINT {savepoint nam e )
If we define three savepoints A, B , and C in that order, and then rollback to

A, all operations since A are undone, including the creation of savepoints B
and C. Indeed, the savepoint A is itself undone when we roll back to it, and
we must re-establish it (through another savepoint command) if we wish to be
able to roll back to it again. From a locking standpoint, locks obtained after
savepoint A can be released when we roll back to A.
It is instructive to compare the use of savepoints with the alternative of execut

ing a series of transactions (i.e., treat all operations in between two consecutive
savepoints as a new transaction). The savepoint mechanism offers two ad
vantages. First, we can roll back over several savepoints. In the alternative
approach, we can roll back only the most recent transaction, which is equiv
alent to rolling back to the most recent savepoint. Second, the overhead of
initiating several transactions is avoided.
Even with the use of savepoints, certain applications might require us to run
several transactions one after the other. To minimize the overhead in such
situations, SQL:1999 introduces another feature, called ch a in e d tra n sa ction s.
We can commit or roll back a transaction and immediately initiate another
transaction. This is done by using the optional keywords AND CHAIN in the
COMMIT and ROLLBACK statements.
64
16.6.2 What Should We Lock?
Until now, we have discussed transactions and concurrency control in terms of

an abstract model in which a database contains a fixed collection of objects, and
each transaction is a series of read and write operations on individual objects.
An important question to consider in the context of SQL is what the DBMS
should treat as an object when setting locks for a given SQL statement (that is
part of a transaction).
Consider the following query:
SELECT S. rating, MIN (S.age)

FROM Sailors S
WHERE S.rating = 8
Suppose that this query runs as part of transaction T 1 and an SQL statement
that modifies the age of a given sailor, say Joe, with rating=8 runs as part of
transaction T 2. What ‘ ob jects’should the DBMS lock when executing these
transactions? Intuitively, we must detect a conflict between these transactions.
The DBMS could set a shared lock on the entire Sailors table for T1 and set
an exclusive lock on Sailors for T2, which would ensure that the two transac
tions are executed in a serializable manner. However, this approach yields low
concurrency, and we can do better by locking smaller objects, reflecting what
each transaction actually accesses. Thus, the DBM S could set a shared lock
on every row with rating=8 for transaction T1 and set an exclusive lock on
just the row for the modified tuple for transaction T 2. Now, other read-only
transactions that do not involve rating=8 rows can proceed without waiting for
T1 or T2.
As this example illustrates, the DBMS can lock objects at different g ra n u la r

ities: We can lock entire tables or set row-level locks. The latter approach is
taken in current systems because it offers much better performance. In practice,
while row-level locking is generally better, the choice of locking granularity is
complicated. For example, a transaction that examines several rows and m od
ifies those that.satisfy some condition might be best served by setting shared
locks on the entire table and setting exclusive locks on those rows it wants to
modify. We discuss this issue further in Section 17.5.3.
A second point to note is that SQL statements conceptually access a collection

of rows described by a selection predicate. In the preceding example, transaction
T1 accesses all rows with rating=8. We suggested that this could be dealt with
by setting shared locks on all rows in Sailors that had rating—8. Unfortunately,
this is a little too simplistic. To see why, consider an SQL statement that inserts
65
538 C h a pte r 16
a new sailor with rating=8 and runs as transaction T 3. (Observe that this
example violates our assumption of a fixed number of objects in the database,
but we must obviously deal with such situations in practice.)
Suppose that the DBMS sets shared locks on every existing Sailors row with
rating=8 for T 1. This does not prevent transaction T3 from creating a brand
new row with rating=8 and setting an exclusive lock on this row. If this new row
has a smaller age value than existing rows, T1 returns an answer that depends
on when it executed relative to T2. However, our locking scheme im poses no
relative order on these two transactions.
This phenomenon is called the p h a n to m problem: A transaction retrieves

a collection of objects (in SQL terms, a collection of tuples) twice and sees
different results, even though it does not modify any of these tuples itself. To
prevent phantoms, the DBMS must conceptually lock all possible rows with
rating=8 on behalf of T l. One way to do this is to lock the entire table, at
the cost of low concurrency. It is possible to take advantage of indexes to do
better, as we will see in Section 17.5.1, but in general preventing phantoms can
have a significant impact on concurrency.
It may well be that the application invoking T l can accept the potential inac
curacy due to phantoms. If so, the approach of setting shared locks on existing
tuples for T l is adequate, and offers better performance. SQL allows a pro
grammer to make this choice— and other similar choices— explicitly, as we see
next.
16.6.3 Transaction Characteristics in SQL
In order to give programmers control over the locking overhead incurred by

their transactions, SQL allows them to specify three characteristics of a trans
action: access mode, diagnostics size, and isolation level. The d i a g n o s t i c s
s i z e determines the number of error conditions that can be recorded; we will
not discuss this feature further.
If the a c c e s s m o d e is READ ONLY, the transaction, is not allowed to modify

the database. Thus, INSERT, DELETE, UPDATE, and CREATE commands cannot
be executed. If we have to execute one of these commands, the access mode
should be set to READ WRITE. For transactions with READ ONLY access mode,
only shared locks need to be obtained, thereby increasing concurrency.
The i s o l a t i o n l e v e l controls the extent to which a given transaction is ex

posed to the actions of other transactions executing concurrently. By choosing
one of four possible isolation level settings, a user can obtain greater concur-
66
O v erview o f T ra n sa ction M a n a g em en t 539
rency at the cost of increasing the transaction’

s exposure to other transactions’
uncommitted changes.
Isolation level choices are READ UNCOMMITTED, READ COMMITTED, REPEATABLE

READ, and SERIALIZABLE. The effect of these levels is summarized in Figure
16.10. In this context, dirty read and unrepeatable read are defined as usual.
L evel ..... T ,. D ir ty R ead U n r e p e a t a b le R ead . P h a n to m

READ UNCOMMITTED Maybe Maybe Maybe
READ COMMITTED No Maybe Maybe
REPEATABLE READ No No Maybe
SERIALIZABLE No No No
F igure 16.10 Transaction Isolation Levels in SQ.L-92
The highest degree of isolation from the effects of other transactions is achieved
by setting the isolation level for a transaction T to SERIALIZABLE. This isolation
level ensures that T reads only the changes made by committed transactions,
that no value read or written by T is changed by any other transaction until T
is complete, and that if T reads a set of values based on some search condition,
this set is not changed by other transactions until T is complete (i.e., T avoids
the phantom phenomenon).
In terms of a lock-based implementation, a SERIALIZABLE transaction obtains

locks before reading or writing objects, including locks on sets of objects that
it requires to be unchanged (see Section 17.5.1) and holds them until the end,
according to Strict 2PL.
REPEATABLE READ ensures that T reads only the changes made by commit
ted transactions and no value read or written by T is changed by any other
transaction until T is complete. However, T could experience the phantom
phenomenon; for example, while T examines all Sailors records with ra tin g—1,
another transaction might add a new such Sailors record, which is missed by
T.
A REPEATABLE READ transaction sets the same locks as a SE R IA LIZA B LE trans

action, except that it does not do index locking; that is, it locks only individual
objects, not sets of objects. We discuss index locking in detail in Section 17.5.1.
READ COMMITTED ensures that T reads only the changes made by committed
transactions, and that no value written by T is changed by any other transaction
until T is complete. However, a value read by T may well be modified by
67
540 C h a p t e r 16
another transaction while T is still in progress, and T is exposed to the phantom

problem.
A READ COMMITTED t r a n s a c t io n o b t a i n s e x c lu s iv e lo c k s b e f o r e w r it in g o b j e c t s
a n d h o ld s t h e s e lo c k s u n til t h e en d . I t a ls o o b t a i n s s h a r e d lo c k s b e f o r e r e a d
i n g o b je c t s , b u t t h e s e lo c k s a r e r e le a s e d im m e d ia t e ly ; th e ir o n ly e ffe c t is t o
g u a r a n t e e t h a t th e t r a n s a c t io n t h a t la s t m o d i f i e d t h e o b j e c t is c o m p le t e . ( T h is
g u a r a n t e e r e lie s o n t h e fa c t t h a t every SQL t r a n s a c t io n o b t a i n s e x c lu s iv e lo c k s
b e f o r e w r it in g o b j e c t s a n d h o ld s e x c lu s iv e lo c k s u n til t h e end.)
A READ UNCOMMITTED transaction T can read changes made to an object by an

ongoing transaction; obviously, the object can be changed further while T is in
progress, and T is also vulnerable to the phantom problem.
A READ UNCOMMITTED transaction does not obtain shared locks before reading
objects. This m ode represents the greatest exposure to uncommitted changes
of other transactions; so much so that SQL prohibits such a transaction from
making any changes itself—a READ UNCOMMITTED transaction is required to have
an access m ode of READ ONLY. Since such a transaction obtains no locks for
reading objects and it is not allowed to write objects (and therefore never
requests exclusive locks), it never makes any lock requests.
The SER IA L IZA B L E isolation level is generally the safest and is recommended for
m ost transactions. Some transactions, however, can run with a lower isolation
level, and the smaller number of locks requested can contribute to improved sys
tem performance. For example, a statistical query that finds the average sailor
age can be run at the READ COMMITTED level or even the READ UNCOMMITTED
level, because a few incorrect or missing values do not significantly affect the
result if the number of sailors is large.
The isolation level and access m ode can be set using the SET TRANSACTION com
mand. For example, the following command declares the current transaction
to be SERIA L IZA B L E and READ ONLY:
SET TRANSACTION ISOLATION LEVEL SE R IA LIZA B LE READ ONLY
W h e n a t r a n s a c t io n is s ta r te d , t h e d e fa u lt is SER IA L IZA B L E a n d READ WRITE.
16.7 INTRODUCTION TO CRASH RECOVERY
The r e c o v e r y m a n a g e r of a DBMS is responsible for ensuring transaction

atom icity and durability. It ensures atomicity by undoing the actions of trans
actions that do not commit, and durability by making sure that all actions of
68
committed transactions survive s y st e m crashes, (e.g., a core dump caused by

a bus error) and m e d ia fa ilu res (e.g., a disk is corrupted).
When a DBMS is restarted after crashes, the recovery manager is given control
and must bring the database to a consistent state. The recovery manager is
also responsible for undoing the actions of an aborted transaction. To see what
it takes to implement a recovery manager, it is necessary to understand what
happens during normal execution.
The tra n sa c tio n m a n a g e r of a DBMS controls the execution of transactions.

Before reading and writing objects during normal execution, locks must be ac
quired (and released at some later time) according to a chosen locking protocol.5
For simplicity of exposition, we make the following assumption:
A to m ic W rites: Writing a page to disk is an atomic action.
This implies that the system does not crash while a write is in progress and is
unrealistic. In practice, disk writes do not have this property, and steps must
be taken during restart after a crash (Section 18.6) to verify that the most
recent write to a given page was completed successfully, and to deal with the
consequences if not.
16.7.1 Stealing Frames and Forcing Pages
With respect to writing objects, two additional questions arise:
1. Can the changes made to an object 0 in the buffer pool by a transaction T

be written to disk before T commits? Such writes are executed wdien an
other transaction wants to bring in a page and the buffer manager chooses
to replace the frame containing 0\ of course, this page must have been
unpinned by T. If such writes are allowed, we say that a ste a l approach
is used. (Informally, the second transaction ‘
steals’a frame from T.)
2. WThen a transaction commits, must we ensure that all the changes it has
made to objects in the buffer pool are immediately forced to disk? If so,
we say that'a fo r c e approach is used.
From the standpoint of implementing a recovery manager, it is simplest to use

a buffer manager with a no-steal, force approach. If a no-steal approach is used,
we do not have to undo the changes of an aborted transaction (because these
changes have not been written to disk), and if a force approach is used, we do
SA concurrency control technique that does not involve locking could be used instead, but we
assume that locking is used.
69
542 C h a pte r 16
not have to redo the changes of a committed transaction if there is a subsequent

crash (because all these changes are guaranteed to have been written to disk
at comm it time).
However, these policies have important drawbacks. The no-steal approach as
sumes that all pages modified by ongoing transactions can be accom m odated
in the buffer pool, and in the presence of large transactions (typically run in
batch mode, e.g., payroll processing), this assumption is unrealistic. The force
approach results in excessive page I/O costs. If a highly used page is updated
in succession by 20 transactions, it would be written to disk 20 times. W ith a
no-force approach, on the other hand, the in-memory copy of the page would
be successively modified and written to disk just once, reflecting the effects
of all 20 updates, when the page is eventually replaced in the buffer pool (in
accordance with the buffer manager’ s page replacement policy).
For these reasons, m ost systems use a steal, no-force approach. Thus, if a
frame is dirty and chosen for replacement, the page it contains is written to
disk even if the m odifying transaction is still active (steal); in addition, pages in
the buffer pool that are modified by a transaction are not forced to disk when
the transaction commits (no-force).
16.7.2 Recovery-Related Steps during Normal Execution
The recovery manager of a DBMS maintains som e information during normal

execution of transactions to enable it to perform its task in the event of a
failure. In particular, a log of all modifications to the database is saved on
sta b le sto ra ge, which is guaranteed6 to survive crashes and media failures.
Stable storage is implemented by maintaining multiple copies of information
(perhaps in different locations) on nonvolatile storage devices such as disks or
tapes.
As discussed earlier in Section 16.7, it is important to ensure that the log

entries describing a change to the database are written to stable storage before
the change is made; otherwise, the system might crash just after the change,
leaving us without a record of the change. (Recall that this is the Write-Ahead
Log, or WAL, property.)
The log enables the recovery manager to undo the actions of aborted and
incomplete transactions and redo the actions of committed transactions. For
example, a transaction that committed before the crash may have made updates
6Nothing in life is really guaranteed except death and taxes. However, we can reduce the chance
of log failure to be vanishingly small by taking steps such as duplexing the log and storing the copies
in different secure locations.
70
T u n in g th e R e c o v e r y S u b sy stem : DBMS performance can be greatly

affected by the overhead imposed by the recovery subsystem. A DBA can
take several steps to tune this subsystem, such as correctly sizing the log
and how it is managed on disk, controlling the rate at which buffer pages
are forced to disk, choosing a good frequency for checkpointing, and so
forth.
to a copy (of a database object) in the buffer pool, and this change may not have
been written to disk before the crash, because of a no-force approach. Such
changes must be identified using the log and written to disk. Further, changes
of transactions that did not commit prior to the crash might have been written
to disk because of a steal approach. Such changes must be identified using the
log and then undone.
The amount of work involved during recovery is proportional to the changes

made by committed transactions that have not been written to disk at the time
of the crash. To reduce the time to recover from a crash, the DBMS period
ically forces buffer pages to disk during normal execution using a background
process (while making sure that any log entries that describe changes these
pages are written to disk first, i.e., following the WAL protocol). A process
called checkpointing , which saves information about active transactions and
dirty buffer pool pages, also helps reduce the time taken to recover from a
crash. Checkpoints are discussed in Section 18.5.
16.7.3 Overview of ARIES
ARIES is a recovery algorithm that is designed to work with a steal, no-force

approach. When the recovery manager is invoked after a crash, restart proceeds
in three phases. In the A n a ly sis phase, it identifies dirty pages in the buffer
pool (i.e., changes that have not been written to disk) and active transactions
at the time of the crash. In the R e d o phase, it repeats all actions, starting
from an appropriate point in the log, and restores the database state to what it
was at the time of the crash. Finally, in the U n d o phase, it undoes the actions
of transactions that did not commit, so that the database reflects only the
actions of committed transactions. The ARIES algorithm is discussed further
in Chapter 18.
16.7.4 Atomicity: Implementing Rollback
It is important to recognize that the recovery subsystem is also responsible for

executing the ROLLBACK command, which aborts a single transaction. Indeed,
71
*.’.•" >7-;;.V f ^ < fs v
;:^v^;:>::Vfv:;;:v«:
jfe& i 1
i i S S i,<
S ^S*
*"" M.
:-■; ■
':>; - -•
;AC--: : . '
® E" ;$1 |
««&
msmgMM
tes\•
p > ,:;p<£
M H T .- »
« » » » «
■ ......
Mr ill
72
17
CONCURRENCY CONTROL
How does Strict 2PL ensure serializability and recoverability?

w How are locks implemented in a DBMS?
w What are lock conversions and why are they important?
How does a DBMS resolve deadlocks?
How do current systems deal with the phantom problem?
Why are specialized locking techniques used on tree indexes?
*■ How does multiple-granularity locking work?
What is Optimistic concurrency control?
w What is Timestamp-Based concurrency control?
w What is Multiversion concurrency control?
K ey con cepts: Two-phase locking (2PL), serializability, recoverabil
ity, precedence graph, strict schedule, view equivalence, view seri
alizable, lock manager, lock table, transaction table, latch, convoy,
lock upgrade, deadlock, waits-for graph, conservative 2PL, index lock
ing, predicate locking, multiple-granularity locking, lock escalation,
SQL isolation level, phantom problem, optimistic concurrency con
trol, Thomas Write Rule, recoverability
P ooh was sittin g in his house one day, coun tin g his p o ts o f honey,
when there cam e a knock on the door.
“Fourteen,”said Pooh. “ C om e in. Fourteen. O r was it fifteen? Bother.
T h a t’s m uddled me.”
549 73
550 C h a p t e r 17
“
Hallo, P ooh,” said Rabbit. “ Hallo, Rabbit. Fourteen, w asn ’
t it?”
“
W hat w as?” “ M y p o ts o f honey what I was counting.”
“Fourteen, th a t’
s right.”
“
Are you su re?”
“
No,”said Rabbit. “ D oes it m atter?”
—A.A. Milne, The House at Pooh Corner
In this chapter, we look at concurrency control in m ore detail. W e begin by

lookin g at locking p rotocols and how they guarantee various im portant prop er
ties o f schedules in Section 17.1. Section 17.2 is an in trodu ction to how locking
p rotocols are im plem ented in a DBM S. Section 17.3 discusses the issue o f lock
conversions, and Section 17.4 covers deadlock handling. Section 17.5 discusses
three specialized locking protocols—for locking sets o f o b je cts identified by som e
predicate, for locking n odes in tree-structured indexes, and for locking collec
tions o f related objects. Section 17.6 exam ines som e alternatives to the locking
approach.
17.1 2PL, SER IA LIZA B IL ITY , AND R E C O V E R A B IL IT Y
In this section, we consider how locking p rotocols guarantee som e im portant

properties o f schedules; namely, serializability and recoverability. T w o sched
ules are said to b e c o n f lic t e q u iv a le n t if they involve the (same set of) actions
o f the sam e transactions and they order every pair o f conflicting actions o f two
com m itted transactions in the sam e way.
As we saw in Section 16.3.3, two actions conflict if they operate on the sam e
data o b je ct and at least one o f them is a write. T h e ou tcom e o f a schedule
d epen ds only on the order o f conflicting operations; we can interchange any
pair o f n onconflicting operations w ithout altering the effect o f the schedule on
the database. If tw o schedules are conflict equivalent, it is easy to see that
they have the sam e effect on a database. Indeed, becau se they order all pairs
o f conflicting operations in the sam e way, we can obtain one o f them from
the other by repeatedly sw apping pairs o f nonconflicting actions, that is, by
sw appin g pairs of actions w hose relative order d oes not alter the outcom e.
A schedule is c o n f lic t s e r ia liz a b le if it is conflict equivalent to som e serial

schedule. Every conflict serializable schedule is serializable, if we assum e that
the set o f item s in the database does not grow or shrink; that is, values can
b e m odified but item s are not added or deleted. W e make this assu m ption for
now and consider its consequences in Section 17.5.1. However, som e serializ
able schedules are not conflict serializable, as illustrated in Figure 17.1. T his
schedule is equivalent to executing the transactions serially in the order T l, T2,
74
Concurrency Control 551
TI T2 T3
R(A)
W{A)
C om m it
W(A)
C om m it
W {A)
C om m it
F ig u r e 17.1 Serializable Schedule That Is Not Conflict Serializable
T 3, but it is not conflict equivalent to this serial schedule becau se the w rites of
T 1 and T 2 are ordered differently.
It is useful to capture all potential conflicts betw een the transactions in a sch ed
ule in a p r e c e d e n c e g ra p h , also called a s e r i a liz a b ilit y g r a p h . T h e prece
dence graph for a schedule S contains:
■ A n ode for each com m itted transaction in S.
■ An arc from T i to T j if an action o f T i precedes and conflicts w ith one o f

T j’
s actions.
T h e precedence graphs for the schedules show n in Figures 16.7, 16.8, and 17.1
are shown in Figure 17.2 (parts a, b, and c, respectively).
F ig u r e 17.2 Examples of Precedence Graphs
T h e Strict 2PL p ro to co l (introduced in S ection 16.4) allows only conflict seri

alizable schedules, as is seen from the follow in g two results:
75
552 C h a p t e r 17
1. A schedule S is conflict serializable if and only if its preceden ce graph is

acyclic. (An equivalent serial schedule in this case is given by any to p o lo g
ical sort over the precedence graph.)
2. Strict 2PL ensures that the precedence graph for any schedule that it allows
is acyclic.
A w idely studied variant o f Strict 2PL, called T w o - P h a s e L o c k in g (2PL),

relaxes the secon d rule o f Strict 2PL to allow transactions to release locks before
the end, that is, before the com m it or abort action. For 2PL, the secon d rule
is replaced by the follow ing rule:
(2PL) (2) A transaction cannot request additional locks once it re

leases any lock.
Thus, every transaction has a ‘

grow in g’phase in which it acquires locks, fol
lowed by a ‘
shrinking’phase in which it releases locks.
It can b e show n that even nonstrict 2PL ensures acyclicity o f the precedence
graph and therefore allows only conflict serializable schedules. Intuitively, an
equivalent serial order of transactions is given by the order in w hich transactions
enter their shrinking phase: If T 2 reads or writes an o b je ct w ritten by T 1, T1
m ust have released its lock on the o b je ct before T 2 requested a lock on this
object. Thus, T1 precedes T 2. (A sim ilar argum ent show s that T1 precedes
T 2 if T 2 writes an o b je c t previously read by T l. A form al p r o o f o f the claim
w ould have to show that there is no cycle of transactions that ‘ p reced e’each
other by this argument.)
A schedule is said to be s t r ic t if a value w ritten by a transaction T is not

read or overwritten by other transactions until T either aborts or commits.
Strict schedules are recoverable, d o not require cascading aborts, and actions of
aborted transactions can be undone by restoring the original values o f m odified
objects. (See the last exam ple in Section 16.3.4.) Strict 2PL im proves on
2PL by guaranteeing that every allowed schedule is strict in addition to being
conflict serializable. T h e reason is that when a transaction T w rites an ob ject
under Strict 2PL, it holds the (exclusive) lock until it com m its or aborts. Thus,
no other transaction can see or m odify this o b je ct until T is com plete.
T h e reader is invited to revisit the exam ples in Section 16.3.3 to see how the
correspon d in g schedules are disallow ed by Strict 2PL and 2PL. Similarly, it
w ould b e instructive to work out how the schedules for the exam ples in Section
16.3.4 are disallowed by Strict 2PL but not by 2PL.
76
17.1.1 View Serializability
Conflict serializability is sufficient but not necessary for serializability. A m ore

general sufficient condition is view serializability. T w o schedules 51 and 52 over
the sam e set o f transactions— any transaction that appears in either 51 or 52
m ust also appear in the other— are v ie w e q u iv a le n t under these conditions:
1. If Ti reads the initial value of o b je ct A in 51, it m ust also read the initial
value o f A in 52.
2. If Ti reads a value o f A w ritten by T j in 51, it m ust also read the value o f

A w ritten by T j in 52.
3. For each data o b ject A, the transaction (if any) that perform s the final
w rite on A in 51 m ust also perform the final w rite on A in 52.
A schedule is v ie w s e r ia liz a b le if it is view equivalent to som e serial schedule.

Every conflict serializable schedule is view serializable, although the converse
is not true. For example, the schedule show n in Figure 17.1 is view serializable,
although it is not conflict serializable. Incidentally, note that this exam ple
contains blind writes. T his is not a coincidence; it can b e shown that any view
serializable schedule that is not conflict serializable contains a blin d write.
As we saw in Section 17.1, efficient locking p ro to co ls allow us to ensure that

only conflict serializable schedules are allowed. E n forcin g or testin g view seri
alizability turns out to b e much m ore expensive, and the concept therefore has
little practical use, although it increases our understanding of serializability.
17.2 IN T R O D U C T IO N T O L O C K M A N A G EM EN T
T h e part o f the D B M S that keeps track o f the locks issued to transactions is

called the lo c k m a n a g e r . T h e lock m anager m aintains a lo c k ta b le , which
is a hash table w ith the data o b je ct identifier as the key. T h e D B M S also
m aintains a descriptive entry for each tran saction in a t r a n s a c t io n ta b le ,
and am ong other things, the entry contains a pointer to a list o f locks held by
the transaction. T his list is checked before requ esting a lock, to ensure that a
transaction d oes not request the sam e lock twice.
A lo c k t a b l e e n t r y for an o b je ct— w hich can b e a page, a record, and so

on, d epen din g on the D B M S— contains the follow ing inform ation: the num ber
o f transactions currently h olding a lock on the o b je c t (this can b e m ore than
one if the o b je ct is locked in shared mode), the nature o f the lock (shared or
exclusive), and a pointer to a queue o f lock requests.
77
554 C h a p t e r 17
17.2.1 Im plem enting L ock and Unlock R equests
A ccord in g to the Strict 2PL protocol, before a transaction T reads or writes a

database o b ject O, it m ust obtain a shared or exclusive lock on O and m ust
h old on to the lock until it com m its or aborts. W hen a transaction needs a
lock on an object, it issues a lock request to the lock manager:
1. If a shared lock is requested, the queue o f requests is empty, and the o b je ct

is not currently locked in exclusive m ode, the lock m anager grants the lock
and u pdates the lock table entry for the o b je ct (indicating that the o b je ct
is locked in shared m ode, and increm enting the num ber o f transactions
h olding a lock by one).
2. If an exclusive lock is requested and no transaction currently holds a lock

on the o b ject (which also im plies the queue o f requests is empty), the lock
m anager grants the lock and updates the lock table entry.
3. Otherwise, the requested lock cannot be im m ediately granted, and the

lock request is added to the queue o f lock requests for this object. T he
transaction requesting the lock is suspended.
W hen a transaction aborts or com m its, it releases all its locks. W hen a lock
on an o b ject is released, the lock m anager u pdates the lock table entry for the
o b je ct and exam ines the lock request at the head o f the queue for this object.
If this request can now b e granted, the transaction that m ade the request is
w oken up and given the lock. Indeed, if several requests for a shared lock on the
o b je c t are at the front o f the queue, all o f these requests can now b e granted
together.
N ote that if T1 has a shared lock on O and T 2 requests an exclusive lock,

T 2’s request is queued. Now, if T 3 requests a shared lock, its request enters
the queue behind that o f T 2, even though the requested lock is com patible
w ith the lock held by T l. T his rule ensures that T 2 d oes not starve, that is,
wait indefinitely while a stream of other transactions acquire shared locks and
thereby prevent T 2 from gettin g the exclusive lock for which it is waiting.
Atom icity o f L ocking and U nlocking
T h e im plem entation o f lock and unlock com m ands m ust ensure that these are
atom ic operations. To ensure atom icity o f these operation s when several in
stances of the lock m anager co d e can execute concurrently, access to the lock
table has to be guarded by an operating system synchronization m echanism
such as a semaphore.
78
To understand why, su ppose that a tran saction requests an exclusive lock.

The lock m anager checks and finds that no other transaction holds a lock on
the o b je ct and therefore decides to grant the request. But, in the meantime,
another transaction m ight have requested and received a conflicting lock. To
prevent this, the entire sequence o f actions in a lock request call (checking
to see if the request can b e granted, u pd a tin g the lock table, etc.) m ust be
im plem ented as an atom ic operation.
Other Issues: Latches, Convoys
In addition to locks, which are held over a lon g duration, a D B M S also su pports
short-duration la tch es. Setting a latch before reading or w riting a page ensures
that the physical read or w rite operation is atom ic; otherwise, tw o read/w rite
operations m ight conflict if the o b je cts b ein g locked do n ot correspon d to disk
pages (the units of I/O). Latches are unset im m ediately after the physical read
or w rite operation is com pleted.
We concentrated thus far on how the D B M S schedules transactions based on

their requests for locks. T his interleaving interacts with the opera tin g sy stem ’
s
scheduling o f p ro cesses’access to the C P U and can lead to a situation called
a convoy, where m ost of the C P U cycles are spent on p rocess switching. T h e
problem is that a transaction T h oldin g a heavily used lock m ay b e su spended
by the operatin g system. Until T is resumed, every other transaction that
needs this lock is queued. Such queues, called c o n v o y s , can quickly becom e
very long; a convoy, once formed, tends to b e stable. Convoys are one o f the
drawbacks o f building a D B M S on to p o f a general-purpose operatin g system
with preem ptive scheduling.
17.3 L O C K CO N V E R SIO N S
A transaction m ay need to acquire an exclusive lock on an o b je ct for which it

already holds a shared lock. For example, a S Q L u pd ate statem ent could result
in shared locks bein g set on each row in a table. If a row satisfies the condition
(in the WHERE clause) for bein g updated, an exclusive lock m ust b e obtained
for that row. . .
Such a lo c k u p g r a d e request m ust b e handled specially by granting the exclu

sive lock im m ediately if no other transaction holds a shared lock on the o b je ct
and inserting the request at the front o f the queue otherwise. T h e rationale
for favoring the transaction thus is that it already holds a shared lock on the
o b je ct and queuing it behind another transaction that wants an exclusive lock
on the sam e o b je ct causes b oth a deadlock. Unfortunately, while favoring lock
u pgrades helps, it does not prevent deadlocks caused by tw o conflicting upgrade
79
556 C h a p t e r 17
requests. For example, if two transactions that h old a shared lock on an o b je ct

b oth request an u pgrade to an exclusive lock, this leads to a deadlock.
A better approach is to avoid the need for lock u pgrades altogether by obtaining
exclusive locks initially, and d o w n g r a d i n g to a shared lock on ce it is clear that
this is sufficient. In our exam ple of an SQ L u pd a te statement, rows in a table
are locked in exclusive m od e first. If a row d oes not satisfy the con dition for
bein g updated, the lock on the row is dow n graded to a shared lock. D oes the
dow ngrade approach violate the 2PL requirem ent? O n the surface, it does,
becau se d ow n grading reduces the locking privileges held by a transaction, and
the transaction m ay go on to acquire other locks. However, this is a special case,
becau se the tran saction did nothing bu t read the o b je ct that it downgraded,
even th ough it conservatively obtained an exclusive lock. We can safely expand
our definition o f 2PL from Section 17.1 to allow lock dow ngrades in the grow ing
phase, provided that the transaction has n ot m odified the object.
T h e dow ngrade approach reduces concurrency by obtaining w rite locks in som e

cases where they are not required. O n the whole, however, it im proves th rough
pu t by reducin g deadlocks. T his approach is therefore w idely used in current
com m ercial system s. Concurrency can b e increased by introducing a new kind
of lock, called an u p d a t e lock, that is com p atible w ith shared locks bu t not
other update and exclusive locks. B y settin g an u p d a te lock initially, rather
than exclusive locks, we prevent conflicts w ith other read operations. O n ce we
are sure we need not update the object, we can dow n grade to a shared lock. If
we need to u p d a te the object, we m ust first u pgrade to an exclusive lock. T his
upgrade d oes not lead to a deadlock becau se no other transaction can have an
upgrade or exclusive lock on the object.
17.4 D E A L IN G W IT H D E A D L O C K S
D eadlocks ten d to b e rare and typically involve very few transactions. In prac
tice, therefore, database system s periodically check for deadlocks. W hen a
transaction T i is su spen ded because a lock that it requests cannot b e granted,
it m ust wait until all transactions T j that currently hold conflicting locks re
lease them. T h e lock m anager m aintains a structure called a w a its - fo r g r a p h
to detect deadlock cycles. T h e n odes correspon d to active transactions, and
there is an arc from T i to T j if (and only if) T i is w aiting for T j to release a
lock. T h e lock m anager adds edges to this graph when it queues lock requests
and rem oves edges when it grants lock requests.
C on sider the schedule show n in Figure 17.3. T h e last step, shown below the
line, creates a cycle in the waits-for graph. Figure 17.4 shows the waits-for
graph before and after this step.
80
Tl T2 T3 T4
S(A)
R(A)
X(B)
W(B)
S(B )
S(C)
■R(C)
X(C)
X(B)
X(A)
F ig u r e 17.3 Schedule Illustrating Deadlock
F ig u r e 17.4 Waits-for Graph Before and After Deadlock
81
558 C h a p t e r 17
O b s e r v e t h a t t h e w a its- fo r g r a p h d e s c r ib e s a ll a c tiv e tr a n s a c t io n s , s o m e o f w h ic h
T i t o T j in t h e w a its- fo r g r a p h , a n d
e v e n tu a lly a b o r t. I f t h e r e is a n e d g e f r o m
b oth T i a n d T j e v e n tu a lly c o m m it, t h e r e is a n e d g e in t h e o p p o s i t e d ir e c
t io n ( from T j t o Ti) in th e p r e c e d e n c e g r a p h (w h ich in v o lv e s o n ly c o m m i t t e d
t r a n s a c t io n s ) .
T h e w a its- fo r g r a p h is p e r io d ic a l l y c h e c k e d fo r cy c le s, w h ic h in d ic a t e d e a d lo c k .
A d e a d lo c k is r e s o lv e d b y a b o r t in g a t r a n s a c t io n t h a t is o n a c y c le a n d r e le a s in g
its lo ck s; th is a c t io n a llo w s s o m e o f t h e w a it in g t r a n s a c t io n s t o p r o c e e d . The
c h o ic e o f w h ic h t r a n s a c t io n t o a b o r t c a n b e m a d e u s in g s e v e r a l c r ite r ia : th e
o n e w ith t h e fe w e s t lo ck s, t h e o n e th a t h a s d o n e t h e le a st w ork , t h e o n e t h a t is
fa r t h e s t f r o m c o m p le t io n , a n d s o on. F u rth er, a t r a n s a c t io n m ig h t h a v e b e e n
r e p e a t e d l y r e s ta r te d ; if so, it s h o u ld e v e n tu a lly b e f a v o r e d d u r in g d e a d lo c k
d e t e c t io n a n d a llo w e d t o c o m p le te .
A s im p le a lte r n a tiv e t o m a in t a in in g a w a its- fo r g r a p h is t o id e n t if y d e a d lo c k s

t h r o u g h a t im e o u t m e c h a n is m : I f a t r a n s a c t io n h a s b e e n w a it in g t o o l o n g fo r
a lock , w e a s s u m e ( p e s s im is tica lly ) t h a t it is in a d e a d lo c k c y c le a n d a b o r t it.
17.4.1 D eadlock Prevention
E m p ir ic a l r e s u lt s in d ic a t e th a t d e a d lo c k s a re r e la t iv e ly in fre q u e n t, a n d d e te c tio n -
b a s e d s c h e m e s w o r k w e ll in p r a c tic e . H o w ev e r, if t h e r e is a h ig h le v e l o f c o n
t e n t io n fo r l o c k s a n d t h e r e f o r e an in c r e a s e d lik e lih o o d o f d e a d lo c k s , p re v e n tio n -
b a s e d s c h e m e s c o u ld p e r f o r m b e tte r . W e c a n p r e v e n t d e a d lo c k s b y g i v in g ea ch
t r a n s a c t io n a p r io r it y a n d e n s u r in g th a t lo w e r - p r io r ity t r a n s a c t io n s a re n o t
a llo w e d t o w a it fo r h ig h e r - p r io r ity t r a n s a c t io n s (or v ic e versa). O n e w ay to
a s s ig n p r io r it ie s is t o g iv e e a ch t r a n s a c t io n a t i m e s t a m p w h e n it s ta r t s up.
T h e lo w e r t h e t im e s t a m p , th e h ig h e r is t h e t r a n s a c t io n ’
s p r io r ity ; th a t is, th e
o l d e s t t r a n s a c t io n h a s t h e h ig h e s t p riority .
I f a t r a n s a c t io n T i r e q u e s t s a lo c k a n d t r a n s a c t io n T j h o ld s a c o n f lic t in g lock,
t h e lo c k m a n a g e r c a n u s e o n e o f t h e fo llo w in g t w o p o lic ie s :
■ W a it- d ie : I f T i h a s h ig h e r p riority , it is a llo w e d t o w ait; o th e r w ise , it is

a b o r te d .
■ W o u n d - w a it: I f T i h a s h ig h e r p riority , a b o r t T j ; o th e r w ise , T i w aits.
In t h e w a it- d ie sch e m e , lo w e r - p r io r ity t r a n s a c t io n s c a n n e v e r w a it fo r h igh er-

p r io r it y tr a n s a c t io n s . In t h e w o u n d - w a it sch e m e , h ig h e r - p r io r ity t r a n s a c t io n s
n e v e r w a it fo r lo w e r - p r io r ity tr a n s a c tio n s . In e ith e r case, n o d e a d l o c k c y c le
d e v e lo p s .
82
A s u b t le p o in t is t h a t w e m u s t a ls o e n s u r e t h a t n o t r a n s a c t io n is peren n ially-
a b o r t e d b e c a u s e it n e v e r h a s a s u ff ic ie n tly h ig h p riority . ( N o te th a t, in b o t h
s c h e m e s, th e h ig h e r - p r io r ity t r a n s a c t io n is n e v e r a b o r te d .) W h en a tra n sa c
t io n is a b o r t e d a n d r e s ta r te d , it s h o u ld b e g iv e n t h e s a m e t im e s t a m p it h a d
orig in a lly . R e is s u in g t im e s t a m p s in t h is w a y e n s u r e s t h a t e a c h t r a n s a c t io n
w ill e v e n tu a lly b e c o m e t h e o ld e s t tr a n s a c t io n , a n d t h e r e f o r e t h e o n e w it h th e
h ig h e s t p riority , a n d w ill g e t all t h e lo c k s it r e q u ire s.
T h e w a it- d ie s c h e m e is n o n p r e e m p t iv e ; o n ly a t r a n s a c t io n r e q u e s t i n g a lo c k c a n
b e a b o r t e d . A s a t r a n s a c t io n g r o w s o ld e r (an d its p r io r it y in cre a se s) , it t e n d s
t o w a it fo r m o r e a n d m o r e y o u n g e r t r a n s a c tio n s . A y o u n g e r t r a n s a c t io n th a t
c o n f lic t s w it h a n o ld e r t r a n s a c t io n m a y b e r e p e a t e d l y a b o r t e d (a d is a d v a n t a g e
w it h r e s p e c t t o w ou n d -w a it), b u t o n t h e o t h e r h a n d , a t r a n s a c t io n t h a t h a s
all t h e lo c k s it n e e d s is n e v e r a b o r t e d fo r d e a d lo c k r e a s o n s (an a d v a n ta g e w ith
r e s p e c t t o w o u n d - w a it, w h ic h is p r e e m p tiv e ) .
A v a r ia n t o f 2PL, c a lle d C o n s e r v a t iv e 2PL, c a n a ls o p r e v e n t d e a d lo c k s . U n

d e r C o n s e r v a t iv e 2PL, a t r a n s a c t io n o b t a in s all t h e lo c k s it w ill e v e r n e e d w h e n
it b e g in s , o r b lo c k s w a it in g fo r these- lo c k s t o b e c o m e av a ila b le. T h is s c h e m e
e n s u r e s t h a t t h e r e w ill b e n o d e a d lo c k s , an d, p e r h a p s m o r e im p o r ta n t , t h a t a
t r a n s a c t io n th a t a lr e a d y h o ld s s o m e lo c k s w ill n o t b l o c k w a it in g fo r o t h e r lock s.
I f lo c k c o n t e n t io n is heavy, C o n s e r v a t iv e 2 P L c a n r e d u c e t h e t im e t h a t lo c k s
a re h e ld o n av era ge, b e c a u s e t r a n s a c t io n s t h a t h o ld lo c k s a r e n e v e r b lo ck e d .
T h e t r a d e - o ff is t h a t a t r a n s a c t io n a c q u ir e s lo c k s ea rlier, a n d i f lo c k c o n t e n t io n
is low , lo c k s a re h e ld lo n g e r u n d e r C o n s e r v a t iv e 2PL. F r o m a p r a c t ic a l p e r
s p e c tiv e , it is h a r d t o k n o w e x a c t l y w h a t lo c k s a re n e e d e d a h e a d o f tim e , a n d
th is a p p r o a c h le a d s t o s e t t i n g m o r e lo c k s t h a n n e ce ssa ry . I t a ls o h a s h ig h e r
o v e r h e a d fo r s e t t in g lo c k s b e c a u s e a t r a n s a c t io n h a s t o r e le a s e all lo c k s a n d t r y
t o o b t a in t h e m all o v e r if it fa ils t o o b t a in e v e n o n e lo c k t h a t it n eed s. T h is
a p p r o a c h is t h e r e f o r e n o t u s e d in p r a c tic e .
17.5 S P E C IA L IZ E D L O C K IN G T E C H N IQ U E S
T h u s fa r w e h a v e t r e a t e d a d a t a b a s e as a fixed c o ll e c t io n o f independent d a t a
o b j e c t s in o u r p r e s e n t a t io n o f l o c k in g p r o t o c o ls . W e n o w r e la x e a ch o f t h e s e
r e s t r ic t io n s a n d d is c u s s t h e c o n s e q u e n c e s .
I f t h e c o lle c t io n o f d a t a b a s e o b j e c t s is n o t fix e d , b u t c a n g r o w a n d sh r in k
t h r o u g h t h e in s e r tio n a n d d e le t io n o f o b je c t s , w e m u s t d e a l w it h a s u b t le c o m p l i
c a t io n k n o w n as t h e phantom problem , w h ic h w a s illu s t r a t e d in S e c t io n 16.6.2.
W e d is c u s s th is p r o b le m in S e c t io n 17.5.1.
83
560 C h a p t e r 17
A lt h o u g h t r e a t in g a d a t a b a s e a s a n in d e p e n d e n t c o lle c t io n o f o b j e c t s is a d e
q u a t e fo r a d is c u s s io n o f s e r ia liz a b ility a n d r e co v e ra b ility , m u c h b e t t e r p e r f o r
m a n c e c a n s o m e t i m e s b e o b t a i n e d u s in g p r o t o c o l s t h a t r e c o g n iz e a n d e x p lo it
th e r e la t io n s h ip s b e t w e e n o b je c t s . W e d is c u s s t w o s u c h ca ses, n a m ely , lo c k in g
in t r e e - s t r u c t u r e d in d e x e s ( S e c tio n 17.5.2) a n d lo c k in g a c o ll e c t io n o f o b j e c t s
w it h c o n t a in m e n t r e la t io n s h ip s b e t w e e n t h e m ( S e c tio n 17.5.3).
17.5.1 Dynam ic Databases and the Phantom P roblem
C o n s id e r t h e fo llo w in g e x a m p le : T r a n s a c t io n T 1 s c a n s t h e S a ilo r s r e la t io n t o
fin d t h e o l d e s t s a ilo r fo r e a ch o f t h e rating le v e ls 1 a n d 2. F ir st, T 1 id e n tifie s
a n d lo c k s all p a g e s ( a s s u m in g t h a t p a g e - le v e l lo c k s a r e set) c o n t a in in g s a ilo r s
w it h r a t in g 1 a n d th e n fin d s t h e a g e o f t h e o ld e s t sa ilo r, w h ic h is, say, 71.
N e x t, t r a n s a c t io n T 2 in s e r ts a n e w s a ilo r w ith r a t in g 1 a n d a g e 96. O b serv e
t h a t th is n e w S a ilo r s r e c o r d c a n b e in s e r t e d o n t o a p a g e t h a t d o e s n o t c o n t a in
o t h e r s a ilo r s w it h r a t in g 1; th u s, a n e x c lu s iv e lo c k o n th is p a g e d o e s n o t c o n f lic t
w it h a n y o f t h e lo c k s h e ld b y T l . T 2 a ls o lo c k s t h e p a g e c o n t a in in g t h e o ld e s t
s a ilo r w it h r a t in g 2 a n d d e le te s t h is s a ilo r ( w h o se a g e is, say, 80). T 2 th e n
c o m m i t s a n d r e le a s e s its lo ck s. F in a lly , t r a n s a c t io n T l id e n tifie s a n d lo c k s
p a g e s c o n t a in in g (all r e m a in in g ) s a ilo r s w it h r a t in g 2 a n d fin d s t h e a g e o f t h e
o l d e s t s u c h sa ilo r, w h ic h is, say, 63.
T h e r e s u lt o f t h e in te r le a v e d e x e c u t io n is t h a t a g e s 71 a n d 63 a r e p r in t e d in
r e s p o n s e t o t h e query. I f T l h a d r u n first, t h e n T 2, w e w o u ld h a v e g o t t e n th e
a g e s 71 a n d 80; if T 2 h a d ru n first, t h e n T l , w e w o u ld h a v e g o t t e n t h e a g e s
96 a n d 63. T h u s, th e r e s u lt o f th e in te r le a v e d e x e c u t io n is n o t id e n t ic a l t o a n y
s e r ia l e x e c t io n o f T l a n d T 2, e v e n t h o u g h b o t h t r a n s a c t io n s fo llo w S t r ic t 2 P L
a n d c o m m it. T h e p r o b le m is t h a t T l a s s u m e s t h a t t h e p a g e s it h a s lo c k e d
in c lu d e all p a g e s c o n t a in in g S a ilo r s r e c o r d s w it h r a t in g 1, a n d t h is a s s u m p t io n
is v io la t e d w h e n T 2 in s e r ts a n e w s u c h s a ilo r o n a d iffe r e n t p a ge.
T h e fla w is n o t in t h e S t r ic t 2 P L p r o t o c o l. R a th e r , it is in T l ’
s im p lic it a s
s u m p t io n t h a t it h a s lo c k e d t h e s e t o f all S a ilo r s r e c o r d s w it h rating v a lu e 1.
T l’
s s e m a n t ic s r e q u ir e s it t o id e n t ify a ll s u c h r e c o r d s , b u t lo c k in g p a g e s th a t
c o n t a in s u c h r e c o r d s at a given time d o e s n o t p r e v e n t n e w “p h a n t o m ” r e c o r d s
fr o m b e in g a d d e d on o th e r pa ges. T l h as th e refore not lo c k e d t h e s e t o f d e s ir e d
S a ilo r s r e c o r d s .
S t r ic t 2 P L g u a r a n te e s c o n f lic t s e r ia liz a b ility ; in d e e d , t h e r e a re n o c y c le s in th e

p r e c e d e n c e g r a p h fo r th is e x a m p le b e c a u s e c o n f lic t s a r e d e fin e d w it h r e s p e c t
t o o b j e c t s (in th is e x a m p le , p a g e s ) r e a d / w r it t e n b y t h e tr a n s a c t io n s . H ow ev e r,
b e c a u se th e set o f o b je c ts th a t should h a v e b e e n lo c k e d b y T l w a s a lt e r e d b y
t h e a c t io n s o f T 2, t h e o u t c o m e o f t h e s c h e d u le d iffe r e d f r o m t h e o u t c o m e o f a n y
84
se ria l e x e c u tio n . T h is e x a m p le b r in g s o u t an im p o r t a n t p o in t a b o u t c o n f lic t

se r ia liz a b ility : I f n e w it e m s a re a d d e d t o t h e d a ta b a s e , c o n f lic t s e r ia liz a b ility
d o e s n o t g u a r a n te e se ria liza b ility .
A c lo s e r l o o k at h o w a t r a n s a c t io n id e n tifie s p a g e s c o n t a in in g S a ilo r s r e c o r d s
w ith rating 1 s u g g e s t s h o w th e p r o b le m ca n b e h a n d le d :
■ I f th e r e is n o in d e x a n d all p a g e s in t h e file m u s t b e s ca n n e d , T 1 m u s t
s o m e h o w e n s u r e t h a t n o n e w p a g e s are a d d e d t o t h e file, in a d d it io n t o
lo c k in g all e x is t in g p a g e s.
■ I f th e r e is an in d e x o n t h e rating field, T 1 c a n o b t a in a lo c k o n th e in d e x
p a g e — again , a s s u m in g th a t p h y s ic a l l o c k in g is d o n e at t h e p a g e le v e l— th a t
c o n t a in s a d a t a e n t r y w ith rating=1. I f t h e r e a re n o s u c h d a t a en tries, th a t
is, n o r e c o r d s w it h th is rating v alue, t h e p a g e t h a t would c o n t a in a d a ta
e n try fo r rating—1 is lo c k e d t o p r e v e n t s u c h a r e c o r d fr o m b e i n g in se rte d .
A n y t r a n s a c t io n t h a t t r ie s t o in s e r t a r e c o r d w it h rating=l in t o t h e S a ilo r s
r e la tio n m u s t in s e r t a d a t a e n t r y p o i n t i n g t o th e n e w r e c o r d in t o th is in d e x
p a g e a n d is b lo c k e d u n til T 1 r e le a s e s its lo ck s. T h is t e c h n iq u e is c a lle d
in d e x lo ck in g .
B o t h t e c h n iq u e s e ffe c tiv e ly g iv e T 1 a lo c k o n t h e s e t o f S a ilo r s r e c o r d s w it h rat

ing =1\ E a c h e x is t in g r e c o r d w it h rating=1 is p r o t e c t e d f r o m c h a n g e s b y o t h e r
tr a n s a c tio n s , a n d a d d itio n a lly , n e w r e c o r d s w it h rating=l c a n n o t b e in se rte d .
A n in d e p e n d e n t is s u e is h o w t r a n s a c t io n T 1 c a n e ffic ie n tly id e n t ify a n d lo c k

th e in d e x p a g e c o n t a in in g rating=l. W e d is c u s s t h is is s u e fo r t h e c a s e o f tree-
s t r u c t u r e d in d e x e s in S e c t io n 17.5.2.
W e n o t e th a t in d e x lo c k in g is a s p e c ia l c a s e o f a m o r e g e n e r a l c o n c e p t c a lle d
p r e d ic a t e lo ck in g . In o u r e x a m p le , th e lo c k o n t h e in d e x p a g e im p lic it ly
lo c k e d all S a ilo r s r e c o r d s t h a t s a t is f y th e l o g ic a l p r e d i c a t e ratings 1. M ore
gen era lly , w e ca n s u p p o r t im p lic it lo c k in g o f all r e c o r d s t h a t m a t c h an a r b itr a r y
p r e d ica te . G e n e r a l p r e d ic a t e lo c k in g is e x p e n s iv e t o im p le m e n t a n d t h e r e f o r e
n o t c o m m o n ly u sed.
17.5.2 Concurrency C ontrol in B+ Trees
A s tr a ig h tf o r w a r d a p p r o a c h t o c o n c u r r e n c y c o n t r o l fo r B + tr e e s a n d I S A M
in d e x e s is t o ig n o r e t h e in d e x s tr u c tu r e , tr e a t e a c h p a g e as a d a t a o b je c t , a n d
u s e s o m e v e r sio n o f 2PL. T h is s im p lis t ic lo c k in g s t r a t e g y w o u ld le a d t o v e r y h ig h
lo c k c o n t e n t io n in th e h ig h e r le v e ls o f th e tree, b e c a u s e e v e r y t r e e s e a r c h b e g in s
at th e r o o t a n d p r o c e e d s a lo n g s o m e p a t h t o a le a f n o d e . F o rtu n a te ly , m u c h
m o r e efficien t lo c k in g p r o t o c o l s t h a t e x p lo it th e h ie r a r c h ic a l s t r u c t u r e o f a t r e e
85
n
562 C h a p t e r 17
in d e x a re k n o w n t o r e d u c e t h e lo c k in g o v e r h e a d w h ile e n s u r in g s e r ia liz a b ility

a n d r e co v e ra b ility . W e d is c u s s s o m e o f t h e s e a p p r o a c h e s briefly, c o n c e n t r a t in g
o n t h e s e a r c h a n d in s e r t o p e r a tio n s .
T w o o b s e r v a t io n s p r o v id e th e n e c e s s a r y in sig h t:
1. T h e h ig h e r le v e ls o f t h e t r e e o n ly d ir e c t sea rch es. A ll t h e T e a l ’d a t a is

in t h e le a f le v e ls (in th e fo r m a t o f o n e o f t h e t h r e e a lt e r n a t iv e s fo r d a t a
en tries).
2. F o r in se rts, a n o d e m u s t b e lo c k e d (in e x c lu s iv e m o d e , o f co u rse ) o n ly if a

s p lit ca n p r o p a g a t e u p t o it f r o m t h e m o d if ie d leaf.
S e a r c h e s s h o u ld o b t a in s h a r e d lo c k s o n n o d e s, s t a r t in g at t h e r o o t a n d p r o
c e e d i n g a lo n g a p a t h t o th e d e s ir e d leaf. T h e first o b s e r v a t io n s u g g e s t s th a t a
lo c k o n a n o d e ca n b e r e le a s e d as s o o n as a lo c k o n a c h ild n o d e is o b ta in e d ,
b e c a u s e s e a r c h e s n e v e r g o b a c k u p t h e tree.
A c o n s e r v a t iv e lo c k in g s t r a t e g y fo r in s e r ts w o u ld b e t o o b t a in e x c lu s iv e lo c k s o n
all n o d e s as w e g o d o w n fr o m t h e r o o t t o th e le a f n o d e t o b e m o d if ie d , b e c a u s e
s p lit s c a n p r o p a g a t e all t h e w a y f r o m a le a f t o th e ro o t. H o w e v e r, o n c e w e lo c k
t h e c h ild o f a n o d e , t h e lo c k o n t h e n o d e is r e q u ir e d o n ly in t h e e v e n t t h a t a
s p lit p r o p a g a t e s b a c k t o it. In p a rtic u la r, if th e c h ild o f th is n o d e (on t h e p a th
t o t h e m o d if ie d leaf) is n o t fu ll w h e n it is lo ck e d , a n y s p lit t h a t p r o p a g a t e s u p
t o t h e c h ild ca n b e r e s o lv e d at t h e ch ild , a n d d o e s n o t p r o p a g a t e fu r th e r t o th e
c u r r e n t n o d e . T h e r e fo r e , w h e n w e lo c k a c h ild n o d e , w e ca n r e le a s e t h e lo c k o n
t h e p a r e n t if th e c h ild is n o t full. T h e lo c k s h e ld th u s b y a n in s e r t f o r c e an y
o t h e r t r a n s a c t io n f o llo w in g th e s a m e p a t h t o w a it at t h e e a r lie s t p o i n t (i.e., th e
n o d e n e a r e s t th e r o o t) th a t m ig h t b e a ffe c te d b y th e in sert. T h e t e c h n iq u e o f
lo c k in g a c h ild n o d e a n d (if p o s s ib le ) r e le a s in g th e lo c k o n t h e p a r e n t is c a lle d
lo c k - c o u p lin g , o r c r a b b i n g (th in k o f h o w a c r a b w alk s, a n d c o m p a r e it t o
h o w w e p r o c e e d d o w n a tree, a lt e r n a t e ly r e le a s in g a lo c k o n a p a r e n t a n d s e t t in g
a lo c k o n a child).
W e illu s tr a t e B-f- t r e e lo c k in g u s in g t h e t r e e in F ig u r e 17.5. T o s e a r c h fo r d a ta

T i m u s t o b t a i n a n S lo c k o n n o d e A, r e a d t h e c o n te n ts
e n t r y 38*, a t r a n s a c t io n
a n d d e t e r m in e th a t it n e e d s t o e x a m in e n o d e B, o b t a in an S lo c k o n n o d e B
a n d r e le a s e t h e lo c k o n A, th e n o b t a in an S lo c k o n n o d e C a n d r e le a s e th e
lo c k o n B, th e n o b t a in an S lo c k o n n o d e D a n d r e le a s e t h e lo c k o n C.
T i a lw a y s m a in ta in s a lo c k o n o n e n o d e in t h e p a th , t o fo r c e n e w t r a n s a c t io n s
t h a t w a n t t o r e a d o r m o d if y n o d e s o n t h e s a m e p a th t o w a it u n til t h e cu r r e n t
t r a n s a c t io n is d o n e . I f t r a n s a c t io n T j w a n ts t o d e le te 38*, fo r e x a m p le , it m u s t
a ls o t r a v e r s e t h e p a th f r o m th e r o o t t o n o d e D a n d is f o r c e d t o w a it u n til T i
86
F ig u r e 17.5 B + Tree Locking Example
is done. O f course, if som e transaction Tk holds a lock on, say, n ode C before
Ti reaches this node, Ti is sim ilarly forced to wait for Tk to complete.
To insert data entry 45*, a transaction m ust obtain an S lock on n ode A, obtain
an S lock on n ode B and release the lock on A, then obtain an S lock on node
C (observe that the lock on B is not released, becau se C is full), then obtain
an X lock on n ode E and release the locks on C and then B. B ecause n ode E
has space for the new entry, the insert is accom plished by m odifyin g this node.
In contrast, consider the insertion o f data entry 25*. P roceed in g as for the
insert o f 45*, we obtain an X lock on n ode H. Unfortunately, this n ode is full
and m ust b e split. S plitting H requires that we also m odify the parent, n ode F,
but the transaction has only an S lock on F. Thus, it m ust request an upgrade
of this lock to an X lock. If no other transaction holds an S lock on F, the
upgrade is granted, and since F has space, the split d oes not propagate further
and the insertion o f 25* can proceed (by splittin g H and locking G to m odify
the sibling pointer in I to point to the newly created node). However, if another
transaction holds an S lock on n ode F, the first transaction is suspended until
this transaction releases its S lock.
Observe that if another transaction holds an S lock on F and also wants to

access node H, we have a deadlock b ecau se the first transaction has an X lock
on H. T he precedin g exam ple also illustrates an interesting point about sibling
pointers: W hen we split leaf node H, the new n od e must b e added to the left
of H, since otherwise the n od e w hose sibling pointer is to be changed w ould
be n ode I, which has a different parent. To m od ify a siblin g pointer on I, we
87
564 C h a p t e r 17
w ould have to lock its parent, n ode C (and possib ly ancestors o f C, in order to
lock C).
E x cep t for the locks on interm ediate n odes that we indicated could be released
early, som e variant of 2PL m ust b e used to govern when locks can b e released,
to ensure serializability and recoverability.
T his approach im proves considerably on the naive use of 2PL, but several ex
clusive locks are still set unnecessarily and, although they are quickly released,
affect perform an ce substantially. One way to im prove perform an ce is for inserts
to obtain shared locks instead of exclusive locks, except for the leaf, which is
locked in exclusive mode. In the vast m ajority of cases, a split is not required
and this approach works very well. If the leaf is full, however, we m ust upgrade
from shared locks to exclusive locks for all n odes to which the split propagates.
N ote that such lock upgrade requests can also lead to deadlocks.
T h e tree locking ideas that we d escribe illustrate the potential for efficient
lockin g p rotocols in this very im portan t special case, but they are not the
current state o f the art. T h e interested reader should pursue the leads in the
bibliography.
17.5.3 M ultiple-Granularity L ocking
A nother specialized locking strategy, called m u lt ip le - g r a n u la r it y lo ck in g ,

allows us to efficiently set locks on o b je cts that contain other objects.
For instance, a database contains several files, a file is a collection of pages,

and a pa ge is a collection o f records. A transaction that ex p ects to access m ost
o f the pages in a file should proba bly set a lock on the entire file, rather than
locking individual pages (or records) when it needs them. D oin g so reduces
the locking overhead considerably. O n the other hand, other transactions that
require access to parts of the file— even parts not needed by this transaction—
are blocked. If a transaction accesses relatively few pages o f the file, it is better
to lock only those pages. Similarly, if a transaction accesses several records on
a page, it should lock the entire page, and if it accesses ju st a few records, it
should lock ju st those records.
T h e question to b e addressed is how a lock m anager can efficiently ensure that

a page, for example, is not locked by a transaction while another transaction
holds a conflicting lock on the file containing the page (and therefore, implicitly,
on the page).
T h e idea is to exploit the hierarchical nature o f the ‘ contains’relationship. A

database contains a set o f files, each file contains a set o f pages, and each page
contains a set o f records. T his containm ent hierarchy can b e thought of as
a tree of objects, where each n ode contains all its children. (The approach
can easily be extended to cover hierarchies that are not trees, bu t we do not
discuss this extension.) A lock on a n ode locks that n ode and, implicitly, all its
descendants. (Note that this interpretation o f a lock is very different from B +
tree locking, where locking a n ode does not lock any descendants implicitly.)
In addition to shared (S ) and exclusive (X) locks, multiple-granularity locking

protocols also use two new kinds o f locks, called in t e n t io n s h a r e d (IS) and
in t e n t io n e x c lu s iv e (IX) locks. I S locks conflict only w ith X locks. I X
locks conflict with S and X locks. To lock a n ode in S (respectively, X) mode,
a transaction m ust first lock all its ancestors in I S (respectively, IX) mode.
Thus, if a transaction locks a n ode in S m ode, no other transaction can have
locked any ancestor in X m ode; similarly, if a transaction locks a n ode in X
mode, no other transaction can have locked any ancestor in S or X mode. This
ensures that no other transaction holds a lock on an ancestor that conflicts
with the requested S or X lock on the node.
A com m on situation is that a transaction needs to read an entire file and m odify
a few o f the records in it; that is, it needs an S lock on the file and an I X lock
so that it can subsequently lock som e of the contained o b je cts in X mode. It
is useful to define a new kind o f lock, called an S I X lock, that is logically
equivalent to h olding an S lock and an I X lock. A transaction can obtain a
single S I X lock (which conflicts with any lock that conflicts with either S or
IX) instead of an S lock and an I X lock.
A subtle point is that locks m ust be released in leaf-to-root order for this p roto
col to work correctly. To see this, consider what happens when a transaction Ti
locks all nodes on a path from the root (corresponding to the entire database)
to the node correspon din g to som e page p in I S m ode, locks p in S mode, and
then releases the lock on the root node. A nother transaction T j could now
obtain an X lock on the root. T his lock im plicitly gives T j an X lock on page
p, which conflicts with the S lock currently held by Ti.
Multiple-granularity locking m ust be used with 2PL to ensure serializability.

T h e 2PL p rotocol dictates when locks can b e released. At th a t time, locks ob
tained using multiple-granularity locking can b e released and m ust b e released
in leaf-to-root order.
Finally, there is the question of how to d ecide what granularity of locking is

appropriate for a given transaction. O ne approach is to b egin by obtaining fine
granularity locks (e.g., at the record level) and, after the transaction requests
89
566 C h a p t e r 17
L o c k G r a n u la r ity : Som e database system s allow program m ers to over

ride the default m echanism for choosin g a lock granularity. For example,
M icrosoft SQ L Server allows users to select page locking instead o f table
locking, using the keyword P A G LO CK . IB M ’ s DB2 U D B allows for explicit
table-level locking.
a certain num ber o f locks at that granularity, to start obtain in g locks at the
next higher granularity (e.g., at the page level). T his procedu re is called lo c k
e s c a la t io n .
17.6 C O N C U R R E N C Y C O N T R O L W IT H O U T L O C K IN G
L ocking is the m ost w idely used approach to concurrency control in a DBMS,

bu t it is not the only one. We now consider som e alternative approaches.
17.6.1 O ptim istic C oncu rrency C ontrol
L ocking p rotocols take a pessim istic approach to conflicts betw een transactions
and use either transaction abort or blockin g to resolve conflicts. In a system
w ith relatively light contention for data objects, the overhead o f obtaining locks
and follow ing a locking p ro to co l m ust nonetheless b e paid.
In optim istic concurrency control, the basic prem ise is that m ost transactions
d o not conflict w ith other transactions, and the idea is to b e as perm issive
as p ossib le in allowing transactions to execute. Transactions proceed in three
phases:
1. R e a d : T h e transaction executes, reading values from the database and

w riting to a private workspace.
2. V a lid a tio n : If the transaction decides that it wants to com m it, the D BM S
checks whether the transaction could p ossib ly have conflicted w ith any
other concurrently executin g transaction. If there is a p ossib le conflict, the
transaction is aborted; its private w orkspace is cleared and it is restarted.
3. W r ite : If validation determ ines that there are no p ossib le conflicts, the
changes to data o b je cts m ade by the transaction in its private w orkspace
are copied into the database.
If, indeed, there are few conflicts, and validation can b e done efficiently, this
approach should lead to better perform an ce than locking. If there are many
90
conflicts, the cost o f repeatedly restarting transactions (thereby w asting the

work th ey’ ve done) hurts perform ance significantly.
Each transaction T i is assigned a tim estam p T S { T i) at the beginn in g o f its

validation phase, and the validation criterion checks whether the tim esta m p
ordering of transactions is an equivalent serial order. For every pair o f transac
tions T i and T j such that T S ( T i ) < T S ( T j) , one of the follow ing v a lid a t io n
c o n d it io n s m ust hold:
1. T i com pletes (all three phases) before T j begins.
2. T i com pletes before T j starts its W rite phase, and T i does not w rite any
database o b je ct read by T j.
3. T i com pletes its R ead phase before T j com p letes its R ead phase, and T i
does not w rite any database o b je ct that is either read or w ritten by T j.
To validate T j, we m ust check to see that on e o f these condition s holds w ith

respect to each com m itted transaction T i such that T S ( T i ) < T S ( T j) . Each
of these conditions ensures that T j ’
s m odification s are n ot visible to T i.
Further, the first condition allows T j to see som e of T V s changes, but clearly,
they execute com pletely in serial order w ith respect to each other. T h e secon d
condition allows T j to read o b jects while T i is still m odifyin g objects, but there
is no conflict becau se T j does not read any o b je c t m odified by Ti. A lthough
T j m ight overwrite som e o b jects w ritten by T i, all o f T V s writes precede all o f
T j’s writes. T h e third condition allows T i and T j to w rite o b je cts at the sam e
tim e and thus have even m ore overlap in tim e than the second condition, but
the sets o f o b jects w ritten by the two transactions cannot overlap. Thus, no
RW, WR, or W W conflicts are possible if any o f these three conditions is met.
Checking these validation criteria requires us to m aintain lists of o b je cts read

and written by each transaction. Further, w hile one transaction is bein g vali
dated, no other transaction can b e allowed to com m it; otherwise, the validation
o f the first transaction m ight m iss conflicts w ith respect to the newly co m
m itted transaction. T h e W rite phase o f a validated transaction m ust also b e
com pleted (so that its effects are visible ou tsid e its private workspace) before
other transactions can b e validated.
A synchronization m echanism such as a c r i t i c a l s e c t i o n can b e used to ensure

that at m ost one transaction is in its (combined) V alidation/W rite phases at
any time. (When a p rocess is executing a critical section in its code, the
system suspends all other processes.) Obviously, it is im portant to keep these
phases as short as p ossib le in order to m in im ize the im pact on concurrency. If
copies of m odified o b je cts have to be cop ied from the private workspace, this
91
568 C h a p t e r 17
can make the W rite phase long. An alternative approach (which carries the
penalty of p o o r physical locality of objects, such as B + tree leaf pages, that
m ust b e clustered) is to use a level of indirection. In this scheme, every o b je ct
is accessed via a logical pointer, and in the W rite phase, we sim ply sw itch the
logical pointer to point to the version of the o b je ct in the private workspace,
instead of copyin g the object.
Clearly, it is not the case that optim istic concurren cy control has no overheads;
rather, the locking overheads of lock-based approaches are replaced w ith the
overheads of recordin g read-lists and write-lists for transactions, checking for
conflicts, and copyin g changes from the private workspace. Similarly, the im
plicit cost o f blockin g in a lock-based approach is replaced by the im plicit cost
of the work w asted by restarted transactions.
Im proved Conflict R esolution1
O p tim istic C on curren cy C on trol using the three validation condition s d escribed
earlier is often overly conservative and unnecessarily aborts and restarts trans
actions. In particular, accordin g to the validation conditions, Ti cannot w rite
any o b je ct read by Tj. However, since the validation is aim ed at ensuring that
T i logically executes before T j , there is no harm if T i writes all data item s
required by T j before T j reads them.
T h e problem arises becau se we have no way to tell when Ti w rote the o b je ct

(relative to T j 's reading it) at the tim e we validate Tj, since all we have is the
list of ob jects w ritten by T i and the list read by Tj. Such false conflicts can be
alleviated by a finer-grain resolution o f data conflicts, using m echanism s very
sim ilar to locking.
T h e basic idea is that each transaction in the R ead phase tells the D B M S abou t
item s it is reading, and when a transaction T i is com m itted (and its w rites are
accepted), the D B M S checks whether any o f the item s w ritten by T i are bein g
read by any (yet to b e validated) transaction Tj. If so, we know that T j ’ s
validation m ust eventually fail. W e can either allow T j to discover this when
it is validated (the d ie policy) or kill it and restart it im m ediately (the k ill
policy).
T h e details are as follows. Before reading a data item, a transaction T enters

an a c c e s s e n t r y in a hash table. T h e access entry contains the transaction
id, a data object id, and a modified flag (initially set to f a ls e ) , and entries are
hashed on the d ata o b je ct id. A tem porary exclusive lock is obtained on the
1We thank Alexander Thomasian for writing this section.
92
hash bucket containing the entry, and the lock is held while the read data item
is copied from the database buffer into the private w orkspace of the transaction.
D uring validation o f T the hash buckets o f all data o b je cts accessed by T

are again locked (in exclusive m ode) to check if T has encountered any data
conflicts. T has encountered a conflict if the modified flag is set to t r u e in one
o f its access entries. (This assum es that the ‘d ie’p o licy is bein g used; if the
‘kill’policy is used, T is restarted when the flag is set to true.)
If T is successfully validated, we lock the hash bucket o f each o b je ct m odified

by T, retrieve all access entries for this object, set the modified flag to tru e,
and release the lock on the bucket. If the ‘ kill’p o licy is used, the transactions
that entered these access entries are restarted. W e then com plete T ’ s W rite
phase.
It seem s that the ‘ kill’policy is always better than the ‘ d ie’policy, because it
reduces the overall response tim e and w asted processing. However, executing
T to the end has the advantage that all o f the data item s required for its
execution are prefetched into the database buffer, and restarted executions of
T will not require disk I/O for reads. T his assum es that the database buffer
is large enough that prefetched pages are not replaced, and, m ore im portant,
that a c c e s s in v a r ia n c e prevails; that is, successive executions o f T require
the sam e data for execution. W hen T is restarted its execution tim e is much
shorter than before because no disk I/O is required, and thus its chances of
validation are higher. (Of course, if a transaction has already com pleted its
R ead phase once, subsequent conflicts should b e handled using the ‘ kill’policy
because all its data o b jects are already in the buffer pool.)
17.6.2 Timestamp-Based Concurrency C ontrol
In lock-based concurrency control, conflicting actions o f different transactions

are ordered by the order in which locks are obtained, and the lock p rotocol ex
tends this ordering on actions to transactions, thereby ensuring serializability.
In optim istic concurrency control, a tim estam p ordering is im posed on trans
actions and validation checks that all conflicting actions occurred in the sam e
order. -
T im estam ps can also b e used in another way: Each transaction can be assigned
a tim estam p at startup, and we can ensure, at execution time, that if action
ai of transaction Ti conflicts with action aj of transaction Tj, ai occu rs before
a j if TS(T i ) < TS(Tj). If an action violates this ordering, the transaction is
aborted and restarted.
93
570 C h a p t e r 17
T o im plem ent this concurrency control scheme, every database o b je ct O is given

a r e a d t i m e s t a m p RTS(O) and a w r it e t i m e s t a m p WTS{0). If transaction
T wants to read o b je ct O, and T S(T ) < W T S(0 ), the order o f this read
w ith resp ect to the m ost recent write on O w ould violate the tim estam p order
betw een this transaction and the writer. Therefore, T is a borted and restarted
with a new, larger timestamp. If TS(T) > WTS(O), T reads O, and RTS(O)
is set to the larger o f RTS(O) and TS(T). (Note that a physical change— the
change to R TS(O )— is w ritten to disk and recorded in the log for recovery
purposes, even on reads. T his w rite operation is a significant overhead.)
O bserve that if T is restarted w ith the sam e tim estam p, it is guaranteed to be

a borted again, due to the sam e conflict. Contrast this behavior w ith the use of
tim estam ps in 2PL for deadlock prevention, w here transactions are restarted
w ith the same tim estam p as before to avoid repeated restarts. T h is show s that
the tw o uses o f tim estam ps are quite different and should not b e confused.
Next, consider what happens when transaction T wants to w rite o b je ct 0\
1. If TS(T) < RTS(O), the w rite action conflicts w ith the m ost recent read
action o f O, and T is therefore aborted and restarted.
2. If T S(T ) < WTS(O), a naive approach w ould b e to a bort T because

its w rite action conflicts w ith the m ost recent w rite o f O and is out of
tim estam p order. However, we can safely ignore such w rites and continue.
Ign orin g ou tdated writes is called the T h o m a s W r it e R u le .
3. Otherwise, T w rites O and WTS(O) is set to TS(T).
The T hom as W rite Rule
W e now consider the justification for the T h om as W rite Rule. If TS(T) <
WTS(O), the current w rite action has, in effect, been m ade obsolete by the
m ost recent write of O, which follows the current write accordin g to the tim es
tam p ordering. W e can think o f T ’ s w rite action as if it had occu rred im m edi
ately before the m ost recent w rite o f O and was never read by anyone.
If the T h om as W rite R ule is not used, that is, T is aborted in case (2), the
tim estam p protocol, like 2PL, allows only conflict serializable schedules. If the
T h om a s W rite R ule is used, som e schedules are perm itted that are not conflict
serializable, as illustrated by the schedule in Figure 17.6.2 B ecause T 2’ s write
follow s TVs read and precedes T l ’ s w rite of the sam e object, this schedule is
not conflict serializable.
2In the other direction, 2PL permits some schedules that are not allowed by the timestamp algo
rithm with the Thomas Write Rule; see Exercise 17.7.
94
T1 T2
R(A)
W(A)
C om m it
W(A)
C om m it
F ig u r e 17.6 A Serializable Schedule That Is Not Conflict Serializable
T h e T hom as Write Rule relies on the observation that T 2’

s w rite is never seen
by any transaction and the schedule in Figure 17.6 is therefore equivalent to
the serializable schedule obtained by deleting this write action, which is shown
in Figure 17.7.
T1 T2
R(A)
C om m it
W(A)
C om m it
F ig u r e 17.7 A Conflict Serializable Schedule
Recoverability
Unfortunately, the tim estam p p ro to co l ju st presented perm its schedules that

are not recoverable, as illustrated by the schedule in Figure 17.8. If TS(T1) = 1
and TS(T 2) = 2, this schedule is perm itted by the tim estam p p ro to co l (with
or w ithout the T hom as W rite Rule). T h e tim estam p p rotocol can b e m odified
to disallow such schedules by b u f f e r in g all w rite actions until the transaction
com m its. In the example, when T1 wants to w rite A, W TS(A ) is updated to
reflect this action, but the change to A is n ot carried out im m ediately; instead,
it is recorded in a private workspace, or buffer. W hen T 2 wants to read A
subsequently, its tim estam p is com pared w ith WTS(A), and the read is seen
to b e perm issible. However, T 2 is blocked until T1 com pletes. If T1 comm its,
its change to A is copied from the buffer; otherwise, the changes in the buffer
are discarded. T 2 is then allowed to read A.
T his blockin g of T 2 is similar to the effect o f T1 obtaining an exclusive lock on

A. Nonetheless, even with this m odification, the tim estam p p ro to co l perm its
som e schedules not perm itted by 2PL; the two p rotocols are not quite the same.
(See E xercise 17.7.)
95
572 C h a p t e r 17
TI T2
W{A)
R(A)
W(B)
C om m it
F ig u r e 17.8 An Unrecoverable Schedule
B ecause recoverability is essential, such a m odification m ust b e used for the

tim estam p p ro to co l to be practical. Given the added overhead this entails, on
top o f the (considerable) cost o f m aintaining read and w rite tim estam ps, tim es
tam p concurrency control is unlikely to beat lock-based p ro to co ls in centralized
systems. Indeed, it has been used m ainly in the context of distribu ted database
system s (Chapter 22).
17.6.3 M ultiversion C oncurrency C ontrol
T his p ro to co l represents yet another way o f using tim estam ps, assigned at
startup time, to achieve serializability. T h e goal is to ensure that a transac
tion never has to wait to read a database object, and the idea is to maintain
several versions of each database object, each with a write tim estam p, and let
transaction Ti read the m ost recent version w hose tim estam p precedes TS(Ti).
If transaction T i wants to write an object, we m ust ensure that the o b je ct

has not already been read by som e other transaction T j such that TS(T i ) <
TS(Tj). If we allow Ti to write such an object, its change should be seen by
T j for serializability, but obviously Tj, which read the o b je ct at som e tim e in
the past, will not see TVs change.
T o check this condition, every o b je ct also has an associated read tim estam p,
and whenever a transaction reads the object, the read tim estam p is set to
the m axim um o f the current read tim estam p and the reader’ s tim estam p. If T i
wants to w rite an o b je ct O and TS(Ti) < RTS(O), Ti is aborted and restarted
with a new, larger tim estam p. Otherwise, T i creates a new version o f O and
sets the read and w rite tim estam ps o f the new version to TS(Ti).
T h e draw backs o f this schem e are similar to th ose o f tim estam p concurrency
control, and in addition, there is the cost o f m aintaining versions. O n the
other hand, reads are never blocked, which can b e im portan t for workloads
dom inated by transactions that on ly read values from the database.
96
W h a t D o R e a l S y s t e m s D o ? IB M DB2, Inform ix, M icrosoft SQ L

Server, and Sybase A SE use Strict 2PL or variants (if a transaction re
quests a lower than SERIALIZABLE SQ L isolation level; see Section 16.6).,
M icrosoft SQL Server also su pports m odification tim estam ps so that a
transaction can run w ithout setting locks and validate itself (do-it-yourself
O ptim istic C on curren cy Control!). O racle 8 uses a m ultiversion concu r
rency control schem e in which readers never wait; in fact, readers never
get locks and detect conflicts by checking if a block changed since they
read it. All these system s su pport m ultiple-granularity locking, with su p
p ort for table, page, and row level locks. All deal with deadlocks using
waits-for graphs. Sybase A S IQ su pports only table-level locks and aborts
a transaction if a lock request fails— updates (and therefore conflicts) are
rare in a data warehouse, and this sim ple schem e suffices.
17.7 R E V IE W Q U E ST IO N S
Answers to the review questions can be found in the listed sections.
■ W hen are two schedules con flict equivalent ? W hat is a co n flict seria liza b le
schedule? W hat is a strict schedule? ( S e c t io n 17.1)
■ W hat is a preceden ce graph or seria liza b ility graph ? H ow is it related to con

flict serializability? H ow is it related to two-phase locking? ( S e c t io n 17.1)
■ W hat does the lock m a n a g e r do? D escribe the lock table and tra n sa ction
table data structures and their role in lock m anagement. ( S e c t io n 17.2)
■ D iscuss the relative m erits o f lock u pgrades and lock dow ngrades. (Sec
t io n 17.3)
■ D escribe and com pare deadlock detection and d eadlock prevention schemes.
W hy are detection schem es m ore com m on ly used? ( S e c t io n 17.4)
■ If the collection o f database o b je cts is not fixed, but can grow and shrink
through insertion and deletion o f objects, we m ust deal with a subtle com
plication known as the p h a n tom problem . D escrib e this problem and the
index locking approach to solving the problem . ( S e c t io n 17.5.1)
■ In tree index structures, locking higher levels of the tree can b ecom e a per
form ance bottleneck. Explain why. D escribe specialized locking techniques
that address the problem , and explain why they work correctly despite not
bein g two-phase. ( S e c t io n 17.5.2)
■ M ultiple-granularity locking enables us to set locks on o b je cts that contain

other objects, thus im plicitly locking all contained objects. W hy is this
approach im portan t and how d oes it work? ( S e c t io n 17.5.3)
97
574 C h a p t e r 17
■ In o p tim istic con cu rre n cy con trol , no locks are set and tran saction s read
and m odify data o b jects in a private workspace. H ow are conflicts betw een
transactions d etected and resolved in this approach? ( S e c t io n 17.6.1)
■ In timestamp-based, con cu rren cy con trol , transactions are assigned a tim es
tam p at startup; how is it used to ensure serializability? H ow d oes the
T h om a s W rite R u le im prove concurrency? ( S e c t io n 17.6.2)
■ Explain why tim estam p-based concurrency control allows schedules that
are not recoverable. D escribe how it can b e m odified through bu fferin g to
disallow such schedules. ( S e c t io n 17.6.2)
■ D escribe m u ltiv ersion con cu rren cy c o n tro l W hat are its benefits and dis
advantages in com parison to locking? ( S e c t io n 17.6.3)
E X E R C IS E S
E x ercise 17.1 Answer the following questions:
1. Describe how a typical lock manager is implemented. Why must lock and unlock be
atomic operations? What is the difference between a lock and a latch? What are convoys
and how should a lock manager handle them?
2. Compare lock downgrades with upgrades. Explain why downgrades violate 2PL but
are nonetheless acceptable. Discuss the use of update locks in conjunction with lock
downgrades.
3. Contrast the timestamps assigned to restarted transactions when timestamps are used
for deadlock prevention versus when timestamps are used for concurrency control.
4. State and justify the Thomas Write Rule.
5. Show that, if two schedules are conflict equivalent, then they are view equivalent.
6. Give an example of a serializable schedule that is not strict.
7. Give an example of a strict schedule that is not serialiable.
8. Motivate and describe the use of locks for improved conflict resolution in Optimistic
Concurrency Control.
E x ercise 17.2 Consider the following classes of schedules: serializable, conflict-serializable,

view-serializable, recoverable, avoids-cascading-aborts, and strict. For each of the following
schedules, state which of the preceding classes it belongs to. If you cannot decide whether a
schedule belongs in a certain class based on the listed actions, explain briefly.
The actions are listed in the order they are scheduled and prefixed with the transaction name.
If a commit or abort is not shown, the schedule is incomplete; assume that abort or commit
must follow all the listed actions.
1. T1:R(X), T2:R(X), T1:W(X), T2:W(X)

2. T1:W(X), T2:R(Y), T1:R(Y), T2:R(X)
98
3. T1:R(X), T2:R(Y), T3:W(X), T2:R(X), T1:R(Y)

4. T1:R(X), T1:R(Y), T1:W(X), T2:R(Y), T3:W(Y), T1:W(X), T2:R(Y)
5. T1:R(X), T2:W(X), Tl:W(X), T2:Abort, TliCommit
6. T1:R(X), T2:W(X), T1:W(X), T2:Commit, Tl:Commit
7. T1:W(X), T2:R(X), Tl:W(X), T2:Abort, TliCommit
8. T1:W(X), T2:R(X), T1;W(X), T2:Commit, T liC om mit
9. T1:W(X), T2:R(X), T1:W(X), T2:Commit, TliAbort
10. T2: R(X), T3:W(X), T3:Commit, T1:W(Y), TliCommit, T2:R(Y),
T2:W(Z), T2:Commit
11. T1:R(X), T2:W(X), T2:Commit, T1:W(X), TliCommit, T3:R(X), T3:Commit
12. T1:R(X), T2:W(X), T1:W(X), T3:R(X), TliCommit, T2:Commit, T3:Commit
E x ercise 17.3 Consider the following concurrency control protocols: 2PL, Strict 2PL, Con
servative 2PL, Optimistic, Timestamp without the Thomas Write Rule, Timestamp with the
Thomas Write Rule, and Multiversion. For each of the schedules in Exercise 17.2, state which
of these protocols allows it, that is, allows the actions to occur in exactly the order shown.
For the timestamp-based protocols, assume that the timestamp for transaction Ti is i and
that a version of the protocol that ensures recoverability is used. Further, if the Thomas
Write Rule is used, show the equivalent serial schedule.
E x ercise 17.4 Consider the following sequences of actions, listed in the order they are sub
mitted to the DBMS:
■ Sequ en ce SI: T1:R(X), T2:W(X), T2:W(Y), T3:W(Y), T1:W(Y),

TliCommit, T2:Commit, T3:Commit
■ Sequ en ce S2: T1:R(X), T2:W(Y), T2:W(X), T3:W(Y), T1:W(Y),
TliCommit, T2:Commit, T3:Cornrnit
For each sequence and for each of the following concurrency control mechanisms, describe
how the concurrency control mechanism handles the sequence.
Assume that the timestamp of transaction Ti is i. For lock-based concurrency control mech
anisms, add lock and unlock requests to the previous sequence of actions as per the locking
protocol. The DBMS processes actions in the order shown. If a transaction is blocked, assume
that all its actions are queued until it is resumed; the DBMS continues with the next action
(according to the listed sequence) of an unblocked transaction.
1. Strict 2PL with, timestamps used for deadlock prevention.

2. Strict 2PL with deadlock detection. (Show the waits-for graph in case of deadlock.)
3. Conservative ‘
(and Strict, i.e., with locks held until end-of-transaction) 2PL.
4. Optimistic concurrency control.
5. Timestamp concurrency control with buffering of reads and writes (to ensure recover
ability) and the Thomas Write Rule.
6. Multiversion concurrency control.
99
C h a p te r 5
R eco v ery
R. Ramakrishnan and J. Gehrke. Database Management Sys

tems. Third Edition. Chapter 18, pp. 579-600 (22 of 1065).
McGraw-Hill, 2003. ISBN: 978-0-07-246563-1
A to m icity and durability are challenging in important ways other than

with concurrency. First, we must deal with the risk of partial execution of func
tions in our service, e.g., due to errors, leading to data corruption and compro
mising all-or-nothing atomicity. Second, if volatile memory is employed, even
complete execution of a function may not guarantee durability of its effects.
Third, performance optimizations, such as use of two-level memories, may com
plicate the determination of the current state of the system in the event of a
failure. Fortunately, methodologies to achieve atomicity and durability in the
face of these challenges have been developed in the context of database sys
tems. Log-based recovery techniques allow us to survive both fail-stop and
media failures. These techniques have been heavily studied, and we will focus
on a family of recovery algorithms called A R IE S (Algorithms f o r Recovery and
Isola tion E xploitin g Semantics). The ultimate goal o f this portion o f the m ate
rial is to provide us with a clear conceptual fram ew ork to reflect about recovery
strategies, and to predict how different recovery strategies behave in different
scenarios.
• Explain the concepts of volatile, nonvolatile, and stable storage as well

as the main assumptions underlying database recovery.
• Predict how force/no-force and steal/no-steal strategies for writes and

buffer management influence the need for redo and undo.
• Explain the notion of logging and the concept of write-ahead logging.
• Predict what portions of the log and database are necessary for recovery
under different failure scenarios.
• Explain how write-ahead logging is achieved in the ARIES protocol.
• Explain the functions of recovery metadata such as the transaction table

and the dirty page table.
101
Predict how recovery metadata is updated during normal operation.
Interpret the contents of the log resulting from ARIES normal operation.
Explain the three phases of ARIES crash recovery: analysis, redo, and
undo.
Predict how recovery metadata, system state, and the log are updated
during recovery.
_____18
CRASH RECOVERY
What steps are taken in the ARIES method to recover from a DBMS
crash?
«
•" How is the log maintained during normal operation?
w How is the log used to recover from a crash?
«
•* What information in addition to the log is used during recovery?
What is a checkpoint and why is it used?
What happens if repeated crashes occur during recovery?
»■ How is media failure handled?
How does the recovery algorithm interact with concurrency control?
** K ey con cepts: steps in recovery, analysis, redo, undo; ARIES,
repeating history; log, LSN, forcing pages, WAL; types of log
records, update, commit, abort, end, compensation; transaction ta
ble, lastLSN; dirty page table, recLSN; checkpoint, fuzzy checkpoint
ing, master log record; media recovery; interaction with concurrency
control; shadow paging
H um pty D u m p ty sat on a wall.

H um pty D u m p ty had a great fall.
All the K in g’
s horses and all the K in g ’
s m en
C ou ld not put H um pty together again.
Old nursery rhyme
579 103
580 C h a p t e r 18
The r e c o v e r y m a n a g e r of a DBMS is responsible for ensuring two important

properties of transactions: Atomicity and durability. It ensures a t o m i c i t y by
undoing the actions of transactions that do not commit and d u r a b ilit y by mak
ing sure that all actions of committed transactions survive s y s t e m c r a s h e s
(e.g., a core dump caused by a bus error) and m e d ia f a ilu r e s (e.g., a disk is
corrupted).
The recovery manager is one of the hardest components of a DBMS to design

and implement. It must deal with a wide variety of database states because
it is called on during system failures. In this chapter, we present the A R I E S
recovery algorithm, which is conceptually simple, works well with a wide range
of concurrency control mechanisms, and is being used in an increasing number
of database sytems.
We begin with an introduction to ARIES in Section 18.1. We discuss the

log, which a central data structure in recovery, in Section 18.2, and other
recovery-related data structures in Section 18.3. We complete our coverage
of recovery-related activity during normal processing by presenting the Write-
Ahead Logging protocol in Section 18.4, and checkpointing in Section 18.5.
We discuss recovery from a crash in Section 18.6. Aborting (or rolling back)
a single transaction is a special case of Undo, discussed in Section 18.6.3. We
discuss media failures in Section 18.7, and conclude in Section 18.8 with a
discussion of the interaction of concurrency control and recovery and other ap
proaches to recovery. In this chapter, we consider recovery only in a centralized
DBMS; recovery in a distributed DBMS is discussed in Chapter 22.
18.1 IN T R O D U C T IO N T O A R IE S
A R I E S is a recovery algorithm designed to work with a steal, no-force ap

proach. When the recovery manager is invoked after a crash, restart proceeds
in three phases:
1. A n a ly s is : Identifies dirty pages in the buffer pool (i.e., changes that have
not been written to disk) and active transactions at the time of the crash.
2. R e d o : Repeats all actions, starting from an appropriate point in the log,
and restores the database state to what it was at the time of the crash.
3. U n d o : Undoes the actions of transactions that did not commit, so that
the database reflects only the actions of committed transactions.
Consider the simple execution history illustrated in Figure 18.1. When the
system is restarted, the Analysis phase identifies T1 and T 3 as transactions
104
Crash Recovery 581
LSN LO G
10 -4- update: T1 writes P5
20 -j- update: T2 writes P3
30 -j- T2 commit
40 -j- T2 end
50 -j- update: T3 writes PI
60 update: T3 writes P3
1X
CRASH, RESTART
F ig u r e 18.1 Execution History with a Crash
active at the time of the crash and therefore to be undone; T 2 as a committed

transaction, and all its actions therefore to be written to disk; and PI, P3, and
P5 as potentially dirty pages. All the updates (including those of T1 and T3)
are reapplied in the order shown during the Redo phase. Finally, the actions
of T1 and T3 are undone in reverse order during the Undo phase; that is, T3’ s
write of P3 is undone, T3’ s write of P I is undone, and then T l’ s write of P 5
is undone.
Three main principles lie behind the ARIES recovery algorithm:
■ W r it e - A h e a d L o g g in g : Any change to a database object is first recorded

in the log; the record in the log must be written to stable storage before
the change to the database object is written to disk.
■ R e p e a t i n g H i s t o r y D u r in g R e d o : On restart following a crash, ARIES
retraces all actions of the DBMS before the crash and brings the system
back to the exact state that it was in at the time of the crash. Then,
it undoes the actions of transactions still active at the time of the crash
(effectively aborting them).
■ L o g g in g C h a n g e s D u r in g U n d o : Changes made to the database while
undoing a transaction are logged to ensure such an action is not repeated
in the event of repeated (failures causing) restarts.
The second point distinguishes ARIES from other recovery algorithms and is
the basis for much of its simplicity and flexibility. In particular, ARIES can
support concurrency control protocols that involve locks of finer granularity
than a page (e.g., record-level locks). The second and third points are also
105
582 C h a p t e r 18
C r a s h R e c o v e r y : IBM DB2, Informix, Microsoft SQL Server, Oracle 8,

and Sybase ASE all use a WAL scheme for recovery. IBM DB2 uses ARIES,
and the others use schemes that are actually quite similar to ARIES (e.g.,
all changes are re-applied, not just the changes made by transactions that
are Vinners’ ) although there are several variations.
important in dealing with operations where redoing and undoing the opera
tion are not exact inverses of each other. We discuss the interaction between
concurrency control and crash recovery in Section 18.8, where we also discuss
other approaches to recovery briefly.
18.2 TH E L O G
The log, sometimes called the t r a il or jo u r n a l, is a history of actions executed

by the DBMS. Physically, the log is a file of records stored in stable storage,
which is assumed to survive crashes; this durability can be achieved by main
taining two or more copies of the log on different disks (perhaps in different
locations), so that the chance of all copies of the log being simultaneously lost
is negligibly small.
The most recent portion of the log, called the l o g tail, is kept in main memory
and is periodically forced to stable storage. This way, log records and data
records are written to disk at the same granularity (pages or sets of pages).
Every l o g r e c o r d is given a unique i d called the l o g s e q u e n c e n u m b e r

(LSN). As with any record id, we can fetch a log record with one disk access
given the LSN. Further, LSNs should be assigned in monotonically increasing
order; this property is required for the ARIES recovery algorithm. If the log is
a sequential file, in principle growing indefinitely, the LSN can simply be the
address of the first byte of the log record.1
For recovery purposes, every page in the database contains the LSN of the most
recent log record that describes a change to this page. This LSN is called the
pageL SN .
A log record is written for each of the following actions:

1In practice, various techniques are used to identify portions of the log that are ‘ too old ’to be
needed again to bound the amount of stable storage used for the log. Given such a bound, the log may
be implemented as a ‘ circular’file, in which case the LSN may be the log record id plus a w rap-count.
106
Crash Recovery 583
■ U p d a t in g a P a g e : After modifying the page, an u p d a t e type record (de

scribed later in this section) is appended to the log tail. The pageLSN of
the page is then set to the LSN of the update log record. (The page must
be pinned in the buffer pool while these actions are carried out.)
■ C o m m it : When a transaction decides to commit, it f o r c e - w r it e s a c o m
m i t type log record containing the transaction id. That is, the log record
is appended to the log, and the log tail is written to stable storage, up to
and including the commit record.2 The transaction is considered to have
committed at the instant that its commit log record is written to stable
storage. (Some additional steps must be taken, e.g., removing the transac-
. tion’ s entry in the transaction table; these follow the writing of the commit
log record.)
■ A b o r t : When a transaction is aborted, an a b o r t type log record containing
the transaction id is appended to the log, and Undo is initiated for this
transaction (Section 18.6.3).
■ E n d : As noted above, when a transaction is aborted or committed, some
additional actions must be taken beyond writing the abort or commit log
record. After all these additional steps are completed, an e n d type log
record containing the transaction id is appended to the log.
■ U n d o in g a n u p d a t e : When a transaction is rolled back (because the
transaction is aborted, or during recovery from a crash), its updates are
undone. When the action described by an update log record is undone, a
c o m p e n s a t i o n lo g r e c o r d , or CLR, is written.
Every log record has certain fields: p r e v L S N , t r a n s ID , and ty p e . The set of

all log records for a given transaction is maintained as a linked list going back
in time, using the p r e v L S N field; this list must be updated whenever a log
record is added. The transID field is the id of the transaction generating the
log record, and the type field obviously indicates the type of the log record.
Additional fields depend on the type of the log record. We already mentioned
the additional contents of the various log record types, with the exception of
the update and compensation log record types, which we describe next.
Update L o g R ecords
The fields in an u p d a t e log record are illustrated in Figure 18.2. The p a g e l D

field is the page id of the modified page; the length in bytes and the offset of the
2Note that this step requires the buffer manager to be able to selectively f o r c e pages to stable
storage.
107
584 C h a p t e r 18
prevLSN transID type pagelD length offset before-image after-image
x v ---------- ----------- ' x------------------------------------- v -------------------------------------
Fields common to all log records Additional fields for update log records
F ig u r e 18.2 Contents of an Update Log Record
change are also included. The b e f o r e - im a g e is the value of the changed bytes
before the change; the a fte r - im a g e is the value after the change. An update
log record that contains both before- and after-images can be used to redo
the change and undo it. In certain contexts, which we do not discuss further,
we can recognize that the change will never be undone (or, perhaps, redone).
A r e d o - o n ly u p d a t e log record contains just the after-image; similarly an
u n d o - o n ly u p d a t e record contains just the before-image.
C om pensation L o g R ecords
A c o m p e n s a t io n l o g r e c o r d ( C L R ) is written just before the change recorded

in an update log record U is undone. (Such an undo can happen during nor
mal system execution when a transaction is aborted or during recovery from a
crash.) A compensation log record C describes the action taken to undo the
actions recorded in the corresponding update log record and is appended to
the log tail just like any other log record. The compensation log record C also
contains a field called u n d o N e x t L S N , which is the LSN of the next log record
that is to be undone for the transaction that wrote update record U\ this field
in C is set to the value of prevLSN in U.
As an example, consider the fourth update log record shown in Figure 18.3.
If this update is undone, a CLR would be written, and the information in it
would include the transID, pagelD, length, offset, and before-image fields from
the update record. Notice that the CLR records the (undo) action of changing
the affected bytes back to the before-image value; thus, this value and the
location of the affected bytes constitute the redo information for the action
described by the CLR. The undoNextLSN field is set to the LSN of the first
log record in Figure 18.3.
Unlike an update log record, a CLR describes an action that will never be
u n d o n e , that is, we never undo an undo action. The reason is simple: An update
log record describes a change made by a transaction during normal execution
and the transaction may subsequently be aborted, whereas a CLR describes
an action taken to rollback a transaction for which the decision to abort has
already been made. Therefore, the transaction m u s t be rolled back, and the
108
Crash Recovery 585
undo action described by the CLR is definitely required. This observation is

very useful because it bounds the amount of space needed for the log during
restart from a crash: The number of CLRs that can be written during Undo is
no more than the number of update log records for active transactions at the
time of the crash.
A CLR may be written to stable storage (following WAL, of course) but the
undo action it describes may not yet been written to disk when the system
crashes again. In this case, the undo action described in the CLR is reapplied
during the Redo phase, just like the action described in update log records.
For these reasons, a CLR contains the information needed to reapply, or redo,
the change described but not to reverse it.
18.3 O T H E R R ECOV ERY-RELA TED STR U CT U R ES
In addition to the log, the following two tables contain important recovery-
related information:
■ T r a n s a c t io n T a b le : This table contains one entry for each active trans

action. The entry contains (among other things) the transaction id, the
status, and a field called la stL S N , which is the LSN of the most recent log
record for this transaction. The s t a t u s of a transaction can be that it is in
progress, committed, or aborted. (In the latter two cases, the transaction
will be removed from the table once certain ‘ clean up’steps are completed.)
■ D ir t y p a g e t a b le : This table contains one entry for each dirty page in
the buffer pool, that is, each page with changes not yet reflected on disk.
The entry contains a field r e cL S N , which is the LSN of the first log record
that caused the page to become dirty. Note that this LSN identifies the
earliest log record that might have to be redone for this page during restart
from a crash.
During normal operation, these are maintained by the transaction manager and
the buffer manager, respectively, and during restart after a crash, these tables
are reconstructed in the Analysis phase of restart.
Consider the following simple example. Transaction T1000 changes the value of
bytes 21 to 23 on page P500 from ‘ABC’to ‘ DEF’ , transaction T2000 changes
‘HIJ’to ‘KLM’on page P600, transaction T2000 changes bytes 20 through 22
from ‘GDE’to ‘ QRS’on page P500, then transaction T1000 changes ‘ TUV’
to ‘
WXY’on page P505. The dirty page table, the transaction table,3 and
3The status field is not shown in the figure for space reasons; all transactions are in progress.
109
586 C h a p t e r 18
pagcID recLSN
TRANSACTION TABLE
F ig u r e 18.3 Instance of Log and Transaction Table
the log at this instant are shown in Figure 18.3. Observe that the log is shown
growing from top to bottom; older records are at the top. Although the records
for each transaction are linked using the prevLSN field, the log as a whole also
has a sequential order that is important—for example, T2000’ s change to page
P500 follows TlOOO’ s change to page P500, and in the event of a crash, these
changes must be redone in the same order.
18.4 T H E W R ITE-A H EA D L O G P R O T O C O L
Before writing a page to disk, every update log record that describes a change
to this page must be forced to stable storage. This is accomplished by forcing
all log records up to and including the one with LSN equal to the pageLSN to
stable storage before writing the page to disk.
The importance of the WAL protocol cannot be overemphasized—WAL is the

fundamental rule that ensures that a record of every change to the database
is available while attempting to recover from a crash. If a transaction made a
change and committed, the no-force approach means that some of these changes
may not have been written to disk at the time of a subsequent crash. Without a
record of these changes, there would be no way to ensure that the changes of a
committed transaction survive crashes. Note that the definition of a c o m m i t t e d
t r a n s a c t i o n is effectively ’
a transaction all of whose log records, including a
commit record, have been written to stable storage’ .
When a transaction is committed, the log tail is forced to stable storage, even
if a no-force approach is being used. It is worth contrasting this operation with
the actions taken under a force approach: If a force approach is used, all the
pages modified by the transaction, rather than a portion of the log that includes
all its records, must be forced to disk when the transaction commits. The set of
110
Crash Recovery 587
all changed pages is typically much larger than the log tail because the size of
an update log record is close to (twice) the size of the changed bytes, which is
likely to be much smaller than the page size. Further, the log is maintained as a
sequential file, and all writes to the log are sequential writes. Consequently, the
cost of forcing the log tail is much smaller than the cost of writing all changed
pages to disk.
18.5 C H E C K P O IN T IN G
A c h e c k p o in t is like a snapshot of the DBMS state, and by taking checkpoints

periodically, as we will see, the DBMS can reduce the amount of work to be
done during restart in the event of a subsequent crash.
Checkpointing in ARIES has three steps. First, a b e g in _ c h e c k p o in t record is

written to indicate when the checkpoint starts. Second, an e n c L c h e c k p o in t
record is constructed, including in it the current contents of the transaction
table and the dirty page table, and appended to the log. The third step is
carried out after the e n d _ c h e c k p o in t record is written to stable storage: A
special m a s t e r record containing the LSN of the b e g i m c h e c k p o i n t log record is
written to a known place on stable storage. While the end-checkpoint record
is being constructed, the DBMS continues executing transactions and writing
other log records; the only guarantee we have is that the transaction table and
dirty page table are accurate a s o f th e t i m e o f th e b e g ir i- c h e c k p o in t r eco rd .
This kind of checkpoint, called a f u z z y c h e c k p o in t , is inexpensive because it

does not require quiescing the system or writing out pages in the buffer pool
(unlike some other forms of checkpointing). On the other hand, the effectiveness
of this checkpointing technique is limited by the earliest recLSN of pages in the
dirty pages table, because during restart we must redo changes starting from
the log record whose LSN is equal to this recLSN. Having a background process
that periodically writes dirty pages to disk helps to limit this problem.
When the system comes back up after a crash, the restart process begins by
locating the most recent checkpoint record. For uniformity, the system always
begins normal execution by taking a checkpoint, in which the transaction table
and dirty page table are both empty.
18.6 R E C O V E R IN G F R O M A S Y S T E M C R A SH
When the system is restarted after a crash, the recovery manager proceeds in
three phases, as shown in Figure 18.4.Il
Ill
588 C h a p t e r 18
UNDO
Oldest log record
A of transactions
A active at crash
Smallest recLSN
B in dirty page table
at end of Analysis
A N A L Y S IS
c Most recent checkpoint
V V CRASH (end of log)
F ig u r e 18.4 Three Phases of Restart in ARIES
The Analysis phase begins by examining the most recent begin_checkpoint

record, whose LSN is denoted C in Figure 18.4, and proceeds forward in the
log until the last log record. The Redo phase follows Analysis and redoes all
changes to any page that might have been dirty at the time of the crash; this set
of pages and the starting point for Redo (the smallest recLSN of any dirty page)
are determined during Analysis. The Undo phase follows Redo and undoes the
changes of all transactions active at the time of the crash; again, this set of
transactions is identified during the Analysis phase. Note that Redo reapplies
changes in the order in which they were originally carried out; Undo reverses
changes in the opposite order, reversing the most recent change first.
Observe that the relative order of the three points A , B , and C in the log may
differ from that shown in Figure 18.4. The three phases of restart are described
in more detail in the following sections.
18.6.1 Analysis Phase
The A n a ly s is phase performs three tasks:
1. It determines the point in the log at which to start the Redo pass.
2. It determines (a conservative superset of the) pages in the buffer pool that
were dirty at the time of the crash.
3. It identifies transactions that were active at the time of the crash and must
be undone.
Analysis begins by examining the most recent begin_checkpoint log record and
initializing the dirty page table and transaction table to the copies of those
structures in the next end-checkpoint record. Thus, these tables are initialized
to the set of dirty pages and active transactions at the time of the checkpoint.
112
Crash Recovery 589
(If additional log records are betw een the begin_checkpoint and end_checkpoint
records, the tables m ust be adju sted to reflect the inform ation in these records,
but we om it the details o f this step. See E xercise 18.9.) A nalysis then scans
the log in the forward direction until it reaches the end of the log:
■ If an end log record for a transaction T is encountered, T is rem oved from

the transaction table becau se it is no longer active.
■ If a log record other than an end record for a transaction T is encountered,

an entry for T is added to the tran saction table if it is not already there.
Further, the entry for T is m odified:
1. T h e lastLSN field is set to the LSN of this log record.

2. If the log record is a com m it record, the status is set to C, otherwise
it is set to U (indicating that it is to b e undone).
■ If a redoable log record affecting page P is encountered, and P is not in

the dirty page table, an entry is inserted into this table w ith page id P and
recLSN equal to the LSN of this redoable log record. T h is LSN identifies
the oldest change affecting page P that m ay not have been w ritten to disk.
At the end of the Analysis phase, the transaction table contains an accurate
list o f all transactions that were active at the tim e o f the crash— this is the
set o f transactions with status U. T h e dirty page table includes all pages that
were dirty at the tim e of the crash but m ay also contain som e pages that were
w ritten to disk. If an end-write log record were w ritten at the com pletion o f
each write operation, the dirty page table constructed during Analysis could
b e m ade m ore accurate, but in A RIES, the additional cost o f w riting end.write
log records is not considered to b e w orth the gain.
As an example, consider the execution illustrated in Figure 18.3. Let us extend

this execution by assum ing that T2000 com m its, then T1000 m odifies another
page, say, P700, and appen ds an u p d a te record to the lo g tail, and then the
system crashes (before this u pd a te log record is w ritten to stable storage).
T h e dirty page table and the transaction table, held in memory, are lost in the
crash. T h e m ost recent checkpoint was taken at the begin n in g o f the execution,
with an em pty transaction table and dirty page table; it is n ot show n in Figure
18.3. After exam ining this log record, which we assum e is ju st before the
first log record show n in the figure, Analysis initializes the two tables to be
empty. Scanning forward in the log, T1000 is added to the transaction table;
in addition, P500 is added to the dirty page table w ith recLSN equal to the
LSN o f the first show n log record. Similarly, T2000 is added to the transaction
table and P600 is added to the dirty pa ge table. There is no change based on
the third log record, and the fourth record results in the addition o f P505 to
113
590 C h a p t e r 18
the dirty page table. T h e com m it record for T2000 (not in the figure) is now
encountered, and T2000 is rem oved from the tran saction table.
T h e Analysis phase is now com plete, and it is recogn ized that the only active
transaction at the tim e o f the crash is T1000, w ith lastLSN equal to the LSN
o f the fourth record in Figure 18.3. T h e dirty p a ge table recon stru cted in the
A nalysis phase is identical to that shown in the figure. T h e u pd a te log record
for the change to P700 is lost in the crash and n ot seen during the Analysis
pass. Thanks to the WAL protocol, however, all is well— the correspon d in g
change to pa ge P700 cannot have been w ritten to disk either!
S om e of the u pd a tes m ay have been w ritten to disk; for concreteness, let us

assum e that the change to P600 (and only this update) was w ritten to disk
before the crash. T herefore P600 is not dirty, yet it is in cluded in the dirty
pa ge table. T h e pageL SN on page P600, however, reflects the w rite becau se it
is now equal to the LSN of the secon d u pd ate lo g record show n in Figure 18.3.
18.6.2 R edo Phase
D uring the R e d o phase, A R IE S reapplies the u pd a tes o f all transactions, com

m itted or otherwise. Further, if a transaction was aborted before the crash
and its updates were undone, as indicated by CLRs, the actions d escribed in
the C L R s are also reapplied. T h is r e p e a t i n g h is t o r y paradigm distinguishes
A R IE S from other p ro p ose d W AL-based recovery algorithm s and causes the
database to b e brou gh t to the sam e state it was in at the tim e of the crash.
T h e R ed o phase begin s w ith the log record that has the sm allest recLSN of all
pages in the dirty page table con stru cted by the A nalysis pass becau se this log
record identifies the oldest u pd a te that m ay not have been w ritten to disk prior
to the crash. Starting from this log record, R e d o scans forward until the end
of the log. For each redoable log record (update or CLR) encountered, R ed o
checks w hether the logged action m ust b e redone. T h e action m ust b e redone
unless one of the follow ing condition s holds:
■ T h e affected page is not in the dirty page table.
■ T h e affected pa ge is in the dirty page table, bu t the recLSN for the entry
is greater than the LSN of the log record b e in g checked.
■ T h e pageL SN (stored on the page, which m ust b e retrieved to check this

condition) is greater than or equal to the LSN o f the log record bein g
checked.
T h e first con d ition obviou sly m eans that all changes to this page have been
w ritten to disk. B ecau se the recLSN is the first u p d a te to this page that m ay
114
Crash Recovery 591
not have been written to disk, the secon d condition m eans that the u pdate
bein g checked was indeed propagated to disk. T he third condition, which is
checked last because it requires us to retrieve the page, also ensures that the
u pdate bein g checked was w ritten to disk, becau se either this update or a later
u pdate to the page was written. (Recall our assum ption that a w rite to a page
is atomic; this assum ption is im portant here!)
If the logged action m ust be redone:
1. T h e logged action is reapplied.
2. T he pageL SN on the page is set to the LSN o f the redone log record. No
additional log record is w ritten at this time.
Let us continue with the exam ple discussed in Section 18.6.1. Trom the dirty
page table, the sm allest recLSN is seen to be the LSN o f the first log record
shown in Figure 18.3. Clearly, the changes recorded by earlier log records
(there happen to be none in this example) have been w ritten to disk. Now,
R edo fetches the affected page, P500, and com pares the LSN o f this log record
with the pageLSN on the page and, because we assum ed that this page was not
written to disk before the crash, finds that the pageLSN is less. T h e update
is therefore reapplied; bytes 21 through 23 are changed to ‘ DEF’ , and the
pageLSN is set to the LSN o f this u pd ate log record.
R ed o then exam ines the second log record. Again, the affected page, P600, is
fetched and the pageL SN is com pared to the LSN of the u pd ate log record. In
this case, becau se we assum ed that P600 was w ritten to disk before the crash,
they are equal, and the u pd ate does not have to be redone.
T he rem aining log records are processed similarly, bringing the system back
to the exact state it was in at the tim e o f the crash. N ote that the first tw o
conditions indicating that a redo is unnecessary never hold in this example.
Intuitively, they com e into play when the dirty page table contains a very old
recLSN, goin g back to before the m ost recent checkpoint. In this case, as R ed o
scans forward from the log record w ith this LSN, it encounters log records for
pages that were w ritten to disk prior to the checkpoint and therefore not in
the dirty pa ge table in the checkpoint. S om e o f these pages m ay b e dirtied
again after the checkpoint; nonetheless, the u pd a tes to these pages prior to the
checkpoint need not b e redone. A lthough the third condition alone is sufficient
to recogn ize that these u pdates need not be redone, it requires us to fetch
the affected page. T he first two conditions allow us to recogn ize this situation
without fetching the page. (The reader is encouraged to construct exam ples
that illustrate the use of each of these conditions; see E xercise 18.8.)
115
592 C h a p t e r 18
At the end of the R ed o phase, end type records are w ritten for all transactions
with status C, which are removed from the transaction table.
18.6.3 Undo Phase
T he U ndo phase, unlike the other two phases, scans backward from the end
of the log. T h e goal o f this phase is to un do the actions o f all transactions
active at the tim e o f the crash, that is, to effectively abort them. T his set of
transactions is identified in the transaction table con stru cted by the Analysis
phase.
The Undo A lgorithm
U ndo begins w ith the transaction table con stru cted by the Analysis phase,
which identifies all transactions active at the tim e o f the crash, and includes the
LSN of the m ost recent log record (the lastLSN field) for each such transaction.
Such transactions are called lo s e r t r a n s a c tio n s . All actions o f losers m ust be
undone, and further, these actions m ust b e undone in the reverse o f the order
in which they appear in the log.
Consider the set o f lastLSN values for all loser transactions. Let us call this set
T o U n d o . U ndo repeatedly chooses the largest (i.e., m ost recent) LSN value in
this set and processes it, until T oU n do is empty. T o p rocess a log record:
1. If it is a C L R and the undoN extLSN value is not null, the undoN extLSN
value is added to the set ToUndo; if the undoN extL SN is null, an end
record is w ritten for the transaction because it is com pletely undone, and
the C L R is discarded.
2. If it is an u pdate record, a C L R is w ritten and the correspon din g action is

undone, as described in Section 18.2, and the prevL SN value in the u pdate
log record is added to the set ToUndo.
W hen the set T oU n do is empty, the U ndo phase is com plete. R estart is now
com plete, and the system can proceed with norm al operations.
Let us continue with the scenario discussed in Section s 18.6.1 and 18.6.2. T he
only active transaction at the tim e of the crash was determ ined to b e T1000.
From the transaction table, we get the LSN o f its m ost recent log record, which
is the fourth u pdate lo g record in Figure 18.3. T h e update is undone, and a
C L R is w ritten w ith undoN extLSN equal to the LSN o f the first log record in
the figure. T h e next record to b e undone for transaction T1000 is the first log
record in the figure. After this is undone, a C L R and an end log record for
T1000 are written, and the U ndo phase is com plete.
116
Crash Recovery 593
In this example, u n doing the action recorded in the first log record causes the
action o f the third log record, which is due to a com m itted transaction, to be
overwritten and thereby lost! This situation arises because T2000 overwrote
a data item written by TT000 while T1000 was still active; if Strict 2PL were
followed, T2000 would not have been allowed to overwrite this data item.
A borting a Transaction
A borting a transaction is ju st a special case o f the U ndo phase o f R estart in

which a single transaction, rather than a set o f transactions, is undone. The
exam ple in Figure 18.5, discussed next, illustrates this point.
Crashes during Restart
It is im portant to understand how the U ndo algorithm presented in Section

18.6.3 handles repeated system crashes. B ecause the details o f precisely how
the action described in an u pdate log record is undone are straightforward,
we discuss U ndo in the presence o f system crashes using an execution history,
shown in Figure 18.5, that abstracts away unnecessary detail. This exam ple
illustrates how abortin g a transaction is a special case o f U ndo and how the use
of C L R s ensures that the U ndo action for an u pd a te log record is not applied
twice.
LSN LOG
00, 05 “t- begin_checkpoint, end_checkpoint
10 - 1~ update: T1 writes P5
20 —j— update: T2 writes P3
30 T1 abort
40,45 CLR: Undo T1 LSN 10, T1 end

undonextLSN
50 update: T3 writes PI
60 update: T2 writes P5
X CRASH, RESTART
70 CLR: Undo T2 LSN 60
80, 85 - j - CLR: Undo T3 LSN 50, T3 end
X CRASH, RESTART
90,95 CLR: Undo T2 LSN 20, T2 end
F ig u r e 18.5 Example of Undo with Repeated Crashes

594 C h a p t e r 18
T h e lo g shows the order in which the D B M S executed various actions; note that
the LSNs are in ascending order, and that each log record for a transaction has
a prevLSN field that points to the previous log record for that transaction. We
have not shown null prevLSNs, that is, som e special value used in the prevLSN
field of the first log record for a transaction to indicate that there is no previous
log record. We also com pacted the figure by occasion ally displaying two log
records (separated by a comma) on a single line.
L og record (with LSN) 30 indicates that T l aborts. All actions o f this trans
action should b e undone in reverse order, and the only action o f T l, described
by the u pdate log record 10, is indeed undone as indicated by C L R 40.
After the first crash, Analysis identifies P I (with recLSN 50), P 3 (with recLSN
20), and P5 (with recLSN 10) as dirty pages. L og record 45 shows that T l is a
com p leted transaction: hence, the transaction table identifies T 2 (with lastLSN
60) and T3 (with lastLSN 50) as active at the tim e o f the crash. T h e R edo
phase begins with log record 10, which is the m inim um recLSN in the dirty
page table, and reapplies all actions (for the u pdate and C L R records), as per
the R ed o algorithm presented in Section 18.6.2.
T h e T oU ndo set consists o f LSNs 60, for T2, and 50, for T3. T h e U ndo phase
now begin s by processin g the log record w ith LSN 60 becau se 60 is the largest
LSN in the T oU ndo set. T h e u pd ate is undone, and a C L R (with LSN 70)
is w ritten to the log. This C L R has undoN extLSN equal to 20, which is the
prevLSN value in log record 60; 20 is the next action to b e undone for T2. Now
the largest rem aining LSN in the T oU n do set is 50. T h e w rite correspon din g
to log record 50 is now undone, and a C L R describing the change is written.
T his C L R has LSN 80, and its undoN extLSN field is null becau se 50 is the
only log record for transaction T 3. T herefore T 3 is com pletely undone, and an
end record is written. L og records 70, 80, and 85 are w ritten to stable storage
before the system crashes a secon d time; however, the changes described by
these records may not have been w ritten to disk.
W hen the system is restarted after the secon d crash, Analysis determ ines that
the only active transaction at the tim e o f the crash was T 2; in addition, the dirty
page table is identical to what it was during the'previous restart. L og records
10 through 85 are processed again during Redo. (If som e o f the changes m ade
during the previous R ed o were w ritten to disk, the pageLSN s on the affected
pages are used to detect this situation and avoid w riting these pages again.)
T h e U ndo phase considers the only LSN in the T oU ndo set, 70, and processes it
by adding the undoN extLSN value (20) to the T oU ndo set. Next, log record 20
is p rocessed by undoing T 2’ s w rite of page P3, and a C L R is w ritten (LSN 90).
B ecause 20 is the first o f T 2 ’
s log records— and therefore, the last of its records
118
Crash Recovery 595
to be undone— the undoN extLSN field in this C L R is null, an end record is

written for T 2, and the T o U sd o set is now empty.
R ecovery is now com plete, and norm al execution can resum e w ith the w riting
o f a checkpoint record.
This exam ple illustrated repeated crashes during the U ndo phase. For co m
pleteness, let us consider w hat happens if the system crashes while R estart is
in the A nalysis or R ed o phase. If a crash o ccu rs during the A nalysis phase, all
the work done in this phase is lost, and on restart the A nalysis phase starts
afresh w ith the sam e inform ation as before. If a crash occu rs during the R ed o
phase, the only effect that survives the crash is that som e of the changes m ade
during R edo m ay have been w ritten to disk prior to the crash. R estart starts
again w ith the A nalysis phase and then the R ed o phase, and som e u pd ate log
records that were redone the first tim e around will not be redone a secon d tim e
because the pageL SN is now equal to the u p d a te record ’ s LSN (although the
pages have to be fetched again to detect this).
We can take checkpoints during R estart to m inim ize repeated work in the event
of a crash, bu t we d o not discuss this point.
18.7 M E D IA R E C O V E R Y
M edia recovery is based on periodically m aking a copy o f the database. B e

cause copyin g a large database o b je ct such as a file can take a long time, and
the D B M S m ust b e allowed to continue w ith its operations in the meantime,
creating a copy is handled in a m anner sim ilar to taking a fuzzy checkpoint.
W hen a database o b ject such as a file or a page is corrupted, the copy o f that
o b ject is brought up-to-date by using the log to identify and reapply the changes
of com m itted transactions and un do the changes o f u n com m itted transactions
(as of the tim e of the m edia recovery operation).
T h e begin.checkpoint LSN of the m ost recent com p lete checkpoint is recorded

along w ith the cop y o f the database o b je ct to m inim ize the work in reapplying
changes o f com m itted transactions. Let us com pare the sm allest recLSN of
a dirty page in the correspon d in g end-checkpoint record w ith the LSN of the
begin_checkpoint record and call the sm aller o f these two LSNs I. We observe
that the actions recorded in all log records w ith LSNs less than I m ust b e
reflected in the copy. Thus, only log records w ith LSNs greater than I need be
reapplied to the copy.
119
596 C h a p t e r 18
Finally, the u pdates o f transactions that are in com plete at the tim e o f m edia
recovery or that were aborted after the fuzzy copy was com p leted need to be
undone to ensure that the page reflects only the actions o f com m itted transac
tions. T h e set of such transactions can be identified as in the Analysis pass,
and we om it the details.
18.8 O T H E R A PPR O A CH ES AND IN T E R A C T IO N W IT H

CONCU RRENCY CON TR OL
Like ARIES, the m ost popular alternative recovery algorithm s also m aintain a
log o f database actions accordin g to the WAL protocol. A m ajor distinction
betw een A R IE S and these variants is that the R ed o phase in A R IE S repeats
history, that is, redoes the actions of all transactions, not ju st the non-losers.
O ther algorithm s redo only the non-losers, and the R ed o phase follows the
U ndo phase, in which the actions of losers are rolled back.
Thanks to the repeating history paradigm and the use o f CLRs, A R IE S su p

p orts fine-granularity locks (record-level locks) and loggin g o f logical operations
rather than ju st byte-level m odifications. For example, consider a transaction
T that inserts a data entry 15* into a B + tree index. Between the tim e this
insert is done and the tim e that T is eventually aborted, other transactions may
also insert and delete entries from the tree. If record-level locks are set rather
than page-level locks, the entry 15* m ay b e on a different physical page when
T aborts from the one that T inserted it into. In this case, the undo operation
for the insert of 15* m ust b e recorded in logical term s because the physical
(byte-level) actions involved in undoing this operation are not the inverse of
the physical actions involved in inserting the entry.
L oggin g logical operations yields considerably higher concurrency, although the

use o f fine-granularity locks can lead to increased locking activity (because m ore
locks m ust b e set). Hence, there is a trade-off betw een different WAL-based
recovery schemes. W e chose to cover A R IE S because it has several attractive
properties, in particular-, its sim plicity and its ability to su p p ort fine-granularity
locks and loggin g of logical operations.
O ne of the earliest recovery algorithms, used in the System R prototy p e at

IBM, takes a very different approach. T here is no loggin g and, of course,
no WAL protocol. Instead, the database is treated as a collection of pages
and accessed through a p a g e ta b le , which m aps page ids to disk addresses.
W hen a transaction makes changes to a data page, it actually makes a copy
of the page, called the s h a d o w o f the page, and changes the shadow page.
T h e transaction copies the appropriate part of the page table and changes the
entry for the changed page to point to the shadow, so that it can see the
120
Crash Recovery 597
changes; however, other transactions continue to see the original page table,
and therefore the original page, until this transaction com m its. A bortin g a
transaction is simple: Just discard its shadow versions o f the p a ge table and
the data pages. C om m ittin g a transaction involves m aking its version o f the
page table pu blic and discarding the original data pages that are su perseded
by shadow pages.
T his schem e suffers from a num ber o f problem s. First, data becom es highly
fragm ented due to the replacem ent o f pages by shadow versions, which m ay be
located far from the original page. T his ph en om en on reduces data clustering
and makes g o o d garbage collection imperative. Second, the schem e does not
yield a sufficiently high degree o f concurrency. Third, there is a substantial
storage overhead due to the use of shadow pages. Fourth, the p rocess abortin g
a transaction can itself run into deadlocks, and this situation m ust b e specially
handled becau se the sem antics o f abortin g an abort transaction gets murky.
For these reasons, even in System R, shadow pa gin g was eventually superseded
by W AL-based recovery techniques.
18.9 R E V IE W Q U EST IO N S
Answers to the review questions can be found in the listed sections.
■ W hat are the advantages o f the A R IE S recovery algorithm ? ( S e c t io n 18.1)
■ D escribe the three steps in crash recovery in A R IE S? W hat is the goal of

the Analysis phase? T h e redo phase? T h e un do phase? ( S e c t io n 18.1)
■ W hat is the LSN o f a log record? ( S e c t io n 18.2)
■ W hat are the different types o f log records and when are they w ritten?
( S e c t io n 18.2)
■ W hat inform ation is m aintained in the transaction table and the dirty page
table? ( S e c t io n 18.3)
■ W hat is Write-Ahead L oggin g? W hat is forced to disk at the tim e a trans

action com m its? ( S e c t io n 18.4)
■ W hat is a fuzzy checkpoint? W hy is it useful? W hat is a m aster lo g record?

( S e c t io n 18.5)
■ In which direction does the A nalysis phase o f recovery scan the log? At
which point in the log does it begin and end the scan? ( S e c t io n 18.6.1)
■ D escribe what inform ation is gathered in the A nalysis phase and how.
( S e c t io n 18.6.1)
121
598 C h a p t e r 18
■ In which direction d oes the R ed o phase of recovery process the log? At

w hich point in the log d oes it begin and end? ( S e c t io n 18.6.2)
■ W hat is a redoable log record? U nder w hat con d ition s is the logged ac
tion redone? W hat steps are carried out when a logged action is redone?
( S e c t io n 18.6.2)
■ In which direction d oes the U n do phase o f recovery process the log? At

w hich point in the log d oes it begin and end? ( S e c t io n 18.6.3)
■ W hat are loser transactions? H ow are they processed in the U ndo phase
and in w hat order? ( S e c t io n 18.6.3)
■ Explain w hat happens if there are crashes during the U n do phase o f re
covery. W hat is the role of C L R s? W hat if there are crashes during the
Analysis and R ed o phases? ( S e c t io n 18.6.3)
■ H ow d oes a D B M S recover from m edia failure w ith out reading the com plete
log? ( S e c t io n 18.7)
■ Record-level loggin g increases concurrency. W hat are the potential prob

lems, and how does A R IE S address them ? ( S e c t io n 18.8)
■ W hat is shadow paging? ( S e c t io n 18.8)
E X E R C IS E S
E x ercise 18.1 Briefly answer the following questions:
1. How does the recovery manager ensure atomicity of transactions? How does it ensure
durability?
2. What is the difference between stable storage and disk?
3. What is the difference between a system crash and a media failure?
4. Explain the WAL protocol.
5. Describe the steal and no-force policies.
E x er cise 18.2 Briefly answer the following questions:
1. What are the properties required of LSNs?

2. What are the fields in an update log record? Explain the use of each field.
3. What are redoable log records?
4. What are the differences between update log records and CLRs?
E x er cise 18.3 Briefly answer the following questions:
1. What are the roles of the Analysis, Redo, and Undo phases in ARIES?
2. Consider the execution shown in Figure 18.6.
122
Crash Recovery 599
LSN LOG
00 -r- begin_checkpoint
10 4 end_checkpoint
20 4 update: T1 writes P5
40 4 T2 commit
50 4 T2 end
70 4 T1 abort
CRASH, RESTART
>k
F igure 18.6 Execution with a Crash
LSN LOG
00 —j— update: T1 writes P2
10 4 update: T1 writes PI
40 4 T3 commit
70 -I- T2 abort
F ig u r e 18.7 Aborting a Transaction
(a) What is done during Analysis? (Be precise about the points at which Analysis
begins and ends and describe the contents of any tables constructed in this phase.)
(b) What is done during Redo? (Be precise about the points at which Redo begins and
ends.)
(c) What is done during Undo? (Be precise about the points at which Undo begins
and ends.)
E x ercise 18.4 Consider the execution shown in Figure 18.7.
1. Extend the figure to show prevLSN and undonextLSN values.

2. Describe the actions taken to rollback transaction T2.
123
600 C h a p t e r 18
LSN LOG
00 -j- begincheckpoint
10 -j- entLcheckpoint
20 -j- update: T1 writes PI
50 -i- T2 commit
70 -j- T2 end
80 • 4 - update: T1 writes P5
90 T3 abort
* CRASH, RESTART
F ig u r e 18.8 Execution with Multiple Crashes
3. Show the log after T 2 is rolled back, including all prevLSN and undonextLSN values in
log records.
E x er cise 18.5 Consider the execution shown in Figure 18.8. In addition, the system crashes
during recovery after writing two log records to stable storage and again after writing another
two log records.
1. What is the value of the LSN stored in the master log record?
2. What is done during Analysis?
3. What is done during Redo?
4. What is done during Undo?
5. Show the log when recovery is complete, including all non-null prevLSN and undonextLSN
values in log records.
1. How is checkpointing done in ARIES?

2. Checkpointing can also be done as follows: Quiesce the system so that only checkpointing
activity can be in progress, write out copies of all dirty pages, and include the dirty page
table and transaction table in the checkpoint record. What are the pros and cons of this
approach versus the checkpointing approach of ARIES?
3. What happens if a second begin_checkpoint record is encountered during the Analysis
phase?
4. Can a second end-checkpoint record be encountered during the Analysis phase?
5. Why is the use of CLRs important for the use of undo actions that are not the physical
inverse of the original update?
124
Crash Recovery 601
LSN LOG
00 - j- begin_checkpoint
10 4- update: T1 writes PI
20 4- T1 commit
30 4- update: T2 writes P2
40 4- T1 end
50 4- T2 abort
60 -b update: T3 writes P3
70 4-
1 end_checkpoint
80 T3 commit
CRASH, RESTART
X
F ig u r e 18.9 Log Records between Checkpoint Records
6. Give an example that illustrates how the paradigm of repeating history and the use of
CLRs allow ARIES to support locks of finer granularity than a page.
1. If the system fails repeatedly during recovery, what is the maximum number of log
records that can be written (as a function of the number of update and other log records
written before the crash) before restart completes successfully?
2. What is the oldest log record we need to retain?
3. If a bounded amount of stable storage is used for the log, how can we always ensure
enough stable storage to hold all log records written during restart?
E x ercise 18.8 Consider the three conditions under which a redo is unnecessary (Section
20 .2.2).
1. Why is it cheaper to test the first two conditions?

2. Describe an execution that illustrates the use of the first condition.
3. Describe an execution that illustrates the use of the second condition.
E x ercise 18.9 The description in Section 18.6.1 of the Analysis phase made the simplifying
assumption that no log records appeared between the begin-checkpoint and end-checkpoint
records for the most recent complete checkpoint. The following questions explore how such
records should be handled.1
1. Explain why log records could be written between the begin_checkpoint and end-checkpoint
records.
2. Describe how the Analysis phase could be modified to handle such records.
3. Consider the execution shown in Figure 18.9. Show the contents of the end-checkpoint
record.
4. Illustrate your modified Analysis phase on the execution shown in Figure 18.9.
125
C h a p te r 6
E x p erim en ta l D esign
This chapter contains the book chapters:
D. Lilja. Measuring Computer Performance: A Practitioner’s

Guide. Chapters 1, 2, and 6, pp. 1-24 and 82-107 (50 of 261).
Cambridge University Press, 2000. ISBN: 978-0-521-64105-0
Properly comparing different designs and implementations of a given ser

vice or system is a hard problem. A wealth of techniques have been developed
to measure or estimate the performance of computer systems, including an
alytical modeling, simulation, and experimentation. Use of these techniques
requires care in assumption documentation, metric selection, benchmark de
sign, measurement approach, and experimental setup. While a comprehensive
treatment is beyond the scope of ACS, we intend to provide an overview of the
issues, as well as measurement exercises to develop competence in structuring
experiments. The ultimate goal o f this portion o f the m aterial is to provide us
with basic concepts in m easurem ent o f com puter systems, as well as allow us to
structure and carry out experim ents to m easure basic perform an ce properties
o f a system under study.
• Explain the three main methodologies for performance measurement and

modeling: analytical modeling, simulation, and experimentation.
• Design and execute experiments to measure the performance of a system.
Chapter 6 o f L ilja ’
s book is given to deepen understanding o f available m ea
surem ent strategies; however, it is to be considered as an additional reading
and not fundam ental to the attainm ent o f the learning goals above.
127
Introduction
'Performance can be bad, but can it ever be wrong?’

J im K o lm , S G I/C ra y Research, In c .
1.1 Measuring performance
If the automobile industry had followed the same development cycles as the
computer industry, it has been speculated that a Rolls Royce car would cost
less than $100 with an efficiency o f more than 200 miles per gallon o f gasoline.
While we certainly get more car for our money now than we did twenty years
ago, no other industry has ever changed at the incredible rate o f the computer
and electronics industry.
Computer systems have gone from being the exclusive domain o f a few scien
tists and engineers who used them to speed up some esoteric computations, such
as calculating the trajectory o f artillery shells, for instance, to being so comm on
that they go unnoticed. They have replaced many o f the mechanical control
systems in our cars, thereby reducing cost while improving efficiency, reliability,
and performance. They have made possible such previously science-fiction-like
devices as cellular phones. They have provided countless hours o f entertainment
for children ranging in age from one to one hundred. They have even brought
sound to the comm on greeting card. One constant throughout this proliferation
and change, however, has been the need for system developers and users to
understand the performance o f these computer-based devices.
While measuring the cost o f a system is usually relatively straightforward
(except for the confounding effects o f manufacturers’discounts to special cus
tomers), determining the performance o f a computer system can oftentimes seem
like an exercise in futility. Surprisingly, one o f the main difficulties in measuring
performance is that reasonable people often disagree strongly on how perfor
mance should be measured or interpreted, and even on what ‘ performance’
actually means.
1
128
2 Introduction
Performance analysis as applied to experimental computer science and engi

neering should be thought o f as a combination o f measurement, interpretation,
and communication o f a computer system’ s‘speed’or ‘size’(sometimes referred
to as its ‘ capacity’). The terms speed and size are quoted in this context to
emphasize that their actual definitions often depend on the specifics o f the situa
tion. Also, it is important to recognize that we need not necessarily be dealing
with complete systems. Often it is necessary to analyze only a small portion o f
the system independent o f the other components. For instance, we may be inter
ested in studying the performance o f a certain computer system’ s network inter
face independent o f the size o f its memory or the type o f processor.
Unfortunately, the components o f a computer system can interact in incredibly
complex, and frequently unpredictable, ways. One o f the signs o f an expert
computer performance analyst is that he or she can tease apart these interactions
to determine the performance effect due only to a particular component.
One o f the most interesting tasks o f the performance analyst can be figuring
out how to measure the necessary data. A large dose o f creativity may be needed
to develop good measurement techniques that perturb the system as little as
possible while providing accurate, reproducible results. After the necessary
data have been gathered, the analyst must interpret the results using appropriate
statistical techniques. Finally, even excellent measurements interpreted in a sta
tistically appropriate fashion are o f no practical use to anyone unless they are
communicated in a clear and consistent manner.
1.2 Common goals of performance analysis
The goals o f any analysis o f the performance o f a computer system, or one o f its
components, will depend on the specific situation and the skills, interests, and
abilities o f the analyst. However, we can identify several different typical goals o f
performance analysis that are useful both to computer-system designers and to
users.
• Compare alternatives. When purchasing a new computer system, you may be

confronted with several different systems from which to choose. Furthermore,
you may have several different options within each system that may impact
both cost and performance, such as the size o f the main memory, the number
o f processors, the type o f network interface, the size and number o f disk
drives, the type o f system software (i.e., the operating system and compilers),
and on and on. The goal o f the performance analysis task in this case is to
provide quantitative information about which configurations are best under
specific conditions.
129
3 1.2 Common goals of performance analysis
• Determine the impact of a feature. In designing new systems, or in upgrading

existing systems, you often need to determine the impact o f adding or remov
ing a specific feature o f the system. For instance, the designer o f a new pro
cessor may want to understand whether it makes sense to add an additional
floating-point execution unit to the microarchitecture, or whether the size o f
the on-chip cache should be increased instead. This type o f analysis is often
referred to as a before-and-after comparison since only one well-defined com
ponent o f the system is changed.
• System tuning. The goal o f performance analysis in system tuning is to find the
set o f parameter values that produces the best overall performance. In time-
shared operating systems, for instance, it is possible to control the number o f
processes that are allowed to actively share the processor. The overall perfor
mance perceived by the system users can be substantially impacted both by
this number, and by the time quantum allocated to each process. Many other
system parameters, such as disk and network buffer sizes, for example, can
also significantly impact the system performance. Since the performance
impacts o f these various parameters can be closely interconnected, finding
the best set o f parameter values can be a very difficult task.
• Identify relative performance. The performance o f a computer system typically
has meaning only in the context o f its performance relative to something else,
such as another system or another configuration o f the same system. The goal
in this situation may be to quantify the change in performance relative to
history - that is, relative to previous generations o f the system. Another
goal may be to quantify the performance relative to a customer’ s expectations,
or to a competitor’ s systems, for instance.
• Performance debugging. Debugging a program for correct execution is a fun
damental prerequisite for any application program. Once the program is
functionally correct, however, the performance analysis task becomes one o f
finding performance problems. That is, the program now produces the correct
results, but it may be much slower than desired. The goal o f the performance
analyst at this point is to apply the appropriate tools and analysis techniques
to determine why the program is not meeting performance expectations. Once
the performance problems are identified, they can, it is to be hoped, be cor
rected.
• Set expectations. Users o f computer systems may have some idea o f what the
capabilities o f the next generation o f a line o f computer systems should be.
The task o f the performance analyst in this case is to set the appropriate
expectations for what a system is actually capable o f doing.
In all o f these situations, the effort involved in the performance-analysis task

should be proportional to the cost o f making the wrong decision. For example, if
130
4 Introduction
you are comparing different manufacturers’systems to determine which best

satisfies the requirements for a large purchasing decision, the financial cost of
making the wrong decision could be quite substantial, both in terms o f the cost
o f the system itself, and in terms o f the subsequent impacts on the various parts
o f a large project or organization. In this case, you will probably want to perform
a very detailed, thorough analysis. If, however, you are simply trying to choose a
system for your own personal use, the cost o f choosing the wrong one is minimal.
Your performance analysis in this case may be correctly limited to reading a few
reviews from a trade magazine.
1.3 Solution techniques
When one is confronted with a performance-analysis problem, there are three

fundamental techniques that can be used to find the desired solution. These are
measurements o f existing systems, simulation, and analytical modeling. Actual
measurements generally provide the best results since, given the necessary mea
surement tools, no simplifying assumptions need to be made. This characteristic
also makes results based on measurements o f an actual system the most believ
able when they are presented to others. Measurements o f real systems are not
very flexible, however, in that they provide information about only the specific
system being measured. A comm on goal o f performance analysis is to character
ize how the performance o f a system changes as certain parameters are varied. In
an actual system, though, it may be very difficult, if not impossible, to change
some o f these parameters. Evaluating the performance impact o f varying the
speed o f the main memory system, for instance, is simply not possible in most
real systems. Furthermore, measuring some aspects o f performance on an actual
system can be very time-consuming and difficult. Thus, while measurements o f
real systems may provide the most compelling results, their inherent difficulties
and limitations produce a need for other solution techniques.
A simulation o f a computer system is a program written to model important
features o f the system being analyzed. Since the simulator is nothing more than a
program, it can be easily modified to study the impact o f changes made to almost
any o f the simulated components. The cost o f a simulation includes both the time
and effort required to write and debug the simulation program, and the time
required to execute the necessary simulations. Depending on the complexity o f
the system being simulated, and the level o f detail at which it is modeled, these
costs can be relatively low to moderate compared with the cost o f purchasing a
real machine on which to perform the corresponding experiments.
The primary limitation o f a simulation-based performance analysis is that it is
impossible to model every small detail o f the system being studied. Consequently,
131
5 1.3 Solution techniques
simplifying assumptions are required in order to make it possible to write the

simulation program itself, and to allow it to execute in a reasonable amount o f
time. These simplifying assumptions then limit the accuracy o f the results by
lowering the fidelity o f the model compared with how an actual system would
perform. Nevertheless, simulation enjoys tremendous popularity for computer-
systems analysis due to its high degree o f flexibility and its relative ease o f
implementation.
The third technique in the performance analyst’ s toolbox is analytical m odel
ing. An analytical model is a mathematical description o f the system. Compared
with a simulation or a measurement o f a real machine, the results o f an analytical
model tend to be much less believable and much less accurate. A simple analy
tical model, however, can provide some quick insights into the overall behavior
o f the system, or one o f its components. This insight can then be used to help
focus a more detailed measurement or simulation experiment. Analytical models
are also useful in that they provide at least a coarse level o f validation o f a
simulation or measurement. That is, an analytical model can help confirm
whether the results produced by a simulator, or the values measured on a real
system, appear to be reasonable.
Example. The delay observed by an application program when accessing mem
ory can have a significant impact on its overall execution time. Direct measure
ments o f this time on a real machine can be quite difficult, however, since the
detailed steps involved in the operation o f a complex memory hierarchy structure
are typically not observable from a user’ s application program. A sophisticated
user may be able to write simple application programs that exercise specific
portions o f the memory hierarchy to thereby infer important memory-system
parameters. For instance, the execution time o f a simple program that repeatedly
references the same variable can be used to estimate the time required to access
the first-level cache. Similarly, a program that always forces a cache miss can be
used to indirectly measure the main memory access time. Unfortunately, the
impact o f these system parameters on the execution time o f a complete applica
tion program is very dependent on the precise memory-referencing characteris
tics o f the program, which can be difficult to determine.
Simulation, on the other hand, is a powerful technique for studying memory-
system behavior due to its high degree o f flexibility. Any parameter o f the mem
ory, including the cache associativity, the relative cache and memory delays, the
sizes o f the cache and memory, and so forth, can be easily changed to study its
impact on performance. It can be challenging, however, to accurately model in a
simulator the overlap o f memory delays and the execution o f other instructions
in contemporary processors that incorporate such performance-enhancing fea
tures as out-of-order instruction issuing, branch prediction, and nonblocking
caches. Even with the necessary simplifying assumptions, the results o f a detailed
132
6 Introduction
simulation can still provide useful insights into the effect o f the memory system
on the performance o f a specific application program.
Finally, a simple analytical model o f the memory system can be developed as
follows. Let tc be the time delay observed by a memory reference if the memory
location being referenced is in the cache. Also, let tm be the corresponding delay
if the referenced location is not in the cache. The cache hit ratio, denoted h, is the
fraction o f all memory references issued by the processor that are satisfied by the
cache. The fraction o f references that miss in the cache and so must also access
the memory is 1 — h. Thus, the average time required for all cache hits is htc while
the average time required for all cache misses is (1 — h)tm. A simple model o f the
overall average memory-access time observed by an executing program then is
*avg = htc + (1 - h)tm. (1.1)
To apply this simple model to a specific application program, we would need

to know the hit ratio, /z, for the program, and the values o f tc and tm for the
system. These memory-access-time parameters, tc and tm, may often be found in
the manufacturer’ s specifications o f the system. Or, they may be inferred through
a measurement, as described above and as explored further in the exercises in
Chapter 6. The hit ratio, h, for an application program is typically more difficult
to obtain. It is often found through a simulation o f the application, though.
Although this model will provide only a very coarse estimate o f the average
memory-access time observed by a program, it can provide us with some insights
into the relative effects o f increasing the hit ratio, or changing the memory
timing parameters, for instance. O
The key differences among these solution techniques are summarized in Table
1.1. The flexibility o f a technique is an indication o f how easy it is to change the
system to study different configurations. The cost corresponds to the time, effort,
and money necessary to perform the appropriate experiments using each tech
nique. The believability o f a technique is high if a knowledgeable individual has a
high degree o f confidence that the result produced using that technique is likely
to be correct in practice. It is much easier for someone to believe that the
execution time o f a given application program will be within a certain range
when you can demonstrate it on an actual machine, for instance, than when
relying on a mere simulation. Similarly, most people are more likely to believe
the results o f a simulation study than one that relies entirely on an analytical
model. Finally, the accuracy o f a solution technique indicates how closely results
obtained when using that technique correspond to the results that would have
been obtained on a real system.
The choice o f a specific solution technique depends on the problem being
solved. One o f the skills that must be developed by a computer-systems perfor
mance analyst is determining which technique is the most appropriate for the
133
7 1.5 Exercises
Table 1.1. A c o m p a ris o n o f th e p e rfo rm a n c e -a n a ly s is s o lu tio n te c h n iq u e s
Solution technique
Analytical Simulation Measurement

Characteristic m odeling
Flexibility High High Low

Cost Low M edium High
Believability Low M edium High
Accuracy Low M edium High
given situation. The following chapters are designed to help you develop pre
cisely this skill.
1.4 Summary
Computer-systems performance analysis often feels more like an art than a

science. Indeed, different individuals can sometimes reach apparently contradic
tory conclusions when analyzing the same system or set o f systems. While this
type o f ambiguity can be quite frustrating, it is often due to misunderstandings o f
what was actually being measured, or disagreements about how the data should
be analyzed or interpreted. These differences further emphasize the need to
clearly communicate all results and to completely specify the tools, techniques,
and system parameters used to collect and understand the data. As you study the
following chapters, my hope is that you will begin to develop an appreciation for
this art o f measurement, interpretation, and communication in addition to devel
oping a deeper understanding o f its mathematical and scientific underpinnings.
1.5 Exercises
1. Respond to the question quoted at the beginning o f this chapter,

‘Performance can be bad, but can it ever be wrong?’
2. Performance analysis should be thought o f as a decision-making process.
Section 1.2 lists several common goals o f a performance-analysis experiment.
List other possible goals o f the performance-analysis decision-making pro
cess. Who are the beneficiaries o f each o f these possible goals?
134
8 Introduction
3. Table 1.1 compares the three main performance-analysis solution techniques

across several criteria. What additional criteria could be used to compare
these techniques?
4. Identify the most appropriate solution technique for each o f the following
situations.
(a) Estimating the performance benefit o f a new feature that an engineer is
considering adding to a computer system currently being designed.
(b) Determining when it is time for a large insurance company to upgrade to
a new system.
(c) Deciding the best vendor from which to purchase new computers for an
expansion to an academic computer lab.
(d) Determining the minimum performance necessary for a computer system
to be used on a deep-space probe with very limited available electrical
power.
135
Metrics of performance
‘
Time is a great teacher, but unfortunately it kills all its pupils.’
H e c to r B e rlio z
2.1 What is a performance metric?
Before we can begin to understand any aspect o f a computer system’ s perfor

mance, we must determine what things are interesting and useful to measure. The
basic characteristics o f a computer system that we typically need to measure are:
• a count o f how many times an event occurs,

• the duration o f some time interval, and
• the size o f some parameter.
For instance, we may need to count how many times a processor initiates an
input/output request. We may also be interested in how long each o f these
requests takes. Finally, it is probably also useful to determine the number o f
bits transmitted and stored.
From these types o f measured values, we can derive the actual value that we
wish to use to describe the performance o f the system. This value is called a
performance metric.
If we are interested specifically in the time, count, or size value measured, we
can use that value directly as our performance metric. Often, however, we are
interested in normalizing event counts to a comm on time basis to provide a speed
metric such as operations executed per second. This type o f metric is called a rate
metric or throughput and is calculated by dividing the count o f the number o f
events that occur in a given interval by the time interval over which the events
occur. Since a rate metric is normalized to a common time basis, such as seconds,
it is useful for comparing different measurements made over different time
intervals.
Choosing an appropriate performance metric depends on the goals for
the specific situation and the cost o f gathering the necessary information. For
136
10 Metrics of performance
example, suppose that you need to choose between two different computer sys
tems to use for a short period o f time for one specific task, such as choosing
between two systems to do some word processing for an afternoon. Since the
penalty for being wrong in this case, that is, choosing the slower o f the two
machines, is very small, you may decide to use the processors’clock frequencies
as the performance metric. Then you simply choose the system with the fastest
clock. However, since the clock frequency is not a reliable performance metric
(see Section 2.3.1), you would want to choose a better metric if you are trying to
decide which system to buy when you expect to purchase hundreds o f systems for
your company. Since the consequences o f being wrong are much larger in this
case (you could lose your job, for instance!), you should take the time to perform
a rigorous comparison using a better performance metric. This situation then
begs the question o f what constitutes a good performance metric.
2.2 Characteristics of a good performance metric
There are many different metrics that have been used to describe a computer
system’ s performance. Some o f these metrics are commonly used throughout the
field, such as MIPS and MFLOPS (which are defined later in this chapter),
whereas others are invented for new situations as they are needed. Experience
has shown that not all o f these metrics are ‘ g o o d ’in the sense that sometimes
using a particular metric can lead to erroneous or misleading conclusions.
Consequently, it is useful to understand the characteristics o f a ‘ g o o d ’perfor
mance metric. This understanding will help when deciding which o f the existing
performance metrics to use for a particular situation, and when developing a new
performance metric.
A performance metric that satisfies all o f the following requirements is gen
erally useful to a performance analyst in allowing accurate and detailed compar
isons o f different measurements. These criteria have been developed by observing
the results o f numerous performance analyses over many years. While they
should not be considered absolute requirements o f a performance metric, it
has been observed that using a metric that does not satisfy these requirements
can often lead the analyst to make erroneous conclusions.
1. Linearity. Since humans intuitively tend to think in linear terms, the value o f
the metric should be linearly proportional to the actual performance o f the
machine. That is, if the value o f the metric changes by a certain ratio, the
actual performance o f the machine should change by the same ratio. This
proportionality characteristic makes the metric intuitively appealing to most
people. For example, suppose that you are upgrading your system to a system
137
11 2.2 Characteristics of a good performance metric
whose speed metric (i.e. execution-rate metric) is twice as large as the same
metric on your current system. You then would expect the new system to be
able to run your application programs in half the time taken by your old
system. Similarly, if the metric for the new system were three times larger than
that o f your current system, you would expect to see the execution times
reduced to one-third o f the original values.
Not all types o f metrics satisfy this proportionally requirement. Logarithmic

metrics, such as the dB scale used to describe the intensity o f sound, for
example, are nonlinear metrics in which an increase o f one in the value o f
the metric corresponds to a factor o f ten increase in the magnitude o f the
observed phenomenon. There is nothing inherently wrong with these types o f
nonlinear metrics, it is just that linear metrics tend to be more intuitively
appealing when interpreting the performance o f computer systems.
2. Reliability. A performance metric is considered to be reliable if system A

always outperforms system B when the corresponding values o f the metric for
both systems indicate that system A should outperform system B. For exam
ple, suppose that we have developed a new performance metric called WIPS
that we have designed to compare the performance o f computer systems when
running the class o f word-processing application programs. We measure sys
tem A and find that it has a WIPS rating o f 128, while system B has a WIPS
rating o f 97. We then can say that WIPS is a reliable performance metric for
word-processing application programs if system A always outperforms system
B when executing these types o f applications.
While this requirement would seem to be so obvious as to be unnecessary to

state explicitly, several commonly used performance metrics do not in fact
satisfy this requirement. The MIPS metric, for instance, which is described
further in Section 2.3.2, is notoriously unreliable. Specifically, it is not unusual
for one processor to have a higher MIPS rating than another processor while
the second processor actually executes a specific program in less time than
does the processor with the higher value o f the metric. Such a metric is
essentially useless for summarizing performance, and we say that it is unreli
able.
3. Repeatability. A performance metric is repeatable if the same value o f the

metric is measured each time the same experiment is performed. Note that this
also implies that a good metric is deterministic.
4. Easiness of measurement. If a metric is not easy to measure, it is unlikely that

anyone will actually use it. Furthermore, the more difficult a metric is to
measure directly, or to derive from other measured values, the more likely
138
it is that the metric will be determined incorrectly. The only thing worse than a
bad metric is a metric whose value is measured incorrectly.
5. Consistency. A consistent performance metric is one for which the units o f the
metric and its precise definition are the same across different systems and
different configurations o f the same system. If the units o f a metric are not
consistent, it is impossible to use the metric to compare the performances o f
the different systems. While the necessity for this characteristic would also
seem obvious, it is not satisfied by many popular metrics, such as MIPS
(Section 2.3.2) and MFLOPS (Section 2.3.3).
6. Independence. Many purchasers o f computer systems decide which system to

buy by comparing the values o f some commonly used performance metric. As
a result, there is a great deal o f pressure on manufacturers to design their
machines to optimize the value obtained for that particular metric, and to
influence the composition o f the metric to their benefit. To prevent corruption
o f its meaning, a good metric should be independent o f such outside influences.
2.3 Processor and system performance metrics
A wide variety o f performance metrics has been proposed and used in the com
puter field. Unfortunately, many o f these metrics are not good in the sense
defined above, or they are often used and interpreted incorrectly. The following
subsections describe many o f these common metrics and evaluate them against
the above characteristics o f a good performance metric.
2.3.1 The clock rate
In many advertisements for computer systems, the most prominent indication o f

performance is often the frequency o f the processor’ s central clock. The implica
tion to the buyer is that a 250 M Hz system must always be faster at solving the
user’s problem than a 200 M Hz system, for instance. However, this performance
metric completely ignores how much computation is actually accomplished in
each clock cycle, it ignores the complex interactions o f the processor with the
memory subsystem and the input/output subsystem, and it ignores the not at all
unlikely fact that the processor may not be the performance bottleneck.
Evaluating the clock rate against the characteristics for a good performance
metric, we find that it is very repeatable (characteristic 3) since it is a constant for
a given system, it is easy to measure (characteristic 4) since it is most likely
stamped on the box, the value o f MHz is precisely defined across all systems
so that it is consistent (characteristic 5), and it is independent o f any sort o f
139
13 2.3 Processor and system performance metrics
manufacturers’games (characteristic 6). However, the unavoidable shortcomings

o f using this value as a performance metric are that it is nonlinear (characteristic
1), and unreliable (characteristic 2). As many owners o f personal computer
systems can attest, buying a system with a faster clock in no way assures that
their programs will run correspondingly faster. Thus, we conclude that the pro
cessor’ s clock rate is not a good metric o f performance.
2.3.2 MIPS
A throughput or execution-rate performance metric is a measure o f the amount o f

computation performed per unit time. Since rate metrics are normalized to a
common basis, such as seconds, they are very useful for comparing relative
speeds. For instance, a vehicle that travels at 50 m s-1 will obviously traverse
more ground in a fixed time interval than will a vehicle traveling at 35 m s-1.
The MIPS metric is an attempt to develop a rate metric for computer systems
that allows a direct comparison o f their speeds. While in the physical world speed
is measured as the distance traveled per unit time, MIPS defines the computer
system’ s unit o f ‘
distance’as the execution o f an instruction. Thus, MIPS, which
is an acronym for millions o f instructions executed per second, is defined to be
n
M IPS = (2 .1)
te x 106
where te is the time required to execute n total instructions.

Defining the unit o f ‘
distance’in this way makes MIPS easy to measure (char
acteristic 4), repeatable (characteristic 3), and independent (characteristic 6).
Unfortunately, it does not satisfy any o f the other characteristics o f a good
performance metric. It is not linear since, like the clock rate, a doubling o f the
MIPS rate does not necessarily cause a doubling o f the resulting performance. It
also is neither reliable nor consistent since it really does not correlate well to
performance at all.
The problem with MIPS as a performance metric is that different processors
can do substantially different amounts o f computation with a single instruction.
For instance, one processor may have a branch instruction that branches after
checking the state o f a specified condition code bit. Another processor, on the
other hand, may have a branch instruction that first decrements a specified count
register, and then branches after comparing the resulting value in the register
with zero. In the first case, a single instruction does one simple operation,
whereas in the second case, one instruction actually performs several operations.
The failing o f the MIPS metric is that each instruction corresponds to one unit o f
‘distance,’even though in this example the second instruction actually performs
more real computation. These differences in the amount o f computation per-
140
formed by an instruction are at the heart o f the differences between RISC and
CISC processors and render MIPS essentially useless as a performance metric.
Another derisive explanation o f the MIPS acronym is meaningless indicator o f
performance since it is really no better a measure o f overall performance than is
the processor’
s clock frequency.
2.3.3 MFLOPS
The MFLOPS performance metric tries to correct the primary shortcoming o f

the MIPS metric by more precisely defining the unit o f ’ distance’traveled by a
computer system when executing a program. MFLOPS, which is an acronym for
millions o f floating-point operations executed per second, defines an arithmetic
operation on two floating-point (i.e. fractional) quantities to be the basic unit
of ‘distance.’MFLOPS is thus calculated as
M FLOPS = — (2.2)
te x 106 v '
where / is the number o f floating-point operations executed in te seconds. The

MFLOPS metric is a definite improvement over the MIPS metric since the results
o f a floating-point computation are more clearly comparable across computer
systems than is the execution o f a single instruction. An important problem with
this metric, however, is that the MFLOPS rating for a system executing a pro
gram that performs no floating-point calculations is exactly zero. This program
may actually be performing very useful operations, though, such as searching a
database or sorting a large set o f records.
A more subtle problem with MFLOPS is agreeing on exactly how to count the
number o f floating-point operations in a program. For instance, many o f the
Cray vector computer systems performed a floating-point division operation
using successive approximations involving the reciprocal o f the denominator
and several multiplications. Similarly, some processors can calculate transcen
dental functions, such as sin, cos, and log, in a single instruction, while others
require several multiplications, additions, and table look-ups. Should these
operations be counted as a single floating-point operation or multiple floating
point operations? The first method would intuitively seem to make the most
sense. The second method, however, would increase the value o f / in the
above calculation o f the MFLOPS rating, thereby artificially inflating its
value. This flexibility in counting the total number o f floating-point operations
causes MFLOPS to violate characteristic 6 o f a good performance metric. It is
also unreliable (characteristic 2) and inconsistent (characteristic 5).
141
2.3.4 SPEC
T o standardize the definition o f the actual result produced by a computer system

in ‘ typical’usage, several computer manufacturers banded together to form the
System Performance Evaluation Cooperative (SPEC). This group identified a set
o f integer and floating-point benchmark programs that was intended to reflect
the way most workstation-class computer systems were actually used.
Additionally, and, perhaps, most importantly, they also standardized the meth
odology for measuring and reporting the performance obtained when executing
these programs.
The methodology defined consists o f the following key steps.
1. Measure the time required to execute each program in the set on the system
being tested.
2. Divide the time measured for each program in the first step by the time
required to execute each program on a standard basis machine to normalize
the execution times.
3. Average together all o f these normalized values using the geometric mean (see
Section 3.3.4) to produce a single-number performance metric.
While the SPEC methodology is certainly more rigorous than is using MIPS or
MFLOPS as a measure o f performance, it still produces a problematic perfor
mance metric. One shortcoming is that averaging together the individual normal
ized results with the geometric mean produces a metric that is not linearly related
to a program’ s actual execution time. Thus, the SPEC metric is not intuitive
(characteristic 1). Furthermore, and more importantly, it has been shown to be
an unreliable metric (characteristic 2) in that a given program may execute faster
on a system that has a lower SPEC rating than it does on a competing system
with a higher rating.
Finally, although the defined methodology appears to make the metric inde
pendent o f outside influences (characteristic 6), it is actually subject to a wide
range o f tinkering. For example, many compiler developers have used these
benchmarks as practice programs, thereby tuning their optimizations to the char
acteristics o f this collection o f applications. As a result, the execution times o f the
collection o f programs in the SPEC suite can be quite sensitive to the particular
selection o f optimization flags chosen when the program is compiled. Also, the
selection o f specific programs that comprise the SPEC suite is determined by a
committee o f representatives from the manufacturers within the cooperative. This
committee is subject to numerous outside pressures since each manufacturer has a
strong interest in advocating application programs that will perform well on their
machines. Thus, while SPEC is a significant step in the right direction towards
defining a good performance metric, it still falls short o f the goal.
142
2.3.5 QUIPS
The QUIPS metric, which was developed in conjunction with the H INT bench
mark program, is a fundamentally different type o f performance metric. (The
details o f the HINT benchmark and the precise definition o f QUIPS are given in
Section 7.2.3). Instead o f defining the effort expended to reach a certain result as
the measure o f what is accomplished, the QUIPS metric defines the quality o f the
solution as a more meaningful indication o f a user’ s final goal. The quality is
rigorously defined on the basis o f mathematical characteristics o f the problem
being solved. Dividing this measure o f solution quality by the time required to
achieve that level o f quality produces QUIPS, or quality improvements per sec
ond.
This new performance metric has several o f the characteristics o f a good
performance metric. The mathematically precise definition o f ‘ quality’for the
defined problem makes this metric insensitive to outside influences (characteristic
6) and makes it entirely self-consistent when it is ported to different machines
(characteristic 5). It is also easily repeatable (characteristic 3) and it is linear
(characteristic 1) since, for the particular problem chosen for the HINT bench
mark, the resulting measure o f quality is linearly related to the time required to
obtain the solution.
Given the positive aspects o f this metric, it still does present a few potential
difficulties when used as a general-purpose performance metric. The primary
potential difficulty is that it need not always be a reliable metric (characteristic
2) due to its narrow focus on floating-point and memory system performance. It
is generally a very good metric for predicting how a computer system will per
form when executing numerical programs. However, it does not exercise some
aspects o f a system that are important when executing other types o f application
programs, such as the input/output subsystem, the instruction cache, and the
operating system’ s ability to multiprogram, for instance. Furthermore, while the
developers have done an excellent jo b o f making the HINT benchmark easy to
measure (characteristic 4) and portable to other machines, it is difficult to change
the quality definition. A new problem must be developed to focus on other
aspects o f a system’s performance since the definition o f quality is tightly coupled
to the problem being solved. Developing a new problem to more broadly exercise
the system could be a difficult task since it must maintain all o f the characteristics
described above.
Despite these difficulties, QUIPS is an important new type o f metric that
rigorously defines interesting aspects o f performance while providing enough
flexibility to allow new and unusual system architectures to demonstrate their
capabilities. While it is not a completely general-purpose metric, it should prove
to be very useful in measuring a system’ s numerical processing capabilities.
143
It also should be a strong stimulus for greater rigor in defining future perfor
mance metrics.
2.3.6 Execution time
Since we are ultimately interested in how quickly a given program is executed,

the fundamental performance metric o f any computer system is the time required
to execute a given application program. Quite simply, the system that produces
the smallest total execution time for a given application program has the highest
performance. We can compare times directly, or use them to derive appropriate
rates. However, without a precise and accurate measure o f time, it is impossible
to analyze or compare most any system performance characteristics.
Consequently, it is important to know how to measure the execution time o f a
program, or a portion o f a program, and to understand the limitations o f the
measuring tool.
The basic technique for measuring time in a computer system is analogous to
using a stopwatch to measure the time required to perform some event. Unlike a
stopwatch that begins measuring time from 0, however, a computer system
typically has an internal counter that simply counts the number o f clock ticks
that have occurred since the system was first turned on. (See also Section 6.2.) A
time interval then is measured by reading the value o f the counter at the start o f
the event to be timed and again at the end o f the event. The elapsed time is the
difference between the two count values multiplied by the period o f the clock
ticks.
As an example, consider the program example shown in Figure 2.1. In this
example, the in it_tim er() function initializes the data structures used to access
the system’ s timer. This timer is a simple counter that is incremented continu
ously by a clock with a period defined in the variable clock _cycle. Reading the
address pointed to by the variable read_count returns the current count value o f
the timer.
To begin timing a portion o f a program, the current value in the timer is read
and stored in start_count. At the end o f the portion o f the program being
timed, the timer value is again read and stored in end_count. The difference
between these two values is the total number o f clock ticks that occurred during
the execution o f the event being measured. The total time required to execute this
event is this number o f clock ticks multiplied by the period o f each tick, which is
stored in the constant clock -cycle.
This technique for measuring the elapsed execution time o f any selected por
tion o f a program is often referred to as the wall clock time since it measures the
total time that a user would have to wait to obtain the results produced by the
program. That is, the measurement includes the time spent waiting for input/
144
mainO
{
int i ;
float a;
init_timer();
/* Read the starting time. */

start_count = read_count;
/* Stuff to be measured */
for (i=0;i< 1000;i++){
a = i * a / 1 0 ;
}
/* Read the ending time. */

end_count = read_count;
elapsed_time = (end_count - start_count) * clock_cycle;

}
Figure 2.1. An example program showing how to measure the execution time o f a portion
o f a program.
output operations to complete, memory paging, and other system operations

performed on behalf o f this application, all o f which are integral components
o f the program’ s execution. However, when the system being measured is time-
shared so that it is not dedicated to the execution o f this one application pro
gram, this elapsed execution time also includes the time the application spends
waiting while other users’applications execute.
Many researchers have argued that including this time-sharing overhead in the
program’ s total execution time is unfair. Instead, they advocate measuring per
formance using the total time the processor actually spends executing the pro
gram, called the total CPU time. This time does not include the time the program
is context switched-out while another application runs. Unfortunately, using
only this CPU time as the performance metric ignores the waiting time that is
inherent to the application as well as the time spent waiting on other programs.
A better solution is to report both the CPU time and the total execution time so
the reader can determine the significance o f the time-sharing interference. The
point is to be explicit about what information you are actually reporting to allow
the reader to decide for themselves how believable your results are.
145
19 2.5 Speedup and relative change
In addition to system-overhead effects, the measured execution time o f an

application program can vary significantly from one run to another since the
program must contend with random events, such as the execution o f background
operating system tasks, different virtual-to-physical page mappings and cache
mappings from explicitly random replacement policies, variable system load in a
time-shared system, and so forth. As a result, a program’ s execution time is
nondeterministic. It is important, then, to measure a program’ s total elapsed
execution time several times and report at least the mean and variance o f the
times. Errors in measurements, along with appropriate statistical techniques to
quantify them, are discussed in more detail in Chapter 4.
When it is measured as described above, the elapsed (wall clock) time mea
surement produces a performance metric that is intuitive, reliable, repeatable,
easy to measure, consistent across systems, and independent o f outside influ
ences. Thus, since it satisfies all o f the characteristics o f a good performance
metric, program execution time is one o f the best metrics to use when analyzing
computer system performance.
2.4 Other types of performance metrics
In addition to the processor-centric metrics described above, there are many

other metrics that are commonly used in performance analysis. For instance,
the system response time is the amount o f time that elapses from when a user
submits a request until the result is returned from the system. This metric is often
used in analyzing the performance o f online transaction-processing systems, for
example. System throughput is a measure o f the number o f jobs or operations
that are completed per unit time. The performance o f a real-time video-proces
sing system, for instance, may be measured in terms o f the number o f video
frames that can be processed per second. The bandwidth o f a communication
network is a throughput measure that quantifies the number o f bits that can be
transmitted across the network per second. Many other ad hoc performance
metrics are defined by performance analysts to suit the specific needs o f the
problem or system being studied.
2.5 Speedup and relative change
Speedup and relative change are useful metrics for comparing systems since they
normalize performance to a common basis. Although these metrics are defined in
terms o f throughput or speed metrics, they are often calculated directly from
execution times, as described below.
146
Speedup. The speedup o f system 2 with respect to system 1 is defined to be a

value 5*2 i such that R2 = S2XR\, where R x and R2 are the ‘ speed’metrics being
compared. Thus, we can say that system 2 is S22l times faster than system 1. Since
a speed metric is really a rate metric (i.e. throughput), R x = D x/Tx, where D x is
analogous to the ‘ distance traveled’in time 7) by the application program when
executing on system 1. Similarly, R2 = D 2/T2. Assuming that the ‘ distance tra
veled’by each system is the same, D x = D 2 = 7), giving the following definition
for speedup:
Speedup o f system 2 w.r.t. system 1 = S2 \ = —- = { 2 = -=r. (2.3)

Rl D/T\ 72
If system 2 is faster than system 1, then 73 <7 3 and the speedup ratio will be
larger than 1. If system 2 is slower than system 1, however, the speedup ratio will
be less than 1. This situation is often referred to as a slowdown instead o f a
speedup.
Relative change. Another technique for normalizing performance is to express
the performance o f a system as a percent change relative to the performance o f
another system. We again use the throughput metrics R x and R2 as measures o f
the speeds o f systems 1 and 2, respectively. The relative change o f system 2 with
respect to system 1, denoted A2)i, (that is, using system 1 as the basis) is then
defined to be
R i —
Relative change o f system 2 w.r.t. system 1 = A2ji = (2.4)
Ri
Again assuming that the execution time o f each system is measured when
executing the same program, the ‘
distance traveled’by each system is the same
so that Rx = D jT x and R2 = D/T2. Thus,
R2 ~ R i _ D / T 2 — D/T\
(2.5)
Ri D/Tx
Typically, the value o f A2,i is multiplied by 100 to express the relative change
as a percentage with respect to a given basis system. This definition o f relative
change will produce a positive value if system 2 is faster than system 1, whereas a
negative value indicates that the basis system is faster.
Example. As an example o f how to apply these two different normalization
techniques, the speedup and relative change o f the systems shown in Table 2.1
are found using system 1 as the basis. From the raw execution times, we can
easily see that system 4 is the fastest, followed by systems 2, 1, and 3, in that
order. However, the speedup values give us a more precise indication o f exactly
how much faster one system is than the others. For instance, system 2 has a
147
21 2.6 Means versus ends metrics
Table 2.1. An example of calculating speedup and relative change using system 1 as
the basis
System Execution time Speedup Relative change

X T, (s) S*,i A,,i (%)
1 480 1 0
2 360 1.33 + 33
3 540 0.89 - 11
4 210 2.29 + 129
speedup o f 1.33 compared with system 1 or, equivalently, it is 33% faster. System
4 has a speedup ratio o f 2.29 compared with system 1 (or it is 129% faster). We
also see that system 3 is actually 11% slower than system 1, giving it a slowdown
factor o f 0.89. O
2.6 Means versus ends metrics
One o f the most important characteristics o f a performance metric is that it be

reliable (characteristic 2). One o f the problems with many o f the metrics dis
cussed above that makes them unreliable is that they measure what was done
whether or not it was useful. What makes a performance metric reliable, however,
is that it accurately and consistently measures progress towards a goal. Metrics
that measure what was done, useful or not, have been called means-based metrics
whereas ends-based metrics measure what is actually accomplished.
To obtain a feel for the difference between these two types o f metrics, consider
the vector dot-product routine shown in Figure 2.2. This program executes N
floating-point addition and multiplication operations for a total o f 2N floating
point operations. If the time required to execute one addition is t+ cycles and one
multiplication requires cycles, the total time required to execute this program
is t\ = N(t+ + tf) cycles. The resulting execution rate then is
s = 0;
for (i = 1; i < N; i+ + )
s = s + x[i] * y [i ] ;
Figure 2.2. A vector dot-product example program.
148
27V
*1 =
------ FLOPS/cycle. (2.6)
N(t+ + t*) t~\~+ L
Since there is no need to perform the addition or multiplication operations for

elements whose value is zero, it may be possible to reduce the total execution
time if many elements o f the two vectors are zero. Figure 2.3 shows the example
from Figure 2.2 modified to perform the floating-point operations only for those
nonzero elements. If the conditional i f statement requires 7if cycles to execute,
the total time required to execute this program is t2 = 7V[/f + f( t+ + q)] cycles,
where/ is the fraction o f N for which both x [i] and y [i] are nonzero. Since the
total number o f additions and multiplications executed in this case is 2Nf, the
execution rate for this program is
2Nf 2f
FLOPS/cycle. (2.7)
Mfif +f(t+ + L)] hi + f( t+ + L=)
If is four cycles, t+ is five cycles, q is ten cycles, / is 10%, and the proces
sor’ s clock rate is 250 MHz (i.e. one cycle is 4 ns), then q = 607V ns and
t2 = 7V[4 + 0.1(5 + 10)] x 4 ns = 22N ns. The speedup o f program 2 relative
to program 1 then is found to be S2ti = 607V/227V = 2.73.
Calculating the execution rates realized by each program with these assump
tions produces 7/ = 2/(60 ns) = 33 MFLOPS and R2 = 2(0.1 )/(22 ns) =
9.09 MFLOPS. Thus, even though we have reduced the total execution time
from q = 607V ns to t2 = 227V ns, the means-based metric (MFLOPS) shows
that program 2 is 72% slower than program 1. The ends-based metric (execution
time), however, shows that program 2 is actually 173% faster than program 1.
We reach completely different conclusions when using these two different types
o f metrics because the means-based metric unfairly gives program 1 credit for all
o f the useless operations o f multiplying and adding zero. This example highlights
the danger o f using the wrong metric to reach a conclusion about computer-
system performance.
s = 0;
for (i = 1; i < N; i++)
if (x [i ] != 0 && y [ i ] != 0)
s = s + x [i ] * y [i ] ;
Figure 2.3. The vector dot-product example program of Figure 2.2 modified to calculate
only nonzero elements.
149
23 2.8 For further reading
2.7 Summary
Fundamental to measuring computer-systems performance is defining an appro

priate metric. This chapter identified several characteristics or criteria that are
important for a ‘ g o o d ’metric o f performance. Several com m on performance
metrics were then introduced and analyzed in the context o f these criteria. The
definitions o f speedup and relative change were also introduced. Finally, the
concepts o f ends-based versus means-based metrics were presented to clarify
what actually causes a metric to be useful in capturing the actual performance
o f a com puter system.
2.8 For further reading
• The following paper argues strongly for total execution time as the best mea
sure o f performance:
James E. Smith, ‘
Characterizing Computer Performance with a Single
Number,’Communications o f the ACM, October 1988, pp. 1202-1206.
• The QUIPS metric is described in detail in the following paper, which also
introduced the idea o f means-based versus ends-based metrics:
J. L. Gustafson and Q. O. Snell, ‘

HINT: A New Way to Measure Computer
Performance,’Hawaii International Conference on System Sciences, 1995,
pp. 11:392-401.
• Some o f the characteristics of the SPEC metric are discussed in the following
papers:
Ran Giladi and Niv Ahituv, ‘

SPEC as a Performance Evaluation Measure,’
IEEE Computer, Vol. 28, No. 8, August 1995, pp. 33-42.
Nikki Mirghafori, Margret Jacoby, and David Patterson, ‘

Truth in SPEC
Benchmarks,’ A C M Computer Architecture News, Vol. 23, No. 5,
December 1995, pp. 34-42.
• Parallel computing systems are becoming more common. They present some
interesting performance measurement problems, though, as discussed in
Lawrence A. Crowl, ‘
How to Measure, Present, and Compare Parallel
Performance,’IEEE Parallel and Distributed Technology, Spring 1994,
pp. 9-25.
150
2.9 Exercises
1. (a) Write a simple benchmark program to estimate the maximum effective

MIPS rating o f a computer system. Use your program to rank the per
formance o f three different, but roughly comparable, computer systems.
(b) Repeat part (a) using the maximum effective MFLOPS rating as the
metric o f performance.
(c) Compare the rankings obtained in parts (a) and (b) with the ranking
obtained by comparing the clock frequencies o f the different systems.
(d) Finally, compare your rankings with those published by authors using
some standard benchmark programs, such as those available on the
SPEC website.
2. What makes a performance metric ‘ reliable?’
3. Classify each o f the following metrics as being either means-based or ends-
based; MIPS, MFLOPS, execution time, bytes o f available memory, quality
o f a final answer, arithmetic precision, system cost, speedup, and reliability o f
an answer.
4. Devise an experiment to determine the following performance metrics for a
computer system.
(a) The effective memory bandwidth between the processor and the data
cache if all memory references are cache hits.
(b) The effective memory bandwidth if all memory references are cache
misses.
5. What are the key differences between ‘ wall clock time’and ‘ CPU time?’
Under what conditions should each one be used? Is it possible for these
two different times to be the same?
6. The execution time required to read the current time from an interval counter
is a minimum o f at least one memory-read operation to obtain the current
time value and one memory-write operation to store the value for later use. In
some cases, it may additionally include a subroutine call and return operation.
H ow does this timer ‘ overhead’affect the time measured when using such an
interval timer to determine the duration o f some event, such as the total
execution time o f a program?
7. Calculate the speedup and relative change o f the four systems shown in Table
2.1 when using System 4 as the basis. H ow do your newly calculated values
affect the relative rankings o f the four systems?
151
Measurement tools and techniques
'When the only tool you have is a hammer, every problem begins to resemble a nail.’
A b ra h a m M a s lo w
The previous chapters have discussed what performance metrics may be useful
for the performance analyst, how to summarize measured data, and how to
understand and quantify the systematic and random errors that affect our
measurements. Now that we know what to do with our measured values,
this chapter presents several tools and techniques for actually measuring the
values we desire.
The focus o f this chapter is on fundamental measurement concepts. The goal is
not to teach you how to use specific measurement tools, but, rather, to help you
understand the strengths and limitations o f the various measurement techniques.
By the end o f this chapter, you should be able to select an appropriate measure
ment technique to determine the value o f a desired performance metric. You also
should have developed some understanding o f the trade-offs involved in using
the various types o f tools and techniques.
6.1 Events and measurement strategies
There are many different types o f performance metrics that we may wish to
measure. The different strategies for measuring the values o f these metrics are
typically based around the idea o f an event, where an event is some predefined
change in the system state. The precise definition o f a specific event is up to the
performance analyst and depends on the metric being measured. For instance, an
event may be defined to be a memory reference, a disk access, a network com
munication operation, a change in a processor’ s internal state, or some pattern or
combination o f other subevents.
82
152
83 6.1 Events and measurement strategies
6.1.1 Events-type classification
The different types o f metrics that a performance analyst may wish to measure
can be classified into the following categories based on the type o f event or events
that comprise the metric.
1. Event-count metrics. Metrics that fall into this category are those that are
simple counts o f the number o f times a specific event occurs. Examples o f
event-count metrics include the number o f page faults in a system with
virtual memory, and the number o f disk input/output requests made by a
program.
2. Secondary-event metrics. These types o f metrics record the values o f some
secondary parameters whenever a given event occurs. For instance, to deter
mine the average number o f messages queued in the send buffer o f a com
munication port, we would need to record the number o f messages in the
queue each time a message was added to, or removed from, the queue. Thus,
the triggering event is a message-enqueue or -dequeue operation, and the
metrics being recorded are the number o f messages in the queue and the total
number o f queue operations. We may also wish to record the size (e.g. the
number o f bytes) o f each message sent to later determine the average mes
sage size.
3. Profiles. A profile is an aggregate metric used to characterize the overall
behavior o f an application program or o f an entire system. Typically, it is
used to identify where the program or system is spending its execution time.
6.1.2 Measurement strategies
The above event-type classification can be useful in helping the performance

analyst decide on a specific strategy for measuring the desired metric, since
different types o f measurement tools are appropriate for measuring different
types o f events. These different measurement tools can be categorized on the
basis o f the fundamental strategy used to determine the actual values o f the
metrics being measured. One important concern with any measurement strategy
is how much it perturbs the system being measured. This aspect o f performance
measurement is discussed further in Section 6.6.
1. Event-driven. An event-driven measurement strategy records the information
necessary to calculate the performance metric whenever the preselected event
or events occur. The simplest type o f event-driven measurement tool uses a
simple counter to directly count the number o f occurrences o f a specific event.
For example, the desired metric may be the number o f page faults that occur
during the execution o f an application program. To find this value, the per
formance analyst most likely would have to modify the page-fault-handling
153
84 Measurement tools and techniques
routine in the operating system to increment a counter whenever the routine is

entered. At the termination o f the program’ s execution, an additional
mechanism must be provided to dump the contents o f the counter.
One o f the advantages o f an event-driven strategy is that the system overhead
required to record the necessary information is incurred only when the event
o f interest actually occurs. If the event never occurs, or occurs only infre
quently, the perturbation to the system will be relatively small. This charac
teristic can also be a disadvantage, however, when the events being monitored
occur very frequently.
When recording high-frequency events, a great deal o f overhead may be
introduced into a program’ s execution, which can significantly alter the pro
gram’ s execution behavior compared with its uninstrumented execution. As a
result, what the measurement tool measures need not reflect the typical or
average behavior o f the system. Furthermore, the time between measurements
depends entirely on when the measured events occur so that the inter-event
time can be highly variable and completely unpredictable. This can increase
the difficulty o f determining how much the measurement tool actually per
turbs the executing program. Event-driven measurement tools are usually
considered most appropriate for low-frequency events.
2. Tracing. A tracing strategy is similar to an event-driven strategy, except that,
rather than simply recording that fact that the event has occurred, some
portion o f the system state is recorded to uniquely identify the event. For
example, instead o f simply counting the number o f page faults, a tracing
strategy may record the addresses that caused each o f the page faults. This
strategy obviously requires significantly more storage than would a simple
count o f events. Additionally, the time required to save the desired state,
either by storing it within the system’ s memory or by writing to a disk, for
instance, can significantly alter the execution o f the program being measured.
3. Sampling. In contrast to an event-driven measurement strategy, a sampling
strategy records at fixed time intervals the portion o f the system state neces
sary to determine the metric o f interest. As a result, the overhead due to this
strategy is independent o f the number o f times a specific event occurs. It is
instead a function o f the sampling frequency, which is determined by the
resolution necessary to capture the events o f interest.
The sampling o f the state o f the system occurs at fixed time intervals that are
independent o f the occurrence o f specific events. Thus, not every occurrence
o f the events o f interest will be recorded. Rather, a sampling strategy produces
a statistical summary o f the overall behavior o f the system. Consequently,
events that occur infrequently may be completely missed by this statistical
approach. Furthermore, each run o f a sampling-based experiment is likely to
produce a different result since the samples occur asynchronously with respect
154
85 6.1 Events and measurement strategies
to a program’ s execution. Nevertheless, while the exact behavior may differ,

the statistical behavior should remain approximately the same.
4. Indirect. An indirect measurement strategy must be used when the metric that
is to be determined is not directly accessible. In this case, you must find
another metric that can be measured directly, from which you then can
deduce or derive the desired performance metric. Developing an appropriate
indirect measurement strategy, and minimizing its overhead, relies almost
completely on the cleverness and creativity o f the performance analyst.
The unique characteristics o f these measurement strategies make them more or

less appropriate for different situations. Program tracing can provide the most
detailed information about the system being monitored. An event-driven mea
surement tool, on the other hand, typically provides only a higher-level summary
o f the system behavior, such as overall counts or average durations. The infor
mation supplied both by an event-driven measurement tool and by a tracing tool
is exact, though, such as the precise number o f times a certain subroutine is
executed. In contrast, the information provided by a sampling strategy is statis
tical in nature. Thus, repeating the same experiment with an event-driven or
tracing tool will produce the same results each time whereas the results produced
with a sampling tool will vary slightly each time the experiment is performed.
The system resources consumed by the measurement tool itself as it collects
data will strongly affect how much perturbation the tool will cause in the system.
As mentioned above, the overhead o f an event-driven measurement tool is
directly proportional to the number o f occurrences o f the event being measured.
Events that occur frequently may cause this type o f tool to produce substantial
perturbation as a byproduct o f the measurement process. The overhead o f a
sampling-based tool, however, is independent o f the number o f times any specific
event occurs. The perturbation caused by this type o f tool is instead a function o f
the sampling interval, which can be controlled by the experimenter or the tool
builder. A trace-based tool consumes the largest amount o f system resources,
requiring both processor resources (i.e. time) to record each event and potentially
enormous amounts o f storage resources to save each event in the trace. As a
result, tracing tends to produce the largest system perturbation.
Each indirect measurement tool must be uniquely adapted to the particular
aspect o f the system performance it attempts to measure. Therefore, it is im pos
sible to make any general statements about a measurement tool that makes use
o f an indirect strategy. The key to implementing a tool to measure a specific
performance metric is to match the characteristics o f the desired metric with the
appropriate measurement strategy. Several o f the fundamental techniques that
have been used for implementing the various measurement strategies are
described in the following sections.
155
6.2 Interval timers
One o f the most fundamental measuring tools in computer-system performance

analysis is the interval timer. An interval timer is used to measure the execution
time o f an entire program or any section o f code within a program. It can also
provide the time basis for a sampling measurement tool. Although interval
timers are relatively straightforward to use, understanding how an interval
timer is constructed helps the performance analyst determine the limitations
inherent in this type o f measurement tool.
Interval timers are based on the idea o f counting the number o f clock pulses
that occur between two predefined events. These events are typically identified by
inserting calls to a routine that reads the current timer count value into a pro
gram at the appropriate points, such as shown previously in the example in
Figure 2.1. There are two common implementations o f interval timers, one
using a hardware counter, and the other based on a software interrupt.
Hardware timers. The hardware-based interval timer shown in Figure 6.1
simply counts the number o f pulses it receives at its clock input from a free-
running clock source. The counter is typically reset to 0 when the system is first
powered up so that the value read from the counter is the number o f clock ticks
that have occurred since that time. This value is used within a program by
reading the memory location that has been mapped to this counter by the man
ufacturer o f the system.
Assume that the value read at the start o f the interval being measured is X| and
the value read at the end o f the interval is x 2. Then the total time that has elapsed
between these two read operations is Te = (x2 —x x)Tc, where Tc is the period o f
the clock input to the counter.
Software timers. The primary difference between a software-interrupt-based
interval timer, shown in Figure 6.2, and a hardware-based timer is that the
counter accessible to an application program in the software-based implementa-
-n bits -
Figure 6.1 A hardware-based interval timer uses a free-running clock source to continuously
increment an n-bit counter. This counter can be read directly by the operating system or by
an application program. The period o f the clock, Tc, determines the resolution o f the timer.
156
87 6.2 Interval timers
Clock
_n_n_n_n > Prescaler _n _n_
(divide-by-m)
To processor's interrupt input
Figure 6.2 A software interrupt-based timer divides down a free-running clock to produce
a processor interrupt with the period Tc. The interrupt service routine then maintains a
counter variable in memory that it increments each time the interrupt occurs.
tion is not directly incremented by the free-running clock. Instead, the hardware
clock is used to generate a processor interrupt at regular intervals. The interrupt-
service routine then increments a counter variable it maintains, which is the value
actually read by an application program. The value o f this variable then is a
count o f the number o f interrupts that have occurred since the count variable
was last initialized. Some systems allow an application program to reset this
counter. This feature allows the timer to always start from zero when timing
the duration o f an event.
The period o f the interrupts in the software-based approach corresponds to
the period o f the timer. As before, we denote this period Tc so that the total time
elapsed between two readings o f the software counter value is again
Te = (x2 —x{)Tc. The processor interrupt is typically derived from a free-run
ning clock source that is divided by m through a prescaling counter, as shown in
Figure 6.2. This prescaler is necessary in order to reduce the frequency o f the
interrupt signal fed into the processor. Interrupts would occur much too often,
and thus would generate a huge amount o f processor overhead, if this prescaling
were not done.
Timer rollover. One important consideration with these types o f interval timers
is the number o f bits available for counting. This characteristic directly deter
mines the longest interval that can be measured. (The complementary issue o f the
shortest interval that can be measured is discussed in Section 6.2.2.) A binary
counter used in a hardware timer, or the equivalent count variable used in a
software implementation, is said to ‘ roll over’to zero as its count undergoes a
transition from its maximum value o f 2" — 1 to the zero value, where n is the
number o f bits in the counter.
If the counter rolls over between the reading o f the counter at the start o f the
interval being measured and the reading o f the counter at the end, the difference
o f the count values, x2 —Xj, will be a negative number. This negative value is
obviously not a valid measurement o f the time interval. Any program that uses
an interval timer must take care to ensure that this type o f roll over can never
occur, or it must detect and, possibly, correct the error. Note that a negative
value that occurs due to a single roll over o f the counter can be converted to the
appropriate value by adding the maximum count value, 2n, to the negative value
157
obtained when subtracting X\ from x2. Table 6.1 shows the maximum time
between timer roll overs for various counter widths and input clock periods.
6.2.1 Timer overhead
The implementation o f an interval timer on a specific system determines how the

timer must be used. In general, though, we can think o f using an interval timer to
measure any portion o f a program, much as we would use a stopwatch to time a
runner on a track, for instance. In particular, we typically would use an interval
time within a program as follows:
x _sta rt = rea cL tim er();

<event b ein g tim ed>
x_end = rea d _ tim er();
ela p sed _ tim e = (x_end - x _ sta rt) * t _ c y c le ;
When it is used in this way, we can see that the time we actually measure
includes more than the time required by the event itself. Specifically, accessing
the timer requires a minimum o f one memory-read operation. In some imple
mentations, reading the timer may require as much as a call to the operating-
system kernel, which can be very time-consuming. Additionally, the value read
from the timer must be stored somewhere before the event being timed begins.
This requires at least one store operation, and, in some systems, it could
require substantially more. These operations must be performed twice, once
at the start o f the event, and once again at the end. Taken altogether, these
operations can add up to a significant amount o f time relative to the duration
o f the event itself.
To obtain a better understanding o f this timer overhead, consider the time
line shown in Figure 6.3. Here, T x is the time required to read the value o f the
interval timer’ s counter. It may be as short as a single memory read, or as long
as a call into the operating-system kernel. Next, T2 is the time required to store
the current time. This time includes any time in the kernel after the counter has
been read, which would include, at a minimum, the execution o f the return
instruction. Time T2 is the actual duration o f the event we are trying to
measure. Finally, the time from when the event ends until the program actually
reads the counter value again is T4. Note that reading the counter this second
time involves the same set o f operations as the first read o f the counter so that
Ta = Tx.
Assigning these times to each o f the components in the timing operation now
allows us to compare the timer overhead with the time o f the event itself, which is
what we actually want to know. This event time, Te is time T3 in our time line, so
that Te = T3. What we measure, however, is Tm = T2 + T3 + T4. Thus, our
158
Table 6.1 The maximum time available before a binary interval timer with n bits and an
input clock with a period of Tc rolls over is Tc2n
Counter width, n
T
1c 16 24 32 48 64
10 ns 655 ps 168 ms 42.9 s 32.6 days 58.5 centuries

100 ns 6.55 ms 1.68 s 7.16 min 326 days 585 centuries
1 ps 65.5 ms 16.8 s 1.19 h 9.15 years 5,850 centuries
10 ps 655 ms 2.8 min 11.9 h 89.3 years 58,500 centuries
100 ps 6.55 s 28.0 min 4.97 days 893 years 585,000 centuries
1 ms 1.09 min 4.66 h 49.7 days 89.3 centuries 5,850,000 centuries
T,---- T2---- ----T4----- .
<D
6
<D
a
Figure 6.3 The overhead incurred when using an interval timer to measure the execution
time o f any portion o f a program can be understood by breaking down the operations
necessary to use the timer into the com ponents shown here.
desired measurement is Ts = Tm — (T2 + T4) = Tm — (Tx + T2), since T4 = Tx.

We call Tx + T2 the timer overhead and denote it r ovhd.
If the interval being measured is substantially larger than the timer overhead,
then the timer overhead can simply be ignored. If this condition is not satisfied,
though, then the timer overhead should be carefully measured and subtracted
from the measurement o f the event under consideration. It is important to
recognize, however, that variations in measurements o f the timer overhead itself
can often be quite large relative to variations in the times measured for the event.
As a result, measurements o f intervals whose duration is o f the same order o f
magnitude as the timer overhead should be treated with great suspicion. A good
rule o f thumb is that the event duration, Te, should be 100-1,000 times larger
than the timer overhead, Tovhd.
159
6.2.2 Quantization errors
The smallest change that can be be detected and displayed by an interval timer is
its resolution. This resolution is a single clock tick, which, in terms o f time, is the
period o f the timer’ s clock input, Tc. This finite resolution introduces a random
quantization error into all measurements made using the timer.
For instance, consider an event whose duration is n ticks o f the clock input,
plus a little bit more. That is, Te = nTc + A, where n is a positive integer and
0 < A < Tc. If, when one is measuring this event, the timer value is read
shortly after the event has actually begun, as shown in Figure 6.4(a), the
timer will count n clock ticks before the end o f the event. The total execution
time reported then will be nTc. If, on the other hand, there is slightly less time
between the actual start o f the event and the point at which the timer value is
read, as shown in Figure 6.4(b), the timer will count n + 1 clock ticks before
the end o f the event is detected. The total time reported in this case will then be
( n + l) T c.
In general, the actual event time is within the range nTc < Te < (n + \)TC.
Thus, the fact that events are typically not exactly whole number factors o f the
timer’
s clock period causes the time value reported to be rounded either up or
down by one clock period. This rounding is completely unpredictable and is one
readily identifiable (albeit possibly small) source o f random errors in our mea
surements (see Section 4.2). Looking at this quantization effect another way, if we
made ten measurements o f the same event, we would expect that approximately
five o f them would be reported as nTc with the remainder reported as (n + 1)TC. If
Tc is large relative to the event being measured, this quantization effect can make
it impossible to directly measure the duration o f the event. Consequently, we
typically would like Tc to be as small as possible, within the constraints imposed
by the number o f bits available in the timer (see Table 6.1).
C lock p— |------ 1--------1------- 1------- 1------- 1------- |-------f-------|-------|-------1------- |------- |— t|------ 1
Event r*-------------------------------------------------------------------H
(a) Interval timer reports event duration o f n = 13 clock ticks.
ciock I—H— {— |— |— |— |— |— |— |— |— |— |— |— |—
i—|
Event r*-------------------------------------------------------------------
(b) Interval timer reports event duration o f n = 14 clock ticks.
Figure 6.4 The finite resolution o f an interval timer causes quantization o f the reported
duration o f the events measured.
160
6.2.3 Statistical measures of short intervals
Owing to the above quantization effect, we cannot directly measure events whose
durations are less than the resolution o f the timer. Similarly, quantization makes
it difficult to accurately measure events with durations that are only a few times
larger than the timer’ s resolution. We can, however, make many measurements
o f a short duration event to obtain a statistical estimate o f the event’ s duration.
Consider an event whose duration is smaller than the timer’ s resolution, that
is, Te < Tc. If we measure this interval once, there are two possible outcomes. If
we happen to start our measurement such that the event straddles the active edge
o f the clock that drives the timer’s internal counter, as shown in Figure 6.5(a), we
will see the clock advance by one tick. On the other hand, since Te < Tc, it is
entirely possible that the event will begin and end within one clock period, as
shown in Figure 6.5(b). In this case, the timer will not advance during this
measurement. Thus, we have a Bernoulli experiment whose outcome is 1 with
probability p, which corresponds to the timer advancing by one tick while are
measuring the event. If the clock does not advance, though, the outcome is 0 with
probability 1 —p.
Repeating this measurement n times produces a distribution that approximates
a binomial distribution. (It is only approximate since, for a true binomial dis-
xT c
(a) Event Te straddles the active edge o f the interval timer.
T
Lc
T
1e
(b) Event Te begins and ends within the resolution o f the interval timer.
Figure 6.5 When one is measuring an event whose duration is less than the resolution o f
the interval timer, that is, Te < Tc, there are two possible outcomes for each measurement.
Either the event happens to straddle the active edge o f the timer’s clock, in which case the
counter advances by one tick, or the event begins and completes between two clock edges.
In the latter case, the interval timer will show the same count value both before and after
the event. Measuring this event multiple times approximates a binomial distribution.
161
tribution, each o f the n measurements must be independent. However, in a real

system it is possible that obtaining an outcome o f 0 in one measurement makes it
more likely that one will obtain a 0 in the next measurement, for instance.
Nevertheless, this approximation appears to work well in practice.) If the num
ber o f outcomes that produce 1 is m, then the ratio m/n should approximate the
ratio o f the duration o f the event being measured to the clock period, Te/Tc.
Thus, we can estimate the average duration o f this event to be
We can then use the technique for calculating a confidence interval for a propor
tion (see Section 4.4.3) to obtain a confidence interval for this average event
time.1
Example. We wish to measure an event whose duration we suspect is less than
the 40 ps resolution o f our interval timer. Out o f n = 10,482 measurements o f
this event, we find that the clock actually advances by one tick during m = 852 o f
them. For a 95% confidence level, we construct the interval for the ratio m/n =
852/10,482 as follows:
8 52
f l 852 ^
1 0 ,4 8 2 1 0 ,4 8 2 y
= (0.0786, 0.0840). (6.2)
(Ci’C2) = I M 8 2 T (L % ) 10,482
Scaling this interval by the timer’ s clock period gives us the 95% confidence
interval (3.14, 3.36)ps for the duration o f this event. O
6.3 Program profiling
A profile provides an overall view o f the execution behavior o f an application

program. More specifically, it is a measurement o f how much time, or the frac
tion o f the total time, the system spends in certain states. A profile o f a program
can be useful for showing how much time the program spends executing each o f
its various subroutines, for instance. This type o f information is often used by a
programmer to identify those portions o f the program that consume the largest
fraction o f the total execution time. Once the largest time consumers have been
identified, they can, one assumes, be enhanced to thereby improve performance.
Similarly, when a profile o f an entire system multitasking among several dif
ferent applications is taken, it can be used by a system administrator to find
system-level performance bottlenecks. This information can be used in turn to
1 The basic idea behind this technique was first suggested by Peter H. D an zig and Steve M elvin in an
unpublished technical report from the University o f Southern California.
162
93 6.3 Program profiling
tune the performance o f the overall system by adjusting such parameters as

buffer sizes, time-sharing quanta, disk-access policies, and so forth.
There are two distinct techniques for creating a program profile - program-
counter (PC) sampling and basic-block counting. Sampling can also be used to
generate a profile o f a complete system.
6.3.1 PC sampling
Sampling is a general statistical measurement technique in which a subset (i.e. a

sample) o f the members o f a population being examined is selected at random.
The information o f interest is then gathered from this subset o f the total popula
tion. It is assumed that, since the samples were chosen completely at random, the
characteristics o f the overall population will approximately follow the same
proportions as do the characteristics o f the subset actually measured. This
assumption allows conclusions about the overall population to be drawn on
the basis o f the complete information obtained from a small subset o f this
population.
While this traditional population sampling selects all o f the samples to be
tested at (essentially) the same time, a slightly different approach is required
when using sampling to generate a profile o f an executing program. Instead o f
selecting all o f the samples to be measured at once, samples o f the executing
program are taken at fixed points in time. Specifically, an external periodic signal
is generated by the system that interrupts the program at fixed intervals.
Whenever one o f these interrupts is detected, appropriate state information is
recorded by the interrupt-service routine.
For instance, when one is generating a profile for a single executing program,
the interrupt-service routine examines the return-address stack to find the
address o f the instruction that was executing when the interrupt occurred.
Using symbol-table information previously obtained from the compiler or
assembler, this program-instruction address is mapped onto a specific subroutine
identifier, i. The value i is used to index into a single-dimensional array, //, to
then increment the element H, by one. In this way, the interrupt-service routine
generates a histogram o f the number o f times each subroutine in the program
was being executed when the interrupt occurred.
The ratio HJn is the fraction o f the program’ s total execution time that it
spent executing in subroutine i, where n is the total number o f interrupts that
occurred during the program’ s execution. Multiplying the period o f the interrupt
by these ratios provides an estimate o f the total time spent executing in each
subroutine.
It is important to remember that sampling is a statistical process in which the
characteristics o f an entire population (in our present situation, the execution
163
behavior o f an entire program or system) are inferred from a randomly selected

subset o f the overall population. The calculated values o f these inferences are,
therefore, subject to random errors. Not surprisingly, we can calculate a con
fidence interval for these proportions to obtain a feel for the precision o f our
sampling experiment.
Example. Suppose that we use a sampling tool that interrupts an executing
program every Tc = 10 ms. Including the time required to execute the interrupt-
service routine, the program executes for a total o f 8 s. If H x = 12 o f the n = 800
samples find the program counter somewhere in subroutine X when the interrupt
occurred, what is the fraction o f the total time the program spends executing this
subroutine?
Since there are 800 samples in total, we conclude that the program spends
1.5% (12/800 = 0.015) o f its time in subroutine X. Using the procedure from
Section 4.4.3, we calculate a 99% confidence interval for this proportion to be
(6.3)
So, with 99% confidence, we estimate that the program spends between 0.39%
and 2.6% o f its time executing subroutine X. Multiplying by the period o f the
interrupt, we estimate that, out o f the 8 s the program was executing, there is a
99% chance that it spent between 31 (0.0039 x 8) and 210 (0.0261 x 8) ms
executing subroutine X. O
The confidence interval calculated in the above example produces a rather
large range o f times that the program could be spending in subroutine X. Put
in other terms, if we were to repeat this experiment several times, we would
expect that, in 99% o f the experiments, from three to 21 o f the 800 samples
would come from subroutine X. While this 7 : 1 range o f possible execution
times appears large, we estimate that subroutine X still accounts for less than
3% o f the total execution time. Thus, we most likely would start our program
tuning efforts on a routine that consumes a much larger fraction o f the total
execution time.
This example does demonstrate the importance o f having a sufficient number
o f samples in each state to produce reliable information, however. To reduce the
size o f the confidence interval in this example we need more samples o f each
event. Obtaining more samples per event requires either sampling for a longer
period o f time, or increasing the sampling rate. In some situations, we can simply
let the program execute for a longer period o f time. This will increase the total
number o f samples and, hence, the number o f samples obtained for each sub
routine.
Some programs have a fixed duration, however, and cannot be forced to
execute for a longer period. In this situation, we can run the program multiple
164
95 6.3 Program profiling
times and simply add the samples from each run. The alternative o f increasing
the sampling frequency will not always be possible, since the interrupt period is
often fixed by the system or the profiling tool itself. Furthermore, increasing the
sampling frequency increases the number o f times the interrupt-service routine is
executed, which increases the perturbation to the program. O f course, each run
o f the program must be performed under identical conditions. Otherwise, if the
test conditions are not identical, we are testing two essentially different systems.
Consequently, in this case, the two sets o f samples cannot be simply added
together to form one larger sample set.
It is also important to note that this sampling procedure implicitly assumes
that the interrupt occurs completely asynchronously with respect to any events in
the program being profiled. Although the interrupts occur at fixed, predefined
intervals, if the program events and the interrupt are asynchronous, the inter
rupts will occur at random points in the execution o f the program being sampled.
Thus, the samples taken at these points are completely independent o f each
other. This sample independence is critical to obtaining accurate results with
this technique since any synchronism between the events in the program and
the interrupt will cause some areas o f the program to be sampled more often than
they should, given their actual frequency o f occurrence.
6.3.2 Basic-block counting
The sampling technique described above provides a statistical profile o f the

behavior o f a program. An alternative approach is to produce an exact execution
profile by counting the number o f times each basic block is executed. A basic
block is a sequence o f processor instructions that has no branches into or out o f
the sequence, as shown in Figure 6.6. Thus, once the first instruction in a block
begins executing, it is assured that all o f the remaining instructions in the block
will be executed. The instructions in a basic block can be thought o f as a com
putation that will always be executed as a single unit.
A program’ s basic-block structure can be exploited to generate a profile by
inserting into each basic block additional instructions. These additional instruc
tions simply count the number o f times the block is executed. When the program
terminates, these values form a histogram o f the frequency o f the basic-block
executions. Just like the histogram produced with sampling, this basic-block
histogram shows which portions o f the program are executed most frequently.
In this case, though, the resolution o f the information is at the basic-block level
instead o f the subroutine level. Since a basic block executes as an indivisible unit,
complete instruction-execution-frequency counts can also be obtained from these
basic-block counts.
165
1. $37: la $ 2 5 , __io b
2. lw $15, 0($25)
3. addu $9, $15, -1
4. sw $9, 0($25)
5. la $ 8 , __io b
6. lw $11, 0($8)
7. bge $11, 0, $3J
8. move $4, $8
9. ja l __f i l b u f
10. move $17, $2
11. $38: la $ 1 2 , __io b
Figure 6.6 A basic block is a sequence o f instructions with no branches into or out of the
block. In this example, one basic block begins at statement 1 and ends at statement 7. A
second basic block begins at statement 8 and ends at statement 9. Statement 10 is a basic
block consisting o f only one instruction. Statement 11 begins another basic block since it is
the target o f an instruction that branches to label $38.
One o f the key differences between this basic-block profile and a profile gen
erated through sampling is that the basic-block profile shows the exact execution
frequencies o f all o f the instructions executed by a program. The sampling pro
file, on the other hand, is only a statistical estimate o f the frequencies. Hence, if a
sampling experiment is run a second time, the precise execution frequences will
most likely be at least slightly different. A basic-block profile, however, will
produce exactly the same frequencies whenever the program is executed with
the same inputs.
Although the repeatability and exact frequencies o f basic-block counting
would seem to make it the obvious profiling choice over a sampling-based pro
file, modifying a program to count its basic-block executions can add a substan
tial amount o f run-time overhead. For instance, to instrument a program for
basic-block counting would require the addition o f at least one instruction to
increment the appropriate counter when the block begins executing to each basic
block. Since the counters that need to be incremented must be unique for each
basic block, it is likely that additional instructions to calculate the appropriate
offset for the current block into the array o f counters will be necessary.
In most programs, the number o f instructions in a basic block is typically
between three and 20. Thus, the number o f instructions executed by the instru
mented program is likely to increase by at least a few percent and possibly as
much as 100% compared with the uninstrumented program. These additional
instructions can substantially increase the total running time o f the program.
166
97 6.4 Event tracing
Furthermore, the additional memory required to store the counter array, plus the
execution o f the additional instructions, can cause other substantial perturba
tions. For instance, these changes to the program can significantly alter its
memory behavior.
So, while basic-block counting provides exact profile information, it does so at
the expense o f substantial overhead. Sampling, on the other hand, distributes its
perturbations randomly throughout a program’ s execution. Also, the total per
turbation due to sampling can be controlled somewhat by varying the period o f
the sampling interrupt interval. Nevertheless, basic-block counting can be a
useful tool for precisely characterizing a program’ s execution profile. Many
compilers, in fact, have compile-time flags a user can set to automatically insert
appropriate code into a program as it is compiled to generate the desired basic-
block counts when it is subsequently executed.
6.4 Event tracing
The information captured through a profiling tool provides a summary picture

o f the overall execution o f a program. An often-useful type o f information that is
ignored in this type o f profile summary, however, is the time-ordering o f events.
A basic-block-counting profile can show the type and frequency o f each o f the
instructions executed, for instance, but it does not provide any information
about the order in which the instructions were executed. When this sequencing
information is important to the analysis being performed, a program trace is the
appropriate choice.
A trace o f a program is a dynamic list o f the events generated by the program
as it executes. The events that comprise a trace can be any events that you can
find a way to monitor, such as a time-ordered list o f all o f the instructions
executed by a program, the sequence o f memory addresses accessed by a pro
gram, the sequence o f disk blocks referenced by the file system, the sizes and
destinations o f all messages sent over a network, and so forth. The level o f detail
provided in a trace is entirely determined by the performance analyst’ s ability to
gather the information necessary for the problem at hand.
Traces themselves can be analyzed to characterize the overall behavior o f a
program, much as a profile characterizes a program’ s behavior. However, traces
are probably more typically used as the input to drive a simulator. For instance,
traces o f the memory addresses referenced by a program are often used to drive
cache simulators. Similarly, traces o f the messages sent by an application pro
gram over a communication network are often used to drive simulators for
evaluating changes to communication protocols.
167
6.4.1 Trace generation
The overall tracing process is shown schematically in Figure 6.7. A tracing

system typically consists o f two main components. The first is the application
being traced, which is the component that actually generates the trace. The
second main component is the trace consumer. This is the program, such as a
simulator, that actually uses the information being generated. In between the
trace generator and the consumer is often a large disk file on which to store the
trace. Storing the trace allows the consumer to be run many times against an
unchanging trace to allow comparison experiments without the expense o f regen
erating the trace. Since the trace can be quite large, however, it will not always be
possible or desirable to store the trace on an intermediate disk. In this case, it is
possible to consume the trace online as it is generated.
A wide range o f techniques have been developed for generating traces. Several
o f these approaches are summarized below.
1. Source-code modification. Perhaps the most straightforward approach for

generating a program trace is to modify the source code o f the program to
be traced. For instance, the programmer may add additional tracing state
ments to the source code, as shown in Figure 6.8. When the program is
subsequently compiled and executed, these additional program statements
will be executed, thereby generating the desired trace. One advantage o f
this approach is that the programmer can trace only the desired events.
This can help reduce the volume o f trace data generated. One major dis
advantage is that inserting trace points is typically a manual process and is,
therefore, very time-consuming and prone to error.
2. Software exceptions. Some processors have been constructed with a m ode that
forces a software exception just before the execution o f each instruction. The
Disk
s
o
s
Figure 6.7 The overall process used to generate, store, and consume a program trace.
168
sum_x = 0.0;
t r a c e (1);
sum_xx = 0.0;
t r a c e (2);
fo r (i = 1; i <= n; i++)
t r a c e (3) ;
{
sum_x += x [ i ] ;
tra ce(4 ) ;
sum_xx += (x[i]*x [ i] );
t r a c e (5);
>
mean = sum_x / n;
t r a c e (6);
var = ((n * sum_xx) - (sum_x * sum_x)) / (n * (n-1));
t r a c e (7);
std _dev = s q r t( v a r ) ;
t r a c e (8);
z_p = u n it_n orm al(1 - (0.5 * alpha));
t r a c e (9);
h a lf_ in t = z_p * std_dev / sqrt(n );
t r a c e (10);
cl = mean - h a lf_ in t ;
t r a c e (11);
c2 = mean + h a lf_ in t ;
t r a c e (12);
(a) The original source program with calls to the tracing routine inserted,
tra ced )
{ p r in t ( i, t im e ) ;}
(b) The trace routine simply prints the statement number, i, and the current time.
Figure 6.8 Program tracing can be performed by inserting additional statements into the
source code to call a tracing subroutine at appropriate points.
exception-processing routine can decode the instruction to determine its oper

ands. The instruction type, address, and operand addresses and values can
then be stored for later use. This approach was implemented using the T-bit in
Digital Equipment Corporation’ s VAX processor series and in the M otorola
68000 processor family. Executing with the trace mode enabled on these
processors slowed down a program’ s execution by a factor o f about 1,000.
169
3. Emulation. An emulator is a program that makes the system on which it

executes appear to the outside world as if it were something completely dif
ferent. For example, the Java Virtual Machine is a program that executes
application programs written in the Java programming language by emulating
the operation o f a processor that implements the Java byte-code instruction
set. This emulation obviously slows down the execution o f the application
program compared with direct execution. Conceptually, however, it is a
straightforward task to modify the emulator program to trace the execution
o f any application program it executes.
4. Microcode modification. In the days when processors executed microcode to
execute their instruction sets through interpretation, it was possible to modify
the microcode to generate a trace o f each instruction executed. One important
advantage o f this approach was that it traced every instruction executed on
the processor, including operating-system code. This feature was especially
useful for tracing entire systems, including the interaction between the appli
cation programs and the operating system. The lack o f microcode on current
processors severely limits the applicability o f this approach today.
5. Compiler modification. Another approach for generating traces is to modify
the executable code produced by the compiler. Similar to what must be done
for generating basic-block counts, extra instructions are added at the start o f
each basic block to record when the block is entered and which basic block is
being executed then. Details about the contents o f the basic blocks can be
obtained from the compiler and correlated to the dynamic basic-block trace to
produce a complete trace o f all o f the instructions executed by the application
program. It is possible to add this type o f tracing facility as a compilation
option, or to write a post-compilation software tool that modifies the execu
table program generated by the compiler.
These trace-generation techniques are by no means the only ways in which

traces can be produced. Rather, they are intended to give you a flavor o f the
types o f approaches that have been used successfully in other trace-generation
systems. Indeed, new techniques are limited only by the imagination and crea
tivity o f the performance analyst.
6.4.2 Trace compression
One obvious concern when generating a trace is the execution-time slowdown

and other program perturbations caused by the execution o f the additional tra
cing instructions. Another concern is the volume o f data that can be produced in
a very short time. For example, say we wish to trace every instruction executed
by a processor that executes at an average rate o f 108 instructions per second. If
170
each item in the trace requires 16 bits to encode the necessary information, our
tracing will produce more than 190 Mbytes o f data per uninstrumented second o f
execution time, or more than 11 Gbytes per minute! In addition to obtaining the
disks necessary to store this amount o f data, the input/output operations
required to move this large volume o f data from the traced program to the
disks create additional perturbations. Thus, it is desirable to reduce the amount
o f information that must be stored.
6.4.2.1 Online trace consumption

One approach for dealing with these large data volumes is to consume the trace
online. That is, instead o f storing the trace for later use, the program that will be
driven by the trace is run simultaneously with the application program being
traced. In this way, the trace is consumed as it is generated so that it never needs
to be stored on disk at all.
A potential problem with online trace consumption in a multitasked (i.e. time-
shared) system is the potential interdeterminate behavior o f the program being
traced. Since system events occur asynchronously with respect to the traced
program, there is no assurance that the next time the program is traced the
exact same sequence o f events will occur in the same relative time order. This
is a particular concern for programs that must respond to real-time events, such
as system interrupts and user inputs.
This potential lack o f repeatability in generating the trace is a concern when
performing one-to-one comparison experiments. In this situation, the trace-con
sumption program is driven once with the trace and its output values are
recorded. It is then modified in some way and then driven again with the same
trace. If the identical input trace is used both times, it is reasonable to conclude
that any change in performance observed is due to the change made to the trace-
consumption program. However, if it cannot be guaranteed that the trace is
identical from one run to the next, it is not possible to determine whether any
change in performance observed is due to the change made, or whether it is due
to a difference in the input trace itself.
6.4.2.2 Compression of data

A trace written to intermediate storage, such as a disk, can be viewed just like
any other type o f data file. Consequently, it is quite reasonable to apply a data-
compression algorithm to the trace data as it is written to the disk. For example,
any one o f the large number o f compression programs based on the popular
Lempel-Ziv algorithm is often able to reduce the size o f a trace file by 20-70%.
O f course, the tradeoff for this data compression is the additional time required
to execute the compression routine when the trace is generated and the time
required to uncompress the trace when it is consumed.
171
6.4.2.3 Abstract execution

An interesting variation o f the basic trace-compression idea takes advantage o f
the semantic information within a program to reduce the amount o f informa
tion that must be stored for a trace. This approach, called abstract execution,
separates the tracing process into two steps. The first step performs a compiler-
style analysis o f the program to be traced. This analysis identifies a small subset
o f the entire trace that is sufficient to later reproduce the full trace. Only this
smaller subset is actually stored. Later, the trace-consumption program must
execute some special trace-regeneration routines to convert this partial trace
information into the full trace. These regeneration routines are automatically
generated by the tracing tool when it performs the initial analysis o f the
program.
The data about the full trace that are actually stored when using the abstract-
execution model consist o f information describing only those transitions that
may change during run-time. For example, consider the code fragment extracted
from a program to be traced shown in Figure 6.9. The compiler-style analysis
that would be performed on this code fragment would produce the control flow
graph shown in Figure 6.10. From this control flow graph, the trace-generation
tool can determine that statement 1 always precedes both statements 2 and 3.
Furthermore, statement 4 always follows both statements 2 and 3. When this
program is executed, the trace through this sequence o f statements will be either
1-2-4, or 1-3-4. Thus, the only information that needs to be recorded during
run-time is which o f statements 2 and 3 actually occurred. The trace-regeneration
routine is then able to later reconstruct the full trace using the previously
recorded control flow graph.
Measurements o f the effectiveness o f this tracing technique have shown that it
slows down the execution o f the program being traced by a factor o f typically 2-
10. This slowdown factor is comparable to, or slightly better than, those o f most
other tracing techniques. More important, however, may be that, by recording
information only about the changes that actually occur during run-time, this
technique is able to reduce the size o f the stored traces by a factor o f ten to
several hundred.
1. if (i > 5)
2.then a = a + i ;
3. e ls e b = b + 1;
4. i = i + 1;
Figure 6.9 A code fragment to be processed using the abstract execution tracing technique..
172
Figure 6.10 The control flow graph corresponding to the program fragment shown in
Figure 6.9.
6.4.2.4 Trace sampling

Trace sampling is another approach that has been suggested for reducing the
amount o f information that must be collected and stored when tracing a pro
gram. The basic idea is to save only relatively small sequences o f events from
locations scattered throughout the trace. The expectation is that these small
samples will be statistically representative o f the entire program’
s trace when
they are used. For instance, using these samples to drive a simulation should
produce overall results that are similar to what would be produced if the simula
tion were to be driven with the entire trace.
Consider the sequence o f events from a trace shown in Figure 6.11. Each
sample from this trace consists o f k consecutive events. The number o f events
between the starts o f consecutive samples is the sampling interval, denoted by P.
Since only the samples from the trace are actually recorded, the total amount o f
storage required for the trace can be reduced substantially compared with storing
the entire raw trace.
x x x x x x x x x x x . . • x x x x x x x x x x x . . .
--------- k-------- J I--------- k-------- ►
--------------------p ------------------- »
Figure 6.11 In trace sampling, k consecutive events com prise one sample o f the trace. A
new sample is taken every P events (P is called the sampling interval).
173
Unfortunately, there is no solid theoretical basis to help the experimenter

determine how many events should be stored for each sample (k), or how large
the sampling interval (P) should be. The best choices for k and P typically must be
determined empirically (i.e. through experimentation). Furthermore, the choice
o f these parameters seems to be dependent on how the traces will be used. If the
traces are used to drive a simulation o f a cache to estimate cache-miss ratios, for
instance, it has been suggested (see Laha et al. (1988)) that, in a trace o f tens of
millions o f memory references, it is adequate to have several thousand events per
sample. The corresponding sampling interval then should be chosen to provide
enough samples such that 5-10% o f the entire trace is recorded. These results,
however, appear to be somewhat dependent on the size o f the cache being simu
lated. The bottom line is that, while trace sampling appears to be a reasonable
technique for reducing the size o f the trace that must be stored, a solid theoretical
basis still needs to be developed before it can be considered ‘ standard practice.’
6.5 Indirect and ad hoc measurements
Sometimes the performance metric we need is difficult, if not impossible, to

measure directly. In this case, we have to rely on our ingenuity to develop an
ad hoc technique to somehow derive the information indirectly. For instance,
perhaps we are not able to directly measure the desired quantity, but we may be
able to measure another related value directly. We may then be able to deduce
the desired value from these other measured values.
For example, suppose that we wish to determine how much load a particular
application program puts on a system when it is executed. We then may want to
make changes to the program to see how they affect the system load. The first
question we need to confront in this experiment is that o f establishing a defini
tion for the ‘
system load.’
There are many possible definitions o f the system load, such as the number o f
jobs on the run queue waiting to be executed, to name but one. In our case,
however, we are interested in how much o f the processor’ s available time is spent
executing our application program. Thus, we decide to define the average system
load to be the fraction o f time that the processor is busy executing users’appli
cation programs.
If we had access to the source code o f the operating system, we could directly
measure this time by modifying the process scheduler. However, it is unlikely
that we will have access to this code. An alternative approach is to directly
measure how much time the processor spends executing an ‘ idle’process that
we create. We then use this direct measurement o f idle time to deduce how much
174
105 6.6 Perturbations due to measuring
time the processor must have been busy executing real application programs
during the given measurement interval.
Specifically, consider an ‘idle’program that simply counts up from zero for a
fixed period o f time. If this program is the only application running on a single
processor o f a time-shared system, the final count value at the end o f the mea
surement interval is the value that indirectly corresponds to an unloaded pro
cessor. If two applications are executed simultaneously and evenly share the
processor, however, the processor will run our idle measurement program half
as often as when it was the only application running. Consequently, if we allow
both programs to run for the same time interval as when we ran the idle program
by itself, its total count value at the end o f the interval should be half o f the value
observed when only a single copy was executed.
Similarly, if three applications are executed simultaneously and equally share
the processor for the same measurement interval, the final count value in our idle
program should be one-third o f the value observed when it was executed by
itself. This line o f thought can be further extended to n application programs
simultaneously sharing the processor. After calibrating the counter process by
running it by itself on an otherwise unloaded system, it can be used to indirectly
measure the system load.
Example. In a time-shared system, the operating system will share a single
processor evenly among all o f the jobs executing in the system. Each available
jo b is allowed to run for the time slice Ts. After this interval, the currently
executing jo b is temporarily put to sleep, and the next ready jo b is switched in
to run. Indirect load monitoring takes advantage o f this behavior to estimate the
system load. Initially, the load-monitor program is calibrated by allowing it to
run by itself for a time T, as shown in Figure 6.12(a). At the end o f this time, its
counter value, n, is recorded. If the load monitor and another application are run
simultaneously so that in total two jobs are sharing the processor, as shown in
Figure 6.12(b), each jo b would be expected to be executing for half o f the total
time available. Thus, if the load monitor is again allowed to run for time T, we
would expect its final count value to be n/2. Similarly, running the load monitor
with two other applications for time T would result in a final count value o f n/3,
as shown in Figure 6.12(c). Consequently, knowing the value o f the count after
running the load monitor for time T allows us to deduce what the average load
during the measurement interval must have been O
6.6 Perturbations due to measuring
One o f the curious (and certainly most annoying!) aspects o f developing tools to
measure computer-systems performance is that instrumenting a system or pro-
175
Count = n
Count = n !2
Count = n/3
Figure 6.12 An example o f using an indirect measurement technique to estimate the

average system load in a time-shared system. The solid lines indicate when each application
is running.
gram changes what we are trying to measure. Obtaining more information, or

obtaining higher resolution measurements, for instance, requires more instru
mentation points in a program. However, more instrumentation causes there
to be more perturbations in the program than there are in its uninstrumented
execution behavior. These additional perturbations due to the additional instru
mentation then make the data we collect less reliable. As a result, we are almost
always forced to use insufficient data to infer the behavior o f the system in which
we are interested.
T o further confound the situation, performance perturbations due to instru
mentation are nonlinear and nonadditive. They are nonlinear in the sense that
doubling the amount o f instrumentation in a program will not necessarily double
its impact on performance, for instance. Similarly, instrumentation perturbation
is nonadditive in the sense that adding more instrumentation can cancel out the
perturbation effects o f other instrumentation. Or, in some situations, additional
instrumentation can multiplicatively increase the perturbations.
For example, adding code to an application program to generate an instruc
tion trace can significantly change the spatial and temporal patterns o f its mem
ory accesses. The trace-generation code will cause a large number o f extra store
instructions to be executed, for instance, which can cause the cache to be effec
tively flushed at each trace point. These frequent cache flushes will then increase
the number o f caches missed, which will substantially impact the overall perfor
mance. If additional instrumentation is added, however, it may be possible that
176
107 6.7 Summary
the additional memory locations necessary for the instrumentation could change
the pattern o f conflict misses in the cache in such a way as to actually improve
the cache performance perceived by the application. The bottom line is that
the effects o f adding instrumentation to a system being tested are entirely
unpredictable.
Besides these direct changes to a program’ s performance, instrumenting a
program can cause more subtle indirect perturbations. For example, an instru
mented program will take longer to execute than will the uninstrumented pro
gram. This increase in execution time will then cause it to experience more
context switches than it would have experienced if it had not been instrumented.
These additional context switches can substantially alter the program’ s paging
behavior, for instance, making the instrumented program behave substantially
differently than the uninstrumented program.
6.7 Summary
Event-driven measurement tools record information about the system being

tested whenever some predefined event occurs, such as a page fault or a network
operation, for instance. The information recorded may be a simple count o f the
number o f times the event occurred, or it may be a portion o f the system’ s state
at the time the event occurred. A time-ordered list o f this recorded state infor
mation is called a trace. While event-driven tools record all occurrences o f the
defined events, sampling tools query some aspect o f the system’ s state at fixed
time intervals. Since this sampling approach will not record every event, it pro
vides a statistical view o f the system. Indirect measurement tools are used to
deduce some aspect o f a system’ s performance that it is difficult or impossible to
measure directly.
Some perturbation o f a system’s behavior due to instrumentation is unavoid
able. Furthermore, and more difficult to compensate for, perhaps, is the unpre
dictable relationship between the instrumentation and its impact on
performance. Through experience and creative use o f measurement techniques,
the performance analyst can try to minimize the impact o f these perturbations, or
can sometimes compensate for their effects.
It is important to bear in mind, though, that measuring a system alters it.
While you would like to measure a completely uninstrumented program, what
you actually end up measuring is the instrumented system. Consequently, you
must always remain alert to how these perturbations may bias your measure
ments and, ultimately, the conclusions you are able to draw from your
experiments.
177
C h a p te r 7
N o tio n s o f R elia b ility
J. H. Saltzer and M. F. Kaashoek. Principles of Computer Sys

tem Design: An Introduction. Part II. Chapters 8.1-4 and 8.6, pp
8-2 - 8-35 and 8-51 - 8-54 (38 of 826). Creative Commons License,
2009.
An important challenge in providing h igh av a ila b ility is architecting com

puter systems for reliability. The first aspect to be studied refers to characteriz
ing basic reliability concepts, and estimating quantitative measures of reliability
of given system configurations. Many of these quantitative estimates are based
on underlying statistical assumptions, often of independent random failures. It
is important to discuss the relationship between these assumptions and real-
world behavior, so that the adequate scope and use of reliability metrics can
be determined. A second aspect to be studied are strategies and techniques for
fault-tolerance. Many strategies are based on the idea of selectively introducing
redundancy in data and components. An important class of techniques is the
class of replication techniques. When combined with asynchronous commu
nication, replication can be an effective tool to increase availability; however,
this benefit is counterbalanced by the impact on the system property of atom
icity. The ultimate goal of this portion of the material is to arm us with the
basic methodologies for designing services which include appropriate mecha
nisms for fault-tolerance, including basic replication protocols, as well as allow
us to reflect on the applicability and limitations of the fault-tolerance strategies
employed in a given scenario.
• Identify hardware scale of highly-available services common by current

standards.
• Predict reliability of a component configuration based on metrics such as

failure probability, MTTF/MTTR/MTBF, availability/downtime.
• Explain how assumptions of reliability models are at odds with observed

events.
• Explain and apply common fault-tolerance strategies such as error detec

tion, containment, and masking.
179
Explain techniques for redundancy, such as n-version programming, error
coding, duplicated components, replication.
Categorize main variants of replication techniques and implement simple
replication protocols.
8 -2 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
8.6.1 Design Strategies and Design Principles ......................................... 8-51

8.6.2 How about the End-to-End Argument? .......................................... 8-52
8.6.3 A Caution on the Use of Reliability Calculations .............................8-53
8.6.4 Where to Learn More about Reliable Systems ................................ 8-53
8.7 A pplication : A Fault T o le ra n c e M odel fo r CMOS RA M .............8-55
8.8 W ar S torie s: Fault T olera n t S y s te m s that F a iled ...................8-57
8.8.1 Adventures with Error Correction ................................................... 8 -57
8.8.2 Risks of Rarely-Used Procedures: The National Archives ............. 8-59
8.8.3 Non-independent Replicas and Backhoe Fade ................................ 8-60
8.8.4 Human Error May Be the Biggest Risk ........................................... 8-61
8.8.5 Introducing a Single Point of Failure ................................................8-63
8.8.6 Multiple Failures: The SOHO Mission Interruption ....................... 8-63
E x e r c is e s ............................................................................ 8-64
G lo ssa ry fo r C h a pter 8 .......................................................... 8-69
In d e x o f C h a pter 8 ............................................................... 8-75
Last ch a p te r p a g e 8-77
Overview
Construction o f reliable systems from unreliable components is one of the most impor
tant applications o f modularity. There are, in principle, three basic steps to building
reliable systems:
1. Error detection-, discovering that there is an error in a data value or control signal.
Error detection is accomplished with the help o f redundancy extra information
that can verify correctness.
2. Error containment, limiting how far the effects of an error propagate. Error
containment comes from careful application o f modularity. When discussing
reliability, a module is usually taken to be the unit that fails independently o f other
such units. It is also usually the unit o f repair and replacement.
3. Error masking, ensuring correct operation despite the error. Error masking is
accomplished by providing enough additional redundancy that it is possible to
discover correct, or at least acceptably close, values o f the erroneous data or control
signal. When masking involves changing incorrect values to correct ones, it is
usually called error correction.
Since these three steps can overlap in practice, one sometimes finds a single error-han
dling mechanism that merges two or even all three of the steps.
In earlier chapters each o f these ideas has already appeared in specialized forms:
• A primary purpose of enforced modularity, as provided by client/server
architecture, virtual memory, and threads, is error containment.
181
8.1 Faults, Failures, and Fault Tolerant Design 8 -3
• Network links typically use error detection to identify and discard damaged
frames.
• Some end-to-end protocols time out and resend lost data segments, thus
masking the loss.
• Routing algorithms find their way around links that fail, masking those failures.
• Some real-time applications fill in missing data by interpolation or repetition,
thus masking loss.
and, as we will see in Chapter 11 [on-line], secure systems use a technique called defense
in depth both to contain and to mask errors in individual protection mechanisms. In this
chapter we explore systematic application o f these techniques to more general problems,
as well as learn about both their power and their limitations.
8.1 Faults, Failures, and Fault Tolerant Design
8.1.1 Faults, Failures, and Modules

Before getting into the techniques o f constructing reliable systems, let us distinguish
between concepts and give them separate labels. In ordinary English discourse, the three
words “ fault,”“ failure,”and “error”are used more or less interchangeably or at least with
strongly overlapping meanings. In discussing reliable systems, we assign these terms to
distinct formal concepts. The distinction involves modularity. Although common
English usage occasionally intrudes, the distinctions are worth maintaining in technical
settings.
A fault is an underlying defect, imperfection, or flaw that has the potential to cause
problems, whether it actually has, has not, or ever will. A weak area in the casing o f a tire
is an example o f a fault. Even though the casing has not actually cracked yet, the fault is
lurking. If the casing cracks, the tire blows out, and the car careens off a cliff, the resulting
crash is a failure. (That definition o f the term “ failure”by example is too informal; we
will give a more careful definition in a moment.) One fault that underlies the failure is
the weak spot in the tire casing. Other faults, such as an inattentive driver and lack o f a
guard rail, may also contribute to the failure.
Experience suggests that faults are commonplace in computer systems. Faults come
from many different sources: software, hardware, design, implementation, operations,
and the environment o f the system. Here are some typical examples:
• Software fault: A programming mistake, such as placing a less-than sign where
there should be a less-than-or-equal sign. This fault may never have caused any
trouble because the combination o f events that requires the equality case to be
handled correctly has not yet occurred. Or, perhaps it is the reason that the system
crashes twice a day. If so, those crashes are failures.
182
• Hardware fault: A gate whose output is stuck at the value z e r o . Until something
depends on the gate correctly producing the output value o n e , nothing goes wrong.
If you publish a paper with an incorrect sum that was calculated by this gate, a
failure has occurred. Furthermore, the paper now contains a fault that may lead
some reader to do something that causes a failure elsewhere.
• Design fault: A miscalculation that has led to installing too little memory in a
telephone switch. It may be months or years until the first time that the presented
load is great enough that the switch actually begins failing to accept calls that its
specification says it should be able to handle.
• Implementation fault: Installing less memory than the design called for. In this
case the failure may be identical to the one in the previous example o f a design
fault, but the fault itself is different.
• Operations fault: The operator responsible for running the weekly payroll ran the
payroll program twice last Friday. Even though the operator shredded the extra
checks, this fault has probably filled the payroll database with errors such as wrong
values for year-to-date tax payments.
• Environment fault: Lightning strikes a power line, causing a voltage surge. The
computer is still running, but a register that was being updated at that instant now
has several bits in error. Environment faults come in all sizes, from bacteria
contaminating ink-jet printer cartridges to a storm surge washing an entire
building out to sea.
Some o f these examples suggest that a fault may either be latent, meaning that it isn’ t
affecting anything right now, or active. When a fault is active, wrong results appear in
data values or control signals. These wrong results are errors. If one has a formal specifi
cation for the design o f a module, an error would show up as a violation of some assertion
or invariant of the specification. The violation means that either the formal specification
is wrong (for example, someone didn’ t articulate all o f the assumptions) or a module that
this component depends on did not meet its own specification. Unfortunately, formal
specifications are rare in practice, so discovery of errors is more likely to be somewhat ad
hoc.
If an error is not detected and masked, the module probably does not perform to its
specification. Not producing the intended result at an interface is the formal definition
o f a failure. Thus, the distinction between fault and failure is closely tied to modularity
and the building o f systems out o f well-defined subsystems. In a system built o f sub
systems, the failure o f a subsystem is a fault from the point o f view o f the larger subsystem
that contains it. That fault may cause an error that leads to the failure o f the larger sub
system, unless the larger subsystem anticipates the possibility of the first one failing,
detects the resulting error, and masks it. Thus, if you notice that you have a flat tire, you
have detected an error caused by failure of a subsystem you depend on. If you miss an
appointment because o f the flat tire, the person you intended to meet notices a failure of
183
a larger subsystem. If you change to a spare tire in time to get to the appointment, you
have masked the error within your subsystem. Fault tolerance thus consists o f noticing
active faults and component subsystem failures and doing something helpful in response.
One such helpful response is error containment, which is another close relative o f
modularity and the building o f systems out o f subsystems. When an active fault causes
an error in a subsystem, it may be difficult to confine the effects o f that error to just a
portion o f the subsystem. On the other hand, one should expect that, as seen from out
side that subsystem, the only effects will be at the specified interfaces o f the subsystem.
In consequence, the boundary adopted for error containment is usually the boundary of
the smallest subsystem inside which the error occurred. From the point o f view o f the
next higher-level subsystem, the subsystem with the error may contain the error in one
of four ways:
1. Mask the error, so the higher-level subsystem does not realize that anything went
wrong. One can think o f failure as falling off a cliff and masking as a way of
providing some separation from the edge.
2. Detect and report the error at its interface, producing what is called a fail-fast
design. Fail-fast subsystems simplify the job o f detection and masking for the next
higher-level subsystem. If a fail-fast module correctly reports that its output is
questionable, it has actually met its specification, so it has not failed. (Fail-fast
modules can still fail, for example by not noticing their own errors.)
3. Immediately stop dead, thereby hoping to limit propagation o f bad values, a
technique known as fail-stop. Fail-stop subsystems require that the higher-level
subsystem take some additional measure to discover the failure, for example by
setting a timer and responding to its expiration. A problem with fail-stop design is
that it can be difficult to distinguish a stopped subsystem from one that is merely
running more slowly than expected. This problem is particularly acute in
asynchronous systems.
4. D o nothing, simply failing without warning. At the interface, the error may have
contaminated any or all output values. (Informally called a “
crash”or perhaps “
fail-
thud”.)
Another useful distinction is that of transient versus persistent faults. A transient fault,
also known as a single-event upset, is temporary, triggered by some passing external event
such as lightning striking a power line or a cosmic ray passing through a chip. It is usually
possible to mask an error caused by a transient fault by trying the operation again. An
error that is successfully masked by retry is known as a soft error. A persistent fault contin
ues to produce errors, no matter how many times one retries, and the corresponding
errors are called hard errors. An intermittent fault is a persistent fault that is active only
occasionally, for example, when the noise level is higher than usual but still within spec
ifications. Finally, it is sometimes useful to talk about latency, which in reliability
terminology is the time between when a fault causes an error and when the error is
184
detected or causes the module to fail. Latency can be an important parameter because
some error-detection and error-masking mechanisms depend on there being at most a
small fixed number o f errors— often just one— at a time. If the error latency is large,
there may be time for a second error to occur before the first one is detected and masked,
in which case masking o f the first error may not succeed. Also, a large error latency gives
time for the error to propagate and may thus complicate containment.
U sin g this term inology, an im properly fabricated stuck-at-ZERO bit in a m em ory chip
is a persistent fault: whenever the bit should contain a o n e the fault is active and the value
o f the bit is in error; at times when the bit is su p posed to contain a z e r o , the fault is latent.
If the chip is a co m p o n en t o f a fault tolerant m em ory m odule, the m od u le design p r o b
ably includes an error-correction co d e that prevents that error from turning into a failure
o f the m odule. If a passing co sm ic ray flips another bit in the same chip, a transient fault
has caused that bit also to be in error, but the same error-correction co d e m ay still be able
to prevent this error from turning into a m od u le failure. O n the other hand, if the error-
correction co d e can handle on ly single-bit errors, the com bin ation o f the persistent and
the transient fault m igh t lead the m od u le to prod u ce w ron g data across its interface, a
failure o f the m odule. If som eon e were then to test the m od u le by storin g new data in it
and reading it back, the test w ou ld probably not reveal a failure because the transient
fault does not affect the new data. Because sim ple input/output testing does not reveal
successfully masked errors, a fault tolerant m od u le design should always include som e
way to report that the m od u le m asked an error. If it does not, the user o f the m od u le m ay
not realize that persistent errors are accum ulating but hidden.
8.1.2 The Fault-Tolerance Design Process

One way to design a reliable system would be to build it entirely o f components that are
individually so reliable that their chance o f failure can be neglected. This technique is
known as fau lt avoidance. Unfortunately, it is hard to apply this technique to every com
ponent of a large system. In addition, the sheer number o f components may defeat the
strategy. If all TVo f the components o f a system must work, the probability o f any one
component failing is p, and component failures are independent of one another, then the
probability that the system works is (1 - p)N . No matter how small p may be, there is
some value o f TVbeyond which this probability becomes too small for the system to be
useful.
The alternative is to apply various techniques that are known collectively by the name
fau lt tolerance. The remainder o f this chapter describes several such techniques that are
the elements o f an overall design process for building reliable systems from unreliable
components. Here is an overview o f t\vt fault-tolerance design process-.
1. Begin to develop a fault-tolerance model, as described in Section 8.3:
• Identify every potential fault.
• Estimate the risk of each fault, as described in Section 8.2.
• Where the risk is too high, design methods to detect the resulting errors.
185
2. Apply modularity to contain the damage from the high-risk errors.

3. Design and implement procedures that can mask the detected errors, using the
techniques described in Section 8.4:
• Temporal redundancy. Retry the operation, using the same components.
• Spatial redundancy. Have different components do the operation.
4. Update the fault-tolerance model to account for those improvements.
5. Iterate the design and the model until the probability of untolerated faults is low
enough that it is acceptable.
6. Observe the system in the field:
• Check logs o f how many errors the system is successfully masking. (Always keep
track of the distance to the edge o f the cliff.)
• Perform postmortems on failures and identify all o f the reasons for each failure.
7. Use the logs o f masked faults and the postmortem reports about failures to revise
and improve the fault-tolerance model and reiterate the design.
The fault-tolerance design process includes some subjective steps, for example, decid
ing that a risk o f failure is “
unacceptably high”or that the “
probability o f an untolerated
fault is low enough that it is acceptable.”It is at these points that different application
requirements can lead to radically different approaches to achieving reliability. A per
sonal computer may be designed with no redundant components, the computer system
for a small business is likely to make periodic backup copies o f most o f its data and store
the backup copies at another site, and some space-flight guidance systems use five com
pletely redundant computers designed by at least two independent vendors. The
decisions required involve trade-offs between the cost o f failure and the cost o f imple
menting fault tolerance. These decisions can blend into decisions involving business
models and risk management. In some cases it may be appropriate to opt for a nontech
nical solution, for example, deliberately accepting an increased risk o f failure and
covering that risk with insurance.
The fault-tolerance design process can be described as a safety-net approach to system
design. The safety-net approach involves application o f some familiar design principles
and also some not previously encountered. It starts with a new design principle:
Be explicit
Get all of the assumptions out on the table.
The primary purpose of creating a fault-tolerance model is to expose and document the
assumptions and articulate them explicitly. The designer needs to have these assump
tions not only for the initial design, but also in order to respond to field reports of
186
unexpected failures. Unexpected failures represent omissions or violations o f the

assumptions.
Assuming that you won’ t get it right the first time, the second design principle of the
safety-net approach is the familiar design for iteration. It is difficult or impossible to antic
ipate all o f the ways that things can go wrong. Moreover, when working with a fast
changing technology it can be hard to estimate probabilities o f failure in components and
in their organization, especially when the organization is controlled by software. For
these reasons, a fault tolerant design must include feedback about actual error rates, eval
uation o f that feedback, and update o f the design as field experience is gained. These two
principles interact: to act on the feedback requires having a fault tolerance model that is
explicit about reliability assumptions.
The third design principle o f the safety-net approach is also familiar: the safety margin
principle, described near the end of Section 1.3.2. An essential part o f a fault tolerant
design is to monitor how often errors are masked. When fault tolerant systems fail, it is
usually not because they had inadequate fault tolerance, but because the number o f fail
ures grew unnoticed until the fault tolerance o f the design was exceeded. The key
requirement is that the system log all failures and that someone pay attention to the logs.
The biggest difficulty to overcome in applying this principle is that it is hard to motivate
people to expend effort checking something that seems to be working.
The fourth design principle o f the safety-net approach came up in the introduction
to the study of systems; it shows up here in the instruction to identify all o f the causes of
each failure: keep digging. Complex systems fail for complex reasons. When a failure o f a
system that is supposed to be reliable does occur, always look beyond the first, obvious
cause. It is nearly always the case that there are actually several contributing causes and
that there was something about the mind set o f the designer that allowed each o f those
causes to creep in to the design.
Finally, complexity increases the chances of mistakes, so it is an enemy o f reliability.
The fifth design principle embodied in the safety-net approach is to adopt sweeping sim
plifications. This principle does not show up explicitly in the description o f the fault-
tolerance design process, but it will appear several times as we go into more detail.
The safety-net approach is applicable not just to fault tolerant design. Chapter 11 [on
line] will show that the safety-net approach is used in an even more rigorous form in
designing systems that must protect information from malicious actions.
8.2 Measures of Reliability and Failure Tolerance
8.2.1 Availability and Mean Time to Failure

A useful model o f a system or a system component, from a reliability point o f view, is
that it operates correctly for some period of time and then it fails. The time to failure
(TTF) is thus a measure of interest, and it is something that we would like to be able to
predict. If a higher-level module does not mask the failure and the failure is persistent,
187
8.2 Measures of Reliability and Failure Tolerance 8 -9
the system cannot be used until it is repaired, perhaps by replacing the failed component,
so we are equally interested in the time to repair (TTR). If we observe a system through
iVrun-fail-repair cycles and observe in each cycle i the values of TTFi and TTRt, we can
calculate the fraction o f time it operated properly, a useful measure known as availability:
A v a ila b ilit y = ----- time system was running-----
time system should have been running
N
^ TTFi
— ~ Eq. 8—1
J ( TTFj + TTR;)
i= 1
By separating the denominator o f the availability expression into two sums and dividing
each by N (the number o f observed failures) we obtain two time averages that are fre
quently reported as operational statistics: the mean time to failure (MTTF) and the mean
time to repair (MTTR):
1 N
M TTF = ~ 2 j TTFi M TTR = TTR, Eq. 8-2
i= 1 N i= 1
The sum o f these two statistics is usually called the mean time between failures (MTBF).
Thus availability can be variously described as
M TTF M TTF M T B F - M TTR
A v a ila b ility = Eq. 8-3
M TBF M TTF+ M TTR M TBF
In some situations, it is more useful to measure the fraction o f time that the system is not
working, known as its down time.
Downtime = (1 - A vailability) = ^ I T R Eq. 8-4

y M TBF 4
One thing that the definition o f down time makes clear is that M TTR and MTBF are
in some sense equally important. One can reduce down time either by reducing M TTR
or by increasing MTBF.
Components are often repaired by simply replacing them with new ones. When failed
components are discarded rather than fixed and returned to service, it is common to use
a slightly different method to measure MTTF. The method is to place a batch ofATom-
ponents in service in different systems (or in what is hoped to be an equivalent test
environment), run them until they have all failed, and use the set o f failure times as the
TTFi in equation 8-2. This procedure substitutes an ensemble average for the time aver
age. We could use this same procedure on components that are not usually discarded
when they fail, in the hope of determining their MTTF more quickly, but we might
obtain a different value for the MTTF. Some failure processes do have the property that
the ensemble average is the same as the time average (processes with this property are
188
8 -1 0 CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable
called ergodic), but other failure processes do not. For example, the repair itself may cause
wear, tear, and disruption to other parts o f the system, in which case each successive sys
tem failure might on average occur sooner than did the previous one. If that is the case,
an MTTF calculated from an ensemble-average measurement might be too optimistic.
As we have defined them, availability, MTTF, MTTR, and MTBF are backward
looking measures. They are used for two distinct purposes: (1) for evaluating how the
system is doing (compared, for example, with predictions made when the system was
designed) and (2 ) for predicting how the system will behave in the future. The first pur
pose is concrete and well defined. The second requires that one take on faith that samples
from the past provide an adequate predictor o f the future, which can be a risky assump
tion. There are other problems associated with these measures. While M TTR can usually
be measured in the field, the more reliable a component or system the longer it takes to
evaluate its MTTF, so that measure is often not directly available. Instead, it is common
to use and measure proxies to estimate its value. The quality o f the resulting estimate of
availability then depends on the quality o f the proxy.
A typical 3.5-inch magnetic disk comes with a reliability specification o f 300,000
hours “ M TTF” , which is about 34 years. Since the company quoting this number has
probably not been in business that long, it is apparent that whatever they are calling
“M TTF”is not the same as either the time-average or the ensemble-average MTTF that
we just defined. It is actually a quite different statistic, which is why we put quotes
around its name. Sometimes this “ M TTF”is a theoretical prediction obtained by mod
eling the ways that the components o f the disk might be expected to fail and calculating
an expected time to failure.
A more likely possibility is that the manufacturer measured this “ M TTF”by running
an array o f disks simultaneously for a much shorter time and counting the number of
failures. For example, suppose the manufacturer ran 1,000 disks for 3,000 hours (about
four months) each, and during that time 10 o f the disks failed. The observed failure rate
o f this sample is 1 failure for every 300,000 hours o f operation. The next step is to invert
the failure rate to obtain 300,000 hours o f operation per failure and then quote this num
ber as the “ M TTF” . But the relation between this sample observation o f failure rate and
the real MTTF is problematic. If the failure process were memoryless (meaning that the
failure rate is independent o f time; Section 8.2.2, below, explores this idea more thor
oughly), we would have the special case in which the MTTF really is the inverse o f the
failure rate. A good clue that the disk failure process is not memoryless is that the disk
specification may also mention an “ expected operational lifetime”of only 5 years. That
statistic is probably the real M TTF— though even that may be a prediction based on
modeling rather than a measured ensemble average. An appropriate re-interpretation of
the 34-year “ M TTF”statistic is to invert it and identify the result as a short-term failure
rate that applies only within the expected operational lifetime. The paragraph discussing
equation 8-9 on page 8-13 describes a fallacy that sometimes leads to miscalculation of
statistics such as the MTTF.
Magnetic disks, light bulbs, and many other components exhibit a time-varying sta
tistical failure rate known as a bathtub curve, illustrated in Figure 8.1 and defined more
189
8.2 Measures of Reliability and Failure Tolerance 8 -1 1
carefully in Section 8.2.2, below. When components come off the production line, a cer
tain fraction fail almost immediately because of gross manufacturing defects. Those
components that survive this initial period usually run for a long time with a relatively
uniform failure rate. Eventually, accumulated wear and tear cause the failure rate to
increase again, often quite rapidly, producing a failure rate plot that resembles the shape
o f a bathtub.
Several other suggestive and colorful terms describe these phenomena. Components
that fail early are said to be subject to infant mortality, and those that fail near the end of
their expected lifetimes are said to burn out. Manufacturers sometimes burn in such com
ponents by running them for a while before shipping, with the intent o f identifying and
discarding the ones that would otherwise fail immediately upon being placed in service.
When a vendor quotes an “ expected operational lifetime,”it is probably the mean time
to failure o f those components that survive burn in, while the much larger “ M TTF”
number is probably the inverse o f the observed failure rate at the lowest point o f the bath
tub. (The published numbers also sometimes depend on the outcome o f a debate
between the legal department and the marketing department, but that gets us into a dif
ferent topic.) A chip manufacturer describes the fraction o f components that survive the
burn-in period as the yield of the production line. Component manufacturers usually
exhibit a phenomenon known informally as a learning curve, which simply means that
the first components coming out of a new production line tend to have more failures
than later ones. The reason is that manufacturers design for iteration: upon seeing and
analyzing failures in the early production batches, the production line designer figures
out how to refine the manufacturing process to reduce the infant mortality rate.
One job o f the system designer is to exploit the nonuniform failure rates predicted by
the bathtub and learning curves. For example, a conservative designer exploits the learn
ing curve by avoiding the latest generation o f hard disks in favor of slightly older designs
that have accumulated more field experience. One can usually rely on other designers
who may be concerned more about cost or performance than availability to shake out the
bugs in the newest generation o f disks.
FIGURE 8.1_________________________________________________________________
A bathtub curve, showing how the conditional failure rate of a component changes with time.
190
The 34-year “ M TTF”disk drive specification may seem like public relations puffery
in the face o f the specification of a 5-year expected operational lifetime, but these two
numbers actually are useful as a measure o f the nonuniformity o f the failure rate. This
nonuniformity is also susceptible to exploitation, depending on the operation plan. If the
operation plan puts the component in a system such as a satellite, in which it will run
until it fails, the designer would base system availability and reliability estimates on the
5-year figure. On the other hand, the designer o f a ground-based storage system, mindful
that the 5-year operational lifetime identifies the point where the conditional failure rate
starts to climb rapidly at the far end o f the bathtub curve, might include a plan to replace
perfectly good hard disks before burn-out begins to dominate the failure rate— in this
case, perhaps every 3 years. Since one can arrange to do scheduled replacement at conve
nient times, for example, when the system is down for another reason, or perhaps even
without bringing the system down, the designer can minimize the effect on system avail
ability. The manufacturer’ s 34-year “ M TTF” , which is probably the inverse o f the
observed failure rate at the lowest point o f the bathtub curve, then can be used as an esti
mate o f the expected rate o f unplanned replacements, although experience suggests that
this specification may be a bit optimistic. Scheduled replacements are an example of pre
ventive maintenance, which is active intervention intended to increase the mean time to
failure of a module or system and thus improve availability.
For some components, observed failure rates are so low that MTTF is estimated by
accelerated aging. This technique involves making an educated guess about what the
dominant underlying cause o f failure will be and then amplifying that cause. For exam
ple, it is conjectured that failures in recordable Compact Disks are heat-related. A typical
test scenario is to store batches of recorded CDs at various elevated temperatures for sev
eral months, periodically bringing them out to test them and count how many have
failed. One then plots these failure rates versus temperature and extrapolates to estimate
what the failure rate would have been at room temperature. Again making the assump
tion that the failure process is memoryless, that failure rate is then inverted to produce
an MTTF. Published MTTFs o f 100 years or more have been obtained this way. If the
dominant fault mechanism turns out to be something else (such as bacteria munching
on the plastic coating) or if after 50 years the failure process turns out not to be memo
ryless after all, an estimate from an accelerated aging study may be far wide of the mark.
A designer must use such estimates with caution and understanding o f the assumptions
that went into them.
Availability is sometimes discussed by counting the number of nines in the numerical
representation o f the availability measure. Thus a system that is up and running 99.9%
o f the time is said to have 3-nines availability. Measuring by nines is often used in mar
keting because it sounds impressive. A more meaningful number is usually obtained by
calculating the corresponding down time. A 3-nines system can be down nearly 1.5 min
utes per day or 8 hours per year, a 5-nines system 5 minutes per year, and a 7-nines
system only 3 seconds per year. Another problem with measuring by nines is that it tells
only about availability, without any information about MTTF. One 3-nines system may
have a brief failure every day, while a different 3 -nines system may have a single eight
191
hour outage once a year. Depending on the application, the difference between those two
systems could be important. Any single measure should always be suspect.
Finally, availability can be a more fine-grained concept. Some systems are designed
so that when they fail, some functions (for example, the ability to read data) remain avail
able, while others (the ability to make changes to the data) are not. Systems that continue
to provide partial service in the face o f failure are called fail-soft, a concept defined more
carefully in Section 8.3.
8.2.2 Reliability Functions

The bathtub curve expresses the conditional failure rate h(t) o f a module, defined to be
the probability that the module fails between time t and time t + dt, given that the com
ponent is still working at time t. The conditional failure rate is only one o f several closely
related ways of describing the failure characteristics o f a component, module, or system.
The reliability, R, of a module is defined to be
R(t) = Pr[ the m odule has not yet failed at time t, given that Eq. 8-5
the m odule was operating at time 0
and the unconditional failure ratef(t) is defined to be

f[ t) = m odule fails between t and t + dt) Eq. 8—6
(The bathtub curve and these two reliability functions are three ways o f presenting the
same information. If you are rusty on probability, a brief reminder o f how they are
related appears in Sidebar 8.1.) Once f(t) is at hand, one can directly calculate the
MTTF:
M TTF = f t f[t)dt Eq. 8-7

0
One must keep in mind that this MTTF is predicted from the failure rate function fit),
in contrast to the MTTF of eq. 8-2, which is the result o f a field measurement. The two
MTTFs will be the same only if the failure model embodied in f(t) is accurate.
Some components exhibit relatively uniform failure rates, at least for the lifetime o f
the system o f which they are a part. For these components the conditional failure rate,
rather than resembling a bathtub, is a straight horizontal line, and the reliability function
becomes a simple declining exponential:
R{t) = e \m t t f ) Eq. 8-8

This reliability function is said to be memoryless, which simply means that the conditional
failure rate is independent of how long the component has been operating. Memoryless
failure processes have the nice property that the conditional failure rate is the inverse o f
the MTTF.
Unfortunately, as we saw in the case o f the disks with the 34-year “
M TTF” , this prop
erty is sometimes misappropriated to quote an MTTF for a component whose
192
Sidebar 8.1: Reliability functions The failure rate function, the reliability function, and the
bathtub curve (which in probability texts is called the c o n d itio n a l f a ilu r e rate f u n c tio n , and
which in operations research texts is called the h a z a rd fu n c tio n ) are actually three
mathematically related ways o f describing the same information. The failure rate function , f i t )
as defined in equation 8—6, is a p r o b a b ility den sity f u n c tio n , which is everywhere non-negative
and whose integral over all time is 1. Integrating the failure rate function from the time the
component was created (conventionally taken to be t = 0) to the present time yields
t
F(t) = jK O dt
o
F ( t) is the cumulative probability that the component has failed by time t. The cumulative
probability that the component has n o t failed is the probability that it is still operating at time
t given that it was operating at time 0, which is exactly the definition o f the reliability function,
R (t). That is,
R(t) = l-F (t)

The bathtub curve o f Figure 8.1 reports the conditional probability h ( t) that a failure occurs
between t and t + dt, given that the component was operating at time t. By the definition o f
conditional probability, the conditional failure rate function is thus
m =4
m
4
conditional failure rate does change with time. This misappropriation starts with a fal
lacy: an assumption that the MTTF, as defined in eq. 8-7, can be calculated by inverting
the measured failure rate. The fallacy arises because in general,
E{\/1) * I / E(t) g_q
That is, the expected value o f the inverse is not equal to the inverse o f the expected value,
except in certain special cases. The important special case in which they are equal is the
memoryless distribution o f eq. 8- 8 . When a random process is memoryless, calculations
and measurements are so much simpler that designers sometimes forget that the same
simplicity does not apply everywhere.
Just as availability is sometimes expressed in an oversimplified way by counting the
number o f nines in its numerical representation, reliability in component manufacturing
is sometimes expressed in an oversimplified way by counting standard deviations in the
observed distribution o f some component parameter, such as the maximum propagation
time of a gate. The usual symbol for standard deviation is the Greek letter a (sigma), and
a normal distribution has a standard deviation of 1.0 , so saying that a component has
“4.5 o reliability”is a shorthand way o f saying that the production line controls varia
tions in that parameter well enough that the specified tolerance is 4.5 standard deviations
away from the mean value, as illustrated in Figure 8 .2 . Suppose, for example, that a pro-
193
duction line is manufacturing gates that are specified to have a mean propagation time
of 10 nanoseconds and a maximum propagation time o f 11.8 nanoseconds with 4.5 o
reliability. The difference between the mean and the maximum, 1.8 nanoseconds, is the
tolerance. For that tolerance to be 4.5 o, o would have to be no more than 0.4 nanosec
onds. T o meet the specification, the production line designer would measure the actual
propagation times o f production line samples and, if the observed variance is greater than
0.4 ns, look for ways to reduce the variance to that level.
Another way o f interpreting “ 4.5 o reliability”is to calculate the expected fraction of
components that are outside the specified tolerance. That fraction is the integral o f one
tail o f the normal distribution from 4.5 to ° °, which is about 3.4 x 10“6, so in our exam
ple no more than 3.4 out o f each million gates manufactured would have delays greater
than 11.8 nanoseconds. Unfortunately, this measure describes only the failure rate o f the
production line, it does not say anything about the failure rate of the component after it
is installed in a system.
A currently popular quality control method, known as “ Six Sigma” , is an application
of two of our design principles to the manufacturing process. The idea is to use measure
ment, feedback, and iteration (design for iteration: “ you won’ t get it right the first time”)
to reduce the variance (the robustness principle, “ be strict on outputs” ) o f production-line
manufacturing. The “ Six Sigma”label is somewhat misleading because in the application
of the method, the number 6 is allocated to deal with two quite different effects. The
method sets a target o f controlling the production line variance to the level o f 4.5 CJ, just
as in the gate example of Figure 8 .2 . The remaining 1.5 a is the amount that the mean
output value is allowed to drift away from its original specification over the life o f the
FIGURE 8.2_________________________________________________________________
The normal probability density function applied to production of gates that are specified to have
mean propagation time of 10 nanoseconds and maximum propagation time of 11.8 nanosec
onds. The upper numbers on the horizontal axis measure the distance from the mean in units
of the standard deviation, a. The lower numbers depict the corresponding propagation times.
The integral of the tail from 4.5 a to is so small that it is not visible in this figure.
194
production line. So even though the production line may start 6 cr away from the toler
ance limit, after it has been operating for a while one may find that the failure rate has
drifted upward to the same 3.4 in a million calculated for the 4.5 o case.
In manufacturing quality control literature, these applications o f the two design prin
ciples are known as Taguchi methods, after their popularizer, Genichi Taguchi.
8.2.3 Measuring Fault Tolerance

It is sometimes useful to have a quantitative measure o f the fault tolerance of a system.
One common measure, sometimes called the failure tolerance, is the number o f failures
o f its components that a system can tolerate without itself failing. Although this label
could be ambiguous, it is usually clear from context that a measure is being discussed.
Thus a memory system that includes single-error correction (Section 8.4 describes how
error correction works) has a failure tolerance o f one bit.
When a failure occurs, the remaining failure tolerance o f the system goes down. The
remaining failure tolerance is an important thing to monitor during operation o f the sys
tem because it shows how close the system as a whole is to failure. One o f the most
common system design mistakes is to add fault tolerance but not include any monitoring
to see how much o f the fault tolerance has been used up, thus ignoring the safety margin
principle. When systems that are nominally fault tolerant do fail, later analysis invariably
discloses that there were several failures that the system successfully masked but that
somehow were never reported and thus were never repaired. Eventually, the total num
ber o f failures exceeded the designed failure tolerance o f the system.
Failure tolerance is actually a single number in only the simplest situations. Some
times it is better described as a vector, or even as a matrix showing the specific
combinations o f different kinds of failures that the system is designed to tolerate. For
example, an electric power company might say that it can tolerate the failure o f up to
15% of its generating capacity, at the same time as the downing o f up to two o f its main
transmission lines.
8.3 Tolerating Active Faults
8.3.1 Responding to Active Faults

In dealing with active faults, the designer o f a module can provide one o f several
responses:
• D o nothing. The error becomes a failure o f the module, and the larger system or
subsystem of which it is a component inherits the responsibilities both of
discovering and of handling the problem. The designer o f the larger subsystem
then must choose which o f these responses to provide. In a system with several
layers o f modules, failures may be passed up through more than one layer before
195
8.3 Tolerating Active Faults 8 -1 7
being discovered and handled. As the number of do-nothing layers increases,

containment generally becomes more and more difficult.
• Be fail-fast. The module reports at its interface that something has gone wrong.
This response also turns the problem over to the designer o f the next higher-level
system, but in a more graceful way. Example: when an Ethernet transceiver detects
a collision on a frame it is sending, it stops sending as quickly as possible,
broadcasts a brief jamming signal to ensure that all network participants quickly
realize that there was a collision, and it reports the collision to the next higher level,
usually a hardware module o f which the transceiver is a component, so that the
higher level can consider resending that frame.
• Be fail-safe. The module transforms any value or values that are incorrect to values
that are known to be acceptable, even if not right or optimal. An example is a
digital traffic light controller that, when it detects a failure in its sequencer,
switches to a blinking red light in all directions. Chapter 1 1[on-line] discusses
systems that provide security. In the event of a failure in a secure system, the safest
thing to do is usually to block all access. A fail-safe module designed to do that is
said to be fail-secure.
• Be fail-soft. The system continues to operate correctly with respect to some
predictably degraded subset of its specifications, perhaps with some features
missing or with lower performance. For example, an airplane with three engines
can continue to fly safely, albeit more slowly and with less maneuverability, if one
engine fails. A file system that is partitioned into five parts, stored on five different
small hard disks, can continue to provide access to 80% o f the data when one of
the disks fails, in contrast to a file system that employs a single disk five times as
large.
• Mask the error. Any value or values that are incorrect are made right and the
module meets it specification as if the error had not occurred.
We will concentrate on masking errors because the techniques used for that purpose can
be applied, often in simpler form, to achieving a fail-fast, fail-safe, or fail-soft system.
As a general rule, one can design algorithms and procedures to cope only with spe
cific, anticipated faults. Further, an algorithm or procedure can be expected to cope only
with faults that are actually detected. In most cases, the only workable way to detect a
fault is by noticing an incorrect value or control signal; that is, by detecting an error.
Thus when trying to determine if a system design has adequate fault tolerance, it is help
ful to classify errors as follows:
• A detectable error is one that can be detected reliably. If a detection procedure is
in place and the error occurs, the system discovers it with near certainty and it
becomes a detected error.
196
• A maskable error is one for which it is possible to devise a procedure to recover

correctness. If a masking procedure is in place and the error occurs, is detected,
and is masked, the error is said to be tolerated.
• Conversely, an untolerated error is one that is undetectable, undetected,
unmaskable, or unmasked.
An untolerated error usually leads to a
failure of the system. (“ Usually,”because
we could get lucky and still produce a cor
rect output, either because the error values undetectable detectable
error
IIU I / error
didn’ t actually matter under the current
conditions, or some measure intended to
mask a different error incidentally masks
\
undetected
/ I
detected
this one, too.) This classification of errors is error error
illustrated in Figure 8.3.
A subtle consequence of the concept of unmaskable
\
maskable
a maskable error is that there must be a error
rror / errc
error
well-defined boundary around that part of
the system state that might be in error. The 1
unmasked
f o I masked
masking procedure must restore all of that error error
erroneous state to correctness, using infor
mation that has not been corrupted by the
error. The real meaning o f detectable, then, untolerated tolerated
error error
is that the error is discovered before its con
sequences have propagated beyond some FIGURE 8.3
specified boundary. The designer usually
chooses this boundary to coincide with that Classification of errors. Arrows lead from a
category to mutually exclusive subcatego
o f some module and designs that module to
ries. For example, unmasked errors include
be fail-fast (that is, it detects and reports its both unmaskable errors and maskable errors
own errors). The system o f which the mod that the designer decides not to mask.
ule is a component then becomes
responsible for masking the failure o f the module.
8.3.2 Fault Tolerance Models

The distinctions among detectable, detected, maskable, and tolerated errors allow us to
specify for a system a fault tolerance m odel one o f the components o f the fault tolerance
design process described in Section 8 .1.2 , as follows:
1. Analyze the system and categorize possible error events into those that can be
reliably detected and those that cannot. At this stage, detectable or not, all errors
are untolerated.
197
8.3 Tolerating Active Faults 8 -1 9
2. For each undetectable error, evaluate the probability o f its occurrence. If that
probability is not negligible, modify the system design in whatever way necessary
to make the error reliably detectable.
3. For each detectable error, implement a detection procedure and reclassify the
module in which it is detected as fail-fast.
4. For each detectable error try to devise a way o f masking it. If there is a way,
reclassify this error as a maskable error.
5. For each maskable error, evaluate its probability o f occurrence, the cost o f failure,
and the cost o f the masking method devised in the previous step. If the evaluation
indicates it is worthwhile, implement the masking method and reclassify this error
as a tolerated error.
When finished developing such a model, the designer should have a useful fault tol
erance specification for the system. Some errors, which have negligible probability of
occurrence or for which a masking measure would be too expensive, are identified as
untolerated. When those errors occur the system fails, leaving its users to cope with the
result. Other errors have specified recovery algorithms, and when those occur the system
should continue to run correctly. A review o f the system recovery strategy can now focus
separately on two distinct questions:
• Is the designer’s list o f potential error events complete, and is the assessment of
the probability o f each error realistic?
• Is the designer’ s set of algorithms, procedures, and implementations that are
supposed to detect and mask the anticipated errors complete and correct?
These two questions are different. The first is a question o f models of the real world.
It addresses an issue of experience and judgment about real-world probabilities and
whether all real-world modes o f failure have been discovered or some have gone unno
ticed. Two different engineers, with different real-world experiences, may reasonably
disagree on such judgments— they may have different models o f the real world. The eval
uation o f modes o f failure and of probabilities is a point at which a designer may easily
go astray because such judgments must be based not on theory but on experience in the
field, either personally acquired by the designer or learned from the experience o f others.
A new technology, or an old technology placed in a new environment, is likely to create
surprises. A wrong judgment can lead to wasted effort devising detection and masking
algorithms that will rarely be invoked rather than the ones that are really needed. On the
other hand, if the needed experience is not available, all is not lost: the iteration part of
the design process is explicitly intended to provide that experience.
The second question is more abstract and also more absolutely answerable, in that an
argument for correctness (unless it is hopelessly complicated) or a counterexample to that
argument should be something that everyone can agree on. In system design, it is helpful
to follow design procedures that distinctly separate these classes o f questions. When
someone questions a reliability feature, the designer can first ask, “ Are you questioning
198
the correctness of my recovery algorithm or are you questioning my model o f what may
fail?”and thereby properly focus the discussion or argument.
Creating a fault tolerance model also lays the groundwork for the iteration part of the
fault tolerance design process. If a system in the field begins to fail more often than
expected, or completely unexpected failures occur, analysis o f those failures can be com
pared with the fault tolerance model to discover what has gone wrong. By again asking
the two questions marked with bullets above, the model allows the designer to distin
guish between, on the one hand, failure probability predictions being proven wrong by
field experience, and on the other, inadequate or misimplemented masking procedures.
With this information the designer can work out appropriate adjustments to the model
and the corresponding changes needed for the system.
Iteration and review o f fault tolerance models is also important to keep them up to
date in the light o f technology changes. For example, the Network File System described
in Section 4.4 was first deployed using a local area network, where packet loss errors are
rare and may even be masked by the link layer. When later users deployed it on larger
networks, where lost packets are more common, it became necessary to revise its fault
tolerance model and add additional error detection in the form o f end-to-end check
sums. The processor time required to calculate and check those checksums caused some
performance loss, which is why its designers did not originally include checksums. But
loss of data integrity outweighed loss o f performance and the designers reversed the
trade-off.
T o illustrate, an example o f a fault tolerance model applied to a popular kind o f mem
ory devices, RAM, appears in Section 8.7. This fault tolerance model employs error
detection and masking techniques that are described below in Section 8.4 o f this chapter,
so the reader may prefer to delay detailed study of that section until completing Section
8.4.
8.4 Systematically Applying Redundancy

The designer o f an analog system typically masks small errors by specifying design toler
ances known as margins, which are amounts by which the specification is better than
necessary for correct operation under normal conditions. In contrast, the designer o f a
digital system both detects and masks errors of all kinds by adding redundancy, either in
time or in space. When an error is thought to be transient, as when a packet is lost in a
data communication network, one method of masking is to resend it, an example o f
redundancy in time. When an error is likely to be persistent, as in a failure in reading bits
from the surface o f a disk, the usual method o f masking is with spatial redundancy, hav
ing another component provide another copy o f the information or control signal.
Redundancy can be applied either in cleverly small quantities or by brute force, and both
techniques may be used in different parts o f the same system.
199
8.4 Systematically Applying Redundancy 8 -2 1
8.4.1 Coding: Incremental Redundancy

The most common form o f incremental redundancy, known as forw ard error correction,
consists o f clever coding o f data values. With data that has not been encoded to tolerate
errors, a change in the value o f one bit may transform one legitimate data value into
another legitimate data value. Encoding for errors involves choosing as the representa
tion o f legitimate data values only some o f the total number o f possible bit patterns,
being careful that the patterns chosen for legitimate data values all have the property that
to transform any one of them to any other, more than one bit must change. The smallest
number o f bits that must change to transform one legitimate pattern into another is
known as the Ham m ing distance between those two patterns. The Hamming distance is
named after Richard Hamming, who first investigated this class o f codes. Thus the
patterns
100101
000111
have a Hamming distance o f 2 because the upper pattern can be transformed into the
lower pattern by flipping the values o f two bits, the first bit and the fifth bit. Data fields
that have not been coded for errors might have a Hamming distance as small as 1. Codes
that can detect or correct errors have a minimum Hamming distance between any two
legitimate data patterns o f 2 or more. The Hamming distance o f a code is the minimum
Hamming distance between any pair o f legitimate patterns o f the code. One can calcu
late the Hamming distance between two patterns, A and B, by counting the number o f
o n e s in A © B , where © is the exclusive o r ( x o r ) operator.
Suppose we create an encoding in which the Hamming distance between every pair
of legitimate data patterns is 2 . Then, if one bit changes accidentally, since no legitimate
data item can have that pattern, we can detect that something went wrong, but it is not
possible to figure out what the original data pattern was. Thus, if the two patterns above
were two members of the code and the first bit o f the upper pattern were flipped from
o n e to z e r o , there is no way to tell that the result, 0 0 0 1 0 1 , is not the result o f flipping the
fifth bit o f the lower pattern.

Next, suppose that we instead create an encoding in which the Hamming distance of
the code is 3 or more. Here are two patterns from such a code; bits 1, 2, and 5 are
different:
100101
010111
Now, a one-bit change will always transform a legitimate data pattern into an incor
rect data pattern that is still at least 2 bits distant from any other legitimate pattern but
only 1 bit distant from the original pattern. A decoder that receives a pattern with a one-
bit error can inspect the Hamming distances between the received pattern and nearby
legitimate patterns and by choosing the nearest legitimate pattern correct the error. If 2
bits change, this error-correction procedure will still identify a corrected data value, but
it will choose the wrong one. If we expect 2-bit errors to happen often, we could choose
the code patterns so that the Hamming distance is 4, in which case the code can correct
200
1-bit errors and detect 2-bit errors. But a 3-bit error would look just like a 1-bit error in
some other code pattern, so it would decode to a wrong value. More generally, if the
Hamming distance o f a code is d, a little analysis reveals that one can detect d —1 errors
and correct \_(d- 1)/2 J errors. The reason that this form o f redundancy is named
“forward”error correction is that the creator o f the data performs the coding before stor
ing or transmitting it, and anyone can later decode the data without appealing to the
creator. (Chapter 7[on-line] described the technique o f asking the sender of a lost frame,
packet, or message to retransmit it. That technique goes by the name o f backward error
correction.)
The systematic construction o f forward error-detection and error-correction codes is
a large field o f study, which we do not intend to explore. However, two specific examples
o f commonly encountered codes are worth examining.
The first example is a simple parity
check on a 2 -bit value, in which the parity
110 — 010
bit is the x o r o f the 2 data bits. The coded
pattern is 3 bits long, so there are 23 = 8
possible patterns for this 3-bit quantity,
only 4 o f which represent legitimate data.
As illustrated in Figure 8.4, the 4 “ correct”
patterns have the property that changing
any single bit transforms the word into one
o f the 4 illegal patterns. T o transform the
coded quantity into another legal pattern,
at least 2 bits must change (in other words, FIGURE 8.4_____________________
the Hamming distance o f this code is 2). Patterns for a simple parity-check code.
The conclusion is that a simple parity Each line connects patterns that differ in
check can detect any single error, but it only one bit; bold-face patterns are the
doesn’ t have enough information to cor legitimate ones.
rect errors.
The second example, in Figure 8.5, shows a forward error-correction code that can
correct 1-bit errors in a 4-bit data value, by encoding the 4 bits into 7 -bit words. In this
code, bits Pj, P(v P5, and P3 carry the data, while bits P4, P2, and P\ are calculated from
the data bits. (This out-of-order numbering scheme creates a multidimensional binary
coordinate system with a use that will be evident in a moment.) We could analyze this
code to determine its Hamming distance, but we can also observe that three extra bits
can carry exactly enough information to distinguish 8 cases: no error, an error in bit 1,
an error in bit 2, ... or an error in bit 7. Thus, it is not surprising that an error-correction
code can be created. This code calculates bits Pj, P2, and P4 as follows:
Pj = P j © P5 © P3
P2 = P7 © P(, © P3
P4 = P7 © P(, © P5
201
Now, suppose that the array o f bits P\ through P j is sent across a network and noise
causes bit P5 to flip. If the recipient recalculates P\, and P 4> the recalculated values
of P\ and P4 will be different from the received bits P\ and P4. The recipient then writes
P4 P2 P\ in order, representing the troubled bits as o n e s and untroubled bits as z e r o s , and
notices that their binary value is 1012 = 5 , the position o f the flipped bit. In this code,
whenever there is a one-bit error, the troubled parity bits directly identify the bit to cor
rect. (That was the reason for the out-of-order bit-numbering scheme, which created a
3-dimensional coordinate system for locating an erroneous bit.)
The use o f 3 check bits for 4 data bits suggests that an error-correction code may not
be efficient, but in fact the apparent inefficiency o f this example is only because it is so
small. Extending the same reasoning, one can, for example, provide single-error correc
tion for 56 data bits using 7 check bits in a 63-bit code word.
In both of these examples of coding, the assumed threat to integrity is that an uni
dentified bit out of a group may be in error. Forward error correction can also be effective
against other threats. A different threat, called erasure, is also common in digital systems.
An erasure occurs when the value o f a particular, identified bit o f a group is unintelligible
or perhaps even completely missing. Since we know which bit is in question, the simple
parity-check code, in which the parity bit is the x o r o f the other bits, becomes a forward
error correction code. The unavailable bit can be reconstructed simply by calculating the
x o r o f the unerased bits. Returning to the example o f Figure 8.4, if we find a pattern in
which the first and last bits have values 0 and 1 respectively, but the middle bit is illegible,
the only possibilities are 001 and Oil. Since 001 is not a legitimate code pattern, the
original pattern must have been O il. The simple parity check allows correction o f only
a single erasure. If there is a threat o f multiple erasures, a more complex coding scheme
is needed. Suppose, for example, we have 4 bits to protect, and they are coded as in Fig
ure 8.5. In that case, if as many as 3 bits are erased, the remaining 4 bits are sufficient to
reconstruct the values of the 3 that are missing.
Since erasure, in the form o f lost packets, is a threat in a best-effort packet network,
this same scheme o f forward error correction is applicable. One might, for example, send
four numbered, identical-length packets of data followed by a parity packet that contains
b it p7 P6 P5 P4 P3 P2 Pi
C h o o se P | so XOR o f every oth er b it (P7 © P5 © P3 © P p is 0 © © © ©
C h o o se P2 so XOR o f every oth er pair (P7 © P g Q P j © P2) is 0 © © © ©
C h o o s e P4 s o XOR o f every oth er fou r (P7 © P ,;© P 5 ©P4) is 0 © © © ©
FIGURE 8.5_________________________________________________________________
A single-error-correction code. In the table, the symbol © marks the bits that participate in the
calculation of one of the redundant bits. The payload bits are P7, P6, P5, and P3, and the redun
dant bits are P4, P2, and P^The “every other” notes describe a 3-dimensional coordinate
system that can locate an erroneous bit.
202
as its payload the bit-by-bit x o r o f the payloads of the previous four. (That is, the first bit
o f the parity packet is the x o r o f the first bit o f each o f the other four packets; the second
bits are treated similarly, etc.) Although the parity packet adds 25% to the network load,
as long as any four o f the five packets make it through, the receiving side can reconstruct
all o f the payload data perfectly without having to ask for a retransmission. If the network
is so unreliable that more than one packet out o f five typically gets lost, then one might
send seven packets, o f which four contain useful data and the remaining three are calcu
lated using the formulas of Figure 8.5. (Using the numbering scheme o f that figure, the
payload of packet 4, for example, would consist o f the x o r of the payloads o f packets 7,
6 , and 5.) Now, if any four o f the seven packets make it through, the receiving end can
reconstruct the data.
Forward error correction is especially useful in broadcast protocols, where the exist
ence o f a large number o f recipients, each o f which may miss different frames, packets,
or stream segments, makes the alternative o f backward error correction by requesting
retransmission unattractive. Forward error correction is also useful when controlling jit
ter in stream transmission because it eliminates the round-trip delay that would be
required in requesting retransmission o f missing stream segments. Finally, forward error
correction is usually the only way to control errors when communication is one-way or
round-trip delays are so long that requesting retransmission is impractical, for example,
when communicating with a deep-space probe. On the other hand, using forward error
correction to replace lost packets may have the side effect o f interfering with congestion
control techniques in which an overloaded packet forwarder tries to signal the sender to
slow down by discarding an occasional packet.
Another application o f forward error correction to counter erasure is in storing data
on magnetic disks. The threat in this case is that an entire disk drive may fail, for example
because of a disk head crash. Assuming that the failure occurs long after the data was orig
inally written, this example illustrates one-way communication in which backward error
correction (asking the original writer to write the data again) is not usually an option.
One response is to use a RAID array (see Section 2.1.1.4) in a configuration known as
RAID 4. In this configuration, one might use an array o f five disks, with four o f the disks
containing application data and each sector o f the fifth disk containing the bit-by-bit x o r
o f the corresponding sectors o f the first four. If any o f the five disks fails, its identity will
quickly be discovered because disks are usually designed to be fail-fast and report failures
at their interface. After replacing the failed disk, one can restore its contents by reading
the other four disks and calculating, sector by sector, the x o r of their data (see exercise
8.9). To maintain this strategy, whenever anyone updates a data sector, the RAID 4 sys
tem must also update the corresponding sector o f the parity disk, as shown in Figure 8 .6 .
That figure makes it apparent that, in RAID 4, forward error correction has an identifi
able read and write performance cost, in addition to the obvious increase in the amount
o f disk space used. Since loss o f data can be devastating, there is considerable interest in
RAID, and much ingenuity has been devoted to devising ways of minimizing the perfor
mance penalty.
203
Although it is an important and widely used technique, successfully applying incre

mental redundancy to achieve error detection and correction is harder than one might
expect. The first case study o f Section 8.8 provides several useful lessons on this point.
In addition, there are some situations where incremental redundancy does not seem
to be applicable. For example, there have been efforts to devise error-correction codes for
numerical values with the property that the coding is preserved when the values are pro
cessed by an adder or a multiplier. While it is not too hard to invent schemes that allow
a limited form o f error detection (for example, one can verify that residues are consistent,
using analogues o f casting out nines, which school children use to check their arith
metic), these efforts have not yet led to any generally applicable techniques. The only
scheme that has been found to systematically protect data during arithmetic processing
is massive redundancy, which is our next topic.
8.4.2 Replication: Massive Redundancy

In designing a bridge or a skyscraper, a civil engineer masks uncertainties in the strength
of materials and other parameters by specifying components that are 5 or 10 times as
strong as minimally required. The method is heavy-handed, but simple and effective.
new sector
data 1
data 2
data 3
data 4
parity
FIGURE 8.6
Update of a sector on disk 2 of a five-disk RAID 4 system. The old parity sector contains
parity «- data 1 © data 2 © data 3 © data 4. To construct a new parity sector that includes the
new data 2, one could read the corresponding sectors of data 1, data 3, and data 4 and per
form three more x o r s . But a faster way is to read just the old parity sector and the old data 2
sector and compute the new parity sector as
new parity «- old parity © old data 2 © new data 2
204
The corresponding way o f building a reliable system out of unreliable discrete compo
nents is to acquire multiple copies of each component. Identical multiple copies are
called replicas, and the technique is called replication. There is more to it than just making
copies: one must also devise a plan to arrange or interconnect the replicas so that a failure
in one replica is automatically masked with the help o f the ones that don’ t fail. For exam
ple, if one is concerned about the possibility that a diode may fail by either shorting out
or creating an open circuit, one can set up a network o f four diodes as in Figure 8.7, cre
ating what we might call a “ superdiode” . This interconnection scheme, known as a quad
component, was developed by Claude E. Shannon and Edward F. Moore in the 1950s as
a way of increasing the reliability of relays in telephone systems. It can also be used with
resistors and capacitors in circuits that can tolerate a modest range o f component values.
This particular superdiode can tolerate a single short circuit and a single open circuit in
any two component diodes, and it can also tolerate certain other multiple failures, such
as open circuits in both upper diodes plus a short circuit in one o f the lower diodes. If
the bridging connection o f the figure is added, the superdiode can tolerate additional
multiple open-circuit failures (such as one upper diode and one lower diode), but it will
be less tolerant o f certain short-circuit failures (such as one left diode and one right
diode).
A serious problem with this superdiode is that it masks failures silently. There is no
easy way to determine how much failure tolerance remains in the system.
8.4.3 Voting
Although there have been attempts to extend quad-component methods to digital logic,
the intricacy o f the required interconnections grows much too rapidly. Fortunately, there
is a systematic alternative that takes advantage o f the static discipline and level regenera
tion that are inherent properties of digital logic. In addition, it has the nice feature that
it can be applied at any level of module, from a single gate on up to an entire computer.
The technique is to substitute in place o f a single module a set of replicas o f that same
module, all operating in parallel with the same inputs, and compare their outputs with a
device known as a voter. This basic strategy is called N-modular redundancy, or NMR.
When N has the value 3 the strategy is called triple-modular redundancy, abbreviated
TMR. When other values are used for N the strategy is named by replacing the N of
NM R with the number, as in 5MR. The combination o f Wreplicas of some module and
FIGURE 8.7
A quad-component superdiode.
The dotted line represents an
H— — H
optional bridging connection,
which allows the superdiode to
tolerate a different set of failures, *-U
as described in the text.
205
the voting system is sometimes called a supermodule. Several different schemes exist for
interconnection and voting, only a few of which we explore here.
The simplest scheme, called fail-vote, consists of NM R with a majority voter. One
assembles TVreplicas o f the module and a voter that consists o f an TV-way comparator and
some counting logic. If a majority o f the replicas agree on the result, the voter accepts
that result and passes it along to the next system component. If any replicas disagree with
the majority, the voter may in addition raise an alert, calling for repair o f the replicas that
were in the minority. If there is no majority, the voter signals that the supermodule has
failed. In failure-tolerance terms, a triply-redundant fail-vote supermodule can mask the
failure o f any one replica, and it is fail-fast if any two replicas fail in different ways.
If the reliability, as was defined in Section 8.2.2, o f a single replica module is R and
the underlying fault mechanisms are independent, a TM R fail-vote supermodule will
operate correctly if all 3 modules are working (with reliability R? ) or if 1 module has
failed and the other 2 are working (with reliability I - R)). Since a single-module
failure can happen in 3 different ways, the reliability o f the supermodule is the sum,
^supermodule= R > * 3 R U - R ) = 3R°- 2R3 Eq. 8-10

but the supermodule is not always fail-fast. If two replicas fail in exactly the same way,
the voter will accept the erroneous result and, unfortunately, call for repair o f the one
correctly operating replica. This outcome is not as unlikely as it sounds because several
replicas that went through the same design and production process may have exactly the
same set of design or manufacturing faults. This problem can arise despite the indepen
dence assumption used in calculating the probability o f correct operation. That
calculation assumes only that the probability that different replicas produce correct
answers be independent; it assumes nothing about the probability of producing specific
wrong answers. Without more information about the probability o f specific errors and
their correlations the only thing we can say about the probability that an incorrect result
will be accepted by the voter is that it is not more than
( ' - R su p ern oM e) 0
These calculations assume that the voter is perfectly reliable. Rather than trying to
create perfect voters, the obvious thing to do is replicate them, too. In fact, everything—
modules, inputs, outputs, sensors, actuators, etc.— should be replicated, and the final
vote should be taken by the client o f the system. Thus, three-engine airplanes vote with
their propellers: when one engine fails, the two that continue to operate overpower the
inoperative one. On the input side, the pilot’ s hand presses forward on three separate
throttle levers. A fully replicated TM R supermodule is shown in Figure 8 .8 . With this
interconnection arrangement, any measurement or estimate of the reliability, R, o f a
component module should include the corresponding voter. It is actually customary
(and more logical) to consider a voter to be a component o f the next module in the chain
rather than, as the diagram suggests, the previous module. This fully replicated design is
sometimes described as recursive.
206
The numerical effect o f fail-vote TM R is impressive. If the reliability o f a single mod

ule at time T is 0.999, equation 8-10 says that the reliability o f a fail-vote TM R
supermodule at that same time is 0.999997. TM R has reduced the probability o f failure
from one in a thousand to three in a million. This analysis explains why airplanes
intended to fly across the ocean have more than one engine. Suppose that the rate o f
engine failures is such that a single-engine plane would fail to complete one out o f a thou
sand trans-Atlantic flights. Suppose also that a 3-engine plane can continue flying as long
as any 2 engines are operating, but it is too heavy to fly with only 1 engine. In 3 flights
out o f a thousand, one o f the three engines will fail, but if engine failures are indepen
dent, in 999 out o f each thousand first-engine failures, the remaining 2 engines allow the
plane to limp home successfully.
Although TM R has greatly improved reliability, it has not made a comparable impact
on MTTF. In fact, the MTTF o f a fail-vote TM R supermodule can be smaller than the
MTTF o f the original, single-replica module. The exact effect depends on the failure
process o f the replicas, so for illustration consider a memoryless failure process, not
because it is realistic but because it is mathematically tractable. Suppose that airplane
engines have an MTTF of 6,000 hours, they fail independently, the mechanism o f
engine failure is memoryless, and (since this is a fail-vote design) we need at least 2 oper
ating engines to get home. When flying with three engines, the plane accumulates 6,000
hours of engine running time in only 2,000 hours o f flying time, so from the point o f
view of the airplane as a whole, 2,000 hours is the expected time to the first engine fail
ure. While flying with the remaining two engines, it will take another 3,000 flying hours
to accumulate 6,000 more engine hours. Because the failure process is memoryless we
can calculate the MTTF o f the 3-engine plane by adding:
Mean time to first failure 2000 hours (three engines)
Mean time from first to second failure 3000 hours (two engines)
Total mean time to system failure 5000 hours
Thus the mean time to system failure is less than the 6,000 hour MTTF o f a single
engine. What is going on here is that we have actually sacrificed long-term reliability in
order to enhance short-term reliability. Figure 8.9 illustrates the reliability o f our hypo-
FIGURE 8.8
Triple-modular
redundant super
module, with three
inputs, three voters,
and three outputs.
207
thetical airplane during its 6 hours o f flight, which amounts to only 0.001 o f the single
engine M TTF— the mission time is very short compared with the MTTF and the reli
ability is far higher. Figure 8.10 shows the same curve, but for flight times that are
comparable with the MTTF. In this region, if the plane tried to keep flying for 8000
hours (about 1.4 times the single-engine MTTF), a single-engine plane would fail to
complete the flight in 3 out o f 4 tries, but the 3-engine plane would fail to complete the
flight in 5 out o f 6 tries. (One should be wary o f these calculations because the assump
tions o f independence and memoryless operation may not be met in practice. Sidebar 8.2
elaborates.)
.999997 three engines
0.999 one engine
Mission time, in units of MTTF
FIGURE 8.9_________________________________________________________________
Reliability with triple modular redundancy, for mission times much less than the MTTF of 6,000
hours. The vertical dotted line represents a six-hour flight.
-0.25 single engine

-0.15 three engines
FIGURE 8.10
Reliability with triple modular redundancy, for mission times comparable to the MTTF of 6,000
hours. The two vertical dotted lines represent mission times of 6,000 hours (left) and 8,400
hours (right).
208
Sidebar 8.2: Risks of manipulating MTTFs The apparently casual manipulation o f M T T F s

in Sections 8.4.3 and 8.4.4 is justified by assumptions o f independence o f failures and
memoryless processes. But one can trip up by blindly applying this approach without
understanding its limitations. To see how, consider a com puter system that has been observed
for several years to have a hardware crash an average o f every 2 weeks and a software crash an
average o f every 6 weeks. The operator does not repair the system, but simply restarts it and
hopes for the best. The com posite M T T F is 1.5 weeks, determined m ost easily by considering
what happens if we run the system for, say, 60 weeks. D uring that time we expect to see
10 software failures
30 hardware failures
40 system failures in 60 weeks — > 1.5 weeks between failure
N ew hardware is installed, identical to the old except that it never fails. T h e M T T F should
jum p to 6 weeks because the only remaining failures are software, right?
Perhaps— but only if the software failure process is independent o f the hardware failure process.
Suppose the software failure occurs because there is a bug (fault) in a clock-updating procedure:
The bug always crashes the system exactly 420 hours (2 1/2 weeks) after it is started— if it gets
a chance to run that long. The old hardware was causing crashes so often that the software bug
only occasionally had a chance to do its thing— only about once every 6 weeks. M ost o f the
time, the recovery from a hardware failure, which requires restarting the system, had the side
effect o f resetting the process that triggered the software bug. So, when the new hardware is
installed, the system has an M T T F o f only 2.5 weeks, much less than hoped.
M TTF's are useful, but one must be careful to understand what assumptions go into their
measurement and use.
If we had assumed that the plane could limp home with just one engine, the MTTF
would have increased, rather than decreased, but only modestly. Replication provides a
dramatic improvement in reliability for missions o f duration short compared with the
MTTF, but the MTTF itself changes much less. We can verify this claim with a little
more analysis, again assuming memoryless failure processes to make the mathematics
tractable. Suppose we have an NM R system with the property that it somehow continues
to be useful as long as at least one replica is still working. (This system requires using fail-
fast replicas and a cleverer voter, as described in Section 8.4.4 below.) If a single replica
has an M T T F reppca = 1, there are îndependent replicas, and the failure process is mem
oryless, the expected time until the first failure is M TTFrepUcJ N, the expected time from
then until the second failure is M FTFrepHcal{ N - 1), etc., and the expected time until the
system of N replicas fails is the sum o f these times,
M TTFsystem = 1 + 1/2 + 1/3 + ...(1/7V) Eq.8-11
209
which for large /Vis approximately ln(N). As we add to the cost by adding more replicas,
M TTFSyStem grows disappointingly slowly— proportional to the logarithm o f the cost. To
multiply the M TTFsystem by K, the number o f replicas required is eK — the cost grows
exponentially. The significant conclusion is that in systems fo r which the mission time is
long compared with MTTFreplica, simple replication escalates the cost while providing little
benefit. On the other hand, there is a way of making replication effective for long mis
sions, too. The method is to enhance replication by adding repair.
8.4.4 Repair
Let us return now to a fail-vote TM R supermodule (that is, it requires that at least two
replicas be working) in which the voter has just noticed that one of the three replicas is
producing results that disagree with the other two. Since the voter is in a position to
report which replica has failed, suppose that it passes such a report along to a repair per
son who immediately examines the failing replica and either fixes or replaces it. For this
approach, the mean time to repair (MTTR) measure becomes o f interest. The super
module fails if either the second or third replica fails before the repair to the first one can
be completed. Our intuition is that if the M TTR is small compared with the combined
MTTF o f the other two replicas, the chance that the supermodule fails will be similarly
small.
The exact effect on chances o f supermodule failure depends on the shape o f the reli
ability function o f the replicas. In the case where the failure and repair processes are both
memoryless, the effect is easy to calculate. Since the rate o f failure o f 1 replica is 1/MTTF,
the rate o f failure o f 2 replicas is 2/MTTF. If the repair time is short compared with
M T T F the probability o f a failure o f 1 o f the 2 remaining replicas while waiting a time
T for repair o f the one that failed is approximately 2T/MTTF. Since the mean time to
repair is MTTR, we have
PR supermodule fails while waiting for repair) = ^ Eq. 8-12
Continuing our airplane example and temporarily suspending disbelief, suppose that
during a long flight we send a mechanic out on the airplane’ s wing to replace a failed
engine. If the replacement takes 1 hour, the chance that one o f the other two engines fails
during that hour is approximately 1/3000. Moreover, once the replacement is complete,
we expect to fly another 2000 hours until the next engine failure. Assuming further that
the mechanic is carrying an unlimited supply o f replacement engines, completing a
10,000 hour flight— or even a longer one— becomes plausible. The general formula for
the MTTF o f a fail-vote TM R supermodule with memoryless failure and repair processes
is (this formula comes out o f the analysis o f continuous-transition birth-and-death
Markov processes, an advanced probability technique that is beyond our scope):
_ M TTF repU ca M T T F r e p lic a _ (M T T F rep lica ^

Eq. 8-13
MTTFsupermodule
2 x M TTR r e p lic a 6 x M TTR r e p lic a
210
Thus, our 3-engine plane with hypothetical in-flight repair has an MTTF o f 6 million
hours, an enormous improvement over the 6000 hours o f a single-engine plane. This
equation can be interpreted as saying that, compared with an unreplicated module, the
MTTF has been reduced by the usual factor o f 3 because there are 3 replicas, but at the
same time the availability o f repair has increased the MTTF by a factor equal to the ratio
o f the MTTF of the remaining 2 engines to the MTTR.
Replacing an airplane engine in flight may be a fanciful idea, but replacing a magnetic
disk in a computer system on the ground is quite reasonable. Suppose that we store 3
replicas o f a set of data on 3 independent hard disks, each o f which has an MTTF o f 5
years (using as the MTTF the expected operational lifetime, not the “ M TTF”derived
from the short-term failure rate). Suppose also, that if a disk fails, we can locate, install,
and copy the data to a replacement disk in an average o f 10 hours. In that case, by eq.
8-13, the MTTF of the data is
(M TTFrep]jca) _ __________(5 years)2______

= 3650 years Eq. 8-14
6 x MTTRrepjjca 6• (10 hours)/(8760 hours/year)
In effect, redundancy plus repair has reduced the probability o f failure of this supermod
ule to such a small value that for all practical purposes, failure can be neglected and the
supermodule can operate indefinitely.
Before running out to start a company that sells superbly reliable disk-storage sys
tems, it would be wise to review some o f the overly optimistic assumptions we made in
getting that estimate of the MTTF, most o f which are not likely to be true in the real
world:
• Disks fa il independently A batch o f real world disks may all come from the same
vendor, where they acquired the same set o f design and manufacturing faults. Or,
they may all be in the same machine room, where a single earthquake— which
probably has an MTTF o f less than 3,650 years— may damage all three.
• Disk failures are memoryless. Real-world disks follow a bathtub curve. If, when disk
#1 fails, disk #2 has already been in service for three years, disk #2 no longer has
an expected operational lifetime of 5 years, so the chance o f a second failure while
waiting for repair is higher than the formula assumes. Furthermore, when disk #1
is replaced, its chances of failing are probably higher than usual for the first few
weeks.
• Repair is also a memoryless process. In the real world, if we stock enough spares that
we run out only once every 10 years and have to wait for a shipment from the
factory, but doing a replacement happens to run us out o f stock today, we will
probably still be out o f stock tomorrow and the next day.
• Repair is done flawlessly. A repair person may replace the wrong disk, forget to copy
the data to the new disk, or install a disk that hasn’
t passed burn-in and fails in the
first hour.2
1
211
Each o f these concerns acts to reduce the reliability below what might be expected from
our overly simple analysis. Nevertheless, NM R with repair remains a useful technique,
and in Chapter 10[on-line] we will see ways in which it can be applied to disk storage.
One of the most powerful applications o f NM R is in the masking of transient errors.
When a transient error occurs in one replica, the NM R voter immediately masks it.
Because the error is transient, the subsequent behavior o f the supermodule is as if repair
happened by the next operation cycle. The numerical result is little short o f extraordi
nary. For example, consider a processor arithmetic logic unit (ALU) with a 1 gigahertz
clock and which is triply replicated with voters checking its output at the end o f each
clock cycle. In equation 8-13 we have M TTRreplica = 1 (in this application, equation
8-13 is only an approximation because the time to repair is a constant rather than the
result of a memoryless process), and M TTFSUpermodule = (M TTFrepiiCa )2/ 6
cycles. If M TTFrepHca is 101°cycles (1 error in 10 billion cycles, which at this clock speed
means one error every 10 seconds), M TTFsupermo(iu^ is 1020/ 6 cycles, about 500 years.
TM R has taken three ALUs that were for practical use nearly worthless and created a
super-ALU that is almost infallible.
The reason things seem so good is that we are evaluating the chance that two transient
errors occur in the same operation cycle. If transient errors really are independent, that
chance is small. This effect is powerful, but the leverage works in both directions, thereby
creating a potential hazard: it is especially important to keep track o f the rate at which
transient errors actually occur. If they are happening, say, 20 times as often as hoped,
M TTFsupermoeiuie will be 1/400 o f the original prediction— the super-ALU is likely to fail
once per year. That may still be acceptable for some applications, but it is a big change.
Also, as usual, the assumption o f independence is absolutely critical. If all the ALUs came
from the same production line, it seems likely that they will have at least some faults in
common, in which case the super-ALU may be just as worthless as the individual ALUs.
Several variations on the simple fail-vote structure appear in practice:
• Purging. In an NM R design with a voter, whenever the voter detects that one
replica disagrees with the majority, the voter calls for its repair and in addition
marks that replica D OW N and ignores its output until hearing that it has been
repaired. This technique doesn’ t add anything to a TM R design, but with higher
levels o f replication, as long as replicas fail one at a time and any two replicas
continue to operate correctly, the supermodule works.
• Pair-and-compare. Create a fail-fast module by taking two replicas, giving them the
same inputs, and connecting a simple comparator to their outputs. As long as the
comparator reports that the two replicas of a pair agree, the next stage o f the system
accepts the output. If the comparator detects a disagreement, it reports that the
module has failed. The major attraction of pair-and-compare is that it can be used
to create fail-fast modules starting with easily available commercial, off-the-shelf
components, rather than commissioning specialized fail-fast versions. Special
high-reliability components typically have a cost that is much higher than off-the-
shelf designs, for two reasons. First, since they take more time to design and test,2 1
212
the ones that are available are typically o f an older, more expensive technology.
Second, they are usually low-volume products that cannot take advantage of
economies o f large-scale production. These considerations also conspire to
produce long delivery cycles, making it harder to keep spares in stock. An
important aspect o f using standard, high-volume, low-cost components is that one
can afford to keep a stock o f spares, which in turn means that M TTR can be made
small: just replace a failing replica with a spare (the popular term for this approach
is pair-and-spare) and do the actual diagnosis and repair at leisure.
• NM R with fail-fast replicas. If each o f the replicas is itself a fail-fast design (perhaps
using pair-and-compare internally), then a voter can restrict its attention to the
outputs o f only those replicas that claim to be producing good results and ignore
those that are reporting that their outputs are questionable. With this organization,
a TM R system can continue to operate even if 2 of its 3 replicas have failed, since
the 1 remaining replica is presumably checking its own results. An NM R system
with repair and constructed o f fail-fast replicas is so robust that it is unusual to find
examples for which W is greater than 2 .
Figure 8.11 compares the ability to continue operating until repair arrives o f 5MR
designs that use fail-vote, purging, and fail-fast replicas. The observant reader will note
that this chart can be deemed guilty of a misleading comparison, since it claims that the
5MR system continues working when only one fail-fast replica is still running. But if that
fail-fast replica is actually a pair-and-compare module, it might be more accurate to say
that there are two still-working replicas at that point.
Another technique that takes advantage o f repair, can improve availability, and can
degrade gracefully (in other words, it can be fail-soft) is called partition. If there is a
choice o f purchasing a system that has either one fast processor or two slower processors,
the two-processor system has the virtue that when one o f its processors fails, the system
5MR with
fail-vote fails
t 5 5MR with
purging fails
Number 4
of
replicas
3 I I 5MR with fail-fast
* replicas fails
still 2
working
correctly
0
time ----- ►
FIGURE 8.11
Failure points of three different 5MR supermodule designs, if repair does not happen in time.
213
can continue to operate with half o f its usual capacity until someone can repair the failed
processor. An electric power company, rather than installing a single generator o f capac
ity K megawatts, may install W generators o f capacity K IN megawatts each.
When equivalent modules can easily share a load, partition can extend to what is
called N + 1 redundancy. Suppose a system has a load that would require the capacity of
N equivalent modules. The designer partitions the load across N + 1 or more modules.
Then, if any one o f the modules fails, the system can carry on at full capacity until the
failed module can be repaired.
N + 1 redundancy is most applicable to modules that are completely interchangeable,
can be dynamically allocated, and are not used as storage devices. Examples are proces
sors, dial-up modems, airplanes, and electric generators. Thus, one extra airplane located
at a busy hub can mask the failure o f any single plane in an airline’ s fleet. When modules
are not completely equivalent (for example, electric generators come in a range of capac
ities, but can still be interconnected to share load), the design must ensure that the spare
capacity is greater than the capacity o f the largest individual module. For devices that
provide storage, such as a hard disk, it is also possible to apply partition and N + 1 redun
dancy with the same goals, but it requires a greater level of organization to preserve the
stored contents when a failure occurs, for example by using RAID, as was described in
Section 8.4.1, or some more general replica management system such as those discussed
in Section 10.3.7.
For some applications an occasional interruption o f availability is acceptable, while in
others every interruption causes a major problem. When repair is part o f the fault toler
ance plan, it is sometimes possible, with extra care and added complexity, to design a
system to provide continuous operation. Adding this feature requires that when failures
occur, one can quickly identify the failing component, remove it from the system, repair
it, and reinstall it (or a replacement part) all without halting operation of the system. The
design required for continuous operation of computer hardware involves connecting and
disconnecting cables and turning off power to some components but not others, without
damaging anything. When hardware is designed to allow connection and disconnection
from a system that continues to operate, it is said to allow hot swap.
In a computer system, continuous operation also has significant implications for the
software. Configuration management software must anticipate hot swap so that it can
stop using hardware components that are about to be disconnected, as well as discover
newly attached components and put them to work. In addition, maintaining state is a
challenge. If there are periodic consistency checks on data, those checks (and repairs to
data when the checks reveal inconsistencies) must be designed to work correctly even
though the system is in operation and the data is perhaps being read and updated by
other users at the same time.
Overall, continuous operation is not a feature that should be casually added to a list
of system requirements. When someone suggests it, it may be helpful to point out that
it is much like trying to keep an airplane flying indefinitely. Many large systems that
appear to provide continuous operation are actually designed to stop occasionally for
maintenance.2 4
1
214
215
8.6 Wrapping up Reliability 8 -5 1
8.6 Wrapping up Reliability
8.6.1 Design Strategies and Design Principles

Standing back from the maze of detail about redundancy, we can identify and abstract
three particularly effective design strategies:
• N-modular redundancy is a simple but powerful tool for masking failures and
increasing availability, and it can be used at any convenient level o f granularity.
• Fail-fast modules provide a sweeping simplification o f the problem of containing
errors. When containment can be described simply, reasoning about fault
tolerance becomes easier.
• Pair-and-compare allows fail-fast modules to be constructed from commercial,
off-the-shelf components.
Standing back still further, it is apparent that several general design principles are
directly applicable to fault tolerance. In the formulation o f the fault-tolerance design pro
cess in Section 8.1.2, we invoked be explicit, design for iteration, keep digging; and the
safety margin principle, and in exploring different fault tolerance techniques we have seen
several examples of adopt sweeping simplifications. One additional design principle that
applies to fault tolerance (and also, as we will see in Chapter 11 [on-line], to security)
comes from experience, as documented in the case studies o f Section 8.8:
Avoid rarely used components

Deterioration and corruption accumulate unnoticed—until the next use.
Whereas redundancy can provide masking o f errors, redundant components that are
used only when failures occur are much more likely to cause trouble than redundant
components that are regularly exercised in normal operation. The reason is that failures
in regularly exercised components are likely to be immediately noticed and fixed. Fail
ures in unused components may not be noticed until a failure somewhere else happens.
But then there are two failures, which may violate the design assumptions o f the masking
plan. This observation is especially true for software, where rarely-used recovery proce
dures often accumulate unnoticed bugs and incompatibilities as other parts o f the system
evolve. The alternative o f periodic testing o f rarely-used components to lower their fail
ure latency is a band-aid that rarely works well.
In applying these design principles, it is important to consider the threats, the conse
quences, the environment, and the application. Some faults are more likely than others,
216
some failures are more disruptive than others, and different techniques may be appropri
ate in different environments. A computer-controlled radiation therapy machine, a deep-
space probe, a telephone switch, and an airline reservation system all need fault tolerance,
but in quite different forms. The radiation therapy machine should emphasize fault
detection and fail-fast design, to avoid injuring patients. Masking faults may actually be
a mistake. It is likely to be safer to stop, find their cause, and fix them before continuing
operation. The deep-space probe, once the mission begins, needs to concentrate on fail
ure masking to ensure mission success. The telephone switch needs many nines o f
availability because customers expect to always receive a dial tone, but if it occasionally
disconnects one ongoing call, that customer will simply redial without thinking much
about it. Users o f the airline reservation system might tolerate short gaps in availability,
but the durability o f its storage system is vital. At the other extreme, most people find
that a digital watch has an MTTF that is long compared with the time until the watch
is misplaced, becomes obsolete, goes out o f style, or is discarded. Consequently, no pro
vision for either error masking or repair is really needed. Some applications have built-in
redundancy that a designer can exploit. In a video stream, it is usually possible to mask
the loss of a single video frame by just repeating the previous frame.
8.6.2 How about the End-to-End Argument?

There is a potential tension between error masking and an end-to-end argument. An end-
to-end argument suggests that a subsystem need not do anything about errors and should
not do anything that might compromise other goals such as low latency, high through
put, or low cost. The subsystem should instead let the higher layer system o f which it is
a component take care o f the problem because only the higher layer knows whether or
not the error matters and what is the best course o f action to take.
There are two counter arguments to that line of reasoning:
• Ignoring an error allows it to propagate, thus contradicting the modularity goal of
error containment. This observation points out an important distinction between
error detection and error masking. Error detection and containment must be
performed where the error happens, so that the error does not propagate wildly.
Error masking, in contrast, presents a design choice: masking can be done locally
or the error can be handled by reporting it at the interface (that is, by making the
module design fail-fast) and allowing the next higher layer to decide what masking
action— if any— to take.
• The lower layer may know the nature o f the error well enough that it can mask it
far more efficiently than the upper layer. The specialized burst error correction
codes used on DVDs come to mind. They are designed specifically to mask errors
caused by scratches and dust particles, rather than random bit-flips. So we have a
trade-off between the cost o f masking the fault locally and the cost o f letting the
error propagate and handling it in a higher layer.2
7
1
217
8.6 Wrapping up Reliability 8 -5 3
These two points interact: When an error propagates it can contaminate otherwise
correct data, which can increase the cost of masking and perhaps even render masking
impossible. The result is that when the cost is small, error masking is usually done locally.
(That is assuming that masking is done at all. Many personal computer designs omit
memory error masking. Section 8.8.1 discusses some o f the reasons for this design
decision.)
A closely related observation is that when a lower layer masks a fault it is important
that it also report the event to a higher layer, so that the higher layer can keep track of
how much masking is going on and thus how much failure tolerance there remains.
Reporting to a higher layer is a key aspect o f the safety margin principle.
8.6.3 A Caution on the Use of Reliability Calculations

Reliability calculations seem to be exceptionally vulnerable to the garbage-in, garbage-
out syndrome. It is all too common that calculations o f mean time to failure are under
mined because the probabilistic models are not supported by good statistics on the failure
rate o f the components, by measures o f the actual load on the system or its components,
or by accurate assessment o f independence between components.
For computer systems, back-of-the-envelope calculations are often more than suffi
cient because they are usually at least as accurate as the available input data, which tends
to be rendered obsolete by rapid technology change. Numbers predicted by formula can
generate a false sense of confidence. This argument is much weaker for technologies that
tend to be stable (for example, production lines that manufacture glass bottles). So reli
ability analysis is not a waste o f time, but one must be cautious in applying its methods
to computer systems.
8.6.4 Where to Learn More about Reliable Systems

Our treatment of fault tolerance has explored only the first layer o f fundamental con
cepts. There is much more to the subject. For example, we have not considered another
class o f fault that combines the considerations o f fault tolerance with those o f security:
faults caused by inconsistent, perhaps even malevolent, behavior. These faults have the
characteristic they generate inconsistent error values, possibly error values that are specif
ically designed by an attacker to confuse or confound fault tolerance measures. These
faults are called Byzantine faults, recalling the reputation o f ancient Byzantium for mali
cious politics. Here is a typical Byzantine fault: suppose that an evil spirit occupies one
of the three replicas o f a TM R system, waits for one o f the other replicas to fail, and then
adjusts its own output to be identical to the incorrect output o f the failed replica. A voter
accepts this incorrect result and the error propagates beyond the intended containment
boundary. In another kind of Byzantine fault, a faulty replica in an NM R system sends
different result values to each of the voters that are monitoring its output. Malevolence
is not required— any fault that is not anticipated by a fault detection mechanism can pro
duce Byzantine behavior. There has recently been considerable attention to techniques
218
that can tolerate Byzantine faults. Because the tolerance algorithms can be quite com
plex, we defer the topic to advanced study.
We also have not explored the full range o f reliability techniques that one might
encounter in practice. For an example that has not yet been mentioned, Sidebar 8.4
describes the heartbeat, a popular technique for detecting failures o f active processes.
This chapter has oversimplified some ideas. For example, the definition o f availability
proposed in Section 8.2 o f this chapter is too simple to adequately characterize many
large systems. If a bank has hundreds o f automatic teller machines, there will probably
always be a few teller machines that are not working at any instant. For this case, an avail
ability measure based on the percentage o f transactions completed within a specified
response time would probably be more appropriate.
A rapidly moving but in-depth discussion of fault tolerance can be found in Chapter
3 o f the book Transaction Processing: Concepts and Techniques, by Jim Gray and Andreas
Reuter. A broader treatment, with case studies, can be found in the book Reliable Com
puter Systems: Design and Evaluation, by Daniel P. Siewiorek and Robert S. Swarz.
Byzantine faults are an area o f ongoing research and development, and the best source is
current professional literature.
This chapter has concentrated on general techniques for achieving reliability that are
applicable to hardware, software, and complete systems. Looking ahead, Chapters 9[on
line] and 10[on-line] revisit reliability in the context o f specific software techniques that
permit reconstruction o f stored state following a failure when there are several concur
rent activities. Chapter 11 [on-line], on securing systems against malicious attack,
introduces a redundancy scheme known as defense in depth that can help both to contain
and to mask errors in the design or implementation o f individual security mechanisms.
Sidebar 8.4: D etectin g failures with heartbeats. An activity such as a W eb server is usually
intended to keep running indefinitely. I f it fails (perhaps by crashing) its clients may notice that
it has stopped responding, but clients are not typically in a position to restart the server.
Som ething more systematic is needed to detect the failure and initiate recovery. O ne helpful
technique is to program the thread that should be perform ing the activity to send a periodic
signal to another thread (or a message to a m onitoring service) that says, in effect, “I'm still
O K ”.The periodic signal is known as a heartbeat and the observing thread or service is known
as a watchdog.
The w atchdog service sets a timer, and on receipt o f a heartbeat message it restarts the timer. If
the timer ever expires, the w atchdog assumes that the m onitored service has gotten into trouble
and it initiates recovery. O n e limitation o f this technique is that if the m onitored service fails
in such a way that the only thing it does is send heartbeat signals, the failure will go undetected.
As with all fixed timers, choosing a go o d heartbeat interval is an engineering challenge. Setting
the interval too short wastes resources sending and responding to heartbeat signals. Setting the
interval too lon g delays detection o f failures. Since detection is a prerequisite to repair, a long
heartbeat interval increases M T T R and thus reduces availability.2
9
1
219
C h a p te r 8
T o p ics in D istrib u te d
C o o r d in a tio n and D istrib u te d
T ran saction s
This chapter contains parts of the book chapter:
G. Couloris, J. Dollimore, T. Kindberg. Distributed systems,

concepts and design. Third Edition. Chapters 11 (except 11.2
and 11.3) and 13, pp. 419-423, 436-464, and 515-552 (72 of 772).
Addison-Wesley ,2001. ISBN: 0201-61918-0
In addition, the chapter also contains the paper:
F. B. Schneider. Implementing fault-tolerant services using the

state machine approach: a tutorial. ACM Comput. Surv. 22(4)
pp. 299-319 (21 of 409), 1990. Doi: 10.1145/98163.98167
Achieving both h ig h av a ila b ility and a t o m ic it y simultaneously is a hard

problem. As we have seen previously, two strategies employed for availability
are redundancy via replication and distribution of system functionality. When
these techniques are employed, atomicity of our higher-level service is typically
compromised. An important question is whether atomicity can be reestab
lished even when replication or distribution are employed. The text explores
techniques from distributed systems to provide us with an only partially pos
itive answer, which depends fundamentally on the assumptions made about
the underlying failure model. First, the text explores the primitive of multi
cast communication, which can be used as a building block in a replication
protocol. Second, we turn to distributed transactions, which can be used to
atomically interact with a system with distributed functionality. The ultimate
goal of this portion of the material is to provides us with an overview of the
results in distributed systems related to synchronous replication and distributed
transactions, as well as basic protocols to achieve both, in particular totally-
ordered multicast and two-phase commit.
The learning goals for this portion of the material are listed below.•
• Explain the difficulties of guaranteeing atomicity in a replicated dis

tributed system.
221
• Explain the notion of state-machine replication and the ISIS approach to
totally ordered multicast among replicas.
• Describe the implications of the FLP impossibility result and possible

workarounds.
• Explain mechanisms necessary for distributed transactions, such as dis

tributed locking and distributed recovery.
• Explain the operation of the two-phase commit protocol (2PC).
• Predict outcomes of 2PC under failure scenarios.
Schneider’ s tutorial paper on state-machine replication is given to deepen

understanding of this approach to replication; however, it is to be considered
as an additional reading and not fundamental to the attainment of the learning
goals above.
By contrast, particular attention should be paid above to the material on
totally-ordered multicast (Section 11.4-3) and the overview of the consensus
problem and its related theoretical results (Sections 11.5.1, 11.5.2, and 11.5.4),
as well as to the material on distributed transactions, especially the two-phase
commit protocol (Couloris et al.’ s Chapter 13, especially Section 13.3).
COORDINATION AND AGREEM ENT
11.1 Introduction
11.2 Distributed mutual exclusion
11.3 Elections
11.4 Multicast communication
11.5 Consensus and related problems
11.6 Summary
In this chapter, we introduce som e topics and algorithm s related to the issu e of how
p ro ce sse s coordinate their actions and agree on shared values in distributed system s,
despite failures. The chapter begin s with algorithm s to achieve mutual exclusion am ong
a collection of p rocesses, so as to coordinate their a c c e s s e s to shared resources. It g o e s
on to exam ine how an election can be im plem ented in a distributed system . That is, it
describ es how a grou p of p r o ce sse s can agree on a new coordinator of their activities after
the previous coordinator has failed,
The secon d half exam ines the related problem s of m ulticast com m unication,
con sen sus, byzantine agreem ent and interactive consistency. In multicast, the issu e is
how to agree on such m atters as the order in which m e ssa g e s are to be delivered.
C on sen su s and the other problem s generalize from this: how can any collection of
p ro ce sse s agree on so m e value, no matter what the dom ain of the values in qu estion ? We
encounter a fundam ental result in the theory of distributed system s: that under certain
conditions - including surprisingly benign failure conditions - it is im possib le to
guarantee that p r o ce sse s will reach con sen su s.
223 419
420 CHAPTER 11 COORDINATION AND AGREEMENT
11.1 Introduction
This chapter introduces a collection of algorithms whose goals vary but that share an aim
that is fundamental in distributed systems: for a set of processes to coordinate their
actions or to agree on one or more values. For example, in the case of a complex piece
of machinery such as a spaceship, it is essential that the computers controlling it agree
on such conditions as whether the spaceship’ s mission is proceeding or has been
aborted. Furthermore, the computers must coordinate their actions correctly with respect
to shared resources (the spaceship’ s sensors and actuators). The computers must be able
to do so even where there is no fixed master-slave relationship between the components
(which would make coordination particularly simple). The reason for avoiding fixed
master-slave relationships is that we often require our systems to keep working correctly
even if failures occur, so we need to avoid single points of failure, such as fixed masters.
An important distinction for us, as in Chapter 10, will be whether the distributed
system under study is asynchronous or synchronous. In an asynchronous system we can
make no timing assumptions. In a synchronous system, we shall assume that there are
bounds on the maximum message transmission delay, on the time to execute each step
of a process, and on clock drift rates. The synchronous assumptions allow us to use
timeouts to detect process crashes.
Another important aim of the chapter while discussing algorithms is to consider
failures, and how to deal with them when designing algorithms. Section 2.3.2 introduced
a failure model, which we shall use in this chapter. Coping with failures is a subtle
business, so we begin by considering some algorithms that tolerate no failures and
progress through benign failures until we consider how to tolerate arbitrary failures. We
encounter a fundamental result in the theory of distributed systems. Even under
surprisingly benign failure conditions, it is impossible to guarantee in an asynchronous
system that a collection of processes can agree on a shared value - for example, for all
of a spaceship’ s controlling processes to agree ‘mission proceed’or ‘ mission abort’ .
Section 11.2 examines the problem of distributed mutual exclusion. This is the
extension to distributed systems of the familiar problem of avoiding race conditions in
kernels and multi-threaded applications. Since much of what occurs in distributed
systems is resource sharing, this is an important problem to solve. Next, Section 11.3
introduces a related but more general issue of how to ‘ elect’one of a collection of
processes to perform a special role. For example, in Chapter 10 we saw how processes
synchronized their clocks to a designated time server. If this server fails and several
surviving servers can fulfil that role, then for the sake of consistency it is necessary to
choose just one server to take over.
Multicast communication is the subject of Section 11.4. As Section 4.5.1
explained, multicast is a very useful communication paradigm, with applications from
locating resources to coordinating the updates to replicated data. Section 11.4 examines
multicast reliability and ordering semantics, and gives algorithms to achieve the
variations. Multicast delivery is essentially a problem of agreement between processes:
the recipients agree on which messages they will receive, and in which order they will
receive them. Section 11.5 discusses the problem of agreement more generally,
primarily in the forms known as consensus and byzantine agreement.2 4
224
SECTION 11.1 INTRODUCTION 421
Figure 11.1 A network partition
The treatment followed in this chapter involves stating the assumptions and the
goals to be met, and giving an informal account of why the algorithms presented are
correct. There is insufficient space to provide a more rigorous approach. For that, we
refer the reader to a text that gives a thorough account of distributed algorithms, such as
Attiya and Welch [1998] and Lynch [1996],
Before presenting the problems and algorithms, we discuss failure assumptions
and the practical matter of detecting failures in distributed systems.
11.1.1 Failure assumptions and failure detectors

For the sake of simplicity, this chapter assumes that each pair of processes is connected
by reliable channels. That is, although the underlying network components may suffer
failures, the processes use a reliable communication protocol that masks these failures
- for example, by retransmitting missing or corrupted messages. Also for the sake of
simplicity, vve assume that no process failure implies a threat to the other processes’
ability to communicate. This means that none of the processes depends upon another to
forward messages.
Note that a reliable channel eventually delivers a message to the recipient’
s input
buffer. In a synchronous system, we suppose that there is hardware redundancy where
necessary, so that a reliable channel not only eventually delivers each message despite
underlying failures but it does so within a specified time bound.
Tn any particular interval of time, communication between some processes may
succeed while communication between others is delayed. For example, the failure of a
router between two networks may mean that a collection of four processes is split into
two pairs, such that intra-pair communication is possible over their respective networks;
but inter-pair communication is not possible while the router has failed. This is known
as a network partition (Figure 11.1). Over a point-to-point network such as the Internet,
complex topologies and independent routing choices mean that connectivity may be
asymmetric: communication is possible from process p to process q, but not vice versa.
Connectivity may also be intransitive: communication is possible from p to q and from
q to r\ but p cannot communicate directly with r. Thus our reliability assumption entails
that eventually any failed link or router will be repaired or circumvented. Nevertheless,
the processes may not all be able to communicate at the same time.2 5
225
The chapter assumes, unless we state otherwise, that processes only fail by
crashing - an assumption that is good enough for many systems. In Section 11.5, we
shall consider how to treat the cases where processes have arbitrary (byzantine) failures.
Whatever the type of failure, a correct process is one that exhibits no failures at any
point in the execution under consideration. Note that correctness applies to the whole
execution, not just to a part of it. So a process that suffers a crash failure is ‘
non-failed’
before that point, not ‘ correct’before that point.
One of the problems in the design of algorithms that can overcome process crashes
is that of deciding when a process has crashed. A failure detector [Chandra and Toueg
1996, Stelling et al. 1998] is a service that processes queries about whether a particular
process has failed. It is often implemented by an object local to each process (on the
same computer) that runs a failure-detection algorithm in conjunction with its
counterparts at other processes. The object local to each process is called a local failure
detector. We shall outline how to implement failure detectors shortly, but first we shall
concentrate on some of the properties o f failure detectors.
A failure ‘ detector’is not necessarily accurate. Most fall into the category of
unreliable failure detectors. An unreliable failure detector may produce one of two
values when given the identity of a process: Unsuspected or Suspected. Both of these
results are hints, which may or may not accurately reflect whether the process has
actually failed. A result of Unsuspected signifies that the detector has recently received
evidence suggesting that the process has not failed; for example, a message was recently
received from it. But of course the process can have failed since then. A result of
Suspected signifies that the failure detector has some indication that the process may
have failed. For example, it may be that no message from the process has been received
for more than a nominal maximum length of silence (even in an asynchronous system,
practical upper bounds can be used as hints). The suspicion may be misplaced: for
example, the process could be functioning correctly, but on the other side of a network
partition; or it could be running more slowly than expected.
A reliable failure detector is one that is always accurate in detecting a process’ s
failure. It answers processes’queries with either a response of Unsuspected - which, as
before, can only be a hint - or Failed. A result of Failed means that the detector has
determined that the process has crashed. Recall that a process that has crashed stays that
way, since by definition a process never takes another step once it has crashed.
It is important to realize that, although we speak of one failure detector acting for
a collection of processes, the response that the failure detector gives to a process is only
as good as the information available at that process. A failure detector may sometimes
give different responses to different processes, since communication conditions vary
from process to process.
We can implement an unreliable failure detector using the following algorithm.
Each process p sends a ‘ p is here’message to every other process, and it does this every
T seconds. The failure detector uses an estimate of the maximum message transmission
time of D seconds. If the local failure detector at process q does not receive a ‘ p is here’
message within T + D seconds of the last one, then it reports to q that p is Suspected.
However, if it subsequently receives a ‘ p is here’message, then it reports to q that p is
OK.
In a real distributed system, there are practical limits on message transmission
times. Even email systems give up after a few days, since it is likely that communication2
6
226
SECTION 11.2 DISTRIBUTED MUTUAL EXCLUSION 423
links and routers will have been repaired in that time. If we choose small values for T
and D (so that they total 0.1 second, say), then the failure detector is likely to suspect
non-crashed processes many times, and much bandwidth will be taken up with ‘ p is
here’messages. If we choose a large total timeout value (a week, say) then crashed
processes will often be reported as Unsuspected.
A practical solution to this problem is to use timeout values that reflect the
observed network delay conditions. If a local failure detector receives a L p is here’in 20
seconds instead of the expected maximum of 10 seconds, then it could reset its timeout
value for p accordingly. The failure detector remains unreliable, and its answers to
queries are still only hints, but the probability of its accuracy increases.
In a synchronous system, our failure detector can be made into a reliable one. We
can choose D so that it is not an estimate but an absolute bound on message transmission
times; the absence of a lp is here’message within T + D seconds entitles the local
failure detector to conclude that p has crashed.
The reader may wonder whether failure detectors are of any practical use.
Unreliable failure detectors may suspect a process that has not failed (they may be
inaccurate); and they may not suspect a process that has in fact failed (they may be
incomplete). Reliable failure detectors, on the other hand, require that the system is
synchronous (and few practical systems are).
We have introduced failure detectors because they help us to think about the
nature of failures in a distributed system. And any practical system that is designed to
cope with failures must detect them - however imperfectly. But it turns out that even
unreliable failure detectors with certain well-defined properties can help us to provide
practical solutions to the problem of coordinating processes in the presence of failures.
We return to this point in Section 11.5.
11.2 Distributed mutual exclusion
Distributed processes often need to coordinate their activities. If a collection of

processes share a resource or collection of resources, then often mutual exclusion is
required to prevent interference and ensure consistency when accessing the resources.
This is the critical section problem, familiar in the domain of operating systems. In a
distributed system, however, neither shared variables nor facilities supplied by a single
local kernel can be used to solve it, in general. We require a solution to distributed
mutual exclusion; one that is based solely on message passing.
In some cases shared resources are managed by servers that also provide
mechanisms for mutual exclusion. Chapter 12 describes how some servers synchronize
client accesses to resources. But in some practical cases a separate mechanism for
mutual exclusion is required.
Consider users who update a text file. A simple means of ensuring that their
updates are consistent is to allow them to access it only one at a time, by requiring the
editor to lock the file before updates can be made. NFS file servers, described in Chapter
8, are designed to be stateless and therefore do not support file locking. For this reason,
UNIX systems provide a separate file-locking service, implemented by the daemon
lockd, to handle locking requests from clients.
227
Taking the example just given, suppose that p 3 either had not failed but was
running unusually slowly (that is, the assumption that the system is synchronous is
incorrect) or that p 3 had failed but is then replaced. Just as p 2 sends its coordinator
message, p 3 (or its replacement) does the same. p 2 receives p ^ s coordinator message
after it sent its own and so sets elected2 = P y Due to variable message transmission
delays, p ] receives p 2's coordinator message after /?3’ s and so eventually sets
elected} = p 2. Condition E l has been broken.
With regard to the performance of the algorithm, in the best case the process with
the second highest identifier notices the coordinator’ s failure. Then it can immediately
elect itself and send N -2 coordinator messages. The turnaround time is one message.
The bully algorithm requires 0(N ) messages in the worst case - that is, when the
process with the least identifier first detects the coordinator’ s failure. For then N - I
processes altogether begin elections, each sending messages to processes with higher
identifiers.
11.4 Multicast communication
Section 4.5.1 described IP multicast, which is an implementation of group

communication. Group, or multicast, communication requires coordination and
agreement. The aim is for each of a group of processes to receive copies of the messages
sent to the group, often with delivery guarantees. The guarantees include agreement on
the set of messages that every process in the group should receive and on the delivery
ordering across the group members.
Group communication systems are extremely sophisticated. Even IP multicast,
which provides minimal delivery guarantees, requires a major engineering effort. Time
and bandwidth efficiency are important concerns, and are challenging even for static
groups of processes. The problems are multiplied when processes can join and leave
groups at arbitrary times.
Here we study multicast communication to groups of processes whose
membership is known. Chapter 14 will expand our study to fully fledged group
communication, including the management of dynamically varying groups.
Multicast communication has been the subject of many projects, including the V-
system [Cheriton and Zwaenepoel 1985], Chorus [Rozier et al. 1988], Amoeba
[Kaashoek et al. 1989, Kaashoek and Tanenbaum 1991], Trans/Total [Melliar-Smith et
al. 1990], Delta-4 [Powell 1991], Isis [Birman 1993], Horus [van Renesse et al. J996],
Totem [Moser et al. 1996] and Transis [Dolev and Malki 1996] - and we shall cite other
notable work in the course of this section.
The essential feature of multicast communication is that a process issues only one
multicast operation to send a message to each of a group of processes (in Java this
operation is aSocket.send(aMessage)) instead of issuing multiple send operations to
individual processes. Communication to all processes in the system, as opposed to a
sub-group of them, is known as broadcast.
The use of a single multicast operation instead of multiple send operations
amounts to much more than a convenience for the programmer. Jt enables the2 8
228
SECTION 11.4 MULTICAST COMMUNICATION 437
implementation to be efficient and allows it to provide stronger delivery guarantees than

would otherwise be possible,
Efficiency. The information that the same message is to be delivered to all processes
in a group allows the implementation to be efficient in its utilization of bandwidth. It
can take steps to send the message no more than once over any communication link,
by sending the message over a distribution tree; and it can use network hardware
support for multicast where this is available. The implementation can also minimize
the total time taken to deliver the message to all destinations, instead of transmitting
it separately and serially.
To see these advantages, compare the bandwidth utilization and the total
transmission time taken when sending the same message from a computer in London
to two computers on the same Ethernet in Palo Alto, (a) by two separate UDP sends,
and (b) by a single IP-multicast operation. In the former case, two copies of the
messages are sent independently, and the second is delayed by the first. In the latter
case, a set of multicast-aware routers forward a single copy of the message from
London to a router on the destination LAN. The final router then uses hardware
multicast (provided by the Ethernet) to deliver the message to the destinations,
instead of sending it twice.
Delivery guarantees: If a process issues multiple independent send operations to

individual processes, then there is no way for the implementation to provide delivery
guarantees that affect the group of processes as a whole. If the sender fails half-way
through sending, then some members of the group may receive the message while
others do not. And the relative ordering of two messages delivered to any two group
members is undefined, In the particular case of IP muldcast, no ordering or reliability
guarantees are in fact offered. But stronger multicast guarantees can be made, and we
shall shortly define some.
System model 0 The system contains a collection o f processes, which can communicate
reliably over one-to-one channels. As before, processes may fail only by crashing.
The processes are members of groups, which are the destinations of messages sent
with the multicast operation. It is generally useful to allow processes to be members of
several groups simultaneously - for example, to enable processes to receive information
from several sources by joining several groups. But to simplify our discussion of
ordering properties, we shall sometimes restrict processes to being members of at most
one group at a time.
The operation multicast(g, m) sends the message m to all members of the group g
of processes. Correspondingly, there is an operation delivetfm) that delivers a message
sent by multicast to the calling process. We use the term deliver rather than receive to
make clear that a multicast message is not always handed to the application layer inside
the process as soon as it is received at the process’
s node. This is explained when we
discuss multicast delivery semantics shortly.
Every message m carries the unique identifier of the process senderfm) that sent
it, and the unique destination group identifier group(m). We assume that processes do
not lie a bou t the o rigin o r destin ation s o f m essa ges.
A group is said to be closed if only members of the group may multicast to it
(Figure 11.9). A process in a closed group delivers to itself any message that it2 9
229
Figure 11.9 Open and closed grou ps
o o
multicasts to the group. A group is open if processes outside the group may send to it.
(The categories ‘ open’and ‘ closed’also apply with analogous meanings to mailing
lists.) Closed groups of processes are useful, for example, for cooperating servers to
send messages to one another that only they should receive. Open groups are useful, for
example, for delivering events to groups of interested processes.
Some algorithms assume that groups are closed. The same effect as openness can
be achieved with a closed group by picking a member o f the group and sending it a
message (one-to-one) for it to multicast to its group. Rodrigues et a l [1998] discuss
multicast to open groups.
11.4.1 Basic multicast

It is useful to have at our disposal a basic multicast primitive that guarantees, unlike IP
multicast, that a correct process will eventually deliver the message, as long as the
multicaster does not crash. We call the primitive B-multicast and its corresponding basic
delivery primitive is B-deliver. We allow processes to belong to several groups, and
each message is destined for some particular group.
A straightforward way to implement B-multicast is to use a reliable one-to-one
send operation, as follows:
To B-multicast{g, m): for each process p e g , sendip, m);
On receive(m) at p: B-deliver(m) at p.
The implementation may use threads to perform the send operations concurrently, in an
attempt to reduce the total time taken to deliver the message. Unfortunately, such an
implementation is liable to suffer from a so-called ack-implosion if the number of
processes is large. The acknowledgments sent as part of the reliable send operation are
liable to arrive from many processes at about the same time. The multicasting process’s
buffers will rapidly fill and it is liable to drop acknowledgments. It will therefore
retransmit the message, leading to yet more acknowledgments and further waste of 3 0
2
230
network bandwidth. A more practical basic multicast service can be built using IP
multicast, and we invite the reader to show this.
11.4.2 Reliable multicast

Section 2.3.2 defined reliable one-to-one communication channels between pairs of
processes. The required safety property is called integrity - that any message delivered
is identical to one that was sent, and that no message is delivered twice. The required
liveness property is called validity: that any message is eventually delivered to the
destination, if it is correct.
Following Hadzilacos and Toueg [1994] and Chandra and Toueg [1996], we now
define a reliable multicast, with corresponding operations R-multicast and R-deliver.
Properties analogous to integrity and validity are clearly highly desirable in reliable
multicast delivery. But we add another: a requirement that all correct processes in the
group must receive a message if any of them does. It is important to realize that this is
not a property of the B-multicast algorithm that is based on a reliable one-to-one send
operation. The sender may fail at any point while B-multicast proceeds, so some
processes may deliver a message while others do not.
A reliable multicast is one that satisfies the following properties; we explain the
properties after stating them.
Integrity: A correct process p delivers a message m at most once. Furthermore,

p e group(m) and m was supplied to a multicast operation by sender(m). (As with
one-to-one communication, messages can always be distinguished by a sequence
number relative to their sender.)
Validity: If a correct process multicasts message m then it will eventually deliver m.
Agreement: If a correct process delivers message m, then all other correct processes
in group(m) will eventually deliver m.
The integrity property is analogous to that for reliable one-to-one communication. The
validity property guarantees liveness for the sender. This may seem an unusual property,
because it is asymmetric (it mentions only one particular process). But notice that
validity and agreement together amount to an overall liveness requirement: if one
process (the sender) eventually delivers a message m then, since the correct processes
agree on the set of messages they deliver, it follows that m will eventually be delivered
to all the group’s correct members.
The advantage of expressing the validity condition in terms of self-delivery is
simplicity. What we require is that the message be delivered eventually by some correct
member of the group.
The agreement condition is related to atomicity, the property of ‘ all or nothing’,
applied to delivery of messages to a group. If a process that multicasts a message crashes
before it has delivered it, then it is possible that the message will not be delivered to any
process in the group; but if it is delivered to some correct process, then all other correct
processes will deliver it. Many papers in the literature use the term ‘ atomic’to include
a total ordering condition; we define this shortly.
231
44 0 CHAPTER 11 COORDINATION AND AGREEMENT
Figure 11.10 Reliable m ulticast algorithm

On initialization
Received := {>;
For process p to R-multicast message m to group g
B-multicast(g, m'y, // p E .g is included as a destination
On B-deliver(m) at process q with g = groupim)
if (m ^ Received)
then
Received := Received U {m};
if(q * p) then B-multicast(g, m); end if
R-deliver m\
end if
Implementing reliable multicast over B-multicastO Figure 11.10 gives a reliable

multicast algorithm, with primitives R-multicast and R-deliver, which allows processes
to belong to several closed groups simultaneously. To R-multicast a message, a process
B-multicasts the message to the processes in the destination group (including itself).
When the message is B-delivered, the recipient in turn B-multicasts the message to the
group (if it is not the original sender), and then R-delivers the message. Since a message
may arrive more than once, duplicates of the message are detected and not delivered.
This algorithm clearly satisfies validity, since a correct process will eventually B-
deliver the message to itself. By the integrity property of the underlying communication
channels used in B-multicast, the algorithm also satisfies the integrity property.
Agreement follows from the fact that every correct process B-multicasts the
message to the other processes after it has B-delivered it. If a correct process does not
R-deliver the message, then this can only be because it never B-delivered it. That in turn
can only be because no other correct process B-delivered it either; therefore, none will
R-deliver it.
The reliable multicast algorithm that we have described is correct in an
asynchronous system, since we made no timing assumptions. But the algorithm is
inefficient for practical purposes. Each message is sent |g| times to each process.
Reliable multicast over IP multicast () An alternative realization of R-multicast is to use

a combination of IP multicast, piggy backed acknowledgments (that is,
acknowledgements attached to other messages), and negative acknowledgments. This
R-multicast protocol is based on the observation that IP multicast communication is
often successful. In the protocol, processes do not send separate acknowledgment
messages; instead, they piggy back acknowledgments on the messages that they send to
the group. Processes send a separate response message only when they detect that they
have missed a message. A response indicating the absence of an expected message is
known as a negative acknowledgement.
The description assumes that groups are closed. Each process p maintains a
sequence number for each group g to which it belongs. The sequence number is
initially zero. Each process also records R| , the sequence number of the latest message
it has delivered from process q that was sent to group g.
232
Figure 11.11 The hold-back queue for arriving m ulticast m e ssa g es
Incoming
messages
For p to R-multicast a message to group g, it piggy backs onto the message the
value Sp and acknowledgments, of the form <q, R^>. An acknowledgement states, for
some sender q, the sequence number of the latest message from q destined for g that p
has delivered since it last multicast a message. The multicaster p then IP-multicasts the
message with its piggy backed values to g, and increments Sp by one.
The piggy backed values in a multicast message enable the recipients to learn
about messages that they have not received. A process R-delivers a message destined for
g bearing the sequence number S’ from p if and only if S = Rp + 1, and it increments
Rp by one immediately after delivery. If an arriving message has S s Rp , then r has
delivered the message before and it discards it. If S > Rp + 1, or if R > R^ for an
enclosed acknowledgement < q ,R> , then there are one or more messages that it has not
yet received (and which are likely to have been dropped, in the first case). It keeps any
message for which S > Rp + 1 in a hold-back queue (Figure 11.11) - such queues are
often used to meet message delivery guarantees. It requests missing messages by
sending negative acknowledgements - to the original sender or to a process q from
which it has received an acknowledgement <q , R^> with R** no less than the required
sequence number.
The hold-back queue is not strictly necessary for reliability but it simplifies the
protocol by enabling us to use sequence numbers to represent sets of delivered
messages. It also provides us with a guarantee of delivery order (see Section 11.4.3).
The integrity property follows from the detection of duplicates and the underlying
properties of IP multicast (which uses checksums to expunge corrupted messages).The
validity property holds because IP multicast has that property. For agreement we
require, first, that a process can always detect missing messages. That in tum means that
it will always receive a further message that enables it to detect the omission. As this
simplified protocol stands, we guarantee detection o f missing messages only in the case
where correct processes multicast messages indefinitely. Second, the agreement
property requires that there is always an available copy of any message needed by a
process that did not receive it. We therefore assume that processes retain copies of the
messages they have delivered - indefinitely, in this simplified protocol.23
233
Neither of the assumptions we made to ensure agreement is practical (see Exercise

11.14). However, agreement is practically addressed in the protocols from which ours is
derived: the Psync protocol [Peterson et al. 1989], Trans protocol [Melliar-Smith et al.
1990] and scalable reliable multicast protocol [Floyd et al. 1997], Psync and Trans also
provide further delivery ordering guarantees.
Uniform properties (} The definition of agreement given above refers only to the
behaviour o f correct processes - processes that never fail. Consider what would happen
in the algorithm of Figure 11.10 if a process is not correct and crashed after it had R-
delivered a message. Since any process that R-delivers the message must first B-
multicast it, it follows that all correct processes will still eventually deliver the message.
Any property that holds whether or not processes are correct is called a uniform
property. We define uniform agreement as follows:
Uniform agreement: If a process, whether it is correct or fails, delivers message m,
then all correct processes in group(m) will eventually deliver m.
Uniform agreement allows a process to crash after it has delivered a message, while still
ensuring that all correct processes will deliver the message. We have argued that the
algorithm of Figure 11.10 satisfies this property, which is stronger than the non-uniform
agreement property defined above.
Uniform agreement is useful in applications where a process may take an action
that produces an observable inconsistency before it crashes. For example, consider that
the processes are servers that manage copies of a bank account, and that updates to the
account are sent using reliable multicast to the group of servers. If the multicast does not
satisfy uniform agreement, then a client that accesses a server just before it crashes may
observe an update that no other server will process.
It is interesting to note that if we reverse the lines ‘
R-deliver m' and ‘ if{q * p )
then B-multicast(g, m); end if in Figure 11.10, then the resultant algorithm does not
satisfy uniform agreement.
Just as there is a uniform version of agreement, there are also uniform versions of
any multicast property, including validity and integrity and the ordering properties that
we are about to define.
11.4.3 Ordered multicast

The basic multicast algorithm of Section 11.4.1 delivers messages to processes in an
arbitrary order, due to arbitrary delays in the underlying one-to-one send operations.
This lack of an ordering guarantee is not satisfactory for many applications. For
example, in a nuclear power plant it may be important that events signifying threats to
safety conditions and events signifying actions by control units are observed in the same
order by all processes in the system.
The common ordering requirements are total ordering, causal ordering, FIFO
ordering and the hybrids total-causal and total-FIFO. To simplify our discussion, we
define these orderings under the assumption that any process belongs to at most one
group. We shall later discuss the implications of allowing groups to overlap.
FIFO ordering: If a correct process issues multicast(g, m) and then multicast(g, m‘
),
then every correct process that delivers m' will deliver m before m\34
2
234
SECTION 11 .4 MULTICAST COMMUNICATION 443
FiQure 11.12 Total, FIFO and causal ordering of multicast m essages
Notice the consistent ordering of totally ordered m essages 7j and T2, the FIFO-related
m essa ges Fj and F2 and the causally related m essages and C3- and the otherwise
arbitrary delivery ordering of m essages
C au sal orderin g: I f multicastfg, m) multicast(g, m'), where -* is the

happened-before relation induced only by messages sent between the members of g,
then any correct process that delivers m' will deliver m before m'.
Total orderin g: I f a correct process delivers message m before it delivers m\ then

any other correct process that delivers m' will deliver m before m'.
Causal ordering implies FIFO ordering, since any two multicasts by the same process
are related by happened-before. Note that FIFO ordering and causal ordering are only
partial orderings: not all m essages are sent by the same process, in general; similarly,
som e multicasts are concurrent (not ordered by happened-before).
Figure 11.12 illustrates the orderings for the case o f three processes. Close
inspection o f the figure shows that the totally ordered messages are delivered in the
opp osite order to the physical time at which they were sent. In fact, the definition o f total
235
Figure 11.13 Display from bulletin board program
Bulletin board: os.interesting

Item From Subject
23 A.Hanlon Mach
24 G.Joseph Microkernels
25 A.Hanlon Re: Microkernels
26 T.L’
Heureux RPC performance
27 M. Walker Re: Mach
end
ordering allows message delivery to be ordered arbitrarily, as long as the order is the
same at different processes. Since total ordering is not necessarily also a FIFO or causal
ordering, we define the hybrid of FIFO-total ordering as one for which message delivery
obeys both FIFO and total ordering; similarly, under causal-total ordering message
delivery obeys both causal and total ordering.
The definitions of ordered multicast do not assume or imply reliability. For
example, the reader should check that, under total ordering, if correct process p delivers
message m and then delivers m', then a correct process q can deliver m without also
delivering m' or any other message ordered after m.
We can also form hybrids of ordered and reliable protocols. A reliable totally
ordered multicast is often referred to in the literature as an atomic multicast. Similarly,
we may form reliable FIFO multicast, reliable causal multicast and reliable versions of
the hybrid ordered multicasts.
Ordering the delivery of multicast messages, as we shall see, can be expensive in
terms o f delivery latency and bandwidth consumption. The ordering semantics that we
have described may delay the delivery of messages unnecessarily. That is, at the
application level, a message may be delayed for another message that it does not in fact
depend upon. For this reason, some have proposed multicast systems that use the
application-specific message semantics alone to determine the order of message
delivery [Cheriton and Skeen 1993, Pedone and Schiper 1999].
The example Of the bulletin board 0 To make multicast delivery semantics more
concrete, consider an application in which users post messages to bulletin boards. Each
user runs a bulletin-board application process. Every topic o f discussion has its own
process group. When a user posts a message to a bulletin board, the application
multicasts the user’ s posting to the corresponding group. Each user’ s process is a
member of the group for the topic in which he or she is interested, so that the user will
receive just the postings concerning that topic.
Reliable multicast is required if every user is to receive every posting eventually.
The users also have ordering requirements. Figure 11.13 shows the postings as they
appear to a particular user. At a minimum, FIFO ordering is desirable, since then every
posting from a given user - ‘ A.Hanlon’ , say - will be received in the same order, and
users can talk consistently about A.Hanlon’ s second posting.
236
SECTION 11,4 MULTICAST COMMUNICATION 4 45
Note that the message whose subjects are ‘ Re: Microkernels’(25) and ‘Re: Mach’
(27) appear after the messages to which they refer. A causally ordered multicast is
needed to guarantee this relationship. Otherwise, arbitrary message delays could mean
that, say, a message ‘ Re: Mach’could appear before the original message about Mach.
If the multicast delivery was totally ordered, then the numbering in the left-hand
column would be consistent between users. Users could refer unambiguously, for
example, to ‘ message 24’ .
In practice, the USENET bulletin board system implements neither causal nor
total ordering. The communication costs o f achieving these orderings on a large scale
outweighs their advantages.
Implementing FIFO ordering 0 FIFO-ordered multicast (with operations FO-multicast
and FO-deliver) is achieved with sequence numbers, much as we would achieve it for
one-to-one communication. We shall consider only non-overlapping groups. The reader
should verify that the reliable multicast protocol that we defined on top of IP multicast
in Section 11.4.2 also guarantees FIFO ordering, but we shall show how to construct a
FIFO-ordered multicast on top of any given basic multicast. We use the variables Sp and
Rq held at process p from the reliable multicast protocol of Section 11.4.2: Sp is a count
o f how many messages p has sent to g and, for each q, Rq is the sequence number of the
latest message p has delivered from process q that was sent to group g.
For p to FO-multicast a message to group g, it piggy backs the value Sp onto the
message, B-?nulticast& the message to g and then increments by 1. Upon receipt of a
message from q bearing the sequence number 5, p checks whether S = Rq + 1. If so,
this message is the next one expected from the sender q and p FO-delivers it, setting
Rq :=S. If S > Rq + 1, it places the message in the hold-back queue until the intervening
messages have been delivered and 5 = Rq + 1.
o
Since all messages from a given sender are delivered in the same sequence, and
since a message’ s delivery is delayed until its sequence number has been reached, the
condition for FIFO ordering is clearly satisfied. But this is so only under the assumption
that groups are non-overlapping.
Note that we can use any implementation of B-multicast in this protocol.
Moreover, if we use a reliable R-multicast primitive instead of B-multicast, then we
obtain a reliable FIFO multicast.
Implementing total ordering 0 The basic approach to implementing total ordering is to
assign totally ordered identifiers to multicast messages so that each process makes the
same ordering decision based upon these identifiers. The delivery algorithm is very
similar to the one we described for FIFO ordering; the difference is that processes keep
group-specific sequence numbers rather than process-specific sequence numbers. We
only consider how to totally order messages sent to non-overlapping groups. We call the
multicast operations TO-multicast and TO-deliver.
We discuss two main methods for assigning identifiers to messages. The first of
these is for a process called a sequencer to assign them (Figure 11.14). A process
wishing to TO-multicast a message m to group g attaches a unique identifier id{m) to it.
The messages for g are sent to the sequencer for g, sequencerig), as well as to the
members of g. (The sequencer may be chosen to be a member of g.) The process
sequencerig) maintains a group-specific sequence number s , which it uses to assign
increasing and consecutive sequence numbers to the messages that it B-delivers, It2 7
3
237
Figure 11.14 Total ordering using a sequ en cer

1. Algorithm for group member p
On initialization: r^ := 0;
To TO-multicast message m to group g
B-multicast(g U {sequencer [g)}, <m> />);
On B-deliver{<m, i>) with g = group(m)
Place <m, i> in hold-back queue;
On B-deliver(morder = < “ order”, i, S>) with g = group{morder)
wait until <m, i> in hold-back queue and S = r ■
TO-deliver m; // (after deleting it from the hold-back queue)
rg := S + 1;
2. Algorithm for sequencer of g

On initialization: sg := 0;
On B-deliver(<m, i>) with g = group(m)
B-multicast(g, < “
order”,i, *^>);*
announces the sequence numbers by B-multicasting order messages to g (see Figure

11.14 for the details).
A message will remain in the hold-back queue indefinitely until it can be TO-
delivered according to the corresponding sequence number. Since the sequence numbers
are well defined (by the sequencer), the criterion for total ordering is met. Furthermore,
if the processes use a FIFO-ordered variant of B-multicast, then the totally ordered
multicast is also causally ordered. We leave the reader to show this.
The obvious problem with a sequencer-based scheme is that the sequencer may
become a bottleneck and is a critical point of failure. Practical algorithms exist that
address the problem of failure. Chang and Maxemchuk [1984] first suggested a
multicast protocol employing a sequencer (which they called a token site). Kaashoek et
al. [1989] developed a sequencer-based protocol for the Amoeba system, These
protocols ensure that a message is in the hold-back queue at / + 1 nodes before it is
delivered; up to/failures can thus be tolerated. Like Chang and Maxemchuk, Birman et
al. [1991] also employ a token-holding site that acts as a sequencer. The token can be
passed from process to process so that, for example, if only one process sends totally-
ordered multicasts then that process can act as the sequencer, saving communication.
The protocol of Kaashoek et al. uses hardware-based multicast - available on an
Ethernet, for example - rather than reliable point-to-point communication. In the
simplest variant of their protocol, processes send the message to be multicast to the
sequencer, one-to-one. The sequencer multicasts the message itself, as well as the
identifier and sequence number. This has the advantage that the other members of the
238
Figure 11.15 The ISIS algorithm tor total ordering
group receive only one message per multicast; its disadvantage is increased bandwidth
utilization. The protocol is described in full at www.cdk3.net/coordination.
The second method that we examine for achieving totally ordered multicast is one
in which the processes collectively agree on the assignment of sequence numbers to
messages in a distributed fashion. A simple algorithm - similar to one that was
originally developed to implement totally ordered multicast delivery for the ISIS toolkit
[Birman and Joseph 1987a] - is shown in Figure 11.15. Once more, a process B-
multicasts its message to the members of the group. The group may be open or closed.
The receiving processes propose sequence numbers for messages as they arrive and
return these to the sender, which uses them to generate agreed sequence numbers.
Each process q in group g keeps Aq g , the largest agreed sequence number it has
observed so far for group g, and Pq, its own largest proposed sequence number. The
algorithm for process p to multicast a message m to group g is as follows:
1. p B-multicasts <m, i> to g, where i is a unique identifier for m.
2. Each process q replies to the sender p with a proposal for the message’ s agreed
sequence number of Pq := Max(Aq gi Pq) + 1. In reality, we must include process
identifiers in the proposed values Pq to ensure a total order, since otherwise
different processes could propose the same integer value; but for the sake of
simplicity we shall not make that explicit here. Each process provisionally assigns
the proposed sequence number to the message and places it in its hold-back queue,
which is ordered with the smallest sequence number at the front.
3. p collects all the proposed sequence numbers and selects the largest one a as the
next agreed sequence number. It then B-multicasts <i, a> to g. Each process q in
g sets Aq Max(Aq g , a) and attaches a to the message (which is identified by i).
It reorders the message in the hold-back queue if the agreed sequence number
differs from the proposed one. When the message at the front of the hold-back
queue has been assigned its agreed sequence number, it is transferred to the tail of
the delivery queue. Messages that have been assigned their agreed sequence2 9
3
239
number but are not at the head of the hold-back queue are not yet transferred,
however.
If every process agrees the same set of sequence numbers and delivers them in the
corresponding order, then total ordering is satisfied. It is clear that correct processes
ultimately agree on the same set of sequence numbers, but we must show that they are
monotonically increasing and that no correct process can deliver a message prematurely.
Assume that a message m] has been assigned an agreed sequence number and has
reached the front of the hold-back queue. By construction, a message that is received
after this stage will and should be delivered after it will have a larger proposed
sequence number and thus a larger agreed sequence number than my So let m2 be any
other message that has not yet been assigned its agreed sequence number but which is
on the same queue. We have that:
agreedSequence(m^) > proposedSequence(m2)
by the algorithm just given. Since m, is at the front of the queue:
proposedSequence(m2) > agreedSequenceim {)
Therefore:
agreedSequence(m2) > agreedSequence(m^)
and total ordering is assured.

This algorithm has higher latency than the sequencer-based multicast: three
messages are sent serially between the sender and the group before a message can be
delivered.
Note that the total ordering chosen by this algorithm is not also guaranteed to be
causally or FIFO-ordered: any two messages are delivered in an essentially arbitrary
total order, influenced by communication delays.
For other approaches to implementing total ordering, see Melliar-Smith et al.
[1990], Garcia-Molina and Spauster [1991] and Hadzilacos and Toueg [1994],
Implementing causal ordering 0 We give an algorithm for non-overlapping closed
groups based on that developed by Birman et al. [1991], shown in Figure 11.16, in
which the causally-ordered multicast operations are CO-multicast and CO-deliver. The
algorithm takes account of the happened-before relationship only as it is established by
multicast messages. If the processes send one-to-one messages to one another, then
these will not be accounted for.
Each process p t (t = 1,2,..., N) maintains its own vector timestamp (see
Section 10.4). The entries in the timestamp count the number of multicast messages
from each process that happened-before the next message to be multicast.
To CO-multicast a message to group g, the process adds I to its entry in the
timestamp, and B-multicast& the message along with its timestamp to g.
When a process Pj B-delivers a message from p ., it must place it in the hold-back
queue before it can CO-deliver it: until it is assured that it has delivered any messages
that causally preceded it. To establish this, p i waits until (a) it has delivered any earlier
message sent by p and (b) it has delivered any message that p j had delivered at the2 0
4
240
SECTION 11.4 MULTICAST COMMUNICATION 44 9
Figure 11.16 Causal ordering using vector tim estam ps

Algorithm for group member p ; (i = 1,2..., N)
On initialization
V f [ ; ] := 0 ( j = 1,2..., /V);
To CO-multicast message m to group g
V f[ / ] := v f l / ] + l;
B-multicast(g, < Vg, m>);
On B-deliver(<Vgj, m>)from pj, with g = group(m)
place < Vs., m> in hold-back queue;
wait until V)[j] = Vg [j] + 1 and V*[k] < V*[k]
CO-deliver m; // after removing it from the hold-back queue
v j l j ] := V f D J + 1 ; *V
time it multicast the message. Both of those conditions can be detected by examining
vector timestamps, as shown in Figure 11.16. Note that a process can immediately CO-
deliver to itself any message that it CO-multicasts,, although this is not described in
Figure 11.16.
Each process updates its vector timestamp upon delivering any message, to
maintain the count of causally precedent messages. It does this by incrementing they'th
entry in its timestamp by one. This is an optimization of the merge operation that appears
in the rules for updating vector clocks in Section 10.4. We can make the optimization in
view of the delivery condition in the algorithm of Figure 11.16, which guarantees that
only theyth entry will increase.
We outline the proof of the correctness of this algorithm as follows. Suppose that
multicast(g, m) —> multicast{g, m ). Let V and V' be the vector timestamps of m and
m', respectively. It is straightforward to prove inductively from the algorithm that
V < V'. In particular, if process p k multicast m, then VTA] < V'[k],
Consider what happens when some correct process p (. B-delivers m' (as opposed
to CO-delivering it) without first CO-delivenng m. By the algorithm, V{[k] can increase
only when p t delivers a message from p k, when it increases by 1. But has not
received m, and therefore VT[k] cannot increase beyond V(k] - 1. It is therefore not
possible for p. to CO-deliver m , since this would require that Vj[k] > V'[k], and
therefore that V-[k\ > V[k).
The reader should check that if we substitute the reliable R-multicast primitive in
place of B-multicast, then we obtain a multicast that is both reliable and causally
ordered.
Furthermore, if we combine the protocol for causal multicast with the sequencer-
based protocol for totally ordered delivery, then we obtain message delivery that is both
total and causal. The sequencer delivers messages according to the causal order and
multicasts the sequence numbers for the messages in the order in which it receives them.
The processes in the destination group do not deliver a message until they have received
an order message from the sequencer and the message is next in the delivery sequence.
241
Since the sequencer delivers.message in causal order, and since all other processes
deliver messages in the same order as the sequencer, the ordering is indeed both total
and causal
Overlapping groups 0 We have considered only non-overlapping groups in the

definitions and algorithms for FIFO, total and causal ordering semantics. This simplifies
the problem but it is not satisfactory, since in general processes need to be members of
multiple overlapping groups. For example, a process may be interested in events from
multiple sources, and thus join a corresponding set of event-distribution groups.
We can extend the ordering definitions to global orders [Hadzilacos and Toueg
1994], in which we have to consider that if message m is multicast to g, and if message
m' is multicast to g', then both messages are addressed to the members of g n g'.
Global FIFO ordering: If a correct process issues multicastig, m) and then

multicastig', m'), then every correct process in g n g' that delivers m' will deliver
m before m'.
Global causal ordering: If multicastig, m) —» multicastig' > m> X where —> is

the happened-before relation induced by any chain of multicast messages, then any
conrect process in g n g' that delivers m will deliver m before m .
Pairwise total ordering: If a correct process delivers message m sent to g before it

delivers m’sent to g\ then any other correct process in g n g' that delivers m' will
deliver m before m .
Global total ordering: Let *<’be the relation of ordering between delivery events.
We require that *<’obeys pairwise total ordering and that it is acyclic - under
pairwise total ordering, ‘
< ’is not acyclic by default.
One way of implementing these orders would be to multicast each message m to the
group of all processes in the system. Each process either discards or delivers the
message according to whether it belongs to group(m). This would be an inefficient and
unsatisfactory implementation: a multicast should involve as few processes as possible
beyond the members of the destination group. Alternatives are explored in Birman et al.
[1991], Garcia-Molina and Spauster [1991], Hadzilacos and Toueg [1994], Kindberg
[1995] and Rodrigues et al. [1998].
Multicast in synchronous and asynchronous systems 0 In this section, we have described

algorithms for reliable unordered multicast, (reliable) FIFO-ordered multicast, (reliable)
causally ordered multicast and totally ordered multicast. We also indicated how to
achieve a multicast that is both totally and causally ordered. We leave the reader to
devise an algorithm for a multicast primitive that guarantees both FIFO and total
ordering. All the algorithms that we have described work correctly in asynchronous
systems.
We did not, however, give an algorithm that guarantees both reliable and totally
ordered delivery. Surprising though it may seem, while possible in a synchronous
system, a protocol with these guarantees is impossible in an asynchronous distributed
system - even one that at worst suffered a single process crash failure. We return to this
point in the next section.2
4
242
SECTION 11.5 CONSENSUS AND RELATED PROBLEMS 451
11.5 Consensus and related problems
This section introduces the problem of consensus [Pease et al. 1980, Lamport et al.
1982] and the related problems of byzantine generals and interactive consistency. We
shall refer to these collectively as problems o f agreement. Roughly speaking, the
problem is for processes to agree on a value after one or more of the processes has
proposed what that value should be.
For example, in Chapter 2 we described a situation in which two armies should
decide consistently to attack or retreat. Similarly, we may require that all the correct
computers controlling a spaceship’ s engines should decide ‘ proceed’ , or ail of them
decide ‘ abort’,after each has proposed one action or the other. In a transaction to transfer
funds from one account to another, the computers involved must consistently agree to
perform the respective debil and credit. In mutual exclusion, the processes agree on
which process can enter the critical section. In an election, the processes agree on which
is the elected process. In totally ordered multicast, the processes agree on the order of
message delivery.
Protocols exist that are tailored to these individual types of agreement. We
described some of them above, and Chapters 12 and 13 examine transactions. But it is
useful for us to consider more general forms o f agreement, in a search for common
characteristics and solutions.
This section defines consensus more precisely and relates it to three related
agreement problems: byzantine generals, interactive consistency and totally ordered
multicast. We go on to examine under what circumstances the problems can be solved,
and sketch some solutions. In particular, we shall discuss the well-known impossibility
result of Fischer et a l [1985], which states that in an asynchronous system a collection
of processes containing only one faulty process cannot be guaranteed to reach
consensus. Finally, we consider how it is that practical algorithms exist despite the
impossibility result.
11.5.1 System model and problem definitions

Our system model includes a collection of processes (i = 1,2
communicating by message passing. An important requirement that applies in many
practical situations is for consensus to be reached even in the presence of faults. We
assume, as before, that communication is reliable but that processes may fail. In this
section, we shall consider byzantine (arbitrary) process failures, as well as crash failures.
We shall sometimes specify an assumption that up to some number/of the N processes
are faulty - that is, they exhibit some specified types of fault; the remainder of the
processes are correct.
If arbitrary failures can occur, then another factor in specifying our system is
whether the processes digitally sign the messages that they send (see Section 7.4). If
processes sign their messages, then a faulty process is limited in the harm it can do.
Specifically, during an agreement algorithm it cannot make a false claim about the
values that a correct process has sent to it. The relevance of message signing will
become clearer when we discuss solutions to the byzantine generals problem. By
default, we assume that signing does not take place.43
2
243
Figure 11.17 C on sen su s.for three p roce sse s
Definition of the consensus problem 0 To reach consensus, every process begins in

the undecided state and proposes a single value v., drawn from a set D
(i = I, 2, .... N). The processes communicate with one another, exchanging values.
Each process then sets the value of a decision variable d:. In doing so it enters the
decided state, in which it may no longer change di (i = 1,2, Figure 11.17
shows three processes engaged in a consensus algorithm. Two processes propose
‘proceed’and a third proposes ‘ abort’but then crashes. The two processes that remain
correct each decide ‘ proceed’.
The requirements of a consensus algorithm are that the following conditions
should hold for every execution of it:
Termination: Eventually each correct process sets its decision variable.
Agreement: The decision value of all correct processes is the same: if p ; and p ■are
correct and have entered the decided state, then d; = d j (/, j = 1,2,..., N).
Integrity: If the correct processes all proposed the same value, then any correct
process in the decided state has chosen that value.
Variations on the definition o f integrity may be appropriate, according to the

application. For example, a weaker type of integrity would be for the decision value to
equal a value that some correct process proposed - not necessarily all of them. We shall
use the definition stated above.
To help in understanding how the formulation of the problem translates into an
algorithm, consider a system in which processes cannot fail. It is then straightforward to
solve consensus. For example, we can collect the processes into a group and have each
process reliably multicast its proposed value to the members of the group. Each process
waits until it has collected all N values (including its own). It then evaluates the function
majority(v,, vy, ..., v^), which returns the value that occurs most often among its
arguments, or the special value _L g D if no majority exists. Termination is guaranteed
by the reliability of the multicast operation. Agreement and integrity are guaranteed by2 4
244
SECTION 11.5 CONSENSUS AND RELATED PROBLEMS 45 3
the definition of majority, and the integrity property of a reliable multicast. Every
process receives the same set of proposed values, and every process evaluates the same
function of those values. So they must all agree, and if every process proposed the same
value, then they all decide on this value.
Note that majority is only one possible function that the processes could use to
agree upon a value from the candidate values. For example, if the values are ordered then
the functions minimum and maximum may be appropriate.
If processes can crash then this introduces the complication of detecting failures,
and it is not immediately clear that a run of the consensus algorithm can terminate. In
fact, if the system is asynchronous then it may not; we shall return to this point shortly.
If processes can fail in arbitrary (byzantine) ways, then faulty processes can in
principle communicate random values to the others. This may seem unlikely in practice,
but it is not beyond the bounds of possibility for a process with a bug to fail in this way.
Moreover, the fault may not be accidental but the result of mischievous or malevolent
operation. Someone could deliberately make a process send different values to different
peers in an attempt to thwart the others, which are trying to reach consensus. In case of
inconsistency, correct processes must compare what they have received with what other
processes claim to have received.
The byzantine generals problem 0 In the informal statement of the byzantine generals
problem [Lamport et at 1982], three or more generals are to agree to attack or to retreat.
One, the commander, issues the order. The others, lieutenants to the commander, are to
decide to attack or retreat. But one or more of the generals may be ‘ treacherous1- that
is, faulty. If the commander is treacherous, he proposes attacking to one general and
retreating to another. If a lieutenant is treacherous, he tells one of his peers that the
commander told him to attack and another that they are to retreat.
The byzantine generals problem differs from consensus in that a distinguished
process supplies a value that the others are to agree upon, instead of each of them
proposing a value. The requirements are:
Agreement: The decision value of all correct processes is the same: if p i and p ■
are
correct and have entered the decided state, then dj = d j (i, j = 1,2,..., N ).
Integrity: If the commander is correct, then all correct processes decide on the value
that the commander proposed.
Note that, for the byzantine generals problem, integrity implies agreement when the
commander is correct; but the commander need not be correct.
interactive consistency 0 The interactive consistency problem is another variant of
consensus, in which every process proposes a single value. The goal of the algorithm is
for the correct processes to agree on a vector of values, one for each process. We shall
cal! this the ‘decision vector1. For example, the goal could be for each of a set of
processes to obtain the same information about their respective states.
The requirements for interactive consistency are:
Agreement: The decision vector of all correct processes is the same.
245
Integrity: If p (- is correct, then all correct processes decide on vi as the ith

component of their vector.
Relating consensus to other problems 0 Although it is common to consider the
byzantine generals problem with arbitrary process failures, in fact each of the three
problems - consensus, byzantine generals and interactive consistency - is meaningful
in the context of either arbitrary or crash failures. Similarly, each can be framed
assuming either a synchronous or an asynchronous system.
It is sometimes possible to derive a solution to one problem using a solution to
another, This is a very useful property, both because it increases our understanding of
the problems and because by reusing solutions we can potentially save on
implementation effort and complexity.
Suppose that there exist solutions to consensus (C), byzantine generals (BG) and
interactive consistency (IC) as follows:
C-fVp v2, vN) returns the decision value of p t in a run of the solution to the
consensus problem, where Vp v2, . vN are the values that the processes proposed.
B G fj, v) returns the decision value of p t in a run of the solution to the byzantine
generals problem, where Pj, the commander, proposes the value v.
/C-(Vj, w,, . . vw)[j] returns thejth value in the decision vector o f p i in a run of the
solution to the interactive consistency problem, where v p v2, .... are the values
that the processes proposed.
The definitions of C-, BGi and IC i assume that a faulty process proposes a single
notional value, even though it may have given different proposed values to each of the
other processes. This is only a convenience: the solutions will not rely on any such
notional value.
It is possible to construct solutions out of the solutions to other problems. We give
three examples:
IC from BG: We construct a solution to IC from BG by running BG N times, once
with each process p i (i, j = 1,2..., N) acting as the commander:
fCj-fvp v2..., vN)[j] = B G f f v ^ d J = 1,2,...,A)
Cfrom IC: We construct a solution to C from IC by running IC to produce a vector

of values at each process, then applying an appropriate function on the vector's
values to derive a single value:
Gf(vp ..., vN) = majority(ICfv p ..., vA,)[l], ...,/C-(vp ..., vn )[N])

(i = 1,2...N), where majority is as defined above.
BGfrom C: We construct a solution to BG from C as follows:
• The commander p j sends its proposed value v to itself and each of the
remaining processes;
* All processes run C with the values Vp v2, . . vN that they receive (pj may be
faulty);2
6
4
246
Figure 11.18 C on sen su s in a syn ch ronou s system

Algorithm for process p ■
e g; algorithm proceeds in / + 1 rounds
On initialization
Values■:= {v(.}; Values^ = {};
In round r (1 < r < f + 1)
B-multicast(g, Values' - Values' ); // Send only values that have not been sent
Values' + 1 := Values-;
while (in round r)
{
On B-deliver( V ■)from some p ■
Values' ^ 1 := Values'+ 1u V-;
}
After (/ + 1) rounds
Assign di = minimum (Valuesj + l );
• They derive B G fj, v) = Ci(vv v2, vN) (i = 1,2

The reader should check that the termination, agreement and integrity conditions are
preserved in each case. Fischer [1983] relates the three problems in more detail.
Solving consensus is equivalent to solving reliable and totally ordered multicast:
given a solution to one, we can solve the other. Implementing consensus with a reliable
and totally ordered multicast operation RTO-multicast is straightforward. We collect all
the processes into a group g. To achieve consensus, each process p ( performs RTO-
multicast(g, Vy). Then each process p i chooses d- = where mi is the/inr? value that
p i RTO-deliver&. The termination property follows from the reliability of the multicast.
The agreement and integrity properties follow from the reliability and total ordering of
multicast delivery. Chandra and Toueg [1996] demonstrate how reliable and totally
ordered multicast can be derived from consensus.
11.5.2 Consensus in a synchronous system

This section describes an algorithm that uses only a basic multicast protocol to solve
consensus in a synchronous system. The algorithm assumes that up to / of the N
processes exhibit crash failures.
To reach consensus, each correct process collects proposed values from the other
processes. The algorithm proceeds in / + I rounds, in each of which the correct
processes B-multicasl the values between themselves. At most/processes may crash, by
assumption, At worst, ail / crashes occurred during the rounds, but the algorithm
guarantees that at the end of the rounds all the correct processes that have survived are
in a position to agree.
The algorithm, shown in Figure 11.18, is based on that by Dolev and Strong
[1983] and its presentation by Attiya and Welch [1998]. The variable Values(- holds the
set of proposed values known to process p i at the beginning of round r. Each process
247
multicasts the set of values that it has not sent in previous rounds. It then takes delivery
of similar multicast messages from other processes and records any new values.
Although this is not shown in Figure 11.18, the duration o f a round is limited by setting
a timeout based on the maximum time for a correct process to multicast a message. After
/ + 1 rounds, each process chooses the minimum value it has received as its decision
value.
Termination is obvious from the fact that the system is synchronous. To check the
correctness of the algorithm, we must show that each process arrives at the same set of
values at the end of the final round. Agreement and integrity will then follow, because
the processes apply the minimum function to this set.
Assume, to the contrary, that two processes differ in their final set of values.
Without loss of generality, some correct process p (- possesses a value v that another
correct process p j ( i* j ) does not possess. The only explanation for p { possessing a
proposed value v at the end that p - does not possess is that any third process, p k say,
that managed to send v to p. crashed before v could be delivered to p -r In turn, any
® J
process sending v in the previous round must have crashed, to explain why p k possesses
v in that round but p. did not receive it. Proceeding in this way, we have to posit at least
one crash in each of the preceding rounds. But we have assumed that at most/crashes
can occur, and there are / + 1 rounds. We have arrived at a contradiction.
It turns out that any algorithm to reach consensus despite up to/crash failures
requires at least / + 1 rounds of message exchanges, no matter how it is constructed
[Dolev and Strong 1983]. This lower bound also applies in the case of byzantine failures
[Fischer and Lynch 1980].
11.5.3 The byzantine generals problem in a synchronous system

We discuss die byzantine generals problem in a synchronous system. Unlike the
algorithm for consensus described in the previous section, here we assume that
processes can exhibit arbitrary failures. That is, a faulty process may send any message
with any value at any time; and it may omit to send any message. Up to / of the N
processes may be faulty. Correct processes can detect the absence of a message through
a timeout; but they cannot conclude that the sender has crashed, since it may be silent
for some time and then send messages again.
We assume that the communication channels between pairs of processes are
private. If a process could examine all the messages that other processes send, then it
could detect the inconsistencies in what a faulty process sends to different processes.
Our default assumption of channel reliability means that no faulty process can inject
messages into the communication channel between correct processes.
Lamport et al [1982] considered the case of three processes that send unsigned
messages to one another. They showed that there is no solution that guarantees to meet
the conditions of the byzantine generals problem if one process is allowed to fail. They
generalized this result to show that no solution exists if N < 3/ We shall demonstrate
these results shortly. They went on to give an algorithm that solves the byzantine
generals problem in a synchronous system if N > 3/ + 1, for unsigned (they call them
‘oral’) messages.
248
Figure 11.19 Three byzantine generals
Impossibility with three processes 0 Figure 11.19 shows two scenarios in which just one
of three processes is faulty. In the left configuration one of the lieutenants, p 3 is faulty;
on the right the commander, p i is faulty. Each scenario in Figure 11.18 shows two
rounds of messages: the values the commander sends, and the values that the lieutenants
subsequently send to each other. The numeric prefixes serve to specify the sources of
messages and to show the different rounds. Read the symbol in messages as ‘ says’;
for example, ‘ 3:l:w’is the message ‘ 3 says 1 says u\
In the left-hand scenario, the commander correctly sends the same value v>to each
of the other two processes, and p 2 correctly echoes this to p y However, p 3 sends a
value u * v to p,. All p 2 knows at this stage is that it has received differing values; it
cannot tell which were sent out by the commander.
In the right-hand scenario, the commander is faulty and sends differing values to
the lieutenants. After p 3 has correctly echoed the value x that it received, p 2 is in the
same situation as it was in when p 3 was faulty: it has received two differing values.
If a solution exists, then process is bound to decide on value v when the
commander is correct, by the integrity condition, If we accept that no algorithm can
possibly distinguish between the two scenarios, p 2 must also choose the value sent by
the commander in the right-hand scenario.
Following exactly the same reasoning for p 3, assuming that it is correct, we are
forced to conclude, by symmetry, that p 3 also chooses the value sent by the commander
as its decision value. But this contradicts the agreement condition (the commander sends
differing values if it is faulty). So no solution is possible.
Note that this argument rests on our intuition that nothing can be done to improve
a correct general’ s knowledge beyond the first stage, where it cannot tell which process
is faulty. It is possible to prove the correctness of this intuition [Pease et al. 1980].
Byzantine agreement can be reached for three generals, with one of them faulty, if the
generals digitally sign their messages.
Impossibility With N < 3f 0 Pease et al. generalized the basic impossibility result for
three processes, to prove that no solution is possible if N < 3f. In outline, the argument
is as follows. Assume that a solution exists with N < 3f. Let each of three processes p,,
p 2 and p 3 use the solution to simulate the behaviour of n]t n2 and n3 generals,
respectively, where + n2 + n3 = N and nlt n2>n3 < N/3. We assume, furthermore,
that one of the three processes is faulty. Those of p {, p 2 and p 3 that are correct simulate
correct generals: they simulate the interactions of their own generals internally and send4 9
2
249
messages from their generals to those simulated by other processes. The faulty process’ s
simulated generals are faulty: the messages that it sends as part of the simulation to the
other two processes may be spurious. Since N < 3/ and n]fn2, «3 2 N/3, at m ost/
simulated generals are faulty.
Because the algorithm that the processes run is assumed to be correct, the
simulation terminates. The correct simulated generals (in the two correct processes)
agree and satisfy the integrity property. But now we have a means for the two correct
processes out of the three to reach consensus: each decides on the value chosen by all of
their simulated generals. This contradicts our impossibility result for three processes,
with one faulty.
Solution with ono faulty process 0 There is not sufficient space to describe fully the
algorithm of Pease et al. that solves the byzantine generals problem in a synchronous
system with N > 3 /+ 1. Instead, we give the operation of the algorithm for the case
N > 4 , f = 1 and illustrate it for N = 4, / = 1.
The correct generals reach agreement in two rounds of messages:
• In the first round, the commander sends a value to each of the lieutenants.
• In the second round, each of the lieutenants sends the value it received to its peers.
A lieutenant receives a value from the commander, plus JV-2 values from its peers. If
the commander is faulty, then all the lieutenants are correct and each will have gathered
exactly the set of values that the commander sent out. Otherwise, one of the lieutenants
is faulty; each of its correct peers receives N - 2 copies of the value that the commander
sent, plus a value that the faulty lieutenant sent to it.
In either case, the correct lieutenants need only apply a simple majority function
to the set of values they receive. Since N >4, (N - 2)^2. Therefore, the majority
function will ignore any value that a faulty lieutenant sent, and it will produce the value
that the commander sent if the commander is correct.
We now illustrate the algorithm that we have just outlined for the case of four
generals. Figure 11.20 shows two scenarios similar to those in Figure 11.19, but in this
case there are four processes, one of which is faulty. As in Figure 11.19, in the left-hand
configuration one of the lieutenants, p 3, is faulty; on the right, the commander, p v is
faulty.
In the left-hand case, the two correct lieutenant processes agree, deciding on the
commander's value:
p 2 decides on majority(v, u, v) = v
p 4 decides on majority(v, v, w) = v
In the right-hand case the commander is faulty, but the three correct processes agree:
p 2, p 3 and p 4 decide on majority(u, v, w) = 1 (the special value 1 applies
where no majority of values exists).
The algorithm takes account of the fact that a faulty process may omit to send a message.
If a correct process does not receive a message within a suitable time limit (the system
is synchronous), it proceeds as though the faulty process had sent it the value JL.
Discussion 0 We can measure the efficiency of a solution to the byzantine generals
problem - or any other agreement problem - by asking:
250
Figure 11.20 Four byzantine generals
P4 Pa
Faulty processes are shown shaded
• How many message rounds does it take? (This is a factor in how long it takes for
the algorithm to terminate.)
• How many messages are sent, and of what size? (This measures the total
bandwidth utilization and has an impact on the execution time.)
In the general case (/ > 1) the Lamport et al. algorithm for unsigned messages operates
over / + 1 rounds. In each round, a process sends to a subset of the other processes the
values that it received in the previous round. The algorithm is very costly: it involves
sending 0 (N ^+ l) messages.
Fischer and Lynch [1982] proved that any deterministic solution to consensus
assuming byzantine failures (and hence to the byzantine generals problem, as Section
11.5.1 showed) will take at least / + 1 message rounds. So no algorithm can operate
faster in this respect than that of Lamport et al. But there have been improvements in the
message complexity, for example Garay and Moses [1993].
Several algorithms, such as that of Dolev and Strong [1983], take advantage of
signed messages. Dolev and Strong’ | algorithm again takes / + 1 rounds, but the
number of messages sent is only 0(N ).
The complexity and cost of the solutions suggest that they are applicable only
where the threat is great. If faulty hardware is the source of the threat, then the likelihood
of truly arbitrary behaviour is small. Solutions that are based on more detailed
knowledge of the fault model may be more efficient [Barborak et al. 1993]. If malicious
users are the source of the threat, then a system to counter them is likely to use digital
signatures; a solution without signatures is impractical.
11.5.4 Impossibility in asynchronous systems

We have provided solutions to consensus and the byzantine generals problem (and
hence, by derivation, to interactive consistency) in synchronous systems. However, all
these solutions relied upon the system being synchronous. The algorithms assume that
message exchanges take place in rounds, and that processes are entitled to timeout and5
1
2
251
assume that a faulty process has not sent them a message within the round, because the
maximum delay has been exceeded.
Fischer et al. [1985] proved that no algorithm can guarantee to reach consensus in
an asynchronous system, even with one process crash failure. In an asynchronous
system, processes can respond to messages at arbitrary times, so a crashed process is
indistinguishable from a slow one. Their proof, which is beyond the scope of this book,
involves showing that there is always some continuation of the processes’execution that
avoids consensus being reached.
We immediately know from the result of Fischer et al. that there is no guaranteed
solution in an asynchronous system to the byzantine generals problem, to interactive
consistency or to totally ordered and reliable multicast. If there were such a solution
then, by the results of Section 11.5.1, we would have a solution to consensus -
contradicting the impossibility result.
Note the word ‘ guarantee’in the statement o f the impossibility result. The result
does not mean that processes can never reach distributed consensus in an asynchronous
system if one is faulty. It allows that consensus can be reached with some probability
greater than zero, confirming what we know in practice. For example, despite the fact
that our systems are often effectively asynchronous, transaction systems have been
reaching consensus regularly for many years.
One approach to working around the impossibility result is to consider partially
synchronous systems, which are sufficiently weaker than synchronous systems to be
useful as models of practical systems, and sufficiently stronger than asynchronous
systems for consensus to be solvable in them [Dwork et al. 1988]. That approach is
beyond the scope of this book. However, three other techniques for working around the
impossibility result that we shall now outline are fault masking, and reaching consensus
by exploiting failure detectors and by randomizing aspects of the processes’behaviour.
Masking faults 0 The first technique is to avoid the impossibility result altogether by
masking any process failures that occur (see Section 2.3.2 for an introduction to fault-
masking). For example, transaction systems employ persistent storage, which survives
crash failures. If a process crashes, then it is restarted (automatically, or by an
administrator). The process places sufficient information in persistent storage at critical
points in its program so that if it should crash and be restarted, it will find sufficient data
to be able to continue correctly with its interrupted task. In other words, it will behave
like a process that is correct, but which sometimes takes a long time to perform a
processing step.
O f course, fault masking is generally applicable in system design. Chapter 13
discusses how transactional systems take advantage of persistent storage. Chapter 14
describes how process failures can also be masked by replicating software components,
Consensus using failure detectors 0 Another method for circumventing the
impossibility result is to employ failure detectors. Some practical systems employ
‘perfect by design’failure detectors to reach consensus. No failure detector in an
asynchronous system that works solely by message passing can really be perfect.
However, processes can agree to deem a process that has not responded for more than a
bounded time to have failed. An unresponsive process may not really have failed, but
the remaining processes act as if it had done. They make the failure ‘ fail-silent’by
discarding any subsequent messages that they do in fact receive from a ‘failed’process.25
252
In other words, we have effectively turned an asynchronous system into a synchronous

one. This technique is used in the ISIS system [Birman 1993].
This method relies upon the failure detector usually being accurate. When it is
inaccurate, then the system has to proceed without a group member that otherwise could
potentially have contributed to the system’ s effectiveness. Unfortunately, making the
failure detector reasonably accurate involves using long timeout values, forcing
processes to wait a relatively long time (and not perform useful work) before concluding
that a process has failed. Another issue that arises for this approach is network
partitioning, which we discuss in Chapter 14.
A quite different approach is to use imperfect failure detectors, and to reach
consensus while allowing suspected processes to behave correctly instead of excluding
them. Chandra and Toueg [1996] analysed the properties that a failure detector must
have in order to solve the consensus problem in an asynchronous system. They showed
that consensus can be solved in an asynchronous system, even with an unreliable failure
detector, if fewer than N/2 processes crash and communication is reliable. The weakest
type of failure detector for which this is so is called an eventually weak failure detector.
This is one that is:
Eventually weakly complete: each faulty process is eventually suspected
permanently by some correct process;
Eventually weakly accurate: after some point in time, at least one correct process is
never suspected by any correct process.
Chandra and Toueg show that we cannot implement an eventually weak failure detector
in an asynchronous system by message passing alone. However, we described a
message-based failure detector in Section 11.1 that adapts its timeout values according
to observed response times. If a process or the connection to it is very slow, then the
timeout value will grow so that eases of falsely suspecting a process become rare. In the
case of many real systems, this algorithm behaves sufficiently closely to an eventually
weak failure detector for practical purposes.
Chandra and Toueg’ s consensus algorithm allows falsely suspected processes to
continue their normal operations and allows processes that have suspected them to
receive messages from them and process those messages normally. This makes the
application programmer’ s life complicated, but it has the advantage that correct
processes are not wasted by being falsely excluded. Moreover, timeouts for detecting
failures can be set less conservatively than with the ISIS approach.
Consensus using randomization 0 The result of Fischer et a l depends on what we can
consider to be an ‘adversary’ . This is a ‘character’(actually just a collection of random
events) who can exploit the phenomena of asynchronous systems so as to foil the
processes’attempts to reach consensus. The adversary' manipulates the network to delay
messages so that they arrive at just the wrong time, and similarly it slows down or speeds
up the processes just enough so that they are in the ‘ wrong’state when they receive a
message.
The third technique that addresses the impossibility result is to introduce an
element of chance in the processes’behaviour, so that the adversary cannot exercise its
thwarting strategy effectively. Consensus might still not be reached in some cases, but
this method enables processes to reach consensus in a finite expected time. A
253
probabilistic algorithm that solves consensus even with byzantine failures can be found
in Canetti and Rabin [1993].
11.6 Summary
The chapter began by discussing the need for processes to access shared resources under
conditions of mutual exclusion. Locks are not always implemented by the servers that
manage the shared resources, and a separate distributed mutual exclusion service is then
required. Three algorithms were considered that achieve mutual exclusion: one
employing a central server, a ring-based algorithm, and a multicast-based algorithm
using logical clocks. None of these mechanisms can withstand failure as we described
them, although they can be modified to tolerate some faults.
Then the chapter considered a ring-based algorithm and the bully algorithm,
whose common aim is to elect a process uniquely from a given set - even if several
elections take place concurrently. The Bully algorithm could be used, for example, to
elect a new master time server, or a new lock server, when the previous one fails.
The chapter described multicast communication. It discussed reliable multicast, in
which the correct processes agree on the set of messages to be delivered; and multicast
with FIFO, causal and total delivery ordering. We gave algorithms for reliable multicast
and for all three types of delivery ordering.
Finally, we described the three problems of consensus, byzantine generals and
interactive consistency. We defined the conditions for their solution and we showed
relationships between these problems - including the relationship between consensus
and reliable, totally ordered multicast.
Solutions exist in a synchronous system, and we described some of them. In fact,
solutions exist even when arbitrary failures are possible. We outlined part of the solution
to the byzantine generals problem of Lamport et al. More recent algorithms have lower
complexity, but in principle none can better the / + 1 rounds taken by this algorithm,
unless messages are digitally signed.
The chapter ended by describing the fundamental result of Fischer el al.
concerning the impossibility of guaranteeing consensus in an asynchronous system. We
discussed how it is that, nonetheless, systems regularly do reach agreement in
asynchronous systems.
EXERCISES
11.1 Is it possible to implement either a reliable or an unreliable (process) failure detector

using an unreliable communication channel? page 422
11.2 If all client processes are single-threaded, is mutual exclusion condition ME3, which
specifies entiy in happened-before order, relevant? page 425
11.3 Give a formula for the maximum throughput of a mutual exclusion system in terms of
the synchronization delay. page 42554
2
254
EXERCISES 463
11.4 In the central server algorithm for mutual exclusion, describe a situation in which two
requests are not processed in happened-before order. page 426
11.5 Adapt the central server algorithm for mutual exclusion to handle the crash failure of any
client (in any state), assuming that the server is correct and given a reliable failure
detector. Comment on whether the resultant system is fault tolerant. What would happen
if a client that possesses the token is wrongly suspected to have failed? page 426
11.6 Give an example execution of the ring-based algorithm to show that processes are not
necessarily granted entry to the critical section in happened-before order. page 427
11.7 In a certain system, each process typically uses a critical section many times before
another process requires it. Explain why Ricart and Agrawala’ s multicast-based mutual
exclusion algorithm is inefficient for this case, and describe how to improve its
performance. Does your adaptation satisfy liveness condition ME2? pa ge429
11.8 In the Bully algorithm, a recovering process starts an election and will become the new
coordinator if it has a higher identifier than the current incumbent. Is this a necessary
feature o f the algorithm? page 434
11.9 Suggest how to adapt the Bully algorithm to deal with temporary network partitions
(slow communication) and slow processes. page 436
11.10 Devise a protocol for basic multicast over IP multicast. page 438
11.11 How, if at all, should the definitions o f integrity, agreement and validity for reliable
multicast change for the case of open groups? page 439
11.12 Explain why reversing the order of the lines lR-deliver m’and ‘ if (q* p) then B-
multicast(g, m)\ end if in Figure 11.10 makes the algorithm no longer satisfy uniform
agreement. Does the reliable multicast algorithm based on IP multicast satisfy uniform
agreement? page 440
11.13 Explain whether the algorithm for reliable multicast over IP multicast works for open as
well as closed groups. Given any algorithm for closed groups, how, simply, can we
derive an algorithm for open groups? page 440
11.14 Consider how to address the impractical assumptions we made in order to meet the
validity and agreement properties for the reliable multicast protocol based on IP
multicast. Hint: add a rule for deleting retained messages when they have been delivered
everywhere; and consider adding a dummy ‘ heartbeat’message, which is never
delivered to the application, but which the protocol sends if the application has no
message to send. page 440
11.15 Show that the FIFO-ordered multicast algorithm does not work for overlapping groups,
by considering two messages sent from the same source to two overlapping groups, and
considering a process in the intersection of those groups. Adapt the protocol to work for
this case. Hint: processes should include with their messages the latest sequence
numbers of messages sent to all groups. page 445
11.16 Show that, if the basic multicast that we use in the algorithm of Figure 11.14 is also
FIFO-ordered, then the resultant totally-ordered multicast is also causally ordered. Is it
the case that any multicast that is both FIFO-ordered and totally ordered is thereby
causally ordered? page 4462
5
255
11.17 Suggest how to adapt the causally ordered multicast protocol to handle overlapping
groups. page 449
11.18 In discussing Maekawa’ s mutual exclusion algorithm, we gave an example of three
subsets of a set of three processes that could lead to a deadlock. Use these subsets as
multicast groups to show how a pairwise total ordering is not necessarily acyclic.
page 450
11.19 Construct a solution to reliable, totally ordered multicast in a synchronous system, using
a reliable multicast and a solution to the consensus problem. page 450
11.20 We gave a solution to consensus from a solution to reliable and totally ordered multicast,
which involved selecting the first value to be delivered. Explain from first principles
why, in an asynchronous system, we could not instead derive a solution by using a
reliable but not totally ordered multicast service and the ‘
majority’function. (Note that,
if we could, then this would contradict the impossibility result of Fischer et al.\) Hint:
consider slow/failed processes. page 455
11.21 Show that byzantine agreement can be reached for three generals, with one o f them
faulty, if the generals digitally sign their messages. page 457
11.22 Explain how to adapt the algorithm for reliable multicast over TP multicast to eliminate
the hold-back queue - so that a received message that is not a duplicate can be delivered
immediately, but without any ordering guarantees. Hint: use sets instead of sequence
numbers to represent the messages that have been delivered so far. page 44126
5
256
—
v • " r-f'--
DISTRIBU TED T R A N S A C T IO N S
13.1 Introduction
13.2 Flat and nested distributed transactions
13.3 Atomic com m it protocols
13.4 Concurrency control in distributed transactions
13.5 Distributed deadlocks
13.6 Transaction recovery
13.7 Summary
This chapter introduces distributed transactions - th ose that involve m ore than one
server. Distributed tran sactions m ay be either flat or nested,
An atom ic com m it p rotocol is a cooperative procedure used by a set of servers
involved in a distributed transaction. It enables the servers to reach a joint d ecision as to
whether a transaction can be com m itted or aborted, This chapter d escrib es the tw o-phase
com m it protocol, which is the m ost com m on ly u sed atom ic com m it protocol.
The section on con cu rren cy control in distributed transactions d isc u sse s how
locking, tim estam p ordering and optim istic concurrency control may be extended for use
with distributed transactions.
The u se of locking sch e m e s can lead to distributed deadlocks. Distributed deadlock
detection algorithm s are discu ssed .
S ervers that provide transactions include a recovery m anager w h ose concern is to
ensure that the effects of tran sactions on the ob jects m anaged by a server can be
recovered w hen it is replaced after a failure. The recovery m anager sav es the ob jects in
perm anent storage togeth er with intentions lists and inform ation about the status of each
transaction.
257 515
51 6 CHAPTER 13 DISTRIBUTED TRANSACTIONS
13.1 Introduction
In Chapter 12, we discussed flat and nested transactions that accessed objects at a single
server. In the general case, a transaction, whether flat or nested, will access objects
located in several different computers. We use the term distributed transaction to refer
to a flat or nested transaction that accesses objects managed by multiple servers.
When a distributed transaction comes to an end, the atomicity property of
transactions requires that either all of the servers involved commit the transaction or all
o f them abort the transaction. To achieve this, one of the servers takes on a coordinator
role, which involves ensuring the same outcome at all of the servers. The manner in
which the coordinator achieves this depends on the protocol chosen. A protocol known
as the 'two-phase commit protocol’is the most commonly used. This protocol allows
the servers to communicate with one another to reach a joint decision as to whether to
commit or abort.
Concurrency control in distributed transactions is based on the methods discussed
in Chapter 12. Each server applies local concurrency control to its own objects, which
ensures that transactions are serialized locally. Distributed transactions must be
serialized globally. How this is achieved varies as to whether locking, timestamp
ordering or optimistic concurrency control is in use. In some cases, the transactions may
be serialized at the individual servers, but at the same time a cycle of dependencies
between the different servers may occur and a distributed deadlock arise.
Transaction recovery is concerned with ensuring that all the objects involved in
transactions are recoverable. In addition to that, it guarantees that the values o f the
objects reflect all the changes made by committed transactions and none of those made
by aborted ones.
13.2 Flat and nested distributed transactions
A client transaction becomes distributed if it invokes operations in several different

servers. There are two different ways that distributed transactions can be structured: as
flat transactions and as nested transactions.
In a flat transaction, a client makes requests to more than one server. For example,
in Figure 13.1(a), transaction 7 is a flat transaction that invokes operations on objects in
servers X, Y and Z. A flat client transaction completes each of its requests before going
on to the next one. Therefore, each transaction accesses servers’objects sequentially.
When servers use locking, a transaction can only be waiting for one object at a time.
In a nested transaction, the top-level transaction can open subtransactions, and
each subtransaction can open further subtransactions down to any depth of nesting.
Figure 13.1(b) shows a client’ s transaction 7 that opens two subtransactions 7) and Ti,
which access objects at servers X and Y. The subtransactions T\ and Ti open further
subtransactions T\ i, Tn- Tj\ and T22 , which access objects at servers M, Nand P. In the
nested case, subtransactions at the same level can run concurrently, so Tj and T2 are
concurrent, and as they invoke objects in different servers, they can run in parallel. The
four subtransactions 7j 1, T 12, T21 and T22 also run concurrently.2
8
5
258
SECTION 13.2 FLAT AND NESTED DISTRIBUTED TRANSACTIONS 5 17
Figure 13,1 Distributed transactions

(a) Flat transaction (b) N ested transactions
Consider a distributed transaction in which a client transfers $10 from account A

to C and then transfers $20 from B to D . Accounts A and B are at separate servers X and
Y and accounts C and D are at server Z. If this transaction is structured as a set of four
nested transactions, as shown in Figure 13.2, the four requests (two deposit and two
withdraw) can run in parallel and the overall effect can be achieved with better
performance than a simple transaction in which the four operations are invoked
sequentially.
Figure 13.2 N ested banking transaction
a.withdraw(IO)
7= openTransaction
openSubTransaction
a. withdraw(IO);
b. withdraw(20)
openSubTransaction
b. withdraw(20);
openSubTransaction
c. deposit(IO);
openSubTransaction c. deposit(IO)
d. deposit(20);
d. deposit(20)
closeTransaction
259
518 CHAPTER 13 DISTRIBUTED TRANSACTIONS
Figure 13.3 A distributed banking transaction

coordinator
jo in
openJransaction/ partipgjant: •
closeTransaction, a.w ithdraw (4),
fa in
fjV^raricliX;
. b.w ithdraw (T, 3 p

Client; : . 'B 0 ■ b.vM hdraw (3);
T = openTransaction
a. w ith d ra w (4 ); jo m \
c. d e p osit(4 );
b. w ith d ra w (3 ); /participant^;
d . d e p os it(3 );
9 1 0 3 $ c.deposit(4);
closeTransaction
^ 3 1 8 1 8 0 1 ? d.deposit(3);
Note: the coordinator is in one of the servers, e.g. BranchX B'ranchZ ;
13.2.1 The coordinator of a distributed transaction

Servers that execute requests as part o f a distributed transaction need to be able to
communicate with one another to coordinate their actions when the transaction commits.
A client starts a transaction by sending an openTransaction request to a coordinator in
any server, as described in Section 12.2. The coordinator that is contacted carries out the
openTransaction and returns the resulting transaction identifier to the client.
Transaction identifiers for distributed transactions must be unique within a distributed
system. A simple way to achieve this is for a TID to contain two parts: the server
identifier (for example, an IP address) o f the server that created it and a number unique
to the server.
The coordinator that opened the transaction becomes the coordinator for the
distributed transaction and at the end is responsible for committing or aborting it. Each
of the servers that manages an object accessed by a transaction is a participant in the
transaction and provides an object we call the participant. Each participant is
responsible for keeping track of all of the recoverable objects at that server involved in
the transaction. The participants are responsible for cooperating with the coordinator in
carrying out the commit protocol.
During the progress of the transaction, the coordinator records a list of references
to the participants, and each participant records a reference to the coordinator.
The interface for Coordinator shown in Figure 12.3 provides an additional
method, join, which is used whenever a new participant joins the transaction:
join(Trans, reference to participant)

Informs a coordinator that a new participant has joined the transaction Trans.
260
SECTION 13.3 ATOMIC COMMIT PROTOCOLS 519
The coordinator records the new participant in its participant list. The fact that the
coordinator knows all the participants and each participant knows the coordinator will
enable them to collect the information that will be needed at commit time.
Figure 13.3 shows a client whose (flat) banking transaction involves accounts A,
B, C and D at servers BranchX, BranchY and BranchZ. The client’ s transaction, T\
transfers $4 from account A to account C and then transfers $3 from account B to
account D. The transaction described on the left is expanded to show that
openTransaction and closeTransaciion are directed to the coordinator, which would be
situated in one of the servers involved in the transaction. Each server is shown with a
participant, which joins the transaction by invoking the join method in the coordinator.
When the client invokes one o f the methods in the transaction, for example
b.withdraw(T, 3), the object receiving the invocation (B at BranchY in this case) informs
its participant object that the object belongs to the transaction T. If it has not already
informed the coordinator, the participant object uses the join operation to do so. In this
example, we show the transaction identifier being passed as an additional argument so
that the recipient can pass it on to the coordinator. By the time the client calls
closeTransaction, the coordinator has references to all of the participants.
Note that it is possible for a participant to call abortTransaction in the coordinator
if for some reason it is unable to continue with the transaction.
13.3 Atomic com m it protocols
Transaction commit protocols were devised in the early 1970s, and the two-phase
commit protocol appeared in Gray [1978], The atomicity of transactions requires that
when a distributed transaction comes to an end, either all of its operations are carried out
or none of them. In the case of a distributed transaction, the client has requested the
operations at more than one server. A transaction comes to an end when the client
requests that a transaction be committed or aborted. A simple way to complete the
transaction in an atomic manner is for the coordinator to communicate the commit or
abort request to all of the participants in the transaction and to keep on repeating the
request until all of them have acknowledged that they had carried it out. This is an
example of a one-phase atomic commit protocol.
This simple one-phase atomic commit protocol is inadequate because, in the case
when the client requests a commit, it does not allow a server to make a unilateral
decision to abort a transaction, Reasons that prevent a server from being able to commit
its part o f a transaction generally relate to issues of concurrency control. For example,
if locking is in use, the. resolution o f a deadlock can lead to the aborting of a transaction
without the client being aware unless it makes another request to the server. If optimistic
concurrency control is in use, the failure of validation at a server would cause it to decide
to abort the transaction. The coordinator may not know when a server has crashed and
been replaced during the progress of a distributed transaction - such a server will need
to abort the transaction.
The two-phase commit protocol is designed to allow any participant to abort its
part of a transaction. Due to the requirement for atomicity, if one part of a transaction is
aborted, then the whole transaction must also be aborted. In the first phase of the6 1
2
261
protocol, each participant votes for the transaction to be committed or aborted. O nce a
participant has voted to commit a transaction, it is not allowed to abort it. Therefore,
before a participant votes to commit a transaction, it must ensure that it will eventually
be able to carry out its part of the commit protocol, even if it fails and is replaced in the
interim. A participant in a transaction is said to be in a prepared state for a transaction
if it will eventually be able to commit it. To make sure of this, each participant saves in
permanent storage all of the objects that it has altered in the transaction, together with
its status - prepared.
In the second phase of the protocol, every participant in the transaction carries out
the joint decision. If any one participant votes to abort, then the decision must be to abort
the transaction. If all the participants vote to commit, then the decision is to commit the
transaction.
The problem is to ensure that all of the participants vote and that they all reach the
same decision. This is fairly simple if no errors occur, but the protocol must work
correctly even when some of the servers fail, messages are lost or servers are temporarily
unable to communicate with one another.
Failure model for the commit protocols 0 Section 12.1.2 presents a failure model for
transactions that applies equally to the two-phase (or any other) commit protocol.
Commit protocols are designed to work in an asynchronous system in which servers
may crash and messages may be lost. It is assumed that an underlying request-reply
protocol removes corrupt and duplicated messages. There are no byzantine faults -
servers either crash or else they obey the messages they are sent.
The two-phase commit protocol is an example of a protocol for reaching a
consensus. Chapter 11 asserts that consensus cannot be reached in an asynchronous
system if processes sometimes fail. However, the two-phase commit protocol does reach
consensus under those conditions. This is because crash failures of processes are masked
by replacing a crashed process with a new process whose state is set from information
saved in permanent storage and information held by other processes.
13.3.1 The tw o-ph ase com m it protocol

During the progress of a transaction, there is no communication between the coordinator
and the participants apart from the participants informing the coordinator when they j oin
the transaction. A client’ s request to commit (or abort) a transaction is directed to the
coordinator. If the client requests abortTransaction, or if the transaction is aborted by
one of the participants, the coordinator informs the participants immediately. It is when
the client asks the coordinator to commit the transaction that two-phase commit protocol
comes into use.
In the first phase of the two-phase commit protocol the coordinator asks all the
participants if they are prepared to commit; and in the second, it tells them to commit
(or abort) the transaction. If a participant can commit its part of a transaction, it will
agree as soon as it has recorded the changes and its status in permanent storage - and is
prepared to commit. The coordinator in a distributed transaction communicates with the
participants to carry out the two-phase commit protocol by means of the operations
summarized in Figure 13.4. The methods canCommit, doCommit and doAbort are2 6
262
Figure 13.4 Operations for tw o-ph ase com m it protocol

canCommit?(trans)—> Yes / No
Call from coordinator to participant to ask whether it can commit a transaction.
Participant replies with its vote.
doCommit( trans)
Call from coordinator to participant to tell participant to commit its part of a
transaction.
doAborl(trans)
Call from coordinator to participant to tell participant to abort its part of a transaction.
haveCommitted(trans, participant)
Call from participant to coordinator to confirm that it has committed the transaction.
getDecision(trans) —> Yes/No
Call from participant to coordinator to ask for the decision on a transaction after it
has voted Yes but has still had no reply after some delay, Used to recover from server
crash or delayed messages.
methods in the interface of the participant. The methods haveCommitted and

getDecision are in the coordinator interface.
The two-phase commit protocol consists of a voting phase and a completion phase
as shown in Figure 13.5. By the end o f step (2) the coordinator and all the participants
that voted Yes are prepared to commit. By the end of step (3) the transaction is
effectively completed. At step (3a) the coordinator and the participants are committed,
so the coordinator can report a decision to commit to the client. At (3b) the coordinator
reports a decision to abort to the client.
At step (4) participants confirm that they have committed so that the coordinator
knows when the information it has recorded about the transaction is no longer needed.
This apparently straightforward protocol could fail due to one or more of the
servers crashing or due to a breakdown in communication between the servers. To deal
with the possibility of crashing, each server saves information relating to the two-phase
commit protocol in permanent storage. This information can be retrieved by a new
process that is started to replace a crashed server. The recovery aspects of distributed
transactions are discussed in Section 13.6.
The exchange of information between the coordinator and participants can fail
when one of the servers crashes, or when messages are lost. Timeouts are used to avoid
processes blocking for ever. When a timeout occurs at a process, it must take an
appropriate action. To allow for this the protocol includes a timeout action for each step
at which a process may block. These actions are designed to allow for the fact that in an
asynchronous system, a timeout may not necessarily imply that a server has failed.
Timeout actions in the two-phase commit protocol 0 There are various stages in the
protocol at which the coordinator or a participant cannot progress its part of the protocol
until it receives another request or reply from one o f the others.
263
Figure 13.5 The tw o-ph ase com m it protocol

Phase 1 (voting phase):
1. The coordinator sends a canCommit? request to each of the participants in the
transaction.
2. When a participant receives a canCommit? request it replies with its vote (Yes or
No) to the coordinator. Before voting Yes, it prepares to commit by saving objects
in permanent storage. If the vote is No the participant aborts immediately.
Phase 2 (completion according to outcome o f vote):
3. The coordinator collects the votes (including its own).
(a) If there are no failures and all the votes are Yes the coordinator decides to
commit the transaction and sends a doCommit request to each o f the
participants.
(b) Otherwise the coordinator decides to abort the transaction and sends doAbort
requests to all participants that voted Yes.
4. Participants that voted Yes are waiting for a doCommit or doAbort request from
the coordinator. When a participant receives one of these messages it acts
accordingly and in the case of commit, makes a haveCommitted call as
confirmation to the coordinator.
Consider first the situation where a participant has voted Yes and is waiting for the
coordinator to report on the outcome of the vote by telling it to commit or abort the
transaction. See step (2) in Figure 13.6. Such a participant is uncertain of the outcome
and cannot proceed any further until it gets the outcome of the vote from the coordinator.
The participant cannot decide unilaterally what to do next, and meanwhile the objects
used by its transaction cannot be released for use by other transactions. The participant
makes a getDecision request to the coordinator to determine the outcome o f the
transaction. When it gets the reply it continues the protocol at step (4) in Figure 13.5. If
Figure 13.6 C om m u nication in tw o-phase com m it protocol
Coordinator i 1Participant
step s ta tu s step status
1 p repared to c o m m it — tii canC om m it? ; ^
(w a itin g fo r votes ) 1
? Ves — 2 prepared to com m it
3 c o m m itte d p* doC om m it i--h (uncertain)
haveCommitted _i- 4 com m itted

done
4-
264
SECTION 13.3 ATOMIC COMMIT PROTOCOLS 52 3
the coordinator has failed, the participant will not be able to get the decision until the
coordinator is replaced, which can result in extensive delays for participants in the
uncertain state.
Other alternative strategies are available for the participants to obtain a decision
cooperatively instead of contacting the coordinator. These strategies have the advantage
that they may be used when the coordinator has failed. See Exercise 13.5 and Bernstein
et al. [1987] for details. However, even with a cooperative protocol, if all the
participants are in the uncertain state, they will be unable to get a decision until the
coordinator or a participant with the knowledge is available.
Another point at which a participant may be delayed is when it has carried out all
its client requests in the transaction but has not yet received a canCommitl call from the
coordinator. As the client sends the closeTransaciion to the coordinator, a participant
can only detect such a situation if it notices that it has not had a request in a particular
transaction for a long time, for example by a timeout period on a lock. As no decision
has been made at this stage, the participant can decide to abort unilaterally after some
period of time.
The coordinator may be delayed when it is waiting for votes from the participants.
As it has not yet decided the fate of the transaction it may decide to abort the transaction
after some period o f time. It must then announce doAbort to the participants who have
already sent their votes. Some tardy participants may try to vote Yes after this, but their
votes will be ignored and they will enter the uncertain state as described above.
Performance of the two-phase commit protocol 0 Provided that all goes well - that is,
that the coordinator and participants and the communication between them do not fail,
the two-phase commit protocol involving N participants can be completed with N
canCommit? messages and replies, followed by N doCommit messages. That is, the cost
in messages is proportional to 3N, and the cost in time is three rounds of messages. The
have Committed messages are not counted in the estimated cost of the protocol, which
can function correctly without them - their role is to enable servers to delete stale
coordinator information.
In the worst case, there may be arbitrarily many server and communication
failures during the two-phase commit protocol. However, the protocol is designed to
tolerate a succession o f failures (server crashes or lost messages) and is guaranteed to
complete eventually, although it is not possible to specify a time limit within which it
will be completed.
As noted in the section on timeouts, the two-phase commit protocol can cause
considerable delays to participants in the uncertain state. These delays occur when the
coordinator has failed and cannot reply to getDecision requests from participants. Even
if a cooperative protocol allows participants to make getDecision requests to other
participants, delays will occur if all the active participants are uncertain.
Three-phase commit protocols have been designed to alleviate such delays. They
are more expensive in the number of messages and the number o f rounds required for
the normal (failure-free) case. For a description of three-phase commit protocols, see
Exercise 13.2 and Bernstein et al. [1987].
265
Figure 13.7 O perations in coordinator for nested transactions

openSubTransaction(trans) —> subTrans
Opens a new subtransaction whose parent is trans and returns a unique
subtransaction identifier.
getStatus(trans)—> committed, aborted, provisional
Asks the coordinator to report on the status of the transaction trans. Returns values
representing one of the following: committed, aborted, provisional.
13.3.2 Tw o-phase com m it protocol for n ested transactions

The outermost transaction in a set o f nested transactions is called the top-level
transaction. Transactions other than the top-level transaction are called subtransactions.
In Figure 13.1(b), T is the top-level transaction, T\, T\\, Tu, Tj\ and T22 are
subtransactions. T\ and Tj are child transactions of T, which is referred to as their parent.
Similarly, T\ 1 and T\ 2 are child transactions of T\, and Tjx and T22 are child transactions
of Tj. Each subtransaction starts after its parent and finishes before it. Thus, for example,
T 11 and T 12 start after Ti and finish before it.
When a subtransaction completes, it makes an independent decision either to
commit provisionally or to abort. A provisional commit is not the same as being
prepared: it is just a local decision and is not backed up on permanent storage. If the
server crashes subsequently, its replacement will not be able to carry out a provisional
commit. For this reason, a two-phase commit protocol is required for nested transactions
to allow servers of provisionally committed transactions that have failed to abort them.
A coordinator for a subtransaction will provide an operation to open a
subtransaction, together with an operation enabling the coordinator of a subtransaction
to enquire whether its parent has yet committed or aborted, as shown in Figure 13.7.
A client starts a set of nested transactions by opening a top-level transaction with
an openTransaetion operation, which returns a transaction identifier for the top-level
transaction. The client starts a subtransaction by invoking the openSubTransaction
operation, whose argument specifies its parent transaction. The new subtransaction
automatically joins the parent transaction, and a transaction identifier for a
subtransaction is returned.
An identifier for a subtransaction must be an extension of its parent’ s TID,
constructed in such a way that the identifier o f the parent or top-level transaction of a
subtransaction can be determined from its own transaction identifier. In addition, all
subtransaction identifiers should be globally unique. The client makes a set o f nested
transactions come to completion by invoking closeTransaction or abortTransaction on
the coordinator o f the top-level transaction.
Meanwhile, each of the nested transactions carries out its operations. When they
are finished, the server managing a subtransaction records information as to whether the
subtransaction committed provisionally or aborted. Note that if its parent aborts, then the
subtransaction will be forced to abort too.
266
SECTION 13.3 ATOMIC COMMIT PROTOCOLS 52 5
Figure 13.8 Transaction T d e cid e s w hether to com m it
Recall from Chapter 12 that a parent transaction - including a top-level

transaction - can commit even if one of its child subtransactions has aborted. In such
cases, the parent transaction will be programmed to take different actions according to
whether a subtransaction has committed or aborted. For example, consider a banking
transaction that is designed to perform all the ‘ standing orders’at a branch on a
particular day. This transaction is expressed as several nested Transfer subtransactions,
each of which consists of nested deposit and withdraw subtransactions. We assume that
when an account is overdrawn, withdraw aborts and then the corresponding Transfer
aborts. But there is no need to abort all the standing orders just because one Transfer
subtransaction aborts. Instead of aborting, the top-level transaction will note the
Transfer subtransactions that aborted and take appropriate actions.
Consider the top-level transaction T and its subtransactions shown in Figure 13.8,
which is based on Figure 13.1(b). Each subtransaction has either provisionally
committed or aborted. For example, T \ j has provisionally committed and T\\ has
aborted, but the fate o f T\2 depends on its parent T\ and eventually on the top-level
transaction, T. Although T21 and T22 have both provisionally committed. To has aborted
and this means that T21 and T22 must also abort. Suppose that T decides to commit in
spite of the fact that T2 has aborted, also that T1 decides to commit in spite of the fact
that T\ 1 has aborted.
When a top-level transaction completes, its coordinator carries out a two-phase
commit protocol. The only reason for a participant subtransaction being unable to
complete is if it has crashed since it completed its provisional commit. Recall that when
each subtransaction was created, it joined its parent transaction. Therefore, the
coordinator of each parent transaction has a list of its child subtransactions. When a
nested transaction provisionally commits, it reports its status and the status of its
descendants to its parent. When a nested transaction aborts, it just reports abort to its
parent without giving any information about its descendants. Eventually, the top-level
transaction receives a list of all the subtransactions in the tree, together with the status
of each. Descendants of aborted subtransactions are actually omitted from this list.
The information held by each coordinator in the example shown in Figure 13.8 is
shown in Figure 13.9. Note that T\2 and Z21 share a coordinator as they both run at server
N. When subtransaction Ti aborted, it reported the fact to its parent, T, but without
passing on any information about its subtransactions Ti\ and T22 . A subtransaction is an
orphan if one of its ancestors aborts, either explicitly or because its coordinator crashed.6
7
2
267
Figure 13.9 Inform ation held by coordinators of nested transactions:
Coordinator Child Participant Provisional Abort list

o f transaction transactions commit list
T T[, T2 yes Tb Tn Tn, T2
Ti Tu, T\i yes Tb Tn Tn
Ti T2b T22 no (aborted) 72
Tu no (aborted) Tn
T\% T21 T \2 but not 721 Tib Tn
T22 no (parent aborted) Ti 2
In our example, subtransactions T21 and T22 are orphans because their parent aborted
without passing information about them to the top-level transaction. Their coordinator
can however, make enquiries about the status of their parent by using the getStatus
operation. A provisionally committed subtransaction of an aborted transaction should be
aborted, irrespective of whether the top-level transaction eventually commits.
The top-level transaction plays the role of coordinator in the two-phase commit
protocol, and the participant list consists of the coordinators of all the subtransactions in
the tree that have provisionally committed but do not have aborted ancestors. By this
stage, the logic of the program has determined that the top-level transaction should try
to commit whatever is left, in spite of some aborted subtransactions. In Figure 13.8, the
coordinators of T, T\ and T n are participants and will be asked to vote on the outcome.
If they vote to commit, then they must prepare their transactions by saving the state of
the objects in permanent storage. This state is recorded as belonging to the top-level
transaction of which it will form a part. The two-phase commit protocol may be
performed in either a hierarchic manner or in a flat manner.
The second phase of the two-phase commit protocol is the same as for the non
nested case. The coordinator collects the votes and then informs the participants as to
the outcome. When it is complete, coordinator and participants will have committed or
aborted their transactions.
Hierarchic two-phase commit protocol 0 In this approach, the two-phase commit

protocol becomes a multi-level nested protocol. The coordinator of the top-level
transaction communicates with the coordinators of the subtransactions for which it is the
immediate parent. It sends canCommit? messages to each of the latter, which in turn
pass them on to the coordinators of their child transactions (and so on down the tree).
Each participant collects the replies from its descendants before replying to its parent.
In our example, T sends canCommit? messages to the coordinator of T\ and then T\
sends canCommit? messages to T \2 asking about descendants of T\. The protocol does
not include the coordinators of transactions such as Ti, which has aborted. Figure 13.10
shows the arguments required for canCommit? The first argument is the TED of the top-
level transaction, for use when preparing the data. The second argument is the TID of
the participant making the canCommit? call. The participant receiving the call looks in
its transaction list for any provisionally committed transaction or subtransaction2 8
6
268
Figure 13.10 canCom m it? for hierarchic tw o-ph ase com m it protocol
canCommit ?(trans, subTrans) —> Yes/No
Call a coordinator to ask coordinator of child subtransaction whether it can commit
a subtransaction subTrans. The first argument trans is the transaction identifier of
top-level transaction. Participant replies with its vote Yes/No.
Figure 13.11 canCom m it ? fo r flat tw o-ph ase com m it protocol

canCommit?(trans, abortList) —> Yes /No
Call from coordinator to participant to ask whether it can commit a transaction.
Participant replies with its vote Yes /No.*
matching the TID in the second argument. For example, the coordinator of T \2 is also
the coordinator of T2 1 , since they run in the same server, but when it receives the
canCommit? call, the second argument will be T\ and it will deal only with T\2-
If a participant finds any subtransactions that match the second argument, it
prepares the objects and replies with a Yes vote. If it fails to find any, then it must have
crashed since it performed the subtransaction and it replies with a No vote.
Flat two-phase commit protocol 0 In this approach, the coordinator of the top-level
transaction sends canCommit? messages to the coordinators o f all of the subtransactions
in the provisional commit list. In our example, to the coordinators of 7j and Tn- During
the commit protocol, the participants refer to the transaction by its top-level TID. Each
participant looks in its transaction list for any transaction or subtransaction matching
that TID. For example, the coordinator of T 12 is also the coordinator of 721) since they
run in the same server (TV).
Unfortunately, this does not provide sufficient information to enable correct
actions by participants such as the coordinator at server N that have a mix of
provisionally committed and aborted subtransactions. If TV’ s coordinator is just asked to
commit T it will end up by committing both 7 ) 2 and T21, because, according to its local
information, both have provisionally committed. This is wrong in the case of T21,
because its parent, T2 , has aborted. To allow for such cases, the canCommit ? operation
for the flat commit protocol has a second argument that provides a list of aborted
subtransactions, as shown in Figure 13.11. A participant can commit descendants of the
top-level transaction unless they have aborted ancestors. When a participant receives a
canCommit? request, it does the following:
• If the participant has any provisionally committed transactions that are

descendants of the top-level transaction, trans:
- check that they do not have aborted ancestors in the abortList. Then prepare to
commit (by recording the transaction and its objects in permanent storage);
269
- those with aborted ancestors are aborted;

- send a Yes vote to the coordinator.
• If the participant does not have a provisionally committed descendent of the top-
level transaction, it must have failed since it performed the subtransaction and it
sends a No vote to the coordinator.
A comparison Of the two approaches 0 The hierarchic protocol has the advantage that at
each stage, the participant only need look for subtransactions of its immediate parent,
whereas the flat protocol needs to have the abort list in order to eliminate transactions
whose parents have aborted. Moss [1985] preferred the flat algorithm because it allows
the coordinator o f the top-level transaction to communicate directly with all o f the
participants, whereas the hierarchic variant involves passing a series o f messages down
and up the tree in stages.
Timeout actions 0 The two-phase commit protocol for nested transactions can cause the
coordinator or a participant to be delayed at the same three steps as in the non-nested
version. There is a fourth step at which subtransaclions can be delayed. Consider
provisionally committed child subtransactions of aborted subtransactions: they do not
necessarily get informed of the outcome of the transaction. In our example, T?i is such
a subtransaction - it has provisionally committed, but as its parent T2 has aborted, it does
not become a participant. To deal with such situations, any subtransaction that has not
received a canCommit? message will make an enquiry after a timeout period. The
getStatus operation in Figure 13.7 allows a subtransaction to enquire whether its parent
has committed or aborted. To make such enquiries possible, the coordinators of aborted
subtransactions need to survive for a period. If an orphaned subtransaction cannot
contact its parent, it will eventually abort.
13.4 Concurrency control in distributed transactions
Each server manages a set of objects and is responsible for ensuring that they remain
consistent when accessed by concurrent transactions. Therefore, each server is
responsible for applying concurrency control to its own objects. The members of a
collection of servers of distributed transactions are jointly responsible for ensuring that
they are performed in a serially equivalent manner.
This implies that if transaction T is before transaction U in their conflicting access
to objects at one of the servers then they must be in that order at all of the servers whose
objects are accessed in a conflicting manner by both T and U.
13.4.1 Locking
In a distributed transaction, the locks on an object are held locally (in the same server).
The local lock manager can decide whether to grant a lock or make the requesting
transaction wait. However, it cannot release any locks until it knows that the transaction
has been committed or aborted at all the servers involved in the transaction. When
locking is used for concurrency control, the objects remain locked and are unavailable2 0
7
270
SECTION 13.4 CONCURRENCY CONTROL IN DISTRIBUTED TRANSACTIONS 52 9
for other transactions during the atomic commit protocol, although an aborted
transaction releases its locks after phase 1 of the protocol,
As lock managers in different servers set their locks independently o f one another,
it is possible that different servers may impose different orderings on transactions.
Consider the following interleaving of transactions T and U at servers X and Y:
T U
Write(A) at X locks A
Write(B) at y locks B
Read(B) a ty waits for U

Read(A) at X waits for T
The transaction T locks object A at server X and then transaction U locks object B at
server Y. After that, T tries to access B at server Y and waits for ITs lock. Similarly,
transaction U tries to access A at server X and has to wait for T s lock. Therefore, we
have T before U in one server and U before Tin the other. These different orderings can
lead to cyclic dependencies between transactions and a distributed deadlock situation
arises. The detection and resolution of distributed deadlocks is discussed in the next
section of this chapter. When a deadlock is detected, a transaction is aborted to resolve
the deadlock. In this case, the coordinator will be informed and will abort the transaction
at the participants involved in the transaction.
13.4.2 Timestamp ordering concurrency control

In a single server transaction, the coordinator issues a unique timestamp to each
transaction when it starts. Serial equivalence is enforced by committing the versions of
objects in the order o f the timestamps o f transactions that accessed them. In distributed
transactions, we require that each coordinator issue globally unique timestamps. A
globally unique transaction timestamp is issued to the client by the first coordinator
accessed by a transaction. The transaction timestamp is passed to the coordinator at each
server whose objects perform an operation in the transaction.
The servers o f distributed transactions are jointly responsible for ensuring that
they are performed in a serially equivalent manner. For example, if the version of an
object accessed by transaction U commits after the version accessed by T at one server,
then if T and U access the same object as one another at other servers, they must commit
them in the same order. To achieve the same ordering at all the servers, the coordinators
must agree as to the ordering of their timestamps. A timestamp consists of a pair <local
timestamp, server-id>. The agreed ordering of pairs of timestamps is based on a
comparison in which the server-id part is less significant.
The same ordering of transactions can be achieved at all the servers even if then-
local clocks are not synchronized. However, for reasons o f efficiency it is required that
the timestamps issued by one coordinator be roughly synchronized with those issued by
the other coordinators. When this is the case, the ordering of transactions generally2 1
7
271
corresponds to the order in which they are started in real time. Timestamps can be kept
roughly synchronized by the use of synchronized local physical clocks (see Chapter 10).
When timestamp ordering is used for concurrency control, conflicts are resolved
as each operation is performed. If the resolution of a conflict requires a transaction to be
aborted, the coordinator will be informed and it will abort the transaction at all the
participants. Therefore, any transaction that reaches the client request to commit should
always be able to commit. Therefore, a participant in the two-phase commit protocol
will normally agree to commit. The only situation in which a participant will not agree
to commit is if it had crashed during the transaction.
13.4.3 Optimistic concurrency control

Recall that with optimistic concurrency control, each transaction is validated before it is
allowed to commit. Transaction numbers are assigned at the start of validation and
transactions are serialized according to the order of the transaction numbers. A
distributed transaction is validated by a collection of independent servers, each o f which
validates transactions that access its own objects. The validation at all of the servers
takes place during the first phase of the two-phase commit protocol.
Consider the following interleavings of transactions T and U, which access objects
A and B at servers X and Y, respectively.
T U
Read(A) at X Read(B) at Y
Write(A) Write(B)
Read(B) at Y Read(A) at X
Write(B) Write(A)
The transactions access the objects in the order T before U at server X and in the order
U before T at server Y. Now suppose that T and U start validation at about the same time,
but server X validates T first and server Y validates U first. Recall that Section 12.5
recommends a simplification of the validation protocol that makes a rule that only one
transaction may perform validation and update phases at a time. Therefore each server
will be unable to validate the other transaction until the first one has completed. This is
an example of commitment deadlock.
The validation rules in Section 12.5 assume that validation is fast, which is true
for single-server transactions. However, in a distributed transaction, the two-phase
commit protocol may take some time and will delay other transactions from entering
validation until a decision on the current transaction has been obtained. In distributed
optimistic transactions, each server applies a parallel validation protocol. This is an
extension o f either backward or forward validation to allow multiple transactions to be
in the validation phase at the same time. In this extension, rule 3 must be checked as well
as rule 2 for backward validation. That is, the write set o f the transaction being validated
must be checked for overlaps with the write set of earlier overlapping transactions. Kung
and Robinson [1981] describe parallel validation in their paper.7 2
272
SECTION 13.5 DISTRIBUTED DEADLOCKS 531
If parallel validation is used, transactions will not suffer from commitment

deadlock. However, if servers simply perform independent validations, it is possible that
different servers of a distributed transaction may serialize the same set of transactions in
different orders, for example with T before U at server X and U before T at server Y in
our example.
The servers o f distributed transactions must prevent this happening. One approach
is that after a local validation by each server, a global validation is carried out [Ceri and
Owicki 1982], The global validation checks that the combination of the orderings at the
individual servers is serializable; that is, that the transaction being validated is not
involved in a cycle.
Another approach is that all o f the servers of a particular transaction use the same
globally unique transaction number at the start of the validation [Schlageter 1982]. The
coordinator of the two-phase commit protocol is responsible for generating the globally
unique transaction number and passes it to the participants in the canCommit? messages.
As different servers may coordinate different transactions, the servers must (as in the
distributed timestamp ordering protocol) have an agreed order for the transaction
numbers they generate.
Agrawal et al. [1987] have proposed a variation of Kung and Robinson’ s
algorithm that favours read-only transactions, together with an algorithm called MVGV
(multi-version generalized validation). MVGV is a form of parallel validation that
ensures that transaction numbers reflect serial order, but it requires that the visibility of
some transactions be delayed after having committed. It also allows the transaction
number to be changed so as to permit some transactions to validate that otherwise would
have failed. The paper also proposes an algorithm for committing distributed
transactions. It is similar to Schlageter’s proposal in that a global transaction number has
to be found. At the end of the read phase, the coordinator proposes a value for the global
transaction number and each participant attempts to validate their local transactions
using that number. However, if the proposed global transaction number is too small,
some participants may not be able to validate their transaction and they negotiate with
the coordinator for an increased number. If no suitable number can be found, then that
participants will have to abort its transaction. Eventually, if all of the participants can
validate their transactions the coordinator will have received proposals for transaction
numbers'from each o f them. If common numbers can be found then the transaction will
be committed.
13.5 Distributed deadlocks
The discussion of deadlocks in Section 12.4 shows that deadlocks can arise within a
single server when locking is used for concurrency control. Servers must either prevent
or detect and resolve deadlocks. Using timeouts to resolve possible deadlocks is a
clumsy approach - it is difficult to choose an appropriate timeout interval, and
transactions are aborted unnecessarily. With deadlock detection schemes, a transaction
is aborted only when it is involved in a deadlock. Most deadlock detection schemes
operate by finding cycles in the transaction wait-for graph. In a distributed system
involving multiple servers being accessed by multiple transactions, a global wait-for
273
Figure 13.12 Interleavings of transactions U, l/and W
u V W
d.deposit(lO) lock D
b.deposit(lO) lock B
a.deposit(20) lock A at Y
at X
c.deposit(30) lock C
b.withdraw(30) wait at Y at Z
c.withdraw(20) wait at Z
a.withdraw(20) wait at X
graph can in theory be constructed from the local ones. There can be a cycle in the global
wait-for graph that is not in any single local one - that is, there can be a distributed
deadlock. Recall that the wait-for graph is a directed graph in which nodes represent
transactions and objects, and edges represent either an object held by a transaction or a
transaction waiting for an object. There is a deadlock if and only if there is a cycle in the
wait-for graph.
Figure 13.12 shows the interleavings of the transactions U, V and W involving the
objects A and B managed by servers X and Y and objects C and D managed by server Z.
The complete wait-for graph in Figure 13.13(a) shows that a deadlock cycle
consists of alternate edges, which represent a transaction waiting for an object and an
object held by a transaction. As any transaction can only be waiting for one object at a
time, objects can be left out of wait-for graphs, as shown in Figure 13.13(b).
Detection of a distributed deadlock requires a cycle to be found in the global
transaction wait-for graph that is distributed among the servers that were involved in the
transactions. Local wait-for graphs can be built by the lock manager at each server, as
discussed in Chapter 12. In the above example, the local wait-for graphs of the servers
are:
server Y\ U —> V (added when U requests b.withdraw(30))

server Z: V —»W (added when V requests c.withdraw(20))
server X\ W —»U (added when Wrequests a.withdraw(20))
As the global wait-for graph is held in part by each o f the several servers involved,
communication between these servers is required to find cycles in the graph.
A simple solution is to use centralized deadlock detection, in which one server
takes on the role of global deadlock detector. From time to time, each server sends the
latest copy of its local wait-for graph to the global deadlock detector, which
amalgamates the information in the local graphs in order to construct a global wait-for
graph. The global deadlock detector checks for cycles in the global wait-for graph.2 4
7
274
SECTION 13.5 DISTRIBUTED DEADLOCKS 5 33
Figure 13.13 Distributed deadlock
(a) (b)
When it finds a cycle, it makes a decision on how to resolve the deadlock and informs
the servers as to the transaction to be aborted to resolve the deadlock.
Centralized deadlock detection is not a good idea, because it depends on a single
server to carry it out. It suffers from the usual problems associated with centralized
solutions in distributed systems - poor availability, lack of fault tolerance and no ability
to scale. In addition, the cost of the frequent transmission of local wait-for graphs is
high. If the global graph is collected less frequently, deadlocks may take longer to be
detected.
Phantom deadlocks 0 A deadlock that is ‘ detected’but is not really a deadlock is called
a phantom deadlock. In distributed deadlock detection, information about wait-for
relationships between transactions is transmitted from one server to another. If there is
a deadlock, the necessary information will eventually be collected in one place and a
cycle will be detected. As this procedure will take some time, there is a chance that one
of the transactions that holds a lock will meanwhile have released it, in which case the
deadlock will no longer exist.
Consider the case of a global deadlock detector that receives local wait-for graphs
from servers X and Y, as shown in Figure 13.14. Suppose that transaction U then releases
an object at server X and requests the one held by V at server Y. Suppose also that the
global detector receives server F s local graph before server X ’ s. In this case, it would
detect a cycle T —»U —> V —»T, although the edge T —> U no longer exists. This is an
example of a phantom deadlock.
The observant reader will have realized that if transactions are using two-phase
locks, they cannot release objects and then obtain more objects, and phantom deadlock
cycles cannot occur in the way suggested above. Consider the situation in which a cycle 2 5
7
275
Figure 13.14 Local and global wait-for graphs

local wait-for graph local wait-for graph global deadlock d etector
T —> U —> V —>Tis detected: either this represents a deadlock or each o f the transactions
T, U and V must eventually commit. It is actually impossible for any of them to commit,
because each of them is waiting for an object that will never be released.
A phantom deadlock could be detected if a waiting transaction in a deadlock cycle
aborts during the deadlock detection procedure. For example, if there is a cycle
T —»U —> V —¥T and U aborts after the information concerning U has been collected,
then the cycle has been broken already and there is no deadlock.
Edge chasing 0 A distributed approach to deadlock detection uses a technique called
edge chasing or path pushing. In this approach, the global wait-for graph is not
constructed, but each of the servers involved has knowledge about some of its edges.
The servers attempt to find cycles by forwarding messages called probes, which follow
the edges o f the graph throughout the distributed system. A probe message consists of
transaction wait-for relationships representing a path in the global wait-for graph.
The question is: when should a server send out a probe? Consider the situation at
server X in Figure 13.13. This server has j ust added the edge W —> U to its 1ocal wait-for
graph and at this time, transaction U is waiting to access object B, which transaction V
holds at server Y. This edge could possibly be part of a cycle such as
V —> T\ —>72 —■>...—»W-»£7 — involving transactions using objects at other
servers. This indicates that there is a potential distributed deadlock cycle, which could
be found by sending out a probe to server Y.
Now consider the situation a little earlier when server Z added the edge V —> W to
its local graph: at this point in time, W is not waiting. Therefore, there would be no point
in sending out a probe.
Each distributed transaction starts at a server (called the coordinator o f the
transaction) and moves to several other servers (called participants in the transaction),
which can communicate with the coordinator. At any point in time, a transaction can be
either active or waiting at just one of these servers. The coordinator is responsible for
recording whether the transaction is active or is waiting for a particular object, and
participants can get this information from their coordinator, Lock managers inform
coordinators when transactions start waiting for objects and when transactions acquire
objects and become active again. When a transaction is aborted to break a deadlock, its
coordinator will inform the participants and all of its locks will be removed, with the
effect that all edges involving that transaction will be removed from the local wait-for
graphs.26
7
276
Figure 13.15 Probes transm itted to detect deadlock
Edge-chasing algorithms have three steps - initiation, detection and resolution.
Initiation. When a server notes that a transaction T starts waiting for another
transaction U, where U is waiting to access an object at another server, it initiates
detection by sending a probe containing the edge < T -» U > to the server of the
object at which transaction U is blocked. If U is sharing a lock, probes are sent to all
the holders of the lock. Sometimes further transactions may start sharing the lock
later on, in which case probes can be sent to them too.
Detection. Detection consists of receiving probes and deciding whether deadlock

has occurred and whether to forward the probes.
For example, when a server of an object receives a probe < T U>
(indicating that T is waiting for a transaction U that holds a local object), it checks to
see whether U is also waiting. If it is, the transaction it waits for (for example, V) is
added to the probe (making it < T U —>•V >), and if the new transaction (V) is
waiting for another object elsewhere, the probe is forwarded,
In this way, paths through the global wait-for graph are built one edge at a time.
Before forwarding a probe, the server checks to see whether the transaction (for
example, T) it has just added has caused the probe to contain a cycle (for example,
< T —> U —»V —»T >). If this is the case, it has found a cycle in the graph and
deadlock has been detected.
Resolution: When a cycle is detected, a transaction in the cycle is aborted to break

the deadlock.
In our example, the following steps describe how deadlock detection is initiated and the
probes that are forwarded during the corresponding detection phase.
• Server X initiates detection by sending probe < W —»V > to the server of B (Server
n
277
• Server Y receives probe < W - + U > , notes that B is held by V and appends V to
the probe to produce < W —> U -» V >. It notes that Vis waiting for C at server 2.
This probe is forwarded to server Z.
• Server Z receives probe < W -> U V > and notes C is held by W and appends
W to the probe to produce
This path contains a cycle, The server detects a deadlock. One of the transactions in the
cycle must be aborted to break the deadlock. The transaction to be aborted can be chosen
according to transaction priorities, which are described shortly.
Figure 13.15 shows the progress of the probe messages from the initiation by the
server of A to the deadlock detection by the server of C. Probes are shown as heavy
arrows, objects as circles and transaction coordinators as rectangles. Each probe is
shown as going directly from one object to another. In reality, before a server transmits
a probe to another server, it consults the coordinator of the last transaction in the path to
find out whether the latter is waiting for another object elsewhere. For example, before
the server of B transmits the probe <W —> U —»V> it consults the coordinator of V to
find out that V is waiting for C. In most of the edge-chasing algorithms, the servers o f
objects send probes to transaction coordinators, which then forward them (if the
transaction is waiting) to the server o f the object the transaction is waiting for. In our
example, the server of B transmits the probe <W —»U V> to the coordinator of V,
which then forwards it to the server of C. This shows that when a probe is forwarded,
two messages are required.
The above algorithm should find any deadlock that occurs, provided that waiting
transactions do not abort and there are no failures such as lost messages or servers
crashing. To understand this, consider a deadlock cycle in which the last transaction, W,
starts waiting and completes the cycle. When W starts waiting for an object, the server
initiates a probe that goes to the server of the object held by each transaction that W is
waiting for. The recipients extend and forward the probes to the servers of objects
requested by all waiting transactions they find. Thus every transaction that W waits for
directly or indirectly will be added to the probe unless a deadlock is detected. When
there is a deadlock, W is waiting for itself indirectly. Therefore, the probe will return to
the object that W holds.
It might appear that large numbers of messages are sent in order to detect
deadlock. In the above example, we see two probe messages to detect a cycle involving
three transactions. Each o f the probe messages is in general two messages (from object
to coordinator and then from coordinator to object).
A probe that detects a cycle involving N transactions will be forwarded by (N - 1)
transaction coordinators via (N- I) servers of objects, requiring 2(N - 1) messages.
Fortunately, the majority of deadlocks involve cycles containing only two transactions,
and there is no need for undue concern about the number of messages involved. This
observation has been made from studies of databases. It can also be argued by
considering the probability of conflicting access to objects. Sec Bernstein etal. [1987].
Transaction priorities 0 In the above algorithm, every transaction involved in a deadlock

cycle can cause deadlock detection to be initiated. The effect of several transactions in a
cycle initiating deadlock detection is that detection may happen at several different
servers in the cycle with the result that more than one transaction in the cycle is aborted.2
8
7
278
Figure 13.16 Two prob es initiated

' (a) initial situation (b) detection initiated at object (c) detection initiated at object
requested by T requested by W
In Figure 13.16(a), consider transactions T, U, V and W, where U is waiting for W

and V is waiting for T. At about the same time, T requests the object held by U and W
requests the object held by V. Two separate probes < T —* U > and < W7—»V > are
initiated by the servers of these objects and are circulated until deadlock is detected by
each of two different servers. See in Figure 13.16(b), where the cycle is
< T —> 1/ -> W —»V —»r>, and (c), where the cycle i s < W ^ > V - > T - :> U - > W > .
In order to ensure that only one transaction in a cycle is aborted, transactions are
given priorities in such a way that all transactions are totally ordered. Timestamps for
example, may be used as priorities. When a deadlock cycle is found, the transaction with
the lowest priority is aborted. Even if several different servers detect the same cycle,
they will all reach the same decision as to which transaction is to be aborted. We write
T > U to indicate that T has higher priority than U. In the above example, assume
T > U > V > W. Then the transaction W will be aborted when either of the cycles
< r - > t/-» W-> V ^ T > o r < t/-> W > is detected.
It might appear that transaction priorities could also be used to reduce the number
of situations that cause deadlock detection to be initiated, by using the rule that detection
is initiated only when a higher-priority transaction starts to wait for a lower-priority one.
In our example in Figure 13.16, as T > U the initiating probe < T —> U > would be sent,
but as W < V the initiating probe < W —> V > would not be sent. If we assume that when
a transaction starts waiting for another transaction it is equally likely that the waiting
transaction has higher or lower priority than the waited-for transaction, then the use of
this rule is likely to reduce the number of probe messages by about half.
Transaction priorities could also be used to reduce the number o f probes that are
forwarded. The general idea is that probes should travel ‘ downhill’- that is, from
transactions with higher priorities to transactions with lower priorities. To do this,
servers use the rule that they do not forward any probe to a holder that has higher priority
than the initiator. The argument for doing this is that if the holder is waiting for another
transaction then it must have initiated detection by sending a probe when it started
waiting.2
9
7
279
Figure 13.17 P rob es travel downhill

(a) l/stores probe when (/starts waiting (b) Probe is forw arded when V starts w aiting
0 Waits
for C
probe
queue
However, there is a pitfall associated with these apparent improvements. In our

example in Figure 13.15 transactions U, V and W are executed in an order in which U is
waiting for V and V is waiting for Wwhen Wstarts waiting for U. Without priority rules,
detection is initiated when W starts waiting by sending a probe < W —> U>. Under the
priority rule, this probe will not be sent, because W < U and deadlock will not be
detected.
The problem is that the order in which transactions start waiting can determine
whether or not deadlock will be detected. The above pitfall can be avoided by using a
scheme in which coordinators save copies of ail the probes received on behalf o f each
transaction in a probe queue. When a transaction starts waiting for an object, it forwards
the probes in its queue to the server of the object, which propagates the probes on
downhill routes.
In our example in Figure 13.15, when U starts waiting for V, the coordinator o f V
will save the probe . See Figure 13.17(a). Then when V starts waiting for W,
the coordinator o f W will store < V - 4 W > and V will forward its probe queue
 V > to W. See Figure 13.17(b), in which W” s probe queue has and
< V —> W >. When W starts waiting for A it will forward its probe queue
 V —> W > to the server of A, which also notes the new dependency W —> U and
combines it with the information in the probe received to determine that
U -» F -> U. Deadlock is detected.
When an algorithm requires probes to be stored in probe queues, it also requires
arrangements to pass on probes to new holders and to discard probes that refer to
transactions that have been committed or aborted. If relevant probes are discarded,
undetected deadlocks may occur, and if outdated probes are retained, false deadlocks
may be detected. This adds much to the complexity o f any edge-chasing algorithm.
Readers who are interested in the details of such algorithms should see Sinha and
Natarajan [1985] and Choudhary et al. [1989], who present algorithms for use with
exclusive locks, But they will see that Choudhary et al. showed that Sinha and
Natarajan’ s algorithm is incorrect and fails to detect all deadlocks and may even report
false deadlocks. Kshemkalyani and Singhal [1991] corrected the algorithm of
Choudhary et al. (which fails to detect all deadlocks and may report false deadlocks) and
provide a proof o f correctness for the corrected algorithm. In a subsequent paper.2 0
8
280
SECTION 13.6 TRANSACTION RECOVERY 539
Kshemkalyani and Singhal [1994] argue that distributed deadlocks are not very well
understood because there is no global state or time in a distributed system. In fact, any
cycle that has been collected may contain sections recorded at different times. In
addition, sites may hear about deadlocks but may not hear that they have been resolved
until after random delays. The paper describes distributed deadlocks in terms of the
contents of distributed memory, using causal relationships between events at different
sites.
13.6 Transaction recovery
The atomic property of transactions requires that the effects of all committed
transactions and none of the effects o f incomplete or aborted transactions are reflected
in the objects they accessed. This property can be described in terms of two aspects:
durability and failure atomicity. Durability requires that objects are saved in permanent
storage and will be available indefinitely thereafter. Therefore, an acknowledgment o f a
client’s commit request implies that all the effects of the transaction have been recorded
in permanent storage as well as in the server’ s (volatile) objects. Failure atomicity
requires that effects o f transactions are atomic even when the server crashes. Recovery
is concerned with ensuring that a server’ s objects are durable and that the service
provides failure atomicity.
Although file servers and database servers maintain data in permanent storage,
other kinds of servers o f recoverable objects need not do so except for recovery
purposes. In this chapter, we assume that when a server is running it keeps all of its
objects in its volatile memory and records its committed objects in a recovery file or
files. Therefore, recovery consists of restoring the server with the latest committed
versions of its objects from permanent storage. Databases need to deal with large
volumes of data. They generally hold the objects in stable storage on disk with a cache
in volatile memory.
The two requirements for durability and for failure atomicity are not really
independent of one another and can be dealt with by a single mechanism - the recovery
manager. The task o f a recovery manager is:
• to save objects in permanent storage (in a recovery file) for committed

transactions;
• to restore the server’

s objects after a crash;
• to reorganize the recovery file to improve the performance of recovery;
• to reclaim storage space (in the recovery file).
In some cases, we require the recovery manager to be resilient to media failures -

failures of its recovery file so that some o f the data on the disk is lost, either by being
corrupted during a crash, by random decay or by a permanent failure. In such cases, we
need another copy o f the recovery file. This can be in stable storage, which is
implemented so as to be very unlikely to fail by using mirrored disks or copies at a
different location.2
1
8
281
Intentions list 0 Any server that provides transactions needs to keep track o f the
objects accessed by clients’transactions. Recall from Chapter 12 that when a client
opens a transaction, the server first contacted provides a new transaction identifier and
returns it to the client. Each subsequent client request within a transaction up to and
including the commit or abort request includes the transaction identifier as an argument.
During the progress of a transaction, the update operations are applied to a private set of
tentative versions of the objects belonging to the transaction.
At each server, an intentions list is recorded for all of its currently active
transactions - an intentions list of a particular transaction contains a list o f the references
and the values o f all the objects that are altered by that transaction. When a transaction
is committed, that transaction’ s intentions list is used to identify the objects it affected.
The committed version of each object is replaced by the tentative version made by that
transaction, and the new value is written to the server’ s recovery file. When a transaction
aborts, the server uses the intentions list to delete all the tentative versions of objects
made by that transaction.
Recall also that a distributed transaction must carry out an atomic commit protocol
before it can be committed or aborted. Our discussion of recovery is based on the two-
phase commit protocol, in which all the participants involved in a transaction first say
whether they are prepared to commit and then, later on if all the participants agree, they
all carry out the actual commit actions. If the participants cannot agree to commit, they
must abort the transaction.
At the point when a participant says it is prepared to commit a transaction, its
recovery manager must have saved both its intentions list for that transaction and the
objects in that intentions list in its recovery file, so that it will be able to carry out the
commitment later on, even if it crashes in the interim.
When all the participants involved in a transaction agree to commit it, the
coordinator informs the client and then sends messages to the participants to commit
their part of the transaction. Once the client has been informed that a transaction has
committed, the recovery files of the participating servers must contain sufficient
information to ensure that the transaction is committed by all of the servers, even if some
o f them crash between preparing to commit and committing.
Entries in recovery file 0 To deal with recovery of a server that can be involved in
distributed transactions, further information in addition to the values of the objects is
stored in the recovery file. This information concerns the status of each transaction -
whether it is committed, aborted or prepared to commit. In addition, each object in the
recovery file is associated with a particular transaction by saving the intentions list, in
the recovery file. Figure 13.18 shows a summary of Ihetypes of entry included in a
recovery file.
The transaction status values relating to the two-phase commit protocol are
discussed in Section 13.6.4 on recovery of the two-phase commit protocol We shall
now describe two approaches to the use of recovery files: logging and shadow versions.
13.6.1 Logging
In the logging technique, the recovery file represents a log containing the history of all
the transactions performed by a server. The history consists of values o f objects,2 8
282
Figure 13.18 T ypes of entry in a recovery file
Type o f entry Description o f contents o f entry

Object A value of an object.
Transaction identifier, transaction status (prepared, committed,
Transaction status aborted) - and other status values used for the two-phase commit
protocol.
Transaction identifier and a sequence of intentions, each of which
Intentions list consists of identifier of objeco, <position in recovery file of
value of object>.
transaction status entries and intentions lists of transactions. The order of the entries in
the log reflects the order in which transactions have prepared, committed and aborted at
that server. In practice, the recovery' file will contain a recent snapshot of the values of
all the objects in the server followed by a history of transactions after the snapshot.
During the normal operation o f a server, its recovery manager is called whenever
a transaction prepares to commit, commits or aborts a transaction. When the server is
prepared to commit a transaction, the recovery manager appends all the objects in its
intentions list to the recovery file, followed by the current status o f that transaction
(prepared) together with its intentions list. When a transaction is eventually committed
or aborted, the recovery manager appends the corresponding status of the transaction to
its recovery file,
It is assumed that the append operation is atomic in the sense that it writes one or
more complete entries to the recovery file. If the server fails, only the last write can be
incomplete. To make efficient use o f the disk, several subsequent writes can be buffered
and then written as a single write to disk. An additional advantage of the logging
technique is that sequential writes to disk are faster than writes to random locations.
After a crash, any transaction that does not have a committed status in the log is
aborted, Therefore, when a transaction commits, its committed status entry must be
forced to the log - that is, written to the log together with any other buffered entries.
The recovery manager associates a unique identifier with each object so that the
successive versions o f an object in the recovery file may be associated with the server’ s
objects. For example, a durable form of a remote object reference such as a CORBA
persistent reference will do as an object identifier,
Figure 13.19 illustrates the log mechanism for the banking service transactions T
and U in Figure 12.7. The log was recently reorganized, and entries to the left of the
double line represent a snapshot of the values of A, B and C before transactions T and U
started. In this diagram, we use the names A, B and C as unique identifiers for objects.
We show the situation when transaction Jhas committed and transaction V has prepared
but not committed. When transaction T prepares to commit, the values of objects A and
B are written at positions Pi and P2 in the log, followed by a prepared transaction status
entry' for T with its intentions list (< A, P\ >, < B, P2 >). When transaction T commits, a
committed transaction status entry for T is put at position P4 . Then when transaction U2 3
8
283
Figure 13.19 Log for banking service
Po P\ Pi Pi Pa Ps Pe Pi
O bjects Object:# Object: C Object:^ Object:# Trans: T Trans: T Object; C Object:# Trans: V
100 200 300 80 220 prepared committed 278 242 prepared
<A ,#i> <C, P5>
<#, # 2 > <B, P g>
Po Ps P4
Checkpoint
prepares to commit, the values of objects C and B are written at positions P5 and P$ in
the log, followed by a prepared transaction status entry for V with its intentions list
(< C, P 5 >, < B, P<5 >).
Each transaction status entry contains a pointer to the position in the recovery file
o f the previous transaction status entry to enable the recovery manager to follow the
transaction status entries in reverse order through the recovery file. The last pointer in
the sequence of transaction status entries points to the checkpoint.
Recovery Ol objects 0 When a server is replaced after a crash, it first sets default initial
values for its objects and then hands over to its recovery manager. The recovery
manager is responsible for restoring the server’ s objects so that they include all the
effects of all the committed transactions performed in the correct order and none o f the
effects of incomplete or aborted transactions.
The most recent information about transactions is at the end of the log. There are
two approaches to restoring the data from the recovery file. In the first, the recovery
manager starts at the beginning and restores the values of all of the objects from the most
recent checkpoint. It then reads in the values of each of the objects, associates them with
their intentions lists and for committed transactions replaces the values of the objects. In
this approach, the transactions are replayed in the order in which they were executed and
there could be a large number of them. In the second approach, the recovery manager
will restore a server’ s objects by ‘reading the recovery file backwards’ .The recovery file
has been structured so that there is a backwards pointer from each transaction status
entry to the next. The recovery manager uses transactions with committed status to
restore those objects that have not yet been restored. It continues until it has restored all
o f the server’s objects. This has the advantage that each object is restored once only.
To recover the effects of a transaction, a recovery manager gets the corresponding
intentions list from its recovery file. The intentions list contains the identifiers and
positions in the recovery file o f values of all the objects affected by the transaction.
If the server fails at the point reached in Figure 13.19, its recovery manager will
recover the objects as follows. It starts at the last transaction status entry in the log (at
Pi) and concludes that transaction U has not committed and its effects should be
ignored. It then moves to the previous transaction status entry in the log (at P4) and
concludes that transaction T has committed, To recover the objects affected by2 4
8
284
transaction T, it moves to the previous transaction status entry in the log (at P3) and finds
the intentions list for T (< A, P\ >, < B, P 2 >)■It then restores objects A and B from the
values at Pi and P2 . As it has not yet restored C, it moves back to Pq, which is a
checkpoint, and restores C.
To help with subsequent reorganization of the recovery file, the recovery manager
notes all the prepared transactions it finds during the process o f restoring the server’
s
objects. For each prepared transaction, it adds an aborted transaction status to the
recovery file. This ensures that in the recovery file, every transaction is eventually
shown as either committed or aborted.
The server could fail again during the recovery' procedures. It is essential that
recovery be idempotent in the sense that it can be done any number of times with the
same effect. This is straightforward under our assumption that all the objects are restored
to volatile memory. In the case of a database, which keeps its objects in permanent
storage, with a cache in volatile memory, some of the objects in permanent storage will
be out of date when a server is replaced after a crash. Therefore, its recovery manager
has to restore the objects in permanent storage. If it fails during recovery, the partially
restored objects will still be there. This makes idempotence a little harder to achieve.
Reorganizing Ihe recovery file 0 A recovery manager is responsible for reorganizing its
recovery file so as to make the process of recovery faster and to reduce its use of space.
If the recovery file is never reorganized, then the recovery process must search
backwards through the recovery file until it has found a value for each o f its objects.
Conceptually, the only information required for recovery is a copy o f the committed
versions o f all the objects in the server. This would be the most compact form for the
recovery file. The name checkpointing is used to refer to the process of writing the
current committed values of a server’ s objects to a new recovery file, together with
transaction status entries and intentions lists o f transactions that have not yet been fully
resolved (including information related to the two-phase commit protocol). The term
checkpoint is used to refer to the information stored by the checkpointing process. The
purpose of making checkpoints is to reduce the number o f transactions to be dealt with
during recovery and to reclaim file space.
Checkpointing can be done immediately after recovery but before any new
transactions are started. However, recovery may not occur very often. Therefore,
checkpointing may need to be done from time to time during the normal activity of a
server. The checkpoint is written to a future recovery file, and the current recovery file
remains in use until the checkpoint is complete. Checkpointing consists of ‘ adding a
mark’to the recovery file when the checkpointing starts, writing the server’ s objects to
the future recovery file and then copying (1 ) entries before the mark that relate to as yet
unresolved transactions and (2 ) all entries after the mark in the recovery file to the future
recovery file. When the checkpoint is complete, the future recovery file becomes the
recovery file.
The recovery system can reduce its use of space by discarding the old recovery
file. When the recover}' manager is carrying out the recovery process, it may encounter
a checkpoint in the recovery file. When this happens, it can restore immediately all
outstanding objects from the checkpoint.
285
Figure 13.20 Shadow versions
Map at start Map when T commits

A Pq A -» Pi
B - ^ P q’ B ^P 2
C —> Pq" C -> Pq"
Po Po' P o" Pi P2 P3 P4
V ersion sto r e 100 200 300 80 220 278 v : 242. ;•/
Checkpoint
13.6.2 Shadow versions

The logging technique records transaction status entries, intentions lists and objects all
in the same file - the log. The shadow versions technique is an alternative way to
organize a recovery file. It uses a map to locate versions of the server’ s objects in a file
called a version store. The map associates the identifiers of the server’ s objects with the
positions of their current versions in the version store. The versions written by each
transaction are shadows of the previous committed versions. The transaction status
entries and intentions lists are dealt with separately. Shadow versions are described first.
When a transaction is prepared to commit, any of the objects changed by the
transaction are appended to the version store, leaving the corresponding committed
versions unchanged. These new as yet tentative versions are called shadow versions.
When a transaction commits, a new map is made by copying the old map and entering
the positions of the shadow versions. To complete the commit process, the new map
replaces the old map.
To restore the objects when a server is replaced after a crash, its recovery manager
reads the map and uses the information in the map to locate the objects in the version
store.
This technique is illustrated with the same example involving transactions T and
U in Figure 13.20. The first column in the table show's the map before transactions T and
U, when the balances of the accounts A, B and C are $100, $200 and $300, respectively.
The second column shows the map after transaction T has committed.
The version store contains a checkpoint, followed by the versions of A and B at Pi
and P2 made by transaction T. It also contains the shadow versions of B and C made by
transaction U, at P 3 and P4 .
The map must always be written to a well-known place (for example, at the start
o f the version store or a separate file) so that it can be found when the system needs to
be recovered.
The switch from the old map to the new map must be performed in a single atomic
step. To achieve this it is essential that stable storage is used for the map - so that there
is guaranteed to be a valid map even when a file write operation fails. The shadow
versions method provides faster recovery than logging because the positions of the
current committed objects are recorded in the map, whereas recovery from a log requires
searching throughout the log for objects. Logging should be faster than shadow versions2 6
8
286
during the normal activity of the system. This is because logging requires only a
sequence of append operations to the same file, whereas shadow versions requires an
additional stable storage write (involving two unrelated disk blocks).
Shadow versions on their own are not sufficient for a server that handles
distributed transactions. Transaction status entries and intentions lists are saved in a file
called the transaction status file. Each intentions list represents the part of the map that
will be altered by a transaction when it commits. The transaction status file may, for
example, be organized as a log.
The figure below shows the map and the transaction status file for our current
example when Thas committed and U is prepared to commit.
Map Stable storage T T V

A ^PX prepared committed prepared
B ->P 2 A ^P \ B —> /*3
C ^P o" Transaction status file b->p2 C ->P 4
There is a chance that a server may crash between the time when a committed status is
written to the transaction status file and the time when the map is updated - in which
case the client will not have been acknowledged. The recovery manager must allow for
this possibility when the server is replaced after a crash, for example by checking
whether the map includes the effects of the last committed transaction in the transaction
status file. If it does not, then the latter should be marked as aborted.
13.6.3 The need for transaction status and intentions list entries in a recovery file
It is possible to design a simple recovery file that does not include entries for transaction
status items and intentions lists. This sort o f recovery file may be suitable when all
transactions are directed to a single server, The use of transaction status items and
intentions lists in the recovery file is essential for a server that is intended to participate
in distributed transactions. This approach can also be useful for servers o f non-
distributed transactions for various reasons, including the following:
• Some recovery managers are designed to write the objects to the recovery file
early - under the assumption that transactions normally commit.
• If transactions use a large number of big objects, the need to write them
contiguously to the recovery file may complicate the design of a server. When
objects are referenced from intentions lists, they can be found wherever they are.
• In timestamp ordering concurrency control, a server sometimes knows that a
transaction will eventually be able to commit and acknowledges the client - at this
time the objects are written to the recovery file (see Chapter 12) to ensure their
permanence. However, the transaction may have to wait to commit until earlier
transactions have committed. In such situations, the corresponding transaction
status entries in the recovery file will be waiting to commit and then committed to
ensure timestamp ordering of committed transactions in the recovery file. On
recovery, any waiting-to-commit transactions can be allowed to commit, because2 7
8
287
Figure 13.21 Log with entries relating to tw o-phase com m it protocol
TransiT r.T • •Trans:T

Coord’ Trans: U • •Part* pan t:t/ Transit/ Transit/
part’ pant
prepared committed prepared r :... uncertain committed
Coord’
list: . ..
intentions intentions
list list
the ones they were waiting for have either just committed or if not have to be
aborted due to failure of the server,
13.6.4 Recovery of the two-phase commit protocol

In a distributed transaction, each server keeps its own recovery file. The recovery
management described in the previous section must be extended to deal with any
transactions that are performing the two-phase commit protocol at the time when a
server fails. The recovery managers use two new status values: done, uncertain. These
status values are shown in Figure 13.6. A coordinator uses committed to indicate that the
outcome o f the vote is Yes and done to indicate that the two-phase commit protocol is
complete. A participant uses uncertain to indicate that it has voted Yes but does not yet
know the outcome. Two additional types of entry allow a coordinator to record a list o f
participants and a participant to record its coordinator:
Type o f entry Description o f contents o f entry
Coordinator Transaction identifier, list of participants

Participant Transaction identifier, coordinator*
In phase 1 o f the protocol, when the coordinator is prepared to commit (and has already
added a prepared status entry to its recovery file), its recovery manager adds a
coordinator entry to its recovery file. Before a participant can vote Yes, it must have
already prepared to commit (and must have already added a prepared status entry to its
recovery file). When it votes Yes, its recovery manager records a participant entry and
adds an uncertain transaction status to its recovery file as a forced write. When a
participant votes No, it adds an abort transaction status to its recovery file.
In phase 2 of the protocol, the recovery manager of the coordinator adds either a
committed or an aborted transaction status to its recovery file, according to the decision.
This must be a forced write. Recovery managers of participants add a commit or abort
transaction status to their recovery files according to the message received from the
coordinator. When a coordinator has received a confirmation from all of its participants,
its recovery manager adds a done transaction status to its recovery file - this need not be
forced. The done status entry is not part of the protocol but is used when the recovery
file is reorganized. Figure 13.21 shows the entries in a log for transaction T, in which the
server played the coordinator role, and for transaction U, in which the server played the
288
Figure 13.22 Recovery of the tw o-ph ase com m it protocol

Role Status Action o f recovery manager
Coordinator prepared No decision had been reached before the server failed. It
sends abortTransaction to all the servers in the participant
list and adds the transaction status aborted in its recovery
file. Same action for state aborted. If there is no participant
list, the participants will eventually timeout and abort the
transaction.
Coordinator committed A decision to commit had been reached before the server
failed. It sends a doCommit to all of the participants in its
participant list (in case it had not done so before) and
resumes the two-phase protocol at step 4 (see Figure 13.5).
Participant committed The participant sends a haveCommitted message to the

coordinator (in case this was not done before it failed).
This will allow the coordinator to discard information
about this transaction at the next checkpoint.
Participant uncertain The participant failed before it knew the outcome of the
transaction. It cannot determine the status of the
transaction until the coordinator informs it o f the decision.
It will send a getDecision to the coordinator to determine
the status of the transaction. When it receives the reply it
will commit or abort accordingly.
Participant prepared The participant has not yet voted and can abort the
transaction.
Coordinator done No action is required.
participant role. For both transactions, the prepared transaction status entry comes first.
In the case of a coordinator it is followed by a coordinator entry, and a committed
transaction status entry. The done transaction status entry is not shown in Figure 13.21.
In the case of a participant, the prepared transaction status entry is followed by a
participant entry whose state is uncertain and then a committed or aborted transaction
status entry.
When a server is replaced after a crash, the recovery manager has to deal with the
two-phase commit protocol in addition to restoring the objects. For any transaction
where the server has played the coordinator role, it should find a coordinator entry and
a set of transaction status entries. For any transaction where the server played the
participant role, it should find a participant entry and a set of transaction status entries.
In both cases, the most recent transaction status entry - that is, the one nearest the end
of the log - determines the transaction status at the time of failure. The action of the
recovery manager with respect to the two-phase commit protocol for any transaction2 9
8
289
depends on whether the server was the coordinator or a participant and on its status at
the time of failure, as shown in Figure 13.22.
Reorganization of recovery file 0 Care must be taken when performing a checkpoint to

ensure that coordinator entries of transactions without status done are not removed from
the recovery file. These entries must be retained until all the participants have confirmed
that they have completed their transactions. Entries with status done may be discarded.
Participant entries with transaction state uncertain must also be retained.
Recovery of nesied transactions 0 In the simplest case, each subtransaction of a nested

transaction accesses a different set of objects. As each participant prepares to commit
during the two-phase commit protocol, it writes its objects and intentions lists to the
local recovery file, associating them with the transaction identifier of the top-level
transaction. Although nested transactions use a special variant of the two-phase commit
protocol, the recovery manager uses the same transaction status values as for flat
transactions.
However, abort recovery is complicated by the fact that several subtransactions at
the same and different levels in the nesting hierarchy can access the same object. Section
12.4 describes a locking scheme in which parent transactions inherit locks and
subtransactions acquire locks from their parents. The locking scheme forces parent
transactions and subtransactions to access common data objects at different times and
ensures that accesses by concurrent subtransactions to the same objects must be
serialized.
Objects that are accessed according to the rules of nested transactions are made
recoverable by providing tentative versions for each subtransaction. The relationship
between the tentative versions of an object used by the subtransactions of a nested
transaction is similar to the relationship between the locks. To support recovery from
aborts, the server o f an object shared by transactions at multiple levels provides a stack
of tentative versions - one for each nested transaction to use.
When the first subtransaction in a set o f nested transactions accesses an object, it
is provided with a tentative version that is a copy of the current committed version of
the object. This is regarded as being at the top of the stack, but unless any other
subtransactions access the same object, the stack will not materialize.
When one o f its subtransactions does access the same object, it copies the version
at the top of the stack and pushes it back on the stack. All of that subtransaction’ s
updates are applied to the tentative version at the top of the stack. When a subtransaction
provisionally commits, its parent inherits the new version. To achieve this, both the
subtransaction’ s version and its parent’s version are discarded from the stack and then
the subtransaction’ s new version is pushed back on to the stack (effectively replacing its
parent’ s version). When a subtransaction aborts, its version at the top of the stack is
discarded. Eventually, when the top-level transaction commits, the version at the top of
the stack (if any) becomes the new committed version.
For example, in Figure 13.23, suppose that transactions T\, T\\, T\i and Tj all
access the same object, A, in the order T\\ T\\\ T\2 \ Ti. Suppose that their tentative
versions are called A[, An, A\z and Aj- When T\ starts executing. A] is based on the
committed version o f A and is pushed on the stack. When T\ \ starts executing, it bases
its version A11 on A \ and pushes it on the stack, when it completes, it replaces its parent’
s2
0
9
290
SECTION 13.7 SUMMARY 54 9
Figure 13.23 N ested transactions

top o f stack
Az Az
T
Ail,
h
version on the stack. Transactions T \2 and T2 act in a similar way, finally leaving the
result of T2 at the top of the stack.
13.7 Summary
In the most general case, a client’ s transaction will request operations on objects in
several different servers. A distributed transaction is any transaction whose activity
involves several different servers. A nested transaction structure may be used to allow
additional concurrency and independent committing by the servers in a distributed
transaction.
The atomicity property of transactions requires that the servers participating in a
distributed transaction either all commit it or all abort it. Atomic commit protocols are
designed to achieve this effect, even if servers crash during their execution. The two-
phase commit protocol allows a server to decide to abort unilaterally. It includes timeout
actions to deal with delays due to servers crashing. The two-phase commit protocol can
take an unbounded amount of time to complete but is guaranteed to complete eventually.
Concurrency control in distributed transactions is modular - each server is
responsible for the serializability of transactions that access its own objects. However,
additional protocols are required to ensure that transactions are serializable globally.
Distributed transactions that use timestamp ordering require a means of generating an
agreed timestamp ordering between the multiple servers. Those that use optimistic
concurrency control require global validation or a means of forcing a global ordering on
committing transactions.
Distributed transactions that use two-phase locking can suffer from distributed
deadlocks. The aim of distributed deadlock detection is to look for cycles in the global
wait-for graph. If a cycle is found, one or more transactions must be aborted to resolve
the deadlock. Edge chasing is a non-centralized approach to the detection of distributed
deadlocks.
Transaction-based applications have strong requirements for the long life and
integrity of the information stored, but they do not usually have requirements for
immediate response at all times. Atomic commit protocols are the key to distributed
transactions, but they cannot be guaranteed to complete within a particular time limit.
Transactions are made durable by performing checkpoints and logging in a recovery
file, which is used for recovery when a server is replaced after a crash. Users of a
transaction service would experience some delay during recovery. Although it is
291
assumed that the servers of distributed transactions exhibit crash failures and run in an
asynchronous system, they are able to reach consensus about the outcome of
transactions because crashed servers are replaced with new processes that can acquire
all the relevant information from permanent storage or from other servers.
EXERCISES
13.1 In a decentralized variant o f the two-phase commit protocol the participants

communicate directly with one another instead of indirectly via the coordinator. In
phase 1, the coordinator sends its vote to all the participants. In phase 2, if the
coordinator's vote is No, the participants just abort the transaction; if it is Yes, each
participant sends its vote to the coordinator and the other participants, each o f which
decides on the outcome according to the vote and carries it out. Calculate the number of
messages and the number of rounds it takes. What are its advantages or disadvantages
in comparison with the centralized variant? page 520
13.2 A three-phase commit protocol has the following parts:
Phase h is the same as for two-phase commit.
Phase 2: the coordinator collects the votes and makes a decision; if it is No, it
aborts and informs participants that voted Yes\ if the decision is Yes, it sends a
preCommit request to all the participants. Participants that voted Yes wait for a
preCommii or doAbort request. They acknowledge preCommit requests and carry
out doAbort requests.
Phase 3: the coordinator collects the acknowledgments. When all are received, it
Commits and sends doCommit to the participants. Participants wait for a
doCommit request. When it arrives they Commit.
Explain how this protocol avoids delay to participants during their ‘ uncertain’period
due to the failure o f the coordinator or other participants. Assume that communication
does not fail. page 523
13.3 Explain how the two-phase commit protocol for nested transactions ensures that if the
top-level transaction commits, all the right descendants are committed or aborted.
page 524
13.4 Give an example of the interleavings o f two transactions that is serially equivalent at
each server but is not serially equivalent globally. page 528
13.5 The getDecision procedure defined in Figure 13.4 is provided only by coordinators.
Define a new version o f getDecision to be provided by participants for use by other
participants that need to obtain a decision when the coordinator is unavailable.
Assume that any active participant can make a getDecision request to any other
active participant. Does this solve the problem of delay during the ‘ uncertain’period?
Explain your answer. At what point in the two-phase commit protocol would the
coordinator inform the participants of the other participants’identities (to enable this
communication)? page 5202 9
292
EXERCISES 551
13.6 Extend the definition of two-phase locking to apply to distributed transactions. Explain
how this is ensured by distributed transactions using strict two-phase locking locally.
page 528 and Chapter 12
13.7 Assuming that strict two-phase locking is in use, describe how the actions of the two-
phase commit protocol relate to the concurrency control actions of each individual
server. How does distributed deadlock detection Fit in? Pa8 es 520 and 528
13.8 A server uses timestamp ordering for local concurrency control. What changes must be
made to adapt it for use with distributed transactions? Under what conditions could it be
argued that the two-phase commit protocol is redundant with timestamp ordering?
pages 520 and 529
13.9 Consider distributed optimistic concurrency control in which each server performs local
backward validation sequentially (that is, with only one transaction in the validate and
update phase at one time), in relation to your answer to Exercise 13.4. Describe the
possible outcomes when the two transactions attempt to commit. What difference does
it make if the servers use parallel validation? Chapter 12 and page 530
13.10 A centralized global deadlock detector holds the union of local wait-for graphs. Give an
example to explain how a phantom deadlock could be detected if a waiting transaction
in a deadlock cycle aborts during the deadlock detection procedure. page 533
13.11 Consider the edge-chasing algorithm (without priorities). Give examples to show that it
could detect phantom deadlocks. page 534
13.12 A server manages the objects aj, a 2 .... an. It provides two operations for its clients:
Read(i) returns the value of a\
Write(i, Value) assigns Value to aj
The transactions T, U and Fare defined as follows:
T: x = Read(i); Write(j, 44);
U: Write(i, 55); Writefj, 66 );
V: Write(k, 77); Write(k, 8 8 );
Describe the information written to the log file on behalf of these three transactions if
strict two-phase locking is in use and U acquires a( and aj before T. Describe how the
recovery manager would use this information to recover the effects of T, U and V when
the server is replaced after a crash. What is the significance of the order of the commit
entries in the log file? pages 540-542
13.13 The appending of an entry to the log file is atomic, but append operations from different
transactions may be interleaved. How does this affect the answer to Exercise 13.12?
pages 540-5422 3
9
293
13.14 The transactions T, U and V o f Exercise 13.12 use strict two-phase locking and their
requests are interleaved as follows:
T U V
x = Read(i);
Writefk, 77);
Writefi, 55)
Write(j, 44)
Write(k,88 )
Write(j, 66 )
Assuming that the recovery manager appends the data entry corresponding to each Write
operation to the log file immediately instead o f waiting until the end of the transaction,
describe the information written to the log file on behalf of the transactions T, U and V.
Does early writing affect the correctness of the recovery procedure? What are the
advantages and disadvantages of early writing? pages 540-542
13.15 Transactions T and U are run with timestamp ordering concurrency control. Describe the
information written to the log file on behalf of T and U, allowing for the fact that U has
a later timestamp than T and must wait to commit after T. Why is it essential that the
commit entries in the log file be ordered by timestamps? Describe the effect of recovery
if the server crashes (i) between the two Commits and (ii) after both of them.
T U
x - Read(i);
Write(i, 55);
Write(j, 66 );
Write(j, 44);
Commit
Commit
What are the advantages and disadvantages of early writing with timestamp ordering?
page 545
13.16 The transactions T and U in Exercise 13.15 are run with optimistic concurrency control
using backward validation and restarting any transactions that fail. Describe the
information written to the log file on their behalf. Why is it essential that the commit
entries in the log file be ordered by transaction numbers? How are the write sets of
committed transactions represented in the log file? pages 540-542
13.17 Suppose that the coordinator of a transaction crashes after it has recorded the intentions
list entry but before it has recorded the participant list or sent out the canCommit?
requests. Describe how the participants resolve the situation. What will the coordinator
do when it recovers? Would it be any better to record the participant list before the
intentions list entry? page 546
294
Implementing Fault-Tolerant Services Using the State Machine
Approach: A Tutorial
FRED B. SCHNEIDER
D e p a r tm e n t o f C o m p u te r S cience, C o rn e ll U n iv e r s ity , Ith a c a , N e w Y o rk 14853
The state machine approach is a general method for implementing fault-tolerant services
in distributed systems. This paper reviews the approach and describes protocols for two
different failure models—Byzantine and fail stop. System reconfiguration techniques for
removing faulty components and integrating repaired components are also discussed.
Categories and Subject Descriptors: C.2.4 [ C o m p u t e r - C o m m u n i c a t io n N e t w o r k s ] :

Distributed Systems—network o p e ra tin g systems-, D.2.10 [ S o f t w a r e E n g in e e r in g ] :
Design—m e th o d o lo g ie s ; D.4.5 [ O p e r a t i n g S y s t e m s ] : Reliability—f a u lt to le ra n c e ; D.4.7
[ O p e r a t i n g S y s t e m s ] : Organization and Design—in te r a c tiv e s ystem s, r e a l- tim e system s
General Terms: Algorithms, Design, Reliability

Additional Key Words and Phrases: Client-server, distributed services, state machine
approach
INTRODUCTION service by replicating servers and coordi

nating client interactions with server rep
Distributed software is often structured in
licas .1 The approach also provides a
terms of clients and services. Each service
framework for understanding and design
comprises one or more servers and exports
ing replication management protocols.
operations that clients invoke by making
Many protocols that involve replication of
requests. Although using a single, central
data or software—be it for masking failures
ized, server is the simplest way to imple
or simply to facilitate cooperation without
ment a service, the resulting service can
centralized control—can be derived using
only be as fault tolerant as the processor
the state machine approach. Although few
executing that server. If this level of fault
of these protocols actually were obtained in
tolerance is unacceptable, then multiple
this manner, viewing them in terms of state
servers that fail independently must be
machines helps in understanding how and
used. Usually, replicas of a single server are
why they work.
executed on separate processors of a dis
This paper is a tutorial on the state ma
tributed system, and protocols are used to
chine approach. It describes the approach
coordinate client interactions with these
and its implementation for two represent
replicas. The physical and electrical isola
ative environments. Small examples suffice
tion of processors in a distributed system
to illustrate the points. However, the
ensures that server failures are indepen
dent, as required.
The state machine approach is a general 1The term “state machine”is a poor one, but, never
method for implementing a fault-tolerant theless, is the one used in the literature.
Permission to copy without fee all or part of this material is granted provided that the copies are not made or
distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its
date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To
copy otherwise, or to republish, requires a fee and/or specific permission.
© 1990 ACM 0360-0300/90/1200-0299 $01.50
ACM Computing Surveys, Vol. 22, No. 4, December 1990
295
300 F. B. Schneider
CONTENTS Output from request processing can be to

an actuator (e.g., in a process-control sys
tem), to some other peripheral device (e.g.,
a disk or terminal), or to clients awaiting
INTRODUCTION responses from prior requests.
1. STA TE MACHINES In this tutorial, we will describe a state
2. FAULT TOLERANCE
machine simply by listing its state variables
3. FAULT-TOLERANT STA TE MACHINES
3.1 Agreement
and commands. As an example, state ma
3.2 Order and Stability chine memory of Figure 1 implements a
4. TOLERATING FAULTY O U TPU T DEVICES time-varying mapping from locations to
4.1 Outputs Used Outside the System values. A read command permits a client to
4.2 Outputs Used Inside the System
5. TOLERATING FAULTY CLIENTS
determine the value currently associated
5.1 Replicating the Client with a location, and a write command as
5.2 Defensive Programming sociates a new value with a location.
6. USING TIM E TO MAKE REQU ESTS For generality, our descriptions of state
7. RECONFIGURATION machines deliberately do not specify how
7.1 Managing the Configuration
7.2 Integrating a Repaired Object
command invocation is implemented. Com
8. RELATED WORK mands might be implemented in any of the
ACKNOW LEDGM ENTS following ways:
REFEREN CES
•Using a collection of procedures that
share data and are invoked by a call, as
in a monitor.
approach has been successfully applied to •Using a single process that awaits mes
larger examples; some of these are men sages containing requests and performs
tioned in Section 8 . Section 1 describes how the actions they specify, as in a server.
a system can be viewed in terms o f a state •Using a collection of interrupt handlers,
machine, clients, and output devices. C op in which case a request is made by caus
ing with failures is the subject of Sections ing an interrupt, as in an operating sys
2 to 5. An important class of optimiza tem kernel. (Disabling interrupts permits
tions—based on the use of time— is dis each command to be executed to com ple
cussed in Section 6 . Section 7 describes tion before the next is started.)
dynamic reconfiguration. The history of
the approach and related work are dis For example, the state machine of Figure 2
cussed in Section 8 . implements commands to ensure that at ail
times at most one client has been granted
1. STATE MACHINES access to some resource. In it, x°y denotes
the result of appending y to the end o f list
Services, servers, and most programming x, head(x) denotes the first element of list
language structures for supporting modu x, and tail{x) denotes the list obtained by
larity define state machines. A state ma deleting the first element of list x. This
chine consists of state variables, which state machine would probably be imple
encode its state, and commands, which mented as part of the supervisor-call han
transform its state. Each command is im dler of an operating system kernel.
plemented by a deterministic program; ex Requests are processed by a state ma
ecution of the command is atomic with chine one at a time, in an order that is
respect to other commands and modifies consistent with potential causality. There
the state variables and/or produces some fore, clients of a state machine can make
output. A client o f the state machine makes the following assumptions about the order
a request to execute a command. The re in which requests are processed:
quest names a state machine, names the
command to be performed, and contains 01: Requests issued by a single client
any information needed by the command. to a given state machine sm are
296
Implementing Fault-Tolerant Services 301
m em o ry-. s t a t e _ m a c h i n e sages. For example, a client might execute

v a r s to re : a r r a y [0 .. n ] o f w o r d
(memory.write, 100, 16.2);
re a d : c o m m a n d f lo c : 0 .. n )
(memory.read, 100 );
s e n d s to re [lo c ] t o c lie n t receiv e v from memory
e n d re a d ;
w r ite : c o m m a n d ( / o c : 0 . . n, v a lu e : w o r d ) to set the value of location 1 0 0 to 16.2,
s to re [lo c ] := v a lu e request the value of location 1 0 0 , and await
e n d w r ite that value, setting v to it upon receipt.
e n d m e m o ry The defining characteristic of a state ma
chine is not its syntax but that it specifies
Figure 1. A memory. a deterministic computation that reads a
stream of requests and processes each, o c
casionally producing output:
m u te x : s t a t e _ m a c h i n e
v a r u s e r : c l i e n t _ i d i n it S em a n tic C h a ra cte riza tio n o f a S tate
w a it in g : l is t o f c l i e n t _ i d in it M achine. Outputs of a state machine are
a c q u ire : c o m m a n d completely determined by the sequence of
i f u s e r = 4> —* s e n d O K t o c lie n t ; requests it processes, independent of time
u s e r : = c lie n t and any other activity in a system.
□u s e r 9^ —> w a it in g : = w a it in g ° c lie n t
fi
Not all collections of commands neces
e n d a c q u ire sarily satisfy this characterization. Con
sider the following program to solve a
relea se: c o m m a n d
i f w a it in g = <t> —> u s e r := 4>
simple process-control problem in which an
□w a it in g ^ <J> —»s e n d O K t o h e a d ( w a itin g ) :
actuator is adjusted repeatedly based on the
u s e r : = h e a d ( w a itin g ) : value of a sensor. Periodically, a client
w a it in g : = t a il( w a it in g ) reads a sensor, communicates the value
fi read to state machine pc, and delays ap
e n d release proximately D seconds:
e n d m u te x
monitor:
Figure 2. A resource allocator. p rocess
do true —> val := sensor;
(pc.adjust, val):
processed by sm in the order they delay D
were issued. od
02: If the fact that request r was made to end monitor
a state machine sm by client c could State machine pc adjusts an actuator based
have caused a request r' to be made on past adjustments saved in state variable
by a client c ' to sm, then sm processes q, the sensor reading, and a control function
r before r'. F:
Note that due to communications network pc: sta te_ m achine
delays, 0 1 and 0 2 do not imply that a state var <7:real;
machine will process requests in the order
made or in the order received. adjust:
T o keep our presentation independent of command(sensor_ua/:real)
the interprocess communication mecha q := F(q, sensor -val):
nism used to transmit requests to state send q to actuator
end adjust
machines, we will program client requests
end pc
as tuples of the form
Although it is tempting to structure pc
{state-machine.command, arguments)
as a single command that loops— reading
and postulate that any results from pro from the sensor, evaluating F, and writing
cessing a request are returned using mes to actuator— if the value of the sensor is

297
302 F. B. S ch n e id e r
time varying, then the result would not Byzantine failures is the weakest possible
satisfy the semantic characterization given assumption that could be made about the
above and therefore would not be a state effects of a failure. Since a design based on
machine. This is because values sent to assumptions about the behavior of faulty
actuator (the output of the state machine) components runs the risk of failing if these
would not depend solely on the requests assumptions are not satisfied, it is prudent
made to the state machine but would, in that life-critical systems tolerate Byzantine
addition, depend on the execution speed of failures. For most applications, however, it
the loop. In the structure used above, this suffices to assume fail-stop failures.
problem has been avoided by moving the A system consisting of a set of distinct
loop into monitor. components is t fault tolerant if it satisfies
In practice, having to structure a system its specification provided that no more than
in terms of state machines and clients does t of those components become faulty during
not constitute a real restriction. Anything some interval of interest .2 Fault-tolerance
that can be structured in terms of proce traditionally has been specified in terms of
dures and procedure calls can also be struc mean time between failures (MTBF), prob
tured using state machines and clients— a ability of failure over a given interval, and
state machine implements the procedure, other statistical measures [Siewiorek and
and requests implement the procedure Swarz 1982]. Although it is clear that such
calls. In fact, state machines permit more characterizations are important to the
flexibility in system structure than is usu users of a system, there are advantages in
ally available with procedure calls. With describing fault tolerance of a system in
state machines, a client making a request terms of the maximum number of com po
is not delayed until that request is proc nent failures that can be tolerated over
essed, and the output of a request can be some interval of interest. Asserting that a
sent someplace other than to the client system is t fault tolerant makes explicit the
making the request. We have not yet en assumptions required for correct operation;
countered an application that could not be M TBF and other statistical measures do
programmed cleanly in terms of state ma not. Moreover, t fault tolerance is unrelated
chines and clients. to the reliability of the components that
make up the system and therefore is a
2. FAULT TOLERANCE measure of the fault tolerance supported by
the system architecture, in contrast to fault
Before turning to the implementation of tolerance achieved simply by using reliable
fault-tolerant state machines, we must in components. M TBF and other statistical
troduce some terminology concerning fail reliability measures o f a t fault-tolerant
ures. A component is considered faulty once system can be derived from statistical reli
its behavior is no longer consistent with its ability measures for the components used
specification. In this paper, we consider two in constructing that system— in particular,
representative classes of faulty behavior: the probability that there will be t or more
B y za n tin e F ailu res. The component failures during the operating interval of
can exhibit arbitrary and malicious behav interest. Thus, t is typically chosen based
ior, perhaps involving collusion with other on statistical measures of component reli
faulty components [Lamport et al. 1982]. ability.
F ail-stop F ailu res. In response to a fail
3. FAULT-TOLERANT STATE MACHINES
ure, the component changes to a state that
permits other components to detect that A t fault-tolerant version of a state machine
a failure has occurred and then stops can be implemented by replicating that
[Schneider 1984].
Byzantine failures can be the most disrup 2A t fault-tolerant system might continue to operate
tive, and there is anecdotal evidence that correctly if more than t failures occur, but correct
such failures do occur in practice. Allowing operation cannot be guaranteed.

298
state machine and running a replica on be partitioned in other ways, the Agree
each of the processors in a distributed sys ment-Order partitioning is a natural choice
tem. Provided each replica being run by a because it corresponds to the existing sep
nonfaulty processor starts in the same ini aration of the client from the state machine
tial state and executes the same requests in replicas.
the same order, then each will do the same Implementations of Agreement and Or
thing and produce the same output. Thus, der are discussed in Sections 3.1 and 3.2.
if we assume that each failure can affect at These implementations make no assump
most one processor, hence one state ma tions about clients or commands. Although
chine replica, then by combining the output this generality is useful, knowledge of com
of the state machine replicas of this ensem mands allows Replica Coordination, hence
ble, we can obtain the output for the t fault- Agreement and Order, to be weakened and
tolerant state machine. thus allows cheaper protocols to be used for
When processors can experience Byzan managing the replicas in an ensemble. E x
tine failures, an ensemble implementing a amples of two common weakenings follow.
t fault-tolerant state machine must have at First, Agreement can be relaxed for read
least 2 t + 1 replicas, and the output of the only requests when fail-stop processors are
ensemble is the output produced by the being assumed. When processors are fail
majority of the replicas. This is because stop, a request r whose processing does not
with 2 t + 1 replicas, the majority of the modify state variables need only be sent to
outputs remain correct even after as many a single nonfaulty state machine replica.
as t failures. If processors experience only This is because the response from this rep
fail-stop failures, then an ensemble con lica is—by definition—guaranteed to be
taining t + 1 replicas suffices, and the out correct and because r changes no state vari
put of the ensemble can be the output ables, the state of the replica that processes
produced by any of its members. This is r will remain identical to the states of rep
because only correct outputs are produced licas that do not.
by fail-stop processors, and after t failures Second, Order can be relaxed for requests
one nonfaulty replica will remain among that commute. Two requests r and r' com
the £+ 1 replicas. mute in a state machine sm if the sequence
The key, then, for implementing a £fault- of outputs and final state of sm that would
tolerant state machine is to ensure the result from processing r followed by r' is
following: the same as would result from processing
r' followed by r. An example of a state
R ep lica C oord in a tion . All replicas re machine where Order can be relaxed
ceive and process the same sequence of appears in Figure 3. State machine tally
requests. determines which from among a set of al
This can be decomposed into two require ternatives receives at least MAJ votes and
ments concerning dissemination of re sends this choice to SYSTEM. If clients
quests to replicas in an ensemble. cannot vote more than once and the num
ber of clients Cno satisfies 2MAJ > Cno,
A greem en t. Every nonfaulty state ma then every request commutes with every
chine replica receives every request. other. Thus, implementing Order would be
O rder. Every nonfaulty state machine unnecessary—different replicas of the state
replica processes the requests it receives in machine will produce the same outputs
the same relative order. even if they process requests in different
orders. On the other hand, if clients can
Notice that Agreement governs the behav vote more than once or 2MAJ < Cno, then
ior of a client in interacting with state reordering requests might change the out
machine replicas and that Order governs come of the election.
the behavior of a state machine replica with Theories for constructing state machine
respect to requests from various clients. ensembles that do not satisfy Replica C o
Thus, although Replica Coordination could ordination are proposed in Aizikowitz
299
304 F. B. Schneider
t a lly : s t a t e _ m a c h i n e satisfies IC l and IC2, then the Agreement
v a r v o te s : array [c a n d id a te ] o f i n t e g e r in it 0 requirement is satisfied. Either the client
c a s t_ v o te : c o m m a n d ( c h o ic e : c a n d id a te ) can serve as the transmitter or the client
v o te s lc h o ic e ] := v o te s [c h o ic e ] + 1; can send its request to a single state ma
i f v o te s [c h o ic e ] > MAJ —> s e n d ch o ice to chine replica and let that replica serve as
SYSTEM ; the transmitter. When the client does not
h a lt
itself serve as the transmitter, however, the
□votes [c h o ic e ] < M A J s k ip client must ensure that its request is not
fi
lost or corrupted by the transmitter before
e n d c a s t_ v o te
the request is disseminated to the state
e n d t a lly
machine replicas. One way to monitor for
Figure 3. Election. such corruption is by having the client be
among the processors that receive the re
quest from the transmitter.
[1989] and Mancini and Pappalardo [1988].
Both theories are based on proving that an 3.2 Order and Stability
ensemble of state machines implements
the same specification as a single replica The Order requirement can be satisfied by
does. The approach taken in Aizikowitz assigning unique identifiers to requests and
[1989] uses temporal logic descriptions of having state machine replicas process re
state sequences, whereas the approach in quests according to a total ordering relation
Mancini and Pappalardo 1988] uses an al on these unique identifiers. This is equiva
gebra of action sequences. A detailed de lent to requiring the following, where a
scription of this work is beyond the scope request is defined to be stable at sm, once
of this tutorial. no request from a correct client and bearing
a lower unique identifier can be subse
3.1 Agreement quently delivered to state machine replica
smt:
The Agreement requirement can be satis
fied by using any protocol that allows a O rd e r Im plem en tation . A replica next
designated processor, called the transmit processes the stable request with the small
ter, to disseminate a value to some other est unique identifier.
processors in such a way that
ICl: All nonfaulty processors agree on the Further refinement of Order Implemen
same value. tation requires selecting a method for as
signing unique identifiers to requests and
IC2: If the transmitter is nonfaulty, then
devising a stability test for that assignment
all nonfaulty processors use its value
method. Note that any method for assign
as the one on which they agree.
ing unique identifiers is constrained by 0 1
Protocols to establish IC l and IC 2 have and 02 of Section 1 , which imply that if
received considerable attention in the lit request r, could have caused request r7 to be
erature and are sometimes called Byzantine made then uid(r,) < uid(rj) holds, where
Agreement protocols, reliable broadcast pro uid(r) is the unique identifier assigned to a
tocols, or simply agreement protocols. The request r.
hard part in designing such protocols is In the sections that follow, we give three
coping with a transmitter that fails part refinements of the Order Implementation.
way through an execution. See Strong and Two are based on the use of clocks; a third
Dolev [1983] for protocols that can tolerate uses an ordering defined by the replicas of
Byzantine processor failures and Schneider the ensemble.
et al. [1984] for a (significantly cheaper)
protocol that can tolerate (only) fail-stop
3.2.1 Using Logical Clocks
processor failures.
If requests are distributed to all state A logical clock [Lamport 1978a] is a map
machine replicas by using a protocol that ping t from events to the integers. T(e),
300
the “time”assigned to an event e by logical 1 2 4

clock T, is an integer such that for am/ two
distinct events e and e ', either T(e) < T (e')
or T(e) > Te'), and if e might be respon
sible for causing e' then T(e) < T(e'). It
is a simple matter to implement logical
clocks in a distributed system. Associated
with each process p is a counter Tp. In
addition, a timestamp is included in each
message sent by p. This timestamp is the Figure 4. Logical clock example.
value of Tp when that messages is sent. Tp
is updated according to the following:
LCl: Tp is incremented after each event ment. The case in which relative speeds
at p. of nonfaulty processors and messages is
LC2: Upon receipt of a message with bounded is equivalent to assuming that
timestamp r, process p resets Tp: they have synchronized real-time clocks
and will be considered shortly. This leaves
Tp := m ax(fp, t) + 1.
the case in which fail-stop failures are pos
The value of T(e) for an event e that occurs sible and a process or message can be
at processor p is constructed by appending delayed for an arbitrary length of time
a fixed-length bit string that uniquely iden without being considered faulty. Thus, we
tifies p to the value of Tp when e occurs. now turn to devising a stability test for that
Figure 4 illustrates the use of this scheme environment.
for implementing logical clocks in a system By attaching sequence numbers to the
of three processors, p, q, and r. Events are messages between every pair of processors,
depicted by dots, and an arrow is drawn it is trivial to ensure the following property
between events e and e' if e might be re holds of communications channels:
sponsible for causing event e'. For example, F IF O Channels. Messages between a pair
an arrow between events in different pro of processors are delivered in the order sent.
cesses starts from the event corresponding
to the sending of a message and ends at the For fail-stop processors, we can also assume
event corresponding to the receipt of that the following:
message. The value of Tp(e) for each event F a ilu re D e tectio n A ssum ption. A pro
e is written above that event. cessor p detects that a fail-stop processor q
If T(e) is used as the unique identifier has failed only after p has received the last
associated with a request whose issuance message sent to p by q.
corresponds to event e, the result is a total
ordering on the unique identifiers that sat The Failure Detection Assumption is con
isfies 01 and 02. Thus, a logical clock can sistent with FIFO Channels, since the
be used as the basis of an Order Implemen failure event for a fail-stop processor nec
tation if we can formulate a way to deter essarily happens after the last message sent
mine when a request is stable at a state by the processor and, therefore, should be
machine replica. received after all other messages.
It is pointless to implement a stability Under these two assumptions, the follow
test in a system in which Byzantine failures ing stability test can be used:
are possible and a processor or message can
L o g ic a l C lo c k S ta b ility T est T o le r a t
be delayed for an arbitrary length of time
in g F ail-stop F ailures. Every client*3
without being considered faulty. This is
because no deterministic protocol can im
plement agreement under these conditions 3The result of Fischer et al. [1985] is actually stronger
than this. It states that IC1 and 1C2 cannot be
[Fischer et al. 85J.3 Since it is impossible to achieved by a deterministic protocol in an asynchron
satisfy the Agreement requirement, there is ous system with a single processor that fails in an
no point in satisfying the Order require- even less restrictive manner—by simply halting.

301
306 F. B. S ch n eid e r
periodically makes some—possibly null— occurs. We can use Tp(e) followed by a

request to the state machine. A request fixed-length bit string that uniquely iden
is stable at replica sm, if a request with tifies p as the unique identifier associated
larger timestamp has been received by sm, with a request made as event e by a client
from every client running on a nonfaulty running on a processor p. T o ensure that
processor. 01 and 02 (of Section 1 ) hold for unique
identifiers generated in this manner, two
To see why this stability test works, we
restrictions are required. 0 1 follows pro
show that once a request r is stable at smif
vided no client makes two or more requests
no request with smaller unique identifier
between successive clock ticks. Thus, if
(timestamp) will be received. First, con
processor clocks have a resolution of R
sider clients that sm; does not detect as
seconds, then each client can make at most
being faulty. Because logical clocks are used
one request every R seconds. 02 follows
to generate unique identifiers, any request
provided the degree of clock synchroniza
made by a client c must have a larger unique
tion is better than the minimum message
identifier than was assigned to any previous
delivery time. In particular, if clocks on
request made by c. Therefore, from the
different processors are synchronized to
FIFO Channels assumption, we conclude
within 5 seconds, then it must take more
that once a request from a nonfaulty client
than 6 seconds for a message from one
c is received by sm,, no request from c with
client to reach another. Otherwise, 02
a smaller unique identifier than uid{r) can
would be violated because a request r made
be received by sm,. This means that once
by the one client could have a unique iden
requests with larger unique identifiers than
tifier that was smaller than a request r '
uid{r) have been received from every non
made by another, even though r was caused
faulty client, it is not possible to receive a
by a message sent after r' was made.
request with a smaller unique identifier
When unique request identifiers are ob
than uid{r) from these clients. Next, for a
tained from synchronized real-time clocks,
client c that sm, detects as faulty, the Fail
a stability test can be implemented by ex
ure Detection Assumption implies that no
ploiting these clocks and the bounds on
request from c will be received by sm;. Thus,
message delivery delays. Define A to be
once a request r is stable at sm,-, no request
constant such that a request r with unique
with a smaller timestamp can be received
identifier uid(r) will be received by every
from a client— faulty or nonfaulty.
correct processor no later than time uid(r)
+ A according to the local clock at the
3.2.2 Synchronized Real-Time Clocks receiving processor. Such a A must exist if
A second way to produce unique request requests are disseminated using a protocol
identifiers satisfying O l and 0 2 is by us that employs a fixed number of rounds, like
ing approximately synchronized real-time the ones cited above for establishing IC l
clocks .4 Define Tp{e) to be the value of the and IC2 .5 By definition, once the clock on
real-time clock at processor p when event e a processor p reaches time r, p cannot sub
sequently receive a request r such that
4 A number of protocols to achieve clock synchroni
uid(r) < r — A. Therefore, we have the
zation while tolerating Byzantine failures have been following stability test:
proposed [Halpern et al. 1984; Lamport and Melliar-
Smith 1984]. See Schneider [1986] for a survey. The R eal-tim e C lo c k S ta b ility T est T o le r
protocols all require that known bounds exist for the a tin g B y za n tin e F a ilu res I. A request r
execution speed and clock rates of nonfaulty proces is stable at a state machine replica sm,
sors and for message delivery delays along nonfaulty
communications links. In practice, these requirements
do not constitute a restriction. Clock synchronization 5In general, A will be a function of the variance in
achieved by the protocols is proportional to the vari message delivery delay, the maximum message deliv
ance in message delivery delay, making it possible to ery delay, and the degree of clock synchronization. See
satisfy the restriction—necessary to ensure 02—that Cristian et al. [1985] for a detailed derivation for A in
message delivery delay exceeds clock synchronization. a variety of environments.

302
being executed by processor p if the local The advantage of this approach to com
clock at p reads r and uid(r) < t — A. puting unique identifiers is that communi
cation between all processors in the system
One disadvantage of this stability test is
is not necessary. When logical clocks or
that it forces the state machine to lag be
synchronized real-time clocks are used in
hind its clients by A, where A is propor
computing unique request identifiers, all
tional to the worst-case message delivery
processors hosting clients or state machine
delay. This disadvantage can be avoided.
replicas must communicate. In the case
Due to property O l of the total ordering
of logical clocks, this communication is
on request identifiers, if communications
needed in order for requests to become sta
channels satisfy FIFO Channels, then a
ble; in the case of synchronized real-time
state machine replica that has received a
clocks, this communication is needed in
request r from a client c can subsequently
order to keep the clocks synchronized .6 In
receive from c only requests with unique
the replica-generated identifier approach of
identifiers greater than uid(r). Thus, a re
this section, the only communication re
quest r is also stable at a state machine
quired is among processors running the
replica provided a request with a larger
client and state machine replicas.
unique identifier has been received from
By constraining the possible candidates
every client.
proposed in phase 1 for a request’ s unique
R eal-tim e C lo c k S ta b ility T e st T o le r identifier, it is possible to obtain a simple
a tin g B y za n tin e F a ilu res II. A request stability test. To describe this stability test,
r is stable at a state machine replica sm, if some terminology is required. We say that
a request with a larger unique identifier has a state machine replica sm; has seen a re
been received from every client. quest r once sm, has received r and pro
posed a candidate unique identifier for r.
This second stability test is never passed if We say that sm, has accepted r once that
a (faulty) processor refuses to make re replica knows the ultimate choice of unique
quests. However, by combining the first identifier for r. Define cuid(smi, r) to be
and second test so that a request is consid the candidate unique identifier proposed by
ered stable when it satisfies either test, a replica sm; for request r. Two constraints
stability test results that lags clients by A that lead to a simple stability test are:
only when faulty processors or network de
lays force it. Such a combined test is dis UID1: cuid(smi, r) < uid(r).
cussed in [Gopal et al. 1990]. UID2: If a request r' is seen by replica
sm, after r has been accepted by
3.2.3 Using Replica-Generated Identifiers sm, then uid(r) < cuid(smi, r').
In the previous two refinements of the Or
der Implementation, clients determine the If these constraints hold throughout exe
order in which requests are processed—the cution, then the following test can be used
unique identifier uid(r) for a request r is to determine whether a request is stable at
assigned by the client making that request. a state machine replica:
In the following refinement of the Order
R ep lica - G en era ted Id e n tifie rs S ta b il
Implementation, the state machine replicas
ity Test. A request r that has been ac
determine this order. Unique identifiers are
cepted by sm; is stable provided there is no
computed in two phases. In the first phase,
which can be part of the agreement protocol
used to satisfy the Agreement requirement, 6T h is com m unications cost argument illustrates an
state machine replicas propose candidate advantage o f having a client forward its request to a
unique identifiers for a request. Then, in single state machine replica that then serves as the
transmitter for dissem inating the request. In effect,
the second phase, one of these candidates that state machine replica becom es the client o f the
is selected and it becomes the unique iden state machine, and so com m unication need only in
tifier for that request. volve those processors running state machine replicas.

303
308 F. B. S ch n eid e r
request r' that has (i) been seen by suit, (ii) unique identifiers such that:
not been accepted by smf, and (iii) for which
•U ID l and UID 2 are satisfied. (1)
cuid{smi, r') < uid(r) holds.
•r ^ r ' = > uid(r) ^ uid(r')- (2 )
T o prove that this stability test works, •Every request that is seen
we must show that once an accepted re eventually becomes accepted. (3)
quest r is deemed stable at s/n,, no request
One simple solution for a system of fail-
with a smaller unique identifier will be sub
stop processors is the following:
sequently accepted at sm, . Let r be a request
that, according to the Replica-Generated R ep lica - g e n e ra te d U n iq u e Id en tifiers.
Identifiers Stability Test, is stable at rep In a system with N clients, each state ma
lica sm,. Due to UID2, for any request r' chine replica sm, maintains two variables:
that has not been seen by sm„ uid(r) < SEEN i is the largest cuid(smi, r ) assigned
cuid{smj, r') holds. Thus, by transitivity to any request r so far seen by sm,, and
using UIDl, uid(r) <uid(r') holds, and we
ACCEPTi is the largest uid(r) assigned to
conclude that r ' cannot have a smaller
unique identifier than r. Now consider the any request r so far accepted by sm,.
case in which request r ' has been seen but Upon receipt of a request r, each replica
not accepted by sm, and—because the sta sm, computes
bility test for r is satisfied—uid(r) <
cuid{sm;, r') holds. Due to UIDl, we con cuid{smi, r) :=
clude that uid(r) < uid(r') holds and, max( [ SEEN ,J, [ACCEPT,J)
therefore, r ' does not have a smaller unique
identifier than r. Thus, we have shown that + 1 + i/N. (4)
once a request r satisfies the Replica- (Notice, this means that all candidate
Generated Identifiers Stability Test at sm,, unique identifiers are themselves unique.)
any request r' that is accepted by sm, will The replica then disseminates (using an
satisfy uid(r) < uid(r'), as desired. agreement protocol) cuid(smi, r ) to all
Unlike clock-generated unique identi other replicas and awaits receipt of a can
fiers for requests, replica-generated ones do didate unique identifier for r from every
not necessarily satisfy O l and 02 of Section nonfaulty replica, participating in the
1 . Without further restrictions, it is possi
agreement protocol for that value as well.
ble for a client to make a request r, send a Let N F be the set of replicas from which
message to another client causing request candidate unique identifiers were received.
r' to be issued, yet have uid{r') < uid(r). Finally, the replica computes
However, 01 and 02 will hold provided that
once a client starts disseminating a request uid(r) := max (cuid{smj, r )) (5)
to the state machine replicas, the client snijE. NF
performs no other communication until
and accepts r.
every state machine replica has accepted
that request. T o see why this works, con We prove that this protocol satisfies
sider a request r being made by some client (1)-(3) as follows. U ID l follows from us
and suppose some request r ' was influenced ing assignment (5) to compute uid(r), and
by r. The delay ensures that r is accepted UID 2 follows from assignment (4) to
by every state machine replica before compute cuid(smi, r ). T o conclude that
r' is seen. Thus, from UID2 we conclude (2) holds, we argue as follows. Because an
uid(r) < cuid(sm,, r') and, by transi agreement protocol is used to disseminate
tivity with UIDl, that uid(r) < uid(r'), as candidate unique identifiers, all replicas re
required. ceive the same values from the same repli
T o complete this Order Implementation, cas. Thus, all replicas will execute the same
we have only to devise protocols for com assignment statement (5), and all will com
puting unique identifiers and candidate pute the same value for uid(r). T o establish

304
that these uid{r) values are unique for each are too small have no effect on the outcome
request, it suffices to observe that maxi- of (5) at nonfaulty replicas and those that
mums of disjoint subsets of a collection of are too large will satisfy U IDl and UID 2 .
unique values—the candidate unique iden
tifiers—are also unique. Finally, to estab 4. TOLERATING FAULTY OUTPUT DEVICES
lish (3), that every request that is seen is
eventually accepted, we must prove that for It is not possible to implement a t fault-
each replica sm,, a replica sm, eventually tolerant system by using a single voter to
learns cuid(smj, r) or learns that sm, has combine the outputs of an ensemble of state
failed. This follows trivially from the use of machine replicas into one output. This is
an agreement protocol to distribute the because a single failure—of the voter-can
cuid(smj, r) and the definition of a fail- prevent the system from producing the cor
stop processor. rect output. Solutions to this problem de
An optimization of our Replica-gener pend on whether the output of the state
ated Unique Identifiers protocol is the basis machine implemented by the ensemble is
for the ABCAST protocol in the ISIS to be used within the system or outside the
Toolkit [Birman and Joseph 1987] devel system.
oped at Cornell. In this optimization, can
didate unique identifiers are returned to the 4.1 Outputs Used Outside the System
client instead of being disseminated to the
other state machine replicas. The client If the output of the state machine is sent
then executes assignment (5) to compute to an output device, then that device is
uid(r). Finally, an agreement protocol is already a single component whose failure
used by the client in disseminating uid(r) cannot be tolerated. Thus, being able to
to the state machine replicas. Some unique tolerate a faulty voter is not sufficient—the
replica takes over for the client if the client system must also be able to tolerate a faulty
fails. output device. The usual solution to this
It is possible to modify our Replica problem is to replicate the output device
generated Unique Identifiers protocol for and voter. Each voter combines the output
use in systems where processors can exhibit of each state machine replica, producing a
Byzantine failures, have synchronized real signal that drives one output device. What
time clocks, and communications channels ever reads the outputs of the system is
have bounded message-delivery delays— assumed to combine the outputs of the
the same environment as was assumed for replicated devices. This reader, which is not
using synchronized real-time clocks to gen considered part of the computing system,
erate unique identifiers. The following implements the critical voter.
changes are required. First, each replica sm, If output devices can exhibit Byzantine
uses timeouts so that sm, cannot be forever failures, then by taking the output pro
delayed waiting to receive and participate duced by the majority of the devices, 2 £ +
in the agreement protocol for disseminating 1 -fold replication permits up to t faulty
a candidate unique identifier from a faulty output devices to be tolerated. For example,
replica sm,. Second, if smf does determine a flap on an airplane wing might be de
that smj has timed out, sm, disseminates signed so that when the 2 t + 1 actuators
“ smj timeout”to all replicas (by using an that control it do not agree, the flap always
agreement protocol). Finally, NF is the set moves in the direction of the majority
of replicas in the ensemble less any sm, for (rather than twisting). If output devices
which “ sm, timeout”has been received from exhibit only fail-stop failures, then only
t + 1 or more replicas. Notice, Byzantine t + 1 -fold replication is necessary to toler
failures that cause faulty replicas to pro ate t failures because any output produced
pose candidate unique identifiers not pro by a fail-stop output device can be assumed
duced by (4) do not cause difficulty. This is correct. For example, video display termi
because candidate unique identifiers that nals usually present information with
305
310 F. B. Schneider
enough redundancy so that they can be Byzantine failures are possible, the client
treated as fail stop— failure detection is need not gather a majority of responses to
implemented by the viewer. With such an its requests to the state machine. It can use
output device, a human user can look at the single response produced locally.
one of t + 1 devices, decide whether the
output is faulty, and only if it is faulty, look
at another, and so on. 5. TOLERATING FAULTY CLIENTS
Implementing a t fault-tolerant state ma
chine is not sufficient for implementing a t
4.2 Outputs Used Inside the System fault-tolerant system. Faults might result
If the output of the state machine is to a in clients making requests that cause the
client, then the client itself can combine state machine to produce erroneous output
the outputs of state machine replicas in the or that corrupt the state machine so that
ensemble. Here, the voter—a part of the subsequent requests from nonfaulty clients
client— is faulty exactly when the client is, are incorrectly processed. Therefore, in this
so the fact that an incorrect output is read section we discuss various methods for in
by the client due to a faulty voter is irrele sulating the state machine from faults that
vant. When Byzantine failures are possible, affect clients.
the client waits until it has received t + 1
identical responses, each from a different 5.1 Replicating the Client
member of the ensemble, and takes that as
the response from the t fault-tolerant state One way to avoid having faults affect a
machine. When only fail-stop failures are client is by replicating the client and run
possible, the client can proceed as soon as ning each replica on hardware that fails
it has received a response from any member independently. This replication, however,
of the ensemble, since any output produced also requires changes to state machines
by a replica must be correct. that handle requests from that client. This
When the client is executed on the same is because after a client has been replicated
processor as one o f the state machine rep .ZV-fold, any state machine it interacts with
licas, optimization of client-implemented receives N requests—one from each client
voting is possible .7 This is because correct replica—when it formerly receives a single
ness of the processor implies that both the request. Moreover, corresponding requests
state machine replica and client will be from different client replicas will not nec
correct. Therefore, the response produced essarily be identical. First, they will differ
by the state machine replica running locally in their unique identifiers. Second, unless
can be used as that client’ s response from the original client is itself a state machine
the t fault-tolerant state machine. And, if and the methods of Section 3 are used to
the processor is faulty, we are entitled to coordinate the replicas, corresponding re
view the client as being faulty, so it does quests from different replicas can also dif
not matter what state machine responses fer in their content. For example, if a client
the client receives. Summarizing, we have makes requests based on the value of some
the following: time-varying sensor, then its replicas will
each read their sensors at slightly differ
D epen d en t-F a ilu res O utput O p tim iza ent times and, therefore, make different
tion. If a client and a state machine replica requests.
run on the same processor, then even when We first consider modifications to a state
machine sm for the case in which requests
7 Care must be exercised when analyzing the fault from different client replicas are known to
tolerance of such a system because a single processor differ only in their unique identifiers. For
failure can now cause two system com ponents to fail. this case, modifications are needed for cop
Im plicit in m ost o f our discussions is that system ing with receiving N requests instead of a
com ponents fail independently. It is not always p o s
sible to transform a t fault-tolerant system in which single one. These modifications involve
clients and state machine replicas have independent changing each command so that instead of
failures to one in which they share processors. processing every request received, requests
306
are buffered until enough 8 have been re the effects of requests from faulty clients.
ceived; only then is the corresponding com For example, memory (Figure 1 ) permits
mand performed (a single time). In effect, any client to write to any location. There
a voter is being added to sm to control fore, a faulty client can overwrite all
invocation of its commands. Client repli locations, destroying information. This
cation can be made invisible to the designer problem could be prevented by restricting
of a state machine by including such a voter write requests from each client to only cer
in the support software that receives re tain memory locations— the state machine
quests, tests for stability, and orders stable can enforce this.
requests by unique identifier. Including tests in commands is another
Modifying the state machine for the case way to design a state machine that cannot
in which requests from different client rep be corrupted by requests from faulty
licas can also differ in their content typi clients. For example, mutex as specified in
cally requires exploiting knowledge of the Figure 2, will execute a release command
application. As before, the idea is to trans made by any client—even one that does not
form multiple requests into a single one. have access to the resource. Consequently,
For example, in a t fault-tolerant system, if a faulty client could issue such a request
2t + 1 different requests are received, each and cause mutex to grant a second client
containing the value of a sensor, then a access to the resource before the first has
single request containing the median of relinquished access. A better formulation
those values might be constructed and of mutex ignores release commands from
processed by the state machine. (Given at all but the client to which exclusive access
most t Byzantine faults, the median of has been granted. This is implemented by
2t + 1 values is a reasonable one to use changing the release in mutex to
because it is bounded from above and below
release:
by a nonfaulty value.) A general method for
co m m a n d
transforming multiple requests containing
i f user ¥= client —»s k ip
sensor values into a single request is dis □waiting = <t> A user = client —*
cussed in Marzullo [1989]. That method is user := <t>
based on viewing a sensor value as an in □waiting 7^ A user = client —>
terval that includes the actual value being sen d OK to head (waiting);
measured; a single interval (sensor) is com user := head (waiting);
puted from a set of intervals by using a waiting := tail(waiting)
fault-tolerant intersection algorithm. fi
e n d release
5.2 Defensive Programming Sometimes, a faulty client not making a
Sometimes a client cannot be made fault request can be just as catastrophic as one
tolerant by using replication. In some cir making an erroneous request. For example,
cumstances, due to the unavailability of if a client of mutex failed and stopped while
sensors or processors, it simply might not it had exclusive access to the resource, then
be possible to replicate the client. In other no client could be granted access to the
circumstances, the application semantics resource. Of course, unless we are prepared
might not afford a reasonable way to trans to bound the length of time that a correctly
form multiple requests from client replicas functioning process can retain exclusive ac
into the single request needed by the state cess to the resource, there is little we can
machine. In all of these circumstances, do about this problem. This is because there
careful design of state machines can limit is no way for a state machine to distinguish
between a client that has stopped executing
8 If Byzantine failures are possible, then a t fault- because it has failed and one that is exe
tolerant client requires 21 + 1-fold replication and a cuting very slowly. However, given an up
comm and is perform ed after t + 1 requests have been per bound B on the interval between an
received. If failures are restricted to fail stop, then
t + 1-fold replication will suffice, and a com m and acquire and the following release, the ac
can be perform ed after a single request has been quire command of mutex can automatically
received. schedule release on behalf of a client.
307
312 • F. B. Schneider
We use the notation 6. USING TIME TO MAKE REQUESTS

schedule (R E Q U E S T ) for +r A client need not explicitly send a message
to specify scheduling (R E Q U E ST ) with a to make a request. Not receiving a request
can trigger execution o f a command— in
unique identifier at least r greater than the
effect, allowing the passage of time to
identifier on the request being processed.
transmit a request from client to state ma
Such a request is called a timeout request chine [Lamport 1984]. Transmitting a re
and becomes stable at some time in the quest using time instead of messages can
future, according to the stability test being be advantageous because protocols that im
used for client-generated requests. Unlike plement IC l and IC2 can be costly both in
requests from clients, requests that result total number of messages exchanged and in
from executing sch ed u le need not be dis delay. Unfortunately, using time to trans
tributed to all state machine replicas of the mit requests has only limited applicability
ensemble. This is because each state ma since the client cannot specify parameter
chine replica will independently sch ed u le values.
The use of time to transmit a request was
its own (identical) copy of the request.
used in Section 5 when we revised the ac
We can now modify acquire so that a
quire command of mutex to foil clients that
release operation is automatically sched failed to release the resource. There, a re
uled. In the code that follows, TIM E is lease request was automatically scheduled
assumed to be a function that evaluates to by acquire on behalf of a client being
the current time. Note that mutex might granted the resource. A client transmits a
now process two release commands on be release request to mutex simply by permit
half of a client that has acquired access ting B (logical clock or real-time clock) time
to the resource: one command from the units to pass. It is only to increase utiliza
client itself and one generated by its tion of the shared resource that a client
acquire request. The new state variable might use messages to transmit a release
request to mutex before B time units have
time ^granted, however, ensures that super
passed.
fluous release commands are ignored. The
A more dramatic example of using time
code is to transmit a request is illustrated in con
acquire: nection with tally of Figure 3. Assume that
com m and
if user = 4> —■
» •all clients and state machine replicas
send OK to client; have (logical or real time) clocks synchro
time-granted. T IM E ; nized to within T,
schedule
(mutex.timeout, time-granted) and
fo r + B
•the election starts at time Strt and this
□user —> waiting := waiting °client
fi is known to all clients and state machine
end acquire replicas.
timeout:
com m and (when-granted: integer) Using time, a client can cast a vote for a
if when-granted 5^ default by doing nothing; only when a client
time-granted —> skip casts a vote different from its default do we
□waiting = A when-granted = require that it actually transmits a request
time-granted —* user <t> message. Thus, we have:
□waiting 5^ 4 A when-granted =
time-granted —* T ra n sm ittin g a D efa u lt Vote. If client
send OK to head (waiting); has not made a request by time Strt + F,
user := head(waiting); then a request with that client’ s default
time-granted := TIME; vote has been made.
waiting := tail(waiting)
fi Notice that the default need not be fixed
end timeout nor even known at the time a vote is cast.
308
For example, the default vote could be “

vote and thus tolerate an unbounded total num
for the first client that any client casts ber of faults over the life of the system:
a nondefault vote for.” In that case, the
entire election can be conducted as long Fl: If Byzantine failures are possible,
as one client casts a vote by using actual then state machine replicas being ex
messages.9 ecuted by faulty processors are iden
tified and removed from the ensemble
7. RECONFIGURATION before the Combining Condition is
violated by subsequent processor
An ensemble of state machine replicas can
failures.
tolerate more than t faults if it is possible
to remove state machine replicas running F2: State machine replicas running on re
on faulty processors from the ensemble and paired processors are added to the
add replicas running on repaired proces ensemble before the Combining Con
sors. (A similar argument can be made for dition is violated by subsequent pro
being able to add and remove copies of cessor failures.
clients and output devices.) Let P {t) be the
F l and F2 constrain the rates at which
total number of processors at time r that
failures and repairs occur.
are executing replicas of some state ma
Removing faulty processors from an en
chine of interest, and let F( t ) be the num
semble of state machines can also improve
ber of them that are faulty. In order for the
system performance. This is because the
ensemble to produce the correct output, we
number of messages that must be sent to
must have
achieve agreement is usually proportional
C o m b in in g C on d ition : P(t) - F(r) > to the number of state machine replicas
Enuf for all 0 < r, where that must agree on the contents of a re
quest. In addition, some protocols to im
P( t) plement agreement execute in time propor
if Byzantine failures
2 tional to the number of processors that are
Enuf = < are possible,
0 if only fail-stop failures faulty. Removing faulty processors clearly
reduces both the message complexity and
are possible.
time complexity of such protocols.
A processor failure may cause the Com Adding or removing a client from the
bining Condition to be violated by increas system is simply a matter of changing the
ing F(t), thereby decreasing P(t) — F(t). state machine so that henceforth it re
When Byzantine failures are possible, if a sponds to or ignores requests from that
faulty processor can be identified, then re client. Adding an output device is also
moving it from the ensemble decreases straightforward—the state machine starts
Enuf without further decreasing P(t) — sending output to that device. Removing
F(t); this can keep the Combining Condi an output device from a system is achieved
tion from being violated. When only fail- by disabling the device. This is done by
stop failures are possible, increasing the putting the device in a state that prevents
number of nonfaulty processors—by add it from affecting the environment. For ex
ing one that has been repaired— is the only ample, a CR T terminal can be disabled by
way to keep the Combining Condition from turning off the brightness so that the screen
being violated because increasing P(r) is can no longer be read; a hydraulic actuator
the only way to ensure that P(r) — F ( ) t controlling the flap on an airplane wing can
> 0 holds. Therefore, provided the follow be disabled by opening a cutoff valve so
ing conditions hold, it may be possible to that the actuator exerts no pressure on that
maintain the Combining Condition forever control surface. As suggested by these ex
amples, however, it is not always possible
to disable a faulty output device: Turning
9Observe that if Byzantine failures are possible, then
a faulty client can be elected. Such problem s are off the brightness might have no effect on
always possible when voters do not have detailed the screen, and the cutoff valve might not
knowledge about the candidates in an election. work. Thus, there are systems in which no
309
314 F. B. Schneider
more than a total of t actuator faults can machine.) One is for clients and output
be tolerated because faulty actuators can devices to query the state machine peri
not be disabled. odically for information about relevant
The configuration of a system structured pending configuration changes. Obviously,
in terms of a state machine and clients can communication costs for this scheme are
be described using three sets: the clients C, reduced if clients and output devices share
the state machine replicas S, and the out processors with state machine replicas. An
put devices O. S is used by the agreement other way to make configuration informa
protocol and therefore must be known to tion available is for the state machine to
clients and state machine replicas. It can include information about configuration
also be used by an output device to deter changes in messages it sends to clients and
mine which sen d operations made by state output devices in the course of normal pro
machine replicas should be ignored. C and cessing. Doing this requires periodic com
0 are used by state machine replicas to munication between the state machine and
determine from which clients requests clients and between the state machine and
should be processed and to which devices output devices.
output should be sent. Therefore, C and Requests to change the configuration of
O must be available to all state machine the system are made by a failure/recovery
replicas. detection mechanism. It is convenient to
Two problems must be solved to support think of this mechanism as a collection of
changing the system configuration. First, clients, one for each element of C, S. or 0.
the values o f C, S, and 0 must be available Each of these configurators is responsible
when required. Second, whenever a client, for detecting the failure or repair of the
state machine replica, or output device is single object it manages and, when such an
added to the configuration, the state of that event is detected, for making a request to
element must be updated to reflect the alter the configuration. A configurator is
current state o f the system. These prob likely to be part of an existing client or
lems are considered in the following two state machine replica and might be imple
sections. mented in a variety of ways.
When elements are fail stop, a configu
7.1 Managing the Configuration rator need only check the failure-detection
mechanism of that element. When ele
The configuration of a system can be man ments can exhibit Byzantine failures, de
aged using the state machine in that sys tecting failures is not always possible.
tem. Sets C, S, and 0 are stored in state When it is possible, a higher degree of fault
variables and changed by commands. Each tolerance can be achieved by reconfigura
configuration is valid for a collection of tion. A nonfaulty configurator satisfies two
requests—those requests r such that uid(r) safety properties:
is in the range defined by two succes
sive configuration-change requests. Thus, Cl: Only a faulty element is removed
whenever a client, state machine replica, or from the configuration.
output device performs an action connected C2: Only a nonfaulty element is added to
with processing r, it uses the configuration the configuration.
that is valid for r. This means that a con
figuration-change request must schedule A configurator that does nothing satisfies
the new configuration for some point far C l and C2. Changing the configuration en
enough in the future so that clients, state hances faults tolerance only if F l and F2
machine replicas, and output devices all also hold. For F l and F2 to hold, a config
find out about the new configuration before urator must also (1) detect faults and cause
it actually comes into effect. elements to be removed and (2) detect re
There are various ways to make config pairs and cause elements to be added. Thus,
uration information available to the clients the degree to which a configurator en
and output devices of a system. (The infor hances fault tolerance is directly related to
mation is already available to the state the degree to which (1) and (2) are achieved.

310
Im p lem en tin g Fault- T olerant Services 315
Here, the semantics of the application can processed k inputs is all that is required to
be helpful. For example, to infer that a put it in state e[/-j0in]. Unfortunately, the
client is faulty, a state machine can com design of self-stabilizing state machines is
pare requests made by different clients or not always possible.
by the same client over a period of time. To When elements are not self-stabilizing,
determine that a processor executing a processors are fail stop, and logical clocks
state machine replica is faulty, the state are implemented, cooperation of a single
machine can monitor messages sent by state machine replica sm, is sufficient to
other state machine replicas during execu integrate a new element e into the system.
tion of an agreement protocol. And, by This is because state information obtained
monitoring aspects of the environment from any state machine replica sra, must be
being controlled by actuators, a state ma correct. In order to integrate e at request
chine replica might be able to determine /join, replica smt must have access to enough
that an output device is faulty. Some ele state information so that e [/>;„ ] can be
ments, such as processors, have internal assembled and forwarded to e.
failure-detection circuitry that can be read
•When e is an output device, e[rjoin] is
to determine whether that element is faulty
likely to be only a small amount of device
or has been repaired and restarted. A con
specific setup information—information
figurator for such an element can be im
plemented by having the state machine that changes infrequently and can be
stored in state variables of sm;.
periodically poll this circuitry.
In order to analyze the fault tolerance of •When e is a client, the information
a system that uses configurators, failure of needed for e[/j0in] is frequently based on
a configurator can be considered equivalent recent sensor values read and can there
to the failure of the element that the con fore be determined by using information
figurator manages. This is because with provided to sm, by other clients.
respect to the Combining Condition, re •And, when e is a state machine replica,
moval of a nonfaulty element from the sys the information needed for e[rjom] is
tem or addition of a faulty one is the same stored in the state variables and pending
as that element failing. Thus, in a i fault- requests at s/n,.
tolerant system, the sum of the number of
The protocol for integrating a client or
faulty configurators that manage nonfaulty
output device e is simple—e[rjoin] is sent to
elements and the number of faulty com po
e before the output produced by processing
nents with nonfaulty configurators must be
any request with a unique identifier larger
bounded by t.
than uid(rjoin). The protocol for integrating
a state machine replica s/nnew is a bit more
7.2 Integrating a Repaired Object complex. It is not sufficient for replica smi
Not only must an element being added to a simply to send the values of all its state
configuration be nonfaulty, it also must variables and copies of any pending re
have the correct state so that its actions quests to smnew. This is because some client
will be consistent with those o f the rest of request might be received by sm, after send
the system. Define e[rj to be the state that ing e[rjoin] but delivered to smnew before its
a non-faulty system element e should be in repair. Such a request would neither be
after processing requests r0 through r,. An reflected in the state information for
element e joining the configuration imme warded by smi to smnew nor received by
diately after request rJ0in must be in state smnew directly. Thus, s/n, must, for a time,
e[rjoin] before it can participate in the relay to s/nnew requests received from
running system. clients.10Since requests from a given client
An element is self-stabilizing [Dijkstra are received by smnev, in the order sent and
1974] if its current state is completely de in ascending order by request identifier,
fined by the previous k inputs it has pro
cessed for some fixed k. Running such an 10Duplicate copies of some requests might be received
element long enough to ensure that it has by s m new.

311
316 F. B. Schneider
once smnew has received a request directly by time -rjoin + A according to its clock.
(i.e., not relayed) from a client c, there is Therefore, every request received by sm,
no need for requests from c with larger after rjoin -I- A must also be received directly
identifiers to be relayed to smnew. If smnew by smnew. Clearly, sm, need not relay such
informs sm, of the identifier on a request requests, and we have the following
received directly from each client c, then protocol:
sm, can know when to stop relaying to smnew
requests from c. In tegra tio n w ith Fail-stop P ro c e sso r s
The complete integration protocol is and Real-tim e C locks. A state machine
summarized in the following: replica sm, can integrate an element e
at request rjoin into a running system as
In tegra tion w ith Fail-stop P r o c e s so r s follows:
and L o g ic a l C locks. A state machine If e is a client or output device, then sm,
replica sm, can integrate an element e sends the relevant portions of its state vari
at request rjoin into a running system as ables to e and does so before sending any
follows: output produced by requests with unique
If e is a client or output device, sm, sends identifiers larger than the one on rjoin.
the relevant portions of its state variables If e is a state machine replica smnew,
to e and does so before sending any output then sm,
produced by requests with unique identi (1) sends the values of its state variables
fiers larger than the one on rj0in• and copies of any pending requests to
If e is a state machine replica smnew,
^ n ew ,
then sm,
and then
(1) sends the values of its state variables (2) sends to smnew every request received
and copies of any pending requests to during the next interval of duration A.
S^new }
and then When processors can exhibit Byzantine

(2) sends to smnew every subsequent re failures, a single state machine replica sm,
quest r received from each client c such is not sufficient for integrating a new ele
that uid{r) < uid{rc), where rc is the ment into the system. This is because state
information furnished by sm, might not be
first request smnew received directly
correct—sm, might be executing on a faulty
from c after being restarted.
processor. To tolerate t failures in a system
The existence of synchronized real-time with 2t + 1 state machine replicas, t + 1
clocks permits this protocol to be simplified identical copies of the state information
because sm, can determine when to stop and t + 1 identical copies of relayed mes
relaying messages based on the passage of sages must be obtained. Otherwise, the pro
time. Suppose, as in Section 3.2.2, there tocol is as described above for real-time
exists a constant A such that a request r clocks.
with unique identifier uid(r) will be re
ceived by every (correct) state machine rep
7.2.1 Stability Revisited
lica no later than time uid{r) + A according
to the local clock at the receiving processor. The stability tests of Section 3 do not work
Let smnew join the configuration at when requests made by a client can be
time Tjoin. By definition, smnewis guaranteed received from two sources—the client and
to receive every request that was made after via a relay. During the interval that mes
time rjoin on the requesting client’ s clock. sages are being relayed, smnew, the state
Since unique identifiers are obtained from machine replica being integrated, might re
the real-time clock of the client making the ceive a request r directly from c but later
request, smnew is guaranteed to receive receive r', another request from c, with
every request r such that uid(r) > T j o i n . The uid(r) > uid(r'), because r' was relayed by
first such request r must be received by sm; sm,. The solution to this problem is for

312
Im p lem e n tin g F au lt-T oleran t S erv ices 317
sm„ ew to consider requests received directly uses the Real-time Clock Stability Test.
from c stable only after no relayed requests The decentralized commit protocol of
from c can arrive. Thus, the stability test Skeen [1982] can be viewed as a straight
must be changed: forward application of the state machine
approach, whereas the two-phase commit
S ta b ility T e st D u r in g R estart. A re
protocol described in Gray [1978] can be
quest r received directly from a client c by
obtained from decentralized commit simply
a restarting state machine replica smnew is
by making restrictive assumptions about
stable only after the last request from c
failures and performing optimizations
relayed by another processor has been based on these assumptions. The Paxon
received by smnew. Synod commit protocol [Lamport 1989]
An obvious way to implement this new also can be understood in terms of the state
stability test is for a message to be sent to machine approach. It is similar to, but less
smnev, when no further requests from c will expensive to execute, than the standard
be relayed. three-phase commit protocol. Finally, the
method of implementing highly available
distributed services in Liskov and Ladin
8. RELATED WORK [1986] uses the state machine approach,
The state machine approach was first de with clever optimizations of the stability
scribed in Lamport [1978a] for environ test and agreement protocol that are pos
ments in which failures could not occur. It sible due to the semantics of the application
was generalized to handle fail-stop failures and the use of fail-stop processors.
in Schneider [1982], a class of failures A critique of the state machine approach
between fail-stop and Byzantine failures for transaction management in database
in Lamport [1978b], and full Byzantine systems appears in Garcia-Molina et al.
failures in Lamport [1984]. These various [1986]. Experiments evaluating the per
state machine implementations were first formance of various of the stability tests in
characterized using the Agreement and a network of SUN Workstations are re
Order requirements and a stability test in ported in Pittelli and Garcia-Molina
Schneider [1985]. [1989]. That study also reports on the per
The state machine approach has been formance of request batching, which is
used in the design of significant fault- possible when requests describe database
tolerant process control applications transactions, and the use of null requests
[Wensley et al. 1978]. It has also been used in the Logical Clock Stability Test Toler
in the design of distributed synchroniza ating Fail-stop Failures of Section 3.
tion— including read/write locks and dis Primitives to support the Agreement and
tributed semaphores [Schneider 1980], Order requirements for Replica Coordina
input/output guards for CSP and condi tion have been included in two operating
tional Ada SELECT statements [Schneider systems toolkits. The ISIS Toolkit [Birman
1982]— and in the design of a fail-stop pro 1985] provides ABCAST and CBCA ST for
cessor approximation using processors that allowing an applications programmer to
can exhibit arbitrary behavior in response control the delivery order of messages to
to a failure [Schlichting and Schneider the members of a process group (i.e., collec
1983; Schneider 1984]. A stable storage im tion of state machine replicas). ABCAST
plementation described in Bernstein [1985] ensures that all state machine replicas pro
exploits properties of a synchronous broad cess requests in the same order; CBCA ST
cast network to avoid explicit protocols for allows more flexibility in message ordering
Agreement and Order and uses Transmit and ensures that causally related requests
ting a Default Vote (as described in Sec are delivered in the correct relative order.
tion 7). The notion of A common storage, ISIS has been used to implement a number
suggested in Cristian et al. [1985], is a state of prototype applications. One example is
machine implementation of memory that the RNFS (replicated NFS) file system, a
313
318 F. B. Schneider
network file system that is tolerant to fail- B irman , K. P., and J oseph , T. 1987. Reliable com
stop failures and runs on top of NFS, that munication in the presence of failures. A C M
T O C S 5, 1 (Feb. 1987), 47-76.
was designed using the state machine ap
C ristian , F., Ag hili , H., S tron g , H. R., and D olev ,
proach [Marzullo and Schmuck 1988]. D. 1985. Atomic broadcast: From simple mes
The Psync primitive [Peterson et al. sage diffusion to Byzantine agreement. In P r o
1989], which has been implemented in the c e ed ings o f th e 1 5 th I n te r n a t io n a l C o n fe re n c e on
jc-kernel [Hutchinson and Peterson 1988], F a u lt - t o le r a n t C o m p u tin g (Ann Arbor, Mich.,
is similar to the CBCA ST of ISIS. Psync, June 1985), IEEE Computer Society.
however, makes available to the program D ij k s t r a , E. W. 1974. S e lf sta b iliza tio n in sp ite
o f d istrib u ted con trol. C o m m u n . A C M 17, 11
mer the graph o f the message “ potential (Nov.), 643-644.
causality”relation, whereas CB CA ST does F isch er , M., Lynch , N., and P aterson , M. 1985.
not. Psync is intended to be a low-level Impossibility of distributed consensus with
protocol that can be used to implement one faulty process. J . A C M 32, 2 (Apr. 1986),
protocols like ABCAST and CBCAST; the 374-382.
ISIS primitives are intended for use by G arcia -Molina , H., P ittelli , F., and D avidson , S.
applications programmers and, therefore, 1986. Application of Byzantine agreement in
database systems. A C M T O D S 11, 1 (Mar. 1986),
hide the “potential causality”relation while 27-47.
at the same time include support for group G opal , A., S tron g , R., T oueg , S., and C ristian ,
management and failure reporting. F., 1990. Early-delivery atomic broadcast. To
appear in P ro c e e d in g s o f th e 9 t h A C M
S IG A C T - S I G O P S S y m p o s iu m o n P r in c ip le s o f
ACKNOWLEDGMENTS D is tr ib u te d C o m p u tin g (Quebec City, Quebec,
Aug. 1990).
This material is based on work supported in part by G r a y , J. 1978. Notes on data base operating systems.
the Office of Naval Research under contract N00014- In O p e r a tin g S y s te m s : A n A d v a n c e d C o urse , L e c
86-K-0092, the National Science Foundation under t u r e N o te s in C o m p u te r S cience. Vol. 60. Springer-
Grants Nos. DCR-8320274 and CCR-8701103, and Verlag, New York, pp. 393-481.
Digital Equipment Corporation. Any opinions, find H alpern , J., S im on s , B., S tr o n g , R., and D olev ,
ings, and conclusions or recommendations expressed D. 1984. Fault-tolerant clock synchronization.
in this publication are those of the author and do not In P ro c e e d in g s o f th e 3 r d A C M S IG A C T - S IG O P S
reflect the views of these agencies. S y m p o s iu m o n P r in c ip le s o f D is tr ib u te d C o m p u t
in g (Vancouver, Canada, Aug.), pp. 89-102.
Discussions with 0. Babaoglu, K. Birman, and
L. Lamport over the past 5 years have helped me H u tch inson , N., and P eterson , L. 1988. Design
formulate the ideas in this paper. Useful comments on of the x-kernel. In P ro c e e d in g s o f S I G C O M M
’8 8 — S y m p o s iu m o n C o m m u n ic a tio n A r c h ite c
drafts of this paper were provided by J. Aizikowitz,
tu re s a n d P ro to c o ls (Stanford, Calif., Aug.), pp.
0. Babaoglu, A. Bernstein, K. Birman, R. Brown,
65-75.
D. Gries, K. Marzullo, and B. Simons. I am very
L am port , L. 1978a. Time, clocks and the ordering
grateful to Sal March, managing editor of A C M C o m
of events in a distributed system. C o m m u n . A C M
p u t in g S u rv e y s , for his thorough reading of this paper 21, 7 (July), 558-565.
and many helpful comments. LA M PORT, L. 1979b. The implementation of reliable
distributed multiprocess systems. C o m p u t. N e t
w o rk s 2, 95-114.
REFERENCES L am port , L. 1984. Using time instead of timeout
Aizik ow itz , J. 1989. Designing distributed services for fault-tolerance in distributed systems. A C M
T O P L A S 6, 2 (Apr.), 254-280.
using refinement mappings. Ph.D. dissertation,
Computer Science Dept., Cornell Univ., Ithaca, Lam port , L. 1989. The part-time parliament. Tech.
New York. Also available as Tech. Rep. TR Rep. 49. Digital Equipment Corporation Systems
89-1040. Research Center, Palo Alto, Calif.
B ern stein , A. J. 1985. A loosely coupled system for L am port , L., and M elliar -Sm ith , P. M.
reliably storing data. I E E E T ra n s . S o ftw . E n g . 1984. Byzantine clock synchronization. In P ro
S E -1 1 , 5 (May), 446-454. c e ed ings o f th e 3 r d A C M S I G A C T - S I G O P S S y m
B irman , K. P. 1985. Replication and fault tolerance p o s iu m o n P r in c ip le s o f D is tr ib u te d C o m p u tin g
in the ISIS system. In P ro c e e d in g s o f th e 1 0 th (Vancouver, Canada, Aug.), 68-74.
A C M S y m p o s iu m o n O p e r a tin g S y s te m s P r i n c i L amport , L., S hostak , R., and P ease , M.
p le s (Orcas Island, Washington, Dec. 1985), A C M , 1982. The Byzantine generals problem. A C M
pp. 79-86. T O P L A S 4, 3 (July), 382-401.
314
Im p lem e n tin g F ault-T oleran t S erv ices 319
L iskov , B., and Ladtn, R. 1986. Highly available S chneider , F. B. 1982. Synchronization in dis
distributed services and fault-tolerant distributed tributed programs. A C M T O P L A S 4, 2 (Apr.),
garbage collection. In P ro c e e d in g s o f th e 5 t h A C M 179-195.
S y m p o s iu m o n P r in c ip le s o f D is tr ib u te d C o m p u t S ch neider , F. B. 1984. Byzantine generals in ac
in g (Calgary, Alberta, Canada, Aug.), A C M , pp. tion: Implementing fail-stop processors. A C M
29-39. T O C S 2, 2 (May), 145-154.
M ancini, L., and P appalardo , G. 1988. Towards
S ch neider , F. B. 1985. Paradigms for distributed
a theory of replicated processing. F o r m a l T e c h programs. D is tr ib u te d S ys te m s . M e th o d s a n d
n iq u e s in R e a l- T im e a n d F a u lt - T o le r a n t S ystem s.
T o o ls f o r S p e c ific a tio n . L e c tu r e N o te s in C o m p u te r
L e c tu r e N o te s in C o m p u te r S cience, Vol. 331.
S c ie n c e , Vol. 190. Springer-Verlag, New York, pp.
Springer-Verlag, New York, pp. 175-192. 343-430.
M arzullo , K. 1989. Implementing fault-tolerant
S chneider , F. B. 1986. A paradigm for reliable clock
sensors. Tech. Rep. TR 89-997. Computer Sci
synchronization. In P ro c e e d in g s o f th e A d v a n c e d
ence Dept., Cornell Univ., Ithaca, New York.
S e m in a r on R e a l- T im e L o c a l A r e a N e tw o rk s
M arzullo , K., and S chmuck , F. 1988. Supplying (Bandol, France, Apr.), INRIA, pp. 85-104.
high availability with a standard network file
system. In P ro c e e d in g s o f th e 8 t h I n t e r n a t io n a l S chneider , F. B., G ries , D., and S chlichting ,
C o n fe re n c e o n D is tr ib u te d C o m p u tin g S y s te m s
R. D. 1984. Fault-tolerant broadcasts. Sci.
C o m p u t. P ro g ra m . 4, 1-15.
(San Jose, CA, June), IEEE Computer Society,
pp. 447-455. S iew iorek , D. P., AND S warz, R. S. 1982. The
P eterson , L. L., B ucholz , N. C., and S ch licht - T h e o ry a n d P ra c tic e o f R e lia b le S y s te m D e s ig n .
ING, R. D. 1989. Preserving and using context Digital Press, Bedford, Mass.
information in interprocess communication. S keen , D. 1982. Crash recovery in a distributed
A C M T O C S 7, 3 (Aug.), 217-246. database system. Ph.D. dissertation, Univ. of
P lT T E L L l, F. M ., A N D G A R C IA - M O L IN A , H. California at Berkeley, May.
1989. Reliable scheduling in a TMR database S tron g , H. R., AND D olev , D. 1983. Byzantine
system. A C M T O C S 7, 1 (Feb.), 25-60. agreement. I n te lle c tu a l L e v e ra g e f o r th e I n f o r m a
SCHLICHTING, R. D., AND SCHNEIDER, F. B. t io n S o c ie ty , D ig e s t o f P a p e rs . (Compcon 83,
1983. Fail-Stop processors: An approach to de IEEE Computer Society, Mar.), IEEE Computer
signing fault-tolerant computing systems. A C M Society, pp. 77-82.
T O C S 1, 3 (Aug.), 222-238. W ensley , J., W ensky , J. H., L amport , L.,
S C H N E iD E R , F. B. 1980. Ensuring consistency on a G oldberg , J., G reen , M. W., Levitt , K. N.,
distributed database system by use of distributed M elliar -Sm ith , P. M., S hostak , R. E., and
semaphores. In P ro c e e d in g s o f I n t e r n a t io n a l S y m W einstock , C. B. 1978. SIFT: Design and
p o s iu m o n D is tr ib u te d D a ta B ases (Paris, France, analysis of a fault-tolerant computer for aircraft
Mar.), INRIA, pp. 183-189. control. P ro c . I E E E 6 6 , 10 (Oct.), 1240-1255.
Received November 1987; final revision accepted January 1990.

315
Chapter 9
Communication and End-to-End

Argument
This chapter contains parts of the b ook chapter:
A. S. Tanenbaum and M. V. Steen. D istributed Systems: Princi

ples and Paradigms. Second Edition. Chapters 4, pp. 124-125 and
140-177 (40 of 686). Pearson International Edition, 2007. ISBN:
0-13-613553-6
In addition, the chapter includes the papers:
D. Pritchett. BASE: An A CID Alternative. Queue 6, 3 (May

2008), pp. 48-55 (8 o f 72), 2008. Doi=10.1145/1394127.1394128.
J. H. Saltzer, D. P. Reed, and D. D. Clark. End-to-end argu

ments in system design. ACM Trans. Comput. Syst. 2(4) pp.
277-288 (12 of 359), 1984. Doi: 10.1145/357401.357402
In this chapter, we observe that an important programming tactic to obtain

availability is to decouple system components through asynchronous communi
cation abstractions. The text reviews communication abstractions that follow
this model, with particular emphasis on Message-Oriented Middleware (MOM).
In addition, the text also discusses other important such communication ab
stractions, such as streaming communication as well as gossiping. Armed with
this background, we turn to papers which discuss how to architect applica
tions with such abstractions. First, the BASE (basically available, soft state,
eventually consistent) m ethodology is introduced, which argues for decoupling
system functions by use of reliable MOM intermediaries. Second, the end-to-
end argument is introduced, which debates at a higher level which functionality
and guarantees a system must actually provide, and which functionality can
be left to applications. The ultimate goal of this portion o f the material is to
encourage us to reflect on the relationship between application and system ar
chitecture, and how the use of different system abstractions my affect high-level
properties such as availability.
• Describe different approaches to design communication abstractions, e.g.,

transient vs. persistent and synchronous vs. asynchronous.1
7
3
317
Explain the design and implementation of message-oriented middleware
(MOM).
Discuss alternative communication abstractions such as data streams and

multicast/gossip.
Explain how to organize systems employing a BASE methodology.
Discuss the relationship of BASE to eventual consistency and the CAP

theorem.
Apply the end-to-end argument to system design situations.

124 COMMUNICATION CHAP. 4
as a middleware service, without being modified. This approach is somewhat an

alogous to offering UDP at the transport level. Likewise, middleware communica
tion services may include message-passing services comparable to those offered
by the transport layer.
In the remainder of this chapter, we concentrate on four high-level middle
ware communication services: remote procedure calls, message queuing services,
support for communication of continuous media through streams, and multicast
ing. Before doing so, there are other general criteria for distinguishing (middle
ware) communication which we discuss next.
4.1.2 Types of Communication
To understand the various alternatives in communication that middleware can

offer to applications, we view the middleware as an additional service in client-
server computing, as shown in Fig. 4-4. Consider, for example an electronic mail
system. In principle, the core of the mail delivery system can be seen as a
middleware communication service. Each host runs a user agent allowing users to
compose, send, and receive e-mail. A sending user agent passes such mail to the
mail delivery system, expecting it, in turn, to eventually deliver the mail to the
intended recipient. Likewise, the user agent at the receiver’ s side connects to the
mail delivery system to see whether any mail has come in. If so, the messages are
transferred to the user agent so that they can be displayed and read by the user.
Synchronize at Synchronize at Synchronize after

request submission request delivery processing by server
Figure 4-4. Viewing middleware as an intermediate (distributed) service in ap

plication-level communication.
An electronic mail system is a typical example in which communication is

persistent. With persistent communication, a message that has been submitted
for transmission is stored by the communication middleware as long as it takes to
deliver it to the receiver. In this case, the middleware will store the message at
one or several of the storage facilities shown in Fig. 4-4. As a consequence, it is3
9
1
319
SEC. 4.1 FUNDAMENTALS 125
not necessary for the sending application to continue execution after submitting
the message. Likewise, the receiving application need not be executing when the
m essage is submitted.
In contrast, with transient communication, a m essage is stored by the c o m
munication system only as long as the sending and receiving application are exe
cuting. M ore precisely, in terms o f Fig. 4-4, if the middleware cannot deliver a
m essage due to a transmission interrupt, or because the recipient is currently not
active, it will sim ply be discarded. Typically, all transport-level com m unication
services offer only transient communication. In this case, the com m unication sy s
tem consists o f traditional store-and-forward routers. If a router cannot deliver a
m essage to the next one or the destination host, it will sim ply drop the message.
B esides being persistent or transient, com m unication can also be asynchro
nous or synchronous. The characteristic feature o f asynchronous com m unication
is that a sender continues immediately after it has submitted its m essage for
transmission. This means that the m essage is (temporarily) stored immediately by
the m iddleware upon submission. With synchronous communication, the sender
is blocked until its request is known to be accepted. There are essentially three
points where synchronization can take place. First, the sender may be blocked
until the m iddleware notifies that it will take over transmission o f the request.
Second, the sender may synchronize until its request has been delivered to the
intended recipient. Third, synchronization may take place by letting the sender
wait until its request has been fully processed, that is, up to the time that the reci
pient returns a response.
Various combinations o f persistence and synchronization occur in practice.
Popular ones are persistence in combination with synchronization at request sub
mission, which is a com m on scheme for many message-queuing systems, which
we discuss later in this chapter. Likewise, transient comm unication with syn
chronization after the request has been fully processed is also w idely used. This
schem e corresponds with remote procedure calls, which we also discuss below.
B esides persistence and synchronization, w e should also make a distinction
between discrete and streaming communication. The exam ples so far all fall in the
category o f discrete communication: the parties comm unicate by m essages, each
m essage form ing a com plete unit o f information. In contrast, streaming involves
sending multiple messages, one after the other, where the m essages are related to
each other by the order they are sent, or because there is a temporal relationship.
W e return to streaming comm unication extensively below.
4.2 R EM OTE PROCEDU RE CALL
Many distributed systems have been based on explicit m essage exchange b e

tween processes. However, the procedures send and receive do not con cea l c o m
munication at all, which is important to achieve access transparency in distributed 2
0
3
320
321
322
140 COM M UNICATION CHAP. 4
Directory machine
Figure 4-13. Client-to-server binding in DCE.
Perform ing an R P C
The actual RPC is carried out transparently and in the usual way. The client
stub marshals the parameters to the runtime library for transmission using the pro
tocol chosen at binding time. When a m essage arrives at the server side, it is
routed to the correct server based on the end point contained in the incom ing m es
sage. The runtime library passes the m essage to the server stub, which unmarshals
the parameters and calls the server. The reply goes back by the reverse route.
D C E provides several semantic options. The default is at-most-once opera
tion, in which case no call is ever carried out m ore than once, even in the face o f
system crashes. In practice, what this means is that if a server crashes during an
RPC and then recovers quickly, the client does not repeat the operation, for fear
that it might already have been carried out once.
Alternatively, it is possible to mark a remote procedure as idem potent (in the
ID L file), in which case it can be repeated multiple times without harm. For ex
ample, reading a specified block from a file can be tried over and over until it
succeeds. When an idem potent RPC fails due to a server crash, the client can wait
until the server reboots and then try again. Other semantics are also available (but
rarely used), including broadcasting the RPC to all the m achines on the local net
work. W e return to RPC semantics in Chap. 8, when discussing RPC in the pres
ence o f failures.
4.3 M ESSAGE-ORIENTED COM M UNICATION

R em ote procedure calls and remote object invocations contribute to hiding
com m unication in distributed systems, that is, they enhance a ccess transparency.
Unfortunately, neither m echanism is always appropriate. In particular, when it
cannot be assumed that the receivin g side is executing at the time a request is2 3
323
SEC. 4.3 MESSAGE-ORIENTED COM M UN ICATION 141
issued, alternative communication services are needed. Likewise, the inherent

synchronous nature o f RPCs, by which a client is blocked until its request has
been processed, sometimes needs to be replaced by something else.
That something else is messaging. In this section w e concentrate on message-
oriented communication in distributed systems by first taking a closer look at
what exactly synchronous behavior is and what its im plications are. Then, w e dis
cuss m essaging systems that assume that parties are executing at the time o f co m
munication. Finally, we will examine m essage-queuing systems that allow proc
esses to exchange information, even if the other party is not executing at the time
communication is initiated.
4.3.1 M essa ge- O rien ted T ra n sien t C o m m u n ic a tio n
Many distributed systems and applications are built directly on top o f the sim
ple message-oriented m odel offered by the transport layer. T o better understand
and appreciate the message-oriented systems as part o f m iddleware solutions, we
first discuss m essaging through transport-level sockets.
Berkeley Sockets
Special attention has been paid to standardizing the interface o f the transport
layer to allow programmers to make use o f its entire suite o f (messaging) proto
cols through a sim ple set o f primitives. Also, standard interfaces make it easier to
port an application to a different machine.
As an example, we briefly discuss the sockets interface as introduced in the
1970s in Berkeley UNIX. Another important interface is XTI, which stands for
the X/Open Transport Interface, formerly called the Transport Layer Interface
(TLI), and developed by AT&T. Sockets and X T I are very similar in their m odel
o f network programming, but differ in their set o f primitives.
Conceptually, a socket is a com m unication end point to which an application
can write data that are to be sent out over the underlying network, and from which
incoming data can be read. A socket form s an abstraction over the actual com m u
nication end point that is used by the local operating system for a specific tran
sport protocol. In the follow in g text, w e concentrate on the socket prim itives for
TCP, which are shown in Fig. 4-14.
Servers generally execute the first four primitives, normally in the order
given. W hen calling the socket primitive, the caller creates a new com m unication
end point for a specific transport protocol. Internally, creating a com m unication
end point means that the local operating system reserves resources to a cco m m o
date sending and receiving m essages for the specified protocol.
The bind primitive associates a local address with the newly-created socket.
For example, a server should bind the IP address o f its machine together with a
(possibly well-known) port number to a socket. Binding tells the operating system
that the server wants to receive m essages only on the specified address and port.2 4
3
324
Primitive Meaning
Socket Create a new communication end point
Bind Attach a local address to a socket
Listen Announce willingness to accept connections
Accept Block caller until a connection request arrives
Connect Actively attempt to establish a connection
Send Send some data over the connection
Receive Receive some data over the connection
Close Release the connection
Figure 4-14. The socket primitives for TCP/IP.
The listen primitive is called only in the case o f connection-oriented com m u

nication. It is a nonblocking call that allows the local operating system to reserve
enough buffers for a specified m aximum number o f connections that the caller is
w illing to accept.
A call to accept block s the caller until a connection request arrives. When a
request arrives, the local operating system creates a new socket with the sam e pro
perties as the original one, and returns it to the caller. This approach will allow the
server to, for example, fork o ff a process that w ill subsequently handle the actual
com m unication through the new connection. The server, in the meantime, can go
back and wait for another connection request on the original socket.
Let us now take a look at the client side. Here, too, a socket must first be
created using the socket primitive, but explicitly binding the socket to a local ad
dress is not necessary, since the operating system can dynam ically allocate a port
when the connection is set up. The connect primitive requires that the caller speci
fies the transport-level address to which a connection request is to be sent. The
client is blocked until a connection has been set up successfully, after which both
sides can start exchanging information through the send and receive primitives.
Finally, closin g a connection is symmetric when using sockets, and is established
by having both the client and server call the close primitive. The general pattern
follow ed by a client and server for connection-oriented com m unication using
sockets is shown in Fig. 4-15. Details about network program m ing using sockets
and other interfaces in a UNIX environment can be found in Stevens (1998).
The M essage-Passing Interface (MPI)
With the advent o f high-performance multicomputers, developers have been

lookin g for m essage-oriented primitives that w ould allow them to easily write
highly efficient applications. This means that the prim itives should be at a con
venient level o f abstraction (to ease application development), and that their2 5
3
325
SE C . 4.3 M E S S A G E -O R IE N T E D C O M M U N I C A T I O N 143
Server
I socket [—H bind I— listen ]—►{ accept read HK
4T~
I
Synchronization p o in t------ I Communication \
I socket -Mconnect write -H read close

Client
Figure 4-15. Connection-oriented communication pattern using sockets.
implementation incurs only minimal overhead. Sockets were deem ed insufficient

for two reasons. First, they were at the wrong level o f abstraction by supporting
only sim ple send and receive primitives. Second, sockets had been designed to
comm unicate across networks using general-purpose protocol stacks such as
TCP/IP. They were not considered suitable for the proprietary protocols devel
oped for high-speed interconnection networks, such as those used in high-perfor
mance server clusters. T hose protocols required an interface that could handle
m ore advanced features, such as different forms o f buffering and synchronization.
The result was that m ost interconnection networks and high-performance
multicomputers were shipped with proprietary com m unication libraries. These
libraries offered a wealth o f high-level and generally efficient com m unication
primitives. O f course, all libraries were mutually incompatible, so that application
developers now had a portability problem.
The need to be hardware and platform independent eventually led to the
definition o f a standard for m essage passing, sim ply called the M essage-Passing
Interface or MPI. MPI is designed for parallel applications and as such is
tailored to transient communication. It makes direct use o f the underlying net
work. Also, it assumes that serious failures such as process crashes or network
partitions are fatal and do not require automatic recovery.
MPI assumes comm unication takes place within a known group o f processes.
Each group is assigned an identifier. Each process within a group is also assigned
a (local) identifier. A (groupID , processID ) pair therefore uniquely identifies the
source or destination o f a message, and is used instead o f a transport-level ad
dress. There may be several, possibly overlapping groups o f processes involved in
a computation and that are all executing at the same time.
At the core o f MPI are m essaging primitives to support transient com m unica
tion, o f which the m ost intuitive ones are summ arized in Fig. 4-16.
Transient asynchronous com m unication is supported by means o f the
M P L bsend primitive. The sender submits a m essage for transmission, which is
generally first copied to a local buffer in the MPI runtime system. When the m es
sage has been copied, the sender continues. The local M PI runtime system will
remove the m essage from its local buffer and take care o f transmission as soon as
a receiver has called a receive primitive.26
3
326
144 C O M M U N IC A T IO N CHAP. 4
Primitive Meaning
MPLbsend Append outgoing message to a local send buffer
MPLsend Send a message and wait until copied to local or remote buffer
MPLssend Send a message and wait until receipt starts
MPLsendrecv Send a message and wait for reply
MPLisend Pass reference to outgoing message, and continue
MPLissend Pass reference to outgoing message, and wait until receipt starts
MPLrecv Receive a message; block if there is none
MPLirecv Check if there is an incoming message, but do not block
Figure 4-16. Some of the most intuitive message-passing primitives of MPI.
T h e r e is a ls o a b l o c k i n g se n d op e ra tion , c a lle d MPLsend, o f w h ic h th e s e m

a n tics are im p le m e n ta tio n dep en den t. T h e p r im itiv e MPI_send m a y eith er b lo c k
th e c a lle r un til th e s p e c if ie d m e s s a g e h as b e e n c o p i e d to th e M P I ru n tim e s y s t e m
at th e s e n d e r ’
s sid e, o r un til the r e c e iv e r h as in itia te d a r e c e iv e op era tion . S y n
c h r o n o u s c o m m u n ic a t io n b y w h ic h th e s e n d e r b l o c k s un til its r e q u e s t is a c c e p t e d
fo r fu rth er p r o c e s s in g is a v a ila b le th ro u g h th e MPLssend p rim itiv e. F in ally, th e
s tr o n g e st f o r m o f s y n c h r o n o u s c o m m u n ic a t io n is a ls o su p p o rte d : w h e n a se n d e r
c a lls MPLsendrecv, it s e n d s a r e q u e s t to the r e c e iv e r and b lo c k s un til the latter
retu rn s a reply. B a sic a lly , this p r im itiv e c o r r e s p o n d s to a n o rm a l R P C .
B o th M PLsend and MPLssend h a v e varia n ts that a v o id c o p y in g m e s s a g e s
fr o m u se r b u ffe r s to b u ffe r s in ternal to the lo c a l M P I ru n tim e system . T h e s e v a ri
ants c o r r e s p o n d to a fo r m o f a s y n c h r o n o u s c o m m u n ica tio n . W ith M PLisend, a
se n d e r p a s s e s a p o in te r to the m e s s a g e after w h ic h the M P I ru n tim e s y s t e m tak es
ca re o f c o m m u n ic a tio n . T h e s e n d e r im m e d ia te ly co n tin u e s. T o p r e v e n t o v e r w r it
in g th e m e s s a g e b e f o r e c o m m u n ic a t io n c o m p le te s , M P I o f fe r s p r im itiv e s to c h e c k
fo r c o m p le tio n , o r e v e n to b l o c k i f requ ired. A s w ith MPLsend, w h eth er the m e s
s a g e has a ctu a lly b e e n tra n sferred to the r e c e iv e r o r that it has m e r e ly b e e n c o p ie d
b y th e l o c a l M P I r u n tim e s y s te m to an in tern al b u ffe r is le ft u n sp e cifie d .
L ik e w ise , w ith MPLissend, a se n d e r a ls o p a s s e s o n ly a p o in te r to the M P I
ru n tim e sy stem . W h e n the ru n tim e s y s te m in d ic a te s it has p r o c e s s e d the m e s s a g e ,
the se n d e r is then g u a ra n te ed that th e r e c e iv e r has a c c e p te d th e m e s s a g e an d is
n o w w o r k in g o n it.
T h e o p e r a tio n IVIPLrecv is c a lle d to r e c e iv e a m e s s a g e ; it b lo c k s th e c a lle r
un til a m e s s a g e arrives. T h e r e is a ls o an a s y n c h r o n o u s variant, c a lle d MPLirecv,
b y w h ic h a r e c e iv e r in d ica te s that is p r e p a r e d to a c c e p t a m e ss a g e . T h e r e c e iv e r
ca n c h e c k w h eth er o r n o t a m e s s a g e h as in d e e d arrived, or b l o c k u n til o n e d o es.
T h e s e m a n tic s o f M P I c o m m u n ic a t io n p r im itiv e s are n o t a lw a y s str a ig h tfo r
w ard, and d iffe r e n t p r im itiv e s ca n s o m e tim e s b e in te r ch a n g e d w ith o u t a f fe c tin g 3
7
2
327
SEC. 4.3 MESSAGE-ORIENTED COMMUNICATION 145
the co r r e c tn e s s o f a p rogra m . T h e o f f ic ia l r e a s o n w h y s o m a n y d iffe re n t fo r m s o f

c o m m u n ic a tio n are s u p p o r te d is that it g iv e s im p le m e n te r s o f M P I s y s te m s
e n o u g h p o s s ib ilit ie s fo r o p t im iz in g p e r fo rm a n ce . C y n ic s m ig h t s a y the c o m m it t e e
c o u ld n ot m a k e u p its c o lle c t iv e m in d, s o it th rew in ev ery th in g. M P I h as b e e n
d e s ig n e d fo r h ig h - p e r fo rm a n ce p a ra llel a p p lica tio n s, w h ich m a k e s it e a s ie r to
u n d ersta n d its d iv e r sity in d iffe re n t c o m m u n ic a t io n p r im itiv e s.
M o r e o n M P I ca n b e f o u n d in G r o p p et al. (1998b) T h e c o m p le t e r e fe r e n c e in
w h ich the o v e r 100 fu n c tio n s in M P I are e x p la in e d in detail, ca n b e fo u n d in S n ir
et al. (1998) and G r o p p et al. (1998a)
4.3.2 M essa g e- O rien ted P ersisten t C o m m u n ic a tio n
W e n o w c o m e to an im p o rta n t c la s s o f m e ss a g e - o r ie n te d m id d le w a r e se rv ice s,
m essage-queuing systems, o r ju s t M essage-Oriented M id
g e n e ra lly k n o w n as
dleware (MOM). M e s s a g e - q u e u in g s y s te m s p r o v id e e x te n s iv e s u p p o r t fo r p e r
sisten t a sy n ch ro n o u s c o m m u n ica tio n . T h e e s s e n c e o f th e se s y s te m s is that th ey
o f fe r in term ed ia te-term s to r a g e ca p a c ity fo r m e s s a g e s , w ith o u t r e q u irin g eith er the
se n d e r o r r e c e iv e r to b e a c tiv e d u rin g m e s s a g e tra n sm ission . A n im p orta n t d iff e r
e n c e w ith B e r k e le y s o c k e t s and M P I is that m e s s a g e - q u e u in g s y s te m s are ty p i
c a lly ta rgeted to s u p p o r t m e s s a g e tran sfers that are a llo w e d to take m in u te s in
ste a d o f s e c o n d s o r m illis e c o n d s . W e first e x p la in a g e n e r a l a p p ro a ch to m e s s a g e
q u e u in g sy stem s, an d c o n c lu d e this s e c tio n b y c o m p a r in g th e m to m o r e tra dition a l
sy stem s, n o ta b ly the In tern et e-m a il system s.
M essage-Queuing M odel
T h e b a s ic id e a b e h in d a m e s s a g e - q u e u in g s y s te m is that a p p lic a tio n s c o m

m u n ica te b y in se rtin g m e s s a g e s in s p e c if ic qu eu es. T h e s e m e s s a g e s are fo r w a r d e d
o v e r a se r ie s o f co m m u n ic a t io n se rv e rs and are e v e n tu a lly d e liv e r e d to the d e s ti
nation, e v e n i f it w a s d o w n w h e n the m e s s a g e w a s sent. In p ra ctice, m o s t c o m m u
n ica tio n se rv e rs are d ir e ctly c o n n e c t e d to e a c h other. In o th e r w ord s, a m e s s a g e is
g e n e ra lly tra n sferred d ir e c tly to a d e stin a tio n server. In p r in cip le , e a c h a p p lica tio n
has its o w n p riv a te q u e u e to w h ic h oth e r a p p lica tio n s ca n s e n d m e ss a g e s . A q u e u e
can b e rea d o n ly b y its a s s o c ia t e d ap p lica tion , but it is a ls o p o s s ib l e fo r m u ltip le
a p p lica tio n s to share a s in g le queue.
A n im p orta n t a s p e c t o f m e s s a g e - q u e u in g s y s te m s is that a s e n d e r is g e n e r a lly
g iv e n o n ly the gu a ra n te es that its m e s s a g e w ill e v e n tu a lly b e in se r te d in the r e
c ip ie n t’
s queue. N o gu a ra n te e s are g iv e n a b o u t w h en , o r e v e n i f the m e s s a g e w ill
actu ally b e read, w h ich is c o m p le t e ly d e te rm in e d b y the b e h a v io r o f the recip ien t.
T h e s e se m a n tics p e r m it c o m m u n ic a t io n l o o s e l y - c o u p le d in tim e. T h e r e is thus
n o n e e d f o r the r e c e iv e r to b e e x e c u tin g w h en a m e s s a g e is b e in g sen t to its queu e.
L ik ew ise , there is n o n e e d fo r th e se n d e r to b e e x e c u tin g at the m o m e n t its m e s
sa g e is p ic k e d u p b y the r e ce iv e r . T h e se n d e r an d r e c e iv e r ca n e x e c u te c o m p le t e ly
328
in d e p e n d e n tly o f e a c h other. In fact, o n c e a m e s s a g e h as b e e n d e p o s it e d in a

q u eu e, it w ill rem a in th ere u n til it is r e m o v e d , ir r e s p e c tiv e o f w h eth er its s e n d e r o r
r e c e iv e r is ex e cu tin g . T h is g iv e s u s fo u r c o m b in a t io n s w ith r e s p e c t to th e e x e c u
tio n m o d e o f th e se n d e r an d re ce iv e r, as s h o w n in Fig. 4-17.
Sender Sender Sender Sender

running running passive passive
i i 1 |
i i 1
« i 1
i i 1
i i 1
i i 1
1 1
Receiver Receiver Receiver Receiver

running passive running passive
(a) (b) (c) (d)
Figure 4-17. Four combinations for loosely-coupled communications using

queues.
In F ig. 4-17(a), b o th the sen der an d r e c e iv e r e x e c u te d u rin g th e en tire

t r a n s m iss io n o f a m e ss a g e . In Fig. 4 -17(b), o n ly the se n d e r is e x e c u tin g , w h ile the
r e c e iv e r is p a ssiv e , that is, in a state in w h ic h m e s s a g e d e liv e r y is n o t p o s s ib le .
N e v e rth e le ss, th e s e n d e r ca n still s e n d m e ss a g e s . T h e c o m b in a t io n o f a p a s s iv e
se n d e r a n d an e x e c u tin g r e c e iv e r is s h o w n in F ig. 4 -17(c). In th is ca se, the r e
c e iv e r ca n re a d m e s s a g e s that w e r e sen t to it, b u t it is n o t n e c e s s a r y that th eir r e
s p e c t iv e se n d e r s are e x e c u tin g as w ell. F in ally, in Fig. 4-17(d), w e s e e the situ a
tio n that the s y s te m is sto r in g (and p o s s ib ly tran sm ittin g) m e s s a g e s e v e n w h ile
se n d e r an d r e c e iv e r are p a ssiv e .
M e s s a g e s can, in p rin cip le, co n ta in an y data. T h e o n ly im p o rta n t a s p e c t f r o m
th e p e r s p e c tiv e o f m id d le w a r e is that m e s s a g e s are p r o p e r ly a d d resse d . In p r a c
tice, a d d r e s sin g is d o n e b y p r o v id in g a s y s t e m w id e u n iq u e n a m e o f the d e stin a tion
q u eu e. In s o m e ca ses, m e s s a g e s iz e m a y b e lim ited , a lth o u g h it is a ls o p o s s ib l e
that th e u n d e r ly in g s y s te m ta k es c a re o f fr a g m e n tin g an d a s s e m b lin g la r g e m e s
s a g e s in a w a y that is c o m p le t e ly tran sparen t to a p p lica tio n s. A n e f f e c t o f th is a p
p r o a c h is that the b a s ic in te r fa ce o f f e r e d to a p p lica tio n s ca n b e e x tr e m e ly sim p le ,
as s h o w n in F ig. 4-18.
T h e put p r im itiv e is c a lle d b y a s e n d e r to p a s s a m e s s a g e to th e u n d e r ly in g
s y s t e m that is to b e a p p e n d e d to th e s p e c if ie d q u eu e. A s w e e x p la in e d , this is a 3
9
2
329
Primitive Meaning
Put Append a message to a specified queue
Get Block until the specified queue is nonempty, and remove the first message
Poll Check a specified queue for messages, and remove the first. Never block
Notify Install a handler to be called when a message is put into the specified queue
Figure 4-18. Basic interface to a queue in a message-queuing system.
nonblocking call. The get primitive is a blocking call by which an authorized pro
cess can remove the longest pending message in the specified queue. The process
is blocked only if the queue is empty. Variations on this call allow searching for a
specific message in the queue, for example, using a priority, or a matching pat
tern. The nonblocking variant is given by the poll primitive. If the queue is empty,
or if a specific message could not be found, the calling process simply continues.
Finally, most queuing systems also allow a process to install a handler as a
c a l l b a c k f u n c t io n , which is automatically invoked whenever a message is put into
the queue. Callbacks can also be used to automatically start a process that will
fetch messages from the queue if no process is currently executing. This approach
is often implemented by means of a daemon on the receiver’ s side that continu
ously monitors the queue for incoming messages and handles accordingly.
General Architecture of a M essage-Queuing System
Let us now take a closer look at what a general message-queuing system looks
like. One of the first restrictions that we make is that messages can be put only
into queues that are l o c a l to the sender, that is, queues on the same machine, or no
worse than on a machine nearby such as on the same LAN that can be efficiently
reached through an RPC. Such a queue is called the sou rce queue. Likewise,
messages can be read only from local queues. However, a message put into a
queue will contain the specification of a destination queue to which it should be
transferred. It is the responsibility of a message-queuing system to provide queues
to senders and receivers and take care that messages are transferred from their
source to their destination queue.
It is important to realize that the collection of queues is distributed across
multiple machines. Consequently, for a message-queuing system to transfer mes
sages, it should maintain a mapping of queues to network locations. In practice,
this means that it should maintain a (possibly distributed) database of queue
names to network locations, as shown in Fig. 4-19. Note that such a mapping is
completely analogous to the use of the Domain Name System (DNS) for e-mail in
the Internet. For example, when sending a message to the logical m a il address
s t e e n @ c s . v u . n l, the mailing system will query DNS to find the n e t w o r k (i.e., IP)
address of the recipient’ s mail server to use for the actual message transfer.30
330
Figure 4-19. The relationship between queue-level addressing and network-

level addressing.
Queues are managed by queue managers. Normally, a queue manager inter
acts directly with the application that is sending or receiving a message. However,
there are also special queue managers that operate as routers, or relays: they for
ward incoming messages to other queue managers. In this way, a message
queuing system may gradually grow into a complete, application-level, overlay
network, on top of an existing computer network. This approach is similar to the
construction of the early MB one over the Internet, in which ordinary user proc
esses were configured as multicast routers. As it turns out, multicasting through
overlay networks is still important as we will discuss later in this chapter.
Relays can be convenient for a number of reasons. For example, in many mes
sage-queuing systems, there is no general naming service available that can dy
namically maintain queue-to-location mappings. Instead, the topology of the
queuing network is static, and each queue manager needs a copy of the queue-to-
location mapping. It is needless to say that in large-scale queuing systems, this ap
proach can easily lead to network-management problems.
One solution is to use a few routers that know about the network topology.
When a sender A puts a message for destination B in its local queue, that message
is first transferred to the nearest router, say R l , as shown in Fig. 4-20. At that
point, the router knows what to do with the message and forwards it in the direc
tion of B. For example, R l may derive from B ’s name that the message should be
forwarded to router R2. In this way, only the routers need to be updated when
queues are added or removed, while every other queue manager has to know only
where the nearest router is.
Relays can thus generally help build scalable message-queuing systems. How
ever, as queuing networks grow, it is clear that the manual configuration of net
works will rapidly become completely unmanageable. The only solution is to
adopt dynamic routing schemes as is done for computer networks. In that respect,
it is somewhat surprising that such solutions are not yet integrated into some of
the popular message-queuing systems.3 1
331
Sender A
Figure 4-20. The genera] organization of a message-queuing system with routers.
Another reason why relays are used is that they allow for secondary proc
essing of messages. For example, messages may need to be logged for reasons of
security or fault tolerance. A special form of relay that we discuss in the next sec
tion is one that acts as a gateway, transforming messages into a format that can be
understood by the receiver.
Finally, relays can be used for multicasting purposes. In that case, an incom
ing message is simply put into each send queue.
M essage Brokers
An important application area of message-queuing systems is integrating

existing and new applications into a single, coherent distributed information sys
tem. Integration requires that applications can understand the messages they re
ceive. In practice, this requires the sender to have its outgoing messages in the
same format as that of the receiver.
The problem with this approach is that each time an application is added to
the system that requires a separate message format, each potential receiver will
have to be adjusted in order to produce that format.
An alternative is to agree on a common message format, as is done with tradi
tional network protocols. Unfortunately, this approach will generally not work for
message-queuing systems. The problem is the level of abstraction at which these3 2
332
system s operate. A com m on m essage format makes sense only if the collection o f
p ro cesses that make use o f that format indeed have enough in comm on. If the c o l
lection o f applications that make up a distributed information system is highly di
verse (which it often is), then the best com m on format may w ell be no m ore than
a sequ en ce o f bytes.
A lth ough a few com m on m essage formats for specific application domains
have been defined, the general approach is to learn to live with different formats,
and try to provide the means to make conversions as sim ple as possible. In m es
sage-queuing systems, conversions are handled by special nodes in a queuing net
work, k n ow n as m essage brokers. A m essage broker acts as an application-level
gateway in a m essage-queuing system. Its main purpose is to convert incom ing
m essages so that they can be understood by the destination application. Note that
to a m essage-queuing system, a m essage broker is just another application, as
show n in Fig. 4-21. In other words, a m essage broker is generally not considered
to be an integral part o f the queuing system.
Repository with
conversion rules
Source client Message broker and programs Destination client
Network
Figure 4-21. The general organization o f a m essage broker in a message-

queuing system.
A m essa ge broker can be as sim ple as a reformatter for messages. For ex
ample, assum e an incom ing m essage contains a table from a database, in which
records are separated by a special end-of-record delimiter and fields within a rec
ord have a known, fixed length. If the destination application expects a different
delimiter betw een records, and also expects that fields have variable lengths, a
m essage broker can be used to convert m essages to the format expected by the
destination.
In a m ore advanced setting, a m essage broker may act as an application-level
gateway, such as one that handles the conversion between two different database
applications. In such cases, frequently it cannot be guaranteed that all information
333
contained in the incom ing m essage can actually be transformed into something
appropriate for the outgoing message.
However, m ore com m on is the use o f a m essage broker for advanced enter
prise application integration (EAI) as w e discussed in Chap. 1. In this case,
rather than (only) converting m essages, a broker is responsible for matching appli
cations based on the m essages that are being exchanged. In such a model, called
publish/subscribe, applications send m essages in the form o f publishing. In par
ticular, they may publish a m essage on topic X, which is then sent to the broker.
Applications that have stated their interest in m essages on topic X, that is, who
have subscribed to those m essages, will then receive these m essages from the
broker. M ore advanced forms o f mediation are also possible, but w e will defer
further discussion until Chap. 13.
At the heart o f a m essage broker lies a repository o f rules and program s that
can transform a m essage o f type 77 to one o f type 72. The problem is defining
the rules and developing the programs. M ost m essage broker products com e with
sophisticated developm ent tools, but the bottom line is still that the repository
needs to be filled by experts. Here w e see a perfect exam ple where com m ercial
products are often m isleadingly said to provide “intelligence,”where, in fact, the
only intelligence is to be found in the heads o f those experts.
A Note on M essage-Queuing Systems
Considering what w e have said about message-queuing systems, it w ould

appear that they have long existed in the form o f implementations for e-mail ser
vices. E-mail systems are generally im plem ented through a collection o f mail ser
vers that store and forward m essages on behalf o f the users on hosts directly con
nected to the server. Routing is generally left out, as e-mail system s can make
direct use o f the underlying transport services. For example, in the mail protocol
for the Internet, SM TP (Postel, 1982), a m essage is transferred by setting up a
direct TCP connection to the destination mail server.
What makes e-mail systems special com pared to message-queuing system s is
that they are primarily aimed at providing direct support for end users. This
explains, for example, why a number o f groupware applications are based directly
on an e-mail system (Khoshafian and B uckiew icz 1995). In addition, e-mail sys
tems may have very specific requirements such as automatic m essage filtering,
support for advanced m essaging databases (e.g., to easily retrieve previously
stored messages), and so on.
General message-queuing systems are not aimed at supporting only end users.
An important issue is that they are set up to enable persistent com m unication b e
tween processes, regardless o f whether a process is running a user application,
handling access to a database, perform ing computations, and so on. This approach
leads to a different set o f requirements for m essage-queuing systems than pure e-
mail systems. For example, e-mail systems generally need not provide guaranteed
334
m essage delivery, m essage priorities, loggin g facilities, efficient multicasting,

load balancing, fault tolerance, and so on for general usage.
General-purpose m essage-queuing systems, therefore, have a w ide range o f
applications, including e-mail, workflow, groupware, and batch processing. H ow
ever, as w e have stated before, the m ost important application area is the integra
tion o f a (possibly widely-dispersed) collection o f databases and applications into
a federated information system (Hohpe and W oolf, 2004). For example, a query
expanding several databases may need to be split into subqueries that are for
warded to individual databases. M essage-queuing system s assist by providing the
basic means to package each subquery into a m essage and routing it to the ap
propriate database. Other comm unication facilities we have discussed in this
chapter are far less appropriate.
4.3.3 E x a m p le: IB M ’
s W e b S p h e r e M essa g e - Q u eu in g S y stem
T o help understand how message-queuing system s work in practice, let us

take a look at one specific system, namely the m essage-queuing system that is
part o f IB M ’
s W ebSphere product. Formerly known as M QSeries, it is now
referred to as W ebSph ere MQ. There is a wealth o f documentation on W eb
Sphere MQ, and in the follow in g w e can only resort to the basic principles. Many
architectural details concerning message-queuing networks can be found in IBM
(2005b, 2005d). Program ming message-queuing networks is not something that
can be learned on a Sunday afternoon, and M Q ’ s program m ing guide (IBM,
2005a) is a g o o d exam ple show ing that goin g from principles to practice may
require substantial effort.
Overview
The basic architecture o f an M Q queuing network is quite straightforward,

and is shown in Fig. 4-22. All queues are m anaged by queue managers. A
queue manager is responsible for rem oving m essages from its send queues, and
forwarding those to other queue managers. Likewise, a queue manager is respon
sible for handling incom ing m essages by picking them up from the underlying
network and subsequently storing each m essage in the appropriate input queue. T o
give an im pression o f what m essaging can mean: a m essage has a maximum de
fault size o f 4 MB, but this can be increased up to 100 MB. A queue is normally
restricted to 2 GB o f data, but depending on the underlying operating system, this
maximum can be easily set higher.
Queue managers are pairwise connected through m essage channels, which
are an abstraction o f transport-level connections. A m essage channel is a unidirec
tional, reliable connection between a sending and a receiving queue manager,
through which queued m essages are transported. For example, an Internet-based
m essage channel is im plem ented as a TCP connection. Each o f the tw o ends o f a3 5
335
SE C . 4.3 MESSAGE-ORIENTED COMMUNICATION 153
m essage channel is managed by a message channel agent (MCA). A sending

M C A is basically doing nothing else than checking send queues for a message,
wrapping it into a transport-level packet, and sending it along the connection to its
associated receiving MCA. Likewise, the basic task o f a receivin g M C A is listen
ing for an incom ing packet, unwrapping it, and subsequently storing the unwrap
ped m essage into the appropriate queue.
Message passing queue managers

(asynchronous)
Figure 4-22. General organization of IBM’

s message-queuing system.
Queue managers can be linked into the same process as the application for
which it manages the queues. In that case, the queues are hidden from the applica
tion behind a standard interface, but effectively can be directly manipulated by the
application. An alternative organization is one in which queue managers and ap
plications run on separate machines. In that case, the application is offered the
same interface as when the queue manager is colocated on the same machine.
However, the interface is im plem ented as a proxy that com m unicates with the
queue manager using traditional RPC-based synchronous communication. In this
way, M Q basically retains the m odel that only queues local to an application can
be accessed.
Channels
An important com ponent o f M Q is formed by the m essage channels. Each

m essage channel has exactly one associated send queue from which it fetches the
m essages it should transfer to the other end. Transfer along the channel can take
place only if both its sending and receiving M C A are up and running. Apart from
starting both M C A s manually, there are several alternative ways to start a chan
nel, som e o f which we discuss next.3
6
336
One alternative is to have an application directly start its end o f a channel by

activating the sending or receiving MCA. However, from a transparency point o f
view, this is not a very attractive alternative. A better approach to start a sending
M C A is to configure the channel’ s send queue to set o ff a trigger when a m essage
is first put into the queue. That trigger is associated with a handler to start the
sending M C A so that it can rem ove m essages from the send queue.
Another alternative is to start an M C A over the network. In particular, if one
side o f a channel is already active, it can send a control m essage requesting that
the other M C A to be started. Such a control m essage is sent to a daem on listening
to a well-known address on the same machine as where the other M C A is to be
started.
Channels are stopped automatically after a specified time has expired during
which no m ore m essages were dropped into the send queue.
Each M C A has a set o f associated attributes that determine the overall b e
havior o f a channel. Som e o f the attributes are listed in Fig. 4-23. Attribute values
o f the sending and receiving M C A should be com patible and perhaps negotiated
first before a channel can be set up. For example, both M C A s should obviously
support the same transport protocol. An exam ple o f a nonnegotiable attribute is
whether or not m essages are to be delivered in the same order as they are put into
the send queue. If one M C A wants FIFO delivery, the other must comply. An ex
ample o f a negotiable attribute value is the m axim um m essage length, which will
sim ply be chosen as the minimum value specified by either MCA.
Attribute Description
Transport type Determines the transport protocol to be used
FIFO delivery Indicates that messages are to be delivered in the order they are sent
Message length Maximum length of a single message
Setup retry count Specifies maximum number of retries to start up the remote MCA
Delivery retries Maximum times MCA will try to put received message into queue
Figure 4-23. Some attributes associated with message channel agents.
M essage Transfer
T o transfer a m essage from one queue manager to another (possibly remote)

queue manager, it is necessary that each m essage carries its destination address,
for which a transmission header is used. An address in M Q consists o f two parts.
The first part consists o f the name o f the queue manager to which the m essage is
to be delivered. The second part is the name o f the destination queue resorting
under that manager to which the m essage is to be appended.
B esides the destination address, it is also necessary to specify the route that a
m essage should follow. Route specification is done by providing the name o f the3 7
337
local send queue to which a m essage is to be appended. Thus it is not necessary to

provide the full route in a message. R ecall that each m essage channel has exactly
one send queue. By telling to which send queue a m essage is to be appended, we
efectively specify to which queue manager a m essage is to be forwarded.
In m ost cases, routes are explicitly stored inside a queue manager in a routing
table. An entry in a routing table is a pair (destQM , sendQ ), where destQM is the
name o f the destination queue manager, and sendQ is the name o f the local send
queue to which a m essage for that queue manager should be appended. (A routing
table entry is called an alias in MQ.)
It is possible that a m essage needs to be transferred across multiple queue
managers before reaching its destination. Whenever such an intermediate queue
manager receives the message, it sim ply extracts the name o f the destination
queue manager from the m essage header, and does a routing-table look-up to find
the local send queue to which the m essage should be appended.
It is important to realize that each queue manager has a system w ide unique
name that is effectively used as an identifier for that queue manager. The problem
with using these names is that replacing a queue manager, or changing its name,
will affect all applications that send m essages to it. Problem s can be alleviated by
using a local alias for queue manager names. An alias defined within a queue
manager M l is another name for a queue manager M2, but which is available only
to applications interfacing to Ml. An alias allows the use o f the same (logical)
name for a queue, even if the queue manager o f that queue changes. Changing the
name o f a queue manager requires that we change its alias in all queue managers.
However, applications can be left unaffected.
Alias table Routing table

LAI QMC QMB SQ1 Alias table Routing table
LA2 QMD QMC SQ1 LAI QMA QMA SQ1
QMD SQ2 LA2 QMD QMC SQ1
QMD SQ1
M SQ1
QMB
Routing table
QMA SQ1
QMB SQ1
QMD SQ1
SQ1
Figure 4-24. The general organization of an MQ queuing network using routing

tables and aliases.
338
The principle o f using routing tables and aliases is shown in Fig. 4-24. For
example, an application linked to queue manager QMA can refer to a remote
queue manager using the local alias LAI. The queue manager w ill first look up
the actual destination in the alias table to find it is queue manager QMC. The
route to QMC is found in the routing table, which states that m essages for QMC
should be appended to the outgoing queue SQL, w hich is used to transfer m es
sages to queue manager QMB. The latter will use its routing table to forward the
m essage to QMC.
F ollow ing this approach o f routing and aliasing leads to a program m ing inter
face that, fundamentally, is relatively simple, called the M essage Queue Inter
face (MQI). The m ost important primitives o f M Q I are summ arized in Fig. 4-25.
Primitive Description
MQopen Open a (possibly remote) queue
MQclose Close a queue
MQput Put a message into an opened queue
MQget Get a message from a (local) queue
Figure 4-25. Primitives available in the message-queuing interface.
T o put m essages into a queue, an application calls the MQopen primitive,

specifying a destination queue in a specific queue manager. The queue manager
can be named using the locally-available alias. Whether the destination queue is
actually remote or not is com pletely transparent to the application. MQopen
should also be called if the application wants to get m essages from its local queue.
Only local queues can be opened for reading in com ing m essages. When an appli
cation is finished with accessin g a queue, it should clo se it by calling MQclose.
M essages can be written to, or read from, a queue using MQput and MQget,
respectively. In principle, m essages are rem oved from a queue on a priority basis.
M essages with the sam e priority are rem oved on a first-in, first-out basis, that is,
the longest pending m essage is rem oved first. It is also possible to request for spe
cific messages. Finally, M Q provides facilities to signal applications when m es
sages have arrived, thus avoiding that an application w ill continuously have to
poll a m essage queue for in com ing messages.
M anaging Overlay Networks
From the description so far, it should be clear that an important part o f m anag
ing M Q systems is connecting the various queue managers into a consistent over
lay network. M oreover, this network needs to be maintained over time. For small
networks, this maintenance will not require much m ore than average administra
tive work, but matters becom e com plicated when m essage queuing is used to
integrate and disintegrate large existing systems.3
9
339
SE C . 4.3 MESSAGE-ORIENTED COMMUNICATION 157
A major issue with M Q is that overlay networks need to be manually adminis

trated. This administration not only involves creating channels between queue
managers, but also filling in the routing tables. Obviously, this can grow into a
nightmare. Unfortunately, management support for M Q system s is advanced only
in the sense that an administrator can set virtually every possible attribute, and
tweak any thinkable configuration. However, the bottom line is that channels and
routing tables need to be manually maintained.
At the heart o f overlay management is the channel control function co m
ponent, which logically sits between m essage channel agents. This com ponent
allow s an operator to monitor exactly what is goin g on at two end points o f a
channel. In addition, it is used to create channels and routing tables, but also to
manage the queue managers that host the m essage channel agents. In a way, this
approach to overlay management strongly resem bles the management o f cluster
servers where a single administration server is used. In the latter case, the server
essentially offers only a remote shell to each machine in the cluster, along with a
few collective operations to handle groups o f machines. The g o o d news about dis
tributed-systems management is that it offers lots o f opportunities if you are lo o k
ing for an area to explore new solutions to serious problems.
4.4 STREAM-ORIENTED COM M UNICATION
Com m unication as discussed so far has concentrated on exchanging more-or-

less independent and com plete units o f information. Exam ples include a request
for invoking a procedure, the reply to such a request, and m essages exchanged be
tween applications as in message-queuing systems. The characteristic feature o f
this type o f communication is that it does not matter at what particular point in
time comm unication takes place. Although a system may perform too slow or too
fast, timing has no effect on correctness.
There are also forms o f comm unication in which timing plays a crucial role.
Consider, for example, an audio stream built up as a sequence o f 16-bit samples,
each representing the amplitude o f the sound wave as is done through Pulse C od e
M odulation (PCM). A lso assume that the audio stream represents C D quality,
meaning that the original sound wave has been sam pled at a frequency o f 44,100
Hz. T o reproduce the original sound, it is essential that the samples in the audio
stream are played out in the order they appear in the stream, but also at intervals
o f exactly 1/44,100 sec. Playing out at a different rate will produce an incorrect
version o f the original sound.
The question that w e address in this section is which facilities a distributed
system should offer to exchange time-dependent information such as audio and
video streams. Various network protocols that deal with stream-oriented com m u
nication are discussed in Halsall (2001). Steinmetz and Nahrstedt (2004) provide 3 0
4
340
an overall introduction to multimedia issues, part o f w hich form s stream-oriented

communication. Query processin g on data streams is discussed in B abcock et al.
( 2002).
4.4.1 Support for Continuous Media
Support for the exchange o f time-dependent information is often formulated

as support for continuous media. A m edium refers to the means by which infor
mation is conveyed. These means include storage and transmission media, pres
entation media such as a monitor, and so on. An important type o f m edium is the
way that information is represented. In other words, how is information encoded
in a computer system? Different representations are used for different types o f in
formation. For example, text is generally encoded as A SC II or Unicode. Im ages
can be represented in different formats such as GIF or JPEG. A udio streams can
be encoded in a computer system by, for example, taking 16-bit sam ples using
PCM.
In continuous (representation) media, the temporal relationships betw een
different data items are fundamental to correctly interpreting what the data actual
ly means. W e already gave an exam ple o f reproducing a sound wave by playing
out an audio stream. As another example, consider motion. M otion can be repres
ented by a series o f im ages in which su ccessive im ages must be displayed at a
uniform spacing T in time, typically 30-40 m sec per image. Correct reproduction
requires not only show ing the stills in the correct order, but also at a constant fre
quency o f l/T im ages per second.
In contrast to continuous media, discrete (representation) media, is charac
terized by the fact that temporal relationships between data items are not funda
mental to correctly interpreting the data. Typical exam ples o f discrete media
include representations o f text and still images, but also object co d e or executable
files.
Data Stream
T o capture the exchange o f time-dependent information, distributed systems

generally provide support for data streams. A data stream is nothing but a se
quence o f data units. Data streams can be applied to discrete as well as continuous
media. For example, UNIX pipes or TCP/IP connections are typical exam ples o f
(byte-oriented) discrete data streams. Playing an audio file typically requires set
ting up a continuous data stream between the file and the audio device.
Tim ing is crucial to continuous data streams. T o capture timing aspects, a dis
tinction is often made between different transmission modes. In asynchronous
transmission m ode the data items in a stream are transmitted one after the other,
but there are no further timing constraints on when transmission o f items should
take place. This is typically the case for discrete data streams. For example, a file
341
SEC. 4.4 S T R E A M - O R IE N T E D C O M M U N I C A T I O N 159
can be transferred as a data stream, but it is m ostly irrelevant exactly when the
transfer o f each item completes.
In synchronous transmission mode, there is a maximum end-to-end delay
defined for each unit in a data stream. Whether a data unit is transferred much fas
ter than the maximum tolerated delay is not important. For example, a sensor may
sample temperature at a certain rate and pass it through a network to an operator.
In that case, it may be important that the end-to-end propagation time through the
network is guaranteed to be low er than the time interval between taking samples,
but it cannot do any harm if samples are propagated much faster than necessary.
Finally, in isochronous transmission mode, it is necessary that data units are
transferred on time. This means that data transfer is subject to a maximum and
minimum end-to-end delay, also referred to as bounded (delay) jitter. Isochronous
transmission m ode is particularly interesting for distributed multimedia systems,
as it plays a crucial role in representing audio and video. In this chapter, w e co n
sider only continuous data streams using isochronous transmission, which we will
refer to sim ply as streams.
Streams can be simple or complex. A sim ple stream consists o f only a single
sequence o f data, whereas a com plex stream consists o f several related sim ple
streams, called substreams. The relation between the substreams in a com plex
stream is often also time dependent. For example, stereo audio can be transmitted
by means o f a com plex stream consisting o f tw o substreams, each used for a sin
gle audio channel. It is important, however, that those tw o substreams are continu
ously synchronized. In other words, data units from each stream are to be c o m
municated pairwise to ensure the effect o f stereo. Another exam ple o f a com plex
stream is one for transmitting a movie. Such a stream could consist o f a single
video stream, along with two streams for transmitting the sound o f the m ovie in
stereo. A fourth stream might contain subtitles for the deaf, or a translation into a
different language than the audio. Again, synchronization o f the substreams is im
portant. If synchronization fails, reproduction o f the m ovie fails. W e return to
stream synchronization below.
From a distributed systems perspective, w e can distinguish several elements
that are needed for supporting streams. For simplicity, w e concentrate on stream
ing stored data, as opposed to streaming live data. In the latter case, data is cap
tured in real time and sent over the network to recipients. The main difference b e
tween the tw o is that streaming live data leaves less opportunities for tuning a
stream. Follow ing W u et al. (2001), w e can then sketch a general client-server ar
chitecture for supporting continuous multim edia streams as shown in Fig. 4-26.
This general architecture reveals a number o f important issues that need to be
dealt with. In the first place, the multim edia data, notably video and to a lesser
extent audio, will need to be com pressed substantially in order to reduce the re
quired storage and especially the network capacity. M ore important from the per
spective o f comm unication are controlling the quality o f the transmission and syn
chronization issues. W e discuss these issues next.
342
Network
Figure 4-26. A general architecture for streaming stored multimedia data over a
network.
4.4.2 Streams and Quality of Service
Tim ing (and other nonfunctional) requirements are generally expressed as

Quality of Service (QoS) requirements. These requirements describe what is
needed from the underlying distributed system and network to ensure that, for ex
ample, the temporal relationships in a stream can be preserved. Q oS for continu
ous data streams mainly concerns timeliness, volume, and reliability. In this se c
tion we take a closer look at Q oS and its relation to setting up a stream.
M uch has been said about how to specify required Q oS (see, e.g., Jin and
Nahrstedt, 2004). From an application’ s perspective, in many cases it boils down
to specifying a few important properties (Halsall, 2001):
1. The required bit rate at which data should be transported.
2. The maximum delay until a session has been set up (i.e., when an ap
plication can start sending data).
3. The maximum end-to-end delay (i.e., how long it will take until a
data unit makes it to a recipient).
4. The m aximum delay variance, or jitter.
5. The maximum round-trip delay.
It should b e noted that many refinements can be m ade to these specifications, as

explained, for example, by Steinmetz and Nahrstadt (2004). However, when deal
ing with stream-oriented com m unication that is based on the Internet protocol
stack, w e sim ply have to live with the fact that the basis o f com m unication is
form ed by an extremely simple, best-effort datagram service: IP. W hen the goin g
gets tough, as may easily be the case in the Internet, the specification o f IP allows
a protocol implementation to drop packets whenever it sees fit. Many, if not all4 3
343
SEC. 4 .4 STREAM-ORIENTED COMMUNICATION 161
distributed systems that support stream-oriented communication, are currently

built on top o f the Internet protocol stack. S o much for Q oS specifications. (Actu
ally, IP does provide som e Q oS support, but it is rarely implemented.)
Enforcing QoS
Given that the underlying system offers only a best-effort delivery service, a
distributed system can try to conceal as much as possible o f the lack o f quality o f
service. Fortunately, there are several mechanisms that it can deploy.
First, the situation is not really so bad as sketched so far. For example, the
Internet provides a means for differentiating classes o f data by means o f its dif
ferentiated services. A sending host can essentially mark ou tgoing packets as
belongin g to one o f several classes, including an expedited forwarding class that
essentially specifies that a packet should be forwarded by the current router with
absolute priority (Davie et ah, 2002). In addition, there is also an assured for
warding class, by which traffic is divided into four subclasses, along with three
ways to drop packets if the network gets congested. Assured forwarding therefore
effectively defines a range o f priorities that can be assigned to packets, and as
such allows applications to differentiate time-sensitive packets from noncritical
ones.
B esides these network-level solutions, a distributed system can also help in
getting data across to receivers. Although there are generally not many tools avail
able, one that is particularly useful is to use buffers to reduce jitter. The principle
is simple, as shown in Fig. 4-27. A ssum ing that packets are delayed with a cer
tain variance when transmitted over the network, the receiver sim ply stores them
in a buffer for a maximum amount o f time. This will allow the receiver to pass
packets to the application at a regular rate, know ing that there will always be
enough packets entering the buffer to be played back at that rate.
Packet departs source
Packet arrives at buffer ir iriminri in]

Time in buffer
Packet removed from buffer ■
<---------------
v Gap in playback
'___ I___ I___ I I I I___ I_ __ I___ I___ I___ I___ L i i I i i
0 5 10 15 20
Time (sec)
Figure 4-27. Using a buffer to reduce jitter.
O f course, things may g o wrong, as is illustrated by packet #8 in Fig. 4-27.

The size o f the receiver’
s buffer corresponds to 9 seconds o f packets to pass to the
application. Unfortunately, packet #8 took 11 seconds to reach the receiver, at
344
which time the buffer will have been com pletely emptied. The result is a gap in
the playback at the application. The only solution is to increase the buffer size.
The obvious drawback is that the delay at which the receiving application can
start playing back the data contained in the packets increases as well.
Other techniques can be used as well. R ealizing that w e are dealing with an
underlying best-effort service also means that packets may be lost. T o com pensate
for this loss in quality o f service, w e need to apply error correction techniques
(Perkins et al., 1998; and Wah et ah, 2000). Requesting the sender to retransmit a
m issing packet is generally out o f the question, so that forward error correction
(FEC) needs to be applied. A well-known technique is to encode the outgoing
packets in such a way that any k out o f n received packets is enough to reconstruct
k correct packets.
One problem that may occur is that a single packet contains multiple audio
and video frames. As a consequence, when a packet is lost, the receiver may actu
ally perceive a large gap when playing out frames. This effect can be somewhat
circumvented by interleaving frames, as shown in Fig. 4-28. In this way, when a
packet is lost, the resulting gap in su ccessive frames is distributed over time.
Note, however, that this approach does require a larger receive buffer in c o m
parison to noninterleaving, and thus im poses a higher start delay for the receiving
application. For example, when considering Fig. 4-28(b), to play the first four
frames, the receiver w ill need to have four packets delivered, instead o f only one
packet in com parison to noninterleaved transmission.
Lost packet
Sent □[Hills 0 0 0 S 1E E in E (MnHHHHI
Delivered
Gap of lost frames

(a)
Lost packet
Sent e s e e IS E E S
Delivered [7] [J] J J T | @ @ ( 0 0 0g] 03] [u] 0
Lost frames
(b)
Figure 4-28. The effect of packet loss in (a) noninterleaved transmission and
(b) interleaved transmission.3
5
4
345
SEC. 4.4 STREAM-ORIENTED COMMUNICATION 163
4.4.3 Stream Synchronization
An important issue in multimedia systems is that different streams, possibly in

the form o f a com plex stream, are mutually synchronized. Synchronization o f
streams deals with maintaining temporal relations between streams. T w o types o f
synchronization occur.
The sim plest form o f synchronization is that between a discrete data stream
and a continuous data stream. Consider, for example, a slide show on the W eb
that has been enhanced with audio. Each slide is transferred from the server to the
client in the form o f a discrete data stream. At the sam e time, the client should
play out a specific (part o f an) audio stream that matches the current slide that is
also fetched from the server. In this case, the audio stream is to be synchronized
with the presentation o f slides.
A m ore demanding type o f synchronization is that between continuous data
streams. A daily exam ple is playing a m ovie in which the video stream needs to
be synchronized with the audio, com m only referred to as lip synchronization.
Another exam ple o f synchronization is playing a stereo audio stream consisting o f
tw o substreams, one for each channel. Proper play out requires that the two sub
streams are tightly synchronized: a difference o f m ore than 20 p se c can distort the
stereo effect.
Synchronization takes place at the level o f the data units o f which a stream is
made up. In other words, we can synchronize two streams only between data
units. The ch oice o f what exactly a data unit is depends very much on the level o f
abstraction at which a data stream is viewed. T o make things concrete, consider
again a CD-quality (single-channel) audio stream. At the finest granularity, such a
stream appears as a sequence o f 16-bit samples. With a sam pling frequency o f
44,100 Hz, synchronization with other audio streams could, in theory, take place
approximately every 23 psec. For high-quality stereo effects, it turns out that syn
chronization at this level is indeed necessary.
However, when we consider synchronization between an audio stream and a
video stream for lip synchronization, a much coarser granularity can be taken. As
we explained, video frames need to be displayed at a rate o f 25 H z or more. Tak
ing the widely-used N T SC standard o f 29.97 Hz, we could group audio samples
into logical units that last as long as a video frame is displayed (33 msec). With an
audio sampling frequency o f 44,100 Hz, an audio data unit can thus be as large as
1470 samples, or 11,760 bytes (assuming each sam ple is 16 bits). In practice,
larger units lasting 40 or even 80 m sec can be tolerated (Steinmetz, 1996).
Synchronization Mechanisms
Let us now see how synchronization is actually done. T w o issues need to be

distinguished: (1) the basic mechanisms for synchronizing tw o streams, and (2)
the distribution o f those mechanisms in a networked environment.3
6
4
346
Synchronization mechanisms can be view ed at several different levels o f

abstraction. At the low est level, synchronization is done explicitly by operating on
the data units o f sim ple streams. This principle is shown in Fig. 4-29. In essence,
there is a process that simply executes read and write operations on several sim ple
streams, ensuring that those operations adhere to specific timing and synchroniza
tion constraints.
Receiver's machine
Figure 4-29. The principle of explicit synchronization on the level data units.
For example, consider a m ovie that is presented as tw o input streams. The
video stream contains uncom pressed low-quality im ages o f 320x240 pixels, each
encoded by a single byte, leading to video data units o f 76,800 bytes each.
A ssum e that im ages are to be displayed at 30 Hz, or one im age every 33 msec.
The audio stream is assumed to contain audio samples grouped into units o f 11760
bytes, each corresponding to 33 m s o f audio, as explained above. If the input proc
ess can handle 2.5 MB/sec, w e can achieve lip synchronization by sim ply alternat
ing between reading an im age and reading a block o f audio sam ples every 33 ms.
The drawback o f this approach is that the application is made com pletely
responsible for im plem enting synchronization while it has only low-level facilities
available. A better approach is to offer an application an interface that allows it to
m ore easily control streams and devices. Returning to our example, assume that
the video display has a control interface that allow s it to specify the rate at which
im ages should be displayed. In addition, the interface offers the facility to register
a user-defined handler that is called each time k new im ages have arrived. An
analogous interface is offered by the audio device. With these control interfaces,
an application developer can write a sim ple monitor program consisting o f two
handlers, one for each stream, that jointly check if the video and audio stream are
sufficiently synchronized, and if necessary, adjust the rate at which video or audio
units are presented.
This last exam ple is illustrated in Fig. 4-30, and is typical for many m ul
timedia m iddleware systems. In effect, multimedia middleware offers a collection
o f interfaces for controlling audio and video streams, including interfaces for con
trolling devices such as monitors, cameras, m icrophones, etc. Each device and4 7
3
347
SE C . 4.4 STREAM-ORIENTED COMMUNICATION 165
stream has its own high-level interfaces, including interfaces for notifying an ap
plication when som e event occurred. The latter are subsequently used to write
handlers for synchronizing streams. Exam ples o f such interfaces are given in Blair
and Stefani (1998).
Application tells
Receiver's machine middleware what
Figure 4-30. The principle of synchronization as supported by high-level interfaces.
The distribution o f synchronization m echanism s is another issue that needs to

be looked at. First, the receiving side o f a com plex stream consisting o f sub
streams that require synchronization, needs to know exactly what to do. In other
words, it must have a com plete synchronization specification locally available.
C om m on practice is to provide this information im plicitly by m ultiplexing the dif
ferent streams into a single stream containing all data units, including those for
synchronization.
This latter approach to synchronization is follow ed for M PE G streams. The
MPEG (Motion Picture Experts Group) standards form a collection o f algo
rithms for com pressing video and audio. Several M PEG standards exist. MPEG-2,
for example, was originally designed for com pressing broadcast quality video into
4 to 6 Mbps. In MPEG-2, an unlimited number o f continuous and discrete streams
can be m erged into a single stream. Each input stream is first turned into a stream
o f packets that carry a timestamp based on a 90-kHz system clock. These streams
are subsequently m ultiplexed into a program stream then consisting o f variable-
length packets, but which have in com m on that they all have the same time base.
The receiving side demultiplexes the stream, again using the timestamps o f each
packet as the basic mechanism for interstream synchronization.
Another important issue is whether synchronization should take place at the
sending or the receiving side. If the sender handles synchronization, it may be
possible to m erge streams into a single stream with a different type o f data unit.
Consider again a stereo audio stream consisting o f tw o substreams, one for each
channel. One possibility is to transfer each stream independently to the receiver
and let the latter synchronize the samples pairwise. Obviously, as each substream
may be subject to different delays, synchronization can be extremely difficult. A 4 8
3
348
better approach is to m erge the two substreams at the sender. The resulting stream
consists o f data units consisting o f pairs o f samples, one for each channel. The re
ceiver now m erely has to read in a data unit, and split it into a left and right sam
ple. Delays for both channels are now identical.
4.5 MULTICAST COMMUNICATION

An important topic in comm unication in distributed system s is the support for
sending data to m ultiple receivers, also known as m ulticast communication. For
many years, this topic has belonged to the domain o f network protocols, where
numerous proposals for network-level and transport-level solutions have been im
plemented and evaluated (Janie, 2005; and Obraczka, 1998). A m ajor issue in all
solutions was setting up the comm unication paths for information dissemination.
In practice, this involved a huge management effort, in many cases requiring
human intervention. In addition, as long as there is no convergen ce o f proposals,
ISPs have shown to be reluctant to support multicasting (Diot et al., 2000).
With the advent o f peer-to-peer technology, and notably structured overlay
management, it becam e easier to set up com m unication paths. As peer-to-peer
solutions are typically deployed at the application layer, various application-level
multicasting techniques have been introduced. In this section, w e w ill take a brief
look at these techniques.
Multicast com m unication can also be a ccom plished in other ways than setting
up explicit com m unication paths. As w e also explore in this section, gossip-based
information dissem ination provides sim ple (yet often less efficient) ways for m ul
ticasting.
4.5.1 Application-Level Multicasting
The basic idea in application-level multicasting is that nodes organize into an

overlay network, which is then used to disseminate information to its members.
An important observation is that network routers are not involved in group
membership. A s a consequence, the connections between nodes in the overlay
network may cross several physical links, and as such, routing m essages within
the overlay may not be optimal in com parison to what could have been achieved
by network-level routing.
A crucial design issue is the construction o f the overlay network. In essence,
there are two approaches (El-Sayed, 2003). First, nodes may organize themselves
directly into a tree, m eaning that there is a unique (overlay) path between every
pair o f nodes. An alternative approach is that nodes organize into a mesh network
in which every node w ill have multiple neighbors and, in general, there exist m ul
tiple paths betw een every pair o f nodes. The main difference between the tw o is
that the latter generally provides higher robustness: if a connection breaks (e.g.,
349
SEC. 4.5 MULTICAST COMMUNICATION 167
because a node fails), there will still be an opportunity to disseminate information

without having to immediately reorganize the entire overlay network.
To make matters concrete, let us consider a relatively simple scheme for con
structing a multicast tree in Chord, which we described in Chap. 2. This scheme
was originally proposed for Scribe (Castro et al., 2002) which is an application-
level multicasting scheme built on top of Pastry (Rowstron and Druschel, 2001).
The latter is also a DHT-based peer-to-peer system.
Assume a node wants to start a multicast session. To this end, it simply gen
erates a multicast identifier, say m i d which is just a randomly-chosen 160-bit key.
It then looks up s u c c ( m id ) , which is the node responsible for that key, and pro
motes it to become the root of the multicast tree that will be used to sending data
to interested nodes. In order to join the tree, a node P simply executes the opera
tion LOOKUP(mid) having the effect that a lookup message with the request to
join the multicast group m i d will be routed from P to s u c c ( m id ) . As we men
tioned before, the routing algorithm itself will be explained in detail in Chap. 5.
On its way toward the root, the join request will pass several nodes. Assume it
first reaches node Q . If Q had never seen a join request for m i d before, it will
become a forwarder for that group. At that point, P will become a child of Q
whereas the latter will continue to forward the join request to the root. If the next
node on the root, say R is also not yet a forwarder, it will become one and record
Q as its child and continue to send the join request.
On the other hand, if Q (or R ) is already a forwarder for m i d , it will also
record the previous sender as its child (i.e., P or Q , respectively), but there will
not be a need to send the join request to the root anymore, as Q (or R ) will already
be a member of the multicast tree.
Nodes such as P that have explicitly requested to join the multicast tree are,
by definition, also forwarders. The result of this scheme is that we construct a
multicast tree across the overlay network with two types of nodes: pure forward
ers that act as helpers, and nodes that are also forwarders, but have explicitly re
quested to join the tree. Multicasting is now simple: a node merely sends a multi
cast message toward the root of the tree by again executing the LOOKUP(mid) op
eration, after which that message can be sent along the tree.
We note that this high-level description of multicasting in Scribe does not do
justice to its original design. The interested reader is therefore encouraged to take
a look at the details, which can be found in Castro et al. (2002).
Overlay Construction
From the high-level description given above, it should be clear that although
building a tree by itself is not that difficult once we have organized the nodes into
an overlay, building an efficient tree may be a different story. Note that in our
description so far, the selection of nodes that participate in the tree does not take
350
into account any performance metrics: it is purely based on the (logical) routing of
messages through the overlay.
Figure 4-31. The relation between links in an overlay and actual network-level routes.
To understand the problem at hand, take a look at Fig. 4-31 which shows a
small set of four nodes that are organized in a simple overlay network, with node
A forming the root of a multicast tree. The costs for traversing a physical link are
also shown. Now, whenever A multicasts a message to the other nodes, it is seen
that this message will traverse each of the links , < R a , R b > , < R c , R d > ,
and < D , R d > twice. The overlay network would have been more efficient if we
had not constructed an overlay link from B to D , but instead from A to C. Such a
configuration would have saved the double traversal across links < R a , R b > and
<R c, R d>.
The quality of an application-level multicast tree is generally measured by
three different metrics: link stress, stretch, and tree cost. Link stress is defined
per link and counts how often a packet crosses the same link (Chu et al., 2002). A
link stress greater than 1 comes from the fact that although at a logical level a
packet may be forwarded along two different connections, part of those connec
tions may actually correspond to the same physical link, as we showed in Fig. 4-
31.
The stretch or Relative Delay Penalty (RDP) measures the ratio in the delay
between two nodes in the overlay, and the delay that those two nodes would
experience in the underlying network. For example, in the overlay network, mes
sages from B to C follow the route B —»R b —»R a —»R c -a C, having a total cost
of 59 units. However, messages would have been routed in the underlying net
work along the path B R b -a R d —> R c -a C, with a total cost of 47 units, lead
ing to a stretch of 1.255. Obviously, when constructing an overlay network, the
goal is to minimize the aggregated stretch, or similarly, the average RDP meas
ured over all node pairs.
Finally, the tree cost is a global metric, generally related to minimizing the
aggregated link costs. For example, if the cost of a link is taken to be the delay be
tween its two end nodes, then optimizing the tree cost boils down to finding a3 1
5
351
SE C . 4.5 MULTICAST COMMUNICATION 169
minimal spanning tree in which the total time for disseminating information to all
nodes is minimal.
To simplify matters somewhat, assume that a multicast group has an associ
ated and well-known node that keeps track of the nodes that have joined the tree.
When a new node issues a join request, it contacts this rendezvous node to obtain
a (potentially partial) list of members. The goal is to select the best member that
can operate as the new node’ s parent in the tree. Who should it select? There are
many alternatives and different proposals often follow very different solutions.
Consider, for example, a multicast group with only a single source. In this
case, the selection of the best node is obvious: it should be the source (because in
that case we can be assured that the stretch will be equal to 1). However, in doing
so, we would introduce a star topology with the source in the middle. Although
simple, it is not difficult to imagine the source may easily become overloaded. In
other words, selection of a node will generally be constrained in such a way that
only those nodes may be chosen who have k or less neighbors, with k being a
design parameter. This constraint severely complicates the tree-establishment al
gorithm, as a good solution may require that part of the existing tree is reconfig
ured.
Tan et al. (2003) provide an extensive overview and evaluation of various
solutions to this problem. As an illustration, let us take a closer look at one specif
ic family, known as switch-trees (Helder and Jamin, 2002). The basic idea is
simple. Assume we already have a multicast tree with a single source as root. In
this tree, a node P can switch parents by dropping the link to its current parent in
favor of a link to another node. The only constraints imposed on switching links is
that the new parent can never be a member of the subtree rooted at P (as this
would partition the tree and create a loop), and that the new parent will not have
too many immediate children. The latter is needed to limit the load of forwarding
messages by any single node.
There are different criteria for deciding to switch parents. A simple one is to
optimize the route to the source, effectively minimizing the delay when a message
is to be multicast. To this end, each node regularly receives information on other
nodes (we will explain one specific way of doing this below). At that point, the
node can evaluate whether another node would be a better parent in terms of delay
along the route to the source, and if so, initiates a switch.
Another criteria could be whether the delay to the potential other parent is
lower than to the current parent. If every node takes this as a criterion, then the
aggregated delays of the resulting tree should ideally be minimal. In other words,
this is an example of optimizing the cost of the tree as we explained above. How
ever, more information would be needed to construct such a tree, but as it turns
out, this simple scheme is a reasonable heuristic leading to a good approximation
of a minimal spanning tree.
As an example, consider the case where a node P receives information on the
neighbors of its parent. Note that the neighbors consist of 'P’ s grandparent, along3
2
5
352
with the other siblings of P 's parent. Node P can then evaluate the delays to each
of these nodes and subsequently choose the one with the lowest delay, say Q , as
its new parent. To that end, it sends a switch request to Q . To prevent loops from
being formed due to concurrent switching requests, a node that has an outstanding
switch request will simply refuse to process any incoming requests. In effect, this
leads to a situation where only completely independent switches can be carried
out simultaneously. Furthermore, P will provide Q with enough information to
allow the latter to conclude that both nodes have the same parent, or that Q is the
grandparent.
An important problem that we have not yet addressed is node failure. In the
case of switch-trees, a simple solution is proposed: whenever a node notices that
its parent has failed, it simply attaches itself to the root. At that point, the optimi
zation protocol can proceed as usual and will eventually place the node at a good
point in the multicast tree. Experiments described in Helder and Jamin (2002)
show that the resulting tree is indeed close to a minimal spanning one.
4.5.2 Gossip-Based Data Dissemination
An increasingly important technique for disseminating information is to rely

on e p i d e m i c b e h a v io r . Observing how diseases spread among people, researchers
have since long investigated whether simple techniques could be developed for
spreading information in very large-scale distributed systems. The main goal of
these epidemic protocols is to rapidly propagate information among a large col
lection of nodes using only local information. In other words, there is no central
component by which information dissemination is coordinated.
To explain the general principles of these algorithms, we assume that all up-
dates for a specific data item are initiated at a single node. In this way, we simply
avoid write-write conflicts. The following presentation is based on the classical
paper by Demers et al. (1987) on epidemic algorithms. A recent overview of epi
demic information dissemination can be found in Eugster at el. (2004).
Information Dissemination Models
As the name suggests, epidemic algorithms are based on the theory of epi
demics, which studies the spreading of infectious diseases. In the case of large-
scale distributed systems, instead of spreading diseases, they spread information.
Research on epidemics for distributed systems also aims at a completely different
goal: whereas health organizations will do their utmost best to prevent infectious
diseases from spreading across large groups of people, designers of epidemic al
gorithms for distributed systems will try to “ infect”all nodes with new informa
tion as fast as possible.
Using the terminology from epidemics, a node that is part of a distributed sys
tem is called infected if it holds data that it is willing to spread to other nodes. A
353
node that has not yet seen this data is called susceptible. Finally, an updated
node that is not willing or able to spread its data is said to be removed. Note that
we assume we can distinguish old from new data, for example, because it has
been timestamped or versioned. In this light, nodes are also said to spread updates.
A popular propagation model is that of anti-entropy. In this model, a node P
picks another node Q at random, and subsequently exchanges updates with Q.
There are three approaches to exchanging updates:
1. P only pushes its own updates to Q
2. P only pulls in new updates from Q
3. P and Q send updates to each other (i.e., a push-pull approach)
When it comes to rapidly spreading updates, only pushing updates turns out to
be a bad choice. Intuitively, this can be understood as follows. First, note that in a
pure push-based approach, updates can be propagated only by infected nodes.
However, if many nodes are infected, the probability of each one selecting a sus
ceptible node is relatively small. Consequently, chances are that a particular node
remains susceptible for a long period simply because it is not selected by an
infected node.
In contrast, the pull-based approach works much better when many nodes are
infected. In that case, spreading updates is essentially triggered by susceptible
nodes. Chances are large that such a node will contact an infected one to subse
quently pull in the updates and become infected as well.
It can be shown that if only a single node is infected, updates will rapidly
spread across all nodes using either form of anti-entropy, although push-pull
remains the best strategy (Jelasity et al., 2005a). Define a round as spanning a
period in which every node will at least once have taken the initiative to exchange
updates with a randomly chosen other node. It can then be shown that the number
of rounds to propagate a single update to all nodes takes O ( l o g (AO), where N is
the number of nodes in the system. This indicates indeed that propagating updates
is fast, but above all scalable.
One specific variant of this approach is called rumor spreading, or simply
gossiping. It works as follows. If node P has just been updated for data item x, it
contacts an arbitrary other node Q and tries to push the update to Q . However, it is
possible that Q was already updated by another node. In that case, P may lose
interest in spreading the update any further, say with probability l/ k . In other
words, it then becomes removed.
Gossiping is completely analogous to real life. When Bob has some hot news
to spread around, he may phone his friend Alice telling her all about it. Alice, like
Bob, will be really excited to spread the gossip to her friends as well. However,
she will become disappointed when phoning a friend, say Chuck, only to hear that3 4
5
354
the news has already reached him. Chances are that she will stop phoning other
friends, for what good is it if they already know?
Gossiping turns out to be an excellent way of rapidly spreading news. How
ever, it cannot guarantee that all nodes will actually be updated (Demers et al.,
1987). It can be shown that when there is a large number of nodes that participate
in the epidemics, the fraction s of nodes that will remain ignorant of an update,
that is, remain susceptible, satisfies the equation:
s = e -(k + m - s)
Fig. 4-32 shows ln ( s ) as a function of k. For example, if k = 4, /n(s)=-4.97,

so that s is less than 0.007, meaning that less than 0.7% of the nodes remain sus
ceptible. Nevertheless, special measures are needed to guarantee that those nodes
will also be updated. Combining anti-entropy with gossiping will do the trick.
k
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
- 2.5
- 5.0
- 7.5
10.0
Figure 4-32. The relation between the fraction s o f update-ignorant nodes and
the parameter k in pure gossiping. The graph displays ln(s) as a function o f k.
One of the main advantages of epidemic algorithms is their scalability, due to

the fact that the number of synchronizations between processes is relatively small
compared to other propagation methods. For wide-area systems, Lin and Marzullo
(1999) show that it makes sense to take the actual network topology into account
to achieve better results. In their approach, nodes that are connected to only a few
other nodes are contacted with a relatively high probability. The underlying
assumption is that such nodes form a bridge to other remote parts of the network;
therefore, they should be contacted as soon as possible. This approach is referred
to as directional gossiping and comes in different variants.
This problem touches upon an important assumption that most epidemic solu
tions make, namely that a node can randomly select any other node to gossip with.
This implies that, in principle, the complete set of nodes should be known to each
member. In a large system, this assumption can never hold.3 5
355
Fortunately, there is no need to have such a list. As we explained in Chap. 2,

maintaining a partial view that is more or less continuously updated will organize
the collection of nodes into a random graph. By regularly updating the partial
view of each node, random selection is no longer a problem.
Removing Data
Epidemic algorithms are extremely good for spreading updates. However,

they have a rather strange side-effect: spreading the d e l e t i o n of a data item is
hard. The essence of the problem lies in the fact that deletion of a data item des
troys all information on that item. Consequently, when a data item is simply re
moved from a node, that node will eventually receive old copies of the data item
and interpret those as updates on something it did not have before.
The trick is to record the deletion of a data item as just another update, and
keep a record of that deletion. In this way, old copies will not be interpreted as
something new, but merely treated as versions that have been updated by a delete
operation. The recording of a deletion is done by spreading death certificates.
Of course, the problem with death certificates is that they should eventually
be cleaned up, or otherwise each node will gradually build a huge local database
of historical information on deleted data items that is otherwise not used. Demers
et al. (1987) propose to use what they call dormant death certificates. Each death
certificate is timestamped when it is created. If it can be assumed that updates
propagate to all nodes within a known finite time, then death certificates can be
removed after this maximum propagation time has elapsed.
However, to provide hard guarantees that deletions are indeed spread to all
nodes, only a very few nodes maintain dormant death certificates that are never
thrown away. Assume node P has such a certificate for data item x. If by any
chance an obsolete update for x reaches P , P will react by simply spreading the
death certificate for x again.
Applications
To finalize this presentation, let us take a look at some interesting applications

of epidemic protocols. We already mentioned spreading updates, which is perhaps
the most widely-deployed application. Also, in Chap. 2 we discussed how provid
ing positioning information about nodes can assist in constructing specific topolo
gies. In the same light, gossiping can be used to discover nodes that have a few
outgoing wide-area links, to subsequently apply directional gossiping as we men
tioned above.
Another interesting application area is simply collecting, or actually aggregat
ing information (Jelasity et al., 2005b). Consider the following information3 6
5
356
exchange. Every node i initially chooses an arbitrary number, say x t. When node i
contacts node j, they each update their value as:
X j ,X j <— ( Xj + x j ) / 2
Obviously, after this exchange, both i and j will have the same value. In fact, it is
not difficult to see that eventually all nodes will have the same value, namely the
average of all initial values. Propagation speed is again exponential.
What use does computing the average have? Consider the situation that all
nodes i have set x t to zero, except for x i , which has set it to 1:
If there N nodes, then eventually each node will compute the average, which is
l/ N . As a consequence, every node i can estimate the size of the system as being
1/jCj. This information alone can be used to dynamically adjust various system pa
rameters. For example, the size of the partial view (i.e., the number of neighbors
that each nodes keeps track of) should be dependent on the total number of parti
cipating nodes. Knowing this number will allow a node to dynamically adjust the
size of its partial view. Indeed, this can be viewed as a property of self-manage-
ment.
Computing the average may prove to be difficult when nodes regularly join
and leave the system. One practical solution to this problem is to introduce
epochs. Assuming that node 1 is stable, it simply starts a new epoch now and then.
When node i sees a new epoch for the first time, it resets its own variable x t to
zero and starts computing the average again.
Of course, other results can also be computed. For example, instead of having
a fixed node (x i ) start the computation of the average, we can easily pick a ran
dom node as follows. Every node i initially sets x t to a random number from the
same interval, say [0,1], and also stores it permanently as m;. Upon an exchange
between nodes i and j, each change their value to:
x h Xj <—m a x ( x u Xj )
Each node i for which m, < x l will lose the competition for being the initiator in
starting the computation of the average. In the end, there will be a single winner.
Of course, although it is easy to conclude that a node has lost, it is much more dif
ficult to decide that it has won, as it remains uncertain whether all results have
come in. The solution to this problem is to be optimistic: a node always assumes it
is the winner until proven otherwise. At that point, it simply resets the variable it
is using for computing the average to zero. Note that by now, several different
computations (in our example computing a maximum and computing an average)
may be executing concurrently.
357
SE C . 4.6 SUMMARY 175
4.6 SUMMARY
H a v in g p o w e r f u l and fle x ib le f a c ilit ie s f o r c o m m u n ic a t io n b e t w e e n p r o c e s s e s

is e sse n tia l f o r any distrib u ted sy stem . In tra d itio n a l n e tw o r k a p p lica tio n s, c o m
m u n ica tio n is o fte n b a s e d o n the lo w - le v e l m e s s a g e - p a s s in g p r im itiv e s o f f e r e d b y
the tran sport layer. A n im p orta n t is s u e in m id d le w a r e s y s te m s is to o f f e r a h ig h e r
le v e l o f a b stra ction that w ill m a k e it e a s ie r to e x p r e s s c o m m u n ic a t io n b e tw e e n
p r o c e s s e s than the su p p o rt o f fe r e d b y the in te r fa ce to th e tra n sp ort layer.
O n e o f the m o s t w id e ly u s e d a b stra ctio n s is the R e m o t e P r o c e d u r e C a ll
(RPC). T h e e s s e n c e o f an R P C is that a s e r v ic e is im p le m e n t e d b y m e a n s o f a p r o
cedu re, o f w h ich the b o d y is e x e c u te d at a server. T h e clie n t is o f f e r e d o n ly the
sig n a tu re o f the p ro ce d u re , that is, the p r o c e d u r e ’
s n a m e a lo n g w ith its p a r a m e
ters. W h e n the clie n t ca lls the p r o ce d u r e , the clie n t- s id e im p le m e n ta tio n , c a lle d a
stub, tak es ca re o f w r a p p in g the p a ra m eter v a lu e s in to a m e s s a g e an d s e n d in g that
to the server. T h e latter c a lls the actual p r o c e d u r e an d return s the resu lts, a g a in in
a m e ss a g e . T h e c lie n t ’
s stu b ex tra cts the re su lt v a lu e s f r o m the return m e s s a g e and
p a ss e s it b a c k to the c a llin g c lie n t a p p lica tion .
RPCs o f fe r sy n c h r o n o u s c o m m u n ic a t io n fa c ilitie s , by w h ich a c lie n t is
b lo c k e d until the serv er has sen t a reply. A lth o u g h v a ria tion s o f eith er m e c h a n is m
ex ist b y w h ich this strict s y n c h r o n o u s m o d e l is relax ed, it turns ou t that gen era l-
p u rp ose, h ig h - le v e l m e ss a g e - o r ie n te d m o d e ls are o fte n m o r e co n v e n ie n t.
In m e ss a g e - o r ie n te d m o d e ls , the is s u e s are w h eth er o r n ot c o m m u n ic a t io n is
persisten t, an d w h eth er o r n ot c o m m u n ic a t io n is s y n c h r o n o u s. T h e e s s e n c e o f p e r
sistent c o m m u n ic a tio n is that a m e s s a g e that is su b m itte d fo r tra n sm ission , is
sto red b y the c o m m u n ic a tio n s y s te m as l o n g as it tak es to d e liv e r it. In oth e r
w ords, n eith er the se n d e r n o r the r e c e iv e r n e e d s to b e u p an d ru n n in g f o r m e s s a g e
tra n sm issio n to take p la ce. In tran sien t c o m m u n ica tio n , n o s to r a g e f a c ilit ie s are
offered , s o that the r e c e iv e r m u st b e p r e p a r e d to a c c e p t the m e s s a g e w h e n it is
sent.
In a s y n ch ro n o u s co m m u n ica tio n , the se n d e r is a llo w e d to co n tin u e im
m e d ia tely after the m e s s a g e h as b e e n s u b m itte d fo r tra n sm ission , p o s s ib l y b e f o r e
it has e v e n b e e n sent. In s y n c h r o n o u s c o m m u n ic a tio n , the s e n d e r is b l o c k e d at
least until a m e s s a g e has b e e n r e c e iv e d . A ltern a tiv ely , the s e n d e r m a y b e b lo c k e d
until m e s s a g e d e liv e r y has taken p la c e o r e v e n un til the r e c e iv e r h as r e s p o n d e d as
with R P C s.
M e s s a g e - o r ie n te d m id d le w a r e m o d e ls g e n e r a lly o f f e r p e r siste n t a s y n c h r o n o u s
co m m u n ica tio n , and are u s e d w h ere R P C s are n ot ap propriate. T h e y are o fte n
u sed to a ssist the in teg ra tio n o f ( w id ely - d isp ersed ) c o ll e c t io n s o f d a ta b a se s in to
large-scale in fo rm a tio n sy stem s. O th er a p p lic a tio n s in c lu d e e-m ail an d w o r k flo w .
A v ery d iffe re n t fo r m o f c o m m u n ic a t io n is that o f strea m in g, in w h ic h the
issu e is w h eth er o r n o t t w o s u c c e s s i v e m e s s a g e s h a v e a te m p o ra l re la tion sh ip . In
con tin u ou s data stream s, a m a x im u m en d-to-en d d e la y is s p e c if ie d fo r e a c h m e s
sage. In addition , it is a ls o re q u ire d that m e s s a g e s are se n t s u b je c t to a m in im u m 3
8
5
358
en d-to-en d delay. T y p ic a l e x a m p le s o f su ch c o n tin u o u s data strea m s are v id e o and

a u d io stream s. E x a c tly w h at th e te m p o ra l re la tio n s are, o r w h at is e x p e c t e d f r o m
the u n d e r ly in g c o m m u n ic a tio n s u b s y s te m in term s o f q u a lity o f s e r v ic e is o fte n
d iffic u lt to s p e c if y an d to im p lem e n t. A c o m p lic a t in g fa c to r is th e r o le o f jitter.
E v e n i f the a v e r a g e p e r fo r m a n c e is a c ce p ta b le , su bstan tia l v a ria tio n s in d e liv e r y
tim e m a y le a d to u n a c c e p ta b le p e rfo rm a n ce .
F in ally, an im p o rta n t c la s s o f c o m m u n ic a t io n p r o t o c o l s in d istr ib u te d s y s te m s
is m u ltica stin g . T h e b a s ic id e a is to d iss e m in a te in fo r m a tio n f r o m o n e s e n d e r to
m u ltip le r e ce iv e r s . W e h a v e d is c u s s e d tw o d iffe re n t a p p ro a ch e s. First, m u ltic a s t
in g ca n b e a c h ie v e d b y se ttin g u p a tree f r o m th e s e n d e r to the r e c e iv e r s . C o n s id
e r in g that it is n o w w e ll u n d e r s to o d h o w n o d e s ca n s e lf- o r g a n iz e in to p ee r- to- p ee r
sy stem , s o lu tio n s h a v e a ls o a p p e a r e d to d y n a m ic a lly set u p tre es in a d e c e n tr a l
iz e d fa sh ion .
A n o th er im p orta n t c la s s o f d iss e m in a tio n s o lu tio n s d e p lo y s e p id e m ic p r o t o
co ls. T h e s e p r o t o c o l s h a v e p r o v e n to b e v e r y sim p le , y e t e x tr e m e ly robust. A p a rt
fr o m m e r e ly s p re a d in g m e s s a g e s , e p id e m ic p r o t o c o l s ca n a ls o b e e ffic ie n tly
d e p lo y e d f o r a g g r e g a t in g in fo rm a tio n a c r o s s a la r g e d istr ib u te d system .
PROBLEMS
1. In many layered protocols, each layer has its own header. Surely it would be more
efficient to have a single header at the front of each message with all the control in it
than all these separate headers. Why is this not done?
2. Why are transport-level communication services often inappropriate for building dis
tributed applications?
3. A reliable multicast service allows a sender to reliably pass messages to a collection of
receivers. Does such a service belong to a middleware layer, or should it be part of a
lower-level layer?
4. Consider a procedure incr with two integer parameters. The procedure adds one to
each parameter. Now suppose that it is called with the same variable twice, for ex
ample, as incr{i, i). If i is initially 0, what value will it have afterward if call-by-refer-
ence is used? How about if copy/restore is used?
5. C has a construction called a union, in which a field of a record (called a struct in C)
can hold any one of several alternatives. At run time, there is no sure-fire way to tell
which one is in there. Does this feature of C have any implications for remote proce
dure call? Explain your answer.
6. One way to handle parameter conversion in RPC systems is to have each machine
send parameters in its native representation, with the other one doing the translation, if
need be. The native system could be indicated by a code in the first byte. However,
since locating the first byte in the first word is precisely the problem, can this work?5
9
3
359
CHAP. 4 PROBLEMS 177
7. Assume a client calls an asynchronous RPC to a server, and subsequently waits until
the server returns a result using another asynchronous RPC. Is this approach the same
as letting the client execute a normal RPC? What if we replace the asynchronous
RPCs with asynchronous RPCs?
8. Instead of letting a server register itself with a daemon as in DCE, we could also
choose to always assign it the same end point. That end point can then be used in ref
erences to objects in the server’s address space. What is the main drawback of this
scheme?
9. Would it be useful also to make a distinction between static and dynamic RPCs?
10. Describe how connectionless communication between a client and a server proceeds
when using sockets.
11. Explain the difference between the primitives MPLbsend and MPLisend in MPI.
12. Suppose that you could make use of only transient asynchronous communication
primitives, including only an asynchronous receive primitive. How would you imple
ment primitives for transient synchronous communication?
13. Suppose that you could make use of only transient synchronous communication primi
tives. How would you implement primitives for transient asynchronous communica
tion?
14. Does it make sense to implement persistent asynchronous communication by means of

RPCs?
15. In the text we stated that in order to automatically start a process to fetch messages
from an input queue, a daemon is often used that monitors the input queue. Give an
alternative implementation that does not make use of a daemon.
16. Routing tables in IBM WebSphere, and in many other message-queuing systems, are
configured manually. Describe a simple way to do this automatically.
17. With persistent communication, a receiver generally has its own local buffer where
messages can be stored when the receiver is not executing. To create such a buffer, we
may need to specify its size. Give an argument why this is preferable, as well as one
against specification of the size.
18. Explain why transient synchronous communication has inherent scalability problems,
and how these could be solved.
19. Give an example where multicasting is also useful for discrete data streams.
20. Suppose that in a sensor network measured temperatures are not timestamped by the
sensor, but are immediately sent to the operator. Would it be enough to guarantee only
a maximum end-to-end delay?
21. How could you guarantee a maximum end-to-end delay when a collection of com
puters is organized in a (logical or physical) ring?2
22. How could you guarantee a minimum end-to-end delay when a collection of com
puters is organized in a (logical or physical) ring?
360
361
tional groups across databases. Splitting data within func F U N C T IO N A L PARTITIONING
tional areas across multiple databases, or sharding,1 adds Functional partitioning is important for achieving high
the second dimension to horizontal scaling. The diagram degrees of scalability. Any good database architecture will
in figure 1 illustrates horizontal data-scaling strategies. decom pose the schema into tables grouped by function
As figure 1 illustrates, both approaches to horizontal ality. Users, products, transactions, and communication
scaling can be applied at once. Users, products, and trans are examples of functional areas. Leveraging database
actions can be in separate databases. Additionally, each concepts such as foreign keys is a com m on approach for
functional area can be split across multiple databases for maintaining consistency across these functional areas.
transactional capacity. As shown in the diagram, func Relying on database constraints to ensure consistency
tional areas can be scaled independently of one another. across functional groups creates a coupling of the schema
not uncommon, and in fact people encounter this delay
between a transaction and their running balance regularly
(e.g., ATM withdrawals and cellphone calls).
How the SQL statements are modified to relax con
sistency depends upon how the running balances are
defined. If they are simply estimates, meaning that some
transactions can be missed, the changes are quite simple,
as shown in figure 4.
are updated. Using an ACID-style transaction, the SQL We've now decoupled the updates to the user and
would be as shown in figure 3. transaction tables. Consistency between the tables is not
The total bought and sold columns in the user table guaranteed. In fact, a failure between the first and second
can be considered a cache of the transaction table. It is transaction will result in the user table being permanently
present for efficiency of the system. Given this, the con inconsistent, but if the contract stipulates that the run
straint on consistency could be relaxed. The buyer and ning totals are estimates, this may be adequate.
seller expectations can be set so their running balances do What if estimates are not acceptable, though? How
not reflect the result of a transaction immediately. This is can you still decouple the user and transaction updates?
Introducing a persistent
message queue solves the
problem. There are several
Begin transaction choices for implement
Insert into transaction(id, se lle rjd , buyerjd, amount); ing persistent messages.
End transaction The most critical factor in
Begin transaction implementing the queue,
Update user set amt_sold=amt_sold+$amount where id= $ se lle r_ id ; however, is ensuring that
Update user set amt_bought=amount_bought+ $ a m ou n t the backing persistence is
where id= $ b u yer_id ; on the same resource as the
database. This is necessary
FIG 4
End transaction
to allow the queue to be
transactionally committed
without involving a 2PC.
Now the SQL operations
look a bit different, as
Begin transaction shown in figure 5.
Insert into transaction^, se lle rjd , buyerjd, amount); This example takes
Queue message "update user("seller", se lle rjd , amount)"; some liberties with syntax
Queue message "update user("buyer", buyerjd, amount)"; and oversimplifying the
End transaction logic to illustrate the
For each message in queue concept. By queuing a
Begin transaction persistent message within
Dequeue message the same transaction as
If message.balance == "seller" the insert, the information
Update user set amt_sold=amt_sold + message.amount needed to update the run
where id=message.id; ning balances on the user
Else has been captured. The
Update user set amt_bought=amt_bought + message.amount transaction is contained on
where id=message.id; a single database instance
FIG 5
End if and therefore will not
End transaction impact system availability.
End for A separate message
processing com ponent will
52 May/June 2008 ACM QUEUE 366 rants: feedback@acmqueue.com

into a separate back-end component, you preserve the
availability of your customer-facing component. The
lower availability of the message processor may be accept
able for business requirements.
Suppose, however, that 2PC is simply never acceptable
in your system. How can this problem be solved? First,
you need to understand the concept of idem p oten ce. An
operation is considered idempotent if it can be applied
one time or multiple times with the same result. Idem-
potent operations are useful in that they permit partial
failures, as applying them repeatedly does not change the
final state of the system.
The selected example is problematic when looking
dequeue each message and apply the information to the for idempotence. Update operations are rarely idem po
user table. The example appears to solve all of the issues, tent. The example increments balance columns in place.
but there is a problem. The message persistence is on the Applying this operation more than once obviously will
transaction host to avoid a 2PC during queuing. If the result in an incorrect balance. Even update operations
message is dequeued inside a transaction involving the that simply set a value, however, are not idempotent
user host, we still have a 2PC situation. with regard to order of operations. If the system cannot
One solution to the 2PC in the message-processing guarantee that updates will be applied in the order they
com ponent is to do nothing. By decoupling the update are received, the final state of the system will be incorrect.
More on this later.
In the case of balance
Begin transaction updates, you need a way to
Insert into transaction(id, se lle rjd , buyerjd, amount); track which updates have
Queue message "update user("seller", se lle rjd , amount)"; been applied successfully
Queue message "update user("buyer", buyerjd, amount)"; and which are still out
End transaction standing. One technique is
For each message in queue to use a table that records
Peek message the transaction identifiers
Begin transaction that have been applied.
Select countO as processed where transjd=message.transjd The table shown in
and balance=message.balance and userJd=message.userJd figure 6 tracks the transac
If processed == 0 tion ID, which balance has
If message.balance == "seller" been updated, and the user
Update user set amt_sold=amt_sold + message.amount ID where the balance was
where id=message.id; applied. Now our sample
Else pseudocode is as shown in
Update user set amt_bought=amt_bought + message.amount figure 7.
where id=message.id; This example depends
End if upon being able to peek a
Insert into updates_applied message in the queue and
(message.transjd, message.balance, message.userjd); remove it once success
End if fully processed. This can
End transaction be done with two indepen
dent transactions if neces
FIG 7
If transaction successful
Remove message from queue sary: one on the message
End if queue and one on the user
End for database. Queue operations
are not committed unless
more queue: www.acmqueue.com 367 ACM QUEUE May/June 2008 53

as closed loops. We think about the predictability of their we are relying upon soft state and eventual consistency in
behavior in terms of predictable inputs producing predict the implementation.
able outputs. This is a necessity for creating correct soft
ware systems. The good news in many cases is that using EVENT-DRIVEN ARCHITECTURE
BASE doesn't change the predictability of a system as a What if you do need to know when state has become
closed loop, but it does require looking at the behavior in consistent? You may have algorithms that need to be
total. applied to the state but only when it has reached a con
A simple example can help illustrate the point. C on sistent state relevant to an incom ing request. The simple
sider a system where users can transfer assets to other approach is to rely on events that are generated as state
users. The type of asset is irrelevant—it could be m oney becom es consistent.
or objects in a game. For this example, we will assume Continuing with the previous example, what if you
that we have decoupled the two operations of taking the need to notify the user that the asset has arrived? Creat
asset from one user and giving it to the other with a mes ing an event within the transaction that commits the
sage queue used to provide the decoupling. asset to the receiving user provides a mechanism for per
Immediately, this system feels nondeterministic and forming further processing once a known state has been
problematic. There is a period of time where the asset has reached. EDA (event-driven architecture) can provide
dramatic improvements in scalability and architectural
decoupling. Further discussion about the application of
EDA is beyond the scope of this article.
C O N C LU S IO N
Scaling systems to dramatic transaction rates requires a
new way of thinking about managing resources. The tra
ditional transactional models are problematic when loads
need to be spread across a large number of components.
Decoupling the operations and performing them in turn
provides for improved availability and scale at the cost of
consistency. BASE provides a model for thinking about
The g o o d news is that this decoupling. Q
using BASE doesn't change

REFERENCES
the predictability of a system as 1. http://highscalability.com/unorthodox-approach-
a closed loop, but it does require database-design-coming-shard.
looking at the behavior in total. 2. http://citeseer.ist.psu.edu/544596.html.
LOVE IT, HATE IT? LET US KNOW

feedback@acmqueue.com or www.acmqueue.com/forums
left one user and has not arrived at the other. The size of
this time window can be determined by the messaging DAN PRITCHETT is a Technical Fellow at eBay where he has
system design. Regardless, there is a lag between the begin been a member of the architecture team for the past four
and end states where neither user appears to have the years. In this role, he interfaces with the strategy, business,
asset. product, and technology teams across eBay marketplaces,
If we consider this from the user's perspective, how PayPal, and Skype. With more than 20 years of experience at
ever, this lag may not be relevant or even known. Neither technology companies such as Sun Microsystems, Hewlett-
the receiving user nor the sending user may know when Packard, and Silicon Graphics, Pritchett has a depth of
the asset arrived. If the lag between sending and receiving technical experience, ranging from network-level protocols
is a few seconds, it will be invisible or certainly tolerable and operating systems to systems design and software pat
to users who are directly communicating about the asset terns. He has a B.S. in computer science from the University
transfer. In this situation the system behavior is consid of Missouri, Rolla.
ered consistent and acceptable to the users, even though © 2008 ACM 1542-7730/08/0500 $5.00
more queue: www.acmqueue.com 369 ACM QUEUE May/June 2008 55

End-To-End Arguments in System Design
J. H. SALTZER, D. P. REED, and D. D. CLARK
Massachusetts Institute of Technology Laboratory for Computer Science
This paper presents a design principle that helps guide placement of functions among the modules of
a distributed computer system. The principle, called the end-to-end argument, suggests that functions
placed at low levels of a system may be redundant or of little value when compared with the cost of
providing them at that low level. Examples discussed in the paper include bit-error recovery, security
using encryption, duplicate message suppression, recovery from system crashes, and delivery acknowl
edgment. Low-level mechanisms to support these functions are justified only as performance enhance
ments.
CR Categories and Subject Descriptors: C.O [General] Computer System Organization—system
architectures; C.2.2 [Computer-Com m unication Networks]: Network Protocols—protocol archi
tecture; C.2.4 [Computer-Com m unication Networks]: Distributed Systems; D.4.7 [Operating
Systems]: Organization and Design—distributed systems
General Terms: Design
Additional Key Words and Phrases: Data communication, protocol design, design principles
1. INTRODUCTION
Choosing the proper boundaries between functions is perhaps the primary activity
of the computer system designer. Design principles that provide guidance in this
choice of function placement are among the most important tools of a system
designer. This paper discusses one class of function placement argument that
has been used for many years with neither explicit recognition nor much convic
tion. However, the emergence of the data communication network as a computer
system component has sharpened this line of function placement argument by
making more apparent the situations in which and the reasons why it applies.
This paper articulates the argument explicitly, so as to examine its nature and
to see how general it really is. The argument appeals to application requirements
and provides a rationale for moving a function upward in a layered system closer
to the application that uses the function. We begin by considering the commu
nication network version of the argument.
This is a revised version of a paper adapted from End-to-End Arguments in System Design by J. H.
Saltzer, D.P. Reed, and D.D. Clark from the 2nd International Conference on Distributed Systems
(Paris, France, April 8-10) 1981, pp. 509-512. © IEEE 1981
This research was supported in part by the Advanced Research Projects Agency of the U.S.
Department of Defense and monitored by the Office of Naval Research under contract N00014-75-
C-0661.
Authors’address: J. H. Saltzer and D. D. Clark, M.I.T. Laboratory for Computer Science, 545
Technology Square, Cambridge, MA 02139. D. P. Reed, Software Arts, Inc., 27 Mica Lane, Wellesley,
MA 02181.
Permission to copy without fee all or part of this material is granted provided that the copies are not
made or distributed for direct commercial advantage, the ACM copyright notice and the title of the
publication and its date appear, and notice is given that copying is by permission of the Association
for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific
permission.
© 1984 ACM 0734-2071/84/1100-0277 $00.75
A C M T r a n s a c t io n s o n C o m p u te r S y s te m s , Vol. 2, N o. 4, N o v e m b e r 1984, P a g e s 277-288.
370
278 J. H. Saltzer, D. P. Reed, and D. D. Clark
In a system that includes communications, one usually draws a modular

boundary around the communication subsystem and defines a firm interface
between it and the rest of the system. When doing so, it becomes apparent that
there is a list of functions each of which might be implemented in any of several
ways: by the communication subsystem, by its client, as a joint venture, or
perhaps redundantly, each doing its own version. In reasoning about this choice,
the requirements of the application provide the basis for the following class of
arguments:
T h e f u n c t i o n in q u e s t io n c a n c o m p l e t e l y a n d c o r r e c t l y b e i m p le m e n t e d o n l y w ith
t h e k n o w le d g e a n d h e lp o f th e a p p l ic a t i o n s t a n d i n g a t t h e e n d p o i n t s o f th e
c o m m u n i c a t i o n s y s te m . T h e r e fo r e , p r o v i d i n g t h a t q u e s t io n e d f u n c t i o n a s a f e a t u r e
o f th e c o m m u n i c a t i o n s y s t e m i t s e lf is n o t p o s s ib le . (S o m e t i m e s a n in c o m p le te
v e r s i o n o f t h e f u n c t i o n p r o v i d e d b y th e c o m m u n i c a t i o n s y s t e m m a y b e u s e f u l a s a
p e r fo r m a n c e e n h a n cem en t.)
We call this line of reasoning against low-level function implementation the

The following sections examine the end-to-end argument
en d - to - en d a rgu m en t.
in detail, first with a case study of a typical example in which it is used—the
function in question is reliable data transmission—and then by exhibiting the
range o f functions to which the same argument can be applied. For the case of
the data communication system, this range includes encryption, duplicate mes
sage detection, message sequencing, guaranteed message delivery, detecting host
crashes, and delivery receipts. In a broader context, the argument seems to apply
to many other functions of a computer operating system, including its file system.
Examination of this broader context will be easier, however, if we first consider
the more specific data communication context.
2. CAREFUL FILE TRANSFER
2.1 End-to-End Caretaking

Consider the problem of c a r e f u l f i l e tr a n s fe r . A file is stored by a file system in
the disk storage of computer A. Computer A is linked by a data communication
network with computer B, which also has a file system and a disk store. The
object is to move the file from computer A’ s storage to computer B ’ s storage
without damage, keeping in mind that failures can occur at various points along
the way. The application program in this case is the file transfer program, part
of which runs at host A and part at host B. In order to discuss the possible
threats to the file ’
s integrity in this transaction, let us assume that the following
specific steps are involved:
(1) At host A the file transfer program calls upon the file system to read the file
from the disk, where it resides on several tracks, and the file system passes
it to the file transfer program in fixed-size blocks chosen to be disk format
independent.
(2) Also at host A, the file transfer program asks the data communication system
to transmit the file using some communication protocol that involves splitting
the data into packets. The packet size is typically different from the file
block size and the disk track size.
A C M T r a n s a c t io n s o n C o m p u te r S y ste m s, V ol. 2, N o. 4, N o v e m b e r 1984.3
1
7
371
End-to-End Arguments in System Design • 279
(3) The data communication network moves the packets from computer A to
computer B.
(4) At host B, a data communication program removes the packets from the
data communication protocol and hands the contained data to a second part
of the file transfer application that operates within host B.
(5) At host B, the file transfer program asks the file system to write the received
data on the disk of host B.
With this model of the steps involved, the following are some of the threats to
the transaction that a careful designer might be concerned about:
(1) The file, though originally written correctly onto the disk at host A, if read
now may contain incorrect data, perhaps because of hardware faults in the
disk storage system.
(2) The software of the file system, the file transfer program, or the data
communication system might make a mistake in buffering and copying the
data of the file, either at host A or host B.
(3) The hardware processor or its local memory might have a transient error
while doing the buffering and copying, either at host A or host B.
(4) The communication system might drop or change the bits in a packet or
deliver a packet more than once.
(5) Either of the hosts may crash part way through the transaction after
performing an unknown amount (perhaps all) of the transaction.
How would a careful file transfer application then cope with this list of threats?
One approach might be to reinforce each of the steps along the way using
duplicate copies, time-out and retry, carefully located redundancy for error
detection, crash recovery, etc. The goal would be to reduce the probability of
each of the individual threats to an acceptably small value. Unfortunately,
systematic countering of threat (2) requires writing correct programs, which is
quite difficult. Also, not all the programs that must be correct are written by the
file transfer-application programmer. If we assume further that all these threats
are relatively low in probability—low enough for the system to allow useful work
to be accomplished—brute force countermeasures, such as doing everything three
times, appear uneconomical.
The alternate approach might be called end-to-end check and retry. Suppose
that as an aid to coping with threat (1), stored with each file is a checksum that
has sufficient redundancy to reduce the chance of an undetected error in the file
to an acceptably negligible value. The application program follows the simple
steps above in transferring the file from A to B. Then, as a final additional step,
the part of the file transfer application residing in host B reads the transferred
file copy back from its disk storage system into its own memory, recalculates the
checksum, and sends this value back to host A, where it is compared with the
checksum of the original. Only if the two checksums agree does the file transfer
application declare the transaction committed. If the comparison fails, something
has gone wrong, and a retry from the beginning might be attempted.
If failures are fairly rare, this technique will normally work on the first try;
occasionally a second or even third try might be required. One would probably
consider two or more failures on the same file transfer attempt as indicating that
some part of this system is in need of repair.
A C M T r a n s a c t io n s o n C o m p u te r S y ste m s, Vol. 2, N o. 4, N o v e m b e r 1984.
372
Now let us consider the usefulness of a common proposal, namely, that the
communication system provide, internally, a guarantee o f reliable data transmis
sion. It might accomplish this guarantee by providing selective redundancy in
the form of packet checksums, sequence number checking, and internal retry
mechanisms, for example. With sufficient care, the probability of undetected bit
errors can be reduced to any desirable level. The question is whether or not this
attempt to be helpful on the part of the communication system is useful to the
careful file transfer application.
The answer is that threat (4) may have been eliminated, but the careful file
transfer application must still counter the remaining threats; so it should still
provide its own retries based on an end-to-end checksum of the file. If it does,
the extra effort expended in the communication system to provide a guarantee
of reliable data transmission is only reducing the frequency of retries by the file
transfer application; it has no effect on inevitability or correctness of the outcome,
since correct file transmission is ensured by the end-to-end checksum and retry
whether or not the data transmission system is especially reliable.
Thus, the argument: In order to achieve careful file transfer, the application
program that performs the transfer must supply a file-transfer-specific, end-to-
end reliability guarantee—in this case, a checksum to detect failures and a retry-
commit plan. For the data communication system to go out of its way to be
extraordinarily reliable does not reduce the burden on the application program
to ensure reliability.
2.2 A Too-Real Example

An interesting example of the pitfalls that one can encounter turned up recently
at the Massachusetts Institute of Technology. One network system involving
several local networks connected by gateways used a packet checksum on each
hop from one gateway to the next, on the assumption that the primary threat to
correct communication was corruption of bits during transmission. Application
programmers, aware o f this checksum, assumed that the network was providing
reliable transmission, without realizing that the transmitted data were unpro
tected while stored in each gateway. One gateway computer developed a transient
error: while copying data from an input to an output buffer a byte pair was
interchanged, with a frequency of about one such interchange in every million
bytes passed. Over a period of time many of the source files of an operating
system were repeatedly transferred through the defective gateway. Some o f these
source files were corrupted by byte exchanges, and their owners were forced to
the ultimate end-to-end error check: manual comparison with and correction
from old listings.
2.3 Performance Aspects

However, it would be too simplistic to conclude that the lower levels should play
no part in obtaining reliability. Consider a network that is somewhat unreliable,
dropping one message of each hundred messages sent. The simple strategy
outlined above, transmitting the file and then checking to see that the file has
arrived correctly, would perform more poorly as the length of the file increased.
The probability that all packets of a file arrive correctly decreases exponentially
A C M T r a n s a c t io n s o n C o m p u te r S y s te m s , Vol. 2, N o. 4, N o v e m b e r 1984.3
7
373
End-to-End Arguments in System Design 281
with the file length, and thus the expected time to transmit the file grows
exponentially with file length. Clearly, some effort at the lower levels to improve
network reliability can have a significant effect on application performance. But
the key idea here is that the lower levels need not provide “ perfect”reliability.
Thus the amount of effort to put into reliability measures within the data
communication system is seen to be an engineering trade-off based on perform
ance, rather than a requirement for correctness. Note that performance has
several aspects here. If the communication system is too unreliable, the file
transfer application performance will suffer because of frequent retries following
failures of its end-to-end checksum. If the communcation system is beefed up
with internal reliability measures, those measures also have a performance cost,
in the form of bandwidth lost to redundant data and added delay from waiting
for internal consistency checks to complete before delivering the data. There is
little reason to push in this direction very far, when it is considered that the end-
to-en d ch eck o f the file tra n sfer a p p lica tio n m u st still be im p le m e n te d n o m a tter
The p r o p e r trade-off requires
h ow reliab le the c o m m u n ic a tio n s y s te m becom es.
careful thought. For example, one might start by designing the communication
system to provide only the reliability that comes with little cost and engineering
effort, and then evaluate the residual error level to ensure that it is consistent
with an acceptable retry frequency at the file transfer level. It is probably not
important to strive for a negligble error rate at any point below the application
level.
Using performance to justify placing functions in a low-level subsystem must
be done carefully. Sometimes, by examining the problem thoroughly, the same
or better performance enhancement can be achieved at the high level. Performing
a function at a low level may be more efficient, if the function can be performed
with a minimum perturbation of the machinery already included in the low-level
subsystem. But the opposite situation can occur—that is, performing the function
at the lower level may cost more—for two reasons. First, since the lower level
subsystem is common to many applications, those applications that do not need
the function will pay for it anyway. Second, the low-level subsystem may not
have as much information as the higher levels, so it cannot do the job as
efficiently.
Frequently, the performance trade-off is quite complex. Consider again the
careful file transfer on an unreliable network. The usual technique for increasing
packet reliability is some sort of per-packet error check with a retry protocol.
This mechanism can be implemented either in the communication subsystem or
in the careful file transfer application. For example, the receiver in the careful
file transfer can periodically compute the checksum of the portion of the file thus
far received and transmit this back to the sender. The sender can then restart
by retransmitting any portion that has arrived in error.
The end-to-end argument does not tell us where to put the early checks, since
either layer can do this performance-enhancement job. Placing the early retry
protocol in the file transfer application simplifies the communication system but
may increase overall cost, since the communication system is shared by other
applications and each application must now provide its own reliability enhance
ment. Placing the early retry protocol in the communication system may be more
A C M T r a n s a c t io n s on C o m p u t e r S y ste m s, Vol. 2, No. 4, N o v e m b e r 1984.
374
efficient, since it may be performed inside the network on a hop-by-hop basis,

reducing the delay involved in correcting a failure. At the same time there may
be some application that finds the cost of the enhancement is not worth the
result, but it now has no choice in the matter.1A great deal of information about
system implementation is needed to make this choice intelligently.
3. OTHER EXAMPLES OF THE END-TO-END ARGUMENT
3.1 Delivery Guarantees

The basic argument that a lower level subsystem that supports a distributed
application may be wasting its effort in providing a function that must, by nature,
be implemented at the application level anyway can be applied to a variety of
functions in addition to reliable data transmission. Perhaps the oldest and most
widely known form of the argument concerns acknowledgment of delivery. A
data communication network can easily return an acknowledgment to the sender
for every message delivered to a recipient. The ARPANET, for example, returns
a packet known as Request For Next Message (RFNM) [1] whenever it delivers
a message. Although this acknowledgment may be useful within the network as
a form of congestion control (originally the ARPANET refused to accept another
message to the same target until the previous RFNM had returned), it was never
found to be very helpful for applications using the ARPANET. The reason is
that knowing for sure that the message was delivered to the target host is not
very important. What the application wants to know is whether or not the target
host acted on the message; all manner of disaster might have struck after message
delivery but before completion of the action requested by the message. The
acknowledgment that is really desired is an end-to-end one, which can be
originated only by the target application—“ I did it,”or “
I didn’
t.”
Another strategy for obtaining immediate acknowledgments is to make the
target host sophisticated enough that when it accepts delivery of a message it
also accepts responsibility for guaranteeing that the message is acted upon by
the target application. This approach can eliminate the need for an end-to-end
acknowledgment in some, but not all, applications. An end-to-end acknowledg
ment is still required for applications in which the action requested of the target
host should be done only if similar actions requested of other hosts are successful.
This kind of application requires a two-phase commit protocol [5,10, 15], which
is a sophisticated end-to-end acknowledgment. Also, if the target application
either fails or refuses to do the requested action, and thus a negative acknowl
edgment is a possible outcome, an end-to-end acknowledgment may still be a
requirement.
3.2 Secure Transmission of Data
Another area in which an end-to-end argument can be applied is that of data
encryption. The argument here is threefold. First, if the data transmission system
perfoms encryption and decryption, it must be trusted to securely manage the
required encryption keys. Second, the data will be in the clear and thus vulnerable*3
5
7
1For example, real-time transmission of speech has tighter constraints on message delay than on bit-
error rate. Most retry schemes significantly increase the variability of delay.
A C M T r a n s a c t io n s on C o m p u te r S y ste m s, Vol. 2, N o. 4, N o v e m b e r 1984.
375
as they pass into the target node and are fanned out to the target application.
Third, the authenticity of the message must still be checked by the application.
If the application performs end-to-end encryption, it obtains its required authen
tication check and can handle key management to its satisfaction, and the data
are never exposed outside the application.
Thus, to satisfy the requirements of the application, there is no need for the
communication subsystem to provide for automatic encryption of all traffic.
Automatic encryption of all traffic by the communication subsystem may be
called for, however, to ensure something else—that a misbehaving user or
application program does not deliberately transmit information that should not
be exposed. The automatic encryption of all data as they are put into the network
is one more firewall the system designer can use to ensure that information does
not escape outside the system. Note however, that this is a different requirement
from authenticating access rights of a system user to specific parts of the data.
This network-level encryption can be quite unsophisticated—the same key can
be used by all hosts, with frequent changes of the key. No per-user keys complicate
the key management problem. The use of encryption for application-level au
thentication and protection is complementary. Neither mechanism can satisfy
both requirements completely.
3.3 Duplicate Message Suppression

A more sophisticated argument can be applied to duplicate message suppression.
A property of some communication network designs is that a message or a part
of a message may be delivered twice, typically as a result of time-out-triggered
failure detection and retry mechanisms operating within the network. The
network can watch for and suppress any such duplicate messages, or it can simply
deliver them. One might expect that an application would'find it very troublesome
to cope with a network that may deliver the same message twice; indeed, it is
troublesome. Unfortunately, even if the network suppresses duplicates, the ap
plication itself may accidentally originate duplicate requests in its own failure/
retry procedures. These application-level duplications look like different mes
sages to the communication system, so it cannot suppress them; suppression
must be accomplished by the application itself with knowledge of how to detect
its own duplicates.
A common example of duplicate suppression that must be handled at a high
level is when a remote system user, puzzled by lack of response, initiates a new
login to a time-sharing system. Another example is that most communication
applications involve a provision for coping with a system crash at one end of a
multisite transaction: reestablish the transaction when the crashed system comes
up again. Unfortunately, reliable detection of a system crash is problematical:
the problem may just be a lost or long-delayed acknowledgment. If so, the retried
request is now a duplicate, which only the application can discover. Thus, the
end-to-end argument again: If the application level has to have a duplicate
suppressing mechanism anyway, that mechanism can also suppress any dupli
cates generated inside the communication network; therefore, the function can
be omitted from that lower level. The same basic reasoning applies to completely
omitted messages, as well as to duplicated ones.
A C M T r a n s a c t io n s o n C o m p u te r S y ste m s, Vol. 2, No. 4, N o v e m b e r 1984.
376
3.4 Guaranteeing FIFO Message Delivery

Ensuring that messages arrive at the receiver in the same order in which they
are sent is another function usually assigned to the communication subsystem.
The mechanism usually used to achieve such first-in, first-out (FIFO) behavior
guarantees FIFO ordering among messages sent on the same virtual circuit.
Messages sent along independent virtual circuits, or through intermediate proc
esses outside the communication subsystem, may arrive in a different order from
the order sent. A distributed application in which one node can originate requests
that initiate actions at several sites cannot take advantage of the FIFO ordering
property to guarantee that the actions requested occur in the correct order.
Instead, an independent mechanism at a higher level than the communication
subsystem must control the ordering of actions.
3.5 Transaction Management
We have now applied the end-to-end argument in the construction of the
SWALLOW distributed data storage system [15], where it leads to significant
reduction in overhead. SWALLOW provides data storage servers called reposi
tories that can be used remotely to store and retrieve data. Accessing data at a
repository is done by sending it a message specifying the object to be accessed,
the version, and type of access (read/write), plus a value to be written if the
access is a write. The underlying message communication system does not
suppress duplicate messages, since (a) the object identifier plus the version
information suffices to detect duplicate writes, and (b) the effect of a duplicate
read-request message is only to generate a duplicate response, which is easily
discarded by the originator. Consequently, the low-level message communication
protocol is significantly simplified.
The underlying message communication system does not provide delivery
acknowledgment either. The acknowledgment that the originator of a write
request needs is that the data were stored safely. This acknowledgment can be
provided only by high levels of the SWALLOW system. For read requests, a
delivery acknowledgment is redundant, since the response containing the value
read is sufficient acknowledgment. By eliminating delivery acknowledgments,
the number of messages transmitted is halved. This message reduction can have
a significant effect on both host load and network load, improving performance.
This same line of reasoning has also been used in development of an experimental
protocol for remote access to disk records [6]. The resulting reduction in path
length in lower level protocols has been important in maintaining good perform
ance on remote disk access.
4. IDENTIFYING THE ENDS

Using the end-to-end argument sometimes requires subtlety of analysis of appli
cation requirements. For example, consider a computer communication network
that carries some packet voice connections, that is, conversations between digital
telephone instruments. For those connections that carry voice packets, an un
usually strong version of the end-to-end argument applies: If low levels of the
communication system try to accomplish bit-perfect communication, they will
probably introduce uncontrolled delays in packet delivery, for example, by re-
A C M T r a n s a c t io n s on C o m p u te r S y ste m s, Vol. 2, N o. 4, N o v e m b e r 1984.3
7
377
questing retransmission of damaged packets and holding up delivery of later

packets until earlier ones have been correctly retransmitted. Such delays are
disruptive to the voice application, which needs to feed data at a constant rate
to the listener. It is better to accept slightly damaged packets as they are, or even
to replace them with silence, a duplicate of the previous packet, or a noise burst.
The natural redundancy of voice, together with the high-level error correction
procedure in which one participant says “ excuse me, someone dropped a glass.
Would you please say that again?”will handle such dropouts, if they are relatively
infrequent.
However, this strong version of the end-to-end argument is a property of the
specific application—two people in real-time conversation—rather than a prop
erty, say, of speech in general. If, instead, one considers a speech message system,
in which the voice packets are stored in a file for later listening by the recipient,
the arguments suddenly change their nature. Short delays in delivery of packets
to the storage medium are not particularly disruptive, so there is no longer any
objection to low-level reliability measures that might introduce delay in order to
achieve reliability. More important, it is actually helpful to this application to
get as much accuracy as possible in the recorded message, since the recipient, at
the time of listening to the recording, is not going to be able to ask the sender to
repeat a sentence. On the other hand, with a storage system acting as the
receiving end of the voice communication, an end-to-end argument does apply to
packet ordering and duplicate suppression. Thus the end-to-end argument is not
an absolute rule, but rather a guideline that helps in application and protocol
design analysis; one must use some care to identify the endpoints to which the
argument should be applied.
5. HISTORY, AND APPLICATION TO OTHER SYSTEM AREAS

The individual examples of end-to-end arguments cited in this paper are not
original; they have accumulated over the years. The first example of questionable
intermediate delivery acknowledgments noticed by the authors was the “ wait”
message of the Massachusetts Institute of Technology Compatible Time-Sharing
System, which the system printed on the user’ s terminal whenever the user
entered a command [3]. (The message had some value in the early days of the
system, when crashes and communication failures were so frequent that inter
mediate acknowledgments provided some needed reassurance that all was well.)
The end-to-end argument relating to encryption was first publicly discussed
by Branstad in a 1973 paper [2]; presumably the military security community
held classified discussions before that time. Diffie and Heilman [4] and Kent [8]
developed the arguments in more depth, and Needham and Schroeder [11] devised
improved protocols for the purpose.
The two-phase-commit data update protocols of Gray [5], Lampson and Sturgis
[10] and Reed [13] all use a form of end-to-end argument to justify their existence;
they are end-to-end protocols that do not depend for correctness on reliability,
FIFO sequencing, or duplicate suppression within the communication system,
since all of these problems may also be introduced by other system component
failures as well. Reed makes this argument explicitly in the second chapter of his
Ph.D. dissertation on decentralized atomic actions [14].
A C M T r a n s a c t io n s o n C o m p u te r S y ste m s, V ol. 2, N o. 4, N o v e m b e r 1984.
3 78
End-to-end arguments are often applied to error control and correctness in

application systems. For example, a banking system usually provides high-level
auditing procedures as a matter of policy and legal requirement. Those high-level
auditing procedures will uncover not only high-level mistakes, such as performing
a withdrawal against the wrong account, but they will also detect low-level
mistakes such as coordination errors in the underlying data management system.
Therefore, a costly algorithm that absolutely eliminates such coordination errors
may be arguably less appropriate than a less costly algorithm that just makes
such errors very rare. In airline reservation systems, an agent can be relied upon
to keep trying through system crashes and delays until a reservation is either
confirmed or refused. Lower level recovery procedures to guarantee that an
unconfirmed request for a reservation will survive a system crash are thus not
vital. In telephone exchanges, a failure that could cause a single call to be lost is
considered not worth providing explicit recovery for, since the caller will probably
replace the call if it matters [7]. All of these design approaches are examples of
the end-to-end argument being applied to automatic recovery.
Much of the debate in the network protocol community over datagrams, virtual
circuits, and connectionless protocols is a debate about end-to-end arguments. A
modularity argument prizes a reliable, FIFO sequenced, duplicate-suppressed
stream of data as a system component that is easy to build on, and that argument
favors virtual circuits. The end-to-end argument claims that centrally provided
versions of each of those functions will be incomplete for some applications, and
those applications will find it easier to build their own version of the functions
starting with datagrams.
A version of the end-to-end argument in a noncommunication application was
developed in the 1950s by system analysts whose responsibility included reading
and writing files on large numbers of magnetic tape reels. Repeated attempts to
define and implement a reliable tape subsystem repeatedly foundered, as flaky
tape drives, undependable system operators, and system crashes conspired against
all narrowly focused reliability measures. Eventually, it became standard practice
for every application to provide its own application-dependent checks and recov
ery strategy, and to assume that lower level error detection mechanisms, at best,
reduced the frequency with which the higher level checks failed. As an example,
the Multics file backup system [17], even though it is built on a foundation of
magnetic tape subsystem format that provides very powerful error detection and
correction features, provides its own error control in the form of record labels
and multiple copies of every file.
The arguments that are used in support of reduced instruction set computer
(RISC) architecture are similar to end-to-end arguments. The RISC argument is
that the client of the architecture will get better performance by implementing
exactly the instructions needed from primitive tools; any attempt by the computer
designer to anticipate the client’ s requirements for an esoteric feature will
probably miss the target slightly and the client will end up reimplementing that
feature anyway. (We are indebted to M. Satyanarayanan for pointing out this
example.)
Lampson, in his arguments supporting the open operating system, [9] uses an
argument similar to the end-to-end argument as a justification. Lampson argues
A C M T r a n s a c t io n s o n C o m p u te r S y s te m s , Vol. 2, No. 4, N o v e m b e r 1984.3
9
7
379
against making any function a permanent fixture of lower level modules; the
function may be provided by a lower level module, but it should always be
replaceable by an application’ s special version of the function. The reasoning is
that for any function that can be thought of, at least some applications will find
that, of necessity, they must implement the function themselves in order to meet
correctly their own requirements. This line of reasoning leads Lampson to
propose an “ open”system in which the entire operating system consists of
replaceable routines from a library. Such an approach has only recently become
feasible in the context of computers dedicated to a single application. It may be
the case that the large quantity of fixed supervisor functions typical of large-
scale operating systems is only an artifact of economic pressures that have
demanded multiplexing of expensive hardware and therefore a protected super
visor. Most recent system “ kernelization”projects have, in fact, focused at least
in part on getting function out of low system levels [12,16]. Though this function
movement is inspired by a different kind of correctness argument, it has the side
effect of producing an operating system that is more flexible for applications,
which is exactly the main thrust of the end-to-end argument.
6. CONCLUSIONS
End-to-end arguments are a kind of “ Occam’ s razor”when it comes to choosing
the functions to be provided in a communication subsystem. Because the com
munication subsystem is frequently specified before applications that use the
subsystem are known, the designer may be tempted to “ help”the users by taking
on more function than necessary. Awareness of end-to-end arguments can help
to reduce such temptations.
It is fashionable these days to talk about layered communication protocols, but
without clearly defined criteria for assigning functions to layers. Such layerings
are desirable to enhance modularity. End-to-end arguments may be viewed as
part of a set of rational principles for organizing such layered systems. We hope
that our discussion will help to add substance to arguments about the “ proper”
layering.
ACKNOWLEDGMENTS
Many people have read and commented on an earlier draft of this paper, including
David Cheriton, F. B. Schneider, and Liba Svobodova. The subject was also
discussed at the ACM Workshop in Fundamentals of Distributed Computing, in
Fallbrook, Calif., December 1980. Those comments and discussions were quite
helpful in clarifying the arguments.
REFERENCES
1. B olt B eranek and N ewman Inc . Specifications for the interconnection of a host and an
IMP. Tech. Rep. 1822. Bolt Beranek and Newman Inc. Cambridge, Mass. Dec. 1981.
2. B ranstad , D.K. Security aspects of computer networks. AAIA Paper 73-427, AIAA Computer
Network Systems Conference, Huntsville, Ala. Apr. 1973.
3. C orbato , F.J., D aggett , M.M., D aley, R.C., C reasy, R.J., H elliwig , J.D., Orenstein , R.H.,
AND K orn , L.K. The Compatible Time-Sharing System, A Programmer’ s Guide. Massachusetts
A C M T r a n s a c t io n s o n C o m p u te r S y ste m s, V ol. 2, N o. 4, N o v e m b e r 1984.
3 80
Institute of Technology Press, Cambridge, Mass. 1963, p. 10.

4. D iffie , W., and H ellman, M.E. New directions in cryptography. IEEE Trans. Inf. Theory
IT-22, 6 (Nov. 1976), 644-654.
5. G ray, J.N. Notes on database operating systems. Operating Systems: An Advanced Course.
Lecture Notes on Computer Science, vol. 60. Springer-Verlag, New York. 1978. 393-481.
6. G reenwald , M. Remote virtual disk protocol specifications. Tech. Memo. Massachusetts
Institute o f Technology Laboratory for Computer Science, Cambridge, Mass. In preparation.
7. K eister , W., K etchledge , R.W., and Vaughan, H.E. N o. 1 ESS: System organization and
objectives. Bell Syst. Tech. J. 53, 5, Pt 1, (Sept. 1964), 1841.
8. K ent , S.T. Encryption-based protection protocols for interactive user-computer communica
tion. S.M. thesis, Dept, of Electrical Engineering and Computer Science, Massachusetts Institute
of Technology, Cambridge, Mass., May 1976. Also available as Tech. Rep. TR-162. Massachusetts
Institute of Technology Laboratory for Computer Science, May 1976.
9. LAMPSON, B.W., and SPROULL, R.F. An open operating system for a single-user machine. In
Proceedings of the 7th Symposium on Operating Systems Principles, (Pacific Grove, Calif Dec. 10-
12). ACM, New York, 1979, pp. 98-105.
10. Lampson , B., and S turgis , H. Crash recovery in a distributed data storage system. Working
paper, Xerox PARC, Palo Alto, Calif. Nov. 1976 and Apr. 1979. Submitted for publication.
11. N eedham , R.M., and S chroeder , M.D. Using encryption for authentication in large networks
of computers. Commun. ACM 21, 12 (Dec. 1978), 993-999.
12. P opek , G.J., et al. UCLA data secure unix. In Proceedings of the 1979 National Computer
Conference, vol. AFIPS Press, Reston, Va., pp. 355-364.
13. R eed , D.P. Implementing atomic actions on decentralized data. ACM Trans. Comput. Syst. 1,
1 (Feb. 1983), 3-23.
14. R eed , D.P. Naming and synchronization in a decentralized computer system. Ph.D. disserta
tion, Massachusetts Institute of Technology, Dept, of Electrical Engineering and Computer
Science, Cambridge, Mass. September 1978. Also available as Massachusetts Institute of Tech
nology Laboratory for Computer Science Tech. Rep. TR-205, Sept., 1978.
15. R eed , D.P., and S vobodova, L. SWALLOW. A distributed data storage system for a local
network. A. West, and P. Janson, Eds. In Local Networks for Computer Communications,
Proceedings of the IFIP Working Group 6.4 International Workshop on Local Networks (Zurich,
Aug 27-29 1980), North-Holland, Amsterdam, 1981, pp. 355-373.
16. S ch ro ed er, M.D., Clark, D.D., and S a ltzer, J.H. The multics kernel design project. In
Proceedings 6th Symposium on Operating Systems Principles. Oper. Syst. Rev. 11, 5 (Nov. 1977),
43-56.
17. S tern , J.A. Backup and recovery of on-line information in a computer utility. S.M. thesis,
Department of Electrical Engineering and Computer Science, Massachusetts Institute of Tech
nology, Cambridge, Mass. Aug. 1973. Available as Project MAC Tech. Rep. TR-116, Massachu
setts Institute of Technology, Jan. 1974.
Received February 1983; accepted June 1983
A C M T r a n s a c t io n s o n C o m p u te r S y ste m s, V ol. 2, N o. 4, N o v e m b e r 1.984.

Chapter 10
D a ta P r o c e ss in g - E xtern al
S o rtin g
H. Garcia-Molina, J. D. Ullman, J. Widom. Database Systems:

The Complete Book. Chapter 11.4, pp. 525-533 (9 of 1119). Pren
tice Hall, 2002. ISBN: 0-13-031995-3
A vast number of applications must process and analyze large quantities of

data. These applications need the system property of scalability with data
volumes. Classic algorithms are developed with the RAM model in mind, and
often break or perform very poorly over multi-level memories. This necessitates
the study of algorithms under different models. In the following, the text
studies algorithms for data processing under the external memory model. In
this chapter, the ubiquitous problem of sorting is studied. The ultimate goal of
this portion o f the material is to allow us to reflect on how the external memory
model differs from the classic model of algorithm design, and to explore in detail
the example o f sorting as a first data processing service implemented under the
external memory model.
• Identify the problem of processing large data collections with external

memory algorithms.
• Discuss the implications of the external memory model on algorithm de

sign.
• Explain the classic sort-merge external sorting algorithm, and in partic

ular its two-phase, multiway variants.3
8
383
384
11.4. USING SECONDARY STORA GE EFFECTIVELY 525
E x e r c ise 1 1 .3 .4 : At the end of Example 11.3 we suggested that the maximum

density of tracks could be reduced if we divided the tracks into three regions,
with different numbers of sectors in each region. If the divisions between the
three regions could be placed at any radius, and the number of sectors in each
region could vary, subject only to the constraint that the total number of bytes
on the 16,384 tracks of one surface be 8 gigabytes, what choice for the five
parameters (radii of the two divisions between regions and the numbers of
sectors per track in each of the three regions) minimizes the maximum density
of any track?
11.4 Using Secondary Storage Effectively

In most studies of algorithms, one assumes that the data is in main memory,
and access to any item of data takes as much time as any other. This model
of computation is often called the “ RAM mod<§r or random-access model of
computation. However, when implementing a DBMS, one must assume that
the data does not fit into main memory. One must therefore take into account
the use of secondary, and perhaps even tertiary storage in designing efficient
algorithms. The best algorithms for processing very large amounts of data thus
often differ from the best main-memory algorithms for the same problem.
In this section, we shall consider primarily the interaction between main
and secondary memory. In particular, there is a great advantage in choosing an
algorithm that uses few disk accesses, even if the algorithm is not very efficient
when viewed as a main-memory algorithm. A similar principle applies at each
level of the memory hierarchy. Even a main-memory algorithm can sometimes
be improved if we remember the size of the cache and design our algorithm so
that data moved to cache tends to be used many times. Likewise, an algorithm
using tertiary storage needs to take into account the volume of data moved
between tertiary and secondary memory, and it is wise to minimize this quantity
even at the expense of more work at the lower levels of the hierarchy.
11.4.1 The I/O Model of Computation

Let us imagine a simple computer running a DBMS and trying to serve a number
of users who are accessing the database in various ways: queries and database
modifications. For the moment, assume our computer has one processor, one
disk controller, and one disk. The database itself is much too large to fit in
main memory. Key parts of the database may be buffered in main memory, but
generally, each piece of the database that one of the users accesses will have to
be retrieved initially from disk.
Since there are many users, and each user issues disk-I/O requests frequently,
the disk controller often will have a queue of requests, which we assume it
satisfies on a first-come-first-served basis. Thus, each request for a given user
will appear random (i.e., the disk head will be in a random position before the
385
526 CHAPTER 11. DATA STORAGE
request), even if this user is reading blocks belonging to a single relation, and
that relation is stored on a single cylinder of the disk. Later in this section we
shall discuss how to improve the performance of the system in various ways.
However, in all that follows, the following rule, which defines the I/O model of
computation, is assumed:
D o m in a n c e o f I / O co st: If a block needs to be moved between
disk and main memory, then the time taken to perform the read
or write is much larger than the time likely to be used manip
ulating that data in main memory. Thus, the number of block
accesses (reads and writes) is a good approximation to the time
needed by the algorithm and should be minimized.
In examples, we shall assume that the disk is a Megatron 747, with 16K-
byte blocks and the timing characteristics determined in Example 11.5. In
particular, the average time to read or write a block is about 11 milliseconds.
E x a m p le 11.6: Suppose our database has a relation R and a query asks for
the tuple of R that has a certain key value k. As we shall see, it is quite desirable
that an index on R be created and used to identify the disk block on which the
tuple with key value k appears. However it is generally unimportant whether
the index tells us where on the block this tuple appears.
The reason is that it will take on the order of 11 milliseconds to read this
16K-byte block. In 11 milliseconds, a modern microprocessor can execute mil
lions of instructions. However, searching for the key value k once the block is
in main memory will only take thousands of instructions, even if the dumbest
possible linear search is used. The additional time to perform the search ,in
main memory will therefore be less than 1% of the block access time and can
be neglected safely. □
11.4.2 Sorting Data in Secondary Storage

As an extended example of how algorithms need to change under the I/O model
of computation cost, let us consider sorting data that is much larger than main
memory. To begin, we shall introduce a particular sorting problem and give
some details of the machine on which the sorting occurs.
E x a m p le 11.7 : Let us assume that we have a large relation R consisting of

10,000,000 tuples. Each tuple is represented by a record with several fields, one
of which is the sort key field, or just “key field”if there is no confusion with
other kinds of keys. The goal of a sorting algorithm is to order the records by
increasing value of their sort keys.
A sort key may or may not be a “ key”in the usual SQL sense of a primary
key, where records are guaranteed to have unique values in their primary key.
If duplicate values of the sort key are permitted, then any order of records
with equal sort keys is acceptable. For simplicity, we shall assume sort keys are
unique.
386
11.4. USING SECONDARY STORAGE EFFECTIVELY 527
The records (tuples) of R will be divided into disk blocks of 16,384 bytes per
block. We assume that 100 records fit in one block. That is, records are about
160 bytes long. With the typical extra information needed to store records in a
block (as discussed in Section 12.2, e.g.), 100 records of this size is about what
can fit in one 16,384-byte block. Thus, R, occupies 100,000 blocks totaling 1.64
billion bytes.
The machine on which the sorting occurs has one Megatron 747 disk and
100 megabytes of main memory available for buffering blocks of the relation.
The actual main memory is larger, but the rest of main-memory is used by the
system. The number of blocks that can fit in 100M bytes of memory (which,
recall, is really 100 x 220 bytes), is 100 x 220/214, or 6400 blocks. □
If the data fits in main memory, there are a number of well-known algorithms
that work well;5 variants of “ Quicksort”are generally considered the fastest.
The preferred version of Quicksort sorts only the key fields, carrying pointers
to the full records along with the keys. Only when the keys and their pointers
were in sorted order, would we use the pointers to bring every record to its
proper position.
Unfortunately, these ideas do not work very well when secondary memory
is needed to hold the data. The preferred approaches to sorting, when the data
is mostly in secondary memory, involve moving each block between main and
secondary memory only a small number of times, in a regular pattern. Often,
these algorithms operate in a small number of passes; in one pass every record
is read into main memory once and written out to disk once. In Section 11.4.4,
we see one such algorithm.
11.4.3 Merge-Sort
You may be familiar with a main-memory sorting algorithm called Merge-Sort
that works by merging sorted lists into larger sorted lists. To merge two sorted
lists, we repeatedly compare the smallest remaining keys of each list, move the
record with the smaller key to the output, and repeat, until one list is exhausted.
At that time, the output, in the order selected, followed by what remains of the
nonexhausted list, is the complete set of records, in sorted order.
E x a m p le 11.8: Suppose we have two sorted lists of four records each. To

make matters simpler, we shall represent records by their keys and no other
data, and we assume keys are integers. One of the sorted lists is (1,3,4,9) and
the other is (2,5,7,8). In Fig. 11.10 we see the stages of the merge process.
At the first step, the head elements of the two lists, 1 and 2, are compared.
Since 1 < 2, the 1 is removed from the first list and becomes the first element
of the output. At step (2), the heads of the remaining lists, now 3 and 2,
are compared; 2 wins and is moved to the output. The merge continues until*3 7
8
5See D. E. Kriuth, The Art o f Com puter Progrannning, Vol. 3: Sorting and Searching,
2nd E d ition , Addison-Wesley, Reading MA, 1998.
387
Step List 1 List 2 Output

start 1,3,4,9 2, 5, 7,8 none
1) 3,4,9 2,5,7,8 1
2) 3,4,9 5,7,8 1,2
3) 4,9 5,7,8 1,2,3
4) 9 5,7,8 1,2,3,4
5) 9 7,8 1,2,3,4,5
6) 9 8 1,2,3,4,5,7
7) 9 none 1,2,3,4,5,7,8
8) none none 1,2,3,4,5,7,8,9
Figure 11.10: Merging two sorted lists to make one sorted list
step (7), when the second list is exhausted. At that point, the remainder of the
first list, which happens to be only one element, is appended to the output and
the merge is done. Note that the output is in sorted order, as must be the case,
because at each step we chose the smallest of the remaining elements. □
The time to merge in main memory is linear in the sum of the lengths of the
lists. The reason is that, because the given lists are sorted, only the heads of
the two lists are ever candidates for being the smallest unselected element, and
we can compare them in a constant amount of time. The classic merge-sort
algorithm sorts recursively, using log2n phases if there are n elements to be
sorted. It can be described as follows:
BASIS: If there is a list of one element to be sorted, do nothing, because the
list is already sorted.
IN D U C T IO N : If there is a list of more than one element to be sorted, then
divide the list arbitrarily into two lists that are either of the same length, or as
close as possible if the original list is of odd length. Recursively sort the two
sublists. Then merge the resulting sorted lists into one sorted list.
The analysis of this algorithm is well known and not too important here. Briefly
T(n), the time to sort n elements, is some constant times n (to split the list and
merge the resulting sorted lists) plus the time to sort two lists of size n/2. That
is, T (n ) = 2T(n/2) + an for some constant a. The solution to this recurrence
equation is T(n) = 0 ( n log n), that is, proportional to nlogn.
11.4.4 Two-Phase, Multiway Merge-Sort

We shall use a variant of Merge-Sort, called Two-Phase, Multiway Merge-Sort
(often abbreviated TPMMS), to sort the relation of Example 11.7 on the ma
chine described in that example. It is the preferred sorting algorithm in many
database applications. Briefly, this algorithm consists of:
388
•Phase 1: Sort main-memory-sized pieces of the data, so every record is

part of a sorted list that just fits in the available main memory. There
may thus be any number of these sorted sublists, which we merge in the
next phase.
•Phase 2: Merge all the sorted sublists into a single sorted list.
Our first observation is that with data on secondary storage, we do not want
to start with a basis to the recursion that is one record or a few records. The
reason is that Merge-Sort is not as fast as some other algorithms when the
records to be sorted fit in main memory. Thus, we shall begin the recursion
by taking an entire main memory full of records, and sorting them using an
appropriate main-memory sorting algorithm such as Quicksort. We repeat the
following process as many times as necessary:
1. Fill all available main memory with blocks from the original relation to
be sorted.
2. Sort the records that are in main memory.
3. Write the sorted records from main memory onto new blocks of secondary
memory, forming one sorted sublist.
At the end of this first phase, all the records of the original relation will have
been read once into main memory, and become part of a main-memory-size
sorted sublist that has been written onto disk.
E x a m p le 11.9: Consider the relation described in Example 11.7. We de

termined that 6400 of the 100,000 blocks will fill main memory. We thus fill
memory 16 times, sort the. records in main memory, and write the sorted sub
lists out to disk. The last of the 16 sublists is shorter than the rest; it occupies
only 4000 blocks, while the other 15 sublists occupy 6400 blocks.
How long does this phase take? We read each of the 100,000 blocks once,
and we write 100,000 new blocks. Thus, there are 200,000 disk I/O’ s. We have
assumed, for the moment, that blocks are stored at random on the disk, an
assumption that, as we shall see in Section 11.5, can be improved upon greatly.
However, on our randomness assumption, each block read or write takes about
11 milliseconds. Thus, the I/O time for the first phase is 2200 seconds, or 37
minutes, or over 2 minutes per sorted sublist. It is not hard to see that, at
a processor speed of hundreds of millions of instructions per second, we can
create one sorted sublist in main memory in far less than the I/O time for that
sublist- We thus estimate the total time for phase one as 37 minutes. □
Now, let us consider how we complete the sort by merging the sorted sublists.
We could merge them in pairs, as in the classical Merge-Sort, but that would
involve reading all data in and out of memory 21og2n times if there were n
sorted sublists, For instance, the 16 sorted sublists of Example 11.9 would be
389
read in and out of secondary storage once to merge into 8 lists; another complete
reading and writing would reduce them to 4 sorted lists, and two more complete
read/write operations would reduce them to one sorted list. Thus, each block
would have 8 disk I/O’ s performed on it.
A better approach is to read the first block of each sorted sublist into a
main-memory buffer. For some huge relations, there would be too many sorted
sublists from phase one to read even one block per list into main memory, a
problem we shall deal with in Section 11.4.5. But for data such as that of
Example 11.7, there are relatively few lists, 16 in that example, and a block
from each list fits easily in main memory.
We also use a buffer for an output block that will contain as many of the
first elements in the complete sorted list as it can hold. Initially, the output
block is empty. The arrangement of buffers is suggested by Fig. 11.11. We
merge the sorted sublists into one sorted list with all the records as follows.
Input buffers, one for each sorted list
Select smallest
, unchosen for j
\ output /
Output
buffer
Figure 11.11: Main-memory organization for multiway merging
1. Find the smallest key among the first remaining elements of all the lists.
Since this comparison is done in main memory, a linear search is suffi
cient, taking a number of machine instructions proportional to the num
ber of sublists. However, if we wish, there is a method based on “priority-
queues” 6 that takes time proportional to the logarithm of the number of
sublists to find the smallest element.
2. Move the smallest element to the first available position of the output
block.*3
0
9
6See Aho, A. V. and J. D. Ullman Foundations o f Com puter S cien ce, Com puter Science
Press, 1992.
390
How Big Should Blocks Be?

We have assumed a 16K byte block in our analysis of algorithms using
the Megatron 747 disk. However, there are arguments that a larger block
size would be advantageous. Recall from Example 11.5 that it takes about
a quarter of a millisecond for transfer time of a 16K block and 10,63
milliseconds for average seek time and rotational latency. If we doubled
the size of blocks, we would halve the number of disk I/O’s for an algorithm
like TPMMS. On the other hand, the only change in the time to access
a block would be that the transfer time increases to 0.50 millisecond. We
would thus approximately halve the time the sort takes. For a block size
of 512K (i.e., an entire track of the Megatron 747) the transfer time is 8
milliseconds. At that point, the average block access time would be 20
milliseconds, but we would need only 12,500 block accesses, for a speedup
in sorting by a factor of 14.
However, there are reasons to limit the block size. First, we cannot
use blocks that cover several tracks effectively. Second, small relations
would occupy only a fraction of a block, so large blocks would waste space
on the disk. There are also certain data structures for secondary storage
organization that prefer to divide data among many blocks and therefore
work less well when the block size is too large. In fact, we shall see in
Section 11.4.5 that the larger the blocks are, the fewer records we can
sort by TPMMS. Nevertheless, as machines get faster and disks more
capacious, there is a tendency for block sizes to grow.
3. If the output block is full, write it to disk and reinitialize the same buffer
in main memory to hold the next output block.
4. If the block from which the smallest element was just taken is now ex
hausted of records, read the next block from the same sorted sublist into
the same buffer that was used for the block just exhausted. If no blocks
remain, then leave its buffer empty and do not consider elements from
that list in any further competition for smallest remaining elements.
In the second phase, unlike the first phase, the blocks are read in an unpre
dictable order, since we cannot tell when an input block will become exhausted.
However, notice that every block holding records from one of the sorted lists is
read from disk exactly once. Thus, the total number of block reads is 100,000
in the second phase, just as for the first. Likewise, each record is placed once in
an output block, and each of these blocks is written to disk. Thus, the number
of block writes in the second phase is also 100,000. As the amount of second-
phase computation in main memory can again be neglected compared to the
I/O cost, we conclude that the second phase takes another 37 minutes, or 74
391
532 C H A P T E R 11. DATA S T O R A G E
minutes for the entire sort.
11.4.5 Multiway Merging of Larger Relations

The Two-Phase, Multiway Merge-Sort (TPMMS) described above can be used
to sort some very large sets of records. To see how large, let us suppose that:
1. The block size is B bytes.

2. The main memory available for buffering blocks is M bytes.
3. Records take R bytes.
The number of buffers available in main memory is thus M / B . On the

second phase, all but one of these buffers may be devoted to one of the sorted
sublists; the remaining buffer is for the output block. Thus, the number of
sorted sublists that may be created in phase one is { M / B ) - 1. This quantity
is also the number of times we may fill main memory with records to be sorted.
Each time we fill main memory, we sort M / R records. Thus, the total number
of records we can sort is (M/R ){ { M / B ) - 1), or approximately M 2/ R B records.
E x a m p le 11.10: If we use the parameters outlined in Example 11.7, then

M = 104,857,600, B = 16,384, and R. = 160. We can thus sort up to M 2/ R B
— 4.2 billion records, occupying two thirds of a terabyte. Note that relations
this size will not fit on a Megatron 747 disk. □
If we need to sort more records, we can add a third pass. Use TPMMS to
sort groups of M 2/ R B records, turning them into sorted sublists. Then, in a
third phase, we merge up to { M / B ) —1of these lists in a final multiway merge.
The third phase lets us sort approximately M 3/ R B 2 records occupying
M s / B 3 blocks. For the parameters of Example 11.7, this amount is about
27 trillion records occupying 4.3 petabytes. Such an amount is unheard of to
day. Since even the 0.67 terabyte limit for TPMMS is unlikely to be carried
out in secondary storage, we suggest that the two-phase version of Multiway
Merge-Sort, is likely to be enough for all practical purposes.
11.4.6 Exercises for Section 11.4

E x e r c is e 11.4.1: Using TPMMS, how long would it take to sort the relation
of Example 11.7 if the Megatron 747 disk were replaced by the Megatron 777
disk described in Exercise 11.3.1, and all other characteristics of the machine
and data remained the same?
E x e r c is e 11.4.2 : Suppose we use TPMMS on the machine and relation R of

Example 11.7, with certain modifications. Tell how many disk I/O’ s are needed
for the sort if the relation R. and/or machine characteristics are changed as
follows:
392
11.5. A C C E L E R A T I N G ACCESS TO SECON D A RY STO R A G E 533
* a) The number of tuples in R is doubled (all else remains the same).

b) The length of tuples is doubled to 320 bytes (and everything else remains
as in Example 11.7).
* c) The size of blocks is doubled, to 32,768 bytes (again, as throughout this
exercise, all other parameters are unchanged).
d) The size of available main memory is doubled to 200 megabytes.
! E x e r c is e 11.4.3: Suppose the relation R of Example 11.7 grows to have as

many tuples as can be sorted using TPMMS on the machine described in that
example. Also assume that the disk grows to accommodate J?, but all other
characteristics of the disk, machine, and relation R remain the same. How long
would it take to sort 7??
* E x e r c is e 11.4.4: Let us again consider the relation R of Example 11.7, but

assume that it is stored sorted by the sort key (which is in fact a “ key”in the
usual sense, and uniquely identifies records). Also, assume that R is stored in
a sequence of blocks whose locations are known, so that for any i it is possible
to locate and retrieve the th block of R using one disk I/O. Given a key value
2
if, we can find the tuple with that key value by using a standard binary search
technique. What is the maximum number of disk I/O’ s needed to find the tuple
with key K ?
!! E x e r c is e 11.4.5: Suppose we have the same situation as in Exercise 11.4.4,

but we are given 10 key values to find. What is the maximum number of disk
I/O’ s needed to find all 10 tuples?
* E x e r c is e 11.4.6: Suppose we have a relation whose n tuples each require R

bytes, and we have a machine whose main memory M and disk-block size B
are just sufficient to sort the n tuples using TPMMS. How would the maximum
n change if we doubled: (a) B (b) R (c) M l
! E x e r c is e 11.4.7: Repeat Exercise 11.4.6 if it is just possible to perform the

sort using Three-Phase, Multiway Merge-Sort.
*! E x e r c is e 11.4.8: As a function of parameters I?, M, and B (as in Exer

cise 11.4.6) and the integer k , how many records can be sorted using a A>phase,
Multiway Merge-Sort?
11.5 Accelerating Access to Secondary Storage

The analysis of Section 11.4.4 assumed that data was stored on a single disk and
that blocks were chosen randomly from the possible locations on the disk. That
assumption may be appropriate for a system that is executing a large number
of small queries simultaneously. But if all the system is doing is sorting a large
393
Chapter 11
D a ta P r o c e ssin g - B a sic
R ela tion a l O p e ra to rs and Join s
T h is ch a p te r c o n ta in s th e b o o k ch ap ter:
H. G a rcia -M olin a , J. D. U llm an, J. W idorn . D a ta b a s e S y stem s:

T h e C o m p le t e B ook . C h a p te r 15, pp. 713-774 (62 o f 1119). P r e n
tic e Hall, 2002. IS B N : 0-13-031995-3
O th e r th a n sortin g, a set o f o p e r a tio n s c o m m o n ly a p p lie d t o b u lk d a ta

s e ts are th e o p e r a tio n s o f th e rela tio n a l algebra . T h is ch a p te r d is c u s s e s h ow
t o d e sig n a lg o r ith m s for a n d im p le m e n t th e s e o p e r a to r s u n d er th e ex tern a l
m e m o r y m od el, m a in ta in in g th e p r o p e r ty o f scalability w ith d a ta volu m es.
The ultimate goal o f this portion o f the material is to equip us with a variety of
algorithmic design alternatives fo r services which must analyze large data sets,
in particular memory management approaches, indexing, sorting, and hashing.
T h e lea rn in g g o a ls for th is p o r t io n o f th e m a teria l are lis te d below .
• E x p la in th e d e sig n o f o p e r a to r s for b a s ic rela tio n a l o p e r a tio n s, in clu d in g

th eir sca n -b a sed , o n e-p a ss va ria n ts as w ell as a s s o c ia te d m e m o r y m a n
a g e m e n t issues.
• A n a ly z e a lg o r ith m s for th e jo in o p era tio n , in c lu d in g th e m u ltip le v arian ts

b a s e d o n n e s te d lo op s, in d ex in g, sortin g, a n d h a sh in g as w ell a s one, two,
a n d m a n y passes.
• A p p ly ex te r n a l m e m o r y a lg o r ith m s t o th e im p le m e n ta tio n o f d a ta p r o
c e s s in g o p era to rs. 9
5
3
395
396
Chapter 15
Query Execution
Previous chapters gave us data structures that allow efficient execution of basic
database operations such as finding tuples given a search key. We are now ready
to use these structures to support efficient algorithms for answering queries. The
broad topic of query processing will be covered in this chapter and Chapter 16.
The q u e r y p r o c e s s o r is the group of components of a DBMS that turns user
queries and data-modification commands into a sequence of database operations
and executes those operations. Since SQL lets us express queries at a very high
level, the query processor must supply a lot of detail regarding how the query
is to be executed. Moreover, a naive execution strategy for a query may lead to
an algorithm for executing the query that takes far more time than necessary.
Figure 15.1 suggests the division of topics between Chapters 15 and 16.
In this chapter, we concentrate on query execution, that is, the algorithms
that manipulate the data of the database. We focus on the operations of the
extended relational algebra, described in Section 5-4. Because SQL uses a bag
model, we also assume that relations are bags, and thus use the bag versions of
the operators from Section 5.3.
We shall cover the principal methods for execution of the operations of rela
tional algebra. These methods differ in their basic strategy; scanning, hashing,
sorting, and indexing are the major approaches. The methods also differ on
their assumption as to the amount of available main memory. Some algorithms
assume that enough main memory is available to hold at least one of the re
lations involved in an operation. Others assume that the arguments of the
operation are too big to fit in memory, and these algorithms have significantly
different costs and structures.
P r e v ie w o f Q u e r y C o m p i la t io n
Query compilation is divided into the three major steps shown in Fig. 15.2.
a) P a r s in g , in which a p a r s e t r e e , representing the query and its structure,

is constructed.
397
713
714 CHAPTER 15. QUERY EXECUTION
query
Figure 15.1: The major parts of the query processor
b) Query rewrite, in which the parse tree is converted to an initial query plan,
which is usually an algebraic representation of the query. This initial plan
is then transformed into an equivalent plan that is expected to require less
time to execute.
c) Physical plan generation, where the abstract query plan from (b), often
called a logical query plan, is turned into a physical query plan by selecting
algorithms to implement each of the operators of the logical plan, and by
selecting an order of execution for these operators. The physical plan, like
the result of parsing and the logical plan, is represented by an expression
tree. The physical plan also includes details such as how the queried
relations are accessed, and when and if a relation should be sorted.
Parts (b) and (c) are often called the query optimizer, and these are the
hard parts of query compilation. Chapter 16 is devoted to query optimization;
we shall learn there how to select a “ query plan”that takes as little time as
possible. To select the best query plan we need to decide:
1. Which of the algebraically equivalent forms of a query leads to the most

efficient algorithm for answering the query?
2. For each operation of the selected form, what algorithm should we use to
implement that operation?
3. How should the operations pass data from one to the other, e.g., in a
pipelined fashion, in main-memory buffers, or via the disk?
Each of these choices depends on the metadata about the database. Typical
metadata that is available to the query optimizer includes: the size of each
398
15.1. INTRODUCTION TO PHYSICAL-QUERY-PLAN OPERATORS 715
SQ L query
Query
optimization
Execute plan
Figure 15.2: Outline of query compilation
relation; statistics such as the approximate number and frequency of different

values for an attribute; the existence of certain indexes; and the layout of data
on disk.
15.1 Introduction to Physical-Query-Plan

Operators
Physical query plans are built from operators, each of which implements one
step of the plan. Often, the physical operators are particular implementations
for one of the operators of relational algebra. However, we also need physical
operators for other tasks that do not involve an operator of relational algebra.
For example, we often need to “ scan”a table, that is, bring into main memory
each tuple of some relation that is an operand of a relational-algebra expression.
In this section, we shall introduce the basic building blocks of physical query
plans. Later sections cover the more complex algorithms that implement op
erators of relational algebra efficiently; these algorithms also form an essential
part of physical query plans. We also introduce here the “ iterator”concept,
which is an important method by which the operators comprising a physical
query plan can pass requests for tuples and answers among themselves.3 9
399
15.1.1 Scanning Tables

Perhaps the most basic thing we can do in a physical query plan is to read the
entire contents of a relation R. This step is necessary when, for example, we
take the union or join of R with another relation. A variation of this operator
involves a simple predicate, where we read only those tuples of the relation R
that satisfy the predicate. There are two basic approaches to locating the tuples
of a relation R.
1. In many cases, the relation R is stored in an area of secondary memory,

with its tuples arranged in blocks. The blocks containing the tuples of R
are known to the system, and it is possible to get the blocks one by one.
This operation is called table-scan.
2. If there is an index on any attribute of R, we may be able to use this index

to get all the tuples of R. For example, a sparse index on R, as discussed
in Section 13.1.3, can be used to lead us to all the blocks holding R, even if
we don’ t know otherwise which blocks these are. This operation is called
index-scan.
We shall take up index-scan again in Section 15.6.2, when we talk about

implementation of the a operator. However, the important observation for now
is that we can use the index not only to get all the tuples of the relation it
indexes, but to get only those tuples that have a particular value (or sometimes
a particular range of values) in the attribute or attributes that form the search
key for the index.
15.1.2 Sorting While Scanning Tables

There are a number of reasons why we might want to sort a relation as we
read its tuples. For one, the query could include an ORDER BY clause, requiring
that a relation be sorted. For another, various algorithms for relational-algebra
operations require one or both of their arguments to be sorted relations. These
algorithms appear in Section 15.4 and elsewhere.
The physical-query-plan operator sort-scan takes a relation R and a speci
fication of the attributes on which the sort is to be made, and produces R in
that sorted order. There are several ways that sort-scan can be implemented:
a) If we are to produce a relation R sorted by attribute a, and there is a

B-tree index on a, or R is stored as an indexed-sequential file ordered by
a, then a scan of the index allows us to produce R in the desired order.
b) If the relation R that we wish to retrieve in sorted order is small enough

to fit in main memory, then we can retrieve its tuples using a table scan
or index scan, and then use one of many possible efficient, main-memory
sorting algorithms.4
0
400
INTRODUCTION TO PHYSICAL-QUERY-PLAN OPERATORS 717
c) If R, is t o o la r g e to fit in m a in m e m o r y , th e n th e m u ltiw a y m e r g in g a p
p r o a ch c o v e r e d in S e c tio n 11.4.3 is a g o o d ch oice. H ow ev er, in s te a d o f
s to r in g th e fin al s o r t e d R b a c k o n disk, w e p r o d u c e o n e b lo c k o f th e
sorted .ft at a tim e, as its t u p le s a r e n eed ed .
15.1.3 The Model of Computation for Physical Operators

A qu ery g e n e ra lly c o n s is t s o f se v e r a l o p e r a t io n s o f r e la tio n a l a lg eb ra , a n d th e
c o r r e s p o n d in g p h y s ic a l q u e r y p la n is c o m p o s e d o f se v e r a l p h y s ic a l o p e ra to rs.
O ften , a p h y sica l o p e r a t o r is an im p le m e n t a t io n o f a r e la tio n a l- a lg e b r a o p e ra to r,
b u t as w e sa w in S e c tio n 15.1.1, o th e r p h y s ic a l p la n o p e r a t o r s c o r r e s p o n d t o
o p e r a tio n s lik e s c a n n in g th a t m a y b e in v isib le in r e la tio n a l algeb ra .
S in ce c h o o s in g p h y s ic a l p la n o p e r a t o r s w is e ly is an e sse n tia l o f a g o o d q u e ry
p r o ce ss o r , w e m u st b e a b le t o e s t im a t e th e “c o s t ” o f e a ch o p e r a to r w e use.
W e sh a ll u se th e n u m b e r o f d isk I / O ’
s a s o u r m e a s u r e o f c o s t fo r an o p e ra tio n .
T h is m e a su re is c o n s is t e n t w ith o u r v ie w (see S e c t io n 11.4.1) th a t it ta k es lo n g e r
t o g e t d a ta fr o m d isk th a n t o d o a n y th in g u sefu l w ith it o n c e th e d a ta is in
m a in m em ory . T h e o n e m a jo r e x c e p t io n is w h en a n s w e r in g a q u e ry in v o lv es
c o m m u n ic a t in g d a ta a c r o s s a n etw ork . W e d is c u s s c o s t s fo r d is tr ib u te d q u e r y
p r o c e s s in g in S e c tio n s 15.9 a n d 19.4.4.
W h e n c o m p a r in g a lg o r ith m s fo r t h e s a m e o p e r a tio n s , w e sh a ll m a k e an
a s s u m p tio n th a t m a y b e s u r p r is in g at first;
• W e a s s u m e th a t th e a r g u m e n ts o f a n y o p e r a t o r are fo u n d o n disk, b u t th e
resu lt o f th e o p e r a t o r is left in m a in m e m ory .
I f the o p e r a to r p r o d u c e s th e fin al a n sw e r t o a query, a n d th a t re su lt is in d e e d

w r itte n to disk, th e n th e c o s t o f d o in g s o d e p e n d s o n ly o n t h e s iz e o f th e an sw er,
an d n o t on h o w th e a n sw er w a s c o m p u t e d . W e c a n s im p ly a d d th e fin al w r ite
b a ck c o s t to th e t o ta l c o s t o f th e query. H ow ev er, in m a n y a p p lica tio n s, th e
an sw er is n o t s t o r e d o n d isk at all, b u t p r in te d o r p a s s e d t o s o m e fo r m a ttin g
p ro g ra m . T h en , th e d isk I / O c o s t o f th e o u t p u t e ith e r is z e r o o r d e p e n d s u p o n
w h a t s o m e u n k n ow n a p p lic a tio n p r o g r a m d o e s w ith th e data.
Sim ilarly, th e r e su lt o f an o p e r a t o r th a t fo r m s p a rt o f a q u e ry (rather th a n
th e w h o le query) o fte n is n o t w r itte n t o disk . In S e c t io n 15.1.6 w e sh a ll d isc u s s
“ite r a to r s,” w h ere th e r e su lt o f o n e o p e r a t o r is c o n s t r u c te d in m a in m em ory ,
p e r h a p s a s m a ll p ie c e a t a tim e, a n d p a s s e d a s an a r g u m e n t t o a n o th e r op e ra to r.
In th is situ a tio n , w e n ev er h ave t o w r ite th e r e s u lt t o disk , a n d m o reo v er, w e
sa v e th e c o s t o f r e a d in g fr o m d isk th is a r g u m e n t o f th e o p e r a t o r th a t u se s th e
result. T h is s a v in g is an e x c e lle n t o p p o r t u n it y fo r th e q u e r y o p tim ize r.
15.1.4 Parameters for Measuring Costs

N ow , let u s in t r o d u c e th e p a r a m e te r s ( s o m e tim e s c a lle d s ta tis tic s ) th a t w e u se to
e x p r e ss th e c o s t o f an o p e r a to r . E s t im a t e s o f c o s t are e s s e n tia l if th e o p t im iz e r
401
718 CHAPTER 15. QUERY EX E CU TIO N
is t o d e te r m in e w h ich o f th e m a n y q u e r y p la n s is lik ely t o e x e c u t e fa ste st.

S e c t io n 16.5 in tr o d u c e s th e e x p lo it a t io n o f th e se c o s t estim a te s.
W e n e e d a p a r a m e te r to r e p re s e n t th e p o r t io n o f m a in m e m o r y th a t th e
o p e r a t o r u ses, a n d w e re q u ire o th e r p a r a m e te r s to m e a s u r e th e s iz e o f its a r g u
m e n t (s). A s s u m e th a t m a in m e m o r y is d iv id e d in to buffers, w h o se s iz e is th e
s a m e a s th e s iz e o f d isk b lock s. T h e n M w ill d e n o t e th e n u m b e r o f m a in - m e m o r y
b u ffe rs a v a ila b le t o an e x e c u tio n o f a p a r tic u la r o p e r a to r . W h e n e v a lu a tin g th e
c o s t o f an o p e r a to r , w e sh a ll n o t c o u n t th e c o s t — e ith e r m e m o r y u s e d o r d isk
I/ O ’
s — o f p r o d u c in g th e o u tp u t; thus M in c lu d e s o n ly th e s p a c e u s e d t o h o ld
th e in p u t a n d a n y in te r m e d ia te r e su lts o f th e o p e r a to r .
S o m e tim e s , w e ca n th in k o f M as th e e n tir e m a in m e m ory , o r m o s t o f
th e m a in m e m ory , as w e d id in S e c t io n 11.4.4. H ow ev er, w e sh a ll a ls o s e e
s it u a t io n s w h e re se v e ra l o p e r a tio n s sh a re th e m a in m e m ory , s o M c o u ld b e
m u c h s m a lle r th a n th e to ta l m a in m e m ory . In fact, as w e sh a ll d is c u s s in
S e c t io n 15.7, th e n u m b e r o f b u ffers a v a ila b le t o an o p e r a tio n m a y n o t b e a
p r e d ic t a b le co n sta n t, b u t m a y b e d e c id e d d u r in g e x e c u tio n , b a s e d o n w h a t
o t h e r p r o c e s s e s are e x e c u tin g at th e s a m e tim e. I f so, M is re a lly an e s t im a t e
o f th e n u m b e r o f bu ffers a v a ila b le t o th e o p e r a tio n . I f th e e s t im a t e is w ro n g ,
th e n th e a c tu a l e x e c u tio n tim e w ill differ fr o m th e p r e d ic t e d tim e u s e d b y th e
o p tim iz e r . W e c o u ld ev e n fin d th a t th e c h o s e n p h y s ic a l q u e r y p la n w o u ld h av e
b e e n differen t, h a d th e q u ery o p t im iz e r k n o w n w h a t th e tru e b u ffer a v a ila b ility
w o u ld b e d u r in g ex e cu tio n .
N ext, le t us c o n s id e r th e p a r a m e te r s th a t m e a s u r e th e c o s t o f a c c e s s in g
a r g u m e n t rela tion s. T h e s e p a ra m e te r s, m e a s u r in g s iz e a n d d is t r ib u tio n o f d a ta
in a re la tio n , a re o fte n c o m p u t e d p e r io d ic a lly t o h e lp th e q u e r y o p t im iz e r c h o o s e
p h y s ic a l o p e r a to r s .
W e sh a ll m a k e th e s im p lify in g a s s u m p t io n th a t d a ta is a c c e s s e d o n e b lo c k
at a t im e f r o m disk. In p r a ctice , o n e o f th e t e c h n iq u e s d is c u s s e d in S e c t io n 11.5
m ig h t b e a b le t o s p e e d u p th e a lg o r ith m if w e a r e a b le t o r e a d m a n y b lo c k s o f
th e r e la tio n at on ce , an d th e y ca n b e r e a d from c o n s e c u t iv e b lo c k s o n a track.
T h e r e a re th r e e p a r a m e te r fa m ilies, B , T , a n d V:
• W h e n d e s c r ib in g th e s iz e o f a r e la tio nR , w e m o s t o fte n a re c o n c e r n e d w ith

th e n u m b e r o f b lo c k s th a t a re n e e d e d t o h o ld all th e tu p le s o f R. T h is
n u m b e r o f b lo c k s w ill b e d e n o te d B(R), o r ju s t B if w e k n o w th a t r e la tio n
R is m ean t, U sually, w e a s s u m e th a t R is clustered; th a t is, it is s t o r e d in
B b lo c k s o r in a p p r o x im a t e ly B b lo ck s. A s d is c u s s e d in S e c t io n 13.1.6, w e
m a y in fa c t w ish t o k e e p a sm a ll f r a c tio n o f e a ch b lo c k h o ld in g R e m p ty
fo r fu tu r e in s e r tio n s in to R. N e v erth eless, B w ill o fte n b e a g o o d - e n o u g h
a p p r o x im a t io n t o th e n u m b e r o f b lo c k s th a t w e m u st r e a d fr o m d isk t o
s e e all o f R, an d w e sh a ll u se B a s th a t e s tim a te u n iform ly. •
• S o m e tim e s , w e a ls o n e e d t o k n ow th e n u m b e r o f tu p le s in R, a n d w e
d e n o t e th is q u a n tity b yT(R), o r ju s t T if R is u n d e r s to o d . If w e n e e d th e
n u m b e r o f tu p le s o f R th a t ca n fit in o n e b lock , w e ca n u se th e r a t io T/B.
F u rth er, th e r e are s o m e in s ta n c e s w h e re a r e la tio n is s t o r e d d is t r ib u te d
402
IN T R O D U C T IO N T O P H Y S IC A L - Q U E R Y - P L A N O P E R A T O R S 719
a m o n g b lo c k s th a t are a ls o o c c u p i e d b y tu p le s o f o t h e r relation s. I f so,

th en a s im p lify in g a s s u m p t io n is th a t ea ch tu p le o f R r e q u ire s a s e p a r a te
d isk read, a n d w e sh a ll u se T a s a n e s t im a t e o f th e d is k I/ O ’
s n e e d e d to
r e a d R iri th is situ a tio n .
• F inally, w e sh a ll s o m e tim e s w a n t t o refer t o th e n u m b e r o f d is tin c t v alu es

th a t a p p e a r in a c o lu m n o f a re la tio n . I f R is a rela tion , a n d o n e o f its
a ttr ib u te s is o, th e n V (R ,a) is th e n u m b e r o f d is t in c t v a lu es o f th e c o lu m n
for a in R . M o r e gen era lly , if [01,02, - •■, o n] is a list o f a ttrib u te s, th en
V (R, [ 0 1,02,..., o n]) is th e n u m b e r o f d is t in c t n - tu p le s in th e c o lu m n s o f
R for a t t r ib u t e s 01,02, ..., an. P u t fo rm a lly , it is th e n u m b e r o f tu p le s in
,a2(...,an (R)')•
15.1.5 I/O Cost for Scan Operators

A s a s im p le a p p lic a t io n o f th e p a r a m e te r s th a t w ere in tr o d u c e d , w e ca n r e p
resen t th e n u m b e r o f d isk I/ O ’
s n e e d e d fo r e a ch o f th e ta b le - sca n o p e r a to r s
d is c u s s e d s o far. I f r e la tio n R is clu ste re d , th en th e n u m b e r o f d is k I/ O ’
s fo r
th e ta b le- sca n o p e r a t o r is a p p r o x im a t e ly B . L ik ew ise, if R fits in m a in -m em ory,
th e n w e ca n im p le m e n t so rt- sca n b y r e a d in g R in t o m e m o r y an d p e r fo r m in g an
in - m e m o ry sort, a g a in r e q u ir in g o n ly B d isk I / O ’
s.
I f R is c lu s te r e d b u t r e q u ire s a tw o - p h a se m u ltiw a y m e r g e sort, then, as
d is c u s s e d in S e c t io n 11.4.4, w e r e q u ire a b o u t 3B d isk I/ O ’
s, d iv id e d e q u a lly
a m o n g th e o p e r a t io n s o f r e a d in g R in s u b lists, w r itin g o u t th e su b lists, a n d
r e r e a d in g th e su b lists. R e m e m b e r th a t w e d o n o t c h a r g e fo r th e final w r itin g
o f th e result. N e ith e r d o w e c h a r g e m e m o r y s p a c e fo r a c c u m u la t e d ou tp u t.
R a th er, w e a s s u m e e a ch o u t p u t b lo c k is im m e d ia te ly c o n s u m e d b y s o m e o th e r
o p e r a tio n ; p o s s ib l y it is s im p ly w r itte n t o disk.
H ow ev er, if R is n o t clu ste re d , th e n th e n u m b e r o f r e q u ir e d d isk I/ O ’
s is
g e n e r a lly m u ch h igh er. I f R is d is t r ib u te d a m o n g t u p le s o f o th e r rela tion s, th en
a ta b le-sca n fo r R m a y r e q u ire r e a d in g a s m a n y b lo c k s as th e re are tu p le s o f R ;
th a t is, th e I / O c o s t is T . Sim ilarly, if w e w a n t t o s o r t R , b u t R fits in m em ory ,
th e n T d isk I/ O ’
s are w h a t w e n e e d t o g e t all o f R in t o m e m ory . F inally, if
R is n o t c lu s te r e d a n d r e q u ir e s a tw o - p h a se so rt, th e n it ta k e s T d isk I/ O ’
s to
r e a d th e s u b g r o u p s in itially. H ow ev er, w e m a y s to r e a n d re re a d th e s u b lis t s in
c lu s t e r e d form , s o th e s e s te p s r e q u ire o n ly 2B d is k I/ O ’
s. T h e t o ta l c o s t fo r
p e r fo r m in g so r t- s c a n o n a large, u n c lu s te r e d r e la tio n is th u s T + 2B.
Finally, let u s c o n s id e r th e c o s t o f an in d ex -sca n . G en erally, an in d e x on
a re la tio n R o c c u p ie s m a n y few er th a n B ( R ) b lo ck s. T h e r e fo re , a sca n o f th e
e n tire R, w h ich ta k e s at le a st B d isk I/ O ’
s, w ill r e q u ir e s ig n ifica n tly m o r e I/ O ’
s
th a n d o e s e x a m in in g th e e n tire in dex. T h u s, e v e n th o u g h in d ex - sca n r e q u ire s
e x a m in in g b o t h th e r e la tio n a n d its in d ex, •
• W e c o n tin u e to u se B o r T as an e s t im a t e o f th e c o s t o f a c c e s s in g a
c lu s te r e d o r u n c lu s tc r e d r e la tio n in its en tirety, u s in g an in dex.
403
720 CHA PTER 15. Q VERY EXECUTION
Why Iterators?
W e sh a ll s e e in S e c t io n 16.7 h o w it e r a t o r s s u p p o r t efficien t e x e c u tio n w h en
T h e y c o n t r a s t w ith a material
th e y a re c o m p o s e d w ith in q u e r y plan s.
ization stra teg y , w h ere th e resu lt o f e a ch o p e r a t o r is p r o d u c e d in its e n
tir e ty — an d eith er s to r e d o n d isk o r a llo w e d t o ta k e u p s p a c e in m a in
m em ory . W h e n ite r a to r s a re u sed, m a n y o p e r a t io n s are a c tiv e a t on ce. T u
p le s p a s s b e tw e e n o p e r a t o r s as n e e d e d , th u s r e d u c in g th e n e e d fo r sto ra g e .
O f co u rse, a s w e sh a ll see, n o t all p h y s ic a l o p e r a t o r s s u p p o r t th e ite r a tio n
a p p ro a ch , o r “p ip e lin in g ,” in a u se fu l way. In s o m e ca ses, a lm o s t all th e
w o rk w o u ld n eed to b e d o n e b y th e O pen fu n ctio n , w h ich is ta n ta m o u n t
t o m a te ria liz a tio n .
H ow ev er, if w e o n ly w an t p a rt o f R, w e o fte n are a b le t o a v o id lo o k in g at th e

e n tir e in d e x a n d th e en tire R. W e sh a ll d e fe r a n a ly s is o f th e se u se s o f in d e x e s
t o S e c t io n 15.6.2.
15.1.6 Iterators for Implementation of Physical Operators

M a n y p h y s ic a l o p e r a to r s ca n b e im p le m e n t e d a s an iterator, w h ich is a g r o u p
o f th r e e fu n c tio n s th a t a llo w s a c o n s u m e r o f th e r e s u lt o f th e p h y s ic a l o p e r a t o r
t o g e t th e r e su lt o n e tu p le at a tim e. T h e th r e e fu n c tio n s fo r m in g th e ite r a to r
fo r an o p e r a t io n are:
1. Open. T h is fu n c tio n s ta r ts th e p r o c e s s o f g e t t in g tu p le s, b u t d o e s n o t g e t
a tu p le. It in itia liz e s a n y d a t a s tr u c tu r e s n e e d e d t o p e r fo r m th e o p e r a t io n
a n d ca lls O pen fo r an y a r g u m e n ts o f th e o p e ra tio n .
2. G e t N ext. T h is fu n c tio n re tu rn s th e n e x t tu p le in th e r e su lt a n d a d ju s t s
d a ta s tr u c t u r e s as n e c e s s a r y t o a llo w s u b s e q u e n t t u p le s t o b e o b ta in e d . In
g e t t i n g th e n e x t t u p le o f its resu lt, it t y p ic a lly ca lls G e tN e x t o n e o r m o r e
tim e s o n its argu m en t(s). I f th e re are n o m o r e tu p le s t o retu rn , G e tN e x t
r e tu r n s a s p e c ia l valu e N otF ou n d , w h ich w e a s s u m e c a n n o t b e m ista k e n
fo r a tuple.
3. Close. T h is fu n c tio n e n d s th e ite r a tio n a fte r all tu p les, o r all t u p le s th a t

th e c o n s u m e r w an ted, h a v e b e e n o b ta in e d . T y p ica lly , it ca lls Close on
a n y a r g u m e n ts o f th e o p e r a to r .
W h e n d e s c r ib in g ite r a to r s a n d th eir fu n ctio n s, w e sh a ll a s s u m e th a t th ere

is a “c la s s ” fo r ea ch t y p e o f ite r a to r (i.e., fo r ea ch t y p e o f p h y s ic a l o p e r a t o r
im p le m e n t e d as an iterator), a n d th e c la ss s u p p o r t s Open, G e tN e x t, a n d C l o s e
m e t h o d s o n in s ta n c e s o f th e class. 4
0
404
IN TRODU CTION TO PHYSICAL-QUERY-PLAN OPERATORS 721
E x a m p l e 15.1 : P e rh a p s th e s im p le s t it e r a t o r is th e o n e th a t im p le m e n t s th e
ta b le- sca n o p e r a to r . T h e it e r a t o r is im p le m e n t e d b y a c la ss TableScan, a n d a
ta b le - sca n o p e r a t o r in a q u e r y p la n is an in s t a n c e o f th is c la ss p a r a m e t e r iz e d b y
th e r e la tio n R w e w ish to scan. L et u s a s s u m e th a t R is a r e la tio n c lu s t e r e d in
s o m e lis t o f b lo ck s, w h ich w e ca n a c c e s s in a c o n v e n ie n t w ay; th a t is, th e n o t io n
of “
g e t th e n e x t b lo ck o f R ”is im p le m e n t e d b y th e s to r a g e s y s t e m a n d n e e d
n o t b e d e s c r ib e d in detail. F u rth er, w e a s s u m e th a t w ith in a b lo c k th e re is a
d ir e c t o r y o f r e c o r d s (tuples) s o th a t it is e a s y t o g e t th e n e x t tu p le o f a b lo c k
o r te ll th a t th e la st tu p le h a s b e e n rea ch e d .
OpenQ {
b := the first block of R;
t := the first tuple of block b;
>
GetNextO {
IF (t is past the last tuple on block b) {
increment b to the next block;
IF (there is no next block)
RETURN NotFound;
ELSE /* b is a new block */
t := first tuple on block b;
> /* now we are ready to return t and increment */
oldt := t;
increment t to the next tuple of b;
RETURN oldt;
}
C lo s e O {
>
F ig u r e 15.3: I te r a to r fu n c tio n s fo r th e ta b le - s c a n o p e r a t o r o v e r r e la tio n R
F ig u r e 15.3 sk e tch e s th e th r e e f u n c tio n s for th is itera tor. W e im a g in e a

b lo c k p o in t e r b an d a t u p le p o in t e r t th a t p o in t s t o a tu p le w ith in b lo c k b. W e
a s s u m e th a t b o t h p o in te r s ca n p o in t “b e y o n d ” th e la st b lo c k o r la s t t u p le o f
a b lo ck , re sp e ctiv e ly , an d th a t it is p o s s ib l e to id e n tify w h en th e se c o n d it io n s
o ccu r. N o t ic e th a t C l o s e in t h is ' e x a m p le d o e s n o th in g . In p r a ctice , a C l o s e
fu n c tio n fo r an it e r a t o r m ig h t cle a n u p th e in tern a l s tr u c tu r e o f th e D B M S in
v a r io u s w ays. It m ig h t in fo rm th e b u ffer m a n a g e r th a t ce rta in b u ffe r s a re n o
lo n g e r n e e d e d , o r in fo rm th e c o n c u r r e n c y m a n a g e r th a t th e r e a d o f a r e la tio n
h as c o m p le te d . □
E x a m p l e 15.2 : N ow , let us c o n s id e r an e x a m p le w h e re th e ite r a to r d o e s m o s t

o f th e w o r k in its Open fu n ction . T h e o p e r a t o r is sort-sca n , w h e re w e r e a d th e
405
722 CHA PTER 15. QUERY EX ECU TIO N
tu p le s o f a r e la tio n R b u t retu rn th e m in s o r t e d order. F urth er, le t u s s u p p o s e

th a t R is s o la r g e th a t w e n e e d t o u se a tw o-p h a se, m u ltiw a y m e rg e - so rt, as in
S e c t io n 11.4.4.
W e c a n n o t retu rn ev e n th e first tu p le u n til w e h a v e e x a m in e d e a ch tu p le o f
R. T h u s, O pen m u s t d o a t le a st th e fo llo w in g :
1. R e a d all th e tu p le s o f R in m a in - m e rn o ry - siz ed ch un ks, s o r t th em , an d

s to r e th e m o n disk.
2. I n itia liz e th e d a ta s tr u c t u r e fo r th e s e c o n d (m erge) p h a se, an d lo a d th e

first b lo c k o f e a ch s u b lis t in to th e m a in - m e m o r y stru ctu re .
T h en , G e tN e x t ca n run a c o m p e t it io n fo r t h e first r e m a in in g t u p le at th e h e a d s
o f all th e su b lists. I f th e b lo c k fr o m th e w in n in g s u b lis t is e x h a u ste d , G e tN e x t
r e lo a d s its buffer. □
E x a m p le 15.3: Finally, let u s c o n s id e r a s im p le e x a m p le o f h o w it e r a t o r s

ca n b e c o m b in e d b y c a llin g o th e r ite r a to r s. It is n o t a g o o d e x a m p le o f h ow
m a n y it e r a t o r s ca n b e a c tiv e sim u lta n e o u sly , b u t th a t w ill h av e t o w a it un til w e
h a v e c o n s id e r e d a lg o r ith m s fo r p h y s ica l o p e r a t o r s lik e s e le c tio n a n d jo in , w h ich
e x p lo it th is c a p a b ility o f it e r a t o r s b ette r.
O u r o p e r a t io n is th e b a g u n io n R U S', in w h ich w e p r o d u c e first all th e
t u p le s o f R a n d th e n all th e tu p le s o f 5, w it h o u t r e g a r d fo r th e e x is t e n c e o f
d u p lica te s. L et 71 a n d S d e n o t e th e it e r a to r s th a t p r o d u c e r e la tio n s R an d S ,
a n d th u s are th e “ch ild re n ” o f th e u n io n o p e r a t o r in a q u e r y p la n fo r R U S.
I te r a to r s 7Z a n d S c o u ld b e t a b le sca n s a p p lie d t o s to r e d r e la tio n s i?. a n d S, or
t h e y c o u ld b e ite r a to r s th a t c a ll a n e tw o r k o f o th e r ite r a to r s t o c o m p u t e R a n d
S. R e g a rd le ss, all th a t is im p o r ta n t is th a t w e h a v e a v a ila b le fu n c tio n s R .O pen,
R .G e tN e x t, a n d R . C l o s e , a n d a n a lo g o u s fu n c tio n s fo r it e r a t o r S. T h e ite r a to r
fu n c tio n s fo r th e u n io n a re s k e tc h e d in F ig. 15.4. O n e s u b t le p o in t is th a t th e
fu n c tio n s u se a s h a re d v a r ia b le C u r R e l th a t is eith e r 71 o r S, d e p e n d in g o n
w h ich r e la tio n is b e in g r e a d fr o m cu rren tly. □
15.2 One-Pass Algorithms for Database

Operations
W e sh a ll n o w b e g in o u r s tu d y o f a v ery im p o r t a n t t o p ic in q u e r y o p t im iz a tio n :
h o w s h o u ld w e e x e c u te e a ch o f th e in d iv id u a l s te p s — fo r e x a m p le , a jo in or
s e le c tio n — o f a lo g ic a l q u e r y p la n ? T h e c h o ic e o f an a lg o r it h m fo r e a ch o p e r a t o r
is an e sse n tia l p a r t o f th e p r o c e s s o f t r a n s fo r m in g a lo g ic a l q u e r y p la n in to a
p h y s ic a l q u e r y plan. W h ile m a n y a lg o r ith m s fo r o p e r a t o r s h a v e b e e n p r o p o s e d ,
they largely fall in to three cla sses:
1. S o r tin g - b a s e d m e th o d s. T h e s e are c o v e r e d p r im a r ily in S e c t io n 15.4.
406
O N E -P A SS A L G O R IT H M S F O R D A T A B A S E O P E R A T IO N S 723
OpenQ {
R. OpenQ ;
CurRe1 := R;
GetNextQ {
IF (CurRel = R) {
t := R .GetNext();
IF (t <> NotFound) / * R is not exhausted */
RETURN t;
ELSE /* R is exhausted * / {
S.OpenO ;
CurRel := S;
}
}
/* here, we must read from S */
RETURN S.GetNextQ;
/* notice that if S is exhausted, S.GetNextQ
will return NotFound, which is the correct
action for our GetNext as well */
>
CloseQ {
R. CloseQ ;
S .CloseQ ;
Figure 15.4: Building a union iterator from iterators 1Z and S
2. Hash-based methods. These are mentioned in Section 15-5 and Section

15.9, among other places.
3. Index-based methods. These are emphasized in Section 15.6.
In addition, we can divide algorithms for operators into three “

degrees”of
difficulty and cost:
a) Some methods involve reading the data only once from disk. These are
the o n e - p a s s algorithms, and they are the topic of this section. Usually,
they work only when at least one of the arguments of the operation fits in
main memory, although there are exceptions, especially for selection and
projection as discussed in Section 15.2.1.
b) Some methods work for data that is too large to fit in available main
memory but not for the largest imaginable data sets. An example of such
407
724 C H A P T E R 15. Q U E R Y E X E C U T IO N
an algorithm is the two-phase, multiway merge sort of Section 11.4.4.

These tw o - p a s s algorithms are characterized by reading data a first time
from disk, processing it in some way, writing all, or almost all of it to
disk, and then reading it a second time for further processing during the
second pass. We meet these algorithms in Sections 15.4 and 15.5.
c) Some methods work without a limit on the size of the data. These meth
ods use three or more passes to do their jobs, and are natural, recursive
generalizations of the two-pass algorithms; we shall study multipass meth
ods in Section 15.8.
In this section, we shall concentrate on the one-pass methods. However,
both in this section and subsequently, we shall classify operators into three
broad groups:
1 . T u p le- a t- a - tim e , u n a r y o p e r a tio n s . These operations — selection and pro

jection — do not require an entire relation, or even a large part of it, in
memory at once. Thus, we can read a block at a time, use one main-
memory buffer, and produce our output.
2 . F u ll- r e la tio n , u n a r y o p e r a tio n s . These one-argument operations require
seeing all or most of the tuples in memory at once, so one-pass algorithms
are limited to relations that are approximately of size M (the number
of main-memory buffers available) or less. The operations of this class
that we consider here are (the grouping operator) and S (the duplicate-
7
elimination operator).
3. F u ll- r e la tio n , b in a r y o p e r a tio n s . All other operations are in this class:
set and bag versions of union, intersection, difference, joins, and prod
ucts. Except for bag union, each of these operations requires at least one
argument to be limited to size M, if we are to use a one-pass algorithm.
15.2.1 One-Pass Algorithms for Tuple-at-a-Time

Operations
The tuple-at-a-time operations <x(7?) and tt(R) have obvious algorithms, regard
less of whether the relation fits in main memory. We read the blocks of R one
at a time into an input buffer, perform the operation on each tuple, and move
the selected tuples or the projected tuples to the output buffer, as suggested
by Fig. 15.5. Since the output buffer may be an input buffer of some other
operator, or may be sending data to a user or application, we do not count the
output buffer as needed space. Thus, we require only that M > for the input 1
buffer, regardless of B .
The disk I/O requirement for this process depends only on how the argument
relation R is provided. If R is initially on disk, then the cost is whatever it
takes to perform a table-scan or index-scan of R . The cost was discussed in
Section 15.1.5; typically it is B if R is clustered and T if it is not clustered.4
8
0
408
O N E -PA SS A L G O R IT H M S F O R D A T A B A S E O P E R A T IO N S 725
Input Output
buffer buffer
Figure 15.5: A selection or projection being performed on a relation R
Extra Buffers Can Speed Up Operations

Although tuple-at-a-time operations can get by with only one input buffer
and one output buffer, as suggested by Fig. 15.5, we can often speed up
processing if we allocate more input buffers. The idea appeared first in
Section 11.5.1. If R is stored on consecutive blocks within cylinders, then
we can read an entire cylinder into buffers, while paying for the seek time
and rotational latency for only one block per cylinder. Similarly, if the
output of the operation can be stored on full cylinders, we waste almost
no time writing.
However, we should remind the reader again of the important exception when
the operation being performed is a selection, and the condition compares a
constant to an attribute that has an index. In that case, we can use the index
to retrieve only a subset of the blocks holding R , thus improving performance,
often markedly.
15.2.2 One-Pass Algorithms for Unary, Pull-Relation

Operations
Now, let us consider the unary operations that apply to relations as a whole,
rather than to one tuple at a time: duplicate elimination (5) and grouping ( ). 7
D u p lic a t e E lim in a t io n
To eliminate duplicates, we can read each block of R one at a time, but for each
tuple we need to make a decision as to whether:1
1. It is the first time we have seen this tuple, in which case we copy it to the
output, or
409
2 . We have seen the tuple before, in which case we must not output this
tuple.
To support this decision, we need to keep in memory one copy of every tuple
we have seen, as suggested in Fig. 15.6. One memory buffer holds one block of
J?.’
s tuples, and the remaining M — buffers can be used to hold a single copy
1
of every tuple seen so far.
M -l buffers Output.
buffer
Figure 15.6: Managing memory for a one-pass duplicate-elimination
When storing the already-seen tuples, we must be careful about the main-
memory data structure we use. Naively, we might just list the tuples we have
seen. When a new tuple from R is considered, we compare it with all tuples
seen so far, and if it is not equal to any of these tuples we both copy it to the
output and add it to the in-memory list of tuples we have seen.
However, if there are n tuples in main memory, each new tuple takes pro
cessor time proportional to n, so the complete operation takes processor time
proportional to n2, Since n could be very large, this amount of time calls into
serious question our assumption that only the disk I/O time is significant, Thus,
we need a main-memory structure that allows each of the operations:
1 . Add a new tuple, and

2. Tell whether a given tuple is already there
to be done in time that is close to a constant, independent of the number of

tuples n that we currently have in memory. There are many such structures
known. For example, we could use a hash table with a large number of buckets,
or some form of balanced binary search tree Each of these structures has some*4
.1 0
1
LSee Aho, A. V,, J. E, Hopcroft, and J. D. Ullman Data Structures and Algorithms,
Addison-W esley, 1984 for discussion s of suitable m ain-m em ory structures. In particular,
hashing takes on average 0(n) tim e to process n items, and balanced trees take 0(n log n)
time; either is sufficiently close to linear for our purposes.
410
O N E -P A SS A L G O R IT H M S F O R D A T A B A S E O P E R A T IO N S 727
space overhead in addition to the space needed to store the tuples; for instance,
a main-memory hash table needs a bucket array and space for pointers to iink
the tuples in a bucket. However, the overhead tends to be small compared
with the space needed to store the tuples. We shall thus make the simplifying
assumption of no overhead space and concentrate on what is required to store
the tuples in main memory.
On this assumption, we may store in the M — 1 available buffers of main
memory as many tuples as will fit in M — blocks of R . If we want one copy
1
of each distinct tuple of R to fit in main memory, then B [ S ( R ) ) must be no

larger than M - 1. Since we expect M to be much larger than , a simpler 1
approximation to this rule, and the one we shall generally use, is:
•B ( S ( R ) ) < M
Note that we cannot in general compute the size of S(R ) without computing
S(R ) itself. Should we underestimate that size, so B ( S ( R ) ) is actually larger
than M , we shall pay a significant penalty due to thrashing, as the blocks
holding the distinct tuples of R must be brought into and out of main memory
frequently.
G r o u p in g
A grouping operation y i gives us zero or more grouping attributes and presum

ably one or more aggregated attributes. If we create in main memory one entry
for each group — that is, for each value of the grouping attributes — then we
can scan the tuples of R , one block at a time. The e n t r y for a group consists of
values for the grouping attributes and an accumulated value or values for each
aggregation. The accumulated value is, except in one case, obvious:
•For a MIN(a) or MAX (a) aggregate, record the minimum or maximum

value, respectively, of attribute a seen for any tuple in the group so far.
Change this minimum or maximum, if appropriate, each time a tuple of
the group is seen.
•For any COUNT aggregation, add one for each tuple of the group that is
seen.
•For SUM (a), add the value of attribute a to the accumulated sum for its
group.•
•AVG(a) is the hard case. We must maintain two accumulations: the count
of the number of tuples in the group and the sum of the a-values of these
tuples. Each is computed as we would for a COUNT and SUM aggregation,
respectively. After all tuples of R are seen, we take the quotient of the
sum and count to obtain the average-
411
728 C H A P T E R 15. Q U ER Y E X E C U T IO N
When all tuples of R have been read into the input buffer and contributed
to the aggregation(s) for their group, we can produce the output by writing the
tuple for each group. Note that until the last tuple is seen, we cannot begin to
create output for a operation. Thus, this algorithm does not fit the iterator
7
framework very well; the entire grouping has to be done by the Open function
before the first tuple can be retrieved by GetNext.
In order that the in-memory processing of each tuple be efficient, we need
to use a main-memory data structure that lets us find the entry for each group,
given values for the grouping attributes. As discussed above for the S operation,
common main-memory data structures such as hash tables or balanced trees
will serve well. We should remember, however, that the search key for this
structure is the grouping attributes only.
The number of disk I/O’ s needed for this one-pass algorithm is B, as must
be the case for any one-pass algorithm for a unary operator. The number of
required memory buffers M is not related to B in any simple way, although
typically M will be less than B . The problem is that the entries for the groups
could be longer or shorter than tuples of R , and the number of groups could
be anything equal to or less than the number of tuples of R . However, in most
cases, group entries will be no longer than R Js tuples, and there will be many
fewer groups than tuples.
15.2.3 One-Pass Algorithms for Binary Operations

Let us now take up the binary operations: union, intersection, difference, prod
uct, and join. Since in some cases we must distinguish the set- and bag-versions
of these operators, we shall subscript them with B or 5 for “ bag”and “ set,”
respectively; e.g,, Ub for bag union or — for set difference. To simplify the
5
discussion of joins, we shall consider only the natural join. An equijoin can
be implemented the same way, after attributes are renamed appropriately, and
theta-joins can be thought of as a product or equijoin followed by a selection
for those conditions that cannot be expressed in an equijoin.
Bag union can be computed by a very simple one-pass algorithm. To com
pute R U b 5, we copy each tuple of R to the output and then copy every tuple
of 5, as we did in Example 15.3. The number of disk I/O’ s is B ( R ) + B ( S ) } as
it must be for a one-pass algorithm on operands R and 5, while M — suffices 1
regardless of how large R and 5 are.

Other binary operations require reading the smaller of the operands R and S
into main memory and building a suitable data structure so tuples can be both
inserted quickly and found quickly, as discussed in Section 15.2.2. As before,
a hash table or balanced tree suffices. The structure requires a small amount
of space (in addition to the space for the tuples themselves), which we shall
neglect. Thus, the approximate requirement for a binary operation on relations
R and 5 to be performed in one pass is:•
•min( B ( R ) , B { S ) ) < M
412
ONE-PASS ALGORITHM S FOR DATABASE OPERATIONS 729
Operations on Nonclustered Data

Remember that all our calculations regarding the number of disk I/O’ s re
quired for an operation are predicated on the assumption that the operand
relations are clustered. In the (typically rare) event that an operand R is
not clustered, then it may take us T(R) disk I/O ’ s, rather than B(R) disk
I/O’ s to read all the tuples of R. Note, however, that any relation that is
the result of an operator may always be assumed clustered, since we have
no reason to store a temporary relation in a nonclustered fashion.
This rule assumes that one buffer will be used to read the blocks of the larger
relation, while approximately M buffers are needed to house the entire smaller
relation and its main-memory data structure.
We shall now give the details of the various operations. In each case, we
assume R is the larger of the relations, and we house S in main memory.
S e t U n io n
We read S into M — 1 buffers of main memory and build a search structure

where the search key is the entire tuple. All these tuples are also copied to the
output. We then read each block of R into the Mth buffer, one at a time. For
each tuple t of R, we see if t is in S, and if not, we copy t to the output. If t is
also in 5, we skip t-
S e t I n t e r s e c t io n
Read 5 into M — 1 buffers and build a search structure with full tuples as the
search key. Read each block of R, and for each tuple t of R, see if t is also in
S. If so, copy t to the output, and if not, ignore t.
S e t D iffe r e n c e
Since difference is not commutative, we must distinguish between R — 5 S and

S —s R> continuing to assume that R is the larger relation. In each case, read
5 into M —1 buffers and build a search structure with full tuples as the search
key.
To compute R - 5 S', we read each block of R and examine each tuple t on
that block. If f is in 5, then ignore f; if it is not in S then copy t to the output.
To compute S ~ s ft, we again read the blocks of R and examine each tuple
t in turn. If t is in S, then we delete t from the copy of S in main memory,
while if t is not in S we do nothing. After considering each tuple of R , we copy
to the output those tuples of S that remain.
413
B a g In t e r s e c t io n
We read S into M —1 buffers, but we associate with each distinct tuple a count,
which initially measures the number of times this tuple occurs in S. Multiple
copies of a tuple t. are not stored individually. Rather we store one copy of t.
and associate with it a count equal to the number of times t occurs.
This structure could take slightly more space than B(S) blocks if there were
few duplicates, although frequently the result is that S is compacted. Thus, we
shall continue to assume that B(S) < M is sufficient for a one-pass algorithm
to work, although the condition is only an approximation.
Next, we read each block of R, and for each tuple t of R we see whether t.
occurs in S . If not we ignore t; it cannot appear in the intersection. However, if
t appears in S , and the count associated with t is still positive, then we output
t and decrement the count by 1. If t appears in 5, but its count has reached 0,
then we do not output t\ we have already produced as many copies of t in the
output as there were copies in 5.
B a g D iffe r e n c e
To compute S —b R, we read the tuples of S into main memory, and count the
number of occurrences of each distinct tuple, as we did for bag intersection.
When we read R, for each tuple t we see whether t occurs in 5, and if so, we
decrement its associated count. At the end, we copy to the output each tuple
in main memory whose count is positive, and the number of times we copy it
equals that count.
To compute R —g S , we also read the tuples of S into main memory and
count the number of occurrences of distinct tuples. We may think of a tuple t
with a count of c as c reasons not to copy t to the output as we read tuples of
R. That is, when we read a tuple t of R, we see if t occurs in S. If not, then we
copy t to the output. If t does occur in S, then we look at the current count c
associated with t. If c = 0, then copy t to the output. If c > 0, do not copy t
to the output, but decrement c by . 1
P rodu ct
Read S into M - I buffers of main memory; no special data structure is needed.

Then read each block of .R, and for each tuple t of R concatenate t with each
tuple of S in main memory. Output each concatenated tuple as it is formed.
This algorithm may take a considerable amount of processor time per tuple
of R, because each such tuple must be matched with M — blocks full of tuples.
1
However, the output size is also large, and the time per output tuple is small.
N a tu r a l J o in
In this and other join algorithms, let us take the convention that R(X,Y) is
being joined with S(Y^Z)) where Y represents all the attributes that R and S
414
ONE-PASS ALGORITHMS FOR DATABASE OPERATIONS 731
What if M is not Known?

While we present algorithms as if M, the number of available memory
blocks, were fixed and known in advance, remember that the available M
is often unknown, except within some obvious limits like the total memory
of the machine. Thus, a query optimizer, when choosing between a one-
pass and a two-pass algorithm, might estimate M and make the choice
based on this estimate. If the optimizer is wrong, the penalty is either
thrashing of buffers between disk and memory (if the guess of M was too
high), or unnecessary passes if M was underestimated.
There are also some algorithms that degrade gracefully when there
is less memory than expected. For example, we can behave like a one-
pass algorithm, unless we run out of space, and then start behaving like
a two-pass algorithm. Sections 15.5.6 and 15.7.3 discuss some of these
approaches.
have in common, X is all attributes of R that are not in the schema of S , and
Z is all attributes of S that are not in the schema of R . We continue to assume
that S is the smaller relation. To compute the natural join, do the following:
1. Read all the tuples of S and form them into a main-memory search struc
ture with the attributes of Y as the search key. As usual, a hash table or
balanced tree are good examples of such structures. Use M — 1 blocks of
memory for this purpose.
2. Read each block of R into the one remaining main-memory buffer. For
each tuple t of R , find the tuples of S that agree with t on all attributes
of Y , using the search structure. For each matching tuple of 5, form a
tuple by joining it with t, and move the resulting tuple to the output.
Like all the one-pass, binary algorithms, this one takes B ( R ) + B ( S ) disk I/O’
s
to read the operands. It works as long as B ( S ) < M — 1, or approximately,
B ( S ) < M . Also as for the other algorithms we have studied, the space required
by the main-memory search structure is not counted but may lead to a small,
additional memory requirement.
We shall not discuss joins other than the natural join. Remember that an
equijoin is executed in essentially the same way as a natural join, but we must
account for the fact that “ equal”attributes from the two relations may have
different names. A theta-join that is not an equijoin can be replaced by an
equijoin or product followed by a selection.
415

E x e r c is e 15.2.1 : For each of the operations below, write an iterator that uses
the algorithm described in this section.
* a) Projection.
* b) Distinct (<5).
c) Grouping (qx,).
* d) Set union.
e) Set intersection.
f) Set difference.
g) Bag intersection.
h) Bag difference.
i) Product.
j) Natural join.
E x e r c is e 15.2.2 : For each of the operators in Exercise 15.2.1, tell whether the
operator is b lo c k in g , by which we mean that the first output cannot be produced
until all the input has been read. Put another way, a blocking operator is one
whose only possible iterators have all the important work done by Dpen.
E x e r c is e 15.2.3: Figure 15.9 summarizes the memory and disk-I/O require

ments of the algorithms of this section and the next. However, it assumes all
arguments are clustered. How would the entries change if one or both arguments
were not clustered?
! E x e r c is e 15.2.4: Give one-pass algorithms for each of the following join-like

operators:
* a) R tXS, assuming R fits in memory (see Exercise 5.2.10 for a definition

of the semijoin).
* b) R IX 5, assum ing S fits in memory.
c) R tX 5, assuming R fits in memory (see Exercise 5.2.11 for a definition

of the antisemijoin).
d) R. X 5, assum ing S fits in memory.
* e) R Mj- 5, assuming R fits in memory (see Section 5-4.7 for definitions

in v o lv in g outerjoin s).
f) R ixij. S', assum ing S fits in memory.
416
15.3. N E S T E D - L O O P J O IN S 733
g) R & R 5, assuming R fits in memory.

h) R &\R 5, assuming S fits in memory.
i) assuming R fits in memory.
15.3 Nested-Loop Joins

Before proceeding to the more complex algorithms in the next sections, we shall
turn our attention to a family of algorithms for the join operator called “nested-
loop”joins. These algorithms are, in a sense, “ one-and-a-half”passes, since in
each variation one of the two arguments has its tuples read only once, while
the other argument will be read repeatedly. Nested-loop joins can be used for
relations of any size; it is not necessary that one relation fit in main memory.
15.3.1 Tuple-Based Nested-Loop Join

The simplest variation of nested-loop join has loops that range over individual
tuples of the relations involved. In this algorithm, which we call tuple-based,
n e s te d - lo o p j o i n , we compute the join R ( X , Y ) tx 5(T, Z ) as follows:
FOR each tuple s in S DO

FOR each tuple r in R DO
IF r and s join to make a tuple t THEN
output t;
If we are careless about how we buffer the blocks of relations R and S, then
this algorithm could require as many as T ( R ) T ( S ) disk I/O ’
s. However, there
are many situations where this algorithm can be modified to have much lower
cost. One case is when we can use an index on the join attribute or attributes
of R. to find the tuples of R that match a given tuple of S , without having to
read the entire relation R . We discuss index-based joins in Section 15.6.3. A
second improvement looks much more carefully at the way tuples of R and 5
are divided among blocks, and uses as much of the memory as it can to reduce
the number of disk I/O’ s as we go through the inner loop. We shall consider
this block-based version of nested-loop join in Section 15.3.3.
15.3.2 An Iterator for Tuple-Based Nested-Loop Join

One advantage of a nested-loop join is that it fits well into an iterator frame
work, and thus, as we shall see in Section 16.7.3, allows us to avoid storing
intermediate relations on disk in som e situations. The iterator for R ix S is
easy to build from the iterators for R and S, which support functions R. Open(),
and so on, as in Section 15.1.6. The code for the three iterator functions for
nested-loop join is in Fig. 15.7. It makes the assum ption that neither relation
R nor S is empty.
417
OpenQ {
R. Open();
S .OpenQ ;
s := S .GetNext();
GetNextQ {
REPEAT {
r := R.GetNext();
IF (r = NotFound) { /* R is exhausted for
the current s */
R.CIoseQ ;
s := S .GetNext();
IF (s = NotFound) RETURN NotFound;
/* both R and S are exhausted */
R.OpenQ ;
r := R.GetNext();
>
>
UNTIL(r and s join);
RETURN the join of r and s;
Close() {
R. CIoseQ ;
S. CloseQ ;
Figure 15.7: Iterator functions for tuple-based nested-loop join of R and 5
15.3.3 A Block-Based Nested-Loop Join Algorithm

We can improve on the tuple-based nested-loop join of Section 15.3.1 if we
compute R ix S by:
1 . Organizing access to both argument relations by blocks, and
2. Using as much main memory as we can to store tuples belonging to the

relation 5, the relation of the outer loop.
Point ( ) makes sure that when we run through the tuples of R in the inner
1
loop, we use as few disk I/O’ s as possible to read R . Point (2) enables us to join
each tuple of R that we read with not just one tuple of S, but with as many
tuples of 5 as will fit in memory.
418
15.3. N E S T E D - L O O P J O IN S 735
As in Section 15.2.3, let us assume B ( S ) M \ i.e., neither relation fits entirely in main memory. We
repeatedly read M blocks of S into main-memory buffers. A search structure,
— 1
with search key equal to the common attributes of R and 5, is created for the
tuples of S that are in main memory. Then we go through all the blocks of R ,
reading each one in turn into the last block of memory. Once there, we compare
all the tuples of ft’s block with all the tuples in all the blocks of 5 that are
currently in main memory. For those that join, we output the joined tuple.
The nested-loop structure of this algorithm can be seen when we describe the
algorithm more formally, in Fig. 15.8.
FOR each chunk of M-l blocks of S DO BEGIN

read these blocks into main-memory buffers;
organize their tuples into a search structure whose
search key is the common attributes of R and S;
FOR each block b of R DO BEGIN
read b into main memory;
FOR each tuple t of b DO BEGIN
find the tuples of S in main memory that
join with t;
output the join of t with each of these tuples;
END;
END;
END;
Figure 15.8: The nested-loop join algorithm
The program of Fig. 15.8 appears to have three nested loops, However, there
really are only two loops if we look at the code at the right level of abstraction.
The first, or outer loop, runs through the tuples of S . The other two loops
run through the tuples of R . However, we expressed the process as two loops
to emphasize that the order in which we visit the tuples of R is not arbitrary.
Rather, we need to look at these tuples a block at a time (the role of the second
loop), and within one block, we look at all the tuples of that block before moving
on to the next block (the role of the third loop).
E x a m p le 15.4: Let B ( R ) = 1000, B ( S ) = 500, and M = 101. We shall use
1 0 0blocks of memory to buffer S in -block chunks, so the outer loop of
1 0 0
Fig. 15.8 iterates five times. At each iteration, we do 100 disk I/O ’
s to read the
chunk of 5, and we must read R entirely in the second loop, using 1000 disk
I/O’ s. Thus, the total number of disk I/O ’ s is 5500.
Notice that if we reversed the roles of R and 5, the algorithm would use
slightly more disk I/O’ s. We would iterate 10 times through the outer loop and
do 600 disk I/O ’ s at each iteration, for a total of 6000. In general, there is a
slight advantage to using the smaller relation in the outer loop. □
419
The algorithm of Fig. 15.8 is sometimes called “ nested-block join.”We shall

continue to call it simply n e s t e d - lo o p j o i n , since it is the variant of the nested-
loop idea most commonly implemented in practice. If necessary to distinguish
it from the tuple-based nested-loop join of Section 15.3.1, we can call Fig. 15.8
“ block-based nested-loop join.5’
15.3.4 Analysis of Nested-Loop Join

The analysis of Example 15.4 can be repeated for any B ( R ), B ( S ) , and M .
Assuming S is the smaller relation, the number of chunks, or iterations of the
outer loop is B ( S ) / ( M — ). At each iteration, we read M — blocks of S and
1 1
B ( R ) blocks of R . The number of disk I/O ’ s is thus
or
B (S)B (R )
B (S) + 1 M -
Assuming all of M , B ( S ), and B ( R ) are large, but M is the smallest of

these, an approximation to the above formula is B ( S ) B ( R ) / M . That is, the
cost is proportional to the product of the sizes of the two relations, divided by
the amount of available main memory. We can do much better than a nested-
loop join when both relations are large. But for reasonably small examples
such as Example 15.4, the cost of the nested-loop join is not much greater than
the cost of a one-pass join, which is 1500 disk I/O ’ s for this example. In fact,
if B ( S ) < M — 1, the nested-loop join becomes identical to the one-pass join
algorithm of Section 15.2.3.
Although nested-loop join is generally not the most efficient join algorithm
possible, we should note that in some early relational DBMS’ s, it was the only
method available. Even today, it is needed as a subroutine in more efficient
join algorithms in certain situations, such as when large numbers of tuples from
each relation share a common value for the join attribute(s). For an example
where nested-loop join is essential, see Section 15.4.5.
15.3.5 Summary of Algorithms so Far

The main-memory and disk I/O requirements for the algorithms we have dis
cussed in Sections 15.2 and 15.3 are shown in Fig. 15.9. The memory require
ments for and 6 are actually more complex than shown, and M = B is only
7
a loose approximation. For , M grows with the number of groups, and for 5,
7
M grows with the number of distinct tuples.

E x e r c is e 15.3.1: Give the three iterator functions for the block-based version
of nested-loop join.4
0
2
420
15.4. T W O - P A S S A L G O R I T H M S B A S E D O N S O R T IN G 737
Approximate
Operators M required Disk I/O Section
(7, 7T 1 B 15.2.1
7, 5 B B 15.2.2
u, n, - , x, m min ( B ( R ) , B ( S ) ) B (R )+ B (S ) 15.2.3
EX any M > 2 B (R )B {S)/M 15.3.3
Figure 15.9: Main memory and disk I/O requirements for one-pass and nested-
loop algorithms
* E x e r c is e 15.3.2: Suppose B ( R ) = B ( S ) — 10,000, and M = 1000. Calculate

the disk I/O cost of a nested-loop join.
E x e r c is e 15.3.3 : For the relations of Exercise 15.3.2, what value of M would
we need to compute R c*i S using the nested-loop algorithm with no more than
a) 100,000 ! b) 25,000 !c) 15,000 disk I/O ’ s?
! E x e r c is e 15.3.4 : If R and S are both unclustered, it seems that nested-loop
join would require about T ( R ) T ( S ) / M disk I/O ’
s.
a) How can you do significantly better than this cost?

b) If only one of R and S is unclustered, how would you perform a nested-
loop join? Consider both the cases that the larger is unclustered and that
the smaller is unclustered.
! E x e r c is e 15.3.5 : The iterator of Fig. 15,7 will not work properly if either R
or S is empty. Rewrite the functions so they will work, even if one or both
relations are empty.
15.4 Two-Pass Algorithms Based on Sorting

We shall now begin the study of multipass algorithms for performing relational-
algebra operations on relations that are larger than what the one-pass algo
rithms of Section 15.2 can handle. We concentrate on t w o - p a s s a l g o r i t h m s ,
where data from the operand relations is read into main memory, processed in
some way, written out to disk again, and then reread from disk to complete the
operation. We can naturally extend this idea to any number of passes, where
the data is read many times into main memory. However, we concentrate on
two-pass algorithms because:
a) Two passes are usually enough, even for very large relations,
b) Generalizing to more than two passes is not hard; we discuss these exten
sions in Section 15.8.
421
In this section, we consider sorting as a tool for implementing relational

operations. The basic idea is as follows. If we have a large relation R , where
B ( R ) is larger than M , the number of memory buffers we have available, then
we can repeatedly:
1. Read M blocks of R into main memory.
2. Sort these M blocks in main memory, using an efficient, main-memory

sorting algorithm. Such an algorithm will take an amount of processor
time that is just slightly more than linear in the number of tuples in main
memory, so we expect that the time to sort will not exceed the disk I/O
time for step (1).
3. Write the sorted list into M blocks of disk. We shall refer to the contents
of these blocks as one of the s o r t e d s u b l i s t s of R.
All the algorithms we shall discuss then use a second pass to “

merge”the sorted
sublists in some way to execute the desired operator.
15.4.1 Duplicate Elimination Using Sorting

To perform the S (R ) operation in two passes, w~e sort the tuples of R in sublists
as described above. We then use the available main memory to hold one block
from each sorted sublist, as we did for the multiway merge sort of Section 11.4.4.
However, instead of sorting the tuples from these sublists, we repeatedly copy
one to the output and ignore all tuples identical to it. The process is suggested
by Fig. 15.10.
M buffers Same M buffers
Figure 15-10: A two-pass algorithm for eliminating duplicates
More precisely, we look at the first unconsidered tuple from each block, and
we find among them the first in sorted order, say t. We make one copy of t in
422
15.4. TWO-PASS ALGORITHMS BASED ON SORTING 739
the output, and we remove from the fronts o f the various input blocks all copies
of t. If a block is exhausted, we bring into its buffer the next block from the
same sublist, and if there are t ’
s on that block we remove them as well.
E x a m p le 15.5 : Suppose for sim plicity that tuples are integers, and only two
tuples fit on a block. Also, M = 3; i.e., there are three blocks in main memory.
The relation R consists of 17 tuples:
2 , 5, 2,1,2, 2,4, 5, 4, 3, 4, 2, 1, 5, 2, 1, 3
We read the first six tuples into the three blocks of main memory, sort them,
and write them out as the sublist R \. Similarly, tuples seven through twelve
are then read in, sorted and written as the sublist R2. The last five tuples are
likewise sorted and becom e the sublist Rj.
To start the second pass, we can bring into main m em ory the first block
(two tuples) from each of the three sublists. The situation is now;
Sublist In m em ory W aiting on disk

R i: 12 2 2, 2 5
R2: 23 4 4, 4 5
Rf- 1 1 2 3, 5
Looking at the first tuples o f the three blocks in main memory, we find that
1 is the first tuple in sorted order. We therefore make one copy of 1 on the
output, and we remove all l ’ s from the blocks in memory. W hen we do so, the
block from J?3 is exhausted, so we bring in the next block, with tuples 2 and 3,
from that sublist. Had there been m ore l ’ s on this block, we would eliminate
them. The situation is now:

R t. 2 2 2, 2 5
R t. 23 4 4, 4 5
Rt 2 3 5
Now, 2 is the least tuple at the fronts of the lists, and in fact it happens
to appear on each list. We write one copy of 2 to the output and eliminate
2’s from the in-memory blocks. T he block from Ri is exhausted and the next
block from that sublist is brought to memory. That block has 2’ s, which are
eliminated, again exhausting the block from Ri. The third block from that
sublist is brought to memory, and its 2 is eliminated. T he present situation is:

~Rp. 5
R2: 3 4 4, 45
Rt 3 5
Now, 3 is selected as the least tuple, one copy of 3 is written to the output,
and the blocks from R2 and i ?3 are exhausted and replaced from disk, leaving:
423
740 CHAFFER 15. QUERY EXECUTION

R i: 5
R2\ 4 4 4 5
Ry 5
To com plete the example, 4 is next selected, consum ing m ost o f list R2. At the
final step, each list happens to consist of a single 5, which is output once and
eliminated from the input buffers. □
The number o f disk I/ O ’s perform ed by this algorithm, as always ignoring

the handling of the output, is:
1. B(R) to read each block of R when creating the sorted sublists.
2. B(R) to write each of the sorted sublists to disk.
3. B(R) to read each block from the sublists at the appropriate time.
Thus, the total cost o f this algorithm is 3B(R,), com pared with B(R) for the
single-pass algorithm o f Section 15.2.2.
On the other hand, we can handle much larger files using the two-pass
algorithm than with the one-pass algorithm. Assum ing M blocks o f m emory
are available, we create sorted sublists of M blocks each. For the second pass,
we need one block from each sublist in main memory, so there can be no m ore
than M sublists, each M blocks long. Thus, B < M 2 is required for the two-
pass algorithm to be feasible, com pared with B < M for the one-pass algorithm.
Put another way, to com pute S(R) with the two-pass algorithm requires only
y/B(R) blocks o f main memory, rather than B(R) blocks.
15.4.2 Grouping and Aggregation Using Sorting

The two-pass algorithm for 7 ^(R) is quite similar to the algorithm o f Sec
tion 15.4.1 for 5(R). We sum m arize it as follows:
1. Read the tuples o f R into memory, M blocks at a time. Sort each M

blocks, using the grouping attributes of L as the sort key. Write each
sorted sublist to disk.
2 . Use one main-memory buffer for each sublist, and initially load the first
block o f each sublist into its buffer.
3. R epeatedly find the least value o f the sort key (grouping attributes)
present am ong the first available tuples in the buffers. This value, v ,
becom es the next group, for which we:
(a) Prepare to com pute all the aggregates on list L for this group. As
in Section 15.2.2, use a count and sum in place o f an average.
424
(b) Exam ine each o f the tuples with sort key v, and accum ulate the
needed aggregates.
(c) If a buffer becom es empty, replace it with the next block from the
same sublist.
When there are no m ore tuples with sort key v available, output a tuple
consisting of the grouping attributes of L and the associated values o f the
aggregations we have com puted for the group.
As for the <5 algorithm, this two-pass algorithm for 7 takes 3B(R) disk I/ O ’
s,
and will work as long as B(R) < M 2.
15.4.3 A Sort-Based Union Algorithm

When bag-union is wanted, the one-pass algorithm of Section 15.2.3, where we
sim ply copy both relations, works regardless of the size o f the arguments, so
there is no need to consider a two-pass algorithm for Ub. However, the one-
pass algorithm for Us only works when at least one relation is smaller than
the available main memory, so we should consider a two-pass algorithm for
set union. The m ethodology we present works for the set and bag versions of
intersection and difference as well, as we shall see in Section 15.4.4. T o com pute
R Us 5, we do the following:
1. Repeatedly bring M blocks of R into main memory, sort their tuples, and
write the resulting sorted sublist back to disk.
2 . D o the same for 5, to create sorted sublists for relation S.
3. Use one main-memory buffer for each sublist of R and S. Initialize each
with the first block from the correspon din g sublist.
4. Repeatedly find the first remaining tuple t am ong all the buffers. Copy
t to the output, and remove from the buffers all copies of t (if R and S
are sets there should be at m ost two copies). If a buffer becom es empty,
reload it with the next block from its sublist.
We observe that each tuple of R and S is read twice into main memory,
once when the sublists are being created, and the second time as part of one of
the sublists. The tuple is also written to disk once, as part of a newly formed
sublist. Thus, the cost in disk T/O’s is 3 (B(R) + B(S)).
The algorithm works as long as the total number o f sublists am ong the two
relations does not exceed M, because we need one buffer for each sublist. Since
each sublist is M blocks long, that says the sizes of the two relations must not
exceed M 2; that is, B(R) 4-B(S) < M 2.
425
15.4.4 Sort-Based Intersection and Difference

Whether the set version or the bag version is wanted, the algorithm s are es
sentially the same as that of Section 15.4.3, except that the way we handle the
copies of a tuple t at the fronts o f the sorted sublists differs. In general we
create the sorted sublists o f M blocks each for both argument relations R and
5. We use one main-memory buffer for each sublist, initially loaded with the
first block of the sublist.
We then repeatedly consider the least tuple t am ong the remaining tuples
in all the buffers. We count the number of tuples o f R that are identical to t
and we also count the number o f tuples o f S that are identical to t. D oing so
requires that we reload buffers from any sublists w hose currently buffered block
is exhausted. The following indicates how we determine whether t is output,
and if so, how many times:
• For set intersection, output t. if it appears in both R and S.
• For bag intersection, output t the m inimum o f the number of times it

appears in R and in 5. Note that t is not output if either o f these counts
is 0 ; that is, if t is m issing from one or both of the relations.
• For set difference, R ~ s S, output t if and only if it appears in R but not

in S.
•For bag difference, R - b S, output t the number of times it appears in R

minus the number o f times it appears in 5. O f course, if t appears in S
at least as many times as it appears in R, then do not output t at all.
E x a m p le 15.6: Let us make the same assum ptions as in Exam ple 15.5: M =
3, tuples are integers, and two tuples fit in a block. The data will be almost
the same as in that example as well. However, here we need two arguments, so
we shall assume that R has 12 tuples and S has 5 tuples. Since main m emory
can fit six tuples, in the first pass we get two sublists from R, which we shall
call Ri and R%, and only one sorted sublist from S, which we refer to as S\?
After creating the sorted sublists (from unsorted relations similar to the data
from Exam ple 15.5), the situation is:

Ri: 12 2 2, 2 5
Ri- 23 4 4, 4 5
Si: 11 2 3, 5
Suppose we want to take the bag difference R ~ b S. We find that the least
tuple am ong the main-memory buffers is 1, so we count the number o f l ’
s am ong
the sublists of R and am ong the sublists of S. We find that 1 appears once in R*6
2
4
2Since S fits in main memory, we could actually use the one-pass algorithms of Sec
tion 15-2.3, but we shall use the two-pass approach for illustration.
426
and twice in S. Since 1 does not appear m ore tim es in R than in S , we do not
output any copies of tuple 1 . Since the first block o f S i was exhausted counting
l !s, we loaded the next block o f Si, leaving the follow ing situation:

Re 2 2 2, 2 5
Re 23 4 4, 4 5
S i: 2 3 5
We now find that 2 is the least rem aining tuple, so we count the number
of its occurrences in i?, which is five occurrences, and we count the number of
its occurrences in S, which is one. We thus output tuple 2 four times. As we
perform the counts, we must reload the buffer for Ri twice, which leaves:

~R ~!7 5
R 2: 3 4 4, 4 5
S i: 3 5
Next, we consider tuple 3, and find it appears once in R and once in S. We

therefore do not output 3 and remove its copies from the buffers, leaving:

Re 5
R2: 44 45
Se 5
Tuple 4 occurs three times in R and not at all in S, so we output three

copies o f 4. Last, 5 appears twice in R and on ce in S, so we output 5 once. The
com plete output is 2, 2, 2, 2, 4, 4, 4, 5. □
The analysis o f this family o f algorithm s is the same as for the set-union
algorithm described in Section 15.4.3:
• 3 (B(R) + B(S)) disk I/ O ’

s.
• Approxim ately B(R) + B(S ) < M 2 for the algorithm to work.
15.4.5 A Simple Sort-Based Join Algorithm

There are several ways that sorting can be used to join large relations. Before
exam ining the join algorithms, let us observe one problem that can occur when
we com pute a join but was not an issue for the binary operations considered
so far. W hen taking a join, the number of tuples from the two relations that
share a com m on value of the join attribute(s), and therefore need to be in main
m emory simultaneously, can exceed what fits in memory. The extreme example
is when there is only one value of the join attribute(s), and every tuple of one
427
relation joins with every tuple of the other relation. In this situation, there is
really no choice but to take a nested-loop join of the two sets of tuples with a
com m on value in the join-attribute(s).
T o avoid facing this situation, we can try to reduce main-memory use for
other aspects of the algorithm, and thus make available a large number of buffers
to hold the tuples with a given join-attribute value. In this section we shall dis
cuss the algorithm that makes the greatest possible number o f buffers available
for join in g tuples with a com m on value. In Section 15.4.7 we consider another
sort-based algorithm that uses fewer disk I/ O ’s, but can present problem s when
there are large numbers o f tuples with a com m on join-attribute value.
Given relations R(X,Y) and S(Y,Z) to join, and given M blocks o f main
m em ory for buffers, we do the following:
1 . Sort i?, using a two-phase, multiway m erge sort, with Y as the sort key,
2 . Sort S similarly.
3. M erge the sorted R and S. We generally use only two buffers, one for the
current block o f R and the other for the current block of S. The following
steps are done repeatedly:
(a) Find the least value y of the join attributes Y that is currently at
the front of the blocks for R and S.
(b) If y does not appear at the front of the other relation, then remove
the tuple(s) with sort key y.
(c) Otherwise, identify all the tuples from both relations having sort key
y. If necessary, read blocks from the sorted R and/or 5, until we are
sure there are no m ore y’ s in either relation. As many as M buffers
are available for this purpose.
(d) O utput all the tuples that can be form ed by joining tuples from R
and 5 with a com m on Y -value y.
(e) If either relation has no more unconsidered tuples in main memory,
reload the buffer for that relation.
E x a m p le 15.7: Let us consider the relations R and S from Exam ple 15.4.
Recall these relations occu py 1000 and 500 blocks, respectively, and there are
M = 101 main-memory buffers. When we use two-phase, multiway m erge sort
on a relation, we do four disk I/ O ’s per block, two in each of the two phases.
Thus, we use 4 (B(R) 4- B(S)) disk I/ O ’ s to sort R and S, or 6000 disk I/ O ’
s.
When we m erge the sorted R and S to find the joined tuples, we read each
block of R and 5 a fifth time, using another 1500 disk I/ O ’s. In this m erge we
generally need only tw o of the 101 blocks of memory. However, if necessary, we
could use all 101 blocks to hold the tuples o f R and S that share a com m on
T-value y. Thus, it is sufficient that for no y do the tuples o f R and S that
have Y-value y together occupy m ore than 101 blocks.
428
Notice that the total number of disk I/ O ’ s perform ed by this algorithm

is 7500, compared with 5500 for nested-loop join in Exam ple 15.4. However,
nested-loop join is inherently a quadratic algorithm, taking tim e proportional
to B(R)B(S), while sort-join has linear I/O cost, taking time proportional to
B(R) + B(S). It is only the constant factors and the small size o f the example
(each relation is only 5 or 10 times larger than a relation that fits entirely
in the allotted buffers) that make nested-loop join preferable. Moreover, we
shall see in Section 15.4.7 that it is usually possible to perform a sort-join in
3 (B(R) + B (S )) disk I/ O ’
s, which would be 4500 in this exam ple and which is
below the cost of nested-loop join. □
If there is a F-value y for which the number of tuples with this F-value does
not fit in M buffers, then we need to m odify the above algorithm.
1. If the tuples from one of the relations, say 7?, that have F-value y fit in
M —1 buffers, then load these blocks o f R into buffers, and read the blocks
of S that hold tuples with y, one at a time, into the remaining buffer. In
effect, we do the one-pass join of Section 15.2.3 on only the tuples with
F-value y.
y that they all

2 . If neither relation has sufficiently few tuples with F-value
fit in M —1 buffers, then use the M buffers to perform a nested-loop join
on the tuples with F-value y from both relations.
Note that in either case, it may be necessary to read blocks from one relation
and then ignore them, having to read them later. For example, in case (1), we
might first read the blocks o f S that have tuples with F-value y and find that
there are to o many to fit in M - 1 buffers. However, if we then read the tuples
o f R. with that F-value we find that they do fit in M — 1 buffers.
15.4.6 Analysis of Simple Sort-Join

As we noted in Exam ple 15.7, our algorithm perform s five disk I/ O ’ s for every
block of the argument relation. The exception would be if there were so many
tuples with a com m on F-value that we needed to do one o f the specialized
joins on these tuples. In that case, the number o f extra disk I/ O ’ s depends
on whether one or both relations have so many tuples with a com m on F-value
that they require m ore than M — 1 buffers by themselves. We shall not go into
all the cases here; the exercises contain som e exam ples to work out.
We also need to consider how big M needs to be in order for the sim ple sort-
join to work. The primary constraint is that we need to be able to perform the
two-phase, multiway m erge sorts on R and S. As we observed in Section 11.4.4,
we need B(R) < M 2 and B(S) < M 2 to perform these sorts. Once done, we
shall not run out of buffers, although as discussed before, we may have to
deviate from the simple m erge if the tuples with a com m on F-value cannot fit
in M buffers. In summary, assum ing no such deviations are necessary:
429
•T he sim ple sort-join uses 5 (B(R) + B(S)) disk I/ O ’

s.
• It requires B[R) < M 2 and B(S ) < M 2 to work.
15.4.7 A More Efficient Sort-Based Join

If we do not have to worry about very large numbers o f tuples with a com
mon value for the join attribute(s), then we can save two disk I/ O ’
s per block
by com bining the second phase o f the sorts with the join itself. We call this
algorithm sort-join; other names by which it is known include “ m erge-join”
and “ sort-merge-join.”To com pute R(X,Y) ix S(Y,Z) using M main-memory
buffers:
1. Create sorted sublists o f size M, using Y as the sort key, for both R and
S.
2. Bring the first block of each sublist into a buffer; we assume there are no
m ore than M sublists in all.
3. Repeatedly find the least F-value y am ong the first available tuples o f all
the sublists. Identify all the tuples of both relations that have F-value
y, perhaps using som e o f the M available buffers to hold them, if there
are fewer than M sublists. Output the join o f all tuples from R with all
tuples from S that share this com m on F-value. If the buffer for one of
the sublists is exhausted, then replenish it from disk.
E x a m p le 15.8: Let us again consider the problem o f Exam ple 15.4: join in g
relations R and S of sizes 1000 and 500 blocks, respectively, using 101 buffers.
We divide R into 10 sublists and S into 5 sublists, each of length 100, and
sort them .3 We then use 15 buffers to hold the current blocks of each o f the
sublists. If we face a situation in which many tuples have a fixed F-value, we
can use the remaining 86 buffers to store these tuples, but if there are m ore
tuples than that we must use a special algorithm such as was discussed at the
end of Section 15.4.5.
Assum ing that we do not need to m odify the algorithm for large groups o f
tuples with the same F-value, then we perform three disk I/ O ’ s per block of
data. T w o of those are to create the sorted sublists. Then, every block o f every
sorted sublist is read into main m em ory one m ore time in the multiway m erging
process. Thus, the total number o f disk I/ O ’ s is 4500. □
This sort-join algorithm is m ore efficient than the algorithm o f Section 15.4.5
when it can be used. As we observed in Exam ple 15.8, the number o f disk I/ O ’ s
is 3 (B[R) + B(S)). We can perform the algorithm on data that is alm ost as
large as that o f the previous algorithm. T he sizes o f the sorted sublists are3 0
*4
techn ically, we could have arranged for the sublists to have length 101 blocks each, with
the last sublist of R having 91 blocks and the last sublist of S having 96 blocks, but the costs
would turn out exactly the same.
430
M blocks, and there can be at m ost M of them am ong the two lists. Thus,
B(R) + B(S) < M 2 is sufficient.
We might wonder whether we can avoid the trouble that arises when there
are many tuples with a com m on K-value, Som e im portant considerations are:
1 . Som etim es we can be sure the problem will not arise. For example, if Y
is a key for R , then a given K-value y can appear only once am ong all the
blocks of the sublists for R. When it is y ’ s turn, we can leave the tuple
from R in place and join it with all the tuples of S that match. If blocks of
S’ s sublists are exhausted during this process, they can have their buffers
reloaded with the next block, and there is never any need for additional
space, no m atter how many tuples o f S have K-value y. O f course, if Y
is a key for S rather than R, the sam e argum ent applies with R and S
switched.
2. If B(R) + B(S) is much less than M 2, we shall have many unused buffers
for storing tuples with a com m on K-value, as we suggested in Exam
ple 15.8.
3. If all else fails, we can use a nested-loop join on just the tuples with a
com m on F-value, using extra disk I/ O ’
s but gettin g the jo b done correctly.
This option was discussed in Section 15.4.5.
15.4.8 Summary of Sort-Based Algorithms

In Fig. 15.11 is a table of the analysis of the algorithm s we have discussed in
Section 15.4. As discussed in Sections 15.4.5 and 15.4.7, m odifications to the
time and m em ory requirements are necessary if we join two relations that have
many tuples with the sam e value in the join attribute(s).
Approximate
7, 5 Vb 3B 15.4.1, 15.4.2
u, n, - ■JB(R) + B(S) 3 (B(R) + B(S)) 15.4.3, 15.4.4
EXl y^m a x(B(R),B(S)) 5 (B(R) + B(Sj) 15.4.5
[X yjB(R)+ 3 (B(R) + B(S)) 15.4.7
Figure 15.11: Main memory and disk I/O requirements for sort-based algo
rithms 4
1
3
431

E x e r c is e 15.4.1: Using the assum ptions of Exam ple 15-5 (two tuples per
block, etc.),
a) Show the behavior o f the two-pass duplicate-elimination algorithm on the

sequence of thirty one-com ponent tuples in which the sequence 0 , 1 , 2 , 3 ,
4 repeats six times.
b) Show the behavior of the two-pass grouping algorithm com putin g the
relation ,y a<AVG(b)(B). Relation R ( a , b ) consists of the thirty tuples f 0
through ^29 j and the tuple t* has i m odu lo 5 as its grouping com ponent
a , and i as its second com ponent b.
E x e r c is e 15.4.2 : For each o f the operations below, write an iterator that uses
the algorithm described in this section.
* a) Distinct (d),
b) G rouping (7 ^)-
* c) Set intersection.
d) B ag difference.
e) Natural join.
E x e r c is e 15.4.3: If B ( R ) = B ( S ) = 10,000 and M = 1000, what are the disk

I/O requirements of:
a) Set union.
* b) Sim ple sort-join.
c) The m ore efficient sort-join o f Section 15.4.7.
! E x e r c is e 15.4.4: Suppose that the second pass of an algorithm described

in this section does not need all M buffers, because there are fewer than M
sublists. How might we save disk I/ O ’ s by using the extra buffers?
! E x e r c is e 15.4.5: In Exam ple 15.7 we discussed the join of two relations R and
5, with 1000 and 500 blocks, respectively, and M = 101. However, we pointed
out that there would be additional disk I/ O ’ s if there were so many tuples with
a given value that neither relation ’ s tuples could fit in main memory. Calculate
the total number of disk I/ O ’ s needed if:
* a) There are only two F-values, each appearing in half the tuples of R and
half the tuples of 5 (recall Y is the join attribute or attributes).
b) There are five T-values, each equally likely in each relation.

432
15.5. TWO-PASS ALGORITHMS BASED ON HASHING 749
c) There are 10 Y - values, each equally likely in each relation.
! E x e r c is e 15.4.6: Repeat Exercise 15.4.5 for the m ore efficient sort-join of

Section 15.4.7.
E x e r c is e 15.4.7: How much m em ory do we need to use a two-pass, sort-based

algorithm for relations of 10,000 blocks each, if the operation is:
* a) 5.
b) 7 .
c) A binary operation such as join or union.
E x e r c is e 15.4.8: D escribe a two-pass, sort-based algorithm for each of the

join-like operators of Exercise 15.2.4.
! E x e r c is e 15.4.9: Suppose records could b e larger than blocks, i.e., we could

have spanned records. How would the m em ory requirements o f two-pass, sort-
based algorithms change?
I! E x e r c is e 15.4.10: Sometimes, it is possible to save som e disk I/ O ’

s if we leave
the last sublist in memory. It may even m ake sense to use sublists of fewer than
M blocks to take advantage of this effect. H ow many disk I/ O ’ s can be saved
this way?
!! E x e r c is e 15.4.11: O Q L allows grouping o f objects according to arbitrary,

user-specified functions of the objects. For example, one could group tuples
according to the sum o f two attributes. H ow would we perform a sort-based
grouping operation o f this type on a set of o b jects?
15.5 Two-Pass Algorithms Based on Hashing

There is a family o f hash-based algorithm s that attack the sam e problem s as
in Section 15.4. The essential idea behind all these algorithm s is as follows.
If the data is too big to store in main-memory buffers, hash all the tuples of
the argument or arguments using an appropriate hash key. For all the com m on
operations, there is a way to select the hash key so all the tuples that need to be
considered together when we perform the operation have the same hash value.
We then perform the operation by w orking on one bucket at a time (or on
a pair of buckets with the same hash value, in the case o f a binary operation).
In effect, we have reduced the size of the operand(s) by a factor equal to the
number of buckets. If there are M buffers available, we can pick M as the
number of buckets, thus gaining a factor o f M in the size o f the relations we
can handle. Notice that the sort-based algorithm s of Section 15.4 also gain
a factor o f M by preprocessing, although the sorting and hashing approaches
achieve their similar gains by rather different means.
433
750 CHAPTER 15. Q UERY EXECUTION
15.5.1 Partitioning Relations by Hashing

To begin, let us review the way we would take a relation R and, using M buffers,
partition R into M — 1 buckets of roughly equal size. We shall assume that
h is the hash function, and that h takes com plete tuples o f R as its argument
(i.e., all attributes o f R are part of the hash key). We associate one buffer with
each bucket. The last buffer holds blocks o f R, one at a time. Each tuple t in
the block is hashed to bucket h(t) and copied to the appropriate buffer. If that
buffer is full, we write it out to disk, and initialize another block for the sam e
bucket. At the end, we write out the last block of each bucket if it is not empty.
The algorithm is given in m ore detail in Fig. 15-12. Note that it assumes that
tuples, while they may be variable-length, are never to o large to fit in an em pty
buffer.
initialize M-l buckets using M-l empty buffers;

FOR each block b of relation R DO BEGIN
read block b into the Mth buffer;
FOR each tuple t in b DO BEGIN
IF the buffer for bucket h(t) has no room for t THEN
BEGIN
copy the buffer to disk;
initialize a new empty block in that buffer;
END;
copy t to the buffer for bucket h(t);
END;
END;
FOR each bucket DO
IF the buffer for this bucket is not empty THEN
write the buffer to disk;
Figure 15.12: P artitioning a relation R into M - l buckets
15.5.2 A Hash-Based Algorithm for Duplicate

Elimination
We shall now consider the details o f hash-based algorithm s for the various
operations of relational algebra that might need two-pass algorithms. First,
consider duplicate elimination, that is, the operation A(R). We hash R to
M - l buckets, as in Fig. 15.12. Note that two copies o f the sam e tuple t, will
hash to the sam e bucket. Thus, S has the essential property we need: we can
examine one bucket at a time, perform 5 on that bucket in isolation, and take as
the answer the union o f S(Ri), where Ri is the portion o f R that hashes to the
ith bucket. The one-pass algorithm of Section 15.2.2 can be used to eliminate
434
duplicates from each 27; in turn and write out the resulting unique tuples.
This m ethod will work as long as the individual i?;’ s are sufficiently small to
fit in main m em ory and thus allow a one-pass algorithm. Since we assume the
hash function h partitions R into equal-sized buckets, each Ri will be approxi
m ately B(R)/(M —1) blocks in size. If that number of blocks is no larger than
M, i.e., B(R ) < M(M - 1), then the two-pass, hash-based algorithm will work.
In fact, as we discussed in Section 15.2.2, it is only necessary that the number
of distinct tuples in one bucket fit in M buffers, but we cannot be sure that
there are any duplicates at all. Thus, a conservative estimate, with a simple
form in which M and M - I are considered the same, is B(R) < M 2, exactly
as for the sort-based, two-pass algorithm for A
The number o f disk I/ O ’
s is also similar to that of the sort-based algorithm.
We read each block of R once as we hash its tuples, and we write each block
o f each bucket to disk. We then read each block of each bucket again in the
one-pass algorithm that focuses on that bucket. Thus, the total number of disk
I/ O ’s is 3B(R).
15.5.3 Hash-Based Grouping and Aggregation

To perform the 7 i(R) operation, we again start by hashing all the tuples of
R to M — 1 buckets. However, in order to make sure that all tuples o f the
same group wind up in the sam e bucket, we must choose a hash function that
depends only on the grouping attributes of the list L.
Having partitioned R into buckets, we can then use the one-pass algorithm
for 7 from Section 15.2.2 to process each bucket in turn. As we discussed
for <5 in Section 15.5.2, we can process each bucket in main m emory provided
B(R) < M 2.
However, on the second pass, we only need one record per group as we
process each bucket. Thus, even if the size o f a bucket is larger than A2, we
can handle the bucket in one pass provided the records for all the groups in
the bucket take no m ore than M buffers. Normally, a grou p ’ s record will be no
larger than a tuple of R. If so, then a better upper bound on 23(2?) is M 2 times
the average number of tuples per group.
As a consequence, if there are are few groups, then we may actually be able
to handle much larger relations R than is indicated by the 23(2?) < M 2 rule.
On the other hand, if M exceeds the number of groups, then we cannot fill
all buckets. Thus, the actual lim itation on the size of R as a function of M is
complex, but 23(2?) < M 2 is a conservative estimate. Finally, we observe that
the number of disk I/ O ’s for 7 , as for 5, is 323(7?).
15.5.4 Hash-Based Union, Intersection, and Difference

When the operation is binary, w e must make sure that we use the same hash
function to hash tuples of both arguments. For example, to com pute R Us 5,
we hash both R and S to M - I buckets each, say 2?i, f?2, ■••,R m - i and
435
S u S‘ 2,...,S m - i . We then take the set-union of Ri with Si for all i, and

output the result. N otice that if a tuple t appears in both R and 5, then for
som e i we shall find t in both Rj and 5,;. Thus, when we take the union o f these
two buckets, we shall output only one copy o f t, and there is no possibility of
introducing duplicates into the result. For Us, the sim ple bag-union algorithm
of Section 15.2.3 is preferable to any other approach for that operation.
To take the intersection or difference of R and 5, we create the 2 (M - 1)
buckets exactly as for set-union and apply the appropriate one-pass algorithm
to each pair o f corresponding buckets. N otice that all these algorithm s require
B(R) H- B(S) disk I/ O ’ s. T o this quantity we m ust add the two disk I/ O ’
s per
block that are necessary to hash the tuples of the two relations and store the
buckets on disk, for a total o f 3(B(R) -1- B(S)) disk I/ O ’s.
In order for the algorithm s to work, we must be able to take the one-pass
union, intersection, or difference o f Ri and Si, whose sizes will be approxi
mately B(R)/(M — 1) and B(S)/(M — 1), respectively. Recall that the one-
pass algorithm s for these operations require that the smaller operand occupies
at m ost M —1 blocks. Thus, the two-pass, hash-based algorithm s require that
m i n(B(R), B(S)) < M 2, approximately.
15.5.5 The Hash-Join Algorithm

To com pute R(X,Y) ixa S(Y, Z) using a two-pass, hash-based algorithm, we
act alm ost as for the other binary operations discussed in Section 15.5.4. The
only difference is that we must use as the hash key ju st the join attributes,
Y . Then we can be sure that if tuples o f R and 5 join, they will wind up in
corresponding buckets Ri and Si for som e i. A one-pass join o f all pairs of
corresponding buckets com pletes this algorithm, which we call hash-join.4
E x a m p le 15.9 : Let us renew our discussion o f the two relations R and S from
Exam ple 15.4, w hose sizes were 1000 and 500 blocks, respectively, and for which
101 main-memory buffers are m ade available. We may hash each relation to
100 buckets, so the average size of a bucket is 10 blocks for R, and 5 blocks
for S. Since the smaller number, 5, is much less than the number of available
buffers, we expect to have no trouble perform ing a one-pass join on each pair
of buckets.
T he number of disk I/ O ’ s is 1500 to read each o f R and 5 while hashing
into buckets, another 1500 to write all the buckets to disk, and a third 1500 to
read each pair of buckets into main m em ory again while taking the one-pass
join of corresponding buckets. Thus, the number o f disk I/ O ’ s required is 4500,
ju st as for the efficient sort-join of Section 15.4.7. □
W e may generalize Exam ple 15.9 to conclude that:3

6
*4
ASometimes, the term “ hash-join”is reserved for the variant of the one-pass join algorithm
of Section 15.2.3 in which a hash table is used as the main-memory search structure. Then,
the two-pass hash-join algorithm described here is called “ partition hash-join.”
436
•Hash join requires 3 (B(R) + B(S )) disk I/ O ’

s to perform its task.
• The two-pass hash-join algorithm will work as long as approxim ately

m i n(B(R),B(S)) < M 2.
T he argument for the latter point is the sam e as for the other binary operations:
one of each pair of buckets must fit in M - 1 buffers.
15.5.6 Saving Some Disk I/O ’

s
If there is m ore m em ory available on the first pass than we need to hold one
block per bucket, then we have som e opportunities to save disk I/ O ’ s. One
option is to use several blocks for each bucket, and write them out as a group,
in consecutive blocks o f disk. Strictly speaking, this technique d oesn ’
t save disk
I/ O ’
s, but it makes the I/ O ’s go faster, since we save seek time and rotational
latency when we write.
However, there are several tricks that have been used to avoid writing som e
of the buckets to disk and then reading them again. The m ost effective of them,
called hybrid hash-join, works as follows. In general, su ppose we decide that to
join R m S, with S the smaller relation, we need to create k buckets, where k
is much less than M, the available memory. W hen we hash S, we can choose
to keep m of the k buckets entirely in main memory, while keeping only one
block for each o f the other k —m buckets. We can m anage to do so provided
the expected size of the buckets in memory, plus one block for each of the other
buckets, does not exceed M\ that is:
mB^ +k-m < M (15.1)

k
In explanation, the expected size o f a bucket is B(S)/k , and there are m buckets
in memory.
Now, when we read the tuples of the other relation, R , to hash that relation
into buckets, we keep in memory:
1 . The m buckets of S that were never written to disk, and
2. One block for each o f the k —m buckets o f R whose corresponding buckets

of S were written to disk.
If a tuple t of R hashes to one of the first m buckets, then we im m ediately

join it with all the tuples of the corresponding 5 -bucket, as if this were a one-
pass, hash-join. The result o f any successful join s is im m ediately output. It
is necessary to organize each of the in-memory buckets of 5 into an efficient
search structure to facilitate this join, ju st as for the one-pass hash-join. If t
hashes to one o f the buckets whose corresponding 5-buckct is on disk, then t
is sent to the main-memory block for that bucket, and eventually m igrates to
disk, as for a two-pass, hash-based join.
437
On the second pass, we join the corresponding buckets of R and 5 as usual.

However, there is no need to join the pairs of buckets for which the 5-bucket
was left in memory; these buckets have already been joined and their result
output.
The savings in disk I/ O ’ s is equal to two for every block o f the buckets o f 5
that remain in memory, and their corresponding ^-buckets. Since m./k of the
buckets are in memory, the savings is 2 (m/k){B(R) + B(S)). We m ust thus
ask how to m axim ize m/A;, subject to the constraint o f equation (15.1). The
surprising answer is: pick m = 1 , and then make k as small as possible.
T he intuitive justification is that all but A: - m of the main-memory buffers
can be used to hold tuples of 5 in main memory, and the more o f these tuples,
the fewer the disk I/ O ’ s. Thus, we want to m inim ize A:, the total number of
buckets. We do so by m aking each bucket about as big as can fit in main
memory; that is, buckets are of size M, and therefore k = B(S)/M. If that is
the case, then there is only room for one bucket in the extra main memory; i.e.,
m = 1.
In fact, we really need to make the buckets slightly smaller than B(S)/M ,
or else we shall not quite have room for one full bucket and one block for the
other k — 1 buckets in m em ory at the same time. Assuming, for simplicity, that
k is about B(S)/M and m = 1, the savings in disk I/ O ’ s is
and the total cost is
E x a m p le 15.10 : Consider the problem o f E xam ple 15.4, where we had to join
relations R and 5, o f 1000 and 500 blocks, respectively, using M = 101. If we
use a hybrid hash-join, then we want k, the number o f buckets, to be about
500/101. Su ppose we pick A: = 5. Then the average bucket will have 100 blocks
of 5 ’s tuples. If we try to fit one o f these buckets and four extra blocks for the
other four buckets, we need 104 blocks o f main memory, and we cannot take
the chance that the in-memory bucket will overflow memory.
Thus, we are advised to choose A; = 6 . Now, when hashing 5 on the first
pass, we have five buffers for five of the buckets, and we have up to 96 buffers
for the in-memory bucket, whose expected size is 500/6 or 83. The number
of disk I/ O ’s we use for 5 on the first pass is thus 500 to read all of 5, and
500 - 83 = 417 to write five buckets to disk. When we process R on the first
pass, we need to read all of R (1000 disk I/ O ’ s) and write 5 of its 6 buckets
(833 disk I/ O ’s).
On the second pass, we read all the buckets written to disk, or 417 + 833 =
1250 additional disk I/ O ’ s. The total number of disk I/ O ’s is thus 1500 to read
R and 5, 1250 to write 5/6 of these relations, and another 1250 to read those
tuples again, or 4000 disk I/ O ’ s. This figure com pares with the 4500 disk I/ O ’
s
needed for the straightforward hash-join or sort-join. □4 8
3
438
15.5. TW O-PASS A L G O R IT H M S B A SE D ON H A SH IN G 755
15.5.7 Summary of Hash-Based Algorithms

Figure 15.13 gives the memory requirements and disk I/O ’ s needed by each of
the algorithms discussed in this section. As with other types of algorithms, we
should observe that the estimates for and <5may be conservative, since they
7
really depend on the number of duplicates and groups, respectively, rather than
on the number of tuples in the argument relation.
Approximate
7 ,5 y/B 3B 15.5.2, 15.5.3
u, n, - V B (S) 3( B { R ) + B { S ) ) 15.5.4
CXI V B (S) 3( B ( R ) + B (S)) 15.5.5
CXI
y / m (3-2 M / B ( S ) ) { B ( R ) + £ (S)) 15,5.6
Figure 15.13: Main memory and disk I/O requirements for hash-based algo
rithms; for binary operations, assume B ( S ) < B ( R )
Notice that the requirements for sort-based and the corresponding hash-
based algorithms are almost the same. The significant differences between the
two approaches are:
1. Hash-based algorithms for binary operations have a size requirement that

depends only on the smaller of two arguments rather than on the sum of
the argument sizes, as for sort-based algorithms.
2. Sort-based algorithms sometimes allow us to produce a result in sorted

order and take advantage of that sort later. The result might be used in
another sort-based algorithm later, or it could be the answer to a query
that is required to be produced in sorted order.
3. Hash-based algorithms depend on the buckets being of equal size. Since

there is generally at least a small variation in size, it is not possible to
use buckets that, on average, occupy M blocks; we must limit them to a
somewhat smaller figure. This effect is especially prominent if the number
of different hash keys is small, e.g., performing a group-by on a relation
with few groups or a join with very few values for the join attributes.
4. In sort-based algorithms, the sorted sublists may be written to consecutive
blocks of the disk if we organize the disk properly. Thus, one of the three
disk I/O ’s per block may require little rotational latency or seek time
439
and therefore may be much faster than the I/O’

s needed for hash-based
algorithms.
5. Moreover, if M is much larger than the number of sorted sublists, then
we may read in several consecutive blocks at a time from a sorted sublist,
again saving some latency and seek time.
6 . On the other hand, if we can choose the number of buckets to be less than
M in a hash-based algorithm, then we can write out several blocks of a
bucket at once. We thus obtain the same benefit on the write step for
hashing that the sort-based algorithms have for the second read, as we
observed in (5). Similarly, we may be able to organize the disk so that a
bucket eventually winds up on consecutive blocks of tracks. If so, buckets
can be read with little latency or seek time, just as sorted sublists were
observed in (4) to be writable efficiently.

Exercise 15.5.1: The hybrid-hash-join idea, storing one bucket in main mem
ory, can also be applied to other operations. Show how to save the cost of stor
ing and reading one bucket from each relation when implementing a two-pass,
hash-based algorithm for: *a) 6 b) c) ru, d) -5.
7
Exercise 15.5.2: If B ( S ) = B ( R ) = 10,000 and M = 1000, what is the

number of disk I/O’
s required for a hybrid hash join?
Exercise 15.5.3: Write iterators that implement the two-pass, hash-based

algorithms for a) 5 b) c) Ob d) 7 e) m. - 5
*! Exercise 15.5.4: Suppose we are performing a two-pass, hash-based grouping

operation on a relation R of the appropriate size; i.e., B ( R ) < M 2. However,
there are so few groups, that some groups are larger than M; i.e., they will not
fit in main memory at once. What modifications, if any, need to be made to
the algorithm given here?
! Exercise 15.5.5: Suppose that we are using a disk where the time to move
the head to a block is milliseconds, and it takes
1 0 0 millisecond to read
1 / 2
one block. Therefore, it takes k /2 milliseconds to read k consecutive blocks,

once the head is positioned. Suppose we want to compute a two-pass hash-join
R M 5, where B ( R ) = 1000, B ( S ) = 500, and M = 101. To speed up the join,
we want to use as few buckets as possible (assuming tuples distribute evenly
among buckets), and read and write as many blocks as we can to consecutive
positions on disk. Counting 100-5 milliseconds for a random disk I/O and
+ k / 2 milliseconds for reading or writing k consecutive blocks from or to
1 0 0
disk:
a) How much time does the disk I/O take?

440
15.6. IN D E X - B A S E D A L G O R IT H M S 757
b) How much time does the disk I/O take if we use a hybrid hash-join as
described in Example 15.10?
c) How much time does a sort-based join take under the same conditions,
assuming we write sorted sublists to consecutive blocks of disk?
15.6 Index-Based Algorithms

The existence of an index on one or more attributes of a relation makes available
some algorithms that would not be feasible without the index. Index-based
algorithms are especially useful for the selection operator, but algorithms for
join and other binary operators also use indexes to very good advantage. In
this section, we shall introduce these algorithms. We also continue with the
discussion of the index-scan operator for accessing a stored table with an index
that we began in Section 15.1.1. To appreciate many of the issues, we first need
to digress and consider “ clustering”indexes.
15.6.1 Clustering and Nonclustering Indexes

Recall from Section 15.1.3 that a relation is “ clustered”if its tuples are packed
into roughly as few blocks as can possibly hold those tuples. All the analyses
we have done so far assume that relations are clustered.
We may also speak of c l u s t e r i n g in d e x e s , which are indexes on an attribute
or attributes such that all the tuples with a fixed value for the search key of this
index appear on roughly as few blocks as can hold them. Note that a relation
that isn’t clustered cannot have a clustering index but even a clustered relation
,5
can have nonclustering indexes.
Example 15.11: A relation R ( a , b) that is sorted on attribute a and stored in

that order, packed into blocks, is surely clustered. An index on a is a clustering
index, since for a given a-value a\, all the tuples with that value for a are
consecutive. They thus appear packed into blocks, except possibly for the first
and last blocks that contain a-value a i, as suggested in Fig. 15.14. However, an
index on is unlikely to be clustering, since the tuples with a fixed 5-value will
6
be spread all over the file unless the values of a and b are very closely correlated.
□*4
1
techn ically, if the index is on a key for the relation, so only one tuple with a given value
in the index key exists, then the index is always “ clustering,” even if the relation is not
clustered. However, if there is only one tuple per index-key value, then there is no advantage
from clustering, and the performance measure for such an index is the same as if it were
considered nonclustering.
441
n “1 a\ a i a[ ai fli ai
All the cij tuples
Figure 15.14: A clustering index has all tuples with a fixed value packed into
(close to) the minimum possible number of blocks
15.6.2 Index-Based Selection

In Section 15.1.1 we discussed implementing a selection <rc(R) by reading all
the tuples of relation R , seeing which meet the condition C , and outputting
those that do. If there are no indexes on R , then that is the best we can do;
the number of disk I/O’ s used by the operation is B ( R ), or even T(R), the
number of tuples of R , should R not be a clustered relation However, suppose .6
that the condition C is of the form a = v, where a is an attribute for which

an index exists, and v is a value. Then one can search the index with value v
and get pointers to exactly those tuples of R that have a-value v. These tuples
constitute the result of <Ta=v[R), so all we have to do is retrieve them.
If the index on R . a is clustering, then the number of disk I/O’ s to retrieve the
set cra=n,(R) will average B ( R ) / V ( R , a ) . The actual number may be somewhat
higher, because:
1 . Often, the index is not kept entirely in main memory, and therefore some
disk I/O’s are needed to support the index lookup.
2 . Even though all the tuples with a — v might fit in b blocks, they could
be spread over 6 blocks because they don’
+ 1 t start at the beginning of
a block.
3. Although the index is clustering, the tuples with a = v may be spread
over several extra blocks. Two reasons why that situation might occur
are:
(a) We might not pack blocks of R as tightly as possible because we
want to leave room for growth of R, as discussed in Section 13.1.6.
(b) R might be stored with some other tuples that do not belong to R ,
say in a clustered-file organization.
Moreover, we of course must round up if the ratio B ( R ) / V ( R ) a) is not an

integer. Most significant is that should a be a key for R, then V ( R , a) = T ( R ) ,
which is presumably much bigger than B ( R ) , yet we surely require one disk
I/O to retrieve the tuple with key value v, plus whatever disk I/O ’ s are needed
to access the index.4
2
c Recall from Section 15.1.3 the notation we developed: T ( R ) for the number of tuples in
R, B ( R ) for the number of blocks in which R fits, and V ( R , L ) for the number of distinct
tuples in n i ( R ) .
442
15.6. IN DEX-BA SED A L G O R IT H M S 759
Notions of Clustering
We have seen three different, although related, concepts called “
clustering”
or “
clustered.”
1. In Section 13.2.2 we spoke of the “ clustered-file organization,”where

tuples of one relation R are placed with a tuple of some other relation
S with which they share a common value; the example was grouping
movie tuples with the tuple of the studio that made the movie.
2 . In Section 15.1.3 we spoke of a “ clustered relation,”meaning that
the tuples of the relation are stored in blocks that are exclusively, or
at least predominantly, devoted to storing that relation.
3. Here, we have introduced the notion of a clustering index — an index

in which the tuples having a given value of the search key appear in
blocks that are largely devoted to storing tuples with that search-
key value. Typically, the tuples with a fixed value will be stored
consecutively, and only the first and last blocks with tuples of that
value will also have tuples of another search-key value.
The clustered-file organization is one example of a way to have a clustered

relation that is not packed into blocks which are exclusively its own. Sup
pose that one tuple of the relation S is associated with many 7?-tuples in a
clustered file. Then, while the tuples of R are not packed in blocks exclu
sively devoted to .5, these blocks are “ predominantly”devoted to R , and
we call R clustered. On the other hand, S will typically n o t be a clustered
relation, since its tuples are usually on blocks devoted predominantly to
i?-tuples rather than 5-tuples.
Now, let us consider what happens when the index on R . a is nonclustering.

To a first approximation, each tuple we retrieve will be on a different block,
and we must access T ( R ) / V ( R )a) tuples. Thus, T ( R ) / V ( R , a ) is an estimate
of the number of disk I/O’ s we need. The number could be higher because we
may also need to read some index blocks from disk; it could be lower because
fortuitously some retrieved tuples appear on the same block, and that block
remains buffered in memory.
E x a m p le 15.12: Suppose B ( R ) = 1000, and T ( R ) = 20,000. That is, R has
20,000 tuples that are packed 20 to a block. Let a be one of the attributes of
R , suppose there is an index on a, and consider the operation a a= o(R ). Here
are some possible situations and the worst-case number of disk I/O’ s required.
We shall ignore the cost of accessing the index blocks in all cases.
1 . If R is clustered, but we do not use the index, then the cost is 1000 disk
443
I/O’
s. That is, we must retrieve every block of R.
2. If R is not clustered and we do not use the index, then the cost is 20,000
disk I/O ’s.
3. If V ( R , a ) = 100 and the index is clustering, then the index-based algo
rithm uses 1000/100 = 1 0 disk I/O’s.
4. If V ( R ]a) ~ and the index is nonclustering, then the index-based
1 0
algorithm uses 20,000/10 = 2000 disk I/O ’ s. Notice that this cost is
higher than scanning the entire relation R , if R is clustered but the index
is not.
5. If V (R ,a) = 20,000, i.e., a is a key, then the index-based algorithm takes 1
disk I/O plus whatever is needed to access the index, regardless of whether
the index is clustering or not.
□
Index-scan as an access method can help in several other kinds of selection
operations.
a) An index such as a B-tree lets us access the search-key values in a given

range efficiently. If such an index on attribute a of relation R exists, then
we can use the index to retrieve just the tuples of R in the desired range
for selections such as cra> i o ( R ) , or even <Ta> 10 and a< 2o(-R)-
b) A selection with a complex condition C can sometimes be implemented by
an index-scan followed by another selection on only those tuples retrieved
by the index-scan. If C is of the form a = v AND C \ where C' is any
condition, then we can split the selection into a cascade of two selections,
the first checking only for a = v, and the second checking condition C ' .
The first is a candidate for use of the index-scan operator. This splitting
of a selection operation is one of many improvements that a query op
timizer may make to a logical query plan; it is discussed particularly in
Section 16.7.1.
15.6.3 Joining by Using an Index

All the binary operations we have considered, and the unary full-relation op
erations of and <5as well, can use certain indexes profitably. We shall leave
7
most of these algorithms as exercises, while we focus on the matter of joins. In

particular, let us examine the natural join R ( X , Y ) cxi 5(K, Z); recall that X,
y , and Z can stand for sets of attributes, although it is adequate to think of
them as single attributes.
For our first index-based join algorithm, suppose that S has an index on the
attribute(s) Y. Then one way to compute the join is to examine each block of
72, and within each block consider each tuple t. Let fy be the component or
444
15.6. IN DEX-BA SED A L G O R IT H M S 761
components of t corresponding to the attribute(s) Y . Use the index to find all

those tuples of S that have t y in their F-component(s). These are exactly the
tuples of S that join with tuple t of R , so we output the join of each of these
tuples with t.
The number of disk I/O’ s depends on several factors. First, assuming R is
clustered, we shall have to read B ( R ) blocks to get all the tuples of R„ If R is
not clustered, then up to T ( R ) disk I/O ’ s may be required.
For each tuple t of R we must read an average of T ( S ) / V ( S )Y ) tuples
of S . If S has a nonclustered index on F, then the number of disk I/O’ s
required to read 5 is T ( R ) T ( S ) / V ( S , Y ) , but if the index is clustered, then
only T ( R ) B ( S ) / V ( S , Y ) disk I/O’ s suffice.7 In either case, we may have to add
a few disk I/O ’ s per F-value, to account for the reading of the index itself.
Regardless of whether or not R is clustered, the cost of accessing tuples of
S dominates. Ignoring the cost of reading R , we shall take T ( R ) T ( S ) / V ( S ,Y )
or T ( R ) ( m & x ( l, B ( S ) / V ( S , F))) as the cost of this join method, for the cases
of nonclustered and clustered indexes on S, respectively.
Example 15.13 : Let us consider our running example, relations R ( X , F) and

5(F, Z ) covering 1000 and 500 blocks, respectively. Assume ten tuples of either
relation fit on one block, so T ( R ) = 10,000 and T ( S ) = 5000. Also, assume
F(5, F) = 100; i.e., there are 100 different values of F among the tuples of 5.
Suppose that R is clustered, and there is a clustering index on F for S . Then
the approximate number of disk I/O’ s, excluding what is needed to access the
index itself, is 1000 to read the blocks of R. (neglected in the formulas above)
plus 10,000 x 500 / 100 = 50,000 disk I/O ’ s. This number is considerably above
the cost of other methods for the same data discussed previously. If either R
or the index on S is not clustered, then the cost is even higher. □
While Example 15.13 makes it look as if an index-join is a very bad idea,

there are other situations where the join R, m 5 by this method makes much
more sense. Most common is the case where R is very small compared with S ,
and F(5, F) is large. We discuss in Exercise 15.6.5 a typical query in which
selection before a join makes R tiny. In that case, most of S will never be
examined by this algorithm, since most F-values don’ t appear in R at all.
However, both sort- and hash-based join methods will examine every tuple of
S at least once.
15.6.4 Joins Using a Sorted Index

When the index is a B-tree, or any other structure from which we easily can
extract the tuples of a relation in sorted order, we have a number of other op
portunities to use the index. Perhaps the simplest is when we want to compute
R ( X , Y ) tx S(F, Z)> and we have such an index on F for either R or S. We* 5
4
7But remember that B(S)/V(S>V) must be replaced by 1 if it is less, as discussed in
Section 15.6.2.
445
can then perform an ordinary sort-join, but we do not have to perform the
intermediate step of sorting one of the relations on Y.
As an extreme case, if we have sorting indexes on Y for both R and 5,
then we need to perform only the final step of the simple sort-based join of
Section 15.4.5. This method is sometimes called zig-zag join , because we jump
back and forth between the indexes finding F-values that they share in common.
Notice that tuples from R with a y-value that does not appear in S need never
be retrieved, and similarly, tuples of S whose T-value does not appear in R
need not be retrieved.
Example 15.14: Suppose that we have relations R{X, Y ) and S (Y ,Z ) with

indexes on Y for both relations. In a tiny example, let the search keys {Y -
values) for the tuples of R be in order 1,3,4,4,4,5,6, and let the search key
values for S be 2,2,4,4,6,7. We start with the first keys of R. and 5, which are
1 and 2, respectively. Since 1 < 2, we skip the first key of R and look at the
second key, 3. Now, the current key of S is less than the current key of R, so
we skip the two 2’ s of S to reach 4.
At this point, the key 3 of R is less than the key of 5, so we skip the key
of R. Now, both current keys are 4. We follow the pointers associated with
all the keys 4 from both relations, retrieve the corresponding tuples, and join
them. Notice that until we met the common key 4, no tuples of the relation
were retrieved.
Having dispensed with the 4’ s, we go to key 5 of R and key 6 of S. Since
5 < 6, we skip to the next key of R. Now the keys are both 6, so we retrieve
the corresponding tuples and join them. Since R is now exhausted, we know
there are no more pairs of tuples from the two relations that join. □
If the indexes are B-trees, then we can scan the leaves of the two B-trees in
order from the left, using the pointers from leaf to leaf that are built into the
structure, as suggested in Fig. 15.15. If R and S are clustered, then retrieval of
all the tuples with a given key will result in a number of disk I/O ’ s proportional
to the fractions of these two relations read. Note that in extreme cases, where
there are so many tuples from R. and S that neither fits in the available main
memory, we shall have to use a fixup like that discussed in Section 15.4.5.
However, in typical cases, the step of joining all tuples with a common V-value
can be carried out with only as many disk I/O ’ s as it takes to read them.
Example 15.15 : Let us continue with Example 15.13, to see how joins using
a combination of sorting and indexing would typically perform on this data.
First, assume that there is an index on Y for 5 that allows us to retrieve the
tuples of S sorted by Y. We shall, in this example, also assume both relations
and the index are clustered. For the moment, we assume there is no index on
R.
Assuming 101 available blocks of main memory, we may use them to create
10 sorted sublists for the 1000-block relation R. The number of disk I/O ’
s is
2000 to read and write all of R. We next use 11 blocks of memory — 10 for
446
15.6. INDEX-BASED ALGORITHMS 763
Index
Figure 15.15: A zig-zag join using two indexes
the sublists of R and one for a block of S ’s tuples, retrieved via the index. We
neglect disk I/O ’s and memory buffers needed to manipulate the index, but if
the index is a B-tree, these numbers will be small anyway. In this second pass,
we read all the tuples of R and 5, using a total of 1500 disk I/O ’s, plus the small
amount needed for reading the index blocks once each. We thus estimate the
total number of disk I/O ’ s at 3500, which is less than that for other methods
considered so far.
Now, assume that both R and S have indexes on Y. Then there is no need
to sort either relation. We use just 1500 disk I/O ’ s to read the blocks of R
and S through their indexes. In fact, if we determine from the indexes alone
that a large fraction of R or S cannot match tuples of the other relation, then
the total cost could be considerably less than 1500 disk I/O ’ s, However, in any
event we should add the small number of disk I/O ’ s needed to read the indexes
themselves. □

Exercise 15.6.1: Suppose there is an index on attribute R.a. Describe how
this index could be used to improve the execution of the following operations.
Under what circumstances would the index-based algorithm be more efficient
than sort- or hash-based algorithms?*
* a) R Us S (assume that R and S have no duplicates, although they may

have tuples in common).
b) R C\s S (again, with R and S sets).
c) 6{R).
Exercise 15.6.2: Suppose B(R) = 10,000 and T(R) = 500,000. Let there
be an index on R.a, and let V(R, a) = k for some number k. Give the cost
of cra=o(R)i as a function of k, under the following circumstances. You may
neglect disk I/O ’
s needed to access the index itself.
447
* a) The index is clustering.
b) The index is not clustering.
c) R is clustered, and the index is not used.
Exercise 15.6.3: Repeat Exercise 15.6.2 if the operation is the range query
ac< a and h < d ( R )■ You may assume that C and D are constants such that k f 10
of the values are in the range.
! Exercise 15.6.4 : If R is clustered, but the index on R.a is not clustering, then
depending on k we may prefer to implement a query by performing a table-scan
of R or using the index. For what values of k would we prefer to use the index
if the relation and query are as in:
a) Exercise 15.6.2.
b) Exercise 15.6.3.
* Exercise 15.6.5: Consider the SQL query:
SELECT birthdate
FROM Starsln, MovieStar
WHERE movieTitle = ’King Kong’ AND starName = name;
This query uses the “

movie”relations:
Starsln(movieTitle, movieYear, starName)

MovieStar(name, address, gender, birthdate)
If we translate it to relational algebra, the heart is an equijoin between
& m o v ie T it} e = 1King K o n g ’(Starsln)
and MovieStar, which can be implemented much as a natural join Rtxi S. Since
there were only two movies named “ King Kong,”T(R) is very small. Suppose
that 5, the relation MovieStar, has an index on name. Compare the cost of an
index-join for this R txi S with the cost of a sort- or hash-based join.
! E xercise 15.6.6: In Example 15.15 we discussed the disk-I/O cost of a join

R m S in which one or both of R and 5 had sorting indexes on the join
attribute(s). However, the methods described in that example can fail if there
are too many tuples with the same value in the join attribute(s). What are
the limits (in number of blocks occupied by tuples with the same value) under
which the methods described will not need to do additional disk I/O ’s?8
4
448
15.7. B UFFEB. M AN A GEMENT 765
15.7 Buffer Management

We have assumed that operators on relations have available some number M
of main-memory buffers that they can use to store needed data. In practice,
these buffers are rarely allocated in advance to the operator, and the value
of M may vary depending on system conditions. The central task of making
main-memory buffers available to processes, such as queries, that act on the
database is given to the buffer manager. It is the responsibility of the buffer
manager to allow processes to get the memory they need, while minimizing the
delay and unsati.sfiable requests. The role of the buffer manager is illustrated
in Fig. 15.16.
Requests
Buffers
Figure 15.16: The buffer manager responds to requests for main-memory access
to disk blocks
15.7.1 Buffer Management Architecture

There are two broad architectures for a buffer manager:
1. The buffer manager controls main memory directly, as in many relational

DBMS’ s, or
2. The buffer manager allocates buffers in virtual memory, allowing the op
erating system to decide which buffers are actually in main memory at
any time and which are in the “ swap space”on disk that the operating
system manages. Many “ main-memory”DBM S’ s and “object-oriented”
DBMS’ s operate this way.
Whichever approach a DBMS uses, the same problem arises: the buffer
manager should limit the number of buffers in use so they fit in the available
449
Memory Management for Query Processing

We are assuming that the buffer manager allocates to an operator M
main-memory buffers, where the value for M depends on system condi
tions (including other operators and queries underway), and may vary
dynamically. Once an operator has M buffers, it may use some of them
for bringing in disk pages, others for index pages, and still others for sort
runs or hash tables. In some DBMS’ s, memory is not allocated from a
single pool, but rather there are separate pools of memory — with sepa
rate buffer managers — for different purposes. For example, an operator
might be allocated D buffers from a pool to hold pages brought in from
disk, S buffers from a separate memory area allocated for sorting, and H
buffers to build a hash table. This approach offers more opportunities for
system configuration and “ tuning,”but may not make the best global use
of memory.
main memory. When the buffer manager controls main memory directly, and
requests exceed available space, it has to select a buffer to empty, by returning
its contents to disk. If the buffered block has not been changed, then it may
simply be erased from main memory, but if the block has changed it must be
written back to its place on the disk. When the buffer manager allocates space
in virtual memory, it has the option to allocate more buffers than can fit in
main memory. However, if all these buffers are really in use, then there will
be “ thrashing,”a common operating-system problem, where many blocks are
moved in and out of the disk’ s swap space. In this situation, the system spends
most of its time swapping blocks, while very little useful work gets done.
Normally, the number of buffers is a parameter set when the DBMS is
initialized. We would expect that this number is set so that the buffers occupy
the available main memory, regardless of whether the buffers are allocated in
main or virtual memory. In what follows, we shall not concern ourselves with
which mode of buffering is used, and simply assume that there is a fixed-size
buffer pool, a set of buffers available to queries and other database actions.
15.7.2 Buffer Management Strategies
The critical choice that the buffer manager must make is what block to throw
out of the buffer pool when a buffer is needed for a newly requested block. The
buffer-replacement strategies in common use may be familiar to you from other
applications of scheduling policies, such as in operating systems. These include:4
0
5
450
15.7. B UFFER A'14NAGEMENT 767
Least-Recently Used (LRU)

The LRU rule is to throw out the block that has not been read or written for the
longest time. This method requires that the buffer manager maintain a table
indicating the last time the block in each buffer was accessed. It also requires
that each database access make an entry in this table, so there is significant
effort in maintaining this information. However, LRU is an effective strategy;
intuitively, buffers that have not been used for a long time are less likely to be
accessed sooner than those that have been accessed recently.
First-In-First-Out (FIFO)
When a buffer is needed, under the FIFO policy the buffer that has been oc
cupied the longest by the same block is emptied and used for the new block.
In this approach, the buffer manager needs to know only the time at which the
block currently occupying a buffer was loaded into that buffer. An entry into a
table can thus be made when the block is read from disk, and there is no need
to modify the table when the block is accessed. FIFO requires less maintenance
than LR.U, but it can make more mistakes. A block that is used repeatedly, say
the root block of a B-tree index, will eventually become the oldest block in a
buffer. It will be written back to disk, only to be reread shortly thereafter into
another buffer.
The “Clock”Algorithm (“Second Chance”)

This algorithm is a commonly implemented, efficient approximation to LRU.
Think of the buffers as arranged in a circle, as suggested by Fig. 15.17. A
“hand”points to one of the buffers, and will rotate clockwise if it needs to find
a buffer in which to place a disk block. Each buffer has an associated “ flag,”
which is either 0 or 1. Buffers with a 0 flag are vulnerable to having their
contents sent back to disk; buffers with a 1 are not. When a block is read into
a buffer, its flag is set to 1. Likewise, when the contents of a buffer is accessed,
its flag is set to 1.
When the buffer manager needs a buffer for a new block, it looks for the
first 0 it can find, rotating clockwise. If it passes l ’
s, it sets them to 0- Thus,
a block is only thrown out of its buffer if it remains unaccessed for the time it
takes the hand to make a complete rotation to set its flag to 0 and then make
another complete rotation to find the buffer with its 0 unchanged. For instance,
in Fig. 15.17, the hand will set to 0 the 1 in the buffer to its left, and then move
clockwise to find the buffer with 0, whose block it will replace and whose flag
it will set to 1.
System Control
The query processor or other components of a DBMS can give advice to the
buffer manager in order to avoid some of the mistakes that would occur with
451
Figure 1517: The clock algorithm visits buffers in a round-robin fashion and
replaces 01 -•
■1 with 10 •
•■0
More Tricks Using the Clock Algorithm

The “ clock”algorithm for choosing buffers to free is not limited to the
scheme described in Section 15.7.2. where flags had values 0 and 1. For
instance, one can start an important page with a number higher than 1
as its flag, and decrement the flag by 1 each time the “ hand”passes that
page. In fact, one can incorporate the concept of pinning blocks by giving
the pinned block an infinite value for its flag, and then having the system
release the pin at the appropriate time by setting the flag to 0.
a strict policy such as LRU, FIFO, or Clock. Recall from Section 12.3.5 that
there are sometimes technical reasons why a block in main memory can not
be moved to disk without first modifying certain other blocks that point to it.
These blocks are called “ pinned,”and any buffer manager has to modify its
buffer-replacement strategy to avoid expelling pinned blocks. This fact gives us
the opportunity to force other blocks to remain in main memory by declaring
them “ pinned,”even if there is no technical reason why they could not be
written to disk. For example, a cure for the problem with FIFO mentioned
above regarding the root of a B-tree is to “
pin”the root, forcing it to remain in
memory at all times. Similarly, for an algorithm like a one-pass hash-join, the
query processor may “ pin”the blocks of the smaller relation in order to assure
that it will remain in main memory during the entire time.
15.7.3 The Relationship Between Physical Operator

Selection and Buffer Management
The query optim izer will eventually select a set of physical operators that will
be used to execute a given query. This selection of operators may assume that a
certain number of buffers M is available for execution of each of these operators.
452
15.7. B U F FER M A N A G EM E N T 769
However, as we have seen, the buffer manager may not be willing or able to
guarantee the availability of these M buffers when the query is executed. There
are thus two related questions to ask about the physical operators:
1. Can the algorithm adapt to changes in the value of M , the number of

main-memory buffers available?
2. When the expected M buffers are not available, and some blocks that are
expected to be in memory have actually been moved to disk by the buffer
manager, how does the buffer-replacement strategy used by the buffer
manager impact the number of additional I/O ’s that must be performed?
Example 15.16 : As an example of the issues, let us consider the block-based

nested-loop join of Fig. 15.8. The basic algorithm does not really depend on
the value of M, although its performance depends on M. Thus, it is sufficient
to find out what M is just before execution begins.
It is even possible that M will change at different iterations of the outer
loop. That is, each time we load main memory with a portion of the relation S
(the relation of the outer loop), we can use all but one of the buffers available at
that time; the remaining buffer is reserved for a block of R, the relation of the
inner loop. Thus, the number of times we go around the outer loop depends on
the average number of buffers available at each iteration. However, as long as
M buffers are available on average , then the cost analysis of Section 15.3.4 will
hold. In the extreme, we might have the good fortune to find that at the first
iteration, enough buffers are available to hold all of S, in which case nested-loop
join gracefully becomes the one-pass join of Section 15.2.3.
If we pin the M — 1 blocks we use for S on one iteration of the outer loop,
then we shall not lose their buffers during the round, On the other hand, more
buffers may become available during that iteration. These buffers allow more
than one block of R to be kept in memory at the same time, but unless we are
careful, the extra buffers will not improve the running time of the nested-loop
join.
For instance, suppose that we use an LRU buffer-replacement strategy, and
there are k buffers available to hold blocks of R. As we read each block of R :
in order, the blocks that remain in buffers at the end of this iteration of the
outer loop will be the last k blocks of R. We next reload the M — I buffers for
S with new blocks of S and start reading the blocks of R again, in the next
iteration of the outer loop. However, if we start from the beginning of R again,
then the k buffers for R will need to be replaced, and we do not save disk I/O ’ s
just because k > 1.
A better implementation of nested-loop join, when an LRU buffer-replace
ment strategy is used, visits the blocks of R in an order that alternates: first-
to-last and then last-to-first (called rocking). In that way, if there are k buffers
available to R , we save k disk I/O ’ s on each iteration of the outer loop except
the first. That is, the second and subsequent iterations require only B{R) - k
453
disk I/O ’
s for R. Notice that even if k = 1 (i.e., no extra buffers are available
to R), we save one disk I/O per iteration. □
Other algorithms also are impacted by the fact that M can vary and by the
buffer-replacement strategy used by the buffer manager. Here are some useful
observations.
•If we use a sort-based algorithm for some operator, then it is possible to

adapt to changes in M. If M shrinks, we can change the size of a sublist,
since the sort-based algorithms we discussed do not depend on the sublists
being the same size. The major limitation is that as M shrinks, we could
be forced to create so many sublists that we cannot then allocate a buffer
for each sublist in the merging process.
•The main-memory sorting of sublists can be performed by a number of

different algorithms. Since algorithms like merge-sort and quicksort are
recursive, most of the time is spent on rather small regions of memory.
Thus, either LRU or FIFO will perform well for this part of a sort-based
algorithm.
•If the algorithm is hash-based, we can reduce the number of buckets if M

shrinks, as long as the buckets do not then become so large that they do
not fit in allotted main memory. However, unlike sort-based algorithms,
we cannot respond to changes in M while the algorithm runs. Rather,
once the number of buckets is chosen, it remains fixed throughout the first
pass, and if buffers become unavailable, the blocks belonging to some of
the buckets will have to be swapped out.

Exercise 15.7.1: Suppose that we wish to execute a join R ixi S, and the
available memory will vary between M and M/2. In terms of M, B(R), and
B ( S ), give the conditions under which we can guarantee that the following
algorithms can be executed:
* a) A one-pass join.
* b) A two-pass, hash-based join.
c) A two-pass, sort-based join.
! Exercise 15.7.2: How would the number of disk I/O ’ s taken by a nested-loop
join improve if extra buffers became available and the buffer-replacement policy
were:
a) First-in-first-out.
b) The clock algorithm.
454
15.8. A L G O R I T H M S U S I N G M O R E T H A N T W O P A S S E S 771
!! Exercise 15.7.3 : In Example 15.16, we suggested that it was possible to take

advantage of extra buffers becoming available during the join by keeping more
than one block of R buffered and visiting the blocks of R in reverse order on
even-numbered iterations of the outer loop. However, we could also maintain
only one buffer for R and increase the number of buffers used for S . Which
strategy yields the fewest disk I/O’
s?
15.8 Algorithms Using More Than Two Passes

While two passes are enough for operations on all but the largest relations, we
should observe that the principal techniques discussed in Sections 15.4 and 15.5
generalize to algorithms that, by using as many passes as necessary, can process
relations of arbitrary size. In this section we shall consider the generalization
of both sort- and hash-based approaches.
15.8.1 Multipass Sort-Based Algorithms

In Section 11.4.5 we alluded to how the two-phase multiway merge sort could be
extended to a three-pass algorithm. In fact, there is a simple recursive approach
to sorting that will allow us to sort a relation, however large, completely, or if
we prefer, to create n sorted sublists for any particular n.
Suppose we have M main-memory buffers available to sort a relation R ,
which we shall assume is stored clustered. Then do the following:
BASIS: If R fits in M blocks (i.e., B ( R ) < M), then read R into main memory,
sort it using your favorite main-memory sorting algorithm, and write the sorted
relation to disk.
If R does not fit into main memory, partition the blocks holding
IN D U C T IO N :
R into M groups, which we shall call R i ,.# , •
••
2, R m ■ Recursively sort R t for
each i = 1,2,..., M . Then, merge the M sorted sublists, as in Section 11.4.4.
If we are not merely sorting R , but performing a unary operation such as 7
or 5 on R, then we modify the above so that at the final merge we perform the
operation on the tuples at the front of the sorted sublists. That is,
•For a 6, output one copy of each distinct tuple, and skip over copies of
the tuple.
•For a , sort on the grouping attributes only, and combine the tuples with
7
a given value of these grouping attributes in the appropriate manner, as

discussed in Section 15.4.2.
When we want to perform a binary operation, such as intersection or join, we

use essentially the same idea, except that the two relations are first divided into
a total of M sublists. Then, each sublist is sorted by the recursive algorithm
above. Finally, we read each of the M sublists, each into one buffer, and we
455
perform the operation in the manner described by the appropriate subsection

of Section 15-4-
We can divide the M buffers between relations R and S as we wish. However,
to minimize the total number of passes, we would normally divide the buffers
in proportion to the number of blocks taken by the relations. That is, R gets
M x B ( R ) / [ B ( R ) + B ( S )) of the buffers, and S gets the rest.
15.8.2 Performance of Multipass, Sort-Based Algorithms

Now, let us explore the relationship between the number of disk I/O ’s required,
the size of the relation(s) operated upon, and the size of main memory. Let
s ( M , k ) be the maximum size of a relation that we can sort using M buffers
and k passes. Then we can compute s ( M , A;) as follows:
BASIS: If k = 1, i.e., one pass is allowed, then we must have B (R ) < M . Put
another way, s(M, ) = M .
1
I N D U C T I O N : Suppose k > 1. Then we partition R into M pieces, each of

which must be sortable in k — passes. If B ( R ) = s(M, k), then $(M, k)/M ,
1
which is the size of each of the M pieces of R , cannot exceed s( M , k —1). That
is: s( M , k ) = M s ( M ,k - 1).
If we expand the above recursion, we find
s(M, k) = M s ( M , k - 1) - M 2s (M ,k —2) — ■
■•— M k~ ' s ( M , 1)
Since s ( M , 1) = M , we conclude that s(M, k) = M k . That is, using k passes,

we can sort a relation R if B ( R ) < s ( M , k ), which says that B ( R ) < M k . Put
another way, if we want to sort R in k passes, then the minimum number of
buffers we can use is M = ( B ( R ) ) l ^ k .
Each pass of a sorting algorithm reads all the data from disk and writes it
out again. Thus, a k -pass sorting algorithm requires 2k B ( R ) disk I/O ’ s.
Now, let us consider the cost of a multipass join R ( X , Y ) txi S ( Y , Z ) , as
representative of a binary operation on relations. Let j ( M , k ) be the largest
number of blocks such that in k passes, using M buffers, we can join relations
of j ( M ,k) or fewer total blocks. That is, the join can be accomplished provided
B (R ) + B ( S ) < j( M , k ) .
On the final pass, we merge M sorted sublists from the two relations.
Each of the sublists is sorted using k - 1 passes, so they can be no longer
than s(M, k — 1) = M k~ l each, or a total of M s ( M , k - 1) = M k . That is,
B ( R ) + B ( S ) can be no larger than M k , or put another way, j ( M , k ) = M k .
Reversing the role of the parameters, we can also state that to compute the join
in k passes requires ( B ( R ) + B ( S ) ) 1/k buffers.
To calculate the number of disk I/O’ s needed in the multipass algorithms,
we should remember that, unlike for sorting, we do not count the cost of writing
the final result to disk for joins or other relational operations. Thus, we use
2(k — 1 ) ( B { R ) + B ( S ) ) disk I/O’
s to sort the sublists, and another B ( R ) + B ( S ) 456
456
15.8. A L G O R IT H M S USING M O R E T H A N T W O PA SSES 773
disk I/O’
s to read the sorted sublists in the final pass. The result is a total of
(2k - )( B ( R ) + B ( S ) ) disk I/O’
1 s.
15.8.3 Multipass Hash-Based Algorithms

There is a corresponding recursive approach to using hashing for operations on
large relations. We hash the relation or relations into M - 1 buckets, where M
is the number of available memory buffers. We then apply the operation to each
bucket individually, in the case of a unary operation. If the operation is binary,
such as a join, we apply the operation to each pair of corresponding buckets, as
if they were the entire relations. For the common relational operations we have
considered — duplicate-elimination, grouping, union, intersection, difference,
natural join, and equijoin — the result of the operation on the entire relation(s)
will be the union of the results on the bucket(s). We can describe this approach
recursively as:
BASIS: For a unary operation, if the relation fits in M buffers, read it into
memory and perform the operation. For a binary operation, if either relation
fits in M — buffers, perform the operation by reading this relation into main
1
memory and then read the second relation, one block at a time, into the Mth
buffer.
If no relation fits in main memory, then hash each relation into

IN D U C T IO N :
M — 1buckets, as discussed in Section 15.5.1. Recursively perform the operation
on each bucket or corresponding pair of buckets, and accumulate the output
from each bucket or pair.
15.8.4 Performance of Multipass Hash-Based Algorithms

In what follows, we shall make the assumption that when we hash a relation,
the tuples divide as evenly as possible among the buckets. In practice, this as
sumption will be met approximately if we choose a truly random hash function,
but there will always be some unevenness in the distribution of tuples among
buckets.
First, consider a unary operation, like or 5 on a relation R using M buffers.
7
Let u ( M , k ) be the number of blocks in the largest relation that a &-pass hashing
algorithm can handle. We can define u recursively by:
BASIS: u(M, 1) = M, since the relation R must fit in M buffers; i.e., B (R ) <
M.
IN D U C T I O N : We assume that the first step divides the relation R into M - 1

buckets of equal size. Thus, we can compute ?i(M, k) as follows. The buckets
for the next pass must be sufficiently small that they can be handled in k — 1
passes; that is, the buckets are of size u ( M , k - ). Since R is divided into M - 1
1
buckets, we must have u(M, k) = (M - ) u ( M , k — ).

1 1
457
774 C H A P T E R 15. Q U ER Y E X E C U T IO N
If we expand the recurrence above, we find that it ( M , k ) — M ( M - l)fc_1,

or approximately, assuming M is large, u ( M , k ) = M k . Equivalently, we can
perform one of the unary relational operations on relation R in k passes with
M buffers, provided M < (B ( R ))L//k.
We may perform a similar analysis for binary operations. As in Section
15.8.2, let us consider the join. Let j ( M , k ) be an upper bound on the size of
the smaller of the two relations R and 5 involved in R ( X ,Y ) x S ( y , Z ) . Here,
as before, M is the number of available buffers and k is the number of passes
we can use.
BASIS: j { M , 1) = M — 1; that is, if we use the one-pass algorithm to join, then

either R or 5 must fit in M - 1 blocks, as we discussed in Section 15.2.3.
I N D U C T I O N : j { M )k) = (M —1) j ( M )k - 1); that is, on the first of k passes,

we can divide each relation into M — I buckets, and we may expect each bucket
to be 1f ( M - 1) of its entire relation, but we must then be able to join each
pair of corresponding buckets in M — 1 passes.
By expanding the recurrence for j(M, k), we conclude that j ( M , k) = ( M — l)fc.

Again assuming M is large, we can say approximately j ( M , k ) = M k . That
is, we can join jR(X,Y) x S ( Y , Z ) using k passes and M buffers provided
M k > min ( B ( R ) , B ( S ) ) .

Exercise 15.8.1: Suppose B ( R ) = 20,000, B ( S ) = 50,000, and M = 101.
Describe the behavior of the following algorithms to compute R x S :
* a) A three-pass, sort-based algorithm,
b) A three-pass, hash-based algorithm.
! Exercise 15.8.2: There are several “ tricks”we have discussed for improving
the performance of two-pass algorithms. For the following, tell whether the
trick could be used in a multipass algorithm, and if so, how?
a) The hybrid-hash-join trick of Section 15.5.6.
b) Improving a sort-based algorithm by storing blocks consecutivelv on disk

(Section 15.5.7).
c) Improving a hash-based algorithm by storing blocks consecutively on disk

(Section 15.5.7).
458
C h a p te r 12
D a ta P r o c e ssin g - P a rallelism
T h is ch a p te r co n ta in s th e p a p ers:
D. D e W it t a n d J. Gray. P a ra llel d a t a b a s e sy ste m s: th e fu tu re

o f h igh p e r fo r m a n c e d a t a b a se sy stem s. C om m un. A C M 35, pp.
85-98 (14 o f 1868), 1992. D oi: 10.1145/129888.129894
J. D e a n a n d S. G h em a w a t. M a p R e d u ce : a flex ib le d a ta p r o c e s s
in g to o l. C o m m u n . A C M 53, 1, pp. 72-77 (6 o f 159), 2010. D oi:
10.1145/1629175.1629198
G. G raefe. E n c a p s u la t io n o f p a r a lle lism in th e V o lca n o q u e r y

p r o c e s s in g sy stem . S I G M O D R ec. 19, 2, pp. 102-111 (10 o f 632),
1990. D o i: 10.1145/93605.98720
A n im p o r t a n t te ch n iq u e for scalability w ith d a ta v o lu m e s is th e u se o f

parallelism. P a ra llel d a ta p r o c e s s in g m e t h o d o lo g ie s a llo w u s t o a d d m o r e p r o
c e s s in g a n d d a ta tra n sfer c o m p o n e n t s t o a c o m p u t e r sy stem , w ith th e g o a l
t o eith e r o b ta in s p e e d u p s o r t o s c a le u p w ith data. In th is fin al ch apter, th e
The ultimate goal of
te x t e x p lo r e s b a s ic c o n c e p t s in p a ra llel d a ta p r o ce ssin g .
this portion of the material is to provide us with the basic terminology and ap
proaches to parallel processing of large data volumes, which can be employed to
reflect on the design and implementation of data analysis services.
T h e lea rn in g g o a ls for th is p o r t io n o f th e m a teria l are lis te d below .
• Id en tify th e m a in m e tr ic s in p a ra llel d a ta p r o ce ssin g , n a m e ly sp ee d - u p

a n d scale-up.
• D e s c r ib e d ifferen t m o d e ls o f p a r a lle lism (partition , p ip e lin ed ) a n d a rch i

te c tu r e s for p a ra lle l d a ta p r o c e s s in g (shared-m em ory, shared-disk, shared-
n othing).
• E x p la in d ifferen t d a t a p a r titio n in g s tr a te g ie s as w ell as th eir a d v a n ta g es

a n d d isa d v a n ta ges.
• A p p ly d a ta p a r titio n in g t o a ch iev e p a ra lle lism in d a t a p r o c e s s in g o p e r a

tors.
• E x p la in th e m a in a lg o r ith m s fo r p a ra llel p r o ce ssin g , in p a r ticu la r pa ra llel

scan, p a ra llel sort, a n d p a ra llel join s. 4
9
5
459
• E x p la in th e r e la tio n sh ip b e tw e e n M a p R e d u c e a n d p a r titio n e d p a ra llel
p r o c e s s in g stra tegies.
G raefe’
s paper on interface design fo r parallel operators is given to deepen
understanding; however, it is to be considered as an additional reading and not
fundamental to the attainment of the learning goals above.
contributed articles
D0I:10.1145/1629175.1629198
of MapReduce has been used exten
MapReduce advantages over parallel databases sively outside o f Google by a number of
organizations.1011
include storage-system independence and To help illustrate the MapReduce
fine-grain fault tolerance for large jobs. program m ing model, consider the
problem of counting the number of
BY JEFFREY DEAN AND SANJAY GHEMAWAT occurrences of each word in a large col
lection of documents. The user would
MapReduce:
write code like the following pseudo
code:
map (String key. S trin g value):
A Flexible
/ / key: document name
/ / value: document contents
fo r each word w in value:
Emltlntermedlate(w, “ 1”
);
Data
reduce(Strlng key. Iterator values):
/ / key: a word
/ / values: a l i s t o f counts
±nt r e s u lt = 0;
Processing
fo r each v in values:
r e s u lt += Parselnt(v);
Emit(AsString(result));
Tool
The map function emits each word
plus an associated count of occurrences
(just' T in this simple example). The re
duce function sums together all counts
emitted for a particular word.
MapReduce automatically paral
lelizes and executes the program on a
large cluster of com m odity machines.
The runtime system takes care o f the
details o f partitioning the input data,
mapreduce is a programming model for processing scheduling the program ’ s execution
and generating large data sets.4Users specify a across a set of machines, handling
machine failures, and m anaging re
map function that processes a key/value pair to quired inter-machine communication.
generate a set of intermediate key/value pairs and MapReduce allows programmers with
no experience with parallel and dis
a reduce function that merges all intermediate tributed systems to easily utilize the re
values associated with the same intermediate key. sources o f a large distributed system. A
We built a system around this programming model typical MapReduce computation pro
cesses many terabytes o f data on hun
in 2003 to simplify construction of the inverted dreds or thousands o f machines. Pro
index for handling searches at Google.com. Since grammers find the system easy to use,
and more than 100,000 MapReduce
then, more than 10,000 distinct programs have been jo b s are executed on G oogle’ s clusters
implemented using MapReduce at Google, including every day.
algorithms for large-scale graph processing, text
Compared to Parallel Databases
processing, machine learning, and statistical machine The query languages built into paral
translation. The Hadoop open source implementation lel database systems are also used to
72 COMMUNICATIONS OF THE ACM JANUARY 2010 VOL. 53 I NO. 1
475
476
express the type of computations sup support a new storage system by de would need to read only that sub-range
ported by MapReduce. A 2009 paper fining simple reader and writer imple instead of scanning the entire Bigtable.
by Andrew Pavlo et al. (referred to here mentations that operate on the storage Furthermore, like Vertica and other col
as the “ comparison paper” 13) com system. Examples of supported storage umn-store databases, we will read data
pared the performance o f MapReduce systems are files stored in distributed only from the columns needed for this
and parallel databases. It evaluated file systems,7 database query results,2'9 analysis, since Bigtable can store data
the open source H adoop implementa data stored in Bigtable,3and structured segregated by columns.
tion10of the MapReduce programming input files (such as B-trees). A single Yet another example is the process
model, DBMS-X (an unidentified com MapReduce operation easily processes ing of log data within a certain date
mercial database system), and Vertica and com bines data from a variety of range; see the Join task discussion in
(a column-store database system from storage systems. the comparison paper, where the Ha
a company co-founded by one of the Now consider a system in which a doop benchmark reads through 155
authors of the comparison paper). Ear parallel DBMS is used to perform all million records to process the 134,000
lier b log posts by som e of the paper’ s data analysis. The input to such analy records that fall within the date range
authors characterized MapReduce as sis must first be copied into the parallel of interest. Nearly every logging sys
“a major step backwards.” 1’
6 In this DBMS. This loading phase is inconve tem we are familiar with rolls over to
article, we address several m isconcep nient. It may also be unacceptably slow, a new log file periodically and embeds
tions about MapReduce in these three especially if the data will be analyzed the rollover time in the name of each
publications: only once or twice after being loaded. log file. Therefore, we can easily run a
► MapReduce cannot use indices and For example, consider a batch-oriented MapReduce operation over just the log
implies a full scan of all input data; Web-crawling-and-indexing system files that may potentially overlap the
► MapReduce input and outputs are that fetches a set of Web pages and specified date range, instead o f reading
always simple files in a file system; and generates an inverted index. It seems all log files.
► MapReduce requires the use of in awkward and inefficient to load the set
efficient textual data formats. of fetched pages into a database just so Complex Functions
We also discuss other important is they can be read through once to gener Map and Reduce functions are often
sues: ate an inverted index. Even if the cost of fairly simple and have straightforward
► MapReduce is storage-system inde loading the input into a parallel DBMS SQL equivalents. However, in many
pendent and can process data without is acceptable, we still need an appropri cases, especially for Map functions, the
first requiring it to be loaded into a da ate loading tool. Here is another place function is too complicated to be ex
tabase. In many cases, it is possible to MapReduce can be used; instead of pressed easily in a SQL query, as in the
run 50 or more separate MapReduce writing a custom loader with its own ad following examples:
analyses in complete passes over the hoc parallelization and fault-tolerance ► Extracting the set o f outgoing links
data before it is possible to load the data support, a simple MapReduce program from a collection of HTML documents
into a database and complete a single can be written to load the data into the and aggregating by target document;
analysis; parallel DBMS. ► Stitching together overlapping sat
► Complicated transformations are ellite images to remove seams and to
often easier to express in MapReduce Indices select high-quality imagery for Google
than in SQL; and The comparison paper incorrectly said Earth;
► Many conclusions in the compari that MapReduce cannot take advan ► Generating a collection of inverted
son paper were based on implementa tage of pregenerated indices, leading index files using a com pression scheme
tion and evaluation shortcomings not to skewed benchmark results in the tuned for efficient support of Google
fundamental to the MapReduce model; paper. For example, consider a large search queries;
we discuss these shortcomings later in data set partitioned into a collection ► Processing all road segments in the
this article. o f nondistributed databases, perhaps world and rendering map tile images
We encourage readers to read the using a hash function. An index can that display these segments for Google
original MapReduce paper1 and the be added to each database, and the Maps; and
comparison paper13for more context. result of running a database query us ► Fault-tolerant parallel execution of
ing this index can be used as an input programs written in higher-level lan
Heterogenous Systems to MapReduce. If the data is stored in guages (such as Sawzall14 and Pig Lat
Many production environments con D database partitions, we will run D in12) across a collection of input data.
tain a mix of storage systems. Customer database queries that will becom e the Conceptually, such user defined
data may be stored in a relational data D inputs to the MapReduce execution. functions (UDFs) can be com bined
base, and user requests may be logged Indeed, som e o f the authors of Pavlo et with SQL queries, but the experience
to a file system. Furthermore, as such al. have pursued this approach in their reported in the comparison paper indi
environments evolve, data may migrate more recent work.11 cates that UDF support is either buggy
to new storage systems. MapReduce Another example of the use o f in (in DBMS-X) or m issing (in Vertica).
provides a simple m odel for analyzing dices is a MapReduce that reads from These concerns may go away over the
data in such heterogenous systems. Bigtable. If the data needed maps to a long term, but for now, MapReduce is a
End users can extend MapReduce to4 7 sub-range of the Bigtable row space, we better framework for doing more com-
74 COMMUNICATIONS OF THE ACM I JANUARY 2010 I VOL. 53 I NO. 1
477
plicated tasks (such as those listed ear of protocol buffers uses an optim ized
lier) than the selection and aggregation binary representation that is more
that are SQL’ s forte. com pact and much faster to encode
and decode than the textual formats
Structured Data and Schemas used by the Hadoop benchmarks in the
Pavlo et al. did raise a good point that MapReduce is comparison paper. For example, the
schemas are helpful in allowing multi
ple applications to share the same data.
a highly effective automatically generated code to parse
a Rankings protocol buffer record
For example, consider the following and efficient runs in 20 nanoseconds per record as
schema from the comparison paper:
CREATE TABLE R a n k in g s (
tool for large-scale compared to the 1,731 nanoseconds
required per record to parse the tex
pageURL VARCHAR(IOO) fault-tolerant tual input format used in the Hadoop
PRIMARY KEY,
pa geR a n k INT.
data analysis. benchmark mentioned earlier. These
measurements were obtained on a JVM
a v g D u r a t io n INT ); running on a 2.4GHz Intel Core-2 Duo.
The Java code fragments used for the
The corresponding H adoop bench benchmark runs were:
marks in the com parison paper used
an inefficient and fragile textual for // F ragm en t 1: p r o t o c o l b u f
mat with different attributes separated f e r p a r s in g
by vertical bar characters: f o r ( in t i = 0; i < n u m ltera -
tio n s ; i++) {
1371h t t p ://www. s o m e h o s t .com/ r a n k i n g s .p a rse F ro m (v alu e);
i n d e x .h tm l 1602 p a g e r a n k = r a n k in g s , g e t -
P ageran k O ;
In contrast to ad hoc, inefficient }
formats, virtually all MapReduce op
erations at Google read and write data // F ra gm en t 2: t e x t f o r
in the Protocol Buffer format.8A high- mat p a r s i n g ( e x t r a c t e d from
level language describes the input and B ench m ark1j ava
output types, and compiler-generated // from t h e s o u r c e c o d e
code is used to hide the details o f en- p o s t e d b y P a v lo e t al.)
coding/decoding from application f o r ( in t i = 0; i < n u m ltera -
code. The corresponding protocol buf t i o n s ; i++) {
fer description for the Rankings data S t r i n g data[] = v a lu e . t o -
would be: S t r in g O .sp lit("\\| ” );
p a geran k = In teger.
m e s sa g e R a n k in g s { valu e0 f(d ata[0 ]);
r e q u i r e d s t r i n g p a g e u r l = 1: }
r e q u i r e d int32 p a gera n k = 2;
r e q u ire d int32 avgdu ration = 3; Given the factor o f an 80-fold dif
} ference in this record-parsing bench
mark, we suspect the absolute num
The following Map function frag bers for the H adoop benchmarks in
ment processes a Rankings record: the com parison paper are inflated and
cannot be used to reach conclusions
R a n k in g s r = new R a n k in g s 0) about fundamental differences in the
r .p a rse F ro m (v alu e) ; performance of MapReduce and paral
i f (r.getPagerankO > 10) { ... } lel DBMS.
The protocol buffer framework Fault Tolerance

allows types to be upgraded (in con The MapReduce implementation uses
strained ways) without requiring exist a pull model for moving data between
ing applications to be changed (or even mappers and reducers, as opposed to
recom piled or rebuilt). This level of a push model where mappers write di
schema support has proved sufficient rectly to reducers. Pavlo et al. correctly
for allowing thousands of Google engi pointed out that the pull model can re
neers to share the same evolving data sult in the creation of many small files
types. and many disk seeks to move data be
Furthermore, the implementation tween mappers and reducers. Imple-
JANUARY 2010 VOL. 53 NO. 1 I COMMUNICATIONS OF THE ACM 75
478
mentation tricks like batching, sorting, format for structured data (protocol
and grouping of intermediate data and buffers) instead of inefficient textual
smart scheduling of reads are used by formats.
G oogle’ s MapReduce implementation Reading unnecessary data. The com
to mitigate these costs. parison paper says, “ MR is always forced
MapReduce implementations tend to start a query with a scan of the entire
not to use a push model due to the input file.”MapReduce does not require
ACM fault-tolerance properties required
by G oogle’ s developers. Most MapRe
a full scan over the data; it requires only
an implementation of its input inter
Transactions on duce executions over large data sets
encounter at least a few failures; apart
face to yield a set of records that match
som e input specification. Examples of
Accessible from hardware and software problems,

G oogle’ s cluster scheduling system can
input specifications are:
► All records in a set of files;
Computing preempt MapReduce tasks by killing

them to make room for higher-priority
► All records with a visit-date in the
range [2000-01-15..2000-01-22]; and
tasks. In a push model, failure of a re ► All data in Bigtable table T whose
ducer would force re-execution of all “language”column is “ Turkish.”
Map tasks. The input may require a full scan
We suspect that as data sets grow over a set o f files, as Pavlo et al. sug
larger, analyses will require more gested, but alternate implementations
computation, and fault tolerance will are often used. For example, the input
becom e more important. There are al may be a database with an index that
ready more than a dozen distinct data provides efficient filtering or an in
sets at Google more than 1PB in size dexed file structure (such as daily log
and dozens more hundreds o f TBs files used for efficient date-based fil
in size that are processed daily using tering of log data).
MapReduce. Outside o f Google, many This mistaken assum ption about
users listed on the Hadoop users list11 MapReduce affects three of the five
are handling data sets of multiple hun benchmarks in the com parison paper
dreds o f terabytes or more. Clearly, as (the selection, aggregation, and join
data sets continue to grow, more users tasks) and invalidates the conclusions
will need a fault-tolerant system like in the paper about the relative perfor
MapReduce that can be used to process m ance o f MapReduce and parallel da
these large data sets efficiently and ef tabases.
♦ ♦ ♦ ♦ ♦ fectively. Merging results. The measurements
o f Hadoop in all five benchmarks in the
This quarterly publication is a
Performance comparison paper included the cost
quarterly journal that publishes Pavlo et al. compared the performance o f a final phase to merge the results of
refereed articles addressing issues o f the Hadoop MapReduce implemen the initial MapReduce into one file. In
of computing as it impacts the tation to two database implementa practice, this merging is unnecessary,
tions; here, we discuss the performance since the next consum er of MapReduce
lives of people with disabilities.
differences o f the various systems: output is usually another MapReduce
The journal will be of particular Engineering considerations. Startup that can easily operate over the set of
interest to SIGACCESS members7 6 overhead and sequential scanning files produced by the first MapReduce,
speed are indicators of maturity of im instead o f requiring a single merged in
plementation and engineering trade put. Even if the consum er is not another
offs, not fundamental differences in MapReduce, the reducer processes in
programming models. These differ the initial MapReduce can write directly
ences are certainly important but can to a merged destination (such as a Big-
be addressed in a variety o f ways. For table or parallel database table).
example, startup overhead can be ad Data loading. The DBMS measure
dressed by keeping worker processes ments in the com parison paper dem
live, waiting for the next MapReduce in onstrated the high cost of loading
vocation, an optimization added more input data into a database before it
than a year ago to G oogle’ s MapReduce is analyzed. For many of the bench
implementation. marks in the com parison paper, the
Google has also addressed sequen time needed to load the input data into
tial scanning performance with a variety a parallel database is five to 50 times
o f performance optimizations by, forex- the time needed to analyze the data via
ample, using efficient binary-encoding Hadoop. Put another way, for som e of
76 COMMUNICATIONS OF THE ACM JANUARY 2010 I VOL. 53 I NO. 1

the benchmarks, starting with data in a heterogenous system with many dif
collection of files on disk, it is possible ferent storage systems. Third, MapRe
to run 50 separate MapReduce analy duce provides a go o d framework for
ses over the data before it is possible to the execution o f m ore com plicated
load the data into a database and com functions than are supported directly
plete a single analysis. Long load times MapReduce in SQL. B
may not matter if many queries will be
run on the data after loading, but this
provides fine-grain
is often not the case; data sets are often fault tolerance R eferences
1. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J.,
Silberschatz, A., and Rasin, A. HadoopDB: An
generated, processed once or twice,
and then discarded. For example, the
for large jobs; architectural hybrid of MapReduce and DBMS
technologies for analytical workloads. In Proceedings
Web-search index-building system de failure in the middle o f the Conference on Very Large Databases (Lyon,
France, 2009); http://db.cs.yale.edu/hadoopdb/
scribed in the MapReduce paper4 is a
sequence o f MapReduce phases where
of a multi-hour 2. Aster Data Systems, Inc. In-Database MapReduce
for Rich Analytics-, http://www.asterdata.com/product/
mapreduce.php.
the output of m ost phases is consum ed execution does 3. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C.,
Wallach, D.A., Burrows, M., Chandra, T., Fikes, A.,
by one or two subsequent MapReduce
phases. not require and Gruber, R.E. Bigtable: A distributed storage
system for structured data. In Proceedings o f the
Seventh Symposium on Operating System Design
Conclusion
restarting the job and Implementation (Seattle, WA, Nov. 6-8). Usenix
Association, 2006; http://labs.google.com/papers/
The conclusions about performance from scratch. bigtable.html

4. Dean, J. and Ghemawat, S. MapReduce: Simplified
data processing on large clusters. In Proceedings of
in the com parison paper were based the Sixth Symposium on Operating System Design and
on flawed assum ptions about MapRe Implementation (San Francisco, CA, Dec. 6-8). Usenix
Association, 2004; http://labs.google.com/papers/
duce and overstated the benefit o f par mapreduce.html
allel database systems. In our experi 5. Dewitt, D. and Stonebraker, M. MapReduce: A Major
Step Backwards blogpost: http://databasecolumn.
ence, MapReduce is a highly effective vertica.com/database-innovation/mapreduce-a-major-
and efficient tool for large-scale fault- step-backwards/
6. Dewitt, D. and Stonebraker, M. MapReduce I I
tolerant data analysis. However, a few blogpost; http://databasecolumn.vertica.com/
useful lessons can be drawn from this database-innovation/mapreduce-ii/
7. Ghemawat, S., Gobioff, H., and Leung, S.-T. The
discussion: Google file system. In Proceedings of the 19th ACM
Startup latency. MapReduce imple Symposium on Operating Systems Principles (Lake
George, NY, Oct. 19-22). ACM Press, New York, 2003;
mentations should strive to reduce http://labs.google.com/papers/gfs.html
startup latency by using techniques like 8. Google. Protocol Buffers: Google's Data Interchange
Format. Documentation and open source release;
worker processes that are reused across http://code.google.eom/p/protobuf/
9. Greenplum. Greenplum MapReduce: Bringing Next-
different invocations; Generation Analytics Technology to the Enterprise;
Data shuffling. Careful attention http://www.greenplum.com/resources/mapreduce/
10. Fladoop. Documentation and open source release;
must be paid to the implementation of http://hadoop.apache.org/core/
the data-shuffling phase to avoid gen 11. Fladoop. Users list; http://wiki.apache.org/hadoop/
PoweredBy
erating 0(M*R) seeks in a MapReduce 12. Olston, C., Reed, B., Srivastava, U., Kumar, R., and
with Mmap tasks and R reduce tasks; Tomkins, A. Pig Latin: A not-so-foreign language for
data processing. In Proceedings of the ACM SIGMOD
Textual formats. MapReduce users 2008 International Conference on Management of
should avoid using inefficient textual Data (Auckland, New Zealand, June 2008); http://
hadoop.apache.org/pig/
formats; 13. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt,
Natural indices. MapReduce users D.J., Madden, S., and Stonebraker, M. A comparison
of approaches to large-scale data analysis. In
should take advantage of natural in Proceedings of the 2009 ACM SIGMOD International
dices (such as timestamps in log file Conference (Providence, RI, June 29-July 2). ACM
Press, New York, 2009; http://database.cs.brown.edu/
names) whenever possible; and projects/mapreduce-vs-dbms/
Unmerged output. M ost MapReduce 14. Pike, R., Dorward, S., Griesemer, R., and Quinlan, S.
Interpreting the data: Parallel analysis with Sawzall.
output should be left unmerged, since Scientific Programming Journal, Special Issue on
Grids and Worldwide Computing Programming Models
there is no benefit to m erging if the and Infrastructure 13, 4, 227-298. http://labs.google.
next consum er is another MapReduce com/pa pers/sa wza ll.ht m I
program.
MapReduce provides many signifi Jeffrey Dean (jeff@google.com) is a Google Fellow in
cant advantages over parallel data the Systems Infrastructure Group of Google, Mountain
View, CA.
bases. First and foremost, it provides
fine-grain fault tolerance for large Sanjay Ghemawat (sanjay@google.com) is a Google
Fellow in the Systems Infrastructure Group of Google,
jobs; failure in the m iddle o f a multi Mountain View, CA.
hour execution does not require re
starting the jo b from scratch. Second,
MapReduce is very useful for handling
data processin g and data loading in a © 2010 ACM 0001-0782/10/0100 $10.00
JANUARY 2010 I VOL. 53 I NO. 1 I COMMUNICATIONS OF THE ACM 77
480
Encapsulation of Parallelism
in the Volcano Query Processing System
Goetz Graefe
University of Colorado
Boulder, CO 80309-0430
graefe@boulder Colorado edu
A b s tra c t
Volcano is a new dataflow query processing system we have developed for database systems research and education
The uniform interface between operators makes Volcano extensible by new operators All operators are designed and coded as
if they were meant for a single-process system only When attempting to parallelize Volcano, we had to choose between two
models o f parallelization, called here the bracket and operator models We describe the reasons for not choosing the bracket
model, introduce the novel operator model, and provide details o f Volcano’ s exchange operator that parallelizes all other opera
tors It allows ultra-operator parallelism on partitioned datasets and both vertical and horizontal inter-operator parallelism The
exchange operator encapsulates all parallelism issues and therefore makes implementation o f parallel database algorithms signifi
cantly easier and more robust Included m this encapsulation is the translauon between demand-driven dataflow within
processes and data-driven dataflow between processes Since the interface between Volcano operators is similar to the one
used in "real," commercial systems, the techniques described here can be used to parallelize other query processing engines
1. Introduction template processes that encompass specific operators Wc

call the new method of parallelizing the operator model In
In order to provide a testbed for database systems
this paper, we describe this new method and contrast it
education and research, we decided to implement an extensi
with the method used in GAMMA and Bubba, which we
ble and modular query processmg system One important
call the bracket model Since we developed, implemented,
goal was to achieve flexibility and extensibility without sac-
and tested the operator model within the framework o f the
nficing efficiency The result is a small system, consisung
Volcano system, we will describe it as realized m Volcano
o f less than two dozen core modules with a total o f about
15,000 lines o f C cod e These modules include a file sys Volcano was designed to be extensible, its design and
tem, buffer management, sorting, top-down B*-trees, and implementation follow s many o f the ideas outlined by
two algorithms each for natural join, semi-join, outer jom, Batory et al for the GENESIS design [5] In this paper,
anti-join, aggregation, duplicate elimination, division, union, we do not focus on or substantiate the claim to extensibility
intersection, difference, anU-difference, and Cartesian pro and instead refer the reader to [17], suffice it to point out
duct Moreover, a single module allows parallel processing that if new operators use and provide Volcano's standard
o f all algonthms listed above interface between operators, they can easily be included in a
Volcano query evaluation plan and parallelized by the
The last module, called the exchange module, is the
exchange operator
focus o f this paper It was designed and implemented after
most o f the other query processing modules The design Volcano’ s mechanism to synchronize mulnplc opera
goal was to parallelize all existing query processmg algo tors in complex query trees withm a smgle process and to
rithms without modifying their implementations exchange data items between operators are very similar to
Equivalently, the goal was to allow parallelizing new algo many commercial database systems, e g , Ingres and the
nthms not yet invented without requinng that these algo System R family o f database systems Therefore, it seems
nthms be implemented with concern for parallelism This fairly straightforward to apply the techniques developed for
goal was met almost entirely, the only change to the exist Volcano's exchange operator and outlined m this paper to
ing modules concerned device names and numbers to allow parallelize the query processing engines of such systems
horizontal partitioning over multiple disks, also called disk This paper is organized as follows In the following
stnping [25] secuon, we briefly review previous work that influenced our
Parallelizing a query evaluation eng me using an design, and introduce the bracket model of parallelization
operator is a novel idea earlier research projects used In Section 3, we provide a more detailed description o f
Volcano The operator model of parallelization and
Volcano's exchange operator arc described in Section 4
Permission to copy without fee all or part o f this material is granted provided
that the copies are not made or distributed for direct commercial advantage the We present experimental performance measurements in Sec
ACM copyright notice and the title o f the publication and its date appear and tion 5 that show the exchange operator's low overhead
notice is given that copying is by permission o f the Association for Computing
Machinery To copy otherwise or to republish, requires a fee and/or specific Section 6 contains a summary and our conclusions from this
permission effort
© 1990 ACM 089791 365 5/90/0005/0101 $1 50
102
481
2. Previous Work
Since so many different system have been developed
to process large dataset efficiently, we only survey the sys
tems that have strongly influenced the design o f Volcano
At the start in 1987, we felt that some decisions in
WiSS [11] and GAMMA [12] were not optimal for perfor
mance or generality For instance, the decisions to protect
W iSS’
s butfer space by copying a data record in or out for
each request and to re-request a buffer page for every
record during a scan seemed to inflict too much overhead1
However, many of the design decisions in Volcano were
strongly influenced by experiences with WiSS and
GAMMA The design of the data exchange mechanism
between operators, the focus o f this paper, is one o f the
few radical departures from GAM M A’ s design
During the design o f the EXODUS storage manager
[10], many o f these issues were revisited Lessons learned
and tradeoffs explored in these discussions certainly helped
form the ideas behind Volcano The development o f E [24]
influenced the strong emphasis on iterators for query pro
cessing The design o f GENESIS [5] emphasized the
importance o f a uniform iterator interface
Finally, a number o f conventional (relational) and
extensible systems have influenced our design Without
further discussion, we mention Ingres [27], System R [3],
Bubba [2], Starburst [26], Postgres [28], and XPRS [29]
Furthermore, there has been a large amount o f research and
development in the database machine area, such that there
is an international workshop on the topic Almost all data
base machine proposals and unplementauons utilize parallel
ism in som e form We certainly have learned from this
work and tned to include its lessons in the design and Figure 1 Bracket M odel o f Parallelization.
implementation o f Volcano In particular, we have strived
for simplicity in the design, mechanisms that can support a
In the bracket model, there is a generic process tem
multitude o f policies, and efficiency in all details We
plate that can receive and send data and can execute
believe that the query execution engme should provide
exactly one operator at any point o f time A schematic
mechanisms, and that the query optimizer should incorporate
diagram o f such a template process is shown in Figure 1
and decide on policies
with two possible operators, join and aggregation The
Independently o f our work. Tandem Computers has code that makes up the generic template invokes the opera
designed an operator called the parallel operator which is tor which then controls execution, network I/O on the
very similar to V olcano’s exchange operator It has proven receiving and sending sides are performed as service to the
useful m Tandem's query execuuon engine [14], but is not operator on request, implemented as procedures to be called
yet documented in the open literature We learned about by the operator. The number o f inputs that can be active
this operator through one o f the referees Furthermore, the at any point o f Ume is limited to two since there are only
distributed database system R* used a technique similar to unary and binary operators m most database systems The
ours to transfer data between nodes [31] However, this operator is surrounded by generic template code which
operation was used only to effect data transfer and did not shields it from its environment, for example the operator(s)
support data or intra-operator parallelism that produce its input and consume its output
2.1. The Bracket Model of Parallelization One problem with the bracket model is that each
When attempting to parallelize exisung single-process locus o f control needs to be created This is typically done
Volcano software, we considered two paradigms or models by a separate scheduler process, requiring software develop
o f parallelization The first one, which we call the bracket ment beyond the actual operators, both initially and for each
model, has been used in a number o f systems, for example extension to the set o f query processing algorithms Thus,
GAM M A [12] and Bubba [2] The second one, which we the bracket model seems unsuitable for an extensible sys
call the operator model, is novel and is described in detail tem
in Section 4 In a query processing system using the bracket
model, operators are coded m such a way that network I/O
1 This statement only pertains to the original version of is their only means o f obtaining input and delivering output
WiSS as described in [11] Both decisions were reconsidered for (with the exception o f scan and store operators) The rea
the version o f WiSS used in GAMMA son is that each operator is its own locus of control and
network flow control must be used to coordinate muluplc
103
482
operators, e g , to match two operators’ speed in a Calling open for the top-most operator results in
producer-consumer relationship Unfortunately, this also instantiations for the associated state record, c g , allocation
means that passing a data item from one operator to o f a hash table, and in open calls for all inputs In this
another always involves expensive inter-process communica way, all iterators in a query are initiated recursively In
tion (IPC) system calls, even in the cases when an entire order to process the query, next for the top-most operator is
query is evaluated on a smgle machine (and could therefore called repeatedly until it fails with an end o f stream indica
be evaluated without IPC m a smgle process) or when data tor Finally, the close call recursively "shuts down" all
do not need to be rcparuiioned among nodes in a network iterators in the query This model of query execution
An example for the latter is the three-way join query matches very closely the one being mcluded in the E pro
"joinCselAsclB" in the W isconsin Benchmark [6,9] which gramming language design [24] and the algebraic query
uses the same jom attribute for both two-way joins Thus, evaluation system o f the Starburst extensible relauonal data
m queries with multiple operators (meaning almost all base system [22]
quenes), IPC and its overhead are mandatory rather than
The tree-structured query evaluauon plan is used to
optional
execute queries by demand-driven dataflow The return
In most (single-process) query processing engines, value o f next is, besides a status value, a structure called
operators schedule each other much more efficiently by NEXT^RECORD that consists of a record identifier and a
means o f procedure calls rather the system calls The con record address m the buffer pool This record is pinned
cepts and methods needed for operators to schedule each (fixed) m the buffer The protocol about fixmg and unfix
other using procedure calls are the subject of the next sec- ing records is as follow s Each record pinned in the buffer
uon is owned by exactly one operator at any point m ume
After receiving a record, the operator can hold on to it for
3. Volcano System Design
a while, e g , in a hash table, unfix it, e g , when a predi
In this section, we provide an overview o f the cate fails, or pass it on to the next operator Complex
modules in Volcano V olcano’
s file system is rather con operations like join that create new records have to fix
ventional It mcludes a modules to manage devices, buffer them in the buffer before passing them on, and have to
pools, files, records, and B+-trees For a detailed discus unfix their input records
sion, we refer to [17]
For intermediate results, Volcano uses virtual devices
The file system routines are used by the query pro Pages of such a device exist only m the buffer, and are
cessing routines to evaluate com plex query plans Quenes discarded when unfixed. Using this mechanism allows
are expressed as com plex algebra expressions, the operators assigning unique RID’ s to intermediate result records, and
o f this algebra are query processing algorithms All algebra allows managing such records in all operators as if they
operators are implemented as iterators, i e , they support a resided on a real (disk) device The operators are not
simple open-next-close protocol similar to eonvenuonal file affected by the use o f virtual devices, and can be pro
scans grammed as if all input comes from a disk-resident file and
Associated with each algonthm is a state record output is written to a disk file
The arguments for the algorithms are kept in the state 4. The Operator Model of Parallelization
record All operations on records, e g , comparisons and
When porting Volcano to a multi-processor machine,
hashing, are performed by support functions which are given
we felt it desirable to use the single-process query process
m the state records as arguments to the iterators Thus, the
ing code described above without any change The result is
query processing modules could be implemented without
very clean, self-scheduling parallel processing We call this
knowledge or constraint on the internal structure of data
objects novel approach the operator model o f parallelizing a query
evaluation engine In this model, all issues of control arc
In quenes involving more than one operator (i c , localized in one operator that uses and provides the standard
almost all quenes), state records are linked together by iterator interface to the operators above and below in a
means o f input pointers The input pom ters are also kept query tree
in the slate records They are pointers to a QEP structure
The module responsible for parallel execution and
that consists o f four pointers to the entry points o f the
synchronization is called the exchange iterator in Volcano
three procedures implementing the operator (open, next, and
NoUce that it is an iterator with open, next, and close pro
close) and a state record All state information for an
cedures, therefore, it can be msencid at any one place or at
iterator is kept in us state record, thus, an algorithm may
multiple places in a complex query tree Figure 2 shows a
be used multiple times in a query by including more than
complex query cxecuuon plan that includes data processing
one state record in the query An operator docs not need
operators, e g file scan and join, and exchange operators
to know what kind o f operator produces its input, and
whether its input com es from a com plex query tree or from This sccUon describes how the exchange iterator
a simple file scan W e call this concept anonymous inputs implements verucal and horizontal parallelism followed by a
or streams Streams are a simple but powerful abstracuon detailed example and a discussion o f alternative m odes of
that allows combining any number o f operators to evaluate operation o f Volcano’
s exchange operator
a complex query Together with the iterator control para
4.1. Vertical Parallelism
digm, streams represent the most efficient execution model
m terms of time (overhead for synchronizing operators) and The first funenon of exchange is to provide vertical
space (number o f records that must reside in memory at parallelism or pipelining between processes The open pro
any point o f time) for single process query evaluation cedure creates a new process after creating a data structure
in shared memory called a port for synchronization and data
104
483
is inserted into a linked list originating in the port and a
PR NT semaphore is used to inform the consumer about the new
packet. Records in packets are fixed in the shared buffer
and must be unfixed by a consuming operator
When its mput is exhausted, the exchange operator in
the producer process marks the last packet with an end-of-
stream tag, passes it to the consumer, and waits unUl the
consumer allows closing all open files This delay is
XCHG necessary in Volcano because files on virtual devices must
not be closed before all their records are unpinned in the
buffer In other words, it is a peculiarity due to olheT
design decisions in Volcano rather than inherent in the
exchange iterator or the operator model of parallelizauon
The alert reader has noticed that the exchange module
uses a different dataflow paradigm than all other operators
JOIN
While all other modules are based on demand-driven
dataflow (iterators, lazy evaluation), the producer-consumer
relationship o f exchange uses data-driven dataflow (eager
evaluation) There are two reasons for this change in para
digms Fust, we intend to use the exchange operator also
for horizontal parallelism, to be described below, which is
easier to implement with data-dnven dataflow. Second, this
scheme removes the need for request messages Even
though a scheme with request messages, e g , using a sema
phore, would probably perform acceptably on a shared-
memory machine, we felt that it creates unnecessary control
overhead and delays Since we believe that very high
degrees o f parallelism and very high-performance query
evaluation require a closely tied network, e g , a hypercube,
o f shared-memory machines, we decided to use a paradigm
for data exchange that has has been proven to perform well
in a shared-nothing database machine [12,13]
A run-tune switch o f exchange enables flow control
or back pressure using an addiuonal semaphore If the pro
ducer is significantly faster than the consumer, the producer
may pm a significant portion of the buffer, thus impeding
overall system performance If flow control is enabled,
after a producer has insened a new packet into the port, it
must request the flow control semaphore After a consumer
Figure 2 Operator M odel o f Parallelization has removed a packet from the port, it releases the flow
control semaphore The initial value o f the flow control
exchange The child process, created using the UNIX fork semaphore, e g , 4, determines how many packets the pro
system call, is an exact duplicate o f the parent process ducers may get ahead o f the consumers
The exchange operator then takes different paths in the Notice that flow control and demand-driven dataflow
parent and child processes are not the same One significant difference is that flow
The parent process serves as the consumer and the control allows some "slack” m the synchronization o f pro
child process as the producer m Volcano. The exchange ducer and consumer and therefore truly overlapped execu
operator in the consumer process acts as a normal iterator, tion, while demand-driven dataflow is a rather ngid struc
the only difference from other iterators is that it receives its ture o f request and delivery in which the consumer waits
mput via inter-process communication rather than iterator while the producer works on ns next output The second
(procedure) calls After creating the child process, significant difference is that data-dnven dataflow is easier to
open_exchange m the consumer is done Next_exchange combine efficiently with honzontal parallelism and parution-
wans for data to arnve via the port and returns them a mg
record at a tune C lose êxchange informs the producer that 4.2. Horizontal Parallelism
it can close, waits for an acknowledgement, and returns
There are two forms o f honzontal parallelism which
The exchange operator in the producer process we call bushy parallelism and intra-operator parallelism In
becom es the driver for the query tree below the exchange bushy parallelism, different CPU's execute different subtrees
operator using open, next, and close on its mput The out o f a complex query tree Bushy parallelism and vertical^
put o f next is collected in packets, which are arrays o f parallelism are forms of inter-operator parallelism Intra
NEXT_RECORD structures The packet size is an argument operator parallelism means that several CPU’ s perform the
in the exchange iterator’ s state record, and can be set same operator on different subsets o f a stored dataset or an
between 1 and 32,000 records When a packet is filled, it
105
484
intermediate result2 only when we move to an environment wnh muluple
shared-memory machines* Others have also observed the
Bushy parallelism can easily be implemented by
inserting one or two exchange operators into a query tree high cost o f process creation and have provided alternatives,
For example, in order to sort two inputs into a merge-join in parucular "light-weight" processes in various forms, c g ,
in parallel the First or both inputs are separated from the in Mach [1]
merge-join by an exchange operation3 The parent process After all producer processes are forked, they run
turns to the second sort immediately after forking the child without further synchromzauon among themselves, with two
process that will produce the first input m sorted order excepuons First, when accessing a shared data structure,
Thus, the two sort operations are workmg in parallel e g , the port to the consumers or a buffer table, short-term
locks must be acquired for the durauon of one linked-list
Intra-operator parallelism requires data partiuonmg
insertion Second, when a producer group is also a consu
Partiuoning o f stored datasets is achieved by using muluple
mer group, i e , there are at least two exchange operators
files, preferably on different devices Partitioning o f inter
and three process groups involved in a vertical pipeline, the
mediate results is implemented by including multiple queues
processes that are both consumers and producers synchronize
in a port If there are multiple consumer processes, each
twice During the (very short) interval between synchroni-
uses its own input queue The producers use a support
zauons, the master o f this group creates a port which serves
function to decide into which o f the queues (or actually,
all processes m its group
into which o f the packets being Filled by the producer) an
output record must go Using a support function allows When a close request is propagated down the tree
implementing round-robin-, key-range-, or hash-partitioning and reaches the first exchange operator, the master
consumer’ s close_exchange procedure informs all producer
If an operator or an operator subtree is executed in
processes that they are allowed to close down using the
parallel by a group o f processes, one o f them is designated
semaphore mentioned above m the discussion on verucal
the master When a query tree is opened, only one process
parallelism If the producer processes are also consumers,
is running, which is naturally the master When a master
the master o f the process group informs its producers, etc.
forks a child process m a producer-consumer relationship,
In this way, all operators are shut down in an orderly
the child process becomes the master within its group The
fashion, and the entire query evaluation is self-scheduling
First action o f the master producer is to determine how
many slaves are needed by calling an appropriate support 4.3. An Example
function If the producer operation is to run in parallel, the Let us consider an example Assume a query with
master producer forks the other producer processes four operators. A, B, C, and D such that A calls B ’ s, B
Gerber pointed out that such a centralized scheme is calls C ’s, and C calls D ’ s open, close, and next pro
suboptimal for high degrees o f parallelism [15] When we cedures Now assume that this query plan is to be run m
changed our initial implementation from forking all producer three process groups, called A, B C , and D This requires
processes by the master to using a propagation tree scheme, an exchange operator between operators A and B, say X,
we observed significant performance improvements In such and one between C and D , say Y B and C continue to
a scheme, the master forks one slave, then both fork a new pass records via a simple procedure call to the C ’ s next
slave each, then all four fork a new slave each, etc This procedure without crossmg process boundaries Assume
scheme has been used very effectively for broadcast com further that A runs as a single process, Ao. while BC and
munication and synchronization in binary hyper cubes D run in parallel in processes BCo to B C i and Do to Dj,
for a total o f eight processes
Even after optimizing the forking scheme, its over
head is not negligible W e have considered using pruned A calls X ’s open, close, and next procedures instead
processes, i e , processes that are always present and wait of B ’s (Figure 2a), without knowledge that a process boun
for work packets Primed processes arc used in many com dary will be crossed, a consequence of anonymous inputs in
mercial database systems Since portable distribution o f Volcano When X is opened, it creates a port with one
compiled code (for support functions) is not trivial, we input queue for Ao and forks B Co (Figure 2b), which in
delayed this change and plan on using pruned processes turn forks BCi and B C i (Figure 2c) When the BC group
opens 7, B C a to BC i synchronize, and wait until the Y
operator in process B C o has initialized a port with three
2 A fourth form o f parallelism is inter-query parallelism, input queues B C o creates the port and stores its location
i e , the ability o f a database management system to work on
at an address known only to the BC processes Then BC o
several queries concurrently In the current version. Volcano
to B C i synchronize again, and BC i and B C i gel the port
docs not support inter-query parallelism A fifth and sixth form
mformauon from its location Next, BCo forks Do (Figure
o f parallelism that can be used for database operations involve
2d) which m turn forks D j to D j (Figure 2c)
hardware vector processing [30] and pipelining in the instrucuon
execution Since Volcano is a software architecture and folio w- When the D operators have exhausted their inputs m
ing the analysis in [8], we do not consider hardware parallelism D o to Dj, they return an end-of-stream indicator to the
further driver parts o f f In each D process, Y flags its last
packets to each of the BC processes (i e . a total o f 3x4=12
3 In general, sorted streams can be piped directly into the flagged packets) with an end-of-stream tag and then waits
jom, both in the single-process and the multi-process case on a semaphore for permission to close The copies o f the
Volcano's sort operator includes a parameter "final merge fan-in”
that allows sharing the merge space by two sort operators per
forming the final merge in an interleaved fashion as requested by * In fact, this work is currently under way
the merge join operator
106
485
BC0 BC0 BC, BC,
X X
B B I B I B
C C
° c
— ,Y ■ Y I Y
Figure 3a-c. Creating the BC processes.
F igu re 3f-h. C losin g all processes down.
y operator in the B C p ro ce s s e s cou n t the num ber o f tagged C to f ’s next p ro ced u re w ill return an end-of-stream indi
packets, after four tags (the num ber o f p rodu cers or D cator In effect, the end-or-stream indicator has been p ro
processes), they have exhausted their inputs, and a call by pagated from the D operators to the C operators In due
107
486
turn, C , B, and then the driver part o f X will receive an record A third argument to next_exchange is used to com
end-of-stream indicator After receiving three tagged pack municate the required producer from the merge to the
ets, X 's next procedure in Ao will indicate end-of-stream to exchange iterator Further modifications included increasing
A the number of input buffers used by exchange, ihc number
o f semaphores (including for flow control) used between
When end-of-stream reaches the root operator o f the
producer and consumer part o f exchange, and the logic for
query, A , the query tree is closed Closing the exchange
end-of-stream All these modificauons were implemented in
operator X includes releasing the semaphore that allows the
such a way that they support multi-level merge trees, e g , a
BC processes to shut down (Figure 3f) The X driver in
parallel binary merge tree as used in [7] The merging
each BC process closes its input, operator B B closes C,
paths are selected automatically such that the load is distri
and C closes Y Closing 7 in BC t and flC 2 is an empty
buted as evenly as possible m each level
operation When the process BC o closes the exchange
operator Y, Y permits the D processes to shut down by Second, we implemented a sort algorithm that sorts
releasing a semaphore After the processes o f the D group data randomly partiUoned over multiple disks into a range-
have closed all files and deallocated all temporary data partitioned file with sorted partitions, i e , a sorted file dis
structures, e g , hash tables, they indicate the fact to f in tributed over multiple disks When using the same number
B C o using another semaphore, and Y’ s close procedure o f processors and disks, we used two processes per CPU,
returns to its caller, C ’ s close procedure, while the D one to perform the file scan and parti uon the records and
processes terminate (Figure 3g) When all B C processes another one to sort them W e realized that creating and
have closed down, X 's close procedure indicates the fact to running more processes than processors inflicted a signifi
Ao and query evaluation terminates (Figure 3h) cant cost, since these processes competed for the CPU’ s and
therefore required operaung system scheduling While the
4.4. Variants of the Exchange Operator
scheduling overhead may not be too significant, in our
There are a number o f situations for which the environment with a central run queue allowing processes to
exchange operator described so far required some modifica- migrate freely and a large cache associated with each CPU,
nons or extensions In this section, we outline additional the frequent cache migration adds a significant cost.
capabilities implemented m V olcano’ s exchange operator
In order to make better use of the available process
For som e operations, it is desirable to replicate or ing power, we decided to reduce the number of processes
broadcast a stream to all consumers For example, one o f by half, effectively moving to one process per disk This
the two partitioning methods for hash-division [19] requires required modificauons to the exchange operator Unul then,
that the divisor be replicated and used with each parution the exchange operator could "live' only at the top or the
of the dividend Another example is Barn's parallel join bottom o f the operator tree in a process Since the modifi
algorithm in which one of the two mput relations is not cation, the exchange operator can also be in the middle of
moved at all while the other relation is sent through all a process’ operator tree When the exchange operator is
processors [4] T o support these algorithms, the exchange opened, it does not fork any processes but establishes a
operator can be directed (by setting a switch in the state communication port for data exchange The next operation
record) to send all records to all consumers, after pinning requests records from its input tree, possibly sending them
them appropriately multiple times in the buffer pool off to other processes in the group, until a record for its
Notice that it is not necessary to copy the records since own partition is found
they reside in a shared buffer pool, it is sufficient to pm
them such that each consumer can unpin them as if it were This mode of operauon3 also makes flow control
the only process using them After we implemented this obsolete A process runs a producer (and produces input
feature, parallelizing our hash-division programs usmg both for the other processes) only if it does not have mput for
divisor partitioning and quotient parutioning [19] took only the consumer Therefore, if the producers are in danger of
about three hours and yielded not insignificant speedups overrunning the consumers, none o f the producer operators
gets scheduled, and the consumers consume the available
When we implemented and benchmarked parallel sort records
ing [21], we found it useful to add two more features to
exchange First, we wanted to implement a merge network In summary, the operator model of parallel query
in which some processors produce sorted streams merge evaluauon provides for self-scheduling parallel query evalua
concurrently by other processors V olcano’s sort iterator tion in an extensible database system The most important
can be used to generate a sorted stream A merge iterator properties o f this novel approach are that the new module
was easily derived from the sort module It uses a single implements three forms o f parallel processing within a sin
level merge, instead o f the cascaded merge o f runs used in gle module, that it makes parallel query processing entirely
sort. The input o f a merge iterator is an exchange Dif self-scheduling, and that it did not require any changes in
ferently from other operators, the merge iterator requires to the existing query processing modules, thus leveraging signi
distinguish the input records by their producer As an ficantly the ume and effort spent on them and allowing
example, for a join operation it does not matter where the easy parallel implemeniauon o f new algorithms
mput records were created, and all inputs can be accumu
lated in a single input stream For a merge operauon. it is
crucial to distinguish the mput records by their producer m 3 Whether exchange forks new producer processes (the ori
order to merge multiple sorted streams correctly ginal exchange design describe in Section 4 1) or uses the exist
ing process group to execute the producer operations is a run
W e modified the exchange module such that it can time switch
keep the input records separated according to their produc
ers, switched by setting an argument field m the state
108
487
5. Overhead and Performance consumer process Each o f these three groups included
three processes, thus, each o f the producer processes created
From the beginning o f the Volcano project, we were
33,333 records All these experiments were conducted with
very concerned about performance and overhead In this
flow control enabled with three ’
slack" packets per
section, we report on experimental measurements o f the
exchange We used different partitioning (hash) funcuons
overhead induced by the exchange operator This is not
for each exchange iterator to ensure that records w c t c pass
meant to be an extensive or complete analysis o f the
ing along all possible data paths, not only along three
operator’s performance and overhead, the purpose o f this
independent pipelines
section is to demonstrate that the overhead can be kept in
acceptable limits As can be seen m Table 3, the performance penalty
for very small packets was significant The elapsed tune
W e measured elapsed times o f a program that creates
was almost cut m half when the packet size was increased
records, fills them with four random integers, passes the
from 1 to 2 records, from 176 seconds to 98 seconds As
records over three process boundaries, and then unfixes the
the packet size was increased further, the elapsed tune
records in the buffer The measurements are elapsed times
shrank accordingly, to 15 71 seconds for 50 records per
on a Sequent Symmetry with twelve Intel 16 M Hz 80386
packet and 12 73 seconds for 250 records per packet
CPU ’ s This is a shared-memory machine with a 64 KB
cache for each CPU Each CPU delivers about 4 MIPS in It seemed reasonable to speculate that for small pack
this machine The times were measured using the hardware ets, most of the elapsed ume was spent on data exchange
microsecond clock available on such machines Sequent’ s T o verify this hypothesis, we calculated regression and
DYNIX operating system provides exactly the same inter correlation coefficients o f the number o f data packets
face as Berkeley 4 2 BSD or System V UNIX and runs (100,000 divided over the packet size) and the elapsed
(t e , executes system calls) on all processors tunes We found an intercept (base ume) o f 12 18 seconds,
a slope of 0 001654 seconds per packet, and a correlation
First, we measured the program without any exchange
o f more than 0 99 Considering that we exchanged data
operator Creating 100,000 records and releasmg them m
over three process boundaries and that on two of those
the buffer took 20 28 seconds Next, we measured the pro
boundaries there were three producers and three consumers,
gram with the exchange operator switched to the mode in
we estimate that the overhead was 1654jtscc / 1 667 =
which it does not create new processes In other words,
992|isec per packet and process boundary
compared to the last experiment, we added the overhead of
three procedure calls for each record For this run, we T w o conclusions can be drawn from these experi
measured 28 00 seconds Thus, the three exchange opera ments Fust, verucal parallelism can pay off even for very
tors in this mode added (28 OOsec - 20 28sec) / 3 / 100,000 simple query plans if the overhead of data transfer is small
= 25 73(xsec overhead per record and exchange operator Second, since the packet size can be set to any value, the
When we switched the exchange operator to create overhead of Volcano’ s exchange iterator is negligible
new processes, thus creating a pipeline o f four processes, 6. Summary and Conclusions
we observed an elapsed time o f 16 21 seconds with flow
W e have described Volcano, a new query evaluauon
control enabled, or 16 16 seconds with flow control dis
system, and how parallel query evaluation is encapsulated m
abled The fact that these times ars less than the time for
a single module or operator The system is operational on
smgle-proccss program execution indicates that data transfer
both single- and multi-processor systems, and has been used
using the exchange operator is very fast, and that pipelined
for a number in database query processing studies [19-
multi-process execution is warranted
21,23]
W e were particularly concerned about the granularity
Volcano uulizes dataflow techniques wuhin processes
o f data exchange between processes and its impact on
as well as between processes Within a process, demand-
Volcano’ s performance In a separate experiment, we reran
driven dataflow is implemented by means of iterators
the program multiple times varying the number o f records
Between processes, data-dnven dataflow is used to exchange
per exchange packet. Table 1 shows the performance for
data between producers and consumers efficiently If neces
transferring 100,000 records from a producer process group
sary, Volcano’ s data-driven dataflow can be augmented with
through two intermediate process groups to a single
flow control or back pressure Horizontal partitioning is
used both on stored and intermediate datasets to allow
Packet Size Elapsed Time intra-operator parallelism The design o f the exchange
[Records] [Seconds] operator embodies the parallel execution mechanism for
1 176 4 vertical, bushy, and intra-opeTator parallelism, and it per
2 97 6 forms the transitions from demand-driven to data-driven
5 45 27 dataflow and back
10 27 67 Using an operator to encapsulate parallelism as
20 20 15 explored in the Volcano project has a number o f advantages
50 15 71 over the bracket model First, it hides the fact that paral
100 13 76 lelism is used from all other operators Thus, other opera
200 12 87 tors can be implemented without consideration for parallel
250 12 73 ism Second, since the exchange operator uses the same
interface to its input and output, it can be placed anywhere
Table 1 Exchange Performance
in a tree and combined with any other operators Hence, it
can be used to parallelize new operators, and effccuvely
109
488
combines extensibility and parallelism Third, it does not Leonard Shapiro Jerry Borgvedt implemented a prototype
require a separate scheduler process since scheduling distnbuted-memory exchange operator — NSF supported
(including initialization, flow control, and final clean-up) is this work with contracts DU-8805200 and IRI-8912618
part o f the operator and therefore performed within the stan Sequent Computer Systems provided machine time for
dard open-next-close iterator paradigm This turns into an experiments on a large machine.
advantage in two situations When a new operator is
References
integrated into the system, the scheduler and the template
process would have to be modified, while the exchange 1 M Accetta, R Baron, W. Bolosky, D Golub, R.
operator does not require any modifications When the sys Rashid, A Tevaman and M Young, “Mach. A New
tem is ported to a new environment, only one module Kernel Foundation for UNIX Development”, Sumner
requires modifications, the exchange iterator, not two Conference Proceedings 1986,
modules, the template process and the scheduler Fourth, it 2 W Alexander and G. Copeland, “Process and
does not require that operators m a parallel query evaluation Dataflow Control in Distributed Data-Intensive
system use 1PC to exchange data. Thus, each process can Systems”, Proceedings o f the ACM S1CMOD
execute an arbitrary subtree o f a complex query evaluation Conference, Chicago, IL * June 1988, 90-98
plan. Fifth, a smgle process can have any number of
inputs, not just one or two Finally, the operator can be 3 M M . Astrahan, M W. Blasgen, D. D Chamberlin,
(and has been) implemented in such a way that it can mul K. P. Eswaran, J. N. Gray, P. P. Griffiths, W. F.
King, R. A Lone, P R. McJones, J W MehL G.
tiplex a smgle process between a producer and a consumer
R Putzolu, I L Traiger, B W Wade and V.
In some respects, it efficiently implements appheauon-
Watson, “System R* A Relational Approach to
specific co-routines or threads
Database Management”, ACM Transactions on
We plan on several extensions o f the exchange opera Database Systems 1, 2 (June 1976), 97-137.
tor First, we plan on extending our design and implemen
4. C. K. Baru, O. Fneder, D. Kandlur and M Segal,
tation to support both shared and distributed memory
("shared-nothing architecture") and to allow combining these “Join on a Cube* Analysis, Simulation, and
Implementauon”, Proceedings o f the 5th International
concepts in a closely tied network o f shared-memory multi
computers while maintaining the encapsulation properties. Workshop on Database Machines, 1987.
This might require using a pool o f "pruned" processes and 5 D S. Batory, “GENESIS* A Project to Develop an
interpreting support functions W e believe that in the long Extensible Database Management System”,
run, high-performance database machines, both for transac Proceedings o f the lnt'l Workshop on Object-Oriented
tion and query processing, will employ this architecture Database Systems, Pacific Grove, CA , September
Second, we plan on devising a error and exception manage 1986, 207-208.
ment scheme that makes excepuon notification and handling
6. D. Bitton, D. J. DeWitt and C. TurbyfiU,
transparent across process and machine boundaries. Third,
“Benchmarking Database Systems: A Systematic
we plan on using the exchange operator to parallelize query
Approach”, Proceeding o f the Conference on Very
processing in object-oriented database systems [16], In our
Large Data Bases, Florence, Italy, October-November
model, a complex object is represented in memory by a
1983, 8-19
pointer to the root component (pinned m the buffer) with
pointers to the sub-components (also pinned) and passed 7 D. Bitton, H. Boral, D. J DeWitt and W. K.
between operators by passing the root component [18] Wilkinson, “Parallel Algorithms for the Execuuon o f
While the current design already allows passing complex Relauonal Database Operauons”, ACM Transactions
objects in a shared-memory environment, more functionality on Database Systems 8, 3 (September 1983), 324-353
is needed in a distributed-memory system where objects 8. H. Boral and D. J DeWitt, "Database Machines. An
need to be packaged for network transfer Idea Whose Tim e Has Passed? A Critique o f the
Volcano is the first implemented query evaluation Future o f Database Machines”, Proceeding o f the
system that combines extensibility and parallelism. Encap International Workshop on Database Machines,
sulating all parallelism issues into one module was essential Munich, 1983
to making this combinauon possible The encapsulauon of 9. H. Boral and D J DeWitt, " A Methodology for
parallelism in Volcano allows for new query processing Database System Performance Evaluauon”,
algorithms to be coded for single-process execution but run Proceedings o f the ACM SIGMOD Conference,
in a highly parallel environment without modifications We Boston, MA, June 1984, 176-185
expect that this will speed parallel algorithm development
and evaluation significantly Since the operator model o f 10 M J Carey, D J DeWitt, J E Richardson and E
parallel query processing and Volcano’ s exchange operator J. Shekita, "O bject and File Management in the
encapsulates parallelism and both uses and provides an itera EXODUS Extensible Database System”, Proceedings
tor interface similar to many existing database systems, the o f the Conference on Very Large Data Bases, Kyoto,
concepts explored and outlined in this paper may very well Japan, August 1986, 91-100
be useful in parallelizing other database query processing 11 H T. Chou, D J DeWitt, R H Katz and A. C.
software Klug, “Design and Implementauon o f the Wisconsin
Storage System", Software - Practice and Experience
Acknowledgements
15, 10 (October 1985), 943-962
A number o f friends and colleagues were great
12. D J DeWitt, R H Gerber, G Graefe, M L.
sounding boards during the design and implementauon of
Heytens, K B Kumar and M Muraliknshna,
parallelism in Volcano, most notably Frank Symonds and
110
489
“GAMMA - A High Performance Dataflow Database Systems, Pacific Grove, C A , September 1986, 85-92.
Machine", Proceedings o f the Conference on Very
27 M Stonebraker, E. Wong. P. Kreps and G. D. Held,
Large Data Bases, Kyoto, Japan, August 1986, 228-
“The Design and Implementation of INGRES”, ACM
237
Transactions on Database Systems 1, 3 (September
13 D J DeWitt, S Ghandeharadizeh, D Schneider, A 1976), 189-222
Bncker, H I Hsiao and R Rasmussen, “The
28. M. Stonebraker and L A Rowe, "T he Design o f
Gamma Database Machine Project’
’, IEEE
POSTGRES", Proceedings o f the ACM SIGMOD
Transactions on Knowledge and Data Engineering 2,
Conference, Washington, DC., May 1986, 340-355.
1 (March 1990)
29 M Stonebraker, R. Katz, D. Patterson and J
14 S Englert, J Gray, R Kocher and P Shah, "A
Ousterhout, "T he Design of XPRS”, Proceedings o f
Benchmark o f NonStop SQL Release 2 Demonstrating
the Conference on Very Large Databases, Los
Near-Linear Speedup and Scaleup on Large
Angeles, CA, August 1988, 318-330.
Databases’’,Tandem Computer Systems Technical
Report 894 (May 1989) 30. S. Torn, K. Kojima, Y. Kanada, A. Sakata, S.
Yoshizumi and M. Takahashi, "Accelerating
15 R Gerber, “Dataflow Query Processing using
Nonnumencal Processing by an Extended Vector
Multiprocessor Hash-Partitioned Algorithms”, P hD
Processor”, Proceedings o f the IEEE Conference on
Thesis, Madison. October 1986 Data Engineering, Los Angeles, C A , February 1988,
16 G Graefe and D Maier, “Query Optimization in 194-201
Object-Oriented Database Systems A Prospectus", in
31. P. Williams, D. Darnels, L. Haas, G Lapis, B.
Advances m Object-Oriented Database Systems, vol
Lindsay, P. Ng, R. Obermarck, P Selinger, A.
334 , K, R Dittrich (editor), Spnnger-Verlag,
Walker, P W ilms and R. Yost, “R*. An Overview
September 1988, 358-363 o f the Architecture”, m Readings m Database
17 G. Graefe, “Volcano An Extensible and Parallel Systems, M. Stonebraker (editor), Morgan-Kanfman,
Dataflow Query Processing System”, Oregon San Mateo, C A , 1988.
Graduate Center, Computer Science Technical Report,
Beaverton, O R , June 1989
18 G Graefe, "S et Processing and Complex Object
Assembly m Volcano and the REVELATION
Project”, O regon Graduate Center, Computer Science
Technical Report, Beaverton, O R, June 1989
19 G. Graefe, “Relational Division. Four Algorithms
and Their Performance”, Proceedings o f the IEEE
Conference on Data Engineering, Los Angelos, CA,
February 1989, 94-101
20 G. Graefe and K Ward, “Dynamic Query Evaluation
Plans", Proceedings of the ACM S1GMOD
Conference, Portland, OR, May-June 1989, 358
21 G Graefe, “Parallel External Sorting in Volcano”,
submitted fo r publication, February 1990
22 L. M Haas, W F Cody, J C Freytag, G Lapis, B
G. Lmdsay, G. M Lohman, K Ono and H Pirahesh,
“An Extensible Processor for an Extended Relational
Query Language”, Computer Science Research
Report, San Jose, C A , April 1988
23 T Keller and G Graefe, "T he One-to-One Match
Operator of the Volcano Query Processing System",
O regon Graduate Center, Computer Science Technical
Report, Beaverton, O R , June 1989
24 J E Richardson and M J Carey, “Programming
Constructs for Database System Implementation in
E X O D U S", Proceedings o f the ACM SIGMOD
Conference, San Francisco, CA., May 1987, 208-219
25 K Salem and H Garcia-Molina, “Disk Stnpmg",
Proceedings o f the IEEE Conference on Data
Engineering, Los Angeles, C A , February 1986, 336
26 P Schwarz, W Chang, J C Freytag, G Lohman, J
McPherson, C Mohan and H Pirahesh, “Extensibility
in the Star burst Database System”, Proceedings o f
the Ini'I Workshop on Object-Oriented Database
Ill
490

ACS (OCR'et) PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

ACS (OCR'et) PDF

Hochgeladen von

Copyright:

Verfügbare Formate

UNIVERSITY OF C O P E N H A G E N

Advanced Computer Systems (ACS)

DIKU, Department of Computer Science,

2 M o d u la r ity th r o u g h C lie n t s a n d S e rv ices, R P C 19

5 R ecov ery 101

7 N o tio n s o f R e lia b ility 179

9 C o m m u n ic a tio n a n d E n d -to- E n d A r g u m e n t 317

12 D a t a P r o c e s s in g - P a ra lle lism 459

The learning goals for ACS are listed below.

• Describe the design of transactional and distributed systems, including

• Explain how to employ strong modularity through a client-service ab­

• Explain techniques for large-scale data processing.

• Implement systems that include mechanisms for modularity, atomicity,

• Structure and conduct experiments to evaluate a system’

• Discuss design alternatives for a modular computer system, identifying

• Analyze protocols for concurrency control and recovery, as well as for

• Apply principles of large-scale data processing to analyze concrete information­

G. Couloris, J. Dollimore, T. Kindberg. Distributed systems, concepts and

J. Dean and S. Ghemawat. MapReduce: a flexible data processing tool. Com-

D. Lilja. Measuring Computer Performance: A Practitioner’ s Guide. Chap­

H. Garcia-Molina, J. D. Ullman, J. Widom. Database Systems: The Complete

G. Graefe. Encapsulation of parallelism in the Volcano query processing

D. Pritchett. BASE: An Acid Alternative. Queue 6, 3 (May 2008), pp. 48-55

R. Ramakrishnan and J. Gehrke. Database Management Systems. Third

J. H. Saltzer and M. F. Kaashoek. Principles of Computer System Design: An

J. H. Saltzer and M. F. Kaashoek. Principles of Computer System Design:

J. H. Saltzer, D. P. Reed, and D. D. Clark. End-to-end arguments in system

A. S. Tanenbaum and M. V. Steen. Distributed Systems: Principles and

F undam ental A b stra ctio n s

This chapter contains the book chapter:

J. H. Saltzer and M. F. Kaashoek. Principles of Computer Sys­

The chapter reviews the fu n d a m e n ta l a b s t r a c t io n s in computer systems:

• Identify the fundamental abstractions in computer systems and their

• Explain how names are used in the fundamental abstractions.

• Design a top-level abstraction, respecting its correspondent API, based

• Discuss performance and fault-tolerance aspects of such a design.

■ they supply fundamental functions o f recall, processing, and communication,

2.1 THE THREE FUNDAMENTAL ABSTRACTIONS

Sidebar 2.1 Terminology: Durability, Stability, and Persistence Both in common

Chapter 7 [on-line] introduces the concept of a persistent sender, a participant in a

2.1.1.1 Read/Write Coherence and Atomicity

duced by communication links. Section 4.5 introduces remote storage, and

■ Perform ance enhancements. Optimizing com pilers and high-performance pro­

herence, and Hewlett-Packard’ s Alpha processor architecture (among others)

■ Replicated storage. As Chapter 8 [on-line] will explore in detail, reliability of

2.1.1.2 Memory Latency

2.1.1.3 Memory Names and A ddresses

turn g o to a pair o f multiplexers. One multiplexer selects an x-coordinate, the other a

associative memory, either in software or hardware. (The design o f caches is discussed

2.1.1.4 Exploiting the Memory Abstraction: RAID

■ Improved performance, by reading

The normal operation o f an interpreter is to proceed sequentially through some

1 procedure sum (a, b) //A d d tw o num bers.

Some examples are represented in the instruction repertoire o f an imaginary

In describing and using communication links, the notation

2.1.2.2 Interpreter Layers

Upper layer procedure

Lower layer procedure

Instruction Instruction Instruction

Consider, for example, a calendar management program. The person making

Calendar manager Add new event on

2.1.3 Communication Links

so there is no need for a third abstraction. However, communication links involve

• Explain how to employ strong modularity through a client-service ab

• Apply principles of large-scale data processing to analyze concrete information

D. Lilja. Measuring Computer Performance: A Practitioner’ s Guide. Chap

J. H. Saltzer and M. F. Kaashoek. Principles of Computer Sys

■ Perform ance enhancements. Optimizing com pilers and high-performance pro

• Explain performance metrics such as latency, throughput, overhead, uti

• Apply performance improvement techniques, such as concurrency, batch

R. Ramakrishnan and J. Gehrke. Database Management Sys

While concurrency is a powerful technique for performance, it is unfortu

• Explain the variants of the two-phase locking (2PL) protocol, in partic

• Apply the conflict-serializability test using a precedence graph to trans

2. Each transaction, run by itself with no concurrent execution of other trans