Sie sind auf Seite 1von 127

Atomic Commit Protocols

A thesis submitted to the University of Manchester for the degree of


M.Phil. in the Faculty of Humanities.

2005

Sumit Agarwal

School of Informatics

1
CONTENTS
1. Introduction..........................................................................................................15

1.1 Aims and Objectives ....................................................................................16

1.2 Chapter Outline............................................................................................18

2. Background ..........................................................................................................19

2.1 Transaction Processing ................................................................................19

2.1.1 Flat Transactions.................................................................................. 20

2.1.2 Spheres of Control ............................................................................... 23

2.1.3 Nested Transactions ............................................................................. 28

2.1.4 SAGAS ................................................................................................ 30

2.1.5 ASSET ................................................................................................. 30

2.1.6 Distributed Transaction processing ..................................................... 30

2.2 Fault Tolerant Distributed Computing.........................................................32

2.2.1 Failure .................................................................................................. 32

2.2.2 Properties of a distributed system: Liveness and Safety...................... 33

2.2.3 Fault tolerance...................................................................................... 33

2.3 Consensus Problem......................................................................................36

2.4 Atomic Commit Protocols ...........................................................................36

2.4.1 Generic Commit Protocol .................................................................... 37

2.4.2 Two Phase Commit (2PC) Protocol..................................................... 39

2.4.3 Non-blocking atomic commit algorithms ............................................ 41

2.4.4 Backup Commit ................................................................................... 49

2.5 Summary ......................................................................................................50

3. Algorithm Overview ............................................................................................51

3.1 Two Phase Commit (2PC) ...........................................................................51

3.2 Paxos Commit..............................................................................................55

2
3.3 Messaging Analysis .....................................................................................59

3.3.1 Two Phase Commit.............................................................................. 59

3.3.2 Paxos Commit...................................................................................... 60

3.4 Analysis of Disk Writes ...............................................................................61

3.4.1 Two phase commit............................................................................... 61

3.4.2 Paxos Commit...................................................................................... 62

3.5 Recommended Optimisation........................................................................62

3.6 Alternate algorithm ......................................................................................63

3.7 Summary ......................................................................................................64

4. Design and Implementation of Simulation Framework.......................................65

4.1 Motivation for the Design............................................................................65

4.1.1 Overview of Simulations ..................................................................... 65

4.1.2 Assumptions......................................................................................... 66

4.2 Simulated Message Passing .........................................................................66

4.2.1 Simulation............................................................................................ 68

4.2.2 Messages.............................................................................................. 68

4.2.3 Counters ............................................................................................... 69

4.2.4 2PC Processes ...................................................................................... 69

4.2.5 Paxos Commit Processes ..................................................................... 69

4.2.6 Optimisation Implementation .............................................................. 70

4.3 Simulation Using Message Passing Library ................................................71

4.3.1 Counters ............................................................................................... 71

4.4 Design of Modified Paxos ...........................................................................72

4.4.1 Disk Write Analysis............................................................................. 73

4.5 Constants Used.............................................................................................73

4.5.1 Simulated Messaging........................................................................... 73

3
4.5.2 Message Passing .................................................................................. 74

4.6 Summary ......................................................................................................74

5. Results and Analysis ............................................................................................76

5.1 Glossary .......................................................................................................76

5.2 Overview......................................................................................................77

5.3 Failure Free Environment ............................................................................78

5.3.1 Simulated Message Passing ................................................................. 78

5.3.2 Message Passing .................................................................................. 88

5.4 Message Failures..........................................................................................91

5.4.1 Simulated Message Passing ................................................................. 92

5.4.2 Message Passing ................................................................................ 102

5.5 Process and Message Failures....................................................................104

5.5.1 Simulated Message Passing ............................................................... 104

5.5.2 Message Passing ................................................................................ 114

5.6 Disk Write Analysis...................................................................................115

5.6.1 Paxos.................................................................................................. 115

5.6.2 Modified Paxos .................................................................................. 116

5.7 Summary ....................................................................................................116

6. Conclusions and Further Work ..........................................................................117

6.1 Conclusions................................................................................................118

6.2 Further Work..............................................................................................119

6.2.1 Simulation Analyser .......................................................................... 119

6.2.2 Low Latency Non-Blocking Atomic Commit ................................... 120

6.2.3 Modified Paxos .................................................................................. 121

6.2.4 Disk Write Optimisation.................................................................... 121

7. References..........................................................................................................122

4
Appendix A Algorithm for Leader Election...........................................................125

Appendix B Result for 2PC with mean message delay of 1.4ms...........................126

Appendix C Average Transaction Commit Time for 2PC .....................................127

5
List of Figures
Figure 1 SoC for containing intermediate results ........................................................25

Figure 2 Example to illustrate Dynamic creation of SoC ............................................26

Figure 3 Tracing back and creation of Sphere F as recovery environment .................27

Figure 4 Extension of Sphere F to include all dependent spheres ...............................27

Figure 6 Commitment control in Nested Transaction..................................................29

Figure 7 Commitment control in Distributed Transaction...........................................31

Figure 9 Set diagram for Fault and Fault Tolerance ....................................................35

Figure 10 ......................................................................................................................38

Figure 11 ......................................................................................................................39

Figure 12 ......................................................................................................................43

Figure 13 ......................................................................................................................46

Figure 14 State Transition Diagram Legend................................................................51

Figure 15 State Transition Diagram for 2PC Transaction Manager ............................52

Figure 16 State transition diagram for 2PC Resource Manager ..................................53

Figure 17 State Transition Diagram for Client ............................................................54

Figure 18 State Transition Diagram for Registrar .......................................................56

Figure 19 State Transition Diagram for RM................................................................57

Figure 20 State Diagram for an Instance of the Paxos Algorithm on an Acceptor .....59

Figure 21 Class Diagram for the various processes of 2PC.........................................67

Figure 22 Interactions between the various classes .....................................................67

Figure 23 Internal working of the message class.........................................................69

Figure 24 Class diagram for the processes of Paxos Commit .....................................70

6
List of Tables
Table 1 Summary of Results for 2PC in fault free environment .................................78

Table 2 Transaction completion times for 2PC in fault free environment ..................79

Table 3 Results for Paxos Commit without optimisation in the failure free
environment .................................................................................................................79

Table 4 Commit and Abort times for Paxos Commit without optimisation in failure
free environment ..........................................................................................................80

Table 5 Results for Paxos Commit with message bundling in failure free environment
......................................................................................................................................81

Table 6 Transaction completion times for Paxos Commit with bundling in fault free
environment .................................................................................................................81

Table 7 Results for Paxos commit without optimisation in the fault free environment
where one fault is tolerated ..........................................................................................82

Table 8 Transaction completion times for Paxos Commit with one fault setting in fault
free environment ..........................................................................................................83

Table 9 Results for Paxos commit with message bundling with one fault to tolerate in
fault free environment..................................................................................................84

Table 10 Transaction completion times for Paxos commit with one fault to tolerate
with message bundling in fault free environment........................................................84

Table 11 Results for Paxos commit with disk write optimisation with one fault to
tolerate in fault free environment.................................................................................85

Table 12 Transaction completion times for Paxos commit with disk write optimization
with one fault to tolerate in fault free environment .....................................................85

Table 13 Results for Paxos with one fault to tolerate with disk write optimisation and
message bundling in fault free environment................................................................86

Table 14 Transaction completion times for Paxos with one fault to tolerate with disk
write optimisation and message bundling in fault free environment...........................86

Table 15 Results for Modified Paxos in fault free environment..................................87

7
Table 16 Transaction commit times for Modified Paxos in fault free environment....87

Table 17 Results for Modified Paxos where it tolerates one fault in fault free
environment .................................................................................................................88

Table 18 Transaction commit times for Modified Paxos with one fault tolerance in
fault free environment..................................................................................................88

Table 19 Results for 2PC in fault free environment ....................................................89

Table 20 Transaction completion times for 2PC in fault free environment ................89

Table 21 Results for Modified Paxos in fault free environment with no faults to
tolerate..........................................................................................................................90

Table 22 Transaction completion times for Modified Paxos in fault free environment
with no faults to tolerate ..............................................................................................90

Table 23 Results for Modified Paxos in fault free environment with one fault to
tolerate..........................................................................................................................91

Table 24 Transaction completion times for Modified Paxos in fault free environment
......................................................................................................................................91

Table 25 Results for 2PC where messages fail............................................................92

Table 26 Transaction completion times for 2PC with message failures......................93

Table 27 Results for Paxos Commit with message failures.........................................94

Table 28 Transaction completion time for Paxos Commit with message failures ......95

Table 29 Results for Paxos Commit with Durable Write optimisation where
environment has message failures ...............................................................................96

Table 30 Transaction completion times for Paxos Commit with disk write
optimisation where message failures occur .................................................................97

Table 31 Results for Paxos commit with optimisation and bundling in message failure
environment .................................................................................................................98

Table 32 Transaction completion times for Paxos commit with optimisation and
bundling in environment with message failures ..........................................................99

Table 33 Results for Modified Paxos in environment with message failures............100

8
Table 34 Transaction completion times for Modified in message failure environment
....................................................................................................................................101

Table 35 Results for 2PC where there are message failures......................................102

Table 36 Transaction completion times for 2PC where there are message failures ..102

Table 37 Results for Modified Paxos in message failure environment .....................103

Table 38 Transaction completion times for Modified Paxos where message failures
occur...........................................................................................................................103

Table 39 Results for 2PC with all failures .................................................................104

Table 40 Transaction completion times for 2PC with all failures .............................105

Table 41 Results for Paxos commit with all failures .................................................106

Table 42 Transaction completion time for Modified Paxos with all failures ............108

Table 43 Results for Paxos commit with disk optimisation ......................................108

Table 44 Transaction completion times for Paxos commit with disk optimisation...110

Table 45 Results for Paxos commit with optimisation and message bundling..........110

Table 46 Transaction completion times for Paxos commit with optimisation and
message bundling.......................................................................................................112

Table 47 Results for Modified Paxos.........................................................................112

Table 48 Transaction completion time for Modified Paxos ......................................113

Table 49 Results for 2PC with all failures .................................................................114

Table 50 Transaction completion times for 2PC with all failures .............................114

Table 51 Results for Modified Paxos with all failures ..............................................115

Table 52 Transaction completion times for Modified Paxos.....................................115

Table 53 Results for 2PC with 1.4ms mean message delay ......................................126

Table 54 Transaction commit times for 2PC with 1.4ms mean message delay ........126

9
Abstract
A transaction is a logical collection of tasks that must either complete successfully or
unsuccessfully as a group. A system that provides tools to accomplish this is known
as a transaction processing system.

In a distributed environment where the tasks that make up a transaction are distributed
across physically distinct nodes, a protocol is required to decide if the transaction
should be completed successfully (Commit) or cancelled (Abort); this protocol is
called an Atomic Commit Protocol. The protocol that is used in most distributed
transaction processing systems is Two Phase Commit Protocol (2PC). In 2PC a single
coordinator is used to collect each participant’s status of the work; the coordinator
decides to Commit the transaction if all participants have successfully completed their
part of the transaction, otherwise the transaction is Aborted.

2PC is a blocking protocol, i.e. if a coordinator fails while the participants are waiting
for a decision, all the participants remain in a blocked state till the coordinator
recovers and terminates the transaction. As alternatives, Non-Blocking Atomic
Commit (NB-AC) protocols have been suggested; all NB-AC protocols require a
synchronous environment.

Paxos Commit is an Atomic Commit protocol that solves a weaker form of NB-AC,
the protocol is non-blocking only if the network is non-faulty for long enough. It uses
an instance of the Paxos Consensus Algorithm to achieve consensus on each
participant’s status. Paxos Consensus is a fault tolerant consensus algorithm that uses
2F+1 acceptors (acceptors are processes that receive values from proposing processes
and pass the first value to the Leader) and proceeds if F+1 of them are active.

The study investigates and develops an alternative algorithm, Modified Paxos, based
on the Paxos consensus algorithm and a disk write optimisations to Paxos Commit
that utilises the available redundancy to improve performance. The algorithms have
been simulated in a framework designed to observe overall transaction commit times
in varying failure conditions.

The results show that 2PC is faster than Paxos Commit and Modified Paxos in a fault
free environment. This is true even in the case where the two Paxos consensus based
commit algorithms are run with fault tolerance switched off. The disk write

10
optimisation improves the performance of Paxos Commit for committed transactions
by 46%. The Modified Paxos when compared with unoptimised Paxos Commit is
4.5% faster for committed transactions. Modified Paxos also has the advantage of not
using a single Transaction Manager process, the current leader of the acceptor
processes is used as the Transaction Manager.

11
DECLARATION

No portion of the work referred to in the thesis has been submitted in support of an
application for another degree or qualification of this or any other university or other
institute of learning.

12
COPYRIGHT STATEMENT
i. Copyright in text of this thesis rests with the author. Copies (by any process) either
in full, or of extracts, may be made only in accordance with instructions given by the
author and lodged in the John Rylands University Library of Manchester. Details may
be obtained from the Librarian. This page must form part of any such copies made.
Further copies (by any process) of copies made in accordance with such instructions
may not be made without the permission (in writing) of the author.

ii. The ownership of any intellectual property rights which may be described in this
thesis is vested in The University of Manchester, subject to any prior agreement to the
contrary, and may not be made available for use by third parties without the written
permission of the University, which will prescribe the terms and conditions of any
such agreement.

iii. Further information on the conditions under which disclosures and exploitation
may take place is available from the Head of School of School of Informatics.

13
ACKNOWLEDGMENT
I would like to thank Dr. Jim Gray for his guidance and the initial framework which
helped in the implementation of the Paxos Commit.

I would like to thank Prof. John Keane for all the guidance and support he has
provided throughout my research.

Finally I am really grateful to have the support of my family and friends who have
been there to help me through everything.

14
1. Introduction
“In computer science, a transaction is a group of logical operations that must all
succeed or fail as a group. Systems dedicated to supporting such operations are known
as transaction processing systems. [1]”

Transaction processing is a part of everyday life. With an increased amount of data


being processed everyday, the data needs to be split over multiple databases. This
gives rise to transactions being processed in a distributed manner. There are various
transaction properties that need to be maintained for distributed transaction
processing.

The group of logical actions that form a transaction, must succeed or fail as a group;
actions across all databases, that are part of a distributed transaction, either succeed or
fail, as a group, in updating the data, i.e. a consistent outcome. An Atomic Commit
Protocol ensures that transactions terminate consistently across all participating
resources even in the presence of failure [2]. In certain failure scenarios the most
common Atomic Commit Protocol, Two Phase Commit (2PC) [3], is unable to decide
a consistent outcome without waiting for a failed site to recover; the protocol is said
to be blocking [4].

In large real-world distributed systems, failures are the rule rather than the exception
and so must be dealt with at all levels – problem specification, algorithm design, and
analysis [5].

Various protocols [4, 6, 7] have been suggested over the years that attempt to solve
the problem of blocking in Atomic Commit Protocols; these are termed as Non-
Blocking Atomic Commit (NB-AC) Protocols. NB-AC protocols permit all correct
participants to come to a commit decision and terminate the transaction. There have
been various studies to understand NB-AC protocols [2, 8, 9].

The problem of choosing a single value out of multiple values proposed by a


collection of processes is the consensus problem [10]. Atomic commit, in the
literature, has been compared with the consensus problem [11] and there have been

15
various studies about the solvability of consensus in asynchronous environments [12-
14].

In an asynchronous system no assumptions can be made about process execution


speeds and/or message delivery delays [15]. Much research has been done on the
subject of consensus and NB-AC in asynchronous systems with failures, but NB-AC
in an asynchronous environment has not been tested.

1.1 Aims and Objectives

Chapter 2 briefly describes existing atomic commit protocols. 2PC, Paxos Commit
and Backup Commit are implementable in a simple asynchronous system. All the
other protocols discussed either require a synchronous system or advanced underlying
frameworks, like virtual synchrony etc., to implement.

Paxos Commit is a NB-AC protocol with weakened requirements for Non-Blocking


and Non-Triviality; it also has a message delay that is only one message cycle more
than the 2PC. These characteristics show that Paxos Commit has the potential to be
implemented in real world systems, as messaging in modern local area networks is
cheap. Comparatively 3PC has not been used in real world systems despite being
around for many years [16].

The aim is to:

1. Investigate asynchronous NB-AC protocols

2. Develop a simulation of a complete transaction system with the Paxos Commit


protocol

3. Compare the performance of the Paxos Commit protocol with 2PC by


measuring the overall transaction completion times for both the protocols in
varying failure conditions.

4. Develop an alternate algorithm, using the Paxos Consensus algorithm, which


improves on the performance of Paxos Commit.

Paxos is non-blocking only if the network is non-faulty for long enough; there will
still be blocking if the entire network is faulty throughout the transaction. Thus, in the
study, the overall number of aborted and committed transactions over a set period of

16
time will be compared for the two protocols. The average amount of time taken will
also be measured.

The performance analysis outlined by Gray et. al. [6] for the Paxos Commit
algorithm, focuses on the number of messages that are sent between the participants
and the coordinators and the number of forced writes that have to be done during the
commit decision making process. Although the number of messages sent and the
forced disk writes are very important indicators of the performance of a distributed
commit algorithm, other factors also contribute to the performance of an algorithm.

The algorithms are compared based on:

1. The performance of the algorithms in a failure free environment. The results in a


failure free environment, provides a base value for the performance of the
algorithms. These results can be used to determine the performance overhead of
using a consensus based atomic commit protocol.

2. Resilience to failure – when failure occurs, a blocking system grinds to a halt (i.e.
2PC), the Paxos Commit Algorithm on the other hand is non-blocking, if a
sufficiently large network of nodes is non-faulty for long enough. The non-
blocking behaviour has a weakened condition and thus it essential to measure the
resilience of the Paxos Commit algorithm to various failures. Using the results
obtained for the algorithms running in a failure free environment a comparison
can be made to determine the resilience of the algorithm to failures.

The faults that can be simulated are

• process failures – any of the processes can fail at random,

• message non-delivery – the messaging system might be unreliable and


messages may not get delivered to the recipient

• late delivery or out of order delivery – in addition to the messages not being
delivered at all the messages may get delivered late, in scenarios where
processes are coordinating actions timely arrival of messages can be
important.

In this study the NB-AC Algorithm with weakened requirements suggested by Gray
et. al. [6] has been simulated and compared with the 2PC protocol in an asynchronous
environment with failures.

17
1.2 Chapter Outline

Chapter 2 reviews the literature to present an introduction to transaction processing,


failure and fault tolerance, consensus, and Atomic Commit.

Chapter 3 goes into details of the two Atomic Commit protocols, 2PC and Paxos
Commit, that have been simulated. This chapter also presents detailed algorithms for
these Atomic Commit protocols.

Paxos Commit utilises redundancy to achieve fault tolerance. The available


redundancy can also be used to gain performance advantage by not performing disk
writes, which are important for failure recovery. This optimisation has been analysed
in Chapter 3.

A different method for implementing Paxos Consensus Algorithm to achieve a


Commit algorithm was designed. This is also discussed in Chapter 3.

A simulation framework was designed in order to obtain results that can be analysed
to compare the transaction commit/abort numbers and the transaction completion
times for the 2PC and Paxos Commit algorithms. Gray et. al. [6] compared 2PC with
Paxos Commit theoretically by analysing the number of message delays and disk
write delay. The simulation here aims to be able to test these empirically. This appears
to be the first reported simulation for the Paxos Commit [17] and thus the first
empirical comparison with the 2PC protocol . Chapter 4 presents details on the design
motivation and the design used.

The results and analysis are presented in Chapter 5. The findings of the study are
presented as a conclusion in Chapter 6.

During the course of the study aspects of further work were identified to improve NB-
AC in asynchronous systems, these are also examined in Chapter 6.2.

18
2. Background
A distributed commit protocol is essential to maintain consistency in Distributed
Transaction systems. In this study, simulations for two distributed commit protocols,
Paxos Commit and Two Phase Commit, have been implemented, and a new algorithm
based on Paxos Commit has been designed and implemented. The results from these
simulations have been compared on the basis of average transaction completion time
and commit to abort ratio for the protocol running for a certain duration of time in
varying fault conditions. This chapter introduces the concepts of transaction
processing, distributed transaction processing and commit protocols.

In a system that operates within agreed specification, no extra primitives are required
to maintain well known transaction properties. If failure occurs, extra primitives are
required to ensure that these transaction properties hold. This chapter also gives a
description of failures and fault tolerance and how these affect commit protocols.

Section 2.2 gives an outline of failure and fault tolerance. Section 2.3 gives an
introduction to the consensus problem. Section 2.4 defines atomic commitment and
then presents a generic commit protocol followed by a description of some known
commit protocols.

2.1 Transaction Processing

A transaction processing system provides tools to ease or automate application


programming, execution, and administration. Transaction processing applications
typically support a network of devices that submit queries and updates to the
application. Based on these inputs, the application maintains a database representing
some real-world state. Application responses and outputs typically drive real-world
actuators and transducers that alter or control the real-world states. The applications,
database, and network tend to evolve over several decades. Increasingly, the systems
are geographically distributed, heterogeneous, continuously available and have
stringent response time requirements [18].

A transaction is a state change to a database representing some real-world states. For


example, Customer A deposits £500 into his account; the account is part of the

19
database and the deposit is the state change to that database. The database is updated
to reflect the new balance of the account.

A transaction can be considered a collection of actions with the following properties


[18]

Atomicity – a state transition is said to be atomic if it appears to jump from the initial
state to the result state without any intermediate steps or if it appears as though it
never left the initial state. A transaction is atomic when it behaves atomically to any
outside observer, i.e., the outside observer should not be able to view any of the
intermediate results generated by any intermediate steps.

Consistency – a transaction produces consistent results only; otherwise it aborts. A


result is consistent if the new state of the database fulfils all the rules or integrity
constraints of the application. If an integrity constraint states that all accounts must
have a positive balance, then any transaction violating this rule will be aborted.

Isolation – means a program running under transaction protection must behave


exactly as it would in single-user mode. This behaviour is only applicable in terms of
an outside observer. Thus, the results obtained at the end of a transaction are the same
irrespective of whether it was a single user system or a multi-user system. However,
the data is still shared between transactions.

Durability – requires that results of transactions that have completed successfully


must not be forgotten by the system. Once the system acknowledges the execution of
a transaction, it must be able to re-establish its results after any type of subsequent
failure.

2.1.1 Flat Transactions

There are various types of transactions based on the control structure they provide for
controlling how the ACID properties, described above, are implemented. Flat
transactions represent the simplest type of transaction. Flat transactions are the basic
building block for organising an application into atomic actions. These kinds of
transactions are called flat because there is only one layer of control by the
application. Everything inside the BEGIN and COMMIT brackets is at the same level;
that is, the whole transaction will either be committed together, or it will be rolled
back as a whole. Flat transactions are also called ACID transactions (because of the

20
four properties Atomicity, Consistency, Isolation, and Durability as above). Flat
transactions are very simple and easy to implement, but as a result are not applicable
to more complex problems. One of the major restrictions of flat transaction is that
there is no way of either committing or aborting parts of such transactions, or
committing results in several stages [18].

An example where flat transaction is applicable is in a banking system where a debit


or credit needs to be applied to an account. The system is also required to update the
running balance of the teller and the branch, and finally file a history record with the
details of the transaction. These changes may be carried out in four steps:

• the account is updated with a debit or credit

• the teller’s balance is updated

• the branch balance is updated

• the history record is inserted

These four changes will be part of a single Flat or ACID transaction performed
within a single BEGIN and COMMIT block. If any execution performs only a part of
these tasks, the execution will be considered inconsistent and would not be carried
out.

Flat Transactions isolate the application from faults resulting due to an execution not
adhering with ACID properties. In addition to the protection provided to the
application, Flat Transactions are simple to implement. However, the simple control
structure of Flat Transactions does not fit well with requirements for complex
applications that may require a higher degree of control. Examples illustrating this are
provided [18]:

• Trip Planning – A person wants to travel from place A to place F. Since there
is no mode of travel that could get the person directly from A to F. The travel
agent will need to break the journey down into several segments, which may
involve travel via air, train, rental car and may require overnight stay in hotels.

BEGIN WORK

S1. Book Flight from A to B

S2. Book Flight from B to C, same day

21
S3. Book Flight from C to D, same day

The only way of getting from D to F on the same day is by Rental Car,
but the person does not wish to drive at night.

There may be other options for the person to travel from D to F via another
place E. Alternatively the person may travel from place C using a different
route and mode of travel. In a normal flat transaction the entire booking will
need to be cancelled, which is not essential as there is no need to cancel the
first two steps. The application instead requires a control structure where
selective rollback can be carried out.

• Bulk Updates – At the end of a month a bank updates all its accounts with a
debit or credit interest accumulated for that month. A flat transaction will be
fine if the entire process completes without any faults. In a scenario where the
database crashes after 100,000 accounts have been updated, the flat transaction
will require that all these updates are rolled back. These updates would have
already required much processing time and it is unnecessary to rollback these
updates. Instead a better option will be to continue ahead from the last
account, when the database recovers from the crash.

The traditional Flat transaction approach is the best approach when dealing with hard-
coded systems in which the ACID properties are essential [18]. However, in many
modern applications, the ACID properties may be

• unnecessary – e.g., maintaining consistency of some street traffic management


data may be unnecessary, since the data will be overwritten with new data
anyway.

• impossible to satisfy – e.g., it is impossible to model a travel reservation


activity involving multiple airlines and hotels as a transaction, because the
different companies will not let atomic commit protocols to be run across
separate databases.

• undesirable – e.g., it is obviously undesirable to model activities of two


collaborating humans as transactions, because then they would (by definition)
be isolated, i.e., uncollaborative.

22
For the above reasons, more expressive primitives are needed to model composite
activities, and the commitment of their components. Although the flat transaction
model [18] is simple and powerful, it is still found to be lacking in functionality and
performance when applied to application that have reactive, open-ended or
collaborative activities [19], like the Trip Planning and Bulk Update example
provided above. There are many such applications scenarios that are not efficiently
handled by Flat Transactions. For such problems found in the standard flat
transaction, extensions to the transaction model have been proposed [19-23]. New
types of applications normally result in a new, specialised transaction model. Each
transaction model is at least partially incompatible to the other models [18]. For
example the commit and abort dependency between sub transactions is very different
in nested transactions and SAGAS. A brief description of some of the transaction
models is provided in the following sections.

2.1.2 Spheres of Control

Davies [24] developed the concept of spheres of control which triggered the
subsequent development of the transaction paradigm. This was the first attempt to
investigate the interrelated issues of control, recovery, and concurrency in integrated
systems in a coherent fashion. A framework is presented for better definition of the
audit and control aspects of data processing application. It introduces the concept of
spheres of control, which are logical boundaries that exist in all data processing
systems. Spheres of Controls define process bounding for recovery and process
commitment [18].

At the core of the notion of spheres of control is the observation that controlling
computations in a distributed multi-user environment primarily means,

1. Containing the effects of arbitrary operations as long as there might be a


necessity to revoke them, and

2. Monitoring the dependencies of operation on each other in order to be able


to trace the execution history in case faulty data is found at some point.

Gray et. al. [18] describe the dynamic creation of spheres of control around actions
accessing data. In this model all the dependencies between data and computation are

23
recorded so that in case of a subsequent problem the execution history can be traced
back and all affected computation can be redone or invalidated.

Spheres of control may be defined statically by structuring the system into a hierarchy
of abstract data types or created due to the dynamic interaction among spheres of
control on shared data. The dynamic creation of spheres of controls brings to attention
the concept of commitment control. Spheres of controls become dependent on each
other due to shared intermediate data. The spheres of controls sharing this data are
then encapsulated within a dynamically created sphere of control. A complete log of
dependencies between data and processing is maintained, to be able to trace back all
affected processes in case of a failure.

The following definitions are quoted from Gray et. al. [18].

• Process control. Process control ensures that the information required by an


atomic process is not modified by others, and it constrains the dependencies
that another process may place on the update made by this process.

• Process atomicity. Process atomicity is the amount of processing one wishes


to consider as having identity, i.e., an outside observer considers atomic. It is
the control over processing that permits an operator to be atomic at one level
of control, while the implementation of that operator may consist of many
parallel and/or serial atomic operators at the next lower level of control.

• Process commitment. While a function is in process, changes of state are


being made only by that function or they are expressly known to that function.
This allows the establishment of a single point to which one can return for
rerun purposes independently of the error detected or the number of functions
since that point. Preventing process commitment by holding (controlling) the
use of its results permits the system to perform a unilateral back out (process
undo) over much larger units of process. Unilateral here means without
having to request permission from each participant in a process. In summary,
process commitment control is the containment of the effects of a process,
even beyond the end of the process.

In the concept of spheres of control (SoC), the commitment of data produced by an


atomic process is not strictly bound to the completion of the process. It is possible to

24
dynamically create a new enclosing process that will control the commitment of the
result of the first process.

Figure 1 SoC for containing intermediate results

Figure 1 demonstrates an example of the case above. Each solid sphere represents a
predefined hierarchy of functions, each executing in its own sphere of control as an
atomic process. The highest level processes A1 and A2 are executed sequentially. If
for some reason A1 decides that its data D is not ready to be committed to the outside
world, A2 can start working on this data after a new, higher-level sphere of control, S,
has been created. S now contains the joint activities of A1, A2 and whatever might
follow after that. S can be terminated when all processes depending on A1’s result
have been terminated, and A1 has decided to commit its result. The dynamically
created spheres for commitment control allow processing data that are not ready for
access by the outside world and would otherwise require the application to stop or
undergo some checking procedure.

The concept of dynamic creation of spheres of control also allows the description of
process dependencies on data that are found to be incorrect long after they have been
committed.

25
D4
D

D2

D6
A

D3 E

D1 B

C D5

Time

Source: Gray, J. and A. Reuter, Transaction Processing Concepts and Techniques, 1993, Morgan Kaufmann Publishers
Figure 2 Example to illustrate Dynamic creation of SoC

In Figure 2, the dotted line denotes current time and time progresses from left to right.
There are three completed spheres of control A, C and E. While sphere B is executing,
a problem is found. There is no way to unilaterally back out from the work done so
far, because other processes, that may or may not depend on faulty data, could have
already committed their results; in Figure 2, process completed in sphere E and result
D6. The problem is dealt by dynamic creation of Spheres of Control in the following
way:

• The processing is traced back to find which process created the incorrect data
(process A in this case) and to enclose the already-completed SoC into a
dynamically created new SoC (item F) to serve as the recovery environment
(Figure 3).

• Item A will be checked to determine the source of the incorrect data.

• The process now moves forward in time to recover all processes that have
become dependent on any data produced by the process that created the
incorrect data D1. This leads to an extension of SoC F (Figure 4).

26
Figure 3 Tracing back and creation of Sphere F as recovery environment

F
D4
D

D2

D6
A

D3 E

D1 B

C D5

Time
Source: Gray, J. and A. Reuter, Transaction Processing Concepts and Techniques, 1993, Morgan Kaufmann Publishers
Figure 4 Extension of Sphere F to include all dependent spheres

27
The SoC F (Figure 4) indicates the necessity to go backwards and forwards through
various levels of commitment. The steps taken in the actual recovery of SoC F are
application dependent.

The notion of spheres of control is both general and powerful, but has never been
fully formalised. Due to its generality, it is likely to be difficult to build a system
exploiting it in full [18]. Flat transactions, on the other hand, are simple and easy to
implement but they only describe the bottom level of atomic actions in the spheres of
control model; flat transactions only implement Spheres of Control to contain
intermediate data from being accessed by an outside observer, as shown in Figure 1.
Spheres of control characterise activities more generally than database transactions.
Conceptually, SoCs are related to a number of extended transaction models that have
been proposed [25]. As stated earlier at the core of SoCs is the notion of not making
data changes public as long as there might be a necessity to revoke them and
monitoring the dependencies to be able to trace the execution history if faulty data is
found. These concepts are utilised in extended transactions. For example, in nested
transactions the data changes made by a child transaction are not externalised but are
only available to its parent transaction and in SAGAS the dependencies are
maintained so as to be able to run compensating transactions to undo the changes
made by partial executions. Nested Transactions and SAGAS are discussed in the
following sections.

2.1.3 Nested Transactions

To maintain Isolation in transactions, concurrency controls are required. Moss [23]


presents a method for concurrency control in transactions using Nested Transactions.
Nested transactions provide nested universes of synchronisation and recovery from
failures. Advantages of nested transactions over flat transactions are that they provide
concurrency control within transactions by serialising sub-transactions appropriately,
and that they permit parts of transactions to fail without necessarily aborting the entire
transaction. The method suggested uses locking for concurrency control.

In nested transactions, a transaction is composed of an arbitrary number of sub-


transactions that may be run concurrently, organised in a tree form. Figure 5 shows an
example of how a nested transaction may be organised. The root transaction T

28
initiates multiple sub-transactions T1, T2 and T3; T1 subsequently initiates sub-
transactions T11 and T12.

A sub-transaction acquires a lock on the


data that it wishes to access, on T

completion this lock is not released but


passed to the parent transaction, thus the
results are made available to the parent. T1 T2 T3
Failure of a child transaction does not
cause the whole transaction to fail, as the
rest of the transaction may still be T11 T12 T31
committed. Results of the whole
computation are externalised only on
commit of the parent transaction. If the
T121
parent transaction aborts all other sub-
transactions are also aborted. Figure 5 Nested Transaction Organisation

System System System


Trigger Trigger Trigger

A B C A B C A B C

T T T

A C A C A C
Trigger Wait
Trigger Wait Trigger

A B C A B C A B C

T1 T1 T2

A C A C A C

Wait
A) Root transaction B) Subtransaction T1 Trigger
T is running. has been started.
A B C
Wait
T11

A C

C) T1 has created subtransaction T11,


and after that the root transaction
starts another subtransaction, T2.

Source: Gray, J. and A. Reuter, Transaction Processing Concepts and Techniques, 1993, Morgan Kaufmann Publishers
Figure 6 Commitment control in Nested Transaction

Figure 6 shows commitment control in a nested transaction. The abort of a higher


level transaction will trigger the abort of all sub-transactions, e.g., the abort of

29
transaction T will trigger the abort of T1 and T2, and the abort of T1 will
subsequently trigger the abort of T11. The commit of a sub-transaction is not
externalised unless the parent is committed. Thus, it waits on the commit of the
parent.

2.1.4 SAGAS

Long lived transactions hold on to database resources for relatively long periods of
time, delaying the termination of shorter and more common transactions. The notion
of saga solves this problem. A long lived transaction is a saga if it can be split as a
sequence of transactions that can be interleaved with other transactions. The database
management system guarantees that either all the transactions making up a saga are
completed successfully, or compensating transactions are run to amend the changes
made by the partial execution [21].

2.1.5 ASSET

ASSET is a system for supporting extended transactions. ASSET consists of a set of


transaction primitives that allow users to define custom transaction semantics to
match the needs of specific applications. These transactions primitives can be used to
specify a variety of transaction models such as nested transactions, split transactions
and sagas [20].

2.1.6 Distributed Transaction processing

A distributed transaction is the execution of a program accessing shared data at


multiple sites [26]. A distributed transaction is typically a flat transaction that runs in
a distributed environment and therefore has to visit several nodes in the network,
depending on the location of the data. The conceptual difference between a distributed
transaction and a nested transaction can be put as follows:

The structure of nested transactions is determined by the functional decomposition of


the application, that is, by what the application views as Spheres of Control. The
structure of a distributed transaction depends on the distribution of data in a network.
In other words, even for a flat transaction, from the application’s point of view a
distributed transaction may have to be executed if the data involved are scattered
across a number of nodes [18].

30
Sub-transactions are spawned to access data at nodes other than the local node. For
example a transaction T running on node a, requires data P and Q. The data P and Q is
not available locally at node a, but on nodes b and c respectively over the network.
Sub-transaction T1 and T2 of transaction T will be spawned on the respective nodes.

System
System
Trigger
Trigger
A B C
A B C
T
T
A C
A C
Wait

A B C A B C
T1 T2
Trigger A C A C
A) Root transaction Wait
T is running.
B) Subtransaction T1 and T2 have been
started at the same level, abort of the
subtransactions triggers an abort of the
parent.
Figure 7 Commitment control in Distributed Transaction

The decomposition into sub-transactions does not reflect a hierarchical structure in the
programs to be executed, but is induced by the placement of the data in the network.
Consequently, the sub-transaction are not really executing at a conceptual lower level
of control, as is the case in nested transactions. The sub-transactions are parts of the
same top-level transaction. If a sub-transaction commits its work, it signals the
commit of the entire transaction, which forces all other sub-transactions to commit. In
comparison, a commit of a sub-transaction in a nested transaction is only local to that
sub-transaction. The final success of which is dependant on its ancestor committing
[18]. The abort of a sub-transaction in a distributed transaction causes the abort of the
whole transaction; this can be seen in Figure 7.

Atomic Commit Problem

The atomic commit problem is concerned with deciding a consistent outcome for a
distributed transaction even in the presence of failure [14].

31
2.2 Fault Tolerant Distributed Computing

This study is about Atomic Commit Protocols. As stated earlier an Atomic Commit
Protocol must provide a consistent outcome even in the presence of failure. It is very
important to understand what failure is and how it affects the decision making
process. This section presents some definitions and concepts related to failure.

2.2.1 Failure

“Whenever the service of a system, as seen by the user of the system deviates from
the agreed specification of the system, the system has failed. A failure is thus an event
occurring at a particular point in real time.

Most computer system failures can be traced back to an incorrect internal state of the
computer, for instance, a wrong data element in the memory or a register. We call
such an incorrect internal state an error. An error is thus an unintended state that lasts
for a given interval of time. The cause of an error, and thus indirectly of a failure, is
called a fault.” [27]

A variety of failure models have been proposed in connection with distributed


systems. The models are based on assigning responsibility for faulty behaviour to the
system’s components – processors and communication channels. Common failure
models are listed [15] :

• Failstop. A processor fails by halting. Once it halts, the processor remains in


that state. The fact that a processor has failed is detectable by other processors.

• Crash. A processor fails by halting. Once it halts, the processor remains in


that state. The fact that the processor has failed may not be detectable by other
processors.

• Crash+Link. A processor fails by halting. Once it halts, the process remains


in that state. A link fails by losing some messages, but does not delay,
duplicate, or corrupt messages.

• Receive Omission. A processor fails by receiving only a subset of the


messages that have been sent to it or by halting and remaining halted.

• Send Omission. A processor fails by transmitting only a subset of the


messages that it actually attempts to send or by halting and remaining halted.

32
• General Omission. A processor fails by receiving only a subset of the
messages that have been sent to it, by transmitting only a subset of the
messages that it actually attempts to send, and/or by halting and remaining
halted.

• Byzantine Failure. A processor fails by exhibiting arbitrary behaviour.

Failstop failures are the least disruptive, because processors never perform erroneous
actions and failures are detectable. Other processors can safely perform actions on
behalf of a fault failstop processor.

Unless a system is synchronous, it is not possible to distinguish between a processor


that is executing very slowly and one that has halted due to a crash failure. A
processor that is halted due to a crash cannot take any further action, but a slowly
executing processor will continue processing. Another processor can safely perform
actions on behalf of a processor that has halted due to a crash, but not for a slow
processor, because a slow processor may perform subsequent actions that will be
inconsistent with the actions performed on its behalf by others. Thus, crash failures in
asynchronous systems are harder to deal with than failstop failures. In a synchronous
system crash and failstop failures are equivalent.

2.2.2 Properties of a distributed system: Liveness and Safety

In general, there are two kinds of correctness properties that an algorithm must
satisfy: safety and liveness. Intuitively, a safety property describes what is not allowed
to happen, and a liveness property describes what must happen. The safety property of
a system defines that during the execution a “bad thing” never happens and the
liveness property specifies that a “good thing” eventually happens [28].

2.2.3 Fault tolerance

Informally, fault tolerance is the ability of a system to behave in a well-defined


manner once faults occur [29]. A fault-tolerant system aims to detect and take
measures, so as to prevent an error, due to a fault.

A formal definition of fault tolerance has been given by Gärtner [29]:

“A distributed program A is said to tolerate faults from a fault class F for an invariant
P if there exists a predicate T for which the following three requirements hold:

33
• At any configuration where P
var x e {0,1,2,3} init 1 {* local state *}
holds, T also holds (i.e., P fi begin
{* normal program action *}
T).
x = 1 f x := 2
• Starting from any state where x = 2 f x := 1
{* faults that should be tolerated *}
T holds, if any actions of A or true f x := 0
F are executed, the resulting {* protection mechanism *}
x = 0 f x := 1
state will always be one in
end
which T holds (i.e., T is Figure 8 A Fault tolerant program
closed in A and T is closed in
F).

• Starting from any state where T holds, every computation that executes actions
from A alone eventually reaches a state where P holds.”

Figure 8 shows a program for a simple process that keeps on changing its only
variable x between the values 1 and 2. The process only tolerates a single type of fault
specified by the fault action1 true f x := 0 and the protection mechanism2 sets the
value of x to 1.

The definition can be explained using this program. The invariant of the process will
be P ≠ x e {1,2}, and the fault class to tolerate will be F = {true f x := 0}. The
predicate T is called a fault span; it is the limit to the changes that will be made by the
faults from F. Thus, T ≠ x e {0,1,2}.

The definition states that faults from F are handled by knowing the fault span T. As
long as such faults f e F occur, the system may leave the set P (Figure 9 shows the
invariant and the predicate as sets), but will always remain within T. When fault

1
A fault action is the additional virtual programming to transform a fault free program A to program
A’ that may be subject to faults Note that A’ will never be explicitly implemented; the transformation
process is just a means to be able to reason about programs which are subject to faults.
2
Protection mechanism is a component or concept that enriches the system to provide protection
against faults from the fault class.

34
actions are not executed for a sufficiently long period of time, but only normal
program actions t ‰ F, the system will eventually reach P again and resume “normal”
behaviour.

Figure 9 Set diagram for Fault and Fault Tolerance

A key method to providing fault-tolerance is to introduce redundancy [29].


Redundancy refers to extra resources that are not used during the normal operations of
the system. For example, in Figure 8, the code occurring in the protection mechanism
(x = 0 f x := 1) is extra code that is present to return the program to normal
execution after a fault has occurred. This code is not used in the normal execution of
the program.

In an asynchronous system to utilise available redundancy fault detection is required.


It is a contributing factor in achieving fault tolerance [30]. For example, consider two
processes p and q, both of which can crash, which form a distributed system. Process
q is the redundant process and starts making decisions when p crashes. In this setup of
the distributed system, Process q cannot find out if process p has crashed. If p sends
messages to q at regular intervals, the absence of the message, in an asynchronous
system, cannot indicate that the process has crashed. It could just be that process p is
running very slow, thus there is a need for a fault detection method in this system.
Fault detection is being increasingly used in the design, analysis and implementation
of many fault-tolerant distributed algorithms [31].

35
2.3 Consensus Problem

Consensus is a paradigm of agreement problems. Consensus is required whenever


processes have to agree on the same action to take. In the consensus problem, every
correct process proposes a value vi and all the correct processes have to decide on the
same value v. The consensus problem can be defined by the following properties
[31]:

• C-Termination: Every correct process eventually decides on some value.

• C-Validity: If a process decides v, then v was proposed by some process.

• C-Agreement: No two correct processes decide differently.

The atomic commit problem is a specialised case of the consensus problem. The
various processes in the distributed system need to decide on whether to commit the
transaction or not.

2.4 Atomic Commit Protocols

In a distributed transaction system, a transaction is performed over multiple nodes or


participants. The transaction terminates in a globally consistent manner by utilising an
Atomic Commit Protocol. The protocol aims to decide amongst two possible decision
values, commit or abort, for each participant. Deciding commit causes all the
participants to make the transaction changes permanent, while deciding abort will
cause none of them to make the changes. Each participant’s decision is irreversible.
The commit decision is only made if all participants unanimously vote YES [2].

The requirements of the atomic commit protocol are:

1. All participants that decide reach the same decision.

2. If any participant decides commit, then all participants must have voted YES.

3. If all participants vote YES and no failure occur, than all participants decide
commit.

4. Each participant decides at most once, i.e., a decision is irreversible.

36
2.4.1 Generic Commit Protocol

Each of the participants in a distributed transaction is a process called a Resource


Manager (RM). The commit protocol is initiated when one of the resource managers
issues a request either to commit or to abort the transaction. For the transaction to be
committed, each participating RM must be willing to commit it; otherwise, the
transaction must be aborted. Prior to the commit request, any RM may decide to abort
its part of the transaction. The fundamental requirement is that all RMs must agree on
whether the transaction is committed or aborted; this ensures that the transaction is
Atomic and Consistent. Once a RM has decided to abort or commit, the decision
cannot be changed [6].

The RM may be viewed as a state machine and the commit protocol as a set of state
transitions.

The states of the RM are as follows:

• Working – the RM is currently processing its sub-transaction

• Prepared – the RM has finished processing its sub-transaction successfully and


is willing to commit

• Aborted – the RM has aborted the transaction either due to the RM not
completing its sub-transaction successfully or because one of the other RMs
has decided to abort.

• Committed – all the RMs unanimously wish to commit the transaction and
thus the protocol decides that the transaction is to be committed. The RM then
makes the changes for its sub-transaction permanent.

Figure 10 shows the state transition diagram for an RM

37
Working

Work Abort
completed

Prepared

Commit Abort

Committed Aborted

Source: Gray, J. and L. Lamport, Consensus


on Transaction Commit. 2004,
Microsoft Research.
Figure 10

All RMs start in the working state. The protocol ensures that either all the RMs reach
the Committed or the Aborted state.

Requirements of the Generic Commit Protocol [6]

• Stability – once an RM enters the committed or aborted state, it remains in


that state forever

• Consistency – one RM must not be in the committed state at the same time as
another one is in the aborted state.

These two properties imply that, once an RM enters the committed state, no other RM
can enter the aborted state, and vice versa [6].

Each RM also has a prepared state. We require that

• An RM can enter the committed state only after all RMs have been in the
prepared state.

These requirements imply that the transaction can commit, meaning that all RMs
reach the committed state, only by the following sequence of events:

1. All the RMs enter the prepared state, in any order.

2. All the RMs enter the committed state, in any order.

38
The protocol allows the following event that prevents the transaction from
committing:

• Any RM in the working state can enter the aborted state.

The stability and consistency conditions imply that this abort event cannot occur if
some RM has entered the committed state. In practice, an RM will abort when it
learns that a failure has occurred that prevents the transaction from committing.
However, the abstract representation of the problem permits an RM to enter the
aborted state in the absence of failure as well [6].

2.4.2 Two Phase Commit (2PC) Protocol

The 2PC [3] is an implementation of the Generic Commit Protocol. The protocol uses
a Transaction Manager (TM) process to coordinate the decision making process.

w a

Figure 11

The RM has the same states as the transaction commit protocol (working, prepared,
aborted, committed). The TM has the init, preparing, aborted and committed states.

In the 2PC protocol, one of the RMs enters the prepared state and sends a prepared
message to the TM. The TM on receiving the prepared message sends a prepare
message to all the RMs. The RMs that are still in the working state can enter the
prepared state. Once the TM receives prepared message from each of the working
RMs, the TM makes a commit decision. If any of the RMs in the working state decide

39
to abort, an abort message is sent to the TM; the TM may also decide to abort. The
TM then sends an abort message to all the RMs.

Once the TM makes a commit decision, a commit message is sent to all the RMs. On
receiving the commit message the RMs move from the Prepared state to the
Committed state. Once an RM enters the committed state it must not move to the
aborted state.

In the 2PC protocol the coordinator, i.e. TM goes through the following two-phase
process:

1. Prepare: first, it instructs all resource managers to get ready to “go either
way” on the transaction. In practice this means that each participant in the
process, i.e. RM, must force all log records for local resources used by the
transaction out to its own physical log. Assuming the forced write is
successful; the resource manager now replies “Yes” to the coordinator
otherwise it replies “No.”

2. Commit: When the coordinator has received replies from all participants, it
forces a record to its own physical log, recording its decision regarding the
transaction. If all replies were “Yes,” then decision is “commit”; if any reply
was “No,” the decision is “abort.” Either way, the coordinator then informs
each participant of its decision, and each participant must then commit or roll
back the transaction locally, as instructed.

Effects of Faults on the 2PC Protocol

For Example, consider a system with three participants one of which performs the role
of the coordinator, using 2PC to ensure commit atomicity. Consider a scenario where
one of the participants and the coordinator fails after the third site has voted but has
not yet received the commit decision.

There are several possible execution states; two of these are of interest

1. Either one of the failed sites may have aborted the transaction.

2. All sites may have decided to commit the transaction.

In the second case, if the coordinator failed during sending the commit messages and
if the second site failed after receiving the commit message, then the transaction has

40
been committed at the second site. As the third site has no way of knowing the status
of the transaction, it cannot safely proceed. The third site has to wait for the
coordinator to recover to determine the commit decision. The execution of the
transaction is blocked at site three until the coordinator recovers.

Performance and consistency need to be at a correct balance for an optimal


transaction management system. Even though 2PC's blocking behaviour is well
known, it is most widely used due to its overall performance.

2.4.3 Non-blocking atomic commit algorithms

Protocols that allow operational sites to continue transaction processing even though
site failures have occurred are called non-blocking.

A protocol that never requires operational sites to block until a failed site has
recovered is called a non-blocking protocol [4].

The fundamental non-blocking theorem as defined by Skeen [4] is:

“A protocol is non-blocking if and only if it satisfies both of the following conditions


(for all participating sites):

1. there exists no local state such that its concurrency set3 contains both an abort
and a commit state

2. there exists no noncommittable state whose concurrency set contains a commit


state”

Analysing Figure 11, which displays the finite state automaton (FSA) for a
participating site of the 2PC, with the non-blocking theorem it can be seen that:

1. state w’s concurrency set contains both an abort and a commit state and,

2. the noncommittable state w has commit in its concurrency state.

Thus, the 2PC is an example of a blocking protocol: operational sites sometimes wait
on the recovery of a failed site. Locks are held while the transaction is blocked.

3
Concurrency set is the set of states that may be concurrently occupied by other sites when the local
site i is in the states si.

41
The Non-Blocking Atomic Commit (NB-AC) [11] problem has the following
conditions

• Uniform Agreement – forbids any two participants (correct or not) to decide


differently.

• Uniform Validity – if a participant decides commit, all participant have voted


yes.

• Termination – every correct participant eventually decides

• Non-Triviality – commit must be decided if all participants vote yes, and there
is no failure.

The conditions for NB-AC are similar to the properties for the consensus problem
described in section 2.3. The Non-Triviality condition is additional to avoid a trivial
solution to NB-AC where the algorithm always decides to abort.

In the case of a site failing, a non-blocking protocol initiates a termination or recovery


protocol. The termination protocol aims at terminating the transaction at the
operational sites as quickly as possible. The recovery protocol aims at resuming the
transaction processing at the operational sites. Both of these methods must ensure
ACID properties.

Non-blocking algorithms generally have a higher latency [32] due to extra message
overheads.

Three Phase Commit Protocol

Skeen [4] presents the three phase commit (3PC) protocol as a non-blocking atomic
commit protocol. There is a buffer state introduced between the waiting or prepare
and commit state of the 2PC protocol. So the coordinator and the resource managers
move to a prepared state and once all of them acknowledge the prepared state, then
they move to the commit state. A backup coordinator can wake up and commit the
transaction in case of failure.

42
q

w a

p buffer state

Figure 12

Analysing Figure 12, which displays the FSA for the 3PC, with the non-blocking
theorem, the following can be said:

1. there is no state that has both an abort and a commit in its concurrency set, as
the w state’s concurrency set now has the abort and buffer state in its
concurrency set.

2. the noncommittable state w does not have the commit state in its concurrency
set.

The 3PC protocol is non-blocking only in the case of a non-faulty messaging system
[2].

Non-Blocking Atomic Commit (NB-AC) in Asynchronous systems

Studies have compared atomic commit with consensus [11, 14]. NB-AC has been
considered to be a harder problem than consensus; the non-triviality condition of NB-
AC (commit must be decided if all participants vote “Yes”, and there is no failure) to
avoid trivial solutions to the problem requires precise knowledge about failures.
Guerraoui [11] proves that a weaker form of NB-AC is equivalent to consensus and
can be solved with weak failure detectors in an asynchronous system.

In the following sections NB-AC algorithms for asynchronous systems are discussed.

43
Low latency Non-blocking Atomic Commit Algorithm

Jimenez-Peris, et al. [7] present a low latency non-blocking atomic commit algorithm.
The algorithm is characterised by using multiple coordinators and using optimistic
commit4, additionally the message to commit is sent before the disk write is done.
This protocol aims to reduce the latency for non-blocking atomic commit. The
optimistic commit suggested may need roll-back: in the majority of cases the decision
would be correct but in the few cases where the decision was incorrect the
transactions will be rolled back.

Jimenez-Peris, et al not only optimises the non-blocking algorithm by not flushing


entries to the log but also by reducing the messaging overhead. The protocol reduces
the messaging overhead by using the virtual synchronous uniform multicast [33].

In distributed systems processes may communicate via broadcast5 or multicast6. A


message transmitted through a broadcast protocol is available to all hosts in the
network. The Broadcast protocol is defined in terms of two primitives: broadcast(m)
and deliver(m)7, where m is a message from a set of M possible messages. In the
presence of failure, the broadcast may not reach all the hosts on the network.
Hadzialacos, et. al. [34] present a collection of broadcast primitives which consider
failures. Uniform Reliable Broadcast [34] is a broadcast primitive with the following
properties

• Validity. If a correct process broadcasts a message m, then all correct


processes in the system eventually deliver m.

• Uniform Integrity. For any message m, each process in the system delivers m
at most once, and only if some process actually broadcasts m.

4
Optimistic commit refers to the process where the commit message is sent to the TM group before the
votes become uniform across the Coordinator Group.
5
Broadcast protocols send messages to all hosts on the network
6
Multicast protocols send messages to selected subset of hosts on the network.
7
The deliver primitive in a host sends a message received by the host to all its neighours.

44
• Uniform Timeliness. There exists a known constant ∆b such that if the
broadcast of m is initiated at real-time t, no process in the system delivers m
after real-time t + ∆b.

• Uniform Agreement. If any process (correct or not) in the system delivers a


message m, then all correct processes in the system eventually deliver m.

This broadcast primitive has mostly been considered in synchronous systems and has
been presented by Schiper et. al. [33] in a virtually synchronous environment8.

The non-blocking behaviour is achieved by the coordinator using virtual synchronous


uniform multicast9 protocol. This guarantees that either all or none of the participants
know the outcome of the transaction. Uniformity is very expensive in terms of the
delay it introduces and the delay is dependent on the size of the network. The
algorithm uses the uniform multicast only over a small subset of participants, i.e., the
commit servers and it hides the latency introduced by the uniform multicast behind
tasks that need to be performed for the transaction commit.

The flushing of log records to disk is a source of increased latency in AC protocols.


This algorithm sends messages instead of flushing log records; the main memory of a
replicated group is used as stable storage.

8
A virtually synchronous environment allows processes to be structured into process groups, and
makes events like broadcasts to the group as an entity, group membership changes, and even migration
of an activity from one place to another appear to occur instantaneously -- in other words,
synchronously.
9
A uniform reliable multicast of a message m has the property that if m has been received by any
destination process (faulty or not), then m is received by all processes that reach a decision.

45
CS Group

C1 C2 C3

Opt-Commit
Vote Vote Vote

TM Group

TM1 TM2 TM3

Figure 13

Protocol Overview

The AC protocol starts when a client requests to commit a transaction. The commit
request arrives at a TM, TMi, which then starts the protocol. The protocol involves
several rounds of messages in two phases [7]:

First phase

1. Upon delivering the commit request, TMi multicasts a reliable prepare to commit
message to the TM group. This message contains the transaction identifier (tid) to
be committed and the number of participants involved (the number of TMs
contacted during the execution of the transaction).

2. Upon delivering the prepare to commit message, each participant uniform


multicasts its vote and the number of participants to the CS group. If a participant
has not yet written the corresponding entries to its local log when the prepare to
commit message arrives, it sends the log entry in addition to its vote without
waiting to write to the log. After the message has been sent, it then writes the log
entry to its local disk. (This is illustrated in Figure 13)

Second phase

1. Upon opt-delivering a vote message, the processes of the commit server decide
who will act as proxy coordinator for the protocol based on the tid of the

46
transaction and the current view. Assume this site is Ci. The rest of the processes
in the CS group act as backup in case Ci fails. If a no vote is opt-delivered, the
transaction is aborted immediately and an abort message is reliable multicast to
the TM group. If all votes are yes, as soon as the last vote is opt-delivered at Ci, Ci
sends a reliable multicast with an opt-commit message to the TM group. (This is
illustrated in Figure 13)

2. Upon delivering an abort message, a participant aborts the transaction. Upon


delivering an opt-commit message, the participant changes the transaction locks to
opt mode.

3. If all votes are affirmative when they have been uni-delivered at Ci, Ci reliable
multicasts to the TM group a commit message.

4. When a participant delivers a commit or abort message, it releases all locks (both
opt and non-opt) held by the transaction and returns the corresponding
transactions that were on hold to their normal state.

5. If all the votes are affirmative, the coordinator opt-commits the transaction before
being excluded from the majority view (before being able to commit the
transaction), and one or more votes do not reach the majority view, the transaction
will be aborted by the new coordinator.

Figure 13 displays a representation of the system with the messages as would be sent
from the second step from Phase 1 and the first step from Phase 2 of the protocol.

Paxos Commit Protocol

The Paxos commit protocol [6] uses the Paxos consensus algorithm [10] on the
commit/abort decision of each participant to achieve fault tolerance in distributed
commit.

Gray et. al. [6] while specifying the requirements of an atomic commit protocol
specify two liveness requirements:

• Non-Triviality If the entire network is nonfaulty throughout the execution


of the protocol, then (a) if all RMs reach the prepared state, then all RMs
reach the committed state, and (b) if some RM reaches the aborted state, then
all RMs reach the aborted state.

47
• Non-Blocking If, at any time, a sufficiently large network of nodes is
nonfaulty for long enough, then every RM executed on those nodes will reach
either the committed or aborted state.

As stated before, NB-AC is harder to solve in asynchronous systems and only a


weaker form of NB-AC is solvable in an asynchronous system. Gray et. al. have
weakened the Non-triviality requirement; that participants only reach a decision if the
entire network remains nonfaulty throughout the execution of the protocol.

The Paxos consensus algorithm [10] is an algorithm for implementing a fault-tolerant


distributed system. The algorithm uses multiple acceptors to reach consensus on
choosing a proposed value. A unique value is chosen when a majority of acceptors
receive the same proposed value. Paxos commit uses 2F+1 coordinators and makes
progress if at least F+1 of them are active.

The distributed commit problem can be made fault tolerant by implementing the
Paxos consensus for the TM’s commit/abort decision. The Paxos commit algorithm
implements an instance of the Paxos consensus algorithm for choosing the
prepared/abort decision for each of the RMs. The transaction is only committed if all
the RMs are prepared.

The safety property of the Paxos consensus algorithm does not have any requirements
related to the processing speed or upper bound on the message delay. This
requirement classifies the system as asynchronous.

Paxos Consensus In Brief

Paxos consensus is an algorithm for choosing a unique proposed value. The algorithm
uses a collection of acceptor processes to decide on a proposed value. At the start
there is a unique leader process that proposes a value to the acceptors. The acceptors
reply to the leader and once the leader receives a message from a majority of the
acceptors, the value is chosen. The leader informs any other processes about the value
[10].

This algorithm is fault tolerant; as a new leader is elected if the original leader fails.
The new leader is able to continue processing if there is enough information available;
otherwise the transaction is aborted.

The Paxos algorithm proceeds in the following manner [6]:

48
Phase 1a The leader chooses a ballot number that it thinks is higher than any other
ballot number that has been processed. A message is sent to all the acceptors
with this ballot number.

Phase 1b The acceptor sends a message to the leader with the value for the highest
ballot number that it has received and the highest ballot number for which it
has received a Phase 1a message.

Phase 2a Once the leader receives a Phase 1b message for the ballot number from a
majority of acceptors, it can know the status for the ballot number.

Free – A value has not been chosen for the ballot number

Forced – A value has already been chosen for the ballot number

If the ballot is free the leader can propose any value to be chosen by the
acceptors. In the forced situation, the value chosen earlier is proposed again to
be selected.

Phase 2b If the acceptor has not received a Phase 1a or Phase 2a message for a higher
ballot number; then it sends a Phase 2b message for the current ballot number.

Phase 3 Once the leader receives the Phase 2b message from a majority of acceptors,
the value proposed for this ballot number is chosen. The value is
communicated to all interested processes with a Phase 3 message.

2.4.4 Backup Commit

Reddy et. al. [16] describe the Backup Commit (BC) protocol that reduces the
blocking in 2PC by using multiple backup sites. In the algorithm there are multiple
backup sites connected to the coordinator. The coordinator on deciding, broadcasts its
decision to the backup sites before informing the participants. When the coordinator
fails the participants can check the decision for the transaction with one of the
coordinators.

Reddy et. al. have compared BC with 3PC, which has high latency due to an extra
round of messaging between the participants and the coordinator. BC has lower
latency as the backup sites are placed closer to the coordinator to reduce network
latency. It is stated that the performance of commit processing depends on the
message processing latency of the coordinator. As BC does not introduce an extra

49
round of messaging between the coordinator and the participants, an overhead is not
incurred.

2.5 Summary

This chapter has given an overview of the atomic commit problem. The chapter
started by introducing Transaction Processing, ACID properties and various
Transaction Processing Models. Failure and Fault Tolerance were described before
describing the consensus problem and how it is made more difficult due to failures.
Atomic Commit in Distributed Databases is a form of the consensus problem; the
chapter looked at a number of algorithms from the literature.

In the following chapter, the algorithms for 2PC and Paxos Commit are explained in
detail, this discussion will form the basis for the design of the simulation framework.

50
3. Algorithm Overview
An introduction to transaction processing concepts was provided in the previous
chapter. The chapter also described how failure affects transaction processing,
especially distributed transactions. Since this study compares atomic commit
protocols a better understanding of these protocols is required. This chapter provides a
detailed analysis of the atomic commit protocols that are simulated in this study.

All processes, that are a part of a transaction processing system, are modelled as state
machines. This chapter presents a state transition diagram for each of these processes.
Figure 14 shows how the state transition diagrams are drawn.

Initial State
Message Received

Message Sent

Destination
State
Figure 14 State Transition Diagram Legend

3.1 Two Phase Commit (2PC)

The 2PC protocol described in section 2.4.2 uses a single coordinator, the Transaction
Manager (TM). The various processes in the transaction system including the TM are
described below:

1. Transaction Manager (TM) The TM in the 2PC works as the coordinator


and is responsible for servicing all requests from the Client and the Resource
Managers (RM). State Transition Diagram for the TM can be seen in Figure
15 .

• It starts in the Free State and assigns a transaction id to every transaction


started and then moves to the Working State.

51
• It keeps track of all the RMs that are part of the transaction

• Once the Client Process completes assigning work to the RMs, it sends a
Do Commit message to the TM. This causes the TM to move from the
Working State to the Preparing State.

• Each RM informs the TM whether it is ready to commit

• Based on the individual RM states, it decides whether to commit or abort,


and the TM moves to either the Aborted or Committed State.

Free

Began Transaction
Begin Transaction

Working
Do Commit

Prepare

Preparing
Pr
d
te

ea
or

pr
Ab

om

ed
t
or

m
Ab

it

Aborted Committed
Figure 15 State Transition Diagram for 2PC Transaction Manager

2. Resource Manager (RM) The RM is the process that does the work in a
transaction processing system. The state transition diagram can be seen in
Figure 16.

• The RM starts in the Free State. The Client requests the RM to do work for
a transaction. The RM moves to the Joining State, by sending a message to
the TM.

52
• The RM knows that the registration with the TM for the transaction is
complete when it receives the Joined Transaction message. The RM moves
to the Working State.

• After completing the work the RM informs the TM of its status and moves
to the Prepared or Abort State.

• When the TM decides whether the transaction should be committed or


aborted the RM can finally discard the changes or make them permanent
and finally move the Aborted or Committed State.

Free

Joining

Working

Prepared
C
t

om
or
Ab

m
it

Aborted Committed

Figure 16 State transition diagram for 2PC Resource Manager

3. Client The client is the process that utilises the transaction service. The State
Transition Diagram can be seen in Figure 17. The client:

53
• initiates the transaction, by sending a Begin Transaction message to the
TM and moves from the Free State to the Joining State;

• the Client receives a Began Transaction message from the TM with the
transaction id. The Client moves from Joining State to Working State and
asks the different RMs to do processing by sending them Do Work
messages.

Free

Begin Transaction
Start
Joining
Began Transaction

Do Worki

Working
Do Commit
Did Workn

Prepared
it
m

Ab
om

or
C

Committed Aborted

Figure 17 State Transition Diagram for Client

• Each RM sends a message back to the Client when it has completed the
work. The Client moves to Prepared State when it receives Did Work
message from all the RMs, it sends a Do Commit message to the TM at
this point

54
• finally the TM informs the Client of the Final state of the transaction and
the Client moves to the Aborted or the Committed State.

3.2 Paxos Commit

Paxos commit is a non-blocking atomic commit protocol that has been briefly
introduced on Page 48. Paxos commit utilises the Paxos consensus algorithm on the
commit/abort decision of each of the RMs, to achieve fault tolerance and thus non-
blocking behaviour.

In Paxos commit the same set of 2F+1 acceptors and leader is used for each instance
of the Paxos Algorithm. The execution of the Paxos commit begins when a RM
decides to prepare and sends a BeginCommit message to the Leader. The Leader then
sends a Prepare message to all the RMs. The RMs then send a phase 2a message with
ballot number 0 to all the 2F+1 acceptors with either an Abort or Prepared message,
depending on its state. For each instance, each acceptor sends a phase 2b message to
the Leader. Once the Leader receives F+1 phase 2b messages for an instance, it knows
the outcome for that instance. The Leader can then send a Phase 3 message regarding
the decision for that instance to all interested parties, all the RMs in this case. The
Phase 3 messages for all the instances are combined into a single Commit/Abort
message, depending on the outcome for each of the instances of the Paxos algorithm.
All the instances must agree on the Prepared value to send a Commit message. If even
one of the instances agrees on the Abort value the Abort message is sent.

There are four types of participants in the Paxos commit algorithm, there are
described below:

1. Registrar The registrar is the process that contains the logic for transaction
management. The state transition diagram for the Registrar can be seen in
Figure 18. The registrar handles the following tasks in the Paxos Commit
Protocol:

• The creation of the transaction.

• The registration of the RMs of the transaction, i.e. which RMs are part of
the transaction.

An instance of the Paxos consensus algorithm is also run for the registrar
process, this is initiated when the Client requests the registrar to begin a

55
transaction. The Registrar moves from the “Free” state to “Working” State.
Once the client has completed the transaction it indicates to the registrar that
the transaction has been completed with a “Do Commit” message. The
Registrar sends a “Prepare” message to all the RMs and moves to the
“Prepring State”. An instance of the Paxos consensus algorithm is then
initiated for each of the RMs that are a part of the transaction.

When a transaction is created, an instance of the Paxos algorithm is initiated


for the registrar process. This sets a timeout for the main transaction, because
if the Paxos algorithm for this instance does not choose a value for set of RMs
(S), then the leader will propose a special value B, which would cause the
transaction to abort.

Free
Begin Transaction
Initiate Instance

Working
Do Commit

Prepare

Preparing
2b

Ph
as
e
as

e
Ph

2b

Committed Aborted
Figure 18 State Transition Diagram for Registrar

Once a value S has been chosen by the Paxos algorithm, an instance of the
Paxos algorithm is started for each one of the RMs. This in effect initiates a

56
timeout on each of the RMs. If a value is not chosen for one of the RMs
instances the leader will propose an abort value for that instance.

Registrar controls the registration of the RMs for a transaction, i.e. does the
job of the TM. It then runs an instance of the Paxos algorithm to get its list
accepted.

The Registrar moves to the Aborted or Committed state on receiving a Phase


2b message related to the commit decision.

2. Resource Manager (RM) The resource manager is a participant in the


transaction. The resource manager is the process that performs work as part of
the transaction.

Free
Join Transaction
Do Work

Joining
Joined Transaction

Did Work

Working
Phase 2a
Prepare

Prepared
2b

Ph
as
e
as

e
Ph

2b

Committed Aborted
Figure 19 State Transition Diagram for RM

57
It is the RM that needs to know the decision of the coordinator. RMs registers
with the Registrar for the transaction when it is asked to do work by the client.
Figure 19 shows the State transition diagram for the RM.

• The RM starts in the “Free” state. It moves to the “Joining” state, when
the Client requests the RM to Do Work, by sending a “Join Transaction”
message.

• When the Registrar responds to the “Join Transaction” message with a


“Joined Transaction” message, the RM can finish the work and move to
the “Working” state by sending a “Did Work” message.

• The RM moves to the “Prepared” state after it receives a “Prepare”


message and it sends a Phase 2a message to all the acceptors regarding its
Prepared State.

• The RM moves to the “Abort” or “Commit” state after it receives Phase


2b messages relating to the commit decision of the entire transaction.

3. Acceptor The acceptor is the process that contains the Paxos consensus logic.
Figure 20 shows the state transition diagram for an instance of the Paxos
algorithm that runs at an acceptor. An instance of the Paxos consensus
algorithm is run to obtain agreement on the decision each RM makes of
whether to prepare or abort—represented by the value Prepared or Aborted
and an instance is also run for the Registrar process’s list of RMs. Each
instance of the Paxos consensus algorithm works to achieve consensus on the
prepared decision of the specific RM. When the leader knows the prepared
decision for all the RMs, from at least F+1 acceptors, it can make the commit
decision.

For the Paxos consensus to work, one of the acceptors has to act as the leader.

58
Free

Phase 2b
Phase 2a
Forced
Figure 20 State Diagram for an Instance of the
Paxos Algorithm on an Acceptor

As part of the optimisations recommended by Gray et.al. [6], the phase 2b


messages are sent as a bundled message to the leader by the acceptors.

In 2PC the TM can unilaterally abort a transaction. In the Paxos commit


algorithm the leader can only abort an RMs decision to be prepared; the leader
sends a Phase 2b abort message with a higher ballot number then the one from
the RM.

4. Client The client is the process that:

• initiates a transaction.

• assigns work to each of the RMs and

• requests that the transaction be committed.

The Client finds out the final outcome of the transaction. The state transition
diagram shown in Figure 17 for the 2PC is also applicable to the Paxos
Commit Algorithm.

3.3 Messaging Analysis

The messaging delay in atomic commit protocols is a key part of the performance
analysis. In this section an analysis of the messaging delay for 2PC and Paxos
Commit is provided [6]. This analysis is for the normal case, where the transaction is
committed.

3.3.1 Two Phase Commit

The 2PC protocol sends the following sequence of messages in the normal case:

59
• The initiating RM enters the prepared state and sends a Prepared message to
the TM. (1 message)

• The TM sends a Prepare message to every other RM. (N-1 messages)

• Each other RM sends a Prepared message to the TM. (N-1 messages)

• The TM sends a Commit message to every RM. (N messages)

Thus, in the normal case, the RMs learn that the transaction has been committed after
four message delays. A total of 3N-1 messages are sent. In a faster form of the 2PC
the prepare messages from the TM can be avoided, as all the RMs can inform the TM
as and when they get ready, this reduces the message complexity to 2N [6].

3.3.2 Paxos Commit

Assume that there are N RMs; assume a system that can tolerate F faults, so there are
2F +1 acceptors. In the normal case, the Paxos Commit algorithm uses the following
(potentially inter-node) messages [6] (The details of the Paxos Algorithm messages
can be found on Page 48):

• The first RM to prepare sends a BeginCommit message to the leader (1


message)

• The leader sends a Prepare message to every other RM. (N-1 messages)

• Each RM sends a ballot 0 phase 2a Prepared message for its instance of


Paxos to every acceptor. ((2F + 1)N messages)

• For each RM’s instance of Paxos, every acceptor sends a phase 2b Prepared
message to the leader. However, an acceptor can bundle the messages for all
those instances into a single message. (2F + 1 messages)

• The leader sends a single Commit message to each RM containing a phase 3


Prepared message for every instance of Paxos. (N messages)

The RMs therefore learn after five message delays that the transaction has been
committed. A total of 2F(N + 1) + 3N + 1 messages are sent. If the leader is on the
same node as one of the acceptors, then one of the phase 2b messages is free, so the
total number of messages is (2F + 3)N + 2F. If each acceptor is on the same node as

60
an RM, then 2F + 1 of the phase 2a messages are intra-node messages that can be
discounted [6].

As observed above, phase 3 of Paxos algorithm can be eliminated by having each


acceptor send its phase 2b messages directly to all the RMs. This allows the RMs to
learn the outcome in only four message delays, but a total of 2F(2N +1)+3N +1
messages are required. If each acceptor is on the same node as an RM, then 2(2F +1)
of those messages can be discounted, leaving 2F(2N - 2F - 1) + 3N + 1 messages.

So far Paxos commit requires five message delays, which can be reduced to four by
eliminating phase 3 and having acceptors send extra phase 2b messages. Two of those
message delays result from the sending of Prepare messages to the RMs. As stated in
section 3.3.1, these delays can be eliminated by allowing the RMs to prepare
spontaneously, leaving just two message delays [6].

3.4 Analysis of Disk Writes

Disk writes for recording the states of the processes, in transaction processing system,
are also a large part of the performance of the system. This is in addition to state
changes and messaging which were described in the previous two sections for 2PC
and Paxos commit. In this section an analysis of the disk writes that are performed by
the two protocols is provided.

3.4.1 Two phase commit

In 2PC there are two important decisions or states that need to be recorded to stable
storage. These are essential so that if a failure occurs the system is able to read from
stable storage the last state for the transaction and continue processing.

These are [6]:

1. The RM’s internal decision about whether to Abort or Commit.

2. The TM’s decision, after receiving all RM decisions, about whether to Abort
or Commit.

61
3.4.2 Paxos Commit

Paxos commit has similar disk write requirements as 2PC but they vary slightly due to
the introduction of the Acceptors. Paxos consensus requires a write to stable storage
for [6]:

1. The leaders decision to send a phase 2a message

2. The acceptors decision to send a phase 1b or phase 2b message

These two in the context of Paxos Commit can be stated as

1. RMs record there individual decision to disk.

As per the Paxos commit protocol the phase 2a message is sent by the RM to
the acceptors for the prepared decision, thus a write is required by the RM
before sending its prepared message.

2. According to the Paxos Consensus algorithm each of the acceptors is supposed


to write to disk before sending its Phase 2b message. This is applicable for
every instance of the Paxos algorithm. This also applies to the instance of the
Paxos algorithm that is run for the Registrar process and the selection of the
list of the RMs that are taking part in the transaction.

This means that each of the acceptor needs to do a disk write for each of the
RM and the Registrar process. Gray et.al. [6] suggest that disk writes for the
RM processes should be bundled together similar to the phase 2b messages to
reduce the number of disk writes that are done.

3.5 Recommended Optimisation

While analysing the Paxos commit algorithm a scope for further improvement was
realised. Considering, disk writes are much more time consuming than messaging, if
the delay associated with disk writes for the algorithm can be reduced the additional
delay of the messaging due to additional processes can be overcome.

The Acceptor writes to disk before sending the Phase 2b message. The reason for
performing the write before the Phase 2b message is sent to ensure that when an
acceptor recovers from a fault it knows that the Phase 2b message was sent or ready to
be sent. This delays the process of choosing a correct value. Paxos can tolerate F
faults and the likelihood of multiple faults at the same time is fairly low. If the

62
algorithm is adjusted to send the message concurrently to the disk write and the disk
write fails at F Acceptors, there are still F+1 acceptors to complete the decision
making process. In the rare case that there are more than F failures and the system is
unable to calculate the same decision which was optimistically decided due to
concurrent messaging and writing to disk the transaction can be rolled back.

A similar optimisation was also proposed by Jimenez-Peris, R., et al [7]. They


propose using multicasting of the RM’s decision to a coordination group without
flushing its result to log. Multicasting is used in small groups to reduce latency; the
architecture uses the coordinator group for multicasting. When the RM is required to
send its decision to the coordinator, if the RM hasn’t already written its record to disk,
it multicasts the coordinator group with its decision concurrently to writing to the log.
The latency of the multicast is masked behind the disk write which is essential and the
coordinator is aware of the RMs decision earlier.

Both approaches use the redundancy in the acceptors to achieve optimisation. The
optimisation suggested by Jimenez-Peris et. al can also be applied to the Paxos
commit, the message from the RM can be sent to the acceptors before the RMs write
the decision to disk. The multiple acceptors will act as durable storage for the RMs.
This concept has been called online data store or persistent memory by Mehra et. al
[35].

3.6 Alternate algorithm

The results for Paxos commit displayed a problem with using the Registrar process.
The instance of the Paxos algorithm that is run for the Registrar does not complete in
time; the RM messages are received before the acceptor finds out the RMs that are a
part of the transaction.

The other problem with the Paxos Commit algorithm is that while the Registrar is
non-operational due to failure, the whole transaction system remains unavailable for
that same duration. This occurs as the transaction processing system does not have a
process doing the TM function of initiating a transaction.

An alternate algorithm was designed to address these problems. In the alternate


algorithm (Modified Paxos) the leader process of the Paxos algorithm performs the
role of the TM.

63
In Paxos, if the Registrar process fails the transaction fails. The same is applicable in
the Modified Paxos algorithm if the current Leader fails. The Leader informs the
acceptors of the list at the time the Client sends the ‘Do Commit’ message to the
Leader. In a situation where the Leader fails before the Client sends the ‘Do Commit’
message, the new Leader does not have a list of the registered RMs and the
transaction terminates unsuccessfully.

The Leader has to additionally send messages to all the participants in the transaction
process so that all the processes know the current leader.

3.7 Summary

In this chapter 2PC and Paxos Commit algorithms have been explained in detail and
optimisations and alternate algorithms have been suggested. The next chapter
presents the purpose of the simulation and outlines its design and implementation.

64
4. Design and Implementation of Simulation
Framework
This chapter provides both a motivation and description of the design used to
implement the simulation.

The aim is to implement the Paxos Commit Algorithm and Two Phase Commit (2PC)
in a complete transaction processing system simulation. The overall number of
transactions committed and aborted over a set period of time will be compared for the
two Atomic Commit Protocols.

4.1 Motivation for the Design

As stated in the previous chapters Paxos commit is fault tolerant and non-blocking.
Thus the simulation needs to be able to introduce various different kinds of faults into
the system to be able to measure the benefits provided by the fault tolerance. The
faults that need to be introduced are as follows:

• Process Failure and Recovery – the processes can fail intermittently and then
recover after a certain amount of time.

• Message Failures – the messaging system may have a failure and because of
this the message does not get delivered to the recipient.

• Message Delay – there is a certain amount of delay on all messages, some


messages may get delayed more than others and hence create an out of
order delivery scenario.

This appears to be the first attempt at simulating the Paxos Commit Algorithm [17].

4.1.1 Overview of Simulations

As part of the study two types of simulations were designed

1. Simulated Message Passing – in which a queue is used to simulate message


passing between the various processes. In this simulation a high number of
failures can be introduced and a large number of transaction cycles can be
simulated in a short span of time. This simulation is able to process large
65
number of transaction cycles in a short span of time due to the simulation
operating a virtual clock which advances based on the message delivery time
rather than an actual time; this is explained in section 4.2.1.

2. Simulation using message passing library – this simulation uses a message


passing library to for passing messages between the various processes instead
of using a queue to simulate the message passing. With this approach the
network load and number of messages sent would impact the messaging delay
and the effects of this can be seen on both the 2PC and Paxos Commit
Algorithms. In the simulation that uses a queue for the messages, the message
delay is a theoretical value; this simulation uses a real network which may
behave differently to the theoretical values calculated in the first simulation.

4.1.2 Assumptions

The following assumptions are made about the environment that is being simulated:

• The simulation implements the Receive omission fault model (described in


section 2.2.1).

• The failure of a node does not affect the permanent storage.

• The messaging system is asynchronous and failure prone.

• All sites fail randomly. At failure the site loses its volatile states. The stable
storage does not fail, and when the site recovers it is able to read its last state
from stable storage.

• All the participating resource managers perform work for every transaction.
The Client progresses by assigning a task to each of the Resource Managers
for every transaction.

• A failed node in the system does not behave incorrectly; it stops responding to
messages and remembers its previous state when it returns to working state.

• The messaging delay is based on a LAN.

4.2 Simulated Message Passing

The solution that was developed to account for all the requirements utilised a series of
classes which model the various types of processes (Figure 21 shows the class

66
diagram for 2PC and Figure 24 shows the class diagram for Paxos Commit). A
‘simulation’ class was designed; this object controls the entire simulation. The
simulation class instantiates objects from the process classes depending on the
parameters set. The messaging is modelled as a message queue.

When a process sends a message it is added to the queue and when the message is
scheduled for delivery the simulation class, depending on the target for the message,
passes it to the specific object instance to be processed (Figure 22 shows some of the
interactions between the various classes of the simulation).

Figure 21 Class Diagram for the various processes of 2PC


The failure of the processes is triggered by the simulation sending a failure message to
the specific instance. In the failed state the process does not process any messages
until it receives the message to recover from the failure. The timing of the failure is
triggered by a random distribution around the mean time to failure set in the
simulation. A random distribution is used so that processes fail at random times in
relation to the mean time to failure. If the mean time to failure were used all the
processes would fail at the same time.

Figure 22 Interactions between the various classes

67
Message transmission also fails randomly; in this case the simulation does not deliver
the message to the receiving process. The message delay is simulated by the messages
being queued in order of delivery time. The delivery time is calculated by using a
random distribution with a constant value for mean delay. A random distribution is
used for the messages so that each transaction that does not cause any failure does not
complete in the same amount of time. Using random distribution brings the simulated
messaging system closer to a network messaging system, where each message can
have a different delay due to the various factors that effect message delay.

The framework allows for execution of millions of transaction commit cycles with a
high failure rate within a reasonable amount of time (approximately an hour)

4.2.1 Simulation

Each transaction commit cycle is started by the simulation process. It sends a message
to the client to start a new transaction. The simulation process detects if the last
transaction has finished and starts a new one. The simulation process maintains a
virtual clock. Every message that is processed by the simulation has a delivery time;
the virtual clock advances based on this delivery time; the delivery time of the
message is set based on the virtual clock. The simulation can be set to either run for a
certain number of transactions or for a certain amount of virtual time.

4.2.2 Messages

The simulation uses a queuing system to simulate message passing between multiple
processes. A single message queue simulates the arrival order of messages to the
various processes (illustrated in Figure 23).

The transaction commit decision process requires communication between the various
processes. The messages are simulated as a single queue sorted by time of arrival, so
that the simulation can pick up the next message to be delivered and the virtual clock
is advanced in a sequential order. Each message that is to be sent has an associated
delay; this may be zero, i.e. for immediate delivery. The messaging system adds the
network associated delay and calculates the expected arrival time based on a random
distribution associated with the mean delay set in the simulation (The values used for
the mean delay are listed in section 4.5).

68
Message Class
Message
Queue
Time Message Destn. Next
Next Message
1 ... Client Message
Retrieved
2 ... RM1
4 ... TM
Message Added Send Message
to Queue with
delivery time

Figure 23 Internal working of the message class


As the simulation progresses the next message in the queue is picked up. The virtual
time is advanced to the arrival time for the message.

4.2.3 Counters

At the end of each transaction cycle, the relevant processes record the results of the
transaction and other associated statistics.

The time taken for a transaction is calculated by checking the elapsed time for each of
the processes in the transaction and using the largest amount of time taken by any of
the processes. This could be the RM in case it failed before receiving the commit
decision and then, after recovering, completed the transaction.

4.2.4 2PC Processes

The 2PC processes are designed as per the state diagrams described in section 3.1

4.2.5 Paxos Commit Processes

Each of the Paxos Commit processes is designed according to the state transition
diagrams discussed in section 3.2. Some additional implementation decisions had to
be made which are described below.

69
Figure 24 Class diagram for the processes of Paxos Commit

Acceptors

For the Paxos consensus algorithm to progress one of the Acceptors has to act as the
leader process. The leader is decided at the start of the algorithm. A new leader is
elected if/when the previous one fails. The (n+4)-stable leader election with message
loss [36] has been used as the algorithm for leader election amongst the acceptors.
This algorithm was chosen as the algorithm is a stable10 leader election algorithm and
it can tolerate message loss. The algorithm for this is provided in Appendix A.

Client

The client sets a timeout when it has finished assigning work to all the RMs, just
before sending the ‘Do Commit’ message to the Registrar, it uses the commit request
timeout constant.

4.2.6 Optimisation Implementation

Gray et. al. [6], while outlining the Paxos Commit algorithm, use the Paxos
Consensus algorithm for each RMs Abort/Commit decision. The discussion suggests
optimisations to the algorithm which should provide improved performance. In the
simulations some of these optimisations have been tested against normal execution.
For example, as discussed in section 3.3.2, if the Acceptor and RM are on the same
node, the message sent between these two processes does not need to be counted. To

10
An algorithm is stable if it ensures that once a leader is elected, it remains the leader for as long as it
does not crash and its links have been behaving well.

70
achieve this, the node process is used to assign physical locations to the processes; if
the message’s source and destination are the same the message is delivered
instantaneously.

Node Process

A node object allows us to specify the location of processes; the message delay can be
reduced to zero if two processes are located on the same node. The mapping of the
nodes and the processes is parameterised. For Paxos Commit each of the Acceptors is
assigned to the same node as one of the RMs.

The failure message can be sent directly to the node. The node can then send
messages to each associated process.

4.3 Simulation Using Message Passing Library

The simulated message passing will provide a good comparison for 2PC and Paxos
Commit protocols. The simulation utilises mean values for network message delays
and disk writes. The JPVM [37] message passing library is used to develop a
simulation where the various participants communicate over a network. As the
various participants in this simulation communicate over the network, the message
delay is not based on a random distribution but the actual delay at that time within the
network based on the network load. This kind of a simulation should take into account
the network load in a better way.

The message passing simulation will not be ideal for running millions of transaction
cycles as the time spent will be real instead of being a virtual time. In the simulated
message passing simulation, a virtual time of approx. 28 hours get executed in approx.
2 hours. If the message passing simulation is run for 28 hours, the simulation will
complete in 28 hours.

4.3.1 Counters

In the message passing simulation, at the end of each transaction the statistics for the
transaction and messaging are recorded. The difference in this simulation is that the
time taken by each transaction is not calculated by taking the maximum of all the
processes elapsed time but just for the Client. In this simulation it was harder to get
the transaction completion times for all the processes and compute the maximum time

71
taken for each transaction, the decision was made to use the completion time at the
Client as this is where the new transaction would be started once the previous
transaction is completed.

In case of an RM failure the client is likely to know the outcome of the transaction
before the RM. In simulated message passing the elapsed time is not calculated till all
the participants finish processing that transaction. Due to a single RM failure there
will be more than one transaction that will aborted. In the simulated message passing
scenario, only a single transaction will be aborted. As a new transaction is not started
till the old transaction has finished processing at all the participants.

4.4 Design of Modified Paxos

In the Paxos consensus algorithm the leader process is only required to ensure
progress. In the Modified Paxos commit model, the leader also takes over the role of
the TM.

The Paxos consensus algorithm uses ballot numbers for proposing values, because
different proposers may choose different values. In the case of distributed commit and
the RM’s decision for a prepared/abort value there is only ever a single proposer that
would select only one value at a time, as a result ballot numbers are not used in this
implementation.

The Modified Paxos commit algorithm ignores the Phase 1 of the Paxos consensus
algorithm. Phase 1 of the Paxos consensus algorithm finds the highest ballot number
that has been executed and determine if the leader is free to propose a value. Ballot
numbers are not being used for this algorithm and hence Phase 1 is not required for
this algorithm. The commit algorithm starts in phase 2b, of the consensus algorithm,
with the RM proposing a value. Each of the acceptors communicates this value to the
leader. The leader acting as the TM is the only party interested in the value and based
on this value makes the commit decision. The commit decision is then communicated
to all the RMs.

If a new leader is elected it queries the state of the RMs commit/abort decision from
all the other acceptors. This would equate to Phase 1 through to Phase 2a of the Paxos
consensus algorithm.

72
4.4.1 Disk Write Analysis

The acceptors need to do a write to stable storage before sending the phase 2b
message, or in terms of our algorithm the prepared message. The disk writes for all
the RMs is bundled into a single disk write.

4.5 Constants Used

4.5.1 Simulated Messaging

The simulations used values for calculating message delays, disk delays, process
lifetimes and the various timeouts. The table below provides a list of these constants
[38]:

Process Lifetime 100 seconds

The Process Lifetime is set to such a low number to cause a high number of process
failures.

Process Recovery Time 1 second

The Recovery time is also really short as the processes fail so often they need to
recover in a relative amount of time

Mean message delay 100 msecond

In modern local area networks messaging is instantaneous, the value has been chosen
to model real world LAN messaging.

Mean Disk delay 5 milliseconds

The disk write is more expensive than a network message, this value models the
difference

Request Timeout (Amount of time to wait 50 milliseconds


for message response)

Transaction Timeout (The transaction 100 milliseconds


fails after this time, if its not already in

73
the Commit decision stage)

The timeouts were chosen after some testing of the simulations, these values were
selected so as to allow transactions to complete under regular processing, where
processes were active

Paxos Commit

This timeout value is only used for the Paxos Commit simulation. This is the value
used by the Client to set a timeout after the Client finishes assigning work to RMs.

Commit Request Timeout 80 milliseconds

4.5.2 Message Passing

These constants are used for calculating delays for timeouts and process failures in the
Message Passing Simulations. The reasons for the values are the same as the ones for
the values in section 4.5.1.

Process Lifetime 100 seconds

Process Recovery Time 1 second

Transaction Timeout 350 milliseconds

Request Timeout 150 milliseconds

Mean Disk Delay 5 milliseconds

4.6 Summary

This chapter detailed the motivations for the design of the simulation framework. The
architecture for the simulator has been described with details of how 2PC and Paxos
Commit are implemented on the simulation framework.

The optimisations tested as part of the work have been described and a description of
the Modified Paxos Algorithm is given. The chapter also listed the assumptions made
for the simulations and the constants used to emulate the environment.

74
The next chapter presents the results and analysis for the tests run on simulation
framework.

75
5. Results and Analysis
The previous chapters have explained the concept of Atomic Commit in a distributed
transaction processing system and the concepts that are essential to understand
Commit Protocols. Algorithms for Two Phase Commit (2PC), Paxos Commit and
Modified Paxos have been explained, with an explanation of the framework used to
simulate these protocols. This chapter lists the results obtained by running these
simulations with various different settings to emulate different system conditions and
protocol settings.

5.1 Glossary

Before the results are presented a description of the fields is provided

Row Headings Description

Events The number of steps processed by the simulation.

Messages The number of messages sent.

Messages Failed The number of messages that were not delivered.

Messages Dead Letter The number of messages that reached the process, when
the process had failed.

Process Failure The number of processes that failed during the whole
simulation.

Application Messages The number of messages that are related to the Transaction
application. (The rest of the messages are sent to control
the processes and the simulation, eg. Messages to cause
process failures)

Commits The number of transactions that were committed.

Aborts The number of transactions that were aborted.

76
Virtual Time The amount of time that was simulated in mseconds.

5.2 Overview

The results provided are for various test sets as follows:

1. Failure Free Environment – Paxos commit, Modified Paxos and 2PC are run
in an environment where there are no failures. Paxos commit and Modified
Paxos are run with no fault tolerance; in this setting the Paxos commit and
Modified Paxos are equivalent to 2PC. This allows for them to be compared
for message overheads. In a failure free environment the commit protocols
should have a 100% commit record; if the tests reveal a different result this
will be due to an error in the commit protocol. Paxos Commit protocol is run
with and without the optimisations discussed before, namely

• Bundling of Phase 2 messages – the bundling of Phase 2 messages reduces


the number of messages sent, thus reducing the message overhead in Paxos
Commit. Running the simulations with and without this optimisation
allows for the analysis of the impact of the number of messages on the
overall performance of Paxos commit protocol.

• Durable Write Optimisation – this optimisation uses the redundancy in the


Paxos commit to improve performance by sending the Phase 2 message in
parallel to the durable write. This shows the impact of durable write on the
overall performance of the Paxos commit protocol.

2. Message Failure – the tests that are run in the failure free environment are
also run in an environment where the messaging system is unreliable, i.e., the
messages may not be delivered. The various simulations in this environment
displays how the protocols deal with intermittent message failures.

3. Process and Message Failure – the Process and Message failure environment
models an asynchronous environment, where processes can fail, messages
may be delayed or not delivered and the application that starts the transaction
can also fail. All the tests described above for the failure free environment are
also run in the process and message failure environment. This allows us to
measure how the protocols would behave in such an environment and cope
with all kinds of failures.
77
The simulations have been run with the number of Resource Managers (RM) set to 3,
5, 7 and 9. This was done so that the number of Acceptors and RM increase in the
same increments. In Paxos Commit, a pair of Acceptor and RM should be placed on
the same node to improve the protocols performance, having the same increments
allows for pairs of these processes. It is not necessary to have RMs in these
increments and the 2PC protocol was run with 1, 2, 4, 8, 16 and 32 RMs, a graph
showing the average transaction commit times is presented in Appendix C.

5.3 Failure Free Environment

To be able to analyse the level of improvement provided by the Paxos Commit


Algorithm the first set of results compare the performance of the Two Phase Commit
(2PC) Algorithm and the Paxos Commit Algorithm with varying properties when no
faults occur.

5.3.1 Simulated Message Passing

Two Phase Commit


3 RM 5 RM 7 RM 9 RM
events: 247,335,530 353,139,638 446,300,916 529,084,520
messages: 354,514,266 499,001,673 626,260,973 739,361,714
messagesFailed: - - - -
messagesDeadLetter: - - - -
processFailures: - - - -
applicationMessages: 247,335,530 353,139,638 446,300,916 529,084,520
commits: 8,244,517 7,676,948 7,198,402 6,783,135
aborts: - - - -
virtualTime (ms): 100,000,004,057 100,000,004,762 100,000,002,912 100,000,001,583
Table 1 Summary of Results for 2PC in fault free environment

78
3 RM 5 RM 7 RM 9 RM
time in ms Commits Commits Commits Commits
10 30,727 2 - -
11 4,886,544 293,425 1,443 -
12 3,225,152 4,740,978 794,973 23,101
13 101,555 2,504,814 4,359,850 1,362,612
14 537 136,061 1,903,714 3,831,489
15 1 1,660 135,745 1,441,363
16 - 7 2,655 121,139
17 - - 19 3,383
18 - - 1 43
19 - - - 1
34 1 - - -
35 - 1 - -
36 - - 1 2
37 - - - 1
Average
Transaction
Time 11.41 12.32 13.19 14.04
Table 2 Transaction completion times for 2PC in fault free environment

In the failure free situation all transactions are completed, as no failure occurs. The
completion time is dependant on the message delay. Results in Table 2 demonstrate
this; as the number of resource managers increases the average amount of time
required to complete the transaction increases.

Paxos Commit

No Faults to Tolerate

To compare the overheads of the Paxos commit protocol, the Paxos commit is
simulated in the configuration where it does not tolerate any faults. The number of
acceptors in this case is 1. The protocol in this situation is equivalent to 2PC.

3 RM 5 RM 7 RM 9 RM
events: 173,060,514 224,153,100 282,562,178 347,363,780
messages: 239,734,602 300,191,334 368,098,482 442,186,228
messagesFailed: - - - -
messagesDeadLetter: - - - -
processFailures: - - - -
applicationMessages: 173,060,514 224,153,100 282,562,178 347,363,780
commits: 1,287,691 916,615 734,563 621,942
aborts: 1,032,623 1,066,787 1,077,021 1,078,823
amnesia: - - - -
virtualTime (ms): 100,000,000,640 100,000,000,349 100,000,000,640 100,000,000,640
Table 3 Results for Paxos Commit without optimisation in the failure free environment

79
3 RM 5 RM 7 RM 9 RM
time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
10 1,249 - - - - - - -
11 639,609 - 18,020 - 40 - - -
12 621,010 - 515,092 - 54,549 - 944 -
13 25,649 - 359,924 - 428,518 - 95,916 -
14 174 - 23,271 - 231,789 - 352,534 -
15 - - 307 - 19,207 - 157,005 -
16 - - - - 457 - 15,054 -
17 - - - - 2 - 477 -
18 - - - - 1 - 12 -
80 - 51,611 - 67 - - - -
81 - 832,218 - 202,918 - 5,349 - 28
82 - 147,120 - 733,288 - 358,739 - 34,315
83 - 1,671 - 127,605 - 606,289 - 471,053
84 - 3 - 2,894 - 103,016 - 487,652
85 - - - 15 - 3,593 - 82,005
86 - - - - - 35 - 3,710
87 - - - - - - - 59
88 - - - - - - - 1
Average
Time 11.5215 81.0957 12.4248 81.9347 13.2954 82.7594 14.1460 83.5827
Table 4 Commit and Abort times for Paxos Commit without optimisation in failure free
environment

As the system has no failures, the commit protocol should commit all transactions. In
these set of results very high Abort rate can be seen. The Registrar’s instance of Paxos
must finish before Phase 2 messages are received by the acceptors; this is to ensure
that the acceptors have a list of registered RMs for the current transaction. In this
scenario one of the Phase 2a messages is received before the Registrar’s instance of
Paxos finishes. The Acceptors ignore the Phase 2a message because it does not know
the list of registered RMs, and the transaction is aborted.

Overall for this scenario there are fewer events as compared to the results for 2PC
(Table 1) because the aborts are initiated after the Commit Request Timeout fires after
80 milliseconds of the transaction firing, as a result the average transaction time
increases from 11ms for 2PC to 46ms ((11.52+81.10)/2)for Paxos Commit.

80
Phase 2 Messages Bundled
3 RM 5 RM 7 RM 9 RM
events: 157,376,204 185,037,763 212,499,750 238,969,966
messages: 224,070,165 261,073,341 298,095,996 333,861,742
messagesFailed: - - - -
messagesDeadLetter: - - - -
processFailures: - - - -
applicationMessages: 157,376,204 185,037,763 212,499,750 238,969,966
commits: 1,288,524 915,925 736,198 623,292
aborts: 1,033,071 1,067,496 1,077,291 1,079,084
virtualTime: 100,000,000,640 100,000,000,640 100,000,000,640 100,000,000,640
Table 5 Results for Paxos Commit with message bundling in failure free environment

When the Phase 2b messages are bundled, the overall number of transactions
executed (committed and aborted) increases in comparison to the results from Table
3. The number of events and number of messages decreases. This shows that the
system has less network traffic and transactions are processed faster than when the
messages are not bundled.

3 RM 5 RM 7 RM 9 RM
time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
10 2,498 - - - - - - -
11 684,753 - 27,209 - 97 - - -
12 578,231 - 540,369 - 70,496 - 1,676 -
13 22,873 - 328,269 - 439,076 - 115,388 -
14 169 - 19,809 - 209,751 - 351,832 -
15 - - 269 - 16,441 - 141,351 -
16 - - - - 333 - 12,662 -
17 - - - - 4 - 375 -
18 - - - - - - 8 -
80 - 52,071 - 54 - - - -
81 - 831,314 - 204,160 - 5,468 - 18
82 - 148,007 - 732,548 - 359,185 - 33,910
83 - 1,673 - 127,866 - 605,999 - 471,373
84 - 6 - 2,851 - 103,005 - 488,032
85 - - - 17 - 3,588 - 81,988
86 - - - - - 46 - 3,700
87 - - - - - - - 63
Average
Time 11.48 81.10 12.37 81.93 13.23 82.76 14.08 83.58
Table 6 Transaction completion times for Paxos Commit with bundling in fault free environment

The average time taken for transactions to complete is reduced marginally, for 3 RM
the average commit time reduces from 11.52ms, where no message bundling is done
(Table 4), to 11.48ms, where message bundling is done.

81
Durable Write Optimised

In the scenario where Paxos Commit is run in the no fault tolerance setting; the disk
write optimisation should not be used, as there is no redundancy available (F=0, there
is only 1 Acceptor).

One Fault to Tolerate


3 RM 5 RM 7 RM 9 RM
events: 339,942,114 459,831,023 601,737,477 763,562,448
messages: 454,427,403 581,920,459 731,715,650 901,411,437
messagesFailed: 0 0 0 0
messagesDeadLetter: 0 0 0 0
processFailures: 0 0 0 0
applicationMessages: 339,942,114 459,831,023 601,737,477 763,562,448
commits: 664,594 409,383 288,080 217,353
aborts: 959,579 1,031,676 1,064,846 1,080,972
amnesia: 0 0 0 0
virtualTime (ms): 100,000,000,640 100,000,000,640 100,000,000,006 100,000,000,640
Table 7 Results for Paxos commit without optimisation in the fault free environment where one
fault is tolerated

As fault tolerance is introduced in the Paxos commit the commit to abort ratio
becomes considerably lower compared to the results from Table 3, 69:100 where
faults are tolerated compared to 124:100 where faults are not tolerate (Table 3) for
3RMs. The overall number of transactions processed is also less this is due to the
average transaction completion time increasing to 46.63 from 46.31. The number of
messages sent has increased; additional messages need to be sent to the additional
acceptors introduced because of fault tolerance.

82
time in 3 RM 5 RM 7 RM 9 RM
ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
10 423 - - - - - - -
11 328,962 - 9,099 - 21 - - -
12 314,884 - 235,634 - 25,018 - 472 -
13 11,389 - 149,230 - 170,419 - 39,007 -
14 51 - 8,479 - 81,280 - 121,570 -
15 - - 94 - 5,858 - 47,777 -
16 - - - - 110 - 4,068 -
17 - - - - 1 - 118 -
18 - - - - - - 3 -
60 32 - - - - - - -
61 5,411 - 432 - 8 - - -
62 3,342 - 4,554 - 908 - 34 -
63 99 - 1,784 - 3,348 - 1,309 -
64 1 - 77 - 1,059 - 2,355 -
65 - - - - 49 - 608 -
66 - - - - 1 - 30 -
67 - - - - - - 1 -
80 - 49,158 - 83 - - - -
81 - 744,245 - 195,756 - 5,618 - 25
82 - 128,697 - 691,234 - 352,949 - 34,660
83 - 1,410 - 119,478 - 588,311 - 468,702
84 - - - 2,666 - 99,315 - 482,100
85 - - - 16 - 3,359 - 80,731
86 - - - - - 45 - 3,696
87 - - - - - - - 57
88 - - - - - 1 - 1
Average 12.1814 81.0892 13.2243 81.9296 14.1704 82.7541 15.0696 83.5796
Table 8 Transaction completion times for Paxos Commit with one fault setting in fault free
environment

In Table 4 the commits complete in the 10ms to 18ms, in this table we see that some
transactions commit in the 60ms to 67ms region. This happens due to the transaction
completing after a timeout re-requests the result.

As the number of RMs is increased the average amount of time required per
transaction also increases. The number of transactions that are aborted also increases
when more RMs are involved in a transaction.

83
Phase 2 Messages Bundled
3 RM 5 RM 7 RM 9 RM
events: 301,805,563 364,742,664 428,756,371 492,620,007
messages: 415,690,018 485,986,280 557,728,823 629,449,785
messagesFailed: - - - -
messagesDeadLetter: - - - -
processFailures: - - - -
applicationMessages: 301,805,563 364,742,664 428,756,371 492,620,007
commits: 580,994 321,577 207,140 146,368
aborts: 991,139 1,070,787 1,103,733 1,117,580
amnesia: - - - -
virtualTime (ms): 100,000,000,640 100,000,000,640 100,000,000,640 100,000,000,640
Table 9 Results for Paxos commit with message bundling with one fault to tolerate in fault free
environment

time in 3 RM 5 RM 7 RM 9 RM
ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
10 396 - - - - - - -
11 270,891 - 5,947 - 8 - - -
12 291,036 - 173,552 - 14,633 - 269 -
13 12,874 - 129,710 - 117,605 - 21,935 -
14 83 - 9,026 - 66,765 - 81,186 -
15 2 - 121 - 5,775 - 37,613 -
16 - - - - 108 - 3,709 -
17 - - - - 2 - 105 -
18 - - - - - - 3 -
60 23 - - - - - - -
61 3,443 - 184 - 2 - - -
62 2,187 - 2,133 - 360 - 10 -
63 59 - 861 - 1,398 - 426 -
64 - - 42 - 465 - 880 -
65 - - - - 19 - 218 -
66 - - - - - - 14 -
80 - 50,174 - 71 - - - -
81 - 783,805 - 204,431 - 5,512 - 20
82 - 137,274 - 728,640 - 367,915 - 35,514
83 - 1,469 - 127,149 - 617,667 - 487,570
84 - 3 - 2,750 - 105,093 - 503,793
85 - - - 20 - 3,588 - 84,515
86 - - - - - 45 - 3,786
87 - - - - - 1 - 78
Average 12.0405 81.0926 12.9452 81.9324 13.8513 82.7577 14.6838 83.5819
Table 10 Transaction completion times for Paxos commit with one fault to tolerate with message
bundling in fault free environment

The average transaction commit times are better with message bundling than can be
seen in Table 8; the average transaction commit time is 12.04ms where messages are
bundled and 12.18ms where messages are not bundled.

84
The average times are shown in these two tables as the total number of commits were
lower in the message bundling simulation, but the transaction commit times should
have been quicker.

Durable Write Optimised


3 RM 5 RM 7 RM 9 RM
events: 1,066,173,039 1,675,596,123 2,364,412,167 3,119,683,368
messages: 1,295,685,319 1,956,387,875 2,690,312,335 3,485,635,521
messagesFailed: - - - -
messagesDeadLetter: - - - -
processFailures: - - - -
applicationMessages: 1,066,173,039 1,675,596,123 2,364,412,167 3,119,683,368
commits: 8,398,483 7,834,749 7,350,550 6,927,652
aborts: 365 2 1 -
amnesia: - - - -
virtualTime: 100,000,000,640 100,000,000,640 100,000,000,020 100,000,000,640
Table 11 Results for Paxos commit with disk write optimisation with one fault to tolerate in fault
free environment

The results when there are 3 acceptors used as compared to 1 acceptor shows fewer
transactions are completed, but only marginally. The optimization only is viable when
we have redundancy. The results are far better compared to the non-optimised
simulation. The problem of the Registrar’s instance of Paxos not finishing before the
Phase 2b messages described on page 80, does not happen with disk optimisation as
the Registrar’s instance of Paxos always completes before the Phase 2b message is
received from any of the RMs.

3 RM 5 RM 7 RM 9 RM
time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
5 12,061 - - - - - - -
6 4,455,494 - 220,003 - 899 - - -
7 3,788,372 - 4,619,627 - 704,552 - 19,425 -
8 141,726 - 2,824,422 - 4,386,080 - 1,296,215 -
9 828 - 168,584 - 2,094,271 - 3,908,158 -
10 1 - 2,100 - 161,159 - 1,559,982 -
11 - - 13 - 3,555 - 139,587 -
12 - - - - 33 - 4,220 -
13 - - - - - - 64 -
80 - 209 - - - - - -
81 - 156 - 2 - - - -
83 - - - - - 1 - -
Average 6.4837 80.4274 7.3763 81.0000 8.2341 83.0000 9.0746 0.0000
Table 12 Transaction completion times for Paxos commit with disk write optimization with one
fault to tolerate in fault free environment

The transactions complete considerably faster than when using 2PC, the average
transaction commit times are 6ms, 7ms, 8ms and 9ms for Paxos compared to 11ms,

85
12ms, 13ms and 14ms for 2PC and 12ms, 13ms, 14ms and 15ms for Paxos Commit
without optimisation, that is an improvement of 46% when compared with Paxos
Commit without optimisation.

Phase 2 Messages Bundled and Durable Write Optimised


3 RM 5 RM 7 RM 9 RM
events: 890,814,519 1,184,711,969 1,443,180,490 1,672,504,037
messages: 1,120,320,498 1,465,506,105 1,769,086,715 2,038,444,381
messagesFailed: - - - -
messagesDeadLetter: - - - -
processFailures: - - - -
applicationMessages: 890,814,519 1,184,711,969 1,443,180,490 1,672,504,037
commits: 8,398,457 7,834,934 7,350,732 6,927,709
aborts: 365 2 - -
amnesia: - - - -
virtualTime: 100,000,000,640 100,000,000,635 100,000,000,013 100,000,000,018
Table 13 Results for Paxos with one fault to tolerate with disk write optimisation and message
bundling in fault free environment

The number of commits achieved with bundling and optimisation as compared to only
optimisation is equivalent. The number of messages sent is less compared to the
results from Table 11.

3 RM 5 RM 7 RM 9 RM
time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
5 13,602 - 2 - - - - -
6 4,526,377 - 227,779 - 961 - - -
7 3,721,652 - 4,677,157 - 719,403 - 19,532 -
8 136,092 - 2,768,068 - 4,425,495 - 1,319,475 -
9 729 - 159,950 - 2,048,899 - 3,924,746 -
10 3 - 1,966 - 152,747 - 1,525,917 -
11 - - 12 - 3,191 - 133,963 -
12 - - - - 35 - 4,024 -
13 - - - - - - 51 -
14 - - - - - - 1 -
29 1 - - - - - - -
30 - - - - 1 - - -
80 - 253 - - - - - -
81 - 112 - 2 - - - -
Average 6.4742 80.3068 7.3658 81.0000 8.2235 0.0000 9.0646 0.0000

Table 14 Transaction completion times for Paxos with one fault to tolerate with disk write
optimisation and message bundling in fault free environment

The improvement can be seen in the transaction completion times. More transactions
are committed in 6ms for the 3RM scenario, than in Table 12. the phase 2b message
bundling causes less messages to be sent as a result reducing the overall transaction

86
completion time this can be seen in the reduction of the average transaction commit
time from 6.4837ms (Table 12) to 6.4742ms.

Modified Paxos

No Faults To Tolerate
3 RM 5 RM 7 RM 9 RM
events: 579,844,464 803,440,533 1,014,712,973 1,215,852,287
messages: 725,074,531 986,816,737 1,231,721,110 1,462,770,231
messagesFailed: - - - -
messagesDeadLetter: - - - -
processFailures: - - - -
applicationMessages: 579,844,464 803,440,533 1,014,712,973 1,215,852,287
commits: 8,230,719 7,668,810 7,192,620 6,778,686
aborts: - - - -
virtualTime: 100,000,000,257 100,000,000,074 100,000,000,025 100,000,000,200
Table 15 Results for Modified Paxos in fault free environment

The results for Modified Paxos are similar to those for 2PC, but the number of
messages sent has doubled even though there is only one acceptor. The results from
the Message Passing simulation will demonstrate more clearly how the increased
number of messages contribute to the performance.

3 RM 5 RM 7 RM 9 RM
time in ms Commits Commits Commits Commits
10 13,425 - - -
11 4,117,047 183,164 689 -
12 3,922,992 4,273,609 576,692 13,763
13 176,032 2,998,716 4,138,385 1,083,933
14 1,220 210,189 2,273,830 3,777,826
15 2 3,117 198,343 1,725,434
16 - 15 4,646 172,130
17 - - 35 5,540
18 - - - 60
34 1 - - -
Average 11.5182 12.4232 13.2929 14.1438
Table 16 Transaction commit times for Modified Paxos in fault free environment

The completion times are also similar to those for 2PC.

87
One Fault to Tolerate
3 RM 5 RM 7 RM 9 RM
events: 813,698,447 1,060,214,193 1,291,850,289 1,511,188,517
messages: 1,018,527,166 1,303,218,365 1,568,533,321 1,817,836,970
messagesFailed: - - - -
messagesDeadLetter: - - - -
processFailures: - - - -
applicationMessages: 813,698,447 1,060,214,193 1,291,850,289 1,511,188,517
commits: 8,202,050 7,650,208 7,180,116 6,770,263
aborts: - - - -
virtualTime: 100,000,000,640 100,000,000,521 100,000,000,021 100,000,000,640
Table 17 Results for Modified Paxos where it tolerates one fault in fault free environment

The results are still similar to 2PC and the results from Table 15, but because of the
fault tolerance there are now more processes and thus more messages.

3 RM 5 RM 7 RM 9 RM
time in ms Commits Commits Commits Commits
10 8,686 1 - -
11 4,163,096 163,405 513 -
12 3,883,194 4,349,041 562,670 12,281
13 146,272 2,952,483 4,209,563 1,086,018
14 799 183,015 2,226,141 3,830,781
15 3 2,257 177,462 1,681,394
16 - 6 3,727 155,324
17 - - 39 4,408
18 - - 1 57
Average 11.5083 12.4133 13.2825 14.1322
Table 18 Transaction commit times for Modified Paxos with one fault tolerance in fault free
environment.
In comparison to Table 16 there are fewer commits in the 10ms range for the 3 RM
simulation. In the fault tolerant mode due to more Acceptors, Modified Paxos sends
more messages, this can be seen from the increase in messages from 725,074,531
(Table 15) where faults are not tolerated, to 1,018,527,166 where faults are tolerated;
this increase in messages causes less transactions to commit in the 10ms range.

5.3.2 Message Passing

The results for simulation performed with message passing cannot be compared
directly against the result for the simulated message passing. The simulations have
been implemented in different languages and the processing time will be different for
the run time environments.

88
The JPVM messaging libraries add a degree of delay on top of the network
messaging. The study by Stankovic et. al. [39] compares various Java based message
passing libraries and according to them JPVM on PC has a message latency of 74ms.
Similar results to the ones presented in the following section were achieved by setting
the mean message delay on the simulated message passing to 1.4ms in comparison to
100ms used for all the tests presented in the previous section. Appendix shows the
results for 2PC with 3RMs and a mean message delay of 1.4ms in the fault free
environment.

Two Phase Commit


3 RM 5 RM 7 RM 9 RM
TotalSimulation: 5,061 3,859 3,233 2,768
Messages: 131,562 154,102 174,156 187,486
MessageDeadLetter: 0 0 0 0
MessagesFailed: 0 0 0 0
ApplicationMessages: 131,559 154,098 174,153 187,483
processFailures: 0 0 0 0
Aborts: 0 4 2 4
Commits: 5,060 3,850 3,223 2,756
Time: 200,000,000 200,000,000 200,000,000 200,000,000
Table 19 Results for 2PC in fault free environment

The results show that there are no aborts in 2PC in the fault free environment. The
number of transactions completed reduces as we introduced more RMs.

3 RM 5 RM 7 RM 9 RM
time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
0 2 - - - - - - -
10 4 - - - - - - 1
20 374 - 3 - - - - -
30 1,321 - 175 - 13 - - -
40 1,436 - 701 - 249 - 21 -
50 804 - 1,076 - 681 - 177 -
60 296 - 745 - 943 - 576 -
70 125 - 344 - 527 - 722 -
80 71 - 113 - 263 - 434 -
90 47 - 64 - 133 - 230 -
100-150 76 - 140 - 168 2 265 2
151-200 3 - 13 - 11 - 32 1
201-299 1 - 13 - 3 - 20 1
300-450 1 1 1 - 1 - 3 2
MaxTime - - - - - - - 1
Table 20 Transaction completion times for 2PC in fault free environment

The commit times increase as more RMs are used for the transaction.

89
Modified Paxos

No Faults To Tolerate
3 RM 5 RM 7 RM 9 RM
TotalSimulation: 3,811 2,885 2,504 1,982
Messages: 135,775 151,732 174,175 174,846
MessageDeadLetter: - - - -
MessagesFailed: - - - -
ApplicationMessages: 130,923 144,742 164,695 163,606
processFailures: - - - -
Aborts: 34 39 53 22
Commits: 3,777 2,846 2,451 1,960
Table 21 Results for Modified Paxos in fault free environment with no faults to tolerate

The results show that in the same amount of time as the 2PC there are less
transactions are completed. A few aborts can be seen in these results; the results in
Table 22 show that the majority of these happen as soon as the transaction starts. The
Leader election algorithm has a small lead time before a Leader is elected. If a Client
sends a “Begin Transaction” message during this time, the transaction will be aborted
since there is no Leader to process the message; this is the cause for the aborts at the
beginning of the simulation.

3 RM 5 RM 7 RM 9 RM
time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
0 - 32 - 35 - 39 - 12
20 11 - - - - - - -
30 207 - 109 1 - - - -
40 842 - 3 - 52 - - -
50 1,512 - 509 - 407 - 35 -
60 749 - 1,018 - 659 - 261 1
70 295 - 628 - 488 - 465 -
80 90 - 281 - 298 - 376 -
90 24 - 102 - 171 1 246 -
100-150 34 1 187 - 320 - 487 -
151-200 9 1 5 2 32 1 53 -
201-299 4 3 1 16 5 25 2
300-399 - - 1 - 8 3 12 4
400-1100 - - - - - 4 - 2
MaxTime - - - - - - - 1
Table 22 Transaction completion times for Modified Paxos in fault free environment with no
faults to tolerate

The average transaction commit time is more in the Modified Paxos as compared to
2PC.

90
One Fault Tolerated
3 RM 5 RM 7 RM 9 RM
TotalSimulation: 2,383 1,691 1,173 849
Messages: 128,564 132,167 124,512 116,185
MessageDeadLetter: - - - -
MessagesFailed: - - - -
ApplicationMessages: 123,890 125,277 115,609 105,404
processFailures: - - - -
Aborts: 39 34 33 54
Commits: 2,344 1,657 1,140 795
Table 23 Results for Modified Paxos in fault free environment with one fault to tolerate

The number of messages sent is similar to results in Table 21 but the number of
transactions committed is fewer. As there are more processes in the transaction
system there are more messages.

3 RM 5 RM 7 RM 9 RM
time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
0 - 26 - 20 - 21 - 17
30 21 1 - 1 - - - -
40 121 - - - - - - -
50 451 - 10 - - - - -
60 603 - 59 - - - - -
70 386 - 259 - 6 - - -
80 197 - 348 - 37 - - -
90 117 - 258 - 107 - 1 -
100-150 232 - 420 - 524 - 236 1
151-200 88 5 71 2 89 2 96 -
201-299 125 4 214 - 331 2 360 6
300-399 3 - 18 3 45 5 96 29
400-1100 - 2 - 4 1 3 6 17
MaxTime - 1 - 4 - - - 1
Table 24 Transaction completion times for Modified Paxos in fault free environment

The average transaction completion time is more than that in Table 22, which is due
to more messages being sent to complete a transaction.

5.4 Message Failures

The simulations for Paxos and modified Paxos without fault tolerance have only been
run in the fault free system to compare the overhead added by the protocol without
adding any fault tolerance.

In this scenario both the messages and the application fail randomly. The application
failure causes the transaction to abort.

91
5.4.1 Simulated Message Passing

Two Phase Commit


3 RM 5 RM 7 RM 9 RM
events: 237,792,016 333,676,955 415,392,133 485,930,299
messages: 341,093,184 471,919,348 583,473,266 679,815,538
messagesFailed: 197,968 282,704 354,747 417,390
messagesDeadLetter: 0 0 0 0
processFailures: 0 0 0 0
applicationMessages: 237,792,016 333,676,955 415,392,133 485,930,299
commits: 7,914,251 7,237,814 6,680,896 6,208,084
aborts: 24,255 36,768 46,941 56,512
virtualTime: 100,000,004,883 100,000,003,901 100,000,000,204 100,000,000,026
Table 25 Results for 2PC where messages fail

The results show that not all message failures cause the transaction to Abort. As was
described earlier the various processes set timeouts to make sure that processes
receive a timely response. Due to some of these timeouts the process resends the
message.

The results from Table 26 show that message failures add more variance to
transaction completion times. The transactions that complete with time over 50ms
have been completed after one of the processes has received a timeout.

92
3 RM 5 RM 7 RM 9 RM
time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
0 - 10,084 - 9,320 - 8,434 - 7,853
1 - 12,472 - 16,325 - 15,257 - 14,304
2 - 1,230 - 9,279 - 14,413 - 14,616
3 - 16 - 1,029 - 6,771 - 12,396
4 - - - 9 - 726 - 4,933
5 - - - 1 - 17 - 586
6 - - - - - - - 22
10 28,679 - 2 - - - - -
11 4,575,292 - 267,568 - 1,280 - 1 -
12 3,027,043 - 4,302,961 - 702,093 - 19,690 -
13 95,575 - 2,273,627 - 3,839,792 - 1,169,205 -
14 495 - 122,619 - 1,678,556 - 3,280,466 -
15 - - 1,518 - 119,402 - 1,236,229 -
16 - - 11 - 2,307 - 104,211 -
17 - - 1 - 17 - 2,814 -
18 - - - - - - 37 -
20 - 77 - 72 - 83 - 54
21 - 191 - 246 - 252 - 244
22 - 30 - 231 - 383 - 396
23 - - - 27 - 246 - 472
24 - - - - - 46 - 239
25 - - - - - - - 39
26 - - - - - - - 1
30 557 - - - - - - -
31 100,618 - 10,205 - 104 - - -
32 80,797 - 157,093 - 37,460 - 1,751 -
33 3,182 - 91,819 - 194,430 - 78,555 -
34 16 - 5,874 - 90,204 - 212,364 -
35 1 - 78 - 7,198 - 83,248 -
36 - - - - 159 - 7,553 -
37 - - - - 3 - 206 -
38 - - - - - - 3 -
50 22 58 - 47 - 64 - 49
51 967 75 173 80 10 53 - 57
52 946 10 2,411 78 817 98 62 62
53 46 - 1,671 4 4,393 63 2,211 87
54 - - 139 1 2,348 12 6,335 53
55 - - 3 - 213 - 2,619 10
56 - - - - 4 - 275 -
57 - - - - - - 16 -
70 1 - - 1 - - - 1
71 7 3 3 2 - 1 - 1
72 6 - 23 2 10 6 - -
73 - - 12 1 59 1 46 1
74 - - 3 - 35 - 132 5
75 - - - - 2 - 46 -
76 - - - - - - 7 -
MaxTime - 6 - 4 - 2 - 4
Table 26 Transaction completion times for 2PC with message failures

93
Paxos Commit
3 RM 5 RM 7 RM 9 RM
events: 335,574,102 451,771,097 587,991,410 743,112,556
messages: 449,801,508 573,448,269 717,368,941 880,183,141
messagesFailed: 279,976 388,178 516,156 663,316
messagesDeadLetter: - - - -
processFailures: - - - -
applicationMessages: 335,574,102 451,771,097 587,991,410 743,112,556
commits: 649,050 397,736 279,220 211,221
aborts: 959,942 1,037,250 1,074,915 1,095,748
virtualTime: 100,000,000,625 100,000,003,229 100,000,002,903 100,000,000,179
Table 27 Results for Paxos Commit with message failures

The results in the table above can not be compared to the results for 2PC with
message failure. Paxos has more failures due to the Registrar’s instance of the Paxos
algorithm not finishing before the RMs start sending there messages; this was
explained before in section No Faults to Tolerate on page 79.

The results can be compared to the result from Table 7. There are a large number of
message failures in this simulation, but the number of commits does not decrease that
much in the Paxos case.

The results from Table 28 show that most of the aborts happen after the Paxos
Commit Timeout or the Transaction timeout has fired. This shows that the Algorithm
is unable to progress any further due to a message that failed to deliver; when the
timeout fires it triggers the algorithm to complete by aborting the transaction.

94
3 RM 5 RM 7 RM 9 RM
time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
0 - 1,938 - 1,789 - 1,612 - 1,587
1 - 2,286 - 3,035 - 2,986 - 2,848
2 - 253 - 1,744 - 2,804 - 2,876
3 - 3 - 218 - 1,390 - 2,497
4 - - - 2 - 139 - 1,101
10 399 - - - - - - -
11 310,505 - 8,117 - 27 - - -
12 298,299 - 215,897 - 22,396 - 426 -
13 10,902 - 137,858 - 151,643 - 33,406 -
14 57 - 7,781 - 72,261 - 105,110 -
15 - - 100 - 5,115 - 41,432 -
16 - - - - 110 - 3,547 -
20 - 16 - 12 - 13 - 10
21 - 28 - 47 - 48 - 42
22 - 5 - 45 - 76 - 81
23 - - - 6 - 42 - 115
24 - - - - - 7 - 46
31 10,825 - 1,239 - 25 - - -
32 8,881 - 12,888 - 3,806 - 306 -
33 377 - 6,565 - 12,661 - 6,590 -
34 5 - 416 - 4,936 - 11,465 -
35 - - 5 - 333 - 3,624 -
36 - - - - 10 - 276 -
37 - - - - - - 11 -
50 - 9 - 13 - 13 - 10
51 105 6 25 6 2 15 - 11
52 104 - 216 2 106 7 9 11
53 3 - 126 - 339 1 208 11
54 - - 10 - 135 - 412 3
55 - - - - 10 - 166 -
60 29 - - - - - - -
61 5,125 - 381 - 6 - - -
62 3,185 - 4,159 - 913 - 25 -
63 90 - 1,714 - 3,200 - 1,259 -
64 2 - 75 - 938 - 2,107 -
65 - - - - 57 - 643 -
80 1 47,440 - 69 - - - -
81 79 719,529 12 186,646 - 5,264 - 25
82 70 125,139 93 658,125 32 330,623 2 31,701
83 4 1,368 49 113,525 109 551,826 65 432,458
84 - 4 1 2,586 38 93,415 87 445,559
85 - - - 19 4 3,173 29 75,099
86 - - - - - 35 2 3,313
100 - 1,155 - 7 - - - 1
101 2 12,292 - 7,131 - 456 - 3
102 - 1,768 1 16,797 - 15,485 - 2,940
103 - 11 - 2,207 - 18,181 1 23,825
104 - - - 37 - 2,365 - 17,780
105 - - - - - 50 - 2,352
121 - 63 - 48 - 1 - -
122 - 15 - 143 - 161 - 35
123 - - - 16 - 214 - 321
124 - - - 3 - 56 - 336
MaxTime - 3 - 1 - 1 - 2
Table 28 Transaction completion time for Paxos Commit with message failures

95
Durable Write Optimised
3 RM 5 RM 7 RM 9 RM
events: 919,921,563 1,338,663,511 1,756,689,457 2,169,857,296
messages: 1,128,648,188 1,579,497,748 2,020,703,273 2,451,066,592
messagesFailed: 784,538 1,172,932 1,571,715 1,968,840
messagesDeadLetter: - - - -
processFailures: - - - -
applicationMessages: 919,921,563 1,338,663,511 1,756,689,457 2,169,857,296
commits: 7,076,660 6,119,092 5,345,294 4,722,463
aborts: 43,881 50,512 54,776 57,701
virtualTime: 100,000,002,118 100,000,000,001 100,000,002,250 100,000,000,032
Table 29 Results for Paxos Commit with Durable Write optimisation where environment has
message failures

From Table 29, compared to the results from Table 11, there are considerably more
Aborts and less number of events that take place. Comparing the results with those for
2PC in Table 25 there are less Commits and more Aborts, this is due to the high
volume of message failures.

Looking at the transaction completion times in table Table 30, it can be observed that
most of the transaction failures are either during the initiation of the transaction or
after the Paxos Commit Timeout has fired.

96
3 RM 5 RM 7 RM 9 RM
time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
0 - 8,912 - 7,818 - 6,946 - 5,992
1 - 10,860 - 13,933 - 12,195 - 10,801
2 - 1,122 - 7,858 - 11,772 - 10,994
3 - 11 - 859 - 5,393 - 9,499
4 - - - 17 - 577 - 3,714
5 9,741 - - - - 22 - 462
6 3,643,655 - 162,903 - 591 - 1 15
7 3,099,992 - 3,422,936 - 475,888 - 12,171 -
8 115,697 - 2,081,143 - 2,938,810 - 797,696 -
9 663 - 123,763 - 1,382,673 - 2,362,813 -
10 2 - 1,542 - 105,432 - 925,320 -
11 - - 4 - 2,301 - 81,872 -
12 - - 1 - 15 - 2,422 -
20 - 73 - 56 - 43 - 46
21 - 143 - 229 - 192 - 182
22 - 25 - 182 - 332 - 339
23 - - - 23 - 184 - 364
24 - - - 1 - 31 - 196
25 1,052 - 2 - - 1 - 31
26 109,084 - 19,172 - 598 - 7 2
27 90,224 - 194,604 - 72,136 - 6,831 -
28 4,153 - 100,799 - 250,501 - 148,388 -
29 33 - 6,251 - 97,442 - 272,144 -
30 - - 93 - 7,406 - 87,302 -
31 - - 1 - 187 - 7,441 -
32 - - - - 2 - 235 -
46 1,114 - 288 - 14 - 1 -
47 1,176 - 3,377 - 1,660 - 200 -
48 49 - 2,020 - 6,542 - 4,697 -
49 - - 141 - 2,702 - 9,111 -
50 - 59 3 54 230 43 3,200 34
51 - 26 - 60 5 44 296 45
52 - - - 13 - 36 6 48
53 - - - - - 15 - 31
66 6 - - - - - - -
67 10 - 25 - 20 - 2 -
68 1 - 18 - 90 - 45 -
69 - - 4 - 41 - 178 -
70 - - - - 2 - 65 -
80 1 1,297 - 1 - - - -
81 3 17,763 - 3,522 - 89 - -
82 - 3,102 2 13,023 - 5,444 2 431
83 - 44 - 2,276 5 9,104 4 6,295
84 - - - 45 - 1,587 2 6,229
85 - - - - - 55 - 1,125
86 - - - - - - - 47
100 - 34 - 7 - 7 - 6
101 - 191 - 87 - 1 - -
102 - 35 - 189 - 154 - 20
103 - - - 25 - 195 - 200
104 - - - 1 - 19 - 182
105 - - - - - 1 - 15
MaxTime - 5 - 8 - 7 - 6
Table 30 Transaction completion times for Paxos Commit with disk write optimisation where
message failures occur

97
Phase 2 Messages Bundled and Durable Write Optimised
3 RM 5 RM 7 RM 9 RM
events: 785,328,355 999,760,413 1,174,570,796 1,319,795,058
messages: 996,320,766 1,248,271,518 1,453,718,686 1,624,430,607
messagesFailed: 649,239 828,121 974,153 1,096,716
messagesDeadLetter: - - - -
processFailures: - - - -
applicationMessages: 785,328,355 999,760,413 1,174,570,796 1,319,795,058
commits: 7,220,376 6,446,662 5,832,799 5,328,286
aborts: 45,185 53,102 59,392 65,382
virtualTime: 100,000,000,016 100,000,002,821 100,000,000,082 100,000,000,012
Table 31 Results for Paxos commit with optimisation and bundling in message failure
environment

In the results where message bundling is also done in addition to disk write
optimisation, are more transactions are processed. Average Commit to Abort ratio is
higher; this is due to fewer messages being sent, thus there are fewer message failures.

The results for transaction completion time do not show any new trend.

98
3 RM 5 RM 7 RM 9 RM
time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
0 - 9,148 - 8,119 - 7,306 - 6,838
1 - 11,255 - 14,335 - 13,343 - 12,191
2 - 1,096 - 8,312 - 12,768 - 12,287
3 - 9 - 928 - 5,893 - 10,805
4 - - - 14 - 686 - 4,254
5 11,188 - - - - 20 - 489
6 3,794,588 - 178,730 - 744 - - 19
7 3,131,281 - 3,704,782 - 540,250 - 14,088 1
8 114,495 - 2,201,117 - 3,335,021 - 947,264 -
9 684 - 128,056 - 1,553,213 - 2,834,228 -
10 1 - 1,559 - 116,585 - 1,105,656 -
11 - - 6 - 2,467 - 97,052 -
12 - - - - 24 - 2,829 -
13 - - - - - - 34 -
20 - 76 - 72 - 45 - 55
21 - 152 - 225 - 223 - 189
22 - 31 - 218 - 297 - 362
23 - 1 - 32 - 231 - 394
24 - - - - - 36 - 194
25 281 - - - - 3 - 24
26 81,708 - 7,056 - 57 - - 4
27 80,430 - 129,792 - 28,027 - 1,318 -
28 3,953 - 85,528 - 162,299 - 61,444 -
29 25 - 6,029 - 80,641 - 175,238 -
30 - - 90 - 6,876 - 72,382 -
31 - - 2 - 172 - 7,046 -
32 - - - - 3 - 258 -
46 773 - 106 - - - - -
47 891 - 2,130 - 566 - 35 -
48 54 - 1,520 - 3,581 - 1,708 -
49 1 - 115 - 1,980 - 5,053 -
50 - 64 3 48 194 56 2,183 55
51 - 21 - 56 6 43 257 51
52 - 1 - 21 - 33 11 47
53 - 1 - 10 - 47
67 5 - 16 - 8 - 1 -
68 - - 15 - 43 - 28 -
69 - - - - 31 - 104 -
70 - 1 - - 5 - 51 1
80 4 1,326 - 5 - - - -
81 3 18,228 2 3,867 - 90 - 1
82 - 3,283 4 13,906 3 5,952 3 507
83 - 36 1 2,293 3 9,872 3 7,140
84 - - - 61 - 1,694 1 7,279
85 - - - - - 63 - 1,239
86 - - - - - 3 - 58
100 - 27 - 9 - 4 - 2
101 - 192 - 90 - 7 - -
102 - 42 - 205 - 182 - 18
103 - - - 37 - 198 - 229
104 - - - - - 23 - 202
105 - - - - - - - 25
MaxTime - 7 - 3 - 2 - 6
Table 32 Transaction completion times for Paxos commit with optimisation and bundling in
environment with message failures

99
Modified Paxos
3 RM 5 RM 7 RM 9 RM
events: 790,150,978 1,019,150,566 1,231,535,029 1,431,284,485
messages: 988,798,494 1,251,061,846 1,491,696,179 1,715,871,661
messagesFailed: 707,649 926,577 1,131,767 1,323,517
messagesDeadLetter: - - - -
processFailures: - - - -
applicationMessages: 790,150,978 1,019,150,566 1,231,535,029 1,431,284,485
commits: 7,703,072 7,034,314 6,479,874 6,013,461
aborts: 23,445 35,504 46,032 54,604
virtualTime: 100,000,000,077 100,000,000,438 100,000,000,436 100,000,000,007
Table 33 Results for Modified Paxos in environment with message failures

The results for Modified Paxos in comparison to 2PC in Table 25 are very promising,
as Modified Paxos is able to achieve a higher commit to abort ratio than 2PC (328:1
compared to 326:1 for 2PC) despite having more message failures. Modified Paxos,
due to increased number of processes, has more messages; as the message failure rate
is fixed there are more message failure, but this does not transfer into the number of
Aborts.

Looking at the transaction completion times in Table 34 it can be seen that most of the
Aborts occur during the transaction initialisation stage and thus resources do not get
blocked.

100
3 RM 5 RM 7 RM 9 RM
time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
0 - 9,708 - 8,931 - 8,281 - 7,826
1 - 11,937 - 15,667 - 14,819 - 13,777
2 - 1,191 - 8,952 - 14,191 - 13,998
3 - 23 - 978 - 6,561 - 11,788
4 - - - 14 - 779 - 4,781
5 - - - - - 13 - 540
6 - - - - - - - 34
10 7,884 - - - - - - -
11 3,786,984 - 143,674 - 406 - -
12 3,546,132 1 3,808,874 - 474,603 - 10,096 -
13 133,515 - 2,589,966 - 3,555,308 2 886,365 -
14 674 - 160,837 - 1,881,399 2 3,128,608 1
15 48 - 2,063 - 151,035 1 1,374,510 -
16 20 1 36 - 3,064 - 127,228 -
17 - - 15 - 47 - 3,694 -
18 - - - - 8 - 100 -
19 - - - - 1 - 5 -
20 - 76 - 66 - 50 - 48
21 - 176 - 221 - 249 - 232
22 - 28 - 236 - 364 - 421
23 - - - 36 - 252 - 456
24 - - - - - 32 - 233
25 - - - - - 3 - 24
30 1,966 - 5 - - - - -
31 121,490 2 20,048 - 655 - 5 -
32 97,373 1 187,760 1 60,803 2 5,436 1
33 4,469 - 107,919 - 228,574 2 113,320 -
34 31 - 7,276 - 105,142 1 244,822 -
35 - - 121 - 8,877 - 95,412 -
36 2 - 3 - 241 - 8,967 -
37 - - - - 3 - 283 -
50 19 74 - 64 - 48 - 49
51 1,250 96 332 73 6 48 - 40
52 1,134 22 3,137 115 1,388 94 164 59
53 54 - 2,010 17 5,424 102 3,410 101
54 1 - 162 - 2,519 22 7,430 78
55 - - 3 - 225 - 3,025 12
56 1 - - - 6 - 310 2
71 10 3 1 - - 3 - -
72 9 - 41 5 12 1 - 3
73 - - 26 - 67 3 53 2
74 - - 2 - 41 - 139 4
75 - - - - 3 - 52 1
76 - - - - 1 - 8 -
100 - 8 - 12 - 9 - 1
MaxTime - 13 - 10 - 8 - 13
Table 34 Transaction completion times for Modified in message failure environment

101
5.4.2 Message Passing

Two Phase Commit


3 RM 5 RM 7 RM 9 RM
TotalSimulation: 2,891 2,356 1,992 1,827
Messages: 58,417 65,495 66,290 73,328
MessageDeadLetter: - - - -
MessagesFailed: 587 679 698 690
ApplicationMessages: 57,830 64,811 65,582 72,637
processFailures: - - - -
Aborts: 847 957 983 961
Commits: 1,925 1,315 924 788
Table 35 Results for 2PC where there are message failures

Comparing the results for 2PC in failure free environment, Table 19, with those where
there are message failures shows that there are fewer transactions processed in the
same amount of time, this signifies that the overall transaction completion time has
increased.

The results from Table 36 show that all Aborts happen in the over 100ms time range.
This is in line with the previous observation that the transaction completion times
increase, i.e. due to message failures resources wait for timeouts to fire and then the
transactions complete, either by Aborting or Committing.

3 RM 5 RM 7 RM 9 RM
time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
10 1 - - - - - - -
20 84 - - - - - - -
30 280 - 18 - - - - -
40 484 - 128 - 7 - - -
50 410 - 263 - 53 - 7 -
60 234 - 250 - 106 - 26 -
70 106 - 178 - 144 - 70 -
80 50 - 132 - 159 - 107 -
90 38 - 67 - 105 - 131 -
100-150 69 292 104 254 166 184 242 155
151-200 108 392 88 572 45 604 46 597
201-300 49 50 73 58 125 99 128 128
301-400 12 10 14 5 11 16 27 16
401-1000 - 1 - - 3 3 4 -
MaxTime - 102 - 68 - 77 - 65
Table 36 Transaction completion times for 2PC where there are message failures

102
Modified Paxos
3 RM 5 RM 7 RM 9 RM
TotalSimulation: 787 781 830 903
Messages: 42,850 57,344 70,706 85,084
MessageDeadLetter: - - - -
MessagesFailed: 355 449 568 680
ApplicationMessages: 37,572 49,611 60,090 72,682
processFailures: - - - -
Aborts: 250 305 400 513
Commits: 537 476 430 390
Table 37 Results for Modified Paxos in message failure environment

The results seen for Modified Paxos are similar to those seen for 2PC. There are fewer
transactions processed in the same amount of time. In 2PC 29% of all transactions
fail. For Modified Paxos 31% of all transactions fail, there are an increased number of
aborts because there are more messages being sent per transaction.

The transaction completion times show the same trend as that for 2PC.

3 RM 5 RM 7 RM 9 RM
time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
0 - 32 - 28 - 42 - 39
10 - 5 - 10 - 5 - 3
20 1 6 - 4 - 8 - 7
30 4 - - - - 1 - 2
40 24 1 - - - 1 - -
50 84 - 3 - - 1 - -
60 97 1 24 - 3 - - -
70 84 - 35 - 9 1 1 1
80 37 - 59 1 28 1 3 -
90 26 - 70 - 40 - 10 -
100 16 - 40 - 36 1 13 -
101-150 35 37 75 31 122 35 112 32
151-200 53 96 31 135 37 186 47 245
201-300 64 12 116 16 131 13 149 58
301-400 4 13 15 28 17 23 46 41
401-900 2 3 3 4 3 4 3 13
MaxTime 6 44 4 48 4 78 6 72
Table 38 Transaction completion times for Modified Paxos where message failures occur

103
5.5 Process and Message Failures

5.5.1 Simulated Message Passing

Two Phase Commit


3 RM 5 RM 7 RM 9 RM
events: 254,119,729 337,951,033 406,717,685 465,655,469
messages: 352,643,763 467,833,442 561,614,537 641,732,005
messagesFailed: 193,745 269,912 331,547 383,229
messagesDeadLetter: 13,806,947 13,037,167 13,255,445 13,551,204
processFailures: 4,986 6,849 8,849 10,842
applicationMessages: 240,302,810 324,900,168 393,444,542 452,082,581
commits: 7,520,712 6,754,906 6,092,870 5,560,481
aborts: 79,476 126,009 174,216 210,260
virtualTime: 100,000,002,330 100,000,002,616 100,000,004,351 100,000,000,201
Table 39 Results for 2PC with all failures

In the case where messages fail to deliver and processes fail the 2PC algorithm in
comparison to the non failure case has 8% fewer Commits. The average transaction
completion time increases.

Results in Table 40 show that majority of transactions Abort in the 50 milliseconds


range, this reduces the number of transactions that can be processed.

104
time in 3 RM 5 RM 7 RM 9 RM
ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
0 - 9,688 - 8,556 - 7,954 - 7,292
1 - 11,343 - 15,269 - 14,275 - 13,028
2 - 1,136 - 8,550 - 13,201 - 13,150
3 - 18 - 961 - 6,124 - 11,084
4 - - - 18 - 702 - 4,482
5 - - - - - 17 - 525
6 - - - - - - - 23
10 27,209 - 3 - - - - -
11 4,349,561 10 250,122 - 1,127 - - -
12 2,874,721 9 4,015,499 8 637,130 4 17,824 -
13 90,458 - 2,122,125 8 3,504,207 15 1,045,231 4
14 496 - 113,964 1 1,530,501 4 2,939,691 11
15 1 - 1,349 - 109,123 - 1,106,813 3
16 - - 7 - 2,196 - 93,404 -
17 - - - - 12 - 2,484 -
18 - - - - - - 34 -
20 - 65 - 69 - 59 - 49
21 - 165 - 234 - 232 - 227
22 - 30 - 257 - 375 - 369
23 - - - 41 - 229 - 423
24 - - - 1 - 38 - 222
25 - - - 2 - 31
30 521 - 1 - - - - -
31 95,987 1 9,632 1 94 - - -
32 76,694 2 146,490 4 34,221 1 1,586 -
33 3,093 - 85,630 1 177,517 8 70,835 2
34 26 - 5,680 - 82,765 2 190,218 6
35 - - 77 - 6,624 - 75,119 2
36 - - - - 157 - 6,722 -
37 - - - - 2 - 216 -
38 - - - - - - 2 -
50 15 41,426 1 40,198 - 42,895 - 42,160
51 968 15,409 170 38,199 8 41,365 - 39,337
52 890 505 2,388 12,280 766 34,099 55 37,723
53 37 4 1,562 628 4,081 9,751 1,911 28,407
54 - - 138 9 1,999 746 5,564 7,636
55 - - 3 - 192 15 2,294 652
56 - - - - 8 - 245 20
70 - 201 - 214 - 250 - 219
71 6 222 3 607 - 752 - 667
72 6 17 14 355 8 1,022 2 1,217
73 1 - 23 28 59 411 42 1,107
74 - - 1 - 35 53 94 422
75 - - - - 2 2 61 45
91 - 4 - 1 - 4 - 3
92 - - - 6 - 12 - 13
93 - - - - 1 12 - 31
99 - 32 - 24 - 30 - 30
100 - 238 - 323 - 334 - 392
101 - 14 - 76 - 105 - 85
102 - - - 12 - 38 - 86
103 - - - 2 - 9 - 27
MaxTime - 13,737,576 - 12,926,904 - 13,105,754 - 13,370,326
Table 40 Transaction completion times for 2PC with all failures

105
If the registrar fails, the simulation keeps sending messages to the Client to start new
transaction. The elapsed time for these transactions is -1 as a transaction id is never
issued. This causes the high number in the MaxTime row, this is not accounted for in
the analysis that shows 8% fewer commits for 2PC with failures in comparison to the
non failure case.

Paxos Commit
3 RM 5 RM 7 RM 9 RM
events: 349,727,127 453,892,002 571,076,067 701,908,917
messages: 461,796,497 572,789,962 696,626,844 834,196,622
messagesFailed: 273,702 372,085 481,265 605,132
messagesDeadLetter: 15,584,255 15,344,182 16,335,484 16,069,384
processFailures: 8,243 9,832 12,184 13,661
applicationMessages: 334,120,336 438,518,378 554,702,331 685,793,967
commits: 598,744 361,128 246,481 181,991
aborts: 946,265 1,015,820 1,047,609 1,065,816
virtualTime: 100,000,000,512 100,000,000,611 100,000,000,013 100,000,002,261
Table 41 Results for Paxos commit with all failures

Paxos Commit does have fewer Commits than it does in the fault free environment,
but comparatively more transactions finish in the 50ms time range.

106
time in 3 RM 5 RM 7 RM 9 RM
ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
0 - 1,889 - 1,741 - 1,571 - 1,533
1 - 2,281 - 2,918 - 2,765 - 2,539
2 - 199 - 1,657 - 2,573 - 2,691
3 - 2 - 198 - 1,175 - 2,257
4 - 9 - 118 - 968
5 - 4 - 94
10 385 -
11 286,272 - 7,551 - 15 -
12 275,361 - 195,995 - 19,535 - 386 -
13 10,315 - 124,823 - 134,546 - 28,806 -
14 59 - 7,137 - 63,365 - 91,142 -
15 89 - 4,661 - 35,043 -
16 1 - 91 - 3,160 -
17 87 -
20 - 11 - 11 - 14 - 7
21 - 34 - 45 - 37 - 43
22 - 7 - 51 - 68 - 80
23 - 5 - 32 - 82
24 - 7 - 33
30 93 -
31 9,819 - 1,171 - 24 -
32 7,940 - 11,765 - 3,356 - 254 -
33 379 - 5,938 - 11,208 - 5,475 -
34 5 - 332 - 4,322 - 10,047 -
35 12 - 312 - 3,018 -
36 10 - 249 -
50 1 21,504 - 21,077 - 22,287 - 20,102
51 117 7,289 22 18,338 - 20,794 - 20,399
52 83 140 183 5,170 79 17,460 10 20,800
53 4 1 119 204 278 4,498 174 13,953
54 11 3 115 215 368 3,063
55 15 5 104 198
56 4 3
60 30 1
61 4,801 22 371 1 5 -
62 2,886 10 3,841 16 841 3 25 -
63 62 - 1,532 5 2,664 7 1,028 5
64 1 - 68 - 826 1 1,863 -
65 1 - 50 - 519 2
66 29 -
70 - 97 - 105 - 105 - 89
71 1 100 - 259 - 354 - 340
72 1 4 - 135 1 485 - 622
73 1 - 2 6 3 170 4 568
74 2 7 6 148
75 5 10
80 1 45,495 - 61
81 72 688,700 10 174,661 - 4,826 - 19
82 50 119,474 105 616,736 22 301,261 - 28,956
83 4 1,292 48 106,767 104 503,669 45 388,805
84 - 3 1 2,375 28 85,823 99 400,705
85 - 15 1 2,959 26 67,084

107
time in 3 RM 5 RM 7 RM 9 RM
ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
86 - - - - - 30 3 3,095
87 - - - - - - - 56
100 - 1,933 - 771 - 874 - 868
101 1 11,767 - 6,365 - 459 - 22
102 - 1,571 - 15,600 - 14,224 - 2,606
103 - 14 - 2,069 - 16,670 - 21,656
104 - 49 1 2,156 1 15,744
105 - 56 - 2,042
106 - 1 - 77
121 - 58 - 41 - 4
122 - 10 - 147 - 147 - 32
123 - 27 - 199 - 293
124 - 27 - 261
125 - 1 - 49
MaxTime - 13,950,195 - 13,457,235 - 14,114,659 - 13,562,897
Table 42 Transaction completion time for Modified Paxos with all failures

Durable Write Optimised


3 RM 5 RM 7 RM 9 RM
events: 887,387,691 1,256,146,821 1,604,444,892 1,941,779,866
messages: 1,086,847,088 1,483,137,252 1,849,502,063 2,199,650,887
messagesFailed: 739,356 1,082,457 1,416,823 1,738,772
messagesDeadLetter: 15,102,425 15,392,213 16,144,742 16,561,429
processFailures: 7,968 9,900 11,822 14,030
applicationMessages: 872,263,390 1,240,724,912 1,588,262,600 1,925,172,855
commits: 6,540,268 5,549,421 4,739,031 4,111,643
aborts: 102,443 123,312 144,801 161,772
virtualTime: 100,000,000,024 100,000,000,002 100,000,000,699 100,000,001,771
Table 43 Results for Paxos commit with disk optimisation

The percentage of Aborts in Paxos Commit with optimisation is considerably higher.


The Paxos Commit appears to be effected by failures a lot more than 2PC.

A higher number of transactions complete in the 80ms time range. Even though
transactions complete in less time than 2PC, the overall throughput of a transaction is
slower in Paxos Commit.

108
time in 3 RM 5 RM 7 RM 9 RM
ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
0 - 8,254 - 7,121 - 6,168 - 5,418
1 - 10,175 - 12,760 - 10,988 - 9,766
2 - 1,011 - 7,076 - 10,566 - 9,717
3 - 12 - 750 - 4,695 - 8,421
4 - - - 19 - 542 - 3,244
5 8,976 - - - - 22 - 376
6 3,364,252 - 147,800 - 604 - - 19
7 2,867,332 - 3,101,466 1 420,205 - 10,504 1
8 107,560 - 1,888,432 - 2,602,269 - 690,963 -
9 642 - 112,494 - 1,228,157 - 2,056,438 -
10 1 - 1,358 - 94,014 - 808,751 -
11 - - 7 - 2,008 - 71,311 -
12 - - - - 32 - 2,197 -
13 - - - - 49 -
20 - 69 - 51 - 39 - 29
21 - 150 - 174 - 192 - 148
22 - 21 - 153 - 281 - 281
23 - 1 - 33 - 182 - 332
24 - - - 2 - 26 - 170
25 927 - 1 - - 2 - 21
26 101,023 - 17,793 - 512 - 3 1
27 83,420 - 177,172 - 64,315 - 5,917 -
28 3,953 - 91,636 - 223,403 - 129,605 -
29 30 - 5,693 - 86,729 - 237,621 -
30 - - 82 - 6,579 - 76,002 -
31 - - - - 147 - 6,443 -
32 - - - - 2 - 205 -
33 - - - - - - 2 -
46 1,007 - 283 - 8 - - -
47 1,053 - 3,078 - 1,466 - 173 -
48 66 - 1,892 - 5,799 - 4,089 -
49 1 - 160 - 2,445 - 8,118 -
50 - 20,780 2 21,198 183 21,115 2,724 21,025
51 - 7,092 - 18,446 5 21,751 263 19,589
52 - 133 - 5,598 - 17,195 2 20,404
53 - - - 196 - 4,270 - 14,238
54 - - - 1 - 220 - 3,210
55 - - - - - 5 - 195
56 - - - - - - - 5
66 7 - 2 - - - - -
67 6 - 42 - 17 - 1 -
68 1 - 20 - 86 - 65 -
69 - - 2 - 33 - 123 -
70 - 93 1 123 4 111 57 107
71 - 86 - 301 - 367 12 352
72 - 3 - 171 - 514 - 603
73 - - - 11 - 161 - 534
74 - - - - - 11 - 154
75 - - - - - - - 12

109
time in 3 RM 5 RM 7 RM 9 RM
ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
80 1 2,891 - 5 - - - -
81 4 42,707 - 9,031 1 225 - -
82 - 7,448 2 32,059 5 14,362 - 1,230
83 - 72 1 5,624 1 23,539 2 17,174
84 - 1 - 132 - 4,089 1 17,955
85 - - - 1 - 119 - 3,021
86 - - - - - 1 - 147
100 - 317 - 314 - 346 - 335
101 - 657 - 326 - 70 - 55
102 - 100 - 776 - 658 - 173
103 - - - 110 - 764 - 981
104 - - - 2 - 95 - 688
105 - - - - - 7 - 94
MaxTime - 13,376,582 - 13,443,074 - 13,983,736 - 14,115,387

Table 44 Transaction completion times for Paxos commit with disk optimisation

Phase 2 Messages Bundled and Durable Write Optimised


3 RM 5 RM 7 RM 9 RM
events: 765,621,165 948,821,459 1,088,325,072 1,195,644,921
messages: 967,156,910 1,182,997,570 1,346,971,872 1,473,087,033
messagesFailed: 612,197 769,048 884,978 973,640
messagesDeadLetter: 15,909,022 15,066,268 15,424,586 16,574,630
processFailures: 8,077 9,849 11,906 14,192
applicationMessages: 749,689,921 933,725,694 1,072,862,827 1,179,024,140
commits: 6,670,780 5,855,641 5,176,481 4,617,180
aborts: 102,291 124,688 148,942 171,703
virtualTime: 100,000,003,072 100,000,000,253 100,000,001,698 100,000,001,926
Table 45 Results for Paxos commit with optimisation and message bundling

The results are slightly better than that can be seen in Table 43, with fewer message
failures, which in turn increases the number of Commits. The percentage of failures is
lower.

The transaction completion times do not show any new trends.

110
time 3 RM 5 RM 7 RM 9 RM
in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
0 - 8,410 - 7,642 - 6,720 - 6,111
1 - 10,456 - 13,270 - 12,220 - 10,893
2 - 1,011 - 7,516 - 11,447 - 10,886
3 - 12 - 832 - 5,314 - 9,222
4 - - - 17 - 636 - 3,663
5 10,281 - 1 - - 18 - 442
6 3,504,474 - 162,762 - 618 - - 24
7 2,894,468 - 3,363,715 - 479,351 - 12,103 -
8 106,246 - 2,000,268 - 2,957,904 - 820,938 -
9 589 - 116,345 - 1,379,870 1 2,456,162 -
10 1 - 1,475 - 103,850 - 958,813 -
11 - - 6 - 2,250 - 83,610 -
12 - - - - 25 - 2,562 -
13 - - 1 - - - 19 -
20 - 46 - 54 - 48 - 44
21 1 164 - 229 - 197 1 164
22 - 22 - 199 - 322 - 322
23 - - - 23 - 216 - 376
24 - - - 1 - 28 - 184
25 246 - 1 - - 2 - 25
26 75,026 - 6,259 - 59 - 1 -
27 74,067 - 117,953 - 24,692 - 1,045 -
28 3,654 - 77,827 - 143,739 - 53,256 -
29 33 - 5,399 - 72,052 - 151,863 -
30 - - 92 - 6,214 - 62,324 -
31 - - 1 - 171 - 6,159 -
32 - - - - 1 - 214 -
46 731 - 95 - - - - -
47 875 - 1,878 - 505 - 42 -
48 59 - 1,387 - 3,165 - 1,427 -
49 - - 133 - 1,745 - 4,263 -
50 - 21,569 1 21,051 188 21,222 2,008 21,613
51 - 6,821 - 17,909 3 21,287 212 21,646
52 - 134 - 5,383 - 16,348 11 21,143
53 - - - 199 - 4,280 - 14,295
54 - - - 1 - 207 - 3,220
55 - - - - - 7 - 173
56 - - - - - - - 7
66 6 - 1 - - - - -
67 10 - 18 - 5 - 1 -
68 1 - 14 - 31 - 15 -
69 - - 4 - 29 - 83 -
70 - 106 - 114 8 101 35 130
71 - 97 - 328 - 367 3 369
72 - 3 - 147 - 478 - 588
73 - - - 14 - 167 - 590
74 - - - - - 3 - 165
75 - - - - - - - 17
80 2 2,785 - 2 - - - -
81 6 41,792 1 9,125 - 258 - 2
82 1 7,399 2 32,413 - 14,822 - 1,329
83 - 71 1 5,710 3 24,803 4 18,031

111
time in 3 RM 5 RM 7 RM 9 RM
ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
84 - - - 137 - 4,141 1 18,672
85 - - - - - 135 - 3,124
86 - 130
100 - 326 - 321 - 336 - 391
101 - 648 - 376 - 82 - 73
102 - 105 - 783 - 647 - 173
103 - 1 - 122 - 784 - 996
104 - 2 - 113 - 732
105 - 2 - 101
121 - 2 - 3 - 1 - 2
122 - 6 - 9 - 1
123 - 16 - 15
124 - 2 - 18
125 - 2
MaxTime - 14,269,470 - 13,176,138 - 13,287,196 - 14,220,126

Table 46 Transaction completion times for Paxos commit with optimisation and message
bundling

Modified Paxos
3 RM 5 RM 7 RM 9 RM
events: 790,263,290 1,019,166,544 1,231,649,454 1,431,074,240
messages: 988,918,831 1,251,057,789 1,491,818,447 1,715,584,636
messagesFailed: 707,432 926,763 1,131,622 1,324,560
messagesDeadLetter: 955,547 1,527,474 2,151,195 2,727,824
processFailures: 10 12 14 16
applicationMessages: 789,301,813 1,017,629,338 1,229,484,155 1,428,328,511
commits: 7,703,861 7,033,394 6,480,145 6,011,188
aborts: 23,362 35,438 45,999 54,371
virtualTime: 100,000,002,016 100,000,000,154 100,000,001,959 100,000,000,765
Table 47 Results for Modified Paxos

Modified Paxos appears more resilient to failures. The percentage of Aborts is


considerably lower than 2PC, even though there are more message failures. This
behaviour can be explained by very few process failures.

The majority of the transactions complete in the 10ms time range.

112
3 RM 5 RM 7 RM 9 RM
time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
0 - 9,752 - 8,940 - 8,153 - 7,633
1 - 11,723 - 15,424 - 14,714 - 13,687
2 - 1,164 - 8,960 - 14,204 - 13,745
3 - 9 - 992 - 6,537 - 11,997
4 - - - 17 - 738 - 4,730
5 - - - - - 17 - 513
6 - - - - - - - 25
10 7,787 - - - - - - -
11 3,788,316 - 143,436 - 422 - - -
12 3,545,546 1 3,807,072 - 473,670 - 10,038 -
13 134,389 - 2,589,663 - 3,556,595 - 885,653 -
14 759 - 161,503 - 1,882,533 - 3,126,315 1
15 50 - 1,988 - 150,448 - 1,373,988 -
16 19 1 54 - 3,120 - 126,913 -
17 1 - 15 - 67 - 3,610 -
18 - - - - 8 - 69 -
19 - - - 1 - - 7 -
20 - 67 - 59 - 51 - 46
21 - 158 - 298 - 237 - 222
22 - 26 - 239 - 416 - 378
23 - - - 34 - 255 - 417
24 - - - 1 - 36 - 250
25 - - - - - 5 - 32
26 - - - - - - - 1
30 1,949 - 3 - - - - -
31 120,920 - 20,098 - 597 - 3 -
32 97,126 - 188,008 1 59,874 - 5,569 -
33 4,462 - 108,415 1 228,979 2 113,444 -
34 24 - 7,255 - 104,677 - 245,417 1
35 5 - 105 - 8,980 - 96,234 -
36 - - 1 - 201 - 8,997 -
37 - - - - 3 - 262 -
50 19 154 - 112 - 109 - 80
51 1,250 147 311 127 14 116 1 73
52 1,165 21 3,209 117 1,411 134 144 119
53 56 - 2,064 13 5,603 122 3,461 172
54 1 - 129 1 2,553 19 7,535 98
55 - - 5 - 239 1 2,962 16
56 - - - - 11 - 280 -
57 - - - - - - 16 -
70 - - - 1 - 3 - -
71 4 4 2 1 - 3 - -
72 10 - 37 2 18 7 2 4
73 - - 17 - 79 6 65 7
74 - - 1 - 39 1 132 5
75 - - - - - - 50 1
76 - - - - - - 11 -
MaxTime - 70,932 - 70,208 - 73,294 - 18,540
Table 48 Transaction completion time for Modified Paxos

113
5.5.2 Message Passing

Two Phase Commit


3 RM 5 RM 7 RM 9 RM
TotalSimulation: 2,830 2,295 2,019 1,904
Messages: 56,684 61,556 66,074 69,828
MessageDeadLetter: 57 76 67 104
MessagesFailed: 554 640 657 708
ApplicationMessages: 56,070 60,839 65,349 69,016
processFailures: 7 13 12 13
Aborts: 844 1,011 1,006 1,122
Commits: 1,874 1,206 922 703
Table 49 Results for 2PC with all failures

The results for 2PC with all failures shows that the throughput of the transaction
system reduces and there are a similar number of transaction Commits and Aborts as
can be seen in the results for only message failure simulation.

Transaction completion times have increased in Table 36 there were few transactions
finishing in 100-150ms range. In Table 50 there are none in this range.

time in 3 RM 5 RM 7 RM 9 RM
ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
0 - 2 - 1 - 4 - 2
10 2 - - - - - - -
20 91 - - - - - - -
30 291 - 16 - 2 - - -
40 403 - 84 - 35 - 4 -
50 382 - 226 - 149 - 35 -
60 209 - 259 - 208 - 94 -
70 115 - 181 - 149 - 107 -
80 76 - 110 - 86 - 102 -
90 50 - 64 - 49 - 70 -
100-149 96 - 92 - 86 2 133 -
150-199 94 691 56 839 32 763 30 824
200-299 57 53 106 88 121 134 120 210
300-599 8 8 12 11 5 26 8 9
MaxTime - 90 - 72 - 77 - 77
Table 50 Transaction completion times for 2PC with all failures

114
Modified Paxos
3 RM 5 RM 7 RM 9 RM
TotalSimulation: 1,235 1,032 808 881
Messages: 61,493 68,881 61,282 73,928
MessageDeadLetter: 173 56 322 439
MessagesFailed: 559 604 518 581
ApplicationMessages: 55,745 61,096 51,054 61,407
processFailures: 12 9 25 30
Aborts: 395 461 509 569
Commits: 840 571 299 312
Table 51 Results for Modified Paxos with all failures

The result trend is similar to that where Modified Paxos was run in the environment
where there are only message failures, discussed in section Modified Paxos on page
103.

3 RM 5 RM 7 RM 9 RM
time in ms Commits Aborts Commits Aborts Commits Aborts Commits Aborts
0 - 37 - 32 - 30 - 41
10 - 5 - 4 - 6 - 1
20 1 12 - 7 - 2 - 5
30 14 - - 5 - 1 - 7
40 101 1 1 1
50 161 - 10 - - 2
60 167 - 39 2 1 -
70 116 - 57 - 7 1
80 47 - 60 - 15 - 1 -
90 28 - 50 - 18 - 6 -
100 12 - 21 - 13 - 7 -
101-150 38 87 79 20 76 12 72 15
151-200 74 155 78 254 20 256 32 296
201-300 74 2 153 26 126 72 153 88
301-400 2 24 15 35 14 43 33 27
401-900 2 5 5 4 3 10 5 13
MaxTime 3 66 3 70 4 76 2 74
Table 52 Transaction completion times for Modified Paxos

5.6 Disk Write Analysis

5.6.1 Paxos

The phase 2b message is sent to all the interested parties directly. Where message
send is optimised by sending the message concurrently to the disk write, the acceptors
receiving the Phase 2b message act as durable storage. In case of failure, the acceptors
can be queried instead of retrieving from the disk.

115
5.6.2 Modified Paxos

Each acceptor only sends out a single prepared message to the leader and hence this
cannot be optimised, as there is not enough redundancy. The leader does not need to
write to durable storage as all the acceptors durable write their individual decision.

5.7 Summary

The analysis of the results can be summarised as follows:

• The average transaction completion times increases as more RMs are used for
a transaction.

• For Paxos Commit the delayed completion of the Registrar’s instance of Paxos
introduces a large number of Aborts to the results.

• The message bundling optimisation improves performance by approx 1%


where Paxos Commit is set to tolerate one fault.

• As fault tolerance is introduced to Paxos Commit the number of messages sent


increases as there are more processes participating in the same transaction, this
has increases the average transaction completion time.

• The durable write optimisation in Paxos Commit halves the transaction


completion time as compared to Paxos Commit without optimisation and
improves by approx 46% over 2PC in the fault free environment.

• In the environment where there are failures, Modified Paxos is the best
performing Algorithm in the simulated message passing tests, the commit to
abort ratio is better than 2PC, but not in the JPVM simulation. The average
transaction commit time is the quickest in the 2PC protocol.

116
6. Conclusions and Further Work
The overall scope of this work was to investigate and implement a simulation of a
complete transaction system that uses the Paxos Commit algorithm to achieve atomic
commit in a distributed transaction environment. A simulation for Two Phase Commit
(2PC) was to be implemented and the results of 2PC and Paxos Commit were to be
compared.

There were optimisations suggested for the Paxos Commit which were to be tested in
the simulation. An alternate algorithm was also designed to overcome the problems
faced with Paxos Commit. The algorithms were to be compared in how they differ in
perfect conditions and how each of them performs when the environment is faulty.

Simulation frameworks, the simulated message passing framework and the JPVM
based message passing framework, were designed and various configurations were
run on these frameworks. An alternate algorithm was designed that used the basic
principles of Paxos Commit algorithm but did not use Paxos’s phase based messaging
and the Registrar process.

The results demonstrated that under failure free conditions 2PC is faster than Paxos
Commit and Modified Paxos. This is true even in the case where the two Paxos
consensus based commit algorithms are run with fault tolerance switched off. The
disk write optimisation improves the performance of Paxos Commit for committed
transactions by 46%. The Modified Paxos when compared with unoptimised Paxos
Commit is 4.5% faster for committed transactions. Modified Paxos also has the
advantage of not using a single Transaction Manager process, the current leader of the
acceptor processes is used as the Transaction Manager.

This study has studied the performance and behaviour of 2PC, Paxos Commit and
Modified Paxos in environment with varying degrees of failure. After analysing this
data we can present the following conclusions.

117
6.1 Conclusions

The 2PC protocol that has been simulated blocks the Resource Manager (RM), if the
RM is in the prepared state and the Transaction Manager (TM) fails. The RM waits
until the TM recovers from the failure. In the Paxos commit protocol, one of the
backup coordinator takes over the role of the TM and completes the execution.

The various processes in the transaction system set timeouts to keep track of
responses from other processes. The Client and the TM set a timeout for the overall
transaction, this is to ensure that the system does not block indefinitely. After a certain
amount of time if the transaction does not complete, then the transaction is cancelled.
A request timeout is set before certain messages are sent. Depending on the local state
of the process the transaction may be aborted or the request re-sent.

If any of the processes other than the coordinator fails then the transaction fails. The
only redundancy available is in the commit decision making process. The protocol
does not provide redundancy for the RM, Client and the Registrar processes. Thus a
failure at one of these processes results in the transaction not being successful.

A failed RM causes subsequent transactions to abort until the RM recovers.

In the past the disk storage and network had specific tasks; disk storage was used for
storing local data and networks for communication between independent computers.
With advances in technology the difference between the two is disappearing and this
has led to the network storage devices [40]. This advance has been utilised in the
optimisation suggested; the disk write is done concurrently to the messaging thus
improving the transaction commit time considerably. In the context of the Paxos
Commit algorithm redundancy is used to achieve fault tolerance. This redundancy has
here been used to improve performance by essentially using the additional processes
as durable storage. The disk write before sending messages is done to ensure that the
protocol can restart at the correct point after a failure. The message being sent to
multiple processes replicates this behaviour, as the likelihood of all the processes
failing together is very low; the protocol can retrieve the status from one of these
processes instead of the disk. In case that all the processes do fail together no other
process can act on the decision which is lost during the failure of all the processes.

118
The Paxos commit protocol is not a non-blocking protocol; it is a fault tolerant
protocol. It only tolerates the failure of the TM by there being multiple coordinators
that may perform the tasks required of the TM.

The RM and Acceptor being on the same node would lead to the transaction being
aborted, when a node fails, as the RM would fail too. Simulation causes more process
failures when there are more nodes. This is due to the mean time to process failure.

The results show that the optimisation of concurrently writing to disk and sending the
message presents a considerable performance advantage; this improves over 2PC with
respect to the average time taken per transaction commit; transactions commit in
approximately 6ms with disk write optimisation compared with approximately 11ms
for 2PC. In addition to being quicker than 2PC, Paxos Commit is a NB-AC protocol.
With the performance improvement achieved and coordinator failure not affecting
transaction commit; Paxos Commit appears to be an algorithm that could be used to
advantage in transaction processing systems.

The message passing simulation does show considerably slower performance for
Paxos Commit as compared with 2PC. This is most likely due to the simulation being
processed in Java and the additional overhead of messaging using the JPVM library,
which was computed to be approximately 14 times slower than the simulated message
passing constant used for message delay.

The Modified Paxos has performed better than Paxos Commit by approximately 4.5%
when comparing the two on the basis of average transaction commit time, without the
disk optimisation. If the disk optimisation is implemented in Modified Paxos, the
algorithm would be an ideal choice for transaction processing, as the Registrar
failures does not block the whole transaction system.

6.2 Further Work

6.2.1 Simulation Analyser

A log of all messages sent and received could be maintained, with a list of the exact
time for a process failure. A tool can be built that would analyse this log and present a
graphical representation to display the transaction failures in relation to process
failures.

119
The tool could be used to drill into each and every transaction that was processed, see
the messages that were sent and received by each of the processes, and if any
messages failed.

The tool could be utilised to fine tune the timeout values. The tool will enable the user
to view each of the messages transmitted and received, the failure of a process or a
message on the transactions timeline, the user can compare average message response
time and thus set the timeout values accordingly.

The tool will also enable the user to find out the cause for a transaction abort. The
process that initiated the message that caused the transaction to abort could be
identified. If the transaction aborted due to messages not being received by
process(es) about the prepared state of RM(s), then this could be identified. The
identification of the exact cause of the transaction failure would help identify any
shortfalls in the protocol. In the simulated framework the number of transaction
failures is accumulated, the fault identification could allow for a classification of the
number of failures by the cause. This would allow for the exact measurement of the
improvement provided by the NB-AC protocol.

6.2.2 Low Latency Non-Blocking Atomic Commit

The Low-Latency Non-Blocking Atomic Commit (NBAC) suggested by Jimenez-


Peris et. al. [7] has not been implemented according to the literature search. This
algorithm requires a virtual synchronous environment with uniform broadcast
primitive.

The protocol utilises coordinator and Resource Manager uniform multicast groups. It
also utilises a concept of optimistic commit and concurrent messaging and disk
flushing. The uniform multicast is a process which is slow but essential as in the
protocol the uniform multicast process is performed in parallel to essential tasks such
as recording and informing all the processes about the final commit decision, thus
reducing the latency of the protocol.

The optimisation used for Paxos Commit performs the disk write in parallel to the
messaging, improving the performance of the protocol considerably. This
optimisation is similar to the process described in the Low Latency NBAC.

120
A simulation for the Low Latency NBAC could be constructed taking into
consideration the performance impact for introducing virtual synchrony and uniform
multicast. The simulation could be run on a similar framework, to the one used for
this study, where transactions are started and the overall time is calculated based on
when all the processes finish with each transaction.

The results could be compared with the results of this study to determine how a
specialist protocol that utilises virtual synchronous environment with uniform
multicast primitive to reduce latency compares with a protocol designed specifically
for asynchronous environment.

6.2.3 Modified Paxos

Further variations to Modified Paxos can be made for example the ready to commit
messages can be sent to all the parties, instead of just sending the message to the
Leader. This would mean that each Acceptors decision is stored with all the other
processes and can be retrieved from them in case of failure. This allows each acceptor
to send the message without waiting for the disk write to finish.

As has been seen in the results for Paxos Commit with disk optimisation, sending
messages concurrently to disk writes provides considerable performance
improvement. Modified Paxos performed better than Paxos Commit where
optimisation was not used. With this optimisation it should perform better than Paxos
Commit with disk optimisation. Modified Paxos has the added advantage of not
having any downtime associated with the failure of the Registrar process, as it uses
the current Leader for Transaction Management functions.

6.2.4 Disk Write Optimisation

The Disk Write Optimisation that has been suggested and tested provides
performance improvement of up to 46% in the simulations with the various
parameters set for system environment. The informal analysis done for a pathological
case does not show any drawbacks with the suggested approach. A formal analysis of
this optimisation should be performed.

121
7. References
1. Transaction processing, Wikipedia, the free encyclopedia., [Online],
Available: http://en.wikipedia.org/wiki/Transaction_processing, 2005,
(Accessed 10 October 2005)
2. Babaoglu, O. and S. Toueg, Understanding non-blocking atomic commitment,
1993, University of Bologna, TR UBLCS-93-2, Available:
http://citeseer.ist.psu.edu/babaoglu93understanding.html
3. Gray, J. Notes on Data Base Operating Systems. in Operating Systems, An
Advanced Course. 1978: R. Bayer, et al. eds., Lecture Notes In Computer
Science Vol. 60, p. 393-481. Available:
http://research.microsoft.com/%7EGray/papers/DBOS.doc
4. Skeen, D. Nonblocking commit protocols. in Proceedings of the ACM
SIGMOD International Conference on Management of Data. 1981, ACM
Press, p. 133-142.
5. Fischer, M.J. and M. Merritt, Appraising two decades of distributed
computing theory research. Distributed Computing, 2003. 16(2-3): p. 239-247.
6. Gray, J. and L. Lamport, Consensus on Transaction Commit, 2004, Microsoft
Research, MSR-TR-2003-96, Available:
http://research.microsoft.com/research/pubs/view.aspx?tr_id=701, (Accessed 3
Feburary 2004)
7. Jiménez-Peris, R., et al. A Low-Latency Non-Blocking Commit service. in
Proceedings of the 15th International Conference on Distributed Computing.
2001, Lecture Notes In Computer Science Vol. 2180, Springer-Verlag, p. 93-
107.
8. Guerraoui, R. and P. Kouznetsov, On the Weakest Failure Detector for Non-
Blocking Atomic Commit in Proceedings of the IFIP 17th World Computer
Congress - TC1 Stream / 2nd IFIP International Conference on Theoretical
Computer Science: Foundations of Information Technology in the Era of
Networking and Mobile Computing 2002 Kluwer, B.V. p. 461-473
9. Guerraoui, R., Non-blocking atomic commit in asynchronous distributed
systems with failure detectors. Distributed Computing, 2002. 15(1): p. 17-25.
10. Lamport, L., Paxos Made Simple. ACM SIGACT news distributed computing
column 5, 2001. 32(4): p. 51-58.
11. Guerraoui, R., Revisiting the relationship between non-blocking atomic
commitment and consensus, in Proceedings of the 9th International Workshop
on Distributed Algorithms. 1995, Springer-Verlag. p. 87-100.
12. Chandra, T.D., V. Hadzilacos, and S. Toueg, The weakest failure detector for
solving consensus. J. ACM, 1996. 43(4): p. 685-722.
13. Fischer, M.J., N.A. Lynch, and M.S. Paterson, Impossibility of distributed
consensus with one faulty process. J. ACM, 1985. 32(2): p. 374-382.

122
14. Hadzilacos, V., On the relationship between the atomic commitment and
consensus problems, in Fault-tolerant distributed computing, B. Simons and
A. Spector, Editors. 1990, Springer-Verlag. p. 201-208. Springer Lecutre
Notes In Computer Science; Vol. 448
15. Schneider, F.B., What good are models and what models are good?, in
Distributed systems (2nd Ed.), S. Mullender, Editor. 1993, ACM
Press/Addison-Wesley Publishing Co. p. 17-26.
16. Reddy, P.K. and M. Kitsuregawa, Reducing the blocking in two-phase commit
with backup sites Inf. Process. Lett. , 2003 86 (1 ): p. 39-47
17. Gray, J., Personal Communication: Implementation of Paxos Commit
Algorithm, 25/03/2004.
18. Gray, J. and A. Reuter, Transaction processing: concepts and techniques.
1993: Morgan Kaufmann Publishers.
19. Chrysanthis, P.K. and K. Ramamritham, ACTA: The SAGA Continues, in
Database Transaction Models for Advanced Applications, A.K. Elmagarmid,
Editor. 1992, Morgan Kaufmann Publishers. p. 349 - 397.
20. Biliris, A., et al. ASSET: A System for Supporting Extended Transactions. in
International Conference on Management of Data. 1994. Minneapolis,
Minnesota: R.T. Snodgrass and M. Winslett eds., ACM Sigmod, p. 44-54.
21. Garcia-Molina, H. and K. Salem. Sagas. in Proceedings of the 1987 ACM
SIGMOD international conference on Management of data. 1987. San
Francisco, California, United States, International Conference on Management
of Data, ACM Press, p. 249-259.
22. Prasad, K.H., T.K. Nayak, and R.K. Ghosh. DiET: a distributed extended
transaction processing framework. in Third International Conference on High-
Performance Computing. 1996. Trivandrum, INDIA, p. 114-119.
23. Moss, J.E.B., Nested transactions: an approach to reliable distributed
computing. 1981, Massachusetts Institute of Technology.
24. C. Davies, J., Data processing spheres of control. IBM Systems Journal, 1978.
17(2): p. 179-198.
25. Singh, M.P. Multiagent Systems as Spheres of Commitment. in International
Conference on Multiagent Systems (ICMAS) Workshop on Norms,
Obligations, and Conventions. 1996. Kyoto, Japan. Available:
http://www.csc.ncsu.edu/faculty/mpsingh/papers/mas/icmas-96-norms.pdf
26. Lampson, B.W., Atomic Transactions in Distributed Systems - Architecture
and Implementation, An Advanced Course 1981 Springer-Verlag. p. 246-265
27. Kopetz, H. and P. Veríssimo, Real time and dependability concepts, in
Distributed systems (2nd Ed.), S. Mullender, Editor. 1993, ACM
Press/Addison-Wesley Publishing Co. p. 411-446.
28. Alpern, B. and F.B. Schneider, Defining Liveness. Information Processing
Letters, 1984. 21(4): p. 181-185. Available:
http://www.cs.cornell.edu/fbs/publications/85-650.ps
29. Gärtner, F.C., Fundamentals of fault-tolerant distributed computing in
asynchronous environments. ACM Computing Surveys, 1999. 31(1): p. 1-26.

123
30. Szentiványi, D., Performance and Availability Trade-offs in Fault-Tolerant
Middleware, in Department of Computer and Information Science. 2002,
Linköping University. p. 7-26. Available:
http://www.ep.liu.se/lic/science_technology/09/82/digest.pdf
31. Reynal, M., A short introduction to failure detectors for asynchronous
distributed systems SIGACT News 2005 36 (1 ): p. 53-70
32. Keidar, I. and S. Rajsbaum, On the cost of fault-tolerant consensus when there
are no faults: preliminary version SIGACT News 2001 32 (2 ): p. 45-63
33. Schiper, A. and A. Sandoz. Uniform Reliable Multicast in a Virtually
Synchronous Environment. in Proceedings of the 13th International
Conference on Distributed Computing Systems. 1993, p. 561-568. Available:
http://ieeexplore.ieee.org/iel2/912/7169/00287667.pdf?isnumber=&arnumber=
287667
34. Hadzilacos, V. and S. Toueg, Fault-tolerant broadcasts and related problems,
in Distributed systems (2nd Ed.), S. Mullender, Editor. 1993, ACM
Press/Addison-Wesley Publishing Co. p. 97-145.
35. Mehra, P. and S. Fineberg. Fast and flexible persistence: the magic potion for
fault-tolerance, scalability and performance in online data stores. in 18th
International Parallel and Distributed Processing Symposium. 2004, p. 206-.
Available: Hewlett-Packard Co., CA, USA
36. Aguilera, M.K., et al. Stable Leader Election. in Proceedings of the 15th
International Conference on Distributed Computing. 2001, Lecture Notes in
Computer Science Vol. 2180, Springer-Verlag, p. 108-122.
37. Adam Ferrari, JPVM: network parallel computing in Java. Concurrency:
Practice and Experience, 1998. 10(11-13): p. 985-992. Available: Department
of Computer Science, University of Virginia, Charlottesville, VA 22903, USA
38. Gray, J., Personal Communication: Simulation Material for Paxos Commit,
25/03/2004.
39. Stankovic, N. and K. Zhang, An evaluation of Java implementations of
message-passing. Software—Practice & Experience, 2000. 30(7): p. 741-763.
40. Gibson, G.A. and R.V. Meter, Network attached storage architecture.
Communications of the ACM, 2000. 43(11): p. 37-45.

124
Appendix A Algorithm for Leader Election
Code for each process:
1 procedure StartRound(s) { executed upon start of a new round }
2 if p _= s mod n then send (START, s) to all { bring all to new round }
3 r←s { update current round }
4 leader ←( { demote previous leader but do not elect leader quite yet }
5 restart timer
6 on initialization:
7 StartRound(0 )
8 start tasks 0 and 1
9 task 0: { leader/candidate sends OK every δ time }
10 loop forever
11 if p = r mod n and have not sent (OK, r) within δ then send (OK, r) to all
12 task 1:
13 upon receive (OK, k) with k = r do { current leader/candidate is active }
14 if leader = B and received at least two (OK, k) messages
15 then leader ← k mod n { now elect leader }
16 restart timer
17 upon timer > 2δ do { timeout on current leader/candidate }
18 StartRound(r + 1) { start next round }
19 upon receive (OK, k) or (START, k) with k > r do
20 StartRound(k) { jump to round k }
21 upon receive (OK, k) or (START, k) from q with k < r do
22 send (START, r) to q { update process in old round }

125
Appendix B Result for 2PC with mean message
delay of 1.4ms
3 RM
events: 150,710
messages: 216,024
messagesFailed: -
messagesDeadLetter: -
applicationMessages: 150,710
commits: 5,023
aborts: -
virtualTime: 200,001,082
Table 53 Results for 2PC with 1.4ms mean message delay

time in 3 RM time in 3 RM
ms Commits ms Commits
20 1 40 280
21 2 41 199
22 5 42 185
23 10 43 182
24 27 44 153
25 44 45 107
26 50 46 71
27 108 47 73
28 132 48 64
29 153 49 32
30 208 50 26
31 265 51 25
32 318 52 28
33 328 53 9
34 336 54 10
35 308 55 9
36 346 56 2
37 335 57 5
38 298 61 1
39 286 62 1
64 1
Table 54 Transaction commit times for 2PC with 1.4ms mean message delay

126
Appendix C Average Transaction Commit Time for
2PC

Average Transaction Commit Time

26

24

22
Average Time (ms)

20

18

16

14

12

10
0 4 8 12 16 20 24 28 32 36

Number of Resource Managers

127

Das könnte Ihnen auch gefallen