Sie sind auf Seite 1von 39

UNIT IV SYNCHRONIZATION AND REPLICATION

Introduction - Clocks, events and process states - Synchronizing physical


clocks- Logical time and logical clocks - Global states – Coordination and
Agreement – Introduction - Distributed mutual exclusion – Elections –
Transactions and Concurrency Control– Transactions -Nested transactions –
Locks – Optimistic concurrency control - Timestamp ordering – Atomic
Commit protocols -Distributed deadlocks – Replication – Case study –
Coda.

Two mark Questions

1. Define clock screw and clock drift.


2. What is Coordinated Universal time?
3. How will you synchronize physical clock?
4. What is clock correctness?
5. What is a synchronized distributed system?
6. What is Network time protocol?
7. Explain why is computer clock synchronization necessary
8. Define distributed mutual exclusion.
9. Differentiate Reliable multicast and IP multicast.
10. Explain on consensus problem.
11. Show what is the use of transaction?
12. Formulate the ACID properties.
13. Illustrate what is concurrency control? Give its use.
14. Show how will you make use of nested transaction? What are its rules?
15. Define deadlock.
16. Discuss what are the advantages and drawbacks of multi version timestamp
ordering in comparison with ordinary timestamp ordering?
17. Describe how flat and nested transaction differ from each other?
18. Formulate the need for atomic commit protocol.
19. Define the two phase commit protocol.
20. Analyse the distributed deadlocks.
21. Analyse and list the need for transaction status and intentions list entries in a
recovery file?
22. Define Linearizability and sequential consistency.
23. What is Coda file system?

16 Mark Questions

24. Distinguish and examine the process of active and passive replication model.
25. Describe in detail about cristian’s and Berkley algorithm for synchronizing
clocks.
26. Examine Briefly about global states
27. Design Flat transaction and nested transaction with example.
28. Explain detail about two phase commit protocol.
29. Examine on atomic commit protocol.
30. What is the goal of an election algorithm? Explain it detail. (8)
31. Examine How mutual exclusion is handled in distributed system. (8)
32. Summarize the internal and external synchronization of Physical clocks.(8)
33. Give the Chandy and Lamports snapshot algorithm for determining the
global states of distributed systems. (8)
34. Discuss the use of NTP in detail.
35. Discuss that Byzantine agreement can be reached for three generals, with one
of them faulty, if the generals digitally sign their messages.
36. Examine a solution to reliable, totally ordered multicast in a synchronous
system, using a reliable multicast and a solution to the consensus problem.
37. Illustrate an example execution of the ring-based algorithm to show that
processes are not necessarily granted entry to the critical section in happened-
before order.
38. Summarize in detail about CODA.
39. Describe about Distributed dead locks.
40. Examine briefly about optimistic concurrency control.

*****ANSWERS ****

Clock drift and ‘clock drift rate’

The relative amount that a computer clock differs from a perfect clock. Computer
clocks drift from perfect time and their drift rates differ from one another. Even if
clocks on all computers in a DS are set to the same time, their clocks will eventually
vary quite significantly unless corrections are applied
Clock drift rate: the difference per unit of time from some ideal reference clock.
Ordinary quartz clocks drift by about 1 sec in 11-12 days. (10-6 secs/sec). High
precision quartz clocks drift rate is about 10-7 or 10-8 secs/sec

Clock skew
The difference between the times on two clocks (at any instant) is skew
Coordinated Universal Time (UTC)

International Atomic Time is based on very accurate physical clocks. UTC is an


international standard for time keeping. It is based on atomic time, but
occasionally adjusted to astronomical time. It is broadcast from radio stations on
land and satellite (e.g. GPS) Computers with receivers can synchronize their clocks
with these timing signals.

Synchronizing physical clock

External synchronization
– A computer’s clock Ci is synchronized with an external authoritative time
source S, so that:
– |S(t) - Ci(t)| < D for i = 1, 2, … N over an interval, I of real time
– The clocks Ci are accurate to within the bound D.

Internal synchronization
– The clocks of a pair of computers are synchronized with one another so that:
– |Ci(t) - Cj(t)| < D for i, j = 1, 2, … N over an interval, I of real time
– The clocks Ci and Cj agree within the bound D.

Clock correctness
A hardware clock, H is said to be correct if its drift rate is within a bound ‘x’
> 0. (e.g. x= 10-6 secs/ sec)

Synchronous distributed system


A synchronous distributed system is one in which the following bounds are
defined.
– the time to execute each step of a process has known lower and upper bounds
– each message transmitted over a channel is received within a known bounded time
– each process has a local clock whose drift rate from real time has a known bound

Cristian’s method for synchronizing clocks [External synchronization – for


asynchronous distributed systems]

Cristian [1989] suggested the use of a time server, connected to a device that receives
signals from a source of UTC, to synchronize computers externally. Upon request,
the server process S supplies the time according to its clock, as shown in Figure.
 A process p requests the time in a message mr , and receives the time value t
in a message mt.
 Process p records the total round-trip time Tround A simple estimate of the
time to which p should set its clock is t + Tround / 2 ,
 The earliest point at which S could have placed the time in mt was min after p
dispatched mr . The latest point at which it could have done this was min
before mt arrived at p. The time by S’s clock when the reply message arrives
is therefore in the range [t + min, t + Tround – min] . The width of this range is
Tround – 2min , so the accuracy is ± Tround /2 – min
 The method achieves synchronization only if the observed round-trip times
between client and server are sufficiently short compared with the required
accuracy.

Berkeley algorithm (Internal synchronization)

– An algorithm for internal synchronization of a group of computers


– A master polls to collect clock values from the others (slaves)
– The master uses round trip times to estimate the slaves’ clock values
– It takes an average (eliminating any above some average round trip time or with
faulty clocks)
– It sends the required adjustment to the slaves (better than sending the time which
depends on the round trip time)
– If master fails, can elect a new master to take over

Network Time Protocol (NTP)


Cristian’s method and the Berkeley algorithm are intended for intranets
NTP: a time service for the Internet - synchronizes clients to UTC
 Reliability from redundant paths, scalable, authenticates time sources
 The synchronization subnet can reconfigure if failures occur, e.g.
– a primary that loses its UTC source can become a secondary
– a secondary that loses its primary can use another primary
NTP - synchronisation of servers

3 Modes of synchronization:
- Multicast: A server within a high speed LAN multicasts time to others which set
clocks assuming some delay (not very accurate)
- Procedure call: A server accepts requests from other computers (like Cristiain’s
algorithm). Higher accuracy. Useful if no hardware multicast.
- Symmetric: Pairs of servers exchange messages containing time information
Used where very high accuracies are needed (e.g. for higher levels)

Logical time and logical clocks


Instead of synchronizing clocks, event ordering can be used
– If two events occurred at the same process pi (i = 1, 2, … N) then they occurred in
the order observed by pi.
– When a message, m is sent between two processes, send(m) happened before
receive(m)
“happened-before” relation: obtained by generalizing the above two relations
– denoted by →
– HB1, HB2 are formal statements of the above two relations
– HB3 means happened-before is transitive
Not all events are related by →, e.g., a → e and e → a
Consider a and e (different processes and no chain of messages to relate them) they
are not related by → ; they are said to be concurrent; write as a || e
a → b (at p1) c →d (at p2) b → c because of m1 also d → f because of m2
Lamport’s logical clocks

Lamport clocks are counters that are updated according to the happened-before
relationship between events.
A logical clock is a monotonically increasing software counter. It need not relate to a
physical clock
- Each process pi has a logical clock, Li which can be used to apply logical time
stamps to events

Vector clocks

vector clocks are an improvement on Lamport clocks,


– we can tell whether two events are ordered by happened-before or are concurrent
by comparing their vector timestamps
Vector clock Vi at process pi is an array of N integers
- Each process keeps its own vector clock Vi ,used to timestamp local events
- Vi[i] is the number of events that pi has time stamped

Global states

Global state means a set of process states + channel states.


It is sometimes desirable to store checkpoints of a distributed system to be able to
restart from a well-defined past state after a crash.
Detecting global properties
Distributed garbage collection: An object is considered to be garbage if there are no
longer any references to it anywhere in the distributed system. when we consider
properties of a system, we must include the state of communication channels as well
as the state of the processes.

Distributed deadlock detection: A distributed deadlock occurs when each of a


collection of processes waits for another process to send it a message, and where
there is a cycle in the graph of this ‘waits-for’ relationship.

Distributed termination detection: The problem here is how to detect that a


distributed algorithm has terminated. To see that this is not easy, consider a
distributed algorithm executed by two processes p1 and p2 , each of which may
request values from the other. A Process may become active based on a message
from another process which may be considered passive prior to that.

Distributed debugging: Distributed systems are complex to debug. The variables


change as the program executes, but tracking their values in all the processes is
complicated.

Global states and consistent cuts


the state of the collection of processes – is much harder to address.
The essential problem is the absence of global time.
So the approach is we can assemble a meaningful global state from local states
recorded at different real times.
History of a process = h = combination of all the events take place in that process i.
history(pi ) = hi = < ei1 , ei2, ei3, …eiN)

- A cut C can be represented by a curve in the time-process diagram which


crosses all process lines.
- FC is Future events that happens after the cut
- C divides all events to PC (those happened before C) and FC (future events)
- Cut C is consistent if there is no message whose sending event is in FC and
whose receiving event is in PC

The lattice of global states

Lattice represents partial order. All consistent global states can be put in
the “lattice of global states” And, all possible flows can be derived from the lattice,
the one in the above figure is only one of them
SNAPSHOT algorithm analog: census taking

Chandy and Lamport [1985] describe a SNAPSHOT algorithm for determining


global states of DS. The goal is to record a set of process and channel states (a
snapshot) for a set of processes pi (i = 1, 2, …, N)
“Census taking in ancient kingdom”: want to take census counting all people, some
of whom may be traveling on highways

Census taking algorithm

􀁹 Close all gates into/out of each village (process) and count people (record process
state) in village; these actions need not be synched with other villages
􀁹 Open each outgoing gate and send official with a red cap (special marker
message).
􀁹 Open each incoming gate and count all travellers (record channel state = messages
sent but not received yet) who arrive ahead of official.
􀁹 Tally the counts from all villages.

Algorithm SNAPSHOT

􀁹 All processes are initially white: Messages sent by white(red) processes are also
white (red)
􀁹 MSend [Marker sending rule for process P]
– Suspend all other activities until done
– Record P’s state
– Turn red
– Send one marker over each output channel of P.
􀁹 MReceive [Marker receiving rule for P]
On receiving marker over channel C,
– if P is white { Record state of channel C as empty; Invoke MSend; }
– else record the state of C as sequence of white messages received since P
turned red.
– Stop when marker is received on each incoming channel
Snapshots taken by SNAPSHOT algorithm

COORDINATION AND AGREEMENT

In Distributed systems the computers must coordinate their actions correctly with
respect to shared resources.

Distributed mutual exclusion

z Distributed mutual exclusion is for resource sharing without conflicts

z A collection of process share resources, mutual exclusion is needed to prevent


interference and ensure consistency. ( critical section)

z No shared variables or facilities are provided by single local kernel to solve it.
Require a solution that is based solely on message passing.

z Application level protocol for executing a critical section

y enter() // enter critical section-block if necessary

y resrouceAccess() //access shared resoruces

y exit() //leave critical section-other processes may enter

Essential requirements:

x ME1: (safety) at most one process may execute in the


critical section

x ME2: (liveness) Request to enter and exit the critical section


eventually succeed.
x ME3 (ordering) One request to enter the CS happened-before
another, then entry to the CS is granted in that order.

ME2 implies freedom from both deadlock and starvation. Starvation involves
fairness condition. The order in which processes enter the critical section. It is
not possible to use the request time to order them due to lack of global clock.
So usually, we use happen-before ordering to order message requests.

Performance Evaluation

z Bandwidth consumption, which is proportional to the number of messages


sent in each entry and exit operations.

z The client delay incurred by a process at each entry and exit operation.

z throughput of the system. Rate at which the collection of processes as a whole


can access the critical section. Measure the effect using the synchronization
delay between one process exiting the critical section and the next process
entering it; the shorter the delay is, the greater the throughput is.

Central Server Algorithm – manage distributed mutual exclusion – manage


critical section

z The simplest way to grant permission to enter the critical section is to employ
a server.

z A process sends a request message to server and awaits a reply from it.

z If a reply constitutes a token signifying the permission to enter the critical


section.

z If no other process has the token at the time of the request, then the server
replied immediately with the token.

z If token is currently held by other processes, the server does not reply but
queues the request.

z Client on exiting the critical section, a message is sent to server, giving it back
the token.
Bandwidth: entering takes two messages (request followed by a grant), delayed by the
round-trip time; exiting takes one release message, and does not delay the exiting process.

Throughput is measured by synchronization delay, the round-trip of a release


message and grant message.

Ring-based Algorithm

z Simplest way to arrange mutual exclusion between N processes without


requiring an additional process is arrange them in a logical ring.

z Each process pi has a communication channel to the next process in the ring,
p(i+1)/mod N.

z The unique token is in the form of a message passed from process to process
in a single direction clockwise.

z If a process does not require to enter the CS when it receives the token, then it
immediately forwards the token to its neighbor.

z A process requires the token waits until it receives it, but retains it.

To exit the critical section, the process sends the token on to its neighbor
Bandwidth: continuously consumes the bandwidth except when a process is
inside the CS. Exit only requires one message

Delay: experienced by process is zero message (just received token) to N


messages (just pass the token).

Throughput: synchronization delay between one exit and next entry is


anywhere from 1(next one) to N (self) message transmission.

Using Multicast and logical clocks: Ricart and Agrawala algorithm

z Mutual exclusion between N peer processes based upon multicast.


z Processes that require entry to a critical section multicast a request message,
and can enter it only when all the other processes have replied to this
message.
z The condition under which a process replies to a request are designed to
ensure ME1 ME2 and ME3 are met.
z Each process pi keeps a Lamport clock. Message requesting entry are of the
form<T, pi>.
z Each process records its state of either RELEASE, WANTED or HELD in a
variable state.
y If a process requests entry and all other processes is RELEASED, then
all processes reply immediately.
y If some process is in state HELD, then that process will not reply until
it is finished.
y If some process is in state WANTED and has a smaller timestamp than
the incoming request, it will queue the request until it is finished.
y If two or more processes request entry at the same time, then
whichever bears the lowest timestamp will be the first to collect N-1
replies.

Ricart and Agrawala’s algorithm

On initialization
state := RELEASED;
To enter the section
state := WANTED;
Multicast request to all processes; request processing deferred here
T := request’s timestamp;
Wait until (number of replies received = (N – 1));
state := HELD;
On receipt of a request <Ti, pi> at pj (i ≠ j)
if (state = HELD or (state = WANTED and (T, pj) < (Ti, pi)))
then
queue request from pi without replying;
else
reply immediately to pi;
end if
To exit the critical section
state := RELEASED;
reply to any queued requests;

Multicast synchronization
z P1 and P2 request CS concurrently. The timestamp of P1 is 41 and for P2 is 34.
When P3 receives their requests, it replies immediately. When P2 receives P1’s
request, it finds its own request has the lower timestamp, and so does not
reply, holding P1 request in queue. However, P1 will reply. P2 will enter CS.
After P2 finishes, P2 reply P1 and P1 will enter CS.
z Granting entry takes 2(N-1) messages, N-1 to multicast request and N-1
replies. Bandwidth consumption is high.
z Client delay is again 1 round trip time
z Synchronization delay is one message transmission time.

Maekawa’s voting algorithm

z It is not necessary for all of its peers to grant access. Only need to obtain
permission to enter from subsets of their peers, as long as the subsets used by
any two processes overlap.
z Think of processes as voting for one another to enter the CS. A candidate
process must collect sufficient votes to enter.
z Processes in the intersection of two sets of voters ensure the safety property
ME1 by casting their votes for only one candidate.
z A voting set Vi associated with each process pi.
z there is at least one common member of any two voting sets, the size of all
voting set are the same size to be fair
Vi Í { p1 , p2 ,..., pN }
such that for all i, j = 1,2,...N :
pi Î Vi
Vi ÇV j ¹ Æ
|Vi |= K
Each process is contained in M of the voting set Vi

On initialization
state := RELEASED;
voted := FALSE;
For pi to enter the critical section
state := WANTED;
Multicast request to all processes in Vi;
Wait until (number of replies received = K);
state := HELD;
On receipt of a request from pi at pj
if (state = HELD or voted = TRUE)
then
queue request from pi without replying;
else
send reply to pi;
voted := TRUE;
end if
For pi to exit the critical section
state := RELEASED;
Multicast release to all processes in Vi;
On receipt of a release from pi at pj
if (queue of requests is non-empty)
then
remove head of queue – from pk, say;
send reply to pk;
voted := TRUE;
else
voted := FALSE;
end if
z ME1 is met. If two processes can enter CS at the same time, the processes in
the intersection of two voting sets would have to vote for both. The algorithm
will only allow a process to make at most one vote between successive
receipts of a release message.

z Deadlock prone. For example, p1, p2 and p3 with V1={p1,p2}, V2={p2, p3},
V3={p3,p1}. If three processes concurrently request entry to the CS, then it is
possible for p1 to reply to itself and hold off p2; for p2 rely to itself and hold
off p3; for p3 to reply to itself and hold off p1. Each process has received one
out of two replies, and none can proceed.

z If process queues outstanding request in happen-before order, ME3 can be


satisfied and will be deadlock free.

z Bandwidth utilization is 2sqrt(N) messages per entry to CS and sqrt(N) per


exit.

z Client delay is the same as Ricart and Agrawala’s algorithm, one round-trip
time.

Synchronization delay is one round-trip time which is worse than R&A

Fault tolerance

z What happens when messages are lost?

z What happens when a process crashes?

z None of the algorithm that we have described would tolerate the loss of
messages if the channels were unreliable.

y The ring-based algorithm cannot tolerate any single process crash


failure.

y Maekawa’s algirithm can tolerate some process crash failures: if a


crashed process is not in a voting set that is required.

y The central server algorithm can tolerate the crash failure of a client
process that neither holds nor has requested the token.

y The Ricart and Agrawala algorithm as we have described it can be


adapted to tolerate the crash failure of such a process by taking it to
grant all requests implicitly.
Elections

z Algorithm to choose a unique process to play a particular role is called an


election algorithm. E.g. central server for mutual exclusion, one process will
be elected as the server. Everybody must agree. If the server wishes to retire,
then another election is required to choose a replacement.

A ring based election algorithm

z All processes arranged in a logical ring.

z Each process has a communication channel to the next process.

z All messages are sent clockwise around the ring.

z Assume that no failures occur, and system is asynchronous.

z Goal is to elect a single process coordinator which has the largest identifier.

z Note: The election was


started by process 17.

z The highest process


identifier encountered
so far is 24.

z Participant processes
are shown darkened

1. Initially, every process is marked as non-participant. Any process can begin


an election.

2. The starting process marks itself as participant and place its identifier in a
message to its neighbour.

3. A process receives a message and compare it with its own. If the arrived
identifier is larger, it passes on the message.

4. If arrived identifier is smaller and receiver is not a participant, substitute its


own identifier in the message and forward if. It does not forward the message
if it is already a participant.
5. On forwarding of any case, the process marks itself as a participant.

6. If the received identifier is that of the receiver itself, then this process’ s
identifier must be the greatest, and it becomes the coordinator.

7. The coordinator marks itself as non-participant set elected_i and sends an


elected message to its neighbour enclosing its ID.

When a process receives elected message, marks itself as a non-participant,


sets its variable elected_i and forwards the message

The bully algorithm

1. The process begins a election by sending an election message to these processes


that have a higher ID and awaits an answer in response. If none arrives within time
T, the process considers itself the coordinator and sends coordinator message to all
processes with lower identifiers. Otherwise, it waits a further time T’ for coordinator
message to arrive. If none, begins another election.

2. If a process receives a coordinator message, it sets its variable elected_i to be the


coordinator ID.

3. If a process receives an election message, it send back an answer message and


begins another election unless it has begun one already.
Consensus problem

z The problem is for processes to agree on a value after one or more of the
processes has proposed what that value should be. (e.g. all controlling
computers should agree upon whether let the spaceship proceed or abort after
one computer proposes an action. )

Transactions and Concurrency Control

The goal of transactions


The objects managed by a server must remain in a consistent state
- when they are accessed by multiple transactions and
- in the presence of server crashes

Transactions

Some applications require a sequence of client requests to a server to be atomic in the


sense that:
1. They are free from interference by operations being performed on behalf of other
concurrent clients; and
2. Either all of the operations must be completed successfully or they must have no
effect at all in the presence of server crashes.

ACID properties

Atomicity: a transaction must be all or nothing;


Consistency: a transaction takes the system from one consistent state to another
consistent state;
Isolation;
Durability.

Atomicity of transactions

Transactions are intended to be atomic.


All or nothing:
– it either completes successfully, and the effects of all of its operations are recorded
in the objects, or (if it fails or is aborted) it has no effect at all. This all-or-nothing
effect has two further aspects of its own:
– failure atomicity: the effects are atomic even when the server crashes;
– durability: after a transaction has completed successfully, all its effects are saved
in permanent storage.
– Each transaction must be performed without interference from other transactions -
there must be no observation by other transactions of a transaction's intermediate
effects

Isolation
One way to achieve isolation is to perform the transactions serially – one at a time
􀁹 The aim for any server that supports transactions is to maximize concurrency.
􀁹 Concurrency control ensures isolation
􀁹 Transactions are allowed to execute concurrently, having the same effect as a serial
execution
– That is, they are serially equivalent or serializable

Concurrency control
We will illustrate the ‘lost update’ and the ‘inconsistent retrievals’ problems which
can occur in the absence of appropriate concurrency control
– a lost update occurs when two transactions both read the old value of a variable
and use it to calculate a new value
– inconsistent retrievals occur when a retrieval transaction observes values that are
involved in an ongoing updating transaction
􀁹 we show how serial equivalent executions of transactions can avoid these
problems
􀁹 we assume that the operations deposit, withdraw, getBalance and setBalance are
synchronized operations - that is, their effect on the account balance is atomic.
Serial equivalence
􀁹 if each one of a set of transactions has the correct effect when done on its own
􀁹 then if they are done one at a time in some order the effect will be correct
􀁹 a serially equivalent interleaving is one in which the combined effect is the same as
if the transactions had been done one at a time in some order
􀁹 the same effect means
– the read operations return the same values
– the instance variables of the objects have the same values at the end
Nested Transactions:

To a parent, a sub-transaction is atomic with respect to failures and concurrent


access
􀁹 transactions at the same level (e.g. T1 and T2) can run concurrently but access to
common objects is serialized
􀁹 a sub-transaction can fail independently of its parent and other sub-transactions
– when it aborts, its parent decides what to do, e.g. start another sub-
transaction or give up

Advantages of nested transactions (over flat ones)

Sub-transactions may run concurrently with other sub-transactions at the same level.
– this allows additional concurrency in a transaction.
– when sub-transactions run in different servers, they can work in parallel.
􀁹 e.g. consider the branchTotal operation
􀁹 it can be implemented by invoking getBalance at every account in the
branch.
- these can be done in parallel when the branches have different servers
- Sub-transactions can commit or abort independently.
– this is potentially more robust
– a parent can decide on different actions according to whether a subtransaction
has aborted or not

Commitment of nested transactions

A transaction may commit or abort only after its child transactions have completed.
􀁹 A sub-transaction decides independently to commit provisionally or to abort. Its
decision to abort is final.
􀁹 When a parent aborts, all of its sub-transactions are aborted.
􀁹 When a sub-transaction aborts, the parent can decide whether to abort or not.
􀁹 If the top-level transaction commits, then all of the sub-transactions that have
provisionally committed can commit too, provided that none of their ancestors has
aborted.
Replication

A basic architectural model for the management of replicated data

Replicated data architecture

 System model
 Five phases in performing a request.
 Front end issues the request.
 Either sent to a single replica or multicast to all replica messages.
Coordination

Replication managers coordinate in preparation for the execution of the request, i.e.
agree if request is to be performed and the ordering of the request relative to others.

• FIFO ordering, Causal ordering, Total ordering

Execution

Perhaps tentative

Agreement

Reach consensus on effect of the request, e.g. agree to commit or abort in a


transactional system.

Response

Transactions on replicated data

Replicated data on transaction

One copy serializability

Replicated transactional service

Each replication manager provides concurrency control and recovery of its own data
items in the same way as it would for non-replicated data.

Effects of transactions performed by various clients on replicated data items are the
same as if they had been performed one at a time on a single data item.

Additional complications: failures, network partitions


Failures should be serialized with transactions, i.e. any failure observed by a
transaction must appear to have happened before a transaction started.

Replication Schemes

Primary Copy

Read one – Write All

Cannot handle network partitions

Schemes that can handle network partitions

Available copies with validation

Quorum consensus

Virtual Partition

Read-one write-all

Each write operation sets a write lock at each replica manager.

Each read sets a read lock at one replica manager.

Two phase commit

Two-level nested transaction

Coordinator -> Workers

If either coordinator or worker is a replica manager, it has to communicate with replica


managers.

Primary copy replication

All client requests are directed to a single primary server.

Available copies replication

It can handle some replica managers are unavailable because they have failed or
communication failure.

Reads can be performed by any available replica manager but writes must be
performed by all available replica managers.
Normal case is like read one/write all.

As long as the set of available replica managers does not change during a transaction.

Available copies

Copies of replication

Available copies replication

Failure case

One copy serializability requires that failures and recovery be serialized with
transactions.

This is not achieved when different transactions make conflicting failure observations.

Example shows local concurrency control not enough.

Additional concurrency control procedure (called local validation) has to be


performed to ensure correctness.

Available copies with local validation assumes no network partition - i.e. functioning
replica managers can communicate with one another.

Local validation - example

Assume X fails just after T has performed GetBalance and N fails just after U has
performed.

GetBalance

Assume X and N fail before T & U have performed their Deposit operations.

T’s Deposit will be performed at M & P while U’s Deposit will be performed at Y.
Concurrency control on A at X does not prevent U from updating A at Y; similarly
concurrency control on B at N does not prevent Y from updating B at M & P.

Local concurrency control not enough.

4.8 Case study – Coda.

Features:

Disconnected operation for mobile clients reintegration of data from disconnected


clients.

Bandwidth adaptation

Failure Resilience

read/write replication servers

resolution of server/server conflicts

Handles of network failures which partition the servers

Handles disconnection of clients client

Performance and scalability

Client side persistent caching of files, directories and attributes for high performance

Write back caching

Security

Kerberos like authentication

Access control lists (ACL's)

Well defined semantics of sharing

Freely available source code


Coda- replication

Communication
Inter process communication in Coda is performed using RPCs. However,
the RPC2 system for Coda is much more sophisticated than traditional RPC systems
such as ONC RPC, which is used by NFS.

Communication of virtual machine


The internal organization of a Virtue workstation.

RPC2 allows the client and the server to set up a separate connection for transferring
the video data to the client on time. Connection setup is done as a side effect of an RPC
call to the server. For this purpose, the RPC2 runtime system provides an interface of
side-effect routines that is to be implemented by the application developer. For
example, there are routines for setting up a connection and routines for transferring
data. These routines are automatically called by the RPC2 runtime system at the client
and server, respectively, but their implementation is otherwise completely
independent of RPC2.

Side effects in Coda’s RPC2 system.

A)Sending an invalidation message one at a time b) Sending an invalidation message


one at a time

The problem is caused by the fact that an RPC may fail. Invalidating files in a strict
sequential order may be delayed considerably because the server cannot reach a
possibly crashed client, but will give up on that client only after a relatively long
expiration time. Meanwhile, other clients will still be reading from their local copies.
Parallel RPCs are implemented by means of the MultiRPC system, which is part of
the RPC2 package. An important aspect of MultiRPC is that the parallel invocation of
RPCs is fully transparent to the callee. In other words, the receiver of a MultiRPC call
cannot distinguish that call from a normal RPC. At the caller’s side, parallel execution
is also largely transparent. For example, the semantics of MultiRPC in the presence of
failures are much the same as that of a normal RPC. Likewise, the side-effect
mechanisms can be used in the same way as before.

Multi-RPC is implemented by essentially executing multiple RPCs in parallel. This


means that the caller explicitly sends an RPC request to each recipient. However,
instead of immediately waiting for a response, it defers blocking until all requests have
been sent. In other words, the caller invokes a number of one-way RPCs, after which
it blocks until all responses have been received from the non-failing recipients. An
alternative approach to parallel execution of RPCs in MultiRPC is provided by setting
up a multicast group, and sending an RPC to all group members using IP multicast

Processes

Coda maintains a clear distinction between client and server processes. Clients are
represented by Venus processes; servers appear as Vice processes. Both type of
processes are internally organized as a collection of concurrent threads. Threads in
Coda are non-preemptive and operate entirely in user space. To account for
continuous operation in the face of blocking I/O requests, a separate thread is used to
handle all I/O operations, which it implements using low-level asynchronous I/O
operations of the underlying operating system. This thread effectively emulates
synchronous I/O without blocking an entire process.

Naming

As we mentioned, Coda maintains a naming system analogous to that of UNIX. Files


are grouped into units referred to as volumes. A volume is similar to a UNIX disk
partition (i.e., an actual file system), but generally has a much smaller granularity. It
corresponds to a partial sub tree in the shared name space as maintained by the Vice
servers. Usually a volume corresponds to a collection of files associated with a user.
Examples of volumes include collections of shared binary or source files, and so on.
Like disk partitions, volumes can be mounted. Volumes are important for two reasons.
First, they form the basic unit by which the entire name space is constructed. This
construction takes place by mounting volumes at mount points. A mount point in
Coda is a leaf node of a volume that refers to the root node of another volume. Using
the terminology introduced here only root nodes can act as mounting points (i.e., a
client can mount only the root of a volume). The second reason why volumes are
important, is that they form the unit for server-side replication. We return to this
aspect of volumes below. Considering the granularity of volumes, it can be expected
that a name lookup will cross several mount points. In other words, a path name will
often contain several mount points. To support a high degree of naming transparency,
a Vice file server returns mounting information to a Venus process during name
lookup. This information will allow Venus to automatically mount a volume into the
client’s name space when necessary. This mechanism is similar to crossing mount
points as supported in NFS version 4.

Clients in Coda have access to a single shared name space.

File Identifiers

Considering that the collection of shared files may be replicated and distributed across
multiple Vice servers, it becomes important to uniquely identify each file in such a
way that it can be tracked to its physical location, while at the same time maintaining
replication and location transparency.

Synchronization

Many distributed file systems, including Coda’s ancestor, AFS, do not provide UNIX
file-sharing semantics but instead support the weaker session semantics. Given its
goal to achieve high availability, Coda takes a different approach and makes an
attempt to support transactional semantics, albeit a weaker form than normally
supported by transactions.

The problem that Coda wants to solve is that in a large distributed file system it may
easily happen that some or all of the file servers are temporarily unavailable. Such
unavailability can be caused by a network or server failure, but may also be the result
of a mobile client deliberately disconnecting from the file service. Provided that the
disconnected client has all the relevant files cached locally, it should be possible to use
these files while disconnected and reconcile later when the connection is established
again.

Sharing Files in Coda

The transactional behavior in sharing files in Coda.

Transactional Semantics

In Coda, the notion of a network partition plays a crucial role in defining transactional
semantics. A partition is a part of the network that is isolated from the rest and which
consists of a collection of clients or servers, or both. The basic idea is that series of file
operations should continue to execute in the presence of conflicting operations across
different partitions. Recall that two operations are said to conflict if they both operate
on the same data, and at least one is a write operation.

Let us first examine how conflicts may occur in the presence of network partitions.
Assume that two processes A and B hold identical replicas of various shared data
items just before they are separated as the result of a partitioning in the network.
Ideally, a file system supporting transactional semantics would implement one-copy
serializability, which is the same as saying that the execution of operations by A and B,
respectively, is equivalent to a joint serial execution of those operations on non-
replicated data items shared by the two processes. The main problem in the face of
partitions is to recognize serializable executions after they have taken place within a
partition. In other words, when recovering from a network partition, the file system is
confronted with a number of transactions that have been executed in each partition
(possibly on shadow copies, i.e., copies of files that were handed out to clients to
perform tentative modifications analogous to the use of shadow blocks in the case of
transactions). It will then need to check whether the joint executions can be serialized
in order to accept them. In general, this is an intractable problem.
Caching and Replication

Client Caching

Client-side caching is crucial to the operation of Coda for two reasons. First, and in
line with the approach followed in AFS, caching is done to achieve scalability. Second,
caching provides a higher degree of fault tolerance as the client becomes less
dependent on the availability of the server. For these two reasons, clients in Coda
always cache entire files. In other words, when a file is opened for either reading or
writing, an entire copy of the file is transferred to the client, where it is subsequently
cached.

Unlike many other distributed file systems, cache coherence in Coda is maintained by
means of callbacks. For each file, the server from which a client had fetched the file
keeps track of which clients have a copy of that file cached locally. A server is said to
record a callback promise for a client. When a client updates its local copy of the file
for the first time, it notifies the server, which, in turn, sends an invalidation message
to the other clients. Such an invalidation message is called a callback break, because
the server will then discard the callback promise it held for the client it just sent an
invalidation.

Server Replication

Coda allows file servers to be replicated. As we mentioned, the unit of replication is a


volume. The collection of servers that have a copy of a volume, are known as that
volume’s Volume Storage Group, or simplyVSG. In the presence of failures, a client
may not have access to all servers in a volume’s VSG. A client’s Accessible Volume
Storage Group (AVSG) for a volume consists of

The use of local copies when opening a session in Coda.


Those servers in that volume’s VSG that the client can contact. If the AVSG is empty,
the client is said to be disconnected. Coda uses a replicated-write protocol to maintain
consistency of a replicated volume. In particular, it uses a variant of Read-One, Write-
All (ROWA). When a client needs to read a file, it contacts one of the members in its
AVSG of the volume to which that file belongs. However, when closing a session on
an updated file, the client transfers it in parallel to each member in the AVSG. This
parallel transfer is accomplished by means of multi RPC as explained before.

Network broken

Fault Tolerance

Coda has been designed for high availability, which is mainly reflected by its
sophisticated support for client-side caching and its support for server replication. An
interesting aspect of Coda that needs further explanation is how a client can continue
to operate while being disconnected, even if disconnection lasts for hours or days.

Recoverable Virtual Memory

Besides providing high availability, the AFS and Coda developers have also looked at
simple mechanisms that help in building fault-tolerant processes. A simple and
effective mechanism that makes recovery much easier, is Recoverable Virtual
Memory (RVM). RVM is a user-level mechanism to maintain crucial data structures
in main memory while being ensured that they can be easily recovered after a crash
failure. The details of RVM are described in.

The basic idea underlying RVM is relatively simple: data that should survive crash
failures are normally stored in a file that is explicitly mapped into memory when
needed. Operations on that data are logged, similar to the use of a write ahead log in
the case of transactions. In fact, the model supported by RVM is close to that of flat
transactions, except that no support is provided for concurrency control. Once a file
has been mapped into main memory, an application can perform operations on that
data that are part of a transaction. RVM is unaware of data structures. Therefore, the
data in a transaction is explicitly set by an application as a range of consecutive bytes
of the mapped-in file. All (in-memory) operations on that data are recorded in a
separate write ahead log that needs to be kept on stable storage. Note that due to the
generally relatively small size of the log, it is feasible to use a battery power-supplied
part of main memory, which combines durability with high performance.

Security

Coda inherits its security architecture from AFS, which consists of two parts. The first
part deals with setting up a secure channel between a client and a server using secure
RPC and system-level authentication. The second part deals with controlling access to
files. We will not examine each of these in turn.

Das könnte Ihnen auch gefallen