Sie sind auf Seite 1von 77

University of Wroclaw

Computer Science Institute

”Distributed Systems”
(A. Tanenbaum, M. van Steen)

Solutions

Robert Schulze

January 21, 2008


Contents
1 Introduction 3

2 Communication 8

3 Processes 15

4 Naming 25

5 Synchronization 32

6 Consistency and Replication 39

7 Fault Tolerance 48

8 Security 55

9 Distributed Object-based Systems 63

10 Distributed File Systems 69

2
1 Introduction

1) What is the role of middleware in a distributed system?

First of all we look at the definition of middleware: Middleware (in the context of distributed
systems) is an additional layer on top of a network operating system which serves two
purposes:

• It hides heterogenity of the underlying layers

• It improves distribution transparency

With that in mind we see, that middleware gets us the best of the two ”oppositing” network
software models:

• The scalability and openness of network operating systems

• The transparency and ease of use of distributed operating systems

Middleware is placed between a network operating system and the distributed applications.
It leaves the management of the computers and simple connection mechanisms to the un-
derlying system. It provides higher level interfaces for communication and services (e.g.
POP / FTP / transaction interfaces). In this way applications don´t have to deal with
machine-specific interfaces (purpose1) and can be easily ported (purpose2).

Tanenbaum mentions some other benefits:

• Naming: Middleware can provide facilities to name services and files. Applications
can rely on a telephone-book-like directory of services and files.

• Persistence: Middleware often offers persistence services (e.g. trough a distributed


filesystem or a database). Applications don´t have to implement machine-specific file
storage mechanisms.

• Distributed Transactions: Middleware can implement distributed transactions in which


data from multiple computers is involved. They fulfil the ACID properties. Applica-
tions can use these hard-to-implement transactions.

• Security: Because applications don´t use the underlying operating systems anymore,
some basic security and protection mechanisms must be implemented. Middleware
usually offers a security model.

3
3) Why is it sometimes so hard to hide the occurrence and recovery from failures
in a distributed system?

We aim to make the distributed system robust towards errors (failure transparency). Espe-
cially, the user should not notice that single resources do not work properly.

The main difficulty is that we are almost not able to distinguish between a dead resource
and a busy resource. In both cases we can not expect status messages from the in-doubt
resource. That is why we have to poll them for their status. We must base our decision on
a timeout value, which may be too short (busy resources may be classified as dead) or too
long (the user will be unsatisfied with the speed of the distributed system).

5) What is an open distributed system and what benefits does openness provide?

We define openness as the degree to which a distributed system behaves according to stan-
dard rules, that describe the syntax and semantics of services. For example protocols (like
HTTP) define the structure, possible contents and the meaning of messages.

Well-defined interfaces allow arbitrary processes to use them without relying on the imple-
mentation or even knowing the implementation. This aims for decoupling and easy inter-
changeability (interoperability) of the components of the distributed system. This effect
(portability) e.g. allows to execute an application which was developed for the distributed
system A on distributed system B (which implements the same interface).

Another benefit is flexibility, this means it should be easy to configure a system out of
different components (e.g. the persistence functionality is realized by an distributed file
system or a relational database). The exchange of one component does not affect other
components.

7) Scalability can be achieved by applying different techniques. What are these


techniques?

The definition of scalability covers three issues:

• Size: Can we add more users and resources to the system?

• Geographically size: What happens if users and resources lie far apart?

• Administrative scalability: Is the system easy to administer, even if it spans over many
independent organizations?

I want to cover only the first two points. There are three main techniques for achieving
scalability:

4
• Hiding communication latencies

– We try to avoid waiting for responses of remote services requests as much as pos-
sible. This means basically asynchronous, non-blocking communication. While
we are waiting on an answer we can do other work on the client.

– Hiding latencies this way means that the distributed system appears to be re-
sponsible and fast.

– This works well for independent requests, e.g. in batch-processing or certain


parallel algorithms, and not so well for cases, where the client depends on the
answer. An example for this might be an interactive remote shell, where the user
has to wait for the processed commands before continuing.

• Distribution

– We take a component / resource and split it in parts, which we distribute over


our system. No part exists twice in the system. If a client requests some parts
of a resource it gets the parts from corresponding servers.

– In this way we can distribute the load of delivering the whole resource every time
by one server on many servers, which distribute only parts of the resource.

– An example is the domain name service (DNS) of the internet. A client sends
a request, like ”What is the IP of ai.cs.tu-dresden.de ?” Instead of getting the
IP for the full address from one server, the request is sent to a server, which
has only answers for the zone ”de”. This server passes the request to a server
which handles the zone ”tu-dresden.de”. This goes on, until the last DNS server
(responsible for ”ai.cs.tu-dresden.de”) returns the IP.

• Replication

– Replication means spreading the same resource across different computers in the
system.

– This increases on the one hand availability of resources and on the other hand
scalability, because balancing the load between different servers means that the
user can access the geographically closest copy. This hides again latency.

– A special kind of replication is caching. Caching means replicating very close


to the client, sometimes even in the client (e.g. browser caches). The main
difference is, that the decision to cache something is usually up to the client.

5
– One common problem with replication is how to keep the spread resources con-
sistent. The easiest way to do that is to update all resources when one is updated
(strong consistency). Of course this might affect the scalability negatively. Some-
times it is OK to work with not-too-old copies of a resource (e.g. in a webbrowser
might a 5 minute old webpage from the local cache be acceptable (if it is no stock
price page)).

9) A multicomputer with 256 CPUs is organized as a 16x16 grid. What is the


worst-case delay (in hops) that a message might have to take?

The worst case happens when two oppositing nodes on the corners of the grid want to
communicate. In this case they have to cross (at least) 16 nodes vertically and 16 nodes
horizontally, which makes at least 32 hops to make.

11) What is the difference between a distributed operating system (DOS) and
a network operating system (NOS)?

DOS are usually used for multiprocessor systems and homogeneous multicomputer systems.
They provide a tightly-coupled system with an uniform interface to applications or middle-
ware (”emulating” a single system). So their main goal is to hide and manage the underlying
hardware resources.

NOS are usually used for heterogeneous multicomputer systems. They are still managing
the underlying hardware (unlike middleware), but are more targeted towards offering local
services to remote clients. They don´t provide a uniform view for applications on the whole
system, but instead offer services of local, loosely-coupled computers (e.g. rlogin or nfs).

13) Explain the principal operation of a page-based distributed shared memory


system.

We want to provide applications a virtual shared memory, which is formed by a multi-


computer system. Applications can use shared memory techniques like semaphores and
monitors instead of message passing. This has several advantages like easier programming
and no worries about reliable communication, buffering and blocking.

One approach is to divide the (virtual) address apace into pages (e.g. 4 kB) which are spread
over the system. If a processor references a local page everything is fine. If he references
a page, which is locally not present, a trap occurs and the operating system fetches the
remote page and resets the instruction which can continue normally. The only difference
to normal paging is that the operating system fetches remote RAM pages instead of locally
swapped RAM pages.

15) Explain what false sharing is in distributed shared memory systems. What

6
possible solutions do you see?

We have somehow to choose the page size in a distributed shared memory system. If we
choose it too large, a phenomenon called false sharing can occur. This means that one
page holds resources of two independent jobs running on two different computers. On every
reference the operating system needs to transfer the page from one computer to the other.
Because of the update, the original page on the first computer is invalidated. When he
makes a reference on the page, the (now-updated) page from the second computer must be
fetched again and the process repeats.

So one solution would of course be to make the page sizes smaller. This would reduce the
chance that two processes on two computers use the same page by accident. Unfortunately,
now we have more network traffic due to more needed page transfer connections and the
high percentage of connection setup efforts.

Another solution would be to define some false sharing detection mechanism in the operating
systems. E.g. one could measure the done instructions per second per job. If jobs suddenly
slows down (but do not stop entirely, this would be a code bug), one could reset it and
assign a new address space (deadlock detection).

17) What is a tree-tiered client-server architecture?

Three-tiered client-server architectures distribute the three principal levels of services (user-
/processing-/data-interface) to physically distributed computers.

The user works with an user interface (GUI) on a client computer. The GUI sends requests
to a a processing layer, which is separated from the GUI layer, typically on a remote server.
This processing level might need additional data, e.g. from a relational database, which is
stored at the data-level. This data functionality is located on an own server.

19)Consider a chain of processes P(1), P(2), ... P(n) implementing a multitiered


client-server architecture. Process P(i) is client of process P(i+1), and P(i) will
return a reply to P(i-1) only after receiving a reply from P(i+1). What are the
main problems with this organization when taking a look at the request-reply
performance at process P(1)?

P(1) seems to depend only on the services of process P(2). Actually this process depends
on P(3). So P(1) depends on all P(2) ... P(n).

If one of the P(i>1) is busy, but not failing, then P(i-1) will have to wait for it until it
can send a reply to P(i-2). That is why one busy layer can affect and slow down the whole
system. There is nothing the other layers can do about it (except some busy detection and
treating).

7
2 Communication

1. In many layered protocols, each layer has its own header. Surely it would be
more efficient to have a single header at the front of each message with all the
control in it than all these seperate headers. Why is this not done?

Usually we have a layered (stacked) network architecture, which spans over 7 levels in the
OSI model. Each level implements its own protocol and defines its own headers. The
messages are piggybacked on the senders side and have to be 7 times unpacked on the
receivers side (once for each level).

When using a single header, we could still stick to the layered network architecture model.
The only difference would be that we would pack all information into a single header. This
would surely save us some overhead for packing and unpacking (compared to packing and
unpacking 7 independent headers). Nevertheless the saving in regards to the total time
from sending and receiving a message would be small. Modern computers are so fast, that
packaging all the headers is very fast compared to transfering the message over a (possibly
large) network.

Even worse is that we would loose a lot of flexibility. Having independent headers, which
don´t ”know” each other, means that we can easily change protocols (e.g. for an techno-
logically improved version) without affecting or modifying the underlying protocols (and
headers). We can also leave higher levels out (e.g. when we need no protocol for session
management) without affecting lower protocols. This means that our model works for vir-
tually every use case. When we have a single header, we would have to change the header
format for every use case. Alternatively we could make the header format highly flexible,
but this would of course mean much more complexity (compared to our 7 independent,
simple header formats).

3. A reliable multicast service allows a sender to reliably pass messages to a


collection of receivers. Does such a service belong to a middleware layer, or
should it be part of a lower-level layer?

The answer has many aspects. Simple multicasting could already be realized by the network
layer. We could route a package (e.g. an IP package) to all destination hosts (based on
routing tables). Unfortunately it would not be reliably (packages could be dropped on the
route) and not connection-oriented (we need to use e.g. TCP)

To make this service easy (= high-level with a standardized interface) to use for applications
and to provide comfortable naming for the receivers, we could have a middleware layer, which
provides some multicasting services. This would give us the choice to use different lower
protocols, based on needs of the applications.

8
5. C has a construction called a union, in which a field of a record (called a struct
in C) can hold any of several alternatives. At run-time there is no sure-fire way
to tell which one is in there. Does this feature of C have any implications for
remote procedure call? Explain your answer.

Because the RPC client and server stubs need to marshal and unmarshal the parameter
values, they must be sure of which type they are. Otherwise a parameter value can not be
transformed in a machine-neutral, data-type.

This means that we can not use unions as parameter types for remote procedure calls.
Alternatively we can tell the runtime system explicitly of which type the union value is, but
this has the danger of telling the wrong value which could crash the application.

7. Assume a client calls an asynchronous RPC to a server, and subsequently


waits until the server returns a result using another asynchronous RPC. Is this
approach the same as letting the client execute a normal RPC?

No, it is not the same. With synchronous RPC only two messages are sent between client
and server. One for calling the procedure and one for returning the results. In the meantime
the client is blocked and can not tell what happened on the network and server side. E.g. it
can not know if the server got the request at all and started processing (if not, it will wait
forever for an answer). The server will also not be able to know if the client got the result
or if the result message was somehow lost (in this case the server does not have to worry
too much).

With two asynchronous RCP calls we have altogether four sent messages, because we have
now an acknowledgement after each call. So the client will know if the message was delivered
or not and the server will know if the answer was delivered or not. In the case of failure,
the messages can be send again. Because the client will wait for an answer in both cases,
there are no other differences except these error recovery possibilities.

9. Give an example implementation of an object reference that allows a client


to bind to a transient remote object.

TBD

11. Would it be useful to make a distinction between static and dynamic RPCs?

This distinction is given by design. Static RPCs would be known at compile-time. So one
could use a simple syntax like: remoteProcedure(param1, param2, param out)
Invocation would be transparent to the programmer and one could check at compile-time if
the requested RPC is valid (parameters, ...)

9
Dynamic RPCs have to be constructed at runtime, e.g. based on user-input. One must use
a more complicated, but also more flexible syntax like:
invoke(id(procedure), param1, param2, param out)
The programmer has full control over the construction of RPCs and he has to make safety
checks at runtime (Is the method available? Are the parameters valid? ...)

13. Describe how connectionless communication between a client and a server


proceeds when using sockets.

Connectionless sockets do not create a connection to transfer data. Instead of creating a


incoming connection listener, the server would create some kind of incoming packets listener.
We would not use TCP, but UDP.

Lets look at the server first. First of all, he would create a new socket with socket(). With
bind() he would associate the newly created socket with a local address and a port. Then
he would go into a blocking listening mode with recvfrom(). When messages from clients
are received, the operating system would return the sender address and the message. The
server can process the message and send something with sendto() back. Finally he closes
the socket with close().

The client would setup a socket with socket(). Then he would specify the server address
and prepare to send some messages with gethostbyname(). To send something, the client
uses sendto(), which is non-blocking and asynchronous. To receive an answer he would
immediately go into the blocking listening mode with recvfrom(). He can also close the
connection with close().

One effect of connectionless communication is that the server typically does not fork or
thread for each new connection, instead he answers all the messages is one thread.

15. Suppose that you could make use of only transient, asynchronous communi-
cation primitives, including only an asynchronous receive primitive. How would
you implement primitives for transient synchronous communication?

We would need primitives for synchronous transient sending and receiving messages. Assume
that we have two asynchronous primitives for sending messages send asy() and receiving
messages recv asy().

After sending we must block the sender somehow until we can be sure that the message
was received by the (running) server. Note that communication should be transient, which
means that we can not expect successful transfer when the server is not running at all.

A simple implementation would be to send the message to the server using send asy().
Because communication should be synchronous, we can not proceed with the program.

10
But because we used an asynchronous send primitive, we can not expect an automatic
acknowledgement from the server (we can expect that only in the case of synchronous
communication). We would have to implement the server in such a way, that he sends
an acknowledgement immediately after reading a new message with send asy(). The client
would poll its operating system for this ack from the server or block itself until ack arrives.

SynchronousTransientSend(receiver, message) {
boolean ack received = false
id++
request = create message (receiver, message, id)
send asy(request)
reply = receive asy(sender, ”ack”, id) //block untill we receive the acknowledgement
}

(sender, message) SynchronousTransientReceive() {


message, id, sender = receive asy()
reply = create message (sender, ”ack”, id) // sender method is assumed to be overloaded
send asy(reply)
}

17. Does it make sense to implement persistent asynchronous communication


by means of RPC?

One could theoretically think of this possibility. There are basically two issues of asyn-
chronous RPC:

• In the case of normal (transient) asynchronous RPC the caller can not be sure what
happened, when he got no acknowledgement after issuing a request to the server.
Either his request message got lost (eventually because the server was down) or the
message has arrived, but processing has not started yet. In the case of persistent asyn-
chronous RPC the caller can be sure that his message will be delivered to the server,
even if the server is currently down (then the request will be buffered ”somewhere” in
the network). So the programmer does not have to deal with timeouts on the client.

• The problem is that RPCs should be transparent to programmers and users and if
possible as fast as local calls. While normal RPC can use small timeouts to limit the
total time to execute a remote procedure, persistent transfer techniques are usually
designed to ”survive” long times reliably. So it can happen that the server comes up
after days again and processes the buffered call. So there is no guarantee that the
server gets the call fast enough. Because the client might go offline in the meantime,
the same problem can happen with the response message.

11
19. Routing tables in IBM MQSeries, and in many other messaging-queuing
systems, are configured manually. Describe a simple way of doing this automat-
ically.

Three methods come to my mind:

• One could setup a central routing server. The network topology is a star. Every queue
manager sends outgoing messages to the central server, who manages one big routing
list. This list can be automatically assembled, because every queue manager has to
register and unregister at the router. Of course the whole approach leads to enormous
scalability problems, although it is conceptually simple.

• A decentralized approach: The first queue manager in the network starts with an
empty routing table. Every new manager makes a multicast to all members of the
manager network. They can add a direct route to the host to their routing tables. Of
course the members have to be known somehow (e.g. trough a directory server).

• Another decentralized approach: The first queue manager in the network starts with
an empty routing table. Every new manager makes a multicast to all its neighbours.
They add the new host to their routing tables. Then they send the announcement
to all their neighbours (which are already in the routing tables), which do the same.
When the announcement is received twice on a host, it will not be forwarded anymore.
This means, that the members of the manager network must not be known ad hoc
and every host gets at least a indirect route to the new host. Of course the algorithm
for itself has scalability problems when compared to the first approach.

21. With persistent communication, a receiver generally has its own local buffer
where messages can be stored when the receiver is not executing. To create
such a buffer, we may need to specify its size. Give an argument why this is
preferable, as well as one against specification of the size.

Pro specifying the size:

• Limits buffer size and needed memory capacity on the receiving host

• Unlimited buffer save messages, even when the server is busy. When using a limited
buffer, a rejection message will be sent back when receiving new messages. If the host
is only a intermediate stop on the route to a host, another route could be chosen.

Contra specifying the size:

• Does not not allow to receive messages, which are larger than the limit

12
• Size limit always empirical and eventually not appropriate for certain use cases

23. Give an example where multicasting is also useful for discrete data streams.

Multicasting (especially when realized by a lower-level layer, e.g. by IP) gives no guarantee
on data rates and the correct order of the received data items. It must be anyway possible
to reorder the items after receiving them all.

Then it can be still useful for transfering time-and-order-uncritical streams, like images,
software updates and backups.

25. How could one guarantee a minimum end-to-end delay when a collection of
computers is organized in a (logical or physical) ring.

A theoretical minimum delay for all possible connections is given by the time the light needs
between the two physically closest members of the ring.

If a message is delivered to fast, one could always use some buffers or trigger which delay
the delivery of messages. One could also use the ”longer” way in the ring to the endpoint.
There are usually more possibilities to make a network slow than to make it fast.

27. For this exercise you are to implement a simple client-server system using
RPC. The server offers one procedure, next, which takes an integer as input
and returns its successor as output. Write a stub procedure called next for use
on the client side. Its job is to send the parameter to the server using UDP
and wait for the response, timing out if the response takes too long. The server
procedure should listen on a known port, accept requests, carry them out and
send back the results.

I implemented the client-server-system in Python, which uses the socket-semantics, but


without dealing with low-level data structures.

The server

\#!/usr/bin/python

from socket import *

def server():
#setup connetion parameters
HOST = "localhost"
PORT = 21567

13
BUFSIZ = 1024
ADDR = (HOST, PORT)
#create a socket
serversock = socket(AF_INET, SOCK_STREAM)
#bind the local adsress to it
serversock.bind(ADDR)
#blocking listen-mode
serversock.listen(2)

while 1:
print "Waiting for connection..."
#we got an connection
clientsock, addr = serversock.accept()
print "connected from: ", addr
while 1:
data = clientsock.recv(BUFSIZ)
if not data: break
#we unpack the sent string, add one pack it and send it
clientsock.send(repr(int(data) + 1))
clientsock.close()
serversock.close()
server()

The client

#!/usr/bin/python

from socket import *

def next(number):
#define the connection parameters
HOST = "localhost"
PORT = 21567
BUFSIZ = 1024
ADDR = (HOST, PORT)

#setup the connection


tcpCliSock = socket(AF_INET, SOCK_STREAM)
tcpCliSock.connect(ADDR)

while 1:

14
data = repr(number)
if not data: break
tcpCliSock.send(data)
#go into blocking receiving-mode (buffer size specified)
successor = tcpCliSock.recv(1024)
#shut down connection if we got no answer
if not data: break
return successor
print data

tcpCliSock.close()

#make the RPC


number = int(raw_input("Please enter an integer: "))
number_succ = next(number)\\
print "The successor of", number, "is", number_succ

3 Processes

1. In this problem you are to compare reading a file using a single-threaded file
server and a multi-threaded server. It takes 15 ms to get a request for work,
dispatch it, and do the rest of the necessary processing, assuming that the data
needed are in a cache in main memory. If a disc operation is needed, as is
the case one-third of the time, an additional 75 ms is required, during which
time the thread sleeps. How many requestss can the server handle if it is single
threaded? If it is multi-threaded?

Although the assumptions are highly idealistic (e.g. we do not consider the load of the
network connection), we can draw the following conclusions:

• Generally: In 32 of all cases we have a latency of 15 ms (file in RAM). In 13 of all


cases we have a latency of 75 ms (file not in RAM, must be fetched from disc − >
block). These assumptions are statistical system behaviour and apply to single- and
multi-threaded server.

• Single-threaded file server: Because we have only one thread, all requests get serialized
and we can use the weighted average: 23 ∗ 15 ms + 13 ∗ 75 ms = 35 request
ms
. This makes
1000 requests
35ms = 28 s .

• Multi-threaded file server: We create a new thread for each incoming request.

15
– User-level threads: When one thread blocks, the whole process gets blocked and
implicitly also the other threads. The performance is the same as in the case of
single-threaded server.

– Kernel-level threads: When one thread blocks, the other threads can proceed.
1000
We can handle 75ms = 13,3 requests
s which need disc access (the disc is assumed
to be the limiting factor here). Because we have such requests only once in three
times, we can handle 13,3 requests
s ∗ 3 = 40 requests
s .

3. In the text, we described a multi-threaded file server, showing why it is better


than a single-threaded server and a finite-state machine server. Are there any
circumstances in which the single-threaded server might be better? Give an
example.

There can be situations in which a single-threaded server should be preferred over a multi-
threaded server:

1. Complexity and safety. It is obvious that programming multi-threaded applications


is much harder than programming single-threaded applications. One has to take care
on mutexes, locks and inter-thread communication. Thus multi-threaded file servers
should not be used when low complexity and error-safety have top priority.

2. Few blocking system-calls. When a big file cache (a big RAM) is available, we will
have only few disc accesses and thus only few blocking-system calls. In this case the
performance advantages of threads vanish and a single-threaded server might reach a
similar performance.

3. Processing time known. If the time to process a request is known and when we
have many requests, we could prioritize them in a single-thread system and process
the shortest task first. So the overall delay to results would be shorter. With threads
we could not do this, because their execution order is determined by the operating
system..

5. Having only a single lightweight process (LWP) per process is also not such
a good idea. Why not?

The main problem is a performance problem. It occurs every time we have a thread switch,
e.g. when a thread blocked itself on a mutex variable. Now we would have to schedule
another thread, this means continuing a sleeping thread. If we have only one LWP per
process, all the other (still) sleeping tasks are assigned to an LWP on another process. In
order to switch to them we would have to do an entire kernel context switch to the process,
which is very expensive (we would have to save the register values, the program counter,
the stack, flush the address buffers in the MMU and the TLB and change the memory

16
mappings). Then we could execute the thread (in the environment of the other process).

If we have multiple LWPs per process, we could have scheduled a thread in another LWP
in the same process. In this case we would not enter the kernel space and make only a
relatively cheap save of the CPU registers.

Another consequence of this is that communication between threads in different processes


becomes very expensive (we have to enter kernel space everytime).

7. Proxies can support replication transparency by invoking each replica, as ex-


plained in the text. Can (the server side of ) an object be subject to a replicated
invocation?

The question is quite ambiguous. I will try to sketch the principle of replication transparency
with proxies and some problems.

We can achieve replication transparency with a proxy on the client side, which distributes
remote object invocations over all object replicas. One drawback is of course that the proxy
has to know the locations of all the replicas (we would need e.g. an object directory).
Otherwise, if we update only a subset off the replicas, we loose consistency.

So there is no single server side (for the client), but many equal-righted server objects.
When we update one of them, we must ensure consistency of the others. One could update
all the others too, but in this case we have to ensure a globally correct order of updates.
In distributed system this can be quite hairy (locking, global time, etc.). Alternatively we
could update only a master copy and invalidate all other copies (which are used as buffers
for read access). The drawback is that the master object becomes now the bottleneck for
all following accesses.

A better approach may be to put the ”intelligence” into the object itself. The client would
see only one object on one server. The object would distribute himself based on an object-
specific distribution policy. E.g. a distributed weather forecast object would distribute itself
only on high speed server, while a distributed customer management object would distribute
itself only on fault tolerant servers. At least it should provide enough information to the
server, that he can distribute the object. The difference is that distribution logic is on the
more natural side (the server side) and that clients would not use a proxy anymore.

9. Sketch the design of a multi-threaded server that supports multiple protocols


using sockets as its transport-level interface to the underlying operating system.

When using Python it would look something like this. Note that the server is not listening
on several ports, but on one well-known port (21567), where requests get dispatched into
different threads (where they can be threated according to the requested protocol).

17
The client:

#!/usr/bin/python

from socket import *

def connect(request):
#define the connection parameters
HOST = "localhost"
PORT = 21567
BUFSIZ = 1024
ADDR = (HOST, PORT)

print ’Connecting to server ...’

#setup the connection


tcpCliSock = socket(AF_INET, SOCK_STREAM)
tcpCliSock.connect(ADDR)

print ’Connection established ...’

while 1:
#pack the request-nr into a string
data = repr(request)
if not data: break
tcpCliSock.send(data)
#go into blocking receiving-mode (buffer size specified)
answer = tcpCliSock.recv(1024)
#shut down connection if we got no answer
if not answer: break
print answer
tcpCliSock.close()

#ask for the service


while 1:
service = raw_input("Please enter an an service to request (FTP, HTTP, SMTP): ")
if (service == ’FTP’):
request = 21
break
elif (service == ’HTTP’):
request = 80
break

18
elif (service == ’SMTP’):
request = 25
break
else:
print ’Error: Please enter a valid service name.’

connect(request)

The server:

#!/usr/bin/python

from socket import *


from threading import *
from thread import *

def handler(clientsock,addr):
#the thread handler resides in its own function
while 1:
data = clientsock.recv(BUFSIZE)
if not data: break
#depending on the requested service, we signalize, that we are listening
if (int(data) == 21):
clientsock.send(’FTP server listening’, data)
elif (int(data) == 25):
clientsock.send(’SMTP server listening’, data)
if (int(data) == 80):
clientsock.send(’HTTP server listening’, data)

clientsock.close()

if __name__==’__main__’:
#setup connection parameters
HOST = "localhost"
PORT = 21567
BUFSIZ = 1024
ADDR = (HOST, PORT)
NOOFTHREADS = 3

def server():

19
#create a socket
serversock = socket(AF_INET, SOCK_STREAM)
#bind the local address to it
serversock.bind(ADDR)
#blocking listen-mode
serversock.listen(NOOFTHREADS)

print ’Server is now listening ...’

while 1:
#we got an connection
clientsock, addr = serversock.accept()
print "New connection from: ", addr
#start a new thread
start_new_thread(handler, (clientsock,addr))
clientsock.close()

serversock.close()

server()

11. Explain what an object adapter is.

An object adapter is a part of an object server (which is used to provide remote clients an
interface to local objects).

Usually one wants to have different behaviours of different objects. E.g. transient objects
could be created on the fly on the first client request and deleted when the last client releases
the object. An example could be a huge file object, which would waste RAM, if no client
uses it. Alternatively we could create a transient object at the time the server is initialized
and keep it persistent in memory until the server shuts down. One could want this e.g. for
a a web application object, which has to serve users as fast as possible (we do not want the
risk of delays by expensive load operations).

Other examples for object behaviour are the degree of isolation between objects (can objects
share memory or not?) and threading behaviour (one thread per object or one thread per
request and object?). Tanenbaum calls these behaviours activation policies.

We can think of an object adapter as an implementation of a specific activation policy. It


manages one or more objects and invokes them according to the policy. Because an object
server should be flexible, there can be many object adapters in one object server. When a
new request hits the server, it is dispatched to the appropriate object adapter (based on the

20
requested object), and by the object adapter redirected to the matching object stub (this
time based on the requested method and a specific activation policy). It is important to
note, that object adapters are unaware of the exact semantics of their objects, otherwise
they could not be generic enough.

13. Change the procedure thread per object() in the example of the object
adapters, so that all objects under the control of the adapter are handled by a
single thread.

The code to the modified version, let us call it thread per adapter(), is very similar to
thread per object().

When the dispatcher calls the adapter, it puts a message in the buffer of adapter thread.
The adapter thread invokes the matching stub and creates the response message.

#include <header.h>
#include <thread.h>

#define MAX_OBJECTS 100


#define NULL 0
#define ANY -1

METHOD_CALL invoke[MAX_OBJECTS]; /* array of pointers to stubs */


THREAD *root; /* demultiplexer thread */
THREAD adapter_thread; /* change: one thread per adapter */

void thread_per_adapter(long object_id) { /* change: new function name */


message *req, *res; /* request response message */
unsigned size; /* size of messages */
char **results; /* array with all results*/

while(TRUE) {
get_msg(&size, (char*) &req); /* block for invocation request */

/* Pass request to the appropriate stub. The stub is assumed to */


/* allocate memory for storing the results. */
(invoke[req->object _id]*)(req->size, req->data, &size, results);

res = malloc(sizeof(message)+size); /* create response message */


res->object_id = object_id; /* identify object */
res->method_id = req.method_id; /* identify method */
res->size = size; /* set size of invocation results */

21
memcpy(res->data, results, size); /* copy results into response */
put_msg(root, sizeof(res), res); /* append response to buffer */
free(req); /* free memory of request */
free(results); /* free memory of results*/
}
}

void invoke_adapter(long oid, message *request) {


put_msg(adapter_thread, sizeof(request), request); /* change: we use only one thread per
}

15. Imagine a web server that maintains a table in which client IP addresses
are mapped to the most recently visited web pages. When a client connects to
a server, the server looks up the client in its table, and if found, returns the
registered page. Is this server stateful or stateless?

It is principally stateful, because it maintains a current state of each client in a table (client,
last visited page). This means, that the current session for each client is preserved over time.

One could argue that IP addresses do not identify clients unambigious. When dynamically
assigned IP addresses are used, the client behind a IP can change over time. Anyway, the
server would still work stateful, although clients could get now incorrect web pages.

17. Strong mobility in Unix systems could be supported by allowing a process


to fork a child on a remote machine. Explain how this would work.

With strong mobility we are able to stop the execution of a program, move it to another
machine and resume it where we left of. One can think of two techniques:

1. Process migration. We take the process and its address space as a whole and start
migration by an external signal or external migration manager. First of all we stop
it (freeze it) and package it. Then we send it to the remote machine, unpack it and
resume. Of course we have to take care of all dependencies the process needs, but this
applies also for process cloning. It is important to note that we need external help,
because the program could not do anything after it has put itself to sleep.

2. Process cloning. We use a cloning technique which is very similar to normal Unix
process creation. First of all the program issues a fork() system call. The operating
system creates an exact copy of the process and the address space. Both processes
keep running and can identify, if they are the parent or the child with the return value
of fork(). In Unix the child would issue now exec() and replace itself. But in our case
the parent would put the child asleep, package it, send it to the remote machine and

22
kill himself. The remote machine is now responsible for unpacking and resuming the
process.

Process cloning would look somehow like that:

#include <stdlib.h>
#include <migration-helper.h>

pid = fork();
if (pid == 0) { /* child */
put_asleep(child);
package(child);
send(child);
}
else
if (pid > 0) { /* parent process */
wait(parent);
}
else { /* an error occured */
exit (1);
}
}

19. Consider a process P that requires access to a file F that is locally available
on the machine where P is currently running (lets call it M1). When P moves
to another machine (M2), it still requires access to F. If the file-to-machine
binding is fixed, how could the system-wide reference to F be implemented?

We have to establish a system-wide reference to the file on the previous machine. This
means that we can still access the (fixed) file from the new machine. One could do that
with two approaches:

• Proxy. One could create a proxy on M2, which captures all references on F. The proxy
would redirect the requests to machine M1, which provides the files. M1 would need
some file server software. The process P would not notice the redirect, because it is
still using the same interface to F. A drawbacks is of course the network delay penalty.
Even if the network is not a data rate bottleneck, we would always introduce some
additional latencies. Another problem is, that if M1 crashes, M2 will stop working
too.

• Local copy. If we assume that the file is small and thus movable to M2, we could
at least provide P a read-only copy of F. For consistency reasons writes would still

23
take place on M1 (trough a proxy on M2 like in the first approach) and invalidate the
copy. This approach does of course not work with fixed (e.g. really large) files like
databases or movie collections.

21. Compare the architecture of D´Agents with that of an agent platform in


the FIPA model.

D´Agents rather an existing implementation than an specification. In contrast FIPS defines


more a general model for software agents, which can be fully or in party implemented (one
example is the JADE platform). — TODO Agent management Directory service Agent
communication

23. Where does an agent communication language fit in the OSI model, when
it is implemented on top of a system for handling E-Mail, such as in D´Agents?
What is the benefit of such an approach?

Agent communication languages (ACLs), e.g. FIPA ACL, are designed to establish a com-
mon communication standard between agents (which could be written in different languages
and run on different heterogeneous machines under different agent platforms).

Because Tanenbaum does not talk too much about interagent communication in D´Agents, I
would like to take a look about ACLs in general first. One could argue that ACLs implement
functionalities of the session layer, the presentation layer and the application layer:

• Session layer. The session layer should provide synchronization facilities and check-
pointing for error recovery. An ACL should provide a predefined high level protocols
for agent communication, e.g. a negotiation protocol with single propose, agree and
disagree directives. These protocols can be thought as a session in which the agents
communicate. Protocols define implicitly error treatment (e.g. with timeouts) and
synchronization.

• Presentation layer. The presentation layer should give the transferred bits a mean-
ing. Unstructured, flat messages should be transferred into meaningful, structured
records. Agent communication is useless, if both agents can not agree on common
message semantics. E.g. FIPA messages can specify an ontology field, which says that
the message is e.g. a library record or a petrol station price offer.

• Application layer. The application layer should implement a high-level end user
service, e.g. messaging or file transfer. Because interagent communication is a high-
level service, it should be clear that ACLs belong also to this layer.

Communication in D´Agents is done with e-Mails. The advantage of such an approach


is that E-Mail is virtually everywhere available and already high-level, which eases the

24
implementation of the ACL. Communication can be done over the well-known ports 25 and
110 without drilling new holes in a firewall.

4 Naming

1. Give an example of where an address of an entity E needs to be further


resolved into another address to actually access E.

It is important not to confuse entities, access points, names, addresses and identifier, so I
want to give some short definitions of the terms:

• Entity. An entity can be everything in an distributed system, e.g. files, users, hard-
ware devices (printer, hosts, ...), processes, mailboxes or messages. Entities have an
interface, on which one can operate.

• Name. A name is a string of characters, that refers to entities. Every entity has a
name, although this name must not be globally known or be globally unique. Examples
would be ”125.23.12.46” or ”xterm@user1@121.43.57.22” .

• Access point. An access point is an entity, which allows us to operate on another


entity. There can be many access points for an entity and access points are typically
dynamic (they might change over time).

• Address. An address is the name of an access point for an entity. Thus, we can use
an address to refer to an entity. Examples are a telephone number for the ”telephone
entry point” of a person or ”www.office-server:80” An address might not be sufficient
to access the entity directly (if the access point is a pointer to another address).

• Identifier: An identifier is a name for an entity, which fulfils two properties:

– Every entity within the system has exactly one identifier and an identifier can be
only used by maximal one entity.

– The mapping between entities and identifiers is static, this means, that one iden-
tifier, once assigned to an entity, can never be reassigned to another entity.

So the question actually means the following: Give an example of a two-step resolution of
an address into an identifier (of an entity), where the first address is a pointer to another
address.

25
One example would be an Internet address like ”www.microsoft.com”. The Domain Name
System (DNS) resolves this into the IP address: 207.46.193.254. Although this name is now
Internet-wide unique, the actual access on the data link level needs another resolution step
into the Ethernet address of 207.46.193.254. This last step is of course not visible for the
user. So the IP and the MAC are both addresses of the same entity , but the IP alone is
not enough to access the host.

A more constructed example would be the following: I want to call ”Mr. X”, but forgot
his number. So I look up his number in the telephone book, where I find ”53727463”
(resolution step 1). When I call this number on the telephone, my telephone provider
resolves the number into a connection (resolution step 2). ”Mr. X” and ”53727463” are
both addresses of the same entity (provided, that both occur globally only once), but the
address ”Mr. X” alone is not enough to access him.

3. Give some example of true identifiers.

Me must take care to preserve uniqueness. Examples would be:

• For a person: The personal ID on the identity card. This ID is an identifier for the
person within one country. If we can make sure, that e.g. different countries use
different ID name spaces, then the ID is also a global identifier.

• For files: Hash functions (e.g. MD5) provide identifiers for files. Although hash
collisions can occur (this would violate uniqueness), in practice the uniqueness is big
enough. Sometimes one combines hash values of files with their host address to give
them a global identifier.

• For Internet hosts: An IP address is no identifier in the word sense, because one IP
address can belong to a subnet (masquerading) and thus to many hosts. We should
combine the IP address e.g. with the MAC address to get an real identifier.

– Note: In practice the IP address can be used as an identifier, because mas-


querading takes special measurements to discriminate hosts behind the same IP
address.

5. Jade is a distributed file system that uses per-user name spaces (Rao and
Peterson, 1993). In other words, each user has its own, private name space. Can
names from such name spaces be used to share resources between two different
users?

We want to share files and directories between different users, which can only see their local
name spaces. The problem reduces to the problem to access a file into another name space.

26
A Jade file name is resolved within a local (per-user) scope. So what we can definitely not
use are soft links to the shared file, because the soft link can not ”escape” from the local
user name space.

We can think of using hard links, so that we have multiple paths (in multiple name spaces)
linking to one file. Usually hard links work only within one name space, but we can extend
this to allow hard links to point to locations in a global system-wide view. With this way we
could access globally available files. With the same extension we could also use soft links,
which point to shared locations in a global system view. This extension approach woks, but
it has one major drawback: It violates the condition, that we can only see our local name
space and thus breaks the name space isolation.

Of course one could also mount the other name space into the own name space (or only the
needed shared files). Then one user could access files of another name space.

As we see it is quite difficult to share files between users which use local name spaces
compared to sharing between users in a global name space environment.

7. Is an identifier allowed to contain information on the entity it refers to?

(For some definitions see answer 1) It is not only allowed, it even should contain information
on the entity it refers to, because we want to make the identifier unique among all identifiers.
If we could use no information from the entity, we would have to take extra measures to
make sure that the identifier is unique. If can use some information from the entity, we
could use it to add some distinguishing attributes to the identifier.

A (constructed) example: Imagine a red-haired person with the identifier ”A”. If we have
another person and we do not know the first person (in particular not his identifier), we
have the problem of giving the new person an unique identifier. Of course one could check
all other persons and choose a not yet used identifier. If we don´t want to do that, we could
call the second person: ”black hair, blue eyes, living in Berlin, Straße des 17. Juni 3, height
175 cm)”. So we used information from the entity to create a unique identifier.

Note: This is of course no guarantee that there is not another person with the same charac-
teristics ”somewhere” in the system, but it provides a nice starting point for an identifier.

9. Give an example of how the closure mechanism for a URL could work.

Usually we have an address of an entity and we want to use a naming system to get the
associated identifier. The term ”closure mechanism” refers to the starting point of such
a name resolution. This means that we must know what and how to pass to the naming
system.

27
An URL (e.g. ”http://news.bbc.co.uk/2/hi/uk news/7030558.stm”) contains three parts:

• Protocol (”http” = HTTP protocol)

• Host address (”news.bbc.co.uk”)

• Local file name (”/2/hi/uk news/7030558.stm”), which is relative to some web space
directory

The client does now the following:

• It separates the three parts.

• It passes the host address to a local name resolving module.

– This module either it returns a locally cached IP address or

– it passes the address to a locally known DNS server, which does does some magic
to find the associated IP address and returns it.

∗ Note: DNS uses an iterative approach.. This means: if the DNS server,
which we contacted, does not know the IP address, it passes us at least
the IP address of another DNS server, which manages a part of the domain
(”subzone”). In this case we have to ask this server. The process may repeat.

• It extracts the protocol and uses a matching protocol module (here: the HTTP mod-
ule) and the IP address to contact the host. For this it uses usually a well-known
(implicit) port, e.g. 80 for HTTP.

• Once a connection is established, the client passes the local file name. The server
starts its own, local file resolution and can finally return the requested file.

As we see, we have actually two closure steps, one at DNS and one on the server for file
resolution.

11. High-level name servers in DNS, that is, name servers implementing nodes
in the DNS name space that are close to the root, generally do not support
recursive name resolution. Can we expect much performance improvement if
they did?

In case a DNS server can not resolve an address immediately, recursive resolution does not
return each intermediate result back to the client (like iterative resolution), but passes the
request directly to the next responsible name server. Thus we get a recursive chain down

28
to the point, where we finally can resolve address. Once the IP address is found, we do not
immediately return it back to the client; but send it up the tree and let it return by the
root server.

The theoretical main advantage of this approach is efficient caching. The root server (and
all other higher servers) are notified of IP addresses by the backwards messages. Next time
they are asked for the same address they can immediately return the address without doing
the whole descenting recursion again.

Can we expect much performance improvement of this in a real-world scenario? Maybe not,
because:

• We would press the double load on the root server (plus the processing of the back-
wards messages).

• They would have to cache host addresses in the managerial (lower) level, which change
really often (compared to the higher zones, they store anyway). To provide up-to-date
results, the servers would have to invalidate the caches frequently and update (per
recursion). Thus efficiency of caching is limited.

13. A special form of locating an entity is called anycasting, by which a service


is identified by means of an IP address (see, for example, Partridge et al., 1993).
Sending a request to an anycast returns a response from a server implementing
the service identified by that anycast address. Outline the implementation of
an anycast service based on the hierarchical service described in Sec. 4.2.4.

First of all some words to anycast: We can think of an anycast service as a service, which is
represented by an IP address (or a host address), but actually provided by the ”nearest” to
the client or ”fastest” host. The hosts are transparently chosen from a pool (based on some
criterion, which does not interest us) by an anycast service on the user-visible endpoint
server.

We have now the following problem: A pool of servers, all known by their host names, must
be managed (on the anycast endpoint) in a dynamic tree. Leaves represent available hosts
and their IP addresses. We must choose a host (based on the criterion) and look up its
IP address (and pass it to the client). Because the pool of hosts is highly dynamic and
potentially big, we need efficient operations for lookup, inserting and deleting of nodes.

We could organize the pool as a DNS domain (e.g. pool.uni.wroc.pl), so that the pool
servers are subdomain leafs (e.g. serv2.group3.pool.uni.wroc.pl). The highest common
domain name would be the root of our tree. The hierarchy of subdomains would form the
rest of the tree. If a node is a leaf, it would know its IP address. Otherwise it has a directory
of all its descendants (children, grandchildren, ...), which would be a pointer to the next

29
child on the path to that descendant.

How to do lookups? We pass the searched host name to any node (for efficiency we would
have to think about a good starting node). If it is (by accident) the searched host name,
we are done. If not and it is still a leaf, we ask the father node. If we ask some node, which
is not a leaf, we have two choices:

• The node has no matching entry in its directory: No subnode of this node is the
searched host. We pass the request upwards to the father node.

• If the node has a matching entry, we follow the pointer to the goal.

This ascending or descending can require many steps. In the worst-case we have to ascent
to the root and descend down to the lowest level again.

When something changes (update, e.g. the associated IP or ) we have to look up the host,
make the update and (if necessary) pass an recursive UPDATE request to the parent to
change the pointers.

When we add a new host to the pool, we add a new node under the matching subdomain
node and announce the insertion with an recursive, ascending message to the father node.
So we create new pointers along the path to the root node. Deleting leafs happens the same
way, but now we have to remove the pointer from all parent nodes.

Of course one could apply many optimizations, such as caching.

15. Suppose that it is known that a specific mobile entity will almost never
move outside domain D, and if it does, it can be expected to return soon. How
can this information be used to speed up the lockup operation in a hierarchical
mobile location service?

Think of a mobile device, which changes its name only between er.uni.wroc.pl, sie.uni.wroc.pl
and es.uni.wroc.pl, but e.g. not (or very seldom) to du.uni.warszawa.pl. So we stay in the
domain uni.wroc.pl.

When we use the hierarchic naming system of chapter 4.2.4. we can exploit effective caching.
We know that the path from the root node to the domain node (e.g. uni.wroc.pl) will never
change, the only things that will change is the pointer in the domain node and the actual
leaf nodes. So we could add a pointer on the domain node to every node of the tree. Thus
,in case of a lookup we will not have to ascend (possibly to the root) and descend, but we
could make a ”crossjump” to the domain node.

In case of a cache miss, which happens if we have crosspointers and the device moved out

30
of the domain, we do the following: We make the crossjump and will notice, that the device
is not in the domain, where it should be. From here we can make a standard lookup, as I
described in answer 13.

17. Consider an entity moving from location A to B, while passing several


intermediate locations where it will reside for only a relative short time. When
arriving at B, it settles down for a while. Changing an address in a hierarchical
location service may still take a relatively long time to complete, and should
therefore be avoided when visiting an intermediate location. How can the entity
be located at an intermediate location?

Updates in the hierarchical tree take so long, because updates have to be propagated up-
wards (up to the root) to update all the pointers. So we do not want any updates in the
naming tree. We could also get inconsistencies, e.g. when a lookup returns an IP address,
but the device has moved to another location in the mean time (outdated IP).

For only short visited intermediate locations we could use forward pointers. When we move
to a new intermediate location, we leave a reference to the new location on the old location.
This is very simple and we can follow the chain to the current location. This approach
works for a few nodes, but of course it scales bad with long chains and there is always the
risk of broken links.

If we work in a small network, we could also make a broadcast to find out the current
location of the device.

19. Make clear that weighted reference counting is more efficient than simple
reference counting. Assume communication is reliable.

We want to find out if a distributed object is still referenced by some clients and if not,
we want to delete it (garbage collection). Because of unreliable communication and high
latencies in distributed systems this can get really tricky (Communication is here assumed
to be reliable).

With simple reference counting (SRC) we count the number of references on this client. We
use a counter to keep track of new or released references. If the counter is zero, we delete
the object.

For weighted reference counting (WRC) we assign every object a weight and we do not track
the number of references to the object, but the weight of references to it. If an reference is
too light, we can not copy it anymore (or we use something like indirection).

There is a main advantage of WRC over SRC: SRC requires frequent counter updates on the
object side. This can slow down performance significantly, especially in mutex situations

31
and over networks. With WRC, the burden of managing the weight is put on the clients.
They must check if there is enough weight left to create a new reference. The server does
not keep track of any (frequently updated) counter, he knows only the static total weight.

21. Is it possible in generation reference counting that an entry G[i] becomes


less than 0?

Generation reference counting (GRC) is an alternative to weighted reference counting. In-


stead of assigning a weight to each reference, we assign a generation number (GN) to each
reference. The first reference gets the GN 0. A copy of a GN = j reference assigns GN =
j + 1 to the new reference. The object itself manages a generation table, which is a list of
all generations i = 0 ... n and the number of references belonging to each generation G[i].
If there are no references in each generation, the object can be deleted.

A generation number G[i] can only be decreased by the following operation: When a refer-
ence is released, the client sends a message to the object, containing the generation number
of the reference G[j] and the total number of all copies NR CP, which have been made from
this reference. (By the way: With this method, we do not have to announce new, copied
references to the object as in simple reference counting, which saves us network traffic).
When the object receives such an message, it decrements G[j] by 1 (because one reference
of this object was released) and increments G[j+1] by NR CP (because the object knows
now, that there are NR CP references in the next generation.

Can G[i] be less than 0? Yes, in the following case: Say G[j] = 3 and G[j+1] = 2. Now
we make one new copy of of a reference of generation j. Thus G[j+1] should be increased
by one, but the algorithm sends this information not immediately to the object, but only
implicitely, when the original reference of generation j is released. Thus G[j+1] stays 2.
When all three references of generation j+1 are released, the object receives three messages
and decreases G[j+1] three times, so that it is now -1.

The algorithm stays anyway correct.

5 Synchronization

1. Name at least three sources of delay that can be introduced between WWV
broadcasting the time and the processors in a distributed system setting their
internal clocks.

WWC is a service by the American National Institute of Standard Time (NIST), which
broadcasts the Universal Coordinated Time (UTC) on shortwave radio band. This time

32
can be viewed as reference time for all computer systems, which synchronize with it.

Although there exist many advanced synchronization algorithms, there will always be a time
drift between local system time and reference time. There are two main reasons (along with
their sub-reasons):

• Accelerated / delayed local timer. Local time measurement is usually done by a quartz
crystal. It oscillates at a well-known frequency and thus, one can derive the time from
it.

– Unfortunately there are always small frequency variations from crystal to crystal.
They can sum up to huge time drifts, especially in distributed systems (where
we have now suddenly n different times).

• Network delays. Time synchronization over volatile network links links is done like
that: The client asks the time server for the reference time. The time server sends its
current time back and the client makes this to its new time.

– The problem is, that sending the answer over the network introduces an addi-
tional delay. Because we have no reliable global time, there is no way to tell how
long exactly the answer was underway.

– In the most networks we do not know the way the packets will take, so they
might take longer or shorter routes.

3. Add a new message to Fig. 5-7 that is concurrent with message A, that is,
it neither happens before A neither after A.

Figure 5-7 illustrates three processes A, B, C with different local times, which send each
other messages.

P1 P2 P3
0 0 0
6 A–> 8 10
12 16 –>A 20
18 24 B–> 30
24 32 40 –>B
30 40 50
36 48 60
42 56 70

The new message shall happen neither before nor after A. For two events on the same
process we can say for sure which one happened first. The model does not allow to send two

33
messages at the same time from one process, otherwise we could send a message C at time
6 from process P1 to process P3. This would surely be concurrent with message A (trivial
case).

So we must send the message from either P2 or P3. Lets consider P2. At the times 0 and
8 we can not send any message, because we have no knowledge about P1´s time. At time
16 we receive a message A from P1, which was sent at P1´s local time 6. Unfortunately P2
is now already at time 16 and it is forbidden to turn back time. If we were at local time 4
and received A, we could fast-forward to 6 and send a message C to P3, which would then
be concurrent with A (altough it is only logically concurrent, not physically).

The same problem applies to process P3, which gets the knowledge of the sending time of
A too late.

5. Consider a communication layer in which messages are delivered only in


the order that they were sent. Give an example in which even this ordering is
unnecessarily restrictive.

A total order of send and delivered messages seems nice and indeed and there are many
applications where this property is useful, e.g. for deposit and withdrawal on bank accounts.
In these cases the order of the received messages matters: The end result is different if we
withdraw 100 Dollar first and then add 2% interest to the account or if we first add the
interest and withdraw subsequently.

The problem with this is that synchronization between the order of sent and received mes-
sages is costly and sometimes not necessarily needed. An example would be a Voice-Over-IP
application. Low latencies matter more than the correct ordering of messages. Of course, if
all messages are delivered in the wrong order, then the system will not work properly, but
humans can still understand the meaning of sentences, even if little parts (words, etc.) are
reversed or not correctly transfered.

7. In figure 5-12 we have two ELECTION messages circulating simultaneously.


While it does not harm to have two of them, it would be more elegant if one
could be killed of. Devise an algorithm for doing this without affecting the
operation of the basic election algorithm.

The algorithm is similar to the normal ring leader election algorithm. We must somehow
find double ELECTION messages and sort them out. We will do this by checking passing-by
ELECTION messages in all other initiating nodes. Note that although there can still be
multiple ELECTION messages circulating at the same time, only one will reach its sender,
because the others will be killed on their way.

Once a node recognizes that the current leader is down, it sends an ELECTION message to

34
it neighbour. Other nodes might regognize the failure at the same time, so there may be mul-
tiple ELECTION messages circulating. When an arbitrary node N receives an ELECTION
message, it must decide what to do:

• If N has not previously initiated any ELECTION message (i.e. he has not recognized
the failure), he adds its node number to the ELECTION message and forwards it to
its next neighbour. (no difference to the standard algorithm here)

• If N has initiated an ELECTION message, we have the following cases:

– It finds its own number in the list. Then the message was initiated by this node
and we remove the message from the ring, because it has circulated one time. We
convert the ELECTION message into an COORDINATOR message and send it
one time more around the ring. This serves to inform all ring members of who
the new leader is (the one with the highest number in the list of the ELECTION
message). So here again, we have no difference to the standard algorithm.

– It does not find its own number in the list. Then it was not the sender of the
ELECTION message. The node compares its own number to that of the initiating
node (which is somehow encoded in the message).

∗ If the own number is higher, the message is killed. In this way we make sure,
that only the ELECTION message from the initiating node with the highest
number is allowed to pass.

∗ If the own number is lower, the node adds its own number to the list and
forwards the message.

In this algorithm all initiating nodes send ELECTION messages. At the same time they
make sure, that only one of the messages reaches the goal.

9. In the centralized approach to mutual exclusion (Fig. 5-13), upon receiving


a message from a process releasing its exclusive access to the critical region it
was using, the coordinator normally grants permission to the first process on
the queue. Give another possible algorithm for the coordinator.

The question is how to find the next process, we want to grant the right to enter the critical
region. Tanenbaum proposed an queue, i.e. an FIFO strategy. This strategy is fair (requests
are granted in the order they were received) and has no starvation (no process will wait
forever).

We could use another strategy, e.g. LIFO with a stack. Then the last process, who asked
for permission, will get access. This comes with heavy disadvantages: We are not fair

35
anymore and if there is always more than one request, the first requests will starve. The
only advantage is, that even in bussy situations, when we have many requests, there is
still a chance that new requests get immediate access. In general the clients can not tell
if the server is under heavy request load or not, because their behaviour seems completely
stochastical (from the viewpoint of the clients).

We could also think of some randomized strategies, which would manage a pool of access
requests and grant access based on a random decision. These strategies are also not fair,
altough, they are (depending on the statistical influence) free of starvation, i.e. every process
in the pool will eventually be choosen.

All in all we might still choose FIFO, because it has the most advantages.

Above we have only changed the strategy. We could also invent a new algorithm which
works as follows: When the critical area is free for access, the server multicasts a FREE
message to all clients. When one client wants to enter, it sends a REQUEST message to the
client. The server lets the client enter (by sending a OK message) and immediately locks
the critical area by sending a LOCKED message to all other clients. Further requests are
discarded by the server (killed, i.e. not answered). When the client releases the critical area
(by sending a RELEASE to the server), the server announces this with FREE messages.
This algorithm eats a lot of network bandwith, but is has several advantages:

• It is fair: the first request, which reaches the server is allowed to enter.

• In theory, it should have no starvation, because all processes have the same chance to
enter.

• Based on the frequency of incoming FREE and LOCKED messages, the clients can
derrive the load of the server.

11. Ricard and Agrawala´s algorithm has the problem, that if a process has
crashed and does not reply to a request from another process to enter a critical
region, the lack of response will be interpreted as denial of permission. We
suggest that all requests will be answered immediately, to make it easy to detect
crashed processes. Are there any circumstances where even this method is
insufficient? Discuss.

I will not explain the whole algorithm, but the main problem is, that it uses blocking
communication. When one process wants to enter the critical region, it has to ask all the
other processes for permission. When one of the other processes crashes, it will wait forever
for an answer and the whole system gets frozen.

A solution would be to send a confirmation message after receiving a request, which either

36
grants access (OK message) or denies access (in this case the initiator would block and wait
for a later OK message). And here lies the problem. Because it should get an answer (either
OK or DENY) the initiating process can tell with a timeout counter, if the other process
crashed or not. After it has received DENY it will put itself asleep and if the other process
crashes now, it will never be waken up. Thus and unfortunately, the system is not 100%
deadlock safe, as it loocked at the first sight.

A smaller (and usual) problem with timeouts is, that timeouts are not reliable to tell if a
system has really crashed or if it is just bussy.

13. A distributed system may have multiple, critical regions. Imagine that
process 0 wants to enter critical region A and process 1 wants to enter critical
region B. Can Ricard and Agrawala´s algorithm lead to deadlocks? Explain
your answer.

Of course it can lead to deadlocks. Suppose a system with two processes 0 and 1. Suppose
that the critical region A (e.g. a printer) is on process 1 and the critical region B is on process
0. Now process 0 tries to get into region A. It asks process 1 for access and subsequently
may enter. Now process 1 wants to get into B and asks process 0 for access. Because process
0 is not in B, is says ”OK”. Thus 1 is now in B.

Now comes the deadlock: Additionally process 0 wants to enter B and asks process 1 for
permission. Because 1 is already in B, 1 does not answer and puts 0 in its queue. In the
meantime 0 has put itselt asleep, waiting for an ”OK” from 1. Now suppose, that 1 wants
to enter A. It asks the sleeping 0, but will never get an answer. Thus both processes can
not go on and we have a deadlock.

15. In Fig. 5-25(d) three schedules are shown, two illegal and one legal. For
the same transactions, give a complete list of all values that x might have at the
end, and state which are legal and which are illegal.

We aim for maximal parallelity, although we must take care, that all transactions are logi-
cally isolated from each other. The final result of concurrent transactions must be the same
as if the transactions were exectuted sequentially one after another in a particular order.

In the example we have three transactions:

• A: x=0, x+1

• B: x=0, x+2

• C: x=0, x+3

37
If we execute them sequentially and assume, that x is a shared variable, we could get the
following results, depending, which transation runs last:

• x=1 if A is the last

• x=2 if B is the last

• x=3 if C is the last

So x=1, 2 or 3 are legal final values. If we execute the transactions somehow parallel, we
could get x=1, 2, 3, 4, 5 and 6 as result. Obviously only some schedules would be legal.

17. Give the full algorithm for whether an attempt to lock a file should succeed
or fail. Consider both read and write locks, and the possibility that the file was
unlocked, read locked or write locked.

If a process wants exclusive access to a ressource, it must apply at a locking manager. To


allow concurrency, there are two kinds of locks:

• Read locks. The process only wants to read the ressource. Other processes may not
write the ressource, but they are still allowed to read it (because the value is assumed
not to change).

• Write locks. The process wants to change the ressource. Thus other processes may
for the time of lock not read or write the ressource.

There are three cases:

• The object is unlocked. Other processes are allowed to set read or write locks.

• The object is read-locked. Other processes may aquire read-locks, but no write-locks.

• The object is write-locked. Other processes may not aquire neither read nor write
locks.

18. Systems that use locking for concurrency control usually distinguish read
locks from write locks. What should happen if a process has already aquired a
read lock and now wants to change it into a write lock? What about changing
a write into a read lock?

Lets assume that we have a read-lock on an object. Other transactions could also have
read-locks on this object, but no write-locks. So we can not simply convert our read-lock
into a write-lock, because write-locks exclude read-locks. The only chance to do this is when

38
there are no other read-locks on the object (i.e. no other reader).

If the transaction has a write-lock on an object, then down-grading it into a read-lock is no


problem, because the previous write-lock excludes and other write- or read-locks.

19. With timestamp ordering in distributed transactions, suppose a write oper-


ation write(T1, x) can be passed to the data manager, because the only, possibly
conflicting operation write(T2, x) had a lower timestamp. Why would it make
sense to let the scheduler postpone passing write(T1, x) until transaction T2
finishes?

Transaction T2 has changed x, but not yet commited. Althought the timestamp of x is now
that of T2 , the change is not pernament yet. If we would let write(T1,x) pass, then we
definitely would have to abort, because the timestamp of x is higher than that of T1 (T1
would update a value, which was externally changed since T1´s last read).

If we do not pass and wait instead until T1 has comitted, we have a chance, that we don´t
have to abort. It might happen, that T2 itself subsequently gets aborted because of some
illegal read or write operations on some other values. Then all of T2 operations would be
rolled back, especially all timestamps. In this case write(T1,x) would not fail anymore.

22. We have repeately said that when a transaction is aborted, the world is
restored to its previous state, as though the transaction never happened. We
lied. Give an example where resetting the world is impossible.

Every transaction which changes the world in a way, which is physically not reversible, is
not reversible itself. E.g. the effects of a transaction which prints a document on a printer
can not be reversed, because the black ink has been used and can not be brought back (with
magic).

6 Consistency and Replication

1. Access to a shared Java object can be serialized by declaring its methods


as being synchronized. Is this enough to guarantee serialization when such an
object is replicated?

Synchronization with monitors like in Java gives only guaranties for a single object in a
single address space. The only thing the programmer can be sure of is that concurrent
accesses to one single object are automatically serialized.

39
In the case of replicated Java objects this guarantee does not work anymore. We have two
possibilities:

• The programmer must take measures, that the object handles also its distribution (i.e.
make the object aware of its eventual distribution and related concurrency issues).
Letting the object handle distributed concurrent accesses is very much in line with
using a concurrency-unaware, ”dumb” object adapter, as we usually would do in the
non-replicated case. The advantage is, that we could implement an object-specific
concurrency strategy (on the cost of a high implementation effort)

• Alternatively we could use a distributed Java runtime or a middleware, which is


replication-aware. The programmer must not worry about replication issues anymore
and could leave the responsibility for managing concurrent accesses to to lower layers.
A clear advantage of this solution is its simplicity.

I am not sure if there exist some distributed Java runtimes, but there exist definitely dis-
tributed object replication middlewares (e.g. JavaParty http://svn.ipd.uni-karlsruhe.de/trac/javaparty).

3. Explain in your own words what the main reason is for actually considering
weak consistency models.

All stronger consistency models (strict, sequential, causal, FIDO) ensure at least, that writes
by one single process are seen by all other processes in order. Sometimes one does not need
this property, because immediate results must not be consistent and only some end result
must be seen by all other processes.

For example consider processes, which make transactions on a database. Transactions can
be aborted at any time. The involved processes are responsible for leaving the system is in
a consistent state after finishing or aborting a transaction,. While processing a transaction,
the system can (and will) be in an inconsistent state. As long as concurrent transactions
are well protected, e.g. with locks, and the end result is consistent, this is no problem. In
the case of transactions other processes do not need to see intermediate results in the right
order. Thus, the corresponding concurrency model is weak consistency.

It has the following properties:

• We have a global synchronization operation. All calls are seen by all processes in a
globally sequential order. This can be seen as a sequence of serialized transactions.

• The synchronization stands in some way for finishing a transaction and starting a new
one.

– Once a process called the synchronization operation, the process blocks until

40
all other processes have synchronized (by calling the synchronization operation).
This assures, that updates are really pushed out to all processes before the next
transaction starts.

∗ Note: Actually it is enough to forbid accesses to the data store − but to


allow e.g. calculations.

– Synchronization flushes the queue of write requests from all processes. This
means that all writes from all processes, which are in progress, have to be finished
until synchronizing. With this property we can use synchronization to let other
processes complete their writes (which we would expect anyway).

Another nice effect of weak consistency models (and sometimes the reason for them) is the
high performance we get due to the lack of implicit global synchronization (the user has full
control over synchronization).

5. During the discussion of consistency models, we often referred to the contract


between the software and data store. Why is such a contract needed?

The contract is basically a guarantee, that the data store gives its clients: If you (the client)
obey some rules, I (the data store) assure working according to some rules / constraints. If
the programmer is aware of the read and write rules and implements its programs according
to them, he can expect the data store to work in the specified way.

There is a tendency: The stronger the contract is and the more rules the programmer has to
obey, the better are the guarantees, the slower is the performance of the data store and the
harder it is to implement the data store. An example is strict consistency, which guarantees,
that some read operation always return the globally most recent written value of the data.
It can be shown, that in the absence of an exact global time, this is (physically) impossible
to implement such a data store.

Of course the programmer can always chose to break the contract (sometimes he does this
by accident, then it is a bug). In this case the data store will probably show some strange
behaviour.

In every case, the programmer has to think carefully, which guarantees he needs and how
much he is willing to deal with eventually inconsistent data, before he chooses a consistency
model.

7. A multiprocessor has a single bus. Is it possible to implement strictly con-


sistent memory?

The defining property of strict consistency is: Any read on a data item x returns a value

41
corresponding to the result of the most recent write on x.

As Tanenbaum shows, this can be physically impossible to guarantee this in a distributed


system: Consider two server A and B, which are quite far away from each other. A client
writes a data item x on A and an other client reads x on B 1 ns later. Because the information
of the updated value for x can not reach B faster than with the speed of light, it will reach
B after processing the read request. Thus we either read a outdated value of x instead the
most recent written value or break the physical laws, which should be quite hard.

The problem is that we have no globally unique timestamp, which we can append to every
written data item and that we can use to identify the most recent versions. In a multipro-
cessor system with a single bus, we could take the global (mainboard) clock as timestamp.
Because the bus can serve at one time only one processor, the timestamp would be glob-
ally unique and we can be sure that no two operations from two processors get the same
timestamp. Thus we can implement strictly consistent memory.

Note, that if we allow multiple processors to operate on the same data at the same time
(e.g. trough multiple buses), we have the risk of multiple operations within the same interval
(timestamp) and can not implement strict consistency anymore.

9. In Fig. 6-7, is 000000 a legal output for a distributed shared memory that is
only FIFO consistent? Explain your answer.

We have the following concurrent processes:


P1: x = 1; print(y,z)
P2: y = 1; print(x,z)
P3: z = 1; print(x,y)
All values are initialized with 0.

An output of 000000 would usually mean, that the print statements are executed first. While
this seems not possible with the standard consistency models, it can appear to happen with
FIFO consistency. FIDO consistency says basically, that writes by a single source are seen
by other processes in order, but it gives absolutely no guarantees when they actually see
the writes. In the worst case they see the writes after they executed their last statements.
So the print statements are not executed before the assignments, but they do not see the
effects of concurrent write operations in time.

In our example one could image the following execution:


P1: x = 1 (but: write is yet not visible to other processes)
P2: y = 1 (but: write is yet not visible to other processes)
P3: z = 1 (but: write is yet not visible to other processes)
P1: print(y,z) −− > ”00”
P1 sees the other writes and finishes execution

42
P2: print(x,z) −− > ”00”
P2 sees the other writes and finishes execution
P3: print(x,y) −− > ”00”
P3 sees the other writes and finishes execution

11. At the end of Sec. 6.2.2., we discussed a formal model that said every set of
operations on a sequential consistent data store can be modelled by a string, H,
from which all the individual process sequences can be derived. For processes
P1 and P2 in Fig. 6-9, give all the possible values of H. Ignore processes P3
and P4 and do not include their operations in H.

Process 1 and 2 do the following:

• P1: W(x)a W(x)c

• P2: R(x)a W(x)b

I will use indices to refer to a process, e.g. R2 (x)a means that process 2 reads the value a
for x.

The string H represents a history, this is a valid, sequentially consistent execution order. A
history has to fulfil two constraints:

• The program order is maintained: For every process, the order of the statements in
the process must be preserved in H.

• Data coherence must be respected: Reads on a data item x return always the last
written value on x. If a execution order is a valid history, it is a sequential execution.

4! = 24 order enumerations are possible. I will list them and check them according to the
constraints:

W1 (x)a W1 (x)c R2 (x)a W2 (x)b: no data coherence (c written, a read)


W1 (x)a W1 (x)c W2 (x)b R2 (x)a: program order for P2 not preserved
W1 (x)a R2 (x)a W1 (x)c W2 (x)b: OK
W1 (x)a R2 (x)a W2 (x)b W1 (x)c: OK
W1 (x)a W2 (x)b W1 (x)c R2 (x)a: program order for P2 not preserved
W1 (x)a W2 (x)b R2 (x)a W1 (x)c: program order for P2 not preserved
W1 (x)c W1 (x)a R2 (x)a W2 (x)b: program order for P1 not preserved
W1 (x)c W1 (x)a W2 (x)b R2 (x)a: program order for P1 and P2 not preserved
W1 (x)c R2 (x)a W1 (x)a W2 (x)b: program order for P1 not preserved
W1 (x)c R2 (x)a W2 (x)b W1 (x)a: program order for P1 not preserved

43
W1 (x)c W2 (x)b W1 (x)a R2 (x)a: program order for P1 and P2 not preserved
W1 (x)c W2 (x)b R2 (x)a W1 (x)a: program order for P1 and P2 not preserved
R2 (x)a W1 (x)a W1 (x)c W2 (x)b: OK
R2 (x)a W1 (x)a W2 (x)b W1 (x)c: OK
R2 (x)a W1 (x)c W1 (x)a W2 (x)b: program order for P1 not preserved
R2 (x)a W1 (x)c W2 (x)b W1 (x)a: program order for P1 not preserved
R2 (x)a W2 (x)b W1 (x)a W1 (x)c: read values change without write
R2 (x)a W2 (x)b W1 (x)c W1 (x)a: program order for P1 not preserved
W2 (x)b W1 (x)a W1 (x)c R2 (x)a: program order for P2 not preserved
W2 (x)b W1 (x)a R2 (x)a W1 (x)c: program order for P2 not preserved
W2 (x)b W1 (x)c W1 (x)a R2 (x)a: program order for P1 and P2 not preserved
W2 (x)b W1 (x)c R2 (x)a W1 (x)a: program order for P1 and P2 not preserved
W2 (x)b W2 (x)b W1 (x)a W1 (x)c: program order for P2 not preserved
W2 (x)b W2 (x)b W1 (x)c W1 (x)a: program order for P1 and P2 not preserved

As we see, there are only four valid histories.

13. It is often argued that weak consistency models impose an extra burden for
programmers. To what extend is this statement actually true.

First of all, most programmers are used to single-processor or multi-processor programming.


Two approaches are common in single-processor or multi-processor environments:

• All such systems pretend strict consistency and are thus very intuitively to program.
Actually all these systems are linear, sequential consistent, because all accesses get
serialized by the system bus (see question 7). Note that, the system keeps some kind
of consistency without certain synchronization constructs.

• On the other hand many software systems work transaction oriented, e.g. SQL
databases. Programmers have to use some constructs to mark the beginnings and
endings of transactions. Other typical examples are mutex locks for shared memory
programming where the programmer has to manage synchronization for himself.

So yes, it is true that weak consistency models require explicit calls of a synchronization
operation, but programmers might already be used to these techniques.

15. Does Orca offer sequential consistency or entry consistency? Explain your
answer.

There are two kind of synchronization mechanisms in Orca:

• Explicitly: The user can specify so called guards, which embrace a method. If all
guards are false, the operation is blocked until one guard becomes true. We can

44
use this to implement method-specific synchronization mechanisms, e.g. wait until
another method leaves some critical region.

• Implicitly: All methods are virtually marked ”synchronized” as in Java, but this time
also for networks (see question 1). This means that the runtime system keeps track of
serializing two method calls on the same object (which is probably distributed), i.e.
treats them like critical regions, where one process is blocked until the other leaves
the region.

Both mechanisms protect some kind of critical regions, thus it corresponds to entry consis-
tency. Anyway we have also some kind of sequential consistency, because the critical regions
(which can be seen as data items), have to be entered in a order, which has to be the same
for all objects (from their point of view).

17. What kind of consistency would you use to implement an electronic stock
market? Explain your answer.

Of course one might want strict consistency in such a sensible application. But before
making a really well founded decision, we must think about the application. In a stock
market we have different stocks, each with a unique market price, which changes faster or
slower over time. We have buy and sell orders, which refer to a stock and an amount. Trader
usually base their buy and sell decisions on the stock price, if it is low, they buy, if it is
high, they sell. We can think of the n traders as n processes, which read (see the current
stock price) and write (buy or sell stocks) data items (stocks). Buying and selling has an
automatic effect on the price and thus on the next reads by other traders.

Sequential consistency would be nice. All traders would see all write orders by all other
traders on all stocks in the same order. This is nice, but we can assume that a decision to
buy or sell a stock A can be made alone by watching the history of A and not by watching
the history of other stocks. Thus we need to see only the causal operations in order which
have been made on A and we can use causal consistency. This guarantees us, that all
traders, which traded at least once the stock X, will see the same history (price, buys, sells)
of X. For other stocks arbitrary orders can be accepted.

19. Describe a simple implementation of read-your-writes consistency for dis-


playing web pages that have just been updated.

This is a client-centric consistency model in which we have to obey the following condition:
The effect of a write operation by a process on data item x will always be seen by a successive
read operation on x by the same process.

This means that we always have to to complete the write operation (which might be in
progress) and deliver the most up-to-date value of x when the data store receives a read

45
request by the client. For our example we have to guarantee, that the client always fetches
the most recent version of the file. We can do this by

• deactivating all caches between server and client (especially any browser caches) de-
signing the browser cache in a way, that he always checks the server for a newer version
of the file before returning a result to the user. This renders of course all performance
effects of the cache to zero.

• Additionally the server must be told to block incoming read requests (delay them),
when he performs an update of its hosted files.

As we see, these strong requirements come with strong performance losses. Nevertheless,
most users are still satisfied if they see pages, which are 1 hour or so old.

21. When using a lease, is it necessary that the clocks of a client and the server,
respectively, are tightly synchronized?

A lease protocol is a mixture between pull and push protocols. A clients can request leases
on the servers, which guarantee him, that the server delivers push-based updates like for
a particular time. After this time expired, he can try to extend the validity period of the
time or alternatively has to pull further updates.

The decision about granting or refusing a lease is done by the server (and based on server
time and some criterion). Thus the client time and any synchronization issues do not matter.
If the clients has doubts about the remaining life-span of the lease, because the server time
might run much faster than the client time, the client can request extensions earlier than
at the end of life-span

23. For active replication to work in general, it is necessary, that all opera-
tions be carried out in the same order at each replica. Is this ordering always
necessary?

In active replication systems we have a distributed data store, which consists of multiple
replicas for each data item which are all updated in case of a write operation (in contrast to
primary-based protocols). Thus we have an up-to-date copy everywhere and can read from
the nearest server.

Usually we would use for all write and read operations totally-ordered multicasts, but this
can be unnecessarily restrictive. If we have for example two or more read operations, it does
not matter in which order we execute them at different servers, the results will always be
the same. The same goes for commutative write operations, like such that replace the file
with itself. The problem in the later case is, that we are hardly able to tell if two writes are
commutative or not if we do not look deeper into the operations (which we do not want to

46
do for performance reasons).

25. A file is replicated on 10 servers. List all the combinations of list quorum
and write quorum that are permitted by the voting algorithm.

We assume that files can be replicated in multiple versions on multiple servers in a dis-
tributed data-store. When we use a quorum-based protocol to ensure consistency (we want
to read only the most up-to-date version), all clients must agree on a special contract. For
n servers, which keep a replica of a file (in possibly different versions), we have to adhere to
a read quorum Nr < N and a write quorum Nw < N.

Nr and Nw must fulfil the following two constraints:

• Nr + Nw > N

N
• Nw > 2

When we want to read a file, the client must first contact Nr server and read the version
numbers of the requested file. We have now two possibilities:

• The version numbers are all the same. We have read exactly from the servers, on which
the last update was written on and everything is fine (because of the first condition).

• We read different versions, i.e. some up-to-date versions and some out-dated versions.
On writing, every file is written to at least Nw servers. Because of the first constraint,
we can be sure that the version with the highest version number in the read set is the
most recent version, because we must have read at least one server which was also in
the last write set, so we have always at least one up-to-date version of the file read
(the one with the highest version number).

The second condition is needed to prevent write-write conflicts. A write must implicitly
invalidate all outdated versions and this is ensured by the second condition. If we for
example with N = 10 servers write only to three server (Nw = 3) and subsequently to three
different server again, we could read (by accident) the first three servers and assume that we
have read the most-recent version. With the second condition we would have overwritten
at least the majority of all replicas and ensured that at least some of the new versions are
read.

Valid combinations of of (read quorum, write quorum) are for example: (1,10), (2,9), (3,8),
(4,7), (5,6). These were the ”minimal combinations”. Because the first condition permits
us to have a sum of Nr and Nw somehow greater than N, we could have a lot more valid
combinations: (2,10), (3,9), (3,10), (4,8), (4,9), (4,10), (5,7), (5,8), (5,9), (5,10). In these
cases we write to more servers, that strictly necessary.

47
26. In the text, we described a sender-based scheme for preventing replicated
invocations. In a receiver-based scheme, a receiving replica recognizes copies of
incoming messages that belong to the same invocation. Describe how replicated
invocations can be prevented in a receiver-based scheme.

In active replication we have the problem, that distributed object replicas can uninten-
tionally invoke other objects many times (when they should invoke them only once). This
happens, because objects are unaware of their replicas and act independently. We must
somehow detect such multiple invocations and prevent them.

One way to deal with replicated invocations is a receiver-based approach. Suppose that we
have an object A, which wants to invoke an object B. Unfortunately (for this problem), B is
replicated, so the request by A gets somehow multicasted to all B´s (either by a sequencer
server or broadcasting with Lamport timestamps). Because B depends on another object
C, each B calls C, where one call of C would be enough to reach the goal.

The solution could look like this: We assign each call A −− > B a unique identifier. When
a B receives such a call, it checks all of its other replicas, if they received already a request
with that identifier. If not, the request is processed, if yes, the request is discarded. The
main problem is that the checking step is quite costly.

Ensuring that only a single reply is sent from C to one B (and not all B´s) can be done
like that: All B´s check the identifier of C and discards it, if it is not (blocking) waiting for
some message from C.

7 Fault Tolerance

1. Dependable systems are often required to provide a high degree of security.


Why?

Wikipedia describes security as ”protecting information and information systems from unau-
thorized access, use, disclosure, disruption, modification or destruction”. This, of course, is
a very broad definition and we need to find some more specific reasons to let a distributed
system adheres to certain security standards.

Because of the distributed nature of distributed systems, we have no single point of attack
anymore, but n points of attack. Servers have to trust their clients and the other way round.
E.g. in a dependable distributed data store, we must trust the replication hosts, that they
do not maliciously modify replicas on purpose (e.g. infect them with viruses). E.g. here we
need high protection against outside attacks.

48
We could say that the definition of a dependable system even embraces security, because if
some distributed hosts are corrupted, the whole system can not keep its normal operations
(without countermeassures) and is thus not dependable anymore. Unfortunately, the more
security measures we take (authentication, code checking, memory protection, etc.), the
more we have to pay in terms of performance.

3. Consider a Web browser that returns an outdated cached page instead of a


more recent one that had been updated at the server. Is this a failure, and if
so, what kind of failure?

A system is said to fail, if it can not provide the functionality it promises. So we have to
look at the promises of a web browser and its cache.

• If we implement (client centric) read-your-writes consistency, we always have to pro-


vide the most recent version of the page. If the cache nevertheless delivers a cached and
outdated version, we break the promise and have a response error (more specifically:
value error).

– Note1. Definition: We have a response error if the server response was incor-
rect. In this case we have to think of the server as the cache, because the user
”interacts” with him and not the web server.

– Note2.: Giving and fulfilling this promise renders the performance of the cache
to that of a browser without cache. This is because we have to check the server
every time for a new version, before we deliver a page.

• We can give the promise, that a displayed page is maximal x time units old. Then we
have a response error, if the page on the server was updated in the last x time units
and we deliver an older version. Otherwise we have no error.

• If we do not give any consistency promises at all, we have no error.

5. How many failed elements (devices plus voters) can Fig. 7-2 handle? Give
an example of the worst case that can be masked.

Fig. 7-2 shows an example for TMR (Triple Modular Redundancy), a technique for physical
redundancy for failure tolerance. The circuit provides three stages, each representing a
device, which depends on its previous device (pipelining). For failure tolerance we triplicate
each device on each stage and interpose voter elements, which pass the result of the majority
of the previous stage to the next stage. Because voters can produce wrong decisions too,
we triplicate them too.

What is the worst case for a device stage? We can mask failures with a majority vote,

49
if at most one device per stage fails and at least two work correct. Then we can provide
a completely correct input for the next stage. Two failures per stage produce already a
”wrong” majority or no majority at all and thus they are not maskable.
What is the worst case for a voter stage i? Again, at most one failure, because when we
produce two correct and one wrong inputs for the next stage, we can compensate the voter
failure in the next voter stage i+1 (the majority wins). Note that this works only if the next
device stage works correct, because new introduced device failures can destroy the fragile
majority.

As we see, for our example with three device stages and three voter stages, we can mask
at most one failure per stage and still get the correct end result (= 6 masked failures in
total). Note that this is a upper bound, because not every combination of failing elements
is still producing the correct end result. An example for the worst case (the best case?) is
for example, if all the upper elements (devices and voters) fail. We still get the correct end
result, because we never introduce new device failures, so that voters only have to correct
failures from the device stage and the last voter stage. In other words: we can only reach
the worst case, if every erroneous device is only feed with the output of the erroneous voter.

7. For each of the following applications, do you think at-least-once semantics


or at most once semantics is best? Discuss.
(a) Reading and writing files from a file server.
(b) Compiling a program.
(c) Remote banking.

These techniques can be applied in the case of a server crash. The problem here is that the
client can not know if the crash happened before the requested action was executed on the
server or after. In the first case he can safely reissue the request, but in the second case if
he sends the request again, really bad things can happen.

• The first approach is called ”at-least-once” and means that the client polls the server
until he gets an answer. Possibly we get multiple executions, but that should be safe.

• The second approach is called ”at-most-once” and gives up immediately (after a time-
out) and reports the client an error. Possibly we get no execution at all, but we avoid
harmful multiple executions.

In the example we have:

• (a) Reading files is always safe, because it does not touch the data. Thus we can
use at-least-once semantics. For writing it depends:

– When we want to create a file or replace a file by another file, we can use at-
least-once semantics.

50
– E.g. for appending some chunk of data to an existing file, we must use at-
most-once, because this operation is not idempotent (= has the same result after
multiple applications).

• (b) Compiling a program on a server. Except wasted CPU time on the server,
reissuing compile request (at-least-once) should do no harm. Note that this task can
take very long and thus we should use some large timeout value.

• (c) Remote banking. This one is critical (imagine real money transfers). We should
not hide the error from the user and stop after a timeout. Then it is up to the user
to find out the problem and fix what went wrong.

9. Give an example in which group communication requires no message ordering


at all.

For this task we assume a middleware layer, which receives multicast messages, orders them
and finally delivers them as a queue to an application. In this way we can implement
different message order semantics.

In the case of unordered delivery, we have no guarantees on the order in which messages
are delivered to processes (as long as as they are delivered at all −− > ”reliable” unordered
multicasting). This might be enough, when the multicasted information contains already
some implicit ordering information. Think for example of an audio file, which we multicast
piecewise. If each audio frame contains information of its position in the file, we can still
compile the complete file from the unordered multicasted packages.

Sometimes, if the intervals between the multicasted packages is small enough, order does not
matter anyway, because the multicasted informations have no dependencies on each other.
Here you can think e.g. of multicasted weather and news information. It is not important
what we receive first, as long as the information is not completely outdated.

11. To what extent is scalability of atomic multicasting important?

An atomic multicast is a reliable multicast, which reaches either all of its receivers or none,
and in which incoming messages are globally total ordered. Thus it is perfectly suited e.g.
for delivering transactions to replicated databases. For small receiver groups scalability
might not be critical, but for big, possibly geographically wide spanning groups this can be
a huge problem. Note again that the nice properties of atomic multicasting come on the
cost of huge performance losses.

13. Virtual synchrony is analogous to weak consistency in distributed data


stores, with group view changes acting as synchronization points. In this con-
text, what would be the analogon of strong consistency?

51
In the context of synchronization, strong consistency means that updates are propagated
more or less immediately and automatically to all processes in a group (in contrast to
certain user-defined synchronization points as we have with weak consistency). In virtual
synchrony we have special messages to announce group membership changes, which act as
barriers between sets of multicasts. This means that no multicasts can pass them, because
they we have to guarantee, that multicasts reach only the group of receivers, they were
specified for.

The multicast analogon of strong consistency would not give us such guarantees: Group
membership can change dynamically and at any time. However, we have to make sure, that
messages reach their specified receiver atomically (all non-faulty receiver or none), even
under these hostile conditions. Membership changes have to be announced by multicasts.
It is important to have total message order, so that we do not mix events.

15. Adapt the protocol for installing a next view Gi+1 in the case of virtual
synchrony so that it can tolerate process failures.

We want to implement virtually synchronous reliable multicasting. We assume two proper-


ties, that we can exploit to simplify the algorithm:

• We have reliable point-to-point multicasting, this means, that each message is guar-
anteed to be received. Note that we have no guarantee, that multicast message, which
are composed by several reliable P2P-messages, are received by all receivers, because
the sender could fail after sending some messages.

• We have total message ordering for a sender and receiver pair. In practice this can be
realized with TCP frame sequence numbers.

Tanenbaum gives an algorithm, which solves the problem of delivering all not-yet-received
messages of a group view Gi before installing the next group view Gi+1 . The algorithm uses
the notion of ”unstable messages”, which are messages, which are at the moment of the
group change not yet delivered to all receivers. Processes should get messages only, when
they are stable, i.e. garanteed to be received by all other group members. To guarantee this
in the case of group view change, all members multicast their buffered unstable messages.
Because of the first property we can be sure, that all unstable messages for Gi are now
received by all members of Gi . To indicate that a process has flushed its buffer and is ready
to install the new group view Gi+1 now, he multicasts a flush message. When a process
received the flush messages from all other processes in Gi , it is ready to install the new view
Gi+1 and can proceed normally.

This (possibly too) short description of the algorithm reveals a major problem: What
happens if process X crashes, while it is flushing its buffer? If X has as the only process
the unstable message A, then the other processes will never receive A (additionally they

52
are blocked, because they have to wait for the flush message from X). The solution is to
interpret the situation after the crash as a new view Gi+2 , now without X. Then we can
have view change messages and flush messages for different views flowing around at the
same time and we can have different installed group views at the same time.

The algorithm could work as follows:

• When a process receives a group change message Gi , it multicasts all its unstable,
buffered messages to all members of Gi . Why to all members of Gi ? Because a
member, which has left the previous view, could not receive the message anyhow and
because new members can safely discard incoming unstable messages, if they belong
to a older view (see the next case). After that the process multicasts a flush message
to all members of Gi . The messages are now stable, just as in the standard algorithm.

• Now what happens if a process receives an unstable, now stable messages for Gi ?

– If the process received a message for a group view, in which he is no member


(i.e. when the process joined the group as a result of the new view Gi+k ), then
he can safely discard the message. In other words: The message must not be
multicasted to the process and has thus no relevance.

– The process can also discard a message, if it is a duplicate of an already received


message. This can happen, if the process got it already from another process (Re-
member: On a group view change, the same unstable message can be multicasted
by different processes).

– The process delivers a message to the application, if it is not a duplicate and if


the process was in the right group.

– In the worst case it can happen, that a process receives a unstable message for
a group view, which he does not know, i.e. was not informed by other processes
about yet. Then it has to buffer the messages somehow, until it is informed about
the new group and can proceed according to the first three rules.

We are now resistant against crashes in the group view change phase. If such a crash
happens, we can treat it simply as a new group view, now with one member less. We can
analogous handle situations, in which processes join the group in such critical situations. In
any case, the membership information will be safely and reliably delivered to all processes
(but possibly asynchronous and delayed).

17. In our explanation of three-phase commit, it appears that committing a


transaction is based on majority voting. Is this true?

53
No, it has nothing to do with majority voting, although it may appear so. We just have to
make sure that we do not block, when either the coordinator or a participant crashes. The
protocol is completely deterministic and each client decides for its own how to proceed.

Imagine e.g. the situation, where a participant P is in the state READY (in the 2nd protocol
stage), waiting for a prepare-commit message from the coordinator C. Now P concludes with
a timeout that C has crashed and it has to find out, in what state to move next. For that P
contacts all other participants and sees that they are all in the same state as P (READY).
In the meantime (before the timeout) other processes may have crashed and can not be
contacted anymore. Thus we have no information about their state and follow a pessimistic
approach. In this case we would abort the transaction, e.g. move to state ABORT. This
is necessary, because a crashed participant may e.g. (incorrectly) recover to state INIT. In
the worst case we abort a transaction, which would have succeed, but this is not harmful.
In each case is may appear, that the majority wins, but actually they follow a strictly
deterministic protocol.

19. Explain how the write-ahead log in distributed transactions can be used to
recover from failures.

We apply recovery techniques to bring the system into a error-free state after we have
detected an error. Write-ahead logs (WAL) are a special technique of backward recovery,
which tries to reset the system to a former state.

The concept of a WAL is to log all read and write operations in the log before actually
performing them. From time to time we would make checkpoints, we would do this at the
beginning of each transaction. Thus WALs are incremental logs.

What happens in the case of a failure, e.g. a crash? We can replay the log from the
last checkpoint to the last recorded log entry (roll-forward). Then we can continue the
transaction.
When a transaction is ”normally” aborted, we must act differently: Now we must undo
all changes to reset to the state before the transaction. We can do this by reading the log
backwards (roll-back) and making an undo step by step.

21. Receiver-based message logging is generally considered better than sender-


based logging. Why?

When we create for performance reasons checkpoints only seldom, we have to get back to
a very old state in case of failure. Thus it is better to additionally trace all (incremental)
changes since the last checkpoint and use this information to recover to a more recent
failure-free state.

For send and received messages we have two possibilities:

54
• Sender-based logging: Each sending process logs its outgoing messages before sending
them.

• Receiver-based logging: Each receiving process logs its incoming messages, before
delivering them to the application.

When a process crashes, it recovers first to a checkpoint and subsequently replays all further
incoming messages. Note: We can neglect outgoing messages, because we can think of them
as (deterministic) reactions to incoming messages and the current state. Thus we can
”reconstruct” them just with the available information (checkpoint + log).

When the system works normal both approaches should be equal in terms of performance.
In the error case receiver-based logging has a advantage, because all needed information is
locally available. With sender-based logging we would additionally have to keep track of a
sender list and ask all sender to resend all messages.

8 Security

1. Which mechanisms could a distributed system provide as security services to


application developers that believe only in the end-to-end argument in system´s
design, as discussed in Chap. 5?

First of all we must have a clear grasp of the end-to-end argument. It basically says, that
functionalities should be placed as high as possible in a functionality stack. Think e.g. of the
usual Internet protocol stack (TCP/IP). A question could be, where to place a total message
sequencing functionality? We could implement it e.g. in the IP level. This would mean, that
all applications are provided total message ordering. On the other hand, in the end-to-end
approach, we would implement it on the application level, because not all applications need
strict total message ordering (think e.g. of voice−over−IP or video streaming). The clear
advantage is, that lower layers are kept slim, fast and easy. Systems, which were designed
with the end-to-end argument in mind, are called end-to-end systems.

If we think about security features in an end-to-end system (encryption, authentication,


etc.), we may not implement them in some middleware or even lower layer. We can only
implement them on a per-application basis, thus we have to reimplement them for every
application (or use some libraries, which provide security mechanisms). The approach gives
per definition high flexibility, but as a distributed, dependable systems embrace security (see
answer chapter 7 question 1), end-to-end designs would always be more difficult to develop.

3. Suppose you were asked to develop a distributed application that would allow

55
teachers to set up exams. Give at least three statements that would be part of
the security policy for such an application.

A security policy for a system describes precisely, which entities (people, data, services,
objects, which invoke other objects, etc.) are allowed to do which actions and which not.
Before one can set up a security policy, one needs a specification of the system functionalities.

Our system should allow teachers to upload exams on a (dependable) distributed system
and use this system for computer-based exams on students. The results should be reliably
stored and only be accessible for the teacher.

Now that we have a rough system description, we can derive an easy security policy:

• There are two groups of users: Students and teachers.

• Users have to authenticate with the system.

– Unambigious identification of users is the base for all other security mechanisms.

• Teachers can: Set up questions, modify, organize and delete their own questions (not
these of other teachers), schedule exams, conduct exams, access exam results

• Students can: Take part in scheduled exams, see their results

• One rule more: All communication should be encrypted.

– This is for privacy reasons, e.g. when results are sent back to the server, we do
not want attackers to eavesdrop messages.

5. Why is it not necessary in Fig. 8-15 for the KDC to know for sure it was
talking to Alice when it receives a request for a secret key that Alice can share
with Bob?

For n participants, who want to communicate with each other, we have n·(n−1) 2 mutual
(symmetric) keys. Instead of distributing all keys on all clients, we could use a central key
server (Key Management Center, KDC). Then each client A has to store only one secret

key KA,KDC , which is used for communication with the KDC,.

The protocol works as follows: If the client A(lice) wants to set up a (symmetrically)
encrypted connection to B(ob), it sends the KDC a plain message, containing the identities
A and B. The KDC looks up KA,B or generates a new KA,B and subsequently sends two
messages:

56
+
• KA,KDC (KA,B ) to A.

+
• KB,KDC (KA,B ) to B.

Because both parties have the corresponding private keys, they can as the only participants
decrypt the messages and extract the requested key.

This is also the reason why it does not matter for the KDC, to whom it was talking. Even
more: The KDC could send out encrypted messages of the above form to arbitrary clients
without doing harm, because only the qualified receiver can decrypt them and benefit.

7. In message 2 of the Needham-Schroeder authentication protocol, the ticket


is encrypted with the secret key shared between Alice and the KDC. Is this
encryption necessary?

Message 2 is send from the KDC to Alice and has the following form:
KA,KDC (RA1 , B, KA,B , KB,KDC (A, KA,B ))

The ticket KB,KDC (A, KA,B ) will subsequently be passed from Alice to Bob. As we see, it
is (together with other information) encrypted with the key of Alice (KB,KDC ). In fact we
could transmit the ticket unencrypted, because no one else except Bob can read and use it
(it is encrypted with Bobs key).

Anyway, it is send encrypted for one reason: In sensible authentication systems we must
do everything to prevent even unlikely attack possibilities. Thus, if we submit the ticket
unencrypted along with encrypted information, we reveal information, which an attacker
could exploit (in a today possibly unknown) way.

9. Devise a simple authentication protocol using signatures in a public-key


cryptosystem.

The protocol should authenticate Alice and Bob mutually. This means: Alice must be sure,
that she is communicating with Bob and the other way round. Secret exchange of a session
key is not demanded (and would anyway be impossible, if we use only signatures).

In contrast to centralized, symmetric-key approaches (like KDC) we use a public-key cryp-


tosystem. We can assume that each party knows the following:

− +
• Alice: KA (Alices private key), KB (Bobs public key)

− +
• Bob: KB (Bobs private key), KA (Alices public key)

57
Big warning: It must be guaranteed, that each party has really the public key of the other
party and not the public key of someone else, who pretends to be the other party!

The protocol would work as follows:


• 1: Alice sends Bob the following: KA (R). R is some message, e.g. ”Please authenti-
cate me.”. The message is encrypted with Alices private key, and thus signed. Bob
decrypts it with Alices public key (as everybody else could also do). He can be sure,
that he is indeed talking to Alice, because only Alice could encrypt the message.


• 2: Bob sends Alice the following: KB (R). R is Bobs response for Alice. When Alice
receives the message, she decrypts it with Bobs public key. Now she can be sure, that
she is indeed talking to Bob, because only Bob was able to encrypt message 2 (he has
as the only one the matching private key).

11. How can role changes be expressed in an access control matrix?

An access control matrix (ACM) models the relation between subjects (which request op-
erations) and objects (which provide functionalities). The subjects are placed as rows in
the matrix, the objects as columns. Each cell lists the allowed operations / rights for a pair
(subject, object).

A role is an alias for a subject. Think e.g. of a person, which is a father, a project manager
and a friend. In each role, he has to behave differently and has different rights. In a
computer system a role is usually given by logging into a system with a specific username
and password. A role with many rights (e.g. the administrator role) contains implicitly also
roles with fewer rights (an administrator can do usually everything, what standard users
can do). On the other hand we can not allow every user to change over to the administrator
role (without having the right password).

Now let us think of a role as an object in an ACM. This role object would provide a subject
(user) all necessary rights to function as this role. To ”access” a role we grant a user a right
(or not). In the end, the ACM behaves like a database similar to /etc/passwds, associating
single users with (possibly many) roles, they can take. Dynamic role changes would depend
on the other allowed roles of a user. The approach is somewhat cumbersome, because to
implement the system intuitively, we have to grant rights to all ”lower” roles, e.g. we have
to give an administrator also all user role rights. Alternatively, we can make an agreement,
that we implicitly grant access to all lower roles without having to give them explicitly. This
would of course require a linear order of roles (every role is directly comparable with every
other role).

15. Name three problems that will be encountered when developers of interfaces
to local resources are required to insert calls to enable and disable privileges

58
to protect against unauthorized access by mobile programs as explained in the
text.

There are several problems:

• Its easy to introduce additional programming errors, because of higher complexity.


Programming errors are especially harmful, if they allow malicious code to circumvent
the security measures and access forbidden resources.

• The security model must have clear answers on recursive accesses. Imagine a process
is allowed to enter object A. Actually object A is not allowed to invoke object B, but
the process is. Should we grant the process access rights or not?

• TODO

17. The Diffie-Hellman key-exchange protocol can also be used to establish a


shared secret key between three parties. Explain how.

The goal is to establish a shared, secret key between three parties (Alice, Bob, Chuck)
without having already some shared keys, like in the KDC approach or in the Needham-
Schroeder protocol. For this I extend the Diffie-Hellman approach (DH) for two parties to
three parties.

First of all Alice, Bob and Chuck have to agree on two large primes n and g (like in standard
DH). They can do this in the public, because attackers can not derive the final secret key
from both numbers (as we will see later). Now, Alice generates a large number x, Bob
similarly y and Chuck z. All three must keep their numbers secret, because they can be
used to construct the secret key (together with the public n and g).

We need to transfer six messages:

• 1: Alice sends Bob g x mod n

• 2: Bob sends Chuck g y mod n

• 3: Chuck sends Alice g z mod n

– Up to this point the receivers can calculate the following values:

∗ Alice: (g z mod n)x = g xz mod n

∗ Bob: (g x mod n)y = g xy mod n

59
∗ Chuck: (g y mod z)n = g yz mod n

• 4: Alice sends Bob g xz mod n

• 5: Bob sends Chuck g xy mod n

• 6: Chuck sends Alice g yz mod n

– Now the receivers can calculate the following values:

∗ Alice: (g yz mod n)x = g xyz mod n

∗ Bob: (g xz mod n)y = g xyz mod n

∗ Chuck: (g xy mod n)z = g xyz mod n

In the end, all three have the same, shared secret key g xyz mod n. It is mathematical almost
impossible to calculate x, y or z from one of the messages of the form g ... mod n (except
brute force attacks, but this is why we have chosen large numbers). Note that we never
have to transfer the secret numbers x, y or z to come to this result.

19. Give a straightforward way how capabilities in Amoeba can be revoked.

In this context a capability means a 128-bit token, which is assigned to each resource and
which specifies the access rights associated to that resource. There can be only one capability
at one time (for a certain object, only one global right assignment is allowed). Protection
against manipulating the token by users (i.e. increasing rights) is done the following way:
The resource creator initially gets a full rights capability token. On creation the server also
generates a 48-bit random check number and includes it in the capability. This number is
additionally stored in some tables in the server for checking purposes. To verify a capability,
the server takes the check number from its tables, XORs it with the associated rights from
the capability and calculates a new check number with a (secret) one-way function. If both
check numbers match, the server can be sure, that there was no manipulation.

We see: Because of the one-way function, it is not possible to change the capability without
that manipulations will be detected. Thus revoking (invalidating) a capability for a resource
should be very easy. We just have to change the check number of the object in the server
table, then the results of all calculations will mismatch and all capabilities for the object
will be invalid. Of course this approach is very rude, but because all clients have the same
capability for an object, Amoeba can not revoke with finer resolution.

21. What is the role of the timestamp in message 6 in Fig. 8-38, and why does
it need to be encrypted?

60
This figure illustrates the Kerberos protocol. Message 6 is sent from Alice to the Ticket
Granting Service (TGS) and is has the following form: KAS,T GS (A, KA,T GS ), B, KA,T GS (t).
As we see, the timestamp t is encrypted with the shared secret between Alice and the TGS
KA,T GS . Thus only Alice and the TGS can see it.

t protects the protocol against so called replay attacks, as I will explain in a few moments.
It ties message 6 and message 7 logically together and acts therefore like a nonce (e.g. in the
Needham-Schroeder protocol). For an successful replay attack, Chuck, an attacker, needs
the following: An intercepted old message 6 KAS,T GS (A, KA,T GS ), B, KA,T GS (told ). Note,
that t is the timestamp of an older message. Furthermore, he has for some reason access to
the secret key KAS,T GS , which is shared by the Authentication Server (AS) and the TGS. In
theory, this should never happen, but even in this worst case, the timestamp will prevent any
harm. With this knowledge Chuck can decrypt the first part of the message, modify it, en-
crypt it again and send the TGS the following message: KAS,T GS (C, KC,T GS ), B, KA,T GS (told ).
KC,T GS is a faked secret key. Obviously Chuck wants the TGS to send him a ticket for a
connection to Bob. Because Chuck has a valid ticket KAS,T GS (C, KC,T GS ) from the AS, the
TGC trusts him on the first sight.

But Chuck was not able to manipulate the timestamp. The TGS will notice, that Chuck
replays a outdated message and will refuse the request. In the best case, the TGS and AS
agree now on a new shared secret key, because the old one was obviously cracked.

23. Consider the communication between Alice and an authentication service


AS as in SESAME. What is the difference, if any, between a message m1 =
+ +
KAS (KA,AS (data)) and m2 = KA,AS (KAS (data)) ?

There is no difference. Only a receiver, who has the (symmetric) secret key KA,AS and the

private key KAS can decrypt the whole message. Only the AS knows both keys and is thus
able to decrypt both messages.

25. A customer in an e-cash system should preferably wait a random time


before using coins it has withdrawn from the bank. Why?

Because the customer wants to stay anonymous towards its bank and merchant. Thus he
wants to reveal as few so called microdata as possible , i.e. data, which can be used to
identify the identity of a customer. A short time between withdrawing e-coins and using
e-coins helps to identify a customer, e.g. if one limits the search scope to all customer,
which withdraw and spend e-coins in one hour.

Anyway, because customers may not be allowed to copy e-coins, e-coins must always be
registered in the bank and indirectly the customer, who bought them. The technique above
makes customer identification maybe a little harder, but it can not hinder it.

61
27. Consider an electronic payment system in which a customer sends cash to
a (remote) merchant. Give a table like the ones used in Figs. 8-44 and 8-45
expressing the hiding of information.

The money has to be transfered from the customers bank to the merchants bank. In this
example we have 3 steps.

• 1. Withdrawal. The custmer withdraws money from his bank.

• 2. Sending. The customer sends the money to the merchant (per ”traditional” mail).

• 3. Deposit. The merchants received the (real) money and deposits it in his bank.

Merchant Customer Date Amount Item


Merchant Full Partial / Full Full Full Full
Customer Full Full Full Full Full
Bank None None None None None
Observer Full Full None Full None

• If the customer prefers to stay anonymous and sends only the money together with
a transaction code, the merchant is not able to identify the customer (only partial
information). However, the merchant must still able to allocate the sent money with
a certain selling (this should be enough for him).

• The customer is in full control of the transaction and neccessarily knows also the
merchants identity.

• It is obvious, that the banks see only withdrawals and deposits. They can not allocate
these transactions with the details of the deal and the identities of the customer and
merchant.

• Tanenbaum defines an ”observer” as someone, who can (ideally-placed) watch the


transaction. In our example this might be a malicious mail employee, who opens
unauthorized letters. He can see who sent whom money (senders address, receiver
address) and the amount (he must count the money), but e.g. not the purpose of the
transfer.

62
9 Distributed Object-based Systems

1. Why is it useful to define the interfaces of an object in an Interface Definition


Language?

The main idea of distributed object-based systems is to provide programmers transparent


access to services of other components. It may not be important where these components
are stored and in which language they are written in. In other words: We want to completely
hide the functionality of components (objects) behind its interface.

To describe the interface (including functions, parameters, return values and their types) we
use an Interface Definition Language. The advantage is, that such a definition is language-
neutral, which means that it does not specify the implementation language or the actual
implementation.

CORBA IDL looks like this:

module Stats {
interface EUStats {
string getMainLangs(in string countryname);
long getPopulation(in string countryname);
string getCapital(in string countryname);
};
};

(from http://www.cs.swan.ac.uk/c̃sneal/InternetComputing/CorbaEx.html). As we can see


in the example, the programmer gets a clear grasp of how the object works from the IDL
description.

In CORBA we would have to translate the interface to a goal-language, e.g. to Java:

class EUStats {
public String getMainLangs(String countryname){...}
public int getPopulation(String countryname){...}
public String getCapital(String countryname){...}
}

Translation to stubs for the server or client side can be done automatically from IDL to
(probably many) goal languages. This can even be done at runtime, i.e. in dynamic invo-
cation.

63
3. Which of the six forms of communication discussed in Sec. 2.4 are supported
by CORBA´s invocation model?

CORBA offers the programmer three invocation schemes:

• Synchronous. The client sends the server an invocation and simply blocks until the
invocation is processed and the results are returned. This was the first model and
turned out to be too simple for most applications (e.g. sometimes we do not want to
block, but continue with something different). Semantics are at-most-once.

• One-way request. The client sends the server a request and continues execution
(asynchronous). This communication form is somehow limited in CORBA, because
the called method is forbidden to return values and delivery is not guaranteed. Thus it
realizes an asynchronous at-most-once notification of the server without the possibility
to return results (”one-way”).

• Deferred-synchronous request. This is a combination of the previous ones. After


sending the request, the client continues execution (asynchronous) like in the second
case. The called method is allowed to return results, but the delivery of results is not
guaranteed (just like delivery of the request). In other words, the client can never
know if the request was successful or not. Deferred-synchronous requests can also be
seen as two one-way requests.

In section 2.4 of the book, there are six combinations of two communication properties:

• Transient / persistent. Transient channels discard messages, when the receiver of a


message is not up. Persistent communication stores messages in the messaging system
and delivers a message later (when the receiver comes up).

• Synchronous / asynchronous. Synchronous sending means that the sender blocks


until it gets an acknowledgement or results from the receiver. This event can either
be simple receiving and saving to thei receivers network buffer (receipt), start of
processing the request (delivery) or the results (response).

So we have the following correspondences:

• Synchronous. This scheme corresponds to response-based transient communication.

• One-way request. This corresponds to transient asynchronous communication.

• Deferred-synchronous request. It does not directly correspond to a model of


section section 2.4, but because it consists of essentially two one-way requests, it
corresponds to transient asynchronous communication.

64
5. Should the client and server-side CORBA objects for asynchronous method
invocation be persistent?

CORBA objects can be implemented with or without persistence with regards to asyn-
chronous method invocation (based on the applications needs):

• Without (transient). Communication would only work when client and server are
up and responsive. In terms of failure tolerance this would not be optimal, because
the communication system can not make a promise not to loose messages.

• Persistent. The communication system would buffer messages in the case that the
receiver of a message is down. Thus, the whole system could still work flawless, even
if a host is down (at least it appears to work normally).

7. Does CORBA support the thread-per-object invocation policy that we ex-


plained in Chap. 3?

This activation policy demands one thread per object (for all clients, which probably use
this object in parallel). The advantage is that all requests are automatically serialized and
that there are no thread synchronization issues. Policies are usually implemented by an
object adapter.

In the case of CORBA, the object adapter (on clients and servers) is called Portable Object
Adapter (POA). In the CORBA reference model the POA is quite simple and inflexible.
An object is registered at the POA with

ObjectID activate_object(Servant servant)

. ”Servan” is a pointer to the actual implementation of the object. The ObjectID is used
to manage all objects in the ”Active Object Map”. By nature there is no way to explicitly
define a threading policy to the POA. If we want a single-thread policy, we must take care
that we call

activate_object()

only once for each servant / object.

9. In the text, we state that when binding to a CORBA object, additional secu-
rity services may be selected by the client ORB based on the object reference.
How does the client ORB know about these services?

65
All security requirements must be somehow coded in the Interoperable Object Reference
(IOR), which is a language-independent object-reference. The right place in the IOR would
be the components field, which stores additional information.

Before binding an object, the client ORB must collect the security requirements of the caller
(e.g. encrypted transmissions). They are stored in a set of so called policy objects on the
client side. For simplicity there are reasonable default policy objects.

10. If a CORBA ORB uses several interceptors, that are not related to security,
to what extend is the order in which interceptors are called important?

The order of interceptors is essential and influences the system functionality. Assume two
interceptors A and B, whereas A converts messages to a MIME-based text format and B
compresses messages with GZIP and expects MIME-text as input. Obviously only the order
A B makes sense and not B A.

If one interceptor is security-related, the situation gets even worse. Consider interceptor A,
which encrypts messages and interceptor B which compresses messages. Because encrypted
messages usually look like random text, compression of encrypted messages does not make
much sense (only the order A B would).

13. In DCOM, consider a method m that is contained in interface X, and which


accepts a pointer to interface Y as input parameter. Explain what happens
with respect to marshaling when a client holding an interface pointer to a proxy
implementation of X, invokes m.

We have the following situation: An object on a server offers interface X with the method
m to clients. m wants a pointer to an interface as input. Because interface pointer (locally)
represent objects, this actually means, that m needs another object as input. Note that this
object can be on the same host or on a remote host.

Assume also that a client is bound to X, which means, that the client holds a interface
pointer to a (local) proxy, which marshals all requests and forwards them to X. When the
client calls m with an interface pointer as input, the input must obviously be somehow
marshalled, because calling m with the clients local reference does not make sense. In
DCOM there are two approaches for this:

• Standard marshalling. Every DCOM interface is described with an IDL and can
be found by type libraries with an unique Interface Identifier (IID). The client could
send the IID of Y to the server. The server could then search the host (and port) of
Y, establish an connection, construct the proxy for Y from Y´s IDL description and
bind Y. All these steps can be done automatically and handled transparently by the
underlying marshaller.

66
• Custom marshalling. This approach works entirely different. Instead of sending
only an IID and assuming that the server of X constructs a proxy for Y for himself, we
send the clients proxy to Y directly. This means that we actually marshal executable
code (which must be guaranteed to run on both hosts). The client sends the code
together with a special call to unmarshall and treat the proxy. The copied proxy
provides all information to X to bind Y.

15. Outline an algorithm for migrating an object in DCOM to another server.

Why would one like to move objects from a server A to another server B? One example
are agent objects, which ”visit” every host in a while. Another example is placing objects
as close as possible to clients. The main problem in moving objects is keeping all client
bindings consistent (the bindings must be kept on the new location) and to keep the objects
state.

A possible 6-step algorithm is:


1. The current server A creates a snapshot of the object, i.e. he has a 1:1 copy in memory.
The running object is ”locked” in that sense, that all new requests on it are chronologically
buffered and stored along with the copy.
2. We assume that there is a DCOM object for object migration on the goal server B. A
binds to it and calls a special method (e.g. transfer object(Object o)). The method expects
the object as input value. A marshals the object and sends it to B.
3. B unmarshals the object, sets it up, but is still keeping it freezed.
4. A sends the log of requests of the object to B. B can now unlock the object and ”replay”
them.

Basically we have now a consistent copy of the state, but what about the bindings?
5. A sets up a forward pointer, which redirects all incoming new requests to B. All clients,
which are still bound to the old location, are kindly informed to rebind the object on B.
6. A announces the new location of the object in the Active Directory. New clients can now
bind directly to B.

17. Explain what happens when two clients on different machines use the same
file moniker to bind to a single object. Will the object be instantiated once or
twice?

DCOM objects are usually transient, this means that they are destroyed when there are no
more references on it. To make an object persistent one can create a persistent reference
(moniker) to it. The most common moniker is the file moniker, which reconstructs an object
and its state from a file.

Assume that the first client used the moniker to reconstruct the object. The moniker
returned the client an interface pointer and registered the object in the Running Object

67
table (ROT) to trace all users of the object. When the binding request from the second
client comes in, the file moniker looks in the ROT and decides what to do.

• Possibility 1. It sees that the object has already been created and binds the second
client to the existing instance. This means that the object is shared between different
clients. This approach is the default approach and corresponds to the usual DCOM
model, where the client programmer is responsible for serializing requests and keeping
consistency.

• Possibility 2. The moniker decides to create a new instance of the object and
binds the second client to it. This has the conceptual drawback, that multiple object
instances can be now in different states. Still, it is possible to use this strategy, e.g. if
the object is stateless and we want to implement a crude object-per-request strategy.

19. Give an example in which the (inadvertent) use of callback mechanisms can
easily lead to an unwanted situation.

This question refers to the Globe system. Globe only support synchronous communication
(and no callbacks). The advantage is, that if we have a program which executes on multiple
objects, only one object is active at one time (because the calling object blocks when it calls
another object). In this way we avoid inconsistent states. The lack of a callback mechanism
and asynchronous communication is why Globe objects are called ”passive”: They do only
something, if they are explicitly requested to.

The main issue with (asynchronous) callbacks is, that the state of the caller may have
changed in the time between the request and the callback. If the programmer does not take
care, this can have unwanted consequences. Consider e.g. this (constructed) situation: A
client is bound to a file management object on a server and requests him to delete all data
of the client, which is in a special trash directory (per callback). The client really wants to
delete only the current files in there, but because it does not block after the request, there
can be new files added to the trash directory. The server takes some time and finally makes
a callback. All files in the directory are now wiped out and also the files the client initially
did not want to be removed.

21. Assume a Globe object server has just installed a persistent local object.
Also, assume that this object offers a contact address. Should that address be
preserved as well when the server shuts down?

Persistent local objects are locally marshalled and saved to stable storage, when the last
client released its binding to it. They can be anytime reconstructed and the question is if
we should store also the same contact address for them.

For that we have to know, what a contact address in Globe actually is. A contact address

68
describes exactly where and how an instance of an object can be reached by clients. The
contact address is stored along an object identifier in the Globe location service. Note that
there can be multiple contact addresses for one object identifier, e.g. when the object is
replicated.

Thus, if the object server changes shuts down and gets a new address during rebooting (e.g.
per DHCP), it must announce the changed contact address for all stored persistent objects
to the location server. If the address does not change (static address), there is no need to
change the contact address (we can preserve it).

10 Distributed File Systems

1. Is a file server implementing NFS version 3 required to be stateless?

A stateless implementation of NFSv3 may not keep any information of its clients between
requests. In other words: After processing a request, the server completely forgets about
the requesting client. This approach has besides an easier server design the advantage of
theoretically higher failure tolerance: When a server crashed, we do not need to recover the
before-crash states of all clients.

The price for stateless designs is, that all requests must provide all necessary information
to process it and can not rely on previous requests. For example: A stateless read()-request
must specify a file and a range (start + offset). The server implicitly open() the file, read()
and close(). A stateful design would do each step as single request and could feed read()
with a relative file handler.

The file system operations of NFSv3 are in principe designed for stateless operation. The
main problem is performance. Consider a server implementation, which wants to delay
writing out the results of write()-operations for a collective flush. This kind of cache needs
to keep an internal state, not a state of clients, but still. Thus, a stateless approach has to
immediately write out results of write()-operations (slow).

Another problem comes with duplicate requests. Assume that a client did not get a receipt
of its request within a certain time. He will then reissue the request, even when the server
is still processing the original request. The problem here is that the server must be able to
detect duplicate requests. For idempotent operations multiple executions obviously do not
matter, but non-idempotent operations may cause trouble. For detection of duplicates, the
server needs again an internal state.

3. Give a simple extension to the NFS lookup operation that would allow

69
iterative name lookup in combination with a server exporting directories that
it mounted from another server.

Iterative name lookup in this context means that we lookup a file in iterative steps, begin-
ning from the file system root /. For example resolving /home/robert/file means that we
iteratively resolve / first, then /home and finally /home/robert (which returns us all file
handlers in this directory).

A principle of NFS is, that a server can not export directories, which are mounted from other
servers. Thus, a client has to mount all directories it wants first, before doing a lookup.

A simple extension of the lookup operation gives us the possibility to mount files more
flexible: If a client looks up a directory, which contains mounted directories from other
servers (say server B), the server A returns a file handler with a pointer (e.g. the IP
address) of B. The client will then contact B and continues its lookup there.

5. Using an automounter that installs symbolic links as described in the text


makes it harder to hide the fact that mounting is transparent. Why?

The automounter mounts a remote NFS directory on demand, i.e. in the first moment it
is accessed. In the indirect approach, we mount directories not directly, but instead use
a mirrored directory hierarchy. E.g. assume, that we want to mount /home/robert. The
mirrored hierarchy resides in /tmp mount/home/robert. All potential mounted directories
are actually only symbolic links to the mirrored hierarchy. In our example /home/robert
contains something like ”link:/tmp mount/home/robert”.

A problem must from the fact, that the automounter returns a symbolic link to /tmp mount/home/robert,
when the user accesses /home/robert. He will suddenly find himself in a directory, he did
not want to open. Because of that, many UNIX programs (such as ls), have been adapted
and pretend to be in /home/robert.

6. Suppose the current denial state of a file in NFS is WRITE. Is it possible


that another client can first successfully open that file and then request a write
lock?

There are two independent locking mechanisms in NFS.

• One is the ”classical” file locking interface, which is with RPC calls like lock() and
lockt() (test on a lock) implemented.

• The other one is the share reservations system. When a client opens a file, it specifies,
which operations it want to perform (READ, WRITE or BOTH) and which operations
the server should deny other clients for the session time (READ, WRITE, BOTH).

70
Now suppose that a file is opened by a client and has WRITE denial state. Thus, opening
the file for writing by another client will fail.

One possible solution would be to open the file with READ as requested access level. After
the file is allowed to open for read, the client can issue a special lock() request of the first
locking system. This lock() would record the second client in a FIFO queue and grant write
access as soon as the first client releases the lock.

7. Taking into account cache coherence as discussed in Chap. 6, which kind of


cache-coherence protocol does NFS implement?

NFS implements mainly caching of files, file attributes, file handlers and directories. In v3
the question of caching (strategies) was completely left open to the implementer, while v4
requires caching. What makes it even more difficult is, that for each item (files, handles,
etc.) NFSv4 suggests another caching strategy.

First of all, I want to give a short classification of caching strategies:

• Coherence Detection Strategy. When are inconsistencies detected?

– Static caching. The compiler does an analysis of possibly inconsistent data at


compiletime. This is only a theoretical approach.

– Dynamic Caching. Inconsistencies are detected by special routines at runtime.


This approach is typical for distributed systems. We can further distinguish by
the strategy:

∗ Pessimistic. Before accessing a data item in a cache, we check, if it is


consistent with the original data at the server.

∗ Optimistic. We access the data item and check consistency afterwards.


This approach is only in transactional systems applicable, because we may
have to roll back the access.

• Coherence enforcement strategy. This determines how client caches are kept
consistent with the server copies. In other words: What should happen, when a data
item on the server is changed.

– Non-sharing. We forbid caching at all. While this solves consistency issues at


all, it is obviously useless.

– Sharing.

71
∗ Invalidations. When data items are modified, the server sends invalidation
messages to all caches.

∗ Update propagations. When data items are modified, the server sends
updates to all caches.

∗ Client invalidation checks. Before accessing cached data items, the client
checks on the server, if the data items are still up-to-date.

∗ Leases. Clients have to regularly renew their leases (check for updates),
otherwise they invalidate their cache.

• Client-update propagation strategies. This issue is similar to the previous one,


but now: What should happen, when a client modifies cached data?

– Nothing. If the client implements only a read-only cache, he can change data
only on the server and thus avoids all consistency issues. In read-write client
caches doing nothing is of course not an option.

– Write-trough. All modifications are immediately forwarded from clients to the


server.

– Write-back. Similar to write-trough, but for performance clients delay update


propagation and and instead mark modified items as ”dirty”.

Now back to NFS:

• Client cache for files. This cache is (of course) dynamic and pessimistic. Before
accessing cached data items, the clients poll the server for new versions, thus we have
client invalidation checks. When a client closes a file, all modified data items must
immediately flushed back to the server, thus we have write-trough.

• Client caches for file attributes, file handles and directories. This one is very
similar to the previous strategy, but the designers of NFS assumed, that these items
do not change as frequently as files. The main difference to caching files is, that we
use leases instead of client invalidation checks.

9. We stated that NFS implements the remote access model to file handling. It
can be argued that it also supports the upload/download model. Explain why.

In the remote access model, the client transparently accesses a remote file server. He is kept
unaware of the actual location and storage of the file and is offered an interface to access it.
The important thing is, that the server is responsible for allowing / forbidding access and

72
the whole management of files. This model corresponds to NFSv3 and v4.

In the upload / download model, the client downloads a file, stores and manipulates it
locally and uploads it. This basically means, that the responsibility for files lie at the
clients. An classical example is the FTP service. NFSv4 supports a similar model with
”open delegations”. Servers can delegate files and and rights to clients in order to improve
performance (they put the load from servers to clients). Now the client is allowed to grant
other clients access rights for a file or not, he is in full control of what should happen with
a file and what not. That is why open delegations correspond to the upload / download
model. A major difference is, that NFS servers can recall delegations by force, while in the
upload / download model clients voluntary turn control back to the server.

11. To what extent will the duplicate-request cache as described in the text
actually succeed in implementing at-most-once semantics?

At-most-once semantics mean, that clients will give up immediately after one try and report
back an error. Thus, zero or one execution of the request are guaranteed, but not more.

A duplicate-request cache on the server side stores the message identifiers of all incoming
requests. When a duplicate arrives at the server, it can recognize it and must not execute
the same request multiple times.

It remains to see if communication follows at-most-once semantics, if the server implements a


duplicate-request cache. When we assume, that the application itself issues one request and
the underlying proxy transparently handles the communication (e.g. eventually resending
requests), then we can make the following case-analysis:

• Server down. In this case the client will get no answer after the first request. It
can safely reissue the request, because the duplicate-request cache on the server would
recognize duplicates. If the client still gets no reply after a couple of tries, he will give
up and report an error. No execution happened.

• Server up.

– Standard case. The client sends a request, the request is processed quickly
and the results are sent back before a timeout on the client is triggered. One
execution happened.

– Busy server. The client sends a request, the server starts processing, but very
slow. Because the client gets no answer within a certain time, it assumes, that
the request or the reply was lost and reissues the message with the same message
identifier as in the original request. The server will detect, that it received a
duplicate and can safely discard it. One execution happened.

73
– Crossing Messages. The client sends a request, the server starts processing,
but very slow, just like in the previous case. At a certain time, he finishes and
sends back the results. Right after finishing, but before receiving the answer,
the client triggers the timeout and resends the request. In other words: Both
messages are crossing each other. The server will detect the duplicate and can
safely ignore it, because he send already the results. One execution happened.

– Lost requests. When the original request was lost, the client will resend the
request after some time. No duplicate will be discovered. One execution hap-
pened.

– Lost reply. If the servers response was lost, the client will trigger the timeout
and resend the request. The server will detect the duplicate and has now a choice.
If we are carefully, we have only one execution and possibly multiple idempotent
executions.

∗ Execute again. This is only for idempotent operations an option (e.g.


reading the first 100 byte of a file).

∗ Send cached results. The server stores the results of all operations in the
duplicate-request cache. This has two consequences: 1. The cache blows
strongly up in size, just to avoid reexecution in this unlikely case. 2. More
cached values mean ”more” state on the server and potentially more trouble
in the case of server recovery.

There are no other cases. We see, that the duplicate-request caches gives us everything to
implement at-most-once semantics.

13. Fig. 10-21 suggests that clients have complete transparent access to Vice.
To what extent is this indeed true?

Coda is indeed highly transparent:

• Access transparency. Coda presents itself to client workstations (Virtue) as tra-


ditional UNIX file system. For example a user can transparently access a mounted
Coda file system in /coda, but he ”feels” no difference to other (local) directories, just
like NFS. The user will never have to deal with Vice servers directly.

• Location, migration, relocation and replication transparency. Besides the


Vice server, a client is currently accessing, users can not tell, on which other Vice
servers a file is stored. Files may be transparently relocated and replicated, the user
can not detect it.

74
• Failure transparency. The Coda client-server mash can be split temporarily into
independent parts. Clients will then just use cached copies of files, without noticing
the split.

• Concurrency transparency. Coda provides clients with transactional semantics


and can handle concurrent transactions of different users.

The last two points reveal some problems:

• Failure transparency. If the network splits and one file is written in each of the halves
independently, the conflict can not be solved any more automatically (transparently).

• If two clients want to write a file at the same time, both are allowed to start their
transactions. Anyway, because one transaction will read outdated values, it will be
aborted and rolled back.

Obviously Coda is designed to provide a lot of transparencies, but in hopeless situations it


can not do magic.

15. If a physical volume in Coda is moved from server A to server B, which


changes are needed in the volume replication database and volume location
database?

The question is how naming in the Coda file systems works. (Physical) volumes are the
smallest mountable unit. They are located on a certain Vice server and have a unique
Volume Identifier (VID). This VID is location independent, that means, that clients have
to use a VID directory service, the Volume Location Database. This database gives clients
the actual physical location of a volume (e.g. an IP address). Thus, if a physical volume
moves from server A to server B, we have to change the record of the volume to point to
the new server.

Furthermore, a physical volume can be replicated on multiple servers. A replicated group


of physical volumes is called logical volume. Each group has a Replicated Volume Identifier
(RVID). To resolve a RVID into a list of VIDs, the client uses the Volume Replication
Database. Because a replication group consists of VIDs and the VIDs are actually not
changed (only the pointers to the real location is changed), we do not have to change
anything within the Volume Replication Database.

17. Explain how Coda solves read-write conflicts on a file that is shared between
multiple readers and only a single writer.

Because of its distributed nature, Coda does not use traditional UNIX semantics (sequential
consistency), but transactional semantics. Each time, a client opens a file, a session is started

75
until he closes the file.

For our example of multiple readers and one writer, there is only one allowed order:

• The first reader opens a file for read. A copy of the file is transfered to the client. The
server records, that the client has a copy.

• All the other readers do the same.

• Now the writer opens the file for write. Again, a copy is transfered to him. The copy
is registered within the server.

Only this order is allowed, because during write sessions all tries to read a file by other
clients are forbidden. We can think of the sessions as independent transactions. When the
writer closes its session, the modified copy is transfered back to the server (write-trough)
and the server notifies all clients that they are reading now from an outdated copy. Now it
is up to the clients what they do:

• If they need really the most-up-to-date data, they can roll back their transaction.

• Because of the transactional semantics they can also complete their (consistent) trans-
actions. The problem is here, that the written-back data is based on an outdated
version of the file.

So, it is not completely clear what to do. The only way to make such situations as seldom
as possible ist to make all sessions as short (atomic) as possible.

19. If a file on a Vice server needs its own specific access control list, how can
this be achieved?

The problem is, that Coda ACLs can refer only to directories, but never to single files.
Thus, if we want to implement per-file ACLs, we must put every file in its own directory.

21. Can union directories in Plan 9 replace the PATH variable in UNIX sys-
tems?

Theoretically yes. The Unix PATH variable is a list of directories, which is searched for a
valid path to an executable everytime the user types a command. Plan9 has a feature called
union directories, which can be ”abused” to implement the PATH variable directly in the
file system.

A union directory is simply a directory, in which multiple other directories are mounted. So,
if we have PATH=/usr/local/bin:/usr/bin: , we could mount /usr/local/bin and /usr/bin

76
in e.g. /variables/PATH. If an executable is in both subdirectories, Plan9 will sequentially
search /variables/PATH (in the order of mounts). The first found result will be executed,
just like in the classical PATH variable.

23. What is the main advantage of using stripe groups in xFS compared to the
approach in which a segment is fragmented across all storage servers?

xFS splits data segments into n fragments + one parity fragment, which can be used for
reconstructing a lost fragment. For performance and failure tolerance, the n + 1 fragments
must be somehow distributed on m file servers.

In the ”Zebra” model, we try to distribute all fragments on as many servers as possible, so
that each server holds at most one fragment. The xFS approach is to partition all servers
into ”stripe groups”. All segments are distributed only on the servers of one stripe group,
so that one server can hold more than one fragment.

There are two advantages of this approach:

• In the Zebra model each server holds fragments of all segments in the file system.
When one server goes down, all segments are affected and have to be reconstructed
with the parity fragments, while in the xFS model only certain fragments are affected.

• For fetching one segment in the Zebra model, a client has to contact potentially all
servers in the network. In the xFS approach only the servers of one stripe group have
to be contacted. This is a clear advantage for server and network load.

77

Das könnte Ihnen auch gefallen