Sie sind auf Seite 1von 124

Distributed Systems Course Notes

Original notes by Ian Wakeman with revisions by Dan Chalmers

Contents
1 Introduction 1.1 Distributed Systems Trailer . . . . . . . . . . . 1.1.1 Detailed content of Distributed Systems 1.2 Aims and Learning Outcomes . . . . . . . . . . 1.3 Prerequisites . . . . . . . . . . . . . . . . . . . 1.4 Teaching methods . . . . . . . . . . . . . . . . 1.5 Assessment . . . . . . . . . . . . . . . . . . . . 1.6 Course programme . . . . . . . . . . . . . . . . 1.7 Reading list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 9 10 10 10 10 10 10 13 13 13 15 15 17 18 18 18 18 19 19 19 19 20 21 24 25 25 25 26 28 28 31

2 Lecture Notes 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Course Outline . . . . . . . . . . . . . . . . . . 2.1.2 Whats a Distributed System . . . . . . . . . . 2.1.3 Example Distributed Systems . . . . . . . . . . 2.1.4 What do we want from a Distributed System? 2.1.5 Elements of a Distributed System . . . . . . . . 2.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . 2.2 Bits and Bytes . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Bits, Bytes, Integers etc . . . . . . . . . . . . . 2.2.2 Preces . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Memory and Packets . . . . . . . . . . . . . . . 2.2.4 Bit Manipulation . . . . . . . . . . . . . . . . . 2.3 Foundations of Distributed Systems . . . . . . . . . . 2.3.1 Physical Concepts . . . . . . . . . . . . . . . . 2.3.2 Packets . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Conclusion: Network Properties . . . . . . . . 2.4 Operating System Support . . . . . . . . . . . . . . . 2.4.1 Protocols . . . . . . . . . . . . . . . . . . . . . 2.4.2 Layering . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Reliable Transmission: The Basic Techniques . 2.4.4 Group Communication . . . . . . . . . . . . . . 2.4.5 Sockets . . . . . . . . . . . . . . . . . . . . . . 2.4.6 Request and Response . . . . . . . . . . . . . . 3

CONTENTS 2.4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . Remote Procedure Call (RPC) . . . . . . . . . . . . . . . . . . . 2.5.1 Send/Receive . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Message Styles . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Remote Procedure Call . . . . . . . . . . . . . . . . . . . 2.5.4 Cross domain communication . . . . . . . . . . . . . . . . 2.5.5 Problems with RPC . . . . . . . . . . . . . . . . . . . . . 2.5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . Distributed Objects: The Java Approach . . . . . . . . . . . . . 2.6.1 Whats an Object? . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Why Distributed Objects? . . . . . . . . . . . . . . . . . . 2.6.3 How to build Distributed Object Systems . . . . . . . . . 2.6.4 Objects and RPC systems . . . . . . . . . . . . . . . . . . 2.6.5 Java RMI . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . Enterprise computing and Corba . . . . . . . . . . . . . . . . . . 2.7.1 Computers in Business . . . . . . . . . . . . . . . . . . . . 2.7.2 Three Tier Models and the Web . . . . . . . . . . . . . . 2.7.3 CORBA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.4 Web Services - Business to Business . . . . . . . . . . . . 2.7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . Computer Security: Why you should never trust a computer system 2.8.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.2 Authentication . . . . . . . . . . . . . . . . . . . . . . . . 2.8.3 Authentication in distributed systems: Private Key Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.4 Public Key Encryption . . . . . . . . . . . . . . . . . . . 2.8.5 Secure Socket Layer . . . . . . . . . . . . . . . . . . . . . 2.8.6 Authorisation . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.7 Enforcement . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.8 Trusted Computing Platform . . . . . . . . . . . . . . . . 2.8.9 Firewalls . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.10 Classes of security problems . . . . . . . . . . . . . . . . . 2.8.11 Lessons . . . . . . . . . . . . . . . . . . . . . . . . . . . . Names and Naming Services . . . . . . . . . . . . . . . . . . . . . 2.9.1 Main Points . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.2 Why names? . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.3 What does one do with names? . . . . . . . . . . . . . . 2.9.4 Whats a name? . . . . . . . . . . . . . . . . . . . . . . . 2.9.5 Partitioned names . . . . . . . . . . . . . . . . . . . . . . 2.9.6 Descriptive names . . . . . . . . . . . . . . . . . . . . . 2.9.7 Object Location from name - broadcast . . . . . . . . . 2.9.8 Location through database . . . . . . . . . . . . . . . . . 2.9.9 Distributed Name Servers . . . . . . . . . . . . . . . . . . 2.9.10 Availability and performance . . . . . . . . . . . . . . . 2.9.11 Maintaining consistency for distributed name services . . 31 31 31 32 33 35 36 36 37 37 37 38 39 39 43 43 43 44 46 49 51 51 51 52 53 55 56 56 57 58 59 59 62 62 62 63 63 63 64 65 65 66 66 66 67

2.5

2.6

2.7

2.8

2.9

CONTENTS 2.9.12 Client and Name server interaction. . . . . . . 2.9.13 Summary . . . . . . . . . . . . . . . . . . . . Distributed File Systems . . . . . . . . . . . . . . . . . 2.10.1 Main Points . . . . . . . . . . . . . . . . . . . . 2.10.2 Client Implementation . . . . . . . . . . . . . 2.10.3 No Caching . . . . . . . . . . . . . . . . . . . 2.10.4 NFS - Sun Network File System . . . . . . . . 2.10.5 Andrew File System . . . . . . . . . . . . . . . 2.10.6 Summary . . . . . . . . . . . . . . . . . . . . . Peer to Peer (p2p) Services and Overlay Networks . . 2.11.1 Overlay Networks . . . . . . . . . . . . . . . . 2.11.2 Gnutella . . . . . . . . . . . . . . . . . . . . . . 2.11.3 Chord - A Distributed Hash Table Example . . 2.11.4 Current Research Challenges . . . . . . . . . . 2.11.5 Summary . . . . . . . . . . . . . . . . . . . . . Content Distribution Networks . . . . . . . . . . . . . 2.12.1 Getting Content over a Network . . . . . . . . 2.12.2 Web Caches . . . . . . . . . . . . . . . . . . . . 2.12.3 Pre-fetching Data . . . . . . . . . . . . . . . . 2.12.4 Using your Peers: BitTorrent . . . . . . . . . . 2.12.5 Summary . . . . . . . . . . . . . . . . . . . . . Replication: Availability and Consistency . . . . . . . 2.13.1 What is Replication? . . . . . . . . . . . . . . . 2.13.2 Issues in Replication . . . . . . . . . . . . . . . 2.13.3 Consistency . . . . . . . . . . . . . . . . . . . . 2.13.4 Updating Server state . . . . . . . . . . . . . . 2.13.5 Multicast and Process Groups . . . . . . . . . . 2.13.6 Message Ordering . . . . . . . . . . . . . . . . 2.13.7 Summary . . . . . . . . . . . . . . . . . . . . . Shared Data and Transactions . . . . . . . . . . . . . . 2.14.1 Servers and their state . . . . . . . . . . . . . . 2.14.2 Atomicity . . . . . . . . . . . . . . . . . . . . . 2.14.3 Automatic Teller Machines and Bank accounts 2.14.4 Transactions . . . . . . . . . . . . . . . . . . . 2.14.5 Serial Equivalence . . . . . . . . . . . . . . . . 2.14.6 Summary . . . . . . . . . . . . . . . . . . . . . Concurrency Control and Transactions . . . . . . . . . 2.15.1 Why concurrency control? . . . . . . . . . . . . 2.15.2 Locking . . . . . . . . . . . . . . . . . . . . . . 2.15.3 Optimistic Concurrency Control . . . . . . . . 2.15.4 Timestamping . . . . . . . . . . . . . . . . . . 2.15.5 Summary . . . . . . . . . . . . . . . . . . . . . Distributed Transactions . . . . . . . . . . . . . . . . . 2.16.1 Single Server Transactions . . . . . . . . . . . . 2.16.2 Distributed Transactions . . . . . . . . . . . . . 2.16.3 Atomic Commit Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 67 68 68 68 69 69 70 73 74 74 75 76 78 81 81 81 82 83 84 85 86 87 87 88 88 89 89 90 94 94 95 95 96 97 97 99 99 100 100 102 103 103 104 104 104 105

2.10

2.11

2.12

2.13

2.14

2.15

2.16

6 2.16.4 Distributed Concurrency Control 2.16.5 Summary . . . . . . . . . . . . . 2.17 Transactions: Coping with Failure . . . 2.17.1 Failure Modes . . . . . . . . . . . 2.17.2 Recovery . . . . . . . . . . . . . 2.17.3 Network Partition . . . . . . . . 2.17.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 107 107 108 108 110 112 113 113 113 113 114 114 115 115 116 116 117 117 118 118 119 119 119 121 121 121 122 123 123 123

3 Exercises and answers 3.1 Exercises . . . . . . . . . . . . . . . . . . . . 3.1.1 Communication System Fundamentals 3.1.2 Devising a Routing Protocol . . . . . 3.1.3 Layering . . . . . . . . . . . . . . . . . 3.1.4 Serialization . . . . . . . . . . . . . . . 3.1.5 Remote Procedure Call . . . . . . . . 3.1.6 Security . . . . . . . . . . . . . . . . . 3.1.7 Names and distributed ling systems . 3.1.8 Availability and Ordering . . . . . . . 3.1.9 Transactions and Concurrency . . . . 3.1.10 Distributed Transactions . . . . . . . . 3.2 The Answers . . . . . . . . . . . . . . . . . . 3.2.1 Communication System Fundamentals 3.2.2 Devising a Routing Protocol . . . . . 3.2.3 Layering . . . . . . . . . . . . . . . . . 3.2.4 Serialization . . . . . . . . . . . . . . . 3.2.5 Remote Procedure Call . . . . . . . . 3.2.6 Names and Distributed File Systems . 3.2.7 Transactions and Concurrency . . . . 3.2.8 Distributed Transactions . . . . . . . . 3.3 Sample Exam Question . . . . . . . . . . . . 3.3.1 Availability and Ordering . . . . . . . 3.3.2 The answer . . . . . . . . . . . . . . .

Remember to check Sussex Direct for the times and rooms, as these vary between UG and PG students and from week to week.

CONTENTS

Week 1 (8/1) 2 (15/1) 3 (22/1) 4 (29/1) 5 (5/2) 6 (12/2) 7 (19/2) 8 (26/2) 9 (5/3) 10 (12/3)

Lecture Introduction OS Issues Object Systems Security Naming P2P Networks Replication Transactions Distributed Transactions Pervasive Computing Management

Lecture Fundamentals RPC Enterprise Computing and CORBA Security Distributed File Systems P2P Networks Replication Concurrency Control Coping with Failure No lecture

Exercise Class Programming Exercise Programming Exercise Programming Exercise Security Marking Names Exam sample Programming Exercise Marking No class

Assignment

Ass 1 due

Ass 2 due

Portfolio due

Table 1: Course Timetable 2006

CONTENTS

Chapter 1

Introduction
This is the online version of the course information sheet. These notes will be updated from time to time as the course progresses. For 2006-7 questions and comments should be directed to Dan Chalmers, either in a timetabled session, my ofce hour at 9:30-10:30 on Mondays, or by email to d.chalmers@sussex.ac.uk.

1.1

Distributed Systems Trailer

Learn about even more Internet applications!!! Learn about distributed Operating Systems!!!! Learn how to program Distributed Systems!!!!!

1.1.1

Detailed content of Distributed Systems

How to build software systems on multiple machines connected by networks. RPC, distributed objects, Content Distribution Networks, Peer to Peer applications, Replication and concurrency control and other stu Pre-requisite - Multimedia Communication Systems (or be prepared to learn fast) Languages - Java. Learn Java sockets, rmi, and how to build simulations 2 Assignments, 40% coursework, 60% exam Classes a mixture of programming and problem solving 9

10

CHAPTER 1. INTRODUCTION

1.2

Aims and Learning Outcomes

This course aims to convey an understanding of the problems of programming distributed systems. After taking the course, the student will be able to program over a networked system, and will be able to criticise algorithms and designs for distributed systems.

1.3

Prerequisites

The course assumes that the courses Introduction to Operating Systems and Multimedia Communications Technology have been taken. If you havent taken these courses, be prepared to do additional work to keep up with the course. It assumes programming skills in the Java programming language and a rudimentary knowledge of computer hardware.

1.4

Teaching methods

Two lectures and one exercise class per week together with 2 programming assignments. Check the url http://www.informatics.sussex.ac.uk/courses/dist-sys/ for copies of lecture slides and other course material.

1.5

Assessment

This is a one term course, taught in the spring term of the nal year. The undergraduate course is assessed by coursework (40%) and by unseen examination (60%). The postgraduate course is assessed solely through coursework (100%).

1.6

Course programme

This programme is subject to change. Remember to check Sussex Direct for the times and rooms, as these vary between UG and PG students and from week to week.

1.7

Reading list

Distributed Systems Concepts and Design. George F. Coulouris, Jean Dollimore and Tim Kindberg, Addison-Wesley, Fourth edition, 2005. This is the course textbook, and I will occasionally be recommending readings from it. The notes are based on the third edition, 2001, although updates to reect the fourth edition are being made and this is the most current version.

1.7. READING LIST Week 1 (8/1) 2 (15/1) 3 (22/1) 4 (29/1) 5 (5/2) 6 (12/2) 7 (19/2) 8 (26/2) 9 (5/3) 10 (12/3) Lecture Introduction OS Issues Object Systems Security Naming P2P Networks Replication Transactions Distributed Transactions Pervasive Computing Management Lecture Fundamentals RPC Enterprise Computing and CORBA Security Distributed File Systems P2P Networks Replication Concurrency Control Coping with Failure No lecture Exercise Class Programming Exercise Programming Exercise Programming Exercise Security Marking Names Exam sample Programming Exercise Marking No class

11 Assignment

Ass 1 due

Ass 2 due

Portfolio due

Table 1.1: Course Timetable 2006 Computer Networking: A Top-Down Approach Featuring the Internet, J. Kurose and K. Ross, Addison-Wesley, 2000. ISBN 0-20147711-4. The course textbook from Multimedia Communications Technology, full of useful and overlapping material. Operating System Concepts Abraham Silberschatz, Peter B. Galvin and Greg Gagne, John Wiley & Sons, Sixth edition. If you already have this book from the Operating Systems course, the material on distributed systems is very relevant.

12

CHAPTER 1. INTRODUCTION

Chapter 2

Lecture Notes
2.1 Introduction

Main Points What we are going to cover What distributed systems are What characteristics well-designed systems have

2.1.1

Course Outline

What are we going to do? 1 lectures on computer communications and networks and loosely coupled systems, revising material from the Multimedia Communications Technology course. The Postgraduate course will get an additional two hours in their seminar slot. 5 lectures on Remote Procedure Call, distributed objects and classic distributed systems such as NFS 3 lectures on integrating distributed systems through the web, content distribution networks and Peer to Peer computing. 5 lectures on closely coupled systems for distributed transactions 1 lecture on recent advances in networks and distributed systems Exercises Two exercises: 1. TBA (33% weighted) 13

14 2. TBA (66% weighted)

CHAPTER 2. LECTURE NOTES

The exercises will be in Java, and where appropriate, will be building upon and extending provided skeleton code. The exercises will be peer-assessed. In the next exercise class after the handin you will be required to mark two of your classmates assignments. Successfully marking an assignment will provide you with 10 marks towards the possible 100 on that assignment. At the end of the term, you will hand in your two assignments along with the assessments from your classmates. These will be formally given a course mark out of 40 marks. Detailed Timetable Week 1 (8/1) 2 (15/1) 3 (22/1) 4 (29/1) 5 (5/2) 6 (12/2) 7 (19/2) 8 (26/2) 9 (5/3) 10 (12/3) Lecture Introduction OS Issues Object Systems Security Naming P2P Networks Replication Transactions Distributed Transactions Pervasive Computing Management Lecture Fundamentals RPC Enterprise Computing and CORBA Security Distributed File Systems P2P Networks Replication Concurrency Control Coping with Failure No lecture Exercise Class Programming Exercise Programming Exercise Programming Exercise Security Marking Names Exam sample Programming Exercise Marking No class Assignment

Ass 1 due

Ass 2 due

Portfolio due

Table 2.1: Course Timetable 2006 Remember to check Sussex Direct for the times and rooms, as these vary between UG and PG students and from week to week. MSc Programme Lectures shared with undergraduates

2.1. INTRODUCTION Details to be discussed in the MSc seminar Read from reading list Work on programming exercises

15

2.1.2

Whats a Distributed System

A distributed system: physically separate computers working together Cheaper and easier to build lots of simple computers Easier to add power incrementally Machines may necessarily be remote, but system should work together Higher availability - one computer crashes, others carry on working Better reliability - store data in multiple locations More security - each piece easier to secure to right level. The real world. . . In real life, can get: Worse availability - every machine must be up. A distributed system is one where some machine youve never heard of fails and prevents you from working Worse reliability Worse security Problem: Coordination is more dicult because multiple people involved, and communication is over network. Your Task: What are the distributed interactions when you login at an x-terminal? Interactions at an X terminal Simplied interactions

2.1.3

Example Distributed Systems

Electronic Mail Mail delivered to remote mailbox. Requires global name space to identify users, transport mechanisms to get mail to mailbox Distributed Information - WWW Remote information hidden below hypertext browser. Caching and other features operate transparently

16

CHAPTER 2. LECTURE NOTES

2.1. INTRODUCTION

17

Distributed File System Files stored on many machines, generally not machine youre working on. Files accessed transparently by OS knowing theyre remote and doing remote operations on them such as read and write e.g. Network File System (NFS) Trading Floor System Bids made, stocks sold, screens updated. Network assumptions 1. The network is reliable 2. Latency is zero 3. Bandwidth is innite 4. The network is secure 5. Topology doesnt change 6. There is one administrator 7. Transport cost is zero 8. The network is homogeneous (Source:The Eight Fallacies of Distributed Computing - Peter Deutsch

2.1.4

What do we want from a Distributed System?

1. Resource Sharing 2. Openness 3. Concurrency 4. Scalability 5. Fault Tolerance 6. Transparency Your Task: order the importance of each of these features for the example systems in the previous slide.

18

CHAPTER 2. LECTURE NOTES

2.1.5

Elements of a Distributed System

Another way to view a system... Communications system Messages Machines Processes on Machines Programs People

2.1.6

Conclusion

Its dicult to design a good distributed system: there are a lot of problems in getting good characteristics, not the least of which is people Over the next ten weeks you will gain some insight into how to design a good distributed system.

2.2

Bits and Bytes

Main Points Bits and Bytes Kilo, Mega, Giga, Tera Memory, Packets and Bit Manipulation

2.2.1

Bits, Bytes, Integers etc

All data in computers is held as a sequence of ones and zeros. A number is represented as the base 2 number held in some pre-ordained xed number of bits Almost all other data is represented as various sets of numbers. A byte is 8 bits sequenced together - what is the maximum number? Integers (in Java and most other languages nowadays) are 32 bits long what is the maximum number?

2.3. FOUNDATIONS OF DISTRIBUTED SYSTEMS

19

2.2.2

Preces

In communications we talk often about throughput in bits/second, or moving les of some particular size. We use magnitude preces for convenience kilo 1000 mega 1000000 giga 109 tera 1012 There is often confusion as to whether a kilobyte is 1000 or 1024 bytes. When dealing with processor architectures, its generally 1024. When dealing with communications, its generally 1000. State assumptions if it is not obvious from context.

2.2.3

Memory and Packets

A computer stores data as bits in memory. When it wants to send this data to another computer, it copies the bits into the memory of the communications device. The communications device sends the bits through the network to the other machine (well cover the details of this in the coming week). The other machines communication device places the bits into memory which the other machine can access. The tricky bits come in ensuring that both machines interpret the bits correctly.

2.2.4

Bit Manipulation

Not only must the data be sent, but accompnying information allowing the computers to interpret the context of the data. Communications software must be able to pick out arbitrary bits from an opaque bit sequence and interpret their meaning. We do this using bitwise operations - and, or, exclusive or, negation.

2.3

Foundations of Distributed Systems

Aim: How do we send messages between computers across a network? Physical Concepts: Bandwidth, Latency Packet Concepts: Data, Headers

20

CHAPTER 2. LECTURE NOTES

Routing Concepts: Shared Media, Switches and Routing Tables, Routing Protocols For more detailed notes, see Dan Chalmers Multimedia Communications Technology notes in http://www.informatics.sussex.ac.uk/courses/mct/.

2.3.1

Physical Concepts

What is a message? A piece of information which needs to move from one process to each other eg a request to open a le, an email message, the results of a remote print request. For the network, this is a just a sequence of bits in memory. Need to communicate this sequence of bits across communications system to other host. Signal to other side whether each bit in sequence is 0 or 1. To communicate, need each end to each access a common substrate, and then for the sender to change the value of a physical characteristic of the substrate. Examples - voltage level of a piece of wire, light level in optical bre, frequency of radio wave in air Signal characteristics If physical characteristic has two levels then signal is binary. If it has more than one level, can encode a number of bits to each level. If 4 levels, then level 0 = 00, level 1 = 01, level 2 = 10, level 3 = 11 If 8 levels, how many bits can be encoded? Bandwidth Adjusting the level at one end, and recognising the level has changed at the other takes nite time. The nite time limits the speed at which bits can be signalled. The rate of signalling is the bandwidth, measured in bits per second. If it takes 20 microseconds to raise the voltage on a wire, and 10 milliseconds to recognise the new level, what is the possible bandwidth if there are two levels? Eight levels? Noise and errors Can we get innite bandwidth by increasing the number of levels? No, since noise makes it impossible to correctly distinguish level. Noise is random changes in the physical characteristic to which all physical phenomena are prone. Always some probability that level will be misinterpreted. Always some probability of error in message.

2.3. FOUNDATIONS OF DISTRIBUTED SYSTEMS

21

Goal of communications engineers is to make this probabilty as low as necessary. Latency Does signal propagate along wire innitely fast? No, limit to speed of propagation of light. Hence described as propagation delay. Latency is time taken for signal to travel from one end of communication system to destination. Since communication system may reconstruct message at intermediate points, time taken to reconstruct message is also part of latency. Known as switching delay Your Task: Describe the bandwidth, propagation delay and switching delay in a game of chinese whispers.

2.3.2

Packets

If there is an error in a message, how is it detected and rectied? 1. Compute a checksum over the message 2. send the checksum with the message 3. Calculate a new checksum over the received message 4. Compare the checksums - if dierent, then message is in error 5. Ask for the message to be resent Probability of error rises with length of message, Thus message sent in separate lumps with maximum number of bits per lump, known as packets. If the message ts in one packet, good. Otherwise message is in many packets. Addressing How do we direct the packet to correct recipient(s)? Put header bits on front of packet, analogous to address and other information on envelope in postal system. Add source of packet to allow returns Destination Source Packet body

If destination and source share the same physical medium, then destination can listen for its own address (ethernet, wireless, single point to point link) If they dont share the same LAN, we use routing

22
user space
Process Process

CHAPTER 2. LECTURE NOTES


... Src Port Dst Port ... Src IP Dst IP ... Src MAC Dst MAC ... network interface Packet
Process

user space
Process

...

Process

...

Process

socket

socket

socket

socket

socket

network interface

operating system local network

network interface network interface network interface switch

network interface

switch

Internet

Names, addresses and routes Name Identier of object eg Ian Wakeman, Mum Address Location of object eg Rm 4C6, COGS Route How to get there from here - turn left, rst right and up stairs, rst door on left Internet: Name is a Domain Name, such as www.cogs.susx.ac.uk. More later. To get packet through network, turn name into address by asking Domain Name Service (DNS). Place address in packet header and hand to nearest router. Router locates MAC address corresponding to IP address, if theyre on the same LAN. Indirect Addressing Architecture of a switch A switch is a specialised computer. Can turn a PC into a router, using free software. 1. A packet arrives on a link 2. The switch gets the destination of the packet from the packet header 3. The switch looks up the destination in a routing table in memory and discovers the output link 4. The switch queues the packet to be sent on the output link

socket operating system

2.3. FOUNDATIONS OF DISTRIBUTED SYSTEMS

23

switch

processor

dst . .

link . .

routing table
pkt dst

input link
Distributed Routing Table Maintenance The situation:

output links

People are network administrators (and end users) Communication systems are links (possibly multipoint) Machines are switches Messages may be lost The problem: Given an address, how does a router know which output link to send the packet on? Choices: 1. Packet could contain list of all output links - source routing. Requires source to lookup and determine route. May be good thing. 2. Router could look up address in local routing table, and send out of corresponding link. If not in table, send out default link. How do we construct tables? Could install entries by hand, known as static routing. But Limited to entries people know about. Unable to ensure consistency and absence of loops Unable to respond to changing topology, sharks gnawing through undersea cables etc. Internet has no centralised authority So we use distributed routing algorithm

24 Distance Vector Routing 1. Each switch knows its own address

CHAPTER 2. LECTURE NOTES

2. Each link hascost, such as a value of 1 per link, or measure of delay. 3. Each switch starts out with distance vector, consisting of 0 for itself, and innity for everyone else 4. Switches exchange distance vectors with neighbour switches, and whenever info changes 5. Switch saves most recent vector from neighbours 6. Switch calculates own distance vector by examining cost from neighbour and adding cost of link to neighbour 7. Use link with minimum cost to destination as link to route out. Examples include RIP and BGP. Link State Routing 1. Each switch knows addresses that are direct neighbours 2. Switch constructs packets saying who are neighbours - link state packets. 3. Link state packets ooded to all other switches 4. Switch constructs complete graph using most recent link state packets from all other switches 5. Use Dijkstra shortest path to gure out routing table. Examples include OSPF.

2.3.3

Conclusion: Network Properties

Packet Switched Networks present certain fundamental problems to the distributed systems programmer: Whilst switches converge on a consistent set of routes, packets will bounce around in the network, and suer delays or get dropped. Switches shared with other packets, therefore get queues. Total latency is therefore variable. Loss rate is variable (noise, available switch buers). Queue size management (aka congestion control) changes the bandwidth available to machines. Therefore bandwidth is variable.

2.4. OPERATING SYSTEM SUPPORT

25

2.4

Operating System Support

Aim: What do we need from an operating system to support distributed systems? Main Points: Layering An organisational principle for modular communications Protocol techniques Acknowedgements, timers, windows etc Operating Systems Abstractions Processes, threads, sockets, daemons Basic service provided by network - variable loss, bandwidth, latency. Need to layer other services on top.

2.4.1

Protocols

Protocol: Agreement between two (or more) parties as to how information is to be transmitted. At minimum, will include the interpretation of the bits in the packet. May include nite state machines movements between senders and receivers. Protocol information in headers (or trailers) at front of packets.

2.4.2

Layering

Networks are used to transmit messages between processes. Protocols used to give messaging functionality Packets in reality Limited Size Unordered (sometimes) Unreliable Machine to machine Local Area Net Asynchronous Insecure Messaging abstraction Arbitrary Size ordered reliable process to process routed anywhere Synchronous Secure

Table 2.2: Packets and messages

Reason for Layering Easier to build functions with higher abstractions. Dene ordering of abstractions to simplify necessary functions Provides modularity

26
Layers Applications Message

CHAPTER 2. LECTURE NOTES

Datagrams (UDP) or Streams (TCP) Transport UDP or TCP packets Internet IP packets Network Interface Networkspecific frames Underlying Network Application Message TCP header IP Header Ethernet header
IP TCP TCP data data data

Ethernet Frame

Layering in the Internet The Headers in an Ethernet Frame

2.4.3

Reliable Transmission: The Basic Techniques

These are some of the basic protocol techniques that are used in lower layer protocols. They are often reused in higher level protocols. Labelling Split up message into smaller chunks. Place label in header indicating which part of message it is. eg abcdefg 1 of 3 abc, 2 of 3 def, 3 of 3 g Labels typically sequential xed eld integers Acknowledging To tell the sender data has been received correctly (after checking checksums etc), use acknowledgements.

2.4. OPERATING SYSTEM SUPPORT

27

When the data packet is received, send an acknowledgement message back to the sender. Acknowledgement message contains the label of the message being acknowledged. Acknowledgement of a packet can sometimes implicitly acknowledge reception of all previously sent packets (Go back N) Multiple labels can be sent in the acknowledgement (Selective acknowledgement) If data is being returned to the sender, acknowledgement information can be piggybacked on return data packet. Acknowledgement information is part of the header eg TCP Timeouts and retransmission The sender measures the expected time between sending a message and receiving an acknowledgement Sender starts a timer after sending a packet. If the timer expires before an acknowledgement is received, the packet is resent Receiver must be able to deal with duplicate packets Questions 1. How can the expected round trip time be measured? 2. What value should the time be set to? Negative Acknowledgement Sometimes, the pattern of data exchange makes it easier to use Negative acknowledgements If data-ow is constant, receiver knows when packet is expected. Can send an negative acknowledgement (NAK ) indicating expected messages hasnt been received Windowing To increase utilization, and decrease acknowledgement overhead, multiple packets can be sent before the sender waits for an acknowledgement The number of packets that can be sent is the window. Window needs to be xed to avoid overloading network or receiver.

28

CHAPTER 2. LECTURE NOTES Careful adjustment of the window size is key to avoiding and controlling congestion and dyanmic performance (slow start in TCP).

State Synchronization Both ends of the protocol exchange typically need to agree on some starting state In labelling systems, both ends need to agree on initial label value Use a handshake of messages containing suggestions for state, and then return messages agreeing the value of the state.

2.4.4

Group Communication

Often, sender wants the same packet replicated to multiple receivers eg game updates, mirroring etc Network oers multicast to provide this functionality. Receivers join a particularly addressed group - class D addresses in IP, 1110XXXX.XXXXXXXX.XXXXXXXX.XXXXXXXX Network conspires to deliver packets sent to this address eciently to all receivers. Available in Local Area, not always available in wide area (unfortunately).

2.4.5

Sockets

Sockets are the near-universal abstraction for using TCP and UDP Provides data structures for holding addresses and other context information, and methods for sending and receiving data. Clients use sockets to talk to servers, often known as daemons1 Operating Systems Background Socket Usage: Single threaded TCP Server try { Create a socket
1 from the Hackers Dictionary daemon /daymn/ or /deemn/ n. [from the mythological meaning, later rationalized as the acronym Disk And Execution MONitor] A program that is not invoked explicitly, but lies dormant waiting for some condition(s) to occur. The idea is that the perpetrator of the condition need not be aware that a daemon is lurking (though often a program will commit an action only because it knows that it will implicitly invoke a daemon). [...] Daemons are usually spawned automatically by the system, and may either live forever or be regenerated at intervals.

2.4. OPERATING SYSTEM SUPPORT


Threads of Control

29

Process
Address Space (Thread stacks, heap, code etc)

Process Process

Process

File and socket descriptors

Process

...

User Space Operating System Kernel

TCP IP

Network Interfaces

Bind the address to the socket loop { Accept connections on socket Receive data on connection do some work Send Response Close connection on socket } } catch and deal with any exceptions Socket Usage: TCP client try { Create a socket Bind the address of the server to the socket // Allows the operating system to choose port // and address for the receiver Open connection on socket // Connection setup Send data on connection Wait for Response Close connection on socket } catch and deal with any exceptions Gotchas The chunk of data written to a socket is not necessarily read as a chunk for the remote socket - packetisation may fragment the chunks.

Socket

Socket UDP

30

CHAPTER 2. LECTURE NOTES Both ends have to agree on how to interpret the data written and read from the socket - the concrete syntax of the data stream. Data interpretation is through application standards dening bit elds eg the RFCs dening HTTP, SMTP, FTP etc Interpreting data can require very messy bit manipulation.

Socket Usage: Concurrent TCP Server A single threaded server can only deal with one client at a time This is often unacceptable in terms of performance eg web servers try { Create a socket Bind the address to the socket loop { Accept connections on socket spawn a worker thread to deal with connection } } catch and deal with any exceptions and the worker thread goes try { Receive data on Connection Do some work Send response Close Connection } catch and deal with any exceptions Thread Pools Spawning a thread for every connection can consume too many resources if connections come too quickly An alternative approach is to have a xed size pool of threads When connection is received pass the connection to the next available thread from pool When all the threads are busy, server blocks and doest accept more connections

2.5. REMOTE PROCEDURE CALL (RPC)

31

2.4.6

Request and Response

Design patterns for client server communications are very stereotyped. Can we automatically generate client server code? Yes! We can model a server as an object waiting for methods to be called Client then obtains reference to object and calls methods Distributed Object Systems and Remote Method Invocation (next lecture)

2.4.7

Conclusion

Layering is a modularisation approach allowing services to improve upon the services oered by lower services. Protocol techniques qfrom lower layers are often re-used again and again (see the End to End Argument paper. Sockets encapsulate TCP/UDP endpoints and can be used to construct clients and servers.

2.5

Remote Procedure Call (RPC)

Main Points Send/Receive One vs Two-way communications Remote Procedure Call Cross address space vs Cross machine communication

2.5.1

Send/Receive

How do you program distributed applications? Need to synchronise multiple threads, which run on dierent machines (cant use test&set at bottom as in a single machine) Instead the Atomic operations are Send and Receive - doesnt require shared memory for synchronising cooperating threads. Mailbox - temporary holding area for messages (ports)

32 The send abstraction Ideal abstractions send(mbox, message)

CHAPTER 2. LECTURE NOTES

Send a message, possibly over network to specied mailbox When does send return? 1. When Receiver process gets message? 2. When message is safely buered on destination machine? 3. Immediately, if message is buered on source node? Choice depends upon system designer. The receive abstraction receive(mbox, buffer) Wait until mbox has message, then copy message into buer In this abstraction, send and receive are atomic: never get portion of a message (all or nothing) - need to ensure buer is of sucient size two receivers cant get the same message - there is local synchronization on the mbox

2.5.2

Message Styles

1 way - messages ow in one direction (BSD Unix pipes) 2 way - request response (Remote Procedure Call) 1 Way example Producer: int msg[100] // maximum message size 100 bytes

while(1) prepare message /* make coke */ send(msg1, mbox) Consumer: int msg2[100] while(1) receive(msg2, mbox) process message /* drink coke */

2.5. REMOTE PROCEDURE CALL (RPC)

33

Producer/consumer doesnt worry about space in mailbox - Handled by send/receive forcing process to block if no space Request Response Example: Read a le on a remote machine Also known as client server - client=requester, server=responder. Server provides service (le storage) to client Client: (requesting the file) char response[1000]; send("read /etc/passwd",mbox1) receive(response,mbox2) Server: char command[1000], answer[1000]; receive(command,mbox1) decode command read file into answer send(answer, mbox2) Server has to decode command, as OS has to decode message to nd mailbox What if le too big for response - then use a big message protocol (eg TCP)

2.5.3

Remote Procedure Call

Call a procedure on a remote machine client calls: rpc_read("/tsunb/random.txt") translates this into call on server read("/tsunb/random.txt") RPC Implementation Request Response Message passing stub provides glue on client and server

34

CHAPTER 2. LECTURE NOTES

bundle args call send client client packet (caller) return stub handler receive unbundle network network transport transport return bundle ret vals send server packet server (callee) handler stub call unbundle receive args
RPC Pseudo Code Client stub: build message send message wait for response unpack reply return result Server stub create N threads to wait for work to do while(1) wait for command decode and unpack request parameters call procedure build reply with results send reply RPC and procedure call In a normal procedure call 1. arguments pushed on stack, 2. name converted to address, 3. return address pushed on stack, 4. results either in register or on stack. Equivalent:

2.5. REMOTE PROCEDURE CALL (RPC) Parameters - request message Result - reply message Name of procedure - passed in request message return address - callers mailbox Implementation issues Stub generator - generates stubs automatically.

35

Uses procedure signature: name, types of arguments and return values. Generates code on client to pack and send message, to receive and unpack message on server How does client know where to send message? DenitionBinding is linking a service to a location. static xed at compile time (C) dynamic xed at runtime (lisp, rpc) Dynamic binding in rpc done via name service, provides translation of service mailbox Runtime binding allows Access control check who is permitted to use service Fail-over if server fails, use another What if there are multiple servers? Can they use same mailbox? Yes, as long as no state carried forward from one call to next eg. open, seek, read, close - each uses context of previous operation

2.5.4

Cross domain communication

If processes in dierent address space, RPC can be used to communicate between processes on same machine as well as dierent machines Microkernel Operating Systems Conventional monolithic structure OS has all services running in kernel space. Dicult to design because of cross dependence, dicult to debug. Idea: use rpc to communicate between services. Split kernel up into application servers, running in separate address spaces. All request to operating systems work via rpc mechanisms. Kernel then becomes message passer. Examples include Mach, QNX

36

CHAPTER 2. LECTURE NOTES

App

App

file App sys RPC threads

windows address spaces

file system VM windowing threads networking

Monolithic structure
Microkernel Services Advantages Fault Isolation Bugs are more isolated (rewall)

microkernel

Enforces Modularity Allows incremental upgrade of pieces of software Location Transparent Service can be remote or local eg X

2.5.5

Problems with RPC

RPC provides location transparency except Failures More failure modes in dist sys than on single machine, since machines or network may crash. Need support for exceptions Performance Cost of procedure call << same machine rpc << network rpc. Programmers must be aware RPC is cheap but not free. Caching helps, but failures more complex. We will examine problems more fully in the coming lectures.

2.5.6

Summary

Remote procedure call hides message request response exchanges between client and servers behind programming semantics which resemble normal procedures

2.6. DISTRIBUTED OBJECTS: THE JAVA APPROACH

37

Remote Procedure Call provides good abstraction for programming distributed systems across address spaces and across machines... ...but comes at a performance cost.

2.6

Distributed Objects: The Java Approach

Main Points Why distributed objects Distributed Object design points Java RMI Dynamic Code Loading

2.6.1

Whats an Object?

An Object is an autonomous entity having state manipulable only by a set of methods public interface BankAccount extends Remote { public void deposit(float amount) throws RemoteException; public void withdraw(float amount) throws RemoteException; public float balance() throws RemoteException; }

2.6.2

Why Distributed Objects?

Distributed Systems multiplies complexity multiple machines multiple people multiple organisations Large amount of communication between system designers in producing distributed systems. Problem is how to manage complexity at design time

38 Software Engineering

CHAPTER 2. LECTURE NOTES

Software design should produce well-engineered software which satises requirements: Comprehensible, so that its easy to maintain and modify. Easier to test Reusable, cheaper than rebuilding and fewer bugs Objects as a basis for distributed system give you techniques to manage complexity: Abstraction hide unnecessary details, so keep system comprehensible Encapsulation allows elements to be extracted comprehensibility and reusability Concurrency control allows easy management of concurrent activities

2.6.3

How to build Distributed Object Systems

What are the various entities? Programmers using existing services Programs running on various machines oering services Packets using RPC protocol to invoke methods in programs How do we communicate between these things?

programmers

Interface Definition Language

programs

rpc system

Concrete Syntax & rpc protocols

packets

2.6. DISTRIBUTED OBJECTS: THE JAVA APPROACH


lookup name

39

rmiregistry
bind name

get classes

client object
call method

5
remote call over TCP

Code Repository

instantiated through classloader

remote stub

remote object 6
method invoke

4 client JVM

generic dispatcher server JVM

2.6.4

Objects and RPC systems

No real distinction between distributed method invocation and rpc systems. Pure object systems Provides dynamic binding through name service, possibly with migration and other features Protocol processing can be part of OS, allows asynchronous processing when appropriate Examples include Java Remote Method Invocation (RMI), Corba Static rpc systems such as Sun rpc Binding of services to machine by programmer Synchronous processing since protocol processing in user thread

2.6.5

Java RMI

Java has RPC built in as Remote Method Invocation No separate IDL - uses Java for interface denition Remote Interfaces An interface in Java species a set of methods that the object implementing that interface will provide Java RMI uses interfaces which extend java.rmi.Remote as a way of specifying which methods can be invoked remotely.

40

CHAPTER 2. LECTURE NOTES

public interface Foo extends Remote { public void myRemoteMethod() throws RemoteException; } public Bar implements Foo extends RemoteObject { ... public void myRemoteMethod() throws RemoteException { ... } ... } Remote Objects and Remote References To use a remote object, an object must acquire a remote reference. In Java, a remote reference looks just like a normal object reference. To provide the necessary communications code, a remote object must extend java.rmi.RemoteObject or one of its subclasses. Java will then provide remote references to the object when a reference as passed out of the local JVM, typically as a method result, or as a eld in another result object. Stub les and generic method dispatch To invoke a remote method, code acting as a proxy or stub for the remote object must run on the local machine. This code implements the appropriate interfaces, and marshalls the required method and arguments before sending them as a byte stream on a TCP connection to the remote machine. At the remote end, a generic dispatcher uses reection to determine which method, and calls the invoke method from the reection package to call the method. Results or exceptions are then returned to the caller. rmic is the tool that generates the stub le from the implementation of the remote object. Java Distributed Garbage Collection Garbage collection (GC) is the removal of objects when they are no longer needed. Single address space GC basically checks for references to objects. If no references are found, the object is removed.

2.6. DISTRIBUTED OBJECTS: THE JAVA APPROACH

41

Distributed GC is complicated because the trac to check all possible references is infeasible - references can be passed arbitrarily from machine to machine. Instead, a remote reference corresponds to a proxy in the local machine. The proxy informs the remote object it holds a reference. When the proxy is GCed, it tells the remote object that it no longer holds a reference. When a remote object knows of zero proxies, it is a candidate for GC. Parameter and Result Passing In local method invocation, object references are passed as arguments and results - call-by-reference. In remote method invocation, only objects which are accessible remotely can be passed by reference. Other objects must be passed by value - call-by-value - and instantiated as copies on the remote machine Objects passed by value must be capable of being passed by value - ie they must support the java.io.Serializable interface. Serializable objects and their associated object graph can be attened into a byte array. If an object implements Serializable and all of its references are Serializable, it can be passed by value Remote Exceptions The number and probability of failure modes are far higher in distributed systems. The designers of rmi decided to make this explicit by forcing programmers to deal with a possible RemoteException in all remote invocations. Therefore all methods in a Remote interface must throw RemoteException. RMIRegistry How do classes get the initial remote reference (bootstrap)? Remote objects bind themselves against a given textual name (eg myRemoteName) with the rmiregistry Objects can then resolve the name remotely by querying the rmiregistry.

42

CHAPTER 2. LECTURE NOTES The rmiregistry will return a remote reference, and hidden from the programmer, the location of the relevant server class les - the interface and the stub les.

Downloading of Classes The layout of classes and the bytecodes for implementing class methods are detailed in class les. Class les are loaded on demand as objects are created or static methods are invoked. Normal class loading comes through the default classloader, which searches the CLASSPATH. Additonal classloaders can be used by programmers to load class les from more exotic places. ClassLoaders Rmi must allow the interface and stub les for remote objects to be downloaded over the network - uses the rmiClassLoader. Code loaded from arbitrary places is a security risk. Java provides for a security policy to be dened for a classloader so that all classes from that classloader can have their actions sandboxed. Typically, these actions are network access, le access, screen access etc, and are specied in java policy les. Activation Using an active thread continuously for an object which is accessed infrequently may be a poor use of resources - consider machines with millions of objects. Instead, allow objects to change state from active to passive and vice versa. When active, they are normal remote objects. When passive, the objects state is stored in persistent storage eg a le, and responsibility for accepting calls to that object is handed over to an activator. When the activator receives a call, it creates a new instance of the object and instantiates its state from its stored state. Compare to the use of inetd to control typical Internet services such as ftp, telnet etc.

2.7. ENTERPRISE COMPUTING AND CORBA

43

2.6.6

Summary

Described the key elements of Java RMI. Refer to these in using rmi to help in udnerstanding some of the problems that occur. Other possible choices for distributed objects in the next lecture

2.7

Enterprise computing and Corba

Main Points What are computers used for? Three tier models CORBA

2.7.1

Computers in Business

The hard job in commercial computing is not writing word processors. . . . . . but to integrate the various business activities - selling, buying, managing, coordinating production. The key asset in business computing is the data that the business has built up over the years, generally held in a database management system. Business computing therefore has to integrate the front end activities carried out by people with the data held in the databases. Changes in the data and requests from the front end must be interpreted according to the business logic.

""" # # # "#### """ "#### #"""#"# #"###" """" """# ####" "### """# "####" """# "####" #""""#"# """"" #### ##### """" ###" """# "####" """# "####" #""""# "#### """#" #"#### """"" ##### """"" ##### """" "#"#"
Business Objects
ORB ORB ORB ORB

A@A ( @ 989898989 8 ( ))() 767 6 545454545 4 && ''&' 23 3 2 101010101 0 $$ %%$%


Business Objects
ORB ORB ORB ORB

!  !          

View Objects

The view objects - Javascript, XML and friends

View Objects

44

2.7.2

These can be a mixture of xml, html, javascript and java in a web browser. Functions include views in the warehouse of the stock control system, business overviews for decision support systems, Point Of Sale views for the tills. The front end is the user interface, which will provide various views, depending on function.

Three Tier Models and the Web

Network

Network

Server Objects Server Objects

CHAPTER 2. LECTURE NOTES

Legacy Applications Factory Processes Legacy Applications Factory Processes DBMS DBMS

2.7. ENTERPRISE COMPUTING AND CORBA

45

Or they can be fully blown applications talking directly back to the server objects

Server objects and middleware

Network

Server Objects

View Objects

The server objects encapsulate the business logic - how the business uses its data.

This is subject to tweaking and change as the business evolves - ie a high maintenance activity.

Object and particularly component based approaches are most useful here.

The CORBA object standards intend to provide business objects to support the implementation of business logic in a component framework.

CBBB C C C BCCC BBB BCCCC BBBCBC CBCCCCB BBBB BBBC CCCCB BCCC CBBBC BCCCB BBBC BCCCCB BBBBCBC BBCBBB CCC CCCCCC BBBBB CCCC BBBC BCCCCB BBBB BCCCC BBBC BCCCCB BBBCBC CBCCCCB BBBB CCCCC BBBBB CCCCC BBBB BCBCB
ORB ORB ORB ORB

a`a ` YXYXYXYXY X HH IIHI WVW V UTUTUTUTU T FFF GGG RS S R QPQPQPQPQ P DDD EEE

DBMS

Factory Processes

Business Objects

Legacy Applications

46

CHAPTER 2. LECTURE NOTES

Legacy systems and software wrappers

Network

Server Objects

View Objects

Most companies have a substantial investment in their existing software. . . . . . and this software already does its job To integrate this legacy software into new systems requires software wrappers that can speak to the new systems.

2.7.3

CORBA

CORBA is a middleware design to allow application programs to communicate with each other irrespective of their programming language. Standardised by the Object Management Group (OMG), an industrial consortium of well over 100 companies. Many dierent implementations, which can all interoperate. The CORBA Object Model
implementation repository client client program proxy for A ORB core Request Reply ORB core interface repository

object adapter

skeleton

or dynamic invocation

bbb c c c bcccc bbb bcccc bbbcbc bbcbbb cccb cccccc bbbb cccb bbbc bcccccb bbbb bcccc bbbcbc bbcbbb cccb cccccc bbbb bbbb ccccc bccccb bbbb bcccc bbbc bccccb bbbcbc cbccccb bbbb ccccc bbbbb ccccc bbbb bcbcb
ORB ORB ORB ORB

 yxyxyxyxy x hhh iii wvw v ututututu t fff ggg rs s r qpqpqpqpq p ddd eee

DBMS

Factory Processes

Business Objects

Legacy Applications

server Servant A

2.7. ENTERPRISE COMPUTING AND CORBA client program calls a method in the remote servant program A.

47

proxy for A marshalls the arguments in invocation requests and unmarshalls replies, and is generated from the IDL. Available at compile time. ORB core implements the standardised communication systems, and provides services for converting remote references to strings etc. Object Adapter bridges between the ORB and the target language, providing dispatch for method invocations, remote references and activation control skeletons are in the target language and do the unmarshalling of arguments and marshalling of results. Servant for A does the actual work. Implementation repository activates registered servers on demand and locates running servers, using the object adapter name. Interface repository provides information about registered IDL interfaces, so as to enable dynamic invocation cf reection in Java. The Interface Denition Language - IDL Modules CORBA IDL provides modules which function in a similar fashion to Java packages, providing scope control mechanisms for names. Interfaces Interfaces are similar to Java interfaces, providing collections of methods oered by an object. IDL supports inheritance Attributes can be declared in IDL - the compiler generates accessor methods (get/set) automatically. Methods Method descriptions are similar to Java, except that parameters can be tagged as in value passed into the method out value returned from the method, in addition to the return value of the call inout value passed in and returned. Exception Methods can raise exceptions, which may return values Primitive types Such as byte, short, int, long, char, double Constructed Types sequence (variable length list), array, struct (collections of values of dierent types), enumerated (set of named integer values) unions (choice of values, depending on discriminator enumerated type tag) Remote Object references Objects are not passed by value. Instead, references to objects can be returned and passed around. These references are interpreted by the local communications modules (the Object Request Broker ORB).

48 IDL Example // From file Person.idl struct Person { string name; string place; long year; };

CHAPTER 2. LECTURE NOTES

interface PersonList { readonly attribute string listname; void addPerson(in Person p); void getPerson(in string name, out Person p); long number(); }

Language Bindings In normal system languages - c, c++, java - these are quite straight forward, as long as the programmer is aware of the semantics of the IDL. But some of the semantics need special support. Consider void getPerson(in string name, out Person p); The Java equivalent is void getPerson(String name, PersonHolder p); where PersonHolder contains an instance of the returned out value of the Person argument. CORBA Services CORBA includes specications for a number of services: Naming Service uses a naming context to lookup a name. The naming contexts can be linked, so that a name actually points to another naming context, allowing hierarchical names. Event and Notication Service provides for event management, including pattern matching for notication. Security Service provides for management of authentication and access control, audit trails etc. Trading Service allows services to be located by description, rather than name.

2.7. ENTERPRISE COMPUTING AND CORBA

49

Transaction and Concurrency Control Service implements transactional mechanisms to provide concurrency control and ACID semantics to operations Persistent Object Service allows objects to be stored in passive form when not required and activated on demand. Transport Issues - IIOP There is a standard protocol for use between ORBs - the General InterORB Protocol, GIOP. The implementation almost universally used is the Internet Inter-ORB Protocol, IIOP. GIOP species the concrete syntax for data placed into the byte stream, the Common Data Representation or CDR. IIOP species the layout of messages, and the standard layout for remote object references.

2.7.4

Web Services - Business to Business

50

CHAPTER 2. LECTURE NOTES HTML/javascript and friends are good for rendering interfaces to people. . . . . . but how should business computers talk to each other? CORBA has never bridged the business to business gap for various reasons. Can XML based solutions bridge the gap?

SOAP, WSDL and XML

Web Services are business to business RPC systems based on XML and HTTP Messages are described using an XML variant called SOAP. The concrete syntax is thus ascii tags and content. The interface denition language is Web Services Description Language (WSDL) The IDL is thus (very verbose) XML More detailed notes can be found at Vladimiros Internet Technologies notes

2.8. COMPUTER SECURITY: WHY YOU SHOULD NEVER TRUST A COMPUTER SYSTEM51

2.7.5

Summary

The majority of programmers in the commercial world create bespoke solutions to manage business processes. Three tier models of business solutions provide exibility yet control the complexity of the business logic. CORBA provides a middleware framework in which to implement the business logic.

2.8

Computer Security: Why you should never trust a computer system

Goal: Prevent Misuse of computers Denitions Authentication Private and Public Key Encryption Access Control and Capabilities Enforcement of security policies Examples of Security Problems

2.8.1

Denitions

Types of Misuse Accidental Intentional Protection is to prevent either accidental or intentional misuse Security is to prevent intentional misuse Three pieces to security Authentication Who user is Authorisation Who is allowed to do what Enforcement Ensure that people only do what they are allowed to do A loophole in any of these can cause problem eg 1. Log in as super-user 2. Log in as anyone, do anything 3. Can you trust software to make decisions about 1 and 2?

52

CHAPTER 2. LECTURE NOTES

2.8.2

Authentication

Common approach: Passwords. Shared secret between two parties. Since only I know password, machine can assume it is me. Problem 1 system must keep copy of secret, to check against password. What if malicious user gains access to this list of passwords? Encryption Transformation on data that is dicult to reverse - in particular, secure digest functions. Secure Digest Functions A secure digest function h = H(M ) has the following properties: 1. Given M , is is easy to compute h. 2. Given h, it is hard to compute M . 3. Given M , it is hard to compute M such that H(M ) = H(M ) For example: Unix /etc/passwd le Password one way transform encrypted password System stores only encrypted version, so ok if someone reads the le. When you type in password, system encrypts password and compares against the stored encrypted versions. Over the years, password protection has evolved from DES, using a wellknown string as the input data, and the password as the key, through to MD5 to SHA-1. Passwords as Human Factors Problem Passwords must be long and obscure. Paradox: short passwords are easy to crack, but long ones, people write down Improving technology means we have to use longer passwords. Consider that unix initially required only lowercase 5 letter passwords How long for an exhaustive search? 265 = 10, 000, 000 In 1975, 10 ms to check a password 1 day In 2003, 0.00001 ms to check a password 0.1 second Most people choose even simpler passwords such as English words - it takes even less time to check for all words in a dictionary. Some solutions 1. Extend everyones password with a unique number (stored in passwd le). Cant crack multiple passwords at a time.

2.8. COMPUTER SECURITY: WHY YOU SHOULD NEVER TRUST A COMPUTER SYSTEM53 2. Require more complex passwords, eg 7 letter with lower, upper, number and special 707 8000 billion, or 1 day. But people pick common patterns eg 6 lower case plus number.

3. Make it take a long time to check each password. For example, delay every login attempt by 1 second.

4. Assign long passwords. Give everyone a calculator or smart card to carry around to remember password, with PIN to activate. Need physical theft to steal card.

Problem 3 Can you trust the encryption algorithm? Recent example: techniques thought to be safe such as DES, have back doors (intentional or unintentional). If there is a back door, you dont need to do complete exhaustive search.

2.8.3

Authentication in distributed systems: Private Key Encryption

In distributed systems, the network between the machine on which the password is typed and the machine the password is authenticating on is accessible to everyone. Two roles for encryption:

1. Authentication - do we share the same secret?

2. Secrecy - I dont want anyone to know this data (eg medical records)

Use an encryption algorithm that can easily be reversed given the correct key, and dicult to reverse without the key

54

CHAPTER 2. LECTURE NOTES

spy plaintext

password encrypt secure insecure transmission ciphertext

MI6 plaintext

password decrypt secure ciphertext

From cipher text, cant decode without password. From plain text and cipher text, cant derive password. As long as the password stays secret, we get both secrecy and authentication. Symmetric Encryption Symmetric encryptions use exclusive ors, addition, multiplication, shifts and transpositions, all of which are fast on modern processors. DES The Data Encryption Standard (DES) was released in 1977. The 56 bit key is too weak for most uses now, and instead, 3DES or triple DES is used, which has 128 bit keys. IDEA The International Data Encryption Algorithm has 128 bit keys, and has proven strong against a large body of analysis. AES The US based NIST has dened a new encryption algorithm based on Rijndael, which oers 128, 192 or 256 bit keys. Authentication Server - Kerberos Operation The server keeps a list of passwords, provides a way for two parties, A, B to talk to each other, as long as they trust the server.

2.8. COMPUTER SECURITY: WHY YOU SHOULD NEVER TRUST A COMPUTER SYSTEM55 Notation: Kxy is a key for talking between x and y. (..)K means encrypt message (..) with key K. Simplistic overview: A asks server for key: A S (Hi! Id like a key for A,B) Server gives back special session key encrypted in As key, along with ticket to give to B: S A (Use Kab (This is A! Use Kab)Ksb)Ksa A gives B the ticket: A B ((This is A! Use Kab)Ksb Details: add in timestamps to limit how long each key exists, use single use authenticators so that clients are sure of server identities, and to prevent machine replaying messages later. Also have to include encrypted checksums to prevent malicious user inserting garbage into message.

2.8.4

Public Key Encryption

Public key encryption is a much slower alternative to private key; separates authentication from secrecy. Each key is a pair K,K-1 With private key: (text)KK = text With public key: (text)KK-1 = text, but (text)KK = text (text)K-1K = text, but (text)K-1K-1 = text Cant derive K from K-1, or vice versa Public key directory Idea: K is kept secret, K-1 made public, such as public directory For example: (Im Ian)K Everyone can read it, but only I could have sent it (authentication). (Hi!)K-1 Anyone can send it but only I can read it (secrecy). ((Im Ian)K Hi!)K-1 On rst glance, only I can send it, only you can read it. Whats wrong with this assumption? Problem: How do you trust the dictionary of public keys? Maybe somebody lied to you in giving you a key.

56

CHAPTER 2. LECTURE NOTES

2.8.5

Secure Socket Layer

Provides a techniques for data sent over a TCP connection to be encrypted. Uses public key technology to agree upon a key, then 3DES or whatever to encrypt the session. Data encrypted in blocks, optionally with compression rst. Used in http as https, and for telnet, ftp etc. SSL handshake protocol

ClientHello ServerHello ServerCertificate CertificateRequest (PreMasterKey)Kserver ClientCertificate ClientDone ServerDone

Establish protocol version session ID, cipher suite, exchange random values

Send server certificate and optionally request client certificate

client

server

Send Premaster key for server to generate session key encrypted in server public key

Send client certificate if requested

All messages now encrypted in session key

2.8.6

Authorisation

Who can do what. . . Access control matrix: formalisation of all the permissions in the system objects le1 le2 le3 ... users A rw r B rw C r ... For example, one box represents C can read le3 Potentially huge number of users and objects, so impractical to store all of these.

2.8. COMPUTER SECURITY: WHY YOU SHOULD NEVER TRUST A COMPUTER SYSTEM57 Approaches to authorisation 1. Access control list - store all permissions for all users with each object Still might be large number of users. Unix addresses by having rwx for user group world. Recent systems provide way of specifying groups of users and permissions for each group 2. Capability list - each process stores tickets for the objects it can operate on. Digital Rights Management Digital content is produced by people earning a living, and they wish to protect their investment Digital Rights Management is the catchall term for the technologies controlling the use of digital content. Two main approaches: Containment in which the content is wrapped is an encrypted shell, so that users have to prove their capability of using the content through knowledge of the key. Watermarking where the content is marked so that devices know that the data is protected. The problem is how to enforce the checking of capabilities and enforce the no circumvention requirement.

2.8.7

Enforcement

Enforcer is the program that checks passwords, access control lists, etc Any bug in enforcer means way for malicious user to gain ability to do anything In Unix, superuser has all the powers of the Unix kernel - can do anything. Because of coarse-grained access control, lots of stu has to run as superuser to work. If bug in any program, youre in trouble. Paradox : 1. make enforcer as small as possible - easier to get correct, but simple minded protection model (more programs have high privilege). 2. fancy protection - only minimal privilege, but hard to get right.

58

CHAPTER 2. LECTURE NOTES

2.8.8

Trusted Computing Platform

Question: How do we ensure no software transgresses digital rights? Answer: By only allowing approved software to have access to the data. The basic computer and operating system must be running validated software. There must be an untamperable part of the computer/OS that can hold keys and hand over only to validated software. The untamperable part of the computer is the Fritz chip, a co-processor which holds a unique certicate that it is running some validated code. Startup Sequence
ROM Memory

Fritz bus Coprocessor

Processor

Main Bus

IO Bus Interface

Ethernet Adapter

....

1. Fritz checks boot rom has been signed, executes, validates the state of the machine. 2. Fritz checks the rst section of the OS has been signed, executes, loads and validates state of the machine. 3. As hardware comes up, the serial number of the hardware is checked to see if it is valid, and then the associated driver had been signed and validates the state. 4. If new hardware is added that is not valid, then the machine must be validated. 5. Once hardware is fully up, control is handed over to the validated OS. Trusted Operating Systems Initial design came from mobile code thoughts in how to protect programmable network components. Digital content can require that only validated software can be used to play it and that it can only be played on a specic machine.

2.8. COMPUTER SECURITY: WHY YOU SHOULD NEVER TRUST A COMPUTER SYSTEM59 The Fritz chip uses the key it holds and with the inner kernel can unpack data, check its being passed to the certied software and is still on the correct machine. Palladium 2 is Microsofts name for their secure OS. Ethical issues? Many. . . Economic issues? Many. . .

2.8.9

Firewalls

TCP/IP has port numbers which indicate services that packets should go to. Firewalls inspect each packet going in and out of the network, applying pattern matches on port numbers (and other elds). Firewalls can prevent packets entering or leaving networks for some services. Filters can be triggered and stateful eg if a Telnet connection is ongoing to a host, then allow X connections from that host to the telnet source, otherwise disallow X connections. Genrally routed through static routed installed at ingress points for networks, although can be more distributed eg through tunneling.

2.8.10

Classes of security problems

Abuse of privilege If the superuser is evil, then were all in trouble - nothing can be done Imposter Break into system by pretending to be someone else. For example in Unix, can set up a .rhosts le to allow logins from one machine to another, without having to retype password Also allows rsh - command to do an operation on a remote node. Combination means: send rsh request, pretending to be from the trusted user to install a .rhosts le granting imposter full access. Similarly, if have open X windows connectionover the network, can send message appearing to be key strokes from window but really is commands to allow imposter access. If no encryption, no way to stop this Trojan Horse Greeks present Troy with present of wooden horse, but army hidden inside. Trojan Horse appears helpful, but really does something harmful
2 Next

Generation Secure Computing Base

60

CHAPTER 2. LECTURE NOTES

Salami Attack Richard Prior in Superman 3 Idea was to build up a small bit at a time. What happens to partial pennies from interest on bank and mortgage accounts? Bank keeps it. Re-program so that partial pennies go to programmers account. Millions of customers adds up quickly. See Internet Worm later... Eavesdropping Listener - tap into serial line on back of terminal or onto Ethernet. See everything typed in, as almost everything goes over network unencrypted. On telnet, password goes over network unencrypted. How can these be prevented? Hard to build system that is both useful and prevents misuse. Tenex Popular operating system at universities in early seventies before Unix Thought to be secure. To demonstrate, created a team to nd holes. Given source and documentation (wanted source to be given away, as Unix was), gave them normal account. In 48 Hours had every password on system. Code for password check for(i=0;i<8;i++) if(userPasswd[i] != realPasswd[i]) go to error Appears that youd have to try all 2568 combinations, But, Tenex used virtual memory, and it can interact badly with above code. Key idea: Force page break at inopportune times. Dierence in timing can reveal how far check has proceeded. Arrange rst character in string to be as last character in page, rest to be on next page. Arrange that page with the rst character is in memory, page with remainder on disk, eg by referencing lots of other pages then referencing the rst page. a|bcdefgh where a in memory, bcedefgh on disk) By timing how long the password check takes, can gure out whether the rst character is correct. if fast, then rst character is wrong. if slow, then rst character is correct, page fault, other character incorrect. Try all rst characters till one is slow. Then put rst two characters on memory, remainder on disk. Try all second characters till one is slow. Mean takes 128 8 to crack passwords. Fix easy: dont stop till you look at all characters. But how is this known in advance?

2.8. COMPUTER SECURITY: WHY YOU SHOULD NEVER TRUST A COMPUTER SYSTEM61 Internet Worm 1988 - Worm infected thousands of machines over Internet. Three attacks Dictionary lookup on passwd le Sendmail - Debug mode, if congured wrongly will execute commands as root. ngerd - nger ianw@csrp.crn Fingerd didnt check for length of string, only allocated a xed size array on stack. By passing a carefully crafted really long string, could overwrite stack, get the program to call arbitrary code. Got caught because idea was to launch attacks onto other machines from whatever systems were broken into. Ended up cutting salami too thickly when machines were attacked multiple times; dragged CPU down so much that people noticed. Variant attack: kernel checks arguments to call before using them. If multithreaded, could co-ordinate non-calling thread to change args after kernel check. Thompsons self-replicating program Bury Trojan Horses in binaries, so no evidence in source, Replicates to every Unix system in the world, and even onto new unixes on dierent platforms with no visible sign. Would give Ken Thompson, one of original Unix designers, ability to logon to any system in world. 1. Make it possible (easy) 2. Hide it (dicult) Step 1 Modify login.c A: if(name == "ken") dont check password log in as root Idea is to then hide change so no one can see it Step 2 Modify the compiler Instead of code in login.c, have code in compiler B: if(see trigger) insert A into input stream

62

CHAPTER 2. LECTURE NOTES Whenever the compiler sees a trigger (/*gobbledygook*/), puts A into source stream into compiler. Now, dont need A in login.c, just need trigger. Need to get rid of problem in compiler

step 3 Modify compiler to have C: if(see trigger2) insert B and C into input stream Note that this can now be self-replicating code. step 4 compile the compiler with C present now in the binary for the compiler step 5 Replace code for C in compiler with trigger2 Result - everything is in binary. As long as trigger2 remains in compiler source, will replicate C, with B, with A, to every compiler built. Every time login.c recompiled, backdoor created. Every time compiler recompiled, the compiler re-inserts backdoor inserter. When porting to a new machine, Compiler used to generate new cross compiler, and then new compiler from source code written in c.

2.8.11

Lessons

1. Hard to resecure system after penetration How do you remove backdoor? Remove triggers? But what if another trigger in editor? If observer trigger being removed, re-insert trigger on saving le. 2. Hard to detect when system has been penetrated. Easy to make system forget 3. Any system has loopholes, and every system has bugs. 4. The more complex the system, the more bugs - KISS

2.9
2.9.1

Names and Naming Services


Main Points

Use of names Structure of names

2.9. NAMES AND NAMING SERVICES Name Services Domain Name Service (DNS) - The Internet Name Service Denitions:Names - what its called Address - where it is Route - how to get there

63

2.9.2

Why names?

Object Identication a service or resource we want to use, eg a lename, a telecommunications provider, a person Allow Sharing Communicating processes can pass names and thus share resources Location Independence If we separate the name from the address, can migrate object transparently Security If large number of possible names, knowing the name of the object means that it must explicitly have been passed. If entire system constructed so that names are passed with authorisation then knowing name means chain of trust to allow access to object.

2.9.3

What does one do with names?

Use as arguments to functions, eg to call a service Create names. If an object comes into creation it must be named. Name should meet rules for system, eg be unique, therefore naming authority (possibly object) must keep track of what names it has allocated Delete names. When an object disappears from the system, at some point may wish to delete name and allow re-use.

2.9.4

Whats a name?

Part of the design space for any distributed system is how to construct the name. Choice of design has ramications for design of rest of system and performance of service.

64 Unique Identier

CHAPTER 2. LECTURE NOTES

Unique Identier (UID) aka at names, primitive names. No internal structure, just string of bits. Only use is comparison against other UIDs eg for lookup in table containing information on named object. Provide location independence and uniformity But Dicult to name dierent versions of the same object eg Distributed systems edition 1 vs Distributed Systems edition 2. Real objects have relationships to each other - useful to reect in naming practice Dicult to discover address from name - need to search entire name space

2.9.5

Partitioned names

Add partition structure to name to enable some function to be more ecient, typically location and name allocation

com edu ac

uk co

susx ucl cogs

2.9. NAMES AND NAMING SERVICES

65

Domain Name Service (DNS) name - unix.rn.informatics.scitech.sussex.ac.uk. Each part of name comes from at name space. Division of name space and possible objects into smaller space. uk reduces objects to those within the uk, ac to those administered by academic networks, and so on. When allocating names, simply add to lowest part of partition. Low risk of collision, since small number of other objects, all administered by same authority When looking up name, can guess where information will reside.

2.9.6

Descriptive names

Necessary to have a unique name. But useful to name objects in dierent ways, such as by service they oer eg Postman Pat, John of Gwent, www.informatics.sussex.ac.uk. Create name using attributes of object. Note that objects can therefore have multiple names, not all of which are unique Choice of name structure depends on system. DNS chooses partition according to administration of creation, with aliases to allow naming by service eg ftp.informatics.sussex.ac.uk is also doc-sun.rn.informatics.scitech.sussex.ac.uk

2.9.7

Object Location from name - broadcast

Ask all possible objects within the system if they respond to that name. Can be ecient if network supports broadcast eg Ethernet and other LANs. Equivalent to distributing name table across all objects, objects storing names referring to them. Only want positive responses - all responses would generate a lot of trac. Scaling problem when move into wide area. Greater number of hosts imply greater probability of failure Broadcasts consume higher proportion of bandwidth, made worse due to higher failures needing more location requests Broadcast used only on small LAN based systems (and in initial location of directory)

66

CHAPTER 2. LECTURE NOTES

2.9.8

Location through database

Keep information about object in a table, indexed by a name. Location is just another piece of information. In DNS, table is stored as {name,attribute} pairing, in a resource record If database centralised, then Whole system fails if database machine fails Database machine acts as bottleneck for performance of system as whole In wide area systems, authority should be shared amongst controlling organisations So name information usually distributed.

2.9.9

Distributed Name Servers

Parts of the name table are distributed to dierent servers eg. in Domain Name Service, servers are allocated portions of the name space below certain domains such as the root, ac.uk, susx.ac.uk Names can be partitioned between servers based on algorithmic clustering eg apply well-known hash function on name to map to server. May result in server being remote from object. Only technique for UIDs structural clustering if name is structured, use structure to designate names to particular server, such as in DNS attribute clustering if names are constructed using attributes, then servers take responsibility for certain attributes.

2.9.10

Availability and performance

If a single server is responsible for name space, there is still single point of failure, and a performance bottleneck. Most systems therefore Replicate name list to other servers. Also increases performance for heavily accessed parts of name space eg secondary servers in DNS Cache information received by lookup. No need to repeat lookup if asked for same information again. Increases performance. Implemented in both client side and server (in recursive calls) in DNS. If information is cached, how do we know when its invalid? May attempt to use inconsistent information.

2.9. NAMES AND NAMING SERVICES

67

2.9.11

Maintaining consistency for distributed name services

Alleviated by following features of some distributed systems In most systems objects change slowly, so names live for a long time, and are created infrequently If address of an object is wrong, it causes an error. Address user can recover if it assumes one of the possible problems is inconsistent information. Obsolete information can be xed by addressed object leaving redirection pointers. Equivalent to leaving a forwarding address to new home. However, there are always systems which break these assumptions eg highly dynamic distributed object system, creating lots of objects and names and deleting lots of names.

2.9.12

Client and Name server interaction.

Hidden behind RPC interface. Client knows of server to ask (either installed in le, or through broadcast location). Client calls with arguments of name and required attribute, eg address. In DNS arguments are name and the type of requested attribute. Server will return result with either required attribute or error message Lookup Modes If name not stored on server, may be stored on other server. Two options Recursive server asks other possible server about name and attribute, which may then have to ask another server and so on

req client ans

req name server 1 ans

req name server 2 ans

name server 3

Iterative server returns address of other possible server to client, who then resends request to new server.

68

CHAPTER 2. LECTURE NOTES

client

request "try ns2" request "try ns3" request answer

name server 1 name server 2 name server 3

2.9.13

Summary

Names can be at, or they can be structured Centralised name servers suer from availability - distributed name servers suer from inconsistency Interaction with name server best modelled by RPC

2.10
2.10.1

Distributed File Systems


Main Points

A Distributed File System provides transparent access to les stored on a remote disk Themes: Failures - what happens when server crashes but client doesnt? or vice versa Performance Caching - use caching at both the clients and the server to improve performance Cache Coherence - how do we make sure each client sees most up to date copy of le?

2.10. DISTRIBUTED FILE SYSTEMS

69

2.10.2

Client Implementation

Request for operation on le goes to OS. OS recognises le is remote and constructs RPC to remote le server Server receives call, does operation and returns results. Subtrees in directory structure generally map to le system. Provides access transparency

rsuna rsuna

local bin

tsunb tsunb

2.10.3

No Caching

Simple approach: Use RPC to forward each le system request to remote server (older versions of Novell Netware). Example operations: open, seek, read, write, close Server implements operations as it would for local request and passes result back to client

70

CHAPTER 2. LECTURE NOTES

Advantages and Disadvantages of uncached remote le service Advantage server provides consistent view of le system to both A and B. Disadvantage can be lousy performance Going over network is slower than going through memory Lots of network trac - congestion Server can be a bottleneck - what if lots of clients

2.10.4

NFS - Sun Network File System

Main idea - uses caching to reduce network load Cache le blocks, le headers etc in both client and servers memory. More recent NFS implementations use a disk cache at the client as well.

cache X S X

read

data

write done

Advantage Advantage: If open/read/write/close can be done locally, no network trac Issues: failure and cache consistency Failures What if server crashes? Can client wait until server comes back up and continue as before? 1. Any data in server memory but not yet on disk can be lost 2. If there is shared state across RPCs, eg open seek read. What if server crashes after seek? Then when client does read it will fail.

2.10. DISTRIBUTED FILE SYSTEMS

71

3. Message re-tries - suppose server crashes after it does rm foo but before it sends acknowledgement? Message system will retry and send it again. How does system know not to delete it again (someone else may have created new foo). Could use transactions, but NFS takes more ad hoc approach. What if client crashes? 1. Might lose modied data in client cache. NFS Protocol 1. Write-through caching - when a le is modied, all modied blocks are sent immediately to the server disk. To the client, write doesnt return until all bytes are stored on disk. 2. Stateless Protocol - server keeps no state about client, except as hints to improve performance Each read request gives enough information to do entire operation ReadAt(inumber, position) not Read(openFile). When server crashes and restarts, can start processing requests immediately, as if nothing had happened. 3. Operations are idempotent: all requests are ok to repeat (all requests are done at least once). So if server crashes between disk I/O and message send, client can resend message and server does request over again Read and write le blocks are easy - just re-read or re-write le block, so no side-eects. What about remove? NFS just ignores this - does the remove twice and returns error if le not found, application breaks if inadvertently removes other le. 4. Failures are transparent to client system. Is this good idea? What should happen if server crashes? If application in middle of reading le when server crashes, options: (a) Hang until server returns (next week...) (b) return an error? But networked le service is transparent. Application doesnt know network is there. Many Unix Apps ignore errors and crash if there is a problem. NFS has both options - the administrator can select which one. Usually hang and return error if absolutely necessary.

72 Cache consistency

CHAPTER 2. LECTURE NOTES

What if multiple clients sharing same les? Easy if they are both reading - each gets a local copy in their cache. What if one writing? How do updates happen? Note NFS has write-through cache policy. If one client modies le, writes through to server. How does other client nd out about change? NFS and weak consistency In NFS, client polls server periodically to check if le has changed. Polls server if data hasnt been checked in every 3-30 seconds (Exact time is tunable parameter)

cache X t=0:X S X

X on disk X

t=30 X still ok?

When le changed on one client, server is notied, but other clients use old version of le till timeout. They then check server and get new version. What if multiple clients write to same le? In NFS get either version or mixed version. Completely arbitrary! Sequential ordering constraints What should happen? If one CPU changes le, and before it completes, another CPU reads le? We want isolation between operations, so read should get old le if it completes before write starts, new version if it starts after write completes. Either all new or all old any other way cf serialisability.

2.10. DISTRIBUTED FILE SYSTEMS NFS Pros and Cons its simple its highly portable but its sometimes inconsistent but it doesnt scale to large numbers of clients Note that this describes NFS version 3.

73

2.10.5

Andrew File System

AFS (CMU late 80s) DCE DFS (commercial product) 1. Callbacks: Server records who has copies of le 2. Write through on close If le changes, server is updated (on close) Server then immediately tells all those with old copy AFS Session Semantics Session semantics - updates only visible on close In Unix (single machine), updates visible immediately to other programs who have le open. In AFS, everyone who has le sees old version, anyone who opens le again will see new version. In AFS: 1. on open and cache miss: get le from server, set up callback 2. on write close: send copy to server; tells all clients with copies to fetch new version from server on next open Files cached on local disk; NFS (used to) cache only in memory
cache callback {X:A,B..} X t=0:X S X Fetch new version next time X is opened

X on disk

B X

74

CHAPTER 2. LECTURE NOTES What if server crashes? Lose all your callback state. Reconstruct information from client - go ask everyone who has which les cached

AFS Pros and Cons Disk as cache more les cached locally Callbacks server not involved if le is read-only (Majority of le access is read-only) But on fast LANs, local disk is slower than remote memory NFS version 4 will provide session semantics when it is deployed by vendors in the 2005 timeframe.

2.10.6

Summary

Remote le performance needs caching to get decent performance. Central server is a bottleneck Performance bottleneck: All data is written through to server all cache misses go to server Availability Bottleneck Server is single point of failure Cost bottleneck Server machines high cost relative to workstation

2.11

Peer to Peer (p2p) Services and Overlay Networks

Main Points: What is an Overlay Network? Gnutella - free form searching. Distributed Hash Tables - object location

2.11. PEER TO PEER (P2P) SERVICES AND OVERLAY NETWORKS 75

2.11.1

Overlay Networks

The Internet is successful because the management is decentralised, yet connectivity is maintained Can we build applications that allow people to connect their machines, yet maintain control over their machines? Yes, by building an overlay network connecting instances of the application Examples include the web, le sharing, DoS protection. A Generic Overlay Network

WAN

Figure 2.1: Generic Overlay Networks

Rather than directly routing messages to target, the overlay routes messages between the constituent machines The overlay network builds routing tables to meet application needs

hi i h gfgfgfg f j lm m l kjkjkjk j

de e d  pq q p ononono n

     utu t srsrsrssrsrsrs rr

76

CHAPTER 2. LECTURE NOTES

2.11.2

Gnutella

Used to share mp3 les, and much other copyrighted content. Simple protocol based on ooding and caching. Many dierent implementations interoperate. Denitions Servent The entities making up the Gnutella network, emphasising that clients are servers and vice versa. Ping A message sent into the network. When a Servent receives the Ping it responds with a Pong to the sender. Pong A message containing one or more servent addresses and some information about the data it is sharing. Query A freetext description of the data asked for. A servent responds with a QueryHit if a match is found against its local data set. QueryHit Provides enough information to acquire the data matching the query. Joining the Network A Servent joining the network can send a Ping request to discover other nodes. Servents receiving the Ping can choose to return a Pong with Servent details. It will then ood the Ping message to other Servents it is connected to. Pong messages return along the same path as Ping messages. Servents cache recently seen Ping identiers so as to return Pong messages back along the same path. If the Pong identier is not in the cache, then it will remove the Pong message. There is a TimeToLive eld in each message decremented each time the message passes through a Servent. When the eld reaches zero, the message is discarded. Servents can choose to initiate connections to other discovered Servents.

2.11. PEER TO PEER (P2P) SERVICES AND OVERLAY NETWORKS 77 Queries Queries are ooded through the network in a similar manner to Pings. When a Servent receives a Query, it returns a QueryHit if the query text matches the search criteria How the Servent interprets the search eld is entirely a local matter. It could only do exat matching, or just return everything in its dataset. The QueryHit returns along the same path as the Query, allowing caching. The QueryHit holds a Servent specic identier and information about the le. If the user chooses to download the le, it initiates a direct connection to the holding Servent and retrieves the data using http on the Gnutella port. Bootstrapping GWebCache is a set of web servers which store the IP addresses of up Gnutella hosts. Most clients now implement some form of UltraPeer, where machines which well-connected become better than others and have 10-100 leaf nodes and < 10 connections to other ultrapeers. Ultrapeers can then lter trac for leaf nodes using the uploaded tables from the leaf nodes. Ultrapeers are the nodes who advertise themselves in the GWebcache. Gnutella Issues Security How can you trust the data youve downloaded? Gnutella well-known for carrying viruses and trojans. Bandwidth Usage When a node becomes an UltraPeer, the bandwidth usage increases enormously. This can cause problems for the institution in which the machine is sited. FreeLoaders The network works because people share les and donate bandwidth and disk space. What happens when people just take without giving? Copyright Most of the material on Gnutella is copyright to someone or other. Should people abuse copyright in this way?

78

CHAPTER 2. LECTURE NOTES

2.11.3

Chord - A Distributed Hash Table Example

When the name of the object is known, there are more ecient search structures, such as Hash Tables. A hash table takes an input key k, calculates the hash function on the key h(k), and uses this as an index into a table. In a distributed hash table, the table into which h(k) indexes is distributed across the nodes in the DHT. The key is provided, the hash function is calculated h(k), and h(k) is used to route to the node which would hold the object corresponding to the key. Basic Operations Chord is a research project which designed one of the rst DHTs. Many others have now been designed. Function insert(key,value) lookup(key) join(n) leave() Description Inserts a key/value binding in the DHT. Return the value associated with key. Causes a node to add itself into the Chord system. Causes a node to leave the Chord system Table 2.3: The Chord API

Identiers and Keys A node generates its identier by picking a value randomly from the hash space eg the 128 bits of SHA applied to its dns name. The node joins the DHT and determines who its predecessor and successor are in the table. Predecessor(n) The node with the highest identier less than than ns identier, allowing for wrapround.. Successor(n) The node with the lowest identier greater than ns identier, allowing for wrapround. A node is then responsible for its own identier and the identiers between its identier and its predecessors identier. The Identier Circle

2.11. PEER TO PEER (P2P) SERVICES AND OVERLAY NETWORKS 79


Node responsible for key 40 0 Node responsible for keys 1

Node responsible for keys 23

Figure 2.2: Identier circle for 3 bit identier Routing in Chord Idea: Route to a node that halves the distance to the node responsible for the target key. So for identier length N , Each node with identier n maintains a table with the address of nodes that succeed identiers n + 20 modN , n + 21 modN ,n + 22 modN through to 2N 1 modN . To route to a target k, look up the identier in the routing table that precedes k, and pass the request onto this node. Routing Example www.pdos.lcs.mit.edu/chord/

Expected Number of Routing Hops Each hop in the routing will, on average, decrease the distance to the target by one half the current distance.

80
target 1 2 4 succ. 1 3 0 0

CHAPTER 2. LECTURE NOTES

target succ. Node responsible for key 40 2 3 3 3 5 0 1 Node responsible for keys 1

target 4 5 7 3

succ. 0 0 0

Node responsible for keys 23 lookup(1)

Figure 2.3: Routing example - lookup(1), going from node 3 to the node responsible for 1 If the size of the identier space is N , then the average number of hops is, with high probablity, O(log(N )). Maintenance: Joins and Leaves On joining: The node generates its identier n. A node locates its immediate successor by looking up n. The node then transfers the (key,value) pairs from its successor for identiers in the DHT between n and the successor identier. Update the routing tables of the relevant nodes for i = 1 to N

2.12. CONTENT DISTRIBUTION NETWORKS p = ndPredecessor(n 2i1 ) p.updateTables()

81

Chord implementation does this through timed checks on correctness of successor and predecessor relations

2.11.4

Current Research Challenges

Gnutella is good for open search (as is Google) DHTs are good for object location Can we build overlay networks which provide looser search porperties, yet retain the O(log(n)) messages of DHTs?

2.11.5

Summary

Overlay networks are an extension of the Internet design philosophy to distributed applications. They provide routing mechanisms that are robust to individual failures, and provide redundant paths. Genesis of much of the current research in networks and distributed systems.

2.12

Content Distribution Networks

Main Points Building content caches Pre-fetching data Using your neighbours - BitTorrent

82

CHAPTER 2. LECTURE NOTES

2.12.1

Getting Content over a Network

Users want to download content from serversas quickly as possible

What structures can we use to improve their experience, and the impact upon the network?

2.12. CONTENT DISTRIBUTION NETWORKS

83

2.12.2

Web Caches

Large organisations can improve web performance by sticking a web cache in front of HTTP connections The cache inspects an incoming HTTP request to see if it can satisfy the request from locally cached objects. If yes, then the object is returned, and the last reference time is updated If no, then the object is retrieved and copied locally if allowed. Web objects can be marked as non-cacheable. Caching is a major part of the HTTP standard Cache Performance The performance of a web cache is dicult to model, since the performance is a complex mixture of interaction between TCP, HTTP and content. Caches work because of temporal locality, due to popularity of content, and spatial locality, due to structure of HTML documents

84

CHAPTER 2. LECTURE NOTES Measurements of web proxies give the following results (based on JANET caches) Request hit rate is about 60%. Volume hit rate is about 30%. Latency improvement is around a factor of 3 on average

Problems with Caching Not all content is marked as cacheable, eg because the site wants accurate records of who looks at content. All hosts behind a cache appear to come from one address. Question: Why is this a problem?

2.12.3

Pre-fetching Data

Can we improve performance by pro-actively distributing content to caches? Yes. . .

Active Content Distribution The html uses links to the cdn domain name eg akamai.com The internet service provider has entries in their local DNS for akamai.com pointing to a machine on the ISPs network. Content will therefore be supplied from the local machine rather than the original machine Customer interaction improved. Bandwidth requirements of servers reduced

2.12. CONTENT DISTRIBUTION NETWORKS

85

2.12.4

Using your Peers: BitTorrent

Why not use the other people receiving the content? BitTorrent downloads from other machines Basic Idea: To download, the host contacts a machine tracking those already downloading the torrent. Peers are selected at random from the tracker. Pieces are selected to download from those available on the downloaders peer set until all pieces of the le have been received The .torrent le To start downloading from a torrent, the consumer must rst locate a .torrent le. This contains information about the le length, name and hashing numbers of the le blocks, and the url of a tracker

86

CHAPTER 2. LECTURE NOTES The le is split into 250 KByte pieces, each having a SHA1 hash calculated. A tracker holds the IP addresses of current downloaders

Locating peers After receiving the torrent le, the downloader contacts the tracker The tracker inserts the downloader into its list of downloaders, and returns a random list of other downloaders This list becomes the downloaders peers Question What is the shape of the overlay Graph? Choosing Pieces The downloader will contact its peers to discover what pieces they have. It then chooses a piece to download: The rst choice is made randomly, so as to spread load Subsequent pieces are based on a rarest-rst approach to increase probability all pieces are available. When a peer has downloaded a new piece which matches the SHA1 hash, it noties its peers it has a new piece. Choosing Downloaders A request to upload is accepted if the requester recently uploaded to it This provides an incentive to machines to share data Periodically other machines are tried for upload

2.12.5

Summary

Web caching improves performance by a reasonable factor, dependent on situation Pro-active content distribution can reduce latency and improve bandwidth usage for popular services BitTorrent can improve bandwidth usage by spreading load across peers.

2.13. REPLICATION: AVAILABILITY AND CONSISTENCY

87

2.13

Replication: Availability and Consistency

Motivation for replication Multicasting updates to a group of replicas Total Ordering Causal Ordering Techniques for ordering protocols ISIS CBCAST

2.13.1

What is Replication?

Multiple copies of dynamic state stored on multiple machines eg Copies of les stored on dierent machines, name servers storing name address mappings Caching can be seen as a form of replication. Why is Replication used? Performance enhancement Single Server acts as a bottleneck - if we can balance load amongst multiple servers, get apparent performance gain If clients are geographically distributed, we can site servers near clients and reduce communication costs Availability If a machine fails, then we can still provide a service

Probability of total failure reduced such as all data being lost, since data replicated across multiple machines If probability of failure is pr(f ail) for a given machine in n machines, then probability of loss of service is pr(f ail)n and the availability of the service is 1 pr(f ail)n eg, if mean time between failure for 3 machines is 5 days, repair time is 4 four hours, then assuming independence of failure, pr(f ail) = 524 = 0.03. Availability = 1 0.033 = 99.996% Fault Tolerance Even in the presence of failure, the service will continue to give the correct service Stronger than availability, since can provide real-time guarantees (with extra work!) Can protect against arbitrary failure where machines feed wrong information (Byzantine Failure)

88

CHAPTER 2. LECTURE NOTES

2.13.2

Issues in Replication

A collection of replicas should behave as if state was stored at one single site When accessed by client, view should be consistent Replication should be transparent - client unaware that servers are replicated If we are providing a replica service, replica can be passive or active. Passive Replication Passive replicas are standbys, to maintain service on failure. No performance improvement. Standbys must monitor and copy state of active server Provide availability in simple manner. Used for highly available systems eg space applications

2.13.3

Consistency

Clients can modify resource on any of the replicas. What happens if another client requests resource before replica has informed others of modication, as in cache consistency in distributed le systems? Answer depends upon application... Example Distributed Bulletin Board System (BBS)
front end Replica Manager

client

Replica Manager

front end client

Replica Manager

Users read and submit articles through Front End. Articles replicated across a number of servers Front Ends can connect to any server Servers propagate articles between themselves so that all servers hold copies of all articles.

2.13. REPLICATION: AVAILABILITY AND CONSISTENCY User membership of a given bbs is tightly controlled. Questions on BBS: How should messages be passed between replicas?

89

Should order of presentation of articles to clients be the same across all replicas? Are weaker ordering semantics possible? When a client leaves bbs group, can they see articles submitted after they have left? Is this desireable? What should happen when replicas are temporarily partitioned?

2.13.4

Updating Server state

Clients read and update state at any of the replicated servers eg submit messages in bbs. To maintain consistency, three things are important Multicast communication Messages delivered to all servers in the group replicating data Ordering of messages Updates occur in the same order at each server Failure recovery When servers or the network fails, and comes back, the replicas must be able to regain consistency. Done through Voting and Transactions (later in course)

2.13.5

Multicast and Process Groups

A Process Group: a collection of processes that co-operate towards a common goal. Multicast communication: One message is sent to the members of a process group Idea: Instead of knowing address of process, just need to know an address representing the service. Lower levels take care of routing messages. Useful for: Replicated Services One update message goes to all replicas, which perform identical operations. Reduces communication costs. Locating objects in distributed services Request for object goes to all processes implementing service, but only process holding object replies. Group Services Maintenance of group information is a complex function of the name service (for tightly managed groups) Create Group Create a group identier that is globally unique.

90

CHAPTER 2. LECTURE NOTES

Join Group Join a group. Requires joining process information to be disseminated to message routing function. May require authentication and notication of existing members.

Leave Group Remove a process from a group. May require authentication, may occur as a result of failure or partition. Need to notify message routing function, may notify other members.

Member List Supply the list of processes within a group. Needed for reliable message delivery, may require authentication.

2.13.6

Message Ordering

If two processes multicast to a group, the messages may be arbitrarily ordered at any member of the group. Process P1 multicasts message a to a group comprising processes P1, P2, P3 and P4. Process P2 multicasts message b to the same group The order of arrival of a and b at members of the group can be dierent.

P1

a a b

P2

P3

P4 a

b a

2.13. REPLICATION: AVAILABILITY AND CONSISTENCY Ordering example

91

P1

create object delete object

P2

P3

delete object create object

Order of operations may be important - delete object, create object. If delete object arrives before create object, then operation not completed Ordering Denitions Various denitions of order with increasing complexity in multicasting protocol FIFO Ordering Messages from one process are processed at all group members in same order Causal Ordering All events which preceded the message transmission at a process precede message reception at other processes. Events are message receptions and transmissions. Total Ordering Messages are processed at each group member in the same order. Sync Ordering For a sync ordered message, either an event occured before message reception at all processes, or after message. Other events may be causally or totally ordered. FIFO ordering Achieved by process adding a sequence number to each message. Group member orders incoming messages with respect to sequence number. Applicable when each process state is separate, or operations dont modify state, just add incremental updates or read. Total Ordering When several messages are sent to a group, all members of the group receive the messages in the same order. Two techniques for implementation:

92

CHAPTER 2. LECTURE NOTES

Sequencer Elect a special sequencing node. All messages are sent to sequencer, who then sends messages onto replicas. FIFO ordering from sequencer guarantees total ordering. Suers from single point of failure (recoverable by election) and bottleneck. Holdback Queue Received messages are not passed to the application immediately, but are held in a holdback queue until the ordering constraints are met. Sequence Number Negotiation ber with all replicas. Sender negotiates a largest sequence num-

1. Replicas store largest nal sequence number yet seen Fmax , and largest proposed sequence number Pmax 2. Sender sends all replicas message with temporary ID. 3. Each Replica i replies with suggested sequence number of max(Fmax , Pmax )+ 1. Suggested sequence number provisionally assigned to message and message placed in holdback queue (ordered with smallest sequence number at front) 4. Sending site chooses largest sequence number and noties replicas of nal sequence number. Replicas replace provisional sequence number with nal sequence number. 5. When item at front of queue has an agreed nal sequence number, deliver the message. Causal Ordering Cause means since we dont know application, messages might have causal ordering a and b are events, generally sending and receiving of messages. We dene the causal relation, a b, if 1. if a and b are events at the same process, a b implies a happened before b 2. if a is a message sent by process P1 and b is the arrival of the same message at P2, then a b is true In bulletin board, an article titled re: Multicast Routing in repsonse to an article called Multicast Routing should always come after, even though may be received before the initial article.

2.13. REPLICATION: AVAILABILITY AND CONSISTENCY CBCAST - Causal ordering in ISIS

93

ISIS is a real commercial distributed system, based on process groups. Causal ordering for multicast within a group is based around Vector Timestamps The vector V T has an identier entry for each member of the group, typically an integer. Vector timestamps have one operation dened merge(u, v)[k] = max(u[k], v[k]), for k = 1..n Incoming messages are placed on a holdback queue, until all messages which causally precede the message have been delivered.

CBCAST Implementation 1. All processes pi initialise the vector to zero 2. When pi multicasts a new message, it rst increments V Ti [i] by 1; it piggybacks vt = V Ti on the message 3. Messages are delivered to the application in process Pj when The message is the next in sequence from pi i.e. vt[i] = V Tj [i] + 1 All causally prior messages that have been delivered to pi must have been delivered to pj , i.e. V Tj [k] vt[k] for k = i. 4. When a message bearing a timestamp vt is delivered to pj , pj s timestamp is updated as V Tj = merge(vt, V Tj ) In words Incoming vector timestamp is compared to current timestamp. If conditions for delivery to process not met, then message placed on holdback queue. When an incoming message is delivered, the timestamp is updated by the merge. Examine all messages in the holdback queue to see if they can be delivered. CBCAST requires reliable delivery.

94

CHAPTER 2. LECTURE NOTES

P1

P2 (1,0,0)

P3 (1,1,0) Message delayed on holdback queue delivered immediately Message on holdback queue (1,1,0) can now be delivered

Causal Example Group View Changes

When group membership changes, what set of messages should be delivered to members of changed group? What happens to undelivered messages of failed members? What messages should new member get? ISIS solves by sending a sync ordered message announcing that the group view has changed. Messages thus belong to a particular group view. Use coordinator to decide which messages belong to which view.

2.13.7

Summary

Replication of services and state increase availability Replication increases performance Replication increases Fault tolerance To maintain consistency, multicast updates to all replicas Use sequence numbers to maintain FIFO ordering Use Vector Timestamps to maintain Causal Ordering Use elected sequencers or identier negotiation to maintain total ordering

2.14

Shared Data and Transactions

Stateful Servers Atomicity

2.14. SHARED DATA AND TRANSACTIONS Transactions ACID Serial Equivalence

95

2.14.1

Servers and their state

Servers manage a resource, such as a database or a printer Attempt to limit problems of distributed access by making server stateless, such that each request is independent of other requests. Servers can crash in between servicing clients Client requests cannot interfere with each other (assuming concurrency control in server But we cant always design stateless servers... Stateful Servers Some applications better modelled as extended conversations, eg retrieving a list of records in a large database better modelled as getting batch of records at a time. If application requires state to be consistent across a number of machines, then each machine must recognise when it can update internal data. Needs to keep track of state of distributed conversation If long duration then, then need to be aware of state. If other conversations need to go on - eg modify records during retrieval, how do we allow concurrency? What happens if machine fails - need to recover. Should also aim to be fault tolerant

2.14.2

Atomicity

Stateful server have two requirements 1. Accesses from dierent clients shouldnt interfere with each other 2. Clients should get fast access to the server

96 Denition We dene atomicity as

CHAPTER 2. LECTURE NOTES

All or Nothing A clients operation on a servers resource should complete successfully, and the results hold thereafter (yea, even unto a server crash), or it should fail and the resource should show no eect of the failed operation Isolation Each operation should proceed without interference from other clients operations - intermediate eects should not be visible. Example Mutual Exclusion For a multi-threaded server, if two or more threads attempt to modify the same piece of data, then the updates should have mutual exclusion around the updates to provide isolation, using semaphores or monitors Synchronisation In situations such as Producer Consumer, need to allow one operation to nish so second operation can use results, needing isolation.

2.14.3

Automatic Teller Machines and Bank accounts


Greedy Bank accounts ATM controller

automatic teller machine (ATM)

Sharing Bank accounts

Other banks
An ATM or cashmachine allows transfer of funds between accounts. Accounts are held at various machines belonging to dierent banks Accounts oer the following operations deposit Place an amount of money in an account withdraw Take an amount of money from an account balance Get the current value in an account

2.14. SHARED DATA AND TRANSACTIONS

97

Operations implemented as read() and write() of values, so withdraw x from A and deposit x in B implemented as 1. A.write(A.read() - x) 2. B.write(B.read() + x)

2.14.4

Transactions

Transactions are technique for grouping operations on data so that either all complete or none complete Typically server oers transaction service, such as: beginTransaction(transId) Record the start of a transaction and associate operations with this transId with this transaction. commitTransaction(transId) Commit all the changes the operations in this transaction have made to permanent storage. abortTransaction(transId) Abort all the changes the transaction operations have done, and roll back to previous state. ACID Transactions are described by the ACID mnemonic Atomicity Either all or none of the Transactions operations are performed. If a transaction is interrupted by failure, then partial changes are undone Consistency System moves from one self-consistent state to another Isolation An incomplete transaction never reveals partial state or changes before commiting Durability After committing, the system never loses the results of the transaction, independent of any subsequent failure Concurrency Problems Lost Update Inconsistent Retrievals

2.14.5

Serial Equivalence

Denition: Two transactions are serial if all the operations in one transaction precede the operations in the other. eg the following actions are serial Ri (x)Wi (x)Ri (y)Rj (x)Wj (y) Denition: Two operations are in conict if:

98

CHAPTER 2. LECTURE NOTES

Transaction T A.withdraw(4,T) B.deposit(4,T) balance = A.read() A.write(balance - 4)

Transaction U C.withdraw(3,U) B.deposit(3,U) 100 96 balance = C.read() C.write(balance - 3) 300 297 200 203

balance = B.read()

200 balance = B.read() B.write(balance + 3)

B.write(balance + 4)

204

Transaction T A.withdraw(100,T) B.deposit(100,T) balance = A.read() A.write(balance - 100)

Transaction U Bank.total(U) 200 100 balance = A.read() balance = B.read() + balance balance = C.read() + balance 100 300 300+

balance = B.read() B.write(balance + 100)

200 300

2.15. CONCURRENCY CONTROL AND TRANSACTIONS At least one is a write They both act on the same data They are issued by dierent transactions Ri (x)Rj (x)Wi (x)Wj (y)Ri (y) has Rj (x)Wi (x) in conict Denition: Two schedules are computationally equivalent if: The same operations are involved (possibly reordered)

99

For every pair of operations in conict, the same operation appears rst in each schedule So, a schedule is serialisable if the schedule is computationally equivalent to a serial schedule. Question: Is the following schedule for these two transaction serially equivalent? Ri (x)Ri (y)Rj (y)Wj (y)Ri (x)Wj (x)Wi (y) Transaction Nesting Transactions may themselves be composed of multiple transactions eg Transfer is a composition of withdraw and deposit transactions, which are themselves composed of read and write transactions Benets: Nested transactions can run concurrently with other transactions at same level in hierarchy If lower levels abort, may not need to abort whole transaction. Can instead use other means of recovery.

2.14.6

Summary

Transactions provide technique for managing stateful servers Need to worry about concurrency control Need to worry about aspects of distribution Need to worry about recovery from failure

2.15

Concurrency Control and Transactions

Problem restatement Locking Optimistic control Timestamping

100

CHAPTER 2. LECTURE NOTES

2.15.1

Why concurrency control?

To increase performance, multiple transactions must be able to carry on work simultaneously... ...but if data is shared, then can lead to problems such as lost updates and inconsistent retrievals. So we must ensure schedules of access to data for concurrent transactions are computationally equivalent to a serial schedule of the transactions.

2.15.2

Locking

As in operating systems, locks control access for dierent clients Granularity of data locked should be small so as to maximise concurrency, with trade-o against complexity. To prevent intermediate leakage, once lock is obtained, it must be held till transaction commits or aborts Conict rules Conict rules determine rules of lock usage If operations are not in conict, then locks can be shared read locks are shared Operations in conict imply operations should wait on lock write waits on read or write lock, read waits on write lock Since cant predict other item usage till end of transactions, locks must be held till transaction commits or aborts. If operation needs to do another operation on same data then promotes lock if necessary and possible - operation may conict with existing shared lock Rules for strict two phase locking 1. When operation accesses data item within transaction (a) If item isnt locked, then server locks and proceeds (b) If item is held in a conicting lock by another transaction, transaction must wait till lock released (c) If item is held by non-conicting lock, lock is shared and operation proceeds (d) If item is already locked by same transaction, lock is promoted if possible (refer to rule b) 2. When transaction commits or aborts, locks are released

2.15. CONCURRENCY CONTROL AND TRANSACTIONS Locking Implementation Locks generally implemented by a lock manager

101

lock(transId,DataItem,LockType) Lock the specied item if possible, else wait according to rules above unLock(transId) Release all locks held by the transaction Lock manager generally multi-threaded, requiring internal synchronisation Heavyweight implementation Example Transactions T and U. T: RT (i), WT (j, 44) U: WU (i, 55)), RU (j), WU (j, 66) Question What are the possible schedules allowed under strict locking? Question Are there any schedules computationally equivalent to a serial schedule which are disallowed? Deadlocks Locks imply deadlock, under following conditions 1. Limited access (eg mutex or nite buer) 2. No preemption (if someone has resource cant take it away) 3. Hold and wait. Independent threads must possess some of its needed resources and waiting for the remainder to become free. 4. Circular chain of requests and ownership. Most common way of protecting against deadlock is through timeouts. After timeout, lock becomes vulnerable and can be broken if another transaction attempts to gain lock, leading to aborted transactions Drawbacks of Locking Locking is overly restrictive on the degree of concurrency Deadlocks produce unnecessary aborts Lock maintenance is an overhead, that may not be required

102

CHAPTER 2. LECTURE NOTES

2.15.3

Optimistic Concurrency Control

Most transactions do not conict with each other So proceed without locks, and check on close of transaction that there were no conicts Analyse conicts in validation process If conicts could result in non-serialisable schedule, abort one or more transactions else commit Implementation of Optimistic Concurrency Control Transaction has following phases 1. Read phase in which clients read values and acquire tentative versions of items they wish to update 2. Validation phase in which operations are checked to see if they are in conict with other transactions - complex part. If invalid, then abort. 3. If validated, tentative versions are written to permanence, and transaction can commit (or abort). Validation approaches Validation based upon conict rules for serialisability Validation can be either against completed transactions or active transactions - backward and forward validation. Simplify by ensuring only one transaction in validation and write phase at one time Trade-o between number of comparisons, and transactions that must be stored. Forward Validation 1. A transaction in validation is compared against all transactions that havent yet committed 2. Writes may aect ongoing reads 3. The write set of the validating transaction is compared against the read sets of all other active transactions. 4. If the sets conict, then either abort validating transaction, delay validation till conicting transaction completes, or abort conicting transaction.

2.15. CONCURRENCY CONTROL AND TRANSACTIONS Backward validation

103

1. Writes of current transaction cant aect previous transaction reads, so we only worry about reads with overlapping transactions that have committed. 2. If current read sets conict with already validated overlapping transactions write sets, then abort validating transaction

2.15.4

Timestamping

Operates on tentative versions of data Each Transaction receives global unique timestamp on initiation Every object, x, in the system or database carries the maximum (ie youngest) timestamp of last transaction to read RTM(x)3 and maximum of last transaction to write WTM(x)4 If transaction requests operation that conicts with younger transaction, older transaction restarted with new timestamp. Transactions committed in order of timestamps, so a transaction may have to wait for earlier transaction to commit or abort before committing. Since tentative version is only written when transaction is committed, read operations may have to wait until the last transaction to write has committed. An operation in transaction Ti with start time T Si is valid if: The operation is a read operation and the object was last written by an older transaction ie T Si > W T M (x). If read permissible, RT M (x) = M AX(T Si , RT M (x)) The operation is a write operation and the object was last read and written by older transactions ie T Si > RT M (x) and T Si > W T M (x). If permissible, W T M (x) = T Si

2.15.5

Summary

Locks are commonest ways of providing consistent concurrency Optimistic concurrency control and timestamping used in some systems But, consistency in concurrency is application dependent - for shared editors, people may prefer to trade speed of execution against possibilities of conict resolution. Problems can occur with long term network partition. Approaches based on notication and people resolution becoming popular.
3 Read 4 Write

Timestamp Maximum Timestamp Maximum

104

CHAPTER 2. LECTURE NOTES

2.16

Distributed Transactions

Models for distributed transactions Attaining distributed commitment Distributed Concurrency Control

2.16.1
client1 client2

Single Server Transactions


transaction 1 Transaction Manager transaction 2 resource server

clientN

transaction N

Till now, transactions have referred to multiple clients, single server. How do we have multiple clients interacting with multiple servers? eg complicated funds transfer involving dierent accounts from dierent banks? Generalise transactions to distributed case...

2.16.2

Distributed Transactions

Distributed Transaction Requirements General characteristics of distributed systems Independent Failure Modes No global time Inconsistent State Need to consider: how to achieve distributed commitment (or abort) how to achieve distributed concurrency control

2.16. DISTRIBUTED TRANSACTIONS Models

105

T11 X T client Z Y T22 T1 Y X

M T12 N

client Z

T2

T21 P

Simple Distributed model

Nested Transaction

If client runs transactions, then each transaction must complete before proceeding to next If transactions are nested, then transactions at same level can run in parallel Client uses a single server to act as coordinator for all other transactions. The coordinator handles all communication with other servers Question: What are the requirements of transaction ids?

2.16.3

Atomic Commit Protocols

Distribution implies independent failure modes, ie machine can fail at any time, and others may not discover. If one phase commit, client requests commit, but one of the server may have failed - no way of ensuring durability Instead, commit in 2 phases, thus allowing server to request abort. 2 Phase Commit One coordinator responsible for initiating protocol. Other entities in protocol called participants. If coordinator or participant unable to commit, all parts of transaction are aborted. Two phases Phase 1 Reach a common decision Phase 2 Implement that decision at all sites

106 2 Phase Commit Details

CHAPTER 2. LECTURE NOTES

1. Phase 1 The coordinator sends a Can Commit? message to all participants in transaction. 2. Participants reply with vote yes or no. If vote is no participant aborts immediately. 3. Phase 2 Coordinator collects votes including own: (a) If all votes are yes, coordinator commits transaction and sends DoCommit to all participants. (b) Otherwise transaction is aborted, and coordinator sends abortTransaction to all participants. 4. When a participant recieves DoCommit, it commits its part of the transaction and conrms using HaveCommited

coordinator 1. Prepared to commit (waiting for votes) CanCommit?

participant

Yes 3. Committed (or aborted)

2. Prepared to commit (uncertain)

DoCommit

HaveCommitted 5. Done

4. Commit

2 Phase Commit Diagram

Note:

If participant crashes after having voted to commit, it can ask coordinator about results of vote. Timeouts are used when messages are expected. Introduces new state in transaction Prepared to commit.

2.17. TRANSACTIONS: COPING WITH FAILURE

107

2.16.4
Locking

Distributed Concurrency Control

Locking is done per item, not per client. No problems generalising to multiple servers... ...except in dealing with distributed deadlock Same techniques as usual, but interesting dealing with distributed deadlock detection. Optimistic Concurrency Control Need to worry about distributed validation Simple model of validation had only one transaction being validated at a time - can lead to deadlock if dierent cordinating servers attempt to validate dierent transaction. Also need to validate in correct serialisable order. One solution is to globaly only allow one transaction to validate at a time. Other solutions is to validate in two phases with timestamp allocation local, then global to enforce ordering. Timestamping If clocks are approximately synchronised, then timestamps can be < localtimestamp, coordinatingserverid > pairs, and an ordering dened upon server ids.

2.16.5

Summary

Nested Transactions are best model for distributed transactions Two Phase Commit protocol suitable for almost all case Distributed Concurrency control is only slightly more dicult than for single server case

2.17

Transactions: Coping with Failure

Failure Modes Recovery Techniques Partitions and quorum voting

108

CHAPTER 2. LECTURE NOTES

2.17.1

Failure Modes

For Transactions to be atomic and durable, need to examine failures 1. Transaction-local failures, detected by the application which calls abort eg insucient funds. No info loss, need to undo changes made. 2. Transaction-local failures , not detected by application, but by system as whole, eg divide by zero. System calls abort. 3. System failures aecting transactions in progress but not media eg CPU failure. Loss of volatile store and possibly all transactions in progress. On recovery, special recovery manager undoes eects of all transactions in progress at failure. 4. Media failures aecting database eg head crash. No way of protecting against this.

2.17.2

Recovery

We assume a machine crashes, but then is xed and returns to operation 5 . We need to recover state to ensure that the guarantees of the transactional systems are kept. Use a recovery le or log that is kept on permanent storage. The Recovery Manager Recovery from failure handled by entity called Recovery Manager. Keeps information about changes to the resource in a recovery le (also called Log) kept in stable storage - ie something that will survive failure. When coming back up after failure, recovery manager looks through recovery le and undoes changes (or redoes changes) so as uncommitted transactions didnt happen, and committed transactions happened. Events recorded on Recovery le for each change to an object in database. Recovery File Information recorded per event include: Transaction Id To associate change with a transaction Record Id The identier of the object Action type Create/Delete/Update etc
5 hence sidestepping the impossibility of byzantine agreement in asynchronous systems, because we assume the machine always returns

2.17. TRANSACTIONS: COPING WITH FAILURE Old Value To enable changes to be undone New Value To enable changes to be redone

109

Also log beginTransaction, prepareToCommit, commit, and abort actions, with their associated transaction id. Recovering If after failure, the database is undamaged, undo all changes made by transactions executing at time of failure the database is damaged, then restore database from archive and redo all changes from committed transactions since archive date. The Recovery le entry is made and committed to stable storage before the change is made - incomplete transactions can be undone, committed transactions redone. What might happen if database changed before recovery le written? Note that recovery les have information needed to undo transactions. Checkpointing Calculation of which transaction to undo and redo on large logs can be slow. Recovery les can get too large Instead, augment recovery le with checkpoint Force recovery le to stable storage Write checkpoint record to stable store with 1. A list of currently active transactions 2. for each transaction a pointer to the rst record in recovery le for that transaction Force database to disk Write address of checkpoint record to restart location atomically Recovering with checkpoints To recover, have undo and redo lists. Add all active transactions at last checkpoint to undo list 1. Forwards from checkpoint to end, (a) If nd beginTransaction add to undo list (b) If nd commit record add to redo list

110

CHAPTER 2. LECTURE NOTES (c) If nd abort record remove from undo list

2. backwards from end to rst record in checkpointed transactions, execute undo for all transaction operations on undo list 3. Forwards from checkpont to end, redo operations for transactions on redo list At checkpoint can discard all recovery le to rst logged record in checkpointed transactions Recovery of the Two Phase Commit Protocol
coordinator 1. Prepared to commit (waiting for votes) CanCommit? participant

Yes 3. Committed (or aborted)

2. Prepared to commit (uncertain)

DoCommit

HaveCommitted 5. Done

4. Commit

Coordinator uses prepared to signal starting protocol, commit on signalling DoCommit and done to indicate end of protocol in recovery le. Participant uses uncertain to indicate that it has replied yes to commit request, and commited when it receives DoCommit. On recovery, coordinator aborts transactions which reach prepared, and resends DoCommit when in commit state but not done Participant requests decision from coordinator if in uncertain state, but not commited.

2.17.3

Network Partition

Transactions are often used to keep replicas consistent. If network partitions (cable breaks), replicas divided into two or more sets (possibly with common members).

2.17. TRANSACTIONS: COPING WITH FAILURE

111

B A C

Link AB broken Routing takes time to determine reroute from AB via C So clients behind A can see only A and C Clients behind B can only see B and C Clients behind C see all replicas

Can we still write and read from any of the sets? Yes, but Must reduce possible read and write sets to maintain consistency Or relax consistency requirements and resolve problems when partition is healed Quorum Consensus Consider set of replicas, where replicated objects have version numbers at each replica. 1. Assign a weighting of votes to each replica, indicating importance. 2. For client to perform operation, it must gather votes for all the replicas it can talk to (denote X). 3. X votes set for read quorum R to enable read 4. X votes set for write quorum W to enable write. 5. As long as W > half the total number of votes R + W > total number of votes in group Each Read quorum and each write quorum will have at least one member in common. Partition Example For three replicas, R1, R2, R3, we can allocate votes to give dierent properties Replica cong 1 cong 2 cong 3 R1 1 2 1 depending upon requirements R2 0 1 1 R3 0 1 1 1. What should R and W be set to in the three congurations? 2. What properties result from these congurations?

112 Read and Write Operations

CHAPTER 2. LECTURE NOTES

When write happens, object has a version number incremented To read, collect votes from replica managers with version numbers of object. Guaranteed to have at least one up to date copy if in read quorum, from which read occurs. To write, collect votes from replica managers with version numbers of object. If write quorum with up to date copy not discovered, then copy up to date copy around to create write quorum. Then write is allowed. Manipulating R and W give dierent characteristics eg R = 1 and W = number of copies gives unaminous update. Cached copies of objects can be incorporated as weak representatives with 0 votes, but usable for reads.

2.17.4

Summary

Atomicity comes from using logging techniques on operations at server, where log is kept on stable storage Voting can be used to give availability for resources on partitioned replicas.

Chapter 3

Exercises and answers


3.1
3.1.1

Exercises
Communication System Fundamentals

1. Ethernet uses a carrier sense multiple access with collision detect scheme. Explain the terms in italics. Even using CSMA/CD, There is still a nite probability of collision between packets. Explain why. 2. Assume that two computers are sending 1000 byte packets over a shared channel that operates at 64000 bits per second. To ensure fairness, after sending a packet, the sending machine must wait 200 microseconds before sending another packet, whilst a waiting machine can send a packet 100 microseconds after packet is sent. How long will it take the two machines to each send a MByte le? 3. Switched networks are taking over from shared media networks as Local Area Networks. Why? The answer in Section 3.2.1.

3.1.2

Devising a Routing Protocol

Your task is to devise a routing protocol based on distance vector routing. You should follow the following steps: 1. What functionality is required from the exchange of messages? 2. What information has to be exchanged? 3. How should this information be represented - dene class structures down to primitive types for this section. 113

114

CHAPTER 3. EXERCISES AND ANSWERS

4. How should the packets be laid out? 5. When should events occur in the protocol? Think of simple state transition diagrams. The answer in Section 3.2.2.

3.1.3

Layering
Bandwidth 2 MBit/s 1000 byte packets, 10 byte ack packet Cable distance 3000 km, propagation speed 2 105 km/s

1. For the following:

What size window is required to keep the line fully utilised? 2. For a sequence space of 3 bits, what is the maximum window size? For a sequence space of size N? 3. If the bandwidth of a network is 100 MBit/s and the sequence space is 32 bits, and each byte is labelled, what is the maximum packet lifetime? 4. The memory buer capacity of a certain bottle-necked router in the Internet is less than one packet per connection traversing it. What eects will this have on TCP? The answer in Section 3.2.3.

3.1.4

Serialization

In an remote procedure call system, a key decision is how to represent the data structures within the messages passed between clients and servers. In this exercise, we will think about some of the necessary mechanisms. 1. A remote procedure call system has a simple interface representation, where the following is an example. const MAXLENGTH 256; struct id { string name<MAXLENGTH>; int pin; }; program BANKACCOUNT { version BANKVERS { void deposit(id,float) = 1;

3.1. EXERCISES void withdraw(id,float) = 2; float balance(id) = 3; } = 1; } = 0x28786554;

115

Design a representation system for laying out the bits within the request and the response packets. You should consider which information will be known by both ends, and whether there needs to be explicit representation of type information or of data length information. 2. Java uses object references as well as primitive types. Discuss the problems in serializing Java objects, with particular reference to encoding chains of objects, and to deciding whether references are remote or local.

3.1.5

Remote Procedure Call

1. Design a Java interface for a voting service, which can allow a user to retrieve a list of candidates and their manifestoes, to discover the current number of votes a candidate has, and to vote for a particular candidate. 2. Convert the Java interface above to a remote invokable interface. What other classes are needed to provide a complete implementation? 3. Design the equivalent XDR interface for the above voting service. 4. Sun rpc can either work over udp or tcp - what are the advantages and disadvantages of the two approaches? 5. Exception handling is a weak part of the sun rpc system, relying principally on the unix have global variables in the program environment errno approach. What are the desireable features of exception handling and what are the problems in extending sun rpc to support exception handling? The answer in Section 3.2.5.

3.1.6

Security

1. Describe some of the ways in which conventional email is vulnerable to eavesdropping, masquerading, tampering, replay and denial of service. 2. Suggest methods by which email could be protected against each of these forms of attack.

116

CHAPTER 3. EXERCISES AND ANSWERS

3.1.7

Names and distributed ling systems

1. Many of the problems in distributed le systems arise because the les are writable. How does the design space change if les are only ever created, read and destroyed? How would you design a scalable le service for such a service? Consider the scalability problems of lookup and retrieval. 2. Discuss the problems raised by the use of aliases in a name service, and indicate how, if at all, these problems may be overcome. 3. The DNS uses a ptr resource record to allow reverse mapping from IP addresses to DNS names. Outline how a lookup must proceed in looking up a ptr record. 4. What data must each NFS client module hold on behalf of each user-level process? Declare a suitable data structure in C to hold the data. The answer in Section 3.2.6.

3.1.8

Availability and Ordering

1. In a multi-user game, players at separate workstations move gures around a common scene. The gures may throw projectiles at one another, and a hit debilitates the unfortunate recipient for some pre-determined interval. What type of ordering is required here. 2. The game incorporates magic devices which may be picked up by a player to assist her. What type of ordering should be applied to the pick up device operation? 3. Three processes, 0, 1, and 2, use the ISIS total ordering protocol, as described in the notes using sequence number agreements. A message is sent from process 0, which reaches process 1 ok, but is delayed in reaching process 2. Process 1 then sends a message which reaches process 2 ok. The original message from process 0 then reaches process 2. In which order are the messages delivered to the processes? 4. A process group using CBcast has three members, 0,1,2. Process 0 sends two messages with vector timestamps (6,9,10) and (7,10,11). Process 1 sends two messages with vector timestamps (5,9,10) and (5,10,10). Process 3 sends two messages with vector timestamps (6,9,11) and (6,9,12). Show a possible ordering of the transmission and delivery events at these processes. 5. What guarantees on message delivery are required for CBcast? The answer in Section ??.

3.1. EXERCISES

117

3.1.9

Transactions and Concurrency

1. A server manages data items, and allows read and write operations. Give three serialisable schedules of the following transactions T: R(j),R(i),W(j,44),W(i,33) U: R(k),W(i,55),R(i),W(k,66) 2. For transactions T and U, explain which of the following interleavings can occur with strict two phase locking and with optimistic concurrency control. Compare the possible outcomes in timestamping where T ST < T SU and T ST > T SU (a) RT (i), WU (i, 55), WT (j, 44), WU (j, 66) (b) RT (i), WT (j, 44), WU (i, 55), WU (j, 66) (c) WU (i, 55), WU (j, 66), RT (i), WT (j, 44) (d) WU (i, 55), RT (i), WU (j, 66), WT (j, 44) (e) WU (i, 55), RT (i), WT (j, 44), WU (j, 66) 3. Dene an inconsistent retrieval. What pattern of operational conicts may lead to an inconsistent retrieval 4. Why might the start time of a transaction not be the best time to allocate its timestamp? Given the timestamps of two committed transactions, can you always draw their serialisation graphs? Compare the overhead of implementing locking with that of timestamping The answer in Section 3.2.7.

3.1.10

Distributed Transactions

1. Transactional le systems have been implemented to provide stronger protection for application programs. Show a possible interface and describe the semantics of a transactional le system that buers operational results till committed. What issues should be addressed to ensure consistency across machine crashes? 2. Suggest two situations in the two phase commit protocol in which all the workers voted yes in the rst phase, yet the coordinator decides to abort the transaction. 3. A server manages the data items a1 , a2 ...an . The server provides read and write operations on these items. Consider the following transactions: T x = read(i); write(j,44); U write(i,55); write(j,66);

118

CHAPTER 3. EXERCISES AND ANSWERS V write(k,77); write(k,88); Describe the information written to the log le if strict two phase locking is in use, and U acquires ai and aj before T. Describe how the recovery manager would uses this information to recover the eects of T, U and V when the server restarts after a crash. What is the signicance of the commit entries in a log le.

4. Five replicas A, B, C, D and E use quorum consensus with weights A = 3, B = C = 2, D = E = 1. State the possible values that may be chosen for a write quorum, Which choices allow the service to continue when one of the servers is unavailable? Give the choice of read quorums for each write quorum, and for each read quorum, how may the quorum be formed and what is the minimum number of servers involved? The answer in Section 3.2.8.

3.2
3.2.1

The Answers
Communication System Fundamentals

1. Ethernet uses a carrier sense multiple access with collision detect scheme. Explain the terms in italics. Even using CSMA/CD, There is still a nite probability of collision between packets. Explain why. Because there it takes a nite time for the signal to propagate along the cable. If at time t, a station transmits, then it wont be heard till t + t at the other stations. If they check and transmit before t + t then a collision will occur. The longer the wire, the longer the latency, and so the greater possibility of collision. This is also why there is a minimum length of packet, so that the jamming signal can be detected. 2. Assume that two computers are sending 1000 byte packets over a shared channel that operates at 64000 bits per second. After each packet, the receiving machine sends an acknowledgement packet of 10 bytes. How long will it take to send a 1 MByte le between the two machines? State your assumptions. Data Packet takes 10008 seconds = 125 milliseconds. Acknowledge64000 108 ment packet takes 64000 seconds = 1.25 milliseconds. 1 MByte contains approx 1000 packets. Machines alternate in sending les, so packet is 125 milliseconds + 1.25 milliseconds = 126.25 ms. So time is approximately 125 s, assuming negligible latency, and no loss. 3. Switched networks are taking over from shared media networks as Local Area Networks. Why?

3.2. THE ANSWERS

119

Matches actual wiring laid out in buildings - most wiring is a star laid out from a dry riser. More secure. More fault tolerant - someone inadvertently unplugging the wire doesnt break the whole network. Each machine can get full bandwidth of wire Can scale the switch hub easier

3.2.2

Devising a Routing Protocol

Answers can be found by looking at RFC2453 RIP Version 2. G. Malkin.

3.2.3

Layering

1. Data Transmission Time = Packet size/Bandwidth = 1000 8/2 10 6 = 4ms Propagation delay = Distance/Signal speed =
3000 2105

= 15ms

Ack Transmission Time = Packet size/Bandwidth = 10 8/2 106 = 0.04ms Round Trip Time = 2 propagation delay + data transmission time + ack transmission time = 2 15 + 4 + 0.04 = 34.04ms For 100% utilisation, window size data transmission time rtt window size = 34.04/4 = 9. We ignore processing delay at sender and receiver. 2. Assume single channel, ie one in which packets arrive in order but can be lost. Then maximum window size is 2N 1 otherwise, the window may advance into space which hasnt been acknowledged, assuming go-back-n. If the network can delay, duplicate and reorder packets arbitrarily, then the maximum window size depends upon the maximum packet lifetime.
2 3. There are 232 labelled bytes. These can be transmitted in 12.5106 = 343s. Maximum lifetime of a packet must be less than 343 seconds.
32

4. TCP breaks. Basically a lot of the TCP ows throttle back due to congestion control till they are no longer sending, and those packets sent are often lost, leading to high numbers of retransmissions.

3.2.4

Serialization

1. This is a sample of Sun rpc and its associated data representation, xdr. Sun rpc is described in rfc 1831, whilst the concrete syntax for sun xdr is available in rfc 1832. Some bits of the answer - the remaining is for you to ll in.

120

CHAPTER 3. EXERCISES AND ANSWERS struct rpc_msg { unsigned int xid; union switch (msg_type mtype) { case CALL: call_body cbody; case REPLY: reply_body rbody; } body; }; struct call_body { unsigned int unsigned int unsigned int unsigned int opaque_auth opaque_auth /* procedure };

rpcvers; /* must be equal to two (2) */ prog; vers; proc; cred; verf; specific parameters start here */

union reply_body switch (reply_stat stat) { case MSG_ACCEPTED: accepted_reply areply; case MSG_DENIED: rejected_reply rreply; } reply; struct accepted_reply { opaque_auth verf; union switch (accept_stat stat) { case SUCCESS: opaque results[0]; /* * procedure-specific results start here */ case PROG_MISMATCH: struct { unsigned int low; unsigned int high; } mismatch_info; default: /* * Void. Cases include PROG_UNAVAIL, PROC_UNAVAIL, * GARBAGE_ARGS, and SYSTEM_ERR. */ void;

3.2. THE ANSWERS } reply_data; };

121

2. This is Java Serialization. The tricky bits are how to represent Java objects eciently.

3.2.5
1. 2. 3.

Remote Procedure Call

3.2.6

Names and Distributed File Systems

1. Aliases are useful in allowing service names eg www, ftp, smtp etc to map onto a single host. This works through using a CNAME for the alias name which points back to the canonical name of the host. Disadvantages are: (a) Performance hit through the additional indirection (b) Adminstrative problems, since there is no backwards pointer from the canonical host name entry to the aliases. Can be made easier by decent tools. (c) Using the alias name where a reverse or PTR lookup is required ensures that the names dont correspond 2. IP addresses have their own portion of the dns tree under inaddr.arpa. This space is managed by ARIN, who keep the netnumbers on the root servers, which have entries for all net numbers. Each net number has an NS entry to point to the particular dns server which maintains the tables for the individual hosts.

3.2.7
1.

Transactions and Concurrency


RT (j), RT (i), WT (j), WT (i), RU (k), WU (i), RU (i), WU (k) RU (k), WU (i), RU (i), WU (k), RT (j), RT (i), WT (j), WT (i) RU (k), RT (j), RT (i), WT (j), WT (i), WU (i), RU (i), WU (k)

2. An inconsistent retrieval occurs when a transaction reads values that another transaction has only partially updated, leaving the set of values in an inconsistent state. An example pattern is RT (x), WU (y), RT (y), WU (x). 3. (a) Impossible under strict locking or timestmping with T ST > T SU , possible under occ or timestamping with T ST < T SU . (b) Possible under locking, occ, timestamping with T ST < T SU , impossible under timestmping with T ST > T SU .

122

CHAPTER 3. EXERCISES AND ANSWERS (c) Possible under locking, occ,timestmping with T ST > T SU , impossible with timestamping with T ST < T SU (d) Possible under timestamping with T ST > T SU , impossible otherwise (e) Impossible under locking and timestamping. Possible under optimistic concurrency control. Why? Because the writes dont actually reveal their changed values till the transactions commit.

4. A transaction may do a lot of work without accessing data values before it accesses the shared data. Thus its timestamp may be very old before it even accesses the data, and thus may have a high probability of abortion. A better time to allocate the timestamp is when it rst accesses data.

3.2.8

Distributed Transactions

1. Java like things TransId beginTransaction() throws TransactionException; void write(TransId, Object, Value) throws TransactionAbortedException, TransactionException; Value read(TransId, Object) throws TransactionAbortedException, TransactionException; void commitTransaction(TransId) throws TransactionAbortedException, TransactionException; void abortTransaction(TransId) throws TransactionException; Java equivalent may extend Reader and Writer classes to TransactionReader and TransactionWriter, and require transaction ids to be passed in on creation, and for the commit and abort operation. Flashier type systems which ensure commit and abort operations happen are good topics for research. Recovery managers and logs 2. Coordinator crashes before exiting the prepared state. Network partitions one participant from coordinator - the coordinator eventually times out and issues abort. 3. U Start, U write i, U write j, V write k, V write k, U commit, T write j, V commit, T commit. Recovery works by recovery manager redoing operations in order for those transactions that have commited, and undoing any operations that have aborted, depending on particular semantics of rollback - most systems wait till commitment before performing writes, requiring reads within a transaction potentially to read the tentative values. 4. Total votes = 9. In the answers below, the read quorums can be formed by any set of servers whose votes are greater than the read quorum.

3.3. SAMPLE EXAM QUESTION W = 5, R = >=5 W = 6, R = >=4 W = 7, R = >=3 W = 8, R = >=2 W = 9, R = >=1 Writing unavailable when any server is down.

123

3.3

Sample Exam Question

In this section, Ill introduce the standard approach taken in exam questions for this course. Each question attempts to measure the learning outcomes achieved for a particular subtopic of the course. The rst part of the question attempts to measure whether you have learnt the basic knowledge, asking you to reproduce basic denitions or similar from the notes. The second part asks you to apply your knowledge in the solution of some problem, generally in a manner that you will have seen before. The nal part attempts to discover whether you can extrapolate from your knowledge to new situations, or meld you knowledge with other areas.

3.3.1

Availability and Ordering

This is a sample question for 20 marks. Attempt the question using the notes, as you would for an exam. 1. Dene the following: (a) Total Ordering [2 marks] (b) Causal Ordering [2 marks] 2. Describe the CBCast Causal Ordering Protocol [6 marks] 3. What guarantees on message delivery are required for CBcast? [3 marks] 4. A process group using CBcast has three members, 0,1,2. Process 0 sends two messages with vector timestamps (6,9,10) and (7,10,11). Process 1 sends two messages with vector timestamps (5,9,10) and (5,10,10). Process 3 sends two messages with vector timestamps (6,9,11) and (6,9,12). Show a possible ordering of the transmission and delivery events at these processes. [7 marks]

3.3.2

The answer

1. (a) Total ordering - each process in group receives messages in the same order. (b) Causal ordering - before a message is received, each process in group receives all messages received at the sending process before the message is sent.

124

CHAPTER 3. EXERCISES AND ANSWERS

2. 1 mark for mentioning vector timestamps, 2 marks for describing the vector timestamp as an array of sequence numbers per process, 2 marks for describing the conditions for delivering the message to the application, 1 mark for describing how the local vector timestamp is updated. 3. 3 marks for noting that reliable delivery is required 4. 2 marks for a diagram as below, 5 marks depending upon how correct the answers are. If there is no diagram, but the answers are perfectly correct, 7 marks would be awarded - however a mistake would result in no marks.
(5,9,10) (5,10,10) (6,9,10) (6,9,11) (6,9,12)

(7,10,11)

Das könnte Ihnen auch gefallen