A Brief Introduction To High Assurance Cloud Computing With Isis2

7/11/2013
About the Lecturer

An introduction to the lecturer
A BRIEF INTRODUCTION TO HIGH ASSURANCE CLOUD COMPUTING WITH ISIS2

Cornell University
Ken Birman
Ken Birman
3
Researcher in high assurance computing since joining Cornell in 1982 (PhD U.C. Berkeley). Currently Cornells N. Rama Rao Professor of Computer Science. ACM Fellow, Winner of IEEE Tsutomu Kanai Award Built the distributed software infrastructure used for a decade by the New York Stock Exchange, and still used in the French Air Traffic Control System, the US Navy AEGIS and several other mission-criticial systems. Contact information at http://www.cs.cornell.edu/ken
Segment I: The Cloud Landscape

Introducing terminology Informal description of goals
High Assurance and the Cloud

5 6
How does todays cloud work?
Cloud Computing: The new universal standard

A technology for federating network services Easy to share data, deeply integrated with web pages Supports a wide range of media types
Client platform: browsers and apps, which are programs that exploit a stripped-down browser API Internet transports the data Data centers run web services that produce the pages we see, stream videos, etc
But the cloud cant offer high assurance today!

A wave of sensitive applications is approaching (areas like mHealth, Smart power grid, eBanking, Smart cars...) They need strong guarantees... what can we do to help?
7/11/2013
Each step embodies weaknesses

7 8
Recipe for high assurance
The client system is vulnerable to loss of connectivity, compromise by downloaded code and infection by viruses and worms. The Internet layer is potentially unreliable
The mapping of domain names to IP addresses is very complex (consequence of cloud need to steer traffic) Network reliability is much lower than it needs to be Much too easy to snoop on traffic or attack connections
Design a system to fail only in safe ways
Nobody gets hurt, but perhaps the system reports that it has gone offline
Then do everything practical to enhance reliability, consistency, security, other needed properties Today: Focus on the web services running on the cloud data center
The Web Services infrastructure can fail or reconfigure abruptly, forcing the client to reconnect
Tradeoffs in cloud space

9 10
Todays cloud: As fast as possible
The properties we need are in tension!

Snappy response: Every 100ms matters Elasticity: Load varies suddenly and dramatically, service replication levels need to vary accordingly Consistency: If distinct service replicas talk to multiple clients about something, they dont say contradictory things. Fault-tolerance: If a replica crashes, the cloud self-heals Attack-tolerance: The service is very hard to attack. Security: Authenticated clients are limited to performing authorized actions in accordance with a policy Privacy: I can control who uses my data and how
In the race to offer the fastest possible services to the largest possible number of clients todays cloud often gives up on other assurance properties
Often weak or lacking
Required
In some sense the cloud is insecure and inconsistent by design!
... but does it have to be that way?
Tomorrow: A high assurance cloud!

11
A single system needs to tell multiple kinds of assurance stories and not all in the same way An mHealth application:
Needs to reassure the user that it is trustworthy Needs to help the developer make the right choices Must implement complex protocols correctly Must be a good citizen on the cloud data center
12
Segment II: Examples

A few slides each on some challenging problems Each needs the cloud... but each needs some form of strong assurance guarantee too
7/11/2013
Example 1: Power grid

13 14
How a small power grid operates
Todays power grid has serious issues

Wasteful: As much as 15% of power is lost just moving it around, and a great deal of renewable energy (solar, wind, tides) is lost because of poor integration with the standard grid Rigid: Ideally, the grid should adapt and move parcels of power much as the Internet moves packets. Dumb: even when it is obvious that we could optimize behavior, the grid uses old, inefficient techniques
Power flows like water
Path of least resistance
Governed by Kirchoffs Law Power enters at every generator, exits at every load Hierarchical structure:

Primary power busses Secondary smaller local feeds

10-Generator, 39-bus New England System
Goal: A smart power grid!
Technology to enable a smart grid

15 16
Even mundane problems can hurt
Well need to monitor power loads, frequency, current in real-time, reliably and securely Use this data to estimate the state of the grid and to predict its evolution over time Use those predictions to plan control actions: increase/decrease generation, borrow reactive power from neighboring regions, adapt pricing, etc Ultimately the grid will become a new kind of network. But must also be safe, efficient, and secure against both mishaps and even attack!
California: Repeated episodes of market manipulation aimed at increasing profits for companies such as Enron that speculate on pricing Multi-state and multi-national rolling outages
Causes turmoil for air traffic, ground traffic, telephone outages Will smartness also make grid more fragile? Risk of CyberAttacks?
Control of the smart power grid

17 18
Control of the smart power grid
Suppose that a cloud control system speaks with two voices In physical infrastructure settings, consequences can be very costly
Canadian 50KV bus going offline
Suppose that a cloud control system speaks with two voices In physical infrastructure settings, consequences can be very costly
Bang!
Switch on the 50KV Canadian bus
7/11/2013
Power grid summary

19 20
Example 2: mHealth
To make it smart we need to monitor at a massive scale and use that to initiate control actions But for this to be safe, we need more that fast response and elasticity
We also need security (so that attackers cant take the grid down) ... and consistency (as we just saw) ... and fault-tolerance (since power systems often experience failures of various kinds)
A term for everything outside the doctors office (but might be linked to electronic health records) Goal is to make your life better and healthier
Encourage activity Discourage poor nutrician choices Help patients with chronic conditions manage their complex medical devices and medications Offer caregivers a window into health so that the patient can maintain independence

21
What properties are needed in remote medical care systems?

Motion sensor, fall-detector
Durability... scalability... fast response

22
Mrs. Marsh has been dizzy. Her stomach is upset and she hasnt been eating well, yet her blood sugars are high.
Healthcare provider monitors large numbers of remote patients
Medication station tracks, dispenses pills Integrated glucose monitor and Insulin pump receives instructions wirelessly
Lets stop the oral diabetes medication and increase her insulin, but well need to monitor closely for a week
Cloud Infrastructure
Home healthcare application
Cloud Infrastructure Patient Records DB
Need: Strong consistency and durability for data
What do these terms mean?

23 24
What do these terms mean?
Consistency: Even if accessed by multiple users concurrently, the data looks like a single database
This sounds like it should obviously be true, but when the data is spread over multiple computers, if they dont coordinate their actions, consistency can easily violated For example, perhaps machine 1 shows updates machine 2 never saw. Perhaps machine 3 sees all the updates but has the order confused. Each of these cases can cause serious inconsistencies.
Durability: Even if system components crash and then recover later, data will not be lost.
Updates confuse things: before the update occurs, clearly it isnt durable After the update is finished, it must have durable effect Question to pose: exactly when did it need to be durable?
Usual
answer: If the effect of an update survives a crash, then the update itself should also survive the crash
7/11/2013
Scalability
25 26
Guarantees versus best effort
As we make the system larger, perforance remains good It needs to be able to support large numbers of clients and run on large numbers of cloud computing systems Fast response: Queries shouldnt delay for long. Updates should have rapid effect on the data.
Todays cloud systems work well in all of these ways but without providing strong guarantees except in certain very specialized cases, like Googles new Spanner database Our challenge: can normal people who arent in the Google spanner development team also create trustworthy cloud computing solutions?
mHealth summary
27 28
How The Cloud Was Built
The needs of the system vary depending on what part of the system we focus on
In our example, some aspects need durability in the sense of a logged database update, while others might accept durability through in-memory replication This illustrates one of many such tradeoffs
It is very hard to create software to run in cloud computing systems

Everything must be automated You must follow many rules and use many packages So open source tools have become popular

If we had more time we could identify a number of additional issues of this kind
Examples: Hadoop (a version of MapReduce), Zookeeper, Graphlab, Pregel, Vowpal Wabbit, global file systems like GFS, etc. In this short class we will focus on process group tools and will use Isis2 as our main example.
An obsession with speed...

29
At very large scale, either a thing is extremely fast, or unacceptably slow So everything we do must be shaped by speed!
High assurance is not an option if the solution would be dramatically slower For example, the cloud computing community avoids databases.
30
Concept: Critical paths

To understand speed, understand the limiting factors This forces us to think about critical paths
They founded the NoSQL movement (storage, but not as strong as a SQL database) for this reason.
Similarly we must have speed in mind at all times!
7/11/2013
What limits responsiveness?

31 32
Critical path with complex services?
Top priority: delay until a client receives a reply Critical path traces actions that contribute to this delay
Update the monitoring and alarms criteria for Mrs. Marsh as follows
Service instance
When we replicate information but want to be sure the data wont be lost, critical path extends into the replicas
Update the monitoring and alarms criteria for Mrs. Marsh as follows
Service instance
Critical path
Response delay seen by end-user would include Internet latencies Response delay seen by end-user would include Internet latencies
Service response delay
Service response delay
Critical path
Confirmed
Confirmed
Critical path
Why do critical paths matter?

33 34
There are many critical applications
When we build complex systems it is hard to imagine how they will behave when we run them By thinking about the critical performance-limiting paths, we can focus our attention on specific elements and not think about the whole system By avoiding delays on the critical path, we bring benefits to the whole system!
Cloud-hosted system to control transportation (think of Googles smart cars)
The cars have autonomy but they depend on data from the cloud and would have a much harder challenge if that data couldnt be trusted Todays online banking systems are growing, but as they happens, more and more security issues arise Chemical refineries, manufacturing plants, ...
Banking systems
Process control
And they come with similar stories

35
In each case we can identify properties that are

Absolutely needed for a cloud deployment Absolutely needed for safety
36
Segment III: Consistency

Well drill down on the tradeoffs between durability and consistency Many cloud systems believe that consistency isnt possible: CAP theorem Yet consistency underlies so many other guarantees Virtual synchrony model
And beyond that we might have other assurance properties that a particular use case doesnt need The challenge will be to analyze each application, and then to translate its needs into cloud solutions
7/11/2013
Were going to drill down

37 38
Consistency for replication
on data and service replication Replication is at the center of cloud computing:

With many replicas a service can handle many clients And those replicas need as much of the critical data to be local as possible So replication is a key technology. It even underlies security: we need to replicate the policy database and certificates that identify principals (clients, servers, etc)
There are many ways to replicate information But it becomes tricky if the data or even the service evolves over time.
Replication of changing data can leave a confusing mess if a request encounters stale versions. In some situations these errors can harm the client. In others, they could cause security violations.
What do we mean by consistency?

39 40
Theory of Consistency
A consistent distributed system will often have many components, but users observe behavior indistinguishable from that of a single-component reference system. Our power system example illustrated a form of inconsistency
There are some famous impossibility results
Fischer, Lynch and Patterson: FLP theorem proves that any correct fault-tolerant protocol strong enough to solve consensus (a form of agreement) can also wedge in the event of certain sequences of failures. But those sequences turn out to be very rare. Brewers CAP theorem posits that you can only have two from {Consistency, Availability and Partition Tolerance}. But the proof holds only for a service running in a WAN, not for one in a single data center.
Bang!
Relate consistency to speed?

41 42
We will learn more about these topics
How costly is strong consistency?

The cloud computing community debates this topic! It is a very contemporary question
In todays lecture we wont drill down But in lecture 4 we will look more closely at these theoretical questions
Mathematics is a valuable tool for cloud computing By making a correspondance of computing ideas to mathematics we can reason more rigorously Yet we will also find that some of the existing theory has limitations of its own

We usually pose the question in connection to replicating data.

Strongly consistent data means guaranteed to be correct and current. Can cloud systems afford strong consistency? Weakly consistent data means best effort but can have mistakes. Facebook, eBay, Google all use weak consistency
7/11/2013
Isis2 System
44
43
Segment IV: Isis2

How does consistency look to the end user? What is it like to program with a powerful high assurance library like Isis2?
A prebuilt technology that automates many of the hard tasks involved in replicating services and the data on which they depend Targets cloud computing settings Available in open-source from isis2.codeplex.com

Intended to be easy to use but still at an early stage of development
Isis2 System
45 46
Isis2 makes developers life easier

Benefits of Using Formal model
C# library (but callable from any .NET language) offering replication techniques for cloud computing developers Based on a model that fuses virtual synchrony and state machine replication models Research challenges center on creating protocols that function well despite cloud events

Importance of Sound Engineering
Elasticity (sudden scale changes) Potentially heavily loads High node failure rates Concurrent (multithreaded) apps
Long scheduling delays, resource contention Bursts of message loss Need for very rapid response times Community skeptical of assurance properties
Formal model permits us to achieve correctness Isis2 is too complex to use formal methods as a development too, but does facilitate debugging (model checking) Think of Isis2 as a collection of modules, each with rigorously stated properties
Isis2 implementation needs to be fast, lean, easy to use Developer must see it as easier to use Isis2 than to build from scratch Seek great performance under cloudy conditions Forced to anticipate many styles of use

47 48

Group g = new Group(myGroup); Dictionary<string,double> Values = new Dictionary<string,double>(); g.ViewHandlers += delegate(View v) {
Console.Title = myGroup members: +v.members;

First sets up group Join makes this entity a member. State transfer isnt shown Then can multicast, query. Runtime callbacks to the delegates as events arrive Easy to request security (g.SetSecure), persistence Consistency model dictates the ordering aseen for event upcalls and the assumptions user can make
First sets up group Join makes this entity a member. State transfer isnt shown Then can multicast, query. Runtime callbacks to the delegates as events arrive Easy to request security (g.SetSecure), persistence Consistency model dictates the ordering seen for event upcalls and the assumptions user can make
}; g.Handlers[UPDATE] += delegate(string s, double v) { Values[s] = v; }; g.Handlers[LOOKUP] += delegate(string s) { g.Reply(Values[s]); }; g.Join(); g.Send(UPDATE, Harry, 20.75);
List<double> resultlist = new List<double>(); nr = g.Query(ALL, LOOKUP, Harry, EOL, resultlist);
7/11/2013

49 50


}; g.Handlers[UPDATE] += delegate(string s, double v) { Values[s] = v; }; g.Handlers[LOOKUP] += delegate(string s) { g.Reply(Values[s]); }; g.Join(); g.Send(UPDATE, Harry, 20.75); List<double> resultlist = new List<double>(); nr = g.Query(ALL, LOOKUP, Harry, EOL, resultlist);

51 52
Concept: A multi-query

Our lookup is

Multicast to the group All members respond
Lookup Harry in the Ithaca phone directory

Front end
A chance for parallelism

Each can do part of the job: e.g. search 1/nth of a database Reduces response delays
Names with Harry in them: ....
With n replicas... ... we get an n times speedup!
Our example was overly simple

53 54
Adding security: Just one line!

it didnt show the state transfer code

Corresponds to the white arrows in time-line figure In Isis2 we have a way to make checkpoints State transfer: Some active member makes a checkpoint, and the joiner loads the state from it. The code looks like other operations in our example
p q r s t
Time:0 102030 40
First sets up group Join makes this entity a member. State transfer isnt shown Then can multicast, query. Runtime callbacks to the delegates as events arrive Easy to request security, persistence, tunnelling on TCP... Consistency model dictates the ordering seen for event upcalls and the assumptions user can make
5060 70
}; g.Handlers[UPDATE] += delegate(string s, double v) { Values[s] = v; }; g.Handlers[LOOKUP] += delegate(string s) { g.Reply(Values[s]); }; g.SetSecure(myKey); g.Join(); g.Send(UPDATE, Harry, 20.75);
Checkpoints can also be used to save group state during periods when all members are inactive
7/11/2013
Some uses for process groups

55 56
Isis2 Summary
To replicate data maintained by the members in memory To replicate actions taken on an external service such as a replicated database To ensure that all replicas are configured the same way
To coordinate the processing of requests and load-balance To offer a way to parallelize processing by having each group member do part of the work Fault-tolerance via a backup scheme
A library that you can invoke from a normal program written in a normal way It does the work of creating groups and sending multicasts and ensuring that the consistency model will be enforced The developer just tells it what to do.

She thinks about a parallel distributed application. Virtual synchrony eliminates many hard problems
Why not build it yourself from scratch?

57
Why focus on Isis2?

58
SafeSend and Send are two of the protocol components hosted over what we call the large-scale properties sandbox. The sandbox addresses issues like flow control, security, etc. All protocols share and benefit from those properties
Isis2 user object Isis2 user object Isis2 user object
Other group members
This is a good question to ask In fact we could focus on any of a number of other technologies, including other multicast products
Send CausalSend OrderedSend SafeSend Query....
Membership Oracle The SandBox itself is mostly composed of convergent protocols that use probabilistic methods Isis2 library Flow Control Group instances and multicast protocols Group membership Reliable Sending Fragmentation Platform Security Group Security TCP tunnels (overlay)
Such as Spread, JGroups, C-Ensemble...
Large Group Layer
Dr. Multicast
Views
Oracle Membership Self-stabilizing Bootstrap Protocol
Sense Runtime Environment Message Library
Socket Mgt/Send/Rcv Wrapped locks
Report suspected failures Bounded Buffers
But Isis2 is open source and specifically designed for cloud settings. (Also, Ken built it!)
These systems are complex, especially if you want to run on platforms like EC2 By using Isis2 you inherit 30 years of research on how to make it work
So since our class is short, we will look at Isis2 examples
Revisit our notion of consistency

60
59
Segment V: Performance
Can Isis2 applications achieve the kinds of scalable performance and elasticity required in large cloud deployments?
Lets look again at our mHealth example We want the best possible performance but we also want to be sure that the application is safe for this kind of use
We need consistency, yet also need snappy response and elasticity, especially in the monitoring component After all, it continuously monitors huge numbers of patients. What limits scalability?
10
7/11/2013
Speed of updates
61 62
Example: Speed of updates
Isis2 offers many ways to do updates

RawSend, Send, CausalSend, OrderedSend, SafeSend Each has different consistency / durability guarantees
Isis2 offers several ways to do updates (we will visit them more carefully later) They have big performance implications But speed can have more than one definition!
As a developer, youll want to use the fastest option that is still safe in your setting

... Hence will need to understand how each works ... and how fast each solution will be
Today well just look at this superficially
Isis2: Send v.s. in-memory SafeSend

63 64
Latency ops/second
Latency: Delay before external user sees action Ops/second: total throughput
For most purposes systems like Isis2 offer basic performance of about 1000 ops/second But by grouping requests into batches of ~50/request, services that can support ~50,000 ops/second are feasible Building them is challenging, but we wont focus on that engineering topic in these lectures
Send scales best, but SafeSend with in-memory (rather than disk) logging and small numbers of acceptors isnt terrible.
Jitter: how steady are latencies?

65 66
Flush delay as function of shard size
The spread of latencies is much better (tighter) with Send: the 2-phase SafeSend protocol is sensitive to scheduling delays
Flush is fairly fast if we only wait for acks from 3-5 members, but slow if we wait for all members. Isis2 lets developer set the threshold.
Cornell (Birman): No distribution restrictions.
Cornell (Birman): No distribution restrictions.
11
7/11/2013
So I want Send+Flush, right?

67 68
Raw speed isnt the whole story!
The problem is that the different solutions offer different guarantees

When building a system such as this we need to look at performance but also at steady behavior Heres an example of a problem we ran into when doing the experiments I just showed you As well see, Isis2 had an instability. We think weve fixed it but it illustrates an important point
The fastest solutions have weaker guarantees Using them safely involves understanding these properties in order to decide whether they are good enough for the desired purpose
But there are subtle issues we dont have time to discuss in todays lecture. We will revisit tomorrow.
The experiment we did

69 70
Debugging: Stabilization bug
We made a timeline picture from left to right One node (the bottom one) sends multicasts The others log the time of receipt We graphed the delay, sorted from slowest (top) to fastest (bottom) delays Heres what we saw
Birman: DARPA MRC Kickoff, Washington, Nov 3-4 2011
As the application ran, it slowed down!

71 72
Debugging : Stabilization bug fixed
At first the system was fast: even the slowest nodes at the top had short delays But within a few multicasts they slowed down Then something resets them and they speed up
We tracked it down to a problem with garbage collection in our system Modifying that protocol helped smooth things out
Birman: DARPA MRC Kickoff, Washington, Nov 3-4 2011
12
7/11/2013
Debugging : 358-node run slowdown
358-node run slowdown: Zoom in
358-node run slowdown: Filter

76
Summary of insights from example?
Tools like Isis2 enable us to build cloud-scale replication based services with strong guarantees But today, at least, they demand a lot from the developer, who needs to really understand the choices and their implications As Isis2 evolves, this problem will be reduced: the system will eventually automate many decisions, including picking the right update primitives for you
Key take-away points

78
77
Segment V: Conclusions
Weve scratched the surface but there is much more to be explored Cornells high assurance researchers are creating solutions for tomorrows demanding applications
Cloud computing, today, isnt very friendly to high assurance applications This is a problem because those applications are increasingly forced to migrate to the cloud for reasons of cost, scalability or just because the cloud is the dominant paradigm today But we can already use tools like Isis2 to solve these problems and as they become easier to work with, the community able to build these solutions will grow
13
7/11/2013
Key take-away points

79 80
The last word...
With Isis2 we can easily create programs that run on cloud platforms like EC2 or even Android mobile
They form into groups and coordinate or replicate data or actions via group primitives The concept is powerful and easily visualized
The word on the street is that cloud computing will rule but that the cloud cant do high assurance But the word in the hallways at Cornell differs!
We see Isis2 as our proof-by-demonstration that it can be done Even so, the engineering challenge remains enormous
But tuning and doing sophisticated fault-tolerance remains challenging. In the remaining lectures we will explore these issues
Learning more
81 82
Learning more
Stay in the class. Well show you how! Download the Isis2 system from isis2.codeplex.com
You can access the users manual The code itself (currently v2.xxx, a very stable release) And we maintain a discussion and issues board there

My textbook covers this topic in depth

Guide to Reliable Distributed Systems: Building HighAssurance Applications and Cloud-Hosted Services Ken Birman. Springer Verlag, February 2012
A paper focused entirely on todays topic is:

Overcoming CAP with Consistent Soft-State Replication. Kenneth P. Birman, D. Freedman, Q. Huang and Patrick Dowell. IEEE Computer Magazine (special issue on The Growing Impact of the CAP Theorem). Volume 12. pp. 50-58. February 2012.
You can download a copy from:

http://www.cs.cornell.edu/projects/quicksilver/pubs.html
14

A Brief Introduction To High Assurance Cloud Computing With Isis2

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

A Brief Introduction To High Assurance Cloud Computing With Isis2

Hochgeladen von

Copyright:

Verfügbare Formate

7/11/2013

About the Lecturer

A BRIEF INTRODUCTION TO HIGH ASSURANCE CLOUD COMPUTING WITH ISIS2

Segment I: The Cloud Landscape

High Assurance and the Cloud

How does todays cloud work?

Cloud Computing: The new universal standard

But the cloud cant offer high assurance today!

Each step embodies weaknesses

Recipe for high assurance

Design a system to fail only in safe ways

Tradeoffs in cloud space

Todays cloud: As fast as possible

The properties we need are in tension!

Often weak or lacking

In some sense the cloud is insecure and inconsistent by design!

... but does it have to be that way?

Tomorrow: A high assurance cloud!

Segment II: Examples

Example 1: Power grid

How a small power grid operates

Todays power grid has serious issues

Power flows like water

Path of least resistance

Primary power busses Secondary smaller local feeds

Goal: A smart power grid!

Technology to enable a smart grid

Even mundane problems can hurt

Control of the smart power grid

Control of the smart power grid

Switch on the 50KV Canadian bus

Switch on the 50KV Canadian bus

Power grid summary

What properties are needed in remote medical care systems?

Durability... scalability... fast response

Healthcare provider monitors large numbers of remote patients

Cloud Infrastructure Patient Records DB

Need: Strong consistency and durability for data

What do these terms mean?

What do these terms mean?

Guarantees versus best effort

How The Cloud Was Built

It is very hard to create software to run in cloud computing systems

An obsession with speed...

Concept: Critical paths

Similarly we must have speed in mind at all times!

What limits responsiveness?

Critical path with complex services?

Service response delay

Service response delay

Why do critical paths matter?

There are many critical applications

Cloud-hosted system to control transportation (think of Googles smart cars)

And they come with similar stories

In each case we can identify properties that are

Segment III: Consistency

Were going to drill down

Consistency for replication

on data and service replication Replication is at the center of cloud computing:

What do we mean by consistency?

There are some famous impossibility results

Switch on the 50KV Canadian bus

Relate consistency to speed?