Beruflich Dokumente
Kultur Dokumente
available systems
in Erlang
Joe Armstrong
Why Erlang?
Erlang was
designed to program
fault-tolerant
systems
Saturday, March 3, 2012
Overview
n
Types of HA systems
Architecture/Algorithms
HA data
Types of HA
n
Washing machine/pacemaker
...
Internet HA
n
Always on-line
Soft real-time
S
S
P = probability of
loosing data on one
machine = 10-3
Probability of loosing
data with
4 machines = 10-12
Where is my data?
data
Computer
Architectures/algorithms
S
C
S
Server
Client
Load balancer
traditional
architectures
Chord
S1 IP = 235.23.34.12
S2 IP = 223.23.141.53
S2 IP = 122.67.12.23
..
md5(ip(s1)) = C82D4DB065065DBDCDADFBC5A727208E
md5(ip(s2)) = 099340C20A42E004716233AB216761C3
md5(ip(s3)) = A0E607462A563C4D8CCDB8194E3DEC8B
Sorted
099340C20A42E004716233AB216761C3 => s2
A0E607462A563C4D8CCDB8194E3DEC8B => s3
C82D4DB065065DBDCDADFBC5A727208E => s1
...
S
C
S
S
Main idea
Hash keys & IP addresses into
the same namespace
Failure probabilities
n
So making 5
replicas takes
the same time
as two
P
P
The problem of
reliable storage
of data
has been solved
How do we
write
the
code?
SIX RULES
ONE
ISOLATION
Isolation
n
n
n
TWO
CONCURRENCY
Concurrency
n
World is concurrent
THREE
MUST
DETECT FAILURES
Failure detection
n
n
FOUR
FAULT
IDENTIFICATION
Fault Identification
n
FIVE
LIVE
CODE
UPGRADE
SIX
STABLE
STORAGE
Stable storage
n
QUOTES
Saturday, March 3, 2012
GRAY
As with hardware, the key to software fault-tolerance is to
hierarchically decompose large systems into modules, each module being
a unit of service and a unit of failure. A failure of a module does
not propagate beyond the module.
...
The process achieves fault containment by sharing no state with
other processes; its only contact with other processes is via messages
carried by a kernel message system
-
Jim Gray
Why do computers stop and what can be done about it
Technical Report, 85.7 - Tandem Computers,1985
GRAY
n
n
n
n
Fail fast
The process approach to fault isolation advocates that the process
software be fail-fast, it should either function correctly or it
should detect the fault, signal failure and stop operating.
Processes are made fail-fast by defensive programming. They check
all their inputs, intermediate results and data structures as a matter
of course. If any error is detected, they signal a failure and stop. In
the terminology of [Christian], fail-fast software has small fault
detection latency.
Gray
Why ...
Fail early
A fault in a software system can cause one or more
errors. The latency time which is the interval between
the existence of the fault and the occurrence of the
error can be very high, which complicates the
backwards analysis of an error ...
For an effective error handling we must detect errors and
failures as early as possible
Renzel Error Handling for Business Information Systems,
Software Design and Management, GmbH & Co. KG, Mnchen, 2003
KAY
Folks -Just a gentle reminder that I took some pains at the last OOPSLA to
try to remind everyone that Smalltalk is not only NOT its syntax or
the class library, it is not even about classes. I'm sorry that I long ago
coined the term "objects" for this topic because it gets many people to
focus on the lesser idea.
The big idea is "messaging" -- that is what the kernel of Smalltalk/
Squeak is all about (and it's something that was never quite completed
in our Xerox PARC phase)....
http://lists.squeakfoundation.org/pipermail/squeak-dev/1998-October/
017019.html
SCHNEIDER
Halt on failure in the event of an error a processor
should halt instead of performing a possibly erroneous
operation.
Failure status property when a processor fails,
other processors in the system must be informed. The
reason for failure must be communicated.
Stable Storage Property The storage of a processor
should be partitioned into stable storage (which
survives a processor crash) and volatile storage which
is lost if a processor crashes.
Schneider
ACM Computing Surveys 22(4):229-319, 1990
ARMSTRONG
n
Programming
Saturday, March 3, 2012
How do we program
our six rules?
n
Use a library?
Erlang was
designed
to program
fault-tolerant
systems
Saturday, March 3, 2012
Rule 1 = Isolation
n
Rule 2 = Concurrency
n
A is a black box.
It might be an entire machine
If an entire machine crashes
another machine must fix the problem
Saturday, March 3, 2012
bar(X) ->
...
Erlang
n
n
n
n
n
n
n
n
n
Properties
n
n
n
n
n
n
No sharing
Hot code replacement
Pure message passing
No locks
Lots of computers (= fault tolerant scalable ...)
Functional programming (controlled side effects)
What is COP?
Machine
Process
Message
No Mutable State
n
Projects
n
n
n
n
n
n
n
n
CouchDB
Amazon SimpleDB
Mochiweb (facebook chat)
Scalaris
Nitrogren
Ejabberd (xmpp)
Rabbit MQ (amqp)
Riak
Companies
n
n
n
n
n
n
Ericsson
Amazon
Tail-f
Klarna
Facebook
...
Books
http://www.sics.se/~joe/thesis/armstrong_thesis_2003.pdf
QUESTIONS
Saturday, March 3, 2012