Sie sind auf Seite 1von 275

Validating Computer System and

Network Trustworthiness
Prof. William H. Sanders
Department of Electrical and Computer Engineering and
Coordinated Science Laboratory
University of Illinois at Urbana-Champaign
whs@uiuc.edu
www.mobius.uiuc.edu
www.perform.csl.uiuc.edu
www.iti.uiuc.edu
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 1

Course Outline
Issues in Model-Based Validation of High-Availability
Computer Systems/Networks
Combinatorial Modeling
Stochastic Activity Network Concepts
Analytic/Numerical State-Based Modeling
Case Study: Embedded Fault-Tolerant Multiprocessor System
Solution by Simulation
Symbolic State-space Exploration and Numerical Analysis of
State-sharing Composed Models
Case Study: Security Evaluation of a Publish and Subscribe
System
The Art of System Trust Evaluation /Conclusions

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 2

What is Validated? -- Dependability

Dependability is the ability of a system to deliver a specified service.


System service is classified as proper if it is delivered as specified; otherwise it
is improper.
System failure is a transition from proper to improper service.
System restoration is a transition from improper to proper service.
failure

improper
service

proper
service
restoration

The properness of service depends on the users viewpoint!


Reference: J.C. Laprie (ed.), Dependability: Basic Concepts and Terminology,
Springer-Verlag, 1992.
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 5

Basic Validation Terms

Measures -- What you want to know about a system. Used to determine if a


realization meets a specification
Models -- Abstraction of the system at an appropriate level of abstraction and/or
details to determine the desired measures about a realization.
Dependability Model Solution Methods -- Method by which one determines
measures from a model. Models can be solved by a variety of techniques:
Combinatorial Methods -- Structure of the model is used to obtain a simple
arithmetic solution.
Analytical/Numerical Methods -- A system of linear differential equations or
linear equations is constructed, which is solved to obtain the desired
measures
Simulation -- The realization of the system is executed, and estimates of the
measures are calculated based on the resulting executions (known also as
sample paths or trajectories.)

Mbius supports performance/reliability/availability validation by


analytical/numerical and simulation-based methods.
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 6

Dependability Measures: Availability


Availability - quantifies the alternation between deliveries of proper and improper
service.
A(t) is 1 if service is proper at time t, 0 otherwise.
E[A(t)] (Expected value of A(t)) is the probability that service is proper at
time t.
A(0,t) is the fraction of time the system delivers proper service during [0,t].
E[A(0,t)] is the expected fraction of time service is proper during [0,t].
P[A(0,t) > t*] (0 t* 1) is the probability that service is proper more than
100t*% of the time during [0,t].
A(0,t)t is the fraction of time that service is proper in steady state.
E[A(0,t)t], P[A(0,t)t > t*] as above.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 7

Other Dependability Measures

Reliability - a measure of the continuous delivery of service


R(t) is the probability that a system delivers proper service throughout [0,t].

Safety - a measure of the time to catastrophic failure


S(t) is the probability that no catastrophic failures occur during [0,t].
Analogous to reliability, but concerned with catastrophic failures.

Time to Failure - measure of the time to failure from last restoration. (Expected
value of this measure is referred to as MTTF - Mean time to failure.)

Maintainability - measure of the time to restoration from last experienced


failure. (Expected value of this measure is referred to as MTTR - Mean time to
repair.)

Coverage - the probability that, given a fault, the system can tolerate the fault
and continue to deliver proper service.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 8

How is Validation Done?


Validation
Measurement

Modeling

Active
Passive
(no fault (Fault Injection
injection) on Prototype)

Without
Contact

Simulation

With
Contact

HardwareImplemented

Continuous Discrete
Event
State
(state)

Analysis/
Numerical

Deterministic

Non-Deterministic

Probabilistic Non-Probabilistic

SoftwareImplemented

Sequential
Stand-alone
Systems

Mbius supports
model-based validation
of italicized (red) items.

Networks/
Distributed
Systems

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Parallel
Non-State-space-based
State-space-based (Combinatorial)
Slide 9

Integrated Validation Procedure


R

Requirement
Decomposition

Functional Model of the Relevant Subset of the System

ModuleB

ModuleA

AA1

M1

M2

L1

L2

AA2

AA3

M4

M3

ModuleZ

AP1

AP2

M5

M6

L3

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Functional Model
of the System
(Probabilistic or
Logical)
Assumptions

Supporting Logical
Arguments and
Experimentation
Slide 10

Probability Review: Exponential Random Variables


An exponential random variable X with parameter has the CDF
P[X t] = Fx(t) =

0
1e-t

The density function is given by f x (t ) =


fx(t) =

0
e-t

t0
t>0

t0
t>0 .

d
Fx (t );
dt

1
1
Its mean is and its variance is 2 .

The exponential random variable is the only continuous random variable that is
memoryless.

To see this, let X be an exponential random variable representing the time that an
event occurs (e.g., a fault arrival).
Important Fact 1: P[ X > t + s X > s ] = P[ X > t ] (memoryless property)!
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 12

Probability Review: Exponential Event Rate


The fact that the exponential random variable has the memoryless property
indicates that the rate at which events occur is constant, i.e., it does not change
over time.
Often, the event associated with a random variable X is a failure, so the event
rate is often called the failure rate or the hazard rate.
The event rate of X is defined as the probability that the event associated with X
occurs within the small interval [t, t + t], given that the event has not occurred
by time t, per the interval size t:
P[t < X t + t X > t ]
.
t

This can be thought of as looking at X at time t, observing that the event has not
occurred, and measuring the number of events (probability of the event) that
occur per unit of time at time t.
Important Fact 2: The exponential random variable has a constant failure rate!
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 13

Probability Review: Minimum of Two Independent Exponentials


Another interesting property of exponential random variables is that the minimum
of two independent exponential random variables is also an exponential random
variable.
Let A and B be independent exponential random variables with rates and
respectively. Let us define X = min{A,B}. What is FX(t)?
FX(t) = P[X t]
= P[min{A,B} t]
= P[A t OR B t]
= 1 - P[A > t AND B > t]
= 1 - P[A > t] P[B > t]
= 1 - (1 - P[A t])(1 - P[B t])
= 1 - (1 - FA(t))(1 - FB(t))
= 1 - (1 - [1 - e-t])(1 - [1 - e-t])
= 1 - e-te-t
= 1 - e-( + )t
Important Fact 3: The minimum of two independent exponential random variables
is itself exponential with rate the sum of the two rates!
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 14

Probability Review: Competition of Two Independent Exponentials


If A and B are independent and exponential with rate and respectively, and A and B
are competing, then we know that one will win with an exponentially distributed
time (with rate + ). But what is the probability that A wins?
P[ A < B ] = P[ A < B A = x ] P[ A = x ] dx

= P[ A < B A = x ] f A ( x )dx

= P[ A < B A = x ] e x dx

= P[x < B ] e x dx
0

= (1 P[B x ]) e x dx
0

= (1 1 e x ) e x dx

= e x e x dx
0

= e ( + ) x dx =
0

Important Fact 4: If A and B are independent, competing exponentials, with rates


and respectively, the probability that A occurs before B is /( + )!
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 15

Course Outline
Issues in Model-Based Validation of High-Availability
Computer Systems/Networks
Combinatorial Modeling
Stochastic Activity Network Concepts
Analytic/Numerical State-Based Modeling
Case Study: Embedded Fault-Tolerant Multiprocessor System
Solution by Simulation
Symbolic State-space Exploration and Numerical Analysis of
State-sharing Composed Models
Case Study: Security Evaluation of a Publish and Subscribe
System
The Art of System Trust Evaluation /Conclusions

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 16

Combinatorial Methods

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 17

Introduction to Combinatorial Methods


Combinatorial validation methods are the simplest kind of
analytical/numerical techniques and can be used for reliability and
availability modeling under certain assumptions.
Assumptions are that component failures are independent, and for
availability, repairs are independent.
When these assumptions hold, simple formulas for reliability and
availability exist.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 18

Lecture Outline
Review definition of reliability
Failure rate
System reliability
Maximum
Minimum
k of N
Reliability formalisms
Reliability block diagrams
Fault trees
Reliability graphs
Reliability modeling process

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 19

Reliability

One key to building highly available systems is the use of reliable components
and systems.

Reliability: The reliability of a system at time t (R(t)) is the probability that the
system operation is proper throughout the interval [0,t].

Probability theory and combinatorics can be directly applied to reliability


models.

Let X be a random variable representing the time to failure of a component. The


reliability of the component at time t is given by
RX(t) = P[X > t] = 1 - P[X t] = 1 - FX(t).

Similarly, we can define unreliability at time t by


UX(t) = P[X t] = FX(t).

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 20

Failure Rate
What is the rate that a component fails at time t? This is the probability that a
component that has not yet failed fails in the interval (t, t + t), as t 0.
Note that we are not looking at P[X (t, t + t)] = fX(t). Rather, we are seeking
P[X (t, t + t)| X > t].

P[ X (t , t + t ), X > t ]
P[ X > t ]
P[ X (t , t + t )]
=
1 FX (t )

P[ X (t , t + t ) | X > t ] =

f X (t )
= rX (t )
1 FX (t )

rX(t) is called the failure rate or hazard rate.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 21

Typical Failure Rate

Break in

Normal operation

Wear out

rX(t)

time
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 22

System Reliability
While FX can give the reliability of a component, how do you compute the
reliability of a system?
System failure can occur when one, all, or some of the components fail. If one
makes the independent failure assumption, system failure can be computed quite
simply. The independent failure assumption states that all component failures of a
system are independent, i.e., the failure of one component does not cause another
component to be more or less likely to fail.
Given this assumption, one can determine:
1) Minimum failure time of a set of components
2) Maximum failure time of a set of components
3) Probability that k of N components have failed at a particular time t.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 23

Maximum of n Independent Failure Times


Let X1, . . . , Xn be independent component failure times. Suppose the system fails
at time S if all the components fail.
Thus, S = max{X1, . . . , Xn}
What is Fs(t)?
Fs(t) = P[S t]
= P[X1 t AND X2 t AND . . . AND Xn t]
= P[X1 t] P[X2 t] . . . P[Xn t]
By independence
= FX1 (t ) FX 2 (t )...FX n (t )
By definition
n

F
i =1

Xi

(t )

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 24

Minimum of n Independent Component Failure Times


Let X1, . . . , Xn be independent component failure times. A system fails at time
S if any of the components fail.
Thus, S = min{X1, . . . , Xn}.
What is FS(t)?
FS(t) = P[S t] = P[X1 t OR X2 t OR . . . OR Xn t]

Trick : If Ai is an event, and Ai is the set complement

such that Ai Ai = and Ai Ai = , then


P[ A1 OR A2 OR . . . OR An ]
= 1 P[ A1 AND A2 AND . . . AND An ]

A2

A1

A3
This is an application of the law of total probability (LOTP).
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 25

Minimum cont.
Fs(t) = P[X1 t OR X2 t OR . . . OR Xn t]
= 1 - P[X1 > t AND X2 > t AND . . . AND Xn > t]
= 1 - P[X1 > t] P[X2 > t] . . . P[Xn > t]
= 1 - (1 - P[X1 t])(1 - P[X2 t]) . . . (1 - P[Xn t])

By trick
By independence
By LOTP

= 1 (1 FX i (t ))
i =1

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 26

k of N
Let X1, . . . , Xn be component failure times that have identical distributions (i.e.,
FX (t ) = FX (t ) = . . .). The system fails at time S if k of the N components fail.
1

FS(t) = P[at least k components failed by time t]


= P[k failed OR k + 1 failed OR . . . OR N failed]
= P[k failed] + P[k + 1 failed] + . . . + P[N failed] - by independence
and axiom of
probability.

What is P[exactly k failed]?


= P[k failed and (N - k) have not]
N
N k
k
= FX (t ) (1 FX (t ))
k
where FX(t) is the failure distribution of each component.
Thus,

N
FS (t ) = FX (t ) i (1 FX (t )) N i
i=k i
N

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 27

k of N in General
For non-identical failure distributions, we must sum over all combinations of at
least k failures.
Let Gk be the set of all subsets of {X1, . . . , XN} such that each element in Gk is a set
of size at least k, i.e.,
Gk = {gi {X1, . . . , XN} : |gi| k}.
The set Gk represents all the possible failure scenarios.
Now FS is given by

(
)
FS (t ) = FX (t ) 1 FX (t )
gG X g

X g
k

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 28

Component Building Blocks


Complex systems can be analyzed hierarchically.
Example: A computer fails if both power supplies fail or both memories fail or the
CPU fails.
FS(t) = 1 - (1 - FP1(t)FP2(t))(1- FM1(t)FM2(t))(1 - FC(t))

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 29

Summary
A system comprises N components, where the component failure times are given
by the random variables X1, . . . , XN. The system fails at time S with distribution
FS if:
Condition:

Distribution:
N

all components fail

FS (t ) = FX (t )

one component fails

FS (t ) = 1 (1 FX (t ) )

i =1

i =1

k components fail,
identical distributions

N
N i
FS (t ) = FX (t ) i (1 FX (t ) )
i=k i

k components fail,
general case

FS (t ) = FX (t ) (1 FX (t ) )
gG X g
X g

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 30

Reliability Formalisms
There are several popular graphical formalisms to express system reliability. The
core of the solvers is the methods we have just examined. In particular, we will
examine
Reliability Block Diagrams
Fault Trees
Reliability Graphs
There is nothing particularly special about these formalisms except their popularity.
It is easy to implement these formalisms, or design your own, in a spreadsheet, for
example.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 31

Reliability Block Diagrams

Blocks represent components.


A system failure occurs if there is no path from source to sink.

Series:
System fails if any component fails.
Parallel:
System fails if all components fail.

source

C1

C2

C3

sink

C1
source

C2

sink

C3
k of N:
System fails if at least k of N
components fail.

C1
source

C2

sink

C3
2 of 3
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 32

Example
A NASA satellite architecture under study is designed for high reliability. The
major computer system components include the CPU system, the high-speed
network for data collection and transmission, and the low-speed network for
engineering and control. The satellite fails if any of the major systems fail.
There are 3 computers, and the computer system fails if 2 or more of the computers
fail. Failure distribution of a computer is given by FC.
There is a redundant (2) high-speed network, and the high-speed network system
fails if both networks fail. The distribution of a high-speed network failure is given
by FH.
The low-speed network is arranged similarly, with a failure distribution of FL.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 33

RBG Example

computer

LSN

HSN
source

computer

sink
HSN

LSN

computer

2 of 3

3 3 i
3 i
FS (t ) = 1 1 FC (t )(1 FC (t ))

i=2 i

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

1 ( FH (t ))2 1 ( FL (t ))2

)(

Slide 34

Fault Trees

Components are leaves in the tree


A component fails = logical value of true, otherwise false.
The nodes in the tree are boolean AND, OR, and k of N gates.
The system fails if the root is true.

AND gates
true if all the components are true (fail).

AND
C1 C2 C3

OR

OR gates
true if any of the components are true (fail).
C1

k of N gates
true if at least k of the components are true (fail).
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

C2

C3

2 of 3
C1 C2 C3
Slide 35

Fault Tree Example

OR

2 of 3

C1

C2

AND
C3

H1

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

AND
H2

L1

L2

Slide 36

Combinatorial Methods: Review


A system comprises N components, where the component failure times are given
by the random variables X1, . . . , XN. The system fails at time S with distribution
FS if:
Condition:

Distribution:
N

all components fail

FS (t ) = FX (t )

one component fails

FS (t ) = 1 (1 FX (t ) )

i =1

i =1

k components fail,
identical distributions

N
N i
FS (t ) = FX (t ) i (1 FX (t ) )
i=k i

k components fail,
general case

FS (t ) = FX (t ) (1 FX (t ) )
gG X g
X g

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 37

Reliability Formalisms
There are several popular graphical formalisms to express system reliability. The
core of the solvers is the methods we have just examined. In particular, we will
examine
Reliability Block Diagrams
Fault Trees
Reliability Graphs
There is nothing particularly special about these formalisms except their popularity.
It is easy to implement these formalisms, or design your own, in a spreadsheet, for
example.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 38

Reliability Block Diagrams

Blocks represent components.


A system failure occurs if there is no path from source to sink.

Series:
System fails if any component fails.
Parallel:
System fails if all components fail.

source

C1

C2

C3

sink

C1
source

C2

sink

C3
k of N:
System fails if at least k of N
components fail.

C1
source

C2

sink

C3
2 of 3
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 39

Example
A NASA satellite architecture under study is designed for high reliability. The
major computer system components include the CPU system, the high-speed
network for data collection and transmission, and the low-speed network for
engineering and control. The satellite fails if any of the major systems fail.
There are 3 computers, and the computer system fails if 2 or more of the computers
fail. Failure distribution of a computer is given by FC.
There is a redundant (2) high-speed network, and the high-speed network system
fails if both networks fail. The distribution of a high-speed network failure is given
by FH.
The low-speed network is arranged similarly, with a failure distribution of FL.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 40

RBG Example

computer

LSN

HSN
source

computer

sink
HSN

LSN

computer

2 of 3

3 3 i
3 i
FS (t ) = 1 1 FC (t )(1 FC (t ))

i=2 i

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

1 ( FH (t ))2 1 ( FL (t ))2

)(

Slide 41

Fault Trees

Components are leaves in the tree


A component fails = logical value of true, otherwise false.
The nodes in the tree are boolean AND, OR, and k of N gates.
The system fails if the root is true.

AND gates
true if all the components are true (fail).

AND
C1 C2 C3

OR

OR gates
true if any of the components are true (fail).
C1

k of N gates
true if at least k of the components are true (fail).
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

C2

C3

2 of 3
C1 C2 C3
Slide 42

Fault Tree Example

OR

2 of 3

C1

C2

AND
C3

H1

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

AND
H2

L1

L2

Slide 43

Reliability Graphs

The arcs represent components and have failure distributions.


A failure occurs if there is no path from source to sink.

Can implement series:

source

FC1

FC2

sink

FC1

Can implement parallel:

source

FC2

sink

FC3

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 44

Reliability Graph Example


Reliability graphs can implement more complex interactions.
For example, a telephone network fails if there is no path from source to sink.

2
A
source

B
How do we solve this?

sink

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 45

Solving by Conditioning
P[ E F ]
P[ F ]
If F and F are complementary events, i.e.,
Recall that P[ E | F ] =

F F = and F F =
then there is a trick :
P[ E ] = P[ E F ] + P[ E F ]
P[ E ] = P[ E | F ]P[ F ] + P[ E | F ]P[ F ]

If you can solve P[ E | F ], P[ F ], P[ E | F ], and P[ F ], then you can solve P[ E ].

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 46

A
source

C
B

sink

First, condition the system on link C being failed.


Then the system becomes the series AD in parallel with the series BE.

A
source

sink

3
FS |C Fail (t ) = P[ S t | C t ] = (1 (1 FA (t ) )(1 FD (t ) ))(1 (1 FB (t ) )(1 FE (t ) ))
and P[C t ] = FC (t )
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 47

Second, condition the system on link C being up.


A
source

2,3
B

sink

FS |C up (t ) = P[ S t | C > t ] = 1 (1 FA (t ) FB (t ) )(1 FD (t ) FE (t ) ),
and P[C > t ] = 1 P[C t ] = 1 FC (t )
Thus, FS (t ) = FS |C Fail (t ) FC (t ) + FS |C up (t )(1 FC (t ) ).

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 48

Conditioning Fault Trees


It is also possible to use conditioning to solve more complex fault trees. If the same
component appears more than once in a fault tree, it violates the independent failure
assumption. However, a conditioned fault tree can be solved.
Example: A component C appears multiple times in the fault tree.

FS (t ) = FS C Fail (t )FC (t ) + FS C Up (t )(1 FC (t ))


Where S C Fail is the system given that C has failed
and S C Up is the system given that C has not failed.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 49

Reliability/Availability Point Estimates

Frequently, the desired measure of a reliability model is the reliability at some


time t. Thus, the distribution of the system reliability is superfluous; R(t) is the
only thing of interest.

This condition simplifies computation because all that is necessary for solution
is the reliability of the components at time t. Solution then becomes a
straightforward computation.

If a system is described in terms of the availability of components at time t, then


we may compute the system availability in the same way that reliability is
computed. The restriction is that all component behaviors must be independent
of one another.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 50

Reliability/Availability Tables
A system comprises N components. Reliability of component i at time t is given by
RXi(t), and the availability of component i at time t is given by AXi(t).
Condition
system fails if all
components fail

System Reliability

AS (t ) = 1 (1 AXi (t ))

RS (t ) = 1 (1 RXi (t ))

i =1

i =1

system fails if
one component fails

AS (t ) = AXi (t )

RS (t ) = RXi (t )

i =1

i =1

system fails if at
N
N
i
N i
least k components
RS (t ) = (1 RXi (t )) RX (t )
i =k i
fail, identical distribution
system fails if at least
k components fail,
general case

System Availability

RS (t ) =

(
(
)
)
(
)
1

R
t
R
t

X
gG X g
X g

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

N
N i
i
AS (t ) = (1 AX (t )) AX (t )
i =k i
N

AS (t ) =

(
(
)
)
(
)
1
A
t
A
t

X
X
gG X G
X g

Slide 51

Estimating Component Reliability

For hardware, MIL-HDBK-217 is widely used.


Not always current with modern components.
Lacks distributions; it only contains failure rates.
While not perfect, it seems to be the best source that exists. However,
numbers from MIL-HDBK-217 should be used with caution.

Due to the nature of software, no accepted mechanism exists to predict software


reliability before the software is built.
Best guess is the reliability of previously built similar software.

In all cases, numbers should be used with caution and adjusted based on
observation and experience.

No substitute for empirical observation and experience!

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 52

Modeling Process

Reliability models are built only after proper service is specified.

Reliability models are built to answer the question What subsystem or


components must be proper for the system to be proper?

Build models hierarchically out of subsystems.

Estimation and guesses are acceptable, but state them explicitly.

If unsure, do sensitivity analysis to see how much it matters.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 53

Reliability Modeling Process

Realistic systems result in large RBDs and must be managed hierarchically.

RBD Process(system)
Define the system
Define proper service
Create RBD out of components
for each component
if component is simple
obtain reliability data of component
else
Do RBD Process(component)
end if
Compute reliability of system
Do results meet specification?
Modify design and repeat as necessary

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 54

Summary
Reliability: review of definition
Failure rate
System reliability
Independent failure assumption
Minimum, maximum, k of N
Reliability block diagrams, fault trees, reliability graphs
Reliability modeling process

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 55

Stochastic Activity Network Concepts

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 56

Introduction
Stochastic activity networks, or SANs, are a convenient, graphical, high-level
language for describing system behavior. SANs are useful in capturing the
stochastic (or random) behavior of a system.
Examples:
The amount of time a program takes to execute can be computed precisely if
all factors are known, but this is nearly impossible and sometimes useless.
At a more abstract level, we can approximate the running time by a random
variable.
Fault arrivals almost always must be modeled by a random process.
We begin by describing a subset of SANs: stochastic Petri nets.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 58

Stochastic Petri Net Review


One of the simplest high-level modeling formalisms is called stochastic Petri nets.
A stochastic Petri net is composed of the following components:
Places:

which contain tokens, and are like variables

tokens:

which are the value or state of a place

transitions:

which change the number of tokens in places

input arcs:

which connect places to transitions

output arcs:

which connect transitions to places

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 59

Firing Rules for SPNs


A stochastic Petri net (SPN) executes according to the following rules:
A transition is said to be enabled if for each place connected by input arcs,
the number of tokens in the place is the number of input arcs connecting
the place and the transition.

Example:
P1
P2
t1

Transition t1 is enabled.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 60

Firing Rules, cont.

A transition may fire if it is enabled. (More about this later.)


If a transition fires, for each input arc, a token is removed from the
corresponding place, and for each output arc, a token is added to the
corresponding place.

Example:

P1

P3
t1 fires
t1

P2

P4

Note: tokens are not necessarily conserved when a transition fires.


2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 61

Specification of Stochastic Behavior of an SPN

A stochastic Petri net is made from a Petri net by


Assigning an exponentially distributed time to all transitions.
Time represents the delay between enabling and firing of a transition.
Transitions execute in parallel with independent delay distributions.

Since the minimum of multiple independent exponentials is itself exponential,


time between transition firings is exponential.

If a transition t becomes enabled, and before t fires, some other transition fires
and changes the state of the SPN such that t is no longer enabled, then t aborts,
that is, t will not fire.

Since the exponential distribution is memoryless, one can say that transitions
that remain enabled continue or restart, as is convenient, without changing the
behavior of the network.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 62

Notes on SPNs

SPNs are much easier to read, write, modify, and debug than Markov chains.

SPN to Markov chain conversion can be automated to afford numerical


solutions to Markov chains.

Most SPN formalisms include a special type of arc called an inhibitor arc,
which enables the SPN if there are zero tokens in the associated place, and the
identity (do nothing) function. Example: modify SPN to give writes priority.

Limited in their expressive power: may only perform +, -, >, and test-for-zero
operations.

These very limited operations make it very difficult to model complex


interactions.

Simplicity allows for certain analysis, e.g., a network protocol modeled by an


SPN may detect deadlock (if inhibitor arcs are not used).

More general and flexible formalisms are needed to represent real systems.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 65

Stochastic Activity Networks


The need for more expressive modeling languages has led to several extensions to
stochastic Petri nets. One extension that we will examine is called stochastic
activity networks. Because there are a number of subtle distinctions relative to
SPNs, stochastic activity networks use different words to describe ideas similar to
those of SPNs.
Stochastic activity networks have the following properties:

A general way to specify that an activity (transition) is enabled


A general way to specify a completion (firing) rule
A way to represent zero-timed events
A way to represent probabilistic choices upon activity completion
State-dependent parameter values
General delay distributions on activities

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 66

SAN Symbols
Stochastic activity networks (hereafter SANs) have four new symbols in addition to
those of SPNs:
Input gate:

Output gate:
Cases:

used to define complex enabling predicates and completion


functions
used to define complex completion functions
(small circles on activities) used to specify probabilistic choices

Instantaneous activities:

used to specify zero-timed events

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 67

SAN Terms
1. activation - time at which an activity begins
2. completion - time at which activity completes
3. abort - time, after activation but before completion, when activity is no longer
enabled
4. active - the time after an activity has been activated but before it completes or
aborts.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 73

Illustration of SAN Terms


activation

completion

activity time

t
activation

aborted

enabled
activity time

activation

completion
and activation
activity
time

completion

activity
time

enabled

t
enabled

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 74

Completion Rules
When an activity completes, the following events take place (in the order listed),
possibly changing the marking of the network:
1. If the activity has cases, a case is (probabilistically) chosen.
2. The functions of all the connected input gates are executed (in an
unspecified order).
3. Tokens are removed from places connected by input arcs.
4. The functions of all the output gates connected to the chosen case are
executed (in an unspecified order).
5. Tokens are added to places connected by output arcs connected to the
chosen case.
Ordering is important, since effect of actions can be marking-dependent.
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 75

General Delay Distributions


SANs (and their implementation in Mbius) support many activity time
distributions, including:

Exponential
Hyperexponential
Deterministic
Weibull
Conditional Weibull
Normal

Erlang
Gamma
Beta
Uniform
Binomial
Negative Binomial

All distribution parameters can be marking-dependent


The obvious implication of general delay distributions is that there is no
conversion to a CTMC. Hence, no solutions to CTMCs are applicable.
However, simulation is still possible.
Analytical/numerical solution is possible for certain mixes of exponential and
deterministic activities. See the Mbius manual for details.
See [Kececioglu 91], for example, for appropriate use of some of these
distributions.
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 80

Fault-Tolerant Computer Failure Model Example


A fault-tolerant computer system is made up of two redundant computers. Each
computer is composed of three redundant CPU boards. A computer is operational if
at least 1 CPU board is operational, and the system is operational if at least 1
computer is operational.
CPU boards fail at a rate of 1/106 hours, and there is a 0.5% chance that a board
failure will cause a computer failure, and a 0.8% chance that a board will fail in a
way that causes a catastrophic system failure.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 81

SAN computer for Computer Failure Model

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 82

Activity Case Probabilities and Input Gate Definition

Activity
CPUfail1

Gate
Enabled1

Case
1
2
3

Probability
0.987
0.005
0.008

Definition
Predicate
MARK(CPUboards1 > 0) && MARK(NumComp) > 0
Function
MARK(CPUboards1) ;

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 83

Output Gate Definitions

Gate
Covered1

Definition
Function
if (MARK(CPUboards1) == 0)
MARK(NumComp)--;
Uncovered1
Function
MARK(CPUboards1) = 0;
MARK(NumComp)--;
Catastrophic1 Function
MARK(CPUboards1) = 0;
MARK(NumComp) = 0;

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 84

Reward Variables
Reward variables are a way of measuring performance- or dependability-related
characteristics about a model.
Examples:
Expected time until service
System availability
Number of misrouted packets in an interval of time
Processor utilization
Length of downtime
Operational cost
Module or system reliability

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 85

Reward Structures
Reward may be accumulated two different ways:
A model may be in a certain state or states for some period of time, for
example, CPU idle states. This is called a rate reward.
An activity may complete. This is called an impulse reward.
The reward variable is the sum of the rate reward and the impulse reward structures.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 86

Reward Structure Example


A web server failure model is used to predict profits. When the web server is fully
operational, profits accumulate at $N/hour. In a degraded mode, profits accumulate
at 16 N/hour. Repairs cost $K.

R(m ) = 16 N
0

m is a fully functioning marking

K
C (a ) =
0

a is an activity representing repair

m is a degraded-mode marking
otherwise

otherwise

By carefully integrating the reward structure from 0 to t, we get the profit at time t.
This is an example of an interval-of-time variable.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 87

Reward Variables
A reward variable is the sum of the impulse and rate reward structures over a
certain time.
Let [t, t + l] be the interval of time defined for a reward variable:
If l is 0, then the reward variable is called an instant-of-time reward variable.
If l > 0, then the reward variable is called an interval-of-time reward
variable.
If l > 0, then dividing an interval-of-time reward variable by l gives a timeaveraged interval-of-time reward variable.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 88

Reward Variable Specification


Reward Structure

Interval-of-Time
Instant-of-Time
Time-Average Interval-of-Time

[t, t + l]
t

lim as t
goes to
infinity

[t, t + l]
[t, t + l] lim as l

[t, t + l]

[t, t + l]

[t, t + l]

lim as t
goes to
infinity

lim as l
goes to
infinity

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

lim as t goes to
goes to infinity
infinity

Slide 89

Reward Variables for Computer Failure Model


Reliability
Rate rewards
Subnet = computer
Predicate:
MARK(NumComp) > 0
Function:
1
Impulse reward
none
NumBoardFailures
Rate reward
none
Impulse reward
Subnet = computer
activity = CPUfail1, value = 1
activity = CPUfail2, value = 1

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 92

Reward Variables for Computer Failure Model


Performability
Rate rewards
Subnet = computer
Predicate:
1
Function:
MARK(NumComp)
Impulse reward
none
NumBoards
Rate reward
Subnet = computer
Predicate:
1
Function:
MARK(CPUBboards1) + MARK(CPUboards2)
Impulse reward
none

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 93

Model Composition
A composed model is a way of connecting different SANs together to form a larger
model.
Model composition has two operations:
Replicate: Combine 2 or more identical SANs and reward structures
together, holding certain places common among the replicas.
Join: Combine 2 or more different SANs and reward structures together,
combining certain places to permit communication.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 94

Composed Model Specification

Join two or more


submodels
together

Replicate
submodel a
certain number of
times

Certain places in
different
submodels can be
made common

Hold certain
places common to
all replicas

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 95

Rationale
There are many good reasons for using composed models.
Building highly reliable systems usually involves redundancy. The
replicate operation models redundancy in a natural way.
Systems are usually built in a modular way. Replicates and Joins are
usually good for connecting together similar and different modules.
Tools can take advantage of something called the Strong Lumping Theorem
that allows a tool to generate a Markov process with a smaller state space
(to be described in Session 7).

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 96

Computer Failure Model Revisited: Single computer Model

(Note initial marking of NumComp is two since there will be two computers
in the composed model.)
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 98

Composed Model for Computer Failure Model

Node
Rep1

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Reps
2

Common Places
NumComp

Slide 99

Composed Model
How does adding an additional computer affect reliability?
In the composed model, change number of replications to 3 and change
various reward variables - easy (Use a global variable if you think suspect
you may want to do this.)
In flat model, add another computer - hard
In composed model, the number of states in the underlying Markov chain is much
smaller, especially for large numbers of replications. (Details will be given in
Session 7.)

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 102

Analytic/Numerical State-Based Modeling

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 103

Session Outline
Review of Markov process theory and fundamentals
Methods for constructing state-level models from SANs
Analytic/numerical solution techniques
Transient solution
Standard uniformization (instant-of-time variables)
Adaptive uniformization (instant-of-time variables)
Interval-of-time uniformization (interval-of-time variables)
Steady-state solution (steady-state instant-of-time variables)
Direct solution
Iterative solution

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 104

Weaknesses of Simulation

Simulation relies on good pseudo-random number generation, sufficient


observations, and good statistical techniques to produce an approximate solution

Increasing accuracy by a factor of n requires on the order of n2 more work,


which can be prohibitively expensive.
For example, a 5-Nines system reliability model will require approximately
100,000 observations to observe one failure. One digit of accuracy can easily
require over 1,000,000 observations!
(For many models, 1,000,000 observations can be generated quickly, but as
system failure becomes even rarer, standard simulation quickly becomes
infeasible.)

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 105

The Case for Analytical/Numerical Techniques


If you can model using exponential delays and your model is sufficiently small,
continuous time Markov chains (CTMCs) offer some advantages. These include:
Typically faster solution time for systems with rare events
Typically takes less time to get more accurate answers
Typically more confidence in the solution
In order to understand when we get these advantages, we must better understand the
methods of obtaining solutions to CTMCs.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 106

Random Variable Review


It is often convenient to assign a (real) number to every element in . This
assignment, or rule, or function, is called a random variable.

-1

X:

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 107

Random Process Review


Random processes are useful for characterizing the behavior of real systems.
A random process is a collection of random variables indexed by time.
Example: X(t) is a random process. Let X(1) be the result of tossing a die. Let X(2)
be the result of tossing a die plus X(1), and so on. Notice that time (T) = {1,2,3,
. . .}.
One can ask: P[ X (2 ) = 12] = 361

P[ X (3) = 14 X (1) = 2] = 361


E [ X (n )] = 3.5n

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 108

Describing a Random Process


Recall that for a random variable X, we can use the cumulative distribution FX to
describe the random variable.
In general, no such simple description exists for a random process.
However, a random process can often be described succinctly in various different
ways. For example, if Y is a random variable representing the roll of a die, and X(t)
is the sum after t rolls, then we can describe X(t) by
X(t) - X(t - 1) = Y,
P[X(t) = i|X(t - 1) = j] = P[Y = i - j],
or X(t) = Y1 + Y2 + . . . + Yt, where the Yis are independent.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 110

Classifying Random Processes: Characteristics of T


If the number of time points defined for a random process, i.e., |T|, is finite or
countable (e.g., integers), then the random process is said to be a discrete-time
random process.
If |T| is uncountable (e.g., real numbers) then the random process is said to be a
continuous-time random process.
Example: Let X(t) be the number of fault arrivals in a system up to time t. Since t
T is a real number, X(t) is a continuous-time random process.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 111

Classifying Random Processes: State Space Type


Let X be a random process. The state space of a random process is the set of all
possible values that the process can take on, i.e.,
S = {y: X(t) = y, for some t T}.
If X is a random process that models a system, then the state space of X can
represent the set of all possible configurations that the system could be in.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 112

Random Process State Spaces


If the state space S of a random process X is finite or countable
(e.g., S = {1,2,3, . . .}), then X is said to be a discrete-state random process.
Example: Let X be a random process that represents the number of bad
packets received over a network. X is a discrete-state random process.
If the state space S of a random process X is infinite and uncountable (e.g., S = ),
then X is said to be a continuous-state random process.
Example: Let X be a random process that represents the voltage on a
telephone line. X is a continuous-state random process.
We examine only discrete-state processes in this lecture.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 113

Stochastic-Process Classification Examples

Time

Continuous

Discrete

Analog signal

A to D converter

Computer
availability
model

round-based
network
protocol
model

State

Continuous

Discrete

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 114

Markov Process
A special type of random process that we will examine in detail is called the
Markov process. A Markov process can be informally defined as follows.
Given the state (value) of a Markov process X at time t (X(t)), the future
behavior of X can be described completely in terms of X(t).
Markov processes have the very useful property that their future behavior is
independent of past values.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 115

Markov Chains
A Markov chain is a Markov process with a discrete state space.
We will always make the assumption that a Markov chain has a state space in
{1,2, . . .} and that it is time-homogeneous.
A Markov chain is time-homogeneous if its future behavior does not depend on
what time it is, only on the current state (i.e., the current value).
We make this concrete by looking at a discrete-time Markov chain (hereafter
DTMC). A DTMC X has the following property:

P[ X (t + k ) = j X (t ) = i, X (t 1) = nt 1 , X (t 2 ) = nt 2 ,..., X (O ) = nO ]
= P[ X (t + k ) = j X (t ) = i ]

(1)

= Pij( k )

(2)

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 116

DTMCs
Notice that given i, j, and k, Pij( k ) is a number!
Pij( k ) can be interpreted as the probability that if X has value i, then after k time-steps,

X will have value j.


Frequently, we write Pij to mean Pij(1) .

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 117

Markov Chains
A Markov chain is a Markov process with a discrete state space.
We will always make the assumption that a Markov chain has a state space in
{1,2, . . .} and that it is time-homogeneous.
A Markov chain is time-homogeneous if its future behavior does not depend on
what time it is, only on the current state (i.e., the current value).
We make this concrete by looking at a discrete-time Markov chain (hereafter
DTMC). A DTMC X has the following property:

P[ X (t + k ) = j X (t ) = i, X (t 1) = nt 1 , X (t 2 ) = nt 2 ,..., X (O ) = nO ]
= P[ X (t + k ) = j X (t ) = i ]

(1)

= Pij( k )

(2)

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 118

DTMCs
Notice that given i, j, and k, Pij( k ) is a number!
Pij( k ) can be interpreted as the probability that if X has value i, then after k time-steps,

X will have value j.


Frequently, we write Pij to mean Pij(1) .

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 119

State Occupancy Probability Vector


Let be a row vector. We denote i to be the i-th element of the vector. If is a
state occupancy probability vector, then i(k) is the probability that a DTMC has
value i (or is in state i) at time-step k.
Assume that a DTMC X has a state-space size of n, i.e., S = {1, 2, . . . , n}. We say
formally
i(k) = P[X(k) = i]
n

Note that i (k ) = 1 for all times k.


i =1

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 120

Computing State Occupancy Vectors: A Single Step Forward in Time


If we are given (0) (the initial probability vector), and Pij for i, j = 1, . . . , n,
how do we compute (1)?
Recall the definition of Pij.
Pij = P[X(k+1) = j | X(k) = i]
= P[X(1) = j | X(0) = i]
n
Since i (0 ) = 1,
i =1

j (1) = P[ X (1) = j ]

= P[ X (1) = j X (0 ) = 1]P[ X (0 ) = 1] + ... + P[ X (1) = j X (0 ) = n]P[ X (0 ) = n]


n

= P[ X (1) = j X (0 ) = i ]P[ X (0 ) = i ]
i =1
n

= Pij i (0 )
i =1
n

= i (0 )Pij
i =1

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 121

Transition Probability Matrix


n

We have j (1) = i (0 )Pij , which holds for all j.


i =1

Notice that this resembles vector-matrix multiplication.


In fact, if we arrange the matrix P = {Pij}, that is, if
P=

p11

p1n

pn1

pnn

then pij = Pij, and (1) = (0)P, where (0) and (1) are row vectors, and (0)P is a
vector-matrix multiplication.
The important consequence of this is that we can easily specify a DTMC in terms of
an occupancy probability vector and a transition probability matrix P.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 122

Transient Behavior of Discrete-Time Markov Chains


Given (0) and P, how can we compute (k)?
We can generalize from earlier that
(k) = (k - 1)P.
Also, we can write (k - 1) = (k - 2)P, and so
(k) = [(k - 2)P]P
= (k - 2)P2
Similarly, (k - 2) = (k - 3)P, and so
(k) = [(k - 3)P]P2
= (k - 3)P3
By repeating this, it should be easy to see that
(k) = (0)Pk

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 123

A Simple Example
Suppose the weather at Urbana-Champaign, Illinois can be modeled the following
way:
If its sunny today, theres a 60% chance of being sunny tomorrow, a
30% chance of being cloudy, and a 10% chance of being rainy.
If its cloudy today, theres a 40% chance of being sunny tomorrow, a
45% chance of being cloudy, and a 15% chance of being rainy.
If its rainy today, theres a 15% chance of being sunny tomorrow, a 60%
chance of being cloudy, and a 25% chance of being rainy.
If its rainy on Friday, what is the forecast for Monday?

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 124

Simple Example, cont.


Clearly, the weather model is a DTMC.
1) Future behavior depends on the current state only
2) Discrete time, discrete state
3) Time homogeneous
The DTMC has 3 states. Let us assign 1 to sunny, 2 to cloudy, and 3 to rainy. Let
time 0 be Friday.

(0 ) = (0,0,1)
.6 .3 .1

P = .4 .45 .15
.15 .6 .25

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 125

Simple Example Solution


The weather on Saturday (1) is

.6 .3 .1

(1) = (0 )P = (0,0,1) .4 .45 .15 = (.15,.6,.25),


.15 .6 .25

that is, 15% chance sunny, 60% chance cloudy, 25% chance rainy.
The weather on Sunday (2) is

.6 .3 .1

(2 ) = (1)P = (.15,.6,.25) .4 .45 .15 = (.3675,.465,.1675).


.15 .6 .25

The weather on Monday (3) is


(3) = (2)P = (.4316, .42, .1484),
that is, 43% chance sunny, 42% chance cloudy, and 15% chance rainy.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 126

Solution, cont.
Alternatively, we could compute P3 since we found
(3) = (0)P3.
Working out solutions by hand can be tedious and error-prone, especially for
larger models (i.e., models with many states). Software packages are used
extensively for this sort of analysis.
Software packages compute (k) by (. . . (((0)P)P)P. . .)P rather than computing
Pk, since computing the latter results in a large fill-in.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 127

Graphical Representation
It is frequently useful to represent the DTMC as a directed graph. Nodes represent
states, and edges are labeled with probabilities. For example, our weather
prediction model would look like this:

.45

2
.15

.3

1 = Sunny Day
2 = Cloudy Day
3 = Rainy Day

.6

.4
.1

.6

.15

.25

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 128

Simple Computer Example


Parr
Pbusy
Pidle

Pcom

Pfi
Pfb

Pr

X=1
X=2
X=3

computer idle
computer working
computer failed

3
Pff

Pidle

P = Pcom
Pr

Parr
Pbusy
0

Pfi

Pfb
Pff

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 129

Limiting Behavior of DTMCs


It is sometimes useful to know the time-limiting behavior of a DTMC. This
translates into the long term, where the system has settled into some steady-state
behavior.
Formally, we are looking for lim (n ).
n

To compute this, what we want is lim (0 )P n .


n

There are various ways to compute this. The simplest is to calculate (n) for
increasingly large n, and when (n + 1) (n), we can believe that (n) is a good
approximation to steady-state. This can be rather inefficient if n needs to be large.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 130

Classifications
It is much easier to solve for the steady-state behavior of some DTMCs than
others. To determine if a DTMC is easy to solve, we need to introduce some
definitions.
Definition: A state j is said to be accessible from state i if there exists an n 0 such
that Pij( n ) > 0. We write i j.
Note: recall that Pij( n ) = P[ X (n) = j X (0) = i ]
If one thinks of accessibility in terms of the graphical representation, a state j is
accessible from state i if there exists a path of non-zero edges (arcs) from node i to
node j.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 131

State Classification in DTMCs


Definition: A DTMC is said to be irreducible if every state is accessible from every
other state.
Formally, a DTMC is irreducible if
ij
for all i,j S.
A DTMC is said to be reducible if it is not irreducible.
It turns out that irreducible DTMCs are simpler to solve. One need only solve one
linear equation:
= P.
We will see why this is so, but first there is one more issue we must confront.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 132

Periodicity
Consider the following DTMC:

1
Does lim (n ) exist? No!

1
(0 ) = (1,0 )

(i )

i =1

However, nlim
does exist; it is called the time-averaged steady-state

n
distribution, and is denoted by *.
Definition: A state i is said to be periodic with period d if Pij( n ) > 0 only when n is
some multiple of d. If d = 1, then i is said to be aperiodic.
A steady-state solution for an irreducible DTMC exists if all the states are aperiodic.
A time-averaged steady-state solution for an irreducible DTMC always exists.
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 133

Steady-State Solution of DTMCs


The steady-state behavior can be computed by solving the linear equation
= P, with the constraint that

i =1

= 1. For irreducible DTMCs, it can be

shown that this solution is unique. If the DTMC is periodic, then this solution
yields *.
One can understand the equation = P in two different ways.
In steady-state, the probability distribution (n + 1) = (n)P, and by
definition (n + 1) = (n) in steady-state.
Flow equations.
Flow equations require some visualization. Imagine a DTMC graph, where the
nodes are assigned the occupancy probability, or the probability that the DTMC has
the value of the node.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 134

Probability must be conserved, i.e.,


i = 1.

...

...

Flow Equations

Let iPij be the probability mass that moves from state j to state i in one time-step.
Since probability must be conserved, the probability mass entering a state must
equal the probability mass leaving a state.
Prob. mass in = Prob. mass out
n

j =1

Pji = i Pij
j =1

= i Pij
j =1

Written in matrix form, = P.

= i

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 135

Continuous Time Markov Chains (CTMCs)


For most systems of interest, events may occur at any point in time. This leads us
to consider continuous time Markov chains. A continuous time Markov chain
(CTMC) has the following property:

P[ X (t + ) = j X (t ) = i, X (t t1 ) = k1 , X (t t 2 ) = k 2 ,..., X (t t n ) = k n ]
= P[ X (t + ) = j X (t ) = i ] ,
= Pij ()
for all > 0, 0 < t1 < t 2 < ... < t n
A CTMC is completely described by the initial probability distribution (0) and the
transition probability matrix P(t) = [pij(t)]. Then we can compute (t) = (0)P(t).
The problem is that pij(t) is generally very difficult to compute.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 136

CTMC Properties
This definition of a CTMC is not very useful until we understand some of the
properties.
First, notice that pij() is independent of how long the CTMC has previously been in
state i, that is,

P[ X (t + ) = j X (u ) = i for u [0, t ]]

= P[ X (t + ) = j X (t ) = i ]

= pij ()
There is only one random variable that has this property: the exponential random
variable. This indicates that CTMCs have something to do with exponential
random variables. First, we examine the exponential r.v. in some detail.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 137

Exponential Random Variables


Recall the property of the exponential random variable. An exponential random
variable X with parameter has the CDF
P[X t] = Fx(t) =

0
1e-t

t0
t>0 .

The distribution function is given by f x (t ) =


fx(t) =

0
e-t

t0
t>0

d
Fx (t );
dt

The exponential random variable is the only random variable that is memoryless.
To see this, let X be an exponential random variable representing the time that an
event occurs (e.g., a fault arrival).
We will show that P[ X > t + s X > s ] = P[ X > t ] .

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 138

Memoryless Property
Proof of the memoryless property:

P[ X > t + s X > s ] =

P[ X > t + s , X > s ]
P[ X > s ]

P[ X > t + s ]
P[ X > s ]
1 FX (t + s )
=
1 FX ( s )
=

e (t + s )
= s
e
e t e s
= s
e
= e t
= P[ X > t ]
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 139

Event Rate
The fact that the exponential random variable has the memoryless property
indicates that the rate at which events occur is constant, i.e., it does not change
over time.
Often, the event associated with a random variable X is a failure, so the event rate
is often called the failure rate or the hazard rate.
The event rate of X is defined as the probability that the event associated with X
occurs within the small interval [t, t + t], given that the event has not occurred by
time t, per the interval size t:
P[t < X t + t X > t ]
.
t

This can be thought of as looking at X at time t, observing that the event has not
occurred, and measuring the number of events (probability of the event) that occur
per unit of time at time t.
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 140

Observe that:

P[t < X t + t X > t ] P[t < X t + t , X > t ]


=
P[ X > t ] t
t
P[t < X t + t ]
=
P[ X > t ] t
FX (t + t ) FX (t )
=
(1 FX (t ))t
=

FX (t + t ) FX (t )
1

t
1 FX (t )

f X (t )
1 FX (t )

in general.

In the exponential case,

f X (t )
e t
e t
=
= t = .
t
1 FX (t ) 1 (1 e ) e
This is why we often say a random variable X is exponential with rate .
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 141

Minimum of Two Independent Exponentials


Another interesting property of exponential random variables is that the minimum
of two independent exponential random variables is also an exponential random
variable.
Let A and B be independent exponential random variables with rates and
respectively. Let us define X = min{A,B}. What is FX(t)?
FX(t) = P[X t]
= P[min{A,B} t]
= P[A t OR B t]
= 1 - P[A > t AND B > t]
- see comb. methods section
= 1 - P[A > t] P[B > t]
= 1 - (1 - P[A t])(1 - P[B t])
= 1 - (1 - FA(t))(1 - FB(t))
= 1 - (1 - [1 - e-t])(1 - [1 - e-t])
= 1 - e-te-t
= 1 - e-( + )t
Thus, X is exponential with rate + .
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 142

Competition of Two Independent Exponentials


If A and B are independent and exponential with rate and respectively, and A
and B are competing, then we know that one will win with an exponentially
distributed time (with rate + ). But what is the probability that A wins?

P[ A < B ] = P[ A < B A = x ] P[ A = x ] dx
0

= P[ A < B A = x ] f A ( x )dx

= P[ A < B A = x ] e x dx

= P[x < B ] e x dx
0

= (1 P[B x ]) e x dx
0

= (1 1 e x ) e x dx

= e x e x dx
0

= e ( + ) x dx =
0

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 143

Competing Exponentials in CTMCs

X(0) = 1
P[X(0) = 1] = 1

Imagine a random process X with state space S = {1,2,3}. X(0) = 1. X goes to state
2 (takes on a value of 2) with an exponentially distributed time with parameter .
Independently, X goes to state 3 with an exponentially distributed time with
parameter . These state transitions are like competing random variables.
We say that from state 1, X goes to state 2 with rate and to state 3 with rate .
X remains in state 1 for an exponentially distributed time with rate + . This is
1
called the holding time in state 1. Thus, the expected holding time in state 1 is + .
The probability that X goes to state 2 is

. The probability X goes to state 3 is

This is a simple continuous-time Markov chain.


2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 144

Competing Exponentials vs. a Single


Exponential With Choice
Consider the following two scenarios:
1. Event A will occur after an exponentially distributed time with rate . Event
B will occur after an independent exponential time with rate .
2. After waiting an exponential time with rate + , event A occurs with

probability + , and event B occurs with probability + .


These two scenarios are indistinguishable. In fact, we frequently interchange the
two scenarios rather freely when analyzing a system modeled as a CTMC.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 145

State-Transition-Rate Matrix
A CTMC can be completely described by an initial distribution (0) and a statetransition-rate matrix. A state-transition-rate matrix Q = [qij] is defined as follows:

qij =

rate of going from


state i to state j
qik
k i

i j,

i = j.

Example: A computer is idle, working, or failed. When the computer is idle, jobs
arrive with rate , and they are completed with rate . When the computer is
working, it fails with rate w, and with rate i when it is idle.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 146

Simple Computer CTMC

1
i

2
w

3
Let X = 1 represent the system is idle, X = 2 the system is working, and X = 3 a
failure.

i
( + i )
Q=

( + w ) w

0
0
0

If the computer is repaired with rate , the new CTMC looks like

i
( + i )
2
1

Q=

( + w ) w

3
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 147

Analysis of Simple Computer Model


Some questions that this model can be used to answer:
What is the availability at time t?
What is the steady-state availability?
What is the expected time to failure?
What is the expected number of jobs lost due to failure in [0,t]?
What is the expected number of jobs served before failure?
What is the throughput of the system (jobs per unit time), taking into
account failures and repairs?

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 148

State-Space Generation from SANs


If the activity delays are exponential, it is straightforward to convert a SAN to a
CTMC. We first look at the simple case, where there is no composed model.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 149

State Space (Generated by Mbius)


State No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

CPUboards1
3
2
0
0
3
3
3
1
2
2
2
0
0
0
3
1
1
1
2
0
0
1

CPUboards2
3
3
3
3
2
0
0
3
2
0
0
2
0
2
1
2
0
0
1
1
1
1

NumComp
2
2
1
0
2
1
0
2
2
1
0
1
0
0
2
2
1
0
2
1
0
2

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

(Next State, Rate)


(2,.p1),(3,p2),(4,P3),(5,p1),(6,p2,),(7,p3)
(8,p1),(3,p2),(4,p3),(9,p1),(10,p2),(11,p3)
(12,p1),(13,(p2+p3) )
(9,p1),(12,p2),(14,p3),(15,p1),(6,p2),(7,p3)
(10,p1),(13,(p2+p3) )
(3,(p1+p2) ),(4,p3),(16,p1),(17,p2),(18,p3)
(16,p1),(12,p2),(14,p3),(19,p1),(10,p2),(11,p3)
(17,p1),(13,(p2+p3) )
(20,p1),(13,(p2+p3) )

(19,p1),(20,p2),(21,p3),(6,(p1+p2) ),(7,p3)
(12,(p1+p2) ),(14,p3),(22,p1),(17,p2),(18,p3)
(13, )
(22,p1),(20,p2),(21,p3),(10,(p1+p2),(11,p3)
(13, )
(20,(p1+p2) ),(21,p3),(17,(p1+p2) ),(18,p3)

Slide 150

Underlying Markov Model (State Transition Rates Not Shown)


3

10

12

14

15

20

19

11
13

16

17

18

22

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

21
Slide 151

Reduced Base Model Construction

Reduced Base Model construction techniques make use of composed model


structure to reduce the number of states generated.

A state in the reduced base model is composed of a state tree and an impulse
reward.

During reduced base model construction, the use of state trees permits an
algorithm to automatically determine valid lumpings based on symmetries in the
composed model.

The reduced base model is constructed by finding all possible (state tree,
impulse reward) combinations and computing the transition rates between states.

Generation of the detailed base model is avoided.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 152

Example Reduced Base Model State Generation

Composed Model

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

computer

Slide 153

Example Reduced Base Model States and Transitions


R (NumComp = 2)

(state 1)

computer (CPUboards = 3)

covered
catastrophic
uncovered

R (NumComp = 2)

computer
(CPUboards = 3)

computer
(CPUboards = 2)

(state 2)

R (NumComp = 1)

computer
(CPUboards = 3)

computer
(CPUboards = 0)

(state 3)

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

R (NumComp = 0)

computer
(CPUboards = 3)

computer
(CPUboards = 0)

(state 4)
Slide 154

Markov Chain of Reduced Base Model


(State Transition Rates not Shown)
1
2

3
4

6
8

10

11
12

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

13
Slide 155

State-Space Generation in Mbius

(For generating random process representations of models with all


exponential or exponential/deterministic timed activities)

Print out states


and reward
variables

Print out absorbing


states. Useful to
detect problems
when attempting a
steady-state
solution.

Place comments, as
specified by edit
comments, in file.
State-space generation must be done before all analytic/numerical solutions are done.
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 156

Numerical/Analytical Solution Techniques


1) Transient Solution
Standard Uniformization (instant-of-time variables)
Adaptive Uniformization (instant-of-time variables)
Interval-of-time Uniformization (expected value, interval-of-time variables)
2) Steady-state Solution
Direct Solution (instant-of-time steady-state variables)
Iterative Solution (instant-of-time steady-state variables)

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 157

CTMC Transient Solution


We have seen that it is easy to specify a CTMC in terms of the initial probability
distribution (0) and the state-transition-rate matrix.
Earlier, we saw that the transient solution of a CTMC is given by (t) = (0)P(t),
and we noted that P(t) was difficult to define.
Due to the complexity of the math, we omit the derivation and show the
relationship

d
P (t ) = QP(t ) = P (t )Q, where Q is the state transition rate matrix of
dt
the Markov chain.
Solving this differential equation in some form is difficult but necessary to compute
a transient solution.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 158

Transient Solution Techniques


Solutions to dtd P (t ) = QP(t ) can be done in many (dubious) ways*:
Direct: If the CTMC has N states, one can write N2 PDEs with N2 initial
conditions and solve N2 linear equations.
Laplace transforms: Unstable with multiple poles
Nth order differential equations: Uses determinants and hence is numerically
unstable
n

Qt
(
)
Qt
.
Matrix exponentiation: P(t) = eQt, where e = I +
n!
n =1
Qt by performing
Matrix
exponentiation
has
some
potential.
Directly
computing
e

(Qt ) n
I +
can be expensive and prone to instability.
n!
n =1

If the CTMC is irreducible, it is possible to take advantage of the fact that


Q = ADA-1, where D is a diagonal matrix. Computing eQt becomes AeDtA-1, where
e Dt = diag (e d t , e d t ,..., e d t ).
11

22

nn

* See C. Moler and C. Van Loan, Nineteen Dubious Ways to Compute the Exponential of a Matrix, SIAM
Review, vol. 20, no. 4, pp. 801-836, October 1978.
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 159

Standard Uniformization
Starting with CTMC state transition rate matrix (Q) construct
1. Poisson process : rate , q(i ,i )
2. DTMC : P = I +

Probability of k transitions
in time t

Then :
k
(
t ) t k
e P .
(t ) = (0 )

k =0

k-step state transition probability

k!

In actual computation :
Ns

(t ) =
with (k + 1) = (k )P.

k =0

(t )k e t (k ),
k!

Choose truncation point to


obtain desired accuracy

Compute (k) iteratively,


to avoid fill-in
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 160

Error Bound in Uniformization

Answer computed is a lower bound, since each term in summation is positive,


and summation is truncated.
Number of iterations to achieve a desired accuracy bound can be computed
easily.
Ns

(t )k

k =0

k!

Error for each state 1

e t

Choose error bound, then compute Ns on-the-fly, as uniformization is done.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 161

Transient Uniformization Solver

(for transient solution of instant-of-time variables)

Instant-of-time variable time


points of interest. Multiple time
points may be specified,
separated by spaces.
Number of digits of accuracy in
the solution. Solution reported is
a lower bound.
Volume of intermediate results
reported. 1 gives the greatest
volume, greater numbers less.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 163

Accumulated Reward Solver (ars)

(solves for expected values of interval-of-time and time-averaged intervalof-time variables on intervals [t0, t1] when both t0 and t1 are finite)
Number of digits of
accuracy in the
solution. Solution
reported is a lower
bound.

Volume of
intermediate results
reported. 1 gives the
greatest volume, greater
numbers less.

Series of time
intervals for which
solution is desired.
Intervals are
separated by spaces.
Each interval can be
specified as t1:t2.

The accumulated reward solver is based on uniformization,


so the hints given for the transient solver apply here as well.
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 167

Steady-State Behavior of CTMCs, cont.


This yields the elegant equation *Q = 0, where * = lim (t ), the steady-state
t
probability distribution. If the CTMC is irreducible, then * can be computed with
n

the constraint that

i = 1.
i =1

Definition: A CTMC is irreducible if every state in the CTMC is reachable from


every other state.
If the CTMC is not irreducible, then more complex solution methods are required.
Notice that for irreducible CTMCs, the steady-state distribution is independent of
the initial-state distribution.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 170

Direct Steady-State Solution


One steady-state solver in Mbius is the direct steady-state solver. This solver
~
solves the augmented matrix * Q = eiT using a form of Gaussian elimination.
Pros:

Can get a very accurate solution in a fixed amount of time;


stiffness (described later) does not affect solution time.

Cons: Solution complexity is O(n3), so does not scale well to large models;
memory requirements are high due to fill-in and are not known a
priori.
Recommendation: Use for small CTMCs (tens of states) or medium-sized and stiff
CTMCs (hundreds to a few thousands), or when high accuracy is required.
Reminder: High accuracy in solution does not mean high accuracy in prediction.
Use accuracy to do relative comparisons.
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 171

Direct Steady-State Solver (dss)

(for steady-state solution of instant-of-time variables)


Stopping criterion used in
iterative refinement
phase, after direct
solution is done.
Volume of
intermediate
results reported.
1 gives the
greatest volume,
greater numbers
less.

Number of rows to search


for the best pivot when
performing LU
decomposition
Grace factor by which
elements may become
pivots

Value that, when multiplied by smallest matrix element, is


threshold at which elements may be dropped in LU
decomposition.
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 172

Iterative Solution Methods


The simplest iterative solution methods are called stationary iterative methods, and
they can be expressed as
(k + 1) = (k)M,
where M is a constant (stationary) matrix. Computing (k + 1) from (k) requires one
vector-matrix multiplication, or one iteration, which on modern workstations is
extremely fast.
The simplest stationary iterative method for CTMCs is called the power method.
Recall *Q = 0. Let M = Q + I.
(M - I) = 0
M - = 0
M =
(k + 1) = (k)(Q + I)
The power method typically converges (gets close to the answer) slowly.
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 174

Iterative Solution Characteristics


Stationary iterative solution methods have the following characteristics:
Low memory usage (no fill-in); predictable memory usage
Low time per iteration, proportional to the number of non-zero entries
Fast solution time for non-stiff matrices (tens or hundreds of iterations)
Stop when sufficiently accurate
Slow solution time for stiff matrices
Difficult to quantify accuracy, especially for stiff matrices
Easy to implement

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 175

Gauss-Seidel
One of the most widely used stationary iterative methods is called Gauss-Seidel.
The algorithm appears as follows:

for k = 1 to convergence
for i = 1 to n
( k +1)

1 i 1 ( k +1)
= j q ji +
qii j =1

j q ji

j = i +1

(k )

end for
end for
An intuitive explanation for this algorithm:
( k +1)

qii i

i 1

= j

flow out of node i

j =1

( k +1)

q ji +

(k )

j q ji

j = i +1

flow into node i

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 178

SOR
There is an extension to Gauss-Seidel called successive over-relaxation, or SOR,
that sometimes gives better performance.

Let xi = xi( k +1) xi( k ) , where x ( k ) and x ( k +1) are the kth and (k + 1)th Gauss - Seidel
iterate. The (k + 1)th SOR iterate, ~
x ( k +1) , is computed as
i

~
xi( k +1) = xi( k ) + x ,
where 0 < < 2.
Choosing is a hard problem in general. Automatic techniques for choosing
exist but are not implemented in Mbius.
Note: = 1 is the same as Gauss-Seidel.
Recommendation: Leave = 1 unless you are solving a similar system many times
and the matrix is stiff.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 179

Iterative Steady-State Solver (iss)

(for steady-state solution of instant-of-time variables)

Stopping criterion,
expressed as 10-x, where x is
given. The criterion used is
the infinity difference norm.
SOR weight factor.
Values < 1 guarantee
convergence, but slow it.
Values >= 1 speed
convergence, but may not
converge.
Maximum number of
iterations allowed.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 180

Mbius Analytical Solvers

Model Class
All activities
exponential

Exponential and
Deterministic
activities

Analytic Solvers (for reward variables only)


SteadyInstant-of-time
Mean,
state or
or
Variance, or
Transient
Interval-of-time
Distribution
a
SteadyInstant-of-time
Mean,
state
Variance, and
Distribution
Transient
Instant-of-time
Mean,
Variance, and
Distribution
Interval-of-time
Mean
Steadystate

Instant-of-timeb

Mean,
Variance, and
Distribution

Applicable
Analytic
Solver
dss and iss

trs and atrs

ars
diss and
adiss

if only rate rewards are used, the time-averaged interval-of-time steady-state measure is
identical to the instant-of-time steady-state measure (if both exist).
b provided the instant-of-time steady-state distribution is well-defined. Otherwise, the timeaveraged interval-of-time steady-state variable is computed and only results for rate
rewards should be derived.
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 183

Case Study: Fault-Tolerant Embedded


Multiprocessor System

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 184

Problem Origin

This problem was originally posed in 1992 as a reliability model of a large,


embedded fault-tolerant computer, presumably for space-borne applications. It
was posed as a hierarchical model with non-perfect coverage at each level, with
the purpose of showing the inadequacy of existing techniques.
Combinatorial methods were incapable of including coverage at all levels of
the hierarchy, thus grossly overstating the reliability.
Markov- or SPN-based methods create far too many states to solve.
Monte-Carlo simulation works, but provides only an estimate (which is
often not good enough).
A specialized tool was developed to do numerical integration of a semiMarkov process to solve this and similar problems.
In Mbius, we solve a smaller version of the same architecture exactly using
Markov models generated by SANs. This is made possible by automatic state
lumping using composed models.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 186

Problem Description

System consists of 2 computers


Each computer consists of
3 memory modules (2 must be operational)
3 CPU units (2 must be operational)
2 I/O ports (1 must be operational)
2 error-handling chips (non-redundant)
Each memory module consists of
41 RAM chips (39 must be operational)
2 interface chips (non-redundant)
A CPU consists of 6 non-redundant chips
An I/O port consists of 6 non-redundant chips
10 to 20 year operational life

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 187

Diagram of Fault-Tolerant Multiprocessor System


..

..

..

41 RAMs

41 RAMs

41 RAMs

2 int. ch.

2 int. ch.

2 int. ch.

memory module

memory module

2 ch.

memory module

errorhandlers
interface bus

..

..

..

..

..

6 CPU
chips

6 CPU
chips

6 CPU
chips

6 I/O
chips

6 I/O
chips

CPU module

CPU module

CPU module

I/O port

I/O port

...

computer

computer

computer

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 188

Definition of Operational

The system is operational if at least one computer is operational


A computer is operational if all the modules are operational
A memory module is operational if at least 39 RAM chips and both interface
chips are operational.
A CPU unit is operational if all 6 CPU chips are operational
An I/O port is operational if all 6 I/O chips are operational
The error-handling unit is operational if both error-handling chips are
operational
Failure rate per chip is 100 failures per 1 billion hours

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 189

Coverage

This system could be modeled using combinatorial methods if we did not take
coverage into account. Coverage is the chance that the failure of a chip will not
cause the larger system to fail even if sufficient redundancy exists. I.e.,
coverage is the probability that the fault is contained.
The coverage probabilities are given in the following table:
Redundant Component
RAM Chip
Memory Module
CPU Unit
I/O Port
Computer

Fault Coverage Probability


0.998
0.95
0.995
0.99
0.95

For example, if a RAM chip fails, there is a 0.2% chance the memory module
will fail even if sufficient redundancy exists. If the memory module fails, there
is a 5% chance the computer will fail. If a computer fails, there is a 5% chance
the system will fail.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 190

Outline of Solution: List of SANs

The model is composed of four SANs:


1. memory_module
2. cpu_module
3. errorhandlers
4. io_port_module

Each SAN models the behavior of the module in the event of a module
component failure.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 191

List of Places

Seven places represent the state of the system:


1. cpus the number of operational CPU modules
2. ioports the number of operational I/O modules
3. errorhandlers whether the two error-handler chips are operational
4. computer_failed the number of failed computers
5. memory_failed the number of failed memory modules
6. memory_chips number of operational RAM chips
7. interface_chips number of operational interface chips

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 192

List of Activities

Five activities represent failures in the system


1. cpu_failure the failure of any CPU chip
2. ioport_failure the failure of any I/O chip
3. errorhandling_chip_failure the failure of either error-handler chip
4. memory_chip_failure the failure of a RAM chip
5. interface_chip_failure the failure of a memory interface chip

Cases on these activities represent behavior based on coverage or non-coverage.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 193

Tricks of the Trade


Since we intend to solve this model analytically, we want the fewest number of
states possible.
We dont care which component failed or what particular failed state the
model is in. Therefore, we lump all failure states into the same state.
We dont care which computer or which module is in what state. Therefore,
we make use of replication to further reduce the number of states.
We use marking-dependent rates to model RAM chip failure, making use of
the fact that the minimum of independent exponentials is an exponential.
We use cases to denote coverage probabilities, and adjust the probabilities
depending on the state of the system.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 194

Composed Model

Node
Join1
Node
Rep1

Reps
3

Rep2

Common Places
computer_failed
memory_failed
computer_failed

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Common Places
Subtree 1

computer_failed

memory_failed
cpus
errorhandlers
ioports

Slide 195

cpu_modules SAN

Place
cpus
ioports
errorhandlers
memory_failed
computer_failed

Marking
3
2
2
0
0

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 196

cpu_modules SAN, cont.

cpu_modules input gate predicates and functions:


Gate
IG1

Enabling Predicate
(MARK(cpus) > 1) &&
(MARK(memory_failed) < 2) &&
(MARK(computer_failed) < 2)

Function
identity

cpu_modules activity time distributions:


Activity
cpu_failure

Distribution
expon(0.0052596 * MARK(cpus))

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 197

cpu_modules SAN, cont.


cpu_modules case probabilities for activities:
Case
1

Probability
module_cpu_failure
if (MARK(cpus) == 3)
return(0.995);
else
return(0.0);
if (MARK(cpus) == 3)
return(0.00475);
else
return(0.95);
if (MARK(cpus) == 3)
return (0.00025);
else
return(0.05);

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

case 1: chip failure


covered
case 2: chip failure
causes computer failure
case 3: chip failure
causes system
(catastrophic) failure

Slide 198

cpu_modules SAN, cont.


cpu_modules output gate functions:
Gate
OG1
OG2

OG3

Function
if (MARK(cpus) == 3)
MARK(cpus) ;
MARK(cpus) = 0;
MARK(ioports) = 0;
MARK(errorhandlers) = 0;
MARK(memory_failed) = 2;
MARK(computer_failed) ++;
MARK(cpus) = 0;
MARK(ioports) = 0;
MARK(errorhandlers) = 0;
MARK(memory_failed) = 2;
MARK(computer_failed) = 2;

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 199

Model Solution
The modeled two-computer system with non-perfect coverage at all levels (i.e., the
model as described), the state space contains 10,114 states. The 10 year mission
reliability was computed to be .995579.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 213

Impact of Coverage

Coverage can have a large impact on reliability and state-space size. Various
coverage schemes were evaluated with the following results.

Design description
100% coverage at all levels
Nonperfect coverage considered at all levels
Nonperfect coverage considered at all levels,
no spare memory module
Nonperfect coverage considered at all levels,
no spare CPU module
Nonperfect coverage considered at all levels,
no spare IO port
Nonperfect coverage considered at all levels,
no spare memory module, CPU module, or
IO port
100% coverage at all levels, no spare
memory module, CPU module, IO port, or
RAM chips

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

4278
10114
1335

Reliability
(10-year
mission time)
0.999539
0.995579
0.987646

3299

0.973325

3299

0.985419

511

0.935152

0.702240

State-space size

Slide 214

Solution by Simulation

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 216

Motivation

High-level formalisms (like SANs) make it easy to specify realistic systems, but
they also make it easy to specify systems that have unreasonably large state
spaces.
State-of-the-art tools (like Mobius) can handle state-level models with a few
tens of million states, but not more.
When state spaces become too large, discrete event simulation is often a viable
alternative.
Discrete-event simulation can be used to solve models with arbitrarily large state
spaces, as long as the desired measure is not based on a rare event.
When rare events are present, variance reduction techniques can sometimes be
used.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 218

Advantages of Simulation
Simulation can be applied to any SAN model. The most
prominent difference, compared with analytic solvers, is that
generally distributed activities can be used.
Simulation does not require the generation of a state space and
therefore does not require a finite state space. Therefore, much
more detailed models can be solved.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 219

Disadvantages of Simulation
Simulation only provides an estimate of the desired measure. An
approximate confidence interval is constructed that contains the actual
result with some user-specified probability.
Higher desired accuracy dramatically increases the necessary simulation
time. As a rule, to make the confidence interval n times narrower, the
simulation has to be run n2 times as long.
The rare event problem may arise. If simulation is used to estimate a
small probability, such as the reliability of a highly-reliable system,
extremely long simulations may have to be performed to encounter the
particular event often enough.
Complicated models can require long simulation times, even if the rare
event problem is not an issue. The simulators in Mbius perform the
necessary event scheduling very efficiently, but it should be realized that
simulation is not a panacea.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 220

Simulation as Model Experimentation

State-based methods (such as Markov chains) work by enumerating all possible


states a system can be in, and then invoking a numerical solution method on the
generated state space.
Simulation, on the other hand, generates one or more trajectories (possible
behaviors from the high-level model), and collects statistics from these
trajectories to estimate the desired performance/dependability measures.
Just how this trajectory is generated depends on the:
nature of the notion of state (continuous or discrete)
type of stochastic process (e.g., ergodic, reducible)
nature of the measure desired (transient or steady-state)
types of delay distributions considered (exponential or general)
We will consider each of these issues in this module, as well as the simulation of
systems with rare events.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 221

Types of Simulation
Continuous-state simulation is applicable to systems where the notion of state is
continuous and typically involves solving (numerically) systems of differential
equations. Circuit-level simulators are an example of continuous-state simulation.
Discrete-event simulation is applicable to systems in which the state of the system
changes at discrete instants of time, with a finite number of changes occurring in
any finite interval of time.
Since we will focus on validating end-to-end systems, rather than circuits, we will
focus on discrete-event simulation.
There are two types of discrete-event simulation execution algorithms:
Fixed-time-stamp advance
Variable-time-stamp advance

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 222

Fixed-Time-Stamp Advance Simulation

Simulation clock is incremented a fixed time t at each step of the simulation.


After each time increment, each event type (e.g., activity in a SAN) is checked
to see if it should have completed during the time of the last increment.
All event types that should have completed are completed and a new state of the
model is generated.
Rules must be given to determine the ordering of events that occur in each
interval of time.
Example:

e1

e2

e3

2t

3t

e4
e5

4t

e6
5t

Good for all models where most events happen at fixed increments of time (e.g.,
gate-level simulations).
Has the advantage that no future event list needs to be maintained.
Can be inefficient if events occur in a bursty manner, relative to time-step used.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 223

Variable-Time Step Advance Simulation

Simulation clock advanced a variable amount of time each step of the


simulation, to time of next event.
If all event times are exponentially distributed, the next event to complete and
time of next event can be determined using the equation for the minimum of n
exponentials (since memoryless), and no future event list is needed.
If event times are general (have memory) then future event list is needed.
Has the advantage (over fixed-time-stamp increment) that periods of inactivity
are skipped over, and models with a bursty occurrence of events are not
inefficient.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 224

Basic Variable-Time-Step Advance Simulation


Loop for SANs
A) Set list_of_active_activities to null.
B) Set current_marking to initial_marking.
C) Generate potential_completion_time for each activity that may complete in the
current_marking and add to list_of_active_activities.
D) While list_of_active_activities null:
1) Set current_activity to activity with earliest potential_completion_time.
2) Remove current_activity from list_of_active_activities.
3) Compute new_marking by selecting a case of current_activity, and executing
appropriate input and output gates.
4) Remove all activities from list_of_active_activities that are not enabled in
new_marking.
5) Remove all activities from list_of_active_activities for which new_marking is a
reactivation marking.
6) Select a potential_completion_time for all activities that are enabled in
new_marking but not on list_of_active_activities and add them to
list_of_active_activities.
E) End While.
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 225

Types of Discrete-Event Simulation


Basic simulation loop specifies how the trajectory is generated, but does not
specify how measures are collected, or how long the loop is executed.
How measures are collected, and how long (and how many times) the loop is
executed depends on type of measures to be estimated.
Two types of discrete-event simulation exist, depending on what type of measures
are to be estimated.
Terminating - Measures to be estimated are measured at fixed instants of
time or intervals of time with fixed finite point and length. This may also
include random but finite (in some sense) times, such as a time to failure.
Steady-state - Measures to be estimated depend on instants of time or
intervals whose starting points are taken to be t .

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 226

Issues in Discrete-Event Simulation


1) How to generate potential completion times for events
2) How to estimate dependability measures from generated trajectories
Transient measures
Steady-state measures
3) How to implement the basic simulation loop
Sequential or parallel

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 227

Generation of Potential Completion Times


1) Generation of uniform [0,1] random variates
Used as a basis for all random variate samples
Types
Linear congruential generators
Tausworthe generators
Other types of generators
Tests of uniform [0,1] generators
2) Generation of non-uniform random variates
Inverse transform technique
Convolution technique
Composition technique
Acceptance-rejection technique
Technique for discrete random variates
3) Recommendations/Issues
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 228

Generation of Uniform [0,1] Random Number Samples


Goal: Generate sequence of numbers that appears to have come from uniform [0,1]
random variable.
Importance: Can be used as a basis for all random variates.
Issues:
1) Goal isnt to be random (non-reproducible), but to appear to be random.
2) Many methods to do this (historically), many of them bad (picking
numbers out of phone books, computing to a million digits, counting
gamma rays, etc.).
3) Generator should be fast, and not need much storage.
4) Should be reproducible (hence the appearance of randomness, not the
reality).
5) Should be able to generate multiple sequences or streams of random
numbers.
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 229

Linear Congruential Generators (LCGs)

Introduced by D. H. Lehmer (1951). He obtained


xn = an mod m
xn = (axn - 1) mod m

Today, LCGs take the following form:


xn = (axn - 1 + b) mod m, where
xn are integers between 0 and m - 1
a, b, m non-negative integers

If a, b, m chosen correctly, sequence of numbers can appear to be uniform and


have large period (up to m).

LCGs can be implemented efficiently, using only integer arithmetic.

LCGs have been studied extensively; good choices of a, b, and m are known.
See, e.g., Law and Kelton (1991), Jain (1991).

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 230

Tausworthe Generators

Proposed by Tausworthe (1965), and are related to cryptographic methods.

Operate on a sequence of binary digits (0,1). Numbers are formed by selecting


bits from the generated sequence to form an integer or fraction.

A Tausworthe generator has the following form:


bn = cq - 1bn - 1 cq - 2bn - 2 . . . c0bn - q
where bn is the nth bit, and ci (i = 0 to q - 1) are binary coefficients.

As with LCGs, analysis has been done to determine good choices of the ci.

Less popular than LCGs, but fairly well accepted.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 231

Generation of Non-Uniform Random Variates

Suppose you have a uniform [0,1] random variable, and you wish to have a
random variable X with CDF FX. How do we do this?

All other random variates can be generated from uniform [0,1] random variates.

Methods to generate non-uniform random variates include:


Inverse Transform - Direct computation from single uniform [0,1] variable
based on observation about distribution.
Convolution - Used for random variables that can be expressed as sum of
other random variables.
Composition - Used when the distribution of the desired random variable
can be expressed as a weighted sum of the distributions of other random
variables.
Acceptance-Rejection - Uses multiple uniform [0,1] variables and a function
that majorizes the density of the random variate to be generated.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 232

Inverse Transform Technique


Suppose we have a uniform [0,1] random variable U.
If we define X = F-1(U), then X is a random variable with CDF FX = F.
To see this,
FX(a) = P[X a]
= P[F-1(U) a]
= P[U F(a)]
= F(a)
Thus, by starting with a uniform random variable, we can generate virtually any
type of random variable.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 233

Example of Inverse Transform

Let X be an exponentially distributed random variable with parameter . Let U be a


uniform [0,1] random variable generated by a pseudo-random number generator.

FX (a ) = 1 e a
1
X = FX1 (U ) = ln(1 U )

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 234

Convolution Technique

Technique can be used for all random variables X that can be expressed as the
sum of n random variables
X = Y1 + Y2 + Y3 + . . . + Yn

In this case, one can generate a random variate X by generating n random


variates, one from each of the Yi, and summing them.

Examples of random variables:


Sum of n Bernoulli random variables is a binomial random variable.
Sum of n exponential random variables is an n-Erlang random variable.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 235

Composition Technique

Technique can be used when the distribution of a desired random variable can be
expressed as a weighted sum of other distributions.

In this case F(x) can be expressed as

F ( x ) = pi Fi ( x )
i =0

where pi 0,

p
i =0

= 1.

The composition technique is as follows:


1) Generate random variate i such that P[I = i] = pi for i = 0, 1, . . .
(This can be done as discussed for discrete random variables.)
2) Return x as random variate from distribution Fi(x), where i is as chosen
above.

A variant of composition can also be used if the density function of the desired
random variable can be expressed as weighted sum of other density functions.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 236

Acceptance-Rejection Technique

Indirect method for generating random variates that should be used when other
methods fail or are inefficient.
Must find a function m(x) that majorizes the density function f(x) of the
desired distribution. m(x) majorizes f(x) if m(x) f(x) for all x.
Note:

c = m( x )dx

f ( x )dx = 1, so m( x ) is not necessarily a density function,

m( x )
is a density function.
c
If random variates for m(x) can be easily computed, then random variates for f(x)
can be found as follows:
1) Generate y with density m(x)
2) Generate u with uniform [0,1] distribution
but m( x ) =

3) If u

f ( y)
, return y, else goto 1.
m( y )

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 237

Generating Discrete Random Variates

Useful for generating any discrete distribution, e.g., case probabilities in a SAN.
More efficient algorithms exist for special cases; we will review most general
case.
Suppose random variable has probability distribution p(0), p(1), p(2), . . . on
non-negative integers. Then a random variate for this random variable can be
generated using the inverse transform method:
1) Generate u with distribution uniform [0,1]
2) Return j satisfying
j 1

p(i ) u < p(i )

i =0
i =0

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 238

Recommendations/Issues in Random Variate Generation

Use standard/well-tested uniform [0,1] generators. Dont assume that because a


method is complicated, it produces good random variates.

Make sure the uniform [0,1] generator that is used has a long enough period.
Modern simulators can consume random variates very quickly (multiple per
state change!).

Use separate random number streams for different activities in a model system.
Regular division of a single stream can cause unwanted correlation.

Consider multiple random variate generation techniques when generating nonuniform random variates. Different techniques have very different efficiencies.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 239

Estimating Dependability Measures: Estimators and


Confidence Intervals

An execution of the basic simulation loop produces a single trajectory (one


possible behavior of the system).

Common mistake is to run the basic simulation loop a single time, and presume
observations generated are the answer.

Many trajectories and/or observations are needed to understand a systems


behavior.

Need concept of estimators and confidence intervals from statistics:


Estimators provide an estimate of some characteristic (e.g., mean or
variance) of the measure.
Confidence intervals provide an estimate of how accurate an estimator is.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 240

Typical Estimators of a Simulation Measure

Can be:
Instant-of-time, at a fixed t, or in steady-state
Interval-of-time, for fixed interval, or in steady-state
Time-averaged interval-of-time, for fixed interval, or in steady-state
Estimators on these measures include:
Mean
Variance
Interval - Probability that the measure lies in some interval [x,y]
Dont confuse with an interval-of-time measure.
Can be used to estimate density and distribution function.
Percentile - 100th percentile is the smallest value of estimator x such that
F(x) .

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 241

Different Types of Processes and Measures Require


Different Statistical Techniques

Transient measures (terminating simulation):


Multiple trajectories are generated by running basic simulation loop multiple
times using different random number streams. Called Independent
Replications.
Each trajectory used to generate one observation of each measure.
Steady-State measures (steady-state simulation):
Initial transient must be discarded before observations are collected.
If the system is ergodic (irreducible, recurrent non-null, aperiodic), a single
long trajectory can be used to generate multiple observations of each
measure.
For all other systems, multiple trajectories are needed.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 242

Confidence Interval Generation: Terminating Simulation


Approach:
Generate multiple independent observations of each measure, one
observation of each measure per trajectory of the simulation.
Observations of each measure will be independent of one another if different
random number streams are used for each trajectory.
From a practical point of view, new stream is obtained by continuing to
draw numbers from old stream (without resetting stream seed).
Notation (for subsequent slides):
Let F(x) = P[X x] be measure to be estimated.
Define = E[X], 2 = E[(X - )2].
Define xi as the ith observation value of X (ith replication, for terminating
simulation).
Issue: How many trajectories are necessary to obtain a good estimate?
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 243

Terminating Simulation: Estimating the Mean of a


Measure I

Wish to estimate = E[X].


Standard point estimator of is the sample mean
1 N
= xn
N n =1

( is unbiased, i.e., E [ ] = , and Var [ ] =

2
N

, where 2 = Var [ X ])

To compute confidence interval, we need to compute sample variance:

1 N
1 N 2
N
2
2
(
)
(
)
s =
x

=
x

n
n N 1
N 1 n =1
N 1 n =1
2

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 244

Terminating Simulation: Estimating the Mean of a


Measure II

Then, the (1 - ) confidence interval about x can be expressed as:

t N 1 (1 2 )s
t N 1 (1 2 )s
+

N
N
Where
t N 1 (1 2 ) is the 100(1 2 )th percentile of the student' s t distribution with

N 1 degrees of freedom (values of this distribution can be found in tables).


s = s 2 is the sample standard deviation.
N is the number of observations.

Equation assumes xn are distributed normally (good assumption for large


number of xi).

The interpretation of the equation is that with (1 - ) probability the real value
() lies within the given interval.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 245

Terminating Simulation: Estimating the Variance of a


Measure I

Computation of estimator and confidence interval for variance could be done


like that done for mean, but result is sensitive to deviations from the normal
assumption.

So, use a technique called jackknifing developed by Miller (1974).

Define
i =

1
N 1
2
2
(
)
x

n N 2 i
N 2 ni

Where

i =

1
xn

N 1 ni

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 246

Terminating Simulation: Estimating the Variance of a


Measure II

Now define

1 N
Z i = Ns ( N 1) and Z = Z i , for i = 1,2,..., N
N i =1
(where s2 is the sample variance as defined for the mean)
2

2
i

And

1 N
2
(
)
s =
Z

Z
i
N 1 i =1
2
Z

Then

t N 1 (1 2 )s Z
t N 1 (1 2 )s Z
2
Z +
Z
N
N
is a (1 - ) confidence interval about 2.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 247

Terminating Simulation: Estimating the Percentile of an


Interval About an Estimator

Computed in a manner similar to that for mean and variance.

Formulation can be found in Lavenberg, ed., Computer Performance Modeling


Handbook, Academic Press, 1983.

Such estimators are very important, since mean and variance are not enough to
plan from when simulating a single system.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 248

Confidence Interval Generation: Steady-State


Simulation

Informally speaking, steady-state simulation is used to estimate measures that


depend on the long run behavior of a system.
Note that the notion of steady-state is with respect to a measure (which has
some initial transient behavior), not a model.
Different measures in a model will converge to steady state at different rates.
Simulation trajectory can be thought of as having two phases: the transient phase
and the steady-state phase (with respect to a measure).
Multiple approaches to collect observations and generate confidence intervals:
Replication/Deletion
Batch Means
Regenerative Method
Spectral Method
Which method to use depends on characteristics of the system being simulated.
Before discussing these methods, we need to discuss how the initial transient is
estimated.
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 249

Estimating the Length of the Transient Phase


Problem: Observations of measures are different during so-called transient phase,
and should be discarded when computing an estimator for steady-state behavior.
Need: A method to estimate transient phase, to determine when we should begin to
collect observations.
Approaches:
Let the user decide: not sophisticated, but a practical solution.
Look at long-term trends: take a moving average and measure differences.
Use more sophisticated statistical measures, e.g., standardized time series
(Schruben 1982).
Recommendation:
Let the user decide, since automated methods can fail.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 250

Methods of Steady-State Measure Estimation:


Replication/Deletion

Statistics similar to those for terminating simulation, but observations collected


only on steady-state portion of trajectory.
One or more observations collected per trajectory:
transient
phase

O11

O12

O21
O31

O32

O13

O14

O22

O23 O24
O33

O34

trajectory 1
trajectory 2

...

trajectory n

Compute
Mi

xi =

O
j =1

ij

th
M i as i observation, where Mi is the number of observations in
trajectory i.
xi are considered to be independent, and confidence intervals are generated.
Useful for a wide range of models/measures (the system need not be ergodic),
but slower than other methods, since transient phase must be repeated multiple
times.
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 251

Methods of Steady-State Measure Estimation: Batch


Means

Similar to Replication/Deletion, but constructs observations from a single


trajectory by breaking it into multiple batches.
Example
O11 O12
O13 ... O1n
O21
O22
initial transient
1

O23 ... O2n

O31

O32 ... O3n3

...

Observations from each batch are combined to construct a single observation;


these observations are assumed to be independent and are used to construct the
point estimator and confidence interval.
Issues:
How to choose batch size?
Only applicable to ergodic systems (i.e., those for which a single trajectory
has the same statistics as multiple trajectories).
Initial transient only computed once.
In summary, a good method, often used in practice.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 252

Other Steady-State Measure Estimation Methods I

Regenerative Method (Crane and Iglehart 1974, Fishman 1974)


Uses renewal points in processes to divide batches.
Results in batches that are independent, so approach used earlier to generate
confidence intervals applies.
However, usually no guarantee that renewal points will occur at all, or that
they will occur often enough to efficiently obtain an estimator of the
measure.

Autoregressive Method (Fishman 1971, 1978)


Uses (as do the two following methods) the autocorrelation structure of
process to estimate variance of measure.
Assumes process is covariance stationary and can be represented by an
autoregressive model.
Above assumption often questionable.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 253

Other Steady-State Measure Estimation Methods II

Spectral Method (Heidelberger and Welch 1981)


Assumes process is covariance stationary, but does not make further
assumptions (as previous method does).
Efficient method, if certain parameters chosen correctly, but choice requires
sophistication on part of user.

Standardized Time Series (Schruben 1983)


Assumes process is strictly stationary and phi-mixing.
Phi-mixing means that Oi and Oi + j become uncorrelated if j is large.
As with spectral method, has parameters whose values must be chosen
carefully.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 254

Summary: Measure Estimation and Confidence Interval


Generation
1) Only use the mean as an estimator if it has meaning for the situation being
studied. Often a percentile gives more information. This is a common mistake!
2) Use some confidence interval generation method! Even if the results rely on
assumptions that may not always be completely valid, the methods give an
indication of how long a simulation should be run.
3) Pick a confidence interval generation method that is suited to the system that
you are studying. In particular, be aware of whether the system being studied is
ergodic.
4) If batch means is used, be sure that batch size is large enough that batches are
practically uncorrelated. Otherwise the simulation can terminate prematurely
with an incorrect result.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 255

Summary/Conclusions: Simulation-Based Validation


Techniques
1) Know how random variates are generated in the simulator you use. Make sure:
A good uniform [0,1] generator is used
Independent streams are used when appropriate
Non-uniform random variates are generated in a proper way.
2) Compute and use confidence intervals to estimate the accuracy of your
measures.
Choose correct confidence interval computation method based on the nature
of your measures and process

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 256

Simulator Editor

Maximum and Minimum


Number of Replications to Run
Number of Batches between
each calculation of the variance
Trace-Level for Debugging
File Name of Output File

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 261

Batch and Replication Outputs (Variable Output Option)


Typical batch output:

Variable Name
Batch Number
Simulation Time
Time (CPU seconds)
Batch Mean
Mean
Variance

:
:
:
:
:
:
:

utilization
10
1.100000e + 04
41
8.467695e 01
8.447065e 01 + / 1.516121e 03
4.417886e 02 + / 5.035103e 04

:
:
:
:
:
:
:
:

utilization
2400
1.000000e + 02
1498
1.000000e + 00
8.466667e 01 + / 8.196275e 03
4.196934e 02
4.196934e 02 + / 2.588252e 03

Typical replication output:

Variable Name
Replication Number
Simulation Time
Time (CPU seconds)
Current Value
Mean
Sample Variance
Variance

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 262

Mbius Simulation Techniques

Simulation Characteristics
Steady-state Instant-of-time or
Mean,
Variable
Applicable
or Transient Interval-of-time Variance, or
Simulator
Distribution
Transient
Instant-of-time
Mean,
Reward Variable tsim and itsim
and
Variance,
Activity
tsim
Interval-of-time
and
Variable
Distribution
Steady-state Instant-of-time
Mean,
Reward Variable
ssim
Variance,
and Activity
and
Variable
Distribution

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 263

Symbolic State-space Exploration and Numerical


Analysis of State-sharing Composed Models

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 265

Motivation
State-space (SS) explosion or largeness problem in
discrete-state systems
Costly generation and representation of SS (space and
time)
Costly representation of CTMC (space)
Costly representation of solution vector (space) and
costly iteration/solution time (time)
Typical solutions:
Largeness avoidance, e.g., using lumping techniques
CTMC level
Model level
Largeness tolerance using BDD, MDD, MTBDD,
Kronecker, or Matrix Diagrams (MD)
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 266

What Is New?

Our approach combines


Model-level lumping induced by structural symmetries
Number of states solution vector size
Number of states iteration time
MDD and Matrix Diagram (MD) data structures
Enables us to represent lumped CTMCs not possible using
sparse matrix
An order of magnitude faster than unlumped sparse
representation although it induces slowdown in solution time
compared to lumped sparse representation
State-sharing composed models as opposed to action-synchronization
Maintain almost the same generality

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 267

State-sharing Composed Models

Join and Replicate operators


Join

SV1

M1

SV1
M2

Join

M1

M2

M1

M1

Rep (3)
M1

M1

Any atomic model formalism that can share state variables


E.g., SAN, PEPAk, and Buckets and Balls
Replicate induces symmetry
Global and local actions

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 268

Introduction to MDD

Represents function

Special case : n = 1, f represents


a set of vectors

where
0 1 2
0 1

0 1

0 1 2
0

0 1
0 1 2

{(0,0,1), (0,0,2), (0,1,1), (0,1,2),


(1,0,1), (1,0,2), (1,1,0), (1,1,1),
(2,0,0), (2,0,1), (2,1,1), (2,1,2)}
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 269

Introduction to MDD

Represents function

Special case : n = 1, f represents


a set of vectors

where

Representation of a set of states


of a discrete-state model
Partition set of SVs
Assign index to unique value
assignment of variables of
each block
Vector of indices represents
a state

0 1 2
0 1

0 1

0 1 2
0

0 1
0 1 2

{(0,0,1), (0,0,2), (0,1,1), (0,1,2),


(1,0,1), (1,0,2), (1,1,0), (1,1,1),
(2,0,0), (2,0,1), (2,1,1), (2,1,2)}

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 270

Introduction to MDD

Represents function

where
0

Special case : n = 1, f represents


a set of vectors

Partition set of SVs


Assign index to unique value
assignment of variables of
each block
Vector of indices represents
a state

Augment by state offsets

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

12

0 1 2
0

Representation of a set of states


of a discrete-state model

0 1
0

0 1
0

0 1 2
0

0 1
0

0 1 2
1

{(0,0,1), (0,0,2), (0,1,1), (0,1,2),


(1,0,1), (1,0,2), (1,1,0), (1,1,1),
(2,0,0), (2,0,1), (2,1,1), (2,1,2)}
Slide 271

MDD data structure by example

Partitioning SVs based on


composition structure
Maximizing efficiency of local
SS exploration
Simplifying global
SS exploration
Dependability model
for multicomputer system

Rep2 (N)
Join
Rep1 (M)

cpu

error handler

IO port

memory

MDD level
assignment

Rep2

Join

Rep1

mem

3 outer replicate

mem

2+M

inner replicate

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 272

Algorithm Overview
1.
2.
3.
4.

Generate MDD representation of unlumped SS


Build MD representation of CTMC
Convert unlumped SS to lumped SS
Solve CTMC by iterating through MD data structure

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 273

Symbolic Generation of Unlumped SS

set of visited states


set of unexplored states
expands using
sequences of firings of local
actions
expands
using single action firing of
global actions
Never generate potential or unreachable states
Creating necessary matrices and data structures to
construct MD of the CTMC at a later stage
No consideration of lumping properties

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 274

Symbolic SSG (Local Actions)

Restriction: immediate actions


are local
i
j
On-the-fly elimination of
A
B local
vanishing states
transition
i to j
Local SS expansion in levels
corresponding to atomic models.
i
j
No assumption of knowing the
A A B
local state space in advance
Online computation of transitive closure based on Ibaraki
and Katohs algoritm
Avoids costly computation of tr. closure from scratch

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 275

Symbolic SSG (Global Actions)

Global action a in component c affects more than one level


No product-form-like restriction
Effect of a on each level need not be determined locally
More difficult to handle than synchronizing actions
Expensive operation

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 276

Lumping

Redundant states (paths)

Rep

AM

AM

AM

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

1
1

Slide 277

Lumping

Redundant states (paths)


Rep node c implies
equivalence relation Rc

Rep

x
1

AM

AM

AM

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 278

Lumping

Redundant states (paths)


Rep node c implies
equivalence relation Rc

Rep

x
1

AM

AM

AM

Overall equivalence relation


Canonical representative state in each class min(v)

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 279

Lumping

Redundant states (paths)


Rep node c implies
equivalence relation Rc

Rep

x
1

AM

AM

AM

Overall equivalence relation


Canonical representative state in each class min(v)
may become exponentially large break it up into

many extremely smaller MDDs faster computation of


2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 280

Lumping

where

is the set of all states v where

min(v) =v
may become huge break up
MDDs

into extremely smaller

is often less structured than


larger in number of nodes
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

and therefore

Slide 281

SSG and Lumping Performance

Worst case example: No local behavior


Drastic decrease in number of states in the lumped SS (up to 6 orders of
magnitude)
Increase in number of nodes in the lumped state space but still small compared
to other entities
Very small unlumped and lumped SS representation

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 282

CTMC Generation and Enumeration

Use Matrix Diagrams (MD) (Ciardo/Miner)


CTMC of largest example has <40000 nodes and takes <3MB of
memory

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 283

CTMC Generation and Enumeration

Use Matrix Diagrams (MD) (Ciardo/Miner)


CTMC of largest example has <30000 nodes and takes <5MB of
memory
Projection of the MD on the lumped SS? Problem: some needed
transitions are deleted

wrong

correct

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 284

CTMC Generation and Enumeration

Use Matrix Diagrams (MD) (Ciardo/Miner)


CTMC of largest example has <40000 nodes and takes <3MB of
memory and at most a few seconds to build
Projection of the MD on the lumped SS? Problem: some needed
transitions are deleted
Project rows on lumped SS
wrong
and columns on
unlumped SS
correct
Redirect transitions on-the-fly
DFS-based enumeration of
MD using sorting MDD

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 285

CTMC Enumeration Performance

Fairly fast iteration: less than 6 times slower than lumped sparse matrix
Solving larger CTMCs

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 286

Integration into Mbius

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 287

Case Study: Survivability Evaluation

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 288

Defending Against a Wide Variety of Attacks

Nation-states,
Terrorists,
Multinationals

Economic intelligence
Information terrorism

Military spying
Disciplined strategic
cyber attack
Selling secrets

Civil disobedience

Serious hackers

Harassment

Embarrassing organizations
Stealing credit cards

Collecting trophies

Script kiddies

HIGH

INNOVATION
PLANNING
STEALTH
COORDINATION

Copy-cat attacks

Curiosity

Thrill-seeking

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

LOW

Slide 289

Intrusion Tolerance: A New Paradigm for Security


Prevent Intrusions

(Access Controls, Cryptography,


Trusted Computing Base)
But intrusions will occur

Trusted Computing
Base

Access Control &


Physical Security

1st Generation: Protection

Detect Intrusions, Limit Damage

(Firewalls, Intrusion Detection Systems,


Virtual Private Networks, PKI)

Cryptography Multiple Security Levels

Boundary
Controllers

Firewalls

Intrusion
Detection
Systems

VPNs

PKI

2nd Generation: Detection

But some attacks will succeed

Tolerate Attacks

(Redundancy, Diversity, Deception,


Wrappers, Proof-Carrying Code,
Proactive Secret Sharing)

Intrusion
Tolerance

Big Board View of


Attacks
Real-Time Situation
Awareness
& Response

Graceful
Degradation

Hardened
Operating
System

3rd Generation: Tolerance


2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 290

Validation of Computer System/Network Survivability

Security is no longer absolute


Trustworthy computer systems/networks must operated through
attacks, providing proper service in spite of possible partially
successful attacks
Intrusion tolerance claims to provide proper operation under such
conditions
Validation of security/survivability must be done:

During all phases of the design process, to make design


choices
During testing, deployment, operation, and maintenance,
to gain confidence that the amount of intrusion
tolerance provided is as advertised.
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 291

Validating Computer System Security: Research Goal

CONTEXT: Create robust software and hardware that are faulttolerant, attack resilient, and easily adaptable to changes in
functionality and performance over time.
GOAL: Create an underlying scientific foundation,
methodologies, and tools that will:
Enable clear and concise specifications,
Quantify the effectiveness of novel solutions,
Test and evaluate systems in an objective manner, and
Predict system assurance with confidence.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 292

Existing Security/Survivability Validation Approaches


Most traditional approaches to security validation have focus on
avoiding intrusions (non-circumventability), or have not been
quantitative, instead focusing on and specifying procedures that
should be followed during the design of a system (e.g., the Security
Evaluation Criteria [DOD85, ISO99]).
When quantitative methods have been used, they have typically either
been based on formal methods (e.g., [Lan81]), aiming to prove that
certain security properties hold given a specified set of assumptions,
or been quite informal, using a team of experts (often called a red
team, e.g. [Low01]) to try to compromise a system.
Both of these approaches have been valuable in identifying system
vulnerabilities, but probabilistic techniques are also needed.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 293

Example Probabilistic Validation Study

Evaluation of DPASA-DV Project design

Designing Protection and Adaptation into a


Survivability Architecture: Demonstration and
Validation
Design of a Joint Battlespace Infosphere

Publish, Subscribe and Query features (PSQ)


Ability to fulfill its mission in the presence of
attacks, failures, or accidents
Uses Multiple, synergistic validation techniques

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 294

JBI Design
Overview

JBI Management Staff

Quad 1
JBI Core

Quad 2

Quad 3

Quad 4

Executive
Zone

Operations
Zone
Crumple
Zone

Network

Access Proxy

(Isolated Process Domains in SE-Linux)


Domain6

Local Controller First Restart Domains Eventually Restart Host

Protection
Domains
Isolation among
selected functions on
individual core hosts
and on clients
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Domain1

Domain2

Forward/
Ratelimit
PS

TCP

Domain3

Domain4

Domain5

Proxy Logic
Inspect / Forward / Rate Limit
Sensor
Rpts

DC

PSQImpl

Eascii

IIOP

RMI

IIOP

UDP

TCP

TCP

STCP

PSQImpl

Slide 295

Survivability/Security Validation Goal

Provide convincing evidence that the design, when


implemented, will provide satisfactory mission support
under real use scenarios and in the face of cyber-attacks.
More specifically, determine whether the design, when
implemented will meet the project goals:
This assurance case is supported by:

Rigorous logical arguments


Experimental evaluation
A detailed executable model of the design

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 296

Goal: Design a Publish and Subscribe Mechanism that

Provides 100% of critical functionality when under sustained attack


by a Class-A red team with 3 months of planning.

Detects 95% of large scale attacks within 10 mins. of attack


initiation and 99% of attacks within 4 hours with less than 1% false
alarm rate.

Displays meaningful attack state alarms. Prevent 95% of attacks


from achieving attacker objectives for 12 hours.

Reduces low-level alerts by a factor of 1000 and display meaningful


attack state alarms.

Shows survivability versus cost/performance trade-offs.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 297

Integrated Survivability Validation Procedure


R

Requirement
Decomposition

Functional Model of the Relevant Subset of the System

Model for
Access Proxy

Model for
Client

AA1

M1

M2

AA2

AA3

M4

M3

Model for
PSQ Server

AP1

AP2

M5

M6

(Network Domains)

L1

L2

L3

Functional Model
of the System
(Probabilistic or
Logical)
Assumptions

Supporting Logical
Arguments and
Experimentation

(ADF)
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 298

Integrated Survivability Validation Procedure


Steps

1. A precise statement of
the requirements

Functional Model of the Relevant Subset of the System

Model for
Access Proxy

Model for
Client

AA1

M1

M2

AA2

AA3

M4

M3

Model for
PSQ Server

AP1

AP2

M5

M6

(Network Domains)

L1

L2

2. High-level functional
model description:
a) Data and alerts
flows for the
processes related
to the
requirements,
b) Assumed attacks
and attack effects
[Threat/vulnerability analysis;
whiteboarding]

L3

(ADF)
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 299

Integrated Survivability Validation Procedure


Steps

Functional Model of the Relevant Subset of the System

Model for
Access Proxy

Model for
Client

AA1

M1

M2

AA2

AA3

M4

M3

Model for
PSQ Server

AP1

AP2

M5

M6

(Network Domains)

L1

L2

3. Detailed descriptions
of model component
behaviors representing
2a and 2b, along with
statements of
underlying
assumptions made for
each component.
[Probabilistic modeling
or logical
argumentation,
depending on
requirement]

L3

(ADF)
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 300

Integrated Survivability Validation Procedure


Steps

Functional Model of the Relevant Subset of the System

Model for
Access Proxy

Model for
Client

AA1

M1

M2

AA2

AA3

M4

M3

Model for
PSQ Server

AP1

AP2

M5

M6

(Network Domains)

L1

L2

L3

(ADF)
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

4. Construct executable
functional model
[Probabilistic
modeling, if model
constructed in 3 is
probabilistic]
In Parallel

5. a) Verification of the
modeling assumptions
of Step 3 [Logical
argumentation] and,
b) where possible,
justification of model
parameter values
chosen in Step 4.
[Experimentation]
Slide 301

Integrated Survivability Validation Procedure


Steps

Functional Model of the Relevant Subset of the System

Model for
Access Proxy

Model for
Client

AA1

M1

M2

AA2

AA3

M4

M3

Model for
PSQ Server

AP1

AP2

M5

M6

6. Run the executable


model for the
measures that
correspond to the
requirements of Step
1. [Probabilistic
modeling]

(Network Domains)

L1

L2

L3

(ADF)
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 302

Integrated Survivability Validation Procedure


Steps

?
S

Functional Model of the Relevant Subset of the System

Model for
Access Proxy

Model for
Client

AA1

M1

M2

AA2

AP1

AA3

M4

M3

Model for
PSQ Server

M5

(Network Domains)

L1

L2

L3

7. Comparison of results
obtained in Step 6,
noting in particular
the configurations
and parameter values
for which the
requirements of Step
1 are satisfied.

AP2

M6

Note that if the


requirement being
addressed is not
quantitative, steps
4 and 6 are
skipped.

(ADF)
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 303

Step 1: Requirement Specification

Expressed in an argument graph:


JBI critical mission
objectives

JBI critical
functionality

Initialized JBI
provides
essential services

Authorized
publish
processed
successfully

Authorized
subscribe
processed
successfully

JBI mission
Detection / Correlation
Requirements

JBI properly
initialized

Authorized
query
processed
successfully

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

IDS objectives

Authorized
join/leave
processed
successfully

Unauthorized
activity
properly
rejected

Confidential
info not
exposed

Slide 304

Argument Graph for the Design


Requirements decomposition

PIP requirements 1 4

JBI survivability
requirements

Executable model

JBI intrusion detection


requirements

Initialized JBI provides


essential services

Model assumptions
Supporting arguments

Authorized subscribe is
processed successfully

Authorized publish is
processed successfully

Dataflow
Timeliness
Integrity

Authorized query is
processed
successfully

Authorized join/leave
is processed
successfully

IDS / Correlation
requirements

JBI is properly
initialized

Unauthorized activity
is properly rejected

Confidential info is
not exposed

Confidentiality

(from functional
model execution)

IO Confidentiality
(end-to-end)

Confidentiality of
Application-layer
Messages

Functional model
faithful to design

IO Confidentiality
in Transit

Functional Model
Assumptions Hold

IO Confidentiality
in Storage

IConfidentiality of
Network
Communications

Design Team Review


Attack Model
Assumptions Hold

Component Model
Assumptions Hold
No
Compromise
or Failure of
QIS

QA1: QIS
Incorruptibility

Hard-wired
Configuration

Attack Model
Parameter
Selection

CERT
Vulnerability
DB Analysis

DoS Causes
Processing
Delays

DoS Does
Not Corrupt
Other
Components

DoS Attacks
Do Not
Propagate from
Clients to Core

Type Enforcement

Hardened Kernel

Solaris

SA3: IO
Authenticity

Initial Targets
of
Infrastructure
Attacks

Attacks
Originate
Outside the
Platform

SELinux

No Data
Attacks
Outside the
Platform

Correctness of
Rate Control
Mechanisms

Variation over
Anticipated
Ranges

Isolation of
Intruded
Process
Domains

Platform Mechanisms

Targets for
Loss of IO
Confidentiality

Physically
Protected

Electrically
Isolated

Infrastructure
Attack
Propagation

SA1: IO
Integrity in
PSQ Server

PA2: Alternate
Path
Availability

SA2: Client
Confidentiality
in PSQ Server

AA2: AP
Applicationlayer Integrity

AA3: AP
Application-layer
Confidentiality

DA1: DC
Communications

QA3: QIS
Input
Integrity

QA4: QIS
Function
Correctness

Physical
Integrity
Correctness of
Reattachment
Protocol

Correctness of
Registration
Protocol

SA4: Networklayer I & C

MA1: SM Byzantine
Agreement

AA1: AP
Function
Correctness

SeA1: Sensor
False Alarm
Rate

SeA2: Sensor
Detection Delay

SeA3: Sensor
Detection
Probability

Correctness of Modified
ITUA Protocols

Connectivity

PA1: ClientCore
Communication
I&C

Data Attack
Propagation

QA2: QIS
Communication
Cutoff

Electrical
Integrity

Gate
Configuration and
Truth Table

Proxy Protocol
Configuration

CoA1:
Corrleator
False Alarm
Rate

Can Identify
Malformed Traffic
IDS Experimental
Evaluation

Process Domain
Policies

Windows

Design
Faithfully
Implemented

Absence of
Insider Threat

PsA1: ADF
Policy Server
Input
Correctness

System Connectivity

PsA2: ADF
Policy Server
Synchronization

No Cryptography
in Access Proxy

IKENA StormWatch
No Tunneling Attacks

Restricted Routing

Not
Preconfigured

Correctness of
Certificate
Exchange

Network Topology

Correctness of
Managed Switch

ADF NIC Firmware


Initialization

ADF Key Initialization

ADF Agent
Initialization

ADF Host
Independence

ADF Protocol
Correctness

ADF Agent
Correctness

Policy Server
Integrity

VPG Integrity

VPG
Confidentiality

ADF Policy
Correctness

DoD Common
Access Card (CAC)

PKCS #11

ADF NIC
services
protected

No Unauthorized
Indirect Access

No Unauthorized
Direct Access

ADF Correctness

Keys Protected
from Theft
ADF NIC Physical
Security

Not
Reconfigurable

Private Key
Confidentiality

Physical Topology

Algorithmic
Framework

Keys Not Guessable

Key Length

Physical Protection
of CAC device

Protection of CAC
Authentication Data

No Compromise of
Authorized Process
Accessing CAC

Key Lifetime

Tamperproof

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 305

Step 2: System and Attack Assumption Definition


Example
High level description

Steps 4-5
Access proxy verifies if
the client is in valid
session by sending the
session key
accompanying the IO to
the Downstream
Controller for verification
Step 6
Access Proxy forwards
the IO to the PSQ
Server in its quadrant.
....

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 306

Attack Model Description


Definitions

Intrusion, prevented intrusion, tolerated intrusion


New vulnerabilities
Assumptions

Outside attackers only


Attacker(s) with unlimited resources
Consider successful (and harmful) attacks only
No patches applied for vulnerabilities found during the
mission/scenario execution

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 307

Attack Model Description


Attack propagation

MTTD: mean time to discovery of a vulnerability


MTTE: mean time to exploitation of a vulnerability
3 types of vulnerabilities:

Infrastructure-Level Vulnerabilities attacks in depth

OS vulnerability
Non-JBI-specific application-level vulnerability
pcommon : common-mode failure
Data-Level Vulnerabilities attacks in breadth

Using the application data of JBI software


Across process domains

flaw in protection domains


2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 308

Attack Model Description


Attack effects

Compromise

Launching pad for further attacks


Malicious behavior
Crash

Attack propagation stopped


(DoS)
Distinction between OSes with and without protection domains

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 309

Attack Model Description


Intrusion Detection

pdetect=0 if the sensors are compromised


pdetect > 0 otherwise.
Attack Responses

Restart Processes
Secure Reboot
Permanent Isolation

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 310

Infrastructure Attacks Example


Access Proxy, Quad 1, OS 1

AP Hb

Se

AP
Alert

Se

Ac

ADF NIC

DC

Policy Server, Quad 1, OS 1

ADF NIC

Outside

DC, Quad 1, OS 1

AP IO
ADF NIC

T=85 min.:
discovery of a
vulnerability on
the Main PD,
OS1

PS

all quad
components

Quadrant
1

LC

Ac
Outside

LC

PSQ Server, Quad 1, OS 1

Guardian, Quad 1, OS 1

PSQ

Gu

Ac

LC

ADF NIC

Se

Ac
LC

SM, Quad 1, OS 1

ADF NIC

SD

Se

ADF NIC

Ac

ADF NIC

Se
Publishing Client, OS1

SM

Correlator, Quad 1, OS 1

Crumple Zone

Co

ADF NIC

LC

Operations Zone

Executive Zone

Access Proxy, Quad 2, OS 2


Access Proxy, Quad 3, OS 3

SM, Quad 1, OS 2
SM, Quad 1, OS 3

ADF NIC
ADF NIC
ADF NIC

ADF NIC
ADF NIC
ADF NIC

ADF NIC
ADF NIC
ADF NIC

PSQ Server, Quad 2, OS 2


AP IO
PSQ Server, Quad 3, OS 3
AP IO
PSQ
AP Hb
PSQ Server, Quad 4, OS 4
AP IO
PSQ
AP Hb
Se PSQ
APAP Hb
Se
Alert
AP
Ac
Se
Se
Alert
AP
Ac
Se
Alert
LC Ac
Ac Se
LC
Ac
Outside
LC Ac
LC
LC
2005 William H. Sanders. All rights reserved. Do LC
not duplicate without permission of the author.
Access Proxy, Quad 4, OS 4

SM, Quad 1, OS 4

SM
SM
SM

Slide 311

Model of Access Proxy

4.4 Access Proxy


4.4.1 Model Description
AM1: If a process domain in the DJM proxy is not corrupted, it forwards the traffic it is designated to handle from the Quadrant
isolation switch to core quadrant elements and vice versa. All traffic being forwarded is well-formed (if the proxy is correct).
The following kinds of traffic are handled:
1. IOs (together with tokens) sent from publishing clients to the core (we do not distinguish between IOs sent via different
protocols such as RMI or SOAP/HTTP).
.
AM2: Attacks on access proxy: attacks on an access proxy are enabled if either/both
1. Quadrant isolation switch is ON, and one or more clients are corrupted, leading to:
a) Direct attacks: can cause the corruption of the process domain corresponding to the domain of the attacking process on
the compromised client.
.
AM3: If an attack occurs on the access proxy, it can have the following effects:
1. Direct attacks leading to process corruption:
a) Enable corruption of other process domains on the host.
..
4.4.2 Facts and Simplifications
AF1: Each access proxy runs on a dedicated host machine.
AF2: DoS attacks result in increased delays.
.

Assumptions

Step 3: Detailed descriptions of model component behaviors


and Assumptions (Access Proxy)

4.4.3 Assumptions
AA1: Only well-formed traffic is forwarded by a correct access proxy.
AA2: The access proxy cannot access cryptographic keys used to sign messages that pass through it.
AA3: Access proxy cannot access the contents on an IO if application-level end-to-end encryption is being used.
AA4: Attacks on an access proxy can only be launched from compromised clients, or from corrupted core elements that interact
with the access proxy during the normal course of a mission.
.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 312

Step 4: Construct Executable Functional Model

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 313

Step 5: Supporting Logical Arguments

JBI critical mission


objectives
JBI mission awareness

JBI critical functionality

Initialized JBI provides


essential services

Authorized publish
processed successfully

Authorized subscribe
processed successfully

Authorized query
processed
successfully

JBI properly initialized

Authorized join/leave
processed
successfully

Unauthorized activity
properly rejected

Confidential info
not exposed

IDS objectives

Dataflow
Timeliness
Integrity

Confidentiality

(from functional
model execution)

IO Confidentiality
(end-to-end)

Notification
Confidentiality

Functional model
assumptions hold

Functional model
faithful to design

Design Team Review

CA1: Origin of
Attacks on
Clients

CA2: Attack
Propagation
from Clients

AA4: Origin of
Attacks on
Access Proxy

AA5: Attacks
from AP

DA1: DC
Communications

DA2: Origin of
Attacks on DC

GA2: Attacks
from Guardian

SA1: Origin of
Attacks on
PSQ Server

SA2: Attacks
from PSQ
Server

SeA1: Attacks
from IDS Sensor

AcA2: Attacks
from Actuator

LA2: Attacks
from Local
Controller

CoA2: Origin
of Attacks on
Correlator

CoA3: Attacks
from Correlator

MA2: Origin
of Attacks on
SM

MA3: Attacks
from SM

PA1: ClientCore
Communication
I&C

SA6: Networklayer I & C

PsA1: ADF
Policy Server
Input
Correctness

PsA1: ADF
Policy Server
Synchronization

AA6: DoS from


Compromised
Core

AA1: AP
Function
Correctness

Bidirectional
Flow Control

AA8: DoS
Prevention by
Access Proxy

Correctness of
Flow Control
Mechanisms

QA1: QIS
Incorruptibility

Hard-wired
Configuration

QA2: QIS
Communication
Cutoff

Electrically
Isolated

Physically
Protected

QA3: QIS
Input
Integrity

Connectivity Physical
Integrity

QA4: QIS
Function
Correctness

Electrical
Integrity

Gate
Configuration and
Truth Table

Correctness of
Registration
Protocol

Proxy Protocol
Configuration
Restricted Routing

SA4: Client
Confidentiality
in PSQ Server

AA2: AP
Applicationlayer Integrity

AA3: AP
Application-layer
Confidentiality

Correctness of
Reattachment
Protocol

Correctness of
Certificate
Exchange

System Connectivity

Network Topology

SA3: IO
Integrity in
PSQ Server

PA2: Alternate
Path
Availability

SA5: IO
Authenticity

No Cryptography
in Access Proxy

Private Key
Confidentiality

MA1: SM Byzantine
Agreement

CA3: Client
Process
Corruption

AA7: AP
Process
Corruption

DA3:
Process
Corruption
on DC

GA1: Process
Corruption on
Guardian

SA7: Process
Corruption in
PSQ Server

SeA5:
Process
Corruption in
Sensor

Not
Reconfigurable

ScA1: Process
Corruption in
Subscribed
Client

Correctness of Modified
ITUA Protocols

SeA2: Sensor
False Alarm
Rate

SeA3: Sensor
Detection Delay

SeA4: Sensor
Detection
Probability

CoA1:
Corrleator
False Alarm
Rate

CoA4: Alert
Integrity

IDS Experimental
Evaluation

ADF NIC
services
protected
Platform Mechanisms

No Unauthorized
Direct Access

LA1: Process
Corruption in
Local
Controller

Process Isolation
Not
Preconfigured

Can Identify
Malformed Traffic

No Tunneling Attacks

AcA1: Process
Corruption in
Actuator

Component-specific
policy

No Unauthorized
Indirect Access
SELinux

Trusted Solaris

Windows 2000

Physical Topology
Keys Protected
from Theft

ADF Correctness

DoD Common
Access Card (CAC)
ADF NIC Physical
Security

ADF NIC Firmware


Initialization

ADF Agent
Initialization

ADF Protocol
Correctness

Policy Server
Integrity

PKCS #11
ADF Key Initialization

ADF Host
Independence

ADF Agent
Correctness

VPG Integrity

Algorithmic
Framework

Keys Not Guessable

Key Length

Physical Protection Protection of CAC


of CAC device
Authentication Data

No Compromise of
Authorized Process
Accessing CAC

Type Enforcement

Hardened Kernel

Hardened Kernel

Kernel Loadable
Wrappers

VMWare over SELinux

Key Lifetime

ADF Policy
Correctness
Tamperproof

VPG
Confidentiality

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 314

Logical Argument Sample


PSQ Server
Model

Functional
Model

Model
Assumptions

SA3: IO
Integrity in
PSQ Server

SA4: Client
Confidentiality
in PSQ Server

No Unauthorized
Direct Access

DoD Common
Access Card
(CAC)
PKCS #11
Compliance

AA2: AP
Applicationlayer Integrity

Private Key
Confidentiality

Supporting
Arguments

Keys Protected
from Theft

Access Proxy
Model

No Cryptography
in Access Proxy

No Unauthorized
Indirect Access

Keys Not
Guessable

Algorithmic
Framework

Key
Length

AA3: AP
Application-layer
Confidentiality

Not
Preconfigured

Physical
Protection of
CAC device

Not
Reconfigurable

Protection of
CAC
Authentication
Data

ADF NIC
services
protected

No Compromise
of Authorized
Process
Accessing CAC

Key
Lifetime

Tamperproof

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 315

Steps 6 and 7: Measures and Results

Assumptions: CPUB is the conjunction of

C1PUB= the publishing client is successfully registered with the


core
C2PUB= the publishing client's mission application interacts
with the client as intended
Definition of a successful publish: EPUB is the conjunction of

E1PUB = the data flow for the IO is correct


E2PUB = the time required for the publish operation is less than
tmax
E3PUB = the content of the IO received by the subscriber has
the same essential content as that assembled by the publisher
Measure: P[EPUB|CPUB]

Fraction of successful publishes in a 12 hour period


Between clients that cannot be compromised
Objective

P[EPUB|CPUB] pPUB for a 12-hour mission


2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 316

Vulnerability Discovery Rate Study

Fraction of successful publishes


versus MTTD
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Number of successful intrusions


versus MTTD
Slide 317

Varying the number of OS and OS w/ process domains


With data attacks
Without data attacks

4 OS, 4 pd,
3 OS, 3 pd,
AP OS<>core AP OS<>core

4 OS total

4 p.d
1.00
0.90

0.94

2 p.d

1 p.d

0 p.d

0.89
0.85

3 p.d

2 p.d

1 p.d

0 p.d

2 p.d

1 p.d

1 OS total
0 p.d

1 p.d

0 p.d

0.93

0.84

0.83

0.80

Fraction of successful Publishe

3 p.d

2 OS total

3 OS total

0.87
0.83

0.82

0.84

0.84
0.81

0.78
0.76

0.75
0.72

0.70

0.70

0.81

0.78
0.76

0.76
0.72

0.70

0.66

0.64

0.63

0.61

0.76

0.71

0.59

0.60

0.57
0.52

0.50
0.40
0.30
0.20
0.10
0.00
1.1

6.1

Experiment
2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

10

11

12

13

14

Slide 318

Autonomic Distributed Firewall (ADF) NIC policies


Fraction of successful publishes

Total number of intrusions


140

1
0.98
0.96

Total Number of Intrusions

Fraction of Successful Publishes

Per process domain

120

0.94
0.92
0.9
0.88

Per component
No restriction

100
80
60
40

0.86

Per process domain

0.84

Per component

0.82

No restriction

20
0

0.8
100

1000

100

1000
MTTD

MTTD (min)

Per-pd policies considerably increase the performance (10% unavailability vs. 1.5%
at MTTD=100 minutes)
ADF NICs can handle per-port policies => should take advantage of this feature,
implying to set the communication ports in advance

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 319

Design and Implementation Oriented Validation of Survivable Systems


A. Agbaria, T. Courtney, M. Ihde, W. H. Sanders, M. Seri, and S. Singh

Design Phase Validation


A study of the design reveals that
integrity and confidentiality can be
regarded as probability-1 events.
We obtain the following logical
decomposition:
PUB1: P[E1 E2| E3 E4 C] p
PUB2: P[E3| C] = 1
PUB3: P[E4| C] = 1
It can be shown that:
(PUB1 PUB2 PUB3) PUB

Model
Assumptions

AA2: AP
Applicationlayer Integrity

PKCS #11
Compliance

Step 2: If R is
logically
decomposable,
decompose it
iteratively.

Logical
Decomposition

AA3: AP
Application-layer
Confidentiality

No
Unauthorized
Indirect
Access

No Unauthorized
Direct Access

DoD
(CAC)

Sub-requirements

Key
Leng
th

Tamperproof

Key
Lifeti
me

No
Cryptography in
AP

Not
Preconfig
ured

Physical
Protection of
CAC device

Decomposable?

No

Not
Reconfigu
rable

Protection of
CAC
Authenticati
on Data

Step 3: For
every atomic
requirement Ra
Logical
Argumentation

Let PUB be the requirement of successfully


process a publish request.
Let C be the preconditions.
Let E be the desired event, i.e., the
successful of a request to publish.
E is a conjunction of:
E1 = the data flow of the publish is
correct
E2 = timeliness
E3 = integrity
E4 = confidentiality
The requirement: PUB: P[E|C] p

Def eat conf identiality


of IO data

Attack Tree

Gate 1

Read data on client

Read data on transit

Gate 2

Gate 3

Build high-level description of System and


its operational environment

Step 4:
Detailed
description of
components

Read data on core

Gate 4

Defeat the firewall


access control

Compromise client

Escalate privileges

Read data

Defeat the firewall


and sniff off wire

Get in middle of
client/core traffic

Defeat the firewall


access control

Ev ent 1

Ev ent 2

Ev ent 3

Gate 5

Gate 6

Gate 7

Ev ent 1

Read from data file

Read from memory

Defeat the firewall


crypto

Steal key/certificate

Sniff packets

Defeat the firewall


access control

Defeat the firewall


crypto

Ev ent 4

Ev ent 5

Ev ent 6

Ev ent 7

Ev ent 8

Ev ent 1

Ev ent 6

Attack Graph

Yes

Tear down current


TCP connections

Read from AP

Gate 12.1

Re-route traffic at
both ends

Read data

Steal key/certificate

Compromise AP

Escalate privileges

Read IO as it passes
through

Gate 8

Gate 9

Ev ent 7

Ev ent 13

Ev ent 14

Ev ent 15

Ev ent 9

Perform ARP
spoofing

Modify network
routing

Steal key/certificate

Decrypt & read data

Ev ent 10

Ev ent 11

Ev ent 7

Ev ent 12

Automatic
construction

Data Flow

ADF NIC
services
protected

Keys Not
Guessable

Alg.
Framew
ork

Yes

Quantitative?

Private Key
Confidentiality

Supporting
Arguments

Keys
Protected
from Theft

Requirement

Access Proxy
Model

Functional
Model

Step 1:
Formulate a
precise
statement of
R.

Implementation Phase Validation

Not
valid
No Compromise of
Authorized Process
Accessing CAC

Step 5: Justify
the modeling
assumptions of
Step 4
Step 6:
Construct a
simulation
model

Verify assumptions
& parameter values

Probabilistic measures

Probabilistic model of the system and its operational environment

Infrastructure-level attacks

Survivable Publish Subscribe System


Management Staff

Step 7:
Evaluation and
comparing
System not valid

Executive Quad 1 Quad 2


Zone

Core

Compare with
requirement

Quad 3

Quad 4

Operations
Zone
Crumple
Zone

System valid w.r.t.


the requirement

Client Zone

Network

Access Proxy

(Isolated Process Domains in SE-Linux)


Domain6

Local Controller

First Restart Domains

Domain1

Domain2

Forward/Rate limit

PS

TCP

2005 William H.
Sanders.
rights
Do not
IN
F O All
RM
Areserved.
TION
T duplicate
R U S Twithout
I Npermission
S T I T of
U the
T Eauthor.
University of Illinois at Urbana-Champaign

Eventually Restart Host

Domain3 Domain4 Domain5


Proxy Logic Inspect / Forward / Rate Limit

Sensor Rpts

DC

PSQImpl

PSQImpl

Eascii

IIOP

RMI

IIOP

UDP

TCP

TCP

TCP

www.iti.uiuc.edu
Slide 320

The Art of Dependability Evaluation /


Conclusions

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 321

Course Outline Revisited

Issues in Model-Based Validation of High-Availability


Computer Systems/Networks
Stochastic Activity Network Concepts
Analytic/Numerical State-Based Modeling
Case Study: Embedded Fault-Tolerant Multiprocessor System
Solution by Simulation
The Art of System Dependability /Conclusions

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 322

Model Solution Issues

In general:
Use tricks from probability theory to reduce complexity of model
Choose the right solution method
Simulation:
Result is just an estimator based on a statistical experiment
Estimation of accuracy of estimate essential
Use confidence Intervals!
Analytic/Numerical model solution:
Avoid state space explosion
Limit model complexity
Use structure of model (symmetries) to reduce state space size
Understand accuracy/limitations of chose numerical method
Transient Solution
(Iterative or Direct) Steady-state solution

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 323

The Art of Performance and Dependability Validation

Performance and dependability validation is an art because:


There is no recipe for producing a good analysis,
The key is knowing how to abstract away unimportant details, while
retaining important components and relationships,
This intuition only comes from experience,
Experience comes from making mistakes.
There are many ways to make mistakes.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 324

Doing it Right: Model Construction

Understand the desired measure before you build the model.


The desired measure determines the type of model and the level of detail required. No
model is universal!
Steps in constructing a model:
1. Choose the desired measures:
Choice of measures form a basis for comparison.
Its easy to choose wrong measure and see patterns where none exist.
Measures should be refined during the design and validation process.
2. Choose the appropriate level of detail/abstraction for model components.
Key is to represent model at the right level of detail for the chosen measures.
It is almost never possible or practical to include all system aspects.
Model the system at the highest level possible to obtain a good estimate of the
desired measures.
3. Build the model.
Decide how to break up the model into modules, and how the modules will
interact with one another.
Test the model as you build it, to ensure it executes as intended.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 325

Doing it Right: Model Solution

Use the appropriate model solution technique:


Just because you have a hammer doesnt mean the world is a nail.
There is no universal model solution technique (not even simulation!)
The appropriate model solution technique depends on model characteristics.
Use representative input values:
The results of a model solution are only as good as the inputs.
The inputs will never be perfect.
Understand how uncertainty in inputs affects measures.
Do sensitivity analysis.
Include important points in the design/parameter space:
Parameterize choices when design or input values are not fixed.
A complete parametric study is usually not possible.
Some parameters will have to be fixed at nominal values.
Make sure you vary the important ones.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 326

Doing it Right: Model Interpretation/Documentation

Make all your assumptions explicit:


Results from models are only as good as the assumptions that were made in
obtaining them.
Its easy to forget assumptions if they are not recorded explicitly.
Understand the meaning of the obtained measures:
Numbers are not insights.
Understand the accuracy of the obtained measures, e.g., confidence intervals
for simulation.
Keep social aspects in mind:
Performance and dependability analysts almost always bring bad news.
Bearers of bad news are rarely welcomed.
In presentations, concentrate on results, not the process.

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 327

Next Steps
You have:
Learned theory related to reliability, availability, and
performance validation using SANs and Mbius
Learned about the advantages and disadvantages of various
(analytical/numerical and simulation-based) solution
algorithms.
There are many places to go for further information:
Mbius Software Web pages
(www.mobius.uiuc.edu)
Performability Engineering Research Group Web pages
(www.perform.csl.uiuc.edu)

2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 328

Das könnte Ihnen auch gefallen