Self Healing Systems Lecture

Faul t -t ol er ance Techni ques f or
Sof t w ar e Syst ems

Dipankar Das
GM R&D, India Science Lab
March 2012, IIT Kharagpur
Aut omot i ve ECS Tr ends
100 Million lines of code executing on a
distributed embedded system of 50-70
ECUs and 7+ buses
2000 features with high interdependencies
More complex features in future
Hybrid PT, Fuel Cell, Displacement on Demand,
Braking
65 and sub-65 nm design of memory and
processors
Short development/testing time -- market
pressure
Multi-core/Distributed implementations
$
1
1
8
2

$
1
1
8
2

(
+
1
9
6
%
)
(
+
1
9
6
%
)
5
0

E
C
U
s

(
+
1
5
0
%
)
1
0
0
M

L
i
n
e
s

o
f

C
o
d
e

(
+
9
9
0
0
%
)
$
4
0
0
2
0

E
C
U
s
1
M

L
i
n
e
s
1
M

L
i
n
e
s
Electronics & Software
growth in the last decade
Aut onomi c Comput i ng
A computer system which can recover from
faults without human intervention.
Continued execution in spite of faults (availability ++)
Short time between fault-detection and correction
(security and maintenance ++)
The broader topic is self-aware systems,
introduced by IBM Autonomic computing effort
Self-healing + self-optimizing + self-protecting + self-
configuring
Self-* requires: self-awareness, and context-
awareness
Self-awareness: aware of self-state and behaviors
Context-awareness: aware of context of operation (i.e.
the environment)
Sel f -Aw ar e Syst ems
Self-configuring: reconfigure automatically and
dynamically to changes in installation, update, integration
and composition.
Self-optimizing: adjust performance and resource
allocation to satisfy users.
Main concerns: QoS, Time-delay, Utilization
Self-protecting: detect and recovery from security
breaches.
Self-Healing: self-diagnosis and self-repair. Can recover
from disruptions, either due to the environment or the
system.
Faul t -t ol er ance f or SW Syst ems
Adaptation processes in a self-adaptive system [1]
Faul t -t ol er ance f or SW Syst ems
The goal is fault-tolerance:
Fault Error (Can be logged) Failure (Undesirable effects)
Prevent faults from graduating into failures!
Applicable to both random and systematic faults.
Most hardware faults are random (wear-tear)
Most software faults are systematic (bugs), some hardware faults
are also systematic (HW is after all a program!)
Alternative strategies for fault-tolerance.
Replication, Diversity, Recovery-block/re-execution
Repair the current run vs. correct future runs (or both).
Will focus mostly on systematic faults
Exampl es of Syst emat i c Faul t s
Syst emat i c Faul t s
A design fault which is built-into the system
May be triggered randomly but are not random
Nondeterministic interleaving of events
Causes of Systematic faults:
Incorrect/Incomplete/Inconsistent requirements
Software is inconsistent with requirements
Implementation errors: memory object overrun, type-faults,
data-races, incorrect synchronization
Configuration errors: incorrect estimation of message queue
size.
St ack Faul t s -- buf f er over f l ow
A buffer overflow occurs when you try to put too
many bits into an allocated buffer.
When this happens, the next contiguous chunk of
memory is overwritten, such as
Return address
Function pointer
Previous frame pointer, etc.
Also an attack code is injected.
This can lead to a serious security problem.
The Typi cal AUTOSAR st ack
Exampl e Code
Over w r i t i ng Ret ur n Addr ess
Il l egal Event Sequence Exampl e
Nondeterministic: will trigger only if TaskA is interrupted by Interrupt
B Very hard to reproduce!
Handling illegal sequences:
Locks to ensure mutual exclusion
Contention-free design of software and system
Software-transactional memories
Re-executes code in case of data contentions
process based
on value of X
Process
Complete
processing
Read
X
Read
X
Write
X
Write
X
Hardware interrupt/
preemption
Interrupt Service Routine
Exampl es of Random Faul t s
Sof t Er r or s
0
1
It is an issue if silicon is < 65nm. State-of-the-art is 28nm
Effects seen at software level, recovery/prevention
strategies can be built at software level.
Impact of Neutron Strike on a Si Device
Secondary source of upsets: alpha particles from packaging
Strikes release electron &
hole pairs that can be
absorbed by source &
drain to alter the state of
the device
+
-
+
+ +
-
-
-
Transistor Device
source
drain
neutron strike
Strike on state bit (e.g., in register file)
Bit
Read
Bit has error
protection
Error
is only detected
(e.g., parity +
no recovery)
Error can be
corrected
(e.g, ECC)
yes
no
Does bit
matter?
Silent Data
Corruption
(SDC)
yes
yes
no
Detected, but
unrecoverable error
(DUE)
no error
yes
no
benign fault
no error
benign fault
no error
Some Acronyms
SDC = Silent Data Corruption
DUE = Detected & unrecoverable error
SER = Soft Error Rate = Total of SDC & DUE
Evi dence of Cosmi c Ray St r i kes
Documented strikes in large servers found in
error logs
Normand, Single Event Upset at Ground Level, IEEE Transactions on
Nuclear Science, Vol. 43, No. 6, December 1996.
Sun Microsystems, 2000
Cosmic ray strikes on L2 cache with defective error protection
caused Suns flagship servers to suddenly and mysteriously crash!
Companies affected
Baby Bell (Atlanta), America Online, Ebay, & dozens of others
Verisign moved to IBM Unix servers (for the most part)
Toyota Prius Recall
http://www.reuters.com/article/idUSTRE6293IC20100310
Pl ant Faul t s
GM CONFIDENTIAL
Plant is the, mostly mechanical,
component of a cyber-physical system
which is controller by electronics and
software.
Plant faults mainly result from mechanical
wear and tear.
Results in deviation from expected
behavior.
Har dw ar e Faul t s
GM CONFIDENTIAL
S-a-1
Address Line
5
th
bit.
A St at e-based Def i ni t i on of
Syst em Behavi or
The Comput er Syst em
.
.
.
0x0393 mov r
1
4
0x0394 mov r
3
4
0x0395 st
G
r
2
r
1
0x0396 st
B
r
4
r
3
.
.
.
I
Q-mem
Q-com CC
Communication
medium-1
Communication
medium-k
P
Instruction Memory
Data Memory
Communication
Queue
Load/store
Queue
Communication
Controller
Register File
Processor Core
Comput er Syst em St at e
The state of a computer system is the status of all the
components for all computer system nodes
Currently executing instruction/statement
Current status of communication and load/store queues
Current status of instruction and program memory
Current status of communication controllers indirectly
captures messages being transmitted in the system.
E= ( , , , ,, , ,
0x0393 mov r 1, G 5
0x0394 mov r 2, G 25 6
0x0395 s t G r 2, r 1
0x0396 mov r 3, B 5
0x0397 mov r 4, B 25 6
0x0398 s t B r 4, r 3
I
0
M
0
Q
0
P
0
, , , ,, , )
0x0393 mov r 1, G 5
0x0394 mov r 2, G 25 6
0x0395 s t G r 2, r 1
0x0396 mov r 3, B 5
0x0397 mov r 4, B 25 6
0x0398 s t B r 4, r 3
I
k
M
k
Q
k
P
k
k
0
CC
0
CC
k
Syst em + Envi r onment
The state of the environment captures
The state of the mechanical/biological/chemical components
(the plant).
The input generation system (users)*
E= ( , , , ,, , ,
0x0393 mov r 1, G 5
0x0394 mov r 2, G 25 6
0x0395 s t G r 2, r 1
0x0396 mov r 3, B 5
0x0397 mov r 4, B 25 6
0x0398 s t B r 4, r 3
I
0
M
0
Q
0
P
0
, , , ,, , ,
, )
0x0393 mov r 1, G 5
0x0394 mov r 2, G 25 6
0x0395 s t G r 2, r 1
0x0396 mov r 3, B 5
0x0397 mov r 4, B 25 6
0x0398 s t B r 4, r 3
I
k
M
k
Q
k
P
k
k
0
CC
0
CC
k
Plant
Users*
Execut i on Space: Set of al l possi bl e st at es
Execut ion Space
System
shutdown
E= ( , , , ,, , ,
0x0393 mov r 1, G 5
0x0394 mov r 2, G 25 6
0x0395 s t G r 2, r 1
0x0396 mov r 3, B 5
0x0397 mov r 4, B 25 6
0x0398 s t B r 4, r 3
I
0
M
0
Q
0
P
0
, , , ,, , , )
0x0393 mov r 1, G 5
0x0394 mov r 2, G 25 6
0x0395 s t G r 2, r 1
0x0396 mov r 3, B 5
0x0397 mov r 4, B 25 6
0x0398 s t B r 4, r 3
I
k
M
k
Q
k P
k
k
0
CC
0
CC
k
Plant
Syst em Behavi or
Execut ion Space
System
shutdown
S3
S2
S1
Runs are sequences of states visited by the system
Transitions indicate a state change, say <S1, S2>
Transitions may be due to implemented system behavior, or due to environment
conditions (user pressing keys, mechanical systems response)
Some transitions (implementation behavior + environment behavior) can be faulty,
others can be fault-free or benign
S4
Transition guided by
implemented system
behavior
Transition due to
the environment
Funct i onal Pr oper t i es Char act er i ze St at es
Execut ion Space
System
shutdown
Input1 = 5,
input2 = 3
MULT: For all runs starting from this state,
if input1 = 5 and input2 = 3, then
there is a state in each run where
output = 15
Output = 15
Start State
Cor r ect St at es
Execut ion Space
Correct ness envelope
Correct States: A set of states.
Execution states which satisfy
functional correctness properties (F)
and are reachable from the start state
(CTL* properties)
System
shutdown
Correct Run
Incorrect Run
Correct Run:
A run traversing only correct states
Incorrect Run:
A run containing at least one incorrect state
Accept abl e St at es
Execut ion Space
Acceptable States:
Execution states which satisfy a weaker acceptability property (A). Acceptability
properties are basic properties which the system designer deems should be satisfied.
Ex: System must not crash vs. functional property MULT
Accept abilit y envelope
Incorrect but
acceptable
Unaccept abl e Runs
Execut ion Space
Unacceptable run
Unacceptable Run: One which goes through one or more unacceptable states
Examples:
The return address of a function is overwritten by a buffer overflow
The result of a solution varies from golden value by more than 20%.
Unaccept abl e Runs
Execut ion Space
Unacceptable run
Unacceptable Runs may be due to incorrect system implementation or due to
environment-triggered transitions
Syst em behavior
Fai l St op Execut i on
Execut ion Space
A mechanism which forces the execution to be stopped when the acceptability
boundary is breached.
Example: Program execution is stopped when return address (in stack) is overwritten
(in GCC 4.1 and above using the ProPolice mechanism)
STOP
Saf e Exi t
Execut ion Space
A mechanism which forces the execution to be altered such that it goes to a safe
exit point, before halting the program.
Example: Releasing locks before a program stops due to a detected stack smashing
attack. Writing back EEPROM before shutdown in automotive ECUs.
STOP
Resi l i ent Execut i on
Execut ion Space
A mechanism which makes corrective action such that the execution remains within
the acceptability envelope.
Two types of corrections: One which leads to the run entering the correctness
envelope, and others which do not lead to correctness.
Examples: Reactive systems which execute code cyclically, with each iteration
reading an input and producing the corresponding output. resetting of automotive
controllers on detection of stack overflows
Sel f -heal i ng Execut i on
Execut ion Space
Self-healing = Resilient execution (acceptability) + ensuring that the faulty run
does not happen in the future (Continued execution + Automated Repair)
If same inputs/environment conditions are available then
How do we prove that the corrected run and the modified behavior are
acceptable?
Abst r act i on of t he St at e Space
Losing information to solve problems
Di scr et i zat i on of Ti me
How can we model the plant?
How about the memory? Is it not a set of transistors, each having
analogue behavior?
When trying to define a reasonable state system we perform
discretization on the variable time.
E= ( , , , ,, , ,
0x0393 mov r 1, G 5
0x0394 mov r 2, G 25 6
0x0395 s t G r 2, r 1
0x0396 mov r 3, B 5
0x0397 mov r 4, B 25 6
0x0398 s t B r 4, r 3
I
0
M
0
Q
0
P
0
, , , ,, , ,
)
0x0393 mov r 1, G 5
0x0394 mov r 2, G 25 6
0x0395 s t G r 2, r 1
0x0396 mov r 3, B 5
0x0397 mov r 4, B 25 6
0x0398 s t B r 4, r 3
I
k
M
k
Q
k
P
k
k
0
CC
0
CC
k
Plant
Abst r act i on of St at e
There are multiple abstractions of data forming a lattice.
Lattice = Elements + partial-order (more details)
Many nice properties follow when we add/reduce information in this manner
Do we have a lattice for the discretization of time?
001010011010100101
T Least information
001010011010100101
int Sh. double int Sh. ret. Sh.
STACK int Sh. int 001010011010100101
int Sh. double int Sh. ret. Sh. Static segment
Stack frame1
001010011010100101
int Sh. double int Sh. Sh. Sh. Static segment
Stack frame1

Can have many
more abstractions
STACK int Sh. int
64-bits
Most information
Level of Abst r act i on = Our i nt er pr et at i on of Pr ogr am

double = double;
short = short;
Static segment bound between [X,Y],
double = double + double;
return-address = 151;
Existence of a static segment
B = 10.1;
Memory objects in static segment.
B = 10.1;
A = 2;
001010011010100101
001010011010100101
int Sh. B (double) int Sh. ret. Sh.
001010011010100101
Stack frame1
001010011010100101
Stack frame1

Level of Abst r act i on = What checks can be per f or med
double = double;
short = short;
(type checks)
double = double + double;
(corruption of return address)
B = 10.1;
(reason about Bs value partial functional
correctness)
B = 10.1;
A = 2; (Everything on memory)
001010011010100101
001010011010100101
int Sh. B (double) int Sh. ret. Sh.
001010011010100101
Stack frame1
001010011010100101
Stack frame1

Gr anul ar i t y of t i me = What checks can be per f or med
Check at context switch
Buffer B overflows overwrites ret
Function corresponding to stack-
frame-1 completes and returns
Task-time budget completes and
context switch happens
Can we detect the overwriting of
return address?
001010011010100101
int Sh. Double[] int Sh. ret. Sh.
001010011010100101
int Sh. Double[] int Sh. ret. Sh. Static segment
001010011010100101
Stack frame1

Static segment
Stack frame1
Key difference: data-abstraction vs. time granularity
We can refine our view to observe additional faults and take necessary
actions.
No scope for refinement after the completion of actions
Gr anul ar i t y of t i me = What heal i ng can be done
Check at context switch
Buffer B overflows overwrites ret.
Function corresponding to stack-
frame-1 completes and returns
Program Sequence Monitor catches
fault after 10,000 instructions.
Task-time budget completes and
context switch happens.
Can we recover from fault?
001010011010100101
int Sh. Double[] int Sh. ret. Sh.
001010011010100101
001010011010100101
Stack frame1

Static segment
Stack frame1
Often possible to log faults and detect them later.
For recovery/healing the amount of data which must be backed up
is much larger than what is needed for detection so selecting the
correct time granularity is critical
Sel f -heal i ng Syst em Desi gn Checkl i st
Level/s of abstraction/s
What do we want to check and recover from (abstract check)
Decides the cost of checking + cost of recovery
May adaptively vary abstraction level.
Time Granularity
Decides what faults we can recover from.
Direct bearing on performance (make checks thin.. They will run most of the time)
The Complexity of SH mechanism
We are adding additional code which should not make system more unreliable.
Decouple SH component from native code (thin interface)
Make SH code simple and small (formal verification possible)
Which resource to spend on SH
Trade-off between Flash, RAM, Cache, Processing, magnetic memory, network.
What is the acceptability of the repair
Realizability: What infra support does the SH mechanism need
Continued Execution vs. Automated Repair
?
Time Granularity
?
?
?.
?
?
?
Repai r i ng Dat a St r uct ur e by Goal
Di r ect ed Reasoni ng
The Pr obl em
F = 20
G = 5
F = 20
G = 10
Broken Dat a St ructure
Missing element s
I nappropriat e sharing
Dangling references
Out of bounds array
indices
I nconsist ent values
F = 10
G = 5
F = 2
G = 1
F = 20
G = 10
F = 20
G = 5
F = 20
G = 10
Broken Dat a St ructure Consist ent Dat a St ruct ure
Repair
Algorit hm
The Sol ut i on
Summar y of Techni que, One St ep
10111001011
10101011101
10101110110
00011001011
10101011101
10101110110
Broken
Bit s
Repaired
Bit s
Broken
Abst ract Model
Repaired
Abst ract Model Abst ract
Repair
Generat e
Concret e Dat a
St ruct ure Updat es
Using Planning
Model
Definit ion
Rules
Repair strategy
based on actions
in abstract model
Corresponding
data-structure
repair found by
goal-directed
planning
Summar y of Techni que
Abstract data
structure
construction
Compute
Repaired
Abstract Data-
structure
Goal Directed Planning:
Actions: Abstraction rules
Initial State: Concrete
data structure
Goal State: Any data-
structure which is close to
the original data-structure
in terms of the repair action
taken (in abstract data
structure)
An amalgamation of abstraction and goal-directed
reasoning.
Fi l e Syst em Exampl e: Concr et e Dat a-St r uct ur e
struct disk {
int blockbitmap;
entry dir[numentries];
block block[numblocks];
}
struct entry {
byte name[Length];
int firstblock;
}
st ruct block {
int next block;
byt e dat a[ blocksize] ;
}
st ruct blockbit map
subt ype block {
int next block;
bit bit map[ numblocks] ;
}
int ro -5 2 -1
Direct ory Ent ries Disk Blocks
-1 3 -1
The Or i gi nal and Cor r ect FS
block
bit map
A Cor r ect Fi l e Syst em
int ro 1011 0 2 -1
-1 3 -1
Or i gi nal Fi l e Syst em
int ro - 5 2 -1
-1 3 -1
The Abst r act i on
Sets of objects
set Block of block : Used | Free
set Used of block : Bitmap
Relations between objects
relation Next : Used, Used
relation BlockStatus : Block, boolean
Block
Used Free
Next
Bit map
boolean
BlockSt at us
Note: This is the abstraction of the
File System -- while the actual file
system is a bit map with multiple
components
Abstraction in terms of
sets + relations
between sets and sub-
setting
Rul es f or Abst r act i on
i e [0..numentries-1], 0 s d.dir[i].firstblock d.block[d.dir[i].firstblock] e Used
b e Used, 0 s b.nextblock (b, d.block[b.nextblock]) e Next
b e Used, 0 s b.nextblock d.block[b.nextblock] e Used
int ro -5 2 -1 -1 3 -1
For all directories, the first block is used
For each used block, if next-block index >= 0, then the tuple containing the said
block and the block with this index are contained in the Next(-block) relation
If a block has a valid next-block index, then the block pointed to by this index is Used
Quantifier + Condition on Concrete Data-str => Set inclusion
Rul es f or Abst r act i on
b in [0..numblocks-1], d.block[b] e Used d.block[b] e Free
true d.block[d.blockbitmap] e Bitmap
j e [0..numblocks-1], b e Bitmap, true <d.block[j], b.bitmap[j]> e BlockStatus
int ro -5 2 -1 -1 3 -1
If a block is not in Used, then it is in Free
The block pointed to by bitmapblock is contained in the set Bitmap
The block-status relation is contained in elements of the bitmap block
Quantifier + Condition on Concrete Data-str => Set inclusion
Exampl e of t he Abst r act i on
int ro -5 2 -1
-1 3 -1
1
2
Used
Free
0
Blocks
Bit map
3
Next
ABSTRACT
Consi st ency Const r ai nt s: The checks
|Bitmap|=1
u e Used, u.BlockStatus=true
f e Free, f.BlockStatus=false
b e Used, |Next.b| s 1
1
2
Used
Free
0
Blocks
Bit map
3
Next
The Di agnosi s
Evaluate consistency properties, find violations
|Bitmap|=1 is violated - Bitmap set is empty
1
2
Used
Free
0
Blocks
Bit map
3
Next
Repai r i ng Vi ol at i ons of Model Consi st ency Pr oper t i es
Violation provides binding for quantified variables
Convert Body of the constraint to disjunctive normal form
(p
1
. . p
n
) v v (q
1
. . q
m
)
p
1
p
n
, q
1
q
m
are basic propositions
Choose a conjunction to satisfy
Repair violated basic propositions in conjunction
Repai r i ng Vi ol at i ons of Basi c Pr oposi t i ons
Inequality constraints on values of numeric fields
V.R = E, V.R < E, V.R s E, V.R > E, V.R > E
Compute value of expression, assign relation
Presence of required number of objects
|S| = C, |S| s C, |S| > C
Remove or insert objects from/to set
Topology of region surrounding each object
|V.R| = C, |V.R| s C, |V.R| > C
|R.V| = C, |R.V| s C, |R.V| > C
Remove or insert tuples from/to relation
Inclusion constraints: V in S, V
1
in V
2
.R, (V
1
,V
2
) in R
Remove or add the object or tuple from/to set or
relation
Repai r i ng Inconsi st enci es
Repair the violation of (|Bitmap|=1) (DNF-format) by adding a
block to the Bitmap set
1
2
Used
Free
0
Blocks
Bit map
3
Next
Goal -Di r ect ed Reasoni ng
Abstract repairs add or remove objects (or
tuples) to sets (or relations)
Goal: find concrete data structure updates with
same effect
1) Find model definition rules that construct the relevant
set or relation
2) Basic strategy:
For removals, appropriately falsify the guards of all
these model definition rules.
For additions, appropriately satisfy the guard of one
of these model definition rules.
Goal -Di r ect ed Reasoni ng i n Exampl e
Abstract Repair: add block 0 to the Bitmap set
Abstraction Rules:
i e [0..numentries-1], 0 s d.dir[i].firstblock
d.block[d.dir[i].firstblock] e Used
b e Used, 0 s b.nextblock
(b,d.block[b.nextblock]) e Next
b in [0..numblocks-1], d.block[b] e Used
d.block[b] e Free
j e [0..numblocks-1], b e Bitmap, true =>
<d.block[j],b.bitmap[j]> e BlockStatus
Abstraction Rules (Action Taken):
i e [0..numentries-1], 0 s d.dir[i].firstblock
d.block[d.dir[i].firstblock] e Used
b e Used, 0 s b.nextblock (b,d.block[b.nextblock]) e Next
b in [0..numblocks-1], d.block[b] e Used d.block[b] e Free
true d.block[d.blockbitmap] e Bitmap (Guard
already satisfied)
j e [0..numblocks-1], b e Bitmap, true =>
<d.block[j],b.bitmap[j]> e BlockStatus
Relevant Abstraction Rule:
Action Taken: d.block[d.blockbitmap]=block 0
Corresponding Data Structure Update:
d.blockbitmap = index of block 0 in d.block
array
Re p a i r i n Ex a m p l e
Or i gi nal Fi l e Syst em
Updat ed Fi l e Syst em
int ro - 5 2 -1
-1 3 -1
int ro 0 2 -1
-1 3 -1
block
bit map
Note: Bitmap details
are still abstracted
xxxx 0010
Mul t i pl e Repai r s
Some broken data structures may require
multiple repairs
Reconstruct model
Reevaluate consistency constraints
Perform any required additional repairs
Multiple repairs needed either due to complex
repair or due to refinement (some repair rule
acts on a more refined model of the system)
Re-abst r act ed Model
BlockSt at us
1
Used
Free
Blocks
Bit map
Next
0
t rue
2 3
false
Note: BlockStatus relationship has been refined in this case.
Refinement assigns arbitrary values to unassigned variables
Can be treated as an environment action
Di agnosi s i n New Model
Re-evaluate model constraints, find violations of
u e Used, u.BlockStatus=true and
-- Rule violations due to refinement of BlockStatus relationship
BlockSt at us
1
Used
Free
Blocks
Bit map
Next
0
t rue
2 3
false
Act i on 2: Fi x Bl ockSt at us
Repair violations of
u e Used, u.BlockStatus=true and
by modifying the BlockStatus relation
BlockSt at us
1
Used
Free
Blocks
Bit map
Next
0
t rue
2 3
false
Repai r ed Fi l e Syst em
block
bit map
Repaired File Syst em
int ro 1011 0 2 -1
-1 3 -1
Repai r Pl an Gr aph
Add block t o Bit map
Replace < f,t rue/ false> wit h
< f,false/ t rue> in BlockSt at us
| Bit map| ! = 1
f.BlockSt at us= t rue
for any f ree block,
f.BlockSt at us= false
for any used block
7. b.bit map[ j ] = false
for j = indexof(f )
Remove t uples
< f,t rue> ,< f, false>
from BlockSt at us by
removing Bit map
State predicate abstraction of the state
Action
Environment Action
when refining
Assign arbit rary
t uples t o f.BlockSt at us
Exper i ence
We acquired five benchmarks (written in C/C++)
AbiWord
x86 emulator
CTAS (air-traffic control tool)
Simplified Linux file system
Freeciv interactive game
We developed specifications for all five
Little development time (days, not weeks)
Most of time spent figuring out Freeciv and CTAS
Each benchmark has
Workload
Bug or fault insertion methodology
Ran benchmarks with and without repair
Snapshot of Resul t s
Applicat ion Time to Check
Consist ency(ms)
Time to Check
and Repair (ms)
AbiWord 0.06 0.55
CTAS 0.07 0.15
FreeCiv 3.62 15.66
File system 4.22 263.14
Multiple: Set-relation abstraction of data. Object level for blocks, bit-level for
bitmap, type-level for next. Needs traversal of about 1/Kth memory. K integers in
a block.
Time Granularity
Variable possibly at suitable execution-block level
20,000 lines of code, deeply related to data, complex operations on data
Mainly CPU
No guarantees, empirically observed to be successful
Access to data and ability to modify no memory protection
Continued execution
Cl ear Vi ew : Code Pat chi ng Usi ng
Onl i ne Lear ni ng
Attack
detector
Repair
Learning
all executions
patch
Pluggable detector,
does not depend on learning
attacks
(or bugs)
normal
executions
predictive
constraints
Learn normal behavior (constraints)
from successful runs
Check constraints during attacks
Patch to re-establish constraints
Evaluate and distribute patches
True on every good run
False during every attack
Det ect , l ear n, r epai r
Restores normal behavior
[Lin & Ernst 2003]
Lear ni ng nor mal behavi or
copy_len buff_size
Client s send
inference result s
Server
Community machines
Server generalizes
(merges result s) Client s do local inference
Observe normal behavior
Generalize observed behavior
At t ack det ect i on & suppr essi on
Det ect or collect s informat ion
and t erminat es applicat ion
Server
Community machines
Detectors used in our research:
Code injection (Memory Firewall)
Memory corruption (Heap Guard)
Many other possibilities exist
Lear ni ng at t ack behavi or
Server
I nst rument at ion cont inuously
evaluat es learned behavior
What was t he effect of t he at t ack?
Community machines
Client s send difference in
behavior: violat ed const raint s
Server correlat es
const raint s t o at t ack
Repai r
Candidate patches:
1. Set copy_len = buff_size
2. Set copy_len = 0
3. Set buff_size = copy_len
4. Return from procedure
Server
Propose a set of pat ches for each
behavior t hat predict s t he at t ack
Community machines
Predictive: copy_len buff_size
Server generates
a set of patches
Repai r
Server
Dist ribut e pat ches t o t he communit y
Community machines
Ranking:
Patch 1: 0
Patch 2: 0
Patch 3: 0
Repai r
Ranking:
Patch 3: +5
Patch 2: 0
Patch 1: -5
Server
Evaluat e pat ches
Success = no det ect or is t riggered
When at t acked, client s
send out come t o server
Community machines
Det ect or is st ill
running on client s
Server ranks pat ches
Repai r
Server
Pat ch 3
Server redist ribut es t he
most effect ive pat ches
Redistribut e the best patches
Community machines
Ranking:
Patch 3: +5
Patch 2: 0
Patch 1: -5
Dynami c i nvar i ant det ect i on

Daikon generalizes observed program
executions
Many optimizations for accuracy and speed
Data structures, code analysis, statistical tests,
We further enhanced the technique
copy_len < buff_size
copy_len buff_size
copy_len = buff_size
copy_len buff_size
copy_len > buff_size
copy_len buff_size
copy_len: 22
buff_size: 42
copy_len < buff_size
copy_len buff_size
copy_len = buff_size
copy_len buff_size
copy_len > buff_size
copy_len buff_size
Candidate constraints: Remaining candidates:
Observation:
Repai r ex ampl e
if (! (copy_len buff_size))
copy_len = buff_size;
The repair checks the predictive constraint
If constraint is not violated, no need to repair
If constraint is violated, an attack is (probably) underway
The patch does not depend on the detector
Should fix the problem before the detector is triggered
Repair is not identical to what a human would write
Unacceptable to wait for human response
Exampl e const r ai nt s & r epai r s
v
1
s v
2
if (!(v
1
sv
2
)) v
1
= v
2
;
v > c
if (!(v>c)) v = c;
v e { c
1
, c
2
, c
3
}
if (!(v==c
1
|| v==c
2
|| v==c
3
)) v = c
i
;
Return from enclosing procedure
if (!()) return;
Modify a use: convert call *v to
if () call *v;
Constraint on v (not negated)
Cl ear Vi ew w as successf ul
Detected all attacks, prevented all exploits
For 7/10 vulnerabilities, generated a patch that
maintained functionality
No observable deviation from desired behavior
After an average of 4.9 minutes and 5.4 attacks
Handled polymorphic attack variants
Handled simultaneous & intermixed attacks
No false positives
Low overhead for detection & repair
Program invariants --- predicate abstraction
Time Granularity
At level of granularity of checkers, e.g. heapguard
Implementation size unknown, Changes code control-flow.
Mainly CPU
No guarantees, empirically observed to work. Changes code control-flow. Very
difficult to verify!
Virtual instruction cache -- Restricted to JIT setups
Automated repair
Ext er mi nat or : Memor y Faul t
Recover y
Di a g n o si n g Bu f f er Ov er f l o w s
Canonical buf f er overf low:
Allocat e obj ect t oo small
Writ e past end -- corrupt s object o byt es f orward
Not n ecessa r i l y cont iguous
b a d o b j ect
( t o o sm a l l )
o b y t es p a st en d
char * st r = new char [ 8] ;
st r cpy ( st r , goodbye cr uel wor l d ) ;
8 10 2 9 4 5 1 7
Isol at i ng Buf f er Over f l ow s
Re d =
possible
bad
obj ect
Gr e e n =
n o t
bad
obj ect
1 8 7 5 3 2 9 6 4
3
Canaries in freed space detect corruption
Run multiple times with DieFast allocator
Key insight: Overflow must be at same o
10
Ca n a r i e s in f reed space det ect corrupt ion
Run mult iple t imes wit h DieFast allocat or
Ke y i n si gh t : Overf low m u st b e a t sa m e o
) obj ect 9 overf lowed, w i t h h i g h p r o b a b i l i t y
Isol at i ng Buf f er Over f l ow s
8 10 2 9 3 4 5 1 7
Re d =
possible
bad
obj ect
Gr e e n =
n o t
bad
obj ect
1 8 7 5 3 2 9 10 6 4
4 9 6 3 8 5 7 2 1
Isol at i ng Dangl i ng Poi nt er s
Dangling pointer error:
Live object freed too soon
Overwritten by some other object
i nt * v = new i nt [ 4] ;
del et e [ ] v; / / oops
char * st r = new char [ 16] ;

st r cpy ( st r , di e, poi nt er ) ;
v[ 3] = 4;
use of v[ 0]
Isol at i ng Dangl i ng Poi nt er s
Unlike buffer overflow:
dangling pointer same corruption in all
11 2 3 6 4 5 10 1 12 7 9 8
1 7 5 3 2 11 12 6 4 8 9 10
4 6 3 12 5 7 2 1 4 10 8 9
Cor r ect i ng Al l ocat or
Generate runtime patches to correct errors
Track object call sites in allocator
Prevent overflows: pad overflowed objects
mal l oc( 8) mal l oc( 8 + )
Prevent dangling pointers: defer frees
f r ee( pt r ) delay mal l ocs; f r ee( pt r )
1

1
Ex t er m i n a t o r Ru n t i m e Ov er h ea d
25%
Memory objects
Time Granularity
At each malloc/free call
Not a very complex implementation
Mainly CPU, memory (but then that would be needed anyways)
Probabilistic guarantees that the repair will work.
None.
Continued execution vs automated repair
Automated repair
Exer ci se 1
Write a C-code where a local array overflows (say in a for loop)
significantly (enough to overwrite the return address).
1. Compile and run this code on your Linux workstation. What do you
observe? What is the explanation of your observation? Prepare a 1-
page write-up on this.
2. If possible repeat the same on a Windows machine.
3. Re-write this code in C#. What happens in this case?
Please email your write-up for these three questions to Satya Gautam
(TA): vsatyagautam@gmail.com
Exer ci se 2
Write your own fault-tolerant my_malloc() and my_free() functions
which does the following:
It maintains a priority queue with elements containing pointers,
additional lifetime, and additional buffer space. Set the free-delay and
additional buffer space randomly.
Additional buffer space = random number between 0 and ceil{size/10}
Additional life = random number between 0-5 events. Event = a malloc/free call.
For each malloc() allocate additional memory, for each my_free() delay
the actual free operation by additional lifetime.
Try it out on a buggy code which you have not been able to fix!
Thank You
Acycl i c Repai r Dependences
Questions
Isnt it possible for the repair of one
constraint to invalidate another constraint?
What about infinite repair loops?
What about unsatisfiable specifications?
Answer
We require specifications to have no cyclic
repair dependences between constraints
So all repair sequences terminate
Repair can fail only because of resource
limitations
Ref er ences
Automatically patching errors in deployed software by Jeff H.
Perkins, Sunghun Kim, Sam Larsen, Saman Amarasinghe,
Jonathan Bachrach, Michael Carbin, Carlos Pacheco, Frank
Sherwood, Stelios Sidiroglou, Greg Sullivan, Weng-Fai Wong,
Yoav Zibin, Michael D. Ernst, and Martin Rinard. In Proceedings
of the 21st ACM Symposium on Operating Systems Principles,
(Big Sky, MT, USA), October 12-14, 2009, pp. 87-102.
Brian Demsky, Martin C. Rinard, "Goal-Directed Reasoning
for Specification-Based Data Structure Repair," IEEE
Transactions on Software Engineering, vol. 32, no. 12, pp.
931-951, Dec. 2006, doi:10.1109/TSE.2006.122
GM CONFIDENTIAL
Ref er ences
Self-adaptive Software: Landscape and Research
Challenges. ACM TAAS, March 2009
Soft errors: Soft errors in circuits and systems, IBM Journal
of R&D, Vol. 52, No. 3, 2008
http://researchweb.watson.ibm.com/journal/rd52-3.html
Martin Rinard, Acceptability Oriented Computing, ACM
SIGPLAN Notices, Vol. 38, Issue 12, December 2003
Read other related works of Martin Rinard:
http://people.csail.mit.edu/rinard/acceptability_oriented_computing/
Marco Schneider, Self-Stabilization, ACM Computing
Surveys, Volume 25, Issue 1, March 1993.

Self Healing Systems Lecture

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Self Healing Systems Lecture

Hochgeladen von

Copyright:

Verfügbare Formate

Faul t -t ol er ance Techni ques f or

Sof t w ar e Syst ems

Level of Abst r act i on = Our i nt er pr et at i on of Pr ogr am

Dynami c i nvar i ant det ect i on

char * st r = new char [ 16] ;

Das könnte Ihnen auch gefallen