Beruflich Dokumente
Kultur Dokumente
art ic l e i nf o a b s t r a c t
Article history: This article presents a concise reliability analysis of network security abstracted from stochastic
Received 24 January 2014 modeling, reliability, and queuing theories. Network security analysis is composed of threats, their
Received in revised form impacts, and recovery of the failed systems. A unique framework with a collection of the key reliability
15 August 2014
models is presented here to guide the determination of the system reliability based on the strength of
Accepted 21 September 2014
Available online 30 September 2014
malicious acts and performance of the recovery processes. A unique model, called Attack-obstacle model,
is also proposed here for analyzing systems with immunity growth features. Most computer science
Keywords: curricula do not contain courses in reliability modeling applicable to different areas of computer
Availability modeling engineering. Hence, the topic of reliability analysis is often too diffuse to most computer engineers and
Reliability
researchers dealing with network security. This work is thus aimed at shedding some light on this issue,
Security
which can be useful in identifying models, their assumptions and practical parameters for estimating the
Risk assessment
reliability of threatened systems and for assessing the performance of recovery facilities. It can also be
useful for the classification of processes and states regarding the reliability of information systems.
Systems with stochastic behaviors undergoing queue operations and random state transitions can also
benefit from the approaches presented here.
& 2014 Elsevier Ltd. All rights reserved.
1. Introduction major parts: (i) attack and failure modeling, (ii) impact modeling,
and (iii) recovery modeling. Theoretical approaches alone are
One of the major reasons for developing concrete theories is to often difficult to satisfy these with rapidly evolving IT systems
enhance their practical applicability to some desired technology and emerging networking concepts. Simulations and empirical
and engineering fields. Our objective here is to find an efficient studies are only devoted to observe and assess system behaviors in
way of bringing parts from reliability theory into a practical order to substantiate the reliability in practical situations. Theore-
technique for the analysis of information security. As known, the tical models used for simulations and experiments need also be
main objective of information security relies on the provision of justified and matched to the well-established modes of operations.
the general triad called CIA, Confidentiality, Integrity, and Avail- Probabilistic approaches can be used to build impact models and
ability. Within this context, availability (as a measure of service estimate the loss due to system failures caused by the threats with
degradation) is the most critical factor that can cause immense predefined probability and hazard distributions. Additionally,
cost on the communication infrastructure and on general business queuing theory and stochastic processes (e.g., Markov chains)
outcomes. We usually have effective protection mechanisms for can be used to guide stationary analysis of system failures and
providing confidentiality and integrity. However, protecting from hazard functions together with the associated repair models.
threats causing unavailability is more complex and often requires The major problem for many information security engineers is
additional mechanisms (e.g., redundant or standby systems), as follows. Though theoretical frameworks are critical in guiding
which may also be under the same type of threats leading to research, however, in some contexts, they can be confusing and
additional losses. much cumbersome to apply. Especially, for complex network
Building reliable models for analyzing failures and impacts structures facing complicated threat types, matching a theoretical
caused by different threat types on information systems can be security model to an overall analysis need to be inductive,
extremely complicated, or models to describe these processes may tractable, explanatory, and well-thought to guide access to con-
not even exist. Therefore, in order to achieve at least an analytical cretely measurable results obtained from many interrelated con-
tractability, we need to separate the problem domain into three cepts and their influencing parameters. Reliability models dealing
with complex systems are numerous, and naturally, some of them
are too diffuse to some computer engineers to apply. As the
n
Tel.: þ 90 232 488 8256. complexity increases alongside with the drastically growing infor-
E-mail address: suleyman.kondakci@ieu.edu.tr mation systems’ diversity, adapting a concrete model becomes
http://dx.doi.org/10.1016/j.ress.2014.09.021
0951-8320/& 2014 Elsevier Ltd. All rights reserved.
276 S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299
more and more complicated. Therefore, obtaining practically 1.1. Outline of the paper
sound reliability functions that can address different paths of a
complex and redundant security structures are necessary. Following the introduction and the objective of this work
The objective of this article is thus to present and discuss in Section 1, a brief review about the related work is given in
several useful reliability models dealing with the availability Section 2. Section 3 describes the terminology used in this paper,
analysis of information systems. Two major reasons have triggered presents a concise overview of the key reliability models, describes
the development of this work: (1) lack of a holistic approach to the failure sources (threats) and associated failures, presents the
network reliability analyses in the current literature and (2) to method for constructing reliability structures from network struc-
enhance the knowledge of formal reliability analysis in the tures, and outlines the main steps of the reliability analysis used
computer engineering discipline. In fact, we find various work in here. Section 4 summarizes the threat categories and tabulates the
the literature that mainly consider the analysis of singular systems, candidate reliability models for modeling attacks and failures.
such as the analysis of a specific type of worm/virus propagation Assumptions and limitations regarding the discussed reliability
and analysis of software (SW) and hardware (HW) faults. With this models are presented in Section 5. Section 6 defines a new class of
work, we intend to bring front an umbrella framework that can the reliability patterns associated to the reliability of networks
facilitate the analysis of a broad spectrum of network structures, security. Section 7 presents a detailed discussion of the reliability
their components, interdependence among the interconnected models and their applications to the analysis of network security
structures, impacts (cost of service degradation) caused by the threats and failures. Impact analysis of the threats is presented in
associated threats, and performance analysis of the facilities used Section 8. Section 9 presents models for describing and analyzing
to repair the threatened systems. repair processes, service efficiency for recovery facilities, whereas
The second reason triggering this idea is such that the most Section 10 goes through a case study dealing with the recovery
of the universities and institutions around the world do not operations and service degradation caused by a virus infection
provide system reliability courses in their computer science scenario. A brief discussion on the presented material and the
departments. Thus, we have a good reason to provide the funda- feature extension of the work is given in Section 11. Finally, a
mental modeling approaches that are applicable to the analysis of detailed background of the model theories is presented in
various aspects of reliability and security of computer networks. Appendix A.
This paper will therefore emphasize this issue by presenting a set
of practical models derived from the reliability theory and queuing
systems. 2. Related work
Obviously, complexity of the reliability models increases in
parallel with the system complexity, which is then multiplied with There exist numerous work being considered within the gen-
the level of system redundancy, if implemented so. Communica- eral context of the reliability engineering. However, it is hard to
tion networks, Internet search engines, cloud computing environ- find approaches specific to the reliability analysis of network
ments, smart grid networks, and resources of the grid networks security. An earlier software failure analysis model was developed
are good examples of such a complexity that contain a high degree by [1], which presents a stochastic model for the software failure
of redundancy and discrepancy in the overall system structure and phenomenon based on a nonhomogeneous Poisson process.
the services provided by these globally distributed systems. A service reliability approach for a distributed software is con-
Defining a separate reliability function for each subset of such sidered in [2], where a distributed system was modeled as a single
conglomerate structures, and integrating them under a framework service system shared by some distributed clients. This could be
will always introduce additional complexities. Estimating the interpreted as a single service system shared among multiple
overall reliability of such a complex structure can be facilitated if customers using a control center that allows access for the client
we were able to get down to some concrete models from the machines. The system availability of the control center is deter-
theoretical reliability concepts. Accordingly, this paper is intended mined by the probability for itself to be available, which is also the
to provide a framework of models applicable to the reliability overall reliability measure representing its clients. As known,
analysis of computer networks. wireless communication networks often have degraded through-
Primarily, we customize a set of functions and models from the put of broadcast packets due to the nature of the transmission
reliability concept and lay down some model assumptions that are characteristics. Related to this, a composite reliability analysis is
specific to the analysis of information security. An appropriate given by [3], which illustrates three modeling approaches for
model can then be selected to determine system states as a metric composite performance and availability analysis. A high-level
reflecting the degree of the system availability. The results description language, stochastic reward net, and continuous time
obtained can then be used for risk assessment of systems under Markov chain are used to construct models for evaluating the
different situations. Related to this, as a special case, we present a performability measures of a channel allocation scheme in a
numerical analysis that determines the reliability measure of a wireless network. Service reliability in a grid system with star
network of susceptible computers, which are vulnerable to some topology is considered by [4], and a topological view on the
virus attacks and software failures. A reliability measure is reliability of a large-scale distributed system is presented by [5].
composed of a set of parameters, such as mean failure and One of the main objectives of our work is to facilitate the
recovery rates, total down times, service efficiency, and repairman quantitative reliability analysis of interconnected systems. Wher-
utilization. Some systems may experience alternating states, while ever applicable, a candidate model defined here will embody the
others experience increasing or decreasing failure states, depend- fundamental steps for assessing information security risks of a
ing on the cause of the failure or the efficiency of the recovery given network. The selection of the candidate model depends
operations. That is, we will analyze time-dependent states of some basically on the underlying assumptions for the applicability of the
suspected systems and determine expressions representing the model to the network under consideration.
failure and recovery rates of them as well as the stationary Reliability engineering covering various types of system safeties
characteristic describing the long-run reliability figures of these encompasses a wide spectrum of theoretical areas, each of which
systems. Accordingly, throughout the paper, we define an unreli- needs a closer look for accurate adaptation to more specific engineer-
able node as a repairable/renewable system since the node can be ing problems. Therefore, the discussion taken here can be considered
restored to operate after eventual failures. as a dedicated focus on the information security compared to that of
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 277
the topics we usually encounter in the recent literature on reliability availability, anticipate logistics and maintenance resource needs,
engineering. Computer networks are complex systems. There exist and provide long term savings in operations and maintenance costs
a number of theoretical approaches intended for both the computa- based on necessary optimizations.
tion and evaluation of reliability of complex systems, e.g., [6], in which In many cases we need to discuss several types of parameters in
a measure of complexity was defined by the number of paths found order to obtain more realistic and accurate results when analyzing
in the system, and reliability functions of the system are given. Using information systems that are either under dedicated attacks or
this definition, the reliabilities of some typical redundant systems vulnerable to frequent failures. As will be discussed soon, we have
with complexity were also obtained. Kondakci [7] suggests models for several classes of systems to consider regarding this issue, e.g.,
the analysis of complex attack patterns and their effects on informa- web-servers, mail-servers, firewalls, database systems, ERP, and
tion systems. A review on different aspects of the predictability project development systems. Each of these systems possesses
problem in dynamical systems complexity is given in [8]. Applying distinct vulnerability characteristics against threats that can cause
these approaches to modern networked systems is rather a compli- reductions in the system availability. Topics such as probability,
cated task. It could be more fruitful if we could bring the general statistics, reliability, stochastic processes, and queuing theory have
theories down to a practically sound ground, so that we can directly fundamental importance to be covered by a security engineer in
apply them to more specific engineering problems. Additionally, there order to successfully analyze the reliability of a given problem
exist numerous work dealing with availability analysis of computer domain. For perceiving a general knowledge, readers can refer to
networks, e.g., [9] presents a review on the software reliability in several sources among which the book of Trivedi, [21], can be a
distributed environments. Jian and Shaoping [10] suggests that the good choice to start with.
availability of interconnected networks is dependent on its topological Computer networks are in the class of complex systems whose
structure and network services performed simultaneously, and dis- analysis require further knowledge on the “complex system
cusses an integrated availability model, which considers both topo- analysis”, which can bear rather special assumptions. A discussion
logical connectivity and operational performance. Gupta and on the reliability of a system complexity is presented in [6]. Most
Dharmaraja [11] have studied the combined effects of resource networked systems are consecutively affected by each other via
degradation and security breaches on the Quality of Service of Voice propagated failures/errors that are triggered in a system element
over IP systems by a dependability model using the semi-Markov within a given structure. For example, a jitter-effect in a badly
process. designed software package can cause malfunctioning in other
It is hard to find a rigorous study in the most recent literature software packages if they interact with each other. Or, a worm
for determining the reliability of information systems that may entering a system can quickly spread to many interconnected
also incorporate the analysis of system security. Nevertheless, systems in a manner of a branching process.
some of them have been appropriately derived from the classical Approaches regarding the theoretical analysis of some conse-
reliability models, e.g., [12,13]. Our approach is based on a cutive system behaviors are discussed in [22–25]. Those mathe-
stochastic model delivering quantitative results, which can be matical models can be modified to match the reliability analysis of
easily applied to risk assessment of safety systems. A comprehen- many types of security attacks whose individuals reproduce in a
sive treatment of stochastic models related to system performance manner that is influenced by time and the degree of system
and reliability is presented in [14]. A Markov model for a multi- protection. Determining whether and how interactions quantita-
state repairable system is presented in [15]. Worm and malware tively impact operational choices, when dealing with risk assess-
attacks can cause degradation in system availability. These types of ments, can be a great challenge to many of us. A perspective of this
analyses are considered in depth by [16], where a concept of subject is considered by [26].
recurrent stochastic model was developed to define the states of a
recurrent epidemic model called REM. Accordingly, a concise cost
analysis of the Internet malware is presented in [17]. 3. Reliability structures and approaches
Mass-mailing can also be used as a typical denial of service
attack aiming at degrading the system availability and creating A threat is a circumstance or event bearing the potential that can
serious frustrations and annoyance for many users. cause harm to information systems through disclosure, destruction,
Solutions to complex problems alone would have little impact falsification, fabrication of data, and disruption of systems (denial of
unless we could provide additional optimizations. Instead of using service). An attack is an assault on system security that derives from
powerful algorithms and going on only with brute-force type an intelligent threat, that is, a deliberate attempt to the realization
approaches can be too computation-intensive. We can refer to a of a given threat or a set of threats on a system to evade security
wide range of sources and topics for discussions on the optimiza- services and to cause harms to the system. We generalize two sets
tion issue, e.g., a tutorial on multi-objective optimization by [18] of definitions here: (i) the terms threats, attacks, worms, viruses,
gives a closer look at an approach using the genetic algorithm. Trojans, pop-ups, and any other malware are interchangeably used
Other related source is given by [19]. to denote the cause (or source) of a failure, (ii) components, nodes,
Reliability analysis covers entire engineering community with hosts, machines, computers, and systems are also interchangeably
almost no boundaries. Hence, when dealing with risk analyses, we used to denote the system under analysis. Though there exist
may encounter several unforeseen surprises in many of the broad several aspects of Internet malware, such as malware intelligence,
fields [20]. Methods dealing with the analysis of reliability and risk discovery, deployment, and defense strategies, we consider mainly
estimation have been deeply considered by several researchers and the reliability analyses of IT systems without explicitly considering
institutions with high-availability concepts, e.g., NASA, where it has the defense factor. Some models may deal with systems containing
been stated that availability prediction and assessment methods combat (or defense) capabilities, which will be explicitly stated
can provide quantitative performance measures that may be used in when we consider these.
assessing a given design. We also agree that quantitative results Reliability is defined as the probability that a given system
obtained from the methods can lead us to collect accurate main- operates properly for a specified period of time. As a companion
tenance costs and help to develop alternatives to reduce life cycle definition to reliability, availability of a system for its users is
costs. Furthermore, analyses based on reliability predictions will defined as the relative frequency that the system works. Here, the
guide us to assess design options that can lead to precise definitions percentage of the successful up times of a system is considered as
of maintenance support concepts that can increase future system a measure of the system reliability. From another perspective,
278 S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299
A modified NHPP fault count model used for software reliability assessment, m(t) denotes the expected failure count observed by time t.
ability that a system fails during a specified period of time.
Unless otherwise stated, an entire network is termed as a
system, a node within the network is denoted as a subsystem, and
an element within a subsystem is termed as a component. That is,
.
b α t 1
obeying the use of the reliability terminology, the term system is
sometimes interchangeably used to denote any system varying
e
from a single component to a variety of systems that are inter-
α
Also called Extreme value distribution: hazard rate is initially constant and then increases rapidly with time, v ¼
connected through any communication infrastructure.
failure rates, L-l avoids the strict bounding of IFR, DFR, and AFR. h(t) is DFR for κ r 1 and AFR for κ 4 1.
3.1. A brief overview of reliability models
Appropriate for modeling systems with IFR, DFR, AFR, and systems n-stages of subsystems.
key modeling approaches. In order to analyze system states in
terms of probabilistic lifetime (survival time) distributions with
respect to non-repairable and repairable constraints several mod-
els have been proposed. A lifetime distribution model can be
represented by a probability distribution function (pdf), denoted
Attacks are branching, propagation rates are high but failure events are rare.
b; α
a; b
λ; α
λ; α
λ; κ
λ
1 þ ðλ tÞκ
e λt ∑
ae bt
1
α
αt
e λt
e λt
R(t)
are the time t and the rate parameters often denoted by λ for
λκðλ tÞκ 1
1þ ðλ tÞκ
λα t ðα 1Þ
ev
f ðtÞ
λðtÞ
mðtÞy e mðtÞ
λα t α 1 λt
½1 þ ðλ tÞκ 2
ev
e
y!
αt
f(t)
Weibull
Gamma
Printing
Service
A A A A
E A E SA
dB Web Mail App File
Server Server Project dB
Server Server Server
server
information and other authentication data used for accessing for encrypting the data. It has been shown that a dedicated
confidential areas. hardware with a cost of one million dollars can search all possible
As can be seen in Fig. 1, some components are tagged with A, DES keys in about a couple of hours. In fact the DES algorithm was
SA, and E. These specify the protection type for the tagged retired and replaced by the Triple-DES algorithm. Since the
component. Attacks to confidentiality and integrity of information computer resources get improved rapidly in these days, the
systems are far difficult to employ but nevertheless important. Triple-DES algorithm with a keylength of 168 bits is no longer
Confidentiality and integrity measures are mostly implemented to secure either. However, according to NIST, with the strongest
ensure the security of mission critical systems, and require rather hardware it will still take more than about 260 years to crack a
different approaches to analyze them. Reliability analysis of Triple-DES key.
cryptography is a completely different issue compared to that of Integrity of a message is provided by so-called cyclic redundancy
the availability analysis of IT systems. Hence, our main concern in checks or by employing some secure hash algorithms. That is, a set of
this paper is the determination of appropriate models that can be extra bits are added to the original message for both error detection
used to analyze the availability of IT systems. Authentication (A) is and correction. Although these mechanisms are quite robust, there still
the simplest form of a control mechanism that allows authenti- exist threats to the integrity of a message, e.g., eavesdropping, session
cated access of principals to a resource, often using simple PINs hijacking, and the denial of service attack.
and passwords. Such an authentication system can be easily
broken by use of password crackers. A system with secure
authentication (SA) is cryptographically protected against inten- 3.4. Constructing reliability structures
tional or unintentional attempts made to access the system.
As mentioned earlier, the security of a resource relies on the Computer networks contain complex system elements, where
provision of the grand triad called CIA, confidentiality, integrity, and each element consists of at least four groups of interacting
availability. Confidentiality of a system is provided by some subsystems: (1) hardware, (2) operating system, (3) communica-
cryptographic functions consisting of cryptic algorithms, encryp- tion unit (hardware and software), and (4) a set of service software
tion (E) and decryption (D), which disallow the disclosure of (so-called applications). Block diagrams of two typical systems, a
a resource to others than the authorized owners and principals. user computer (PC) and a network printer, are shown in Fig. 2.
Denoting simply, the function C ¼ EðM; K; AÞ encrypts the message Prior to the reliability analysis, these structures need to be
M to produce the ciphertext C using the key K and the algorithm A. converted to a reliability diagram consisting of only series (and
For decrypting the ciphertext C, function M ¼ DðC; K; AÞ is per- eventually some parallel) components together with their inter-
formed. As can be noted, the same algorithm and key are used dependence structures. To do so, we can apply the path-tracing
both for the encryption and decryption functions. There exist technique to identify all possible paths from the input end to the
several algorithms devoted to securing critical resources, e.g., RSA, output of the entire system. We can further apply reduction to
DES, and Triple-DES, of which DES uses 56-bit keys (256 different series element method and/or the minimal cut algorithm to simplify
combinations of a key). To break a cryptographic algorithm is the computation of the overall reliability structure. There exist
extremely hard, which requires intensive work performed by only several techniques and methods found in the literature for
intelligence communities or cryptanalysts. The most critical part of simplifying different reliability structures, e.g., [29], which uses a
a cryptographic system (or function) is the length of the key used formerly known recursive decomposition algorithm.
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 281
Spooler
Operating
(CU) (PCU) Micro Output queue
System
OS
(OS)
CHW CSW App-N PHW PSW
IJ-1 IJ-1 IJ-1
Main Hardware (MHW) Printer Hardware (PHW) Input queue
NA NA
CHW CSW NA NA NA NA
OS = Operating system NA NA
MHW = Main hardware
CHW = Communication HW OS MHW
CHW = Communication SW
NA = Network application APP APP
APP= User application
APP APP APP APP
APP APP
To illustrate the determination of the reliability diagram we 3.5. Main steps of the analysis process
consider here only the PC, whose structure is shown in Fig. 2 (a).
By referring to the roles and interdependency structure of the Most of the existing reliability models have been used for the
components of the PC, one can easily derive reliability diagrams of analysis of production reliability, life tests, and failure counts of
some other related systems, e.g., that of the network printer service facilities. Network security analysis is relatively a new-
shown in Fig. 2 (b). In contrast to the similarity in their HW comer into this area. Thus, applicability of these models to
structures, the major differences between the PC and the printer assessing the reliability of network (or information) security must
are such that the network printer has only one application (so- be substantially discussed together with the underlying model
called spooler) and additionally two queues representing input assumptions. Namely, it is important to take the key modeling
and output jobs. The reliability diagram of the user computer, approaches, analyze their underlying assumptions, find their
shown in Fig. 3, contains two sets of applications, network limitations, and justify their applicability to the analysis of
applications (NA) and user applications (APP) with some relations identifiable security threats.
to each other, while being independent of the common It is not easy to directly analyze system failures caused by
communication unit. unknown security threats. In order to easily identify system
Following the construction of the reliability diagram, we define failures their causes must be identified first. Thus, before diving
a reliability model (or function) together with its associated into the failure analysis, the identification and modeling of threats
parameters, and outline underlying assumptions. Furthermore, is more advantages. This can help determining the correlation
we combine the reliability functions of the components into a between the threats and the failures caused by them. Due to the
single structure in order to obtain an overall structure that large variations in the computer platforms, their applications,
expresses the reliability measure of the entire system. The relia- behaviors, and the threats to them, we have to consider several
bility measure together with some other system specific para- critical issues prior to choosing an appropriate model for the
meters (e.g., importance factor, threat and vulnerability measures) reliability analysis of a given system. Hence, we need to
can then be used to derive an impact metric for the risks
estimation regarding the system at hand. identify threats/attacks to the system,
Overall reliability of a series system with n components is given by unify test and observation models needed for experimenting
and model matching,
n unify necessary models via generalized reliability concepts,
R ¼ ∏ Ri ; ð4Þ
i¼1 design a candidate model and identify the underlying
parameters,
and the reliability of a parallel system with n components is determine the underlying assumptions and limitations of the
candidate model,
n identify methods to obtain estimates of model parameters, e.g.,
R ¼ 1 ∏ ð1 Ri Þ: ð5Þ
i¼1 MLE method,
identify and eliminate practical and theoretical difficulties in
Thus, it is trivial to determine the reliability of a serial–parallel or any parameter estimations,
combination of a system, which is left as an exercise to the reader, if validate the applicability of the model to the analysis of the
desired so. problem domain at hand, e.g., goodness-of-fit test.
282 S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299
Although we are not going to detail these issues separately, a various DDoS attacks such as ICMP-flood, SYN-flood, Trin00,
sufficient framework covering most of the fundamental concepts Stacheldraht, TFN2K, UDP-flood, Smurf-attack, and so forth. A
will be considered here. TFN system is made up of a master server, clients, and
numerous daemons (slave servers) programs. Most of these
attacks have deterministic attack rates, which also can lead
4. Threat categories to constant failure rates. Stochastic models involving
repeated trials, e.g., Attack-obstacle, exponential, Poisson,
Generally, we have several categories of threats to networked Weibull, Gamma, and Log-logistic are the strongest candi-
systems that can effect a system in different ways. IT systems can be dates for modeling the variants of DoS attacks, depending
threatened by a specific type of threat or a combination of various on the way they can be realized, i.e., in random but
threat types, e.g., virus infections, Trojan injections, software failures, sequential phases or in parallel phases.
or service disruptions caused by some protocol attacks. Apparently, in Buffer overflow: This type of threats (e.g., Trojans) can be deliber-
order to analyze the reliability of a victim (target) system, or to ately inserted into applications or they may come from
measure the strength of eventual attacks, we need to select an misconfigurations, which when applied repeatedly may
appropriate reliability model to proceed with. In order to be consistent cause kernel area violations and excessive storage usage,
with the subject covered here, five major threat categories were and in turn, will result in denial of service. These threats
evaluated to consider here, where the reliability analysis of each can propagate to other interoperating software both
category can be satisfactorily bound to a realistic model. residing on the local and remote hosts. Threats that
randomly propagate among different applications on dif-
ferent hosts can be modeled as a Brownian motion. Buffer
Worms/viruses: Stochastic and deterministic worm propagation and overflow failures involving randomly repeated trials can be
extinction models usually give realistic results for rando- treated as a stochastic process, which can be modeled as a
mized mass worm attacks. In an earlier work [16], we have binomial process, whereas the density of the attacks
modeled an infection process as a binomial process, suscept- causing the failures can be modeled as a Poisson process.
ibility to infection as a Poisson process, however, arrivals of As of Assumption 1–3, a modified Gamma model can be
suspected hosts at a quarantine service were modeled as a used for systems that have immunity growth capability.
Poisson process, and the respective recovery operation was Variations of the Weibull and Pareto models can be a
modeled as a birth–death Markov process. Related to the candidate for tests dealing with strength-stress assess-
analysis of the propagation of malware (worms, viruses, ments of physical characteristics of memory systems and
Trojans, etc.), a variety of complex approaches mostly based page faults in operating systems.
on stochastic processes have been discussed in [16,30–33]. Remote malware injection: This category contains a broad spec-
It follows that branching processes for rapidly propagating trum of threats consisting of Trojans, SQL-injection,
worm and malware [16] and Poisson and Bernoulli pro- identity theft, spyware injection, key-loggers, phishing,
cesses for infrequently occurring incidents are good candi- unsolicited bulk messages (spams), cross-site scripting,
dates. pop-ups, and malicious e-mail attachments. Though the
Verification of the candidate models and parameter estima- attack density of these threats are specific to each class,
tions in this category are relatively easy compared to those however, incidents caused by them can be modeled as a
of the protocol-based threat analyses. Often, simulations and Bernoulli trial using the parameters generated from the
real-time experiments are used to collect failure data, protection profiles of the attacked systems and user
estimate the model parameters from the failure data, and behaviors (e.g., users' surf-pattern on the Internet and
perform goodness-of-fit test for checking the appropriate- security awareness of the user).
ness of the model. Experimenting with worms can be easily Human-related security: Users and administrators of systems may
performed by randomly “seeding” a certain number of also cause vulnerabilities to systems. User vigilance, informa-
worms in a network of computers each having a different tion leakage by social engineering, and unawareness of the
protection and usage profile. Observation of the failures is security threats can lead to exploits caused by all categories
then carried out within a given time duration while count- defined above. As an example, a spam trap is the inclusion of
ing the number of incidents. an option in an online web-form that is preselected by
Denial of Service (DoS): DoS attacks mostly make use of protocol default with the expectation that the user will fail to notice
deficiencies, which are aimed at degrading system avail- the option. If the form is submitted with the option selected,
ability. DoS attacks often have successful results reported as the user has given the adversary permission to send junk e-
to completely shutdown systems in several hours. A TCP- mails. Consequently, a number of infected e-mail attach-
SYN flood attack is an example of a typical DoS scenario, in ments, spear phishing by forged e-mails will continue to
which the adversary finds a hole through a deficiency show up in the user's mail box. Additionally, system main-
found in the TCP protocol implementation based on a loose tenance tasks including backup, restore, patch, upgrade, and
protocol description. Such a DoS attack launches a huge malconfigurations can introduce vulnerabilities while being
number of concurrent connection requests against its performed by the users and system administrators. Frequent
victim machine, which is being lured to believe that the software upgrades to newer releases for increasing the
requests are coming from all legitimate machines and functionality of a system can also be considered as a reliability
hence it tries to accept the connection requests from the issue. Reliability of test and debug operations, practical
attackers. While trying to accept the requests the victim implications of the company security policy, reliability of
machine quickly fills up the multiplexing buffer of its TCP protection system configurations, and flaws (e.g., zero-day
protocol, which then freezes the operating system, conse- attack) in emerging system installations are also considered
quently, it becomes unavailable. DoS attacks are imple- among the human-related security. It is hard to designate a
mented in a variety of forms, e.g., Tribe Flood Network known reliability model to this category, however, user and
(TFN) and Distributed Denial of Service (DDoS). The TFN system-administrator activities should be separately modeled
system is composed of a set of computer tools that conduct and verified.
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 283
Attacks are branching, propagation rates are high but failure events are rare. Since many nodes are involved k-out-of-n analysis
Massive attacks (e.g., by TFN) are hyperexponential, Due to protocol intelligence, models involving AFR, e.g., Attack-obstacle and
Propagation of threats are Brownian motions, failure rates are hypoexponential (serial system), Pareto for CPU- and memory-
Table 2 summarizes the threat categories and candidate models
for the analysis of attacks and failures. Common to each threat
category are Assumptions 1–3, if a related failure correction is
applied. If not explicitly stated, the remaining assumptions are
considered as default for all threat categories. It should be noted
that the candidate models shown are not customary, i.e., a given
model may be used for many types of analysis. Before choosing a
model, we need to perform a sensitivity analysis on the model
Threats are multifaceted each having own attack model, failure analysis is binomially distributed.
parameters. How the model behaves to changes in parameter
values and how it responds to changes in the structure of the
model are important issues in designing an experiment and
related models. Failure data and plots are useful in determining
the system behavior. Following the analysis of the data and plots,
an estimate of the model parameters can be obtained by use of a
method, e.g., maximum likelihood. Furthermore, an operation for
model matching need to be done by substituting the estimated
parameter values in the selected model. Finally, in order to
Difficult to model human behavior, mostly bear AFR characteristics.
is required.
Human-related
DoS
example, as rigorously discussed and modeled by [17], most reliability distributions following the corrections are all different,
malware attacks taking advantage of the vulnerable applications R1 ðtÞ o R2 ðtÞ; …; o Rn ðtÞ, whose failure times (or incident counts)
on a computer occur mostly at random times. As extensively are also independent of each other.
discussed by [16,34–36], usage profiles basically affected by the
user vigilance are also of a stochastic character. Consequently, Assumption 3. Protocol intelligence: Some communication proto-
unless otherwise specified, attacks, exploits, vulnerabilities, risks, cols may have built-in mechanisms to observe the input activity
failures, and the overall impact are considered as random pro- and make the system to counteract accordingly. Upon the obser-
cesses, random functions, or variables as appropriate. vation of a threat pattern, the intelligent (or expert) mechanism
tries to thwart the attack by discarding the malicious packets. In
such realizations, the effect of the attacks is assumed to be higher
5.1. Significance of protection profiles during the observation (learning) period than that of the following
periods. Hence, the reliability figure of the intelligent protocols
Depending on the company policy, a network can be configured exhibits first increasing failure rates (IFR) followed by decreasing
to have some level of protection. A large network, if not strictly failure rates (DFR). This is due to the learned expertise gained by
applying necessary security mechanisms, will always have ran- the protocol.
domness in the overall vulnerability figure. This is due to the
diversity of operating systems and the large number of software Assumption 4. Randomness: Unless otherwise specified, threat
applications with several different vulnerability patterns. Some events and the number of vulnerabilities in an application as well
networks, e.g., many home networks, usually make use of freely as in a communication protocol is random. Some applications may
available anti-virus programs on their computers, and no other already contain discovered vulnerabilities denoted as determinis-
defense mechanisms at all. Some PCs and servers in some net- tic, however, system failures often occur at random times and
works may be individually protected by their users, while some under random circumstances, including the user vigilance and
other PCs have either none or some arbitrary level of protection. protection policies.
Obviously, a network with several different operating systems and
Assumption 5. Non-uniformity: Unless otherwise specified, all
hundreds of different software applications each developed by a
threats (attacks and vulnerabilities) contribute unequally to the
different vendor and through distinct design and production plans
(un)reliability of a system. The same target may react differently at
will always show stochastic behaviors against many undiscovered
different times regardless of the type of the threat.
security threats. Besides, attacks to the elements of a network take
random patterns and occur at random time epochs. Attacks Assumption 6. Reliability metric: If convenience calls so, we will
exploiting vulnerabilities inherent in an application may also explicitly specify the reliability measure as failure counts, working
result in nondeterministic failure patterns. times, attack counts, attack success rates, and/or attack durations.
For example, success rate of a DoS attack, the number of infected
5.2. General assumptions e-mail boxes, effect of spear phishing (e-mail spoofing), and the
success rate of a scan worm (e.g., how many nodes can be Trojan
At this point, it may serve quite well that we partition the injected by a worm in Δ t) are also typical counting processes.
model assumptions so that they specifically reflect two different
Assumption 7. Independent failure cases: Failures detected during
categories of analysis, responsive and nonresponsive, respectively.
the nonoverlapping time intervals are independent of each other.
The responsive analysis relies on an immunity growth factor in the
failed systems such that the failure rates decrease with the Assumption 8. Working time distribution: Unless otherwise spe-
increasing immunity factor while the causes of the failures are cified, working times between failures are independently distrib-
being eliminated. This means that intermediate failures caused by uted random variables.
attacks/vulnerabilities are being fixed so that the victim system
will have increased strength against further threats, consequently, Assumption 9. Rare events: Homogeneous Poisson processes
it will have decreasing failure rates. On the contrary, a nonrespon- (HPP) are used to model events with “rare” occurrences, hence,
sive analysis totally omits system fixes and focuses merely on the their rates fλ1 ; λ2 ; …; λn g are also exponentially distributed inde-
cumulative failure types, which in the most cases results in pendent random variables being defined by the nature of threats
increasing failure rates. We have to explicitly declare the category and the usage profile of the system under consideration. Three
of the applied method whether it will be responsive or nonre- main properties of a Poisson process are (1) events occurring in
sponsive. We consider here the following major assumptions different epochs are independent of each other, (2) events occur at
regarding both of these two categories. a constant rate, and (3) few events occur within a short interval
of time.
Assumption 1. Responsiveness: In order to capture the effect of
the immunity growth (responsiveness), vulnerabilities are fixed Assumption 10. Non-homogeneous Poisson: Events (attacks or
with certainty in a negligible time. Most software applications do failures) described by non-homogeneous Poisson processes
not contain automatic system corrections, thus, the necessary (NHPP) evolve as time dependent random functions or processes,
corrections are done manually by patches and/or by upgrades as as appropriate. Hence, as apposed to HPP, events of NHPP exhibit
appropriate. Depending on the system's reaction to a given randomly alternating rates. Time dependent probabilities of a
correction, a patch operation may result in increase, decrease, given number of events with the mean rate M(t) for the NHPP
alternating, or constant (towards the end) failure rates. But in most model can be calculated by
cases the cumulative result is assumed to be DFR. MðtÞk MðtÞ
P NðtÞ ¼ k ¼ e :
k!
Assumption 2. Responsive system reliability: Reliability is a func-
tion of all remaining failures. Assuming that all necessary correc-
tions have been made, the expected failure rate will ideally exhibit 6. Reliability patterns
a decreasing (DFR) behavior. In line with this, most of our
experiments show that the failure-time distributions exhibit We classify reliability models according to failure type and
alternating mean values after each correction. That is, the system complexity. The first class deals with three types of failure
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 285
rates: (1) increasing failure rate (IFR), (2) decreasing failure rate Commander (initiater)
(DFR), and (3) Alternating failure rate (AFR). The second class
considers the system complexity based on the diversity of the Slave server
systems and their components: (1) homogeneous and (2) hetero-
geneous systems, see Section 7.6. Common to both of the classes is Agent
the variety of threats.
Fitting a suitable model to any of the threat categories is usually
nontrivial. Because some threats (or attacks) and their ultimate
effects often take place in n interleaved stages. Related to this,
eventual system recovery and impact analyses following the fail-
ure phase will multiply the complexity of the model at hand. In
fact, in most cases, occurrence of the actual failure of a system is
Attack shower
such that the system fails when n independent failures have
occurred. The sequence of failures can be in tact with the strength
of their associated attacks. Therefore, it can be advantageous to
first analyze the attacks in order to facilitate the analysis of failures
and impacts. Causal and harmonic effects of the threats can also
arise due to the consecutive attacks emanating from different
sources [34–38]. Victim
Packet queues (buffers)
t X packets
CPU usage (%)
Burst 1
Burst 2
s1 s3
Attack bursts
50 s2
s4
30
s1 s5
Burst n 10
t1 t2 t3 t4 t5 t1 t2 t3 t4 t5
Fig. 5. Correlation between the attack bursts and the CPU-utilization, 3λ ¼ 10% CPU usage.
describing the number of effective CPU-peaks at time t as (varying as k ¼ I 0 ; I 0 þ 1; …, I0), and γ denotes the mean
8 Poisson rate.
>
< n pk ð1 pÞn k for 0 Z r n; Example: threat sampling: We have experimented with a fault-
PfξðtÞ ¼ k∣NðtÞ ¼ ng ¼ k ð9Þ seeding scenario, where a virus (Conficker) was randomly seeded
>
:
0 otherwise: via e-mail messages in a network of PCs with and without virus
protection tools. In order to justify the applicability of the
By the low of total probability we have
predicted model, we have also implemented the scenario in a
1 n ðλtÞn λt simulation containing 100 randomly protected PCs. The unpro-
P ξðtÞ ¼ k ¼ ∑ pk ð1 pÞn k e
n¼k k n! tected PCs are nonreactive, so that the same PC might be repeat-
edly infected by the same virus [16], hence leading to IFR system
ðλt pÞk λt p
¼ e : ð10Þ behavior. The description of the chosen model (Poisson) is as
k!
follows. Let N(t) be the current (at time t) number of faults
The effect of attack bursts is illustrated in Fig. 5. detected and M the expected number of total virus incidents
As can be easily seen, the number of effective CPU-peaks at during the entire observation time. We assumed that the time to
time t has a Poisson distribution with parameter λt p, which detect the incidents, denoted as T 1 ; T 2 ; …; T M , are mutually inde-
corresponds to a certain percentage of CPU load. Depending on pendent and identically distributed random processes each
the system, some percentage of the CPU usage corresponds to a defined by a common function ξðtÞ. That is, the injection process
certain number of attack bursts. For example, the scale used in of the seeded virus is defined as a common Poisson process,
Fig. 5 is such that 3 parallel bursts produce a 10 percent of CPU because, a unique type of virus has been seeded, and the PCs were
load. But this is not always an accurate metric, since each all identical having the same operating system and the same set of
operating system has different paging and process scheduling applications. It follows that the probability of detecting incident i
schemes, the scale parameter must be empirically determined. at time t is ξi ðtÞ ¼ p. Thus, we have
Let n be the number of attack bursts and p be the probability of
successfully hitting the peak once, then the probability of exactly k M
PfNðtÞ ¼ k∣Mg ¼ ½ξðtÞk ½1 ξðtÞM k : ð13Þ
attack errors is given by k
n
Pfe ¼ k∣ng ¼ pn k ð1 pÞk ; ð11Þ Since M is Poisson distributed, its pdf would be given as
k
and the probability of a successful peak hit resulting from n attack ðλtÞm e λt
P MðtÞ ¼ m ¼ : ð14Þ
bursts is given by m!
Pfe ¼ 0∣ng ¼ sn ¼ pn ; 0 osn o 1; n ¼ 1; 2; …:
From the above assumptions, we can obtain the number of such
There exist several IFR attack classes aimed at degrading incidents at time t as
system availability. For instance, TFN attack and the massive warm
attack called SQL-slammer engage their attack-agents in the form ½λξðtÞk e λξðtÞ
P NðtÞ ¼ k∣λ; ξðtÞ ¼ : ð15Þ
of a branching process. Some scan worms develop extremely many k!
branches and grow exponentially by propagating their malware
via vulnerable applications. An extensive model developed by [16] As a corollary to this, [30] has achieved the following deduc-
shows that the scan-worm propagation evolves as a branching tions. If the number of attacks are large and the infection
process. There exist other deterministic and stochastic models probability of each attack is small then the process is Poisson
defined for the spread and detection of scan worms, e.g., [31]. See distributed. If, on the other hand, the infection probability of each
also Eqs. (20) and (21) for that matter. A similar method to that of attack is relatively high, which is often the case for insufficiently
presented by [16] has been suggested by [39], which determines protected small environments, then the process is binomially
the total progeny of the branching process as a Borel–Tanner distributed., i.e., the probability of n infections among m incoming
distribution, i.e., messages (both viral and clean) is binomially distributed and can
be determined by [16]
I0
P I¼k ¼ ðkγÞðk I0 Þ e kγ ; ð12Þ
kðk I 0 Þ
where I0 denotes the number of initially infected hosts, PfI ¼ kg m
pn ¼ pn ð1 pÞðm nÞ :
denotes the probability that the total number of hosts infected is k n
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 287
is, combining the features of both birth and death Markov 7.1. Poisson and binomial distributions
processes, we can model many realistic problems related to worm
propagations, massive attacks, network traffic analyses, mainte- To validate failures involving the exponential distribution and
nance problems, and queuing systems dealing with recovery having the memoryless Markov property, we can make use of
problems. As a special case (considered in Section 7.5), we have properties inherent to Poisson events. As an example, the times
modeled the failures caused by a DDoS attack as a “birth” process between different types of attacks launched on different machines
and the reactiveness of the attacked system as a “death” process. within the same network can be modeled as a Poisson process
Availability of the victim system degrades during the birth period with the density parameter λi each. That is, for each machine
and increases during the death (reactive) period. The reactiveness receiving different attack densities, λ1 ; …; λm , we obtain the prob-
is an explicitly built-in intelligence within the victim system, ability of exactly n attacks being observed in t time units can be
which automatically obstacles the attack packets, hence, decreas- computed by
ing the cumulative failures following each obstacle period. In
½ðλ1 þ λ2 þ; ⋯; þ λn Þtn ðλ1 þ λ2 þ ;⋯; þ λm Þt
addition to Appendix A, we can find the necessary material on pn ðtÞ ¼ e : ð26Þ
n!
the derivation of the related mathematical models in diverse
sources, e.g., [16]. In this model a process is a combination of both One can refer to the additive property of the Poisson process to
the birth and death processes whose transitions lead to the verify Eq. (26) as follows. Let X 1 ðtÞ and X 2 ðtÞ be two Poisson
generator matrix: arrivals with average rates λ1 and λ2, respectively. Now, let
0 1 XðtÞ ¼ X 1 ðtÞ þ X 2 ðtÞ, and for t Z 0 we have
λ0 λ0 0 0 0 …
Bμ ðλ1 tÞn1
B 1 ðλ1 þ μ1 Þ λ1 0 0 …C C P X 1 ðtÞ ¼ n1 ¼ e λ1
A¼B C: n1 !
@0 μ2 ðλ2 þ μ2 Þ λ2 0 … A
ðλ 2 tÞ
n2
⋮ ⋮ ⋮ ⋱ ⋱ ⋱ P X 2 ðtÞ ¼ n2 ¼ e λ2 ð27Þ
n2 !
Based on the generator matrix, we can see that the respective
Using these, we can also show that X(t) is also a Poisson process.
forward Kolmogorov equation describing the birth and death
Hence,
process is
n
P 0n ðtÞ ¼ ðλn þμn ÞP n ðtÞ þ λn 1 P n 1 ðtÞ þ μn þ 1 P n þ 1 ðtÞ: ð22Þ P XðtÞ ¼ n ¼ ∑ P X 1 ðtÞ ¼ n n2 P X 2 ðtÞ ¼ n2
n2 ¼ 0
With the initial conditions: n ðλ1 tÞn n2 λ2 t ðλ2 tÞn2
( ¼ ∑ e λ1 t e
1 if n ¼ i; n2 ¼ 0 ðn n2 Þ! n2 !
P n ð0Þ ¼
0 if n a i; n λn1 n2 λn22
¼ e ðλ1 þ λ2 Þt t n ∑
n2 ¼ 0 ðn n2 Þ!n2 !
and, if both of the birth and death processes take place linearly !
dependent on the population size, i.e., λn ¼ nλ; μn ¼ nμ, and ρ ¼ λ=μ, ½ðλ1 þ λ2 Þtn n n
¼ e ðλ1 þ λ2 Þt ∑
we get n! n2 ¼ 0 n 2
n n2 n2
1 e ðλ μÞt λ1 λ2
P 0 ðtÞ ¼ μ ;
λ μ e ðλ μÞt λ1 þ λ2 λ1 þ λ2
P n ðtÞ ¼ ð1 P 0 ðtÞÞð1 ρ P 0 ðtÞÞðρ P 0 ðtÞÞn 1 : ð23Þ ½ðλ1 þ λ2 Þtn
¼ e ðλ1 þ λ2 Þt ð28Þ
n!
Assuming that the only nonzero transition probabilities of this
process are [17] Example. Suppose a pair of independent DoS attackers, say A and
λ0 ¼ μ; λ1 ¼ ðλ þμÞ; and λ2 ¼ λ; B, bursting independently attack packets on a router with Border
Gateway Protocol (BGP). Attacker A performs a Resource Exhaustion
then we get a set of useful functions describing the probabilities of
attack and B carries on with Link Cutting attack, both of which eats
the first and the last states as
up the CPU time and storage resources of the BGP protocol
P 0 ðtÞ ¼ μγ; independently. The ultimate effect of the attacks is the service
P n ðtÞ ¼ ð1 λγÞð1 μγÞðλγÞn 1 ; 8 n Z 0; ð24Þ degradation. The number of attacks succeeded by A has a Poisson
distribution with mean 5 and number of attacks succeeded by B
where has mean 4 and has the Poisson distribution. Let us find the
8 probability that the total number of successes is 10. By referring to
>
> 1 eðλ μÞt
>
< if λ a μ; Eq. (28), we obtain the result as
μ λeðλ μÞt
γ¼ ð25Þ
>
> t ð5 þ 4Þ10
>
: 1 þ λt if λ ¼ μ: P XðtÞ ¼ 10 ¼ eð 5 þ 4Þ ¼ 0:119:
10!
Pareto distribution is given by It should be reminded from the related theorem that if
α α1 X 1 ; X 2 ; …; X r are mutually independent and identically distributed
f ðxÞ ¼ αλ x x Z ½λ; α; λ 40; ð29Þ
random variables, where each Xi is exponentially distributed, then
1
the cdf: the random variable ∑ri ¼ 0 X i has an r-stage Erlang distribution
8 α with parameter λ. Hence, by this assumption Eq. (35) is an r-stage
< 1 λ
>
for x Z λ; Erlang distribution function. It follows that we can express this
FðxÞ ¼ x
>
:0 function in terms of an Erlang hazard rate as
for x o λ;
λr t r 1
the failure rate: hðtÞ ¼ ; ½t; λ 4 0; r ¼ 1; 2; …: ð36Þ
r1
8 ðr 1Þ! ∑ ðλtÞk
< α for x Z λ; k¼0
k!
hðxÞ ¼ x
: 0 for x o λ;
This implies that the victim's lifetime has an r-stage Erlang
distribution expressed by
and the reliability for x Zλ is
α λt t r 1 e λt
λ f ðtÞ ¼ ; ½t; λ 40; r ¼ 1; 2; …: ð37Þ
RðxÞ ¼ : ð30Þ ðr 1Þ!
x
It follows from Section 6.2 that the overall effect of super- Weibull belongs to one of the most widely used parametric
imposed attacks can be modeled as a hyper-Erlang distribution, category of distributions often used for hardware fatigue analyses
assuming that the distribution is discrete. An hyper-Erlang dis- [41]. A nonlinear hazard function is used when the failure rate
tributed random variable X has the following pdf: does not change linearly with time. We can obtain an IFR, DFR, or
n constant failure distribution by a proper choice of its shape
SðxÞ ¼ ∑ pi Eri ðxÞ; 0 o pi r 1; ð31Þ parameter, α. A DoS or a bufferoverflow attacked system will often
i¼0
go through a mortality phase of first IFR followed by a constant
where Eri denotes an r-stage Erlang distribution each stage having failure rate. The latter phase is thus the highest unreliable state of
the probability pi. As will be discussed later, for situations where the system that leads to restart during which a natural DFR phase
the attack times are exponentially distributed with different rates is active.
(λi a λi þ 1 ) with random time intervals (T i aT i þ 1 ), then the events
can be modeled by an r-stage hyperexponential distribution:
7.5. Attack-obstacle model: a case study
r
f ðtÞ ¼ ∑ αi λi e λi t ; 0 o αi r 1; t 4 0: ð32Þ
i¼1 Attackers may take advantage of deficiencies found in a
Suppose a victim system having a maximum tolerable stress level, protocol implementation. Attacks are generally tunneled via these
say Sm, which when exceeded will cause random peak failures. protocols to gain access to specific applications. There are several
Such failures caused by massive attacks (bursts of attack packets) other protocol deficiencies that allow an attacker to build fraudu-
can be defined in terms of peak stresses, which are assumed to lent TCP/IP packets leading to disruptions. But these attacks can be
follow a Poisson distribution having the attack effect rate para- thwarted with the aid of secure protocol implementations, which
meter λs t, which is defined for time ðt; t þ sÞ. The effect of the have some built-in intelligence. Intelligent protocols first track the
attacks varies with the use of CPU and memory resources of the network traffic in a learning period, and in turn, try to obstacle
target system. The number of peak stresses (point unreliability), eventual attacks, all in real-time. For example, some firewalls and
St, during the time interval ð0; t is given by routers may have an intelligent combat module used as a real-time
attack blocker. Such systems with quasi-renewal capabilities can
e λt ðλtÞr be approximated to an immunity growth model, Assumption 3. An
P fSt ¼ r∣λg ¼ ; λ 4 0; r ¼ 0; 1; 2; …: ð33Þ
r! immunity growth model composed of a failure period (birth)
As of the property of Poisson events, the number of attacks is very followed by a short recovery (death) period can be ideally
high while the probability of causing failures is relatively low [16]. modeled by a birth and death Markov process. Analysis of firewalls
The reliability of the system prior to the peak stress period can be and intrusion detection systems with real-time responsiveness can
expressed in terms of the victim's uptime X related to St satisfying benefit from this model.
Given the probability of the attack accumulation (or density) a
½X 4 t ¼ ½St or:
(t) at time t¼ 0, which is defined by the relation between the
Thus, the reliability of the system prior to the peak stress period attack rate λ and blocking rate μ of a massive attack by
can be expressed as
1 eðλ μÞt
RðtÞ ¼ P fX 4 t g ¼ P fSt o r g aðtÞ ¼ ; ð38Þ
μ λ eðλ μÞt
r1
¼ ∑ P fSt ¼ r∣λg we get the probability of having n new offspring of DDoS attacks at
k¼0
time t leading to the peak stress on a single victim as
r 1ðλtÞk
¼ e λt ∑ : ð34Þ
k ¼ 0 k!
P n ðtÞ ¼ ½1 λ aðtÞ½1 μ aðtÞ½λ aðtÞn 1 : ð39Þ
The cumulative distribution function, F(t), as the complement of As discussed in detail by [16], the above formulation assumes that
R(t) is defined as the attack intensity initially takes the form of a branching process,
see Eq. (24). This function can be further expressed in terms of the
r 1 ðλtÞk
FðtÞ ¼ 1 RðtÞ ¼ 1 e λt ∑ ; ½t; λ 4 0; r ¼ 1; 2; …: ð35Þ CPU-usage of the victim, which also clarifies the relation between
k ¼ 0 k! the attack and blocking behaviors together with the attack density
290 S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299
Sometimes the occurrence of attacks, their impacts resulting in and the related failure (hazard) rate is given by
system failures, and the process of system recovery evolve in λ1 λ2 e λ1 t e λ2 t
hðtÞ ¼ : ð43Þ
multiple sequential stages. For example, the so-called Watering λ2 e λ1 t þλ1 e λ2 t
hole attack [42] uses three contiguous stages to compromise an
organizations’ resources. Some organized attacks may occur in It shows that a hypoexponential distribution exhibits IFR from
parallel or in some combined forms. Besides, a modern IT zero up to minðλ1 ; λ2 Þ.
environment consists of HW and SW components each with
different vulnerability characteristics and different protection
possibilities. Due to these, building an appropriate model is a
complex process, where different models can be combined to 7.6.2. Parallel phases
achieve more realistic results. An IT environment with mostly If a process consists of alternate phases, where each phase has
identical systems is treated as a homogeneous system, while the some incoming events each with the probability α, then the phases
one with a mixture of different systems is treated as a hetero- can be modeled to be exponentially distributed. It follows that the
geneous or nonhomogeneous system. Reliability analysis of hetero- overall distribution will rise to a hyperexponential distribution.
geneous systems is significantly complex compared to that of the The density function of a k-phase hyperexponential random event
homogeneous systems. Nevertheless, for both of the categories the chain is given by
decomposition steps of the analysis need to be handled with great k k
care. Below are the two most widely used mathematical models f ðtÞ ¼ ∑ αi λi e λi t ; ½t; λi 4 0 and αi 4 0 with ∑ αi ¼ 1:
that can be applied to solving problems involving sequential and i¼1 i¼1
and its cumulative distribution function is stages with nonidentical behaviors. If the interconnected stages
k
behave identically against threats then the computation of the
FðtÞ ¼ ∑ αi ð1 e λi t Þ; t Z 0: ð45Þ cumulative reliability can be computed by applying the Bernoulli
i¼1 process as
The hazard function is thus !
k n jþ 1
Rðp; k; nÞ ¼ ∑ pn j ð1 pÞj : ð48Þ
∑i f ðiÞ ∑i αi λi e λi t j¼0 j
hðtÞ ¼ ¼ ; 1 r i r k; t Z 0: ð46Þ
∑i RðiÞ ∑ i α i e λi t
A Bernoulli process is a sequence of independent identically
Here, αi denotes the probability of the ith event and λi denotes the distributed (iid) Bernoulli trials. Independence of the trials implies
occurrence rate of the ith event. The hazard function, Eq. (46), is that the process is memoryless. Given that the probability p is
a DFR starting from the upper range given by ∑i αi λi down to known; past outcomes provide no information about future out-
minðλ1 ; λ2 ; …; λk Þ. An application of the hyperexponential distribu- comes. However, the past informs about the future indirectly
tion dealing with the analysis of network performance models is through inferences about p while the process undergoes the trials.
presented in [45]. Since the hyperexponential distribution exhibits On the contrary, if the stages do not behave identically (variable
DFR it can be used with immunity growth models applied to IFR/DFR) then we need to analyze every possible path of the
systems having different random parameters for different events reliability structure in order to accurately determine the overall
each showing various vulnerabilities under different situations. Or, reliability of the system at hand. In most basic cases where only
it can be used to model an attack that completes its mission in qualitative results are required then a Fault Trees method can be
several consecutive phases. For example, a targeted attack takes used to simplify the solution. However, analysis of sequential and
place in four phases: incursion, discovery, capture, and exfiltration. functional interdependencies between components can be cum-
Each of these events has a different probability distribution for bersome to deal with the fault-tree method, instead one may
ðα1 ; α2 ; α3 ; α4 Þ and a different cause effect (λ) [46]. Another example choose a more deductive approach using Bayesian belief networks
of consecutive events is performed by a Trojan, which can [47]. There exist some other mathematical models that can be
propagate through many computers in a network whose impact modified and used further. For example, reliability analysis of
on the visited computers are random. consecutive k-out-of-n systems with non-identical components
Similar to the TFN attacks, illustrated in Fig. 4, an attacker is lifetimes considered by [22].
often interested in finding a way to (stealthily) access the resources It is also necessary to consider the hazard function of the
within a network of distributed resources. This is a so-called systems when attacked, whether the failure rate increases/
reconnoissance process often performed by portscanner tools, e.g., decreases linearly by time or the stages go through constant
nmap and Nessus, which launch their scans in parallel; often via a failure rates. This complexity can be modeled as k-out-of-n system,
single scanner (also pseudo-parallel). Since the scanned system where each stage having a different failure function. Many com-
believes that the scan is a benign connection request, it uses its CPU plex k-out-of-n systems having constant failure rate with expo-
and memory resources to blindly respond the reconnoissance nential distributions are modeled as
process. As a result, the service quality of the scanned systems will !
significantly degrade. Suppose that interarrival times of a portscan
k n
Rx ðtÞ ¼ ∑ ðe λt Þj ð1 e λt Þn j : ð49Þ
burst are exponentially distributed, and exactly one portscan has j¼0 j
occurred in the interval ð0; t, i.e., there is no overlap between the
Consequently, due to numerous variety of systems with also
successive portscan times. By referring to the properties of the
immense complexity, analysis of complex system has always been
Poisson process, we can easily show that the conditional distribu-
an open problem. Though there exist methods dealing with
tion of scan time T1 is uniform over ð0; t. If Tk denotes the kth scan
conventional complex structures, it is quite difficult to calculate
event, Tk is a k-stage Erlang random variable, whose conditional
the collapse probability of a system having indeterminate struc-
distribution T k ð1 rk rnÞ can be determined as follows:
ture with many possible modes or paths to complete failure, which
Pfð NðτÞ ¼ 1 Þ \ ð NðtÞ NðτÞ ¼ 0 Þg may propagate the problem to remain unsolved for the repair case
P T 1 rτ∣NðtÞ ¼ 1 ¼
PfNðtÞ ¼ 1g too. We have always some parallels between the failure and
λτe λτ e λðt τÞ τ recovery structures of a given system. Several models used for
¼ ¼ : ð47Þ
λte λt t failure analysis are already discussed above. The most appropriate
approach to model the reliability of a majority of complex systems
This can be generalized to n cases to give the joint pdf of the arrival
with predictable failure pats can be a k-out-of-n parallel system
times as
described as [6]
n!
f t 1 ; t 2 ; …; t n ∣NðtÞ ¼ n ¼ n : n n
t Rs ¼ ∑ Rx Ri0 ð1 R0 Þn i ; ð50Þ
i¼k i
The proof is based on the relation:
where Rx denotes the individual complexity of a failure path,
PðA \ BÞ
PðA∣BÞ ¼ ; which is defined as (a discrete geometric distribution):
PðBÞ
n
and left to the reader. 1
Rx ¼ q i ; q e α; q A ½0 1; α A ½0 1:
7.6.3. Consecutive system reliability Eq. (50) assumes identical reliability function, R0, for each of the
Let us consider a typical DoS attack, where the attacker may components, where the summation components are subscripted
issue more than a single attack by launching a series of consecu- by i starting at k. Thus, the step parameter ðniÞ denotes the number
tive sub-attacks. That is, first bypassing (fire-walking) a firewall, of paths of a k-out-of-n system with n components of which i at
than finding exact local IP-addresses, and than launching various step i are working. In a more general case, depending on the
other attacks (e.g., IP-masquerading combined with UDP-squelch) components' interaction structure with each other, we can model
in several stages till the victim is compromised. There are two the system as a series–parallel system. The overall risk is then
major cases to consider: (i) stages with identical behaviors and (ii) dominated by the cascade causality among the interacting
292 S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299
CPU−utilization: λ = 1.8, μ = 1.74 attack-obstacle model was verified to match a system with
0.35 responsiveness, see Assumptions 1–3.
Immunity growth
Cumulative degradations and recoveries
0.3
Failure
8. Impact analysis
0.25
The models given above are devoted to attack and failure
0.2
analyses, however, regarding the impact analysis, we need to
develop, or apply existing, e.g., [17], models to estimate an overall
0.15
loss involving system failure rates, system down-times, and the
entire cost from the start of the failure phase until the total
0.1
recovery, if repairable. Risks in a network describe the degree of
0.05 exposure of the system to failures. impact is defined as the “cost”
(or risk) of unavailability of a resource, which is a factor of the
0 resource weight (importance), the reliability measure of the
0 10 20 30 40 50
resource, and a defense factor (if the resource is protected). There
Time
can be multiple simultaneous attacks to a single or to a series of
Fig. 7. Plots of the results from the simulation and experiments (fluctuating systems in a network causing multiple failures. Accordingly, the
curves). overall network reliability should be reconsidered in order to
estimate the total cost as a measure of unreliability of all the
components [36]. Bayesian belief networks can be effectively used affected systems. Some systems may have defense strategies while
for the majority of cases, where the causal interdependence others are unprotected against threats. That is, the impact measure
among system components are definable. Some related models depends also on existing protection measures and, indirectly, on
can be found in [25,48]. the strength of the threats that lead to system failures. Therefore,
in order to estimate the overall impact for a network covering all
7.7. Model verification security threats, each component must have been assigned a
While analyzing the system we collect attack data (if observa- wi quantitative importance factor (weight), 0 o wi o 100,
ble) and failure data corresponding to the attacks. The first step Ti threat/attack list given as T i 〈t 1 ; …; t n 〉,
following the data aggregation is to create plots of observed data. di quantitative vulnerability factor associated to each
Since the plot can display the behavior of the system reliability threat/attack in the list, which can be represented either
during the observation period, it can guide us in finding a closer in terms of the strength of the threat/attack or alterna-
model to start with. We can now estimate values for the model tively in terms of the strength of the defense strategy,
parameters from the failure data. We have a variety of methods for 0 odi o 1,
the parameter estimation. Because of its good statistical properties hi quantitative hazard factor/function associated to each
one often uses the maximum likelihood estimation method (MLE) threat/attack in the list, 0 ohi o 1.
for estimating a small set of parameters. We can now use the The weight of a system is a subjective metric chosen by the
estimated parameter values to obtain a fitted model by substitut- evaluator, whereas the threat list and the remaining factors are
ing the estimated values of the parameters with the parameters of determined from the existing treat list and working environment.
the chosen model. Furthermore, as the last step, we need to verify Regarding the vulnerability factor, an empirical method for deter-
the model chosen. This process is called goodness-of-fit test, which mining the attack strength defined in terms of attack success ratio
evaluates the chosen model whether it fits the observed data. is presented in [7], where a vector was built to contain the attack
There are also various methods for checking a model, e.g., success (or hazard) data from a variety of attack types as
Kolmogorov–Smirnov or Anderson–Darling test can be used to ! s1 s2 sn
test whether two samples are drawn from identical distributions h ¼ h1 ; h2 ; …; hn ¼ ; ; …; ;
a1 a2 an
match or converge at some point. If the model chosen does not fit,
we have to collect more data or search for another model. Perhaps Here, parameter ai depicts the number of attacks of type i, si holds
the number of successes of that attack type. A total impact factor
simulation is the easiest method for performing goodness-of-fit
test. Besides, if the complexity of the model parameters is high, of m components each having n threats can be readily computed
by
which makes it difficult to guess a model, then we can use the
!
simulation method. m n
We applied the simulation method to the experiment pre- I¼ ∑ wj ∑ hi :
j¼1 i¼1
sented above. The experiment has produced a set of availability
(CPU-usage) data for 50 different DoS attacks. Since the target This formula is being kept simple and has no bound normalization,
system is a responsive system, the chosen model would rely on the
immunity growth exhibiting an AFR behavior. We have repeated
the experiment on the same target system at different times and Table 3
with different attack schemes, and tried to plot the observation Impact computation for some network units.
data and the simulation data. Both the experiment and the Component Threat list ! Weight Impact
∑h
simulation results are plotted in Fig. 7. Mean values of all
experiment data at different points can be compared to the Router T 1 〈⋯〉 0.7826 100 78.2600
corresponding simulation data in order to find a correlation E-mail server T 2 〈⋯〉 0.2250 75 16.8750
between the experiment and the simulation. By a further analysis PC T 3 〈⋯〉 0.5000 60 30.000
of the correlation data one can easily compute the estimated Printer T 3 〈⋯〉 0.4300 50 21.5000
dB-server T 4 〈⋯〉 0.4091 80 32.7280
model parameters. As mentioned earlier, the attack-obstacle model Web-server T 2 〈⋯〉 0.8958 90 80.6220
was designed to match the experiment discussed above, thus, the
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 293
but it can satisfactorily display the impacts in a quantitative λ
pn þ 1 ¼ ðm nÞ p n ¼ 1; 2; …; m 1: ð54Þ
manner. Indeed, the computation of the hazard factor for some μ n
components may also include defense factors. Since, generally
Solving Eq. (54) with the normalization condition of
defense strategies change from company to company, their relia-
∑mn ¼ 0 pn ¼ 1, we obtain the probability of exactly n machines
bility analysis can be generalized by eliminating the unknown
among m being down as
factor of the defense strategy. Due to this reason we consider here
n
the defense factor as a constant parameter in the impact analysis. m! λ
pn ¼ p0 ; n ¼ 1; 2; …; m 1; ð55Þ
Some components (see Fig. 1) and their items used for the impact ðm nÞ! μ
computation are shown in Table 3.
where p0 is the proportion of time the system is idle, i.e., no
recovery taking place.
9. Recovery models
9.2. Utilization of the recovery facility
Recovery process of a failed computer consists of a number of
operations such as patching, rebuilding, and reinstalling of a variety Recovery utilization (or traffic intensity), defined as the propor-
of corrupted software and/or hardware components. There may also tion of time the repair facility (or software) is busy with processing
be an entire recovery process dealing with disc recovery and a of the incoming machines, is expressed as 1 p0 . Let us assume a
thorough system rescue operation, which may require significantly stable state (the steady-state), where the arrival rate of the failed
high resources for recovery, and much longer time for all required machines is always less than or equal to the repair rate of the
operations. Suppose that we have n computers enqueued randomly repair facility, and the corresponding utilization factor of the
for repair, patch, and update operations, where the incoming traffic repair facility is expressed as ρ ¼ λ=μ.
is of a Poisson type, Eq. (15), with the arrival rate λ. Furthermore, The overall throughput ratio for a repair facility of a single-
λΔt þoðΔÞt is the probability of accepting at least one repair request repairman (or a single utility tool) and n repair requests of identical
in the small time interval Δt. Total repair (holding) times are failures is computed by
exponentially distributed with random time intervals, assuming n
that each computer waits for T random time units and requires λ
ρ¼ ; ð56Þ
another random time unit t to be repaired, then from [16], μ
PfT Z tg ¼ e μt ð51Þ
and therefore, from the queuing theory we have
can be used to determine the probability of the total time required
for the recovery (queuing þ repair) operation. The times between pn ¼ p0 ρn ; for n ¼ 1; 2; …; ð57Þ
recoveries, R, is also a Poisson process with the exponential where implicitly,
distribution parameter μ. Thus, the continuous random variable R 1 1
1 1 1
has the probability density function: p0 ¼ ¼ ∑ ρn ¼ ¼ ð1 ρÞ: ð58Þ
1 þ ∑1
n ¼ 1ρ
n
n¼0 1ρ
f R ðtÞ ¼ μe μt ; ð52Þ
Thus, the steady-state probability for the state n (i.e., n
and the mean time between recoveries is
machines in the recovery service) is
1
EðRÞ ¼ : ð53Þ pn ¼ p0 ρn ¼ ð1 ρÞρn ; for n ¼ 1; 2; …: ð59Þ
μ
From this the expected number of machines in the recovery
system (queue þservice) can be easily computed by
state n is defined as [16] modeling the dynamic interaction of the mixed queues can be
8 n simply expressed by the following differential equation set:
> ρ
>
< n! p0 if 0 r n rs;
∂s
pn ¼ ¼ βsðtÞ λf ðtÞ
>
> ρn
∂t
: p if n Z s:
s!ðsn s Þ 0 ∂f
¼ λf ðtÞ βsðtÞ
∂t
s1 n 1 ∂r
ρ ρs ¼ μrðtÞ λf ðtÞ: ð64Þ
p0 ¼ ∑ þ : ð62Þ ∂t
n ¼ 0 n! s!ð1 ρs Þ
As clearly seen, this equation set denotes the dynamic behavior for
Setting ρs ¼ λ=sμ as the throughput (or service rate), we get the discovering susceptible computers ∂s=∂t, their queuing at the
expected number of machines in the recovery system (queue þ recovery center ∂f =∂t, and the recovery operation ∂r=∂t, where
service) with a multiple-repair facility as the sum of expected the associated rates are β; λ, and μ, respectively. Given the failure
queue length waiting for repair and the throughput ratio of the distribution
queue:
f ðtÞ ¼ λe λt ;
λ p0 ρs ρs
L ¼ Lq þ ¼ þ ρ: ð63Þ the reliability function for the continuous time analysis can be
μ s!ð1 ρs Þ2
derived from the complement expression, RðtÞ þFðtÞ ¼ 1, as
Z t
RðtÞ ¼ 1 FðtÞ ¼ 1 λe λu du ¼ e λt :
10. A case study: analysis of a recovery facility 0
First, we consider the input dynamics of the recovery facility, so In order to give a clearer view, we present the analysis with a
that we can be able to analyze the time-dependent states of the series of related questions each leading to the solution of a specific
suspected computers to determine the failure and recovery func- part of the problem at hand.
tions, as well as the stationary states describing the long-run
reliability figure of the entire process. Susceptibility of a node is Q1: What portion of time is the recovery service busy with
identified by its expected service degradation rate, β. Time/ processing the suspected (slowed-down) computers?
resource usage for computing β is not considered in this example, Primarily, we compute the traffic intensity (input/output
however, for large networks this process can require significant flow ratio of the recovery facility) defined by Eq. (56) as
amount of resources that can be necessary to incorporate into the
computation. The probability distribution for having s susceptible λ 18
ρ¼ ¼ ¼ 0:900:
hosts among m is binomial and μ 20
Thus, the proportion of the process times of failed, ρf, and
m s
pðβ; s; mÞ ¼ β ð1 βÞðm sÞ ; s 4 1: non-failed (healthy), ρh, computers are computed as
s
ρf ¼ ρpf ¼ 0:900 0:300 ¼ 0:270; and
Since the output from the failure queue is directly cascaded to the
ρh ¼ ρph ð1 pf Þ ¼ 0:900 ð1 0:300Þ ¼ 0:900 0:700 ¼ 0:630:
recovery queue, the scenario becomes a mixed situation of both
input to the recovery and output from the recovery queue. Thus, Note that, pf denotes the failure (infection) probability,
and hence ph ¼ 1 pf .
1
An M/M/1 queue represents a stochastic queue system having a single server,
Q2: The probability of exactly one computer being in the
where arrivals are determined by a Poisson process and job service times have an recovery center can be obtained using the recurrence
exponential distribution. formula in a Markovian system, see Appendix A. First, we
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 295
Thus, the probability of having exactly one computer Average time (minutes) spent on each failed computer is
being in the system is 0:068 60 ¼ 4:080 min, which makes a total of 18.115 h
p1 ¼ ð1 ρÞρ ¼ ð1 0:900Þ 0:900 ¼ 0:090: per month to maintain 266 attacked computes. Conse-
quently, given the infection probability of 0.300, arrival
rate of 18 for incoming hosts, and the service rate of 20,
Q3: What is the probability that there are more than one the average time spent on 1260 computers for scanning
failed computers in the system? First, the arrival rate of and recovering 266 of them is 9 0.500 ¼4.5 h per day.
the failed (infectious) computers is determined as The impact of this threat is not only holding 266
λf ¼ pf λ ¼ 0:300 18 ¼ 5:400 computers=h; computers down in 4.5 hours per month, but also the
business loss caused by each of the failed computer
and the probability of having exactly one infected com- during 4.5 hours a month plus the labor time spent for
puter, thus scanning 1260 machines and recovering 266 of them,
λf μ λf λf 20 5:400 5:400 plus the extent of damage to the reputation of the
p〈f ;1〉 ¼ p〈f ;0〉 ¼ ¼ ¼ 0:197:
μ μ μ 20 20 company caused by this impact.
Hence, the probability of having more than one, p〈f ;k〉 ,
infected computers in the recovery center is
1
p〈f ;k〉 ¼ ∑ p/f ;jS ¼ 1 p〈f ;0〉 p〈f ;1〉 11. Concluding remarks
j¼2
20 5:400 The vast majority of the theorems and models were presented
¼ 1 0:197 ¼ 0:073:
20 date back at least some decades ago. However, the application of
The tendency of p〈f ;k〉 when limk-1 ¼ 0 is at some point them to the reliability analysis of computer networks has not been
near the equilibrium state, which verifies that the system clearly identified yet. It is important to determine the cost of
is stable within the range of steady-state. unavailability of information systems using quantitative
Q4: What is the average number of computers in the system, approaches that are theoretically and practically sound. We have
both infected and healthy? From Eq. (60) presented an umbrella framework with the hope that it can
facilitate the analysis of a broad spectrum of network structures,
ρ ðλ=μÞ λ 18 their components, interdependence among the interconnected
EðNÞ ¼ ¼ ¼ ¼ ¼ 9;
1 ρ 1 ðλ=μÞ μ λ 20 18 structures, impacts caused by the associated threats, and perfor-
and the average number of failed computers in the mance analysis of the recovery facilities. A model with clearly
system is identified underlying assumptions can significantly simplify the
application of the model to a network of systems, a subsystem
ρf ðλf =μÞ λf 5:400
Eðf Þ ¼ ¼ ¼ ¼ ¼ 0:370: within the network, or a component within the subsystem.
1 ρf 1 ðλf =μÞ μ λf 20 5:400
Reliability analysis of complex systems can be too diffuse to
The estimated number of failed hosts per day is 24 many of us, and also most computer security engineers have
0:370 ¼ 8:880, and per month is 30 8:880 ¼ 266:400. difficulties in having a clear overview due to lack of the scholar
Q5: What is the average time spent on each of these background in reliability engineering. Some research guided by
computers in the system, both infected and healthy? the theoretical context is thus diffuse and cumbersome for many
computer security engineers to apply to real world problems.
ρ 1 1 1
EðwÞ ¼ ¼ ¼ ¼ 0:500 h; Therefore, in this paper we have presented a concise framework
1 ρ λ μ λ 20 9
for determining the reliability of systems that are under security
and the average time spent on each of the failed threats. The framework makes use of the most fundamental
reliability, stochastic process, and queuing theories to guide the
analysis. Choosing appropriate models for threat, failure, and
performance of recovery services play the fundamental role for
s u sc
may have proven applicability to some environments/systems will Markov processes can substantially facilitate the analysis of long-run
be considered as is. Other models bearing uncertainty should be characteristics of system states, stochastic failure changes, random
thoroughly verified for being an adequate model candidate of a attack changes, and stochastic recovery and queue operations.
specific system analysis. A Markov chain is a random process usually characterized as mem-
In practice neither testing nor proving of a model on a specific oryless, i.e., the next state depends only on the current state and not
environment can guarantee a complete confidence in the appro- on the events occurred earlier. Also systems that undergo transitions
priateness of the model. Because computer networks and threats forgetfully from one state to another can simply adapt this model.
to them have highly complex structures that require a through Shortly, Markov processes can be used to model recovery services
analysis of each system and threats to them in order to asympto- and several other reliability structures having IFR, DFR, and AFR
tically match a specific model. The models discussed here can be characteristics.
considered as a complementary approach to assess the reliability Immunity (reliability) growth is an important phenomenon in
of a specific part of a given environment rather than being used as network security, because, systems are often recovered and
a competing tool. Due to the imperfectness of the models in patched to thwart failure sources. Some systems (e.g., routers,
assuring accurate results, sometimes a set of approaches need to firewalls, and network protocols) may have built-in capability for
be evaluated for collecting data, experimenting, choosing a model, eliminating failures in real-time. This aspect requires a specific
estimating model parameters, and developing appropriate meth- approach to analyze the attacks, failures, and the correlation
ods that can obtain a fitted model. between the attacks and failures. Accordingly, a theoretically
Furthermore, how the model behaves to changes in parameter justified model, [16], called Attack-obstacle model, was also mod-
values and how it responds to changes in the structure of the ified and tested for DDoS analysis. Due to the limitation on the
model are important. Failure data and plots are useful in deter- scope of this article, the modification of this model was not
mining the system behavior. Following the analysis of the data and substantially verified, the verification was done by additional
plots, an estimate of the model parameters can be obtained by use simulations. We said also that threats can randomly propagate
of a method, e.g., MLE. A model matching need to be done by among different applications on different hosts can be modeled as
substituting the estimated parameter values in the selected model. a Brownian motion. But this was not substantiated either. Because
Finally, in order to eliminate eventual uncertainties, a goodness- of its complexity a feature work is planned to investigate the fault
of-fit test has to be performed. If the chosen model does not fit, we propagation within heterogeneous application environments. We
have to gather additional data and reassess the tests or try another did not spend much time for the verification of some of the
model from the candidate model list. It may be hard to predict the candidate models either. An extensive study is thus needed for
size of the additional data to collect or to find a better model. In testing and verifying them in order to validate their application to
that case, carefully designed simulations may help to guide a system analysis, especially, systems with immunity growth
sensitivity analysis and eliminate eventual uncertainties in the features.
chosen model and the underlying parameters and assumptions.
As already noticed, we have frequently used Poisson processes for
modeling both the attack success counts and failure counts for massive Acknowledgments
attacks, e.g., DoS and scan worm spreading. Error searching for
deterministic failures can be facilitated by matching a Bernoulli The author is grateful to the referees for their inspiring
process, where defects such as infections, back doors, Trojans are comments and suggestions, which have significantly improved
being discovered by security tools and the failures are binomially the presentation of this article. Their superior knowledge in the
counted. Overall reliability of complex systems mostly featuring serial- field and their challenging comments have contributed a lot to
parallel interoperational structures requires k-out-of-n type analysis. create a more fruitful paper that will hopefully guide many of
For example, having n machines in a network, each using a specific computer scientists, researchers, and engineers dealing with the
application that is vulnerable to a specific threat under a random reliability analysis of network security.
operational sequence, can be modeled as a k-out-of-n system.
Systems with immunity growth features require the determi-
Appendix A. Steady-state analysis
nation of appropriate growth functions and sensitivity analysis for
their parameter values. This can be extremely difficult, if we are
Assume a network of m machines, which are maintained by
not able to collect sufficient data for the parameter estimation and
a repair facility. The time to failure (up time) of a machine and its
goodness-of-fit tests. Variants of Gamma and Poisson distributions
repair time are exponentially distributed random variables with
can be modified to contain the immunity growth function
rates λ and μ, respectively. This can be considered as a classical
assumed that the underlying model assumptions and parameters
machine repairmen problem. Assume further that repairs occur as
are carefully chosen. That is, both the attack behavior and its
a Markovian process at rate μ min½NðtÞ; RðtÞ, where N(t) is the
correlation to the observed system failures need to be studied in
number of machines failed and awaiting repair at time t, and R(t) is
detail for laying down the required parameters and assumptions.
the number of repairmen on duty. This process can be modeled as
Regarding more complex structures, consecutively occurring
a simple birth–death Markov process, wherein jumps in state, N(t),
attacks can be modeled by hypoexponential distributions, whereas
occur at exponentially distributed intervals defined as follows:
the threats occurring concurrently can be modeled using hyper-
failure probability of a machine before time t is
exponential and Erlang distributions. Again, a Gamma distribution
combined with any of these exponential distributions is almost the pf ðtÞ ¼ Pftime to failure otg ¼ 1 e λt ; …t ¼ 0; 1; 2; …;
only choice if the system has the immunity growth feature.
and the probability that there will be no failure before t is given by
As a brief note, we have summarized the key reliability models,
suggested models for decomposing network structures into reliability pf ðtÞ ¼ Pftime to failure Ztg ¼ e λt ; …t ¼ 0; 1; 2; …: ðA:1Þ
structures, modified and defined a set of reliability approaches
Time to repair expressions are analogous to the above definitions,
applicable to modeling attacks and failures. Hence, approaches cover-
i.e., the probability of the recovery time of a machine being less
ing attacks, failures, impact analysis, and recovery processes have been
than time t is
provided with some detail, but without much concern of justifying
them. As we have already considered in Sections 6, Appendix A, pr ðtÞ ¼ Pftime to repair o tg ¼ 1 e μt ; …t ¼ 0; 1; 2; …;
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 297
n þ 1 (birth), when a machine departs from the recovery service The requirement of the sum of the probabilities ∑1
n ¼ 0 pn ¼ 1
the state decrements to n 1 (death) during ðt; t þ ΔtÞ. Fig. 9 implies
illustrates the input-output relation of the recovery center.
n nþ1 λi
As can be easily deduced from the one-step transition equita- pn ¼ 1 þ ∑ ∏ p0 ¼ 1;
tions, Eq. (A.3), the dynamics of the state transition probabilities 8 i Z 0 i ¼ 0 μi þ 1
If this series converges (i.e., ρ o 1), we can have the solution for p0 Similarly, the average delay in the queue is
and for pn consequently, that is, Q 1 1 1
D¼ ¼W ¼ :
1 1 λ μ ðμ λÞ μ
∑ ρn ¼ ; ρo 1:
n¼0 1 ρ
Considering the single repairman model, the recurrence formula, [1] Goel A, Okumoto K. Time-dependent error-detection rate model for software
reliability and other performance measures. IEEE Trans Reliab 1979;R-28
Eq. (A.13), of moving from pn 1 to pn can be used to build the (3):206–11.
probabilities for the state transition space S ¼ fp1 ; p2 ; …; pn g, given [2] Dai Y, Xie M, Poh K, Liu G. A study of service reliability and availability for
that we have the equation set (A.7). As well, in a long run, the rates distributed systems. Reliab Eng Syst Saf 2003;79(1):103–12.
[3] Ma Y, Han J, Trivedi K. Composite performance and availability analysis of wireless
will converge to their average values, i.e., communication networks. IEEE Trans Veh Technol 2001;50(5):1216–23.
[4] Levitin G, Dai Y-S. Service reliability and performance in grid system with star
n nþ1
λ C ∏ λi and μC ∏ μj : topology. Reliab Eng Syst Saf 2007;92(1):40–6.
i¼0 j¼1 [5] He Y, Ren H, Liu Y, Yang B. On the reliability of large-scale distributed systems
—a topological view. Comput Netw 2009;53(12):2140–52.
Hence, the initial throughput during the idle period of the repair- [6] Nakagawa T, Yasui K. Note on reliability of a system complexity. Math Comput
Model 2003;38(11–13):1365–71.
man will become [7] Kondakci S. Intelligent network security assessment with modeling and
nþ1 λ
analysis of attack patterns. Secur Commun Netw 2012;5(12):1471–86.
i λ
ρp0 ¼ ∏ p0 ¼ p0 ¼ ρp0 : ðA:12Þ [8] Boffetta G, Cencini M, Falcioni M, Vulpiani A. Predictability: a way to
μ
i ¼ 0 iþ1 μ characterize complexity. Phys Rep 2002;356(6):367–474.
[9] Ahmed W, Wu YW. A survey on reliability in distributed systems. J Comput
These assignments will recursively lead us to the steady-state Syst Sci 2013;79(8):1243–55.
solution for states 0 n, i.e., the set of steady-state probabilities [10] Jian S, Shaoping W. Integrated availability model based on performance of
computer networks. Reliab Eng Syst Saf 2007;92(3):341–50.
from state 0 to state n: [11] Gupta V, Dharmaraja S. Semi-Markov modeling of dependability of VOIP
( " 2 # 3 n ) network in the presence of resource degradation and security attacks. Reliab
λ λ λ λ λ Eng Syst Saf 2011;96(12):1627–36.
S¼ p0 ; p1 ¼ p0 ; p0 ; …; p0 : ðA:13Þ
μ μ μ μ μ [12] Daneshkhah A, Bedford T. Probabilistic sensitivity analysis of system avail-
ability using Gaussian processes. Reliab Eng Syst Saf 2013;112(0):82–93.
[13] Ang C-W, Tham C-K. Analysis and optimization of service availability in a HA
It is obvious that the recurrence formula given above yields for cluster with load-dependent machine availability. IEEE Trans Parallel Distrib
equally constant mean rates, i.e., λi ¼ λi þ 1 ¼ λi þ 2 ¼ ; …; ¼ λn and Syst 2007;18(9):1307–19.
μi ¼ μi þ 1 ¼ μi þ 2 ¼ ; …; ¼ μn , which significantly simplifies the [14] Buzacott JA, Shanthikumar JG. Stochastic models of manufacturing systems.
Englewood Cliffs, NJ: Prentice Hall PTR; 1993.
steady-state solution. Since the steady-state probabilities must [15] Cui L, Li H, Li J. Markov repairable systems with history-dependent up and
sum to 1, we can easily obtain p0 for the general case as down states. Stoch Models 2007;23(October (4)):665–81.
1 1
n n [16] Kondakci S, Dincer C. Internet epidemiology: healthy, susceptible, infected,
λ 1 λ p quarantined, and recovered. Secur Commun Netw 2011;4(2):216–38.
1 ¼ ∑ pn ¼ ∑ p0 ¼ p0 ∑ ¼ 0 :
n¼0 μ n¼0 μ 1ρ [17] Kondakci S. A concise cost analysis of Internet malware. Comput Secur
n¼0
2009;28(7):648–59.
[18] Konak A, Coit DW, Smith AE. Multi-objective optimization using genetic
Thus,
algorithms: a tutorial. Reliab Eng Syst Saf 2006;91(9):992–1007.
p0 ¼ 1 ρ: ðA:14Þ [19] Levitin G. The universal generating function in reliability analysis and
optimization (Springer series in reliability engineering). NJ, USA: Springer-
The value of p0 gives the proportion of time the repair system is Verlag New York, Inc.; 2005.
[20] Aven T, Krohn BS. A new perspective on how to understand, assess and
idle (or unavailable), not necessarily being hacked down; a repair manage risk and the unforeseen. Reliab Eng Syst Saf 2014;121(0):1–10.
system can also be under repair or a casual maintenance. Hence, [21] Trivedi KS. Probability & statistics with reliability queuing and computer
by referring to the transition probability space (A.13), we obtain science applications. 2nd Edition. New York, US: John Wiley & Sons Inc.; 2001.
[22] Salehi E, Asadi M, Eryilmaz S. Reliability analysis of consecutive k-out-of-n
the probability, pn, that we have n machines on repair (queue þ systems with non-identical components lifetimes. J Stat Plan Inference
recovery) in the system as 2011;141(8):2920–32.
[23] Eryilmaz S. Reliability properties of consecutive k-out-of-n systems of arbi-
pn ¼ ρn p0 ¼ ρn ð1 ρÞ; for n 4 0: trarily dependent components. Reliab Eng Syst Saf 2009;94(2):350–6.
[24] Eryilmaz S, Navarro J. Failure rates of consecutive k-out-of-n systems. J Korean
Thus, for λ o μ (that is ρo 1), the expected number of jobs Stat Soc 2012;41(1):1–11.
(requests) in this Markov process under steady-state is the mean [25] Wilson AG, Huzurbazar AV. Bayesian networks for multilevel system relia-
bility. Reliab Eng Syst Saf 2007;92(10):1413–20.
of pn ¼ ρn ð1 ρÞ, i.e., [26] Borgonovo E, Smith CL. A study of interactions in the risk assessment of
1 1 complex engineering systems: an application to space PSA. Oper Res 2011;59
ρ ðλ=μÞ
L ¼ ∑ npn ¼ ∑ nð1 ρÞρn ¼ ¼ : ðA:15Þ (6):1461–76.
n¼0 n¼0 1 ρ 1 ðλ=μÞ [27] A.B. Notel, S.M. Sparta, Y. Yang, Generic threats to routing protocols, 2006. url:
〈http://www.rfc-base.org/txt/rfc-4593.txt〉.
The average number of requests currently under service is [28] Montgomery D, Murphy S. Toward secure routing infrastructures. Secur Priv
IEEE 2006;4(5):84–7. http://dx.doi.org/10.1109/MSP.2006.135.
λ
Ls ¼ ρ ¼ ¼ 1 p0 : [29] Kim Y, Kang W-H. Network reliability analysis of complex systems using a
μ non-simulation-based method. Reliab Eng Syst Saf 2013;110(0):80–8.
[30] Kondakci S. Epidemic state analysis of computers under malware attacks.
Then, the average number of requests in the queue is Simul Model Pract Theory 2008;16(5):571–84.
[31] Rohloff KR, Başar T. Deterministic and stochastic models for the detection of
1 ρn
Q ¼ L Ls ¼ ∑ ðn 1Þpn ¼ : random constant scanning worms. ACM Trans Model Comput Simul 2008;18
n¼1 1ρ (2):1–24.
[32] Amador J, Artalejo JR. Stochastic modeling of computer virus spreading with
By referring to the Little's low L ¼ λW and combining it with warning signals. J Frankl Inst 2013;350(5):1112–38.
[33] Barthèlemy M, Barrat A, Pastor-Satorras R, Vespignani A. Dynamical patterns
Eq. (A.15), we can obtain the average waiting time, W, as of epidemic outbreaks in complex heterogeneous networks. J Theor Biol
2005;235(2):275–88.
L 1
W¼ ¼ : [34] Kondakci S. Dependency analysis of risks in information security. Int Rev
λ ðμ λÞ Comput Softw 2008;3(1):11–9.
S. Kondakci / Reliability Engineering and System Safety 133 (2015) 275–299 299
[35] Kondakci S. A new assessment and improvement model of risk propagation in [43] Horváth G, Reinecke P, Telek M, Wolter K. Efficient generation of PH-
information security. Int J Inf Comput Secur 2007;1(3):341–66. distributed random variates. In: Al-Begain K, Fiems D, Vincent J-M, editors.
[36] Kondakci S. A causal model for information security risk assessment. In: 2010 Analytical and stochastic modeling techniques and applications, lecture notes
Sixth International Conference on Information Assurance and Security (IAS); in computer science, vol. 7314. Berlin, Heidelberg: Springer; 2012.
2010, pp. 143–148. p. 271–85. http://dx.doi.org/10.1007/978-3-642-30782-9_19.
[37] Doguc O, Ramirez-Marquez JE. A generic method for estimating system [44] Fang Y. Hyper-erlang distribution model and its application in wireless mobile
reliability using Bayesian networks. Reliab Eng Syst Saf 2009;94(2):542–50. networks. Wirel. Netw 2001;7(3):211–9. http://dx.doi.org/10.1023/
[38] S. Kondakci, Network security risk assessment using bayesian belief networks, A:1016617904269.
in: 2010 IEEE second international conference on social computing (social- [45] Feldmann A, Whitt W. Fitting mixtures of exponentials to long-tail distribu-
com), 2010, pp. 952–960. tions to analyze network performance models. Perform Eval 1998;31(3–4):
[39] Sellke SH, Shroff NB, Bagchi S. Modeling and automated containment of 245–79.
worms. IEEE Trans Dependable Secur Comput 2008;5(2):71–86. http://dx.doi. [46] S. Corporation, Internet security threat report 2014 (April 2014). url: 〈http://
org/10.1109/TDSC.2007.70230. www.symantec.com/security_response/publications/〉.
[40] Walpole RE, Myers RH, Myers SL, Ye K., Probability & statistics for engineers [47] Boudali H, Dugan J. A discrete-time bayesian network reliability modeling and
and scientists, Prentice–Hall, Inc., USD, NJ, USA, 2002. analysis framework. Reliab Eng Syst Saf 2005;87(3):337–49.
[41] Jiang X, Yuan Y, Liu X. Bayesian inference method for stochastic damage [48] Langseth H, Portinale L. Bayesian networks in reliability. Reliab Eng & System
accumulation modeling. Reliab Eng Syst Saf 2013;111(0):126–38. Safety 2007;92(1):92–108.
[42] W. Gragido, Lions at the watering hole—the “VOHO” affair (July 2012). url: [49] Guo H, Liao H, Zhao W, Mettas A. A new stochastic model for systems under
〈https://blogs.rsa.com/lions-at-the-watering-hole-the-voho-affair/〉. general repairs. IEEE Trans Reliab 2007;56(1):40–9.